반응형

Network Drivers

 

Special Features of Network Devices

  • do not look/act like files
  • do not correspond to inodes in /dev
  • read and write do not make sense
  • multiple sockets and protocols can be multiplexed on a single NIC
  • packets received asynchronously, from outside
    input is "pushed" from outside, rather than "pulled" from inside the system

Linux Network Subsystem

  • protocol-independent
  • driver and kernel interact one packet at a time
  • separated into:
    • protocol-independent networking infrastructure (e.g., for all packets)
    • protocol-specific infrastructure (e.g., for TCP, or for UDP)
    • generic device-class infrastructure (e.g., for all ethernet devices, or for all PCI devices)
    • device-specific implementations (e.g., for just the E100)

The goal of using generic code wherever possible results in many call-back interfaces:

  • E100 code calls generic PCI code -- direct call
  • generic PCI code calls E100 device-specific code -- must be call-back, via function pointer
  • E100 code calls generic ethernet code -- direct
  • generic code calls E100-specific code -- call-back
  • protocol implementation calls protocol-independent code -- direct
  • protocol-independent code calls protocol-specific-code -- call-back
  • etc.

Sometimes it is even worse, with both sides using pointers to call one another, or indirectly doing so, via chains of calls involving other subsystems.


Network Device Descriptor

The datatype struct net_device has many fields, including:

  • init -- a pointer to an initialization routine
  • name -- string indentifying the interface
    can use a form like "eth%d" to allow the system to plug in a unique number.
  • several pointers to interface service routines ("methods")

Registering A Network Device

This is done by a call to register_netdev, which calls register_netdevice.


Example: Initialization of E100 Ethernet Driver

e100.c handles a variety of ethernet controllers based on Intel chips, including the 82557, 82558, 82559, 82550, 82551, and 82562 devices.

e100_init_module calls pc_module_init with a reference to a variable of type struct pci_driver called e100_driver.

That structure contains several fields, including a pointer to a probe function, e100_probe.

The e100_probe() function allocates an object of type struct nic, which contains a net_device and a reference to a pci_dev. A pointer to this nic structure is accessible from the net_device structure via the macro netdev_priv.

The e100_probe() function calls alloc_etherdev, which calls alloc_etherdev_mq(), which in turn calls alloc_netdev_mq.

Much of the work of alloc_netdev_mq() is done by a function parameter that is passed in, named setup. In the current case, this is the function ether_setup.

The function e100_probe() initializes the netdev structure returned by alloc_etherdev(), including pointers to methods such as open, and stop, and does a lot of other initialization, including all the generic initialization for a PCI device that uses DMA.

One of the interesting features is the setting up of a timer with the handler e100_watchdog.


Opening/Closing A Network Device

open

  • called in response to ioctl(SIOCSIFFLAGS) call from ifconfig
  • requests system resources (e.g., irq, buffer space)
  • activates the hardware interface

stop

  • called in response to another ioctl(SIOCSIFFLAGS) call (with IFF_UP cleared) from ifconfig
  • undoes effects of open()

See e100_open for example. It does most of the work in e100_up.


Socket Buffers

The datatype struct sk_buff, is used to hold packets that go to or from sockets. It has quite a few fields, including, some which are tagged as "important" by the text. Unfortunately, the names of these fields appear to have changed. The names are:

  • dev = the sending or receiving device
  • h, nh, mac = pointers to headers within packet
    • h = transport (e.g., struct tcphdr *th)
      apparently replaced by transport_header, of type sk_buff_data_t
    • hh = network (e.g., struct iphdr *iph) apparently replaced by network_header, of type sk_buff_data_t
    • mac = link (e.g., struct ethhdr *ethernet) apparently replaced by mac_header, of type sk_buff_data_t
  • head, data, tail, end: pointers to components of the packet
    • head = beginning of space
    • data = beginning of valid data
    • tail = end of valid data
    • end = maximum value tail can reach
  • len = full length of packet (= tail - data)
  • data_len = nonzero only for scatter-gather
    length of portion stored in separate fragments
  • ip_summed = the checksum policy (set by driver on incoming packets)
  • pkt_type = delivery classification (set by eth_type_trans())
    • PACKET_HOST -- for the local host
    • PACKET_BROADCAST
    • PACKET_MULTICAST
    • PACKET_OTHERHOST
    • PACKET_OUTGOING
    • PACKET_LOOPBACK
    • PACKET_FASTROUTE
  • pointer to structure containing info that is shareable betweeen copies of the sk_buff
    (see scatter gather, below)

sk_buff without Scatter-Gather

contiguous sk_buff layout diagram, without scatter gather

The book's comment about being prepared to have code that depends on the internals of type struct sk_buff be broken by future kernel releases is certainly true, but may also apply to other aspects of the kernel. The differences even between versions 2.6.6 and 2.6.11 were quite noticeable, as were the differences between version 2.6.11 and 2.6.25.


Scatter-Gather I/O

  • normally used for outgoing network packets
  • allows layers of packet headers to be added, for each layer of protocol
    without recopying the packet
  • accomodated by the sk_buff datatype
  • macro skb_shinfo returns pointer to the "shareable data" of a sk_buff object, including fields:
    • nr_frags = number of fragments (beyond the the first sk_buff) in the packet
    • frag_list
    • frags = pointer to array of skb_frag_struct, each with the following fields:
      • struct page *page
      • __u16 page_offset
      • __u16 size

sk_buff with Scatter-Gather

uncontiguous sk_buff layout diagram, with scatter gather

See also scatter-gather mappings under DMA I/O.

Scatter-gather allows sharing of common fields, like MAC-address, between packets.

Scatter-gather is also implemented by the user API. See the man-page for sendmsg for specifics. From the user level, the scatter-gather mechanism is implemented by the following structures:

struct msghdr {
      void         * msg_name;     /* optional address */
      socklen_t    msg_namelen;    /* size of address */
      struct iovec * msg_iov;      /* scatter/gather array */
      size_t       msg_iovlen;     /* # elements in msg_iov */
      void         * msg_control;  /* ancillary data, see below */
      socklen_t    msg_controllen; /* ancillary data buffer len */
      int          msg_flags;      /* flags on received message */
  };
  struct iovec {
      void *iov_base;   /* Starting address */
      size_t iov_len;   /* Number of bytes */
  };

It is up to the device driver to decide whether scatter-gather can be supported directly by the device, or whether to force the a higher layer of the network implementation to copy scatter-gather structures provided by a user into contiguous ranges of memory.


The text says sending is less complicated than receiving, and so chooses to treat it first. However, sending is also less interesting, since it is synchronous, pushed I/O, similar to examples we have seen with other types of devices. Since we have limited time, we will look at the more interesting case first.

The E100 driver does not handle scatter-gather output. (It does not set the NETIF_F_SG bit in netdev->features.) For an example of scatter-gather see the function e1000_tx_map in the E1000 driver. This function is responsible for setting up the DMA mapping for all the elements of a scatter-gather list. Note where this driver sets netdev->features to include the feature NETIF_F_SG, so that the network device layer knows that this driver expects scatter-gather sk_buffs to this device.


Receiving a Packet

The end goal of the driver is to pass off each packet that it receives, to the higher-level protocol handling routines of the system.

There are two models of packet reception that a driver may implement

  • interrupt-driven --
    • each packet arrival generates an interrupt
    • can paralyze the system if there is a "packet storm"
    • implemented by the toy driver, snull.c in the text, but not by the E100 driver
    • a packet is passed off via a call to netif_rx()
  • polling -- (better)
    • packet arrivals only generate an interrupt if the system is not actively polling
    • reduced overhead due to interrupts when the system is already busy
    • implemented by the snull, E100, and E1000 drivers
    • a packet is passed off via a call to netif_receive_skb

All the input processing builds up to this.


It may be surprising that "polling" is more efficient than interrupt-driven input, since historically interrupts were developed to address the inefficiency of polling. The reality is that we are talking about a hybrid approach, which involves both polling and interrupts and is superior to either of them alone.

Since the text concentrates on the pure interrupt-driven case, in class we will concentrate on the polling approach, which is more effective, more interesting, and the recommended approach for all Linux network drivers.

Starting with kernel 2.6.6, the E100 driver had an option CONFIG_E100_NAPI to choose between interrupt-driven I/O and polling (a.k.a. "NAPI"), but by kernel 2.6.11 NAPI (polling) was the only option supported.


Pre-NAPI (interrupt-driven) Linux Networking

pre-napi control flow diagram

Data Flow of Pre-NAPI Linux Network Driver API

pre-napi data flow diagram

The diagrams above are reproduced from piters.home.cern.ch/piters/ TCP/seminar/TDAQ-April02/TDAQ-April02.ps.


NAPI -- the "New API" for Network Device Drivers

  • Interrupt from NIC is normally disabled
  • Driver has thread that polls to see if there is work for the driver to do:
    1. Handling received (RX) packets
      • undo DMA mapping, parse link-level packet header, pass packet up to next level of protocol stack
    2. Cleaning up transmitted (TX) packets
      • undo DMA mapping, recycle buffer, possibly notify sender
  • If thread finds no work to do, it enables the interrupt
  • Interrupt handler disables interrupt and signals polling thread
  • Advantages:
    • Less time wasted in overhead of interrupt handling
    • Poller can handle several packets in one cycle
    • If packets must be dropped (due to overload) they are dropped befor wasting CPU time on them

The Ethernet-HOWTO says:

When a card receives a packet from the network, what usually happens is that the card asks the CPU for attention by raising an interrupt. Then the CPU determines who caused the interrupt, and runs the card's driver interrupt handler which will in turn read the card's interrupt status to determine what the card wanted, and then in this case, run the receive portion of the card's driver, and finally exits.

Now imagine you are getting lots of Rx data, say 10 thousand packets per second all the time on some server. You can imagine that the above IRQ run-around into and out of the Rx portion of the driver adds up to a lot of overhead. A lot of CPU time could be saved by essentially turning off the Rx interrupt and just hanging around in the Rx portion of the driver, since it knows there is pretty much a steady flow of Rx work to do. This is the basic idea of NAPI.

For additional explanation, see:


Packet Reception in the E100 Driver

  • polling function is registered with the network_device layer, responsible for passing packets up to the network layer
  • DMA receive-buffers are allocated for the device, normally by e100_up
  • e100_up also installs an interrupt handler, responsible for activating polling
  • buffers are eventually returned to the device

Polling Function Registration

The function e100_poll is bound as the poll_controller method of netdev in the e100_probe function, and the polling weight (priority) is specified.


Receive-Buffer Allocation

The data structure passed to netif_rx() is a struct sk_buff, which is allocated by calling netdev_alloc_skb().

The function netdev_alloc_skb() is called from e100_rx_alloc_skb(), which allocates a socket buffer of size determined by:

  • 2 bytes (reserved, apparently, for a protocol header?) referred to by symbol NET_IP_ALIGN
  • the size of struct rfd
  • 1518 bytes, referred to by symbol VLAN_ETH_FRAME_LEN, which appears to be the maximum number of octets permitted in a packet, less the Frame Check Sequence (FCS)

Device-dependent details: The "RFD" in the comments refers to a "Receive Frame Descriptor", and the "RFA" refers to the "Receive Frame Area", which is a linked list of RFDs. For output, there is a linked list of CFDs (Command Frame Descriptors), called the CBL (Command Block List). The other communication with the device is through a third memory area, called the CSR (Control/Status Registers).


Receive Frame Descriptors (Simple Mode)

rfd diagram

Receive Frame Descriptors (Flexible Mode)

cb diagram

RFD Contents

Each RFD (Receive Frame Descriptor) has the following fields:

  • EL (Bit 31) - indicates that this RFD is the last one in the RFA
    (When the devices uses up this frame it enters the "no resource" condition, generates an RNR interrupt, and starts discarding frames until the RU (receive unit) is restarted with sufficient resources.)
  • S (Bit 30) - indicates the RU should suspend after receiving the frame
  • H (Bit 20) - indicates the current RFD is a header RFD.
  • SF (Bit 19) - indicates not "simplified mode"
  • C (Bit 15) - indicates completion of frame reception (set by device)
  • OK (Bit 13) - indicates the frame was received without any errors, including sufficient memory
  • Status (Bits 12:0) - indicate the result of the receive operation
  • Link Address - 32-bit offset to the next RFD, relative to the RU base. (Circular lists are OK.)
  • Size - the data buffer size (only used in "simplified mode")
    In the header RFD this is the data buffer size excluding the header area.
    It should be an even number.
  • EOF - indicates the devicde has completed placing data in the data area
    (Set by device; must be cleared by software before re-use.)
  • F - indicates the device has updated the actual count field
    (Set by device; must be cleared by software before re-use.)
  • Actual Count - the number of bytes written into the data area.

Command Block List

cb diagram

Driver can add blocks to the end of the chain while the device is processing blocks earlier in the chain.


Command Block Contents

Each CB (Command Block) has the following fields:

  • EL (Bit 31) - indicates that this CB is the last one in the list
    (After execution of the command, the CU becomes idle.)
  • S (Bit 30) - indicates the CU should suspend after executing the command
  • I (Bit 29) - indicates the CU should generate a CX interrupt after executing the command
  • CMD (Bits 18:16) - specifies the command.
    Examples:
    • Transmit
    • NOP
    • Individual (source) address setup
    • Configure
    • Multicast setup
    • Load microcode
    • Dump
    • Diagnose
  • C (Bit 15) - used to indicate completion of command execution (set by device)
  • X (Bit 13) - indicates the frame was received without any errors, including sufficient memory
  • OK (Bit 13) - used to indicate error-free completion of the command (set by device)
  • Link Offset - 32-bit offset to the next RFD, relative to the RU base. (Circular lists are OK.)
  • Optional Address and Data Fields - depend on the command For Transmit:
    • Transmit buffer descriptor array address
    • TBD number = number of TB's in the TBD array
      TBD = (TB pointer, TB size)
    • Transmit threshold = min # of 8-byte blocks in FIFO before start of transmit
    • EOF
    • Transmit command block byte count (contiguous bytes, possibly in addition to other buffers in "flexible mode"

E100 Interrupts

  • CNA - if configured, when the CU suspends
  • CI - if configured, when CU encounters a TxCB with the I bit set
  • Frame receipt - optionally two interrupts, optional one at start of receipt
  • RNR - out of buffers
  • SWI - software interrupt (generated by by in SCB command word)

For more information on the E100 programming interface, see the Intel documentation.

All interrupts can be masked by the Mask bit in the SCB command word.


Back to the E100 Driver

The function e100_rx_alloc_skb() is called in two places, principally in e100_rx_alloc_list().

The function e100_rx_alloc_list() is called from several places, including e100_up().
It actually allocates objects of the datatype struct rx, which includes an sk_buff along with forward and backward links and a DMA address.

The function e100_up() is called from several places, principally from e100_open(), which is one of the standard network device entry points (exported methods).


Both struct rx and struct sk_buff include forward and backward links. How is each pair used?


Interrupt Handler

Besides allocating a list of receive frame descriptors, e100_up() installs an interrupt handler, e100_intr.

The function e100_intr() does several things, including reading the status of the NIC, acknowledging the interrupt, and calling __netif_rx_schedule_prep and then __netif_rx_schedule.

These two are actually wrappers for napi_schedule_prep and __napi_schedule. The former is part of a two-phase protocol for waking up the NAPI polling thread without accidentally running two copies of the polling routine (perhaps on different CPUs). The latter adds the device to this CPU's softnet_data poll_list and raises the softirq NET_RX_SOFTIRQ to indicate that the list needs polling. The handler for that softirq will do the next step in processing of the incoming packet.


Why is no lock needed to add this device to the "poll list" for the current CPU?


Network Device Polling

In net_dev_init, which is called during system initialization to initialize the network device subsystem, the function net_rx_action is attached to the softirq NET_RX_SOFTIRQ.

When the function net_rx_action is called, in response to this softirq, it runs though a per-CPU softnet_data queue, executing the poll method of each device on the queue.


E100 Polling

In the case of the e100, the polling function is e100_poll. The most interesting part of this function is the call to e100_rx_clean.

The function e100_rx_clean() calls e100_rx_indicate() for each received frame, and replenishes the receive receive sk_buff list for the device.

The function e100_rx_indicate() does several things, including setting the data and tail pointers of the sk_buff to indicate the actual beginning and end of the dta (via calls to skb_reserve and skb_put), and setting the pkt_type (via eth_type_trans) and the (link layer) protocol field of the sk_buff. It finally calls netif_receive_skb()).

The function netif_receive_skb() eventually calls deliver_skb(), does the actual delivery to the protocol-specific handler, via the function pt_prev->func().


The loop structure in netif_receive_skb is interesting. Why is pt_prev passed to deliver_skb rather than ptype? If you are not yet familiar with the Linux kernel struct list_head usage (covered in Ch 11 under "Linked Lists"), this may be a good time to learn how and why this kind of loop works.

The function netif_receive_skb conditionally calls netpoll_rx(), which appears to directly handle ARP and UDP packets, but seems to simply return for TCP packets. This seems to be an optional optimization to cut down on the overhead of passing certain packets up through the protocol hierarchy.

Because we have limited classroom time, we have concentrated on the e100 driver (a complex real driver) instead of the toy driver snull provided in the text. Although the textbook examples are now mostly obsolete, I have updated them enough that they will at least compile with the 2.6.25 kernel, and it may still help to review the network driver concepts in that simpler context.

In particular, it may be useful to now look back at the snull driver, starting with snull_rx, snull_regular_interrupt, and snull_napi_interrupt.


Sending a Packet

Sending of a packet starts with one application-level calls, like send() or sendto(). After the system determines the route and assembles the packet with the appropriate headers, it eventually makes an internal call to the netdev->hard_start_xmit method of the actual device.

For the e100 driver, the actual function bound to this link is e100_xmit_frame(), which calls e100_exec_cb(), which calls e100_exec_cmd() to do finally deliver a command to the controller. Most of the code in these functions is device-specific. Since this is a DMA device, the data is passed to and from the device indirectly, by giving pointers to memory-mapped buffers containing the commands.

Observe the call to netif_stop_queue from e100_xmit_fram() in the case that the packet used up the last available queue space for this device.


This driver does not address the short packet information leakage problem discussed in the text. That is OK, because the comments say "hardware padding of short packets to the minimum packet size is enabled".


Timeouts

  • can detect failure of several
    • hardware bugs
    • driver bugs
    • failures/bugs of communication partners (other hosts)
  • implemented at two levels in the E100 driver:
    • a generic timeout, implemented by network device layer (net_device)
      • netdev->tx_timeout method and netdev->watchdog_timeo value are set in e100_probe
      • e100_tx_timeout is called by the network device layer if the timeout value is exceeded
        (look what this handler does)
    • a specific timeout, implemented within the driver itself
      • A kernel timer, nic->watchdog is also set up in e100_probe, with its own handler
      • e100_watchdog, the handler for this timer, does device-specific recovery actions
        (contrast this with what is done in the generic timeout handler, above)
      • The watchdog timer is first set inside e100_up, and reset by a call to mod_timer inside the handler

Review of Uses of Memory Mapping in the E100 Driver

  • A call to pci_iomap(pdev, ...) is used to provide memory-mapped access the device registers.

  • The receive frame descriptors (struct rx) are allocated using kcalloc() (contiguous/array allocation).

  • Each receive frame descriptor (RFD) contains a pointer to a sk_buf, which is allocated by a call to dev_alloc_skb(). This is the pointer that is passed to pci_map_single().

  • A call to pci_map_single(...) is used to make each sk_buf accessible to the device, in e1000_tx_map, e1000_clean_rx_irq_ps, and e1000_alloc_rx_buffers_ps. For the i386, this seems to do very little, just calling flush_write_buffers() and then virt_to_phys() since the buffer is assumed to be already in a range that the virtual address is in the range of pages that are always mapped into kernel virtual memory.

  • The E100 driver does not handle scatter-gather output. For an example of memory mapping to implement scatter-gather see the function e1000_tx_map in the E1000 driver. This function is responsible for setting up the DMA mapping for all the elements of a scatter-gather list.


Review of Uses of Interrupts in the E100 Driver

  • request_irq is called in e100_up to bind the handler e100_intr.

  • e100_intr reads the status of the device, acknowledges the interrupt to the device. If the device has hit "receive no resource" (meaning it is out of receive buffer space), a flag is set to indicate that the device is not receiving any more. In all cases, netif_rx_schedule is called to make certain an effort is under way to process the packets already received.

© 2004-2008 T. P. Baker. ($Id: ch17.html,v 1.1 2008/04/28 12:41:35 baker Exp baker $)
반응형

+ Recent posts