This is an overview of InfiniBand architecture that I gathered from the IB white paper and user manual.
InfiniBand (IB) is an alternative proposal for the widely popular TCP/IP/Ethernet networking stack. IB mainly targets High Performance Computing systems but it also supports varying services like reliable, unreliable, connection-oriented, connection-less etc.
IB is designed based on 2 fundamental goals:
- Direct access to NIC from user space
- Ability to transfer data without the help of OS doing memory copy, address translation etc.
IB is different from Ethernet in a way they present the messaging service to the application. Ethernet is a byte stream oriented whereas IB is message oriented. Before starting a communication, a channel need to be created between 2 applications/services. This channel include a Queue Pair (QP) (consists of a send and receive queue) on both ends of the channel. The QP is mapped directly to the user address space. Actions are performed on the QP using something called Verbs( APIs provided by the IB driver).
IB provides 2 data transfer semantics – channel semantics and memory semantics. Channel or send/receive semantics is done through sending packets or Work Requests(WR) through the channel. Using verbs, sender posts the WR into the send queue. In order for the receiver to receive the data from the sender, it has to post a Receive WR in the receive queue before hand. This is an important characteristics of IB, as it uses credit based system to transfer data in order avoid packet drops due to buffer overflow.
Memory or RDMA read/write semantics is used to directly read from or write to a remote memory location. For example, for a file system(FS) to write data to a remote block storage device, FS first writes data to a local buffer and register it with the NIC which would return a key. FS sends this key and the virtual memory address along with the SCSI block write command to the storage device. In effect FS has passed control of the buffer to the remote server. FS then posts a recv request and waits for the remote server to respond on completion. Remote server after reading the request constructs an RDMA read operation from FS using key and virtual memory address and posts it on the work queue. Once the transfer is done, it respond to the FS with the status.
Even though Channel semantics does not involve CPU in transfer, it is slower than Memory semantics. For this reason, Channel semantics is only used for control messages.
Similar to OSI model, IB also has all the networking layers. I believe all the layers from Transport is implemented in the hardware. IB provides a Software Transport Interface to the application for communicating with the Transport layer. The transport layer implementation is slightly different from TCP since IB provides credit based flow. For network routing, IB implements IPv6 protocol. Compared to Ethernet, IB Link layer provides much richer functionalities. For example, priority channels. IB support 16 channels called Virtual Lanes. VL15 is of highest preference and used only for control messages. All the other VLs are used for data transfer. It also uses Local ID(LID) assigned by the subnet manager for link level routing instead of globally unique MAC addresses.
IB can also be used as an alternative for PCI. The main difference would that PCI is a shared bus architecture whereas IB is a switched fabric.