Configuring a point-to-point IP connection with Infiniband

Introduction

I had access to a couple of single-port PCI Infiniband cards and a suitable cable. I wanted to use them to set up a point-to-point connection for syncing DRBD devices between to Debian 11 systems.

I found a few links describing what needs to be done (Arch Linux‘s Infiniband page was probably the best), but none of them really explained what depended on what in sufficient granularity for my understanding. Hence, writing this page.

This page is a work in progress! When I meet a new problem I’ll come back and update it. Laste edited: 13/02/2021.

Note that an Infiniband network needs at least one subnet manager to check the network for new adapters and adds them to the routing tables.  A consequence of this is that how the first node behaves during its configuration is slightly different from how the second node behaves during its configuration, because the first node will not already have a subnet manager available to it whereas the second node will. This procedure tries to take that into account.

Procedure

Complete this procedure on each of the two nodes, ideally doing it in parallel.

  1. Insert the cards and connect with the cable.
  2. Use lspci to verify the card is detected, as in this output:
    torchio# lspci 
    ...
    03:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
    ...
    torchio#
  3. Install the following packages to get access to the ibstat command:
    apt-get install infiniband-diags
    
  4. Run ibstat; the state should be either Active or initializing, depending on whether there is already a subnet manager on the network, as shown in this output:
    torchio# ibstat
    ...
    State: Initializing
    Physical state: LinkUp
    ...
    torchio#
  5. Without the correct modules loaded, the IB network cannot be reached (regardless of state), as shown in the output below:
    torchio# ibhosts 
    ibwarn: [1267] get_abi_version: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?
    ibwarn: [1267] mad_rpc_open_port: can't open UMAD port ((null):0)
    src/ibnetdisc.c:786; can't open MAD port ((null):0)
    /usr/sbin/ibnetdiscover: iberror: failed: discover failed
    torchio#
  6. Load the following modules:
    modprobe -a ib_uverbs mlx5_ib mlx5_core ib_core ib_umad rdma_ucm
  7. Run ibhosts again; this time it should report other HCAs (cards) on the IB network, as shown in the output below:
    torchio# ibhosts
    Ca : 0x7cfe900300b82270 ports 1 "MT4113 ConnectIB Mellanox Technologies"
    Ca : 0x7cfe900300b82220 ports 1 "MT4113 ConnectIB Mellanox Technologies"
    torchio#
  8. But looking back at ibstat it still shows Initializing because we still have no subnet manager, as shown in the output below:
    torchio# ibstat
    ...
    State: Initializing
    Physical state: LinkUp
     ...
    torchio#
  9. Regardless of whether there is already a subnet manager on the network, install and start one:
    apt-get -y install opensm
  10. Run ibstat again; this time it should show the state as Active, as shown in this output:
    torchio# ibstat
    ...
    State: Active
    Physical state: LinkUp
    ...
    torchio#
  11. Even though ibstat and ibhosts now work, the infiniband IP interface is still not accessible until this is run:
    torchio# modprobe -a ib_ipoib
    torchio#

    (Note that that fix is not persistent.)

  12. To determine the infiniband IP interface name run:
    torchio# ifconfig -a
    ...
    torchio#

    (Note that the interface name can vary; in my own case it was ibp1s0, which is used in the text below.)

  13. Temporarily configure an IP address on the ib0 interface from both hosts with something like this:
    fiori# ifconfig ibp1s0 192.168.2.6 netmask 255.255.255.0 up
    fiori#
    
    torchio# ifconfig ibp1s0 192.168.2.7 netmask 255.255.255.0 up
    torchio#
  14. Do a ping test, as shown in this output:
    torchio# ping 192.168.2.6 -c 2
    PING 192.168.2.6 (192.168.2.6) 56(84) bytes of data.
    64 bytes from 192.168.2.6: icmp_seq=1 ttl=64 time=0.162 ms
    64 bytes from 192.168.2.6: icmp_seq=2 ttl=64 time=0.123 ms
    
    --- 192.168.2.6 ping statistics ---
    2 packets transmitted, 2 received, 0% packet loss, time 8ms
    rtt min/avg/max/mdev = 0.123/0.142/0.162/0.022 ms
    torchio#
  15. To make all the above work reboot safe:
    1. Add the needed modules to /etc/modules by running:
      echo ib_uverbs mlx5_ib mlx5_core ib_core ib_umad rdma_ucm ib_ipoib | xargs -n 1 echo >> /etc/modules-load.d/ib.conf
    2. Add a suitable entry to /etc/network/interfaces or /etc/network/interfaces.d/ibp1s0 as shown in the extract below:
      auto ibp1s0 
      iface ibp1s0 inet static
          address 192.168.2.7
          netmask 255.255.255.0
    3. Reboot and test that ping still works.

Tuning and performance testing

  1. Install ibutils and run ibdiag net and examine the output for warnings and errors, as shown in the output below:
    torchio# apt-get install ibutils
    torchio# ibdiagnet 
    ...
    -W- Topology file is not specified.
    ...
    -W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps
    ...
    torchio#
  2. The first warning I did not manage to solve in a nice manner. I could use option -wt to write a topology file:
    torchio# ibdiagnet -wt ~/pasta.top > /dev/null
    torchio#

    but when I tried to call ibdiag net specifying that file as the topology file then:

    torchio# ibdiagnet -t ~/pasta.top -s S7cfe900300b82270
    ...
    Aborted
    torchio# 
    
  3. The second warning may only apply to multicast groups. The interfaces can be run in ‘datagram’ or ‘connection’ mode; the former offers lower latency; the latter offers a higher MTU. This page contains the following table:
    Mode MTU MB/sus latency
    datagram204470719.4
    connected204435318.9
    connected6552072619.6

    which I take to mean: the most important thing is not to change the MTU from that that is the default for a particular mode.
  4. I did my own performance tests as follows:
    torchio# apt-get -y install netperf
    torchio# netserver -4
    torchio#
    
    fiori# apt-get -y install netperf
    fiori# netperf -4 -H 192.168.3.7   #  over DRBD cluster network
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.7 () port 0 AF_INET : demo
    Recv   Send    Send                          
    Socket Socket  Message  Elapsed              
    Size   Size    Size     Time     Throughput  
    bytes  bytes   bytes    secs.    10^6bits/sec
    131072  16384  16384    10.02     936.37
    fiori# netperf -4 -H 192.168.2.7 # over IB
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.7 () port 0 AF_INET : demo
    Recv   Send    Send                          
    Socket Socket  Message  Elapsed              
    Size   Size    Size     Time     Throughput  
    bytes  bytes   bytes    secs.    10^6bits/sec  
    131072  16384  16384    10.00    3308.75
    fiori#

    But this is still a long way from what I was expecting.

To do

  1. Look at l-s’s IB tests – is there anything useful there?
  2. Look at the other scripts – I thought there was a way to get info on cables
  3. Switch DRBD to using infiniband

See also