Search This Blog

Wednesday, March 21, 2018

Homelab Infiniband part 1

Infiniband is a high speed network protocol. It uses special pcie cards (called Host Bus or Channel adapters), cables (copper or fiber optic, both passive and active), and switches. The Infiniband network requires a "subnet manager" running on one of the nodes, either a computer or a switch, to manage it. There are multiple versions/speeds. It started with "SDR" or "Standard Data Rate", which was 4x links at 2.5Gbit/s = 10Gbit/s. Then DDR, "Dual Data Rate", which has 4x 5Gbit/s links for 20Gb/s. Then QDR ("Quad") for 40 Gbit/s, FDR10 (which is 40 Gbit/s), FDR (56 Gbit/s), etc. The actual throughput is usually lower than that. For example, the theoretical throughput for QDR is 32 Gbit/s, though starting with FDR10 they tried to make it closer to link speed x number of links. The number of links is almost always 4x. There are various manufacturers, but the two main ones are Mellanox and Intel. Mellanox hardware is often rebranded as HP, IBM, and Sun.  Intel purchased QLogic, and there are various Intel/QLogic rebrands, too. While they're all supposed to be compatible, they really aren't. The Intel/QLogic cards use the PSM protocol for MPI, while the Mellanox cards use Verbs and offload the processing from the CPU to the card. There are advantages and disadvantages to both. While you can sometimes get Intel and Mellanox hardware to work together, it's rare and usually means something is not running efficiently. It's strongly suggested to stick with Mellanox and its rebrands, or Intel and its rebrands, only.

A note about firmware: The Mellanox HCAs use firmware that has to be periodically updated, and it's usually available for free. However, the rebranded ones have their own firmare built from the Mellanox firmware. If you can figure out which Mellanox card you have, then you can reflash it with the original Mellanox firmware. The Intel/QLogic cards do not require firmware. They are ASICs, and since the message processing isn't offloaded to the cards, they can be simpler. However, they require a special software library (which only works with specific versions of specific OS's) to function correctly.

A note about cables and connections: you have to be careful about connections when buying Infiniband hardware. The SDR and DDR hardware typically has CX4 connectors and cables. However, some DDR hardware has the QSFP+ ports and cables, which are also used by QDR and FDR, and are not compatible with CX4. None of this is compatible with SFP+ or Ethernet cables though you can buy adapters, though that limits your speed. Your hardware will auto-negotiate to the lowest speed in the network. For example, if you buy 3x FDR cards, a FDR switch, and QDR cables, you will probably end up with a QDR speed network.

A note about Ethernet: a lot of homelabbers want a high speed network. The most cost effective way to do this at the time of this writing is actually with QDR infiniband hardware, not 10Gbe hardware. You can generally buy QDR HCAs and switches for less than equivalent 10Gbe hardware. The best thing to do then once you have your QDR network setup is to run IPoIB (IP over infiniband). That will give you much higher data throughput than 10Gbe but still act a lot like Ethernet.

This is a very brief summary. The wikipedia article is good, but there is a lot more information available elsewhere online.

Anyways, before I knew all of the above, I had purchased a random hodgepodge of Infiniband hardware from eBay. I had a 36 port internally managed Sun switch, a Mellanox IS50XX, two HP 544FLR-QSFP QDR ConnectX-3 HCAs, one Sun dual port QDR ConnectX-2 HCA, 4x QLE7340 single port QDR QLogic cards (came with Supermicro server), 5x 2M HP (Mellanox) QDR cables, and 2x QLogic QDR cables.

The first setup I tried that mostly just worked was the Sun switch with the Sun and HP HCAs. I updated the firmware on the HP HCAs since that was freely available. Sun's firmware updates are only available with a very expensive paid customer service contract. I could reflash the Sun HCA with Mellanox firmware, but turns out I didn't need to. Luckily the manuals are all available online. I plugged it all together, installed the "Infiniband Support" package in CentOS on all three nodes, and it just worked. Boom, 40 Gbit/s (according to ibstat). This meant the Sun switch's internal subnet manager was working. I wanted to be able to access it, though, just for fun. I didn't know the IP address of the ethernet management port, but it has serial USB port on it. I purchased two usb to serial converters, a null modem cable to go between them, and connected it to my laptop and tried to connect to it. Nothing...not a peep. I tried this every way I could think of. I even plugged the other usb end into another usb port just to make sure the cables were working (they were). It seems that either the usb converter wasn't really converting right, or the prior owner (oddly enough it identifies itself as a dhs.gov switch with ibswitches) locked it down. It turns out that these switches have no hardware reset. That sucks. I then tried something a little crazy: bruteforce pinging all of the private IP addresses to see if I got a response. There are some parallelized linux and windows tools available to do this. It still took about a day...nothing. *sigh. Oh well, at least the subnet manager starts up every time and seems to be doing its job.

It was around this time I bought the Supermicro server. This came with the 4x QLE7340s, which turned out to be nightmares. I also bought the Mellanox switch....also a nightmare.

I'll start with the Mellanox switch. I purchased an IBM rebrand of a Mellanox IS50XX from eBay. Turns out you really need to check to see if they come with the 36 port enable license (otherwise you're limited to 18 ports) and the FabricIT license (which runs the internal subnet manager). To save space, here are the mellanox community post and servethehome post on this mess. While I never figured out if the switch was DDR or QDR, I'm fairly certain IBM never sold any DDR switches, so it was probably some sort of configuration problem. I ended up re-selling it.

The QLE7340's: after about two weeks emailing back and forth with the Intel rep, I finally got the QLE7340's working back-to-back (plugged into each other) at QDR speeds. The problem was that the published pdf user guide for the required software stack is missing some prerequisites. Here's the user guide you need, and here are the commands you need to enter after install CentOS 7.2 minimal. You must use 7.2 unless they've updated the software since this post. It will say in the readme text file.

yum install -y dmidecode tcl tcl-devel pciutils-devel binutils-devel tk libstdc++ libgfortran sysfsutils zlib-devel perl lsof tcsh glibc libstdc++-devel gcc-gfortran rpm-build glibc.i686 libtool bison flex gcc-c++
yum install -y http://vault.centos.org/7.2.1511/os/x86_64/Packages/kernel-devel-3.10.0-327.el7.x86_64.rpm

Then install "OFED+ Host" software as in the pdf guide. The True Scale software is pay-for only, but you don't need it unless you have a huge Infiniband "fabric". That should get the QLE7340's working back-to-back. However, they refused to negotiate to more than 10 Gb/s with my Sun switch, or with the Mellanox IS5030 switch. After another month with the Intel Rep, turns out that the QLE7340's only support the Mellanox X series or newer switches and Intel/Qlogic switches (all are still pretty expensive). I was lucky he helped me out so much...none of the companies officially support EoL hardware. I sold the QLE7340's and am going to buy 4x Sun HCAs identical the one I currently have since I know they work with my switch. I test fit that one in the cramped high density nodes...supermicro did an excellent job designing those by the way. It fits. Oh, I also sold the HP HCAs because I sold the HP servers. My timeline is kind of messed up, sorry.

Here is a table of Mellanox Infiniband HCA rebrands and corresponding Mellanox part numbers. This is useful for finding the correct Mellanox firmware for IBM, HP, and Sun/Oracle cards.

Part 2 will be about getting the full setup working.

No comments:

Post a Comment