Disclaimer

This is a collection of random infrastructure notes based on the work I'm doing at any given time. Most of the technical notes here assume an infrastructure similar to the one I'm working on (which I will not describe in detail, and which is subject to change). I can't be responsible if you do something that's documented here and bad things happen.

Wednesday, February 13, 2008

Multipath and EqualLogic iSCSI

At my job, we're using EqualLogic iSCSI arrays for all of our on-line primary storage. It's an easy to configure, highly scalable, and highly flexible solution for us. We have two arrays combined into a logical "group". Both arrays are connected to two Foundry FastIron gigabit Ethernet switches, and each server has a NIC connected to each switch as well, for path redundancy. We use CentOS on our servers. Any technical advice below assumes you have a similar setup. I'm also assuming that you've already configured your arrays, and read up on EqualLogic's best practices viz. jumbo frames and flow control.

A nice thing that CentOS 5.1 adds over the 5.0 version is an updated iscsi-initiator-utils. It adds the "iface" context, which allows you to configure more than one physical interface for connection to iSCSI targets. With CentOS 5.0 the best alternative for redundant paths was to use the Linux bonding driver and create an active-standby bond between two interfaces. This is great for redundancy, but doesn't provide and load balancing. Interestingly, the upstream version number for the iSCSI utilities is the same (6.2.0), leading me to believe that for whatever reason either CentOS or RedHat had stripped the "iface" functionality out.

This adds a couple steps to iSCSI configuration. First you have to change some settings in /etc/iscsi/iscsid.conf:

node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
node.session.timeo.replacement_timeout = 15


The first two lines set the interval to send an iSCSI nop-out as a test of the channel. If the daemon doesn't get a response in the number of seconds set by the second value it fails the channel. These are normally 10 and 15 seconds, respectively, but we want it to be a bit more aggressive so it'll go to the other channel before applications start experiencing I/O failures. The third sets how many seconds to wait for a channel to reestablish before failing an operation back up to the application level. It's normally 120 seconds, to give the channel plenty of time to recover. In this case, again, we want it to fail quickly so the operation is retried on the other channel. iscsid.conf is well annotated, so there's more insight to be had reading through that.

It is necessary to set up the iSCSI interfaces before doing discovery. It goes like this:


iscsiadm -m iface -I iface0 --op=new
iscsiadm -m iface -I iface1 --op=new
iscsiadm -m iface -I iface0 --op=update -n iface.hwaddress -v\ 00:16:3E:XX:XX:XX
iscsiadm -m iface -I iface1 --op=update -n iface.hwaddress -v\ 00:16:3E:XX:XX:XX

The above example sets up two interfaces. You get the hardware address for the NICs you want to use by checking the output of the "ifconfig" command. The interfaces must each be configured with an IP address that can reach the iSCSI target.

Update: If you want to script the process, you can source the ifcfg-ethX files for the interfaces you want to use. Then you can refer to the HWADDR variable, e.g.

. /etc/sysconfig/network-scripts/ifcfg-eth1
iscsiadm -m iface -I iface1 --op=update -n iface.hwaddress -v $HWADDR


NOTE FOR XEN USERS: Since you're associating the iSCSI interface with the NIC by hardware address, if you're doing this within a virtual machine it is important that the hardware address doesn't change between boots. Your VM definition file should explicitly define the hardware address where it defines the virtual interfaces.

Now you're ready to do your iSCSI discovery and login. Ping the iSCSI group IP address from each NIC to make sure it's reachable, and then:


iscsiadm -m discovery -t st -p 10.X.X.X
iscsiadm -m node --loginall=all
iscsiadm -m session

The output of the last command should show each of your targets twice. If it does, this means that your interfaces are correctly configured and they're each talking to the array. Take a break. Go get a glass of water.

Okay, now you have to deal with the multipath layer. This is really quite easy. Make sure you have the device-mapper-multipath package installed and multipathd configured to run at startup (
chkconfig multipathd on). The configuration file (/etc/multipath.conf) is set up by default to ignore, or "blacklist" all devices. So the first thing you need to do is comment out the following lines:

blacklist {
devnode "*"
}


Now you'll want to add a blacklist stanza to cover any devices you don't want multipathed (e.g. your local disks). In the example below, I'm running on a Xen VM, so I want to blacklist the standard Xen block devices:


blacklist {
devnode "^xvd[a-z]"
}

Easy. Now we create a "devices" stanza which defines our EqualLogic array and sets some config values for dealing with it:

devices {
device {
vendor "EQLOGIC"
product "100E-00"
path_grouping_policy group_by_prio
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
features "1 queue_if_no_path"
path_checker readsector0
failback immediate
}
}

H/T to Jason Koelker, from who's blog I ripped off this configuration (he shows values for NetApp, I made some changes to adapt it to EqualLogic). He does a much better job than I could at explaining the options. There are plenty of other options in multipath.conf, my advice is to study them and tune them as necessary, but don't feel like you have to change every default.

Okay, great. Now you're ready to discover your paths:

multipath

Whew! That was hard. Follow that up with a "multipath -ll" and you should see something like:

mpath1 (36090a01820494c58XXXXXXXXXXXXXXXX) dm-3 EQLOGIC,100E-00
[size=128G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][active]
\_ 3:0:0:0 sdd 8:48 [active][ready]
\_ 2:0:0:0 sdc 8:32 [active][ready]
mpath0 (36090a01820492c56XXXXXXXXXXXXXXXX) dm-2 EQLOGIC,100E-00
[size=128G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][active]
\_ 1:0:0:0 sdb 8:16 [active][ready]
\_ 0:0:0:0 sda 8:0 [active][ready]



If you want to address these devices (e.g. with a "pvcreate" command) you can find them in /dev/mpath as /dev/mpath/mpath0 and /dev/mpath/mpath1. I highly recommend managing the disks with LVM; you'll avoid a lot of brain damage when Linux decides to arbitrarily change the device names after a reboot. Install the "dstat" utility (available from the Dag yum repository) and you can watch the I/O going out both NICs, with reads and writes going to both channels while you're running I/O to the devices.

Based on my reading of EqualLogic's documentation (which is sorely lacking for those of us who use Linux), I didn't expect multipath load balancing to work, since a single session can only use one channel on the array end at a time. Thus, for example, if you used the Linux bonding driver to create a trunk of two NICs, you could still only connect to one of the gig interfaces on the array end, so no performance gain. The dm-multipath method uses multiple iSCSI login sessions however, and balances between them, so you really do get a significant improvement in performance along with quicker failover vs. the Linux bonding method. In ideal conditions, I've seen reads and writes at 200MB/second.

Got an additional tip for squeezing performance out of an EqualLogic? Share it in comments! Those of us who use Linux with EqualLogic have to support each other, because again, their Linux-related documentation is largely non-existent.

Update (20080419): Run Away! Run Away!

We just acquired 15 Sun X4150 servers for our datacenter and tried to deploy them as described above. So far it has been an unmitigated disaster, with well over half the servers having crashed at least once while attempting to run a production VM load. We've never seen crashes like this on our Dell gear. Near as I can tell the key components that the Sun's have different from the Dell 1950's we've been using are the RAID controller and the e1000 NICs. The Dell's have e1000s as well, but only two of the four (the oher two are Broadcom). Right now the best theory is that there is a problem somewhere in the interaction between the e1000 driver, open iscsi, and dm-multipath which causes the kernel to panic so bad that it winks out with no warning, no logs, no core. It's as though someone walked up to the box and hit the reset button. Anyway we've had to retreat to the old Linux bonding path failover method until we get the problem figured out, and will likely delay moving any servers into a dm-multipath configuration until we have a root cause nailed down.

Further Update: It's been an interesting few weeks. 1U servers can be a lot like Formula 1 cars, I guess; if you don't drive them hard enough the tires cool down and you crash. The apps we were running were highly I/O intensive, heating up the RAID card and the NIC in the server, but not CPU intensive, and since the CPU temperature controls the speed of the chassis fans, it seems that we were overheating our PCI cards while the CPUs stayed nice and cool. Knowing this we've developed some workarounds to keep things cooler, so my guess is we could re-impliment multipath as described above and wouldn't see much problem.

On the other hand, we've also learned that EqualLogic iSCSI arrays have a maximum number of connections in the ~500 range, and we were in danger of running out of available connections. Since each LUN presented by an iSCSI array takes a session, and multipath requires two sessions, that means a server with 5 LUNs exposed over multipath will take up 10 connections. We'll keep multipath in reserve as a point solution if we find we need higher I/O for a specific purpose.