This is a collection of random infrastructure notes based on the work I'm doing at any given time. Most of the technical notes here assume an infrastructure similar to the one I'm working on (which I will not describe in detail, and which is subject to change). I can't be responsible if you do something that's documented here and bad things happen.

Thursday, July 24, 2008

Cfengine rocks it

Nice to be able to do a good-news followup to one of my posts. Having now rolled cfengine out across a couple hundred nodes, it is doing fantastic so far. Key things we have cfengine doing:

  • Keeping our yum repository definition files current. This is awesome, as it allows us to quickly roll out things like "exclude" statements in repo definition files.
  • Rolling out authorized_keys files. This is much less dangerous if I can make a quick change and have a new file in place within 15 minutes.
  • Get key packages installed (!!). Cfengine understands the rpm format and can install necessary packages from yum repositories
And we're still really just scratching the surface. We have been struggling with RedHat's kickstart to get servers in as close to production-ready shape as possible at install time. It turns into a nightmare of maintaining different kickstart files for each type of server we deploy. With cfengine, we're now moving toward a single kickstart configuration file for the whole environment, with customization done post-install depending on the hostname construction. Slick!

Coming soon we're going to start implimenting alerts based on cfenvd. This is a daemon that continually gathers statistical information and defines cfengine classes based on anomolous behavior. I hope to write about that in the very near future.

Wednesday, July 9, 2008


It's been a while since I jumped on a nice steep learning curve, and cfengine is suiting me nicely to that end. Many administrators probably know of cfengine as something that sounds incredibly useful but a bit of work to get rolling, which ends up being true. You can head to www.cfengine.org to get a pretty good basic introduction to what it is and what it does. I'll also try to describe it a bit here.

Basically, cfengine is a suite of daemons that run in a client-server type configuration. They are intended to compare the state of a given computer system to the ideal state described by a configuration file, and take steps to correct or warn of any problems it finds. This might take the form of:
  • files that should be the same for every box in the environment (e.g. /etc/resolv.conf)
  • processes that should or shouldn't be running
  • ownership and permissions on files
  • configuration of some system daemons
The basic gist is that if you have a network of over 200 virtual machines like I manage, and you need to make a change to a system configuration file, cfengine provides a method to do this without having to manually log into each machine and make the change, or alternately running a dangerous non-interactive shell script on every host.

One of the powerful aspects of this is that cfengine is aware of the differences between, say, how a Linux system does things and how a Solaris system does things. So for many tasks, the same line in the configuration file could cause a vastly different sequence of events depending on the system architecture. The system administrator doesn't have to care about this, she or he simply states in the high-level cfengine configuration language how the system should behave, and cfengine will figure out the details.

Another powerful aspect is the notion of "classes". This is great if you have a QA environment closely modeled on your production environment, but with some key changes. The cfgengine client daemon (cfagent) will automatically assign itself to several classes based on system variables such as subdomain or subnet. So if you have a "prod.acme.com" subdomain and a "qa.acme.com" subdomain for example, you can define different versions of key files for those subdomains.

At Lijit, we're tying it in to Subversion to make management simpler. We have a repository of cfengine configuration files (which get propogated to the clients in the same way as other files), and a repository of other files we wish to be distributed by cfengine. I can check out these repos to my PC, make the edits I need to make, commit the changes, and within 15 mintutes all of my clients will have the new configuration and, presumably, will have taken whatever steps they need to get in sync. There are methods for "pushing" updates from the master server as well, but I haven't gotten that far yet.

One stumbling block that I'm just figuring out now is how to use "define" statements to cause chains of events to take place. Take /etc/syslog.conf. If I update that file, the syslog daemon needs to be restarted. Cfengine can restart running processes in a myrad of ways, so that's no problem, but I don't want syslogd restarted every 15 minutes; that'd just be a waste, and could lead to dropped messages. So I add a "define" statement to the configuration line for the syslog.conf file. Like so:

/foo/bar/qa/syslog.conf dest=/etc/syslog.conf mode=0644 owner=0 group=0 server=cfengine.acme.com define=new_syslog_conf

So, this line (in the "copy" section of the file) says that if you are a member of the qa_acme_com class, make sure your /etc/syslog.conf matches the master version and copy it over if it doesn't. If you copy the file, then define yourself as a member of the class "new_syslog_conf". I can then later on in the config say that members of the "new_syslog_conf" class should restart their syslog daemon.

This is just barely scratching the surface of cfengine. There are good tutorials and documentation out there for anyone wishing to learn about it. I suspect it will become an invaluable tool for us at lijit as the number of hosts we manage continues to grow. If you're managing more than a couple dozen hosts I definitely recommend having a look.

Thursday, May 29, 2008

SVN and Yum Repositories

We use "yum" in our environment, not just for OS updates, but for managing third-party software and our own production software. It's a great way for rolling releases out to a large number of servers in a way that ensures that everyone is running the same code. The ideal is that, if it isn't in one of our yum repositories, it's not going on one of our servers. The reality is considerably more nuanced, of course, but it's still a good idea.

So here's the challenge: If we're, say, mirroring Dag Wieer's excellent repository of third-party apps and we want to refresh the mirror to get updated packages, subsequent new server builds will be out-of-sync with production until all of the systems have been updated. If we get partially through that update process and find a problem which requires a roll-back, it becomes difficult to unwind to the previous state. Enter subversion.

If you can make it through a whole post on this blog without falling asleep, chances are you already know what subversion is. What we're doing is combining it with apache and mod_dav_svn to allow our hosts to update directly from the subversion repository. We have a "production" and a "qa" branch for each repo, and we just point yum on the servers to the appropriate branch. Since yum uses http for its transport this requires no trickery at all with yum. Simple, elegant, manageable, and saves a ton of disk space over manually managing multiple versions of a repo. Subversion only stores one copy of a file if it's identical across branches, and most repository updates will consist of just adding some RPM files and changing the metadata file.

Update: Well, that didn't work. Turns out that yum uses an http 1.1 byte range request to get the headers out of RPM files for dependency checking. Unfortunately mod_dav_svn doesn't seem to support this type of request, so it's back to the drawing board.

Sunday, April 20, 2008

Good Article: The Six Dumbest Ideas in Computer Security

Found an interesting article on security today. Here's the blurb:

What are they? They're the anti-good ideas. They're the braindamage that makes your $100,000 ASIC-based turbo-stateful packet-mulching firewall transparent to hackers. Where do anti-good ideas come from? They come from misguided attempts to do the impossible - which is another way of saying "trying to ignore reality." 

There's some good advice in there. Most of all I appreciate the author's rejection of the romanticization of "hacking". It has bugged the crap out of me for years that people see black-hat hacking activities as a path to a career as a security consultant, and there are enterprises out there explicitly enabling this!  It does get a bit dogmatic at times.  Yes, okay, we'd never fly on commercial airliners if the airlines took the same attitude towards airplane maintenance as most take towards network security, but then again no one dies if my network gets penetrated, so let's not get too overheated.

read more | digg story

Wednesday, February 13, 2008

Multipath and EqualLogic iSCSI

At my job, we're using EqualLogic iSCSI arrays for all of our on-line primary storage. It's an easy to configure, highly scalable, and highly flexible solution for us. We have two arrays combined into a logical "group". Both arrays are connected to two Foundry FastIron gigabit Ethernet switches, and each server has a NIC connected to each switch as well, for path redundancy. We use CentOS on our servers. Any technical advice below assumes you have a similar setup. I'm also assuming that you've already configured your arrays, and read up on EqualLogic's best practices viz. jumbo frames and flow control.

A nice thing that CentOS 5.1 adds over the 5.0 version is an updated iscsi-initiator-utils. It adds the "iface" context, which allows you to configure more than one physical interface for connection to iSCSI targets. With CentOS 5.0 the best alternative for redundant paths was to use the Linux bonding driver and create an active-standby bond between two interfaces. This is great for redundancy, but doesn't provide and load balancing. Interestingly, the upstream version number for the iSCSI utilities is the same (6.2.0), leading me to believe that for whatever reason either CentOS or RedHat had stripped the "iface" functionality out.

This adds a couple steps to iSCSI configuration. First you have to change some settings in /etc/iscsi/iscsid.conf:

node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
node.session.timeo.replacement_timeout = 15

The first two lines set the interval to send an iSCSI nop-out as a test of the channel. If the daemon doesn't get a response in the number of seconds set by the second value it fails the channel. These are normally 10 and 15 seconds, respectively, but we want it to be a bit more aggressive so it'll go to the other channel before applications start experiencing I/O failures. The third sets how many seconds to wait for a channel to reestablish before failing an operation back up to the application level. It's normally 120 seconds, to give the channel plenty of time to recover. In this case, again, we want it to fail quickly so the operation is retried on the other channel. iscsid.conf is well annotated, so there's more insight to be had reading through that.

It is necessary to set up the iSCSI interfaces before doing discovery. It goes like this:

iscsiadm -m iface -I iface0 --op=new
iscsiadm -m iface -I iface1 --op=new
iscsiadm -m iface -I iface0 --op=update -n iface.hwaddress -v\ 00:16:3E:XX:XX:XX
iscsiadm -m iface -I iface1 --op=update -n iface.hwaddress -v\ 00:16:3E:XX:XX:XX

The above example sets up two interfaces. You get the hardware address for the NICs you want to use by checking the output of the "ifconfig" command. The interfaces must each be configured with an IP address that can reach the iSCSI target.

Update: If you want to script the process, you can source the ifcfg-ethX files for the interfaces you want to use. Then you can refer to the HWADDR variable, e.g.

. /etc/sysconfig/network-scripts/ifcfg-eth1
iscsiadm -m iface -I iface1 --op=update -n iface.hwaddress -v $HWADDR

NOTE FOR XEN USERS: Since you're associating the iSCSI interface with the NIC by hardware address, if you're doing this within a virtual machine it is important that the hardware address doesn't change between boots. Your VM definition file should explicitly define the hardware address where it defines the virtual interfaces.

Now you're ready to do your iSCSI discovery and login. Ping the iSCSI group IP address from each NIC to make sure it's reachable, and then:

iscsiadm -m discovery -t st -p 10.X.X.X
iscsiadm -m node --loginall=all
iscsiadm -m session

The output of the last command should show each of your targets twice. If it does, this means that your interfaces are correctly configured and they're each talking to the array. Take a break. Go get a glass of water.

Okay, now you have to deal with the multipath layer. This is really quite easy. Make sure you have the device-mapper-multipath package installed and multipathd configured to run at startup (
chkconfig multipathd on). The configuration file (/etc/multipath.conf) is set up by default to ignore, or "blacklist" all devices. So the first thing you need to do is comment out the following lines:

blacklist {
devnode "*"

Now you'll want to add a blacklist stanza to cover any devices you don't want multipathed (e.g. your local disks). In the example below, I'm running on a Xen VM, so I want to blacklist the standard Xen block devices:

blacklist {
devnode "^xvd[a-z]"

Easy. Now we create a "devices" stanza which defines our EqualLogic array and sets some config values for dealing with it:

devices {
device {
vendor "EQLOGIC"
product "100E-00"
path_grouping_policy group_by_prio
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
features "1 queue_if_no_path"
path_checker readsector0
failback immediate

H/T to Jason Koelker, from who's blog I ripped off this configuration (he shows values for NetApp, I made some changes to adapt it to EqualLogic). He does a much better job than I could at explaining the options. There are plenty of other options in multipath.conf, my advice is to study them and tune them as necessary, but don't feel like you have to change every default.

Okay, great. Now you're ready to discover your paths:


Whew! That was hard. Follow that up with a "multipath -ll" and you should see something like:

mpath1 (36090a01820494c58XXXXXXXXXXXXXXXX) dm-3 EQLOGIC,100E-00
[size=128G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][active]
\_ 3:0:0:0 sdd 8:48 [active][ready]
\_ 2:0:0:0 sdc 8:32 [active][ready]
mpath0 (36090a01820492c56XXXXXXXXXXXXXXXX) dm-2 EQLOGIC,100E-00
[size=128G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][active]
\_ 1:0:0:0 sdb 8:16 [active][ready]
\_ 0:0:0:0 sda 8:0 [active][ready]

If you want to address these devices (e.g. with a "pvcreate" command) you can find them in /dev/mpath as /dev/mpath/mpath0 and /dev/mpath/mpath1. I highly recommend managing the disks with LVM; you'll avoid a lot of brain damage when Linux decides to arbitrarily change the device names after a reboot. Install the "dstat" utility (available from the Dag yum repository) and you can watch the I/O going out both NICs, with reads and writes going to both channels while you're running I/O to the devices.

Based on my reading of EqualLogic's documentation (which is sorely lacking for those of us who use Linux), I didn't expect multipath load balancing to work, since a single session can only use one channel on the array end at a time. Thus, for example, if you used the Linux bonding driver to create a trunk of two NICs, you could still only connect to one of the gig interfaces on the array end, so no performance gain. The dm-multipath method uses multiple iSCSI login sessions however, and balances between them, so you really do get a significant improvement in performance along with quicker failover vs. the Linux bonding method. In ideal conditions, I've seen reads and writes at 200MB/second.

Got an additional tip for squeezing performance out of an EqualLogic? Share it in comments! Those of us who use Linux with EqualLogic have to support each other, because again, their Linux-related documentation is largely non-existent.

Update (20080419): Run Away! Run Away!

We just acquired 15 Sun X4150 servers for our datacenter and tried to deploy them as described above. So far it has been an unmitigated disaster, with well over half the servers having crashed at least once while attempting to run a production VM load. We've never seen crashes like this on our Dell gear. Near as I can tell the key components that the Sun's have different from the Dell 1950's we've been using are the RAID controller and the e1000 NICs. The Dell's have e1000s as well, but only two of the four (the oher two are Broadcom). Right now the best theory is that there is a problem somewhere in the interaction between the e1000 driver, open iscsi, and dm-multipath which causes the kernel to panic so bad that it winks out with no warning, no logs, no core. It's as though someone walked up to the box and hit the reset button. Anyway we've had to retreat to the old Linux bonding path failover method until we get the problem figured out, and will likely delay moving any servers into a dm-multipath configuration until we have a root cause nailed down.

Further Update: It's been an interesting few weeks. 1U servers can be a lot like Formula 1 cars, I guess; if you don't drive them hard enough the tires cool down and you crash. The apps we were running were highly I/O intensive, heating up the RAID card and the NIC in the server, but not CPU intensive, and since the CPU temperature controls the speed of the chassis fans, it seems that we were overheating our PCI cards while the CPUs stayed nice and cool. Knowing this we've developed some workarounds to keep things cooler, so my guess is we could re-impliment multipath as described above and wouldn't see much problem.

On the other hand, we've also learned that EqualLogic iSCSI arrays have a maximum number of connections in the ~500 range, and we were in danger of running out of available connections. Since each LUN presented by an iSCSI array takes a session, and multipath requires two sessions, that means a server with 5 LUNs exposed over multipath will take up 10 connections. We'll keep multipath in reserve as a point solution if we find we need higher I/O for a specific purpose.