This is a collection of random infrastructure notes based on the work I'm doing at any given time. Most of the technical notes here assume an infrastructure similar to the one I'm working on (which I will not describe in detail, and which is subject to change). I can't be responsible if you do something that's documented here and bad things happen.

Thursday, July 24, 2008

Cfengine rocks it

Nice to be able to do a good-news followup to one of my posts. Having now rolled cfengine out across a couple hundred nodes, it is doing fantastic so far. Key things we have cfengine doing:

  • Keeping our yum repository definition files current. This is awesome, as it allows us to quickly roll out things like "exclude" statements in repo definition files.
  • Rolling out authorized_keys files. This is much less dangerous if I can make a quick change and have a new file in place within 15 minutes.
  • Get key packages installed (!!). Cfengine understands the rpm format and can install necessary packages from yum repositories
And we're still really just scratching the surface. We have been struggling with RedHat's kickstart to get servers in as close to production-ready shape as possible at install time. It turns into a nightmare of maintaining different kickstart files for each type of server we deploy. With cfengine, we're now moving toward a single kickstart configuration file for the whole environment, with customization done post-install depending on the hostname construction. Slick!

Coming soon we're going to start implimenting alerts based on cfenvd. This is a daemon that continually gathers statistical information and defines cfengine classes based on anomolous behavior. I hope to write about that in the very near future.

Wednesday, July 9, 2008


It's been a while since I jumped on a nice steep learning curve, and cfengine is suiting me nicely to that end. Many administrators probably know of cfengine as something that sounds incredibly useful but a bit of work to get rolling, which ends up being true. You can head to www.cfengine.org to get a pretty good basic introduction to what it is and what it does. I'll also try to describe it a bit here.

Basically, cfengine is a suite of daemons that run in a client-server type configuration. They are intended to compare the state of a given computer system to the ideal state described by a configuration file, and take steps to correct or warn of any problems it finds. This might take the form of:
  • files that should be the same for every box in the environment (e.g. /etc/resolv.conf)
  • processes that should or shouldn't be running
  • ownership and permissions on files
  • configuration of some system daemons
The basic gist is that if you have a network of over 200 virtual machines like I manage, and you need to make a change to a system configuration file, cfengine provides a method to do this without having to manually log into each machine and make the change, or alternately running a dangerous non-interactive shell script on every host.

One of the powerful aspects of this is that cfengine is aware of the differences between, say, how a Linux system does things and how a Solaris system does things. So for many tasks, the same line in the configuration file could cause a vastly different sequence of events depending on the system architecture. The system administrator doesn't have to care about this, she or he simply states in the high-level cfengine configuration language how the system should behave, and cfengine will figure out the details.

Another powerful aspect is the notion of "classes". This is great if you have a QA environment closely modeled on your production environment, but with some key changes. The cfgengine client daemon (cfagent) will automatically assign itself to several classes based on system variables such as subdomain or subnet. So if you have a "prod.acme.com" subdomain and a "qa.acme.com" subdomain for example, you can define different versions of key files for those subdomains.

At Lijit, we're tying it in to Subversion to make management simpler. We have a repository of cfengine configuration files (which get propogated to the clients in the same way as other files), and a repository of other files we wish to be distributed by cfengine. I can check out these repos to my PC, make the edits I need to make, commit the changes, and within 15 mintutes all of my clients will have the new configuration and, presumably, will have taken whatever steps they need to get in sync. There are methods for "pushing" updates from the master server as well, but I haven't gotten that far yet.

One stumbling block that I'm just figuring out now is how to use "define" statements to cause chains of events to take place. Take /etc/syslog.conf. If I update that file, the syslog daemon needs to be restarted. Cfengine can restart running processes in a myrad of ways, so that's no problem, but I don't want syslogd restarted every 15 minutes; that'd just be a waste, and could lead to dropped messages. So I add a "define" statement to the configuration line for the syslog.conf file. Like so:

/foo/bar/qa/syslog.conf dest=/etc/syslog.conf mode=0644 owner=0 group=0 server=cfengine.acme.com define=new_syslog_conf

So, this line (in the "copy" section of the file) says that if you are a member of the qa_acme_com class, make sure your /etc/syslog.conf matches the master version and copy it over if it doesn't. If you copy the file, then define yourself as a member of the class "new_syslog_conf". I can then later on in the config say that members of the "new_syslog_conf" class should restart their syslog daemon.

This is just barely scratching the surface of cfengine. There are good tutorials and documentation out there for anyone wishing to learn about it. I suspect it will become an invaluable tool for us at lijit as the number of hosts we manage continues to grow. If you're managing more than a couple dozen hosts I definitely recommend having a look.