Wednesday, December 19, 2012

Ubuntu 12.04 (LTS), Xen 4.1, and DRBD

I recently set up a couple high-end servers as a small "cluster" to replace an old Xen VM server. The idea was to use DRBD (Distributed Replicated Block Device) to mirror all of the storage between servers for both data redundancy and resource flexibility. This allows separate VMs to run on both servers, taking full advantage of available CPU and memory, while still being able to be quickly and easily migrated between servers in the event of failure, or just to optimize performance and resource usage. DRBD allows this to be done without some sort of iSCSI or NAS typically used in larger clusters, with the added benefit of being decentralized and not suffering from a network latency overhead. Additionally, the targeted VMs have vastly different performance requirements, ranging from simple web servers to high-performance research workstations, so having two servers allowed me to economically provision one as "low-end" for simple web/file server VMs that just needed high availability, and one as "high-end" for the compute and memory intensive VMs.

Both servers use Supermicro's brand new, very, very, nice, X9DAX-7TF motherboard, which supports dual-socket Intel E5-2600 series CPUs, along with dual 10 GbE ports, USB 3.0, and a ton of other neat goodies. Unfortunately this board is oversized, and, since I wanted both rackmount and tower configurations along with full size PCIe cards, the only chassis that really worked was the somewhat overpriced Supermicro SC747. The "low-end" server has dual E5-2620s (6-core, 2Ghz) and 64GB of RAM, and the "high-end" server has dual E5-2687Ws (8-core, 3.1Ghz, 3.8GHz turbo-boost) and 96GB of RAM. Since redudancy, uptime, and protection against data-loss were top priorities, both servers have RAID 6 arrays using OCZ Vertex 4 SSDs (along with a couple RevoDrives on the high-end server), which (as mentioned) are mirrored between servers using DRBD.



 Since this blog is just as much notes to myself as anything, I'll start with a couple ridiculous minor annoying issues that I ran into, but aren't really applicable to most setups.

  •  OCZ's 2.5" to 3.5" drive adapter DOES NOT MEET 3.5" drive specifications! This means that it will not fit in to 3.5" hot swap drive bays since the port is misaligned both horizontally and vertically :/. It is also pretty hard to find a drive adapter that meets spec, since the 2.5" drive has to be almost flush against the cage, so most that do meet spec are outrageously expensive (such as the Icy Dock drive converter). Since I used iStarUSA's 5.25" bay to 6x2.5" drive adapter for most of the SSDs, I only needed two 2.5" to 3.5" adapters, so a quick trip to the machine shop to mill down the bevels and screw holes of the OCZ adapters fixed the problem (it is a bit cludgy, since you can't put all the screws in, and it is hard to fasten the drive to the adapter and the adapter to the hot swap cage, but it worked well for an immediate/free solution).

  • The onboard LSI 2208 controller did not have JBOD enabled by default. This was a huge pain, since I wasn't able to use any drives without configuring a proprietary RAID array (not to mention I wanted to use data from existing disks). It turns out you can install the MegaCli tool (here), follow the instructions to install it (just a simple "alien" command to convert the .rpms, then "dpkg -i"), then run "sudo MegaCli -AdpSetProp -EnableJBOD -val 1 -aALL" to enable it (the all means enable it on all present LSI controllers... make sure to run it with sudo, or as root, since if you don't it gives cryptic error messages).

  •  On a side note, I was also a bit dismayed with the build quality of the SC747 chassis and X9DAX-7TF motherboards. It is certainly a nice case, but not up to par with what I would expect for $900. One of the power supplies doesn't line up well with the connector, so it is a pain to get back in. The hot-swap cages are flimsy and don't slide in and out easily. One of the motherboard ethernet ports doesn't clip in the cable, and the BIOS splash screen has a misspelling. The boot time is also ridiculously long; I realize there are a ton of peripherals for it to initialize, but it is pushing a whopping 2 minutes before it even gets to grub, which is a huge annoyance when you are constantly rebooting to debug/configure the bad LSI controller. (Though don't take this the wrong way, I'm still pretty happy with the cases/motherboards, it just seems that for the money they should have been better.)

  • I had a lot of problems getting the servers connected to the network for a really odd/surprising reason. Our network assigns affinity groups (VLANs) based on MAC address. I had the MAC addresses setup and registered correctly, but for some reason the ports were still going up and down randomly and hopping between VLANs. It turns out the built in NICs send frames from two different MAC addresses! I still haven't looked in to why (perhaps it is because of the VMDq, virtual machine device queue, or IOV functionality), but as soon as I made sure all of the MACs were properly registered it worked fine. Weird.


Now on to the good stuff =). So one of the main points of this setup is to have super high data redundancy and uptime. This is accomplished by using RAID 6 arrays which are mirrored across the two servers using DRBD, allowing up to 5 drives to fail, or an entire server to fail, and still have no data loss or downtime. (If each server has 2 drives fail then we are still running at 100% because of RAID 6, the next drive that fails takes down a server, but since everything is on the other server we can just move the VMs over.) For added reliability, the servers will also be placed in separate rooms (though still with a direct 10G ethernet connection over existing wiring), to provide some added location redundancy (against theft, A/C leaks, small fires, etc.). I'm working on an offsite backup too, but that's a pain to administer, and may be a bit overkill for our current needs.

 The Xen setup stuff is mostly covered here, but I have a couple points to add, mainly with regard to migration:

  • I found the command "dd if=/dev/myoldvolume bs=4M | ssh -c arcfour mynewserver dd of=/dev/mynewvolume bs=4M" useful for copying logical volumes to the new server (run it on the old server). It is pretty straightforward; the "bs" argument speeds up the disk access, and the "arcfour" encryption is much faster (though less secure) than the default. You can probably use netcat to get better performance, but it seems a bit less reliable to me, and this ran at ~80MB/s, which was plenty (and may have been a network bottleneck more than anything). 

  • If you are migrating pygrub VMs, make sure to update the path to pygrub (for me it was replacing "/usr/bin/pygrub'" with "/usr/lib/xen-4.1/bin/pygrub'" in the .cfg), along with, of course, all the appropriate logical volume and volume group adjustments for the drives. 

  • Both RealVNC and TightVNC had issues displaying the Xen HVMs, but UltraVNC worked great.

  • You need to turn hibernation off in Windows 7/8 HVM virtual machines using "powercfg -h off" in an administrator command prompt (cmd).

  • When I ran hvm hosts I noticed an error stating the qemu keymaps couldn't be found. This is quickly fixed by symlinking qemu using "ln -s /usr/share/qemu-linaro/ /usr/share/qemu/".

  • For some reason the xen-tools /usr/lib/xen-tools/debian.d/20-setup-apt script is hardcoded to add debian mirrors to the VMs /etc/apt/sources.list.  My quick-fix was to just add a couple lines before the security block (just before the chroot) to overwrite the entire sources list to match dom0's, with the appropriate version:
    #fix for ubuntu (since above security breaks ubuntu dists...) case ${mirror} in *ubuntu*) sed "s/`xt-guess-suite-and-mirror --suite`/${dist}/g" /etc/apt/sources.list > ${prefix}/etc/apt/sources.list esac

  • I noticed terrible block device performance in my VMs (at least for continuous reads). This is largely attributed to the default read-ahead not being set correctly in the domUs. Essentially the default read ahead value is 128 sectors, or 64kB in linux; however, in raid arrays this is multiplied by the number of disks (since each is read simultaneously). You can check the settings by running either "blockdev --report" or "cat /sys/block/blkdev/queue/read_ahead_kb". Similarly, to set it, you can run either "blockdev -setra XX blkdev" or "echo XX > /sys/block/blkdev/queue/read_ahead_kb" (note this has to be done as root, not with sudo, since the > won't be run as root). I haven't found a good way to automatically do this on VM creation yet, but for now I have just stuck it in the /etc/rc.local file (though apparently this may not work if you run a full gnome desktop in the environment, since it will reset it). You may also be able to do it with sysctl or /etc/sysctl.conf, but it isn't clear to me how. After setting the correct read-ahead in my domUs I noticed a 3x increase in speed. On a similar note, it seems that it is important to tune the ext filesystem parameters stride and stripe-width for RAID arrays.

  • To enable IOMMU, VTd, and SR-IOV (which my setup supports), I had to add modify /etc/default/grub by adding "pci_pt_e820_access=on" to the option, and adding a line that says GRUB_CMDLINE_XEN="iommu=1 vtd=1 dom0_mem=4096M,max:4096M dom0_max_vcpus=4".  (Don't forget to run update-grub after this.)  Note that this also caps the dom0 memory and cpus for performance reasons (you should probably also modify /etc/xen/xl.conf and xend-config.sxp to prevent autoballooning as well).  I also added "xm sched-credit -d Domain-0 -w 1024" to /etc/rc.local for better IO performance with the intensive raid 6 being processed by dom0.  Update:  Actually it looks like iommu and vtd are enabled by default in Xen 4.x, but it doesn't hurt to add them.  It also appears that there is an issue with Dom0 memory, which requires you to run "xm mem-set Domain-0 4096" after boot to reclaim your memory (mine was 2012 without it); just put it in rc.local.

  • There is some weird bug with openvswitch (at least in the kernel update to 3.2.0-35-generic) which didn't compile/load the new openvswitch module when I updated kernels.  You can have it recompiled by doing ("sudo apt-get remove openvswitch-datapath-dkms; sudo apt-get install  openvswitch-datapath-dkms xcp-guest-templates xcp-networkd xcp-xapi", since it removes more than just the openvswitch module.)  To update it install the source, then compile and install it using: "sudo apt-get install openvswitch-datapath-source && sudo m-a a-i openvswitch-datapath".  Also, since I compiled and installed drbd, I had to update the kernel module manually ("sudo m-a a-i drbd8" if the module source is already installed).

I looked at some cluster management software, like Pacemaker and Ganeti, to take care of VM management and volume replication, but they all seemed like they had a lot of overhead and were overkill for what I needed (plus had a somewhat steep learning curve). I could have replicated the entire LVM volume group (or RAID device) over DRBD, but this wouldn't have given me the flexibility I wanted (since I don't necessarily want *every* LV replicated, and I have some volumes on the RevoDrive that I may want replicated), so I ended up using DRBD on a per-volume basis. Since I wasn't about to manually setup every VM to use DRBD, I ended up writing some quick and dirty scripts to automatically take any VM and set it up to be replicated to the other server. This is actually a bit trickier than it sounds, since it involves:

  1. Writing resource files to /etc/drbd.d/ for every VM (made more complicated by VMs with more than one volume, such as their swap). 

  2. Modifying Xen's .cfg file in /etc/xen to point to the appropriate /dev/drbd_ devices and setting the device type to "drbd", if necessary. 

  3. Creating the appropriately sized logical volumes on the other server.

  4. Copying the configs to the other server. 

  5. And, finally, performing a number of administrative tasks, such as creating the DRBD meta data and forcing the primary. 

While I'm not going to detail every line of code, and it is too much to post inline, you can find the scripts here. THESE ARE PROVIDED WITH NO WARRANTY. I recommend you read every line of the code to figure out what it is doing. In fact, you should probably use them just as reference and never actually use them yourself unless it is on a sandbox with no real data, as THEY COULD CAUSE PERMANENT DATA LOSS. That being said, I did try to do a number of sanity checks and make them fairly robust (used on HVMs, pygrub, etc.). I do feel a bit like I was probably re-inventing the wheel though, so if anyone knows of a good light-weight Xen/DRBD volume manager I would love to hear about it.

I ran in to a number of not so obvious issues with DRBD while I was building and testing these tools:

  • DRBD 8.3 does not support user defined device names, which makes setting up Xen config files much, much, more difficult. It also does not support online switching of replication protocols. For performance reasons I wanted my synchronization to be asynchronous (Protocol A), but I also wanted to support Xen live migration of VMs, which requires "dual-primary" mode, that is only supported by fully synchronous replication (Protocol C). Ubuntu 12.04 (and 12.10 for that matter) do not have DRBD 8.4 in their repositories, and LinBit, the developer of DRBD, only has a repository for paying support customers. This means that I had to compile DRBD 8.4 from scratch, but this worked just fine following the instructions in the DRBD documentation.  (Update:  It looks like some of my VMs on old kernels had (seemingly intermittent) problems with live migration, giving a kernel BUG in spinlock.c.  This problem was reportedly fixed after kernel 2.6.38.)

  • DRBD does not do a good job of documenting how to use user-defined block device names. I was very, very dismayed to find that even with a user defined device name, you still require a device number (i.e., "minor number"). So the correct syntax in the resource file is "device /dev/drbd_mydefinedname minor myminor#;". It is also not clear that every resource has to have it's own port (though I guess it makes sense in retrospect...). This means that my scripts had to take over the unfortunate task of keeping track of minor numbers and port numbers :/, which greatly complicated things, and made device creating much more prone to error (you have to make sure to keep the minor number in sync across servers and line up with appropriate volumes). Because of this I ended up creating a separate resource for every volume (i.e., both swap and root on most VMs), just so that they port and minor number would stay in sync. I still only used one resource file per VM though, which helps keep things organized. 

  • The resync-rate is a bit confusing, and I still am not sure how to set it. Despite the fact that I am using a direct 10GbE connection and RAID arrays that have more than 1GB/s r/w on the storage, I was only syncing at ~100MB/s. I tried setting the "resync-rate 300M;" property in the global config under disk, but it seemed to have no affect on initial sync rate. I found older references to "syncer", but it seems to be deprecated in 8.4. I'm still looking around, but I guess it's not that big of a deal. 

  • Be careful NOT to delete any DRBD .res files without making sure that the resource has been stopped with "drbdadm stop myresource".  I did this a couple times during testing/development, and it makes impossible to stop the resource without recreating the .res file (or rebooting).  It seems like the whole .res system could be greatly improved.  If DRBD had internal management of ports and minor numbers, then perhaps all the resource/volume specific information could be moved in to the metadata block, and have everything managed through command line tools...  (more like mdadm).

  • /proc/drbd doesn't list the resource names, just the minor number, which is very, very annoying. It makes it impossible to tell the status of volumes, since you have to cross-reference with the actual resource config files in /etc/drbd.d/.  Or you could, you know, just use "drbd-overview" like you are supposed to.

  • When you use the "drbd:" type device in your Xen .cfg, you don't need the full device path, just the drbd resource name, i.e. 'drbd:myvm-disk,xvda2,w'

After I got everything setup correctly, live migration went without a hitch on para-virtualized linux VMs, which was pretty awesome =). Notably, defining the volumes with "drbd:" in the Xen .cfg file will automatically take care of everything if you are running Protocol C (it will even promote volumes to primary when you first try to run the VM). If you aren't running protocol C, then you will have to manually change the protocol and dual-primary mode (temporarily) on any volumes the VMs are using, by doing " sudo drbdadm net-options --protocol=C --allow-two-primaries ". Of course after the migration you will have to change everything back (if you want) using " drbdadm net-options --protocol=A --allow-two-primaries=no ".  I did try to live migrate a Windows 7 HVM host; it worked fine initially, but then the VM randomly rebooted and said it had recovered from a blue-screen. I'll have to play with it some more to see what the issue was.

I'm still looking at a good way to be alerted of any failures; it looks like mon is probably the best solution, but I haven't had chance to set it up yet. Of course installing sendmail and configuring the email in /etc/mdadm.conf will send you notifications of important mdadm RAID events (notably you need a fully qualified domain name for sendmail to work; this can be set in the /etc/hosts file like this "127.0.0.1 localhost myhost.mydomain.com myhost"). Unfortunately DRBD doesn't have a similar functionality.

Anyway, that's about it. If you want a small, reliable, high-availability cluster that is very resilient to data loss, I would highly recommend Xen+DRBD so far. RAID 6 may be a little overkill with the DRBD replication, but I would rather avoid the hassle of recovering from a server going down (and instead just hot-swap bad drives), and given the total cost, the extra two drives to bump from RAID 5 to RAID 6 were just a small percent. For smaller budgets RAID 5 or 1 would still offer a ton of protection ;).

3 comments:

angrygreenfrogs said...

Don't give up on Ganeti my friend. It does a lot of the DRBD orchestration magic that your custom scripts are doing, but it's a refined system and really works great.

Worth your time to check it out.

I'm using XenServer and DRBD in production, but I may move some services to Ganeti in the future since I had some real success with it in my lab environment.

I can send you my Ganeti install notes if you're interested - I set it up on a small lab of 3 machines and had a lot of fun moving around the DRBD volumes and seeing how it handles migration, failover, etc. Pretty cool stuff.

Clayton Shepard said...

Thanks for the tip, I'll definitely take another look if I get a chance.

The current setup is working great for now though, and since it's in production at this point, I really can't take it down to tinker ;).

Actually what I am mostly missing now is some sort of failure monitor. I've had a couple issues with individual processes on VMs hanging (like apache), which I wasn't automatically alerted to. Perhaps I should also take another look at something like heartbeat.

John Barness said...

I've been thinking about moving to the Ubuntu for a week and I am just curious how good web-based applications work on it. I reviewed the best virtual data rooms and it shouldn't be any troubles with that.