Storage Replication in High-Performance High-Availability Environments

by Florian Haas.

  • Almost everyone has done HA.
  • About half of the people have worked with replicated storage.
  • A few people had heard of FlashCache.
  • Two people in the audience actually running it on a server.
  • Key FlashCache driver was FaceBook along with a number of other sponsors.

  • Facebook don’t talk too much about what they use it for.

  • Probably to resolve a problem with certain MySQL backends (e.g. InnoDB) where a shutdown/restart cycle would result in crippled performance until MySQL’s caches had repopulated.

  • It’s not MySQL or InnoDB specific, though, it’s a general block-level caching technology.

  • Main use case is for accelerating performance of large data sets.

  • Say you have a large data set to serve: you could throw money at it be replacing your cheap storage and migrate to e.g. SSDs for performance. “You could call it the 1%-er approach”.

  • So what if we can combine a small, expensive SSD with a large, cheap storage.

  • This is the same approach as e.g. Adaptec and LSI RAID controllers, but you’re locked into the physical controller.

  • FlashCache gives you this in Linux.

  • FlashCache is simply a devicemapper target; simple and generic, and very familiar to admins.

  • There are 4 userland utilities with a fifth coming for managing the FlashCache to initialise, configure, and load the devices into the DM.

  • From there we use it as a generic block device; normally you might just still a filesystem on it, or stack an LVM, or whatever.

  • It functions as you’d expect: reads are copied to the cache and used for subsequent access - a standard LRU algorithm

  • You can use a variety of modes: writeback, writethrough.

  • When we want to use this in an HA environment, we have a number of options; discussion covers DRBD.

  • Use a FlashCache device on each device underlying the DRBD.

  • Running off cold caches is still slow when you failover. The data will be there, but the cache isn’t populated.

  • But we do have an option to work around the scenario.

  • Rather than doing DRBD over the whole lot, we DRBD each type of storage (flash and disk) and then put FlashCache above the two DRBD devices.

  • This preserves the LRU caching behaviour, so it’s ready to run at full speed immediately on failover.

  • The OCF ResourceAgent manages this behaviour as part of your normal PaceMaker configuration for DRBD.

  • The are some gotchas.

  • FlashCache is still out-of-tree and there seems to be no current effort to mainline it, with all the usual problems that entails.

  • Currently being maintained for RHEL 5, RHEL 6, and Debian.

  • Ditto DRBD - while the mainline is there, it’s old (8.3.11), and to make this work right you need 8.4, which is still out of tree. Florian doesn’t see this changing any time soon.

  • If you hate out-of-tree code, you probably will give this a miss.

  • There’s no real packaging and even the build environment sucks. Florian understand hate for autotools, but the build process is completely non-tweakable. Hence, no RPMs or deb files.

  • OCF resource agent is there, it works, but it could use some testing.

  • Please help.

Share