I want to have these guys’ babies. Incredibly cool stuff from NTT, who also bought us NilFS.
(As an aside: it’s sad that there’s this whole deep geek culture in Japan that throws up stuff like Ruby or company sponsored stuff like NTT’s contributions that just gets horribly underexposed to the English-speaking world).
Distributed storage for QEMU/KVM.
- Amazon EBS-like volume pool
- Scalable, abailable, and reliable
- Support for advanced volume management
- Many physical nodes in a sheepdog cluster, presents as multiple HA volumes accessed as block devices by the VMs with snapshots, cloning, etc.
- Fully symmetric. Zero config for cluster nodes.
- Auto-detect added/removed notes.
- Similar to Isilon architecture and XIV.
- New nodes are auto-added and rebalanced; old nodes are removed sheepdogs.
Goals
- Managed autonomously, with auto relocation and balancing.
- Scale to hundreds on nodes, with linear scalaing of performance and capacity.
- HA volumes, data replicated across multiple nodes with auto-recovery of lost data.
- Support useful volume manipulation: snapshots, LVMs, etc.
Design
- Not a general filesystem. Question was asked if it ever will be; that’s not a design goal and it most likely never will be. Too much of the design is optimised for working well with KVM. I’d say that’s a feature, not a bug; outside of the likes of Weta most shops now have too much iron and partitioning/virtualisation is the Way of the Future(tm).
- API specific to QEMU/KVM
- Cannot be used as a regular filesystem; all comms happen through the collie process on the nodes nad clients.
- One volume can be attacked to only one VM at once.
- Volumes are divided into 4 MB objects. They’d like to revisit this for tuning for different cluster sizes, but 4MB seems optimal for overhead/performance balance.
- Each object has a GUID and is replicated to multiple nodes. Most of the examples during the demo showed 3 copies being the norm.
- Sparse allocation so only written objects are allocated.
- Consistent hashing decides which node obkjects sould be stored on.
Cluster Management
- corosync: supports reliable and total ordered multi-casting
- Sheepdog uses it to supportlock/release, join/leave, master elections.
- Let’s see what happens when a node falls down.
- Node falls off.
- Sheepdog notices a decrease in the redundancy count.
- Data is re-arranged to restore the desired reduncancy.
- Three node replication is the target, allowing for a double-failure.
The demo had tantalising glimpses of a decent-looking management GUI being developed. This is really nice - like documentation it’s a hugely under-realised part of a lot of open source.
Scalability
Essentially the performance testing to date suggests it’ll beat out GFS on a NetApp once you get a bunch of nodes, but until then GFS wins. Of course, they’re also relatively early on in their development cycle. And, I’ll note, cheap dumb flocks of disk are likely a damn sight cheaper than a NetApp.
Current Status
Early development, not suitable for production.
I can’t overstate how exciting this one was. This is fantastic looking technology, and I hope NTT drive it hard. Having seem the KVM-orientation of Certain Large Vendors who’ve given me NDAed presentations I’d say KVM is going to own the Linux full virtualisation space in the none to distant future, and Sheepdog would be a godsend for large KVM installations being able to do cool stuff for reasonable prices. Extra kudos for Kazutaka Morita, because coming and presenting in a second language looked like hard work when I saw the folks in Brisbane do it, and it looked daunting here, too.