mdadm and the Tyranny of Documentation Lag

So you’re building mdadm arrays the modern way (avoiding RAID 5 because of the write hole and RAID 5 rebuild problems with modern large disks); that would incline you to build a RAID 10 layout. Great! Until the day you need to grow your array. You read the Internet, you check your manpages, and you discover, to your growing horror that you’re screwed! You can’t add any more disks and do a reshape - you’re going to have to dump your filesystems to a backup, rebuild the arrays and filesystems from scratch, and then do a restore.

What is this, 1999? ZFS?

But you’re wrong. It’s not your fault, though. You’ve been mislead by documentation lag.

To wit:

[root@fedora-22]# mdadm --grow /dev/md127 --add /dev/sdb -n 5
[root@fedora-22]# cat /proc/mdstat
Personalities: [raid10]
md127 : active raid10 sdb[4] sdc[3] sdf[0] sdd[1] sdi[2]
      52388864 blocks super 1.2 512k chunks 2 near-copies [5/5] [UUUUU]
      [===============>...] resync 87.5% (45865088/52388864) finish=5.7 min speed=24999K/s

But wait - what is this sorcery? This is impossible. You should be seeing an error! Something has gone wrong, or, depending on your point of view, very right. You’re getting your online resize of the raid10 array. Which is great, but both the Internet and the man pages say this can’t happen.

Documentation lag is your answer. Documentation lag is something normally associated with the Internet; because search engines tend to privelege things like the popularity of an answer to a problem above any notion of currency or accuracy, it’s fairly easy to find yourself looking at StackOverflow or old Ubuntu forums answers to questions which may have been correct in 2013, but have since been overtaken by events. This is a real pain in the arse - you’ll quickly end up doing things (whether sysadminly or programmery things) that turn out to be remarkably bad decisions; in this case, for example, you might feel that the Internet’s authoritative claims about raid10 on Linux mdadm mean you either have to do a lot of highly disruptive, risky work; or, if you did your reading first, you might avoid it all together, compromising on an worse fit for your needs.

That’s a problem that the likes of Google and Bing would do well to think harder about - to weight correct, current information over out-of-date, misleading answers1. But at a human level, the answer to that is normally “don’t Google answers”; if you’re solving a programming problem, look at the current documentation for the programming language you’re using; if you’re a Unix admin, you should be using the man pages for the version of *ix that you’re running. In this case, though, that doesn’t get very far. Because if you log into a fairly current (Fedora 22 or 23, for example) system, you’ll find the nominally accurate guide to the system’s capabilties (the man pages) are just as misleading.

This is extremely unfortunate, but it reflects a common cultural issue in programming communities (closed source as well as open source); documentation is the unloved, unsexy bastard child of projects; even the code commentary may be out of date, leaving the documentation authors with the unlovely task of either reading the code commits to track changes, or the rats nest of mailing lists, IRC, Slack channels, or whatever ad-hoc mechanisms are used to record change.

This is a shame; if you want people to use the awesome features of your software, the best way to encourage them to do so is to make sure they know about it. Otherwise they’ll judge the five or ten year old version of your capabilities - and if they jump ship to something else, you’ll have been done in by documentation lag.

  1. No, I don’t know how you do that. If I did, I’d be paid a lot more money.