I Can't Believe This is Butter! A tour of btrfs.

2012-01-18 · 631 words · 3 minute read

Technology · Conferences

lca2012 · lca · Avi Miller · btrfs

Avi Miller

Some key points (having lost many notes due to Firefox being fucking useless).

There’s a bunch of stuff still working badly or not, and optimisations.
e.g. metadata is stuck at a 4K blocksize, and that hurts performance. This is being fixed.
RAID is block redundency across disks. So a RAID-1 mirror with 5 different-sized disks will simply make sure that blocks are duped somewhere in the array.
Scrubbing is great, and will auto-fix on read. There are some important caveats, though; the biggest is that btr prefers to always read from the same device if it can. This means that if you don’t force scrubs occasionally you can have a drive crap itself, pull the drive, and then discover your alternate block was corrupt. And be unable to find a good copy. Oops.
Chris M recommends scrubbing periodically with the sum tool from time to time (say a week for busy filesystem).
You can mount any device in an array and everything mounts.
No idea what happens if you try mounting multiple devices in the array.
Disk replacement is working smoothly, and Just Works.
btr send/recieve is working. It sends a “neutral” stream, so it ought to scrub and dump errors.
btr is friendlier to small machines that ZFS, but not to small disks - it tends to allocate heaps of metadata.
RAID 0, 1, and 10 are there, but RAID 5, 6 and triple mirroring are still sitting in the merge queue, thanks to Intel.
You can mix RAID levels in the same disks, because, hey, it’s just block duplication.
Unfortunately df and the like just Don’t Work. e.g. until you force sync, the filesystem will report the wrong utilisation, and it will always tell you the FS size is the sum of all the disks in an array.

When Bad Things Happen to Good Data

There’s a read-only btrfs tool, so you can try and save your data when btr goes bad. It works well.
Chris Mason will be talking about btrfs on Saturday. You may choose to assume that btrfsck will be announced then. If you want.
Oracle have publicly stated that they will take it into production with btrfs.
Even if the filesystem isn’t changing, the metadata rolls its root backup (every 30 seconds). You can switch off.
Avi has some amusing tools to corrupt files and filesystems.
And “mount -o recovery” just fixed the checksum corruption he inflicted on his test filessytem. Worst case scenario you’ve lost 30 seconds of data per write.

Beeeellions of files

ext4, xfs, and btrfs all have problem with lots of files.
ext4 is journal-bound
xfs has fixed this in head. It spams files all over the place and gets generally good performance, bt generates many seeks.
btr load-levels across the disk, not isn’t seek-thrashing the disk.
btr and xfs are both CPU-limited on SSDs.
“seekwatcher” is one of Chris M’s tools that shows what’s doing on.

yum upgrade and snapshots

Requires btrfs root, and allows you to snapshot on upgrade and rollback in one hit.
It’s easier to use Fedora than OEL to convert the FS from ext4. Since ext4 is stores as a conversion snapshot, you can rollback to ext4 later.
Avi no longer uses the 3D accelerators for VirtualBox so he never has to use GNOME 3.
When you convert ext4 -> btrfs remember to edit /etc/fstab at change the FS type!
You need the yum snapshot plugin to be installed.
Then yum install just creates a snapshot.
New Fujitsu logging has improved the speed of apt-get and yum, both of which generate a lot of fsync() calls.

Questions

Some people do md-raid and btr-RAID.
Dedupe? Not on the roadmap right now. Disks are so big; the cost of CPU and RAM to dedupe is huge.