Seven Challenges for the Kernel Community

aka The Kernel Report LCA2010

Jonathan Corbet from LWN

Organised in terms of challenges, because this will help spell out where we’ll have to do.

Vitality

Andrew Morton said the volume would have to drop off one day. He said this in 2005.

If you worry about the vitatily of the process, “we’re done, we can go do WordPress plugins”, this would be a bad thing.

In 2005, there were 4000 changesets into 2.6.18, now we’re up to 10,000 plus regularly. So drop off? Not so much. In the last year we see 55,000 changesets from 2,700 developers, with at least 370 employers, and 2.8 million LOC.

The process is definitely alive!

The sources are the same as they ever were - about 20% indies, 10% unknown, the rest paid-for. RedHat, IBM, Novell, Oracle.

Aside: where’s Ubuntu? Again.

So what good new stuff have we got:

  • GEM graphics.
  • ext4 no longer experimental.
  • staging tree for code which is not ready for mainline, but would be good to be closer to the kernel to get more eyes. Things go mainline, or dropout.
  • Wireless USB.
  • Kernel mode setting.
  • btrfs, squashfs (been out for a long time), nilfs.
  • WIMAX.
  • 4096 CPUs.
  • TOMOYO Linux - MACs.
  • Integrity measurement - uses trusted management modules to guarantee boot time integrity from remote sites.
  • ATI R6xx/R7xx graphics support.
  • Performance counter support.
  • char devices userland - “does anyone actually want this?”
  • kmemleak - “useful debugging tool.”
  • TTM and Radeon KMS. “TTM is the equiv of GEM fo ATI.”
  • Storage topology support.
  • Block scalability work.
  • Perf counter improvements.
  • Sechedular interactivity.
  • KSM for virtualisation.
  • HWPOISON - keep running past hardware errors.

Coming Soon™

  • Dynamic ftrace.
  • DRBD distributed storage.
  • I/O bandwidth controller - control by process, aimed at cirtualisation.
  • TCP Cookie Transactions - “not complete, but the groundwork is there.”
  • Nouveau driver - “Devs didn’t want it, but Linux pulled it in anyway.” Free driver, not NVidia.

There’s a lot going on; the kernel summit in Tokyo had the consensus is that it’s mostly working well from a process perspective. There were some concerns, though:

Participation

It only works if people opt to participate; it works less well if you opt-out of working with others in the sandpit. J highlighted a discussion by the lead Google engineer, who have been working out-of-band on older kernels inhouse. They have over 1200 patches now trying to get up-to-date, and now they’re burning money and engineering resources to roll half a million lines of code to newer kernels. It’s been so painful and expensive Google are going to get closer to the mainline.

Scalability.

  • 2.0 in 1996 started scaling to 2 CPUs. Wooooooo!
  • Now 2 - 32 are common, and 4096 is not unheard of.
  • Workqueue restructuring, multiqueue networking, cpumask networking (tidy up the data structures for the processor representations).

Problems:

  • dcache_lock is a big block on massive IO, it’s not well-understood, it will be hard.
  • Networking; sure, we can run 10 Gbit adapters with big packets it works well, but we do many-small-packets badly.
  • Solid state devices - the paradigm shift to many iops is a big change, with the block dev layer looking maybe more like the network layer in the future; interrupt mitigation, for example, was pulled form the network layer’s equivalent, which was implemented ten years ago.

Scalability needs to work both ways. If you just to big systems, you ain’t scalable. Have a look at bloatwatch. At the moment, we’re growing slower than hardware, but this is still a concern; there could be more participation from the embedded world to help with scaling down.

Storage

Storage devices are getting bigger but not always faster, but we’ve also got small, fast devices for SSD. The short-term answer is ext4; good compatibility, “fairly stable”, in consumer distros (aside: pity Unbuntu’s is corrupting large files), generally works pretty well, lifts performance.

btrfs

New from-scratch.

  • Fast
  • Full checksumming for commodity
  • Snapshots - F13 is looking at using snapshots as part of the upgrade/rollback process.
  • Integral RAID/volume management.
  • In the mainline, but not production-ready. It will be quite some time before it is.

Solid-state

  • Moving into the market
  • Some interesting performance issues; they degrade over time.
  • TRIM/discard doesn’t always work well.
  • There are whole new layers of problems.

Visibility

An increasing demand for being able to look into the running system in more detail, both as developers and managers. Also, how do you shut up DTrace users, although they’re kinda quiter (aside: you need to know more about how Solaris works better than most loudmouths.)

Systemtap: nice, featureful, but not well-loved by the developers. But then they’re the wrong people to be driving this IMO, because it’s meant for admins and engineers.

ftrace: lightweight, popular with kernel developers, especially now dynamic tracing is available. Jonathon thinks this will deprecate systemtap, I recon it only will if it actually gets documented and broadened. Great if you only care about the kernel.

Perf Events: Heavy development, useful for low-level optimising low-level stuff.

Aside: Dissapointed that the focus is entirely on kernel, which won’t shut the dtrace crowd up, because it’s kernel dev focused - it’s useless to engineers.

Responsiveness/Real-time

It’s about deterministic, not speed. Critical in embedded system; keyboards, music, TVs, etc.

Surprise, though: Used in financial systems! Linux lets us crash markets quicker than ever before! The financial people are doing heavy work in this space.

Determinisitic real-time, general purpose kernel.

Priority inheritance-Linus said it would never get in, but there it is!

Most of the outstanding work is around spinlocks, big (global) kernel locks, and the like.

Deadline scheduling: moving from priorities to deadlines. Priorities are traditional POSIX, but this is a bad match for the realtime world, which thinks in terms of “How much, by when, and how often”. The schedular will refuse work it can’t run, because it’s more important to be accurate.

Containment

Virtualisation: Full Vms, different guests, etc. Containers: Jails on steroids. more limited, but more efficient.

Virtualisation

  • KVM in, Xen still out (comment in response to a question that Linux had advised the Xen maintainers to “treat Ingo as damage and route around him”).
  • KSM gives inter-guest page sharing, which is pretty cool.
  • Compcache: swap memory to memory with compression.
  • Transcendant memory: see the talk.

Containers

  • Linux does namespace isolation for this.
  • Resource controllers are nderway.
  • Checkpoint/restart.

Hardware

  • Hardware support is near-universal.
  • Graphics is still a problem, but it’s getting there. (Aside: this seems likely to be a moving target for me.)
  • Some network adapters. Just avoid dickish vendors. They need us more than the reverse.

Q&A

  • How do new developers get accepted?; participation is hard, Jonathon wants more people in, but acceptance is hard. lkml is not friendly or accepting. “It can be intimidating, but there are people evanglising, writing, documenting on educating people, working on trying to change the mailing list culture to make it friendlier. It may still be a bit cabalish at the higher levels, but we’re making pogress and generally doing good.”
  • Can we speak to regressions, particularly in older hardware. “Keeping older hardware working can be a problem, but it needs to be tested and reported and working with the developers on fixing it.”
Share