You can't spell KABOOM without OOM

A real-world debugging problem.

  • Koji - the build system used inside Red Hat and by Fedora.
  • DB (postgresql) for build metadata; Hub for the XMLRPC interface; Web GUI; workers (mock, rpmbuild) to do the builds.

Background

  • Memory is important!
  • NFS, disk, mmap, swap are all great, but not good enough.
  • And then the OOM killer comes to town!
  • Back in the 2.2 and 2.4 days the OOM killer was appalling.
  • It’s improved enough over time in the process selection that Anthony’s thinking about running without swap and relying on OOM killer,

The Problem

  • Koji server running out of problem, it alerts because it’s ground to a halt on low memory, so let’s reboot!
  • Everything works again, it’s not the DB box, so there’s no data loss. What’s the problem?
  • Well, the workers need to be restarted, and nobody likes getting paged at 3 a.m. every other day.
  • So fix the bug!

Fixing the Bug

  • Some OOM poblems previously, linked to being able to do crazy queries like “List all history for everything forever”, which gets bundled up as a big glob of query, turned into a big glob of XML for XMLRPC, and badthings happened.
  • Used setrlimit() to kill processes that go too big as a workaround.
  • Loss of code trust.
  • Could be another bug in koji - lots of debug logs, troll through the usage logs.
  • Throttle incoming requests for overuse, but the “overuse” was long-term and hadn’t been causing problems.
  • Maybe it’s mod_python memory links in RHEL5’s version of python? upgrading to RHEL6 or wscgi seems like a bad idea when there’s already a problem.
  • Reduce apache MaxRequestsperChild? Didn’t help.
  • setrlimit() is killing prevents huge processes, but not many big ones. Reducing the number of clients with MaxClients, but this impacts the amount of concurrent usage.

  • Move to testing - a crash script that opens a number of sockets and do a little work, and report on the success/failure.

  • Couldn’t reproduce the problem.

  • Give it more memory!

  • Still crashes.

Back to First Principles

  • Use a soft toy.
  • Are you running out of memory?
  • Are you sure?
  • How does the kernel track memory usage to make that decision?
  • How about kernel memory structures - slabinfo
  • Lo and behold - koji was using 1 GB, but 4 - 5 GB was showing as used.
  • Oh dear. nfs_inode_cache was caching 2.5 GB of data.
  • Oh dear oh dear. It’s a regression in the RHEL 5 kernel. 5.7 specifically.
  • Capturing slabinfo while running, but, oh dear, the shell script doesn’t work when you’re swapping heavily. Bugger.
  • You want to use python, perl, etc, so you aren’t forking or execing and can’t get swapped out.
  • Demonstrated that it’s an NFS caching problem.
Share