A real-world debugging problem.
- Koji - the build system used inside Red Hat and by Fedora.
- DB (postgresql) for build metadata; Hub for the XMLRPC interface; Web GUI; workers (mock, rpmbuild) to do the builds.
Background
- Memory is important!
- NFS, disk, mmap, swap are all great, but not good enough.
- And then the OOM killer comes to town!
- Back in the 2.2 and 2.4 days the OOM killer was appalling.
- It’s improved enough over time in the process selection that Anthony’s thinking about running without swap and relying on OOM killer,
The Problem
- Koji server running out of problem, it alerts because it’s ground to a halt on low memory, so let’s reboot!
- Everything works again, it’s not the DB box, so there’s no data loss. What’s the problem?
- Well, the workers need to be restarted, and nobody likes getting paged at 3 a.m. every other day.
- So fix the bug!
Fixing the Bug
-
Some OOM poblems previously, linked to being able to do crazy queries like “List all history for everything forever”, which gets bundled up as a big glob of query, turned into a big glob of XML for XMLRPC, and badthings happened.
-
Used setrlimit() to kill processes that go too big as a workaround.
-
Loss of code trust.
-
Could be another bug in koji - lots of debug logs, troll through the usage logs.
-
Throttle incoming requests for overuse, but the “overuse” was long-term and hadn’t been causing problems.
-
Maybe it’s mod_python memory links in RHEL5’s version of python? upgrading to RHEL6 or wscgi seems like a bad idea when there’s already a problem.
-
Reduce apache MaxRequestsperChild? Didn’t help.
-
setrlimit() is killing prevents huge processes, but not many big ones. Reducing the number of clients with MaxClients, but this impacts the amount of concurrent usage.
-
Move to testing - a crash script that opens a number of sockets and do a little work, and report on the success/failure.
-
Couldn’t reproduce the problem.
-
Give it more memory!
-
Still crashes.
Back to First Principles
- Use a soft toy.
- Are you running out of memory?
- Are you sure?
- How does the kernel track memory usage to make that decision?
- How about kernel memory structures - slabinfo
- Lo and behold - koji was using 1 GB, but 4 - 5 GB was showing as used.
- Oh dear. nfs_inode_cache was caching 2.5 GB of data.
- Oh dear oh dear. It’s a regression in the RHEL 5 kernel. 5.7 specifically.
- Capturing slabinfo while running, but, oh dear, the shell script doesn’t work when you’re swapping heavily. Bugger.
- You want to use python, perl, etc, so you aren’t forking or execing and can’t get swapped out.
- Demonstrated that it’s an NFS caching problem.