A real-world debugging problem.
- Koji - the build system used inside Red Hat and by Fedora.
- DB (postgresql) for build metadata; Hub for the XMLRPC interface; Web GUI; workers (mock, rpmbuild) to do the builds.
- Memory is important!
- NFS, disk, mmap, swap are all great, but not good enough.
- And then the OOM killer comes to town!
- Back in the 2.2 and 2.4 days the OOM killer was appalling.
- It’s improved enough over time in the process selection that Anthony’s thinking about running without swap and relying on OOM killer,
- Koji server running out of problem, it alerts because it’s ground to a halt on low memory, so let’s reboot!
- Everything works again, it’s not the DB box, so there’s no data loss. What’s the problem?
- Well, the workers need to be restarted, and nobody likes getting paged at 3 a.m. every other day.
- So fix the bug!
Fixing the Bug
- Some OOM poblems previously, linked to being able to do crazy queries like “List all history for everything forever”, which gets bundled up as a big glob of query, turned into a big glob of XML for XMLRPC, and badthings happened.
- Used setrlimit() to kill processes that go too big as a workaround.
- Loss of code trust.
- Could be another bug in koji - lots of debug logs, troll through the usage logs.
- Throttle incoming requests for overuse, but the “overuse” was long-term and hadn’t been causing problems.
- Maybe it’s mod_python memory links in RHEL5’s version of python? upgrading to RHEL6 or wscgi seems like a bad idea when there’s already a problem.
- Reduce apache MaxRequestsperChild? Didn’t help.
setrlimit() is killing prevents huge processes, but not many big ones. Reducing the number of clients with MaxClients, but this impacts the amount of concurrent usage.
Move to testing - a crash script that opens a number of sockets and do a little work, and report on the success/failure.
Couldn’t reproduce the problem.
Give it more memory!
Back to First Principles
- Use a soft toy.
- Are you running out of memory?
- Are you sure?
- How does the kernel track memory usage to make that decision?
- How about kernel memory structures - slabinfo
- Lo and behold - koji was using 1 GB, but 4 - 5 GB was showing as used.
- Oh dear. nfs_inode_cache was caching 2.5 GB of data.
- Oh dear oh dear. It’s a regression in the RHEL 5 kernel. 5.7 specifically.
- Capturing slabinfo while running, but, oh dear, the shell script doesn’t work when you’re swapping heavily. Bugger.
- You want to use python, perl, etc, so you aren’t forking or execing and can’t get swapped out.
- Demonstrated that it’s an NFS caching problem.