Sarah Novotny
Origin of the talk was when a customer rang with a complaint that a site was wrong, but Sarah couldn’t find a problem, and this provoked her into thinking about where data can and should be cached.
<-!more–>
Why Cache?
We want to move data as close to the end used, while retaining ACID-style guarantees. The abandonment rate after 7 seconds is huge. We need reliable speed.
Count Them.
-
CPU Caches - L1, L2, L3.
-
Filesystem/VM caches.
-
Controller caching.
-
Disk caches. Great, but disks now lie about whether the write is there or on the platter.
-
SSD hybrids.
-
RAM on disks.
-
DB caching.
-
memcached or other application level caching.
-
Protocol level caching, e.g. HTTP, DNS.
-
Transparent proxies.
-
Browser caching.
-
A lot of this is transparent to you.
-
Sometimes stuff ignores semantics around caching, too.
-
Users often don’t have the knowledge to bypass bad caching by e.g. doing a browser force-refresh.
A Short Diversion
- DBA/SA background means Sarah cares a lot about ACID demantics around data.
- Will therefore focus on the DB
Which Caches are Redundant?
- Some caching is redundant.
- The tackle the same functions, but are either redundant or even harmful. Battery-backed controller caches are good and cache disks. Disk caches cache, but are unlikely to be “safe”.
- You need to ensure durability in those cases.
- For MySQL you also have InnoDB query caches and buffer caches.
Why do we keep doing this? Because we want things to go faster! But there’s a conflict between the DB cache and filesystem cache, too. You’re double-buffering. They aren’t particular dangerous on modern filesystems, but it’s an inefficient use of memory and CPU to manage both sets of caches.
Which Caches are Risky?
- Expiries not set well on memcached will result in data being lost; Sarah is of the opinion you should only use this for temp data.
- Hypervisors often cache disk in memory, without advising the guest what happens here.
- Disks lie! They are reporting writes suceeding when they aren’t on the platter for reals.
- RAID controllers lie, but at least they lie with battery backup (if you spent the money), so you’re probably OK.
- The last two are really toxic, because you can end up losing data on power failure. Sarah recommends controlled power failures to test this.
- TURN YOUR DISK CACHE OFF IF YOU VALUE YOUR DATA.
- MySQL generally does better if it bypasses the FS cache for direct-attached storage. However, for SAN-attached disk you should leave FS caching on.
Benchmarking
- You need to be careful when benchmarking, but in general it’s good and you can never do enough.
- It’s not magic. You just need to do it right.
- Don’t do bad benchmarks that just e.g. exercise your cache.
- You need to touch the slowest part of the system. Force pessimistic scenarios, e.g. when your controller cache goes offline.
- You also want to test the normal production case, with real data sets and a workload that looks like production behaviour.
- You can test in prod, but you should use proper staging hardware that’s similar.
- Benchmarking with real data also exercises your backups if you populate from them.
- You can also use a replica/DR server on a short-term basis. Breaking replication and then restoring is good practise for this.
Monitoring
- Only monitor the stuff you want.
- Test multiple layers in your infrastructure and that you test both what the end customer sees, as well as each touch point along the way.
- Monitoring is an evolving case; treat it like you’d treat unit testing in software.
- There’s no boilerplate. Every system is unique.
- So many tools.