Where is Your Data Cached and Where Should It Be Cached?

2012-01-17 · 596 words · 3 minute read

Technical

lca2012 · lca · Sarah Novotny · cache

Sarah Novotny

Origin of the talk was when a customer rang with a complaint that a site was wrong, but Sarah couldn’t find a problem, and this provoked her into thinking about where data can and should be cached.

<-!more–>

Why Cache?

We want to move data as close to the end used, while retaining ACID-style guarantees. The abandonment rate after 7 seconds is huge. We need reliable speed.

Count Them.

CPU Caches - L1, L2, L3.
Filesystem/VM caches.
Controller caching.
Disk caches. Great, but disks now lie about whether the write is there or on the platter.
SSD hybrids.
RAM on disks.
DB caching.
memcached or other application level caching.
Protocol level caching, e.g. HTTP, DNS.
Transparent proxies.
Browser caching.
A lot of this is transparent to you.
Sometimes stuff ignores semantics around caching, too.
Users often don’t have the knowledge to bypass bad caching by e.g. doing a browser force-refresh.

A Short Diversion

DBA/SA background means Sarah cares a lot about ACID demantics around data.
Will therefore focus on the DB

Which Caches are Redundant?

Some caching is redundant.
The tackle the same functions, but are either redundant or even harmful. Battery-backed controller caches are good and cache disks. Disk caches cache, but are unlikely to be “safe”.
You need to ensure durability in those cases.
For MySQL you also have InnoDB query caches and buffer caches.

Why do we keep doing this? Because we want things to go faster! But there’s a conflict between the DB cache and filesystem cache, too. You’re double-buffering. They aren’t particular dangerous on modern filesystems, but it’s an inefficient use of memory and CPU to manage both sets of caches.

Which Caches are Risky?

Expiries not set well on memcached will result in data being lost; Sarah is of the opinion you should only use this for temp data.
Hypervisors often cache disk in memory, without advising the guest what happens here.
Disks lie! They are reporting writes suceeding when they aren’t on the platter for reals.
RAID controllers lie, but at least they lie with battery backup (if you spent the money), so you’re probably OK.
The last two are really toxic, because you can end up losing data on power failure. Sarah recommends controlled power failures to test this.
TURN YOUR DISK CACHE OFF IF YOU VALUE YOUR DATA.
MySQL generally does better if it bypasses the FS cache for direct-attached storage. However, for SAN-attached disk you should leave FS caching on.

Benchmarking

You need to be careful when benchmarking, but in general it’s good and you can never do enough.
It’s not magic. You just need to do it right.
Don’t do bad benchmarks that just e.g. exercise your cache.
You need to touch the slowest part of the system. Force pessimistic scenarios, e.g. when your controller cache goes offline.
You also want to test the normal production case, with real data sets and a workload that looks like production behaviour.
You can test in prod, but you should use proper staging hardware that’s similar.
Benchmarking with real data also exercises your backups if you populate from them.
You can also use a replica/DR server on a short-term basis. Breaking replication and then restoring is good practise for this.

Monitoring

Only monitor the stuff you want.
Test multiple layers in your infrastructure and that you test both what the end customer sees, as well as each touch point along the way.
Monitoring is an evolving case; treat it like you’d treat unit testing in software.
There’s no boilerplate. Every system is unique.
So many tools.