Where is Your Data Cached and Where Should It Be Cached?

Sarah Novotny

Origin of the talk was when a customer rang with a complaint that a site was wrong, but Sarah couldn’t find a problem, and this provoked her into thinking about where data can and should be cached.

<-!more–>

Why Cache?

We want to move data as close to the end used, while retaining ACID-style guarantees. The abandonment rate after 7 seconds is huge. We need reliable speed.

Count Them.

  • CPU Caches - L1, L2, L3.

  • Filesystem/VM caches.

  • Controller caching.

  • Disk caches. Great, but disks now lie about whether the write is there or on the platter.

  • SSD hybrids.

  • RAM on disks.

  • DB caching.

  • memcached or other application level caching.

  • Protocol level caching, e.g. HTTP, DNS.

  • Transparent proxies.

  • Browser caching.

  • A lot of this is transparent to you.

  • Sometimes stuff ignores semantics around caching, too.

  • Users often don’t have the knowledge to bypass bad caching by e.g. doing a browser force-refresh.

A Short Diversion

  • DBA/SA background means Sarah cares a lot about ACID demantics around data.
  • Will therefore focus on the DB

Which Caches are Redundant?

  • Some caching is redundant.
  • The tackle the same functions, but are either redundant or even harmful. Battery-backed controller caches are good and cache disks. Disk caches cache, but are unlikely to be “safe”.
  • You need to ensure durability in those cases.
  • For MySQL you also have InnoDB query caches and buffer caches.

Why do we keep doing this? Because we want things to go faster! But there’s a conflict between the DB cache and filesystem cache, too. You’re double-buffering. They aren’t particular dangerous on modern filesystems, but it’s an inefficient use of memory and CPU to manage both sets of caches.

Which Caches are Risky?

  • Expiries not set well on memcached will result in data being lost; Sarah is of the opinion you should only use this for temp data.
  • Hypervisors often cache disk in memory, without advising the guest what happens here.
  • Disks lie! They are reporting writes suceeding when they aren’t on the platter for reals.
  • RAID controllers lie, but at least they lie with battery backup (if you spent the money), so you’re probably OK.
  • The last two are really toxic, because you can end up losing data on power failure. Sarah recommends controlled power failures to test this.
  • TURN YOUR DISK CACHE OFF IF YOU VALUE YOUR DATA.
  • MySQL generally does better if it bypasses the FS cache for direct-attached storage. However, for SAN-attached disk you should leave FS caching on.

Benchmarking

  • You need to be careful when benchmarking, but in general it’s good and you can never do enough.
  • It’s not magic. You just need to do it right.
  • Don’t do bad benchmarks that just e.g. exercise your cache.
  • You need to touch the slowest part of the system. Force pessimistic scenarios, e.g. when your controller cache goes offline.
  • You also want to test the normal production case, with real data sets and a workload that looks like production behaviour.
  • You can test in prod, but you should use proper staging hardware that’s similar.
  • Benchmarking with real data also exercises your backups if you populate from them.
  • You can also use a replica/DR server on a short-term basis. Breaking replication and then restoring is good practise for this.

Monitoring

  • Only monitor the stuff you want.
  • Test multiple layers in your infrastructure and that you test both what the end customer sees, as well as each touch point along the way.
  • Monitoring is an evolving case; treat it like you’d treat unit testing in software.
  • There’s no boilerplate. Every system is unique.
  • So many tools.
Share