An amble up to the Te Papa function centre for the CloudNative Summit Wellington edition. Apparently you can win an e-scooter if you collect all the sponsor booth stickers; already having an e-scooter of my own, I demurred.
Set Your Sites on Tracing
Adrian Cole @adrianfcole
Adrian works on Zipkin for Pivotal; Zipkin is a distributed tracing tool; distributed tracing tracks requests across the various touch points in your system: web server, DB, cache, app servers, etc. Requests are correlated by a unique trace ID and produce a casual diagram. The pitch is that this easier to understand than traditional log correlation, reduces time in triage, visualise latency, understand complex applications, and understand the reality of the architecture, as opposed to the theory. SLA/SLO is almost impossible to understand for complex architectures.
Zipkin is an open source effort, and lives on Github.
“When you live in IOT land, it’s not just users complaining, you may have an angry dishwasher complaining.” There are other tools out there; but Zipkin is cheap, not just in the sense of up-front licensing, but in terms of the support required with people and infrastructure; Adrian says that it’s been very cheap for their users. That said, he notes that instrumentation for Zipkin is a less magical story for Zipkin than proprietary tools: tracers and instrumentation probably requires a little more work than proprietary tools.
A typical Zipkin site
A Zipkin site is a production deployment of distributed tracing. The Zipkin community maintains a collection of sites who have been prepared to contribute real detail about the site; not “just a logo”, but how it’s implemented, how many people support it, how it’s being used, what it costs, and other real-world details.
Ascend Money have been using to to measure whether their decomposition of a monolith into microservices is actually working - is it hitting its targets, are people actually using the designs? Conversely Hotels.com use it for the classic case of identifying performance bottlenecks and responding to them. For Netflix the story is about using it to direct developer effort into the areas that actually matter.
While most sites use the Zipkin server as the collector and UI, it’s possible to use the Zipkin format traces with other tools; for example Hotels.com use Haystack for the long-term. Some sites keep all traces; other sites keep only samples over the longer term. The Zipkin project also collaborates with other projects to emit Zipkin format data - they don’t insist you instrument with only the Zipkin libraries. As Netflix have moved into Zipkin for tracing, the easiest way to bring their old tracing tools into a unified world was to proxy the old tracing messages into Zipkin format, rather than retrofitting everything with Zipkin libraries.
Common points of difference around of tools are the data collection policies: do you scrub secrets and customer identifiers? Do you keep everything or sample? Do you do a mix - full retention for a new product rollout and step down to sampling over time?
Production ready Kubernetes platform service (behind the scenes)
Bruno Largo and Feilong Wang
Bruno is the Managing Director at Catalyst Cloud, while Feilong is the Head of Research.
Catalyst Cloud have 3 sites in New Zealand, and have also been working with companies who want help building private clouds; Catalyst are contributers to OpenStack Magnum. Magnum deploys container orchestration stacks: k8s, Swarm, and Mesos, but in practise most Catalyst customers are running on k8s. This attempts to abstract away the complexity of k8s deploys behind an API (with Ansible and Terraform interaction); it needs to provide rolling upgrades and patching, auth[nz], and network policies, as well as day 2 operations like scaling and healing the environment.
Catalyst use Keystone for their auth layer, Calico to provide BGP and IP routing to ensure routing between nodes and pods works in a way that keeps k8s happy. For provisioning other resources such as storage k8s interacts with the OpenStack Cloud Provider, which acts as a broker into (e.g.) Neutron and Cinder for resource provisioning to support nodes and pods.
Building Serverless Applications at Scale
Vanessa Thornton
When Vanessa started at Xero their APIs were delivered to customers and partners via a EC2-deployed API gateway monolith. Under load, this system would buckle; EC2 instances would take up to 15 minutes to autoscale. One of the reasons for this was that applications were polling the API to check on whether an update was ready. The team’s challenge was to produce a webhook API that would allow a better way of allowing applications to move away from hammering the EC2 gateways for updates. This was deployed on serverless, and over a couple of years, has scaled from 7 million webhook requests per month to 20 million. It uses only a few AWS Lambdas and some streams backing them.
“Serverless in one of the hottest design patterns in the cloud right now”. It’s the surrounding services that make serverless interesting: in their infancy, serverless was very hard work; today it’s much, much easier. Today Vanessa is focusing on scalability: serverless doesn’t give you scalability for free if you do it wrong. You need to focus on how your system handles concurrency, rather than focusing on aggregate number of connections. Vanessa uses the term “concurrency zone”: there’s no point scaling your requests horizontally if you simply have bottlenecks further back in your stack.
For example, a common pattern is to have an Amazon API gateway, an Amazon Lambda function, and an RDS DB backing the whole lot. THe problem here is that while the gateway and Lambdas scale easily, the database doesn’t. So how do you solve that? Well, we can add a stream or a queue. Messages drop onto the queue/stream, and the database pulls the data in at whatever speed it can cope with, for example.
Vanessa looks at the standard pattern many people use for reading from a Kinesis stream: each Lambda polls each stream shard every second; this can cause bottlenecks, though. Vanessa prefers to map Lambdas to shards: a one to one relationship between the Lambdas and streams to avoid contention as many Lambdas hit the same shards. Do remember, though, that batch sizes are limited to 10,000 messages of up to 6 MB in aggregate size; you are also limited by your Lambda invocation time.
If you’re using SQS, Vanessa notes that batching is important here: the maximum throughput for a Lambda working with a queue is 300/second. If you batch, the per-Lambda limit drops to 10 a second, but the aggregate rises to thousands per second from the queue; by focusing on aggregate throughput you end up with a better experience overall.
You should also look at the reserved concurrency configurations: this allows you to limit the scaling of Lambdas, both in the absolute, and relative to one another. Without these controls, low-priority Lambdas could starve high-priority Lambdas; instead you can limit your overall workload (and spend!) to what your most critical components can cope with, while also ensuring your most business-critical functions are always prioritised.
Performance: performance still isn’t magic. Opening the floodgates will ultimately hit limits. Waiting until you hit problems, then reacting, is a terrible experience for your customers. Vanessa highly recommends you test. And test. And test again. She recommends a number of tools:
- Serverless Artillery. A tool devoted to testing serverless performance.
- JMeter. A lot of Xero folks love JMeter.
- Roll your own. Sounds weird, but don’t be afraid to build your own in this space; it’s still an immature space, so you may be best to build your own tooling tailored to your needs.
Monitoring? AWS tools. Vanessa notes she’s never been paged for this system, unlike the old EC2 starting point.
Practical Istio
Zack Butcher @zackbutcher
So you have services that want to communicate:
-
You drop a proxy - Envoy in this case - alongside the applications. It lives in the same trust zone - VM or pod - as the application it lives with.
- Out of the box, Envoy intercepts via iptables. It can also use BPF or the application can be explicitly configured to talk to it.
- Envoy will discover the service as part of the intercept, and map that to what it knows about the topology.
- This frees us from centralised proxies/load balancers.
- This is a very reliable, bottleneck-free architecture.
- The receiver will check policy with Mixer to see if the connection should be allowed; from there traffic will be accepted or denied.
-
Then you deploy Galley to configure the rest of your control plane; Galley is aware of the network semantics, so it can roll configs without breaking the whole control plane in one go.
-
Pilot is then used to configure the sidecars - programming Envoy - by pushing the overall topology of e.g. k8s into the Envoy on each pod/VM.
-
Mixer provides telemetry and enforces policy. It does this out-of-band.
-
Citadel provides identity to machines. IP/port pairs stopped being adequate a long time ago, and k8s completely breaks it with fully-dynamic infrastructure. Citadel hands out certificates with the name of the workload, and use mutual-TLS with a SPIFFE extension. SPIFFE includes the workload spec, providing an additional level of validation.
-
Traffic management:
- Fine-grained control - per request - load balancing and performance controls.
- East-West traffic without centralised load balancers.
- High resilience - circuit breakers, fewer SPOFs.
-
Visibility:
- Real time maps of the real world state.
- Application/protocol-aware metrics.
-
Security:
- Real identities for deployed infrastructure.
- Policy can be detached from physical implementation.
Most important of all, this is consistent across the fleet. You aren’t reliant on e.g. all your developers need to implement. You can guarantee security and observability is what you want. You’ve made it easy for people to do the right thing.
GitOps Driven Deployments on OpenShift
Everett Toews and Heather Cumberworth-Lane
GitOps - source-control centric operations. “Apply a developer experience to ops.”
Starting at the end:
- A PR arrives.
- A webhook kicks off a build process.
- The application is deployed.
It’s a simple outcome, but how do we get there?
- Git gives us access control, and audit. In git they have environment and service repos.
- env: system-focused repositories. One repo per environment. These contain the pipeline definitions for OpenShift.
- service: the application-focused repositories. These contain conventional C# repos.
- Bitbucket is used to manage the repos.
- A version to deploy is selected here.
- Pipelines.
- An interface into the OpenShift environment.
- The Pipelines handle the deploy config, and any policies: linting, security scans, and so on.
- The pipeline does a full build from source only for the first step.
- They also handle the push.
- Environments.
- Build/dev environments, where build happens.
- Higher environments: these are deploy-only, and include test, prod, stress, etc etc.
- You can guarantee every deploy here is the same.
- It’s a different cluster with highly restricted access.
- This reduces the rates of errors. “When I arrived deployments were zip files.”
Related work:
- Jenkins X.
- Flux, which was recently contributed to the CNCF by WeaveWorks.
Cross-Service Consistency in a Microservice World
Andy Marks
“Microservices are the best of architectures, and the worst of architectures.” But, Andy notes, you can say that about any architecture - everything comes with trade-offs, and is a Faustian pact. Andy wants to talk about the various trade-offs you’re making with the microservices choices. Microservices are great for delivery term autonomy; teams can make the best choices for their services: refactor their own data, scale as they need, and pick the best tools. Ultimately, this is about delivering to customers at speed.
The fly in the ointment is consistency: in monolith world, consistency was pretty easy, and a function of the code base. People had to work together. So implementations had to be consistent. Think about logging format and detail, security - which libraries are we using, can every deploy fixes in a timely fashion, resilience - do we all have the same maturity around retries, bulkheads, circuit breakers, and so on.
So what is the price of autonomy, and how do we mitigate the cost? Architects often lack influence and will howl into the void, but production incidents focus the mind - and the organisation - wonderfully. To that end, Andy decided to survey fellow ThoughtWorks practitioners and see what had worked for them in terms of making microservices work in the wild.
Some approaches to maintaining consistency and quality:
- Coding Standards
- Written down statements about how we write code.
- A loose social contract.
- Moderately effectively.
- Service Templates
- Bootstraps, starter services, whatever.
- Start with a hello world example of how you do services.
- Had the config, opinions, you name it.
- e.g. SpringBoot Actuator templates.
- Client libraries
- Provide a client library. No-one wants to deal with an HTTP client and marshalling and unmarshalling and all that. That want an idiomatic client library.
- “Native semantics for the language of my choice” are important. Support all the languages in the target communities.
- Provides guidance to the consumer.
- Platform Library
- Provide authors with a set of common, consistent “bits” for the service.
- e.g. logging library.
- Service Mesh
- Every service has a personal assistant!
- Makes life much easier for service developers and managing the runtime.
- Container orchestrators
- Makes like easier and saner.
Research
Andy has done internal research at ThoughtWorks, with a survey that tried to understand what the most effective approaches in the ~3,000 engineers at ThoughtWorks use.
- On average each team is responsible for 4 services. The median was 2.1; the outlier was more than 20, and that large number was a result of being too fine-grained. The team said that if they were starting over, they’d reduce the number of services.
- Coding standards are overwhelmingly popular (21/26).
- Communication of these is hard.
- Tracking reality in a fast-changing world is hard.
- And keeping them from becoming overly prescriptive is hard.
- So they’re a great tool, but they require a lot of work.
- Service templates are very popular (21/30).
- One team did stop using them.
- SpringBoot is very popular as a starting point.
- They’re a good base, but quickly mutate once they’re copy-pasted. Don’t expect them to be any more than a starting point.
- Client library: 12/24.
- There’s a lot of overhead in managing the propagation of change.
- And who owns and guides them? The consumer community? The service owner? Both?
- Platform library: 16/22.
- SpringBoot is popular, but not overwhelming.
- Service mesh: not so popular.
- Istio more than 75% of the time.
- Container orchestration: everyone loves Kubernetes!
- k8s is 2/3s.
- 25% is ECS.
- Everything else is last.
Observability for Everyone
Inny So and Andrew Jones
Is this just another buzzword? Well, it was a term coined back in 1961 by Kalman: “A system is observable is the behaviour of the entire system can be understood entirely from its inputs and outputs.”
Story time! Andrew starts with the tale of an app which speaks to a host. We’re a small startup so no-one cares about it, but as people arrive, we might have some ops people hook things up to Nagios to understand what the hosts are doing. “We’re shitting… uh shoving metrics into Nagios.” I think that was unintentional. “Then we have too many machines to follow the logs” so we stand up an ELK stack to understand them. At which point we discover how terrible logs are for incident resolution, because logs which are structured for humans to read are hard for machines to parse and search and correlate. And /then/ we decide to poke our app metrics into statsd.
And now we’ve got a dev/ops split. Everyone is focused on their thing and not customer happiness - here Andrew name-checks Charity Major’s “nines don’t matter if the customer isn’t happy”. So now we wire in New Relic to understand the customer experience. And now we’re trying to correlate across four different tools. And we’re bored with that. So let’s put all the ELK and statsd data into Splunk instead! Fewer tools! One tool to rule them all!
But one tool to rule them all is pretty much an anti-pattern. Things get even worse as we add microservices. Splunk to too expensive, let’s switch it off in dev!1. So maybe we replace it with CloudWatch and hook it up to PagerDuty. None of this is a great dev or ops experience. It’s hideous. There’s lots of integration boilerplate and still lots of places to look. CloudFormations to PagerDuty alone is a nightmare! Also CloudFormations is hideous to use, so let’s add SumoLogic. And Prometheus for our k8s.
And dashboards. Dashboard culture is poison, says Andrew. Dashboards are worthless for triaging new problems. They only tell you things you already know, like the system is broken.
At this point, Andrew is foaming at the mouth, so Inny takes over. The world, she notes, has changed: our apps have decomposed, our infra is dynamic. Why is our monitoring still stuck in 20 years ago? This isn’t good enough. We’re walking into a crime scene with a magnifying glass.
Let’s make on-call fun again!
Let’s take some cues from modern application architectures. In the modern day, we have event-driven architectures. If our understanding of system behaviour were to follow the events that are generated by customer behaviour, wouldn’t that help us understand the things we need to know to troubleshoot? Events (unlike logs) are structured. Events, unlike metrics, can carry rich information. So if we think in events, we can tie the context (what the customer did) to the information about the event (which we traditionally tie to logs and metrics).
So what do we need?
We need make it easy for ops and devs to funnel the information into the event platform. We need to not pre-aggregate or otherwise make assumptions about what we do. And it needs to be easy to query, to change alerts, and sensible defaults. Above all, we want it to be easy to evolve, to change the query or visualisation tools. So really, this is a general-purpose data platform, just for the ops team. Because these problems are largely solved in the big data world.
In this world we move our alerts to just be another set of consumers that trigger on an event. Any event, according to a simple code definition: “alerts as code”. We can test them? Inny argues that sufficiently good alerts should replace dashboards - it’s an argument I find compelling; a light went on over my head with the caption “dashboards mean you don’t trust your alerts to tell you things you need to know”.
Andrew argues that we should treat this like TDD: as with TDD, we write tests to describe what we expect our code to do, we should likewise should emit events that instrument behaviour at the same time
“This is not new. Charity Majors and the Honeycomb folks have been saying this for a long time.” But we need to move to it, regardless of the tools.
The Nature and Characteristics of Adopting the Hybrid Cloud
Mandi Buswell
“We talk about the cloud because we used to use a picture of the cloud to represent the Internet.”
- Moving workloads to the cloud - public or private - is it inevitable. The automation, orchestration, and self-service, are key elements.
- AWS introduced the first public cloud in 2006, with EC2.
- Unfortunately not every workload is a good fit.
- And while cloud provides many benefits, it creates anxiety around sovereignty, security, and complexity. So people started building private clouds. Many, many private clouds.
- Since I still need some traditional workloads, we have hybrid cloud: orchestration across multiple clouds.
There can be compelling reasons to be across clouds: one cloud may be better at specific functions than another, or offer a significant price benefit, or have significant benefits around soverignty. But, Mandy notes, while burstable cloud is a popular talking point, it’s the one that happens least. But to take advantage of that you have some pre-requisites: you need to be able to automate your builds, moves, observability, and so on.
So how do you choose your hybrid path? Well, start by asking what the customer wants and needs. You need a design thinking approach - “a human-centred approach to the possibilities of technolgy.” Keep it open, keep it portable, make it great!
Portable Open Source Serverless Runtimes on Kubernetes
Scott Coulton
Why serverless on k8s?
- They may not use a public cloud.
- Flexibility of language (“I like to use Rust,” says the guy who woroks for Microsoft).
- Compliance and security.
- Multi-cloud portability.
In descending order of complexity:
-
Knative
- Comes from Google.
- A lot of dependencies some of which are highly complex (e.g. Istio).
- The most complex option.
- Exposes a focused API.
- Loosely coupled.
- Will run on any k8s, anywhere.
- Three components: build, serving, and eventing.
- Builds in-cluster, or accepts container images.
- YAML templating.
- Knative serving can be swapped out for e.g. OpenWhisk.
-
OpenFaaS
- Big community but limited production deployments.
- Relatively simple to use.
- Not locked to k8s - can work with Swarm or raw Docker.
- Locked into Prometheus.
- A gateway, a Watchdog, and FaaS-idle (to scale to zero).
- Insecure by default - the gateway is open to the Internet.
- Does give you a UI, which is nice, as well as a CLI.
-
Keda
- Fine-grained scaling of k8s workloads.
- Acts as a metrics server and scales workloads based on a resource definition.
- No dependencies other than k8s itself.
- Simply listens to and reacts to events.
- Allows scale to zero.
- Security and management is inherited from k8s.
- Event sources include SQS, RabbitMQ, Szure Service Bus Queues and Topics, and Prometheus.
Kubernetes Security Low-Hanging Fruit
This is talk is about where to start, rather than an attempt to be exhaustive; there are a few things you can do that will address most risks you face. You don’t need to start with worrying about nation state actors.
- Containers are great, but they lose some isolation relative to VMs.
- Containers share a kernel, which becomes a common point of vulnerability, and memory.
- Don’t multi-tenant!
- A k8s pod is one or more container. Many pods run on a node, which may be a VM or physical server. There is a separate control plane, where you ought to store secrets and cluster control.
- An attacker will probably want to break out, either into the host (which will let you spread to other containers on the same host), and then perhaps into the control plane (which lets you own the cluster, including the secrets and other material available to the cluster is a whole).
- An attacker will probably look for a vulnerable application, such as one running an old version of a library.
- From there, you can get a shell and start looking for ways into compromising the application itself (for example, data exfiltration of the data the application manages).
- I the application is boring, pivot to attacking the kernel of the host the pod is running on.
- e.g. running containers as root or leaving the docker socket in the container basically give the farm away.
- If you can talk to the k8s API as a host, you can re-use those creds to try to attack the backplane, in order to pivot to the whole cluster. Don’t leave creds on the host unless they’re needed!
Levels of defence:
- Supply chain: make sure you know what you’re getting, who you’re getting it from.
- Container image: check the quality of what’s gone into it, that the image is what you think it is, that it doesn’t require privs you shouldn’t give it.
- Pipeline controls: make it easy to redeploy when something turns out to be full of holes, protect secrets, CI servers must be logged down, and decouple CI and SD.
- Linux security: use it. cgroups, namespaces, don’t use root, use LSMs like SELinux.
- Cluster: RBAC, admission controllers, secure container runtime, segregate etcd from your master.
- Network: limit outbound access, use a service mesh, mutual TLS, etc.
So what are the low-hanging fruit?
- Minimise the container image. Keep containers super lean if you can: no shell, no cURL/wget, nmap, nc, editor.
- Use RBAC - consider the CKAE guidelines.
- Sensible network policy.
- Do container image scanning.
- Be able to release quickly.
- Use the default set of recommended Admission Controllers.
- Use cgroup limits.
Unfortunately the NewRelic presentation - which I have culled from my notes because it included phrases such as “we gamified the dashboards for our squadified teams” ran so far over that I had to leave before the last couple of presentations, but overall the day was one well-spent. Thanks to the organisers and speakers!
-
Or you know half-arse production, he says, through gritted teeth. ↩︎