LCA 2016 Day 1

One of the nice things about the rotation of LCA around various cities in Australia (and New Zealand) is that I end up visiting parts of Australia I wouldn’t otherwise see - Canberra, Ballarat, and now Geelong.

LCA 2016 is running later in the year than the norm - I can’t say this especially pleases me, since I miss the first week of my French course and it leaves Maire to wrangle the first week of the school year. The latter makes it even less accessible for parents, I imagine; perhaps schools start later in Australia.

Of course, having grumbled about the family-friendliness of the timing, I have to acknowledge the organisers for coming up with professionally run childcare, which is awesome.

In a departure keynote questions will be emailed and filtered. Hopefully this was a response to the problem of non-questions, rather than, say, people upset about actual questions that some of the audience don’t like (like Matthew Garret asking Linus about his behavioural problems).

Continuous Delivery Using Blue-Green Deployments

Ruben Rubio

Introduction

Deployments have risks associated with them: application failures, capacity issues, people and process failures, and so on. A traditional response to this is to either fix in production or rollback; but applications can suffer rollback failures.

Experience has taught Ruben that even with careful testing there will be failures in production. He believes that Blue-Green deployments on immutable infrastructure can reduce this risk.

Blue-Green Deployments

This is a technique of running multiple production employments in parallel; the “blue” environment which is the current production environment, and the “green” environment, which is the new deployment.

Once the green deployment has been tested, users can be migrated across to it; if there is a problem, a complete blue environment can be failed back to. If the green environment is good, the blue environment is eventually destroyed.

Immutable Infrastructure

A long-standing problem in IT are special snowflake servers and environment which are fragile, long-lived, and can never be replicated, or even modified reliably.

Immutable infrastructure, on the other hand, cleanly separates data (which must persist) from infrastructure (which are stateless). Here, we rebuild servers as part of every deployment, ensuring we never lose confidence in our ability to reproduce an environment.

Ruben notes that a lot of the conversations around this deployment model are focused on tools - containers, especially Docker - while he insists that containers are not necessary for this technique.

How?

Process is critical:

  • You must never touch infrastructure. You have to change from a culture of hotfixing individual servers to resolve problems - you must have the discipline of rebuilding them.
  • This implies you must be able to rebuild infrastructure quickly, or you won’t be able to solve problems in the time required.
  • It also implies that builds must be consistent and configured via a configuration management system.

Forcing from-scratch builds of environments forces the cultural change, because changes that aren’t made through the proper mechanisms will be destroyed quickly.

Define Your Infrastructure

  • Purely stateless: app servers, web servers, load balancers, etc.
  • Persistent state: SQL DB, filesystem.
  • Volatile state: Message, task, email queues.

The third type must be dealt with carefully; it can be destroyed, but only when the queue is drained, or state will be lost.

In some cases cache (which is nominally stateless) needs to be treated as stateful; this is because caches may take long enough to warm up that there is significant impact in replacing it.

Blue-Green deployments need to take account of the fact that jobs in the volatile state may have data format changes; since the different versions of the infrastructure may not be able to process one another’s jobs. The routing infrastructure must, therefore, be able to route different messages to the correct version of the infrastructure.

Which Technologies?

  • Ad-hoc technologies: work very well for a single cloud provider; for example, CloudFormations. You’ll be locked into a particular cloud or toolset.
  • Service Wrappers: these orchestration engines abstract away the cloud suppliers to avoid lock-in. The challenge is that the wrappers may not be updated across all platforms as quickly as the providers offer new features.
  • Frameworks: for example, ManageCloud’s mac framework. Ruben suggests this offers better compatibility by reducing the level of abstraction (no wrapper layer). The automation is written in the framework, but the template items are written natively.

Remaining Challenges

Data is hard, particularly schema-type changes. Since the data layer is a junction between the versions of the infrastructure, all changes would have to be compatible in the persistence layer; where data does change, you will either have to keep multiple versions of the data around (and have the migration maintain and transform them), or use a traditional migration for those cases.

Advantages

  • Rollback is easy (and possible).
  • Configuration drift disappears.
  • Infrastructure is properly documented - because every build is documented.

Q&A

  • “How do you synchronise versions in the persistence layer?” You can use triggers or similar techniques, but in some cases there is simply no easy answer - you may have to use traditional deployment techniques for these cases. Blue-Green deployments can’t solve every problem.
  • “Does a framework-based approach imply that the implementation-specific parts of your implementation blueprint need to be rewritten for each provider?” Yes, that is correct.
  • “I didn’t follow the distinction between the wrapper-based approach and the framework based approach? Aren’t you better sharing the burden of API change with your wrapper provider?” Ruben argues that the rate of change makes this impractical - that in practise, the restricted subset the wrapper offers isn’t worth the benefit of someone else maintaining that interface. “They work very well, but they have limitations with what they can do.”
  • “Is this an alternative to staging environments? Have you been burned by not being able to reproduce production problems or catch them before production?” Yes.

The Twelve-Factor Container

Casey West - @caseywest

“This isn’t really a talk about containers. It’s a talk about operational maturity. Don’t tell the organisers.”

“I like hand-based participation.”

Participation quickly demonstrates the audience has used containers in dev a lot more than production.

“Docker docker docker docker docker docker. Now we’re done discussing docker.”

The idea of “12 factor development” came out of Heroku, who have some fantastic ideas around development to maximise reuse in a distributed model.

One codebase, tracked in revision control, many deploys

An anti-pattern: different images for development, testing, and production. One immutable image for every environment is the right way.

Images should be immutable, they should have provenance. Immutability is not the same as unchanging, it’s about being reliable at points in time.

Another anti-pattern: tags for dev and prod. Just don’t build image which vary depending on the tag. Cut that out.

Use feature flags or environment flags in the application to control behaviors. They can be a big source of contention between developers and operational teams. They need a certain amount of rigor in the development process to make them work well, including pruning dead code.

Explicitly declare and isolate dependencies

Microservices: instead of having one deployment that breaks every month, have a hundreds of deployments.

Anti-pattern: latest. Do not do that. You must declare the version numbers of things you depend on.

Best practice: depend on base images for default filesystems and runtimes. Don’t cram everything into your container. Do not jam lots of crappy bash into your container and call it “operationally mature”. Layer your runtime on a base container - imagine updating a Shellshock-class vulnerability. Do you want to rebuild a base, then the app layer, then push the update, or rebuild hundreds of bases?

Store config in the environment variables

Anti-pattern: config.yml, properties.xml. They just get turned into environment variables anyway.

Anti-pattern: hard-coded feature flags. They don’t belong as booleans in your codebase.

Treat backend resources as attached resources

Anti-pattern: no local disk. When you migrate apps in, one of the first changes you need to make is to migrate away from local files. DON’T RELY ON LOCAL DISK. Don’t re-implement NFS. That’s all I ask.

Connect to network-attached services using connection info from the environment. If you need parameters or credentials, get them from an appropriate service, e.g. Eureka service discovery.

Seperate

Anti-pattern: install on deploy. Consider the perl ecosystem: CPAN pulls in install images (code, tests) from the deployment stage.

Create the immutable image, then deploy it. Respect the lifecycle: build, run, deploy.

Execute the app as one or more stateless processes.

This can be challenging because we’re used to thinking about apps as tightly coupled: db + application + HTML + JS. You need to get away from this; the data isn’t part of the application.

Schedule long-running processes by distributing them across a cluster of physical hardware.

Scheduling is a solved-but-hard problem. Use tools like Kubernetes, don’t try and solve it all yourself.

Anti-pattern: don’t use NFS.

Export services via port binding

Best practice: port=env.fetch(:env) or whatever works in your language. Don’t hardcode the port. Don’t make assumptions.

Scale out via the process model

Horizontally scale by adding instances. Don’t pre-fork.

Maximise robustness with fast startup and graceful shutdown

There are no particular examples because containers get you that for free; container startup is quick and cheap in Linux.

Keep dev, staging, and production as similar as possible

This should be obvious. Tools like docker make this easier. Run containers in development so you’re creating the same experience everywhere - don’t run a big ball of mud. Docker compose is great for this.

Logging

Anti-pattern: Don’t #yolo log files all over the filessytem.

Best practice: STDOUT STDOUT STDOUT STDOUT. We don’t need to daemonise any more.

Run admin tasks as one-off processes

Anti-pattern: Custom containers for tasks. Don’t have a special application with “doit.sh” or jump containers.

Instead reuse application images with special entry points for tasks - you could “docker run migration”, for example.

Payoff

You are all now cloud-native.

Q&A

  • You talked about deploying containers with multiple entry points - will it make security people nervous? When you build a management task, rather than have bolted-on warts of code, leverage the application code, so it’s part of the proper SDLC, so it can be inspected, tested, and ultimately removed once you don’t need it any more. You’re doing them anyway, this just makes it more visible/better managed.
  • How do you debug production with immutable containers? That’s a wonderful question. “It never ceases to amaze me that developers say it’s ops problem now.” You can never simulate production characteristics in dev, because you can’t use the production data, scale, and so on. “You can’t have fewer than three of anything in production, you ought to be able to look at a live container. But you need to be able to take it out of the live pool. And once you’ve finished with it, it’s now a special snowflake and it must be destroyed afterwards.” You can also save the state of a running container and move into a safe environment.
  • I appreciate the don’t run NFS in your container - what would you recommend for solving storage problems? Kubernetes has some interesting development around persistent storage; S3 or Swift. You need to move away from structured filesystems for the storage. You may need to accept that your storage in these scenarios will be slower than what you can build with e.g. Berkley DB on SSD.

Manage Infrastructure as Code

Alan Shone

A brief hiatus at the beginning as the presentation fairies take their toll on displaying slides.

Legacy Concepts

  • Infrastructure was extremely manual (I disagree - traditional mainframe sysprogs and *ix admins have always automated; rcs was commonly used for versioning, for example1).
  • Everything managed in plain test files.
  • Documentation was a nightmare.

Problems with this sort of approach involved keeping meta information up to approach, handling migration, reporting, and cumbersome interfaces.

Ideas

  • Some sort of versioning.
  • Easier interface for collaboration.
  • Provisioning of host state.
  • Automation.

Software?

  • Infrastructre requires orchestration.
  • Software dependencies can be pushed within this infrastructure.
  • Hardware often requires different flows.

Ansible

  • Provides inheritance.
  • Allows for variable-driven config.
  • Straightforward, easy to implement and use.
  • Playbooks provide grouping of instructions.
  • No built-in versioning.
  • Drawbacks include: no real instance tracking; supplier specific wrappers for each class of infrastructure; versioning is DIY.

Chef

  • Based on Ruby syntax.
  • Variables.
  • Push model.
  • Cookbooks for grouping commands.
  • A central, versioned repository is available.
  • Drawbacks: dedicated management server; pegged to using ruby; you need to write plugins for functionality.
  • OS and packaging restrictions.

Puppet

  • Simple syntax (although I will note that my experience is that it can become very complex).
  • Easy to automate (see above - it is at first).
  • Central server available.
  • Drawbacks: very specific DSL; complex infrastructure can become cumbersome; dependency based which can make it difficult to control the orderng of execution.

CloudFormations

  • JSON config of your infrastructure, which gives complete control over the provisioning and automation.
  • Information is all available in the API.
  • Works very well.
  • But is entirely locking you into AWS.
  • The JSON config is difficult to maintain - no comments allowed.
  • Not idempotent.

Terraform

  • Orchestration and provisioning.
  • Simple syntax to grasp and maintain.
  • Parameterisation options are simple and work well.
  • Drawbacks: tightly integrated with vendors. If it isn’t already there, you’re shit out of luck; the syntax has a learning curve; and it is a work in progress.

ManageACloud

  • Complete solution.
  • “Completely open.” Except it’s actually not, there are closed source chunks.
  • This part started sounding a bit like an advert, especially after Ruben’s talk this morning (both are affiliated with MAC).
  • Macfiles are used for the manifests.

What About People?

DevOps - “automation is a key aspect”. I’m going to agree and disagree. Automation is tools and sets of tools. “DevOps”, like “Security” is a way of doing things, not tools you drop into the picture.

“There will always be people.” “Workflows and processes are important.”

(The message I harp on when I push automation culture to people is simple: “I want to take the shit parts of your job away.”)

Decisions

  • There are always more options available than time to discuss.
  • “What everyone else is using isn’t a good criteria.” I totally disagree. A large library of expertise and implementation is hugely important.

Comments

I was very disappointed by this talk - apart from it coming across as a MAC advert, but beyond that, it seemed to me to miss quite a few points; the importance of culture change to DevOps; the benefits of tools that are widely used; the idea that “tradition” was manual; and the speaker’s apparent lack of a broader view outside AWS.

Cloud Anti-Patterns

Casey West

“The Five Stages of Cloud Native”

“No weird family stuff, but we’re going to talk about your delivery pipeline.”

Denial

“Containers are just like tiny virtual machines.”

Putting stuff that doesn’t work into the cloud doesn’t make it work. Do not treat containers as tiny VMs.

“How many of you think a container is Docker? No-one? Good.”

We Don’t Need to Automate Continuous Delivery

“We can’t automate delivery, because delivery breaks every time, which is why we have a manual checklist. We shouldn’t automate it until it’s perfect.”

Automation isn’t about perfection, it’s about being consistent. Being broken the same way every time gives you something to improve on, and automating gives you the time and space to fix it.

Note this isn’t technology-specific. This is a principal that’s independent of your tools.

Anger

It Works on My Machine

“It’s beautiful and perfect and my unit tests tell me that, so if it breaks in production, it’s your fault.”

Dev is Just #YOLO-ing Shit To Production

“If we’re honest with ourselves as developers, that’s true.”

Having the ability to push things into production doesn’t mean we should. The objective is not to push worse code to production faster, it’s about delivering value to production with lower risk.

Bargaining

“We crammed this monolith into a container and called it a microservice. It runs on Docker!”

You can’t just cram your old busted shit into a container and call it good. You need to use your critical thinking skills and work out what you need to refactor from your existing applications.

Bi-Modal IT

This term comes from Gartner. They want you to drink their Kool-Aid, which is like top-shelf martinis because they’re expensive.

The idea is that you’ll have fast and slow lanes - but the problem is anyone who doesn’t want to change will argue they belong on the slow lane; it’s also a false dichotomy, because there’s a spectrum.

Legacy software should be defined as “anything we can’t iterate on quickly.”

What If We Create “Microservices” That All Talk To The Same Data Source?

Then a single SQL server is your API.

Depression

We Made 200 Microservices and Forgot to Set Up Jenkins

You’ve made things worse. “We built an automated build pipeline, but we only release twice a year.” If your business aren’t ready for faster delivery, for example, you’re just piling up work quicker and you’ve wasted all that effort to change things.

Acceptance

We need small batch sizes for re-platforming, too - just as we develop small units, we should change in small steps, and we need to change all the things we need to change.

Cloud Crafting - Public/Private/Hybrid

Steven Ellis

  • Public Cloud: AWS, Azure, Google Cloud, etc.
  • Private Cloud: private IAAS, VMWare, HyperV, RHEV, OracleVM.
  • So hybrid is managing both of those, making them interoperable.

“That overused term, a single pane of glass.”

ManageIQ orchestrates cloud providers, internal VMs, and containers (Kubernetes, OpenShift, and so on).

“Brown fields as well as green fields.” ManageIQ can discover existing VMs on your VM farms, cloud providers, and whatnot and learn about them. It doesn’t need you to re-platform everything to get some benefit.

It can also pull in Heat and CloudFormations information from OpenStack and AWS, respectively.

Integration via a REST API, and integration points with various well-known CMDBs and ticketing systems - the latter lets you identify slow process points by reporting on ticket open/close/action times.

Reporting, chargeback, and the like are all built-ins.

There is a demo. The demo gods are kind!

This talk is essentially ManageIQ for people who don’t know it - since I know the commercial downstream it’s a bit old hat for me.

If your problem is “how do I glue bits of my automation together” this is one good answer.

Live Migration of Linux Containers

Tycho Anderson

“We’ve always taken the view that containers are simple lightweight VMs,” noting this is at odds with Casey’s talk.

Fundamentally Canonical see LXD as belonging to the class of hypervisor managers - HyperV, RHEV, etc. So snapshots and so on. LXD talks to lxc as the mediator into the kernel.

“So one function we want to be able to enable is migrating one namespace to another”: lxc move host1 to host2. lxd facilitates that co-ordination, moving the filesystem and process state from one system to another.

It relies on CRIU, which has been around for 5 years, beginning on OpenVZ; the OpenVZ team have been upstream OpenVZ functionality in chunks for a number of years now, and the checkpoint/restore is part of that.

One challenge is that neither Linus is not a fan of the kernel extensions, and Lennart isn’t going to add the functionality to systemd, in the latter case because it isn’t terribly reliable - functions in CRIU work “often but not always”. Features need checkpoint/restore support, and even then the userspace will trail behind.

Security

  • cgroups
  • apparmor, but not selinux
  • seccomp (STRICT, FILTER)
  • user namespaces

Making this work requires a lot of care and consideration around the order of operations, since getting the ordering wrong can result in one security function killing the migration of the process.

To enable seccomp migration requires disabling bits of seccomp so code can be injected for the migration. Kees Cook describes this as “giving me the creeps” and I had the same reaction.

Performance

  • It’s still pretty slow.
  • Using btrfs or ZFS will run much faster than other filesystems because it can take advantage of their snapshotting and sending features.
  • Otherwise it falls back to rsync.
  • The memory transfer can be time-consuming.

All in all it stuck me as “technically neat but practically questionable.” Not least because I’m not sure this is the Right Thing to do with containers, and it’s worth noting the whole presentation is diametrically opposed to what most of the other speakers described as advantages of containers.


  1. To elaborate on this : this position seems very ahistorical to me. One of my first Unix mentors explained that, to him, he wasn’t doing his job unless everything was so automated he could just put his feet up for the day and do nothing, confident that his monitoring, jobs, and so on, would all run smoothly enough that he could rely on his error reporting to let him know if he had to do something. This was a normal position; my experience with VMS was similar, and so on and so forth. Builds in classic Solaris land were automated with Jumpstart, user provisioning happened by (heaven help us) YP, and the idea of spending a lot of time redoing tasks over and over by hand was an anethma. That seemed to me to have changed by the late 90s and early part of this century. I don’t know what did change it - perhaps the sheer volume of stuff being built during the .com boom and the influx of folks because of it - but it seemed like a lot of old practices (/etc in rcs, script everything, and so on) got lost. It’s great that we’ve kicked on from there with even better automation and version control concepts, but the idea that we are now entering a blessed sate from an ignorant past is a fairy story.