LCA 2018 Day 5

Another hot, sticky day in Sydney. The climate is my least favourite thing about the place.

Surprising amounts of the city apparently close for “the public holiday”, as the organisers are referring to it, in an effort to maintain neutrality on the local arguments around whether the current form of celebrating the European colonisation of Australia is the right one.

As an aside, mentioning I’m staying in Redfern elicits the odd interesting response from people who think this is basically equivalent to checking into a hotel in a war zone; I was told there would be riots here on Friday, which doesn’t in fact seem to be the case. While the locals are not exactly Abbot voters, the only dire predictions of danger that appear to be accurate are Sydneysiders warning of aggressive gentrification.

Once more we roll through the door prizes: winners one and two are not present, but winner number three is here!

(Applause for Next Day Video - more like same day video!)

Containers from User Space

Jess Frazelle

“I work out at Microsoft. Selling out has been amazing.”

“I live between the layers of abstraction and love it there. I love pulling features from one layer into another.”

What is a container?

  • Not a real thing - they’re made of Linux primitives; they’re a collection of Linux primitives, rather than a first-class contruct (as VMs, zones, jails are).
    • This is a feature, not a bug. It lets you do cooler things.
    • The base is namespaces and cgroups.
    • You can layer other things on them; an LSM, for exampe (AppArmour or SELinux) to restrict access to parts of the filesystem, or seccomp (whitelisting syscalls).
    • seccomp default currently blocks about 150 syscalls.
    • In the kernel the NoNewPrivs flag is set in Docker; in k8s this is named as AllowPrivEscalation.
  • But what about Intel Clear Containers?
    • KVM-based.
    • “I like it,” but is it really a container.
    • “I would have called them glass houses”.
  • History:
    • OpenVZ was released 12 years ago.
    • Linux-vserver - 9 years ago.
    • lxc - 9 years ago.
    • Docker - 5 years ago. Initally based on lxc, but the backend eventually became libcontainer.
    • lmctfy (let me contain that for you) from Google was released 4 years ago. based around libcontainer.
    • rkt - 3 years ago. CoreOS’s containers; the start of the container wars.
    • runc - 2 years ago. An attempt to end the containers wars; part of the OCI (Open Container Initiative).

“Container runtimes are the new JavaScript frameworks.”

Are Containers Secure?

Yes, “and I can prove it”. VMs, zones, and jails are like buying Lego if the Lego was already glued together out of the box. Linux containers, on the other hand, are like a Lego set. You can, for example, turn on and off certain namespaces. If network performance is critical to you, you could turn off the namespace just for one container so you can use advanced kernel features; you can share namespaces: run strace or Wireshark in one container to inspect another container, by sharing the relevant namespaces.

Every container in a k8s pod shares the PID and net namespace.

Docker has sane defaults to try and help you shoot yourself in the foot. You can make them more restrictive, or relax the restrictions.

Jess runs is a game that teaches you about containers, and also to prove containers are secure. If you think containers are breakable; feel free to CTF and prove your point. People have tried, and failed.

One reason is that the seccomp profile locks the environment down enough that it requires ROP (return oriented programming) to crack the container. And no-one is going to do that just to break “some shitty web app”. And if you break that, you’re probably going to need a kernel 0-day to break all the way out.

Containers are sandboxes. If you turn off the security features, they are no longer a sandbox; if you want all the features of a VM, security-wise, you should just use a VM.

Containers on the Desktop

The key patch is for rootless containers, using runc. This gets Jess a tidy computer with all her applications neatly contained in sandboxes. But that still wasn’t enough: Jess was running Debian, which was still too messy. So she migrated to CoreOS, which forces the entire desktop into containers, because CoreOS is a read-only root.

“Little boxes of the [CoreOS] hillside.”

What if?

What if we applied the same principles to programming languages?

Take some go code; as you build it, generate seccomp filters, based on the source code, so you have a seccomp profile that allows you to restrict the binary to only call the syscalls the source code actually needs.

Yes, you can do this with post-compiling profiling, but in practise that option is unrealiable, and users don’t bother.

“The go compile is unique in that it generates the assmbler for the syscalls at compile time”, which makes it easier to undestand which syscalls are going to be required.

This approach mostly works; there are a couple of hiccups:

  • If the go program execs other binaries, the binaries aren’t in the profile.
  • go plugins have arrived, and again, it’s difficult to understand what is going to be needed.
  • You can use the syscall.RawSyscall() interface, which will bypass the compiler’s normal syscal interface.

Jess hasn’t merged this because dealing with upstream politics is too hard. It’s an approach Jess thinks has merit, and “Rome wasn’t build in a day.” Applying these principles to programming languages could be really beneficial.

Other Crazy Things

SCONE: Linux containers with Intel SGX (secure enclaves) - SGX is primarily aimed at people who don’t trust the cloud. It’s a way of protecting your runtime from root. SCONE assumes that your hardware and OS has been compromised; but it does not claim to stop side channel attacks or DOS.

There are some interesting lessons from this secure enclave apparoach:

  1. Keep code small. If everything is in the sandbox, there is no sandbox.
    • But if the base is too small, you become too dependent on the host OS, so you aren’t protected from it any more.
  2. Performance: the are memory copies and syscalls being copies in and out of the enclave, and bust the cache to keep the encrypted memory secure. This does hurt performance.

After a number of approaches, the current approach is to implement a shielding layer, which encrypts all the syscalls going in and out of the host OS layer - disk, network, and so on. This is expensive (“Like, no shit”).

Jess thinks this is probably too slow and overcomplicated to be worthwhile at the moment. You can play with it in Azure, where a Windows hypervisor based approach has been impemented based on Microsoft’s Haven paper.


At the moment k8s doensn’t support multiple tenancy. Jess is part of a working group trying to make this work.

  • Soft multi-tenancy is for preventing accidents, not for malice.
  • Hard multi-tenancy is for preventing malice in an environments with untrusted actors.

This means you have a lot of moving parts you need to think about; you need a minimal base OS. For a container runtime, you could run clear containers if you want full VM level protection, or a well-controlled container if you’re slightly less paranoid.

The network needs to be whitelist only; one challenge is that default k8s DNS is oriented to service discovery which is not ideal. You could switch it off, or run it as a per-pod sidecar.

AuthN/AuthZ: There are a lot of answers here, and they’ll be specific to your scenario.

The master and system service nodes need to be isolated.

Heavily restrict access to your host resources. From there on, it’s about making the containers dumb to their surroundings. Minimal knowledge inside the container is good; Jess notes that when she worked at Docker there was a constant pressure from customers to make containers smarter and smarter, and to know more about the world outside.

The RomCom, the App and the Wardrobe

Hannah Thompson [@hannnahcancode]()

“This is a talk about React and React Native. It is also a talk about history and fashion.”

(“This font is called Princess Sophia if you need some curls.”)

“A time of curls, and marrying for financial security, and dying in childbirth. Which is not funny.”

203 years ago Jane Austen was writing; Charles Babbage was a young man struggling for a professorship when she died.

Emma was written and released in 18xx and was an independent women - which at this time meant independently wealthy. She was a matchmaker who meddled in her friends lives; Emma the novel is about how people make and miss connections.

Now we travel to 1995: Clueless, a modern interpretation of Emma. Cher, like Emma, is a fashionista and a matchmaker. In the film, she had a coputer helping her pick her outfits, something that seemed magical at the time. This technology was extraordinary in 1995, but is very doable today.

“At it’s heart it’s just a fruit machine for clothes.”

It’s 2018. Amazon have a creepy robot they want you to put into your bedroom, so you can get dressed in front of the Internet for Amazon to judge your choices. This seems… less that ideal.

Interestingly enough, the app from Clueless is a popular Google search. It touched a chord for many people. There are a lot of efforts at this, but they’re mostly nether very good, nor free. But women get into technology through their clothes - pocket sewing BOFs, Pintrest, you name it. There’s a lot of stuff out there, but nothing with the simplicity of the Clueless app.

Last year, Hannah started a degree in programming. One of her friends immediately demanded she write the Clueless app. So she did. She made the app in React Native.

Under the Bonnet

  • React Native uses JavaScript core on iOS; it compiles its own version with the app on Android. This makes it more native than most comparable JS frameworks.
  • This lets you use native UI elements and capabilities.

So why React Native?

  • You know JavaScript.
  • You want to be cross-mobile-platform.
  • You have an existing web app you want to make native.
  • You want hot reload, not compile-deploy while you’re developing.


Disclaimer: iOS still requires an OS X for development. You need node.js and the development tools.

Install node, watchman (which provides hot reloading), and then the npm install react-native-ui.

Now you can create a new react application with react init MyApp and pull the packages you want into your packages.json.


Components are a chunk of UI; you may have a group of text components in a view component. Everything in React is a component.

It’s just React and JavaScript.

The Clueless app: Three carousel components and a footer. The footer can be two text components.

  • New components can be created by extending the base Component class.
  • Components have a leading capital letter.
  • Styling components is via Flexbox.
    • Styles are part of the JavaScript, and are CSS like, but not CSS.


Props let you pass information in and out of a component. Components can alter themselves. You can build functional components, components that do something when frobbed with.


State allows you to determine behaviour (e.g. conditional rendering) based on the state assigned. Without an event, this might be limited to e.g. changing the appearance of a button.


Events allow you to wire logic in. You can use setState() (which is async) to force a re-render with the new state. You can run arbitary JavaScript on the event.


The API provides the interaction with the phone; for example, the CameraRoll API lets you get photograph from the camera roll.

Driving Virtual Reality from Linux

Keith Packard

This work started in Hobart last year, and in spite of an optimisitic estimate, it’s taken the better part of the year to deliver.

The head mounted display requires the application compute the left and eye buffers independently, and merge them. However, there’s additional complexity: the OLED displays all have per-pixel variation of brightness and the lenses manipulate the image. Your application needs to correct for both of these aspects when rendering to the HTC Vive. You can’t simply dump a desktop onto the Vive and get useful output. You’d need to manipulate them first.

Also they’re the most hard real time device we have for the desktop. If you don’t track smoothly with super-low latency for every frame, the user becomes violently ill. We want 90 frames per secondm with no dropped frames, with varying work to render that sucessfully.

When the display locks up, if you don’t keep your head still, you fall over.

HMD Display Options

Of the four approaches (ICCCM, RandR, Meta siaplay server, and Kernel changes), Keith and Dave Airlie thought that changing the kernel to allow the HMD to be addressed as a display addressed directly from the application. This is analagous to the way the DRM is used by 3D applications.

“So all I need to do is add modesetting capability to the application … but it’s a priviledged option.” “We didn’t want to just let applications gain that control, unlike a graphics card vendor that makes a proprietary driver.” So they decided to allow the window manager to offer a ‘lease’ on the HMD to the application; the lessor will let the application use the HMD while active, and if it crashes or exits, the desktop will take the lease back and clean up the HMD.

The vblank API needed some modifications to support this - its timers weren’t fine-grained enough, and it had wraparound timers. It was a hassle to get the changes merged, and there was a lot of unpleasant bikeshedding around the change. Keith gave a shoutout to Daniel Vetter’s talk as describing the unpleasantness of the process.

Keith broke the change down into three phases:

  1. Adding the API, which had no functional changes.
  2. Adding the code infrastructure.
  3. Exposing to user space.

Breaking into chunks like this makes it much easier to understand the changes, and is a low risk way to getting it into the kernel.

“This is a live demo of script I’ve never run before.”

Keith notes that he could extend the approach to allow leasing of X server to output devices to provide multiseat setups. There have had to be a bunch of small changes to stop things crashing - for example, many X applications crash if RandR resources come and go.

For the Vulkan extensions, Keith tossed his original plans when it became apparent getting a new extension through the standards process was too hard, so he overloaded an existing NVidia standard. He did have to tinker with it, though, because NVidia don’t have any kind of access controls, so applications can scribble on anything, which Keith is not keen on.

Making distributed storage easy: usability in Ceph Luminous and beyond

Sage Weil

Ceph is Hard

This is a common complaint, and for a long time Sage ignored this, because distributed systems are hard. Over the last year, though, Sage has realised that if Ceph is too hard, then he’s not really succeeding at his goals. So they’ve been working on solving this problem.

Luminous Goodness

    • BlueStore - direct management of the devices.
    • Erasure encoding overwrites.
    • ceph-mgr: more scalable, connectivity to lots of management tools, RESTful API, new web dashboard.
    • AsyncMessager by default.
  • RGW object store.
    • Search.
    • Compression.
    • Encryption.
    • NFS gateway.
  • RBD
    • iSCSI HA.
  • CephFS

But there has been a cost to this: a lower release cadence. From every 6 months, with alternating LTS releases, to 9 months with every release being an LTS and support for 2 cycles.


  • Clearer output from the command line, e.g. ceph -s.
  • Clearer health warnings with relevant detai.
  • Cleaner logging.
  • Redid the whole configuration infrastructure now. Documenation is more useful. Clean seperation of user vs. developer options.
  • Coming in mimic: centralised administration of config files across the cluster, rather than relying on external config management.
    • May use DNS SRV for bootstrapping.
  • Simplify authn/z config.
  • Upgrades:
    • ceph versions shows you the farm.
    • Cluster-wide compatibility settings to prevent old machines joining.
    • It will stop you from configuring the cluster in a way that breaks the features you want, or breaks the current environment.

### Easy Stuff

  • MTU sized pings betweem OSDs to make it easy to see if there are network problems.
  • Disable auto-out on small clusters.
  • Better defaults for HDD and SDD.

### ceph-mgr

A new native component, which is a mandatory part of the install; an outage won’t affect data though. It’s intended to provide a better home for management functions than ceph-mon; it has a fast async view of the world.

This also has the happy side effect of allowing ceph-mon to scale much better.

It’s a natural integration point for plug-ins and external components, and makes it easy to write modules, including sophisticated tooling like the balancer, which forces a better distribution of data across the cluser.

Currently the main tunable for sharding is PG_NUM. Getting it wrong is a big problem, getting it right is card. There is work in progress on making this easier to manage, having the manager able to retune automatically, and remove some of the limitations of changing the PG_NUM downward.

The servicemap is generic functionality to make it easier for tools to report on the overall health of the environment.


It turns out that a lot of users really, really want a web UI dashboard; as of this release it’s an an-auth’ed read-only dashboard. It can be enabled from the command line, and is built in to the ceph distribution now, so it can provide solid reference dashboard and is manager hosted.

The best result from this is that SUSE have decided to fold openATTIC from an out-of-tree tool into the basis for the second version of the dashboard. This is driving some consensus around what things should look like:

  • Cultuer operations such as managing pools and filesystems.
  • Rich Grafana metrics.
  • Deployment tasks such as provisioning new systems.

This is well underway but will not arrive before Mimic.

Deployment Tools

  • ceph-deploy is very basic an limited.
  • ceph-ansible and DeepSea are the best third party deployment tools.


  • k8s and containers are a bad fit for OSDs.
  • But a lot of ceph components, like the gateways, managers, and so on, are good fits for these things.
  • Containers are also an excellent choice for small, hyperconverged clusters.

This has lead to a tool called Rook, started by Quantum. It’s a Kubernetes operation for ceph. It can run ceph daemons properly: for example it’s aware of the state of the quorum and keeps the right number of monitors running to maintain quorum; it can upgrade and be aware of the config changes requried. This is likely to become the default choice for k8s deployments with ceph.

Don’t Look Back in Anger: Wildman Whitehouse and the Great Failure of 1858

_Lilly Ryan _

Lilly introduces us to the problems of arsenic-doped wallpapers in the Victorian era. “If you wonder about the behaviour of people in this era, remember they were inhaling poison with their breakfast.”

“Reflecting on the mistakes of the past helps us do better in the future.” We must reflect on the past, but we must apply those lessons to the future. You can often get away with this for a while, but it will catch up with you eventually.

One of Lilly’s favourite things is go-live parties; perhaps one of her favourite go-live parties was the one that happened in 1858, after the completion of the first trans-Atlantic telegraph table. Telegraphs were a hugely important part of the changes in the 19th century world. By 1858 both Britain and the United States had rich, well-connected internal networks, but still relied on mail.

The celebrations were immense: they fired cannons, danced in the streets, and wrote the worst poetry Lilly has ever read. People really, really loved this cable. They were still celebrating a week later, until…

…the cable went dead.

These days, we tend to have some sort of way to deal with failure. But there was no fallback planned for the trans-Atlantic cable.

When your user base is the population of Europe and America, when your product is literally world-changing, failure is a big problem. Failure can be a great teacher! In this day and age we are more likely to deal with failure constructively. Unfortunately they were

Dr Edward Orange Wildman Whitehouse, with his magnificent sideburns, was a prolific holder of patents, and was appointed the Chief Electrician of the company. Which was unfortuate, since he had no experience with the field. And understanding of electricity was poorly understood, so trying to run current along a 3,000 km cable was quite a gamble.

Another person on the project was William Thomson, an actual physicist. Cyrus Field, the funder of the project, hired him too. Unfortunately he and Whitehouse fell out immediately, and disagreed friecley over, amongst other things, the construction of the cable. Whitehouse wanted high-voltage, high current, heavy cables; Thomson didn’t think this was a good design. The board backed Whitehouse, stuck half of the cable on each of two ships; Thomson nonetheless was assigned to one ship to start spooling the cable.

Which promptly snapped under its own weight. This time, they started from the middle of the Atlantic. And they failed three more times, once snapped by a passing whale - “I like to call this the original fail whale”.

Finally they suceeded. Whitehouse now made the mistake that sealed his fate. The high-voltage transmission was ramped up even furter by Whitehouse, in secret, because he believed that he would make the messages clearer and quicker. Unfortunately it melted the cable.

Unsurprisingly Field, the board, and the investors were rather unhappy. Questions were asked. The best companies would take a step back and think about how things had gone wrong. Which is what happened. There was a public enquiry. Thomson testified as to his advice, which had been ignored, and other testified that Wildman Whitehouse had interfered with the cable after go-live.

As often happens when people are put on the defensive, people sometimes react poorly. Wildman Whitehouse did what Victorian gentlement do when extremely angry: he wrote a pamphlet. It might as well have been titled “everyone is wrong but me”. He disclaimed all responsibility for the problem, claiming he had done the right thing all along in the face of opposition.

Publishing this defensive pamphlet did not have the desired effect. Mostly it convinced them they no longer wished to work with him.

  • Open mindedness is crucial; during the project and after.
  • Treat feedback sensitively, with tact, and preferably in private.
  • Assume everyone did the best job they could, given the constraints they were operating under.
  • Most importantly there is no room for self-styled heroes. One of the strongest findings was that Wildman Whitehouse chose to act on his own, acting against the best interests of the project.

So what happened to the project after paragraph? Well, the company learned something about teamwork, they appointed Thomson as Chief Electrician, who not only had a better grasp of the science involved, he worked well with his team. They used a low-voltage, light, and flexible cable.

The first time the went to deploy it, the cable snapped. But they were able to repair it, rather than discard it; they drew a second cable, and only after testing, they announced it was ready.