QCon London 2016 – Day 2

If day one of QCon was about Continuous Delivery for us (which you can read about here), then day two was about large-scale architectures and using containers in production. Day two was packed with some great tracks: ‘Architectures you’ve always wondered about’, ‘Close to the metal’, ‘Containers (in production)’, ‘Modern CS in the real world’, ‘Security, incident response & fraud detection’, and ‘Optimizing you’. For day two, we split our time between the ‘Architectures you’ve already wondered about’ (intriguing, right?) and ‘Containers in production’ tracks with a brief sojourn to hear more about unikernels. Unikernels are seen as a future part of the container ecosystem – reflected in part by Docker’s recent purchase of Unikernel Systems.

2016-03-08 14.50.33

The first of the architecture talks was given by Stephen Godwin from the BBC on the iPlayer’s transformation from monolith to Cloud-based (AWS) microservices. Did you know the iPlayer is used by about a third of of all adults in UK? And it receives about 10 million requests a day to playback video. According to Stephen, after moving to the Cloud and the new architecture, they are now processing 21TB of data a day into AWS S3!

At the beginning of 2013, the iPlayer transitioned from monolith to microservices over the course of nine months. This has meant that more double the amount of video is available on iPlayer, and it’s been possible to extend the availability of content from 7 to 30 days. As a side note, we had assumed these limitations were due to regulations or other factors – definitely not because of technical limitations. These changes have also meant iPlayer can handle sudden spikes due to, for example, large sporting events like Wimbledon. The move to microservices has been accompanied with a move to Continuous Delivery (CD) deployment approach (of course!). This means changes can be deployed to production in under 15 minutes with developers delivering hundreds of changes and carrying out dozens of deployments a week. In total the team are running about 20 microservices each with about 600 Java statements (now that is small). The idea is that only a few developers will work on each service, which means the deployment process can be streamlined (including committing changes to trunk) – delivering changes quickly. For testing, they use the common trio: TDD/ADD/BDD, and it was noted that developers spend about 60% of their time writing tests!

Once again moving to microservices and employing CD has meant faster deployments and better scale, and ultimately a better experience for users. That’s another win.

Next up was a great talk ‘#NetflixEverywhere Global Architecture’ from Josh Evans, Director of Operations Engineering at Netflix (NF). NF are now truly global – covering most of the planet (besides China), and even in the air – NF is available on some US carriers. However, this wasn’t always so, and has required a lot of engineering to run a truly reliable, global infrastructure with 75 million customers. For this, NF has invested a lot into resiliency. Failure is inevitable of course, and they call their approach ‘failure-driven architecture’ – “never fail the same way twice”.

2016-03-08 12.34.05

NF run all of their infrastructure in AWS – they’re are all in. Josh took us on a fascinating journey through the evolution of NF’s four key architectural pillars: microservices (NF are seen as early adopter of microservices), databases, caches, and traffic. For microservices, some of the big challenges they faced were around routing and failure. To address some of these issues, NF wrote and open sourced so interesting tools, including the well known Simian army and in particular the Chaos Monkey that randomly terminates parts of the system to test resiliency. For an interesting read about the Chaos Monkey, check out Netflix’s blog. For databases, as they’ve have scaled, NF have tried a number of solutions, starting with AWS SimpleDB and ending with Apache Cassandra (to handle billions of records), and replicating Cassandra’s rings globally (across AWS Availability Zones and AWS Regions) using Apache Kafka (initially using AWS SQS). As for traffic, NF are able to recover from all sorts of failure – include AWS Region failures – with the ability to route traffic to other regions and then back again. Ultimately, the scale of NF’s infrastructure in AWS and how it’s evolved to adapt to failure (developing so called ‘resiliency patterns’) is remarkable.

Lastly, in the infrastructure track, Micah Lemonik from Google gave an interesting talk on the history of the Google Docs architecture, in particular looking at real-time collaboration on documents like spreadsheets and slides. Micah went through some the complexities of managing collaborative data models, that is, data models that are being concurrently updated by multiple users. At its core, the system keeps data models in memory (the servers are stateful) and relies on operational transformation (OT): an approach to collabortive editing of documents that is responsive and ensures user’s documents will converge on the same state. Finally, Micah explained how OT can be applied to spreadsheets, and in particular how one might go about designing an ‘undo’ function. Very interesting to see that basically the backend of Google Sheets, Slides and Docs hasn’t changed since it was originally developed in 2003/04, and to understand why.

The first container talk we attended was given by Mitchell Hashimoto, CEO of Hashicorp and the creator of Vagrant and Consul, called ‘Observe, enhance, and control: VMs to containers’. Mitchell gave a great overview of how infrastructure is evolving from VMs to containers. In the Age of Virtual Machines (circa 2006; public clouds appearing), infrastructure consisted of datacentres that had no APIs; offered no elasticity; were running monolithic applications; and IaaS was very young. The problems that were faced included: uniformity of servers; scalable change management; auditing server state; and early service discovery, and systems were developed to address these types of problems. Systems for monitoring, like Nagios and Sensu; for configuration, like Chef and Puppet; for deployment, Fabric, Chef, and Puppet. These tools are still widely used (we use a few!) but are perhaps not best in class anymore?

Fast forward to 2016 and containers are taking over, and what happens if we apply the same reasoning to predict where systems are going? Datacenters are now quite different: they’re API-driven; highly elastic; using small, bin-packed servers; running containers on VMs; and fast. With these characteristics the class of problems has naturally changed. They’re now are around infrastructure management; service discovery and load balancing; configuration management (at many levels, e.g. host, container runtime, containers); and scale: speed and size. Mitchell posits that the software to address these issues are distributed systems; situated in environments where failures are expected (e.g. Cloud environments; noise neighbours); are API-drive; including infrastructure managed as code; and there is a demand for low resource usage, i.e. not using dynamic languages, like Ruby and Python, but ones better suited for building distributed systems, e.g. Golang. New tools for monitoring as sysdig and Datadog; Consul and etcd for configuration (although not configuration management tools themselves, like Apache Zookeeper); and systems like Apache Mesos, Kubernetes, and Nomad for deployment – or orchestration and scheduling.

It is our view that the ‘old’ tools will certainly be used for sometime (if not because they are so widely used, and they do work). However, we would broadly agree with Mitchell’s position – because, simply put, there is an ever increasing pressure for speed (and flexibility) – to reduce the cycle time: to “observe, enhance and control” fast, and containers (and their ecosystem) are part of the next revolution.

Next, going beyond containers, we turn to unikernels. Unikernels are seen as the next logical step after containers. They are similar to containers but do not rely on a host OS – they compile against library OSes, including only what is required for the application to run. They are much smaller than, say, Docker images. Unikernels are a hot topic right now, and we were certainly curious to find out more. In search of knowledge and understanding we decided to attend a couple of talks on unikernels. The first was given by Anil Madhavapeddy, engineer at Docker and one of the original team behind the Xen hypervisor, called ‘Not quite so broken TLS using unikernels’, and a second one by Docker engineer, Justin Cormack, talking about how to ‘Build, ship, and run unikernels’. Anil’s talk was a deep dive into a project involving the rebuilding of the TLS (transport layer security) implementation using unikernel techniques. The outcome was a clean reimplementation (in OCaml!) of the protocol as a tiny unikernel that was well understood (compared to the arcane C implementation) and fully-type safe. Very interesting talk.

2016-03-08 15.38.56

Justin gave a good overview of unikernels and why we should be interested. Whereas containers depend on the host OS, unikernels are self-contained – bundling all their dependencies and only their dependencies. One powerful consequence of this is that their attack surface is much smaller – they are more secure out of the box. Also the box is much smaller: images are small (to tiny) – for example, one DNS server unikernel was 2/3 KB in size (!) – for reference: a Docker image based an Ubuntu image might be several GB. Justin noted that unikernels are where Linux containers were three years ago before Docker. they are being used in production but have yet to achieve widespread adoption. Expect big things to come.

Unikernels are making their way out the lab, especially with Docker’s purchase of Unikernels Systems. Expect a drop-in replacement for Docker containers soon. Exciting!

What we learnt on day two:

  • Architectures evolve fast – be ready for change; be flexible.
  • Software architecture now: microservices; immutable infrastructure; containers.
  • We are entering the ‘Age of Containers’. And it is a revolution.
  • Unikernels next.

Comments

  1. […] to attend for the second time QCON London (you can find our review from the past year here: day 1, day 2 and day […]

Leave a comment

Please prove that you are human: