Today is the OpenShift Commons day - a bunch of information, mostly from
Brandon Philips and Clayton Coleman
“We wanted to bring the smartest Kubernetes developers together.” Said tongue and cheek.
“Developers and people who understand how the Internet work are horribly outnumbered. New users are joining the Internet user base faster than developers and ops people to support them. CoreOS is about how we make it as easy as possible to manage that, because we can’t train our way out of that.”
“We wanted to change the way people operate servers. No matter where you run your application,t he experience should be the same. The infrastructure around your application should be the same, no matter where you run it. That’s the CoreOS mission.”
“OpenShift is about the easiest way to ship shoftware. Make it easy to iterate and deliver to production. OpenShift pre-dates k8s, but switched over to using it 3 years ago. Kube is now the default orchestrator choice.”
So what’s the shared mission - “Automated operations.” Cloud is essentially two parts: renting someone else’s computer, which is an old model; the other component is “I make an API call, and now infrastructure happens.”
So what do we automate. ALL THE THINGS! The platform and the application; these things exist to serve the developer and the operators. So what we want is the Automated Platform, based around CoreOS Tectonic:
One click updates to the OS.
- Everything is run under Kubernetes, including the k8s components themselves.
- These automated operations go all the way from the cluster down to the individual machine (bare metal or VM).
- This means you can update everything from the ground up, invisibly to the users of the environment.
That leads into Automated Applications:
* How do you build applications that are easy to deploy in a cloud-native fashion? * The Operator Framework: wrapping operators - individual components like a cache - are wrapped in kubernetes APIs. * "Now at Red Hat we have the horsepower to make this happen." * No application is an island - chargeback, data, storge, service catalogues and so on are all part of the lifecycles.
Brandon - “We want to give people the ability to inject new services into the cloud, including the public cloud, which is something you can’t really do today.”
Building Streaming Data Platforms using Kubernetes and Kafaka
- Provide services to airlines.
- OpenShift users for 4 years.
- Using append-only streams of immutable events.
- You can play and replay events as you need.
- You can process them as you need.
- A common initial use case: the application logs.
- Consumers then update data stores and caches independent of the source application.
- Eventually consistem.
- You can add a stream processor to update an output stream.
- This output stream allows you to create complex, even-driven applications very easily.
- Kafka is a great tool for this.
- You can either call the Kafka API directly. Easy when you’re stateless, but hard if you want state.
- There are a number of Stream Processor frameworks to make this easier - Spark, Storm, etc.
- This model gives you:
- loose couping and resilience.
- Flexibiity & agility
- Auditability and error recovery - immutability means you can always go beack and demonstrate what happened, and replay to recover from failures.
- This is foundational for a data-driven architecture.
- Pluggable business logic - microservice or serverless patterns.
- Can parallel feed your analytics processes.
- Feed insights back into the applications - ML models for example can be fed back into the business layer.
How do we run all this?
- Kafka didn’t run well in OpenShift for some time. It wanted persistent components, which OpenShift didn’t offer.
- OpenShift now offers Stateful Sets - consistent names, persistent storage, ordered startup and shutdown.
- They have been running this for a year.
- Use affinity and anti-affinity for cluster resilience.
- Persistence - common wisdom is persistent volume. Amadeus have found that replication is adequate for this, with local, non-peristant storage.
- What about topics?
- Many microservices means many topics.
- This means topic management becomes a challenge:
- Make ure topics exist when applications depend on them
- Make sure they’re consisten across environments.
- A k8s ConfigMap is used to describe a k8s application, which applies to the Kafka cluster.
- This then allows them to be delivered alongside other applications as needed.
- Topology management
- A tool to develop and design topics and services.
- An operator to visualise, deploy, and monitor.
- Available via the OpenShift console.
- Prometheus monitoring for each service.
- OpenTracing for following messages across microservies, streaming or HTTP.
- Notification at Scale
- With thousands of nodes for the pricing applications this becomes a challenge.
- Pricing is propogated across the nodes.
- First level cache - per node.
- Second level - central cache.
- Around 20k notifications per second.
- This is delivered via the Kafka stream.
- Small (200 byte) messages.
- Fetched from Couchbase as the data bus, with Kafka as the notification system.
- Will evolve out to multi-region, multi-cloud.
OpenShift at Barclays
- Old bank with a lot of tech debt: 327 years old.
- 30% of the annual UK GDP processed every day.
- Moving to a view of being a “technology company that does banking, not a banking company with a technology department.”
- aPaaS is the Barclays PaaS service - OpenShift is the core of the platform, but they have spent a lot of time working on other aspects of the platform: how it operates, accountabilities and so on.
Started with OpenShift 2 (pre-Kubernetes). Now tens of thousands of pods. Struggling to keep up with growth.
Deployed seperate clusters in two different datacentres, joined by a load balancer.
- The registry, however, is shared.
- Everyone has to deploy into both data centres - active-active is a must.
- This is new to the bank - traditionally active-passive.
- This is the first place they’ve achieved this.
- Seperate production and pre-production environments.
Deployed via Chef.
- Probably not a great idea - there’s a slowdown repackaging OpenShift from Ansible-driven deployments ito Chef-driven.
Master nodes have been split into their components over time.
Continual iteration over time.
It’s also in the DMZ environment.
- This adds many challenges - trying to break OpenShift across a traditional multi-firewall environment.
- Iterate on governance to modify how they operate to line up with regulatory and security requirements.
- No builds in the DMZ - only uses pre-built containers.
- Token auth rather than AD for managing.
- Gluster for the internal PVs, while the campus is NFS.
- Self-service focus.
- Accountablility needs to follow control.
- Developers own their own things.
- Patching just happens. This proves out the resilience, and also forces people to write their applications correctly.
- Three teams support the platform.
- Multiskilled, dedicated team.
- Consultancy and pre-sales roles to help sell and on-board.
- Extensive documentation.
- Pre-prod is treated as production.
- Extensive marketing.
- Two day developer labs.
- Aim for pay as you go.
- The financial structures are the hardest problem they’ve encountered.
- Funding for continual growth.
- Culture is harder than techn.
- Multi-tenancy is a challenge.
- Steep learning curve.
- A lot of trust required.
- Application-level resilience.
- De-comissioning old systems.
- Finance processes.
“Entire teams can go live with applications without anyone in ops needing to know about it.”
Making Big Yellow Fly: DHL goes bimodal with OpenShift
- The challenge: a huge company, and keeping everything running and going faster in an intensely competitive market.
- How to be faster and more flexible.
Split into two modes: mode one is a waterfall, perdictable and reliable; ode two is fast and flexible, with new processes, organisations, and technology.
- Mode 2: zero downtime container platform.
- Automated deployments with an appropriate commercial model.
- New skills and mindset: new software architecture, new development models, new team structures.
- An entire end-to-end ecosystem for developers.
- They turn up with source code and everything is set.
- Start on-premise for proximity to existing applications.
- Multiple zones and clusters: production and preproduction; internal and DMZ.
- Driven by not wanting to change the existing firewall landscapes.
- But run production and test on the same clusters.
- Management is segregated and has access to the different clusters.
- No east-west traffic between the firewall zones.
- This requires the routers to be seperated.
OpenShift cluster is fun to implement, but where do you go from there? You need fast processes that scale: a CI/CD pipeline that scales:
- Jenkins & git.
- Artefactory for the registry - base images come from Red Hat.
- Fortify security scan and Sonarqube code scan to keep the quality to a high level.
- Selenium functional testing.
- Integrate change management into the platform so the change process doesn’t become the bottleneck.
- The traditional DHL process is a 2 week approval lead.
- ServiceNow is integrated and auto-tickets and will auto-approve if the pipeline shows that the approvals are good.
- Pay-per-use a là cloud.
- “Application boxes” to limit the cost exposure for projects; resource use is capped. in order to stop people gaming the charges, the floor is actual use or 20% of quota, whichever is higher.
Maintaining Container Images at Cisco
- Using Quay to provide a “container as a service” model. Quay was deployed prior to OpenShift.
- The goal was to manage container use by developers at Cisco.
- Highly available, multi-datacentre.
- Restrict image source.
- Enterprise RBAC.
- Vulnerability scanning.
- UI: visibility, promote image sharing.
- Vendor support.
- Looked at Docker Trusted Registry, Artifactory, OpenShift, and Quay. Quay won.
- Multi-DC architecture fronted by a global load balancer.
- DC 1 is the primary, with two others.
- Backed by Postgresql, with a warm standby model and manual failover.
Blue Cross Blue SHield of Florida
- Florida was the first state to adopt OpenShift as a container platform.
- Apps and microservices.
- Health care for 4 million customers. A not-for-profit policy-holder owned mutual company.
- On OpenShift since 2015. All in on 2016.
- Seven upgrades with zero application downtime.
- Process seven billion dollars a year in payments.
OpenShift isn’t just for business applications, though.
- By using OpenShift for their tooling, they have improved their management of their tooling.
- There was a lot of resistance intially; “isn’t this a dev thing”, “our tools don’t go down.”
- It was also an exercise in eating their own dogfood and understanding how their platform works for their developers.
Fry was their first application to cut off.
- Multilanguage, poorly maintained.
- Looked after NAS, tape robots, and other storage.
- Single-instance, downtime means they can’t monitor and manage storage any more.
- People were very nervous - it’s an important application!
- Move to a multi-site, easy-to-deploy model.
- Rewrite with django, deployed via S2I.
- No more SPOFs.
- No more downtime.
- Continuously availabnle across all datacentres.
- Autoscales under load rather than crashing as it used to.
What did they learn:
- How to debug OpenShift.
- How config and deploy really works.
- They had the answers before devs or business units came to them.
- Found platform bugs like ScheduledJobs and other alpha features.
- Disseminate knowledge.
- Create a community of practise/support.
- There will be demand.
They’ve now moved many of their other tools - Prometheus, rocket.chat, Grafana and so on - into OpenShift.
Niklas Tanskanen, Jesse Haka, Severi Haverila
Elisa is a telco who provide various mobile/videoconf/IPTV and other solutions.
Rather than trying to force everyone onto OpenShift, they chose to hook into a rewrite of a service that was being refresh - their video on demand solution for TV and movie rental. The team weren’t familiar with containers, but were using Jenkins.
So they migrated the users Jenkins to integrate with OpenShift. The pipeline functionality quickly became more complex as more functionality was added; this lead them to create a shared library on top of the base Jenkins intgeration; this included multi-cluster deploys, canary deployments, and code analysis.
The Jenkins plugin uses templates for things like route configuration, so the developer is abstracted away from any configuration beyond their Jenkins and git repo.
This user-friendliness became a problem - the developers were so abstracted away from the platform that they didn’t learn enough to effectively support their applications running on it, so it’s necessary to find a balance between making onboarding an easy eperience and learning to use what’s under the hood.
ACI Universal Payments
- World-spanning payments system.
- Includes payments intelligence.
- 80 ms to make a call as to whether a payment should be made.
- That has to include fraud, anti-money laundering and so on.
- All transactions go to the real-time model, with the model being updated from data sent to the data science area.
- Scaling became harder and harder.
- Breaking things down into microservices made this easier to scale.
- Microservices are easier to scale in a low-latency fashion.
- Highly secure. Never lost a transaction for 40 years.
- Replaced traditional RDBMS with Cassandra.
- Rode out Black Friday and Cyber Monday without ever breaching 30 ms.
- Cassandra and Hadoop are not yet on OpenShift. That’s their next big goal.