Red Hat Summit 2018 Summit Day 2

Breakfast was, again, a bit of an interesting experience. And by interesting I mean sugar-laden. Everything tastes very sweet to my palette, but I guess this is the norm in the States.

Keynote

Jim Whitehurst

Last year’s keynote was “planning is dead”, but Jim wants to revisit that, because people come to him and express that they still need some planning and forecasting. “Planning” is “planning writ large” - the tradition model where a multi-year strategic plan, with top-down detailed plans for execution and delivery doesn’t really work in many cases. The world is a more uncertain place, and you can’t plan for every contingency.

Planning can be replaced by configuring - set yourself up to respond and act - “Configure for constant change”. You need to “try - learn - modify” - apply agile/devops principles to broader organisational problems and strategy. Build modularity into the structures of your architecture; be flexibility, not highly-optimised systems that are hard to change.

Rather than prescribe, enable people to do what is needed. Automation - replacing and augmenting people - is not the point here. Rather, how do you get people the context, information, and tools they need to make good decisions without having to check back up a management chain before they act.

Engagement - what do leaders do, if not plan and drive the execution? Well, you need to be focused on helping people want to do the right things. IT is now the foundation of most businesses - and in many cases IT are waiting on the business, not the other way around.

Adel Al-Saleh - T Systems

Invested over a billion euros in building new ICT capabilities to their customers - those customers are demanding results at a higher cadence, and expecting providers to understand what they do. T Systems have been investing heavily in OpenStack and OpenShift as a foundation for service offerings.

As an example is connected cars - it used to take years to deliver new applications to cars, but the platforms T Systems are providing to help accelerate development and delivery have reduced that to a month or two.

Security and GDPR are big challenges, ongoing challenges.

José María Ruesta - BBVA

BBVA has 73 million customers across a number of different countries. Blockchain - “I’ve got no idea. If you know [how this will affect banking] I have a job for you.” “Financial services over the next five years? I will be five years older.”

How do you prepare yourself for success when you don’t know what the future will look like? You build for risk and efficiency: if the foundations are sound; risk management doesn’t mean buying the most expensive products from the biggest countries in the world. Efficiency doesn’t mean tailoring to every country or business units as though they were completely unique. This creates a mess of complex, undocumented systems. You can’t innovate because your base run function is too complex.

BBVA have replaces local proprietary systems with global, open source ones. Drive down the cost of transactions, and make it as easy to work and deliver as possible, in a secure, reliable fashion. It’s a long, sometimes tough journey - but BBVA are succeeding. They are processing 3,000 transactions/second (50% of Spain’s volume) on commodity systems that have replaced proprietary ones.

Osmar Alza - DNM Migraciones and David Abrahams - IAG

The National Migration Administation manages immigration and visiting to Argentina, 80 million events last year. Argentina has a large border area, and the department is responsible for securing it. The DNM are moving to a more predictive system, using many diverse data sources, many external, and working carefully with legal and security boundaries; they have implemented a private cloud using Data Virtualisation and OpenShift (amongst other things); every border decision draws on up to 2 billion data points.

The insurance industry has been a bit insulated from disruption - as a highly regulated industry, it’s had a high barrier to entry; nonetheless customers are pushing for change in the sector as customers ask why they can’t have the same flexibility and ease of use as they find elsewhere. Moreover, data as exploded - so taking in information and using it to make decisions has become a challenge.

Tying disparate systems together is a challenge - IAG has grown via mergers and aquisition, with lots of data silos, which is a bad look for a company that is fundamentually a data company. They push is to build a unified data platform which is easier for the right people to access. Open Source (kafka, postgresql) has been key to this because it’s allowed them to change the tools to fit their needs, rather than depend on what a closed-source vendor will drop every 10 years.

Tobias Mohr - Lufthansa Technik and Nick Costidas - UPS

Tobias - their service offering is an aviation platform (across the industry) that provides predictive analysis for maintenance of aircraft fleets - and they were mandated to do this within 100 days. They achieved that by putting their best experts together and giving them a free hand to do as they think that best. They had their first cut of the app ready after 30 days to show a customer. Having good infrastructure in place was critical so it wasn’t in the road.

A data integration service which allows self-service of integration allows airlines to on-board in a day.

Nick - technology is the core of their operations and always has been. They’re always looking to use technology to improve their operations; this award is for Edge and Sight, which allows UPS staff to visualise what’s going on, across billions of events, what is going on and adjust operations based on what’s going on in across 34,000 people.

Operators can make decisions on staffing, routing, equipment. It’s great for staff, because they can make decisions at the edge. OpenShift is a foundational tool for this - its changed how they develop and deliver; they reduced delivery times from 12 to 18 months, down to weeks. Even better, it makes it easy to scale during their 5 week peak period, and span private out to public cloud as needed.

Marco Bill-Peter

Support org - not just the support team, but security, documentation, and quality engineering. This is unusual, but it has a lot of benefits, particularly in terms of being able to, for example, build upstream testing that reflects real-world use cases; a cross-functional team.

Erica Kochi - unicef Innovation

Worked on a school mapping project with Innovation Labs; open source is critical because they can take the tools and systems they build to any country in the world, and it allows anyone to contribute.

In many of the Latin American countries unicef work in they are dealing with schools in places with a history of natural disasters and conflict, and so understanding what capabilities schools have, what their challenges are, leads into understanding where and what unicef should target in order to help communities.

If you want to help, look at unicef stories and get in touch.

Dr Ellen Grant - Boston Childrens’ Hospital

Data overload is a key problem for clinicians. ChRIS could “change medicine as we know it today”. ChRIS helps pulls data from the archives and provides rapid data processing to give bedside data analysis while working with patients. Doctors work in seconds or minutes, while historically this sort of analysis took weeks or months.

The roots of this came from not having image analysis while working with patients, and having to spend a lot of time learning the computer science, which is not a doctor’s main skill. Dr Grant’s team didn’t want to work with proprietary vendors because they wanted to work with actual experts - deal directly with the programmers building the tools as part of an open community, rather than having layers between physicians.

Open is also important because the traditional medical software industry will take the data and lock it away, leaving a third party in charge of the patient data and clinical analysis; doctors are uncomfortable with this. An open source approach leaves the hospital or a consortium of clinicians remaining in charge of the patient data.

The future looks like better collaboration within divisions in the hospital, breaking down the barriers between the various specialties and tools; and work between various hospitals. Particularly in the case of uncommon illnesses, there may only be one or two patients in any given hospital. Another case is with hospitals in poorer areas - they may have all the tools to measure patient information, but not access to the tools to analyse that data.

Constraint Optimser: I bet you I am better than a human

Justin Goldsmith and Christian Witchger

Business Optimiser solves planning problems: solve goals with limited resources e.g. maximise profit with minimal ecological foot print and a limited number of employees. OptapPlanner. It’s part of the JBoss Decisions Manager as a commercial offering.

Planning problems are still hard - there’s no practical way to know the best answer in a reasonable time: the travelling salesman problem doesn’t require many citys to hit the point where it’s effectively unsolvable.

You have positive and negative constraints to weight the solutions the algorithms will choose.

Shift Rostering

  • A shift has a date, a time range, and required skills.
  • Employees have skills, seniority, days off preferences, and so on.

Constraints are divided into hard constraints and soft constraints - can’t break and shouldn’t break.

At the client the goals were:

  1. Minimise overtime.
  2. Balance work hours.
  3. “Hard” vs “soft” skills1 e.g. security staff who “must carry a gun” vs who “must speak Spanish”.

Contraints are defined in an XML config. The application shows the “thought” process in action as it’s running.

Vehical Routing with Time Windows

A team with 75,000 vehicles and hundreds of thousands of tasks.

  • Jobs: time window for job, skills required, location.
  • Technician: skills, working hours, average job time.

Time windows and skills are hard constraints. Others were:

  1. As many jobs in a day as possible,
  2. Least number of technicians.
  3. Minimise the travel.

Planning speed is important - you need to be able to re-plan as the events change; for example, if customers solve a fault themselves, they will cancel an appointment. OptaPlanner can quickly solve a problem and then iterate over the answer to come to an optimal solution, but it can produce a first cut within milliseconds. So the customer would run overnight to get the most optimal solution, then update during the day as needed for a few minutes at a time.

OptaPlanner claims better results in less time (e.g. 5 minutes vs 3 hours).

Task Assignment

  • Tasks have task types and require skills, priority.
  • Reviewers: skills, affinity.
  1. Dollar value for the claim.
  2. Date added.
  3. Customer status.

Locking tasks in-progress are an important challenge; marking a task as no longer up for grabs once someone starts working on it.

Gerrymandering

The practise of drawing and redrawing districts so you can bias electoral results by changing the shape of electoral districts to create unrepresentative pools.

The optimiser uses census-driven block groups to iterate over the best shape for 13 different districts in a way that hits criteria for fair representation, although solving what the criteria looks like is where the challenge lies.

How to build a European scale instant payments platform

Giovanni Fulco, Giuseppe Bonocore, Ugo Landini

SIA processes 6.1 billion operations, 3.3 billion payments transactions, and 56.2 billion financial transactions in markets; Italian based but world-wide operations. SIA have been using Linux and contributing open source since 2000.

Wire transfer without Instant Payments.

The current user experience is pretty straightfoward, but under the hood it’s batch jobs and file transfers, which take up to 4 days in Europe. “Your money is eventually consistent.”

Wire transfer with Instant Payments

The user experience remains the same, but the transactions are settled in less than 50 ms rather than 4 days.

What’s Under the Hood

  • JDG: JBoss Data Grid
  • AMQ.
  • FIS (Fuse Integration Service) - Camel on OpenShift.
  • Cassandra - non repudiation and transaction history.

The bank makes an AMQ call, which then goes into Camel for processing. Infinispan is used to propogate the transaction; Cassandra is used to store a copy of the message, and AMQ is used to deliver to the destination bank.

I will note that using an in-memory grid for this is amazingly bold. Like, way bolder than I am. But it’s not the first time I’ve recently heard of people relying on multi-way replication for this kind of critical data2, so maybe I’m just getting old and too conservative.

Confirmations return on the same path.

JDG, In Memory Data Grid

“Memory is the new disk.” SSD is fast but memory is still 1,000 times faster.

JDG is the commercial version of Infinispan. It’s a “polyglot clustered in-memory store”. All data is kept in-memory as a key-value store; data overflows into multiple nodes and is replicated for resilience. You can persist to disk, but for this project they didn’t.

Various modes:

  • Distributed mode: typically one replica of the data.
    • A value is written to one node, and a copy will be created in a second node.
    • Subsequent writes are distributed across different nodes, as are the replicas.
    • Consistent hashing is the algorithm used, which means keys are always hashed on the same nodes. It’s simple to find the right node given the key, which is important for performance - lookups to a directory server are not required.
  • When nodes are lost, a new cluster is (effectively) formed, and lost keys are re-hashed and re-distributed.
    • Clients are updated with the new topology; clients are topology aware.
  • Data affinity: co-locating the data close to the clients.
    • Compute should be co-located with the data.
    • This is achieved with “grouping” in JDG.
    • The affinity parameters are set, and then the data will be partitioned accordingly - for example, customer data, or credit card data.
    • This requires hashing on the partition, not just the node.
    • This further reduces the latency and improves performance.
    • In a best case can be consistently fetched from memory as the primary copy will always be on the same system as the compute.

Requirements

The requirements were set out in the EU’s Euro Banking RT1 tender - requires 5,000 tx/second, sub-second round trip (900 ms worst case), active/active across at least two data centres, zero data loss, and 24365 availability, no planned outages.

This rules out orthodox RDBMS (hence JDG being chosen). The availability requirement drove the use of AMQ, with multiple brokers and zero data loss. The final challenge was dealing with split-brain configuration; deployed a light-heavy config of JDG (3 nodes on one site, 2 on the other); this does require manual intervention to override quorum failure.

Elasticsearch and Kibana were used to built monitoring/visualisation of the platform.

In practise

Since November 2017 one million payments have been processed, with headroom for plenty more; transaction times averaged 50 ms, well ahead of the tender’s requirements.

Next Steps

  • Using EnMasse (messaging as a service in OpenShift) to streamline queue management. It’s not yet production ready.
  • More containerisation in order to allow safer releases of newer versions and instances. JDG for example is not supported on OpenShift at the moment, but will be with the next release, when they will migrate.

  1. This is a bit of a false dichotomy, and one that devalues the more important skill. If speaking Spanish means you’re less likely to shoot someone, well, which is the more important skill?
  2. Listening to someone senior in Microsoft’s SQLServer team explain that they have customers using SQLServer on OpenShift with only ephemeral storage and relying on AlwaysOn Availability Groups is another exercise in realising I’m not as rip-shit-and-bust on this sort of thing as I could be.
Share