24 Jan 2018 16 min read Conferences

LCA 2018 Day 2

I'm putting myself off-side with general sentiment in Sydney, but I find bin chickens adorable.

As always, the day starts with randomly selected prizes - I am dubious about the claims lava lamp RNG was involved - and today's winner was present!

Collaborating with Everyone: Open Source Drug Discovery

Matthew Todd

This is a very different audience for Professor Todd; normally his first job with this presentation is to sell a hostile audience on the concept of open source; this will not, of course, be a problem here.

"All I'm trying to do is to apply a concept pioneered here to wet lab work."

The term is quite contentious; rms for example has thrown rocks at calling science work "open source". Matthew has argued the toss with Richard, but their differences are irreconcilable. I find this very peculiar - the nly time I've seen Stallman talk he spoke eloquently about how the history of open collaboration in science is an inspiration for free software.

First Open Source Chemistry Project

Praziquantel - a WHO drug that is used for treating a parasite in Africa. It's given as a pull, which has two isomers, one of which is effective, the other of which is useless and tastes horrible, utting people off the drug. The normaly synthesised form has a 1:1 ratio of active and foul-tasting inactive form, which is cheap and easy; seperating them has been too expensive to be useful.

Professor Todd worked with Ginger Taylor, who works as a Drupal developer, on building a community to allow multiple labs to collaborate on attacking this problem. They quickly became better known, featuring on chemistry-specific blogs and sites. This catalysed contribution from all over the world; to their surprise, 75% of the contributions were from private industry. Surprisingly enough, a huge number of contributers arrived from a LinkedIn posting, "finally finding a use for LinkedIn".

They very quickly found a solution. Unknown to Professor Todd the WHO were funding a private effort in parallel; both came up with valid solutions via different paths.

Interestingly enough, while not patent protected, it's progressing through the approval process, something which is normally not considered possible by advocates of closed chemistry and medicine.

In the industry there's a lot of talk about "Open Innovation"; this is not at all open. Companies publish problems, and then buy suggestions from contributers, keeping the solution and chemistry closed and patented.

Conventional Drug Discovery

Normal drug discover typically starts with a "hit", a molecule that will have the desired effect, but has problems that prevent it from being useful; research works towards a "lead", a molecule that be be practically used.

The six laws of open science:

All data are open and all ideas are shared.
Anyone can take part at any level.
There will be no patents.
Suggestions are the best form of criticism.
Public discussion is much more valuable than private email.
An open project is bigger than, and is not owned by, any given lab.

Information sharing is becoming widely accepted as critical to better quality research; the Bill and Melinda Gates Foundation has been gratifyingly aggressive in their support of open publishing, for example.

What Actually Is It?

Starts with an open lab book. Most people in chemistry - perhaps 70% - still work on paper notebooks, so simply getting people to use electronic notebook, like Lab Trove or Lab Archives, is a huge change for many people; from there, it needs to be shared - sharing happens every day at the end of lab work.

They've hijacked github so they can use the wiki and todo lists and issue trackers. They're very happy with it, even though they "don't know what they're doing"; the molecules are tracked in a Google sheet.

Lab vs Code

Technology is "an ïmprovement in the instructions for mixing together raw materials", from PostCapitalism by Paul Mason. The molecules researched so far are very far along, with mouse experiments show a set of working molecules. They still don't understand why they work, but they're close to this. Thiis, by the way, is the point at which traditional researchers would be thinking abaout patents.

(Professor Todd notes they have found 30 useful molecules, with different structure, which have the same outcome, which is surprising and interesting.)

So they decided to run a competition for a computational model that would help understand how the molecules would work; it was popular and interesting, but none of the resulting models are good enough at predicting the molecules' behaviour; some more work, based off more experimental data, is needed to refine them.

"Everyone in the competition wants to work together to help, which is so awesome."

There have been big inputs from some big pharma, too: Scott Obach from Pfizer has been doing significant amounts of worth in his labs, sharing the results back with the Malaria team. While the companies themselves don't do open science, the scientists who work there are very keen to participate.

Vito Spadavecchio is another contributor who arrived without prompting - he is very interested in helping, and likely works in private industry.

Crowdsourced, Open Undergrad Lab Classes

Lawrence University have started using Open Malaria research in their student labs, with students excited to synthesise real, useful antimalerials ask part of their lab work. It's tremendously uplifting and practical for the students.

Similarly, a Sydney high school chemistry class synthesised Daraprim as part of their coursework to understand the difference between the cost of synthesising medical drugs. Ths ended up in the news, because it's the drug that was connected with Martin Shkreli. There is, Professor Todd notes, a complete disconnect between the cost of synthesising medicine and the price paid for it; moreover he was impressed with the quality of the molecules produced by the high school students.

Publishing

There's a lot anxiety about the idea of being unable to publish papers based on public domain research, but over the last five years more and more major scientific publishers, including PLOS and Nature, have been accepting papers derived from open research.

Next Targets for OSM

Stage 4 clinical trials, where they're used with people. This would be the first time this has happened.
Funding is a big thing.
Legal and economic scope of trials will be a challenge.

What is missing?

Representing molecules is, at the moment, tremendously difficult. Lab notebooks, search engines, none of the tools today understand the structure of molecules. The semantics of transcribing modelules is weak.

SCINDR (in homage to Tindr) is a tool designed to help connect potential collaborators, but it struggles with the inability to understand the structure of molecules.

Science doesn't work seamlessly with computers today; work happens in the lab, then you finish up and go to the computer. We need to get to the point where the computer is used in the lab, during the work cycle, to enable better collaboration and to augment with machine learning or AI techniques.

Better web and mailing list infrastructure and data migration. Should we become a non profit.

Gonna need stickers!

The automatic creation of narrative - an AI system that can read papers and creat plain language summaries to make the papers easily understood. This is something that is already happening in the biological sciences, so chemists need to catch up.

Open Source Mycetoma

There is a lead research project, and the open team are providing backup. Mycetoma is a fungal disease with no treatment other than amputation. It's common in Sudan and the current state of treatents are "like a butcher's shop". This iniative launches next week.

Application to pharma?

Funding this, through to the market, is an unsolved problem. How do you fund getting a drug to market working completely in the open? Matthew is working on this. One advantage is that you have better knowledge of the molecules before you apply funding; the problem isn't even really funding, there's lots available. The problem is the lack of precedent for getting to patients.

Matthew thinks the open source world provides a model - "I had no idea that so much backing for open source is from private companies."

There's a high resistence from people who believe you can't do drugs without patents, but this is nonsense; polio drugs didn't need patents. The problem isn't the patents; the problem is how entities can make their money back.

Data exclusivity is one possibility; if you fund taking a drug through trials, you get 6 years of exclusivity. This could be the scaffolding for solving the problem without resorting to patents.

The Future of Art

J. Rosenbaum @minxdragon

The future is in machine learning, "and this is not hyperbole". This doesn't have to be a bad thing - the machines can be our new best friends. In the field of art, we're starting to see that ML art has limitless potential, in spite of art having been seen as a uniquely human endeavour.

J is delighted by the frivolous: Janelle Shae's work with neural networks, generating names from inputs that she trains up. This is a consistent theme in J's talk; the first steps toward art are the playful use of a thing (I am put in mind of Wilde's observation that "all art is quite useless").

Botnik is another example, creating ads; plausible yet unreal and hilarious. Contribution is open, and the humans are still guiding the machines. This openness and collaboration - whether with machines or the other people in the projects - is where the breakthroughs will come from.

Deep Dream would be an interesting, fleeting phenomenon except that Google make it availabe to everyone; this has kicked off communities built around the works. J says that the discussion that has formed is as much part of the art as the generated images themselves; it's what you do with it that leads to art, rather than the tools themselves.

A Neural Algorithm of Artistic Style leads on from this initial work, going further down the route of "style transfer". J compares style transfer to, say, a Photoshop filter; while there are some similarities, the filter is designed for predictability, while style transfer, which breaks the image down and rebuilds it, is unpredictable. The interaction between the machine, the algorithm, and the human is where the art comes from, and how it transcends something like a filter.

"The most exciting moment is when the neural network adds colours that are neither present in the source render or the reference." As a human, we see the totality of the source material, but to the algorithm, a simple colour abberation that we overlook is as important as every other pixel. J gives an example of a style transfer she performed where the algorithm added a purple hue which probably came from a few pixel artefacts ignored by a human viewer.

Creative adversarial networks are an emerging and exciting area; they are reaching the point where human viewers assume the adversarial network output is by humans, and rating them more highly than actual human-created art shown in the same context. One risk, though, is that as the algorithms may fine-tune themselves to be so optimised for positive feedback, that they will never take risks, narrowing the scope of generated art.

Music is highly mathematical, and already amenable to being generated - but people have trouble listening to generated music. J has generated music with Magenta; J argues that trying to use generated music to mimic too closely it will be most jarring, and the most artistically interesting efforts will be from giving the machine it's head, with lighter moderation from a human broker.

"One of my favourite moment is when the tools are released into the wild." People discuss how creepy and discomforting the generated work is, and how silly. "I love seeing the frivolous and creepy uses, because this where progress will be made into how we think." #puppyslug is a term for the creepy works generated by Deep Dream; birds and dogs have been so common that they reinforce themselves; Deep Dream itself has a kind of confirmation bias.

Dinosaurs by Chris Rodley introduced many people to the idea of style transfer; style transfer is less prone to this kind of creepy valley of confirmation bias.

J has played with FaceApp and is intrigued, but regrets that it can't be easily investigated, since the algorithm and code are closed. Muglife, released by Nvida, tries to create memes based off animations and static images. J has been playing with it; "the fails are almost as interesting as the successes."

Pix2Pix is another tool that has gone from creepy to wonderous. It has many uses, but it's best known for filling in sketches with photographic data. It became initially famous for creepy images, but artists like Patrick Tresset have been creating art with it, mixing elements of traditional art - people sitting for portraits - and machine elements.

Mike Tyka's Portraits of Imaginary People uses multiple algorithms to create unreal yet uncannily human faces. Not many months later, Nvidia release CelebA, synthetic celebrity photos. Interestingly enough the former recieved complaints of the uncanny valley, but the latter didn't. Why is that? J is fascinated by why we consider something to fall into that space; moreover, J is concerned that without a touch of the uncanny valley we move from an emotional reaction to a work that is too smooth, too slick, something that has no emotional hook, arguing that if there is no reaction, there is no art.

The Next Rembrant is a project by J. Walter Thompson, which has used every source Rembrandt to create a new work using the corpus to decide what a new image. The artist took it to the next step, printing an image with the same paint application technique as Rembrant. What is the difference, at this point, between the human-painted image, and the machine painted one? Which is art and why?

Machine learning is a wonderful tool, because as artists, our work is the product of everything we have absorbed; we train our mind on the visual imagery we consume. Colloboration with the algorithm is an extension of this process.

Can the machine create art by themselves, without that human guidance, though? Will they remain a tool, just guided by the human, like a camera, or will they overtake us? J argues that the synthesis of algorithm and human will create something that goes beyond the capability of either party; moreover, J argues that the intent is the key thing. Machine learning treats all sources as the same, the art is in the intent.

This was an amazing talk that filled my head chock-full. I will have go back and re-watch the video.

On Writing Machines

Mark Rickerby @maetl

What are writing machines? A phrase to capture the field of generating texts without direct human authoring; the human is the blind watchmaker.

The Eureka, a Victorian machine for generating Latin poetry; it had a line-by-line, closed grammer, output. Roald Dahl wrote a story called The Great Automatic Grammatizator in 1953; around this time, we start seeing software becoming important. In 1962, MIT create the SAGA II script generator, which was actually used to generate scripts that were filmed for TV. Sadly, the code has been lost.

What do writing machines need to do?

A combination of novelty and constraint; we need the novelty to surprise us and feel real, but they need to be constrained to our grammer, readbility, to make sense. Well-formed, comprehensible, meaningful, and expressive. The last is especially important for art.

If we're writing a game, app, or bot, they need to be context-aware and responsive, as well.

That's quite a heavy set of criteria. No-one has really cracked this. Do we really need to solve this problem?

Should we constrain the domain; for example if we're writing poetry, we care about feeling and meaning, but perhaps not so much as structure; reports care more about facts and less about creativity and novelty.

Connecting with older traditions of novelty and constraint: ideas such as the I-Ching or tarot, which were used as fortune telling, we providing the basis for creativity.

Let's start with strings: instead of shooting for the moon, let's start bottom up. A simple example:

0 posts selected.
1 posts selected.
2 posts selected.

Whoops. This is a generative text problem. A common solution is to hardcode for this case, but that doesn't scale. There are so many cases for every natural language that you can't code every special case. Consider:

Eat a apple.
Eat a bannana.
Eat a orange.

Now this may not be logical; it's a function of how language sounds, and we're evolved our language over hundreds of years. But we need to model it correctly if we want to make things look right.

Representations

There is no universal method of representing writing with coputers. We can start from the atoms of letters and phonemes, up through layers of words, sentences, paragraphs, passages, and narratives. It's not the only way to think about it, but dividing things up this way, Mark argues, helps us think about solving the problems piecemeal.

We can start with statiscal methods: machine learning, bottom up, sampled from a corpus. Symbolic, on the other hand, is top down, generative grammars.

The latter is an old technique in computer science; we express some structural rules. This allows us to express quite sophisticated structures. There are many great examples out there, such as, for example, Tracery.

For statistical methods the best known is perhaps the Markov chain; read a corpus, bread in down into the likelihood of which word follows another. It relies on an idea in computational lingustics, that words appearing in similar contexts tend to have the same meanings.

The cutting edge in this area are word vectors; within word vectors we can map our semantic and syntactic relationships, and then perform vector operations on the data set. Beware: word vectors directly encode the values and biases of their source texts. Be careful playing with this stuff; you can eaily create awful output.

Vladimir Propp wrote a book, the Morphology of the Folktale to provide rules to combine the archetypes of characters and plots of folktales. It was a tremendously popular source for work in the 70s.

This leads into the idea of world models; representing the high level idea of story and shape of the world; these can be derived from the graph of the possibilities in graph representation of the story possibilities.

This leads into the idea of world state and actors - at which point my understanding of the talk solidly trailed off; it's a lot cleverer than I am.

But what's the point?

Generating larger columes of text that would be difficult to manage otherwise.
For example, generating elements of a fictional work in order to support a core work; worldbuilding.
Explore weird forms of writing that were previously impossible.
Consider Borges The Library of Babel.
People are building this.
Replacing traditional authors with machines.
This is very problematic.
We want to augment humans, not destroy ourselves.
Fight back against the lack of ethics in ML and AI.

User Session Recording: An Open Source Solution

Fraser Tweedie

Why?

Comply with government or industry regs.
Track what contractors or other third parties are doing on our systems.
Know who broke our server, and how.

The dream:

Record what users do.
Store it somewhere safe.
Retrieve when needed.

There is a supply:

Range from app level proxies through to user space processes on the target system.
Tracking, central storage, and playback.
Integration with identity management.
It's important to only record activity where there's some justification; only on certain systems or while using certain roles.

Many of these tools aren't great; they're expensive, proprietary tools that are hard to understand, fix, or extend.

While there are some hacks, they have flaws:

script is useful, but not security oriented.
sudois security-oriented, searchable, but only when being used for sudo-ed activities.
tty audit with auditd. Only records input, no output, so less useful.

Record session I/O with tlog.

Shim between terminal and shell.
Logs all input and output and terminal events.
Stored in a JSON schema.
Can be centrally stored.
Can be tuned for latency and overhead.
Rate limiting.
The user session starts with tlog-rec-session and logs to Journal or syslog.
You can play back with tlog-play.

Searching: Elasticsearch and Kibana

The ViaQ project is pulling this and other Red Hat things into ElasicSearch.

Kernel auditing.

Use the aushape tool, which transform the evens into JSON or XML to make them a more regular format.

View events via Cockpit.

Cockpit has been extended to add the playback functionality as a proof of value.
Can even show Nyancat.

Control via sssd.

Work in progress.
Will be able to perform HBAC.

Closing the Skills Gap for Distributed Technology

Florian Haas

This is not about what we tend to label education (primary, secondary, tertiary, etc); it is about ongoing education as a professional. It's a mix of the implmentation tools and the whys of ongoing education.

Florian notes that the biggest challenge with the adopting new technologies is not cost, or scalability, or speed to deliver; rather, it is that the largest challenge is people learning how to use these new tools; the blocker to achieve any of the goals is finding the people. They don't exist, and we can't easily hire people to do the work.

So why not build the teams ourselves, by training our people. But Florian assets that the traditional methods of vocational education are a poor fit for actually achieving that goal: pulling together your people, getting them in a classroom with an instructor - who may not be available when you need them - and take them out of their week.

People try to work around this: watching a video of an instructor, for example. Doing that well costs a fortune. Most likley you're doing it on the cheap and creating something boring, that can't be searched or easily referred back to. Then we ask people to apply this knowledge to complex systems like SDN.

It doesn't work.

So what should we look for in a learning environment worthy of the name?

They should be immersive; they should be lab environments that let us do the things we'd do in production.
We want to be able to interact with our peers in the way we would at work.
We would ideally like them to be open source so they can grow and develop rapidly.

Florian has been trying to solve thsi problem for about 3 or 4 years; he started with OpenEdX, an opensourced learning management environment. They then combined that with OpenStack to provide the learning environments. If the subject is Hadoop or Ceph, then everyone gets a full environment they could build and break alongside the rest of the course.

OpenEdX

Openness and community.
There were about 3 contenders: Moodle, Google Course Builder, and OpenEdX.
Course Builder had no real community at the time. In the last few years it's gone quiet.
Technology.
Florian and his team is more familiar with Python.
Extensibility.
They knew that neighter Moodle nor OpenEdEx has out-of-the-box integration with OpenStack, so being able to write plug-ins would be critical.

Florian prefers to the LMS check the state of the as-built environment, rather than using post-lab quizzes; in this way it measures the actual learning. OpenEdX will shell into your environment and check you've built what the config defines as being necessary.

They have built Apache Guacamole extensions to embed terminal and graphical remoting into the OpenEdX platform.

MQTT as a Unified Message Bus for Infrastructure Services

Matthew Treinish

The OpenStack community infra operates > 50 services on > 250 servers, which supply a variety of event streams. Each service has its own thing; it's a mess to navigate for the service consumers.

As a result, they introduced the Firehose: an MQTT broker for the infra. It has anonymous, read-only access with optional encryption.

Pub/sub messaging protocol.
An ISO standard - ISO/IEC 20922.
Maintained by OASIS.
Came from the home automation world; low bandwidth, designed to handle unreliable networking.
This makes it a good fit for the public cloud.
Has a central broker.
There are a number of brokers, including brokers-as-a-service.
There are many client bindings for pretty much any language.
Client bindings are generally pretty similar in terms of how you interact.
Topics and subscriptions.
Topics are generated dynamically, hierarchal, and support wildcarding.
For example: sensors/HOSTNAME/temperature/HDD_NAME.
There are three level of QoS: maybe, at least once, or once and only once.
You can set this on a per-message basis, so you don't have to have different topics for different QoS levels.

Firehose runs the Mosquito broker, which requires fairly small resources; OpenStack picked it because they found a C application more comfortable than an Erlang one to manage and modify.

Firehose runs between 500 and 2,500 messages per minute, with spikes on the hour as cron kicks off Ansible jobs. Manual load testing showed an ability to sustain over two million messages per minute, using only 40% of their CPU and less than 2 GB of memory - bandwidth at 200 mbit/s turned out to be the limit.

It has proven less popular with users than they hoped, but 3rd party CI operators who are using it as part of their hardware testing; some users are using mqttwarn to see events on their desktop notifications. It's beginning to be used for inter-service communications.

Acroyoga

I was going to learn about service discovery with DNS; instead, I ended up learning about acroyoga and some simple base work, which turned out to be a much more entertaining use of my time. Thanks to Casey and Darcy for the hallway tutorial.