KiwiPyCon 2022 Day 2

Off and away for day 2, but various appointments mean this won't be a full day for me, which is a shame.

Keynote: Convention and Construction

Christopher Neugebauer

Machine Learning with Python and a Game Engine

Paris Buttfield-Addison

The basics of ML: learning patterns from data. There are two main paths:

  • Classification links inputs to desired labels.
  • Reinforcement Learning: linking actions to feedback - this is good for things that are hard to write enough rules; this is popular in game development, but offering the model rewards for doing what you want. Eventually, it will do what you want as you give it more and more rewards.

The goal of RL is for the agent to develop a policy based on the responses to behaviour. During the training phase we:

  1. Have the agent observe the world that you've made available to it.
  2. The agent takes actions, either randomly or based on a policy.
  3. The agent is rewarded as they do things that you want.

Paris brings up ML-Agents, a Unity extension that makes it easy to do machine learning within the Unity programming tools; it also plumbs Python into the normal Unity tooling. Here he shows us a demo, creating an object within the Unity 3D editor, and marking it as an agent. Paris notes that it's worth adding keyboard controls to the agent, so that you can override the random behaviour to enourage the model picking up rewards. Up until this point, the work is C# code, the norm for Unity development; when we come to the rewards and model, though, we drop into Python, since it makes use of PyTorch.

Paris notes that Proximal Policy Optimization (PPO) is generally considered a good model in the ML world. The Python-driven model starts up a command-line based server which interacts with the Unity IDE, driving random behaviours and adding rewards until the model is built to the standard that you want; it will output an OMX file, a neural net model that will drive behaviours in Unity that we want, with the agent now having the behaviours that we want. This demo is based on a ball trying to hit another object, and is a nice visualisation of the learning phase.

The next example that Paris gives us makes richer use of the capabilties of the Unity studio: the agent is a cube trying to balance on its top; feedback loop is via the in-IDE camera noting when the ball falls off or stays on, and rewarding based on that; Paris breaks out into a Jupyter notebook, showing how you can capture data from a camera.

The final demo is a worm that wriggles toward a target, having been trained with rewards for moving to the object.

This was a nice taster of some things which are possible (and facilitated by Python), but mostly serves as a pointer to things that a motivated person might go off to explore.

Q&A

  • Given the problems with Unity, what would you recommend as an alternative? Unfortunately there's nothing as good for ML. Unreal are working on something, but it's not as good.
  • How did you integrate with Unity? pip install ml-agents is all you need.
  • How do you ship this? What's the impact on performance? Unity have their own runtime for the neural net, so it's not a Python runtime, and performance is excellent on modern devices.

The When of Python

Grant Paton-Simpson & Ben Denham

Simplicity is one of the most-touted ideas of why we consider Python attractive. But is Python still simple? Well, that's an interesting question. Let's start with readibility - how easy it is to look at a bit of code, and understand what it does. Well, when we look at new features in Python, we've added not just features, but different ways to use the new features; consider the question of how to do async code.

Learnability is another attribute of Python's simpliticy: it's popular as an educational language. But to teach a language, you need to be able to discuss a large subset of day-to-day features. Grant and Ben argue that language extensions while retaining old features - language creep - is making this harder and harder, as we expand the amount a person needs to learn to get up to speed.

So what to do? Do we follow the JavaScript example of focusing on "the good bits". Well, that sounds attractive, but Python has no tradition of deprecating features, and no real agreement that there are obvious bad parts to drop. So are we deadlocked?

Everyday Python - a Python Constrictor if you will - would be three tiers of Python:

  1. Almost always use, which is everyday Python.
  2. Sometimes use: situational or for advanced users.
  3. Very niche or deprecated.

Some examples

  • String formatting:
    1. f-strings
    2. .format() occasional.
    3. %-formatting - consider deprecated.
  • Data-Storage Objects:
    1. @dataclass: the everyday choices.
    2. Named tuples.
  • Structural Pattern Matching; these have many benefits (a concise switch statement! easy to unpack complex data structures), but also some costs (it's a whole mini-language that looks like Python, but isn't)
    1.
    2.
    3. Niche use only, for example unpacking complex unstructured data.
  • Concurrency:
    1. Concurrency isn't an everyday future.
    2. concurrent.futures has a simple interface and is straightforward to use when neaded.
    3. asyncio while useful, it's tremendously more complicated to write, although it's very robust - but a niche feature.
  • for else
    1.
    2.
    3. Don't use it.
  • Misc
    1. Comprehensions.
    2. Format numbers with underscores - low overhead, greater readability.
    3. Type hinting.
    4. Anonymous lambda are useful, but aren't very Pythonic.
    5. Don't use the Walrus operator if you can get away with it.

Underlying Principles

  1. Try to winnow everything down to one, easy, safe approache.
  2. Be prepared to shrink Python, focusing on common tasks and features.
  3. Think about whether new features are a simplification of old mechanisms, or whether they're complex for supporting niche problems.

So where to from here?

  • Collaboration
  • Consensus
  • Standards

This seems like something that the Steering Council should do, as a living standard like PEP 8, providing strong guidance. This is desireable, they argue, because otherwise relying on conventions will likely further fragment the community into different niches driving and driven by conventions, rather than a simple, unified language unified by strong constructs.

Real-World Use

Scraped from recently-uploaded github repos:

  • datetime is common, in 30% of code, likewise type-hinting.
  • for else is in 8% of repos, even if Guido hates it.
  • dataclasses, while useful are only in 6% of repos.
  • asyncio is used in a surprisngly large 8% , while concurrent.futures is only in 4%. This is very surprising, given that learning material steers people away for it.

Further investigation suggested that about 62% of uses of asyncio is part of the target use case, but perhaps 18% of the repos that they looked at are using it when they don't need to.

This approach - monitoring real-world useage - to inform decisions about the language, combined with the idea of tiering the features by whether they are in the proposed Everyday Python core, could made steering the language much simpler. Alongside that formal process, they argue that this thinking should be part of training material, blog posts, books, team standards and style guides.

Q&A

  • How do you think adoption across different communities would affect the idea of Everyday Python, and would the definition change for different communities? It may well differ somewhat per community, but the core ought to have substantial overlap across communities.
  • Is this published? Can I use it? It's not published yet, but we want to publish the code and idea shortly.

This was a great talk - very thought-provoking, and it was fantastic to see them suggest that real-world use ought to be part of the feedback loop for language design descisions.

Securing Python Web Apps

Oliver Ewert

While Oliver is leaning on his experience with Django applications, he's hoping that most of the resources will be useful for any Python web app, because, to borrow from the Django deployment checklist, "the Internet is a hostile environment". There are good checklists - including the ones provided by Django istelf, and a built-in checker. The same is true for Flask, which itself has documentation, as well as the flask-security-too library which pulls together a bunch of libraries than can help you secure your application.

Oliver suggests that you start with the OWASP Top 10 as as starting point, and walks us through them:

Broken Access Control

This used to be number five, but it's now number one, and Oliver isn't surprised by that - it's the only one on the list where he doesn't have a straightforward fix for. It's very easy to make mistakes that leave an application open; many frameworks fail-open, and during development you may relax controls, and forget to re-enable them later. There are some things that you can do to mitigate this:

  • Testing - make sure you have coverage that tries to access views in your application with inappropriate creds, and see what happens.
  • Middleware that checks views for missing access controls - tricky, but can be useful.
  • DB row-level security to re-check user identity against the data.

Cryptographic Failures

  • Inappropriate application of cryptography - out-of-date algorithms, not switched on where it should be, and so on.
  • Keeping private material private.
  • Understanding the difference between different types of cryptography and where to apply them appropriately.

A common issue that Oliver has seen over the years is where people override defaults at some point, and never re-check that decision when code is updated, leaving them with downgraded or vulnerable applications.

Injection

Command injection - SQL injection, or commands in no-SQL databases - is a function of not sanitising user input correctly, allowing an attacker to pass commands through to an application or database. This is a classic, and remarkably common; up until this year it was still number one on the OWASP top 10. With Python frameworks, simply using bind variables or an ORM largely eliminates this risk for databases.

The other risk is cross-site scripting, where an attacker can inject fragments of code (e.g. JavaScript) that will run on other users' sessions. Using templates like Jinja2 or similar for your web apps can eliminate most of these risks.

Insecure Design

"Shift left" is on buzzword for this, "secure by design", but generally making sure that you think about these areas of concern toward the start of building tools and applications; easier said that done.

Security Misconfiguration

Your settings are always right when you push to prod, right? You wouldn't accidentally use test settings in prod? Never happens!

Check your configuration as you push to production.

My observation would be that the best way to avoid this is to have your config pulled from the environment, not in the code base or files that people have to get right as they push to production

Update Your Dependencies

As Laura highlighted yesterday, this is a key thing to do. Oliver notes that you should be familiar with the release schedule of your frameworks and tools, and build updates into your release schedules. Oliver's a fan of dependency pinning so that you can reason about what's in your code. Oliver notes that Dependabot, Rennovate - Mend, and Snyk are all excellent tools for understanding what you ought to update - and even if you don't update because you don't believe that you're affected by something, you should document that properly.

Oliver mentions that you need to consider all of your dependencies, not just the core application; you need to be aware of your OS, for example.

My experience is that the biggest blocker on updates and upgrades is lack of decent test coverage and production instrumentation; if you can't be confident that you haven't broken anything as part of an upgrade, you'll always get push-back on rolling fixes, even critical ones, because people are trying to evaluate the risk of a breach versus the risk of a regression-related outage.

Pandas: Librarian to the outer Solar System

Nicole Tan

Nicole is an astronomy PhD student who wants to discuss how she uses Pandas to organise the metadata for the images that she uses to study the universe, specifically objects beyond Neptune.

Nicole introduces us to the solar system beyond Neptune; it's a relatively new field for astronomers: while Pluto is the most famous and has been known for a while, the second significant Trans-Neptunian Object (TNO) was only discovered in the 90s. It's a field that attracts a lot of interest, because current thinking is that the dust and objects scattered beyond Neptune are the remnants of the early solar system, thrown out during the early expansion that happened when the solar system first formed.

We study TNOs, much like a lot of astronomy, by deciphering light, either via spectroscopy or photometry; the former is measured accross a broad range of wavelengths, while the latter focuses on a narrow range of lights through a particular filter. Because TNOs are so distant and faint, there isn't enough light for spectroscopy, so we use photometry. By examining the brightness through different filters, we can construct the spectral scope of the object. Nicoles area of focus is on the near-UV wavelengths, an area which is hard to study on Earth: our atmosphere shields us from a lot of UV light. Nicole notes that UV allows her to deduce information about organic compounds, which have notable UV signatures. She pulls data from two telescope surveys.

Her PhD requires her to answer a couple of key questions: how many unique objects does she have in the images, and how many times are they being imaged simultaneous by the two different telescope with different filters applied.

Observations are not the same as images: you can't just list all the images taken; there may be multiple TNOs per image, and hence multiple observations, in each image. Happily, the Canadian Astronomy Data Centre has a database that you can query to find solar system objects; while they have an API, the data structures returned are less than ideal - but fortunately there's a package to help with that: Astropy. Astropy helps break the raw returns from the CADC API into a table. From there, Nicole created a dataframe with Pandas, allowing her to work with the data. One thing that is particularly important for Nicole's work is that Pandas can retain specialised data formats, such as the specialised MJD (Modified Julian Day) datatype used in astronomy.

With the information in a data frame, it becomes significantly easier to understand the data - checking for uniqueness of images, filtering out observations that aren't from one of the telescopes that she's supposed to be looking at (the answer, by the way, is 166 TNOs in the u-band).

The simeltaneous observation, which allows different filters to be applied at the same time while observing the same object via two different telescopes[1]; by using Pandas and pulling the two different sets of observation data in, Nicole can create a plot that shows where there are overlaps of observations: the same observation at the same time being what she's looking for. The plot makes it easy to visualise where there are gaps in the observations (including a telescope crashing in the first year); happily, for the most part, there were good overlaps to work with.

This was a really cool talk!

Q&A

  • The question was inaudible but the answer was: We've only had one probe out far enough to observe TNOs, the New Horizons probe.
  • Another inaudible: There was an ultraviolet-specialised probe, but it has been decomissioned.

  1. This is a very, very, very expensive version of how digital cameras capture images. ↩︎