KiwiPyCon 2022 Day 3
Keynote: Code Like a Scientist: Free Software and "Good enough" Practise
Dr Héloïse Stevance
Héloïse does not have a formal education as a programmer, and her day job does not pay her to write code - this is not a unique problem; it's one that's common not just to researchers like her, but to data analysts and data scientists who are paid to produce results, not to write code. And that affects how code is written and shared.
She offers a sample plot with the question: how do you know this is real science? Because it's not pretty. In a lot of science - or data jobs generally - the data is messy; her example is hundreds of plots in the same file, with different shapes. In her own job, for example, she's an astrophysicist where the data sources are complex, 3D captures of data, or complex simulations that started in the 60s, and have evolved from FORTRAN through FORTRAN77 and even some FORTRAN90.
This means that her needs are specialised and niche, and libraries that assume data will be clean and tidy and well-organised - like Pandas or even numpy - aren't going to have a good time. But out there in the world, whether it's Héloïse's world of Hertzsprung–Russell_diagram or other problems, data can be very irregular. She was quite surprised that no-one had written a library to handle this - it's a common diagram derived from common data sets. So she wrote one, called Hoki.
And shared it - via GitHub, but also in The Journal of Open Source Software (JOSS). And other people started using it. The first team other than Héloïse's was a study of the R136 star cluster, which contains the largest known star (200 solar masses).
Héloïse refers back to Bill Gates' (im)famous letter talking about software developers getting paid. It's a good question, she suggests: in her case, she doesn't get paid based on her code, she gets recognised based on citations - the h-index. In the first paper using her code, Dr Stevance isn't cited as an author, so it doesn't count to her h-index, which means that, professionally speaking, the code was worthless. Publishing code takes time, good code takes time, and it's time not spent publishing papers, which is how you keep your job. Worse yet, if someone publishes results with your code before you do, you've been disadvantaged by sharing your code.
This creates a landscape where people have an incentive to write bad code, and not publish it, which under-cuts reproducability, which is the heart of science. Héloïse offers the example of code that was published to implement the Voronoi binning method, VorBin. She found that the core of the published library does nothing - any why? Because the core code that it relies upon is proprietary. You can download the wrapper, but you can't do anything useful with it.
What makes the problem worse is that people are ashamed of their code: they don't want to show it, because they know that they aren't producting polished code. They're intimidated by the big name projects and the standards set, and lose sight of the fact that code for a small community is valuable, even if it's a bit rouch and ready. So why bother? Well, Héloïse offers some examples of what happens when you don't share:
- Duplication of effort, and lower quality code.
- It makes results impossible to reproduce results, which is the heart of scince.
So if the problem is that people aren't sharing code, or spending the time to good code, how do we solve it? Well, code needs to move your metrics: for scientists, that's the recognised impact metrics like the h-index. Scientists need to be generous with their citations, which in turn encourages other scientists to share their code. Going back to the R136 paper, while Héloïse is not an author, she is cited - and that means that it does in fact reward her professionally, and she can justify the time to work on it.
Lifting code quality helps - but one problem is that describing what that looks like is difficult. Take the weasel-word phrase: "best practise". What does that mean? There's no best practise devoid of context. Dr Stevance offers the example of cooking: a home cook isn't going to wear a hair net, or gloves as a professional would. It's a waste of time and effort. So she wants us to think about "good enough practise"; first, think about what tier you are working to:
- One-off scripts. Not scripts that are only ever run once; they may be re-used within a piece of research, but not to be heavily used outside a piece of research:
- Meaningful variable names.
- Comments. Comp sci folks may think that code is self-documenting, it isn't. Record the why, record the choices you make.
- Make your code re-useable.
- Unit testing.
- Docstrings.
- Version control.
- Sharing with the community.
- High-quality error messages and error handling.
- Clear documentation that helps people flatten the learning curve - you should already have docstrings, and Sphinx can turn that into real documentation.
- Please, please, please write some tutorials.
- Remember that putting your code on Github (or equivalent) doesn't make it open source. You need to pick a license that allows for sharing.
But this will differ from environment to environment. What doesn't your environment have as a hierarchy of coding needs?
Q&A
- Have you heard of the Credit System in academia? Héloïse has not, but it sounds interesting.
- Have other people published through JOSS? Yes - Héloïse has evangalised this and other people are starting to adopt it. It even has an RSS feed; moreover, the Astrophysical Journal has a formal relationship with JOSS, which feeds into the review, feedback, and credit systems.
- Do you see a way in which non-academics could contribute? Yes, but not from a free-time/hobbyist perspective. Some senior academics are hiring programmers, but it's still rare.
- Have you seen the Good Research Code Guideline? No, I haven't but I will.
- It's not uncommon for open source communities to be harsh in their criticism of code - do you think this is one reason that scientists are reluctant to share code? She hasn't had that experience, but she doesn't engage with the general open source community. She does note, though, that we do often talk about bad code and of course no-one likes to be made fun of.
Keeping It Simple and Scalable: quick production-scale data pipelines
Jenny Sahng
Jenny starts out by surveying us on our preference on a number of choices - and notes that the choices can quickly seem overwhelming: package tooling, code formatting, deployments, you name it, there's plenty of room to end up blocked trying to make a choice; and when you do make choices, if you pick too many tools, you can slow yourself down.
A data pipeline can be very simple, from a csv, or very complex, like astronomical data. An example that Jenny offers is a simple mashup of Github and Jira into a dashboard. It's a title that shouldn't be intimidating, but there's no reason that is has to be. You don't need to pick tools that you don't know if the tools that you already have are good enough, for example. Using a real-world example, she shows us pulling read-time events from GitHub: a simple Python wrapper that pulls the events from Github, tidies it up - which is most of the complexity - and then stashes it in a database.
Simple
Simple doesn't mean low-quality, though: you still want testing and a proper CI/CD pipeline, for example. A simple stack should be easy to learn, easy to read, and accessible to your team. The ideal tools should be focused on data wrangling, easy to deploy on your chosen infrastructure, and should be scalable enough for your likely future needs. That said, Jenny notes that "what you know" is also an important consideration - if you can do the job with tools you know, that will often be a better choice than learning new things.
It's remarkable how much more productive you can be if you have the discipline to focus on the things that you know and can deliver, rather than being lured into a swamp of choices and tooling that you don't understand.
Scalable
Over time, though, you'll have new needs, which means you need to be prepared to adopt or modify your tool choices over time: when Jenny's jobs started to regularly take longer to run than a lamba allows for, they moved to EC2 instances. As there are more users, logging and alerting have become important.
Different use cases have different levels of maturity required in the code, though: Jenny notes that the best way to test new features is to have a UX with test data from a pipeline, rather than static mock-up. In this case, scaling the pipeline down for testing and delivery is important.
Q&A
- What are the pull request metrics that you look at? Jenny notes that they focus on turn-around on commits being approved. Collaboration metrics - are people working together or are there silos? Out-of-hours commits for signs of problems with wellness?
- Do you use your own tools for your own development? Yes, constantly.
- Do you look at anything other than GitHub? At the moment, only GH, but in the future they're keen to look at Slack. They're rolling out Linear, and Jira is next. It's focused on trying understand the types of work that are happening and whether people are overloaded or siloed with different types of work.