Blog
Category
-
Python package managers: uv vs pixi?
When I talk to people about Python package management in 2025 I see the following tools in active use;
uv,pixi,pip,conda,mamba,micromambaandpoetry. There may be others, but I don’t hear much about them. -
Revisiting old contributions for Hacktoberfest
It’s hacktoberfest time again!
For the last few years hacktoberferst has been opt-in for project maintainers to avoid the bombardment of spam PRs across GitHub. To find participating projects you can view the hacktoberfest tag on github.
-
The Majority Of Your Users
The majority of your users don’t read your changelog.
The majority of your users only upgrade to new versions when forced to.
-
Generating useful titles for automated PRs in GitHub Actions
In kr8s I have a GitHub Actions workflow which runs a script nightly on a cron job. The workflow grabs a list of actively supported Kubernetes versions from endoflife.date and then cross-references them with the available
kindcontainer images for running the tests in CI. -
Why don't my markdown titles work sometimes?
I write a lot of markdown. I use it on GitHub when creating issues/PRs, I use it in Obsidian when I take notes, I use it in Hugo when writing blog posts (like this one), I use it in Jupyter Notebooks when working with data and I use it in Sphinx with MyST when writing documentation.
-
100 Days of Coreutils
I consider myself an advanced Linux and macOS user. I’m currently a software engineer developing primarily for Linux systems, and I’ve previously worked as a Linux and Mac System Administrator. Over the years I’ve spent tons of time on the command line, however I bet there are a bunch of GNU Core Utilities (coreutils) commands I’ve never used before.
-
Using multiple config files with kubectl and other Kubernetes tools
If you want to point tools like
kubectlto a config file other than~/.kube/configyou can set the environment variableKUBECONFIG. But did you know thatKUBECONFIGbehaves sort of like a path, andkubectlwill load all the config files it finds? -
Most stale bots are anti-user and anti-contributor, but they don't have to be
If you’ve been around open source projects on GitHub you may have encountered a project with a stale bot.
-
Python version epochs are broken
In PEP440 Python introduced Version Epochs as a mechanism to allow projects to change versioning scheme. Unfortunately there’s no way I could see a project actually making use of this without confusing their users.
-
Creating GitHub Releases automatically on tags
GitHub Releases is a feature where you can create a page associated with a git tag that contains a description of the changes in that tag along with build artifacts for users to download.
-
A beginner's guide to managing Kubernetes resources in Python with kr8s
Managing Kubernetes resources with Python has never been easier thanks to the
kr8sKubernetes client for Python. -
Running Dask on Databricks
Databricks is a very popular data analytics platform used by data scientists, engineers, and businesses around the world. It was founded by the creators of Apache Spark, a powerful open-source data processing engine, and builds on top of Spark to provide a comprehensive analytics platform.
-
Running Dask workloads on multiple cluster backends with zero code changes using dask-ctl
Sometimes you want to write some code using Dask which can then be run against multiple different cluster backends. For example for local testing you might want to use
LocalCLuster, but in production useKubeCluster. Or perhaps you want to easily switch between an on premise HPC withSLURMRunneror the cloud withCoiled. -
EffVer: Version your code by the effort required to upgrade
Version numbers are hard to get right. Semantic Versioning (SemVer) communicates backward compatibility via version numbers which often lead to a false sense of security and broken promises. Calendar Versioning (CalVer) sits at the other extreme of communicating almost no useful information at all.
-
How to highlight lines in a Hugo code block
Sometimes when writing code in a blog post I want to emphasize a couple of lines in particular. Today I found out that Hugo has really nice syntax to do this in a regular markdown code-fence.
-
How to get typer to show help by default
I love using typer for creating CLI tools in Python. It makes creating complex trees of subcommands really straightforward.
-
GitHub streaks and work/life balance
I recently read Loving and hating the Streak by Cassidy Williams. The post was all about committing code on GitHub every single day to maintain a streak.
-
Comparison of kr8s vs other Python libraries for Kubernetes
I’ve been working on
kr8sfor a while now and one of my core goals is to build a Python library for Kubernetes that is the most simple, readable and produces the most maintainable code. It should enable folks to write dumb code when working with Kubernetes. -
The challenge of updating an aging blog
The Dask blog is a bit neglected these days. The website is an aging Jekyll blog and is well past it’s prime. Bringing it into current decade has been on my backlog for a while and today I decided to dedicate some time getting it up to date.
-
How I fixed my UniFi Devices intermittently showing as offline
Since upgrading to a UniFi Dream Machine (UDM) Pro I’ve had a problem with some of my UniFi devices showing as offline. TLDR It turned out that I accidentally had two controllers on my network and the devices were hopping back and forth between them.
-
Livestream notes: Replacing aiohttp with httpx in kr8s
This post will be updated with notes from the livestream throughout the day.Today I will be streaming some open source code refactoring. Come and join in on Twitch!. Don’t forget to say hi in the chat 😊.
-
Introducing kr8s, a new Kubernetes client library for Python inspired by kubectl
For the last few months I’ve been tinkering with a new Kubernetes client library for Python called kr8s.
-
Avoid indirection in tests at all costs
When writing tests the balance between avoiding indirection and DRY-ness should be much more weighted towards avoiding indirection than in the code it is testing.
-
Mini demos
As software engineers we should all be able to communicate the things we have built to others, but giving a formal demo of something you’ve been working on can be daunting. Mini demos are a great way to build muscles around giving ad-hoc demos with little to no preparation.
-
Debugging Data Science workflows at scale
May 12, 2023 15 minute read #python, #dask, #kubernetes, #apache-beam, #google-cloud, #google-kubernetes-engineThe more we scale up our workloads the more we run into bugs that only appear at scale. Reproducing these bugs can be expensive, time consuming and error prone. In order to report a bug on a GitHub repo you generally need to isolate the bug and come up with a minimal reproducer so that the maintainer can investigate. But what if a minimal reproducer requires hundreds of servers to isolate and replicate?
-
Running Jupyter in your Dask Kubernetes cluster
Did you know that the Dask scheduler has a
--jupyterflag that will start a Jupyter server running within the Dask Dashboard? -
Being intentional with container terminology
When writing and speaking about linux container technologies I’m trying to be more intentional with the words I use, which means often avoiding the word docker. My goal is to communicate clearly to both experts and novices alike.
-
Oversubscribing GPUs in Kubernetes
Sometimes I want to oversubscribe the GPUs in my Kubernetes cluster. This is especially useful when I’m developing but could also be useful in light workloads where you have ample GPU memory and don’t mind the occasional failure.
-
Quick and dirty way to pre-pull container images on Kubernetes
Sometimes when I give live demos with Kubernetes clusters I want to make sure that the container images I’m going to use are already pulled onto all of the nodes in my cluster. The last thing I want is for a
Podto be created to then sit in aPendingstate while an image is pulled, especially given how large containers can be in the Data Science space. -
Debugging Sphinx extensions in VSCode
This week I’ve been working on some custom Sphinx extensions for a documentation site.
Sphinx is a pretty complex tool with a broad ecosystem so documentation tends to be spread across the upstream project, dependencies like docutils and popular extensions like MyST. Therefore figuring out what is going on can be challenging, so I almost always resort to digging through state in a debugger and doing code spelunking on GitHub.
-
Sometimes I regret using CalVer
Over the last few years, many open-source Python projects that I work on have switched to CalVer. I’ve felt some pain around this, particularly in Dask and its subprojects. I want to unpack some of my thoughts and feelings around this trend.
-
Narrative driven development
In July I published a blog post on using Dask on KubeFlow with the Dask Kubernetes Operator. I originally outlined that post in January before the Dask Operator even existed as part of my planning for that work.
-
Accelerating ETL on KubeFlow with RAPIDS
Aug 30, 2022 11 minute read #dask, #etl, #kubeflow, #pandas, #rapids, #technical-walkthrough ArchiveIn the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL pipelines or hyperparameter optimization?
-
How to check your NVIDIA driver and CUDA version in Kubernetes
When using GPUs with Kubernetes it can be important to know which driver and CUDA versions are installed on the nodes.
-
Using Dask on KubeFlow with the Dask Kubernetes Operator
Kubeflow is a popular Machine Learning and MLOps platform built on Kubernetes for designing and running Machine Learning pipelines for training models and providing inference services. It has a notebook service that lets you launch interactive Jupyter servers (and more) on your Kubernetes cluster as well as a pipeline service with a DSL library written in Python for designing and building repeatable workflows. It also has tools for hyperparameter tuning and running model inference servers, everything you need to build a robust ML service.
-
Don't prematurely squash/rebase and force push your PRs
A big frustration for me when reviewing Pull Requests on GitHub is coming back to a PR you’ve already reviewed to check on recent changes and be greeted with “We went looking everywhere, but couldn’t find those commits”.
-
Commenting on Pull Requests with GitHub Actions
When someone opens a Pull Request (PR) on your GitHub project it can be helpful for a bot to comment on the PR. You might want to thank the user for the contribution, provide some useful information such as giving a binder link where folks can try out the PR, or providing more verbose output from some tests or other checks.
-
The secret to making code contributions that stand the test of time
When you contribute code to collaborative projects, whether they are open source community projects or large internal projects inside organisations, the feeling of having your code running inside a large application can be very rewarding.
-
How to set environment variables on your Dask workers
When working with Dask clusters you often need the remote worker environment to match you local environment. This generally means having the same packages and data available.
-
Golang block until interrupt with ctrl+c
Today I found myself needing a Go application’s main thread to stop and wait until the user wants it to exit with a
ctrl+ckeyboard interrupt. -
Goodbye Docker Desktop for Mac, Hello Colima
Today is the deadline for the license changes to Docker Desktop for Mac and Windows. This means that if you are employed at a company with more than 250 employees or your company makes more than $10m you need to start paying a subscription to continue using Docker Desktop.
-
Docker Desktop for Mac alternatives for developers
In a couple of days Docker will begin charging employees of companies with >250 employees to use Docker Desktop. I have no problem with paying for software that brings me value, but you wouldn’t believe how complex it can be for large companies to sign employees up to subscription services. Paperwork everywhere! To avoid this I’m evaluating alternatives for Docker Desktop to use on my MacBook.
-
Running Kubeflow inside Kind with GPU support
This week I’ve been playing around with Kubeflow as part of a larger effort to make it simpler to use Dask and RAPIDS in MLOps workflows.
-
Quick hack: Adding GPU support to kind
This post has been superseded with this tutorial that no longer requires any code changes. Please read that instead.
Don't be that open-source user, don't be me
Before I was a maintainer of open source software I was a user of open source software, and I sometimes behaved badly.
Branding your open source Python package
Having a brand can help give your open source project some legitimacy, and you don’t need to be a designer to see these benefits. However it is important to understand that you do not need to add branding to your project in order for it to be successful, and adding branding can even harm your project.
What is the difference between Dask and RAPIDS?
Both Dask and RAPIDS are Python libraries to scale your workflow and empower you to process more data and leverage more compute resources. Both use interfaces modeled after the PyData ecosystem, making them familiar to most data practitioners.
The evolution of a Dask Distributed user
This week was the 2021 Dask Summit and one of the workshops that we ran covered many deployment options for Dask Distributed.
Building a contributor community for your open source project
With our open source project published on GitHub we probably want to allow folks to contribute changes. Some users of the project may find bugs, or desire extra features and will open issues to tell you. Users who have the skills required to make that change can open a Pull Request on GitHub to propose it. As the maintainer you can then review and merge those changes.
Communicating with your open source community
Once your open source Python project has users and a community you will likely want to communicate with them in an official capacity. Perhaps you want to tell them about a new release, show a use case where someone is using your tool or solicit feedback on an upcoming feature.
Building a user community for your open source project
Now that our open source Python project exists and users can install it we will want to turn our attention to sustainability, reach and ongoing maintenance. By putting it out there and gaining users you are opening yourself up to questions, bug reports and feature requests.
Documenting Python projects with Sphinx and Read the Docs
In part four of this series we discussed documenting our code as we went along by adding docstrings throughout out project. In this post we will see that effort pay off by building a documentation site using Sphinx which will leverage all of our existing docstrings.
Monitoring Dask + RAPIDS with Prometheus + Grafana
Prometheus is a popular monitoring tool within the cloud community. It has out-of-the-box integration with popular platforms including Kubernetes, Open Stack, and the major cloud vendors, and integrates with dashboarding tools like Grafana.
Automating releases of Python packages with GitHub Actions
In this post we will cover automatically packaging and releasing our project when a new git tag is pushed to GitHub.
Testing and Continuous Integration for Python packages with GitHub Actions
In this post we will cover automatically running our tests when we push new code to GitHub, and when contributors raise Pull Requests against our project.
Awaitable Objects and Async Context Managers in Python
Python objects are synchronous by default. When working with
asyncioif we create an object the__init__is a regular function and we cannot do any async work in here.Test driven development in Python
What is test driven development (TDD)?
Test driven development is a style of development where you write your tests before you write your code.
Testing your Python package
In this post we will cover testing our code.
Testing
There are many many great resources out there for learning about testing software. In this post I’m going to try and focus on simple examples that you can use to get started quickly. Once you have a good foundation for your tests you can then dive into mocking, replaying HTTP requests or even hypothesis testing.
Documenting your Python code
This post will cover documenting our code. Specifically adding documentation within the code itself.
Docstrings
Right now our code is undocumented, so if the user inspects our function they will only see the interface (the way you call it) but with no other context. We can use IPython to quickly inspect this.
How to interactively debug GitHub Actions with netcat
Update: This was a fun experiment and I recommend you check out the post for a fun read on setting up reverse shells. But I’ve since discovered this awesome tmate action which lets you interactively debug in the browser or via SSH.
How to check out the default git branch
Many open source projects are taking steps to update terminology to be more inclusive. The largest of these changes has been renaming the “trunk” branch of git repositories from
mastertomain.Leveraging the Hacktoberfest community
Hacktoberfest is approaching once again. In previous years I have both participated and contributed to open source, and also tried to leverage the community in the open source projects I maintain by curating and labeling issues.
Running Dask tutorials
Aug 21, 2020 20 minute read #python, #dask, #distributed-computing, #open-source, #community, #tutorials ArchiveOriginally published on the Dask blog on August 21st, 2020.
For the last couple of months we’ve been running community tutorials every three weeks or so. The response from the community has been great and we’ve had 50-100 people at each 90 minute session.
The current state of distributed Dask clusters
Originally published on the Dask blog on July 23rd, 2020.
Dask enables you to build up a graph of the computation you want to perform and then executes it in parallel for you. This is great for making best use of your computer’s hardware. It is also great when you want to expand beyond the limits of a single machine.
How to use OBS Studio with Zoom, Hangouts, Teams and more on macOS
A popular tool with streamers and YouTubers is Open Broadcaster Software®️ Studio or OBS for short. It allows you to compose scenes with cameras, desktop sharing, video snippets, images, web pages and more and then stream that video to services like Twitch or Mixer. You can also save recordings locally if you want to upload them to YouTube.
How to enable SSH on Binder
⚠️ This post is no longer valid.
Running SSH on Binder has not been possible since late 2020. Due to abuse from botnets Binder will now kill sessions running
sshd.Publishing open source Python packages on GitHub, PyPI and Conda Forge
In this post we will cover making our code available to people. This is the bit where we open the source! We will push our code to a code posting platform and then package up our library and submit it to a couple of repositories to make it easy for people to install.
Versioning and formatting your Python code
In this post, we will cover a few project hygiene things that we may want to put into place to make our lives easier in the future.
Testing static sites with Lighthouse CI and GitHub Actions
Feb 13, 2020 7 minute read #python, #github, #tutorial, #github-actions, #static-sites, #lighthouse-ciWhen you build a website you want pages to load as quickly as possible for users. Google has a tool called PageSpeed Insights which you can run on your website to see various metrics about the page. I’ve used it in the past while working on my blog and other sites.
Creating an open source Python project from scratch
Have you had a great idea for an open-source Python library that you think people will find useful, but you don’t know where to begin in creating and publishing it?
5 Tips to help you ace your internship and entry-level job interviews
Applying for internships and entry-level positions can be tricky. Interviewers want to hear you talk about your experiences and things you’ve done that prove you’re a good fit for the job. However given that you are applying for an entry-level role you likely don’t have much real world experience in this space. It’s a chicken and egg situation that everyone faces when they are first starting out or wanting to make a shift to a new area.
Creating GitHub Actions in Python
Note: This post is also available in Go flavour.
GitHub Actions provide a way to automate your software development workflows on GitHub. This includes traditional CI/CD tasks on all three major operating systems such as running test suites, building applications and publishing packages. But it also includes automated greetings for new contributors, labelling pull requests based on the files changed, or even creating cron jobs to perform scheduled tasks.
Creating GitHub Actions in Go
Note: This post is also available in Python flavour.
GitHub Actions provide a way to automate your software development workflows on GitHub. This includes traditional CI/CD tasks on all three major operating systems such as running test suites, building applications and publishing packages. But it also includes automated greetings for new contributors, labelling pull requests based on the files changed, or even creating cron jobs to perform scheduled tasks.
How to run Jupyter Lab at startup on macOS
In my day to day work I generally access a variety of Jupyter installations. Sometimes these are short lived installations in conda environments on my laptop, sometimes they are running on a remote server, and sometimes I use a managed service like JupyterHub or Binder.
The three types of fun
According to folks who enjoy outdoor activities there are three types of fun. I’ve been using this scale for a while to categorize my own enjoyment of things and wanted to share my version.
Why your profile picture is important
Choosing a good profile picture will make collaborating with others easier, especially if you haven’t met them yet. Here are some tips to help you pick a good one.
Cleaning up conda environments
Often when I’m developing or debugging in Python I end up creating throw away conda environments. They will be to test some package installation or combination of packages and once I’ve finished I will probably never use them again.
Setting up GPU Data Science Environments for Hackathons
Originally published on the RAPIDS AI blog on August 13th, 2019.
Background
In my first week working at NVIDIA, I have been spending some time with my previous colleagues at the Met Office to explore how the two organizations can collaborate.
Switching to Hugo
It has been nearly two years since I published a new blog post on this website. That doesn’t mean I haven’t been writing things. It’s just that much of my content has been posted on other platforms. I’ve decided recently to gather everything together and make this website the canonical source for the things I produce. This includes blog posts, talks, videos and more.
Hypothetical datasets
In Theo’s previous posts on storing high momentum data and its accompanying metadata we get some interesting insights into the future of cloud based data storage. In this post I’m going to cover how we are working with today’s NetCDF-based challenges, by making assumptions!
Intro to Earth Information Workshop
This article was originally written for the the Met Office workshop run at the Intro to Earth Information event on the 12th of March 2019.
My pragmatic workshop format
Jan 30, 2019 7 minute read #workshop, #conference-planning, #facilitation, #public-speaking, #training Archive
Mozfest workshop facilitators meeting Figuring out the right format for a workshop can be tricky. There are so many factors; what is the subject, do people need any equipment, how many people will attend, how many facilitators will there be, where will it be held, what level of expertise will the participants have, the list goes on…
Debugging Kubernetes PVCs
Sometimes I find that something goes wrong in a container and some data stored in a persistent volume gets corrupted. This may result in me having to get my hands dirty and have a poke around in the filesystem myself.
Using Xiaomi door/window sensors as light switches
Introduction
For a while I’ve been searching for a decent light switch solution for my home automation setup. I’ve recently put in a pretty good solution using Xiaomi door/window sensors, I’m very happy with it and it ticks a lot of boxes.
Exploring Dask and Distributed on AWS Lambda
I spent some time this week exploring whether it would be possible to run Dask and Distributed on a function as a service platform like AWS Lambda.
Instant access to auto-scaling personal Python clusters
Originally published on the Met Office Informatics Lab blog on February 7th, 2018.
We are excited to announce that the work we’ve been doing with distributed Dask clusters running on Kubernetes has been absorbed into an awesome new tool called Daskernetes through our work on the Pangeo project.
ChatOps - Automation via chat
Originally published on the Met Office Informatics Lab blog on December 19th, 2017.
ChatOps - Automation via chat
This article is a companion to a workshop on using chat to automate ops workflows. This is a static version of a Jupyter Notebook which you can download here.
Deploying opsdroid using ZEIT
ZEIT is a great platform for deploying your opsdroid instance. Particularly because it is free for light use, which many opsdroid deployments will be.
Article in Computer Weekly
This week Computer Weekly have published an article interviewing me about how the Met Office is tackling the vast amount of data we are producing. They reference the work the Informatics Lab have done on the Jade project and Met Office public datasets.
Adaptive Dask clusters on Kubernetes and AWS
Originally published on the Met Office Informatics Lab blog on July 21st, 2017.
Introduction
This article assumes a basic understanding of Amazon Web Services (AWS), Kubernetes, Docker and Dask. If you are unfamiliar with any of these you should do some preliminary research before continuing.
RITA 2017 Innovation Award
My team won a Real IT Award for Innovation! Check out the full post here.
Monitoring scalable infrastructure
Originally published on the Met Office Informatics Lab blog on May 8th, 2017.
Recently we’ve been thinking a lot about monitoring. In a world of ephemeral servers, auto-scaling, spot instances and infrastructure-as-code, monitoring has to be tackled differently.
Using Jupyter notebooks for SysAdmin, CloudOps and DevOps workflows.
Originally published on the Met Office Informatics Lab blog on May 8th, 2017.
Jupyter notebooks are awesome. If you speak to a data scientist or analyst who writes Python there’s a very good chance that they use Jupyter notebooks. But I think there’s another community that would benefit hugely from including them in their standard arsenal of tools, and that’s folks in IT Infrastructure.
Moving large volumes of data to S3
Originally published on the Met Office Informatics Lab blog on April 20th, 2017.
We just moved ~80TB of data to S3 (stay tuned to hear what we’re doing with it).
A game on the perception of symbols
Originally published on the Met Office Informatics Lab blog on June 24th, 2016.
With some friends look out the window and each choose a weather symbol which represents what you see. Do you all agree?
Cracking Enigma with Go
Originally published on the Met Office Informatics Lab blog on June 2nd, 2016.
Can I crack the Enigma code with Go on a MacBook? Yes!
A Raspberry Pi Docker Cluster
Originally published on the Met Office Informatics Lab blog on December 12th, 2015.
Introduction
We are fortunate in the Lab to have a small stash of Raspberry Pis in our cupboard which are used at hackathons and other events. As there are no events using them currently I thought I’d take the opportunity to make a nice demonstration piece to show off clustering containers.
Building with Kubernetes
Originally published on the Met Office Informatics Lab blog on October 1st, 2015.
For our 3D visualisation project we wanted to build a data processing service using Docker containers. We quickly found that once you are running more than a couple of containers you need a way to manage them. After looking into the different tools available we decided to give Kubernetes a go, this is what we learned.
govspeak: An open source markup language
Originally published on the Met Office Informatics Lab blog on July 22nd, 2015.
The Informatics Lab website is created with an application called Jekyll. Recently I made an enhancement to it which I’m very excited about. It allows us to write our articles in a markup language called Govspeak, which is an extension to the excellent markdown.
Lab School: Docker
Originally published on the Met Office Informatics Lab blog on June 24th, 2015.
Welcome to the first ever Lab School session. This session aims to give you an overview of docker and how we are currently using it in the Lab.
How I value media and entertainment
I’ve been meaning to write about how I value media/entertainment and my thought process when purchasing it. This is different to my normal style of article but hopefully people will find it interesting.