#dask
Tag
-
Running Dask on Databricks
Databricks is a very popular data analytics platform used by data scientists, engineers, and businesses around the world. It was founded by the creators of Apache Spark, a powerful open-source data processing engine, and builds on top of Spark to provide a comprehensive analytics platform.
-
Running Dask workloads on multiple cluster backends with zero code changes using dask-ctl
Sometimes you want to write some code using Dask which can then be run against multiple different cluster backends. For example for local testing you might want to use
LocalCLuster
, but in production useKubeCluster
. Or perhaps you want to easily switch between an on premise HPC withSLURMRunner
or the cloud withCoiled
. -
The challenge of updating an aging blog
The Dask blog is a bit neglected these days. The website is an aging Jekyll blog and is well past it’s prime. Bringing it into current decade has been on my backlog for a while and today I decided to dedicate some time getting it up to date.
-
Debugging Data Science workflows at scale
May 12, 2023 15 minute read #python, #dask, #kubernetes, #apache-beam, #google-cloud, #google-kubernetes-engineThe more we scale up our workloads the more we run into bugs that only appear at scale. Reproducing these bugs can be expensive, time consuming and error prone. In order to report a bug on a GitHub repo you generally need to isolate the bug and come up with a minimal reproducer so that the maintainer can investigate. But what if a minimal reproducer requires hundreds of servers to isolate and replicate?
-
Running Jupyter in your Dask Kubernetes cluster
Did you know that the Dask scheduler has a
--jupyter
flag that will start a Jupyter server running within the Dask Dashboard? -
Accelerating ETL on KubeFlow with RAPIDS
Aug 30, 2022 11 minute read #dask, #etl, #kubeflow, #pandas, #rapids, #technical-walkthrough ArchiveIn the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL pipelines or hyperparameter optimization?
-
Using Dask on KubeFlow with the Dask Kubernetes Operator
Kubeflow is a popular Machine Learning and MLOps platform built on Kubernetes for designing and running Machine Learning pipelines for training models and providing inference services. It has a notebook service that lets you launch interactive Jupyter servers (and more) on your Kubernetes cluster as well as a pipeline service with a DSL library written in Python for designing and building repeatable workflows. It also has tools for hyperparameter tuning and running model inference servers, everything you need to build a robust ML service.
-
How to set environment variables on your Dask workers
When working with Dask clusters you often need the remote worker environment to match you local environment. This generally means having the same packages and data available.
-
What is the difference between Dask and RAPIDS?
Both Dask and RAPIDS are Python libraries to scale your workflow and empower you to process more data and leverage more compute resources. Both use interfaces modeled after the PyData ecosystem, making them familiar to most data practitioners.
-
The evolution of a Dask Distributed user
This week was the 2021 Dask Summit and one of the workshops that we ran covered many deployment options for Dask Distributed.
-
Monitoring Dask + RAPIDS with Prometheus + Grafana
Prometheus is a popular monitoring tool within the cloud community. It has out-of-the-box integration with popular platforms including Kubernetes, Open Stack, and the major cloud vendors, and integrates with dashboarding tools like Grafana.
-
Running Dask tutorials
Aug 21, 2020 20 minute read #python, #dask, #distributed-computing, #open-source, #community, #tutorials ArchiveOriginally published on the Dask blog on August 21st, 2020.
For the last couple of months we’ve been running community tutorials every three weeks or so. The response from the community has been great and we’ve had 50-100 people at each 90 minute session.
-
The current state of distributed Dask clusters
Originally published on the Dask blog on July 23rd, 2020.
Dask enables you to build up a graph of the computation you want to perform and then executes it in parallel for you. This is great for making best use of your computer’s hardware. It is also great when you want to expand beyond the limits of a single machine.
-
Exploring Dask and Distributed on AWS Lambda
I spent some time this week exploring whether it would be possible to run Dask and Distributed on a function as a service platform like AWS Lambda.
-
Instant access to auto-scaling personal Python clusters
Originally published on the Met Office Informatics Lab blog on February 7th, 2018.
We are excited to announce that the work we’ve been doing with distributed Dask clusters running on Kubernetes has been absorbed into an awesome new tool called Daskernetes through our work on the Pangeo project.
-
Adaptive Dask clusters on Kubernetes and AWS
Originally published on the Met Office Informatics Lab blog on July 21st, 2017.
Introduction
This article assumes a basic understanding of Amazon Web Services (AWS), Kubernetes, Docker and Dask. If you are unfamiliar with any of these you should do some preliminary research before continuing.