Starting with Metaflow

Metaflow is (my favorite, just the best 😍) a machine learning library that offers simple python annotations to establish reproducible data engineering, model training, model validation, and other steps and execute them locally or in the cloud on AWS, Kubernetes, and Titus.

Metaflow is open-source and used at Netflix and many other companies in production machine learning and data science workflows.

What problems does Metaflow help to solve❓

  • Get training data, train a model on a schedule and keep an audit trail of all the training executions β€” βœ…
  • Establish an ETL pipeline with just a few lines of python code β€” βœ…
  • Train a large-scale model on AWS, Kubernetes, or Titus β€” again, with just a few lines of python β€” βœ…
  • Quickly establish directed graphs of different computation steps for parallel computing? Check! βœ…
  • Resume your compute from a certain step? β€” βœ…

I’ve seen Metaflow being used for small ETL jobs as well as for multi-day training marathons. Its simplicity makes it an extremely versatile library.

Want to try it out? This exercise will take just a few minutes of your time! I advise performing the below steps in a python virtual environment. β—️

You can quickly create a virtual environment via a python module virtualenv
The commands are similar between Mac and Linux and slightly different on Windows.

python3 -m virtualenv venv
# activate new virtualenv
source venv/bin/activatepip3 install metaflow

Let’s start with a simple flow to make sure everything works. Create a with the below code snippet:

from metaflow import FlowSpec, step

class LinearFlow(FlowSpec):

    def start(self):
        self.my_var = 'hello world'

    def step_one(self):
        print('the data artifact is: %s' % self.my_var)

    def end(self):
        print('the data artifact is still: %s' % self.my_var)

if __name__ == '__main__':

To execute the flow, let’s run

python3 run

You should see an output in the console:

Metaflow 2.4.3 executing LinearFlow for user:{your_user_name}
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
Workflow starting (run-id 1637382785717584):
2021-11-19 20:33:05.736 [1637382785717584/start/1 (pid 6096)] Task is starting.
... Task finished successfully.
...Task is starting..../step_one/2 the data artifact is: hello world... Task finished successfully.
... Task is starting.... the data artifact is still: hello world... Task finished successfully.
... Done!

πŸŽ‰πŸŽ‰πŸŽ‰ You have created your first flow! πŸŽ‰πŸŽ‰πŸŽ‰

There are a few essential features of Metaflow showcased in the above example.

  • When you assign a value to self inside a step, it will be available to other steps to the last step, ❗️ unless there is a split into parallel processing step somewhere in the middle. ❗️
  • If a parallel processing step splits, then the values assigned to the preceding steps would not be available to the steps after the parallel processing step.
  • When you run your flow on AWS, values assigned to self are pickled and stored in the S3 object-store. You can see the variable my_var gets the value of hello_world β€” this my_var the variable then can be used in other steps. You can use this pattern to pass DataFrames, media files, and other artifacts between steps.
The image shows steps A, B, C connected with the flow from A to B and from B to C. The B step is parallel step, it has many instances of itself. The C step is a join step, it allows to merge all the compute that happened in the B step.

As you can see, it is pretty easy to enhance your python code with Metaflow and add parallel processing or cloud computing to your data science project. Thank you for reading!

Open sourcing

We, the tech community thrive on open source.

When we use open source projects, we do not ask ourselves:
1. What kind of education contributors of the project have?
2. Can the contributors balance a binary tree or implement an LRU cache?
3. Can the open source folks design twitter?
We just take an open source project, and use it.

Why, when it comes to interviewing and hiring, we forget about it and jump into colonist mentality of looking for top of the class, most experienced, most this and most that talent? Why we forget to focus on the talent who gets the job done?

Everyday, we use open source software written possibly by the people who lack a lot of privileges, just to hire people who are privileged all around.

Notes on the book: Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists by Alice Zheng and Amanda Casari

Recently finished reading

Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists ( link to Amazon)
by Alice Zheng and Amanda Casari.

Great coverage of the topic, that is often omitted in Machine Learning intro books, or books on ML infrastructure.
Picked up tons of programmatic tricks on how to deal with numerical and categorical data. There are good examples in Python.

Several spots in the book are in need of more attention or re-work, imho.
The code printouts were quite long, and were hard to follow on Kindle or iPad.

Also, the appendix section on linear modeling and linear algebra doesn’t seem to belong. And it can be part of the Feature Engineering topic, but in a more intuitive and down to earth way – the coverage seemed a bit more abstract that it could benefit the reader.

Informative error messages are great.

Looks like beloved Gabriel JosΓ© de la Concordia GarcΓ­a MΓ‘rquez would NOT be able to change his address on the

Thank you USPS for the error message that doesn’t make any sense to people with Last names that are several distinct words.

{field: "lastName", message: "The Last Name field allows only letters and the following characters ( - )."}