Starting with Metaflow

Metaflow is (my favorite, just the best 😍) a machine learning library that offers simple python annotations to establish reproducible data engineering, model training, model validation, and other steps and execute them locally or in the cloud on AWS, Kubernetes, and Titus.

Metaflow is open-source and used at Netflix and many other companies in production machine learning and data science workflows.

What problems does Metaflow help to solve❓

  • Get training data, train a model on a schedule and keep an audit trail of all the training executions — ✅
  • Establish an ETL pipeline with just a few lines of python code — ✅
  • Train a large-scale model on AWS, Kubernetes, or Titus — again, with just a few lines of python — ✅
  • Quickly establish directed graphs of different computation steps for parallel computing? Check! ✅
  • Resume your compute from a certain step? — ✅

I’ve seen Metaflow being used for small ETL jobs as well as for multi-day training marathons. Its simplicity makes it an extremely versatile library.

Want to try it out? This exercise will take just a few minutes of your time! I advise performing the below steps in a python virtual environment. â—️

You can quickly create a virtual environment via a python module virtualenv
The commands are similar between Mac and Linux and slightly different on Windows.

python3 -m virtualenv venv
# activate new virtualenv
source venv/bin/activatepip3 install metaflow

Let’s start with a simple flow to make sure everything works. Create a metaflow_start.py with the below code snippet:

from metaflow import FlowSpec, step

class LinearFlow(FlowSpec):

    @step
    def start(self):
        self.my_var = 'hello world' 
        self.next(self.a)

    @step
    def step_one(self):
        print('the data artifact is: %s' % self.my_var)
        self.next(self.end)

    @step
    def end(self):
        print('the data artifact is still: %s' % self.my_var)

if __name__ == '__main__':
    LinearFlow()

To execute the flow, let’s run

python3 metaflow_start.py run

You should see an output in the console:

Metaflow 2.4.3 executing LinearFlow for user:{your_user_name}
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
Workflow starting (run-id 1637382785717584):
2021-11-19 20:33:05.736 [1637382785717584/start/1 (pid 6096)] Task is starting.
... Task finished successfully.
...Task is starting..../step_one/2 the data artifact is: hello world... Task finished successfully.
... Task is starting.... the data artifact is still: hello world... Task finished successfully.
... Done!

🎉🎉🎉 You have created your first flow! 🎉🎉🎉

There are a few essential features of Metaflow showcased in the above example.

  • When you assign a value to self inside a step, it will be available to other steps to the last step, ❗️ unless there is a split into parallel processing step somewhere in the middle. ❗️
  • If a parallel processing step splits, then the values assigned to the preceding steps would not be available to the steps after the parallel processing step.
  • When you run your flow on AWS, values assigned to self are pickled and stored in the S3 object-store. You can see the variable my_var gets the value of hello_world â€” this my_var the variable then can be used in other steps. You can use this pattern to pass DataFrames, media files, and other artifacts between steps.
The image shows steps A, B, C connected with the flow from A to B and from B to C. The B step is parallel step, it has many instances of itself. The C step is a join step, it allows to merge all the compute that happened in the B step.

As you can see, it is pretty easy to enhance your python code with Metaflow and add parallel processing or cloud computing to your data science project. Thank you for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *