What is Airflow?

Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows.

The philosophy is "Workflows as Code", which serves several purposes

  • dynamic: dynamic pipeline generation
  • extensible: can connect with numerous technologies (other packages, dbs)
  • flexible: parameterization leverages Jinja templating engine.

Here's a simple example

from datetime import datetime

from airflow import DAG
from airflow.decorators import task
from airflow.operators.bash import BashOperator

# A DAG represents a workflow, a collection of tasks
with DAG(dag_id="demo", start_date=datetime(2022, 1, 1), schedule="0 0 * * *") as dag:

    # Tasks are represented as operators
    marco = BashOperator(task_id="marco", bash_command="echo marco")

    @task()
    def polo():
        print("polo")

    # Set dependencies between tasks
    marco >> polo()

There are three main concepts to understand.

  • DAGs: describes the work (tasks) and the order to carry out the workflow
  • Operators: a class that acts as a template for carrying out some work
  • Tasks: a unit of work (node) in a DAG, implements an operator

Here we have a DAG (Directed Acyclic Graph) named "demo" that will start on January 1st 2022, running daily.

We have two tasks defined

  • A BashOperator running a bash script
  • A Python function defined with the @task decorator

>> is a bishift operator which defines the dependency and order of the tasks.

Here it means marco runs first, then polo()

Why Airflow?

  • Coding > clicking
  • version control, can roll back to previous workflows
  • can be developed by multiple people simultaneously
  • write tests to validate functionalities
  • components are extensible

Why not Airflow?

  • not for infinitely-running event-based workflows (this is streaming - Kafka)
  • if you like clicking

Resources

Alternatives

Blogs

Jun 29, 2023