Parent: Apache Suite
Airflow is a platform where you can plan, architect, trigger, and monitor you data pipelines. The core function of it is as an orchestrator that governs how each function/script/tool interact with each other:
- In what sequence?
- Under what condition?
- How frequent?
- With what SLA?
- What to do when something fails?
What Airflow does for you
It handles the sequential logic of a simple python script, the if’s and elses
In order to do this you need to define a few things using python:
- DAG (the pipeline governor)
- Tasks (the operations within the pipeline)
- Operators (now kinda abstracted away with Taskflow)
Airflow is useful in the following use cases:
- Data powered applications - eg. Garmin connect
- Data powering critical operational processes (also fun) - eg. MH02 alerts
- Data for reporting and analytics
- Data for ML Models and predicions
When not to use Airflow:
- Not for Stream Processing
- Its not for data transformation per se, its for managing data workflows
Enough theory!
Lets get to practice: Setting up airflow in a reproducible way Airflow Operators Production grade airflow scripting
Related:
Python Data Orchestration Airflow Engine