Are you wondering what Apache Airflow is and how it operates? Or maybe you are more interested in hearing about how Airflow compares to similar tools? Well either way, you are in the right place! In this article we tell you everything you need to know to understand what Airflow is and what its main advantages and disadvantages are.
We start out with a high level explanation of what Apache Airflow is used for. After that, we go over some key concepts that you need to know to understand how Airflow works. After we explain how Airflow works, we discuss some of the main advantages and disadvantages of Airflow. Finally, we provide information on common alternatives to Airflow.
What is Apache Airflow used for?
What is Apache Airflow used for? Apache Airflow is used to schedule workflows that need to run on a regular cadence. Apache Airflow ensures that all of the steps in a workflow are run in the correct order and are allocated an appropriate amount of resources. In addition to providing scheduling capabilities, Airflow also provides a user-friendly UI that makes it easy to monitor your jobs and understand where failures occur.
What is an example of a use case where you might need to use Apache Airflow? Imagine you had a machine learning model that you wanted to use to make predictions on a new batch of data each day. You would need to wait until the underlying data was published, then preprocess that data, then make predictions using the preprocessed data.
This would be a perfect example of an Airflow use case because it is a multi-step workflow that needs to be re-run at a regular cadence. There are multiple steps in this workflow that need to run in the appropriate order. For example, the data cannot be preprocessed before the underlying data is published and the predictions cannot be made until the data is preprocessed.
Key concepts in Apache Airflow
Now we will talk about some of the most important concepts that you need to understand in order to understand how Airflow works. Understanding what these concepts are up front will make it much easier to read through the Airflow documentation.
DAGs in Airflow
One of the concepts that is most fundamental to Airflow is that of a DAG, or a directed acyclic graph. A DAG is not a concept that is unique to Airflow, so before we talk about what a DAG is in the context of Airflow, we will first explain what a DAG is in general terms.
What is a DAG in general terms?
A DAG is simply a diagram that illustrates the dependencies between different entities. In general, each individual entity is represented by a shape like a circle or a rectangle. The relationships between different entities are represented by arrows. The arrows generally point from an upstream entity to a downstream entity that depends on it.
One important characteristic of a DAG is that it is acyclic, which means that there can be no circular dependencies within the graph. This means that there cannot be any cases such that if you start at a given entity and follow it’s dependency chain, you eventually loop back to the original entity you started with.
What is a DAG in the context of Airflow?
Within the context of Airflow, DAGs are used to represent individual workflows. Anytime you want to schedule a new workflow in Airflow, you must build a DAG that represents that workflow so that Airflow knows which steps need to be run in which order.
Tasks in Airflow
The next concept that you should understand in order to understand Airflow is a task. A task is an individual unit of work or step that needs to be executed in a workflow. Each entity that is represented in an Airflow DAG corresponds to one task.
Operators in Airflow
Another concept that is closely related to the concept of a task is an operator. An operator is an entity that performs a task. The airflow library contains many different operators that can import to perform different types of tasks. You can choose which operator you want to use depending on what kind of task you want to perform.
Here are just a few examples of different types of tasks that operators can perform.
- Execute custom Python code
- Run a bash script
- Send an email
- Send a HTTP request
The final concept that you need to understand in order to understand how Airflow operates is that of a sensor. A sensor is technically a special type of operator, but sensors deserve some special attention because they perform a critical task. The main job of a sensor is to evaluate whether a certain condition is met. If the condition is not met, the sensor will continue to evaluate the condition at a regular interval until it is met.
The reason that sensors are so important is that they make it possible to encode external dependencies and ensure that your tasks run only after the upstream processes they depend on are finished. For example, if one of your tasks relies on a file to be uploaded by an external system, you can use a sensor to determine whether that file has been uploaded.
Here are a few examples of types of events that sensors can listen for.
- A specific time on the clock.
- A file to be uploaded somewhere.
- A record to be added to a SQL database.
- A task in a different DAG to complete.
How are Airflow DAGs implemented?
So how are pipelines actually implemented in Airflow? Airflow DAGs are implemented entirely in Python code. Objects such as DAGs and Operators can be imported from the Airflow Python library in order to build out Airflow pipelines.
The fact that Airflow DAGs are implemented entirely in code makes it easier to maintain, collaborate on, and test Airflow DAGs. It also makes it possible to dynamically generate pipelines using Jinja2 syntax.
What are the advantages of using airflow?
What advantages does airflow have over similar workflow management tools? Here are some of the main advantages provided by Airflow.
- Scalable. One of the main advantages of Apache Airflow is its ability to scale as the number of workflows that need to be managed increases. Airflow is a great option to turn to if you expect that the number of workflows that need to be managed is going to increase over time.
- Powerful UI. Airflow also has a powerful UI for visualizing workflows that is generally considered to be superior to the interfaces offered by similar tools. This makes it easier to gain visibility into the status of ongoing workflows and identify where errors occurred in terminated workflows.
- Modular & easily extensible. The Python library that is used to build out Airflow pipelines is modular and easily extensible. This makes it easier to build out non-traditional workflows that need a little bit of customization. For example, it is straightforward to define your own operators then incorporate them into your Airflow DAGS.
- Robust integrations. Airflow offers many integrations for things such as cloud services. These integrations make it easier to incorporate airflow into your current ecosystem without having to make many changes to your current infrastructure.
What are the disadvantages of airflow?
What are some disadvantages that Airflow has compared to similar tools? Here are some disadvantages of Airflow.
- Off-schedule workflows. One of the main disadvantages of Airflow is that it is not as straightforward to run workflows in an off-schedule or ad-hoc fashion.
- Steep learning curve. Another disadvantage of Airflow is that it is generally considered to have a steeper learning curve than similar tools. Deep technical knowledge may be required to implement some complex or custom workflows.
- Documentation is lacking. A final disadvantage of Airflow is that it does not have as robust documentation as some other tools. This is related to the previous points and contributes to the steep learning curve.
What are common alternatives to Airflow?
- Luigi. Luigi is a Python module that can be used to schedule batch jobs. Luigi is generally considered to be simpler than Airflow. On one hand, this means that Luigi has less of a learning curve for teams that need to schedule simple workflows. On the other hand, Luigi may lack features that are required to support more complex workflows.
- Oozie. Oozie is another workflow scheduler that comes with a built in UI. Oozie was specifically designed for scheduling Hadoop jobs, so it is not as versatile as Airflow if you have other types of tasks that need to be scheduled.
- Argo. Is a workflow scheduler that operates within the Kubernetes ecosystem.
- Prefect. Prefect is a Python-based scheduler that aims to offer many of the features that Airflow supports with a simpler interface. While the core project is open source, many of the extensions are offered as a paid service.
- Others. Dagster, Apache Nifi, etc.