top of page
Writer's pictureHarry Tan

My Top 10 Airflow Tricks

For the past four years, my team and I have relied on Airflow as our primary orchestration tool. Working alongside five talented data engineers at a private SaaS company of approximately 800 employees, we’ve managed all our ETL jobs using Airflow. Over the years, we’ve upgraded, relocated, migrated, and made numerous modifications. In this article, I’ve highlighted 10 straightforward Airflow tweaks that have significantly simplified our tasks. I believe that if you adopt some of these changes, you’ll appreciate them as much as we do.

Let’s dive in.

  1. Design Your DAG to Operate With or Without a ‘Config’

Typically, DAGs run as scheduled, processing data in their predefined manner. However, there are instances when we need to rerun specific historical dates or certain segments. In such cases, it’s highly beneficial to design the DAG to be triggered with a ‘config’, often specified by a particular start and end date.

Another tip is to include instructions on composing the config within the DAG description. This ensures that team members understand how to execute it correctly.

2. Create unscheduled “Tooling” DAGs

We’ve developed several ‘tooling’ DAGs without assigning them regular running schedules. For instance, one might be designed to delete files with a specific S3 prefix. This approach significantly addresses permission challenges. Manual operations in the production environment often demand elevated permissions, which many of our non-system users lack. However, the Airflow user already possesses these permissions for standard operations in production.

An added advantage is the ability to log these manual operations. Every DAG run leaves a trace in the database, which is then displayed in the UI, ensuring transparency and accountability.

3. Sending Slack Alert Made Easy

Once you click the highlighted button, an email address will appear, resembling ‘mychannel_[some_random_text]@myorg.slack.com’. Simply add this new address to the recipient list of your DAGs, and you’re good to go.

For example, if we plan to send an additionally copy of Airflow alerts to our slack channel, we can add the email to the central receiver list, instead of updating all the DAGs.

4. Utilize Ignore Files

We utilize .ignore files to exclude specific DAGs for targeted deployments. This strategy serves two main purposes:

Regional Deployments: We operate across multiple data centers, leading to distinct Airflow deployments for each region. To manage this, we use a dedicated exclusion file for each area. We employ files like .airflowignore_eu and .airflowignore_us to manage our DAG deployments.

Temporary Exclusions: There are instances when we need to temporarily sideline certain DAGs. For instance, while some stakeholders may consent to the removal of their DAGs, they might also anticipate a future need for them. Instead of removing the code entirely from our codebase, using this exclusion feature allows us to retain the code without deploying the associated DAGs. It’s a more efficient way to manage temporary changes.

5. Harness the Power of Tags for Grouping DAGs

If you’ve been using Airflow for a while, you might be familiar with the tagging feature. Introduced in recent versions, tags weren’t available when we began our Airflow journey. However, upon upgrading, we immediately saw the benefits of using tags to group similar DAGs. For instance, when we import multiple tables from the same database connection, a single connection issue can affect all related DAGs. By tagging these DAGs, we can swiftly address any issues that arise from a shared data source.

6. Include Stakeholders in the DAG Description

Much like our setup, you might have data pipelines serving specific stakeholder teams, such as Marketing, Sales, or Customer Success. By mentioning these stakeholders in the DAG description, we’ve streamlined communication significantly.

7. Customize Your Timetable for Varied Schedules

We’ve found value in adjusting our job schedules based on the day of the week. For instance, our weekend jobs run less frequently since our data users aren’t active, allowing us to cut costs. Airflow’s timetable feature has been instrumental in this regard. Check out the official document for more details.

8. Have a default DAG template

we utilize a central Data Factory to generate DAGs with default configurations. For specialized DAG types, we inherit from this central factory, tailoring it to produce custom behaviors. While this approach simplifies DAG generation, it can complicate the code over time.

9. Dual Schedulers Enhance Performance

For large Airflow deployments with numerous DAGs, optimizing performance is vital. One effective method is using dual schedulers.

The scheduler in Airflow determines which tasks are ready to run. As the number of DAGs increases, a single scheduler can become overwhelmed. Implementing dual schedulers offers:

  • Load Distribution: Distributing task polling across multiple instances speeds up task startup time.

  • Redundancy: If one scheduler falters, the other continues, ensuring uninterrupted task scheduling.

  • Scalability: As Airflow usage grows, adding more scheduler instances becomes seamless.

However, be mindful of Database Backend: Ensure compatibility if using MySQL. For example, MySQL 5.X doesn’t support multiple schedulers.

10. Prioritize Lint Style Checks and Unit Testing

While it might seem evident to many, the importance of lint style checks and unit testing cannot be overstated. As the Airflow codebase expands, these tools become invaluable. PyLint ensures consistent coding styles and data type usage, while unit tests are essential for maintaining software quality. We run these checks locally and integrate them into our CI/CD pipeline. Any code that doesn’t pass lint and unit-test stages is not merged, simplifying our processes and ensuring quality.

In conclusion, Nnavigating the complexities of Airflow can be challenging, but with the right strategies and tools, it’s possible to streamline processes and enhance performance. From the convenience of tags and the clarity of detailed DAG descriptions to the power of dual schedulers, these tips reflect our journey and learnings over the years. As with any tool, the key is to adapt and evolve, ensuring that Airflow continues to meet the unique demands of your data workflows. We hope these insights serve you well in your own Airflow journey.

1 view0 comments

Comentarios


bottom of page