We have a saying here at the Technical Operations team of Insight TV: “Work Smarter, not Harder” This means adopting a CI/CD approach to everything we do to see how we can use technology to optimise our operations and processes to support business initiatives.
As the business grew (and continues to grow) there was increasing demand and pressure on the Media Ops team to deliver our content in various formats and flavours to our affiliates for content distribution deals as well as to our own digital platform. We quickly knew we had to improve our on-premise infrastructure here in Amsterdam and combine it with Cloud services to reduce transfers over the public internet and ensure we have control of the entire lifecycle of our asset e.g. a typical TX Master asset is close to 200GB in size.
Adding flexibility, capacity and throughput to our infrastructure is one thing, but managing these new resources and optimising our return on investment is another thing. We quickly realised that running multiple instances of Bash and Python scripts along with a multitude of watchfolders was not a sustainable and scalable solution.
Enter Apache Airflow. Initially created in 2014 by Airbnb, open-sourced in 2015 and then joining the Apache Software Foundation’s Incubator Program, It is now an Apache Top-Level Project.
Whilst hunting for an Orchestration tool to help us tie all of our microservices and disparate resources together in a ‘single pane of glass’, I stumbled upon Apache Airflow.
The advantages were immediately obvious:
- Python based using standard frameworks.
- Familiar technology stack i.e. Flask, PostgreSQL, Celery running on Ubuntu.
- Easy to scale vertically and horizontally.
- Existing hooks to Cloud Services.
- Operator friendly UI.
- Easy to chain tasks and set dependencies.
- Open Source.
- Actively developed with a huge user-base including some heavyweight tech companies.
- Ability to pass metadata between tasks using Xcoms.
Using the Configuration-as-Code principle, workflows are authored as Directed Acyclic Graphics (DAGs) which define tasks and their running order/dependencies and then assigned to Operators. These DAGs are then run by a Scheduler and executed by a Worker or cluster of Workers.
Suitability for Media
Although popular with Data Scientists for moving data and executing ETL pipelines, there was little (to no) information on its suitability to execute long running, resource intensive tasks typical with media related workflows.
These are some of the top features we found that made it suitable for our needs:
- Easily extensible – Utilise the full flexibility of Python (and its associated modules within a task).
- Manage hardware resources – The use of resource pools and queues allowed us to manage (and load balance) resource intensive tasks e.g. CPU resources for transcode tasks or Internet connectivity resources for inbound/outbound file transfers.
- Run Bash commands using the BashOperator.
- Easy integration with media specific resources (e.g. transcode, file QC, HDR/SDR up/down/cross-conversion, rewrapping etc.) without requiring a 3rd party vendor to develop plug-ins.
- Self-healing and decision-making tasks – Failed tasks automatically retry, a task is capable of making decisions and driving the direction of downstream tasks based on the results or output from upstream tasks.
- Use Variables to define fixed parameters for a DAG. Each variable is in JSON format making it easily readable and can be manipulated by the operators e.g. setting a maximum nit value for a HDR upconversion is as simple as updating the appropriate parameter in the Web UI.
- Consolidation and easy-access to resource logs through the UI.
A basic task that polls a designated folder on our NAS. We use regex to ensure only the right file gets triggered e.g. based on file-naming convention or extension. This task also pushes the triggered filename to a Xcom for easy access by downstream tasks and moves the file to a working directory.
Creates a new DAG run ready to pick up the next file.
Call MediaInfo CLI to lookup the media file’s Transfer Characteristics and store the value in a Xcom.
Determine if the returned Transfer Characteristic is HDR HLG or BT.709 (SDR) and push the relevant transcode profiles downstream.
Submit a transcode job to our AWS Elemental Server.
Lookup the Elemental Job ID from the response returned.
Poll the Elemental Server to lookup the status of the previously submitted job. The task will complete once the transcode job completes or will fail if the transcode job fails.
Call Nablet MediaEngine to encode our final XAVC-I Class 300 UHD master file.
Delete the temporary v210 raw file generated by the Elemental.
Move the source file from the working directory to the done directory.
- Ensure your Airflow server has sufficient resources (CPU and RAM) to run all required services.
- Use sensors in reschedule mode to prevent DAG concurrency contention.
- Ensure all of your workers use the exact same versions of all components and modules including Airflow.
- Consolidate your DAG variables into a single JSON variable to eliminate multiple DB calls.
- Ensure your test and staging environments match production exactly otherwise you will have inconsistent results. DAGs can sometime be challenging to troubleshoot.
Current Integration Points
Airflow now provides us with a way to manage and integrate the following services and resources together:
- Dell EMC Isilon
- Synology NAS
- AWS S3
- AWS Elemental Server
- Nablet Media Engine
- Technicolor ITM
- Baton QC
- Forensic Watermarking
- MS Teams
This is just the tip of the iceberg for us as we continue to build new workflow pipelines to ensure we hand-off labour intensive, repetitive tasks to machines to allow us to be more creative and innovative. Stay tuned for more technical posts in the future!