How do you guys do the static analysis on the queries? I notice you support dbt, bigquery etc, but all of our companies pipelines are in airflow. That makes the static analysis difficult because we're dealing with arbitrary python code that programmatically generates queries :).
Any plans to support airflow in the future? Would love to have something like this for our companies 500k+ airflow jobs.
It depends a bit on your stack. Out of the box it does a lot with the metadata produced by the tools your using. With something like dbt we can do things like extract your test assertions while for postgres we might use database constraints.
More generally we can embed the transformation logic of each stage of your data pipelines into the edge between nodes (like two columns). Like you said, in the case of SQL there are lots of ways to statically analyze that pipeline but it becomes much more complicated with something like pure python.
As an intermediate solution you can manually curate data contracts or assertions about application behavior into Grai but these inevitably fall out of sync with the code.
Airflow has a really great API for exposing task level lineage but we've held off integrating it because we weren't sure how to convert that into robust column or field level lineage as well. How are y'all handling testing / observability at the moment?
- we have a dedicated dev environment for analysts to experience a dev/test loop. None of the pipelines can be run locally unfortunately.
- we have CI jobs and unit tests that are run on all pipelines
Observability:
- we have data quality checks for each dataset, organized by tier. This also integrates with our alerting system to send pagers when data quality dips.
- Airflow and our query engines hive/spark/presto each integrate with our in-house lineage service. We have a lineage graph that shows which pipelines produce/consume which assets but it doesn't work at the column level because our internal version of Hive doesn't support that.
- we have a service that essential surfaces observability metrics for pipelines in a nice ui
- our airflow is integrated with pagerduty to send pagers to owning teams when pipelines fail.
We'd like to do more, but nobody has really put in the work to make a good static analysis system for airflow/python. Couple that with the lack of support for column level lineage OOTB and it's easy to get into a mess. For large migrations (airflow/infra/python/dependecy changes) we still end up doing adhoc analysis to make sure things go right, and we often miss important things.
Happy to talk more about this if you're interested.
Any plans to support airflow in the future? Would love to have something like this for our companies 500k+ airflow jobs.