- "automates root cause analysis" -> it means (1) showing which rows have affected the metrics and (2) provide some automated context (is it an update? a delete? a dimension that changed? etc). But it is still very early for 2.
- Metrics are defined by users in their usual "data" repository (using dbt for example). The metric computation is not defined on Datadrift, we only go "read it".
- No, it's really for batch processing in a data warehouse (like hourly / daily computations)
- That's not something we had in mind (I know some dbt package can help you do this)
Sammy and Lucas here. We are building an open-source framework that monitors your metrics, sends alerts when anomalies are detected and automates root cause analysis.
Think of Datadrift as a simple & open-source Monte Carlo for the semantic layer era. The repo is at https://github.com/data-drift/data-drift
Datadrift started as an internal tool built at our former company, a large European B2B Fintech. We had data reliability challenges impacting key metrics used for financial and regulatory reporting.
However, when we tried existing data quality tools we where always frustrated. They provide row-level static testing (eg. uniqueness or nullness) which does not address time-varying metrics like revenues. And commercial observability solutions costs $manyK a month and brings compliance and security overhead.
We designed Datadrift to solve these problems. Datadrift works by simply adding a monitor where your metric is computed. It then understands how your metric is computed and on which upstream tables it depends. When an issue occurs, it pinpoints exactly which rows have been updated and introducing the change.
You can also set up alerting and customise it. For example, you can decide to open and assign an Github issue to the analyst owning the revenue metric when a +10% change is detected. We tried to make it easy to customise and developer friendly.
We are thinking of adding features around root cause analysis automation/issues pattern analysis to help data teams improve metrics quality overtime. We’d love to hear your feature requests.
Work a lead data in a fintech company based in EU.
I built a simple observability tool for key data assets in a data warehouse. It's a python monitor you add to a given table, it checks that table daily and tells you when there is an issue & which rows introduced that issue.
We used static testing framework like great expectations but that was not enough. We did not have the budget for the big data observability players like Monte Carlo, so we kept it simple.
Congrats for this - Love the bitemporal aspect. It was a real struggle for me in past analytics experiences where we spent a lot of time recomputing key metrics 'as of' certain dates for reporting / auditing.
It's more about engineering management but 'Accelerate: Building and Scaling High Performing Technology Organizations' by Nicole Forsgren (Github VP), Jez Humble and Gene Kim is a must-read.
Currently doing it for an open-source metrics observability and troubleshooting tool (15 PoC in production, no revenues yet). Committed about 30% of the amount so far, but it's tough and expectations seems ever increasing (revenues, community traction etc). Curious to hear others experience as well!
One that comes top of mind is Swedish "startup" H2 Green Steel (https://www.h2greensteel.com/). They're building a steel plant powered by a giga-scale electrolyser to produce hydrogen (rather than using coal).
Aligned with the humble way. Have you tried the user research angle like "hey I'm building XXX, thought it might be useful for you because YYY. Would you be open to try it and give us your feedback"? I've been doing this for a dev tool for data analysts and works pretty well. Anyway keep trying and good luck, been there and it's not easy.
- "automates root cause analysis" -> it means (1) showing which rows have affected the metrics and (2) provide some automated context (is it an update? a delete? a dimension that changed? etc). But it is still very early for 2.
- Metrics are defined by users in their usual "data" repository (using dbt for example). The metric computation is not defined on Datadrift, we only go "read it".
- No, it's really for batch processing in a data warehouse (like hourly / daily computations)
- That's not something we had in mind (I know some dbt package can help you do this)