I recently had the experience of setting up some Prefect pipelines, which I can compare to this article. Note that while I'm not new to data engineering, I'm new to open source frameworks, and have some insight into Airflow (studied architecture in depth, written a lot of code in it).
Prefect is generally very easy to use. Essentially, you: (a) write a Python-based flow, which defines some job to run (with subtasks), (b) turn on an orchestrator on a server somewhere, (c) turn on an agent on a server somewhere (to run the flow when instructed by the orchestrator), and (d) connect to the orchestrator, build & apply a deployment, and run it.
I find the docs a little half baked right now. One example is that cron jobs, which one would think are essential to something like Prefect, basically can't be done (as of a month ago) without touching the Prefect UI. This is extremely odd.
I also found it fairly confusing which components were supposed to be checked into source control, and which weren't. I blame this on Python deployment generally being very odd and confusing, but Prefect docs don't make it any more clear. Prefect assumes that there's an S3-like storage that both the submitting computer (my laptop) and the orchestrator (the server) can access.
Overall I find it quite handy, and probably won't switch. It feels more lightweight than say using full Docker containers, which we probably don't need right now. The UI is nicer than Airflow's, and the orchestrator & agent are much easier on resources. It feels more reproducible. I haven't tried Prefect Cloud, and we're unlikely to (security & cost are the main reasons).
We've been looking at a few for what I thought "basic" AI data prep tasks like scraping with backfill & periodic refreshes (basically a bunch of REST queries with care not to overwhelm the targets; 2000's era backpressure good citizen stuff), and found that all the scheduling primitives ended up being half-baked from a data orchestration perspective. That's even before we get into periodic rescoring for the actual AI parts. So these ended up feeling like generic manual task/orchestration tools, which there are simpler + more powerful technologies. We did the same exercise 2-3 years ago, and surprisingly, not much change in core scheduler interfaces here from this perspective.
Curious if there are positive experiences with any tools here from a data orchestration perspective, esp OSS?
i mean, the leading new OSS solutions are Prefect and Dagster, its not like there are a million of these out there, would just try them out and see what you think, each have their fans
We have tried + investigated a variety, trying to avoid unnecessary shade for indiv co's. Hundreds of millions maybe even billions have gone into funding this space now, so folks get twitchy :)
It looks like we added cron and other schedule types to the deployment CLI just under a month ago[1].
Over the last couple of releases, we've also made it easier to pull deployments from GitHub or bake your flow code into Docker images instead of needing S3-like storage.
As with any product, there's always more to do, so I appreciate you sharing your thoughts. More than anywhere else I've worked, community feedback is a huge driver of product enhancements and feature development. Feel free to join our Slack community[2] if you'd like to share more feedback or ask questions.
I hear grumbling from a friend on the data engineering side of my company. It takes an enormous amount of effort to stay on top of their data pipelines, and they still have lots of failures in cleaning, transformation, orchestration, and reporting. One product wasn't updated in 5 months !
They've tried everything airflow, informatica, alteryx, etc.
They've even built their own custom data flow etl in python.
I often wonder if the real issues they face is more about expectations and standards such centralized logging, easy report/artifact generation, ops management, and hiring more developer oriented data engineers.
I think a partial, but substantive explanation is related to fragmented domain expertise across the stakeholder base and lack of ownership of the vision of what the stack is meant to achieve. The recent blog post from the Lago folks about why they exited the no-code reverse ETL space discusses parts of this problem [1].
> Marketers often already have access to data, at least in ‘read-only’ mode, and can download it in CSV format. They don’t really explore their options, not because they would need to learn SQL (no need with spreadsheets), but because this would require them to study the whole data structure of the company.
A database usually contains dozens or hundreds of data tables, and understanding how they are organized, how they relate to each other and how often they are updated is a huge effort. Therefore, they say they want to be more data-driven, but they rarely acquire the knowledge that is required to do so, because the bar is pretty high and because this would add up to their existing workload
The inverse observation here I believe is also relevant and rings true. Engineers often don't have have the marketing/sales/analytics/business domain context to make decisions that inform optimal infrastructure decisions. There's a bidirectional dog-piling effect that just gets worse over time.
My own personal experience has led me to believe that this is largely an ownership and governance issue. It's not enough to just give marketers access to the data warehouse, and it's also not enough to sync up with engineers about KPIs and OKRs on a weekly basis. The result of that is the equivalent of a kids soccer game where everyone is chasing the ball rather than playing their positions as a team.
Adding tooling complexity on top of all that, pressed into the service of fixing "pipeline woes", compounds the problem substantially.
Well it all starts with seductive but mostly bullshit idea that once all the data is transformed and put in some common data lake some amazing, deep insights can be extracted which were not possible before. And there is large budget to be spent on developers, hardware, software, consulting and so on. Who will say no to this?
I mean, I have actually used a data lake and it worked pretty well for its intended purpose. Fuck having to email 5 different people (and have a 50 minute meeting) and wait 3 days (or 3 weeks) for them all to send me fucked up CSVs and Zip files full of XML and whatever the hell else, with no clear provenance and absolutely no documentation.
Dagster's product is great, but comparing it to MWAA is unfair. MWAA is a poor quality product - difficult to use, unstable, inflexible, and poorly-supported. A fairer comparison would be against Astronomer. Astronomer is a MUCH better product than MWAA.
I don't know one off the top of my head, but here are a few bullet points:
* Both provide fully-managed or hybrid SAAS options. Dagster has Dagster Cloud, while Astronomer has Astro.
* Both are containerized deployments.
* Both have fully-functional local development environments. My Airflow development environment is my local env managed by the Astro CLI. It works great. I haven't worked with Dagster, but I've heard that it's lighter weight and local development is delightful.
* Both have functionality for data lineage. Dagster has Software Defined Assets, while Airflow/Astronomer integrates with OpenLineage. I don't have direct experience here, btw.
* Both are reasonably priced. They're cheap enough that the build vs buy decision is a no-brainer: buy buy buy.
One difference, however, is that Dagster natively isolates tasks in the DAG into separate Kubernetes pods. You can get this with Airflow if you use KubernetesExecutor, but KubernetesExecutor is only available in Astronomer's Hybrid-SAAS product. Only CeleryExecutor is available in Astro.
There's also a difference in syntax for defining DAGs. I find Dagster's approach is more Pythonic than Airflow's standard way of defining DAGs. However, Airflow's TaskFlow API is almost at parity with Dagster's approach.
One advantage that Airflow has over Dagster is its maturity and pre-built integrations with other systems. If I need to interact with Fivetran in Airflow, there's a provider for that. Hightouch? there's a provider. Snowflake? Yep. If there's a widely used product or managed service related to data, there's probably an Airflow provider for it.
I tried (and tried, and tried, and tried) once to set up a local "test" instance of Airflow just to try out a few different things and understand better how the whole thing worked. I finally gave up after a week - I've never come across any software that I couldn't install, but Airflow just ended up being too much.
The author seems to think that Dagster or Prefect will take over Airflow, I don't think this is true. All of them being open source means that if one has a good idea or better way of doing something the other can quickly implement the feature and even use the some of same code. We saw it with Airflow implementing the TaskFlow API as a response to Dagster, and in a few weeks Airflow 2.4 is going to have dataset scheduling released.
So, Airflow's head start is going to be extremely hard to over come, if they remain adaptable.
Also, as other's have mentioned the comparison to MWAA is unfair, the true own of Airflow is Astronomer as they have over 50% of the commits to Airflow, and Astronomer is a much better product the MWAA.
Dagster contributor here responding to Astronomer employee ^^.
A "quickly-implemented" feature != parity in utility. E.g. Airflow's TaskFlow superficially looks like some of Dagster's APIs, but the experience of using them is way different:
- TaskFlow is built on top of XCom, which isn't designed for data sizes that larger than small.
Having worked on Hadoop in the past, I saw how it's easy for large Apache projects to add tons of new features to get "parity" with upstart competitors, but still lose to software like Snowflake and Spark that engineer those features in the "right" way as part of a streamlined product vision.
On one hand, it is immensely successful Apache project.
On the other hand, it is a failure of open source - Databricks Spark is full of proprietary extensions that are not in Apache Spark. AWS Spark (in EMR and Glue) is full of proprietary extensions that are not in Apache Spark. Same with Cloudera, IBM and other players.
There's also the story how Databricks people blocked IBM patch to Apache Spark that was supposed to make it faster on IBM POWER platform - they didn't want to help competition.
>On the other hand, it is a failure of open source - Databricks Spark is full of proprietary extensions that are not in Apache Spark. AWS Spark (in EMR and Glue) is full of proprietary extensions that are not in Apache Spark. Same with Cloudera, IBM and other players.
Well, can't disagree with that.
>There's also the story how Databricks people blocked IBM patch to Apache Spark that was supposed to make it faster on IBM POWER platform - they didn't want to help competition.
Yeah, I don't mean to knock large Apache projects in general - they can be great software. My point is that it's hard for them to change / successfully copy their competitors.
Much of what makes Spark successful was there at the beginning - e.g. a clean programming model and an architecture that didn't require provisioning new resources from the resource manager every time it launches a task.
I don't really think that either will overtake Airflow in terms of sheer number or users or orgs it is deployed at.
1. Dagster and Prefect won't ever have a hosted version on AWS/GCP (teams can spin these up from their consoles without even considering a vendor conversation or leaving their infra).
2. There are very large Airflow projects out there with huge DE teams running them, that just can't really move to Dagster or Prefect very easily. Not do they really want to if they can do what they need to in a tool they are familiar with.
My perspectives have been very much coming at the tools with fresh eyes.
Will they overtake Airflow in terms of features? They probably already have.
For sure, between these competing products you will see one respond to a popular feature by mimicking it. But the Dagster framework is fundamentally different from Airflow so over time I don't see it being a feature-vs-feature decision but more of a declarative/reconciliation vs. imperative/task-centric split.
sounds like a very positive experience all around, but not going too deep into dagster itself yet. i feel like comparing MWAA to Dagster is not an even footing, i'd be interested in seeing how Astronomer has also improved the Airflow experience.
An often overlooked framework used by NASA among others is Kedro https://github.com/kedro-org/kedro. Kedro is probably the simplest set of abstractions for building pipelines but it doesn't attempt to kill Airflow. It even has an Airflow plugin that allows it to be used as a DSL for building Airflow pipelines or plug into whichever production orchestration system is needed.
Seconding Temporal, it is definitely a second-gen workflow platform with approaches that make many Airflow frustrations nonexistent.
In particular, workflows are extremely durable, they can easily just sit indefinitely waiting on a condition and you can redeploy workers whenever, and workflows will resume exactly where they were.
Shameless self-promotion: I built Eventline (see my profile) precisely because I just wanted to schedule and run code in various environments, manually or in reaction to various events, without the complexity that goes with specialized tools.
It gives you a pretty flexible job system, identities (secrets), monitoring, notifications, a web interface, a command line tool and a HTTP API. You can build lots of things above it.
AWS Simple Workflows or Azure Logic Apps are both services that let you define S2S workflows however you like without any particular bias to CI/CD or business operations.
If you want to go even lower level, a framework like DTFx lets you define long-running, distributed and resilient orchestrations in code:
I've been using ploomber over the last year to build ML pipelines. It's good for both dev/prod workflows. The other frameworks were too bulky for a small team with little infra support.
Disclaimer: I am building a no-code, SQL-focused data pipeline platform to improve the experience around data pipelines, see my profile for more. The idea is to replace all of Fivetran, DBT, Airflow and more with a single platform that can handle all with no code.
I like the fact that the articles the author has are walking the reader through their own journey, along with the learnings and opinions after the trial periods. I am curious to hear how Dagster will evolve for their usage.
One of the things that has been a big discussion point among data folks that I have been talking to is that people that haven't done ops before underestimate the amount of operations that go into managing an Airflow instance, there are still quite a lot of stuff to be figured out:
- how do I get my pipeline code there?
- where do I execute them?
- how do I setup my development environment?
- how do I make sure the platform is up and running?
- how do I know if a task fails?
- where do I store my logs?
- how do I scale my setup?
If you are using a managed solution like MWAA or Cloud Composer, some of these questions might go away, but some aren't. As it stands today, Airflow is a powerful but hard-to-use technology; in my opinion, it is less of a tool that is supposed to be used directly by data engineers / analysts, and more of a platform that should enable easier-to-use platforms for its internal users.
In that sense, I believe Dagster is hitting the right chord: they focus on the pain points in DX for Airflow and similar solutions, they have figured out how to do development branches, they are focusing on assets rather than tasks, and they are constantly improving their product as far as I can tell from the outside. However, it is still a platform that you have to have engineers writing code for it: asking a data analyst to write python code to schedule a few SQL queries still adds a huge barrier to entry.
I am excited to see all the innovation that is happening in this space. I find the point about people getting used to the baggage around Airflow to be a quite real problem, and I am very happy to see solutions like Dagster are gaining speed. All in all, it is a very large space and there are many different problems that need to be solved to add the data abilities larger organizations have to smaller companies with limited budgets.
All I can say about your aspirations is: good luck. I’m working for a startup in the space currently, and this is a billion dollar valued flush with cash startup that is only aiming to capture one of the several segments you mentioned.
If you can solve all of these problems, you deserve to be rich. If you are a little unsure, perhaps take one of the segments you mentioned and focus on that.
2 ways to make money - bundling and unbundling. unbundling works until a competent enough founder actually makes a good enough bundle. its an enormous undertaking to rebundle an unbundled industry, which data eng basically is
Prefect is generally very easy to use. Essentially, you: (a) write a Python-based flow, which defines some job to run (with subtasks), (b) turn on an orchestrator on a server somewhere, (c) turn on an agent on a server somewhere (to run the flow when instructed by the orchestrator), and (d) connect to the orchestrator, build & apply a deployment, and run it.
I find the docs a little half baked right now. One example is that cron jobs, which one would think are essential to something like Prefect, basically can't be done (as of a month ago) without touching the Prefect UI. This is extremely odd.
I also found it fairly confusing which components were supposed to be checked into source control, and which weren't. I blame this on Python deployment generally being very odd and confusing, but Prefect docs don't make it any more clear. Prefect assumes that there's an S3-like storage that both the submitting computer (my laptop) and the orchestrator (the server) can access.
Overall I find it quite handy, and probably won't switch. It feels more lightweight than say using full Docker containers, which we probably don't need right now. The UI is nicer than Airflow's, and the orchestrator & agent are much easier on resources. It feels more reproducible. I haven't tried Prefect Cloud, and we're unlikely to (security & cost are the main reasons).