Hacker News new | past | comments | ask | show | jobs | submit | cgardens's comments login

Hiya, I'm the original author. tl;dr for those deciding whether or not to read it:

If you are thinking about build versus buy for your ETL solution, in this day and age, there are enough great tools out there where buy is almost always the right option. You may think you can write a "simple" little ETL script to solve your problem, but invariably it grows into a monster that will be a reliability liability and engineering time suck. The post goes into more depth on why that is. Enjoy!


Rushing a job search. Decided it was time to move on (right decision), but allowed one company to court me and did not take enough time to play the field (bad decision).


The next step is to make the tool easy enough to use such that GPT-3 could build a connector in 2 hours.


We couldn’t agree more that producing high-quality connectors requires a lot of work. The hardest part about this task is that connectors must evolve quickly (due to changes in the API, new corner cases, etc). The quality of the connector is not just how well the first version works but how well it works throughout its entire lifetime.

Our perspective is that by providing these connectors as open source we can arrive at higher quality connectors. For a closed source solution, a user has to go through customer service and persuade them that there is indeed a problem. A story we have heard countless times, is that SaaS ETL providers are slow to fix corner cases discovered by users leading to extended downtime. With an OSS solution, a user can fix a problem themselves and be back online immediately.

We proactively maintain all connectors, but we believe that by sharing that responsibility with the OSS community, we can achieve the highest quality connectors.

One of the main focuses of Airbyte is to provide a very strong open-source MIT standard for testing and developing (base packages, standard tests, best practices…) connectors in order to achieve the highest quality.


If you are interested, John (one of the co-founders) wrote an article about how we are imagining ELT evolving. https://airbyte.io/articles/data-engineering-thoughts/why-th...


(Airbyte Engineer)

I think what you're saying here is often true until it isn't. For a personal project where you're pulling data from one API? Sure.

Once you have an engineering system with multiple engineers relying on the data to be pulled reliably, having a host of individual ELT crons gets brittle really fast. At the past couple companies I've worked out this same narrative has played out:

"Oh we need to pull data from X let's build a cron." (3 months later.) "Wait a second why is all of this data 1 month old? Oh the cron hasn't run in a month because of a schema change. Let's add monitoring." (3 months later.) "We need to change the cron to pull fields A,B,C hourly and field D,E,F weekly." (3 months later.) "The amount of data we're pulling is making this too expensive, we need to implement some sort of incremental replication." ... etc

It always starts out as a "small" script but they rarely stay that way. In my experience, they end up needing the same features that get rebuilt over and over again on an ad hoc basis. We want an engineer to be able to get these features out of the box.

We generally think that for most engineering teams (even pretty small ones) the ad hoc crons for pulling data become nightmarish pretty fast. This problem is compounded if you are already using some other ETL as a service provider but they don't support one of your data sources so you also have a separate set of crons. By taking an OSS approach we're trying to cover that long tail, so that all of your ELT can be managed using one tool.


Different Airbyte engineer here!

Wanted to help answer your question as to what "optional normalized schemas" means. When writing data into your data warehouse we provide 2 options: 1. write each record as a json blob. 2. infer the schema of the data and write each value in a record to its own column with an appropriate type.

We are betting on EL(T), meaning we think that Transform should be considered separately from EL. To give a more a concrete example, if you are already using DBT in your data warehouse to normalize your data, you likely prefer operating on the "raw" (json blob) data than an arbitrarily normalized form of your data that your EL pipeline has decided on for you. I am seeing this trend pretty pervasively where a lot of, nominally, ELT pipelines are outsourcing the Transform to best in breed tools like DBT.

Thanks for asking this question btw, we'll do our best to clarify in our docs!


Thank you for replying. Just scanned DBT. Cool. It's reliance on SQL is The Correct Answer™. (For SQL capable systems, of course.) The solutions based on schemas, where the SQL is then somehow code generated, are terrible.

I'm skeptical of DBT's inference of dependencies between queries, but I'll keep an open mind until I have direct experience.

Happy hunting.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: