Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
There is no data engineering roadmap (alasdairb.com)
212 points by sebg on July 14, 2023 | hide | past | favorite | 129 comments


Here's the thing with data engineering, if you do an absolutely superb job you end up being largely invisible to the business. Like the very necessary guys who fix the water mains or the sewers, the expectation is that the service/data pipeline/database will be always available, and nobody really cares too much either way once everything's working as expected.

The issues I encountered working as a data engineer many moons ago, is that in a business that's either agnostic to data quality or doesn't invest in reliability, the data engineers tend to become the face of failure. If a Kafka queue begins acting up, or a data ingestion jobs fails because of some upstream data source changes, then it's usually up to the data engineer to...

1. Notice that something's gone wrong.

2. Figure out who in the organization is going to be affected and let them know.

3. Run any triage and comms with the business keeping them in the loop until it's fixed.

In my experience, like the guy who only gives good news to the dictator, certain execs/leadership would quite happily let data engineers take the lead for outages, only swooping back in once everything was up and running again to demand a report on "how this happened" and some glib advice to "take time to stop this happening again".

While it can be technically rewarding as a career move, it's got to be highlighted that it's only fun and games until you hit your first major outage.


This is a lesson I keep repeating to our juniors: when you do something notable, celebrate it, and make sure to include the whole team, the users and the boss in the celebration.

It can be as simple as an email, but sometimes, making a few chat messages or even calls. It takes times, it seems silly, but you get a fantastic ROI on this.

Because now people see your work as valuable. It's concrete to them. They understand its value. And this has many positive consequences on your life.

This, of course, works only if the celebration avoids technical terms, is short, straight to the point, and explain what's the benefit to them and the org and how much it took to to it.

People must know that you are doing a good job, and you are the only one that is in the position to let them know. They will not get out of their way to learn about this.

E.G of a chat message I would send to my client's team boss:

Hello <first name>, it took 3 weeks, but we finally finished improving <system x>. This should save us <y> of <z> in the future. We tried hard to make sure your work is not disturbed by the change so it should be transparent for you. But let me know if you notice anything out of the ordinary. Cheers

Believe it or not, people actually like it. It keeps them in the loop, give them context and perspective, and help them take better decisions. So don't hesitate to do it.


> It keeps them in the loop, give them context and perspective, and help them take better decisions

Other side of the coin: "I just spent 2 weeks optimising a daily process to be twice as fast" might well end up in me wondering why anyone would care about the 1 minute reduction in delivery time for daily cadence data. It might also result in me being stroppy about work I cannot get resourced that has obvious $ value.

But that is also a win for the business. Either I am wrong and am persuaded of the value, or I am not and I become more engaged with stakeholder prioritisation and we set up something more effective.


That's good feedback to have. If your boss/client step in and says "we don't need that", this allows you to re-calibrate the team priorities and shows you need to get better at extracting the real need in the future.


I completely agree with your initial comment and your response here. Transparency is a great feedback loop. And it's always good to celebrate what you do.


Or just do this

1) Break something on purpose. 2) Make sure the business notice it (usually a spectacular cost increase does the job). 3) Come in and fix the issue 4) Make a cost comparison quick chart before/after 5) Become famous


You don't need to do it artificially.

If you look long enough in most code bases, you will eventually find a spectacular problem.

Case in point, I was working for several months for a client, and I had a slow afternoon, so I decided to convert some calculation to numpy, see if we gained any free perf.

We got a x100 local speed up, which was very fishy. Gaining speed is common with vectorization, but two orders of magnitude is a lot.

So I looked at the original algo. There was a glaring mistake in it, that I fixed by my numpy code without noticing, and just changing that in the pure Python algo, without using numpy at all, made it X50 faster.

I could then call my client, and celebrate the good news. Not "there was a mistake", no. But "we found margin for progress".


>I could then call my client, and celebrate the good news. Not "there was a mistake", no. But "we found margin for progress".

This is great framing.

Bugs in code are not really "mistakes"; they will occur. Finding and fixing them is a positive.


> Not "there was a mistake", no. But "we found margin for progress".

Were your clients nontechnical? Or did they just not care to know how you got a 50x speedup?


The boss only cared about the end result and got a quick message. He won't read more anyway.

The technical team an in depth explanation with code snippets.

Know your audience.


This hits home. I communicate a lot more with up and downstream stakeholders as a data engineer. Be it for designing future changes and how those affect others. Or, as you said, triaging outages/delays, figuring out where the missing dependency is. Less problem solving by coding.

The SQL thing the article mentions is also true. Even if you work with higher level tech like Spark. SQL gives you a good feel for joins. I also like the fast feedback cycles you get from SQL compared to something Scala/Python based.


This is exactly why I found DE is not a long term career path. I work in data, I occasionally do DE roles, but if you aren't moving over to the app side or business/domain side, you are leaving a lot of money on the table long term.

The funny thing, being in data, if you take the time to understand it.. you are tremendously valuable to the business/domain side.

Otherwise you 1000% risk becoming part of the background infra. No one celebrates when the faucet dispenses water or the power switch works. They do lose their collective #$#@ when the water or power fails though. Data Engineering has a tendency to be treated that way.

DQ is a big part of it as well. Everyone wants it, no one wants to pay the hours/dollars to do it.. No one is sure what they want, but probably what you propose is too simple. Can't AI do it? Etc.


TBF, in most companies that have a traditional core business (let's say, selling pharma products), IT is a cost center and your run of the mill engineer is seen as the plumber who's suppose to make everything seamless, as you describe.

Short of solving a pain point in a spectacular way, the whole IT dept will be the face of stuff crashing, security issues and budget burning.


> if you do an absolutely superb job you end up being largely invisible to the business

openly communicate how difficult some feature, project or solution is going to be. estimate time then multiply with pi and round up. then proudly present how you got it done despite obstacles possibly even in less time.

also good - create problems - then solve them.


The Futurama quote is relevant: "When you do things well, nobody will know you did anything at all". I think this goes for a lot of engineering, though. Especially infrastructure.


> Like the very necessary guys who fix the water mains or the sewers, the expectation is that the service/data pipeline/database will be always available, and nobody really cares too much either way once everything's working as expected.

Sounds a lot like Cybersecurity.


Cybersecurity jobs get paid very well, though.


> All you need is l̶o̶v̶e̶ SQL

I’ve ended up doing a lot of data engineering over the years, because I have a background in low-level search/databases systems coding and know text well. I have mixed feelings about the field precisely because it’s so SQL dominated.

Data engineering can be unsatisfying if you thrive on writing reliable systems. There’s a whole lot of big-ball-of-inscrutable-SQL work out there which when it gets changed breaks in unpredictable ways. And the cultural traditions of testing and validation are underdeveloped because SQL has no core testing story — people count on joins being logically correct, but don’t go to the trouble to prove that true. Vector operations are just hard to get right in any ecosystem and if you don’t stay humble and apply skepticism to your work, it’s going to be wrong a lot.

I try to do stuff in pandas or similar — which has its own problems, but at least allows for testable library code. But in the field of data engineering, that’s swimming against the tide.


DE is an inherently less tame domain than traditional SWE. The ground is more likely to shift under your feet. IMO the real problem with testing data engineering is that the data changes at a cadence that is independent to your version control/CI/CD cadence. You can have airtight unit and integration tests that get wrecked when the data schema changes without warning, the distribution of the data shifts, etc.

And you can certainly say “I’m going to enforce x rules on ingest and if it doesn’t match it doesn’t pass”. Depending on your system you may end up with a lot of on call or self-enforced down time or data loss. You might white list a schema and then later find out you’ve been missing out on 3 months of high value data because your upstream provider added a new columns. Alternatively, you can let things float through, monitor changes, but don’t let it plug up your system. This is also a risky game to play, with its own set of downsides.

Striving for comprehensive testing is still super important though. I find probably 60-80% of my time is spent tooling test frameworks that let us address the various edge cases. We use pyspark, and the amount of test-oriented tooling around that is, at least to me, surprisingly underdeveloped.


I've recently done my first implementation of dbt and though its fundamental Jinja+yaml nature does show through a bit, I was really pleasantly charmed by the data testing. The catalog of data assertions (dbt-utils and dbt-expectations really fill out the bare bones default tests) is generic across data sources, intermediate tables, or output tables, and is typically 3-5 lines of yaml for any given test. Hits all the basics: nullness, uniqueness, freshness, value cardinality, value distribution, row count, foreign key validity, etc.

I still don't like the yaml completely, but it makes adding data assertions so easy there is no excuse not to.

This project is pretty black and white, but I'm glad dbt supports test failure severity to lower on call load (error vs warning) when I have the need.

Just wanted to add a positive review for dbt, that it made it fun to do better data testing!


We have a DBT-style system that shunts "bad" data to "rejected_<model>" tables where we can generate reports. We also have tons of SQL-based tests to prevent inaccurate data from getting through the pipeline. So far so good, but it took us a little while to get there.


You made great points. Data Engineering cannot claim to support test-driven development to half the extent of the rest of software engineering.

Frequently DE projects have no concept of Dev, Test, Staging and then Prod. Quite often its dev and then straight to Prod. SQL of course is to partly or fully blame for this.

My last job for an large insurance company, they happily set a best practices of 90% test coverage (which in itself ended up being an artificial, ritualistic goal) which is impossible to achieve with DE tasks.


In my experience another big problem is that it's just expensive. For Spark jobs for instance, it's very common for it to run fine on the small test dataset on your laptop but then when you release it to run for 2 days on the prod dataset you end up having to do tuning there with pretty long turnaround times.

And that then extends to why staging and even dev aren't very useful - the scale is part of the equation. Even just doing your dev loop, it's pretty easy to get something that's logically correct on your test sample but then when you run across the real dataset you find that 1 row out of every million is weird but you still gotta deal with it.


You seem to be touching on two disparate issues that I find very interesting to tackle.

One is testing performance of code (SQL or otherwise) and ensuring it achieves a particular level of performance or ensuring no regressions. This is a problem not just with automatically unit-testing SQL, but any other code. In general, for regular code, we simply go with special, "manual" tests of the execution time for critical pieces (the biggest issue is fragility or flakiness: due to changing conditions a test runs under, speed is not always stable). We could do similar for testing SQL performance. If we know the database we are targetting, we could also use some of the introspection tools it offers to get even better tests (eg. we could run an "EXPLAIN (FORMAT JSON)" query on a very large Postgres database matching production to ensure right indexes are being hit and no seqscans are being done and Postgres' estimation of the time is on target).

Basically, so far it's hard because "performance" means so many different things, but I don't think it's impossible.

As for the other point, those 1-in-a-million edge cases, we've got those with regular code too! If you really want to test something against production-like DB, it's not hard (make an anonymised replica of prod DB), it's just expensive (tests will be slow, getting this set up will be slow, etc).

I personally believe the right balance for cases like those is to catch them in production: automated tests should be quick and allow quick iteration if someone wants to do TDD on any part of the codebase. There are certainly product niches where this is not true (let's not have airbags in cars deploy accidentally every 1M rides, because there's a lot more than 1M rides daily :)), but for our regular applications, that's usually more than fine: quick tests will make it easy to fix the particular edge case once we hit it.

FWIW, I love the idea of pgTAP for those who haven't seen it too.


It's probably worth noting that I was mostly talking about Data Engineering proper. I think "SQL is king" is (ime anyway?) comes from data analysts, who are usually not engineers are have more of a math of stats background. Data engineers do seem to write spark/pandas which is a different set of problems. I think a lot of the "big ball of sql mud" and brittleness just come from the fact that the people writing them aren't engineers and aren't following an engineering discipline. They're analysts trying to answer individual questions and find patterns to then pass off to ML engineering or data engineering teams to productionize. Or - they're lower-value pipelines that wouldn't be worth the time to ask an engineering team to try to prioritize them - so you either get a brittle thing that mostly works and need occasional love, or you do without it entirely.


Writing performing transformation code is one of the critical skills that a data engineer needs to master. It's a combination of experience with and knowledge about the underlying technology. How it actually processes data (e.g. learning how database pages work or going beyond dataframes and getting hands on experience with Spark RDDs).


Data segregation requirements fundamentally break the utility of staged environments. There are some startups that try to fill this gap (Tonic.ai etc.) with data generation technologies -- yet it is extremely expensive to generate meaningful test data to populate development and staging environments that deeply mimic the interrelationships inherent to production data. These data relationships are very valuable to test. Because of this segregation, many organization resort to adhoc and manual testing techniques.


I see a lot of replies saying something to the effect that "SQL is actually really testable, people just don't do it" which I'm sure is true, but probably less helpful if you're having to work on a tech stack that doesn't offer unit testing (I'm thinking something like Azure Data Factory or even some setups/configurations of airflow).

Would love to hear from anyone out there who has a good test cycle for predominantly SQL based code? What tools are you using? How does the process work out for you? What are some pitfalls you found along the way?


I gave a positive review of dbt's testing capabilities in this other comment: https://news.ycombinator.com/item?id=36720013

This is what the actual generic tests look like. There's also support for one-off SQL-based tests: https://docs.getdbt.com/reference/resource-properties/tests#...

This article covers a lot of ground as to how to actually use dbt tests in practice. I'm particularly glad they mentioned the dbt-utils and dbt-expectations libraries: https://www.datafold.com/blog/7-dbt-testing-best-practices

The dbt ecosystem is also a strength. This article has good examples of the type of tooling and libraries in the ecosystem: https://meltano.com/blog/top-5-of-dbt-packages-and-tools-in-...

I have no affiliation with any of the above linked companies, just a happy dbt user! I'll also add that I don't use the dbt cloud, just invoke the dbt cli in a backend task runner.


What I tell people about SQL is that there are two different domains: (a) operational data ingestion which is usually for the current month and (b) analytics historical data which is for all time.

Think of how a business runs its finances and accounting. The analysts run daily reports to ensure the data ingested yesterday ran correctly. They enter adjusting entries as needed and call Data Engineering to fix any integration issues. At the end of the month, they "close" the month and that data is now historical.

For the first scenario, data ingestion... I typically have seen the business analysts and engineering analysts run daily reports to ensure all the integrations and systems were successful. When data engineers deploy new code, yes there is also limited testing. The most frequent problem I see is when the business analysts merge duplicate records; this throws everything off. Engineering will have to work with the business to correct a slew of systems and records to adjust for the fact where there was once two John Doe records and hundreds of children records in multiple systems... there is now one John Doe record, hundreds of children records, and new adjusting records. You can argue, there shouldn't have been two John Doe records in the first place. That is true if you only manage only one system. In many companies there are subsidiaries which have their own data entry points. e.g., When a big corp acquires smaller companies, they will want to suck up the subsidiary data and reconcile all the duplicates. It takes years to sort out all the system integrations.

At the end of the month, when the data becomes historical, the company may decide to change its business rules going forward. So the historical data violates the new business rules. e.g., During a teacher shortage, the local school district says we will hire military veterans without college degrees; this year we will go back to the old policy. e.g., This year the state decides to start daylight savings time earlier than the rest of the country. That is it is very time consuming to write test cases for historical data especially if no one documents the historical business rules which happens more often than you think.


> So the historical data violates the new business rules.

How did you solve this? Is there a set of standard practices to address situation like this?


There is no tech best practice I know of. Even the business finance folks have no solution. This is where idealism hits reality. Yeah it would nice to say that GitHub and source control can give you state. But as we all know code is not reality. It is an idealized representation of reality where we would like to believe business rules don’t change and things never fail.


We use dbt. It definitely does improve the testability of SQL and provides a nexus for engineers and analysts to collaborate. I still heavily advocate keeping its utilization surface as low as possible, and maintaining efficient, compact, and incremental data modeling.


Can’t speak for SQL _in general_ but in Postgres land pgTAP is wonderful. Invest in writing some factory functions, and you have the ability to write a very expressive test suite. We use Postgres functions to provide a data interface that is thoroughly tested. This allows us to ensure that any migration does not break any contracts.


While environments like 'dbt' can bring some reasoning, testability, jinja macros and iterative constructs to SQL, it also can help contain it. At my organization I advocate for the minimization of the SQL surface, and we utilize DRYer, modern, testable languages wherever feasible, like Clojure (mostly) and Python. Clojure integrates very well with Kafka, and it is used throughout our organization. While Java/Clojure lacks some of the data applications found in the Python ecosystem, we certainly use Python where appropriate.


> Data engineering can be unsatisfying if you thrive on writing reliable systems

This is what I needed for me to stop considering data engineering. Not that I was thinking about it a lot. I feel very happy with my current job but the thought of the code I write being so brittle as to paralyze me from making any changes terrifies me.


Unless you get to work at the place that values your test driven perspective. Either because they are doing it or desperately want to (where want = they are already investing heavily in it)


We build pipelines in staging schemas, and we only promote them to prod once they pass tests. This is pretty analogous to SWE (it's where we got the idea) and is a pretty crucial part of us meeting the requirement where we don't put inaccurate data in front of customers. It's--of course--only as good as the tests, but that's every reliable system.


I've done written heavily tested code in python. But also heavily tested code in SQL. When I was doing data engineering 5 years ago, dbt was a new tool and I used that to write tests, ci/cd, etc. in data engineering. It was a game changer.


There's plenty of support for testable SQL. There's just a long tradition of not using it.


Just yesterday I installed a bunch of TSqlT tests on a test SQL Server database for our subsidiary, and watching unit testing written in SQL run inside of SSMS made me realize just how massively I miss this ability. There is absolutely no reason we should be able to test API calls and not the database calls underneath.


I personally don't have an issue with SQL. I started my career a couple decades ago as a SQL guy and I still use it occasionally.

If you wanted to, you can write tests for the queries. I've done so in the past.


At the end of the day with SQL, is what are you testing....

I support an application that does authentication and does URL/component registration based on the IP/hostname/FQDN of the machine at startup. It shoves some of this stuff off into a table for URL validation as requests might come in from a different node but need to verify that host was part of the cluster.

In client systems this start up was almost always insanely fast, until one day for one of our clients it wasn't. Startup process was in a restart loop with a timeout where I had never seen one before. Started it in debug mode without a timeout and it was taking minutes to start up. Look at the SQL side and this thing is just processing more IP records than I'd ever seen in the app and the registration table size was huge.

Turns out this machine had went though some massive number IPv6 addresses in a very short period of time. The way the SQL was written for this query wasn't great, but never needed to be because most instances of this software may have 20 records in the table at most. That's the one 'fun' thing about SQL that's hard to test for.


i dont understand what makes Pandas testable that is not also doable with the part of one's code that emits the SQL queries. assuming your test suite is capable of running with a database connection (which most are).


> joins being logically correct

What do you mean by this?


I think there's a fair caveat here:

In a context free situation, sure, just learn SQL. It's the workhorse, and you'll definitely use it.

If you have to start somewhere, start there.

...buuuut, obviously look at job ads.

If all the junior jobs say "experience with SnowPanda" or "Working knowledge of FancyCloudPlatform" or whatever the name of the latest hotness is, you'd be daft not to invest some time and effort into whatever everyone is actually asking for to.

You don't need someone to tell you what those things are; literally look at the bullet points on job ads... but don't ignore them.


This also works when reviewing resumes. If you see enough candidates list xyz as a skill, then it's probably worth looking into.


I view data engineering as a hybrid of SQL and programming. I'd go as far as saying that it's a mixture of various disciplines.

In the end it depends on what you'd be doing as a data engineer. My experience transitioning into the role/ecosystem in the last decade (I'm only 33) + hiring data analysts + scientists + engineers has left me with this impression:

Data engineering isn't so much an entry-level career, similar to data science if you don't have some core stat+math background. You could find a job that requires you to write some data pipelines, and it requires just SQL knowledge, and that's fine. However consider what similarity you have with an ETL developer (think Informatica, Oracle, SSIS).

I'm probably one of those gatekeepers because I do think there's some roadmap. I agree that it almost always starts with SQL. It's however important to acquire a broad toolkit though, because some people who've gone into data eng roles straight from uni often find themselves only knowing SQL or some bespoke tool that they're using at work, and end up struggling when they encounter the Sparks and glitters of the fiels.

Of the hiring processes that I've been involved with (say 20 candidates so far), I've found software engineers to transition better into the role, because most of the time candidates who come from that discipline have more experience debugging performance issues, which translate well into SQL optimisation and generally figuring one's way around tools.

Lastly, I'm self-taught. I'm professionally an accountant who started programming while in uni. I was fortunate to move into an analytics team at work and build my foundations from there. The analysts often had Math backgrounds, and I realised that I wouldn't cut it, so I focused on the tooling and giving them data fast and reliably. It was a lot of night-time and weekend tinkering to get to where I am.


> It was a lot of night-time and weekend tinkering to get to where I am.

I think that this gets at the difficult part of hiring junior candidates for any role -- you want the people that are capable of learning a new thing if they're given some time to learn, but it's hard to find a signal for 'capable of and willing to learn'.


I agree with so much in this article: there's no narrow path, SQL is king, and enough with the damn influencers.

But how many entry-level DE jobs will hire folks without programming experience? I've never heard of such a thing in the US, perhaps the UK market is different? If you're new to DE you're unlikely to have any experience managing a database or using common ETL tooling, so your transferable skills are basically SQL and python, and that's what every interview process is going to screen you for.


I've worked in a big company that had explicit "junior data engineer" roles, but even those would expect some programming experience - either through a bootcamp, degree or related role like data analysis.

But basically yeah, as a UK data engineer, I'd say what you've said is pretty much right on- programming language + SQL will get you hired much more than "I've used this ETL tool"


I think the idea of a data engineering roadmap is flawed because there isn't a single cookie cutter "data engineer" role to aim for, just like if wouldn't make sense to talk about a single route into becoming a "software engineer".

I'm a data engineer but spend most of my time writing code in dask, and use fairly little of SQL. I've been in roles before though where long and complex SQL routines was the main focus.

My advice that nobody asked for: focus on 1) general programming skills which are transferable anywhere and 2) do what you personally find interesting. Like SQL? Sweet plenty if people will pay to do that. Obsessed with using scala and spark in combo? Someone'll probably pay you to do that too.

Don't focus on being a "data engineer" or any other label- just enjoy programming and find roles that have more of what makes you happy.


Good article but a bit short. Anyway two books I can recommend for data engineering:

Designing data intensive applications-will give you a good overview of tools and the theories, algorithms, data structures behind it for different types of DB.

High performance browser networking-throwing this in because it can help extend the book above outside of a data center to the last mile. Sometimes you can save yourself a bunch of headaches on the data server side by doing things client side and caching/prefetching more than what the user asked for.


Interesting. I work in the space currently (an ELT company) and I wouldn’t characterize the skill sets you mentioned as critical to a data team or a data driven organization. They are helpful, sure, but a lot of the problems that data teams face in my experience are around how to expose data to the organization in a meaningful, regular, digestible, actionable way. This means thinking about how to ingest and transform data and present it to end users for analysis (BI) in such a way that it’s not painful or arduous to consume it, and ensuring that it is clean and accurate. In other words, how well you move the data has always been a secondary point to how people can access and use said data.

Note that I’m not in a dedicated data engineering role, but have worked adjacent to it for the better part of a decade and have done a fair amount of data engineering in that time.


>a lot of the problems that data teams face in my experience are around how to expose data to the organization in a meaningful, regular, digestible, actionable way. This means thinking about how to ingest and transform data and present it to end users for analysis (BI) in such a way that it’s not painful or arduous to consume it, and ensuring that it is clean and accurate

100% this. I basically work in this role in my organization. I'm not sure how typical my experience is compared with other people who call themselves 'data Engineers'. I'm usually put into the situation where I'm looking at how to capture a raw signal (out of an instrument, device or similar) and package it up as structured data and store it in some kind of meaningful format people can use.

Useful skills I think (other than SQL) are: -Some understanding of PLCs/SCADA/OPC communication world. -knowledge about Signal analysis/ DSP, Kalman Filters, Fourier Transforms etc. Especially useful when you need to store a raw signal as "data" of some kind. also increasingly useful when the underlying data is in some format like Video, Audio etc. -An Understanding of working with Time Series analysis can help with data transformation etc (downsampling a high frequency signal into lower frequency one for example).

Once you have captured the signal and stored it you move on to the second part how to make it available to end users in a meaningful and accessible way - this is where SQL knowledge and working with DBAs comes into it. In my experience it has become increasingly common you need to understand about cloud architecture etc as well. For example in my org more and more of our storage is moving to Azure so knowing how to navigate that stack and all the terms they use for things can help save a lot of time and frustration.

Lastly it can help if you are at least somewhat familiar with the BI tools the end users will be using to ultimately leverage the data.


One really interesting aspect of this that I’ve had some revealing conversations with the data engineers in my org is how to best expose the firehose of data for people in BI tooling. We use dbt in my org, and Metabase as the BI tool, and a lot of thought is put into how to create a clearinghouse that serves the needs of the organization. The current pattern that has been of interest is to ELT into what the data engineers call OBT (one big table). The OBT is cleaned, denormalized, and able to be sliced on. An org might have several of these OBT consisting of various areas of interest. End users then import the OBT in Metabase to drive their filtering and build dashboards. The goal is to reduce reliance on custom SQL scripts and push all of that custom slicing and dicing of the data into Metabase’s front end logic where filtering can be applied dynamically rather than trying to maintain a bazillion sql variants.

Eventually I think we will move into a post ChatGPT world where you’ll give ChatGPT (or whatever equivalent) your schema and a question and it will output the dashboards for you. We aren’t quite there yet though


I like this and I think it's where modern AI will shine the most. Like a clippy but for data.

The (outside of this scope) question is what happens when you feed that decision back into the system. I think the "recursive AI" question has been exhausted though.


Have you evaluated Superset or Lightdash against Metabase? If so I'd love to hear about your experience. I'll shortly be helping a client company migrate BI off Looker and haven't gotten my hands dirty with the options yet.


I imagine the number of people who fit the venn of data engineer with signal analysis knowledge/skills is pretty small. Is hiring for that kind of role difficult? How would you recommend people with skills in SWE/DE become hireable in the signal processing world?


I currently work in a Data Engineering role and agree with your assessment. At the business I work for, there was no BI or data management processes automation until I joined and wrote all the pipelines (in Python) and queries (in T-SQL). Being comfortable with Python and pandas/arrow (or some other language or ETL/ELT platform), SQL, and working with REST APIs is all usually necessary or helpful, but SQL is ubiquitous. And the most challenging problems for the data mgmt team I'm on involve:

(1) getting datasets from historically siloed data management teams to join with each other cleanly and programmatically — which in my case is a political issue, not a technical one

(2) determining the best way to replace old tabular (Excel) reporting with modern interactive (Tableau) dashboards that refresh from a database — which in my case usually means getting a better understanding of many cross-team business processes, understanding how data management is involved in them, and how data is being used to make decisions and take action

(3) determining how to structure and present this data to them with the least possible friction, and in a way that's the most useful and impactful to them

This is all to say that DE roles seem to vary wildly depending on the domain you work in and the kind of business you work for. But at the end of the day the purpose of the role is to make clean and accurate data available to end users with the least friction and in the most impactful way. And most kinds of data businesses use can be represented by, transformed with, and interacted with using SQL.

There's tons of SQL resources, but I'd love to find some books that cover these deeper issues of politics and business process discovery from a data engineering perspective.


I wrote Excel macros that connected to a VAX VMS DB2 to get updated data so that management could just open the same excel files as they always did. This was the early/mid 1990s


This kinda sounds like higher level concerns where the books cover lower level problems?


I think it's just two sides of the same coin. There's definitely an art to making it consumable to end users. It's even an engineering challenge to open these data lakes to end users without killing the database (think of a dev trying to show load average on a graph for all running servers over last 3 years and there's 5000 servers). You can't really blame them and realistically they might have a project where they need that data. And then there's data crazy people who make 400 graphs of various metrics and only 20 get actively used. It can lead you down red herring paths and all kinds of chaos as an SRE, and also blow the budget.

Anyway, I understand what they are saying and unfortunately I have no book recommendations to solve these problems. In larger companies it can be much more of a challenge than actually ingesting data, especially in the observability space.


To add some additional notes - Designing Data Intensive Applications was a heavy read to dive into for me as a junior engineer. I took a step back after the first couple of chapters and read Understanding Distributed Systems[0] which was a fantastic primer.

[0] https://understandingdistributed.systems


One of the problems I’ve seen with folks in data engineering is an underdeveloped sense of data modeling and table design. And this article is an example of that. It heavily emphasized SQL, but didn’t mention anything about data modeling. If you aren’t designing your tables thoughtfully, no amount of SQL proficiency will save you.


This reminds me of an interesting project when I was still a consultant. Our partner (audit + consultancy firm) sold the idea of a magical "analytical base table" which would consolidate the client's data in a single place for quicker reporting and analysis.

Of course all of this was new to me (the ABT thing really obscured the problem). I was going for quality review with another partner at the firm whose bread & butter are data management. Let's say it was a very harsh review.

At the end of the review, he recommended that I learn data modelling properly. He gave me a copy of his Kimball book for the December holidays. I went through and completed it, and on the other side of that we were able to build a coherent dimensional data model to address the client's problem efficiently.

That knowledge makes me agree with you, merely knowing SQL or your way around a database is often not enough, and knowing some theory about data modelling (regardless of which type you prefer) is a key element in a data engineer's toolkit.


The Data Engineering track seems to attract data analysts and non-engineers. I’m one of them, have worked with many others in my same bucket. What makes Data Engineers excel, I think, is to have all the skills of a software engineer + the ability to think like a data analyst.

SQL is definitely part of that journey, and more skills should be mastered if you’re going to do things like use Kafka, airflow or spark.


The problem is that if you have all the skills of a software engineer, lots of people would pay you more to do different things than they will to do "data engineering", which is often a lot about cobbling different tools together and doing a bunch of custom transformations and validations that are more tedious than creative.


Agreed. The job title of Data Engineer gets abused. Some DE's are Analysts, others are Backend SWEs. Their goals are aligned, to create value out of data, and their methods may differ greatly.


A data engineer is as much a software engineer as a back-end engineer or front-end engineer and paid equivalently.


Same as “game developer”

But that’s not how markets work.

Also a job title doesn’t really say much. On all of these 4 roles there’s people doing trivial work, as there’s people doing very deep technical work. Same with pushing the envelope on tech.


I would certainly prefer it if that were the case, but it largely is not, in my experience.


It seems like your experience doesn't overlap with data engineering, in that case.


Ok. I think this is just going to end up being "no true scotsman". I'll say "in many companies, data engineers are more doing BI data preparation and presentation than building distributed processing infrastructure or whatever and they are compensated more like analysts than general software engineers" and you'll say "well that's not data engineering".

My argument is not an "ought" argument, it's an "is". I agree with you that data engineering "ought" to be, if anything, a more difficult specialized practice of software engineering, because you need all the software engineering skills and then also specialized data skills. But I'm saying that what it often "is", is more like a specialization of BI and analyst roles.

I think the OP is a pretty good demonstration of this. If "all you need is SQL", does that sound more like a specialization of a software engineer skillset or an analyst skillset? I think the latter... And sure you can say, "well the article is wrong, if all you need is SQL, that's not data engineering", but we're just back to "no true scotsman"; I believe it is common to see the role this way.


for a moment there I thought you said "what makes data engineers is excel"


do not speaks its name


Really enjoyed this article. Made me feel a tad inspired to write a similar one, "There is no Product Management roadmap", tailored for the many people looking to break into that discipline. Loooots of parallels.


i think you're probably onto something, what do you think the equivalent to SQL would be in product management?


It's a little more loosie goosie, but there's really no single tool. The intuition is that a good PM on a team helps guide things along at all stages such that when anyone looks back, they think "this never would have happened without them driving this". But the details of that aren't easily moldable into some framework or tool you can use for other projects, teams, or organizations.


Jira? Powerpoint?


Nah. Being good at Jira makes you worse at product management.


You cant have a roadmap if you dont have a destination, or at least a direction for travel for the immediate next steps.

Remarkably the article doesnt event try to define data engineering, so how you can possibly have a roadmap to it.

"SQL" is too vague and jargonistic. Could be just "data analyst". The Engineering aspect is supposed to add a more fundamental data flow approach, less business context driven and more close to the metal. This is the skillset of big data framework (streaming data) nosql etc.

We could say it puts more focus on performance, but even that is imprecise because a lot of SQL expertise is also about performant designs and for many people would claim that e.g graph DB's are not essential


The lack of a roadmap is disastrous in terms of professional quality IMO. A lot of orgs have data scientists who are good at coding and maths who can't do their work because the data engineers, who got their jobs because they kinda knew SQL a decade ago, are incompetent.

I'm sure this is less of a problem in tech companies which are skimming the cream of the crop, but in other organisations it's apparent that a third-rate data scientist or software engineer is probably much better at their job than a second-rate data engineer. And indeed if I were running an IT dept I'd probably just fill the data engineering roles with decent coders and tell them to reskill a bit.


I've been in such an org. The problem is that few decent coders want to do data engineering work, because they'd rather build software and because data engineering pays less and is kind of a dead end in a lot of org.

You end up with a lot of DBAs that self-style as Data Engineers.


I'm in Australia and data engineers here are overpaid, afaict slightly better paid on average than software engineers who are better paid than data people.

So you end up with bizarro stuff where a $150k AUD data engineer who basically can't code or even set-up FTP telling some maths PhD data scientist on $100k what he can or can't do. Managers, unable to gauge technical ability, assume the data engineer must be highly skilled because they're getting paid so much.

The wages and lack of skill I think are symptomatic of there being no real entry-point in data eng. There's a very slow osmosis of people from genuinely skilled areas into data eng, the main drawback being that the entire culture of data eng is defective with so many unskilled hacks.


I work in the aus data space, I’ve met a lot of intelligent PHD data scientists, but very rare is it to meet one who can scale their work efficiently across the org. The amount of time wasted on models that never make it to production is ridiculous.


>the data engineers, who got their jobs because they kinda knew SQL a decade ago, are incompetent.

I don't think I've ever been called out more in my life.


Data engineer sounds like a great second remote job to pick up. If you already have a remote software engineering job, picking up an entry level data engineer role on the side will allow you to build up that resume in parallel, without having to worry about the lower salary. It’s an investment and gives you a second career branch to eventually fall back on if you get tired of software engineer, or to eventually just have two different six figure jobs with plenty of skill crossover.


I work as an (data) analyst, and I'll echo that many data engineers just end up in the engineering role after having worked as a traditional analyst.

You start doing the traditional analysis work, and then slowly get more involved on the tooling and pipeline side. Then you become to "go-to" guy on some specific tech, and slowly find yourself doing more and more engineering work, than writing traditional reports. You get offered more money to do the engineering work, due to engineering positions having higher salary caps than the analyst positions.

But it is a natural evolution. When you work with analysis, you obviously work very close with the data. You work close with the engineers (if your org / company has them), or become the engineer if there aren't any.


Maybe it's just the positions I've ended up in, but my DE roles have required a TON of ops as well. You can build all these pipelines, but you will be absolutely buried in operational work if you can't get the automation squared away. With the role I've been in for the last 2 years, the most significant things we've done were done in Terraform, Ansible, and Prometheus. Sure, there's lots of SQL and Python, some Java, some Scala, but the biggest issue has been getting the system to where we get useful and actionable alerts and downstream data consumers get notified about late or missing data without requiring manual intervention from the data engineers.


Agree with this. I think a good 75% of the value I've delivered in data engineering roles has been on proficiency building out cloud infrastructure, CICD, etc.

Teams hire data engineers when the analysts are spending too much time and effort (and failing to succeed at) building data pipelines. Often there is a significant need for engineering and cloud infrastructure improvements beyond fixing up a few wonky python ETLs and broken SQL queries.


I'll second this too. As a former manager told me, figure out how to scale yourself. Making the end to end development lifestyle easy for peers and partner teams is one way to do that. It could just as easily be called platform or systems engineering with a touch of DevOps.

It's like UX for data people. How to make a cohesive experience among the various tools, scripts, services that people use day-to-day so they can use and maintain datasets efficiently.


There’s long been framework fatigue in frontend JS work. I feel data engineering has a lot of that now. There’s different schedulers and orchestrators. Databases, ORMs, and wrappers. Cloud providers and hosting platforms. The list goes on.

It’s overwhelming.


Software always involve data.

There is now a tendency for large companies to announce a slice of engineer's day to day to be a standalone position, and others will follow suit thinking they also need a dedicated position/team to handle such operation. And management training indoctrinates you that a team needs objective, measurable result, and sometimes a roadmap. Somewhere down the road, you realised that whatever your goal is, is 2 degrees detached from you business.

Data consistency, mean time to recovery, sprint velocity, DORA, SPACE, what else? It is as if one opens a ramen shop to find keeping the light bulbs on all the time the most important job. What about the ramen?

In micro service, there is this grains of sand anti-pattern where a service is too granular[1]. Zoom in a bit, you got left-pad as a library. Zoom out a bit, you got identity crisis of granular team. Don't forget to preach communication is important. How else are you going to coordinate all these services, I mean, teams, otherwise

[1] https://www.oreilly.com/content/microservices-antipatterns-a...


Author of the post here, I had no idea it got posted here and just noticed a huge traffic spike. So, hello!

Something I got asked a lot in response to this post, or maybe berated for, is that there is a lot more to Data Engineering than SQL.

FWIW, I agree. There is so, so much more to DE than SQL.

The point of this post was purely to cover entry-level DE, and not 'This is the only thing you'll ever need in your career'. In my view, developing your SQL skills teaches you a lot of the fundamentals of working with data, enough that you can start to use it in a professional setting. From there, you will develop a lot more skills than just SQL, but your journey could take you in so many different paths that I find it hard to agree with folks who say you should always learn X,Y,Z specific tools. Some folks will end up in Python shops, or Scala, or Bash, or no-code, Airflow, Dagster, Prefect, Control-M, Spark, Flank, Pandas, Polrs, BigQuery, Snowflake, Redshift, ClickHouse, MSSQL, Hadoop, dbt....Some might even go down the K8S and Terraform road, becoming more of a platform-focused DE...

There is just so much potential variance depending on where you work and who you are, that it doesn't sit well with me to shoe-horn everyone into a pattern of "First learn Airflow, then learn BigQuery, then learn Spark" when you could have a successful career and never touch a single one of those tools.

So, the reason I focused on SQL here, is that in my career consulting with hundreds of DE teams - the only consistent skill across every single team has been SQL. I do not believe SQL is all you'll ever need, but I do think its all you need to get started. It's the only skill I see being relavent regardless of which team you join, and that will still be relevant in 15+ years time.


Out of curiousity, what then makes a difference between a data engineer and an ETL developer?

> There is just so much potential variance depending on where you work and who you are, that it doesn't sit well with me to shoe-horn everyone into a pattern of "First learn Airflow, then learn BigQuery, then learn Spark" when you could have a successful career and never touch a single one of those tools.

I agree with this. My observation has been that people who start/stay in 1 place for too long end up knowing some tool(s) very well but not others. So surely then we should abstract away the tools and identify what's common about them (and SQL)?

When I produced classroom content for the graduate program that we ran at work, I used to focus more what's being done (ETL, data management, perf optimisation) and primitives like HTTP for REST, bits about DevOps, security, analysis. The idea being that if one is fluent in understanding the underlying problem, tooling becomes easier to pick up.

So I still think that there's a roadmap, and that roadmap is the underlying concepts that data engineering tries to solve. I've found that you can teach someone enough about:

- ETL & data management (incl modelling) - DBA - DevOps - Software (architecture and development)

to the point where they're not as good as the best people in those fields (if they're confining themselves so), but to the point where they become good at the parts of those fields that are relevant to data engineering. That's often my roadmap.


I don't disagree with you, and I believe you're probably teaching people many of the right things that will set them up to do well in the future. But...the tricky part about learning just concepts and not the skills is that they are quite abstract and its very hard for a junior to immediately see how they can applied in practise when getting their first job. They're also quite hard to demonstrate to an employer. Learning SQL gives you some tangible to do that can help you to learn something about many of those areas, maybe not too deep, but while also getting a hands-on skill that easier to demonstrate to an employer. The reason I think SQL is the right skill there is that its specific enough that its actually useful, but its generic enough that its widely applicable to many employers (rather than specialising on one specific tool, like say Spark).

For example, you might learn the theory behind internal combustion engines, but does that make you more attractive as a Mechanic's apprentice than a kid who's rebuilt his lawn mower's engine? Neither are 100% perfect, but the hands on experience might be easier to demonstrate that you can be effective on day 1, and you can learn some of that theory on the job. Idk if that's the best analogy, and I'm not saying it's 100% the case here.

I did a degree in Compture Science and learnt a lot of theory, its all super interesting and personally I still love understanding all of the theory and low level details behind software...but many of the DEs working in Enterprise don't have that background (or interest). And that's not a bad thing, they're just as competent data engineers as I am.

I guess it's a matter of what your end goal (and timeline) is. Some folks aren't looking to be a word-class DE expert, for some its just a job and DE is a booming space with above-average compensation. In that case, you're probably not inclined to sit in a class room and learn all the theory, you just want to land the job and get on with it. Nothing wrong with that at all, some of the best get-stuff-done engineers are like this. For others, even if they find the space fascinating and they want to be at the top, they might be more interested in one very specific hand-on area and they only need as much theory as they pick up day-to-day.

Anyway, I don't think your approach is wrong, and everything you're teaching is great for someone who is genuinely interested in the field of DE - but I also don't think its all absolutely necessary if you just want to get your first DE job.

RE. DE vs ETL developer; I think DE has become more of an umbrella term like "Software Engineer" which encompasses many different roles. An ETL developer probably falls under Data Engineering these days, its just a specific niche.


(Summary of this article):

> Data engineering career roadmaps are a waste of time because they're full of things you don't need to learn. Here instead is an over-simplified data engineering career roadmap.

Roadmaps are useful when there's a lot you could learn, you're not sure you have a complete list, a lot of it isn't essential, you don't know what order to do it in, and you don't know what resources to learn from. Many juniors are in that place. To get into a company where they can learn on-the-job or from a mentor is extremely difficult. So a good roadmap is a pretty good alternative to that.

Saying "there cannot be a roadmap" or "I didn't need a roadmap, neither do you" is unehlpful at best, damaging at worst. To be an effective junior data engineer needs more than just knowing SQL.


Data engineering reminds me of the George Carlin bit on how names change, but the things don't.

Database management -> Data management -> Data architecture -> Database administration -> Big Data engineering -> Data engineering.


Database administration is more like a "sysadmin of database". IMO not equivalent to the other ones. Data Architecture is like software architecture, so more like high level system design. I do agree though that data engineering is equivalent to what data management referred to in the past.


> People used to say “Software Engineering is not an entry level role”.

Something that I ran into myself and have seen other data-people struggle with is that it is hard to get good experience working with data when you don't have access to data.

Online SQL tutorials and classes typically have pretty clean/small toy datasets that are only really useful for teaching you the correct syntax. Going from these simple queries that are all ~10 lines to a production system with queries that are 100s of lines is jarring.


And dealing with database sizes that have millions/hundreds of millions of rows.

Moving to a system where your indexes are larger than main memory and some really expensive queries may need to spill to temp tables if you don't keep your data sizes under control is also jarring. Big transactions getting deadlocks, and how to read said deadlock graphs and learn how to deal with said locking when you write your SQL starts to become a full time job.


The author believes that SQL is the core competency of Data Engineering. Now that ChatGPT can translate natural language into good SQLs. So Data Engineering is going to be retired?


I think it will change or alleviate the tedium of the boring bits of exploring a dataset, figuring out if the dataset can join to another dataset in the way you thought of first, and writing a few of the dashboards.

Hopefully it means I can dig into the real problem (correctness, idempotency, and performance) rather than the tedious implementation.


One the most important thing about SQL is learn to read the output of EXPLAIN queries: it's the key to understand why your queries are not running as fast as you'd like.


A data engineer's "hammer" may be SQL but the data engineer's JOB is gluing all that SQL together into a process that works efficiently and reliably.

You need to know how those other tools work together, whether it's DBT or Airflow or ... or ...

I do agree with ignoring "influencers" they are almost always trying to tie you into their stack or product and don't necessarily have your best interests at heart.


Does DevOps ever hire juniors? I applied many times as a DE but never got any reply, not even one interview.

I'm preparing to get a RH cert and see if it opens some doors.


Sounds like a good idea despite how many people diss certs on this forum. I've been trying to get a entry-level DevOps role for over a year and I can't even get a nibble—nothing! No replies, no interviews.

I feel like I'm a competent backend dev, even though a lot of times I feel imposter syndrome when I look at the things other developers are doing.


But how else will the influencers get their riches if not by selling bootcamps and courses?

They've learned like the others to sell shovels during a gold rush.


I am a Senior Data Engineer professionally and I simply do not use SQL everyday. Services that I test and invoke everyday do use various dialects of SQL, yes. As an almost joking aside, "(f)or Data Engineering, there is only one skill that is absolutely, non-negotiably the first thing you should learn to get started is" Terraform.


Additional useful tech skills besides SQL:

grep/sort/uniq, basic bash, jq, python, numpy/scipy.


I'm not a data engineer but what about dimensional modelling and stuff like the data warehouse toolkit? I would imagine that it is somewhat important?


It is important, but when the shift to cheap compute and storage hit an inflection point it became possible to build out wide tables that combine the characteristics of facts and dimensions, and real data modeling took a back seat. That possibility led to bias toward moving quickly with less emphasis on making the data model sustainable.

I'm seeing the pendulum start to swing the other way, where the complexity of these scrappy and loosely structured data models is hampering the ability to innovate, and even slowing down the business. The models are often inflexible and hard to maintain with hidden bugs and gotchas.


Interesting, this makes me think of a process I am going through, where I have a couple of very wide tables where I feel a need to build fact tables simply to get a better understanding of the data and the domain they are from. I have already stumbled a couple of times as the groupings was not as expected at all. Think this was the last push for me to build those fact tables once and for all, to get an overview and also to be able to add my own inferred groups to the data.


Sounds like you are on the right track.

A non-technical benefit is there won't be so much context to keep in your head when triaging data issues in the future.

A database table is very similar to the separation of concerns problem of core software engineering. A table really just is an abstraction for storing and fetching various data points.

Finding the right balance is part art and part science.


A lot of it is dated but the first 3 chapters are still well worth a read for an aspiring DE.


Telling data engineers to learn sql because it's what's used in BoringCorp is a bit like telling front end developers in 2003 to learn VB because it's what most user interfaces are written in.

Data engineering which pays well happens exactly where databases fail and you need bespoke solutions.


They’re speaking to entry-level people, who will not be designing bespoke solutions. SQL is exactly the right recommendation.


In the same way that VB was to front end developers 20 years ago.


Sure, the language that has underpinned data work for the last few decades and has relevance regardless of what shiny framework you put on top of it is just like VB twenty years ago. Good comparison.


I was going to say, ‘Bullsh*t Learn SQL.’ Then I read the post.


At least you're eventually consistent ^_^


Then learn what sequence of actions your data store(s) takes based on that SQL. Relational engines are great but can need tuning (add indexes, de/normailze, materialized views or not) and then key/value ones are going to be doing scans from the entire cluster generally unless you get the cluster key in there and ideally the primary key too.


Am I the only one who finds this to be nearly unreadable on mobile? The full width spaced out text, font, contrast...


Hey, it's my blog - apologies! I am really bad at frontend and this was the first iteration of me trying to build my blog from scratch doing everything myself. It's definitely not perfect! Appreciate the feedback and I'll try to make it more readable on mobile, sorry!


This person is essentially correct.


Can't forget about SQL. It's also worth noting that designing any central data system without a SQL interface is probably a mistake. The people making it may be more comfortable with a REST interface or Python API, but data users will show up and demand that data be SQL-accessible. So now you get to hack it in, or help them develop their own shadow database that is poorly replicated from the main system.


The author stated that the only logical starting point is to learn to solve problems in a dialect of SQL then approach problems of the same type in other dialects of SQL. You’re entirely right, and the author seems to agree to an extent.


Now that I think about it you're right. All the maps for dev careers so far:

1. Staff eng map - https://staffeng.com/stories/

2. Bootstrappers - Indiehackers

3. Ecommerce - StarterStory

4. Software Engineering - /u/softwareengineering

5. Higher salary - Blind, reddit

6. Technical interviews - Leetcode forums

Looks like the shape of the solution can be forums, books, articles, what do you think you'll do?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: