Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> All you need is l̶o̶v̶e̶ SQL

I’ve ended up doing a lot of data engineering over the years, because I have a background in low-level search/databases systems coding and know text well. I have mixed feelings about the field precisely because it’s so SQL dominated.

Data engineering can be unsatisfying if you thrive on writing reliable systems. There’s a whole lot of big-ball-of-inscrutable-SQL work out there which when it gets changed breaks in unpredictable ways. And the cultural traditions of testing and validation are underdeveloped because SQL has no core testing story — people count on joins being logically correct, but don’t go to the trouble to prove that true. Vector operations are just hard to get right in any ecosystem and if you don’t stay humble and apply skepticism to your work, it’s going to be wrong a lot.

I try to do stuff in pandas or similar — which has its own problems, but at least allows for testable library code. But in the field of data engineering, that’s swimming against the tide.



DE is an inherently less tame domain than traditional SWE. The ground is more likely to shift under your feet. IMO the real problem with testing data engineering is that the data changes at a cadence that is independent to your version control/CI/CD cadence. You can have airtight unit and integration tests that get wrecked when the data schema changes without warning, the distribution of the data shifts, etc.

And you can certainly say “I’m going to enforce x rules on ingest and if it doesn’t match it doesn’t pass”. Depending on your system you may end up with a lot of on call or self-enforced down time or data loss. You might white list a schema and then later find out you’ve been missing out on 3 months of high value data because your upstream provider added a new columns. Alternatively, you can let things float through, monitor changes, but don’t let it plug up your system. This is also a risky game to play, with its own set of downsides.

Striving for comprehensive testing is still super important though. I find probably 60-80% of my time is spent tooling test frameworks that let us address the various edge cases. We use pyspark, and the amount of test-oriented tooling around that is, at least to me, surprisingly underdeveloped.


I've recently done my first implementation of dbt and though its fundamental Jinja+yaml nature does show through a bit, I was really pleasantly charmed by the data testing. The catalog of data assertions (dbt-utils and dbt-expectations really fill out the bare bones default tests) is generic across data sources, intermediate tables, or output tables, and is typically 3-5 lines of yaml for any given test. Hits all the basics: nullness, uniqueness, freshness, value cardinality, value distribution, row count, foreign key validity, etc.

I still don't like the yaml completely, but it makes adding data assertions so easy there is no excuse not to.

This project is pretty black and white, but I'm glad dbt supports test failure severity to lower on call load (error vs warning) when I have the need.

Just wanted to add a positive review for dbt, that it made it fun to do better data testing!


We have a DBT-style system that shunts "bad" data to "rejected_<model>" tables where we can generate reports. We also have tons of SQL-based tests to prevent inaccurate data from getting through the pipeline. So far so good, but it took us a little while to get there.


You made great points. Data Engineering cannot claim to support test-driven development to half the extent of the rest of software engineering.

Frequently DE projects have no concept of Dev, Test, Staging and then Prod. Quite often its dev and then straight to Prod. SQL of course is to partly or fully blame for this.

My last job for an large insurance company, they happily set a best practices of 90% test coverage (which in itself ended up being an artificial, ritualistic goal) which is impossible to achieve with DE tasks.


In my experience another big problem is that it's just expensive. For Spark jobs for instance, it's very common for it to run fine on the small test dataset on your laptop but then when you release it to run for 2 days on the prod dataset you end up having to do tuning there with pretty long turnaround times.

And that then extends to why staging and even dev aren't very useful - the scale is part of the equation. Even just doing your dev loop, it's pretty easy to get something that's logically correct on your test sample but then when you run across the real dataset you find that 1 row out of every million is weird but you still gotta deal with it.


You seem to be touching on two disparate issues that I find very interesting to tackle.

One is testing performance of code (SQL or otherwise) and ensuring it achieves a particular level of performance or ensuring no regressions. This is a problem not just with automatically unit-testing SQL, but any other code. In general, for regular code, we simply go with special, "manual" tests of the execution time for critical pieces (the biggest issue is fragility or flakiness: due to changing conditions a test runs under, speed is not always stable). We could do similar for testing SQL performance. If we know the database we are targetting, we could also use some of the introspection tools it offers to get even better tests (eg. we could run an "EXPLAIN (FORMAT JSON)" query on a very large Postgres database matching production to ensure right indexes are being hit and no seqscans are being done and Postgres' estimation of the time is on target).

Basically, so far it's hard because "performance" means so many different things, but I don't think it's impossible.

As for the other point, those 1-in-a-million edge cases, we've got those with regular code too! If you really want to test something against production-like DB, it's not hard (make an anonymised replica of prod DB), it's just expensive (tests will be slow, getting this set up will be slow, etc).

I personally believe the right balance for cases like those is to catch them in production: automated tests should be quick and allow quick iteration if someone wants to do TDD on any part of the codebase. There are certainly product niches where this is not true (let's not have airbags in cars deploy accidentally every 1M rides, because there's a lot more than 1M rides daily :)), but for our regular applications, that's usually more than fine: quick tests will make it easy to fix the particular edge case once we hit it.

FWIW, I love the idea of pgTAP for those who haven't seen it too.


It's probably worth noting that I was mostly talking about Data Engineering proper. I think "SQL is king" is (ime anyway?) comes from data analysts, who are usually not engineers are have more of a math of stats background. Data engineers do seem to write spark/pandas which is a different set of problems. I think a lot of the "big ball of sql mud" and brittleness just come from the fact that the people writing them aren't engineers and aren't following an engineering discipline. They're analysts trying to answer individual questions and find patterns to then pass off to ML engineering or data engineering teams to productionize. Or - they're lower-value pipelines that wouldn't be worth the time to ask an engineering team to try to prioritize them - so you either get a brittle thing that mostly works and need occasional love, or you do without it entirely.


Writing performing transformation code is one of the critical skills that a data engineer needs to master. It's a combination of experience with and knowledge about the underlying technology. How it actually processes data (e.g. learning how database pages work or going beyond dataframes and getting hands on experience with Spark RDDs).


Data segregation requirements fundamentally break the utility of staged environments. There are some startups that try to fill this gap (Tonic.ai etc.) with data generation technologies -- yet it is extremely expensive to generate meaningful test data to populate development and staging environments that deeply mimic the interrelationships inherent to production data. These data relationships are very valuable to test. Because of this segregation, many organization resort to adhoc and manual testing techniques.


I see a lot of replies saying something to the effect that "SQL is actually really testable, people just don't do it" which I'm sure is true, but probably less helpful if you're having to work on a tech stack that doesn't offer unit testing (I'm thinking something like Azure Data Factory or even some setups/configurations of airflow).

Would love to hear from anyone out there who has a good test cycle for predominantly SQL based code? What tools are you using? How does the process work out for you? What are some pitfalls you found along the way?


I gave a positive review of dbt's testing capabilities in this other comment: https://news.ycombinator.com/item?id=36720013

This is what the actual generic tests look like. There's also support for one-off SQL-based tests: https://docs.getdbt.com/reference/resource-properties/tests#...

This article covers a lot of ground as to how to actually use dbt tests in practice. I'm particularly glad they mentioned the dbt-utils and dbt-expectations libraries: https://www.datafold.com/blog/7-dbt-testing-best-practices

The dbt ecosystem is also a strength. This article has good examples of the type of tooling and libraries in the ecosystem: https://meltano.com/blog/top-5-of-dbt-packages-and-tools-in-...

I have no affiliation with any of the above linked companies, just a happy dbt user! I'll also add that I don't use the dbt cloud, just invoke the dbt cli in a backend task runner.


What I tell people about SQL is that there are two different domains: (a) operational data ingestion which is usually for the current month and (b) analytics historical data which is for all time.

Think of how a business runs its finances and accounting. The analysts run daily reports to ensure the data ingested yesterday ran correctly. They enter adjusting entries as needed and call Data Engineering to fix any integration issues. At the end of the month, they "close" the month and that data is now historical.

For the first scenario, data ingestion... I typically have seen the business analysts and engineering analysts run daily reports to ensure all the integrations and systems were successful. When data engineers deploy new code, yes there is also limited testing. The most frequent problem I see is when the business analysts merge duplicate records; this throws everything off. Engineering will have to work with the business to correct a slew of systems and records to adjust for the fact where there was once two John Doe records and hundreds of children records in multiple systems... there is now one John Doe record, hundreds of children records, and new adjusting records. You can argue, there shouldn't have been two John Doe records in the first place. That is true if you only manage only one system. In many companies there are subsidiaries which have their own data entry points. e.g., When a big corp acquires smaller companies, they will want to suck up the subsidiary data and reconcile all the duplicates. It takes years to sort out all the system integrations.

At the end of the month, when the data becomes historical, the company may decide to change its business rules going forward. So the historical data violates the new business rules. e.g., During a teacher shortage, the local school district says we will hire military veterans without college degrees; this year we will go back to the old policy. e.g., This year the state decides to start daylight savings time earlier than the rest of the country. That is it is very time consuming to write test cases for historical data especially if no one documents the historical business rules which happens more often than you think.


> So the historical data violates the new business rules.

How did you solve this? Is there a set of standard practices to address situation like this?


There is no tech best practice I know of. Even the business finance folks have no solution. This is where idealism hits reality. Yeah it would nice to say that GitHub and source control can give you state. But as we all know code is not reality. It is an idealized representation of reality where we would like to believe business rules don’t change and things never fail.


We use dbt. It definitely does improve the testability of SQL and provides a nexus for engineers and analysts to collaborate. I still heavily advocate keeping its utilization surface as low as possible, and maintaining efficient, compact, and incremental data modeling.


Can’t speak for SQL _in general_ but in Postgres land pgTAP is wonderful. Invest in writing some factory functions, and you have the ability to write a very expressive test suite. We use Postgres functions to provide a data interface that is thoroughly tested. This allows us to ensure that any migration does not break any contracts.


While environments like 'dbt' can bring some reasoning, testability, jinja macros and iterative constructs to SQL, it also can help contain it. At my organization I advocate for the minimization of the SQL surface, and we utilize DRYer, modern, testable languages wherever feasible, like Clojure (mostly) and Python. Clojure integrates very well with Kafka, and it is used throughout our organization. While Java/Clojure lacks some of the data applications found in the Python ecosystem, we certainly use Python where appropriate.


> Data engineering can be unsatisfying if you thrive on writing reliable systems

This is what I needed for me to stop considering data engineering. Not that I was thinking about it a lot. I feel very happy with my current job but the thought of the code I write being so brittle as to paralyze me from making any changes terrifies me.


Unless you get to work at the place that values your test driven perspective. Either because they are doing it or desperately want to (where want = they are already investing heavily in it)


We build pipelines in staging schemas, and we only promote them to prod once they pass tests. This is pretty analogous to SWE (it's where we got the idea) and is a pretty crucial part of us meeting the requirement where we don't put inaccurate data in front of customers. It's--of course--only as good as the tests, but that's every reliable system.


I've done written heavily tested code in python. But also heavily tested code in SQL. When I was doing data engineering 5 years ago, dbt was a new tool and I used that to write tests, ci/cd, etc. in data engineering. It was a game changer.


There's plenty of support for testable SQL. There's just a long tradition of not using it.


Just yesterday I installed a bunch of TSqlT tests on a test SQL Server database for our subsidiary, and watching unit testing written in SQL run inside of SSMS made me realize just how massively I miss this ability. There is absolutely no reason we should be able to test API calls and not the database calls underneath.


I personally don't have an issue with SQL. I started my career a couple decades ago as a SQL guy and I still use it occasionally.

If you wanted to, you can write tests for the queries. I've done so in the past.


At the end of the day with SQL, is what are you testing....

I support an application that does authentication and does URL/component registration based on the IP/hostname/FQDN of the machine at startup. It shoves some of this stuff off into a table for URL validation as requests might come in from a different node but need to verify that host was part of the cluster.

In client systems this start up was almost always insanely fast, until one day for one of our clients it wasn't. Startup process was in a restart loop with a timeout where I had never seen one before. Started it in debug mode without a timeout and it was taking minutes to start up. Look at the SQL side and this thing is just processing more IP records than I'd ever seen in the app and the registration table size was huge.

Turns out this machine had went though some massive number IPv6 addresses in a very short period of time. The way the SQL was written for this query wasn't great, but never needed to be because most instances of this software may have 20 records in the table at most. That's the one 'fun' thing about SQL that's hard to test for.


i dont understand what makes Pandas testable that is not also doable with the part of one's code that emits the SQL queries. assuming your test suite is capable of running with a database connection (which most are).


> joins being logically correct

What do you mean by this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: