Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
For Want of a JOIN (moderndescartes.com)
268 points by subset on Dec 22, 2022 | hide | past | favorite | 316 comments


It is one thing when a junior does this because they haven't learned better.

It's quite another when experienced seniors ban the use of SQL features because it's not "modern" or there is an architectural principle to ban "business logic" in SQL.

In our team we use SQL quite heavily: Process millions of input events, sum them together, produce some output events, repeat -- perfect cases for pushing compute to where the data is, instead of writing a loop in a backend that fetches events and updates projections.

Almost every time we interact with other programmers or architects it's an uphill battle to explain this -- "why can't just just put your millions of events into a service bus and write some backend to react to them to update your aggregate". Yes we CAN do that but why do that it's 15 lines of SQL and 5 seconds compute -- instead of a new microservice or whatever and some minutes of compute.

People bend over backwards and basically re-implement what the databases does for you in their service mesh.

And with events and business logic in SQL we can do simulations, debugging, inspect state at every point with very low effort and without relying on getting logging right in our services (because you know -- doing JOIN in SQL is not modern, but pushing the data to your service logs and joining those to do some debugging is just fine...)

I think a lot of blame is with the database vendors. They only targeted some domains and not others, so writing SQL is something of an acquired taste. I wish there was a modern language that compiled to SQL (like PRQL, but with data mutation).


This debate has been happening forever and really the fundamentals haven't changed. It doesn't matter if you pull data from the DB to an old middle tier or to a nice modern microservice architecture, you're almost always losing the performance game at that point.

The database already has most of your data cached in memory, it already built statistics on the best methods to use to join the data, and the data is always local to the DB.

Reading lots of a data from a DB to do the same operation in a microservice means you incur a cost of data retrieval, memory for a copy of the dataset and enough to do a join, the network speed to transfer the data over, and then you're giving up all the indexing and statistics that a database provides.

There is almost never a reason for this unless you're needing to do some kind of data analysis that isn't supported by the DB. Maybe most of all this copy the data to somewhere else to do a thing is never going to scale well.

Don't mind me, I'm just a retired former DB nerd.


> the fundamentals haven't changed

I believe they have, because we've gotten so much better at ORMs. ORMs get a bad rap because people use them terribly (and it's largely the ORM's fault because they encourage their own terrible use). But they are a true "change in fundamentals" that can finally end this debate.

Business logic doesn't belong in the database layer, because the database layer doesn't support the level of abstraction, composability, testing, and type safety that modern programming languages afford. However, that does not mean that business logic doesn't belong in SQL. It just means you have to treat SQL as an output of your actual business layer.

A good use of an ORM looks like a metaprogramming environment for conveniently building syntax trees that get converted into intelligent SQL. You know it's working well if the SQL looks somewhat like you'd write yourself and you can build one SQL statement with multiple layers that are abstracted from each other (in C#, think about passing IQueryables around). The structures are parsed into SQL and executed very explicitly only at the end of the chain, do a lot of work, and never produce SELECT N+1s. A good ORM user is thinking in SQL but writing in C# (or whatever your business layer is in).

A bad use of an ORM is trying to pretend like SQL doesn't exist, or is too scary for regular programmers to think about. It has SELECT N+1s everywhere. A bad ORM user is thinking in C# and hoping the database will roughly do the correct thing.


> I believe they have, because we've gotten so much better at ORMs.

This has almost nothing to do with what the parent was describing. Poor ORM use (or poor ORMs) introduce a different set of ways to mess up performance.

[EDIT] OK, this isn't entirely fair, or at least I didn't explain it well enough (no, it's not getting downvoted, I just decided I'm not happy with it). The problem in question is, at its heart, developers not realizing what they should be letting the database do, and perhaps not even realizing what it could do, or deciding they shouldn't let the database do it for some probably-misguided purity reasons or whatever—ORMs generating more-efficient queries or exposing more features is great and does help with the problem of straightforward, natural use of ORMs sometimes resulting in poorly-optimized queries, but doesn't fix the problem of developers not knowing that a block of logic in [programming language of their application] should have been left to the database instead, whether that's achieved by hand writing some SQL, writing a stored procedure, or directing the ORM to do it. It's a little related in that hopefully better ORMs will result in ORM-dependent developers learning more about what their database could be doing for them, or being more willing to poke around and experiment with the ORM since it's more-pleasant to use, but I'd expect that effect to be pretty marginal.


A problem that may ORMs have is thinking/working with row-expressions rather than sets. If everything is expressed in plural sometimes with zero or one, other times many, then the N+1s mostly go away. The same goes for many interfaces/APIs that have single and multiple forms, only make the multiple forms and have callers call it with [single].

Related pet peeve: calling tables by plural names. The table (or any relation) should be named for the set of X rather than thinking of it as Xs.


> A problem that may ORMs have is thinking/working with row-expressions rather than sets.

Absolutely. There's a reason that the famous Out Of The Tar Pit[1] paper identifies relational algebra as the solution to many programming woes. Thinking in sets instead of individual items is extremely powerful, when using an ORM and also in general. If an ORM makes this hard, use a different one (hopefully there is a better option).

> Related pet peeve: calling tables by plural names

Agreed again! Based on my limited observations, this seems like a big cultural difference between "database people" and "software people". The people who spend most of their time working directly in databases (and trying to basically write fully fledged business applications entirely in the database layer) seem to think of tables as big containers. If you labeled a box full of people (or a binder full of women?), you'd probably label it "People". Whereas "software people" tend to think of a database table as a definition of something, more like a class or type. Clearly the correct label for that definition is "Person".

Actually the latter is also wrong for a different reason: that is not how nouns work. But that's a story too long to fit in this comment.

[1] Out of the tar pit (warning: direct link to 66-page PDF): https://curtclifton.net/papers/MoseleyMarks06a.pdf


It's the other way around - "software people" tend to use plurals for table names, because it then maps nicely to plurals in property names in their object layer, where it's already the established convention.

The "database people" OTOH tend to use singular, which goes at least as far back as C.J. Date. I'm not sure why, but perhaps it's because fully qualified field names read more natural.


Actually, all the old relational texts talk about select * from employees (plural) because they didn’t have aliases for table names in queries.

I agree that when you use aliases for multiple tables when you use (Oracle, non-standard SQL joins) that it looks prettier, but you can also do that with plural table names.

Select employee.name manager.name from employee, manager where employee.manager_id = manager.id

But it doesn’t matter that much when you use the traditional join syntax:

Select employees.name, managers.name from employees left outer join managers as manager on employees.manager_id = managers.id

Still I prefer singular table names, just disagree that it was data people who preferred singular and why.

Of course, with aliases you can return the exact name you want with aliases:

Select e.name as employee, m.name as manager from employees e, managers m where e.manager_id = m.id


it is either a bunch of apples in which case you would call it apples, or it is the box that holds apples, in which case you would call it apple. as in apple_box.

Got to admit I am pretty firmly in "it is the box that holds apples" camp so none of my array, table, dictionary names are plural. Some time I comfort myself by saying "well, apple[5] just sounds better" but that is really up for debate.


But a set is plural. We say "the set of things", not "the set of thing".


I guess it depends on usage that's familiar to you. I would say set of Real numbers and call the set Real (like a class) not Reals like a collection of instances.


I agree with your pet peeve. ActiveRecord (Rails ORM) encourages by convention that your tables are named in the plural, and it sometimes drives me crazy. When writing queries outside the ORM, I end up aliasing tables/relations to the singular so the query is more sane to reason about, e.g.:

> SELECT person.id, person.name, person.age ... FROM people person JOIN ...

Which reads so much nicer to me than:

> SELECT people.id, people.name, people.age ... FROM people JOIN ...

(obviously a somewhat contrived example because of the people/person inflection that makes it awkward already)


> A good ORM user is thinking in SQL but writing in C# (or whatever your business layer is in).

That doesn't sound like metaprogramming; it sounds like insanity brought about by bureaucratic limitations on language choice.


It should be "thinking in relational algebra but writing in C#". Arguable Linq with C# represent relational algebra better than SQL.


But not more efficiently.

Working with small datasets in memory can be fast, but larger datasets consumes more memory and cpu.

The purpose of a database is to efficiently store and retrieve data from disk — and limiting it to only the data you need.

Most database interactions are also over a network, which is always slower than (and in addition to) disk retrieval. This should not be done except when you cannot fit the data on disk (or work with it in memory) on the same system.

Services should not be separated except for the same reasons (exceeding computation or memory) for the same reasons (network, disk latency).

We used to perform billions of complex computations in seconds reading from slow disks and slow systems with much smaller memory footprints on single systems with single cpus.

This article is a parable for the consequences of not understanding that.

The Google and Amazon and Netflix etc white papers are about organizations that build solutions for applications that cannot fit on disk or in memory or be handled by the computation of a single system — and do not apply to 99% of application architectures that use them, including those developed by Google, Amazon, etc.


Consider a function that takes an IQueryable<T> (for any T) and connects it with your change tracking logic to return an IQueryable<Tracked<T>>, or connects it with your comments system to return an IQueryable<Commented<T>>. I can take a query of (nearly) arbitrary complexity represented by an IQueryable<ComplexModel> and turn it into an IQueryable<Tracked<Commented<ComplexModel>>> in one line, while still keeping the result as a single SQL statement. If you hate that type, note that it's almost never explicitly written out like that (thank you, 'var' keyword).

On the other hand, your databases have a "Comments" field in every table; a "LastModifiedBy" in every table. Your database tables grossly violate the single responsibility principle: every single cross-cutting concern is represented in every single one of your "primary" tables. (Level up: every concern is a cross-cutting concern.) Your databases have association tables between X and Y for every primary data type X and every cross-cutting concern Y, leading to a combinatorial explosion of redundant tables. Your SQL queries/views/procedures/triggers are repetitive and full of boilerplate. If your client told you they needed you to change how change tracking is done in your system, you'd have to touch nearly every single SQL module in your entire system.

For me, I need to change the implementation of IChangeTrackingSystem, and that's literally all. All of my other queries don't have to know about the change, because they're simply composed together with whichever IChangeTrackingSystem is in place. Show me how to do that in T-SQL and I'll reconsider my position.

Now that I've tasted this fruit, the old way of doing things sounds like insanity to me. You're being needlessly zealous and close-minded here.

Edit: And another thing! (Shakes fist)

Much of the beauty of modern programming languages is their adaptability to whichever domain you're working in. We no longer need domain-specific languages for every different task, because we can embed those languages inside our parent language; then we don't have to reinvent static typing, write a new IDE, and learn decades of programming language design before we can start on our actual business logic. So, yes:

When writing data access logic, you should be thinking in SQL (or generic relational logic) but writing in C#.

When writing a game renderer, you should be thinking in linear algebra, but writing in C#.

When writing a payroll processing system, you should be thinking in payroll, but writing in C#.

When writing a chemical engineering toolbox, you should be thinking in molecules and reactions and units, but writing in C#.

This way, anyone who knows C# is already halfway (yes, only half) toward being able to maintain your system. If you insist on using a DSL for every single one of these tasks, 80% of your time will be spent on context switching and trying to get them to talk to each other correctly and correcting issues in the DSL itself.

Of course I do often write raw SQL as views and scripts, and of course I write TypeScript and HTML and CSS/SASS when working on a front-end (although I usually "think in HTML and write in TypeScript", not surprisingly). But that's mostly for development and maintenance; not for core business logic or library development.


Let us assume you accept that writing SQL is better than writing the equivalent machine code.

The “meta-programming” is stating there is a “language” representing their problems that is better than SQL. If we call that better framework(language) Blub[1] then that is probably a good metaphor, rather than causing people to get triggered by the generic ORM tag.

[1] https://www.benkuhn.net/blub/


To be fair, there is one important thing the ORM brings though. Sanatizing inputs.


Don't sanitize your inputs; parameterize your queries instead.


I see those as two different solutions to two different, but slightly overlapping problems.


Going to keep those problems a secret?


Imo, the problems are the same but the conditions are different. If the query is used many times, parameterize it. If the input is used many times, sanitize it. If both are used many times, parameterize.


Parameterising works for individual fields in a statement. However for complex queries (the reason for the meta-programmimg comment) you can’t always parameterise the additional subqueries/tables/fields. You can use stored procedures, but that just shifts the necessary code from one language to SQL, and the SQL doesn’t have a robust library you can just use.


> A good use of an ORM looks like a metaprogramming environment for conveniently building syntax trees that get converted into intelligent SQL. You know it's working well if the SQL looks somewhat like you'd write yourself and you can build one SQL statement with multiple layers that are abstracted from each other (in C#, think about passing IQueryables around). The structures are parsed into SQL and executed very explicitly only at the end of the chain, do a lot of work, and never produce SELECT N+1s. A good ORM user is thinking in SQL but writing in C# (or whatever your business layer is in).

Sounds very similar to how Django (python) wants you to pass around QuerySets. It's very easy to set up an initial query with joins/etc, then pass the QuerySet into multiple functions to filter it in multiple different ways (each filter creates a new instance, so you're forking the original set and don't have to specify the joins multiple times), but the query itself is never actually run until you try to read from it.


Also databases (yup, MySQL I'm looking at you) got better at handling subqueries quickly, which makes the job of ORMs a lot easier.


It's still (5.7) really bad sometimes. A simple `IN (subquery)` can sometimes run much better if the subquery is fetched and another round-trip query is made using literal values for the `IN (...)`. I'm sure there are plenty of other 'deoptimizing' query patterns that shouldn't be.


To use SQL (or rather, push compute to the database) well you need to think in sets, indexes, etc; the primitives you work with is rather different from the ones you usually use in C#.

You need to switch mode of thought anyway.

If a really good language for that happens to be expressed in the language of the C# AST -- instead of some new syntax -- that would be fine with me. I do not see a big difference.

But since one needs to switch mode of thought anyway, a new high level language that compiles to SQL and would be usable across all backend languages I would like slightly better. But, whatever fixes the problem of allowing pushing computation to the database without all the warts in SQL I am all for.

Until that really gets a bit further than today I prioritize writing SQL over a bit too leaky abstractions.


A separate high-level language just for queries doesn't make much practical sense, since queries are normally surrounded by plenty of other code.

OTOH something like C# LINQ, which, on one hand, plays nicely with the rest of the language, and on the other, can be directly mapped to SQL (without ORM and other impedance-mismatch-inducing layering) is great. But, necessarily, language-specific to ensure tight integration.


As a data platform manager in a data-rich company I agree: programmers tend to prefer their favourite hammers to sql, even if a modern relational database can do the thing much better.

But! As somebody with a relatively good understanding of a history of sql, relational dbs and related concepts I have to add: SQL is often to blame.

All the the amazing engineering that goes into database engines, clean and coherent ideas of relational algebra, optimisability of a declarative approach to computaion - all of that gets bad rep because of the nightmare of sql-the-language.

The standard, the syntax, every little bit that could go wrong is just wrong from the point of view of a language designer. Composability, modularity, predictability, even core null-related defaults.

But it's everywhere. We just have to accept it.


It’s not just the language, it’s schema evolution, data distribution, and exposed APIs.

I don’t want to give other teams direct access to a DB and have them Not only take a dependency on the schema, but have the ability to run arbitrary queries that may exhaust resources in ways that impact normal operations. If I expose an API, I control the access patterns and can evolve the schema separately to suit the workload.

If other teams need a replica to perform their arbitrary queries, I’d much rather have them using a richer data model that they can normalize into whatever form suits their needs than have to conflate that into a source of truth data store.

If you have a single business unit and can get away with commingling concerns within a small team, great, throw it all in a single DB. If it makes sense to split, however, do it quick and early to avoid a decoupling hell that is more expensive then having split in the first place.


> If I expose an API, I control the access patterns and can evolve the schema separately to suit the workload.

And then they will depend on your API data schema...

Yes, schema evolution is hard, but databases have many tools to help here that you will either have to recreate on your APIs or live without and have a harder time. Either way, all the trouble comes from data evolution, and any schema-only change is trivial to deal with.

Data distribution is something that varies from one DB to another, they usually have very good performance that is hard to replicate on your application layer, but are very hard to setup and keep running. But the point about control of resource usage is a good one.


> And then they will depend on your API data schema...

Which we have methods of evolving. I can put my API in gRPC and know exactly which changes will or won't break compatibility. Try doing that with a database.


> And then they will depend on your API data schema...

+100 to this


But that API schema is unrelated to the underlying DB schema. I’ve been able to run services with different backends (eventually consistent & low latency vs transactional) exposing the same API. That would not have been possible by just giving consumers access to a DB.

The DB can do everything can’t seem to understand this, for some reason.


Any nontrivial change to the base model will mean a lot of complexity in the API layer and degraded performance. Maybe that's worth it for you, maybe if you're exposing this data to hundreds of external users who don't need high performance. But I feel that for most usecases, barebones DB access is the better option.


> That would not have been possible by just giving consumers access to a DB.

That depends on the DB's foreign table support, or equivalent, if it has it. It's certainly not categorically impossible with a DB.


Moving the goalposts, I can see.

Of course, merging different backends can not be done on the backend. Either you add a layer or you do it on the client.


That seems to be an odd way of looking at it, IMO. Most of the time, the entire point of building an API is to present consistent functionality to any consumer, not only ones under your control. Also, a well-behaved API is versioned so as to allow evolution of the API without breaking existing clients who can upgrade to new versions as they are able.


That’s not a new problem though? The classic way to solve this in database is to expose the data API as stored procedures/views and restrict query access on the actual tables to just the DBAs - I think even MySQL which was late to the party here has had this ability for some time now.


> I don’t want to give other teams direct access to a DB [...]

You don't have to. Package up necessary queries into views and/or stored procedures and grant permissions only on those. Views can also shield from schema changes.


That also requires all data to be in the same DB. Often that’s not practical or possible.


Postgres FDW's do wonders to centralize this type of access.


In SQLServer you can define external tables that access data on another server, or even parses a file.


As someone who absolutely loves the power of SQL, I abhor the footguns involved. Especially with null-based ternary logic that is incomprehensible to most people.


You have to learn how your database handles NULLs, different databases do it slightly differently. Once you do that, you're fine.

One big thing is that NULL should never "mean" anything in a business sense. It is the absence of a value, hence it cannot mean anything.

By the same token you need to understand how your database handles concurrency. Different databases do it differently.


> One big thing is that NULL should never "mean" anything in a business sense. It is the absence of a value, hence it cannot mean anything

I always had a problem with that notion. I mean, it has a memory representation, it has a set of operators you can apply to it, defined behavior in UNIQUE and FOREIGN KEY constraints etc. All this is well documented (though can behave slightly differently between databases, as you mentioned).

So, it has a set of valid values (just one: NULL) and a set of valid operations, so it's a type!

And now you have a type that looks somewhat similar to a null in "normal" programing languages, and SQL generally lacks the mechanisms for inventing your own types, so why wouldn't you use it in your business logic where it makes sense?

The design of NULL seems like a historical accident anyway. The best I can discern, there was a need for something to behave as "excluded" in the context of FOREIGN KEYs and outer joins, and so that semantic was just passed along to other areas where it made less sense.

I think a better type system and a better separation between comparison logic and the core type would have obviated most of the NULL's weirdness and made it far less foot-gunny in the process...


The problem is that it isn't just me. It's everybody. Everybody that uses an SQL system has to learn a version of logic that looks like something they've learned before but actually has no relation to it. People can learn it, but it is a chore and takes a lot of real world experience (i.e. costly mistakes) to get it drilled into their head just like it did with me. As someone who does a lot of mentoring of data engineers, it's infuriating that they all have to go through this at some point.

Unfortunately the problem is made worse by the proliferation of languages making the equally heinous mistake of treating null as "false-y". Bad form, Peter, bad form!


Many businesses are absent of value :)


Agreed. The power of SQL is awesome. But really wish people started making better languages that compile to SQL; like it happened with JavaScript


Same. Edgedb is doing something good but I wish I didn’t have to deploy a new service on top of my own db. Postgres only as well.


is there something fundamental from us making a new frontend to something like postgres? I think all the solutions that compile to SQL kinda work but it would be nice to have a new native interface that sucks less.


Nope. Nothing but industry-wide inertia is so massive by now that it just doesn't make sense to switch. Nosqls tried hard and went nowhere.

What i find funny is that most dbs translate sql into an internal representation that is remarkably similar to a proper relational algebra and optimise on that. I'd really just prefer the alebraic language as described in the original ages old paper.

There were also a few dbs trying to push sql-but-better languages... haven't heard about them for a while.


I have experimented with a different query model from sql for time series data. A query took the form of a Rhai script. (Rhai is a scripting language that has great interop with rust, so it was similar to how lua would be used to script parts of a game.)

Each query script would act on a few objects in global scope: `db` (handle to database), `start`, and `end` (time range of grafana dashboard that query was for).

I found being able to write imperative (rather than declarative) code to build a query to be extremely powerful, especially for storing variables and looping over things.

e.g. query script - just to get a feel for it:

    let dalmp = db.ts("pjm-da-lmp/western-hub") // ie retrieve the timeseries named 'pjm..'
        .with_time_range(start, end);
    
    let rtlmp = db.ts("pjm-5min-lmp-rt-lmp/western-hub")
        .with_time_range(start, end)
        .resample("1h", "mean");
    
    let da_err = rtlmp.diff(dalmp);
    
    #{
        dalmp: dalmp,
        rtlmp: rtlmp,
        da_err: da_err,
    }
A query script would be expected to return a dictionary-like object. the keys would be used as labels and the values would each be a time series object.

This is not the perfect solution for every problem but though it might be interesting to see an example of a very different approach to querying compared to sql.


I'm a fan of relational databases, but I think we should concede three points:

1. Databases should accept queries in a structured machine-readable format rather than a plaintext language.

2. SQL in particular is a poorly designed language: it isn't very composable, it has lots of annoying edge cases (like NULLs), and it has a number of annoying limitations (in particular, its historically limited support for structured data within fields).

3. Given how most RDBMSs are designed, you often need to handle denormalization and caching manually. This requires doing a lot of excess data management in a middleware layer--for instance, querying a cache before accessing the DB, or storing data in multiple places for denormalization. Some of this can be done in SQL (e.g. denormalization through TRIGGERs), but since SQL is not a very good language (see (2)) that can be tough.


SQL being a plaintext, human-friendly language is a good thing. SQL is a common skill transferrable between languages and environments, and it is also easily usable by non-developers. If we replaced SQL with some abstract language, what language would you use when talking to the database directly (via psql, sqlplus, or whatever)? Would you need to learn the PythonQuery, JavaQuery, and C#Query languages separately? Would the language used in ETL tools be different still? What language would you write database views, triggers, functions in?


Have you looked at Quel? We have multiple languages to run our code on generic CPUs (C, Python, Java, Ada, Go, Rust, Lisp, etc.) Why must we have only one database language that doesn't do a very good job at Codd's relational calculus?

So many people have drank the kool-aid that SQL is the answer. Maybe it's time to change this.

The success of ORM's could be thought as the market voting against SQL.

https://www.holistics.io/blog/quel-vs-sql/


The blogspam post you linked has zero examples of QUEL. Looking at Wikipedia [0], it seems much uglier and less readable than SQL.

I am not a database theory person. I'm a developer who does not care about Codd's relational calculus. As for ORMs, they are great at solving simpler problems and CRUD data access, and they make the developer’s life easier by giving them nice objects to work with as opposed to raw database rows. However, any advanced analytics/reporting/summary queries tend to look awful with an ORM.

[0]: https://en.wikipedia.org/wiki/QUEL_query_languages


I was hoping you would address the points about composability rather than turning it into a beauty contest between the two languages.


SQL actually isn't all that transferable, because in practice you almost always use an ORM or a query builder instead of writing queries directly into your codebase. So when switching languages you still need to learn the new language's ORM; your knowledge of the underlying SQL will only go so far.

Of course queries should be human-readable, but there's no need for it to be a complete language with its own grammar and templating via prepared statements. The queries could be encoded in JSON or some similar (probably custom) human-readable language that can easily be generated programmatically. MongoDB does this IIRC; it's probably the only thing I like about Mongo, but it's a good idea.


in my experience most engineering shops that care about database perf are not using ORMs except for the most generic CRUD features. you gotta handroll your queries with an eye on EXPLAIN once you pass a billion rows in your tables, in my experience anyway


My experience as well. ORMs are really nice in the beginning, they save a lot of boilerplate code. But they become the enemy once scale and/or performance become an issue.


The thing about #2, is poorly designed in comparison to what?

Chances are high the application layer is written in JavaScript, PHP, Ruby, or Python. We don't even talk about nasty edge cases in those languages because they are uncountable.


Those languages all have first-class functions and OOP-style encapsulation.

In SQL stored procedures (at least in Postgres), you can't have a variable that holds multiple records. You can't even define a variable inside of a block.


DECLARE foo record[];


Temp tables?


None of those languages is as unpleasant as SQL. Not even PHP.


For me, I would much rather SQL especially when using DBT too to manage the queries like any other language. I thought the same way as you and SQL certainly has its warts, but I have grown to appreciate it's elegance. I think over the long-term it is easier to maintain because it is so concise and the most common second language among programmers.


Once you start thinking in sets instead of collections of independent statements as per PHP et al, SQL starts looking a lot better.

If you try to fit SQL into the general purpose programming language box, you're gonna have a bad time. A really bad time.

SQL is a DSL for set theory. Nothing more. Nothing less.


The things that are bad about SQL have nothing to do with set theory. The non-orthogonal syntax. The terribly limited and opaque type system. Ternary logic that fails silently. The utter lack of composability or testability.


Also, views are the essence of composability.

As for testing in the Postgres sphere, there's pgTAP and pg_prove. https://pgtap.org/

Extension development includes a built-in unit testing harness as well.

For use with any database, a tool like Sqitch allows for DB-agnostic verification at each migration point. https://sqitch.org/docs/manual/sqitchtutorial/#trust-but-ver...

Proprietary databases have their own solutions as well. https://learn.microsoft.com/en-us/sql/ssdt/walkthrough-creat...

I think "utter lack" is grossly misrepresenting the state of the art. If you mean "widespread ignorance of existing techniques related to composability or testability," then we are in agreement.


CTEs are quite composable. I might be spoiled by Postgres, but types are quite robust there.

Replace "NULL" with "unknown" in your head, and the ternary makes more sense. When you're building your schema, does an unknown value make sense in that context? Many times not, and the column should either not be nullable or should be referenced in a different table by a foreign key.

3 = NULL

"Is 3 equal to this unknown value?" Maybe yes. Maybe no. It's unknown. Therefore the answer to "3 = NULL" is NULL. The answer is also unknown. Not true. Not false. Unknown.

IS NULL or IS NOT NULL, but never = NULL or <> NULL.

It may be unusual to someone coming from a general purpose programming language's notion of null as a (known) missing value, but that doesn't make it wrong. It means you need to reorient your mind toward set theory, where NULL means "unknown" if you're going to work with SQL and relational databases in general.

Folks often speak of the impedance mismatch between relational models and in-memory object models. NULL is one of those mismatches.


> CTEs are quite composable. I might be spoiled by Postgres, but types are quite robust there.

> Replace "NULL" with "unknown" in your head, and the ternary makes more sense. When you're building your schema, does an unknown value make sense in that context? Many times not, and the column should either not be nullable or should be referenced in a different table by a foreign key.

Cool, now how do I do these two incredibly basic things at the same time and make something not nullable or referencing another table in a CTE?


CTEs are quite composable. -- on SQL land.

I wish SQL is more composable


> in particular, its historically limited support for structured data within fields

This is not particular to SQL though, and is the rationale behind the first normal form. Codd argued that any complex data structure could be represented in the form of relations, so adding non-relational structures would just complicate things for no additional power.


The issue is that, to put things in 1NF, you need to fully normalize everything, which has a big performance penalty since every query now has to JOIN a large number of tables together.

Of course, an RDBMS could be designed to do that without a performance penalty, by storing data in a denormalized form and automatically translating queries for the normalized data accordingly.

But SQL doesn't have the features you'd need to control and manage that sort of transparent denormalization. So you'd end up having to extend SQL to support it properly so that the performance penalty in question could be mitigated in all cases.

edit: Rather than "you need to fully normalize everything," I should have said "you need to split all your data across multiple tables to eliminate the need for structured data within records." The performance penalty happens when you need to do this everywhere for sufficiently complex datasets.


I doubt querying JSON or XML or some other embedded structured format would be faster than querying normalized data. It might be true in some special cases, but certainly not in the general case of ad-hoc queries across nested data structures.

But I totally agree SQL could be improved to make normalization feel like less of a burden. It really highlights a problem when it feels like its more convenient to just dump a JSON array into a field rather than extract to a separate table.


> The issue is that, to put things in 1NF, you need to fully normalize everything

Not unless you redefine “fully normalize” you don’t.


The boundary between "simple" and "complex" data structures is largely arbitrary, though. It's not unreasonable to consider an integer as a complex data structure, consisting of bits, and in some contexts we do that - but in most, it's obviously more convenient to treat it as a single value. In the same vein, it should be possible to treat a tuple of several numbers (say, point coordinates) as a single value as well, in contexts where this makes sense.


In the context of the relational model, simple and composite refer to how they are treated by by the relational operators. You can't select individual bits of an integer (without special-purpose operators) which means an integer is a single value.

Presumable an integer is physically stored as a set of bits, but this is not exposed to the logical layer (for good reasons - e.g. whether the machine uses big endian or little endian should not affect the logical layer). If you actually wanted to operate on individual bits using relational operators, you would use a column for each individual bit.

Having XML or JSON fields is also totally fine according to the relational model as long as they are "black box"-values for the logical layer. But Codd observed that if you wanted to treat individual values as composite you end up with a much more complex query language. And indeed this have happened with XPath and JSON-queries and whatnot embedded in SQL. Presumably it should then be possible to have XML inside a JSON struct, and a table inside the XML. If this is even possible, it would be hideously complex. But normalized relations already allows this without any fuss.


This is an arbitrary distinction in the first place. You can absolutely have a relational database that only allows 0 and 1 as valid values and requires you to build integers up yourself using relational techniques. The relational model itself doesn't care what "values" are.

In practice, simple value tuples make things much more convenient, and the edge cases are minimal. You don't have to allow full-fledged composition like nested tables etc.


> The relational model itself doesn't care what "values" are.

Maybe I misunderstand what you are arguing, but the relational model is defined in terms of values, sets, and domains (data types), but of course the domains chosen for a particular database schema depends on the business requirements.


Codd argued this a long time ago. Unfortunately Codd died twenty years ago, so we can't ask him his current thoughts on the matter. On the other hand, Chris Date dropped this rigid view of the relational model way back in the 1990s.

https://www.youtube.com/watch?v=swR33jIhW8Q


The question isn't who said what when, the question is what reasoning holds today. Codds argument was simply that if we allow tables embedded inside fields and we want to query across multiple "layers" of embedded tables, we get a much more complex query language and implementation for no additional benefit, since the same relationship can be represented with foreign keys. I haven't seen any reasonable counterargument against this.

If I understand Date correctly, he is just saying that individual values can be arbitrary complex as long as they are treated as "atomic" by the relational operators. I don't disagree, but reality is that very soon after you start storing stuff like XML or JSON in database fields, someone wants to query sub-structures, e.g. filter on individual properties in the JSON. And then you have a mess.


Also, caching can be done with materialized views


IIRC in Postgres you need to refresh an entire materialized view all at once, effectively recreating the entire table; you can't just have it update incrementally whenever the underlying data changes.

I think SQL Server can do this, but...then you have to use SQL Server.


> I think SQL Server can do this, but...then you have to use SQL Server.

Yes, SQL Server can update indexed views incrementally, but there are severe limitations:

https://learn.microsoft.com/en-us/sql/relational-databases/v...

If memory servers, indexed views have been in SQL Server for 20-odd years, and haven't seen meaningful improvements in all that time. We still can't do a LEFT JOIN, or join the same table more than once or MAX etc...

The same story with T-SQL, which is firmly stuck in the '80s (not that other databases are better).

There are some extremely powerful features in SQL Server that can be used effectively with some pain, but they could be so much better if Microsoft invested in fully fleshing-out their potential instead of chasing the latest buzzword.

Sorry for the rant.


Yep, which is why materialized views don't tend to work great for a lot of tasks where they initially seem to be the most natural implementation, particularly almost any view that's a feed or aggregate of data in the system over all time. It's so easy to just play with a simple SQL select query until you get it working, then throw it into a materialized view. It'll probably even work for a long time! But as soon as the data in the system grows and that refresh starts getting slower, you're stuck with a (potentially tricky or at least frustrating) migration to another implementation (maybe something like event sourcing).


I think it's being worked on for postgres but probably another major release or two away, quick Google came up with this https://pgconf.ru/en/2021/288667 but I'm sure I've come across discussion in postgres mailing lists / wiki in the past.

Would definitely be a nice feature to have, without it I find the main use case I have for materialised views is batch processing where you want to prepare a complex result set and then stream process it in a Cronjob or similar


There’s also info on this page about it (incremental materialised views).

https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...

In the meantime my preferred technique is to have a column where I stamp the last generation time for each row and then I rebuild anything that’s changed since then (assuming all your source data has some sort of last updated stamp).


Also computed columns in a covering index.


> It's quite another when experienced seniors ban the use of SQL features because it's not "modern" or there is an architectural principle to ban "business logic" in SQL.

While I agree with the idea of pushing heavy compute to where the data resides, I wholeheartedly disagree with the statement quoted above. Concentrating business logic in SQL makes it effectively untestable. Its not easy to "compartmentalize" SQL code such that each individual piece is testable on its own. Often, this lets major issues go unnoticed until the SQL query executes in production on some unexpected input data and people are woken up at 3AM. Adding on top of this the fact that SQL exceptions are a PITA to debug on a normal day, and it probably isn't any easier at 3AM


Are you talking about stored procedures? SQL itself is super easy to test: insert data into tables, run query, compare output. It's also easy to test queries on production data since every db has a repl. Queries are also mostly composed of read-only, referentially transparent parts so it's super easy to take snippets and test/run them in isolation. For complex updates you join the table you want to update to a read-only query with all the logic to calculate the new values.

SQL is probably the easiest language there is to write well tested, composable, easy to debug code.


> Queries are also mostly composed of read-only, referentially transparent parts so it's super easy to take snippets and test/run them in isolation

True enough, but now you need to ensure that the snippets copied out into tests are in sync with the in-line versions embedded inside your 6 screen long sql query.


In mssql you have "inline tabled valued functions" which you can declare and reuse, and they are inlined and optimization happens across them. We use sqlcode [1] to get around weaknesses in deploying stored functions.

Big drawback though is that functions can only take scalar arguments, not table arguments.

Another method is to just have a bunch of CTE clauses, and append something different to the end of the CTE depending on which parts of it to use (i.e. some string assembling required).

[1] https://github.com/vippsas/sqlcode


Easy to debug and write well-tested - mostly yes. Composable is tough though (make sure all your table aliases are unique, that there are no unambiguous column names, that all your ANDs are in place and don't forget that 1=1 to make it easier to uncomment parts. You already need a somewhat sophisticated query builder just to compose multiple JOINs cleanly. Recently I needed to get a query builder-ish thing going with arbitrary number of conditions which should be done as JOINs and the most I could muster to keep it at least a little manageable was a "pkey IN({literal_subquery})". It does compose as in "yes you can do it" but I wouldn't say it composes very conveniently.


Excuse me, but can you show me where in the SQL language documentation examples I can find how info about its unit test harness framework, teedeedee best practices, mocking, shimming, and dependency injection? And that's just the bare minimum but I start with those as the first thing to learn in any new language.


At my first real programming job, I found an error in the SQL functions that the company had written. After pointing it out, the CEO bet me that I couldn't fix it. Apparently their best programmers had tried and failed.

I did end up fixing it, but it took me a couple weeks.

Beyond being untestable, it's also hard to version control and roll back if there's a problem. We had some procedures around SQL updates that were kind of a pain because of that.


Regarding SQL being "untestable"...

I have put SQL DDL+DML+stored procedures in version control, create/run stored procedure (TDD) unit/integration tests on mock data against other stored proceedures, had pass-fail testing/deployment in my CICD tool right alongside native app code, and done rollback, all using Liquibase change sets (+git+Jenkins).

Using Liquibase .sql scripts for version control isn't hard. Testing is always more-work but it's doable.

I don't completely disagree with you on rollback though as hard, at least full pure rollback-from-anything. Having built tooling to do it once with Liquibase I found the effort to guarantee rollback in all circumstances took more effort than it was worth. A lot of DDL and code artifacts and statements like TRUNCATE are not transaction safe and not easy to systematically rollback. Liquibase did let you specify a rollback SQL command for every SQL command you execute so you could make it work if you had the time, but writing+testing a rollback SQL command for every SQL command you execute wasn't worth it and is indeed materially more effort than just rolling back to earlier .war/.jar/.py/docker/etc files. (The latter are easier in part because they are stateless of course.)

In any case, something like Liquibase can get you a long ways if you have the testing mindset. (Basically it lets you execute a series of SQL changesets and you can have preconditions and postconditions for each changeset that cause an abort or rollback.)


If you mean 5+-page SQLs are untestable, I don't disagree. Like any code you need to break it up once it gets past a certain size.


> Like any code you need to break it up once it gets past a certain size.

And SQL doesn't give you the tools to do that. There's no easy or natural way to split a statement into smaller parts.


SQL doesn’t give you tools to break up code?? Err, I don’t agree there. I can think of four options just off the top of my head.

Simplest is to use WITH clauses (common table expressions/CTEs). They can help readability a lot and add a degree of composability within a query.

Second, you can split one query into several, each of which creates a temporary (or intermediate physical) table.

Third, you can define intermediate meaningful bits of logic as views to encapsulate/hide the logic from parents. For performance you materialize them.

Fourth, you can create stored procedures which return a result set and like any procedural or functional language chain or nest them.

These techniques are available in most databases. More mature databases support forms of recursion for CTEs or stored procedures or dynamic SQL.

As with most programming, proper decomposition and naming helps a fair bit.


> Beyond being untestable, it's also hard to version control and roll back if there's a problem.

Recently, I've learned that SQLServer supports synonyms. So you version functions / procedures (like MySP_1, MySP_2, etc...) and establish a synonym MySP -> MySP_1. Then you test MySP_2 and when ready, change the synonym to point to MySP_2. Of course, all code uses just the synonym.


We use sqlcode as a more general solution to this:

https://github.com/vippsas/sqlcode


I'm not sure where this idea comes from. There are unit testing frameworks for T-SQL and pl/sql. Stored procedure code can be version controlled like any other code.


We care A LOT about testing everything, always.

We use SQL container [edit: new, freshly cloned DB for a test function in less than a second] and find it to be no problem in practice to write integration tests for our pieces of Go code that calls the SQL queries with some business logic in them.

(No, we don't do a sprawling mess of stored procedures calling each other. We just try to not move data over to backends ubless we really need to)

If you really need complex flow, making connection scoped temp tables for temporary results in SQL and having the composability/orchestration through backend functions calling each other and passing the SQL connection between them is doable.

Yes you cannot unit test every small line of SQL, but since SQL is much higher level that isn't really needed. Test the functional behaviour / inputs/outputs and you are fine..

It really isn't different for writing for a GPU in a sense.


> Concentrating business logic in SQL makes it effectively untestable. Its not easy to "compartmentalize" SQL code such that each individual piece is testable on its own.

What are you talking about? SQL is just as easy to compartmentalize and test as anything else.

Each query statement belongs in a function -- there, compartmentalized. Now set up a table state, run the function, and compare with new table state. The test either passes or fails.

Also no idea why you'd think SQL is a PITA to debug. It's a relatively compact and straightforward language once you learn it, and queries are self-contained. It's generally much easier to debug a query than it is to debug something happening somewhere across 10,000 LOC across 400 functions.


It's totally possible to validate data before pushing to a table, and have normal tests around sql massaging it. Saying this as somebody looking at 1000s of transformations done by my teams.

Admittedly, It took a while for data engineers in the industry to accept these practises though.


PS but stored procedures and udss are evil, yes


A lot of people are not comfortable learning languages beyond the ALGOL-like paradigm. SQL's building blocks are incredibly odd if you're used to the idea that work gets done via variables, conditionals, and loops.

Personally I think it's a wonderful (if imperfect), ultra-powerful, and easy language to learn. But if it doesn't click for you and you're under the gun at the job, I bet it's very easy to develop a bad attitude towards SQL.


"SQL's building blocks are incredibly odd if you're used to the idea that work gets done via variables, conditionals, and loops."

They're even weirder when you get into stored procedures, where SQL statements are your very un-ALGOL-like elementary statement, but then in between them you have a procedural language operating on the results, except when the optimizer figures it can "see through" your procedures to get back to something it can optimize through. And the "declarative" nature of them makes understanding cost models a challenge sometimes. You need a deep understanding of how the database works to get the cost model out of your stored procedure code.

Very powerful. I don't do a lot of deep database stuff but I have a couple of times turned something that required an arbitrary number of thousands of back-and-forths with the DB taking seconds from the application code into a one-shot "send this to the DB, get answer back about a millisecond later" using them, and you can end up with performance so good that your fellow developers literally won't believe it's a "conventional stodgy old relational database" blowing their socks off. But it's a weird programming model.


I'd also add that "doesn't click” is sometimes confounded by a respect gradient. With SQL (also CSS) I've seen some people pick up an attitude that those aren't Serious Languages worthy of their time or respect where they avoid learning the fundamental concepts, have problems, and then say the language is too hard or old fashioned to use. I've seen people write many thousands of lines of Java because nobody pushed back on that mountain of fragile code telling them “maybe you should take a day and really get up to speed with how SQL works”.


> A lot of people are not comfortable learning languages beyond the ALGOL-like paradigm. SQL's building blocks are incredibly odd if you're used to the idea that work gets done via variables, conditionals, and loops.

Yes, although somewhat amusingly I've found that non-programmers who have never learnt an imperative programming paradigm tend to find SQL a lot more intuitive than ALGOL-style programming languages.


I'm curious about resources to learning SQL well.

In our company I've done a series of teaching session (we're on about 10th hour now). It's working OK, but wish I knew of good material / blog posts to point at. I feel like there's no good community to learn from like for many backend languages.

E.g. after learning about "cross apply" (T-SQL; "lateral join" in postgres) everything got 10x easier to express in SQL. But: How is a developer who's just getting started in SQL going to know that?

And where is the community that can teach a backend developer getting started with SQL to ignore some of that advice from the DBA and Data Analytics communities? E.g. the advice to "not pin indexes" -- which I believe is 100% wrong advice for a typical backend application where reproducability across environments is key, and where any query not supported directly by an index is probably a bug anyway.


"E.g. after learning about "cross apply" (T-SQL; "lateral join" in postgres) everything got 10x easier to express in SQL."

I feel like even in the past few years the database community has still been learning about what you need the databases to be able to do in order to make good code. I personally don't like the "declarative" memeset and think it set the community back literally decades, but with recent Postgreses (by which I mean, the whole last five years or so... lots of people still running older things) all the functions and functionality is technically there to arbitrarily convert between rows, arrays, columnsets, etc., and more and more you can use them arbitrarily as well, so you can JOIN against a columnset you bodged together from two other queries that you pulled into an array and then put that array into a columnset, without it having to ever be turned into a full "table". Cross apply is another example of that, where IIRC a row can be turned into multiple rows.

The problem is that while all the functionality I've wanted on this front does now seem to exist, it's all incredibly haphazard. Cross joining is an SQL keyword, but arrays look more like a data structure, and I can't remember what all was going on but I recall having more hassle turning arrays into columnsets for some reason. If I were going to be doing this full time I think I'd build myself a matrix cheat sheet of how to convert between all these things, and I bet there's still holes in the matrix even today (is there an opposite of a cross join? dunno, but I wouldn't be surprised the answer is "no").

I feel like I'm doing a lot less "work around things missing in SQL (that I have access to)" than I did 15-20 years ago, but rather than a cohesive and well-organized toolset for dealing with all these things, I've got a haphazard set of Bob's Custom Tool for This and A Semi-Standard, Modestly Extensible Tool for that, neither of which were ever designed with the other in mind, and yeah, in the end I can do everything I want quite nicely but it's up to me to notice that what this tool calls a 1/8 inch quartzic turns out to be the same as a Number Seven smithnoczoid and so in fact they do work together perfectly despite the fact the documentation for neither of them suggests that such a thing is possible, etc.


Mode's SQL school is the best I've seen: https://mode.com/sql-tutorial/


And when it does click for you all the other things that do not involve for/while loops (dicts, map/reduce/group by) become very obvious.


This depends on use case. SQL is the king for batching process - queries are declarative, decades of effort put into optimization.

For real-time / streaming use cases, however, there is yet a mature solution in SQL yet. Flink SQL / Materialize is getting there, but the state-of-the-art approach is still Flink / Kafka Streams approach - put your state in memory / on local disk, and mutate it as you consume messages.

This actually echoes the "Operate on data where it resides" principle in the article.


We do mini-batch processing in SQL. Some hundred milliseconds latency, some hundred events consumed from to the inbound event table per iteration. Paginate through using a (Shard, EventSequenceNumber) key; writers to table synchronize/lock so that this is safe.

Kafka-in-SQL if you wish. Or, homegrown Flink.

(There are many different uses for the events inside our SQL processing pipelines, and have to store the ingested events in SQL anyway)

I am sure real Kafka+Flink has some advantages, but...what we do works really well, is simple, and feels right for our scale.

It is enough batching in SQL to real speed/CPU benefits on inserts/updates into SQL (vs e.g. hitting SQL once per consumed event which would be way worse). And with Azure SQL the infra is extremely simple vs getting a Kafka cluster in our context.


Do you find flink SQL immature? For me it looks a lot like syntactical sugar on top of the datastream api.

Same thing, less code?


I've met so many developers who seem to be willing to do anything, except use sql. I remember one case where we were dealing with clearly relational data. One of the senior developers was adamant that we use a non-relational database. When pushed as to why, he said it was because it would have better performance. I pointed out that this for a process where data would be sent out and we wouldn't expect to receive it back for 3 days or so. A few milliseconds performance boost was hardly beneficial over 3 days. But he was insistent and was senior, so we did it. Shockingly, it turned out to be the wrong decision. Later we learned that the senior developer didn't like sql.


> I think a lot of blame is with the database vendors. They only targeted some domains and not others, so writing SQL is something of an acquired taste. I wish there was a modern language that compiled to SQL (like PRQL, but with data mutation).

There's EdgeQL https://www.edgedb.com/blog/we-can-do-better-than-sql#lack-o... which I like the look of - but as it adds one more layer over a database I haven't used it yet.

Edit: I hadn't seen PRQL before, reading the site now https://prql-lang.org/


I have run into similar attitudes. In my case, it's an unfamiliarity with SQL, fear of the database, and a desire to utilize strongly typed ORM's for everything. Implementing these sorts of queries often requires using raw sql which is seen as taboo by such people. They think it's unsafe. Meanwhile we're pulling more records than we need, searching them, then firing off more queries in a for loop.


How easy is it to verify that the configuration of your database matches a checked-in configuration or source file these days? My beef with a lot of installed procedures in a SQL database comes down to deployment and rollback difficulty.


For Postgres views, materialized views, functions and triggers I've made a tool that can do that, DBSamizdat:

https://sr.ht/~nullenenenen/DBSamizdat


Depends on the DBMS, but pt-config-diff works well for MySQL.

edit: This will only help for configuration, not procedures.

https://docs.percona.com/percona-toolkit/pt-config-diff.html


I feel this has gotten worse with the latest round of data science graduates wanting everything in python.

Probably just perspective. I ducked out of the push for everything to be in Hadoop. And while I can appreciate the foot gun that is indexing everything so that ad hoc queries work, I also have to deal with folks thinking elastic search somehow avoids that trap.

I think I've seen the same claims and push for graphql. :(


This, a thousand times this. A hundred lines of Java/Python/C# can save you at least 10 lines of a stored proc :) Also why don't they teach SQL in most schools???


> Yes we CAN do that but why do that it's 15 lines of SQL and 5 seconds compute -- instead of a new microservice or whatever and some minutes of compute.

This works until your database falls over in production. Recently someone started appending to a json field type over and over in our production database. And then on some queries, postgres crashed due to lack of memory. The fix was to remove the field and code that constantly appended to it, and do something else.

No, the database should not be the answer to all your data problems. Yes a microservice may be the best answer. But for structured data and queries that can run with normal amounts of memory the standard SQL DB is fine.


> No, the database should not be the answer to all your data problems. Yes a microservice may be the best answer.

Or to put it another way: No, microservices should not be the answer to all your data problems. Yes, using the DB may be the best answer.


In one case it might just be slow. In the other your db will fall over. Unfortunately, you may find the answer when your production environment fails.


If you have a single DB instance in production, something has already gone wrong.

Read replicas.


> Recently someone started appending to a json field type over and over in our production database. And then on some queries, postgres crashed due to lack of memory. The fix was to remove the field and code that constantly appended to it, and do something else.

So your choices are doing something manifestly non-optimal in the database in violation of best practices or not use the database for non-trivial business logic?

Sounds like a false dichotomy to me. Maybe just find a better solution?

I mean, if someone writes an O(n^3) algorithm in Python, is the solution to use a better strategy or to swear off Python for anything non-trivial?


> I wish there was a modern language that compiled to SQL (like PRQL, but with data mutation).

That is nice, but much better will be language that is relational itself.

I'm working on one (https://tablam.org) and you can even do stuff like:

for p in cross(products, qty) ?limit 10 do print(p.products.price * p.qty) end

The thing is be functional/relational only is too mind-bending and is very nice to work with procedural construct.

BTW my plan is that the query (?) section will compile to optimal executions defined per storage engine (memory, sqlite, mysql, redis, etc) `products ? price = 10.0` is executed on the server, not on the client.


Interesting! Have you looked at Datalog? In many ways it's the ideal purely relational language.


Yes. I think is not friendly for end-users* and making this on Rust make it easy to plug-in iterators/generators so i go that way.

(Also never wrap my head about how use it internally!)


IME SQL and relational databases have the framework problem, where they expect to be used in a particular way and won't let you use the internals. It's like when a language (ADA?) famously built a high-level version of concurrency into the language, where you were meant to use their "rendezvous" (I've probably mistaken the same), and instead what happened is that programmers implemented mutexes on top of this "rendezvous" and then reimplemented high-level concurrency on top of that.

What we need is databases that expose more of their internals, that are designed to be embedded in applications and used as libraries. All serious databases use MVCC these days, but none of them expose it to the user. All serious databases separate updating the data from updating the index, but few of them expose that to the user. All serious databases know the difference between an indexed join and a table scan, but good luck figuring it out by just looking at an SQL query. Etc.


Oh this reminds me a lot of a familiar kafka vs SQL battle.

Yes it is possible to do it in kafka, but everything is obscured in the sense that you can't peek at what you are doing, you spend your precious CPU on serializing/deserializing and things like backfill is a mess.


I find the opposite; SQL databases work a lot like Kafka underneath, but hide it from you, burning all your CPU to generate this illusion of a globally consistent state of the world that you don't actually want or need and doing everything they can to avoid ever showing you what they're actually doing.


Hold up. I need some clarification here. SQL databases are opaque with regard to resource usage but Kafka is completely transparent in this regard? That is your personal experience and assertion?


Yes. I once saw an outage caused by an SQL database server collapsing because of a query that had been issued 23 days earlier. I saw another one gradually get slower and slower over a period of weeks because a data scientist had left a command prompt open and forgotten about it. Kafka brokers are a lot more consistent and predictable IME.


> "why can't just just put your millions of events into a service bus and write some backend to react to them to update your aggregate". Yes we CAN do that but why do that it's 15 lines of SQL and 5 seconds compute

Do people actually do this? I feel like this is classic Occam's Razor. This is one of the biggest things I do in SQL and I couldn't imagine the time and effort to do it with a separate service.


I interact with SQL on a daily basis but never write a single query by hand. It's all abstracted by the Django ORM. I do have to be mindful of what I do of course but mostly I just plow away with the abstractions offered. Once in a while I have to take a look at the SQL or the query plan but those are few and far between.


Sure, but there's a world of difference between someone who knows SQL and uses a ORM for productivity and someone who uses an ORM because it's the only tool in their box.


Exactly. Effective use of an ORM requires knowledge of the ORM's API in addition to SQL, not in lieu of it.

Using an ORM without a firm understanding of SQL is a recipe for disaster (or at least glacial performance).


I've seen both extremes of this.

On one end, you have applications that execute thousands of SQL queries for each page load, for populating a table with some data or something like that, which has bunches of rules for what should be displayed, all implemented as nested method calls (e.g. the service pattern) in your application. It's a performance nightmare when you get more data, or more users.

On the other end, you have an application where your back end acts just as a view for a DB that has all of the logic in it. There will rarely be proper tests for it. There will rarely be any sort of logging or observability solutions for this in place, the discoverability will be pretty bad, debugging will often be really bad, versioning of changes will also be pretty awkward, but at least the performance will typically be okay.

Just look at the JetBrains Survey from 2021: https://www.jetbrains.com/lp/devecosystem-2021/databases/#Da...

  Do you debug stored procedures?
  47% Never
  44% Rarely
  9% Frequently
  
  Do you have tests in your database?
  14% Yes
  70% No
  15% I don't know
  
  Do you keep your database scripts in a version control system?
  54% Yes
  37% No
  9% I don't know
  
  Do you write comments for the database objects?
  49% No
  27% Yes, for many types of objects
  24% Yes, only for tables
If something that most would consider to be a "good practice" isn't done, then clearly that's a bit of a canary about the state of the technology and the ecosystem around it. Consider that databases are very important for most types of systems, and yet about half of people don't debug their stored procedures, most people don't have tests for them, only about half version their scripts and about half don't bother with comments.

I'm pretty much convinced that it's possible to write bad software regardless of the approach that's used. In my mind, the happy path for succeeding in even sub-optimal circumstances is a bit like this:

  - make liberal use of DB views for querying data, if you have a table in your app, it should have a matching DB view, which will also make debugging easier
  - make use of in-database processing only when it makes a lot of sense and anything else would be a horrible choice (e.g. ETL or batch processes/pipelines), have log tables and such regardless
  - for most other concerns (e.g. typical CRUD), write app logic: most back end languages will have better support for observability and tracing, logging and debugging, as well as scaling for any expensive operations
  - still, be wary of the N+1 problem, that might mean that you don't have enough views, or that you're not using JOINs for your queries properly
  - also, look into using something like Redis for caching, S3 (or something compatible, like MinIO) for binary blobs and something like RabbitMQ for task queues, just because you can shove everything into the DB doesn't mean that you should, sometimes these specialized solutions will have better support and more standardized libraries, than whatever you can concoct


Yep. Folks treat relational databases like dumb bit buckets. They assume they are limited and therefore treat them as limited despite the reality of modern SQL. It should not be a surprise to find that most devs skip modern development practices with it as well.


The main reason to do this is so they can be turned into streams that and have the source and, potentially, multiple destinations decoupled. If the source and destination will always be the same, sure, do it in the DB.


I couldn't have put it better. Use the SQL, Luke.


Firstly, I'd suggest the author look at this differently; perhaps "For Want of a Code Review". Especially code from a relatively recent graduate, on a piece of code for which the engineer in question has little experience.

With that said, the JOIN is a very powerful concept which, unfortunately, has been given a terrible reputation by the NoSQL community. Moving such logic out of the database and into to DB's client is just a waste of IO and computing bandwidth.

SQL has been the ONLY technology/language that has stuck with me for > 25 years. The fact that it is (apparently) not being taught by institutions of higher learning is just a shame.


> Firstly, I'd suggest the author look at this differently; perhaps "For Want of a Code Review". Especially code from a relatively recent graduate, on a piece of code for which the engineer in question has little experience.

I assume the story is made up, but if we were to take it at face value the title would be "For Want of Basic Human Decency"; the author is saying that they saw this whole easily preventable train wreck happen in slow motion and did not lift a finger to prevent it, instead laughing, taking notes and thinking of the fabulous snarky blog bost that would have come out of it.


The way I read it, they accepted the decisions of those higher in the hierarchy, after providing their feedback. It wasn't clear if lots of money was lost, just lots of time. I didn't think it was made up.


Eventually you learn it's not always a great idea to stop your employer from getting burned. Some people don't learn until it hurts.


From the article:

> I’d definitely commented on the JOIN statements during the initial code review


How about when it became a problem? Or the second time? The third?


I agree. It took me 3 years or so to actually land in a project and learn SQL for the first time. Before it was all with ORMs. I didn't know what a join was for the first couple of years of my career.

Understanding SQL and being able to work with data interactively has made me a better software engineer. This tech is important enough that it should be taught in university/coding camps.


You DIDN’T learn SQL in school? Probably my most useful class. I hated it at the time, I was a desktop and embedded dev, and this was before SQLlite roamed the earth.


>Probably my most useful class.

Ditto.

My teacher was hardcore. He was a graybeard who was around before Codd's now famous paper. He worked with some of the old pre-relational hierarchical databases.

We had to take SQL queries, turn them into relational calculus and algebra, turn that into a query plan, then come up with an estimate for the time the query would take to run given various hardware speed numbers and the size of the data.

We had to implement our own (primitive!) database engines, including various join algorithms.

To date it's one of the hardest, yet most rewarding, learning experiences I've had.


I would take that today!


Not exactly the same, but check out the CMU database course lectures on YouTube.


That.. Sounds graduate-level. How's your PhD?


I took it during summer, and there were some masters and PhD students in there, but I took it as an undergrad course. It was very intense, 3 hours per day, 3 days per week, and 1.5 hours per day the other two.


Man, I got downvoted into oblivion for my comment, but seriously, yeah, graduate level.

That sounds fun. My most intense undergrad class series was one where we soldered together a M68HC11 computer and learned assembly one class, built an OS for it the next, and then turned it into a robot with control software running on our OS-es.


Not necessarily. One of my undergrad courses (~2008) had us implementing a basic inverted index (think Solr or Elasticsearch), which gave me some insight into Solr that my co-workers didn't have that helped with performance issues.

If we had a database course that went that in-depth, I'd've definitely taken it, too.


That sort of course was very much par for undergrad courses at CMU when I was taking CS classes there 25 years ago. The OS course was very intense. I wasn’t a major and didn’t have time to take it, but my networks course was of similar rigor (i.e., implement a toy TCP/IP stack).


This was obviously was a joke. Sounds like there's a range in SQL training that goes from "I've heard of SQL" to "I implemented a PostgreSQL compatible db my Sophomore year."


When I was at Rice a couple decades ago, the database class was a 400-level class in which we learned relational algebra and relational calculus before SQL. The professor must have been good at teaching because I loved learning the formal underpinnings even though my memory of them has faded, but I do recall that I went from zero SQL knowledge to being very excited by its power. So many of my CS classes were very theoretical, and even though we learned some theory in the database class, it was definitely one of the single most (the single most?) pragmatic & practical of all the CS classes I had.

I was so zealous about normal forms that I complained loudly at one job where they used an old D3 database with multivalue fields. It was so glaring to me because we actually used all hand-rolled SQL instead of an ORM in those days. Years later, after growing less tech-centric and more thoughtful of business needs, I realized that sparingly using multivalue fields was not a hill to die on. :)

Fast forward many years to my first startup in Boston. Google App Engine was new and I wasted precious time trying to figure out how to shoehorn a typical relational data model into the early NoSQL data store available for App Engine at the time. This was just after the financial crisis and I hadn't yet heard the mantra to pick boring technologies, and I learned through sheer pain that unless you really really really need to, don't waste effort by walking away from relational databases. And also, most apps can get by with whatever the ORM does and if there's a performance issue, optimize that one query instead of trying to optimize all your SQL from the beginning. There's a lot I still don't know about pushing heavily complex queries down to the db level, but for expensive problems I'd reach for expensive assistance, because it's worth it (after trying to play with the SQL myself).


For me the first time we dabbled with SQL was in a 3rd year Software Engineering course where the focus of the class was a single group project that we managed among ourselves by splitting tasks, conducting code reviews and handling the build and release in teams.

I recall one group doing the project login which went much along the lines of what the OP's article touched on. Their code was esseentially

    var success = false
    var query = SELECT * FROM users
    while query.read
    {
      if query(user) == input_user && query(password) == input_pass
      {
        success = true
      }
    }
Yes. They selected the entire user table.

Yes. They iterated over the entire result (even if first returned result was valid)

Yes. That was "shipped" for the project

No. My complaints notion they should be leveraging the database for all the things they're doing wrong were ignored. It was performant! Look! It logs in instantly! YEah, because there's 8 users on the database for this project, what about when it ""ships"" and there's 100,000? More?

---

My first real job dealing with a database wasn't much better. We were using a MS Access database with no normalized data. Our client's primary transaction data was across a table with 70 some columns, many of which were often duplicated values in some form or utilizing very bad practices. Since joining this company I've sped up queries in almost immeasurable ways and done things my older coworkers initially derided because they couldn't understand the syntax.

TL;DR SQL, for some stupid reason, is still treated as second class to core langauges and it is a god damn shame


> SQL is still treated as second class

Agree so much. And if you've ever seen a real SQL wizard in action, you realise how much can be done with it. Like most of the business logic of a system can be in the database, with an interface that's a set of stored procs/functions. And fast.


> most of the business logic of a system can be in the database

The problem of this approach is the tooling and lock-in.

If databases had first-class versioning support for their code objects (which could easily interoperate with git), testing automation, and a parvence of standardization across the industry, then a lot of people would be very happy to work with that model.

But they don't.


> The problem of this approach is the tooling and lock-in.

I've seen a program rewritten or heavily refactored on top of an existing database more times than I've seen the database swapped on an app that had reached production (which I've seen zero times).

Consequently, I have regard remaining "database agnostic" as having very little worth. If you pick a DB with a bunch of great features that can save you time, improve performance, and improve data integrity—use those features!

Plus, if you find yourself in that rewriting-or-heavily-refactoring job that I've seen a few times, your favorite person in the whole world will be whoever put all those annoying constraints and triggers and such in the DB itself. It'll make the operation far easier and safer.


While the tooling could definitely be better, a lot of those issues aren't so problematic if you just use Postgres. Use a migration tool and store the migrations in git, use a tool like Zapatos to provide typing for your queries at the application layer, support multiple versions of stored procedures using schemas with a defined search order and test your procedures using pgTap.

Postgres-as-a-platform is definitely a new architectural trend, but because of companies like Supabase it's maturing quickly, and there are so many benefits to it when executed properly.


Came in to say pretty much the same thing. Discoverability is a huge issue... and even if you do things in a way that lends itself to that, it gets really clunky really quickly.


I may be misunderstanding here because I’m not a dev and have only a cursory level of experience doing some basic programming or sql but is this not what dbt allows?


I'm all for straight up queries and understanding... even some more complex sprocs... I'm not a fan of too much logic in the dbms, since it's pretty much a lock-in for a single vendor, limits breaking pieces out for scale and makes things generally much harder to find/understand in practice. I'm a proponent of what I like to call discoverable code structures, sprocs/functions don't lend themselves to that.


yeah, I should have added "can be, not should be" ;)

Though I have met DB Admins who insist that the only way of stopping bad data getting into the DB is to have the DB do all the data manipulation, including a lot of what we would now consider business logic


Yeah, I've worked in environments like that... and after a few years, it takes an inordinate amount of time to add/change anything as a result.


I took several classes on it and I still didn't really 'get' it until I had to work on challenging problems in the real world. Granted, my education was not great quality overall.


I just enrolled in an online CS degree course. The database class is an elective. Crazy, I know.


Agreed... ORMs can be nice, but one should understand how it works. I'm a pretty big proponent of simple mappers (Dapper for .Net, template literals for JS/TS) with straight SQL over ORMs at this point.


I was at a place that used sharding, so the data was scattered across 128 database servers.

SELECT works there but JOIN doesn’t, as your right side may reside at another shard.


IMO...

If the query is for OLAP the data may need to be extracted to another data store.

If the query is for OLTP, then the design is wrong. I don't know your problem space, but pulling data from 128 shards to resolve queries while a user is waiting is just a really bad idea.


> If the query is for OLTP, then the design is wrong. I don't know your problem space, but pulling data from 128 shards to resolve queries while a user is waiting is just a really bad idea.

well, that's the basic idea of microservices lol. forget living on a different shard, lots of times your data is going to round-trip to JSON and back a couple times and then be manually joined in some backend/service layer, or in graphql!

one bad abstraction I see a lot from microservice teams (that don't really understand it past the high-level concept) is "every table is a service", or "every minimal set of tables and its codeset is a service" and that's exactly how that ends up. Microservices really ought to be chunky enough to do their business without ending up calling 27 different services under the hood just to do simple operations. Obviously there is a point where it's too chunky, but too micro is also bad too.


> pulling data from 128 shards to resolve queries while a user is waiting is just a really bad idea.

To the contrary, pulling data from 128 shards can be done in parallel, and about 127 of them don't have any data to return.


Yes. This is a super common problem with no-sql engines - we ran into something similar with SOLR when objects are not flattened (eg @JsonUnwrapped annotation). Child objects are stored as separate documents with a join... but if the child object is not stored in the same [file-]block then predicate-scans for the parent may not encounter the child object that causes predicate satisfaction. This breaks deep pagination and some other abstractions.

To me this really is the fundamental distinction for no-sql vs RDBMS. If your data model involves lots of joins... it's RDBMS even if you're using mongo or some other document store under the hood. ideally you will be storing some large analytical document that contains a lot of details about the thing, rather than just treating it as "rows as a document".

the thing about JOINs breaking across blocks/shards is one thing, and it's ultimately something you can work around for a lot of data (again, flatten with @JsonUnwrapped for example) but if you find yourself reaching for joins, your data is relational, or at least your representation is relational.


If you had to do joins on different shards, then you have implemented sharding wrong.


This is to some extent an implementation limitation, not a theoretical one.

Typical SQL databases support neither the data organization nor parallel orchestration features required to support these types of JOINs well. The practical issue is that you can't add these features to an existing database kernel architecture if it was not designed to make this feasible from day one, and people are rightly reluctant to design a new SQL database kernel architecture from scratch so that these features are available. SQL databases are trapped in a local minima.


Our SQL course in uni left a lot to be desired. Very little time spent on join, much more on subqueries, oddly. My first job our of school there was a SQL portion and they were impressed by my overuse of subqueries. Best SQL I learned was on the first few months in a real database with real information, instead of a student-courses mock DB with 15 rows that seems to be the academic standard for teaching.


I'm frankly surprised that some of the larger MS based data sets aren't more standard for learning. MS SQL Server isn't generally my first choice (preferring PostgreSQL for standards and portability), but it's got some pretty great example data out there.


Oh it's taught. I have a bone to pick with how. I'd rather have spent a lot of time on the practical application of SQL than the theoretical background of column and table operations.


I've learned that "a month in the lab saves an hour in the library" usually can be distilled to "A shallow understanding produces complex solutions. A deeper understanding is usually required to create simple solutions."

While the original example of not understanding JOIN might just be a lack of of general knowledge, the later steps are great examples of this, especially if someone else comes along and is told to fix the error.

Making something execute slow code in parallel is pretty easy to do generically. It doesn't require understanding much about the slow code. It's fairly low risk, you probably won't have to tweak tests, there won't be additional side effects. The major risks will be around error handling and it's easy to turn a blind eye to partial success/failure and leave that as a problem for a future team. You can confidently build the parallel for loop, call the task done and move on.

Striving for a deeper understanding requires a lot more effort and a lot more risk. Re-writing the slow code is a lot more risk. All side effects must be accounted for. Tests might have to be re-written. The new implementation might be slower. The new index might confuse the query planner and make unrelated queries slower somehow. It's not just a matter of investing time, it's investing energy/focus and taking on risk. But the result will have comparatively fewer failure modes, it'll be cheaper to operate and less likely to have security implications.

I've been in both spots and while I wish I could say we always went with the deeper understanding that wouldn't be an honest statement. But the framing has been really helpful, especially as I work with other execs in the company to prioritize our limited resources.


"Simplicity is the highest form of sophistication"


"less is more"


<>>


golf clap


Reminds me of Blaise Pascal who apologized to a correspondent saying something like "I apologize for the long letter, I didn't have time to write a shorter one.".


The worst part is not the missing JOIN. This happens, especially with juniors.

It's the 'all signup errors warranted paging the on-call even on 4am' bureaucratic decision followed by being unable to apply any fix quickly. No surprise the author did not stay.


The worst part is more senior devs being happy to watch it happen and even accept the commits, and then write a long winded story of how this 'car crash unfolded' apparently unaware that they're a tacitly active contributor to it happening and then escalating out of control.

If you're going to be aper of throwing junior devs under the bus, at least have the self awareness not to brag about it on the itnernet.


This article speaks to me. So many times I have needed to go back and fix queries that were naively written this way like it was some kind of "optimization". There is no difference in effort between writing a join or doing the ORM-double-round-trip in the vast majority of cases. People are so afraid of doing joins I see people doing subqueries with the id in a subselect because "joins are slow". The worst is usually some kind of pseudo-join and then an aggregate or filtering in the application code. It drives me up the wall when I see it in code review, usually because I get into some argument about "joins are slow" (with no evidence) and then I have to go and rewrite the query and maybe add an index to show that, yes - an aggregate that takes seconds and a ton of memory in the application code can in fact take milliseconds in the database.

The NoSQL people have really done a lot of brain-damage to this industry.

It's so pervasive that I've starting using this kind of question in our technical interviews, doing a double round-trip ends the interview for anyone higher than a junior.


Here is some actionable advice for speeding up a join with indexes. You'll need one in each table with the join columns. For example:

SELECT ... FROM table_A JOIN table_B ON table_A.column_A1 = table_B.column_B1 AND table_A.column_A2 = table_B.column_B2

You can add indexes like this: - table_A(column_A1, column_A2) - table_B(column_B1, column_B2)

If both tables are large enough, this query can probably take advantage of those indexes to perform a merge join.

Postgres tip: you can also add columns to the include part of the index to speed up filters in the WHERE conditions. You might even get an index-only scan! Look into covering indexes to learn more about it.


Not related, but here in my job our query just stopped working because the data got too big. we use left joins everywhere because we don't want to lose the main table data. do you think the trick you just mentioned could optimize our left join as well?


Most likely, yes. Combined indexes are real cool, because if the database can satisfy a join only from the index, it can usually perform it completely in memory, instead of having to do a table scan/hitting the disk for the values.

Another trick people usually shy away from: temporary tables. It might seem slow and wasteful to create a table (plus indices) just to store results for a fraction of a second, but for very large tables with large indices, creating a smaller index from the baseteable and joining against that can be magnitudes faster!

There is even dedicated syntax for that: CREATE TEMPORARY TABLE. They are local to the connection and will get dropped automatically at the end of the sql session.

They are also great for storing the results of (nondependent) subqueries, because for large sets, not every database is able to find the proper optimizations. Mysql versions < 8 for example.

I really recommend you to try that one. So far I could fix every "query takes too long" problem that resisted other solutions that way.


I suppose so. I would just check if these tables are vacuumed often (preferably with autovacuum) otherwise the query planner might decide not to use the index depending on the visibility map conditions.

This post explains it in more detail: https://blog.pythian.com/postgres-covering-indexes-and-the-v...


Agreed, partial indexes were a game changer for me. After that, Window Functions was like a whole new world of awesomeness opened to me. Nothing has been the same ever since I figured out how to effectively use Window Functions. One of the biggest Eureka! moments of my entire career.


Actually I was talking about covering indexes:

CREATE INDEX ... INCLUDE ...

They can be used to speed up queries that have WHERE clauses, so I see it might have caused some confusion since partial indexes have WHERE clauses in the their definition.


Interesting, I've never seen INCLUDE before - probably because we use a version of postgres that does not support it - but I see now that it can be useful instead of doing CREATE INDEX idx ON table (colA, colB) because of the lack of ordering in the b-tree leaves. Good stuff. Just another piece of ammunition for upgrading our db to a newer version that has all sorts of new goodies I've been reading about lately.


ok, now i cant resist

> This article speaks to me. So many times I have needed to go back and fix queries that were naively written this way like it was some kind of "optimization"

in some cases, doing joins in the application is more performant then making the database do it. Its usually better to do it by join, but depending on the data you're joining you might incur significant slowdowns. Its always better to start with the join and only evaluate the application join if there is a need to improve the performance however. Nonetheless, a sweeping statement like yours doesn't help either.


Read my comment again, I said in the vast majority of cases, not "all".


Back in the 90s, I remember getting a project from a local F500 company. Our design team had been doing some work for them and they'd been happy with the results so when they had problems on a backend project which was over a year behind schedule they asked if we could help & I was pulled in. The project was a fairly straight forward product selector for industrial equipment but the team from a large consulting firm which had been working on it was struggling with performance & hadn't completed most of the features. The client was saying it was unacceptable that pages would take 5 or more minutes to load and they weren't going to drop $500K on bigger servers like the developers were swearing were necessary to run the site.

I knew something was off performance-wise since the entire product catalog was only on the order of tens of thousands of records. As soon as I looked at the source code, the mystery was explained: they had allegedly experienced 3 developers working on it but none of them knew about SQL WHERE constraints! Instead, they were doing nested for loops to repeatedly retrieve every row of every table and doing the equality checks in VBScript. Finishing the rest of the project backlog took me a couple of days and the customer was quite happy that the slowest pages were now measured in hundreds of milliseconds rather than tens of minutes.

I was proud of how quickly we were able to turn that project around but the PM & I were discussing how even our rush rate wasn't enough to get us anywhere close to the amount of money the previous contractors had charged.


> With the exception of NPM modules, most tools are designed to solve problems, possibly the ones you have

Upvoted just because of the chuckle this gave me.


As someone who’s primarily worked with monoliths, I often wonder how often this exact problem happens, but where A and B are [micro]services owned by two different teams, one is required by company policy to use their APIs not their raw databases, and escalation of each of these issues e.g. query size/rate limiting runs the risk of burning political capital on top of everything else.

How does one JOIN across not just tables but opaque services, in the general case? Or does every team doing microservices silently expect that one day a data team will start querying for a massive number of records-by-ID from every service, and the veterans in each team plan for this load pattern accordingly?


> Or does every team doing microservices silently expect that one day a data team will start querying for a massive number of records-by-ID from every service, and the veterans in each team plan for this load pattern accordingly?

What tends to be by far more common is that each team fails to envision that someone, somewhere, sometime in the not so distant future will want or be required to retrieve more than one "element" at a time via their APIs. And so panic ensues when "other entity" begins feeding 20 API retrievals per second at their "one-at-a-time API" and their performance goes off the cliff it was always sitting near.


Extra points if you manage to do it to your own team’s APIs because of bad design and planning


You mean like 8+ joins to get that single record and then having it bottleneck when you get a few hundred requests a minute? (not bitter at all here)


I think it depends on whether you are talking ad-hoc ie. a user analyzing several datasets, or pipelined ie. preprocessed joins. For pipelined joins, effectively your data forms a DAG (directed acyclic graph, and yes we are ignoring recursion here). Providing your data services speak the same language you can create a pipeline off the first pipeline that joins the data and sticks it in a cache (eg. RDS, Elasticsearch). Changing the underlying data should then trigger a reload of the downstream pipelines. This is basically what Materialize.io, KSQLDB et. al. do - a reactive DAG with a database as the cache.

One issue for larger companies is that you don't control the whole DAG, so discovery, security, protocols etc. need to be coordinated by an overarching architecture for this to work.

Something like Apollo (GraphQL) is a simpler solution (in some ways) as you control the joins on the Apollo server which speak to backend (REST) APIs (other teams).


Doesn’t this mean every service must expose a way for downstream consumers (via Apollo or not) to subscribe to updates to allow them to invalidate their caches? I shudder to think how stale the DAG approach would be without this. I suppose this is doable if the company lives on Kafka, but what if it lives on RESTful microservices?


Usually restful services are behind a loadbalancer/reverse proxy. This proxy could also handle things like security, logging, cache invalidation (the service behind the proxy) and service registration (other downstream services).

Marking APIs as immutable/up or down (for rollback) state migrations etc. Is really important from a platform perspective, so it also makes sense for APIs to have a cache hook as well.

The proxy can then manage the platform state wrt change propagation.


>Or does every team doing microservices silently expect that one day a data team will start querying for a massive number of records-by-ID from every service, and the veterans in each team plan for this load pattern accordingly?

I assume here by "data team" you mean reporting. Reporting and operations groups are very different with very different needs.

Microservices are useful in operations settings where the flexibility of taking modules out of a monolith and putting the network between them outweighs the performance hit.

Reporting directly from microservices is a recipe for disaster. To support reporting, the microservices need to contribute data to a data lake, data warehouse, or other repository.


IME that antipattern is common in small companies. This is mainly so because it's often the most expedient way to get something "working". A related problem is creating a new RPC for every variation of a query that an external service may require.

One better approach is to ensure each service's db has the data it needs already at query time. For example, each service should ingest events from elsewhere in the system, and accumulate the relevant data for its responsibilities. Joins should always happen in the db.

Another approach is to keep all the data in the same RDBMS. You can slice up the data into different schemas as you see fit. I have had a lot of success with this approach, reuniting databases where people have gone a bit too microservice-wild for their actual circumstances. You can vertically scale an RDBMS to quite a large size before seeking other approaches.


> How does one JOIN across not just tables but opaque services, in the general case?

You (should) never do that. It's as simple as that. If you create microservices that are atomically depending on each other, you are doing something _extremely_ wrong.


Definitely not implying that two services should mutually depend on each other. But a third service C may want to look up in B for every record available from A - say, if B reports reservations, and A reports users who are members of a certain group, and you want the reservations relevant to a specific group. The OP article describes all sorts of pitfalls for the C team if you only had access to B via a "get by IDs" API.


It really depends on how much data you need... I worked in an org where the primary data was in one database, and secondary data was in another. The DBA team wrote the query to call the other (remote) db across in part of the statement, and it was horribly slow... The N+1 pattern combined with memcached on the secondary lookups was so much faster in the end.. since it was limited to a display page worth of secondary lookups. A GraphQL server can relatively effectively do this for you.

In the end it really depends... if you're talking even 10-100k users, a single, well optimized SQL RDBMS is your best bet... getting past that takes deep knowledge and/or more options/skills. In the end, most don't have that next step and the trend to Micro-Service all the things is jumped to too soon in most cases (and not soon enough in others).


A couple ways. If the need is not real-time and analytical, you feed the data from multiple services into a separate BI database which can do slower and more complex joins across data from multiple data sources. Or if the need is real-time, you build a paginated API with a page limit that can always be processed within the API SLA. Then you build workflows on top of the paginated API to operate on that data.

Generally, unbounded operations have to be broken up at some point. It just depends on how big the data set is.


>How does one JOIN across not just tables but opaque services

One solution we use is to have an event queue from service A to which service B is subscribed. In the event handler, service B fills its own view table with data from service A. And then it can do joins on data from multiple services because everything is in the same DB. We require services to always emit "created" and "updated" events for its objects.


You have a shared datastore, but not an SQL RDBMS. If you're big enough you build it in-house. See "An oral history of Bank Python", posted here a while back.


That was a fun read, and I loved that little joke with NPM packages.

I find SQL, Regular Expressions, DNS, Client-side caching, CORS, TLS, and a few other things to be a MUST when hiring people, because most of the over-engineered crap can be avoided with a little bit of expertise with these. I spend most of my semi-leisure time with some good Regex books and golfing too.

Modern databases are amazing. Every few months, I take pleasure and not shy away in refactoring some complex and frequent queries into SQL views, carefully replace data logic (but not business logic) into stored procedures, and replace certain batch scripts with one-off queries.


I was recently discussing something that involved knowing whether a string consisted of only a single repeated character. Having spent many years in the trenches with Perl, my first thought was /^(.)\1*$|^$/, which is the kind of thing people dismiss as "line noise" because they haven't spent a few minutes learning a language that can easily express what you want. We have this trend now from people who like languages like Go where answering the question "does this string consist of a single repeated character" begins with "I would now like to reserve space for a 64-bit integer which I shall henceforth refer to as 'i'...", and that's considered a virtue.


To be honest this sounds like the very definition of "you solved the problem with regex, and now you have two problems". In essentially every language in existence the question you're asking can be solved with a simple loop and perhaps 3-4 lines of code. In many languages it's easily expressed as a one liner, i.e. (with C#) `s.All(c => c == s[0])`.

I really don't get the love affair that some devs have with regex. In my 10 year career I think I don't think I've run into more than a dozen problems in a production system that _required_ regex to solve. When you're working with robust modern languages there's almost a solution other than regex that's significantly easier to understand + maintain, and probably a lot more performant to boot. Is regex useful for other things, especially cli stuff like grep and sed? Oh yes absolutely. But generally speaking I really don't want it in my code base unless there's no other choice.


10 years and a dozen problems? Conversely, I encounter pattern matching problems and use RegEx near daily. Both these perspectives are anecdata, neither are useful.

> and probably a lot more performant to boot

I highly doubt your home-grown pattern matching functions could beat the decades of optimization that have gone into RegEx engines, in anything but the most trivial of patterns (like the one demonstrated here). Creating your own ad-hoc pattern matcher instead of using the ubiquitous one built into your language is like the junior in the article re-implementing JOINs. Sure, you may be able to beat the engine occasionally on particularly simple patterns, but I guarantee you'll lose out overall.

RegEx is not inherently slow, and it is definitely possible to maintain. See industries with serious text processing demands like bioinformatics, where Perl is still used extensively. They could not operate like they do if they shied away from RegEx like many developers seem to.


> Creating your own ad-hoc pattern matcher instead of using the ubiquitous one built into your language

GP was using LINQ, which is a first-class language construct in C#. I'll grant that it may not be _as_ optimized as Perl's regex routines, but it's hardly ad-hoc or slow.


Yes, the linked example falls in the category of "particularly simple patterns" I mentioned. This strategy only works for such simple patterns; add in a single alternation with a common base and the naïve iteration strategy falls apart. Implementing a sensible algorithm to evaluate such a pattern would require much more code than a couple of brackets, would not benefit from RegEx caching, etc.


> I highly doubt your home-grown pattern matching functions could beat the decades of optimization that have gone into RegEx engines

On the contrary, the All() method used here (which is part of the .Net standard library) is literally just a loop that evaluates each item in the collection to verify that they all match the predicate function. It'll be able to check hundreds if not thousands of characters in the time that it takes the regex engine to initialize and parse the pattern.


Interesting how you conveniently ignored the second half of my sentence. I will paste it here:

> in anything but the most trivial of patterns (like the one demonstrated here)


It other words, you're not caching/reusing you regex patterns?

I think I see the cause of one of your recurring performance problems.


>reserve space for a 64-bit integer which I shall henceforth refer to as 'i'

In this scenario, regex processing should allocate more and be slower. The for loop is more optimal even if takes more lines. There's probably some SIMD solution which would be the fastest.


It's probably slower by a constant amount. It also probably allocates a constant amount of memory.

On an compiled language, odds are that the regexp is faster and uses the same amount of memory.


Let's see if I remember Perl regexps: '/': this is a regular expression. '^' at the beginning of a line, '(.)' match any character, and remember it for later. '\1' match the same character that you just remembered, '*' zero or more times. '$' then match the end of the line. '|' Or, '^$' match the beginning and end of the line with nothing in between.


all(c == str[0] for c in str)


I think the only thing I might add is a guardrail for strings that are too long, and might behave really badly with regexp.


s/\./[\\s.]/ iirc. Or s/\/$/&s/ if the engine implements it.


len(set(x)) == 1


>because they haven't spent a few minutes learning a language

Regex is pretty far from an easy-to-learn language and you're going to need more than a few minutes with it. Like, imagine if a standard string library only had functions with a single character name and how awful that would be to use.


Favorite regex books? Only one I've read is Mastering Regular Expressions by Jeffrey Fridel and I loved it. If you have any more recs, I'm all ears.


Excellent book! I was only about 25-30 pages into the first edition when it all finally clicked for me. I've never feared a regex since.


I am curious about the "but not business logic" part. Can you make it more concrete what are the parts that you chose not use SQL for?


If the biz logic is about structured data storage and transformation, it belongs in the database.

If the biz logic relates to data validation (data types and data affinity), it belongs in the database (and probably should be checked elsewhere as well).

If it relates to data integrity and correctness (foreign keys, check constraints, uniqueness, cascade behavior, etc.), it belongs in the database.

If it relates to communication with external services (email, queues, file storage, HTML rendering, data compression, scheduling, lookups to 3rd-party APIs, pure computation, etc.), it very much does not belong in the database.

All of these are "business logic". All are important. Use the right tool for the job at hand. Set theory for the databases, general purpose computing for the app layer.


Then I think we have different definitions of business logic?

HTML rendering, email, data compression, ... for me never belonged to the business logic.


I could agree with HTML rendering and compression, but sending an email during user registration or as part of an event notification is 100% business logic.


I've seen systems where people are doing manual JOINs with CSV, JSON, and the results of DynamoDB scans on relatively tiny datasets (<10 megs.) Everything could fit in sqlite on a single machine. Instead, they build a Rube Goldberg contraption that uses "modern cloud architecture."


The aversion to SQL by younger devs is pretty amazing. Yes it has a weird cognitive model and a learning curve but it's a cornerstone of web dev. Instead they resort to convoluted and technically inferior solutions like Prisma just because of a superficial DX advantage.

I'm certain Mongo only became popular because of this even though for many years it was crap.

That said I do think we need a better SQL. It's still not there but EdgeDB looks very promising.


Mongodb is very convenient for CRUD operations, while relational databases need a complex ORM to handle that sanely. Consider how much CRUD a typical application contains, I certainly get the appeal. However aggregatipn framework is horrible and thr 16MB limitation very annoying.

Im general I find SQL/relational models easy to understand conceptually, but maps badly to both the rest of the application and the problem domain.

I also hope that edgedb will help with that. When I modeled one of my applications in its SDL it was a very clean match. I don't have much experience with its query languange. But so far it looks much nicer than SQL, but still uglier than functional programming.


Those who fail to learn the lessons of SQL are doomed to re-implement it… poorly.

SQL-92 is no longer the baseline. Now we have CTEs, laterals, graph queries, JSON storage, system versioning, and more within a reasonably concise DSL for set theory.

And even more showing up all the time like function pipelines. https://docs.timescale.com/timescaledb/latest/how-to-guides/...

SQL is the Rodney Dangerfield of programming languages. It don't get no respect!


My main issue is:

1. The ORMs are really not updated to utilize these useful features, largely because they try to support all the databases, and end up supporting really just the ANSI standard set of features, with maybe a few extensions.

2. Without the ORM, you're writing SQL-as-string, and its as demented and awful as one would expect of writing all your code inside a random string. It doesn't help that all SQL features are made inconsistent language-wise, so you're basically guaranteed to have at least one syntax error using anything novel (with an utterly useless error message from your favorite SQL compiler), which can only be caught at runtime (because your IDE's DB linting and language support is also limited to mostly ANSI SQL, because they also try to support all the databases).

3. If you make it a stored proc/function/view, the DB IDE tooling is shit across the board. They're all miles behind any decent app-lang IDE in terms of features/tooling they provide. It's actually impressive how pathetic an environment DBA's put up with. You're really not going to get much more support than an autocomplete on table/column names, and maybe datatypes.

So ultimately, the act of writing SQL is terrible compared to doing your normal app-logic. The only reason I'm willing to put up with it is because the positives of using an RDBMS properly dramatically outweighs the negatives. But that tradeoff isn't immediately visible to the novice, so this absurdity of reimplementing SQL with not-SQL becomes reasonable.

The engine is beautiful. The relational algebra -- glorious. The language, tooling and ecosystem? It's all stuck in the 80s.


Agreed


This isn't just a younger dev thing. This attitude was common even twenty years ago.

And for the record, I love both SQL and tools like Prisma. (I actually prefer Postgraphile, but that's a minor distinction without a substantive difference.)


The other problem with the double join is that it's not an atomic operation so between the two select data could have changed.


I really wish more people understood this. Thank you.

ACID is not just a good time on a Saturday night.


I'm not sure I understand how a JOIN would have fixed this problem. That is, if each chunk is fetching 1k rows, and you're doing 50 simultaneous chunks, then you're doing a 50,000 row query, and that's ALSO going to be extremely slow, in terms of exclusive database contention (less of an issue with bigquery) and result set memory usage (definitely still a huge issue for python). In fact, one of my most frequent pieces of feedback to junior engineers who are just working on a larger backend for the first time is "this query tries to fetch too much data at once, it will take too long and use way more memory then it needs to, please use find_each to automatically batch the query so that we balance memory usage and database contention". Indeed, Rails by default will use the exact same batching strategy the junior engineer chose in this case: fetch 1k items, process those items, and move on to the next 1k items. I understand that the author rankles about things not being done the "right way" with JOIN, but I question whether their focus on "best practices" is preventing them from seeing the optimization forest (split things up into parallel background tasks, don't try to keep the entire dataset in memory at once) for the "doing things right" trees (use JOIN)


> then you're doing a 50,000 row query, and that's ALSO going to be extremely slow, in terms of exclusive database contention

In the example the author gives the query cost is very likely dominated by finding matching rows in A. Where there is no index, then we can expect a full scan of A (or the index of A.id) for every batch of B.

This is the case no matter how many rows of B you are searching with; by running the query 50x you make this cost 50x greater. Using a join you pay it once.

In addition, and probably more to the point, the round trip database costs (serialisation, parsing, planning, scheduling, network comms) are going to dominate the actual query costs for something like this (unless A is exceptionally large).

Furthermore, the memory cost to the DB of the serialisation and parsing is likely to be much larger than just storing all those ids in their native format - and there would be no client memory footprint in a join. For the final result set the client can reduce their memory footprint by using a streaming result which every BigData DB supports, and most others too. If you are particularly concerned about client side memory it is best to either: do everything on the database, or manifest a temporary result table and batch out of that.

There are circumstances where the JOIN will be too expensive to do all at once. I've worked with what is claimed to be "BigData" for about 4 years and have had only a few situations like that; but none of them would be ameanable to a batching like this, and instead need much more complex architectural steps to make cheaper.


> In the example the author gives the query cost is very likely dominated by finding matching rows in A. Where there is no index, then we can expect a full scan of A (or the index of A.id) for every batch of B.

Why would you expect that there's no index? I have never seen a single database system where the most basic primary key A.id wasn't indexed. Instead, I would expect that you're correct below that cost of the query is dominated by fetching the rows from disk and serializing them—this is a linear cost that increases with the number of rows returned, so fetching 50,000 rows should be about 50x as slow as fetching 1,000 rows (especially as long as you're fetching them in some sort of block-cache-amenable order, such as in increasing ID order, so that you're seeking to sequential places on the disk most of the time instead of fetching just random blocks)

> In addition, and probably more to the point, the round trip database costs (serialisation, parsing, planning, scheduling, network comms) are going to dominate the actual query costs for something like this (unless A is exceptionally large).

Aside from a small overhead, serialization, parsing and network comms will all increasing linearly with the amount of data returned. 50,000 rows of data will be about 50x the serialization and network cost of 1,000 rows.

> Furthermore, the memory cost to the DB of the serialisation and parsing is likely to be much larger than just storing all those ids in their native format - and there would be no client memory footprint in a join. For the final result set the client can reduce their memory footprint by using a streaming result which every BigData DB supports, and most others too

Sure, I can absolutely agree that using a streaming result set would be the best of all possible worlds here. However, it does require you to keep a client connection open for 50x longer than batching would, which on many databases (e.g. Postgres), would lead to more memory usage and CPU contention then batching the result in a background job queueing system. This comes down to what % of your total pipeline is spent in the database in question compared to data processing or other databases—if only 20% of your job's runtime is fetching the rows from this database, then it's a bad idea to monopolize that DB memory for the much larger amount of time it takes you to process the entire result set, when instead you could be yielding that memory back to the system for other transactions to use. But if 80%+ of your time is spent in the database, then the small amount of time that other transactions would be able to reclaim wouldn't be worth the amount of fixed overhead from re-planning, re-executing, re-fetching the index from cache, etc. And obviously these—as you may have been able to guess, my experience here is rooted in OLTP workloads using Postgres, and I'm sure there are plenty of differences with BigQuery's architecture.


On the off chance you see this:

> Why would you expect that there's no index?

I don't. Whilst I didn't write it particularly eloquently, I included that it would be an index-scan if there was one. And like the full table scan, this is a 50 vs 1 cost (unless the querying ids are well sorted, at which point you'd maybe get a 5vs1 cost at best).

> I have never seen a single database system where the most basic primary key A.id wasn't indexed.

Nobody has said it was a primary key. In fact it very likely isn't. All we really know is that there were approx 50k rows selected from B; we do not know how many are matched in A.

As they are using BigQuery and the queries are taking such a long time, it would be reasonable to assume A is some large dataset clustered around some other value (e.g. timestamp). But that itself would be an assumption.

> [streaming] does require you to keep a client connection open for 50x longer than batching would,

It does not. It will be less time.

---

Looking at pg.

I'm trying really hard to see a your point. As far as I can tell, you're bothered by the working set memory of the query caused by the join exceeding a limit and causing contention - this is the only time the streamed join is worse than the batching. On an index join this would have to be a very large table.

As for CPU contention - its a non-issue.

There may be a point related to time-to-execute with respect to lock contention.

Regardless, if either lock or index memory contention are problems for you then you will still want to `JOIN` - just against a subquery/cte with limit and offset.

Roundtripping is not the answer!


Do you know what's slower than fetching 50000 rows with a big join? Fetching them with the overhead of thousands of tiny queries instead of one, and repeating disk reads thousands of times because you cannot consolidate the thousands of query executions.


Definitely don't underestimate a good database's ability to stream large queries. One of the many ways an ORM-centric view of the world can mess you up, since ORMs have a constitutional bias towards instantiating the entire result of the query in memory. (They don't have to, it isn't completely impossible for them to stream, but even if your ORM can stream it probably doesn't take much to convince it not to, even perhaps accidentally.) It is generally better all the way around to send a single query that is everything you want, if at all possible, let the database do its thing, and then spew a stream of all the results you want at you as fast as the network can carry it, than to be sitting there constantly harassing the poor thing with tiny query after tiny query, adding latency every step of the way. Match it with code that can consume the result as a stream and you can do a lot of work without using a lot of simultaneous resources.

This does break down eventually but I feel this is another one of the several places where developers still sometimes subconsciously have an early-2000s view of the world, as if all relational databases start panting and sweating if you ask them to return more than a couple hundred rows of any kind. No, set them up with the right indexes and foreign keys and they'll happily stream gigabytes at you, without the CPU even hardly doing anything. It's just as likely to be the consuming code that is the bottleneck!

You get up to "big data" and this approach stops working but what constitutes "big data" has also gotten a lot bigger since the early 2000s. Even in the engineering-centric company I work for, a lot of engineers & management assume that things are "big data" way before they should.


The applicability of this advice definitely depends on the database, drivers, and query.

For example with PostgreSQL you need to create a cursor, then FETCH NEXT 1000 over and over again in a loop. This is a bit of a pain, but is the difference between processing as data arrives, with only small buffers everywhere, versus waiting for all data to arrive before doing anything.

What exactly you need to do and how to make it work is very much database specific.


Yeah, I wish this was more standard. SQL is not so much a standard as a skeleton of a standard. Better than nothing, maybe, but still every database I walk up I pretty quickly hit issues like this.

I'm not trying to promise that every database will stream a petabyte without a problem; I'm more trying to help people get out of an early 2000s mindset and if nothing else, check what their DB will do. A lot of old programmer's tales about how to baby old databases along are actively pessimal and unnecessary in 2022/almost 2023. Don't spend days writing code to correctly slice and dice a query into tiny pieces when you could just send it in one shot and get better performance in every way.


Thousands? the article said explicitly that there were only 50 queries (in fact, it said they only hit the 50 query limit after the job was taking "several hours" to complete and parallel queries were introduced). That's two orders of magnitude below "thousands"


Two clear inefficiencies in the junior's code:

1. It is pushing the entire id table back and forth through the network connection, bit by bit. Replacing with a join completely eliminates this.

2. A query with an IN clause is (probably) doing a hash join under the hood to calculate the result of the IN clause. So the junior's code is effectively submitting a join query over and over, each time with a slightly different tiny chunk of data, rather than asking for the joined data once and processing the result in batches.

It is also worth considering if the entire data processing pipeline can be in SQL, but I can't tell if that's the case from the blog post.


Without a join, you are sending the entirety of Table B over the wire and processing it in Python. Then you send that back to the database, inline in a query, to do an ad-hoc join on Table A. The results are then sent back over the wire to the Python side. Notably, the result set may be small, and it may always be small. Table B may grow very large, and all of it will always be sent over the wire and processed in Python.

With a join, the database is able to do the join on Table A and Table B in place, using whatever indexes it already has built up. The only thing sent over and processed by the Python is the result set. Even if Table B becomes very large, only the result set is sent over the wire and processed in Python.

Without a join in the query, you're essentially having to replicate the kind of logic that already exists in the database engine, in Python. That is, the database engine is already doing chunking and parallelization for you.


Sure, I mean, I want to be clear—there's certainly a lot of inefficiency here. But my point is that the OP is talking about their data processing pipeline taking HOURS to handle queries for 50,000 IDs (50 parallel queries—1,000 IDs per query). When it comes down to it, I just don't believe that the inefficiency in serializing 1,000 numbers in python and then deserializing them in BigQuery has anything to do with the problems OP was experiencing. Remember—OP is probably going to be getting back 50,000 rows from the database anyway, the additional work here is linear with respect to the amount of rows retrieved. Sending 50,000 numbers over the wire to get back 50,000 rows worth of data is not an "hours long" processing cost, in the scheme of things.


> I'm not sure I understand how a JOIN would have fixed this problem

If the set of things in A that have ids in B is very small, then very little data is returned from the JOIN query, while a lot may be returned from the B query by itself.

(if that set of things is large, then you'll still want to batch the joined query as well, i.e. using find_each in rails. They're orthogonal requirements)


Sure, I mentioned this in my comment as well, but I think it's clear from the use-case of the article that the cardinalities of the two tables are approximately the same.


Fetching 50k rows is a nobrainer, neither speed nor memory wise. Relational db-s are super fast when used properly


A join would have fixed the problem by pushing the logic inside of the database where it would have been optimized more easily. It would have traversed the indexes in parallel. Once.

But they didn't. Instead they made the database parse every single record, compile it, and then try to optimize it. Which will come up with a plan where you had to do index lookup after index lookup. That parsing and optimization overhead is probably most of your time. But even ignoring that, a single scan for `n` things in an index with `m` things winds up taking an average time `O(n log(m/n))`. Which is generally faster than the `O(n log(m))` of separate lookups.

This change saves a tremendous amount of work on the database, and therefore reduces contention for resources. That's database 101, and any competent DBA should be able to give you the lecture. As a programmer you might not understand how much of a difference it makes. But trust me, it does.

Now about data quantity. You're giving cargo cult advice on queries that is only sometimes going to be right. What is the actual tradeoff for find_each?

The one win is that you return limited data on each trip. 50,000 records really isn't that much these days, so I discount the win. But it can matter, particularly for memory constrained containers.

But what is happening inside the database if you fetch 50,000 records from a join, in batches of a thousand? As I understand, it uses limit and offset statements to figure out the result. But how doe that work?

First, it calculates the join to find 1000 records and returns them.

Second, it calculates the join to find 2000 records, throws 1000 away, and returns the rest.

Third, it calculates the join to find 3000 records, throws 2000 away, and returns the rest.

And so on until it has found a full 1,275,000 records, of which it has thrown away 1,225,000 and returned 50,000. Guess what this means for total database work required? And the behavior is fundamentally quadratic. If you have 10x the data to process, your database has to do 100x the work.

There are definitely a lot of use cases where you need to batch records. But your batch size should be as large as you can comfortably use. And you need to realize that you're trading off trading up front memory for time and more work inside of the database.

The next time that you find yourself having to go down the "optimization tree", I strongly recommend considering whether you're in fact trying to put a patch on a self-inflicted wound. Try proper joins, indexes, and a larger batch size first. See how much of a difference that makes.

Alternately take advantage of the fact that you know your tables in a way that Rails doesn't. Order the results by primary key. Every time you fetch a batch, record the largest primary key you returned. Then instead of offset/limit on the next query, use a limit and a condition on the key. This will eliminate almost all of the duplicate work in most situations, at the cost of having somewhat more fragile logic.


> But they didn't. Instead they made the database parse every single record, compile it, and then try to optimize it. Which will come up with a plan where you had to do index lookup after index lookup. That parsing and optimization overhead is probably most of your time.

See https://news.ycombinator.com/item?id=34095480 for a more detailed discussion of where time is actually spent here, I think this short explanation glosses over a lot of important issues.

> As I understand, it uses limit and offset statements to figure out the result

You are incorrect. Your entire comment is based on a faulty premise. find_each uses an ordered primary ID column which can be queried efficiently using indexes.


Huh, I googled for how it worked and found a limit/offset explanation. Then I wandered around the source and verified what you said. I'm not a Rails programmer, so I did get that wrong. (Though I've seen that exact mistake over and over again when people are using microservices. So it is worth being aware of that in your APIs.)

But that said, your "detailed discussion" is going to be wrong for most databases that I've worked with. MySQL makes queries cheap. But PostgreSQL, Oracle, and so on make parsing expensive. Having to parse and try to optimize a good chunk of a MB of SQL is almost certainly more expensive than 50,0000 individual index lookups. (The tradeoff is that the other databases are likely to produce better execution plans if you run the same query over and over again.)


Yes I remember inheriting a project where in a similar fashion people were allergic to join. So we got js code selecting entire table, looping over the rows and then doing inner loops with further selects. I eventually had to switch everything around to using joins. There seemed to be a huge disdain for SQL.. like if you ever endeavoured to try some raw SQL you were playing with matches. Sure OK but code that is handling db operations that inefficiently is 100x worse tho...


I tend to push for the other direction, especially if using JS/TS... template literals are so useful here...

    const foo: MyType[] = await db.query`
      SELECT ...
      FROM ...
      WHERE bar = ${baz}
    `;
And simply understanding how the queries work... very similar with Dapper in C#... I'm kind of all out against ORMs at this point.


Uhhhh, you should be extremely careful with string interpolation around DB statements. The code sample you posted is pretty much a textbook case of a SQL injection vulnerability if the value of ${baz} is ever provided by a user.


No, it isn't... db.query method recieves the parameters separately from the string parts and will turn it into a parameterized query. You're confusing/conflating db.query`...` with db.query(``);

https://www.javascript.christmas/2020/11


Ah, sorry! I haven't used templates much and an interpolation inside a string triggered my "oh shit security problem" sense.


Can understand that... in general, have fought vigorously against ORMs (and DI/IoC tooling) in JavaScript, and use Dapper in C# with similar interfaces.

The template methods have allowed for some really powerful adaptations. Mostly in Database/SQL, XML/HTML, and JSS interpreters.


I sling a lot of SQL, and, mirroring a lot of peoples sentiment here, wish it had better syntax and composability.

DuckDB and Apache spark expose nice apis that almost completely remove the need to faff around with textual strings. Each projection returns a view that can be treated like another table, so composition and reuse is simple.. It would be nice if such a thing we're more standard and available on the other dbms that I have to work with.

I feel like, in the continuum of abstraction, SQL is like opengl 3.. high level and a bit inflexible. Taking the analogy further, an ORM would be like the game engine on top of opengl.. What doesn't exist, as far as I know, is the Vulkan equivalent. A low level, api that exposes the relational algebra and exactly how to execute it. There are cases where I would have saved a lot of effort if I could just write the damned physical plan for a query execution myself rather than rearranging table join orders and sending hints that the query optimizer is just going to passive aggressively ignore anyway.


Don’t TVFs accomplish the level of composability you’re describing? They give exactly a “view that can be treated like another table”.

https://cloud.google.com/bigquery/docs/reference/standard-sq...


An SQL query goes into a bar, walks up to two tables and asks: May I join you?


I would call it “for want of reasonable hiring and onboarding processes”. How does someone get into a data engineering job without any knowledge of SQL and doesn’t even get basic onsite training?


Largely due to the fact that "Data Engineer" is defined differently at every company. Some want a SRE, some want a Database Architect, some want a software engineer that knows some SQL, others want only SQL junkies.

As a result, I have picked up a variety of skills to fit into whatever my company dictated what a Data Engineer should handle


"If you encounter an unusually round system limit, you’re probably using the system in a way its designers never imagined."

Haha, so true. We triggered a static code analyzer error "Cyclomatic Complexity bigger than 1.000.000.000!". The vendor was very interested in that code snippet (generated classifier code) and we shared a good laugh.


We hired a data engineering consulting company and none of their team of SQL experts had heard of upsert or merge. I find it weird that people don't spend a bit of time searching for a better way of doing stuff before just jumping into a long, hard way of doing things.


Unfortunately many data engineers don’t know basic stuff about sql snd databases, but are experts in etl tools and data warehouses where such features are not relevant or don’t exist. You were probably looking for some dba who are something different.


"Don’t let junior SWEs get 2000 lines into a change before submitting a pull request."

This is good advice. Share your code early and often so you can get feedback before you're fully committed to one approach.


Yeah, this is insane to me. 2000 lines written well in an expressive programming language is a small but complete library. It's 15 to 30 code files. 2000 lines is just about enough to write a complete spelling and word-use checker with a persistence layer. It's enough to write a generic mass-balance process simulator, or a web security library, or a moderately complex workflow-management library, or a toy game. 2000 lines is a week or two of work. If your junior SWE is writing 2000 lines of code before anyone looks at it, you're basically letting them be raised by wolves.


In the spirit of HN I should confess I’ve done exactly this, not the using Python to feed data back to SQL, but writing terrible hacks to get around resource limits on deadlines.

On a recent project I needed to process a couple years of data for a hard deadline of Monday, and it was Friday. Our DB had a query timeout and a resource memory limit which blocked doing the full analysis without building new data models which would take days to get shipped and to build the new data models. The deadline couldn’t be moved so hacks were needed.

The solution: write some Python code to generate one query per week of data going back two years (over 100 queries), save the results to individual scratch tables, and then use a second query to union all the results together in our BI tool.

Of course the first time I ran it serially it was too slow, so I parallelized it. That was too many queries so I added a limit. Then one query failure broke the whole thing so I added retries… by the end of the day it looked exactly like this article.

It worked though! I got all the data we needed processed for Monday, I presented it to our execs and our project was approved. We only needed to manually run that script once more before I built the real solution and deleted the script.


This reminds me of some log parsing code I wrote that had to batch download sets of logs from our log provider, and then used some gnu parallel, jq, and some general shell tools to spit things out (may have thrown the data into some format and used textql for the end of it.)

(I can't recall why the general log searching tools we had didn't work in this situation, I think it was because I needed to get data from a lot of disparate logs at once, or it was driven by having to make lots of separate downloads of the logs.)


I'm gonna be the contrarian and say this is mostly fine. We can research the proper way to do things, or use code review to teach about the proper ways. But this can lead to code shaming and a fearful environment where people second guess themselves and spend a lot of time chasing a perfection that doesn't move the business metrics.

In this case, doing the join manually isn't a huge deal, chunking isn't a huge deal, parallel requests isn't a huge deal. But "concurrent limit reached" is the point in this story where Bob should have put on the thinking cap and reasoned that "this shouldn't be hard, other people do things like this with bigger datasets all the time, I wonder how". Before that point it's literally just a matter of changing a couple lines to solve the issue. So what? After that point however, it's starting to affect the overall design around it in harmful ways, and turning the issue into a bigger one.


And that's how we end up with chat apps that consume gigabytes of RAM, and blinking cursors using one whole CPU core.


That just means he did the parallelization incorrectly, there should be a fixed maximum number of concurrent requests.


I got a chuckle out of "With the exception of NPM modules, most tools are designed to solve problems, possibly the ones you have," but have to agree with some other commenters that a better title for this would have been "for want of a mature code review process"


I have to wonder if the code review process wouldn't be as big a deal if they hired a proper DBA. This problem would not have been an issue if the DBA had been told "we need to do X," and the DBA would craft a stored procedure for X.

I get that stored procedures aren't a cure-all, and sure, they can get out of hand, but doing this stuff in code is often worse than letting the DB do its job.


This is more common that you would believe other issues i've seen are no `limit` on the query, fetching all the results and then sorting in your own app code, using wrong joins. Many of these happen while using ORMs as well. SQL is a context switch for more devs and very few understand it and even those that do might not be familiar with the capabilites of your startups db choice.

Shameless plug but this was my motivation behind building GraphJin a GraphQL to SQL compiler and it's my single goto force multipler for most projects. https://github.com/dosco/graphjin


The principle is that you should be using the underlying API if something is already solved on a lower level, and not replicate the functionality on a higher level, because it will perform poorly. There was a reason why the lower level API exists in the first place.

This applies to graphics programming very well, its not a question that you wouldn't be making your own pixel rasterizer instead of using DX, OpenGL or Vulkan, for example.

The big recognition is that when doing business apps, SQL database functionality is the underlying API, and you should prefer using that.


> The principle is that you should be using the underlying API if something is already solved on a lower level

I think you're making a great point but I want to consider what this suggests about ORM's. Using an ORM means you're not directly using the underlying API. In theory an ORM should be a very small "distance" from the underlying API. When that's the case, they are a no-brainer. But no ORM has 100% feature parity and for more complex queries this distance from the underlying API can grow considerably. And if you insist on ONLY using the ORM API then you're going to find yourself doing some pretty dumb shit in the application layer.

Personally I think ORM's are great, but there is this common problem of over-insisting on their API and treating raw SQL as the devil.


Back in the days of Mysql w/ MyISAM engine it was sometimes way faster on big data-sets and underpowered DB servers to do the query exactly this way. Even with all the correct indices in place the JOINs (especially if more than one table was in game) would often just freeze the server for 15-20 minutes, while joining data at the app level in the for loop and with the lookup tables for id-s would typically take only a few seconds. Obviously this is an obsolete hack for long time now...


I can imagine that, but MySQL didn't really start to become a competitive relational database until the 3.23 timeframe (around 2000), and it is hard to imagine MyISAM (with no actual transaction support) being used for a production database except under very carefully controlled circumstances.

To some degree that was true of a lot of earlier competing databases as well, which tended to take escalating locks on everything from the page level on up just to implement basic read consistency. So any transaction of any type could easily lock up a random set of unrelated rows if not entire tables until completion.


My key take away here is that not spending an hour reviewing code probably man-days worth of work.

The technical capabilities are all there on the team, from description. What was probably missing is someone both technical and assertive, who could politely say to the deadline setters "This is fucking stupid and it's not going to work".


> Don’t let junior SWEs get 2000 lines into a change before submitting a pull request.

Why junior SWEs and not all SWEs?


This doesn't really add up to me.

The article explains how the original bad code gets checked in which seems plausible enough.

But that doesn't explain why the first fix wasn't to just start using a JOIN? Or the second fix.

I guess it's a made up story, to make a point? Anyway, I found the plot holes distracting.


This is an incredibly common thing. The worst I've seen it is when people drop their SQL DB for a No-SQL thing (for no good reason) and then end up implementing all the joins they lost in the application :(


Ragequitting a company and calling it a "trainwreck" because one developer didn't know about JOIN seems... extreme.


That's an excellent description of Contagion caused by tech debt.

How much does this problem grow and spread the longer it goes unfixed?


Wait until he hears about LEFT JOIN


We've all been Bob at least once in our life. By 'we' I really mean me.


Shipping it the first time may have made sense. Second time? Not so much.


how is CSV not "wrap the fields in double quotes and join them with commas and newlines"?


Escaping.


isn't that covered by wrapping in double quotes?


Not if the content contains a double quote


well it's not really wrapping it if it's not escaped.

Regardless it is all trivial. Not sure what the point of that comment was.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: