For Want of a JOIN

dagss · on Dec 22, 2022

It is one thing when a junior does this because they haven't learned better.

It's quite another when experienced seniors ban the use of SQL features because it's not "modern" or there is an architectural principle to ban "business logic" in SQL.

In our team we use SQL quite heavily: Process millions of input events, sum them together, produce some output events, repeat -- perfect cases for pushing compute to where the data is, instead of writing a loop in a backend that fetches events and updates projections.

Almost every time we interact with other programmers or architects it's an uphill battle to explain this -- "why can't just just put your millions of events into a service bus and write some backend to react to them to update your aggregate". Yes we CAN do that but why do that it's 15 lines of SQL and 5 seconds compute -- instead of a new microservice or whatever and some minutes of compute.

People bend over backwards and basically re-implement what the databases does for you in their service mesh.

And with events and business logic in SQL we can do simulations, debugging, inspect state at every point with very low effort and without relying on getting logging right in our services (because you know -- doing JOIN in SQL is not modern, but pushing the data to your service logs and joining those to do some debugging is just fine...)

I think a lot of blame is with the database vendors. They only targeted some domains and not others, so writing SQL is something of an acquired taste. I wish there was a modern language that compiled to SQL (like PRQL, but with data mutation).

treebeard901 · on Dec 22, 2022

This debate has been happening forever and really the fundamentals haven't changed. It doesn't matter if you pull data from the DB to an old middle tier or to a nice modern microservice architecture, you're almost always losing the performance game at that point.

The database already has most of your data cached in memory, it already built statistics on the best methods to use to join the data, and the data is always local to the DB.

Reading lots of a data from a DB to do the same operation in a microservice means you incur a cost of data retrieval, memory for a copy of the dataset and enough to do a join, the network speed to transfer the data over, and then you're giving up all the indexing and statistics that a database provides.

There is almost never a reason for this unless you're needing to do some kind of data analysis that isn't supported by the DB. Maybe most of all this copy the data to somewhere else to do a thing is never going to scale well.

Don't mind me, I'm just a retired former DB nerd.

feoren · on Dec 22, 2022

> the fundamentals haven't changed

I believe they have, because we've gotten so much better at ORMs. ORMs get a bad rap because people use them terribly (and it's largely the ORM's fault because they encourage their own terrible use). But they are a true "change in fundamentals" that can finally end this debate.

Business logic doesn't belong in the database layer, because the database layer doesn't support the level of abstraction, composability, testing, and type safety that modern programming languages afford. However, that does not mean that business logic doesn't belong in SQL. It just means you have to treat SQL as an output of your actual business layer.

A good use of an ORM looks like a metaprogramming environment for conveniently building syntax trees that get converted into intelligent SQL. You know it's working well if the SQL looks somewhat like you'd write yourself and you can build one SQL statement with multiple layers that are abstracted from each other (in C#, think about passing IQueryables around). The structures are parsed into SQL and executed very explicitly only at the end of the chain, do a lot of work, and never produce SELECT N+1s. A good ORM user is thinking in SQL but writing in C# (or whatever your business layer is in).

A bad use of an ORM is trying to pretend like SQL doesn't exist, or is too scary for regular programmers to think about. It has SELECT N+1s everywhere. A bad ORM user is thinking in C# and hoping the database will roughly do the correct thing.

yamtaddle · on Dec 22, 2022

> I believe they have, because we've gotten so much better at ORMs.

This has almost nothing to do with what the parent was describing. Poor ORM use (or poor ORMs) introduce a different set of ways to mess up performance.

[EDIT] OK, this isn't entirely fair, or at least I didn't explain it well enough (no, it's not getting downvoted, I just decided I'm not happy with it). The problem in question is, at its heart, developers not realizing what they should be letting the database do, and perhaps not even realizing what it could do, or deciding they shouldn't let the database do it for some probably-misguided purity reasons or whatever—ORMs generating more-efficient queries or exposing more features is great and does help with the problem of straightforward, natural use of ORMs sometimes resulting in poorly-optimized queries, but doesn't fix the problem of developers not knowing that a block of logic in [programming language of their application] should have been left to the database instead, whether that's achieved by hand writing some SQL, writing a stored procedure, or directing the ORM to do it. It's a little related in that hopefully better ORMs will result in ORM-dependent developers learning more about what their database could be doing for them, or being more willing to poke around and experiment with the ORM since it's more-pleasant to use, but I'd expect that effect to be pretty marginal.

karmakaze · on Dec 22, 2022

A problem that may ORMs have is thinking/working with row-expressions rather than sets. If everything is expressed in plural sometimes with zero or one, other times many, then the N+1s mostly go away. The same goes for many interfaces/APIs that have single and multiple forms, only make the multiple forms and have callers call it with [single].

Related pet peeve: calling tables by plural names. The table (or any relation) should be named for the set of X rather than thinking of it as Xs.

feoren · on Dec 22, 2022

> A problem that may ORMs have is thinking/working with row-expressions rather than sets.

Absolutely. There's a reason that the famous Out Of The Tar Pit[1] paper identifies relational algebra as the solution to many programming woes. Thinking in sets instead of individual items is extremely powerful, when using an ORM and also in general. If an ORM makes this hard, use a different one (hopefully there is a better option).

> Related pet peeve: calling tables by plural names

Agreed again! Based on my limited observations, this seems like a big cultural difference between "database people" and "software people". The people who spend most of their time working directly in databases (and trying to basically write fully fledged business applications entirely in the database layer) seem to think of tables as big containers. If you labeled a box full of people (or a binder full of women?), you'd probably label it "People". Whereas "software people" tend to think of a database table as a definition of something, more like a class or type. Clearly the correct label for that definition is "Person".

Actually the latter is also wrong for a different reason: that is not how nouns work. But that's a story too long to fit in this comment.

[1] Out of the tar pit (warning: direct link to 66-page PDF): https://curtclifton.net/papers/MoseleyMarks06a.pdf

int_19h · on Dec 22, 2022

It's the other way around - "software people" tend to use plurals for table names, because it then maps nicely to plurals in property names in their object layer, where it's already the established convention.

The "database people" OTOH tend to use singular, which goes at least as far back as C.J. Date. I'm not sure why, but perhaps it's because fully qualified field names read more natural.

fijiaarone · on Dec 25, 2022

Actually, all the old relational texts talk about select * from employees (plural) because they didn’t have aliases for table names in queries.

I agree that when you use aliases for multiple tables when you use (Oracle, non-standard SQL joins) that it looks prettier, but you can also do that with plural table names.

Select employee.name manager.name from employee, manager where employee.manager_id = manager.id

But it doesn’t matter that much when you use the traditional join syntax:

Select employees.name, managers.name from employees left outer join managers as manager on employees.manager_id = managers.id

Still I prefer singular table names, just disagree that it was data people who preferred singular and why.

Of course, with aliases you can return the exact name you want with aliases:

Select e.name as employee, m.name as manager from employees e, managers m where e.manager_id = m.id

somat · on Dec 23, 2022

it is either a bunch of apples in which case you would call it apples, or it is the box that holds apples, in which case you would call it apple. as in apple_box.

Got to admit I am pretty firmly in "it is the box that holds apples" camp so none of my array, table, dictionary names are plural. Some time I comfort myself by saying "well, apple[5] just sounds better" but that is really up for debate.

blep_ · on Dec 22, 2022

But a set is plural. We say "the set of things", not "the set of thing".

karmakaze · on Dec 23, 2022

I guess it depends on usage that's familiar to you. I would say set of Real numbers and call the set Real (like a class) not Reals like a collection of instances.

caseyohara · on Dec 22, 2022

I agree with your pet peeve. ActiveRecord (Rails ORM) encourages by convention that your tables are named in the plural, and it sometimes drives me crazy. When writing queries outside the ORM, I end up aliasing tables/relations to the singular so the query is more sane to reason about, e.g.:

> SELECT person.id, person.name, person.age ... FROM people person JOIN ...

Which reads so much nicer to me than:

> SELECT people.id, people.name, people.age ... FROM people JOIN ...

(obviously a somewhat contrived example because of the people/person inflection that makes it awkward already)

puffoflogic · on Dec 22, 2022

> A good ORM user is thinking in SQL but writing in C# (or whatever your business layer is in).

That doesn't sound like metaprogramming; it sounds like insanity brought about by bureaucratic limitations on language choice.

goto11 · on Dec 22, 2022

It should be "thinking in relational algebra but writing in C#". Arguable Linq with C# represent relational algebra better than SQL.

fijiaarone · on Dec 25, 2022

But not more efficiently.

Working with small datasets in memory can be fast, but larger datasets consumes more memory and cpu.

The purpose of a database is to efficiently store and retrieve data from disk — and limiting it to only the data you need.

Most database interactions are also over a network, which is always slower than (and in addition to) disk retrieval. This should not be done except when you cannot fit the data on disk (or work with it in memory) on the same system.

Services should not be separated except for the same reasons (exceeding computation or memory) for the same reasons (network, disk latency).

We used to perform billions of complex computations in seconds reading from slow disks and slow systems with much smaller memory footprints on single systems with single cpus.

This article is a parable for the consequences of not understanding that.

The Google and Amazon and Netflix etc white papers are about organizations that build solutions for applications that cannot fit on disk or in memory or be handled by the computation of a single system — and do not apply to 99% of application architectures that use them, including those developed by Google, Amazon, etc.

feoren · on Dec 22, 2022

Consider a function that takes an IQueryable<T> (for any T) and connects it with your change tracking logic to return an IQueryable<Tracked<T>>, or connects it with your comments system to return an IQueryable<Commented<T>>. I can take a query of (nearly) arbitrary complexity represented by an IQueryable<ComplexModel> and turn it into an IQueryable<Tracked<Commented<ComplexModel>>> in one line, while still keeping the result as a single SQL statement. If you hate that type, note that it's almost never explicitly written out like that (thank you, 'var' keyword).

On the other hand, your databases have a "Comments" field in every table; a "LastModifiedBy" in every table. Your database tables grossly violate the single responsibility principle: every single cross-cutting concern is represented in every single one of your "primary" tables. (Level up: every concern is a cross-cutting concern.) Your databases have association tables between X and Y for every primary data type X and every cross-cutting concern Y, leading to a combinatorial explosion of redundant tables. Your SQL queries/views/procedures/triggers are repetitive and full of boilerplate. If your client told you they needed you to change how change tracking is done in your system, you'd have to touch nearly every single SQL module in your entire system.

For me, I need to change the implementation of IChangeTrackingSystem, and that's literally all. All of my other queries don't have to know about the change, because they're simply composed together with whichever IChangeTrackingSystem is in place. Show me how to do that in T-SQL and I'll reconsider my position.

Now that I've tasted this fruit, the old way of doing things sounds like insanity to me. You're being needlessly zealous and close-minded here.

Edit: And another thing! (Shakes fist)

Much of the beauty of modern programming languages is their adaptability to whichever domain you're working in. We no longer need domain-specific languages for every different task, because we can embed those languages inside our parent language; then we don't have to reinvent static typing, write a new IDE, and learn decades of programming language design before we can start on our actual business logic. So, yes:

When writing data access logic, you should be thinking in SQL (or generic relational logic) but writing in C#.

When writing a game renderer, you should be thinking in linear algebra, but writing in C#.

When writing a payroll processing system, you should be thinking in payroll, but writing in C#.

When writing a chemical engineering toolbox, you should be thinking in molecules and reactions and units, but writing in C#.

This way, anyone who knows C# is already halfway (yes, only half) toward being able to maintain your system. If you insist on using a DSL for every single one of these tasks, 80% of your time will be spent on context switching and trying to get them to talk to each other correctly and correcting issues in the DSL itself.

Of course I do often write raw SQL as views and scripts, and of course I write TypeScript and HTML and CSS/SASS when working on a front-end (although I usually "think in HTML and write in TypeScript", not surprisingly). But that's mostly for development and maintenance; not for core business logic or library development.

robocat · on Dec 22, 2022

Let us assume you accept that writing SQL is better than writing the equivalent machine code.

The “meta-programming” is stating there is a “language” representing their problems that is better than SQL. If we call that better framework(language) Blub[1] then that is probably a good metaphor, rather than causing people to get triggered by the generic ORM tag.

[1] https://www.benkuhn.net/blub/

fifilura · on Dec 22, 2022

To be fair, there is one important thing the ORM brings though. Sanatizing inputs.

zanecodes · on Dec 22, 2022

Don't sanitize your inputs; parameterize your queries instead.

Tostino · on Dec 22, 2022

I see those as two different solutions to two different, but slightly overlapping problems.

pessimizer · on Dec 22, 2022

Going to keep those problems a secret?

Lvl999Noob · on Dec 23, 2022

Imo, the problems are the same but the conditions are different. If the query is used many times, parameterize it. If the input is used many times, sanitize it. If both are used many times, parameterize.

robocat · on Dec 22, 2022

Parameterising works for individual fields in a statement. However for complex queries (the reason for the meta-programmimg comment) you can’t always parameterise the additional subqueries/tables/fields. You can use stored procedures, but that just shifts the necessary code from one language to SQL, and the SQL doesn’t have a robust library you can just use.

Izkata · on Dec 22, 2022

> A good use of an ORM looks like a metaprogramming environment for conveniently building syntax trees that get converted into intelligent SQL. You know it's working well if the SQL looks somewhat like you'd write yourself and you can build one SQL statement with multiple layers that are abstracted from each other (in C#, think about passing IQueryables around). The structures are parsed into SQL and executed very explicitly only at the end of the chain, do a lot of work, and never produce SELECT N+1s. A good ORM user is thinking in SQL but writing in C# (or whatever your business layer is in).

Sounds very similar to how Django (python) wants you to pass around QuerySets. It's very easy to set up an initial query with joins/etc, then pass the QuerySet into multiple functions to filter it in multiple different ways (each filter creates a new instance, so you're forking the original set and don't have to specify the joins multiple times), but the query itself is never actually run until you try to read from it.

ivanhoe · on Dec 22, 2022

Also databases (yup, MySQL I'm looking at you) got better at handling subqueries quickly, which makes the job of ORMs a lot easier.

karmakaze · on Dec 22, 2022

It's still (5.7) really bad sometimes. A simple `IN (subquery)` can sometimes run much better if the subquery is fetched and another round-trip query is made using literal values for the `IN (...)`. I'm sure there are plenty of other 'deoptimizing' query patterns that shouldn't be.

dagss · on Dec 22, 2022

To use SQL (or rather, push compute to the database) well you need to think in sets, indexes, etc; the primitives you work with is rather different from the ones you usually use in C#.

You need to switch mode of thought anyway.

If a really good language for that happens to be expressed in the language of the C# AST -- instead of some new syntax -- that would be fine with me. I do not see a big difference.

But since one needs to switch mode of thought anyway, a new high level language that compiles to SQL and would be usable across all backend languages I would like slightly better. But, whatever fixes the problem of allowing pushing computation to the database without all the warts in SQL I am all for.

Until that really gets a bit further than today I prioritize writing SQL over a bit too leaky abstractions.

int_19h · on Dec 22, 2022

A separate high-level language just for queries doesn't make much practical sense, since queries are normally surrounded by plenty of other code.

OTOH something like C# LINQ, which, on one hand, plays nicely with the rest of the language, and on the other, can be directly mapped to SQL (without ORM and other impedance-mismatch-inducing layering) is great. But, necessarily, language-specific to ensure tight integration.

vkazanov · on Dec 22, 2022

As a data platform manager in a data-rich company I agree: programmers tend to prefer their favourite hammers to sql, even if a modern relational database can do the thing much better.

But! As somebody with a relatively good understanding of a history of sql, relational dbs and related concepts I have to add: SQL is often to blame.

All the the amazing engineering that goes into database engines, clean and coherent ideas of relational algebra, optimisability of a declarative approach to computaion - all of that gets bad rep because of the nightmare of sql-the-language.

The standard, the syntax, every little bit that could go wrong is just wrong from the point of view of a language designer. Composability, modularity, predictability, even core null-related defaults.

But it's everywhere. We just have to accept it.

jonhohle · on Dec 22, 2022

It’s not just the language, it’s schema evolution, data distribution, and exposed APIs.

I don’t want to give other teams direct access to a DB and have them Not only take a dependency on the schema, but have the ability to run arbitrary queries that may exhaust resources in ways that impact normal operations. If I expose an API, I control the access patterns and can evolve the schema separately to suit the workload.

If other teams need a replica to perform their arbitrary queries, I’d much rather have them using a richer data model that they can normalize into whatever form suits their needs than have to conflate that into a source of truth data store.

If you have a single business unit and can get away with commingling concerns within a small team, great, throw it all in a single DB. If it makes sense to split, however, do it quick and early to avoid a decoupling hell that is more expensive then having split in the first place.

marcosdumay · on Dec 22, 2022

> If I expose an API, I control the access patterns and can evolve the schema separately to suit the workload.

And then they will depend on your API data schema...

Yes, schema evolution is hard, but databases have many tools to help here that you will either have to recreate on your APIs or live without and have a harder time. Either way, all the trouble comes from data evolution, and any schema-only change is trivial to deal with.

Data distribution is something that varies from one DB to another, they usually have very good performance that is hard to replicate on your application layer, but are very hard to setup and keep running. But the point about control of resource usage is a good one.

lmm · on Dec 22, 2022

> And then they will depend on your API data schema...

Which we have methods of evolving. I can put my API in gRPC and know exactly which changes will or won't break compatibility. Try doing that with a database.

yunohn · on Dec 22, 2022

> And then they will depend on your API data schema...

+100 to this

jonhohle · on Dec 22, 2022

But that API schema is unrelated to the underlying DB schema. I’ve been able to run services with different backends (eventually consistent & low latency vs transactional) exposing the same API. That would not have been possible by just giving consumers access to a DB.

The DB can do everything can’t seem to understand this, for some reason.

dvdkon · on Dec 22, 2022

Any nontrivial change to the base model will mean a lot of complexity in the API layer and degraded performance. Maybe that's worth it for you, maybe if you're exposing this data to hundreds of external users who don't need high performance. But I feel that for most usecases, barebones DB access is the better option.

dragonwriter · on Dec 24, 2022

> That would not have been possible by just giving consumers access to a DB.

That depends on the DB's foreign table support, or equivalent, if it has it. It's certainly not categorically impossible with a DB.

marcosdumay · on Dec 22, 2022

Moving the goalposts, I can see.

Of course, merging different backends can not be done on the backend. Either you add a layer or you do it on the client.

h_r · on Dec 24, 2022

That seems to be an odd way of looking at it, IMO. Most of the time, the entire point of building an API is to present consistent functionality to any consumer, not only ones under your control. Also, a well-behaved API is versioned so as to allow evolution of the API without breaking existing clients who can upgrade to new versions as they are able.

andylynch · on Dec 22, 2022

That’s not a new problem though? The classic way to solve this in database is to expose the data API as stored procedures/views and restrict query access on the actual tables to just the DBAs - I think even MySQL which was late to the party here has had this ability for some time now.

zvrba · on Dec 22, 2022

> I don’t want to give other teams direct access to a DB [...]

You don't have to. Package up necessary queries into views and/or stored procedures and grant permissions only on those. Views can also shield from schema changes.

jonhohle · on Dec 22, 2022

That also requires all data to be in the same DB. Often that’s not practical or possible.

Tostino · on Dec 22, 2022

Postgres FDW's do wonders to centralize this type of access.

zvrba · on Dec 23, 2022

In SQLServer you can define external tables that access data on another server, or even parses a file.

darksaints · on Dec 22, 2022

As someone who absolutely loves the power of SQL, I abhor the footguns involved. Especially with null-based ternary logic that is incomprehensible to most people.

SoftTalker · on Dec 22, 2022

You have to learn how your database handles NULLs, different databases do it slightly differently. Once you do that, you're fine.

One big thing is that NULL should never "mean" anything in a business sense. It is the absence of a value, hence it cannot mean anything.

By the same token you need to understand how your database handles concurrency. Different databases do it differently.

branko_d · on Dec 22, 2022

> One big thing is that NULL should never "mean" anything in a business sense. It is the absence of a value, hence it cannot mean anything

I always had a problem with that notion. I mean, it has a memory representation, it has a set of operators you can apply to it, defined behavior in UNIQUE and FOREIGN KEY constraints etc. All this is well documented (though can behave slightly differently between databases, as you mentioned).

So, it has a set of valid values (just one: NULL) and a set of valid operations, so it's a type!

And now you have a type that looks somewhat similar to a null in "normal" programing languages, and SQL generally lacks the mechanisms for inventing your own types, so why wouldn't you use it in your business logic where it makes sense?

The design of NULL seems like a historical accident anyway. The best I can discern, there was a need for something to behave as "excluded" in the context of FOREIGN KEYs and outer joins, and so that semantic was just passed along to other areas where it made less sense.

I think a better type system and a better separation between comparison logic and the core type would have obviated most of the NULL's weirdness and made it far less foot-gunny in the process...

darksaints · on Dec 23, 2022

The problem is that it isn't just me. It's everybody. Everybody that uses an SQL system has to learn a version of logic that looks like something they've learned before but actually has no relation to it. People can learn it, but it is a chore and takes a lot of real world experience (i.e. costly mistakes) to get it drilled into their head just like it did with me. As someone who does a lot of mentoring of data engineers, it's infuriating that they all have to go through this at some point.

Unfortunately the problem is made worse by the proliferation of languages making the equally heinous mistake of treating null as "false-y". Bad form, Peter, bad form!

winrid · on Dec 22, 2022

Many businesses are absent of value :)

dagss · on Dec 22, 2022

Agreed. The power of SQL is awesome. But really wish people started making better languages that compile to SQL; like it happened with JavaScript

belmont_sup · on Dec 23, 2022

Same. Edgedb is doing something good but I wish I didn’t have to deploy a new service on top of my own db. Postgres only as well.

klysm · on Dec 22, 2022

is there something fundamental from us making a new frontend to something like postgres? I think all the solutions that compile to SQL kinda work but it would be nice to have a new native interface that sucks less.

vkazanov · on Dec 22, 2022

Nope. Nothing but industry-wide inertia is so massive by now that it just doesn't make sense to switch. Nosqls tried hard and went nowhere.

What i find funny is that most dbs translate sql into an internal representation that is remarkably similar to a proper relational algebra and optimise on that. I'd really just prefer the alebraic language as described in the original ages old paper.

There were also a few dbs trying to push sql-but-better languages... haven't heard about them for a while.

jstrong · on Dec 22, 2022

I have experimented with a different query model from sql for time series data. A query took the form of a Rhai script. (Rhai is a scripting language that has great interop with rust, so it was similar to how lua would be used to script parts of a game.)

Each query script would act on a few objects in global scope: `db` (handle to database), `start`, and `end` (time range of grafana dashboard that query was for).

I found being able to write imperative (rather than declarative) code to build a query to be extremely powerful, especially for storing variables and looping over things.

e.g. query script - just to get a feel for it:

    let dalmp = db.ts("pjm-da-lmp/western-hub") // ie retrieve the timeseries named 'pjm..'
        .with_time_range(start, end);
    
    let rtlmp = db.ts("pjm-5min-lmp-rt-lmp/western-hub")
        .with_time_range(start, end)
        .resample("1h", "mean");
    
    let da_err = rtlmp.diff(dalmp);
    
    #{
        dalmp: dalmp,
        rtlmp: rtlmp,
        da_err: da_err,
    }

A query script would be expected to return a dictionary-like object. the keys would be used as labels and the values would each be a time series object.

This is not the perfect solution for every problem but though it might be interesting to see an example of a very different approach to querying compared to sql.

jasonhansel · on Dec 22, 2022

I'm a fan of relational databases, but I think we should concede three points:

1. Databases should accept queries in a structured machine-readable format rather than a plaintext language.

2. SQL in particular is a poorly designed language: it isn't very composable, it has lots of annoying edge cases (like NULLs), and it has a number of annoying limitations (in particular, its historically limited support for structured data within fields).

3. Given how most RDBMSs are designed, you often need to handle denormalization and caching manually. This requires doing a lot of excess data management in a middleware layer--for instance, querying a cache before accessing the DB, or storing data in multiple places for denormalization. Some of this can be done in SQL (e.g. denormalization through TRIGGERs), but since SQL is not a very good language (see (2)) that can be tough.

Kwpolska · on Dec 22, 2022

SQL being a plaintext, human-friendly language is a good thing. SQL is a common skill transferrable between languages and environments, and it is also easily usable by non-developers. If we replaced SQL with some abstract language, what language would you use when talking to the database directly (via psql, sqlplus, or whatever)? Would you need to learn the PythonQuery, JavaQuery, and C#Query languages separately? Would the language used in ETL tools be different still? What language would you write database views, triggers, functions in?

bb88 · on Dec 22, 2022

Have you looked at Quel? We have multiple languages to run our code on generic CPUs (C, Python, Java, Ada, Go, Rust, Lisp, etc.) Why must we have only one database language that doesn't do a very good job at Codd's relational calculus?

So many people have drank the kool-aid that SQL is the answer. Maybe it's time to change this.

The success of ORM's could be thought as the market voting against SQL.

https://www.holistics.io/blog/quel-vs-sql/

Kwpolska · on Dec 22, 2022

The blogspam post you linked has zero examples of QUEL. Looking at Wikipedia [0], it seems much uglier and less readable than SQL.

I am not a database theory person. I'm a developer who does not care about Codd's relational calculus. As for ORMs, they are great at solving simpler problems and CRUD data access, and they make the developer’s life easier by giving them nice objects to work with as opposed to raw database rows. However, any advanced analytics/reporting/summary queries tend to look awful with an ORM.

[0]: https://en.wikipedia.org/wiki/QUEL_query_languages

bb88 · on Dec 22, 2022

I was hoping you would address the points about composability rather than turning it into a beauty contest between the two languages.

jasonhansel · on Dec 22, 2022

SQL actually isn't all that transferable, because in practice you almost always use an ORM or a query builder instead of writing queries directly into your codebase. So when switching languages you still need to learn the new language's ORM; your knowledge of the underlying SQL will only go so far.

Of course queries should be human-readable, but there's no need for it to be a complete language with its own grammar and templating via prepared statements. The queries could be encoded in JSON or some similar (probably custom) human-readable language that can easily be generated programmatically. MongoDB does this IIRC; it's probably the only thing I like about Mongo, but it's a good idea.

efficax · on Dec 22, 2022

in my experience most engineering shops that care about database perf are not using ORMs except for the most generic CRUD features. you gotta handroll your queries with an eye on EXPLAIN once you pass a billion rows in your tables, in my experience anyway

billythemaniam · on Dec 22, 2022

My experience as well. ORMs are really nice in the beginning, they save a lot of boilerplate code. But they become the enemy once scale and/or performance become an issue.

jrumbut · on Dec 22, 2022

The thing about #2, is poorly designed in comparison to what?

Chances are high the application layer is written in JavaScript, PHP, Ruby, or Python. We don't even talk about nasty edge cases in those languages because they are uncountable.

jasonhansel · on Dec 22, 2022

Those languages all have first-class functions and OOP-style encapsulation.

In SQL stored procedures (at least in Postgres), you can't have a variable that holds multiple records. You can't even define a variable inside of a block.

ttfkam · on Dec 23, 2022

DECLARE foo record[];

nhinck2 · on Dec 23, 2022

Temp tables?

lmm · on Dec 22, 2022

None of those languages is as unpleasant as SQL. Not even PHP.

billythemaniam · on Dec 22, 2022

For me, I would much rather SQL especially when using DBT too to manage the queries like any other language. I thought the same way as you and SQL certainly has its warts, but I have grown to appreciate it's elegance. I think over the long-term it is easier to maintain because it is so concise and the most common second language among programmers.

ttfkam · on Dec 23, 2022

Once you start thinking in sets instead of collections of independent statements as per PHP et al, SQL starts looking a lot better.

If you try to fit SQL into the general purpose programming language box, you're gonna have a bad time. A really bad time.

SQL is a DSL for set theory. Nothing more. Nothing less.

lmm · on Dec 23, 2022

The things that are bad about SQL have nothing to do with set theory. The non-orthogonal syntax. The terribly limited and opaque type system. Ternary logic that fails silently. The utter lack of composability or testability.

ttfkam · on Dec 23, 2022

Also, views are the essence of composability.

As for testing in the Postgres sphere, there's pgTAP and pg_prove. https://pgtap.org/

Extension development includes a built-in unit testing harness as well.

For use with any database, a tool like Sqitch allows for DB-agnostic verification at each migration point. https://sqitch.org/docs/manual/sqitchtutorial/#trust-but-ver...

Proprietary databases have their own solutions as well. https://learn.microsoft.com/en-us/sql/ssdt/walkthrough-creat...

I think "utter lack" is grossly misrepresenting the state of the art. If you mean "widespread ignorance of existing techniques related to composability or testability," then we are in agreement.

ttfkam · on Dec 23, 2022

CTEs are quite composable. I might be spoiled by Postgres, but types are quite robust there.

Replace "NULL" with "unknown" in your head, and the ternary makes more sense. When you're building your schema, does an unknown value make sense in that context? Many times not, and the column should either not be nullable or should be referenced in a different table by a foreign key.

3 = NULL

"Is 3 equal to this unknown value?" Maybe yes. Maybe no. It's unknown. Therefore the answer to "3 = NULL" is NULL. The answer is also unknown. Not true. Not false. Unknown.

IS NULL or IS NOT NULL, but never = NULL or <> NULL.

It may be unusual to someone coming from a general purpose programming language's notion of null as a (known) missing value, but that doesn't make it wrong. It means you need to reorient your mind toward set theory, where NULL means "unknown" if you're going to work with SQL and relational databases in general.

Folks often speak of the impedance mismatch between relational models and in-memory object models. NULL is one of those mismatches.

lmm · on Dec 23, 2022

> CTEs are quite composable. I might be spoiled by Postgres, but types are quite robust there.

> Replace "NULL" with "unknown" in your head, and the ternary makes more sense. When you're building your schema, does an unknown value make sense in that context? Many times not, and the column should either not be nullable or should be referenced in a different table by a foreign key.

Cool, now how do I do these two incredibly basic things at the same time and make something not nullable or referencing another table in a CTE?

fomine3 · on Dec 23, 2022

CTEs are quite composable. -- on SQL land.

I wish SQL is more composable

goto11 · on Dec 22, 2022

> in particular, its historically limited support for structured data within fields

This is not particular to SQL though, and is the rationale behind the first normal form. Codd argued that any complex data structure could be represented in the form of relations, so adding non-relational structures would just complicate things for no additional power.

jasonhansel · on Dec 22, 2022

The issue is that, to put things in 1NF, you need to fully normalize everything, which has a big performance penalty since every query now has to JOIN a large number of tables together.

Of course, an RDBMS could be designed to do that without a performance penalty, by storing data in a denormalized form and automatically translating queries for the normalized data accordingly.

But SQL doesn't have the features you'd need to control and manage that sort of transparent denormalization. So you'd end up having to extend SQL to support it properly so that the performance penalty in question could be mitigated in all cases.

edit: Rather than "you need to fully normalize everything," I should have said "you need to split all your data across multiple tables to eliminate the need for structured data within records." The performance penalty happens when you need to do this everywhere for sufficiently complex datasets.

goto11 · on Dec 23, 2022

I doubt querying JSON or XML or some other embedded structured format would be faster than querying normalized data. It might be true in some special cases, but certainly not in the general case of ad-hoc queries across nested data structures.

But I totally agree SQL could be improved to make normalization feel like less of a burden. It really highlights a problem when it feels like its more convenient to just dump a JSON array into a field rather than extract to a separate table.

dragonwriter · on Dec 22, 2022

> The issue is that, to put things in 1NF, you need to fully normalize everything

Not unless you redefine “fully normalize” you don’t.

int_19h · on Dec 22, 2022

The boundary between "simple" and "complex" data structures is largely arbitrary, though. It's not unreasonable to consider an integer as a complex data structure, consisting of bits, and in some contexts we do that - but in most, it's obviously more convenient to treat it as a single value. In the same vein, it should be possible to treat a tuple of several numbers (say, point coordinates) as a single value as well, in contexts where this makes sense.

goto11 · on Dec 23, 2022

In the context of the relational model, simple and composite refer to how they are treated by by the relational operators. You can't select individual bits of an integer (without special-purpose operators) which means an integer is a single value.

Presumable an integer is physically stored as a set of bits, but this is not exposed to the logical layer (for good reasons - e.g. whether the machine uses big endian or little endian should not affect the logical layer). If you actually wanted to operate on individual bits using relational operators, you would use a column for each individual bit.

Having XML or JSON fields is also totally fine according to the relational model as long as they are "black box"-values for the logical layer. But Codd observed that if you wanted to treat individual values as composite you end up with a much more complex query language. And indeed this have happened with XPath and JSON-queries and whatnot embedded in SQL. Presumably it should then be possible to have XML inside a JSON struct, and a table inside the XML. If this is even possible, it would be hideously complex. But normalized relations already allows this without any fuss.

int_19h · on Dec 23, 2022

This is an arbitrary distinction in the first place. You can absolutely have a relational database that only allows 0 and 1 as valid values and requires you to build integers up yourself using relational techniques. The relational model itself doesn't care what "values" are.

In practice, simple value tuples make things much more convenient, and the edge cases are minimal. You don't have to allow full-fledged composition like nested tables etc.

goto11 · on Dec 26, 2022

> The relational model itself doesn't care what "values" are.

Maybe I misunderstand what you are arguing, but the relational model is defined in terms of values, sets, and domains (data types), but of course the domains chosen for a particular database schema depends on the business requirements.

ttfkam · on Dec 23, 2022

Codd argued this a long time ago. Unfortunately Codd died twenty years ago, so we can't ask him his current thoughts on the matter. On the other hand, Chris Date dropped this rigid view of the relational model way back in the 1990s.

https://www.youtube.com/watch?v=swR33jIhW8Q

goto11 · on Dec 23, 2022

The question isn't who said what when, the question is what reasoning holds today. Codds argument was simply that if we allow tables embedded inside fields and we want to query across multiple "layers" of embedded tables, we get a much more complex query language and implementation for no additional benefit, since the same relationship can be represented with foreign keys. I haven't seen any reasonable counterargument against this.

If I understand Date correctly, he is just saying that individual values can be arbitrary complex as long as they are treated as "atomic" by the relational operators. I don't disagree, but reality is that very soon after you start storing stuff like XML or JSON in database fields, someone wants to query sub-structures, e.g. filter on individual properties in the JSON. And then you have a mess.

asalahli · on Dec 22, 2022

Also, caching can be done with materialized views

jasonhansel · on Dec 22, 2022

IIRC in Postgres you need to refresh an entire materialized view all at once, effectively recreating the entire table; you can't just have it update incrementally whenever the underlying data changes.

I think SQL Server can do this, but...then you have to use SQL Server.

branko_d · on Dec 22, 2022

> I think SQL Server can do this, but...then you have to use SQL Server.

Yes, SQL Server can update indexed views incrementally, but there are severe limitations:

https://learn.microsoft.com/en-us/sql/relational-databases/v...

If memory servers, indexed views have been in SQL Server for 20-odd years, and haven't seen meaningful improvements in all that time. We still can't do a LEFT JOIN, or join the same table more than once or MAX etc...

The same story with T-SQL, which is firmly stuck in the '80s (not that other databases are better).

There are some extremely powerful features in SQL Server that can be used effectively with some pain, but they could be so much better if Microsoft invested in fully fleshing-out their potential instead of chasing the latest buzzword.

Sorry for the rant.

tshaddox · on Dec 22, 2022

Yep, which is why materialized views don't tend to work great for a lot of tasks where they initially seem to be the most natural implementation, particularly almost any view that's a feed or aggregate of data in the system over all time. It's so easy to just play with a simple SQL select query until you get it working, then throw it into a materialized view. It'll probably even work for a long time! But as soon as the data in the system grows and that refresh starts getting slower, you're stuck with a (potentially tricky or at least frustrating) migration to another implementation (maybe something like event sourcing).

mnahkies · on Dec 22, 2022

I think it's being worked on for postgres but probably another major release or two away, quick Google came up with this https://pgconf.ru/en/2021/288667 but I'm sure I've come across discussion in postgres mailing lists / wiki in the past.

Would definitely be a nice feature to have, without it I find the main use case I have for materialised views is batch processing where you want to prepare a complex result set and then stream process it in a Cronjob or similar

aidos · on Dec 22, 2022

There’s also info on this page about it (incremental materialised views).

https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...

In the meantime my preferred technique is to have a column where I stamp the last generation time for each row and then I rebuild anything that’s changed since then (assuming all your source data has some sort of last updated stamp).

cwillu · on Dec 22, 2022

Also computed columns in a covering index.

jason2323 · on Dec 22, 2022

> It's quite another when experienced seniors ban the use of SQL features because it's not "modern" or there is an architectural principle to ban "business logic" in SQL.

While I agree with the idea of pushing heavy compute to where the data resides, I wholeheartedly disagree with the statement quoted above. Concentrating business logic in SQL makes it effectively untestable. Its not easy to "compartmentalize" SQL code such that each individual piece is testable on its own. Often, this lets major issues go unnoticed until the SQL query executes in production on some unexpected input data and people are woken up at 3AM. Adding on top of this the fact that SQL exceptions are a PITA to debug on a normal day, and it probably isn't any easier at 3AM

ndriscoll · on Dec 22, 2022

Are you talking about stored procedures? SQL itself is super easy to test: insert data into tables, run query, compare output. It's also easy to test queries on production data since every db has a repl. Queries are also mostly composed of read-only, referentially transparent parts so it's super easy to take snippets and test/run them in isolation. For complex updates you join the table you want to update to a read-only query with all the logic to calculate the new values.

SQL is probably the easiest language there is to write well tested, composable, easy to debug code.

tetromino_ · on Dec 22, 2022

> Queries are also mostly composed of read-only, referentially transparent parts so it's super easy to take snippets and test/run them in isolation

True enough, but now you need to ensure that the snippets copied out into tests are in sync with the in-line versions embedded inside your 6 screen long sql query.

dagss · on Dec 22, 2022

In mssql you have "inline tabled valued functions" which you can declare and reuse, and they are inlined and optimization happens across them. We use sqlcode [1] to get around weaknesses in deploying stored functions.

Big drawback though is that functions can only take scalar arguments, not table arguments.

Another method is to just have a bunch of CTE clauses, and append something different to the end of the CTE depending on which parts of it to use (i.e. some string assembling required).

[1] https://github.com/vippsas/sqlcode

julik · on Dec 22, 2022

Easy to debug and write well-tested - mostly yes. Composable is tough though (make sure all your table aliases are unique, that there are no unambiguous column names, that all your ANDs are in place and don't forget that 1=1 to make it easier to uncomment parts. You already need a somewhat sophisticated query builder just to compose multiple JOINs cleanly. Recently I needed to get a query builder-ish thing going with arbitrary number of conditions which should be done as JOINs and the most I could muster to keep it at least a little manageable was a "pkey IN({literal_subquery})". It does compose as in "yes you can do it" but I wouldn't say it composes very conveniently.

puffoflogic · on Dec 22, 2022

Excuse me, but can you show me where in the SQL language documentation examples I can find how info about its unit test harness framework, teedeedee best practices, mocking, shimming, and dependency injection? And that's just the bare minimum but I start with those as the first thing to learn in any new language.

wccrawford · on Dec 22, 2022

At my first real programming job, I found an error in the SQL functions that the company had written. After pointing it out, the CEO bet me that I couldn't fix it. Apparently their best programmers had tried and failed.

I did end up fixing it, but it took me a couple weeks.

Beyond being untestable, it's also hard to version control and roll back if there's a problem. We had some procedures around SQL updates that were kind of a pain because of that.

gregw2 · on Dec 22, 2022

Regarding SQL being "untestable"...

I have put SQL DDL+DML+stored procedures in version control, create/run stored procedure (TDD) unit/integration tests on mock data against other stored proceedures, had pass-fail testing/deployment in my CICD tool right alongside native app code, and done rollback, all using Liquibase change sets (+git+Jenkins).

Using Liquibase .sql scripts for version control isn't hard. Testing is always more-work but it's doable.

I don't completely disagree with you on rollback though as hard, at least full pure rollback-from-anything. Having built tooling to do it once with Liquibase I found the effort to guarantee rollback in all circumstances took more effort than it was worth. A lot of DDL and code artifacts and statements like TRUNCATE are not transaction safe and not easy to systematically rollback. Liquibase did let you specify a rollback SQL command for every SQL command you execute so you could make it work if you had the time, but writing+testing a rollback SQL command for every SQL command you execute wasn't worth it and is indeed materially more effort than just rolling back to earlier .war/.jar/.py/docker/etc files. (The latter are easier in part because they are stateless of course.)

In any case, something like Liquibase can get you a long ways if you have the testing mindset. (Basically it lets you execute a series of SQL changesets and you can have preconditions and postconditions for each changeset that cause an abort or rollback.)

gregw2 · on Dec 22, 2022

If you mean 5+-page SQLs are untestable, I don't disagree. Like any code you need to break it up once it gets past a certain size.

lmm · on Dec 22, 2022

> Like any code you need to break it up once it gets past a certain size.

And SQL doesn't give you the tools to do that. There's no easy or natural way to split a statement into smaller parts.

gregw2 · on Dec 23, 2022

SQL doesn’t give you tools to break up code?? Err, I don’t agree there. I can think of four options just off the top of my head.

Simplest is to use WITH clauses (common table expressions/CTEs). They can help readability a lot and add a degree of composability within a query.

Second, you can split one query into several, each of which creates a temporary (or intermediate physical) table.

Third, you can define intermediate meaningful bits of logic as views to encapsulate/hide the logic from parents. For performance you materialize them.

Fourth, you can create stored procedures which return a result set and like any procedural or functional language chain or nest them.

These techniques are available in most databases. More mature databases support forms of recursion for CTEs or stored procedures or dynamic SQL.

As with most programming, proper decomposition and naming helps a fair bit.

zvrba · on Dec 22, 2022

> Beyond being untestable, it's also hard to version control and roll back if there's a problem.

Recently, I've learned that SQLServer supports synonyms. So you version functions / procedures (like MySP_1, MySP_2, etc...) and establish a synonym MySP -> MySP_1. Then you test MySP_2 and when ready, change the synonym to point to MySP_2. Of course, all code uses just the synonym.

dagss · on Dec 22, 2022

We use sqlcode as a more general solution to this:

https://github.com/vippsas/sqlcode

SoftTalker · on Dec 22, 2022

I'm not sure where this idea comes from. There are unit testing frameworks for T-SQL and pl/sql. Stored procedure code can be version controlled like any other code.

dagss · on Dec 22, 2022

We care A LOT about testing everything, always.

We use SQL container [edit: new, freshly cloned DB for a test function in less than a second] and find it to be no problem in practice to write integration tests for our pieces of Go code that calls the SQL queries with some business logic in them.

(No, we don't do a sprawling mess of stored procedures calling each other. We just try to not move data over to backends ubless we really need to)

If you really need complex flow, making connection scoped temp tables for temporary results in SQL and having the composability/orchestration through backend functions calling each other and passing the SQL connection between them is doable.

Yes you cannot unit test every small line of SQL, but since SQL is much higher level that isn't really needed. Test the functional behaviour / inputs/outputs and you are fine..

It really isn't different for writing for a GPU in a sense.

crazygringo · on Dec 23, 2022

> Concentrating business logic in SQL makes it effectively untestable. Its not easy to "compartmentalize" SQL code such that each individual piece is testable on its own.

What are you talking about? SQL is just as easy to compartmentalize and test as anything else.

Each query statement belongs in a function -- there, compartmentalized. Now set up a table state, run the function, and compare with new table state. The test either passes or fails.

Also no idea why you'd think SQL is a PITA to debug. It's a relatively compact and straightforward language once you learn it, and queries are self-contained. It's generally much easier to debug a query than it is to debug something happening somewhere across 10,000 LOC across 400 functions.

vkazanov · on Dec 22, 2022

It's totally possible to validate data before pushing to a table, and have normal tests around sql massaging it. Saying this as somebody looking at 1000s of transformations done by my teams.

Admittedly, It took a while for data engineers in the industry to accept these practises though.

vkazanov · on Dec 22, 2022

PS but stored procedures and udss are evil, yes

civilized · on Dec 22, 2022

A lot of people are not comfortable learning languages beyond the ALGOL-like paradigm. SQL's building blocks are incredibly odd if you're used to the idea that work gets done via variables, conditionals, and loops.

Personally I think it's a wonderful (if imperfect), ultra-powerful, and easy language to learn. But if it doesn't click for you and you're under the gun at the job, I bet it's very easy to develop a bad attitude towards SQL.

jerf · on Dec 22, 2022

"SQL's building blocks are incredibly odd if you're used to the idea that work gets done via variables, conditionals, and loops."

They're even weirder when you get into stored procedures, where SQL statements are your very un-ALGOL-like elementary statement, but then in between them you have a procedural language operating on the results, except when the optimizer figures it can "see through" your procedures to get back to something it can optimize through. And the "declarative" nature of them makes understanding cost models a challenge sometimes. You need a deep understanding of how the database works to get the cost model out of your stored procedure code.

Very powerful. I don't do a lot of deep database stuff but I have a couple of times turned something that required an arbitrary number of thousands of back-and-forths with the DB taking seconds from the application code into a one-shot "send this to the DB, get answer back about a millisecond later" using them, and you can end up with performance so good that your fellow developers literally won't believe it's a "conventional stodgy old relational database" blowing their socks off. But it's a weird programming model.

acdha · on Dec 22, 2022

I'd also add that "doesn't click” is sometimes confounded by a respect gradient. With SQL (also CSS) I've seen some people pick up an attitude that those aren't Serious Languages worthy of their time or respect where they avoid learning the fundamental concepts, have problems, and then say the language is too hard or old fashioned to use. I've seen people write many thousands of lines of Java because nobody pushed back on that mountain of fragile code telling them “maybe you should take a day and really get up to speed with how SQL works”.

nicoburns · on Dec 22, 2022

> A lot of people are not comfortable learning languages beyond the ALGOL-like paradigm. SQL's building blocks are incredibly odd if you're used to the idea that work gets done via variables, conditionals, and loops.

Yes, although somewhat amusingly I've found that non-programmers who have never learnt an imperative programming paradigm tend to find SQL a lot more intuitive than ALGOL-style programming languages.

dagss · on Dec 22, 2022

I'm curious about resources to learning SQL well.

In our company I've done a series of teaching session (we're on about 10th hour now). It's working OK, but wish I knew of good material / blog posts to point at. I feel like there's no good community to learn from like for many backend languages.

E.g. after learning about "cross apply" (T-SQL; "lateral join" in postgres) everything got 10x easier to express in SQL. But: How is a developer who's just getting started in SQL going to know that?

And where is the community that can teach a backend developer getting started with SQL to ignore some of that advice from the DBA and Data Analytics communities? E.g. the advice to "not pin indexes" -- which I believe is 100% wrong advice for a typical backend application where reproducability across environments is key, and where any query not supported directly by an index is probably a bug anyway.

jerf · on Dec 22, 2022

"E.g. after learning about "cross apply" (T-SQL; "lateral join" in postgres) everything got 10x easier to express in SQL."

I feel like even in the past few years the database community has still been learning about what you need the databases to be able to do in order to make good code. I personally don't like the "declarative" memeset and think it set the community back literally decades, but with recent Postgreses (by which I mean, the whole last five years or so... lots of people still running older things) all the functions and functionality is technically there to arbitrarily convert between rows, arrays, columnsets, etc., and more and more you can use them arbitrarily as well, so you can JOIN against a columnset you bodged together from two other queries that you pulled into an array and then put that array into a columnset, without it having to ever be turned into a full "table". Cross apply is another example of that, where IIRC a row can be turned into multiple rows.

The problem is that while all the functionality I've wanted on this front does now seem to exist, it's all incredibly haphazard. Cross joining is an SQL keyword, but arrays look more like a data structure, and I can't remember what all was going on but I recall having more hassle turning arrays into columnsets for some reason. If I were going to be doing this full time I think I'd build myself a matrix cheat sheet of how to convert between all these things, and I bet there's still holes in the matrix even today (is there an opposite of a cross join? dunno, but I wouldn't be surprised the answer is "no").

I feel like I'm doing a lot less "work around things missing in SQL (that I have access to)" than I did 15-20 years ago, but rather than a cohesive and well-organized toolset for dealing with all these things, I've got a haphazard set of Bob's Custom Tool for This and A Semi-Standard, Modestly Extensible Tool for that, neither of which were ever designed with the other in mind, and yeah, in the end I can do everything I want quite nicely but it's up to me to notice that what this tool calls a 1/8 inch quartzic turns out to be the same as a Number Seven smithnoczoid and so in fact they do work together perfectly despite the fact the documentation for neither of them suggests that such a thing is possible, etc.

tqi · on Dec 22, 2022

Mode's SQL school is the best I've seen: https://mode.com/sql-tutorial/

fifilura · on Dec 22, 2022

And when it does click for you all the other things that do not involve for/while loops (dicts, map/reduce/group by) become very obvious.

ptrik · on Dec 22, 2022

This depends on use case. SQL is the king for batching process - queries are declarative, decades of effort put into optimization.

For real-time / streaming use cases, however, there is yet a mature solution in SQL yet. Flink SQL / Materialize is getting there, but the state-of-the-art approach is still Flink / Kafka Streams approach - put your state in memory / on local disk, and mutate it as you consume messages.

This actually echoes the "Operate on data where it resides" principle in the article.

dagss · on Dec 22, 2022

We do mini-batch processing in SQL. Some hundred milliseconds latency, some hundred events consumed from to the inbound event table per iteration. Paginate through using a (Shard, EventSequenceNumber) key; writers to table synchronize/lock so that this is safe.

Kafka-in-SQL if you wish. Or, homegrown Flink.

(There are many different uses for the events inside our SQL processing pipelines, and have to store the ingested events in SQL anyway)

I am sure real Kafka+Flink has some advantages, but...what we do works really well, is simple, and feels right for our scale.

It is enough batching in SQL to real speed/CPU benefits on inserts/updates into SQL (vs e.g. hitting SQL once per consumed event which would be way worse). And with Azure SQL the infra is extremely simple vs getting a Kafka cluster in our context.

fifilura · on Dec 22, 2022

Do you find flink SQL immature? For me it looks a lot like syntactical sugar on top of the datastream api.

Same thing, less code?

irrational · on Dec 22, 2022

I've met so many developers who seem to be willing to do anything, except use sql. I remember one case where we were dealing with clearly relational data. One of the senior developers was adamant that we use a non-relational database. When pushed as to why, he said it was because it would have better performance. I pointed out that this for a process where data would be sent out and we wouldn't expect to receive it back for 3 days or so. A few milliseconds performance boost was hardly beneficial over 3 days. But he was insistent and was senior, so we did it. Shockingly, it turned out to be the wrong decision. Later we learned that the senior developer didn't like sql.

pbowyer · on Dec 22, 2022

> I think a lot of blame is with the database vendors. They only targeted some domains and not others, so writing SQL is something of an acquired taste. I wish there was a modern language that compiled to SQL (like PRQL, but with data mutation).

There's EdgeQL https://www.edgedb.com/blog/we-can-do-better-than-sql#lack-o... which I like the look of - but as it adds one more layer over a database I haven't used it yet.

Edit: I hadn't seen PRQL before, reading the site now https://prql-lang.org/

nonethewiser · on Dec 22, 2022

I have run into similar attitudes. In my case, it's an unfamiliarity with SQL, fear of the database, and a desire to utilize strongly typed ORM's for everything. Implementing these sorts of queries often requires using raw sql which is seen as taboo by such people. They think it's unsafe. Meanwhile we're pulling more records than we need, searching them, then firing off more queries in a for loop.

joeblubaugh · on Dec 22, 2022

How easy is it to verify that the configuration of your database matches a checked-in configuration or source file these days? My beef with a lot of installed procedures in a SQL database comes down to deployment and rollback difficulty.

Wicher · on Dec 23, 2022

For Postgres views, materialized views, functions and triggers I've made a tool that can do that, DBSamizdat:

https://sr.ht/~nullenenenen/DBSamizdat

mh- · on Dec 22, 2022

Depends on the DBMS, but pt-config-diff works well for MySQL.

edit: This will only help for configuration, not procedures.

https://docs.percona.com/percona-toolkit/pt-config-diff.html

taeric · on Dec 22, 2022

I feel this has gotten worse with the latest round of data science graduates wanting everything in python.

Probably just perspective. I ducked out of the push for everything to be in Hadoop. And while I can appreciate the foot gun that is indexing everything so that ad hoc queries work, I also have to deal with folks thinking elastic search somehow avoids that trap.

I think I've seen the same claims and push for graphql. :(

rlawson · on Dec 22, 2022

This, a thousand times this. A hundred lines of Java/Python/C# can save you at least 10 lines of a stored proc :) Also why don't they teach SQL in most schools???

bb88 · on Dec 22, 2022

> Yes we CAN do that but why do that it's 15 lines of SQL and 5 seconds compute -- instead of a new microservice or whatever and some minutes of compute.

This works until your database falls over in production. Recently someone started appending to a json field type over and over in our production database. And then on some queries, postgres crashed due to lack of memory. The fix was to remove the field and code that constantly appended to it, and do something else.

No, the database should not be the answer to all your data problems. Yes a microservice may be the best answer. But for structured data and queries that can run with normal amounts of memory the standard SQL DB is fine.

mike_hock · on Dec 22, 2022

> No, the database should not be the answer to all your data problems. Yes a microservice may be the best answer.

Or to put it another way: No, microservices should not be the answer to all your data problems. Yes, using the DB may be the best answer.

bb88 · on Dec 22, 2022

In one case it might just be slow. In the other your db will fall over. Unfortunately, you may find the answer when your production environment fails.

ttfkam · on Dec 23, 2022

If you have a single DB instance in production, something has already gone wrong.

Read replicas.

ttfkam · on Dec 23, 2022

> Recently someone started appending to a json field type over and over in our production database. And then on some queries, postgres crashed due to lack of memory. The fix was to remove the field and code that constantly appended to it, and do something else.

So your choices are doing something manifestly non-optimal in the database in violation of best practices or not use the database for non-trivial business logic?

Sounds like a false dichotomy to me. Maybe just find a better solution?

I mean, if someone writes an O(n^3) algorithm in Python, is the solution to use a better strategy or to swear off Python for anything non-trivial?

mamcx · on Dec 22, 2022

> I wish there was a modern language that compiled to SQL (like PRQL, but with data mutation).

That is nice, but much better will be language that is relational itself.

I'm working on one (https://tablam.org) and you can even do stuff like:

for p in cross(products, qty) ?limit 10 do print(p.products.price * p.qty) end

The thing is be functional/relational only is too mind-bending and is very nice to work with procedural construct.

BTW my plan is that the query (?) section will compile to optimal executions defined per storage engine (memory, sqlite, mysql, redis, etc) `products ? price = 10.0` is executed on the server, not on the client.

jasonhansel · on Dec 22, 2022

Interesting! Have you looked at Datalog? In many ways it's the ideal purely relational language.

mamcx · on Dec 22, 2022

Yes. I think is not friendly for end-users* and making this on Rust make it easy to plug-in iterators/generators so i go that way.

(Also never wrap my head about how use it internally!)

lmm · on Dec 23, 2022

IME SQL and relational databases have the framework problem, where they expect to be used in a particular way and won't let you use the internals. It's like when a language (ADA?) famously built a high-level version of concurrency into the language, where you were meant to use their "rendezvous" (I've probably mistaken the same), and instead what happened is that programmers implemented mutexes on top of this "rendezvous" and then reimplemented high-level concurrency on top of that.

What we need is databases that expose more of their internals, that are designed to be embedded in applications and used as libraries. All serious databases use MVCC these days, but none of them expose it to the user. All serious databases separate updating the data from updating the index, but few of them expose that to the user. All serious databases know the difference between an indexed join and a table scan, but good luck figuring it out by just looking at an SQL query. Etc.

fifilura · on Dec 22, 2022

Oh this reminds me a lot of a familiar kafka vs SQL battle.

Yes it is possible to do it in kafka, but everything is obscured in the sense that you can't peek at what you are doing, you spend your precious CPU on serializing/deserializing and things like backfill is a mess.

lmm · on Dec 22, 2022

I find the opposite; SQL databases work a lot like Kafka underneath, but hide it from you, burning all your CPU to generate this illusion of a globally consistent state of the world that you don't actually want or need and doing everything they can to avoid ever showing you what they're actually doing.

ttfkam · on Dec 23, 2022

Hold up. I need some clarification here. SQL databases are opaque with regard to resource usage but Kafka is completely transparent in this regard? That is your personal experience and assertion?

lmm · on Dec 23, 2022

Yes. I once saw an outage caused by an SQL database server collapsing because of a query that had been issued 23 days earlier. I saw another one gradually get slower and slower over a period of weeks because a data scientist had left a command prompt open and forgotten about it. Kafka brokers are a lot more consistent and predictable IME.

jjice · on Dec 22, 2022

> "why can't just just put your millions of events into a service bus and write some backend to react to them to update your aggregate". Yes we CAN do that but why do that it's 15 lines of SQL and 5 seconds compute

Do people actually do this? I feel like this is classic Occam's Razor. This is one of the biggest things I do in SQL and I couldn't imagine the time and effort to do it with a separate service.

Beltiras · on Dec 22, 2022

I interact with SQL on a daily basis but never write a single query by hand. It's all abstracted by the Django ORM. I do have to be mindful of what I do of course but mostly I just plow away with the abstractions offered. Once in a while I have to take a look at the SQL or the query plan but those are few and far between.

pphysch · on Dec 22, 2022

Sure, but there's a world of difference between someone who knows SQL and uses a ORM for productivity and someone who uses an ORM because it's the only tool in their box.

ttfkam · on Dec 23, 2022

Exactly. Effective use of an ORM requires knowledge of the ORM's API in addition to SQL, not in lieu of it.

Using an ORM without a firm understanding of SQL is a recipe for disaster (or at least glacial performance).

KronisLV · on Dec 22, 2022

I've seen both extremes of this.

On one end, you have applications that execute thousands of SQL queries for each page load, for populating a table with some data or something like that, which has bunches of rules for what should be displayed, all implemented as nested method calls (e.g. the service pattern) in your application. It's a performance nightmare when you get more data, or more users.

On the other end, you have an application where your back end acts just as a view for a DB that has all of the logic in it. There will rarely be proper tests for it. There will rarely be any sort of logging or observability solutions for this in place, the discoverability will be pretty bad, debugging will often be really bad, versioning of changes will also be pretty awkward, but at least the performance will typically be okay.

Just look at the JetBrains Survey from 2021: https://www.jetbrains.com/lp/devecosystem-2021/databases/#Da...

  Do you debug stored procedures?
  47% Never
  44% Rarely
  9% Frequently
  
  Do you have tests in your database?
  14% Yes
  70% No
  15% I don't know
  
  Do you keep your database scripts in a version control system?
  54% Yes
  37% No
  9% I don't know
  
  Do you write comments for the database objects?
  49% No
  27% Yes, for many types of objects
  24% Yes, only for tables

If something that most would consider to be a "good practice" isn't done, then clearly that's a bit of a canary about the state of the technology and the ecosystem around it. Consider that databases are very important for most types of systems, and yet about half of people don't debug their stored procedures, most people don't have tests for them, only about half version their scripts and about half don't bother with comments.

I'm pretty much convinced that it's possible to write bad software regardless of the approach that's used. In my mind, the happy path for succeeding in even sub-optimal circumstances is a bit like this:

  - make liberal use of DB views for querying data, if you have a table in your app, it should have a matching DB view, which will also make debugging easier
  - make use of in-database processing only when it makes a lot of sense and anything else would be a horrible choice (e.g. ETL or batch processes/pipelines), have log tables and such regardless
  - for most other concerns (e.g. typical CRUD), write app logic: most back end languages will have better support for observability and tracing, logging and debugging, as well as scaling for any expensive operations
  - still, be wary of the N+1 problem, that might mean that you don't have enough views, or that you're not using JOINs for your queries properly
  - also, look into using something like Redis for caching, S3 (or something compatible, like MinIO) for binary blobs and something like RabbitMQ for task queues, just because you can shove everything into the DB doesn't mean that you should, sometimes these specialized solutions will have better support and more standardized libraries, than whatever you can concoct

ttfkam · on Dec 23, 2022

Yep. Folks treat relational databases like dumb bit buckets. They assume they are limited and therefore treat them as limited despite the reality of modern SQL. It should not be a surprise to find that most devs skip modern development practices with it as well.

miiiiiike · on Dec 22, 2022

The main reason to do this is so they can be turned into streams that and have the source and, potentially, multiple destinations decoupled. If the source and destination will always be the same, sure, do it in the DB.

cryptonector · on Dec 22, 2022

I couldn't have put it better. Use the SQL, Luke.

Jupe · on Dec 22, 2022

Firstly, I'd suggest the author look at this differently; perhaps "For Want of a Code Review". Especially code from a relatively recent graduate, on a piece of code for which the engineer in question has little experience.

With that said, the JOIN is a very powerful concept which, unfortunately, has been given a terrible reputation by the NoSQL community. Moving such logic out of the database and into to DB's client is just a waste of IO and computing bandwidth.

SQL has been the ONLY technology/language that has stuck with me for > 25 years. The fact that it is (apparently) not being taught by institutions of higher learning is just a shame.

lultimouomo · on Dec 22, 2022

> Firstly, I'd suggest the author look at this differently; perhaps "For Want of a Code Review". Especially code from a relatively recent graduate, on a piece of code for which the engineer in question has little experience.

I assume the story is made up, but if we were to take it at face value the title would be "For Want of Basic Human Decency"; the author is saying that they saw this whole easily preventable train wreck happen in slow motion and did not lift a finger to prevent it, instead laughing, taking notes and thinking of the fabulous snarky blog bost that would have come out of it.

axus · on Dec 22, 2022

The way I read it, they accepted the decisions of those higher in the hierarchy, after providing their feedback. It wasn't clear if lots of money was lost, just lots of time. I didn't think it was made up.

nisegami · on Dec 23, 2022

Eventually you learn it's not always a great idea to stop your employer from getting burned. Some people don't learn until it hurts.

Epskampie · on Dec 23, 2022

From the article:

> I’d definitely commented on the JOIN statements during the initial code review

boxed · on Dec 25, 2022

How about when it became a problem? Or the second time? The third?

leononame · on Dec 22, 2022

I agree. It took me 3 years or so to actually land in a project and learn SQL for the first time. Before it was all with ORMs. I didn't know what a join was for the first couple of years of my career.

Understanding SQL and being able to work with data interactively has made me a better software engineer. This tech is important enough that it should be taught in university/coding camps.

miiiiiike · on Dec 22, 2022

You DIDN’T learn SQL in school? Probably my most useful class. I hated it at the time, I was a desktop and embedded dev, and this was before SQLlite roamed the earth.

rjbwork · on Dec 22, 2022

>Probably my most useful class.

Ditto.

My teacher was hardcore. He was a graybeard who was around before Codd's now famous paper. He worked with some of the old pre-relational hierarchical databases.

We had to take SQL queries, turn them into relational calculus and algebra, turn that into a query plan, then come up with an estimate for the time the query would take to run given various hardware speed numbers and the size of the data.

We had to implement our own (primitive!) database engines, including various join algorithms.

To date it's one of the hardest, yet most rewarding, learning experiences I've had.

tomrod · on Dec 22, 2022

I would take that today!

Tostino · on Dec 22, 2022

Not exactly the same, but check out the CMU database course lectures on YouTube.

miiiiiike · on Dec 22, 2022

That.. Sounds graduate-level. How's your PhD?

rjbwork · on Dec 22, 2022

I took it during summer, and there were some masters and PhD students in there, but I took it as an undergrad course. It was very intense, 3 hours per day, 3 days per week, and 1.5 hours per day the other two.

miiiiiike · on Dec 24, 2022

Man, I got downvoted into oblivion for my comment, but seriously, yeah, graduate level.

That sounds fun. My most intense undergrad class series was one where we soldered together a M68HC11 computer and learned assembly one class, built an OS for it the next, and then turned it into a robot with control software running on our OS-es.

Izkata · on Dec 22, 2022

Not necessarily. One of my undergrad courses (~2008) had us implementing a basic inverted index (think Solr or Elasticsearch), which gave me some insight into Solr that my co-workers didn't have that helped with performance issues.

If we had a database course that went that in-depth, I'd've definitely taken it, too.

selimnairb · on Dec 22, 2022

That sort of course was very much par for undergrad courses at CMU when I was taking CS classes there 25 years ago. The OS course was very intense. I wasn’t a major and didn’t have time to take it, but my networks course was of similar rigor (i.e., implement a toy TCP/IP stack).

miiiiiike · on Dec 22, 2022

This was obviously was a joke. Sounds like there's a range in SQL training that goes from "I've heard of SQL" to "I implemented a PostgreSQL compatible db my Sophomore year."

stephenhuey · on Dec 22, 2022

When I was at Rice a couple decades ago, the database class was a 400-level class in which we learned relational algebra and relational calculus before SQL. The professor must have been good at teaching because I loved learning the formal underpinnings even though my memory of them has faded, but I do recall that I went from zero SQL knowledge to being very excited by its power. So many of my CS classes were very theoretical, and even though we learned some theory in the database class, it was definitely one of the single most (the single most?) pragmatic & practical of all the CS classes I had.

I was so zealous about normal forms that I complained loudly at one job where they used an old D3 database with multivalue fields. It was so glaring to me because we actually used all hand-rolled SQL instead of an ORM in those days. Years later, after growing less tech-centric and more thoughtful of business needs, I realized that sparingly using multivalue fields was not a hill to die on. :)

Fast forward many years to my first startup in Boston. Google App Engine was new and I wasted precious time trying to figure out how to shoehorn a typical relational data model into the early NoSQL data store available for App Engine at the time. This was just after the financial crisis and I hadn't yet heard the mantra to pick boring technologies, and I learned through sheer pain that unless you really really really need to, don't waste effort by walking away from relational databases. And also, most apps can get by with whatever the ORM does and if there's a performance issue, optimize that one query instead of trying to optimize all your SQL from the beginning. There's a lot I still don't know about pushing heavily complex queries down to the db level, but for expensive problems I'd reach for expensive assistance, because it's worth it (after trying to play with the SQL myself).

TheCapn · on Dec 22, 2022

For me the first time we dabbled with SQL was in a 3rd year Software Engineering course where the focus of the class was a single group project that we managed among ourselves by splitting tasks, conducting code reviews and handling the build and release in teams.

I recall one group doing the project login which went much along the lines of what the OP's article touched on. Their code was esseentially

    var success = false
    var query = SELECT * FROM users
    while query.read
    {
      if query(user) == input_user && query(password) == input_pass
      {
        success = true
      }
    }

Yes. They selected the entire user table.

Yes. They iterated over the entire result (even if first returned result was valid)

Yes. That was "shipped" for the project

No. My complaints notion they should be leveraging the database for all the things they're doing wrong were ignored. It was performant! Look! It logs in instantly! YEah, because there's 8 users on the database for this project, what about when it ""ships"" and there's 100,000? More?

---

My first real job dealing with a database wasn't much better. We were using a MS Access database with no normalized data. Our client's primary transaction data was across a table with 70 some columns, many of which were often duplicated values in some form or utilizing very bad practices. Since joining this company I've sped up queries in almost immeasurable ways and done things my older coworkers initially derided because they couldn't understand the syntax.

TL;DR SQL, for some stupid reason, is still treated as second class to core langauges and it is a god damn shame

marcus_holmes · on Dec 22, 2022

> SQL is still treated as second class

Agree so much. And if you've ever seen a real SQL wizard in action, you realise how much can be done with it. Like most of the business logic of a system can be in the database, with an interface that's a set of stored procs/functions. And fast.

toyg · on Dec 22, 2022

> most of the business logic of a system can be in the database

The problem of this approach is the tooling and lock-in.

If databases had first-class versioning support for their code objects (which could easily interoperate with git), testing automation, and a parvence of standardization across the industry, then a lot of people would be very happy to work with that model.

But they don't.

yamtaddle · on Dec 22, 2022

> The problem of this approach is the tooling and lock-in.

I've seen a program rewritten or heavily refactored on top of an existing database more times than I've seen the database swapped on an app that had reached production (which I've seen zero times).

Consequently, I have regard remaining "database agnostic" as having very little worth. If you pick a DB with a bunch of great features that can save you time, improve performance, and improve data integrity—use those features!

Plus, if you find yourself in that rewriting-or-heavily-refactoring job that I've seen a few times, your favorite person in the whole world will be whoever put all those annoying constraints and triggers and such in the DB itself. It'll make the operation far easier and safer.

CuriouslyC · on Dec 22, 2022

While the tooling could definitely be better, a lot of those issues aren't so problematic if you just use Postgres. Use a migration tool and store the migrations in git, use a tool like Zapatos to provide typing for your queries at the application layer, support multiple versions of stored procedures using schemas with a defined search order and test your procedures using pgTap.

Postgres-as-a-platform is definitely a new architectural trend, but because of companies like Supabase it's maturing quickly, and there are so many benefits to it when executed properly.

tracker1 · on Dec 22, 2022

Came in to say pretty much the same thing. Discoverability is a huge issue... and even if you do things in a way that lends itself to that, it gets really clunky really quickly.

iamacyborg · on Dec 22, 2022

I may be misunderstanding here because I’m not a dev and have only a cursory level of experience doing some basic programming or sql but is this not what dbt allows?

tracker1 · on Dec 22, 2022

I'm all for straight up queries and understanding... even some more complex sprocs... I'm not a fan of too much logic in the dbms, since it's pretty much a lock-in for a single vendor, limits breaking pieces out for scale and makes things generally much harder to find/understand in practice. I'm a proponent of what I like to call discoverable code structures, sprocs/functions don't lend themselves to that.

marcus_holmes · on Dec 22, 2022

yeah, I should have added "can be, not should be" ;)

Though I have met DB Admins who insist that the only way of stopping bad data getting into the DB is to have the DB do all the data manipulation, including a lot of what we would now consider business logic

tracker1 · on Dec 22, 2022

Yeah, I've worked in environments like that... and after a few years, it takes an inordinate amount of time to add/change anything as a result.

pineconewarrior · on Dec 22, 2022

I took several classes on it and I still didn't really 'get' it until I had to work on challenging problems in the real world. Granted, my education was not great quality overall.

cfeduke · on Dec 22, 2022

I just enrolled in an online CS degree course. The database class is an elective. Crazy, I know.

tracker1 · on Dec 22, 2022

Agreed... ORMs can be nice, but one should understand how it works. I'm a pretty big proponent of simple mappers (Dapper for .Net, template literals for JS/TS) with straight SQL over ORMs at this point.

maratc · on Dec 22, 2022

I was at a place that used sharding, so the data was scattered across 128 database servers.

SELECT works there but JOIN doesn’t, as your right side may reside at another shard.

Jupe · on Dec 22, 2022

IMO...

If the query is for OLAP the data may need to be extracted to another data store.

If the query is for OLTP, then the design is wrong. I don't know your problem space, but pulling data from 128 shards to resolve queries while a user is waiting is just a really bad idea.

paulmd · on Dec 22, 2022

> If the query is for OLTP, then the design is wrong. I don't know your problem space, but pulling data from 128 shards to resolve queries while a user is waiting is just a really bad idea.

well, that's the basic idea of microservices lol. forget living on a different shard, lots of times your data is going to round-trip to JSON and back a couple times and then be manually joined in some backend/service layer, or in graphql!

one bad abstraction I see a lot from microservice teams (that don't really understand it past the high-level concept) is "every table is a service", or "every minimal set of tables and its codeset is a service" and that's exactly how that ends up. Microservices really ought to be chunky enough to do their business without ending up calling 27 different services under the hood just to do simple operations. Obviously there is a point where it's too chunky, but too micro is also bad too.

maratc · on Jan 1, 2023

> pulling data from 128 shards to resolve queries while a user is waiting is just a really bad idea.

To the contrary, pulling data from 128 shards can be done in parallel, and about 127 of them don't have any data to return.

paulmd · on Dec 22, 2022

Yes. This is a super common problem with no-sql engines - we ran into something similar with SOLR when objects are not flattened (eg @JsonUnwrapped annotation). Child objects are stored as separate documents with a join... but if the child object is not stored in the same [file-]block then predicate-scans for the parent may not encounter the child object that causes predicate satisfaction. This breaks deep pagination and some other abstractions.

To me this really is the fundamental distinction for no-sql vs RDBMS. If your data model involves lots of joins... it's RDBMS even if you're using mongo or some other document store under the hood. ideally you will be storing some large analytical document that contains a lot of details about the thing, rather than just treating it as "rows as a document".

the thing about JOINs breaking across blocks/shards is one thing, and it's ultimately something you can work around for a lot of data (again, flatten with @JsonUnwrapped for example) but if you find yourself reaching for joins, your data is relational, or at least your representation is relational.

higeorge13 · on Dec 22, 2022

If you had to do joins on different shards, then you have implemented sharding wrong.

jandrewrogers · on Dec 22, 2022

This is to some extent an implementation limitation, not a theoretical one.

Typical SQL databases support neither the data organization nor parallel orchestration features required to support these types of JOINs well. The practical issue is that you can't add these features to an existing database kernel architecture if it was not designed to make this feasible from day one, and people are rightly reluctant to design a new SQL database kernel architecture from scratch so that these features are available. SQL databases are trapped in a local minima.

jjice · on Dec 22, 2022

Our SQL course in uni left a lot to be desired. Very little time spent on join, much more on subqueries, oddly. My first job our of school there was a SQL portion and they were impressed by my overuse of subqueries. Best SQL I learned was on the first few months in a real database with real information, instead of a student-courses mock DB with 15 rows that seems to be the academic standard for teaching.

tracker1 · on Dec 22, 2022

I'm frankly surprised that some of the larger MS based data sets aren't more standard for learning. MS SQL Server isn't generally my first choice (preferring PostgreSQL for standards and portability), but it's got some pretty great example data out there.

Beltiras · on Dec 22, 2022

Oh it's taught. I have a bone to pick with how. I'd rather have spent a lot of time on the practical application of SQL than the theoretical background of column and table operations.

phamilton · on Dec 22, 2022

I've learned that "a month in the lab saves an hour in the library" usually can be distilled to "A shallow understanding produces complex solutions. A deeper understanding is usually required to create simple solutions."

While the original example of not understanding JOIN might just be a lack of of general knowledge, the later steps are great examples of this, especially if someone else comes along and is told to fix the error.

Making something execute slow code in parallel is pretty easy to do generically. It doesn't require understanding much about the slow code. It's fairly low risk, you probably won't have to tweak tests, there won't be additional side effects. The major risks will be around error handling and it's easy to turn a blind eye to partial success/failure and leave that as a problem for a future team. You can confidently build the parallel for loop, call the task done and move on.

Striving for a deeper understanding requires a lot more effort and a lot more risk. Re-writing the slow code is a lot more risk. All side effects must be accounted for. Tests might have to be re-written. The new implementation might be slower. The new index might confuse the query planner and make unrelated queries slower somehow. It's not just a matter of investing time, it's investing energy/focus and taking on risk. But the result will have comparatively fewer failure modes, it'll be cheaper to operate and less likely to have security implications.

I've been in both spots and while I wish I could say we always went with the deeper understanding that wouldn't be an honest statement. But the framing has been really helpful, especially as I work with other execs in the company to prioritize our limited resources.

isoprophlex · on Dec 22, 2022

"Simplicity is the highest form of sophistication"

hoosieree · on Dec 22, 2022

"less is more"

boffinism · on Dec 22, 2022

hoosieree · on Dec 22, 2022

golf clap

kwhitefoot · on Dec 22, 2022

Reminds me of Blaise Pascal who apologized to a correspondent saying something like "I apologize for the long letter, I didn't have time to write a shorter one.".

yyyk · on Dec 22, 2022

The worst part is not the missing JOIN. This happens, especially with juniors.

It's the 'all signup errors warranted paging the on-call even on 4am' bureaucratic decision followed by being unable to apply any fix quickly. No surprise the author did not stay.

weego · on Dec 22, 2022

The worst part is more senior devs being happy to watch it happen and even accept the commits, and then write a long winded story of how this 'car crash unfolded' apparently unaware that they're a tacitly active contributor to it happening and then escalating out of control.

If you're going to be aper of throwing junior devs under the bus, at least have the self awareness not to brag about it on the itnernet.

rubyist5eva · on Dec 22, 2022

This article speaks to me. So many times I have needed to go back and fix queries that were naively written this way like it was some kind of "optimization". There is no difference in effort between writing a join or doing the ORM-double-round-trip in the vast majority of cases. People are so afraid of doing joins I see people doing subqueries with the id in a subselect because "joins are slow". The worst is usually some kind of pseudo-join and then an aggregate or filtering in the application code. It drives me up the wall when I see it in code review, usually because I get into some argument about "joins are slow" (with no evidence) and then I have to go and rewrite the query and maybe add an index to show that, yes - an aggregate that takes seconds and a ton of memory in the application code can in fact take milliseconds in the database.

The NoSQL people have really done a lot of brain-damage to this industry.

It's so pervasive that I've starting using this kind of question in our technical interviews, doing a double round-trip ends the interview for anyone higher than a junior.

giovannibonetti · on Dec 22, 2022

Here is some actionable advice for speeding up a join with indexes. You'll need one in each table with the join columns. For example:

SELECT ... FROM table_A JOIN table_B ON table_A.column_A1 = table_B.column_B1 AND table_A.column_A2 = table_B.column_B2

You can add indexes like this: - table_A(column_A1, column_A2) - table_B(column_B1, column_B2)

If both tables are large enough, this query can probably take advantage of those indexes to perform a merge join.

Postgres tip: you can also add columns to the include part of the index to speed up filters in the WHERE conditions. You might even get an index-only scan! Look into covering indexes to learn more about it.

boredemployee · on Dec 22, 2022

Not related, but here in my job our query just stopped working because the data got too big. we use left joins everywhere because we don't want to lose the main table data. do you think the trick you just mentioned could optimize our left join as well?