Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> In the example the author gives the query cost is very likely dominated by finding matching rows in A. Where there is no index, then we can expect a full scan of A (or the index of A.id) for every batch of B.

Why would you expect that there's no index? I have never seen a single database system where the most basic primary key A.id wasn't indexed. Instead, I would expect that you're correct below that cost of the query is dominated by fetching the rows from disk and serializing them—this is a linear cost that increases with the number of rows returned, so fetching 50,000 rows should be about 50x as slow as fetching 1,000 rows (especially as long as you're fetching them in some sort of block-cache-amenable order, such as in increasing ID order, so that you're seeking to sequential places on the disk most of the time instead of fetching just random blocks)

> In addition, and probably more to the point, the round trip database costs (serialisation, parsing, planning, scheduling, network comms) are going to dominate the actual query costs for something like this (unless A is exceptionally large).

Aside from a small overhead, serialization, parsing and network comms will all increasing linearly with the amount of data returned. 50,000 rows of data will be about 50x the serialization and network cost of 1,000 rows.

> Furthermore, the memory cost to the DB of the serialisation and parsing is likely to be much larger than just storing all those ids in their native format - and there would be no client memory footprint in a join. For the final result set the client can reduce their memory footprint by using a streaming result which every BigData DB supports, and most others too

Sure, I can absolutely agree that using a streaming result set would be the best of all possible worlds here. However, it does require you to keep a client connection open for 50x longer than batching would, which on many databases (e.g. Postgres), would lead to more memory usage and CPU contention then batching the result in a background job queueing system. This comes down to what % of your total pipeline is spent in the database in question compared to data processing or other databases—if only 20% of your job's runtime is fetching the rows from this database, then it's a bad idea to monopolize that DB memory for the much larger amount of time it takes you to process the entire result set, when instead you could be yielding that memory back to the system for other transactions to use. But if 80%+ of your time is spent in the database, then the small amount of time that other transactions would be able to reclaim wouldn't be worth the amount of fixed overhead from re-planning, re-executing, re-fetching the index from cache, etc. And obviously these—as you may have been able to guess, my experience here is rooted in OLTP workloads using Postgres, and I'm sure there are plenty of differences with BigQuery's architecture.



On the off chance you see this:

> Why would you expect that there's no index?

I don't. Whilst I didn't write it particularly eloquently, I included that it would be an index-scan if there was one. And like the full table scan, this is a 50 vs 1 cost (unless the querying ids are well sorted, at which point you'd maybe get a 5vs1 cost at best).

> I have never seen a single database system where the most basic primary key A.id wasn't indexed.

Nobody has said it was a primary key. In fact it very likely isn't. All we really know is that there were approx 50k rows selected from B; we do not know how many are matched in A.

As they are using BigQuery and the queries are taking such a long time, it would be reasonable to assume A is some large dataset clustered around some other value (e.g. timestamp). But that itself would be an assumption.

> [streaming] does require you to keep a client connection open for 50x longer than batching would,

It does not. It will be less time.

---

Looking at pg.

I'm trying really hard to see a your point. As far as I can tell, you're bothered by the working set memory of the query caused by the join exceeding a limit and causing contention - this is the only time the streamed join is worse than the batching. On an index join this would have to be a very large table.

As for CPU contention - its a non-issue.

There may be a point related to time-to-execute with respect to lock contention.

Regardless, if either lock or index memory contention are problems for you then you will still want to `JOIN` - just against a subquery/cte with limit and offset.

Roundtripping is not the answer!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: