The legacy back-end system being migrated was Clipper + dBase III on DOS, which ...

ZephyrBlu · on Nov 19, 2022

p90 for a full-text search on 330 million documents was 15ms?

I know you can tune the hell out of search performance, but that seems a bit too insane for what looks like a relatively unspecialized setup (Standard DB).

tgtweak · on Nov 19, 2022

Not likely the full book, just title, author and a few other low cardinality fields I'm sure. Also not likely 330 million unique volumes, but total books. This is within reach of a single database with proper indexing.

itsthecourier · on Nov 19, 2022

Can you elaborate a little bit more about how you partitioned it?

jiggawatts · on Nov 19, 2022

I simply added the "library id" as a prefix to almost every table's primary key. Every lookup specified it with an equality filter, so essentially it was thousands of standalone libraries in a single schema.

One hiccup was that when the query cardinality estimator got confused, it would occasionally ignore the partition prefix and do a full scan somewhere, bloating the results by a factor of 2000x! This would cause dramatic slowdowns randomly, and then the DB engine would often cache the inefficient query plan, making things slow until it got rebooted.

This is a very deep rabbit hole to go down. For example, many large cloud vendors have an Iron Rule that relational databases must never be used, because they're concerned precisely about this issue occurring, except at a vastly greater scale.

I could have used actual database partitioning, but I discovered it had undesirable side-effects for some cross-library queries. However, for typical queries this would have "enforced" the use of the partitioning key, side-stepping the problem the cloud vendors have.

Modern versions of SQL Server have all sorts of features to correct or avoid inefficient query plans. E.g.: Query Store can track the "good" and "bad" version of each plan for a query and then after sufficient samples start enforcing the good one. That would have been useful back in 2007 but wasn't available, so I spent about a month doing the same thing but by hand.

ZephyrBlu · on Nov 19, 2022

This makes the performance a lot more understandable if you're only searching in a single library. I assume that cuts out >99.9% of those 330 million documents.

greenie_beans · on Nov 19, 2022

is this symphony or horizon or spydus or koha? or?

jeffal · on Nov 22, 2022

All those commercial systems existed before his (going by his use of SQL 2005).

greenie_beans · on Nov 23, 2022

ah. i was just listing the ones that i could think of that use the concept of a library id.