Demand the impossible: Rigorous database benchmarking

gavinray · on Dec 31, 2023

It's worth noting that the author is also the same person who gave the talk "Modern B-Tree Techniques" at Strange Loop, which is sort of the de-facto overview of B-Tree's as they relate to databases in modern times.

https://www.youtube.com/watch?v=4ELJDEjDpqk

szarnyasg · on Dec 31, 2023

Rigorous database benchmarking is indeed very difficult and time-consuming. I spent the last ~7 years working on benchmarks for graph processing systems in the Linked Data Benchmark Council (LDBC) [1], originally established in 2012 as an EU research project.

LDBC creates TPC-style application-level database benchmarks which can be used for system-to-system comparison. We provide detailed specifications, data generators, benchmark frameworks, and multiple reference implementations. The benchmarks are implemented by vendors for their database products, and the implementations submitted to be run by independent third-party auditors to ensure their correctness and reproducibility.

We have found that there is a market for audits for graph processing systems, albeit it is quite small: over the last 4 years, we have published 34 audited results, see [2], [3], and [4].

A major problem we face is that process of implementing the benchmark for a system and getting an audited result is long (and therefore expensive). Vendors spend months implementing and tuning the benchmarks for their systems. It is also typical for the auditor to spend 50+ hours on the auditing process, which includes a lengthy code review step, setting up the system, running the experiments, testing ACID properties, writing a report, etc. The length of the process is exacerbated by the lack of widely-adopted standard graph query languages. This potentially necessitates the auditor to learn a new language for every new audit.

We have tried to mitigate this problem by improving our documentation, creating more reference implementation, distributing pre-generated data sets. There are new standard graph query languages (SQL/PGQ, GQL) but their adoption is currently very limited. Overall, the auditing process is still quite long, which is mainly caused by the essential complexity of the problem: implementing an application-level benchmark and getting reliable results is very difficult.

[1] https://ldbcouncil.org/introduction/

[2] https://ldbcouncil.org/benchmarks/snb-interactive

[3] https://ldbcouncil.org/benchmarks/snb-bi/

[4] https://ldbcouncil.org/benchmarks/spb/

paulryanrogers · on Dec 31, 2023

Wouldn't a vendor tuning the benchmark be ... cheating?

szarnyasg · on Dec 31, 2023

Vendors are tuning their implementation (e.g. the schema, indexing, and queries used to implement the benchmark) and also their database system, but they cannot tune the benchmark workload itself. Apologies if this was ambiguous in my comment.

It is quite typical that vendors find performance issues in their systems when they first adopt the benchmark. This then leads to improvements in their operators, optimizer, storage, etc. These improvements will be included in a new release of the system – which is great, as a main point of the benchmark is to drive innovation and raise the general level of performance in the technology space.

It's important that the audits can only be conducted on systems that are generally available. We don't allow audits on some one-off version of a system that has special features just to perform well on the benchmark (these are known as "benchmark specials" and are also illegal in TPC benchmarks).

crabbone · on Dec 31, 2023

I've worked on similar testing process (but for filesystems): https://www.spec.org/sfs2014/ (and it has increments every so many years).

"Tuning the benchmark" is probably "tuning your system to score high on the benchmark". There are always caveats of course. You, as a developer, may be able to find some implicit assumptions in the test and exploit it to "game" the test. Suppose, for example, the test assumes that acknowledging deletion means deletion completed, but your code simply schedules a file for deletion instead of actually deleting it before acknowledge is sent.

You'd have to argue somehow that your system is behaving in such a way that scheduling and deleting are essentially the same thing which would allow you to pass the test, but that's about the level of "cheating" that you can do on these tests.

Also, by the nature of things, the way these systems are sold, you don't really want to pull these numbers out of nowhere. This kind of unsophisticated fraud will get you disqualified very quickly, but the investment into building such systems is usually on the scale of upwards from five or more years until it even gets into the stage where you can show something to the customer. So, it's a large investment that you don't want to risk.

nimbius · on Dec 31, 2023

demand it at your peril. players like Oracle are infamous for EULA's that expressly forbid publishing your findings.

darksaints · on Dec 31, 2023

They should be included in any and all benchmarking results anyway, but just ranked last on everything, with a little star note to explicitly state that oracle prohibits publishing measured performance.

speedgoose · on Dec 31, 2023

For such cases, we usually write "Database A" or similar and the reader can easily guess which one it is. Maybe include a benchmark test where the database performs poorly so the database company doesn’t want to prove that you were referring to their product.

Shorn · on Dec 31, 2023

This is a really good article to read if you're interested in any kind of systemic approach to load-testing, not just benchmarking databases.

- open/closed load generation strategies

- nature of the systems-under-test being almost always that of a networks of queues

- behaviour of systems-under-test tending to be non-normal and modal

egnehots · on Dec 31, 2023

interesting but not very practical?

let's take the max_wal_size parameter for ex.

for most systems it's much simpler to set the max_wal_size to a value high enough that its value does not put pressure for checkpoints under load.

so get enough free disk space, allocate maybe 5%. look at the logs, increase the value if checkpoints are required.

(the downside is that crash recovery will take longer since it will read the wal)

samokhvalov · on Jan 2, 2024

very good example

for each setting, we need to remember trade-offs

and in this particular case, the trade-off is increased recovery time after potential crashes

it is possible to conduct another benchmark that will measure this time, showing how it increases (in "avg" case -- "normal" TPS, and in "worst"* case -- increased TPS, e.g. massive UPDATE with random IO/access), and collect another set of interesting data and then combine two sets to support decisions

in result, we will choose something like: we decided to use 32 GiB for max_wal_size and 15 min checkpoint_timeout and we know that if we crash in the worst* case, DB will need up to 10 minutes to recover; but benefit is that we have much less disk IO stress during massive writes.

___

*) "worst" here has two levels that show off when writes are massive with random IO pattern of block writes:

- excessive writes from frequent checkpoints

- additionally, more WAL needs to be written due to (if) full_page_writes=on

smegsicle · on Dec 31, 2023

[flagged]

trollied · on Dec 31, 2023

Care to back your comment up and say why? Otherwise it's just boring trolling.

diffuse_l · on Dec 31, 2023

[1] for example

[1] https://danluu.com/anon-benchmark/#:~:text=Over%20the%20year....

exabrial · on Dec 31, 2023

I like IBM’s clause, seems the most reasonable.

ddtaylor · on Dec 31, 2023

> Unknown unknowns, e.g. we have no idea that we run the database on a buggy Linux Kernel version, where buffered IO in a cgroup is slow due to wrong memory pressure calculation and constant page reclaiming.

?