NoisePage – Self-Driving Database Management System

etaioinshrdlu · on Nov 3, 2020

It seems like "Self-Driving" to the authors means simply that it has a query optimizer using ML.

I would counter that idea by suggesting that a more foolproof database would also incorporate truly solid auto-scaling, and global replication, so that one can just throw immense amounts of data into tables, while retaining fast CRUD operations worldwide, without any DBA support required. A little bit like the direction CockroachDB and TiDB are headed, perhaps.

apavlo · on Nov 4, 2020

> It seems like "Self-Driving" to the authors means simply that it has a query optimizer using ML.

Nope. Please see this article for my definition of self-driving: http://www.cs.cmu.edu/~pavlo/blog/2018/04/what-is-a-self-dri...

brainless · on Nov 4, 2020

I had come across peloton before in my research in this domain as I mentioned in another comment that I am building a product with similar long-term goals. Congrats on the work, would love to see schema also take care of.

jasonwatkinspdx · on Nov 3, 2020

This is part of an existing line of research going back decades. Fully self tuning database engines have proven to be a lot more tricky than you'd assume.

It's not the same thing as horizontal scalability. If you want to build a Spanner/Cockroach/Ti style system, you'll face the questions they're examining in detail in your per node storage engine.

The empirical evaluation paper in particular is interesting and useful stuff.

brainless · on Nov 4, 2020

Does not seem like that to me at all. Just their about page says they are taking care of the entire deployment and lifespan of the database - indexes to hardware capacity. I would like them to take care of schema too.

_alex_ · on Nov 4, 2020

Query planners unexpectedly “helping” can lead to very real availability and performance problems. For large scale workloads, I really want to know what my database is doing. Having it suddenly flip an index for reasons no one can predict or explain because some ML algorithm decided it was the right thing sounds incredibly dangerous.

SomaticPirate · on Nov 4, 2020

One of the authors is famous for his love of the Wu-Tang Clan.

Cool to see bits of this love in his code. https://github.com/cmu-db/noisepage/blob/ed86192c8ea5c7a0754...

aratno · on Nov 3, 2020

ML-assisted query optimization is super interesting to me (I've had to fight the uphill battle against anomalies in Postgres stats and estimations[1]), but I'd also love to see more optimization and auto-tuning across the "stack":

- Use column-oriented storage for columns that are frequently scanned in analytical queries

- Automated index management, based on user-provided boundaries for write throughput, disk footprint, query latency

- Calibrated optimism for heap updates. AFAIK, current DBs are either optimistic (assume transactions are more likely to commit than rollback) and update the heap in-place, and write the old value elsewhere in case of rollback, or pessimistically write all updates in new, versioned tuples and let MVCC (and garbage collection) handle the rest. Would be interesting to see the performance improvement that could come from modeling commit / rollback outcomes and optimizing accordingly.

- Using variable-sized pages to reduce overhead for older pages, as a sort of light cold storage

Anyone know of any DBs that automatically tune these aspects?

[1]: http://www.vldb.org/pvldb/vol9/p204-leis.pdf

brainless · on Nov 4, 2020

I love this project. This is part of my long term goal for my main product [1] which is all about making databases easier to work with for everyone while abstracting away SQL.

The use of ML/RL is exactly the method I have in mind for long term. As I am not from that background, so my immediate goals are more like a set of best practices and migration management with a Python layer (SQLAlchemy, sqlalchemy-migrate and automation on top).

[1] https://github.com/brainless/dwata

carlineng · on Nov 4, 2020

Somewhat related is Pavlo's musings on Naming a Database Management System [1]:

"In my opinion, the best DBMS names from the last thirty years are Postgres[4] and Clickhouse.

These two names did not mean anything before, but now they only have one connotation. There is no ambiguity. There is no overlap with other DBMS names or non-database entities. They are easy to spell correctly[5]. Everyone refers to them by their full name. Nobody mistakingly calls them "PostgresDB" or "ClickhouseSQL."

After reflecting on what makes them so good, I realized what the secret was to them. They are a two-syllable name that is derived from combining two unrelated one-syllable words together (e.g., Post + Gres, Click + House). Each individual word has its own meaning. It is only when you put them together does it mean the database.

Given this, I henceforth contend that the best way to name a DBMS is to combine two one-syllable words."

Noise + Page

[1]: https://www.cs.cmu.edu/~pavlo/blog/2020/03/on-naming-a-datab...

chrismorgan · on Nov 4, 2020

> They are easy to spell correctly[5]. Everyone refers to them by their full name. Nobody mistakingly calls them "PostgresDB" or "ClickhouseSQL."

… and yet that article misspells both of them, consistently not calling one of them by its full name (though it acknowledges this in the fourth footnote, with the relevant history).

Postgres is actually PostgreSQL.

Clickhouse is actually ClickHouse.

Gluing two words together guarantees that a significant percentage of your users will misspell it. (I’m categorising incorrect capitalisation as a misspelling.) Some will follow the declared spelling, and some will incorrectly capitalise, regardless of what you declare to be true, either introducing spurious capitals or lowercasing authentic capitals.

Some entities change the spelling of their name over time. Some are inconsistent by accident, which I don’t really understand (I could never do it myself; yet I observe it happens quite commonly). But you know what really grinds my gears? When the original source is deliberately inconsistent in its spelling, refusing to declare a canonical spelling, as is the case with sauceHut.

edulix · on Nov 4, 2020

Let me first say sorry. I know it's a bit unrelated.. but it seems like the right people are here. Is there any kind of open source database that is close to Amazon's QLDB? ideally as a postgresql plugin.

Also: is there any "distributed database system" that allows to synchronize for example some tables (or even specific fields) of a server-side database with a third-party client (which might be sqlite vs postgresql in the server side), so that only part of the database is synchronized, in a secure way? this "slave" should not be trusted, think about it as a web user.

taywrobel · on Nov 4, 2020

Looks like Peloton finally got too popular for Andy Pavlo to keep using their name for the project.

Sort of a shame, I thought the old name was a more apt analogy for the goals of the system.

samsquire · on Nov 3, 2020

This is really cool.

Does anybody know how you implement multiversion concurrency control?

I'm trying to implement it in a toy project that simulates bank accounts withdrawing and depositing money in a multithreaded scenario where no money should be lost or created.

I found that the algorithm only worked when I considered only write timestamps. If there is a write timestamp that is greater than this commit's write timestamp, then abort and retry.

apavlo · on Nov 4, 2020

> Does anybody know how you implement multiversion concurrency control?

I have lectures available on Youtube:

Intro DB Class:

* https://15445.courses.cs.cmu.edu/fall2019/schedule.html#nov-...

Advanced DB Class (with reading list):

* Design Decisions: https://15721.courses.cs.cmu.edu/spring2020/schedule.html#ja...

* Protocols: https://15721.courses.cs.cmu.edu/spring2020/schedule.html#ja...

* Garbage Collection: https://15721.courses.cs.cmu.edu/spring2020/schedule.html#ja...

jasonwatkinspdx · on Nov 3, 2020

So this comes out of Andy Pavlo's group. He teaches an undergrad and graduate level database class each year, and makes nearly all of his course materials available (there's some guest lectures that don't want to be public sadly): http://www.cs.cmu.edu/~pavlo/

I'd definitely go through the slide decks on concurrency control in both of these classes. They're an amazing resource for anyone interested in database internals. There's a ton of tradeoffs, and a lot of prior research into them. Andy's slide decks are something of a road map to the different choices.

What you describe is roughly snapshot isolation, which mostly works, but is vulnerable to write skew.

sitharus · on Nov 3, 2020

It really depends on your use-case.

A very brief summary for the case of Postgres: every tuple has a pair of hidden columns `min_xid` and `max_xid`. Every transaction get assigned an `xid` that's always increasing (it wraps around on int_max, that's a problem for later), and it can only see rows where `xid > min_xid and (max_xid is null or xid <= max_xid)`.

On writes the database is checked for rows with a matching PK where min_xid > xid, if any are found an error is raised.

The reality is a lot more complex than this and involves some locks during transaction commits and the vacuum process to clean up dead rows and handle xid wrap-around, but that's the basic idea.

6thbit · on Nov 4, 2020

This is using several buzzwords some database vendors are already using to promote their fully managed solutions, yet the "autonomy" of this dbms appears to be limited to SQL tuning.

I don't even see references to high availability configuration and how this would withstand transactional load and be resilient enough in comparison with more mature engines.

apavlo · on Nov 4, 2020

> This is using several buzzwords some database vendors are already using to promote their fully managed solutions, yet the "autonomy" of this dbms appears to be limited to SQL tuning.

This is incorrect. Where did you get this impression? We are trying to support automated physical design, knob configuration tuning, SQL tuning, and capacity planning/scaling.

Please refer to our vision paper: https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf

ellisv · on Nov 4, 2020

Hey Andy, thanks for keeping it real!

georgewfraser · on Nov 4, 2020

Welcome to HN Andy! As you are witnessing, people find that actually reading the links can be a real impediment to hot takes...

jfbaro · on Nov 4, 2020

Would it be able to apply some (most) of these technics to other RDBMS like PostgreSQL and MySQL? Maybe some of (non private/secure) information about the queries/stats could be sent back to a centralised server that would be ingested by the ML model making it even "smarter".

jdc · on Nov 3, 2020

They have to limit the scope somewhere.

airstrike · on Nov 3, 2020

The edges of the Milky Way galaxy still qualify as "somewhere"