Jepsen: MongoDB 4.2.6

dang · on May 24, 2020

All: there was a big thread about this yesterday (https://news.ycombinator.com/item?id=23285249) but because it didn't focus on the technical content, and because there were glitches with a previous submission of this report (described at https://news.ycombinator.com/item?id=23288120 and https://news.ycombinator.com/item?id=23287763 if anyone cares), we invited aphyr to repost this. Normally we downweight follow-up posts that have such close overlap with a recent discussion (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...), so the exception is probably worth explaining.

inglor · on May 24, 2020

I also want to point their Node.js transactions API is wrong and looks like they have no idea how promises or async code work in JS.

In mongo, you have a `withTransaction(fn)` helper that passes a session parameter. Mongo can call this function mutliple times with the same session object.

This means that if you have an async function with reference to a session and a transaction gets retried - you very often get "part of one attempt + some parts of another" committed.

We had to write a ton of logic around their poor implementation and I was shocked to see the code underneath.

It was just such a stark contrast to products that I worked with before that generally "just worked" like postgres, elasticsearch or redis. Even tools people joke about a lot like mysql never gave me this sort of data corruption.

Edit: I was kind of angry when writing this so I didn't provide a source and I'm a bit surprised this go so many upvotes without a source (I guess this community is more trusting than I assumed :] ). Anyway for good measure and to behave the way I'd like others to when making such accusations here is where they pass the same session object to the transacton https://github.com/mongodb/node-mongodb-native/blob/e5b762c6... (follow from withTransaction in that file) - I can add examples of code easily introducing the above mentioned bug if people are interested.

inglor · on May 24, 2020

If you work for Mongo and are reading this. Please just fix it. I don't need to win and I don't care about being "right".

I just don't want to be called to the office on a weekend anymore for this sort of BS.

Production incidents with MongoDB last year: 15 Production instances with redis, elasticsearch and mysql combined last year: 2 (and with much less severity)

Edit: just to add: I didn't pick Mongo, I was just the engineer called to clean that mess. I created enough of my own messes to not resent the person who made that shot for it. We are constantly on the verge of rewriting the MongoDB stuff since a database that small (~250GB) should really not have these many issues (In previous workplaces I ran ~10TB PostgreSQL deployments with much more complicated schemas and queries with far fewer issues). It's also expensive and support at Mongo Atlas hasn't been great (we should probably self host but I am not used to small databases being so problematic)

brianwawok · on May 24, 2020

This is why most of us don’t use mongo in production. It’s just not worth it. Postgres is a tank and supports Json when you really need it.

hetspookjee · on May 24, 2020

The Guardian posted quite a nice blog in 2018 about the switch to Postgres from MongoDB. Especially interesting because they intended to use Postgres as replacement document storage: Here's the link https://www.theguardian.com/info/2018/nov/30/bye-bye-mongo-h...

guanzo · on May 24, 2020

> Automatically generating database indexes on application startup is probably a bad idea.

aw crap. oh well it probably doesn't matter for my small-ish application.

Quekid5 · on May 24, 2020

I was actually amazed that a big CMS/E-commerce vendor proudly proclaimed in a sales meeting that they were on MongoDB.

I suppose salespeople probably aren't into the nitty-gritty, but their tech people should have warned them about this. Maybe they were just trying to pull our collective leg, but I suppose that why I was at that meeting.

It was obviously an instant 'No'.

leviathant · on May 24, 2020

There aren't a lot of CMS/Ecommerce vendors that sit on MongoDB, so maybe we were in a meeting together!

Even if we weren't - as a sales engineer on a large CMS/ECommerce platform with merchants running $150M+ in annual revenue, with an average client retention of seven years, and two decades of agency experience behind the decisions around building that platform, if you instantly said no just because of MongoDB, maybe you don't know as much about MongoDB as you think you do.

I came from a SQL background myself, and had reservations based on all the things I'd read about MongoDB as we decided to build a platform after doing things bespoke for two decades, but time has proven our architecture choices out. It's easy to be proud of something that works well.

arp242 · on May 24, 2020

I didn't pick Mongo, I was just the engineer called to clean that mess.

My only experience with MongoDB is being "the engineer called to clean the mess". I'm sure you can effectively use MongoDB in production if you're knowledgable and careful, but most people aren't and they shouldn't have to know the detailed inner working to not create a mess.

goatinaboat · on May 24, 2020

My only experience with MongoDB is being "the engineer called to clean the mess".

It’s always the same

1. Newbie webdev (aren’t they all) uses MongoDB because it’s easy to use according to blogs and twitter

2. Somehow it makes it into production

3. A dozen experienced engineers spend years trying to keep it running

sophiebits · on May 25, 2020

No, not all web devs are “newbies”.

xpe · on May 25, 2020

Please remember to search for a charitable interpretation of what someone writes, per the HN guidelines [1]:

> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

[1]: https://news.ycombinator.com/newsguidelines.html

In this case, the parent commenter probably meant that "newbie web developers" are likely to choose MongoDB. Of course, web developers have a range of experience, some new, some seasoned.

Regarding your comment, I am reminded of a pattern in online behavior over time: people seem to take offense more easily. Please take a look at https://www.psychologytoday.com/us/blog/how-do-life/201410/t... to understand what I mean.

xpe · on May 26, 2020

Caveat: this is a meta-comment about voting, not a complaint about how people upvoted or downvoted the parent comment. (My motivations are explained at the very bottom).

Based on seeing how comments like this may get interpreted, as well as broader thinking about online communication, I think HN should consider a more nuanced system of comment feedback mechanisms.

I don't have a particular plan finalized, but I would like to see HN provide feedback on different aspects of the comment. Below are some important aspects:

To what degree does the reader / voter... *

1. agree/disagee with the comment?

2. find the comment relevant / irrelevant to the topic as a whole?

3. find the comment is situated in the correct / incorrect location in the thread? (e.g. responding to the parent comment or not) 4. find the comment interesting / uninteresting?

5. think the comment adds to a diversity of perspectives?

6. find the comment clear / unclear?

7. think the comment aligns / (does not align) with the HN Guidelines: https://news.ycombinator.com/newsguidelines.html

8. find the comment welcoming / offensive?

* When I write '/' above, I intend it to be a continuum; e.g. hot/cold means "in the continuum between hot and cold).

Additionally, being able to give feedback in a more granular fashion could be of use. For example, in my comment above, I would not be surprised if a significant number of people were bothered/offended by my commentary that people seem to be taking offense more easily. Some would call this ironic -- I wouldn't -- I think it gives more data to prove the point.

Motivations: my goal here is not to gain or lose karma -- I care very little about karma here, precisely because it is so muddled and varied from person to person -- as long as I have enough to participate fully. My goal is to learn and play a small part in fostering awareness and community, while hopefully to motivating others to reflect on their impact on the community here.

xpe · on May 26, 2020

Wow, I re-read the grandparent comment noticed it said:

> "Newbie webdev (aren’t they all) uses MongoDB because it’s easy to use according to blogs and twitter"

I'm not sure if I completely missed the "aren't they all" part or if the comment got edited.*

So, clearly, in response to "aren't they all", the response by sophiebits above makes complete sense.

* This makes me wonder, is a comment locked for editing once a reply gets added?

takeda · on May 24, 2020

MySQL is less of a joke than MongoDB is. They similarly started by someone who didn't know anything about databases and learned about them on the go. Actually both of them started as much faster alternatives to other databases, both ended up having complete rewrite of its engine written by someone from outside that knew their stuff. MySQL ISAM then MyISAM and then InnoDB (written by an outsider). Similarly MongoDB got a WiredTiger written.

The thing is that MySQL is older so it went through all of it earlier, but it still suffers from poor decisions from the past. This is contrasting with PostgreSQL, where correctness and reliability was #1 from the beginning. It started as an awfully slow database, but performance for improved and we now have correct, reliable and fast database.

berns · on May 24, 2020

MySQL is no joke, nothing is perfect and Postgresql is not 100% reliable. Remember:

Transaction ID wraparound: https://twitter.com/bcantrill/status/1110647418008133632

Incorrect use of fsync: https://news.ycombinator.com/item?id=19119991

goatinaboat · on May 24, 2020

MySQL is no joke

If you were around back in the day you will remember the MySQL team claiming that no one needed transactions or referential integrity, that you should just do it yourself in the application...

edw · on May 24, 2020

MySQL's rise IMO cannot be considered without also looking at the rise of Ruby on Rails and other CRUD-optimized platforms and frameworks. Also ORMs. These things denigrated the idea of using an RDBMS as anything but a dumb table store. Features like stored procedures and views were seen as pointless. MySQL was the perfect database for people who had no respect for databases.

bri3d · on May 25, 2020

I agree that the rise of MySQL is combined with using RDBMS as a table store rather than a relational database, but I am not positive that this was driven by RoR and ORMs. Every large-scale system I have worked with that utilizes MySQL (and I'm on at least my third in a row of these systems, sadly!) is/was driven by application-logic database utilization via the "FriendFeed model" - that is, a big fat ID->Document Blob table for persistence and breakout tables for indexing.

ORMs and ActiveRecord in particular encourage, to some extent, the use of a RDBMS, even if they didn't get to take advantage of them well for a long time - for example, in RoR "has_one / has_many" for foreign-key relationship, .joins(:field_name) for, well, joins, and so on.

edw · on May 25, 2020

Perhaps. Something happened between those first-generation web sites where you were writing SQL by hand -- so you could just as easily be writing (injection-attack-prone) queries that made use of stored procedures etc -- and today.

A big reason I called out RoR is that back in '04-05 I was railing against its default use of plural table names, and DHH on IRC recommended I shut up and just flip the configuration switch and turn off the feature, but of course when I did that all sorts of latent bugs were exposed.

RoR was the beginning of hipster "coding" and I therefore blame it for everything.

I'm wasn't previously familiar with the FriendFeed approach to database (ab)use. I paid about as much attention to it as I did to MySpace back in the day -- nearly zilch -- so its etc innards are doubly obscure to me.

Xorlev · on May 25, 2020

MySQL was already ridiculously popular from use in PHP/MySQL applications well before Rails was popular. That said, I generally agree with your statement:

> MySQL was the perfect database for people who had no respect for databases.

bearcherian · on May 25, 2020

The good old LAMP stack

twic · on May 24, 2020

Does MySQL support check constraints yet?

dnissley · on May 24, 2020

It does finally! In 8.0.16+

beatrobot · on May 24, 2020

And still no transactional DDL in MySQL

mathnode · on May 24, 2020

No, but it does support online DDL for some operations in InnoDB.

Very few database systems support online DDL, which unlike a transaction, does not require undo or rollback resources. Of course one must have a rollback procedure if something fails, but you need one for transactions too, just in case.

An online rollback is far lest costly than a transactional rollback, because and online rollback is just undoing what you did. Added a column you didn't want in one query? Remove it again in another, very quickly.

TokuDB (a mysql/mariadb storage engine) supported all DDL as an online operation. But percona killed it in favour of TokuMX, the MongoDB equivalent.

TokuMX has no upgrade path to wired tiger, only one major customer at Percona (I can't say who it is) and no engineers.

Any kind of DDL is tricky and requires users to RTFM for the intricacies of their chosen database. One size rarely fits all.

vlasky · on May 25, 2020

TokuDB is a great storage engine! Online DDL and fast compression are a winning combo. We use it for all our big MySQL tables. It is still available in MySQL 8.0.

I really wish Percona would reconsider their decision to deprecate it.

After Percona took over TokuDB's creator TokuTek, they wasted so much of their development time and money on TokuMX (Percona's fractal tree-enabled MongoDB server) only to abandon it in 2017.

That money would have been better spent on TokuDB development to allow it to match the features present in InnoDB like generated columns, spatial indexing, fulltext Indexes and Galera.

TokuDB still has many users and MyRocks is just no substitute.

jfkebwjsbx · on May 24, 2020

> Even tools people joke about a lot like mysql never gave me this sort of data corruption.

People rightfully joked about MySQL when they had the non-ACID engine.

Same for MongoDB. A database that loses data when properly used is a joke.

Yes, there are use cases out there for fast non-guaranteed writes. No, 99% of companies don’t have them.

Something1234 · on May 24, 2020

Can you name a use case for a fast non guaranteed write?

throwaway744678 · on May 24, 2020

Analytics: you don't want to slow down your app, and you don't care if you lose a few records in the process.

zbentley · on May 24, 2020

Importantly, in analytics workloads, it is very important to know roughly how many writes aren't making it. Otherwise your analytics system sucks.

rocho · on May 24, 2020

Interesting. How would one know that?

zbentley · on May 24, 2020

Good question. You'd need some accurate-enough data source telling you about failed writes. Which eventually comes back around to needing a consistent database and indications of client disconnects.

staticautomatic · on May 26, 2020

With a huge amount of data (as I've heard analytics is), could you take a sampling approach where you log every n transaction and only check those against the DB?

twic · on May 24, 2020

Caches. If you lose a write, you just get a cache miss.

Periodic snapshots of state held elsewhere. If you lose a write, you just get stale data until the next update.

Firm realtime work. If you lose a write, that sucks, but a slow write sucks just as much.

Jweb_Guru · on May 24, 2020

Sure. Data that people don't care about enough to be worried about losing--for example, time series data from an unimportant remote sensor. Should this data be recorded at all? Maybe not, but if should then a best-effort recording may be fine. It may even be all that's possible.

mbreese · on May 24, 2020

I wouldn’t go as far as to say an “unimportant” remote sensor... but I think you’re correct in spirit.

I could think of an instance where you’d like to log data, but the occasional datapoint being missing wouldn’t be terrible. Maybe something like a temperature monitor — you’d like to have a record of the temperature by the minute, but if a few records dropped out, you’d be able to guess the missing values from context. Something like the data monitoring equivalent of UDP vs TCP.

why-el · on May 24, 2020

Even more elementary that sibling comments, this also happens in gaming all the time. You are recording live results, say in Fifa, but if you unplug your device, your results are gone, since they were memory only. The game simply cannot afford to write to disk, the write is "non guaranteed" in the true sense of the word, but it is fast.

You then "checkpoint" when the game is over.

You might dissent that is not a "non-guaranted" write, because in fact the write did occur, but I simply want to allude to the concept of a "non-secured" write, in that it vanished without an fsync.

jfkebwjsbx · on May 24, 2020

The number of likes in a given post in your favorite $social-network-of-the-year.

squiggleblaz · on May 25, 2020

Since that number may well already be approximate if it's recorded in a hyperlog log.

threeseed · on May 24, 2020

Telemetry.

I work for a telco where we log large amounts of network requests using MongoDB.

lossolo · on May 24, 2020

When I was evaluating MongoDb couple of years ago (around the time they were switching to WiredTiger engine), I've found memory leak in their Node.js client on day one, I've submitted a ticket on their Jira and the same time I had a look at other issues they had there. I saw there memory leak after memory leak, memory corruption everywhere, data disappearing without any reason, segfaults etc. After that MongoDB was dropped as a candidate for a DB in project I was working on, we went with Postgres and never regretted it.

xeromal · on May 24, 2020

I just want remind people that this video exists.

https://www.youtube.com/watch?v=b2F-DItXtZs

sorokod · on May 25, 2020

Well, does dev/null supports sharding?

hintymad · on May 24, 2020

Just curious, what was the reason that your team decided to work around the problem instead of migrating away from MongoDB?

capableweb · on May 24, 2020

Not the author but done similar things (patching something rather than migrating away from it). Usually it's way more work to migrate away than just patching it again to fit your use-case. Once you find yourself having to patch it too often, you start thinking about migrating away. Then the research slowly begins ad-hoc until it hits "seems we need to migrate away now, otherwise we're spending too much time working around something / fixing their broken shit", that's when you sit down and decide to migrate away from it.

Also would depend on how long time you think the application will be around. You're building an MVP to evaluate something? Just hack together whatever will work (then throw away). You're maintain software for a library/archive that will most likely stick around for a long time, even if they say it's just temporary? Do decisions that will help in the future, always.

inglor · on May 24, 2020

We have a complicated system and migration is ~3 months we won't be shipping features.

We have a roadmap we need to meet and so far we have been trying to spill money on it rather than developers (paying mongo atlas) and adding features incrementally as Mongo gets them (like transactions).

If this wasn't a startup we would probably rewrite.

hodgesrm · on May 24, 2020

> Even tools people joke about a lot like mysql never gave me this sort of data corruption.

That's about a decade out of date at this point. MySQL/InnoDB is the standard table engine and corruption is exceedingly rare. As of 2014, when I last directly worked on MySQL prod systems, there was no practical difference from PostgreSQL in terms of transactional guarantees. That includes APIs like JDBC which we used for billions of transactions.

arp242 · on May 24, 2020

The biggest issue with MySQL/MariaDB isn't so much data corruption at the InnoDB level but stuff like:

  MariaDB [test]> create table test ( i int );
  Query OK, 0 rows affected (0.06 sec)
  
  MariaDB [test]> insert into test values (''), ('xxx');
  Query OK, 2 row affected, 2 warning (0.01 sec)
  
  MariaDB [test]> select * from test;
  +------+
  | i    |
  +------+
  |    0 |
  |    0 |
  +------+
  2 row in set (0.01 sec)

There's a bunch of other similar caveats as well, and this can really take you by surprise. I've seen it introduce data integrity issues more than once.

That's a new MariaDB 15.1 with the default settings I just installed the other day to test some WordPress stuff. I know there are warnings, and that you can configure this by adding STRICT_ALL_TABLES to SQL_MODE, but IMO it's a dangerous default.

This is also an issue with using MongoDB as a generic database: every time I've seen it used there were these kind of data integrity issues: sometimes minor, sometimes brining everything down. Jepsen reports aside, this alone should make people double-check if they really want or need MongoDB, because turns out that most of the time you don't really want this.

mathnode · on May 24, 2020

15.1 is not a version. Since MariaDB 10.2, this is not possible as strict_trans_tables is enabled by default in sql_mode.

morelisp · on May 24, 2020

MySQL still has no transactional DDL (and I think still even autocommits if you try). This is a major difference from Postgres which I believe supports everything short of dropping tables.

yobert · on May 24, 2020

Every month, we do an external database import into our production PostgreSQL database. In a single transaction, we drop dozens of tables, create new ones with the same names, insert hundreds of thousands of rows, and recreate indexes, all in a single transaction. It works flawlessly.

vijaybritto · on May 25, 2020

wow! have you tested for network errors in any of these steps? Will it rollback automatically if there are any errors?

sterwill · on May 25, 2020

If it's in a transaction, it will do the same thing it would do for any failed transaction (roll it back). DDL isn't special.

takeda · on May 24, 2020

I wouldn't use that particular thing against MySQL. DDL normally supposed to be always outside of a transaction, it's just PostgreSQL feature that you can use them inside and be able to rollback. BTW I'm convinced you also can drop table within a transaction in PostgreSQL.

morelisp · on May 24, 2020

No, MySQL stands out here. Postgres, SQL Server, DB2, and Firebird all give at least some way to do some major DDL transactionally. Usability varies (e.g. Oracle supports a very specific kind of change that is not its normal DDL statements), but it's at least possible.

https://wiki.postgresql.org/wiki/Transactional_DDL_in_Postgr...

That MySQL autocommits is also even worse than just "doesn't support it."

dragonwriter · on May 24, 2020

> DDL normally supposed to be always outside of a transaction

A basic element of the relational model is that metadata is stored as relational data and that the same guarantees that apply to manipulating main data in the database apply to manipulating the schema metadata.

It's true that many real relational databases compromise on this element in various ways at times, but it is absolutely not the case that DDL “is supposed to be” non-transactional.

vorticalbox · on May 24, 2020

> Mongo can call this function multiple times with the same session object.

isn't that the point? you can use a session to do multi actions within that session.

inglor · on May 24, 2020

If you have code that looks like this:

    withTransaction(async session => {

      await Promise.all([someOp(sesson), 

      someOtherOp(session)]);

    });

Mongo may retry running it (calling the function again) if a "TransaientTransactionError" is raised (the transaction is retried from the client side rather than at the cluster).

However, when the driver calls your function again it doesn't invalidate the `session` object - so previous calls to the same function can make updates to the database.

Let's say `someOp` does something that causes the transaction to retry and `someOtherOp` is doing something non-mongo-related in the meantime (like pulling a value from redis). Now `someOtherOp` reached the mongo part of its code and it is executing it happily with the same session object (so operations succeed although they really shouldn't)

The point of transactions like you said is to perform multiple operations atomically and for them to happen "exactly once or not at all". With Mongo in practice it is very easy to get "Once and some leftovers from a previous attempt".

IgorPartola · on May 24, 2020

Sorry, I haven’t had my coffee yet. If I am reading this correctly, either someOp() or someOtherOp() may execute first, no? And if you introduce an external database, why do you expect Mongo to handle that rollback? Say someOtherOp() increments a Redis value by 1. If that part executed first since both are asynchronous here, what would a Mongo session have to do with it?

What exactly would invalidating that session object do here? And what would the session object do after it was invalidated?

waheoo · on May 24, 2020

It sounds like the old session object is reused and becomes live again or something.

bambataa · on May 24, 2020

Thanks for this explanation. So if I understand correctly, `someOp` has thrown an error but this doesn't affect `someOtherOp`? So `someOtherOp` will end up being called twice?

inglor · on May 24, 2020

Correct, the easy workaround is not to use that transaction API and write your own disposer instead of using withTransaction.

Namari · on May 24, 2020

I think this is the expected behaviour of the transaction but the problem comes from the fact you wrap all DB operation inside a Promise.all.

Because you wrap the DB operations inside a Promise.all, it means it will run them all BUT it will not revert them if one fails (it's not atomic, it just says that one has failed and you need to catch it), it will reject them but not revert them. (the CUD operation will already have changed the data) The problem I believe is the transaction is considering the Promise.all and not what's inside of it so it will run it again despite the fact that some have already succeeded earlier

I think you just have to resolve each of them outside a Promise.all. In your case because Promise.all has been rejected it will redo the transaction, therefor it will redo the one that have already worked in the first call.

I'm no expert but this is how I understand it.

fendy3002 · on May 24, 2020

Not op, I think what he means is the session, even when already failed once, can still be used (without error) in the next operation without being invalidated.

Without promise.all, I think it can be replicated like this:

    try{
      await someOp(session);
    catch { }
    await someOtherOp(session);

He expected what the session to be invalidated during someOtherOp.

inglor · on May 25, 2020

That is correct and what I meant. Thanks, and there doesn't have to be a `try/catch` because it is possible to "fork" control with promises.

They implemented a "disposer" incorrectly (I added this to several other libraries and made a Q&A about it here https://stackoverflow.com/questions/28915677/what-is-the-pro... )

gabrieledarrigo · on May 24, 2020

[flagged]

couchand · on May 25, 2020

Please be civil.

inglor · on May 25, 2020

I have found people to attack me when I make a technical criticism of technologies they like.

Am I expected to justify being able to code in HN? I could go on about being a maintainer of the two most promise libraries in npm (bluebird and Q), being node core, organizing promises sessions for APIs in Node core and having over 1000 answers about promises in SO.

I generally find that sort of "non-technical chat" boring compared to the technical stuff and appeal to authority kind of lame.

The only other person attacked in this issue is aphyr, so I believe I am in very good company in this particular instance.

g0ldenb0ugh · on May 25, 2020

> We had to write a ton of logic around their poor implementation and I was shocked to see the code underneath.

Just wondering, did you submit a bug report to them about this? If so, any response?

inglor · on May 25, 2020

Yes, several times - we pay Mongo Atlas over $5000 per month.

We reported it immediately at the highest severity and we pay for the highest tier or support - we tried to collaborate as soon as possible. It sort of went "over their head".

g0ldenb0ugh · on May 25, 2020

Can you link the JIRA ticket? I use the Node driver heavily and have contributed several PRs to it in the past, would be more than happy to fix this since it seems fairly bad.

inglor · on May 26, 2020

Hi, I'm not sure where I'm supposed to look for JIRAs at but we discussed this with Mongo Atlas via several support tickets and channels.

For example at #00636235 "how to avoid TransientTransactionError".

aphyr · on May 24, 2020

Hi folks! Author of the report here. If anyone has questions about detecting transactional anomalies, what those anomalies are in the first place, snapshot isolation, etc., I'm happy to answer as best I can.

devit · on May 24, 2020

Have you considered presenting the data in a concise manner in addition to the in-depth analyses?

That is, a table on the jepsen.io frontpage, or at least on each product's review page, with database products and configuration on rows and consistency properties on columns, and a nice "Yay!" or "Nope!" mark in the cell, plus links on how to achieve the database configurations in the table (esp. how to configure each database to have the most guarantees).

Also, ideally the analyses should be rerun automatically (or possibly after being paid, but making it easy for the company to do so) every time a new major release happens rather than being done once and then being stale.

Finally, there should be tests for the non-broken databases (PostgreSQL for instance, both in single-server mode, deployed with Stolon on Kubernetes and using the multimaster projects) as well to confirm they actually work.

aphyr · on May 24, 2020

That is, a table on the jepsen.io frontpage, or at least on each product's review page, with database products and configuration on rows and consistency properties on columns, and a nice "Yay!" or "Nope!" mark in the cell, plus links on how to achieve the database configurations in the table (esp. how to configure each database to have the most guarantees).

This is a wonderful idea, and I've got no idea how to actually do it in a standardized, rigorous way. Vendor claims are often contradictory, it's hard to get a good idea of anomaly frequency, availability is... a rabbithole, and it's hard to come up with a standard taxonomy of anomalies--most of the analyses I do wind up finding something I've never really seen before, haha. With that in mind, I've wound up letting the reports speak for themselves.

Also, ideally the analyses should be rerun automatically (or possibly after being paid, but making it easy for the company to do so) every time a new major release happens rather than being done once and then being stale.

I don't know a good way to do this either. Each report is typically the product of months of experimental work; it's not like Jepsen is a pass-fail test suite that gives immediately accurate results. There is, unfortunately, a lot of subtle interpretive work that goes into figuring out if a test is doing something meaningful, and a lot of that work needs to be repeated on each test run. Think, like... staring at the logs and noticing that a certain class of exception is being caught more often than you might have expected, and realizing that a certain type of transaction now triggers a new conflict detection mechanism which causes higher probabilities of aborts; those aborts reduce the frequency with which you can observe database state, allowing a race condition to go un-noticed. That kinda thing.

If I'm lucky and the API/setup process haven't changed, I can re-run an analysis in about a week or so. If I'm unlucky, there's been drift in the OS, setup process, APIs, client libraries, error handling, etc. It's not uncommon for a repeat analysis to take months. :-(

X6S1x6Okd1st · on May 24, 2020

It's probably more snarky than helpful, but it'd be great to have a section where it's just marketing materials or docs that you've corrected with a red pen

bcrosby95 · on May 24, 2020

It's probably better to keep it professional. Your average employee can afford some snark. But when companies hire you for this sort of consulting, you could turn off a lot of potential clients by including it in materials you produce, even when they didn't pay for it. Because it is a representation of the product they would be paying for.

It would be kinda like you including this sort of thing on your resume. Which would also be a bad idea.

ashtonkem · on May 24, 2020

For those who don’t know, Kyle makes a living offering these types of analysis to database companies directly. While a lot of us love to dunk on Mongo (myself included), it would be silly to expect Kyle to risk his livelihood.

jka · on May 24, 2020

If done accurately and professionally, something like you're suggesting could be really useful to aid people and organizations during vendor selection.

https://web.hypothes.is/about/ or similar could be used to develop commentary overlays on top of marketing materials.

HappyDreamer · on May 24, 2020

> consistency properties on columns, and a nice "Yay!" or "Nope!" mark in the cell

Plus maybe a column indicating what [the company behind the database] claims?

slashdev · on May 24, 2020

Oh man this would be useful.

politician · on May 24, 2020

Thank you for all of your work over the years. Your reports have helped me and others stand up to bizdev hype and make better decisions for our companies and customers.

Postgres is widely understood to be a robust database with safe defaults. I, and perhaps others, would love to see you aim your array of weapons at Postgres. Do you have any plans to look at stock Postgres?

aphyr · on May 24, 2020

It's been on my list for a long time, but I've also struggled to find out like... what, exactly, is the right way to do postgres replication? Every time I go into the docs I wind up with a laundry list of different mechanisms for replication and failover, and no idea which one would be most appropriate for a test. I gotta get on this!

zbjornson · on May 24, 2020

It'd be especially interesting given that MongoDB claims this:

> Postgres has both asynchronous (the default) and synchronous replication options, neither of which offers automatic failure detection and failover [12]. The synchronous replication only waits for durability on one additional node, regardless of how many nodes exist [13]. Additionally, Postgres allows one to tune these durability behaviors at the user level. When reading from a node, there is no way to specify the durability or recency of the data read. A query may return data that is subsequently lost. Additionally, Postgres does not guarantee clients can read their own writes across nodes.

From http://www.vldb.org/pvldb/vol12/p2071-schultz.pdf

takeda · on May 24, 2020

> It'd be especially interesting given that MongoDB claims this:

> > Postgres has both asynchronous (the default) and synchronous replication options, neither of which offers automatic failure detection and failover [12]. The synchronous replication only waits for durability on one additional node, regardless of how many nodes exist [13]. Additionally, Postgres allows one to tune these durability behaviors at the user level. When reading from a node, there is no way to specify the durability or recency of the data read. A query may return data that is subsequently lost. Additionally, Postgres does not guarantee clients can read their own writes across nodes.

> From http://www.vldb.org/pvldb/vol12/p2071-schultz.pdf

This is like those commonly seen tables comparing your product with others where your product had checkmarks in all categories, and of course competitors are missing a bunch of them. The problem is that the categories were picked by you, and are often irrelevant to the other product. This is the case here.

PostgreSQL is not a distributed database, the master is the one doing all writes. The replicas are read only. By default replicas are asynchronous which means they won't affect master performance, at the cost of having data there being late by few seconds. Since you can't write to replicas, this won't cause data corruption, only delay which often is acceptable. If you design your applications in such way that will have two database endpoints: one for writes and one just for reads, you can then decide based on context which endpoint you want to use. The read only is easy to scale, but as mentioned earlier it is read only, and might slight delay.

Now, for failover, you might also opt on using synchronous replicas this will add extra latency, but then you always have at least one machine that has the same data. They mentioned that if you have multiple synchronous standbys then it only one needs to write. Actually that's configurable, you can specify group of synchronous machines and how many and which need to be synchronized, the remaining ones are a backup in case those that you specified aren't available.

Besides, the writes don't work the same way as in mongo, when a standby node is in sync it isn't just in sync for that particular write, it is completely in sync, so their following argument about not being able to specify durability/recency of data on read is redundant. If you contact the master or synchronous replica, you will always get the most recent state. If you don't mind slight delay you should query asynchronous replicas (in fact you should prefer them whenever you can, since those are cheap to add)

zbjornson · on May 24, 2020

I'm not sure I understand your point.

> the master is the one doing all writes. The replicas are read only. By default replicas are asynchronous

The same is true with MongoDB's defaults in an unsharded cluster.

pcl · on May 24, 2020

I think that it'd be super-valuable to do an analysis of an RDS Postgres deployment. Amazon is doing some dark magic with RDS that sits at this really interesting "distributed, but not that distributed" inflection point, which impacts the basic assumptions of lots of distributed database design.

I believe RDS Postgres is probably the right answer for lots of applications, especially for those that already depend on AWS for baseline availability. I'd love to see if that holds up against a rigorous analysis.

aphyr · on May 24, 2020

I'd like this too, but I'm not sure how to do fault injection against an Amazon-controlled service.

ashtonkem · on May 24, 2020

You’d probably have to work directly with AWS on that one, either to get a custom harness in AWS, or to find out how they configure RDS replication.

elesbao · on May 24, 2020

the setup would prob ably be pgsql primary, aurora secondaries on diff zones and something changing cross-zone or cross-region vpc setting to try to break replication ? never tried that but was hurt by rds pure pgsql cross region replication in a network outage situation.

aeyes · on May 24, 2020

Are you talking about Aurora? Because in RDS the replication is just what you get out of the box.

bsaul · on May 24, 2020

i feel like this is the reaction of everyone having ever tried to setup postgres replication. With your audience, you deciding for a particular setup will probably help a LOT of people, and ultimately the postgres project as well.

takeda · on May 24, 2020

If you worry about data, you should not use automatic failover. It's nearly impossible for standby to know why master stopped responding. Maybe there was a hardware failure, or maybe master is just busy. This is why manual failover is better, because you can know the real reason and decide whether you should perform failover or just wait.

With tools like repmgr it is just a single command invoked on the standby.

If you absolutely don't want to lose any data, you should have two masters in close proximity (so the latency isn't high) set up with synchronous replication, then have one or two standbys with asynchronous replication. This will reduce throughout, but then you can be sure that the other machine has all the same transactions. If something happens to both you then can fallback to the asynchronous one which might be a bit behind.

feike · on May 24, 2020

One of the authors of Patroni here.

Automatic failover for PostgreSQL works great and can be done safely if combined with synchronous replication.

Multiple tools will implement this correctly:

https://patroni.readthedocs.io/en/latest/replication_modes.h... https://github.com/sorintlab/stolon/blob/master/doc/syncrepl...

Quoting a former colleague here, but "if it hurts, do it more often". That is what you should do with your PostgreSQL failovers.

I have clusters running on timelines in the hundreds without a byte of data loss due to using synchronous replication, tools that help out with leader election, and just doing it often.

takeda · on May 24, 2020

Can Patroni tell if master node is not responsive because it is busy vs dead? GitHub (I believe) had few outages that caused data loss because their auto failover mechanism kicked in when it shouldn't.

I would actually be interested if aphyr's analysis of Patroni and other distributed add-ons to PostgreSQL.

pas · on May 25, 2020

There is no real difference between dead or too-busy.

The only question is how soon are you going to page humans. After the automated mechanism flipped your master 2-3 times but the cluster still hasn't made progress [nothing coming out of the master; or it locks up after a few minutes again]), or right after some other automated mechanism detects that there's a problem.

Whatever automation you have in place, it has advantages and disadvantages. In the GitHub case - I suppose - they determined post-mortem that it would have been better to just let the master chug through the incoming onslaught of queries instead of failing over, and over, and over. (But of course this seems like a trivial problem in any auto failover setup, so I suspect there's more to the story.)

feike · on May 25, 2020

> Can Patroni tell if master node is not responsive because it is busy vs dead

No. But the contract Patroni has is this:

I only serve a master (primary) if I have the lock. If I do not have the lock I will demote.

This results in that there can be only 1 primary active at any given point in time, even if the network is partitioned.

This in and of itself does not guarantee no-split-brain situations, a split-brain can occur if writes were made on the former primary, but not yet on the future primary. This however can be mitigated with synchronous replication.

zozbot234 · on May 25, 2020

> tell if master node is not responsive because it is busy vs dead?

The postgres documentation will tell you that you'll need to set up your own mechanisms for this, and that they will need to integrate with OS facilities as appropriate. One-size-fits-all does not cut it. Not wrt. replication, not wrt. HA/failover.

takeda · on May 24, 2020

Well, the built-in ways is the right way to do it. But given that PostgreSQL is quite conservative about it, it will be hard to find issues there (the replicas are read only, so at worst it will be just a replication delay, unless you use synchronous replication, which will remove the replication delay at the cost of slower performance).

All the tooling that provides extra distributed functionality not present in postgres (auto failover, multi master replication, sharding etc) will surely have issues, but then you aren't testing the PostgreSQL itself, but the tooling, so to be fair, you the article should evaluate these tools, and any shortcomings shouldn't go to PostgreSQL (unless it really is a PostgreSQL issue).

pas · on May 25, 2020

This k8s operator advertises production-grade auto-failover auto-healing replicas - basically what MongoDB - but built on postgres: https://github.com/CrunchyData/postgres-operator

And looking at this table, basically the future seems to be WAL shipping anyway ( https://www.postgresql.org/docs/13/different-replication-sol... )

didip · on May 24, 2020

It is true that only recently PG has a standard way of replicating. But even then, PG is not a distributed database by default.

However, if I may suggest, Stolon, Patroni, Postgres XL or Citus Data might be interesting to you.

grogers · on May 25, 2020

I know you've done galera in the past, but MySQL group replication would be a good one since it IS the defacto HA solution, unlike with Postgres.

zzzcpan · on May 24, 2020

Postgres is not a distributed database and doesn't have a single safe default for running it in a distributed configuration, including talking to it over network. It can't claim any consistency guarantee, so there is nothing for aphyr to test it for.

Even common highly available configurations take the route of no consistency guarantees by doing primitive async replication and primitive failover.

politician · on May 24, 2020

Postgres supports multi-master replication, among other replication models. This could provide an interesting target.

In a classic single node configuration, a confirmation that its transaction isolation behaviors exhibited the corresponding anomalies would be valuable.

So I think there’s value in this ask.

samdk · on May 24, 2020

Postgres doesn't natively support multi-master. (Although there are a variety of open source/proprietary offerings that add support for it to various degrees.)

takeda · on May 24, 2020

PostgreSQL doesn't offer multi master replication. There are extensions that do, but if aphyr will evaluate them he should emphasize that he is treating them not the PostgreSQL (unless he finds a bug in PostgreSQL itself).

I think he did something similar for MySQL when evaluating the Galera cluster.

politician · on May 24, 2020

Jepsen reports often include two distinct types of analyses: correctness in a distributed storage system under a variety of failure scenarios, and in-depth analysis of consistency claims. Both examinations are extremely helpful.

In a single write master configuration, Postgres runs transactions concurrently, so the consistency analysis is still quite relevant.

I don’t think it’s a stretch to say that everyone expects Postgres to get top marks in this configuration and it would be worth confirming that this is the case.

takeda · on May 24, 2020

Actually he already did analyze PostgreSQL: https://aphyr.com/posts/282-call-me-maybe-postgres

But it was long ago, and maybe needs to be redone?

Edit: after re-reading it he treats it as a distributed system because client and server is over network. And that is true, it can also be thought of as a distributed system because as you said transactions are concurrent and are running as separate processes. Although in these cases you can't have a partition (which aphyr uses to find weaknesses), or maybe there is something equivalent that happens?

zozbot234 · on May 24, 2020

> PostgreSQL doesn't offer multi master replication.

Not in itself, but it does offer a PREPARE TRANSACTION - COMMIT PREPARED / ROLLBACK PREPARED extension that could be used to add such support in the future. This would not be unprecedented, as the simpler case of db sharding is already being supported via the PARTITION BY feature, combined with "FOREIGN" database access.

bsaul · on May 24, 2020

i'm not sure what you mean by pg not being a "distributed" database. it has replication and sharding functionalities that let it run in various clustering configuration. This looks enough to me to qualify it for aphyr tests.

takeda · on May 24, 2020

Replication is read only, so at worst there's only delay when it is set up asynchronously, but ultimately it will be the same as master. The sharding part, do you mean FDW? I don't think PostgreSQL gives any consistency guarantees if you use them.

bsaul · on May 24, 2020

ha, my bad. I had the feeling pg provided some solution for sharding, but it seems they're all third-party extensions ( like citus/pg-shard)

danpalmer · on May 24, 2020

Not a question necessarily about the technical side, but I'm interested in your opinion as to the root cause – is it desire to achieve certain results for marketing purposes, lack of understanding/training in the team about distributed systems, just bugs and a lack of testing...? Alternatively does most of this come down to one specific technical choice, and why might they have made that choice?

Very happy for (informed) speculation here, I recognise we'll probably never know for certain, but I'm interested to avoid making similar mistakes myself.

aphyr · on May 24, 2020

There's a few things at play here. One is talking only about the positive results from the previous Jepsen analysis, while not discussing the negative ones. Vendors often try to represent findings in the most positive light, but this was a particularly extreme case. Not discussing default behavior is a significant oversight, and it's especially important given ~80% of people run with default write concern, and 99% run with default read concern.

The middle part of the report talks about unexpected but (almost all) documented behavior around read and write concern for transactions. I don't want to conjecture too much about motivations here, but based on my professional experience with a few dozen databases, and surveys of colleagues, I termed it "surprising". The fact that there's explicit documentation for what I'd consider Counterintuitive API Design suggests that this is something MongoDB engineers considered, and possibly debated, internally.

The final part of the report talks about what I'm pretty sure are bugs. I'm strongly suspicious of the retry mechanism: it's possible that an idempotency token doesn't exist, isn't properly used, or that MongoDB's client or server layers are improperly interpreting an indeterminate failure as a determinate one. It seems possible that all 4 phenomena we observed stem from the retry mechanism, but as discussed in the report, it's not entirely clear that's the case.

danpalmer · on May 24, 2020

Thanks for the thoughts.

I get the impression that MongoDB may have hyped themselves into a corner in the early days with poorly made (or misleading) benchmarks. Perhaps they have customers with a lot of influence determining how they think about performance vs consistency.

Maybe this combined with patching, re-patching, re-patching again their replication logic/consistency algorithm means that they'll be stuck in this sort of position for a long time.

aphyr · on May 24, 2020

Possibly! You're right that path dependence played a role in safety issues: the problems we found in 3.4.0-rc3 were related to grafting the new v1 replication protocol onto a system which made assumptions about how v0 behaved. That said, I don't want to discount that MongoDB has made significant improvements over the years. Single-document linearizability was a long time in the works, and that's nothing to sneeze at!

http://jepsen.io/analyses/mongodb-3-4-0-rc3

eternalban · on May 24, 2020

"3.4 Duplicate Effects"

This section seems to be the most worrying results in your report, Kyle, with no work around. Did I read that correctly?

aphyr · on May 24, 2020

Yeah, there's no workaround that I can find for 3.4 (duplicate effects), 3.5 (read skew), 3.6 (cyclic information flow), or 3.7 (read own future writes). I've arranged those in "increasingly worrying order"--duplicating writes doesn't feel as bad as allowing transactions to mutually observe each other's effects, for example. The fact that you can't even rely on a single transactions' operations taking place (or, more precisely, appearing to take place) in the order they're written is especially worrying. All of these behaviors occurred with read and write concerns set to snapshot/majority.

That's not to say that workarounds don't exist, just that I didn't find any in the documentation or by twiddling config flags in the ~2 weeks I was working on this report. :)

rystsov · on May 24, 2020

Hi Kyle, thanks for the Elle :) I want to use Elle to check long histories of transactions over small set of keys with read dominant workload, the paper recommends to use lists over registers but when the history becomes long on the one hand it becomes too wasteful to read the register's history on each request on the other hand the Elle's input becomes very large. E.g. when each read should return the whole register's history the size of history grows O(n^2) compared to the case when the reads return just the head.

So I'm curios how would you have described the ability of finding violations with Elle using read-write registers with unique values vs the append-only lists?

aphyr · on May 24, 2020

E.g. when each read should return the whole register's history the size of history grows O(n^2) compared to the case when the reads return just the head.

If you look at Elle's transaction generators, you can cap the size of any individual key, and use an uneven (e.g. exponential) distribution of key choices to get various frequencies. That way keys stay reasonably small (I use 1-10K writes/key), some keys are updated frequently to catch race conditions, and others last hundreds of seconds to catch long-lasting errors.

So I'm curios how would you have described the ability of finding violations with Elle using read-write registers with unique values vs the append-only lists?

RW registers are significantly weaker, though I don't know how to quantify the difference. I've still caught errors with registers, but the grounds for inferring anomalies are a.) less powerful and b.) can only be applied in certain circumstances--we talk about some of these details in the paper.

monstrado · on May 24, 2020

Huge fan of your work! I was curious if you've ever attempted to run your (or part of) Mongo test suite against FoundationDB using their DocumentLayer since it's supposed to be Mongo API compatible.

aphyr · on May 24, 2020

No, I haven't! You can see a full list of analyses here: http://jepsen.io/analyses

robterrell · on May 24, 2020

IIRC one of the FoundationDB engineers tested with Jepsen and found that it passed in its default configuration, but the blog post seems to have disappeared.

https://web.archive.org/web/20150312112556/http://blog.found...

monstrado · on May 24, 2020

Thanks for firing up the time machine! I've been using FDB for a little over a year now and can't recommend it enough. Such a solid piece of meticulous engineering.

teskk123 · on May 24, 2020

hello :) Where is you found out all information about how to do such testing?

aphyr · on May 24, 2020

I was lucky to have a good education: my B.A. involved courses in contemporary experimental physics and independent research in nonlinear quantum dynamics (esp. proofs, experimental design, writing), cognitive and social psychology (more experiment design and stats), math structures (proof techniques), philosophy (metaphysics, philosophy of science), and English (rhetoric). All of those helped give me a foundation for doing this kind of experimental work and communicating it to others.

Jepsen draws inspiration from a long line of work on property-based testing, especially Quickcheck & co. It also draws on roughly 10 years of experience building & running distributed systems in production. A lot of Jepsen I invented from whole cloth, but some of the checkers in Jepsen are derived from specific research papers, like work by Wing, Gong, and Howe on linearizability checking.

Then it's just... a lot of thinking, experimenting, and writing. Jepsen's the product of ~6 years of full-time work. Elle, the system which detected the anomalies in this report, was a research project I've been puzzling over for roughly two years.

I write the Jepsen series, and open-source all of the code for these tests, partly as a resource so that other people can learn to do this same kind of work. :-)

throwaway_pdp09 · on May 24, 2020

I guess you've answered my question, but to be clear, you do not instrument/analyse the code, you treat it as a black box which you hammer on externally, is that right?

aphyr · on May 24, 2020

Pretty much, yeah. There are some cases where Jepsen reaches into the guts of a database or lies to it via LD_PRELOAD shims, but generally these are Just Plain Old Binaries provided by vendors; no instrumentation required.

teskk123 · on May 24, 2020

wow. thanks a lot for quick and full answering!

rclayton · on May 24, 2020

Hi Kyle! I’ve really enjoyed your work over the years. I was wondering, with all of your testing and experimentation, is there any system that had really impressed you?

zbentley · on May 24, 2020

I don't presume to speak for him, but his writeup on ZooKeeper was among the most positive in the Jepsen series: https://aphyr.com/posts/291-jepsen-zookeeper

My bias: I like and heavily use ZooKeeper in production. HN seems not to like it as much.

aphyr · on May 24, 2020

I'm kind of impressed any distributed system gets off the ground. These things are hard to write!

staticassertion · on May 24, 2020

I've wanted to try building a toy database to learn more about how they work - any suggestions for good resources?

aphyr · on May 25, 2020

I've actually written a little mini-Jepsen workbench for folks who want to practice implementing a distributed key-value store. Might be worth a spin! https://github.com/jepsen-io/maelstrom

dilandau · on May 24, 2020

You're doing very, very valuable work. Thanks fam, keep those vendors honest, and help us make informed decisions.

lllr_finger · on May 24, 2020

Mongo has been related to "perpetual irritation" up to "major production issue" at all three of my last companies.

For as easy as it is to use jsonb in Postgres, or Redis, or RocksDB/SQLite, or whatever else depending on your use case - I can't find any reason to advocate its use these days. In my anecdotal experience, the success stories never happen, and nearly developer I know has an unpleasant experience they can share.

Big thanks to aphyr and the Jepsen suite (and unrelated blog posts like Hexing the Interview) for inspiring me to do thorough engineering.

stavros · on May 24, 2020

I find that using JSON for things you don't need to query/validate (like big blobs you just want to store) and breaking the rest out to columns works well enough. Plus, you can always migrate the data out to a field anyway.

emerongi · on May 24, 2020

Postgres 12 has generated columns, so you can throw your data in a jsonb column and have Postgres pull data out of it into separate columns for indexing for example.

magnushiie · on May 24, 2020

Generated columns are not necessary for indexing in Postgres, you can create an index on any expression based on the record (supported by many versions now).

ep103 · on May 24, 2020

Is Postgres what most people would suggest as a MongoDB replacement?

Anyone have any suggestions for a true non-MongoDB jsonDocument based noSql option?

jfkebwjsbx · on May 24, 2020

The first question you must ask yourself is: do I really need a document store?

Because the answer is "no" in the overwhelmingly majority of cases, specially if your product is mature.

ep103 · on May 25, 2020

Or trust me, I'm aware. But inevitably I will be in a design meeting where they will want a non-sql alternative, and I'd be nice to know what I can suggest besides Mongo

zozbot234 · on May 24, 2020

It depends what you're using it for. Postgres is a very good all-around choice these days (compared to when the whole 'noSql' thing got started) and also supports document-based scenarios quite well via JSON/JSONB columns and its support for these datatypes in queries, updates, indexing etc. Sharding and replication can also be set up via fairly general mechanisms, as described in pgSQL documentation. (For instance, the FDW facility is often used to set up sharding, but it could also support e.g. aggregation.)

threeseed · on May 24, 2020

Note that there is no Jepsen test for those sharding/replication features.

threeseed · on May 24, 2020

As has been mentioned above PostgreSQL does not come out of the box with a supported, tested clustering solution.

Given that is a pretty popular part of MongoDB seems like an important thing for people to continuously fail to mention.

mtrycz2 · on May 24, 2020

[flagged]

rmdashrfstar · on May 24, 2020

[flagged]

reese_john · on May 24, 2020

https://news.ycombinator.com/newsfaq.html

"Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills."

rmdashrfstar · on May 24, 2020

Interesting taste of my own medicine. Will do, thanks for the reminder!

mtrycz2 · on May 24, 2020

aphyr, you are of great inspiration as an engineer and as a human.

Your attitude of "a tool I need doesn't exists, so I'll just go ahead and create it" blew my mind and changed me for the better.

I'm dedicating my next test framework to you. Thank you for everything.

aphyr · on May 24, 2020

Aw shucks, thank you! <3

undergrowth54 · on May 25, 2020

> I'm dedicating my next test framework to you.

As an engineer for whom automated testing tools are crucial to my mental health, let me know if you want a UX tester or just someone to provide feedback on the documentation.

chousuke · on May 24, 2020

This article reinforces my stance that bad defaults are a bug. Defaults should be set up with the least number of pitfalls and safety tradeoffs possible so that the system is as robust as it can be for the majority of its users, since the vast majority of them aren't going to change the defaults.

Sometimes you end up with bad defaults simply by accident but I feel like for MongoDB the morally correct choice would be to own up to past mistakes and change the defaults rather than maintain a dangerous status quo for "backwards compatibility", even if you end up looking worse in benchmarks as a result.

aphyr · on May 24, 2020

I think this is a good way to look at things, and there are vendors who do this! VoltDB, for instance, changed their defaults to be strict serializable even though it imposed a performance hit, following their Jepsen analysis. https://www.voltdb.com/blog/2016/07/voltdb-6-4-passes-offici...

zzzeek · on May 24, 2020

How many more years do we have to keep evaluating, studying, and reading about MongoDB's ongoing failures? It would appear this product has been a great burden on the community for many years.

aphyr · on May 24, 2020

I like to keep in mind that MongoDB's existing feature set is maturing--occasional regressions may happen, but by and large they're making progress. The problems in this analysis were in a transaction system that's only been around for a couple years, so it's had less time to have rough edges sanded off.

dan_pasette · on May 26, 2020

I’m Dan Pasette, Executive Vice President of Core Engineering at MongoDB. I'd like to thank aphyr for posting the detailed report on MongoDB 4.2.6. We were able to use these findings to identify a bug that can lead to a previously committed transaction being incorrectly retried in the presence of a primary failover and a subsequent transaction commit retry. From our testing, this bug is the cause of the anomalies described in sections 3.4 - 3.7 of the report.

A detailed description of the bug can be found here: https://jira.mongodb.org/browse/SERVER-48307

This bug has been fixed and backported, and will be available to users in MongoDB 4.2.8 onwards. The MongoDB test suite has been updated to ensure that this specific phenomenon is detected in future releases. We are also planning to update the version of Jepsen we are currently running in our CI loop to include the newest test case used in the report.

Last, we’ve made some changes to how we share information discussed in the Jepsen reports on our website. You can find the updated page here (https://www.mongodb.com/jepsen).

zzzeek · on May 24, 2020

there are so many great databases out there. There's no need for one that has been mediocre for years and continues to make false claims. This is an issue of years of super aggressive marketing of an inferior product making it hard on engineers.

trashcan · on May 25, 2020

I think if you compared it to other databases that are designed to scale horizontally like Cassandra and DynamoDB, you might have a more favorable opinion. IMHO, most products at this scale are terrible in different ways, because it is a difficult problem to solve generally.

I have been responsible for <100 clustered Cassandra instances, and <500 clustered MongoDB instances, and I would choose the latter every time.

runT1ME · on May 25, 2020

Can mongo really support the kind of large scale ETL or time series use cases Cassandra can?

bborud · on May 25, 2020

It has been a decade since MongoDB was initially released. It still isn't a database in any meaningful sense of the word. Please don't downplay the amount of trouble one can get into by assuming it is a database.

bithavoc · on May 24, 2020

> Clients observed a monotonically growing list of elements until [1 2 3 5 4 6 7], at which point the list reset to [], and started afresh with [8]. This could be an example of MongoDB rollbacks, which is a fancy way of saying “data loss”.

I hope they learned the lesson, don't fuck with aphyr.

amenod · on May 24, 2020

That's... not the lesson they need to learn. Databases are app foundations. Make sure you do them right and don't overpromise.

baq · on May 24, 2020

I agree but maybe it’s the only lesson they are able to understand at this time. Their attitude was asking for somebody to call them, which aphyr is maybe the best positioned to do.

I’d love to read a roasting like that authored by Leslie Lamport for a different perspective but aphyr’s works absolutely stand on their own.

Any ideas how to get Jepsen and TLA to work together? :)

junon · on May 24, 2020

I wanted to incorporate MongoDB into a C++ server at one point.

Their C/C++ client is literally unusable. I went to look into writing my own that actually worked and their network protocols are almost impossible to understand. BSON is a wreck and basically the whole thing discouraged me from ever trying to interact with that project again.

loeg · on May 24, 2020

Aphyr is such a competent professional. What a relatively thorough and polite response to Mongo's inaccurate claims. "We also wish to thank MongoDB’s Maxime Beugnet for inspiration." is a nice touch.

egeozcan · on May 24, 2020

The general mood I observed about MongoDB was that it used to be inconsistent and unreliable but they fixed most, if not all of those problems and they now have a stable product but bad word of mouth among developers. Personally, I've treated it as "legacy" and migrated everything that I had to touch since 2013 [0], and luckily (just read the article so hindsight 20/20 -- transaction running twice and seeing its own updates? holy...) never gave it another try.

[0]: https://news.ycombinator.com/item?id=6801970 (BTW: no, my dream of simple migration never materialized, but exporting and dumping data to Postgres JSONB columns and rewriting queries turned out to be neither buggy nor hard).

cyphar · on May 24, 2020

> MongoDB was that it used to be inconsistent and unreliable but they fixed most, if not all of those problems and they now have a stable product but bad word of mouth among developers.

This report is 9 days old, and tests the latest stable release of MongoDB. The problems it discusses are present on modern MongoDB.

egeozcan · on May 24, 2020

If it wasn't clear, I said "mood" (what you conveniently ignored), referring to chit-chat I heard recently, and was underlining the fact how wrong it has been. I totally understand what the report says and know what version it tests.

cyphar · on May 24, 2020

In my defense, it wasn't clear that's what you were saying in your original comment. "Mood" has become a filler word at this point -- hence why I omitted it from the quote -- and can mean anything from the traditional meaning of "mood in the room" to "incredibly relatable/factual statement". How I originally understood your comment was that you were saying that you felt that most of the issues are in the past, but you still decided to migrate away from it.

egeozcan · on May 24, 2020

English is not my mother language and given the down-votes, probably it's my wording at fault here - sorry.

I'm glad now that it's been clarified :)

judofyr · on May 24, 2020

This is not directly related to this report or Jepsen, but since you're here I've got to ask: Aphyr, are there any recent papers/research in the realm of distributed databases which you're excited about?

aphyr · on May 24, 2020

Calvin and CRDTs aren't new, but I still think they're dramatically underappreciated! Heidi Howard's recent work on generalizing Paxos quorums is super intriguing, and from some discussion with her, I think there are open possibilities in making leaderless single-round-trip consensus systems for log-oriented FSMs, which is what pretty much everyone WANTS.

I'm also excited about my own research with Elle, but we're still working on getting that through peer review, haha. ;-)

thramp · on May 24, 2020

> I think there are open possibilities in making leaderless single-round-trip consensus systems for log-oriented FSMs, which is what pretty much everyone WANTS.

Woah, that's wild. Are there any pre-prints/papers/talks that you can link to on this subject? I'd _love_ to read this.

> I'm also excited about my own research with Elle, but we're still working on getting that through peer review, haha. ;-)

I read over bits of Elle; the documentation in it is absolutely top-notch. You and Peter Alvaro knocked it out of the park!

aphyr · on May 24, 2020

I think there are open possibilities in making leaderless single-round-trip consensus systems for log-oriented FSMs, which is what pretty much everyone WANTS.

This is based on her presentation and some dinner conversation at HPTS 2019, so I don't know if there's actually a paper I can point to. The gist of is that Paxos normally involves an arbitration phase where there are conflicting proposals, which adds a second pair of message delays. But if you relax the consensus problem to agreement on a set of proposals, rather than a single proposal, you don't need the arbitration phase. Instead of "who won", it becomes "everyone wins". Then you can impose an order on that set via, say, sorting, and iterate to get a replicated log.

I read over bits of Elle; the documentation in it is absolutely top-notch. You and Peter Alvaro knocked it out of the park!

Thank you! Could I... hang on, just let me grab reviewer #1 quickly, I'd like them to hear this. ;-)

judofyr · on May 24, 2020

> This is based on her presentation and some dinner conversation at HPTS 2019, so I don't know if there's actually a paper I can point to. The gist of is that Paxos normally involves an arbitration phase where there are conflicting proposals, which adds a second pair of message delays. But if you relax the consensus problem to agreement on a set of proposals, rather than a single proposal, you don't need the arbitration phase. Instead of "who won", it becomes "everyone wins". Then you can impose an order on that set via, say, sorting, and iterate to get a replicated log.

This sounds very similar to atomic broadcast (https://en.wikipedia.org/wiki/Atomic_broadcast) where each node sends a single message and the process ensures that all nodes agree on the same set of messages. Not sure how it would fit with a log-oriented FSM, but it certainly sounds interesting.

senderista · on May 24, 2020

It’s really pretty trivial to implement RSM given an atomic broadcast protocol. But you can implement many other things, like totally ordered ephemeral messaging with arbitrary fanout, or a replicated durable log ala Kafka. Here’s my current favorite atomic broadcast protocol (from 2007 or so), which is leaderless, has write throughput saturating network bandwidth, and read throughput scaling linearly with cluster size:

https://os.zhdk.cloud.switch.ch/tind-tmp-epfl/394a62dd-278f-...

thramp · on May 24, 2020

> This is based on her presentation and some dinner conversation at HPTS 2019, so I don't know if there's actually a paper I can point to.

Thanks for the explanation! I just found http://www.hpts.ws/papers/2019/howard.pdf; I'm reading through it now :)

> Thank you! Could I... hang on, just let me grab reviewer #1 quickly, I'd like them to hear this. ;-)

Do as you please with my praise!

inglor · on May 24, 2020

Without going into details die to NDAs, the experience in the OP matches the ones of several fortune 500 companies I had gigs with.

nevi-me · on May 24, 2020

Friendly question: did you update anything on the findings since https://news.ycombinator.com/item?id=23191439 ?

aphyr · on May 24, 2020

Nope! Something weird happened to that post; it got a lot of upvotes and some comments, but never made it to frontpage. After the InfoQ article took off yesterday, an HN mod got in touch and asked if I'd like to resubmit it.

azernik · on May 24, 2020

Ouch. This is what you get when you order up a third-party review and then misrepresent it in advertising.

taywrobel · on May 24, 2020

I’m still waiting for Jepsen to put Confluent’s “Kafka provides exactly once delivery semantics” claim to the test.

Since they’re claiming something provably false, it’d be nice to have some empirical evidence as such.

aphyr · on May 24, 2020

I'm not convinced it is false--IIRC their claim is specifically w.r.t other Kafka side effects, and those they can control.

staticassertion · on May 25, 2020

I thought they only guaranteed exactly once processing ? That could just be idempotency.

sam1r · on May 24, 2020

Extremely well written! I learned a lot.

I wonder if someone can type up a well-manicured post-Morten of the recent triple byte incident?

depr · on May 24, 2020

>Sometimes, Programs That Use Transactions… Are Worse

I understood that reference

rmdashrfstar · on May 24, 2020

The main argument for using a documented-oriented database: https://martinfowler.com/bliki/AggregateOrientedDatabase.htm...

sorokod · on May 24, 2020

I suppose there are reasons why the defaults are the way they are. Can anyone comment on the implications, performance or otherwise, of bumping up the read/write concerns?

aphyr · on May 24, 2020

Latency is a big one--you've got to wait an extra round-trip for secondaries to acknowledge primary writes, and primaries (assuming you don't have reliable clocks) need to check in with secondaries to confirm they have the most recent picture of things if you want to do a linearizable read. Snapshot isolated reads shouldn't require that, at least in theory--it's legal to read state from the past under SI, so there's no need to establish present leadership. That's why I'm surprised that MongoDB requires snapshot reads to go through write concern majority--it doesn't seem like it'd be necessary. Might have something to do with sharding--maybe establishing a consistent cut across shards requires a round of coordination. Even then I feel like that's a cost you should be able to pay only at write time, making reads fast, but... apparently not! I'm sure the MongoDB engineers who designed this system have good reasons; they're smart folks and understand the replication protocol much better than I do.

MongoDB's also published a writeup (which is cited a few times in the Jepsen report!) talking about the impact of stronger safety settings and why they choose weak defaults: http://www.vldb.org/pvldb/vol12/p2071-schultz.pdf

goatinaboat · on May 24, 2020

In general, MongoDB’s defaults fall into two categories. The first could possibly be justified as making it easy for inexperienced devs to get started, but it means that people rely on those defaults and then try to promote to production, and unless there is an experienced traditional DBA with the power to veto it, it will go ahead. This is how they “backdoor” their way into companies. The second category is whatever will look good on a benchmark, regardless of any corners cut.

Compare and contrast with the highly ethical Postgres team, who encourage good practices from the start and who get a feature right first before worrying about performance. That may harm their adoption in the short term but over the long term, that's why they're the gold standard. And with their JSONB datatype they have a better MongoDB than MongoDB anyway! And have a million other features besides!

logicchains · on May 24, 2020

>And have a million other features besides!

Yeah, but in spite of that their performance still sucks compared to writing directly to /dev/null, and that's where Mongo steals their thunder.

threeseed · on May 24, 2020

> Compare and contrast with the highly ethical Postgres team

You do know that PostgreSQL had issues with not fsyncing data as well ? It's technology. Bugs will be made. Design decisions will be wrong.

I think it's really disappointing and inappropriate to be labelling MongoDB engineers as unethical for simply having incorrect defaults. Which in their history they often change after they are made aware of them.

goatinaboat · on May 25, 2020

You do know that PostgreSQL had issues with not fsyncing data as well ?

See, you can name just one Postgres bug, and they held their hands up to it straight away. Whereas the MongoDB "bugs" are countless and by sheer coincidence, they mostly skew to improving performance in benchmarks and demos. That's a pattern.

bbulkow · on May 24, 2020

mongodb's business model, forever, has been to get developers to write code, be damned the fact that you can't support it reliably on a cloudy day.

jtdev · on May 24, 2020

Now do DynamoDB.

aphyr · on May 24, 2020

I'd like to, but I don't have any way to do fault injection on a system someone else owns. :(

jtdev · on May 24, 2020

Would love to see AWS agree to facilitating this. Appreciate your work very much!

petrikapu · on May 24, 2020

They have downloadable version of it https://docs.aws.amazon.com/amazondynamodb/latest/developerg...