I also want to point their Node.js transactions API is wrong and looks like they have no idea how promises or async code work in JS.
In mongo, you have a `withTransaction(fn)` helper that passes a session parameter. Mongo can call this function mutliple times with the same session object.
This means that if you have an async function with reference to a session and a transaction gets retried - you very often get "part of one attempt + some parts of another" committed.
We had to write a ton of logic around their poor implementation and I was shocked to see the code underneath.
It was just such a stark contrast to products that I worked with before that generally "just worked" like postgres, elasticsearch or redis. Even tools people joke about a lot like mysql never gave me this sort of data corruption.
Edit: I was kind of angry when writing this so I didn't provide a source and I'm a bit surprised this go so many upvotes without a source (I guess this community is more trusting than I assumed :] ). Anyway for good measure and to behave the way I'd like others to when making such accusations here is where they pass the same session object to the transacton https://github.com/mongodb/node-mongodb-native/blob/e5b762c6... (follow from withTransaction in that file) - I can add examples of code easily introducing the above mentioned bug if people are interested.
If you work for Mongo and are reading this. Please just fix it. I don't need to win and I don't care about being "right".
I just don't want to be called to the office on a weekend anymore for this sort of BS.
Production incidents with MongoDB last year: 15
Production instances with redis, elasticsearch and mysql combined last year: 2 (and with much less severity)
Edit: just to add: I didn't pick Mongo, I was just the engineer called to clean that mess. I created enough of my own messes to not resent the person who made that shot for it. We are constantly on the verge of rewriting the MongoDB stuff since a database that small (~250GB) should really not have these many issues (In previous workplaces I ran ~10TB PostgreSQL deployments with much more complicated schemas and queries with far fewer issues). It's also expensive and support at Mongo Atlas hasn't been great (we should probably self host but I am not used to small databases being so problematic)
The Guardian posted quite a nice blog in 2018 about the switch to Postgres from MongoDB. Especially interesting because they intended to use Postgres as replacement document storage: Here's the link https://www.theguardian.com/info/2018/nov/30/bye-bye-mongo-h...
I was actually amazed that a big CMS/E-commerce vendor proudly proclaimed in a sales meeting that they were on MongoDB.
I suppose salespeople probably aren't into the nitty-gritty, but their tech people should have warned them about this. Maybe they were just trying to pull our collective leg, but I suppose that why I was at that meeting.
There aren't a lot of CMS/Ecommerce vendors that sit on MongoDB, so maybe we were in a meeting together!
Even if we weren't - as a sales engineer on a large CMS/ECommerce platform with merchants running $150M+ in annual revenue, with an average client retention of seven years, and two decades of agency experience behind the decisions around building that platform, if you instantly said no just because of MongoDB, maybe you don't know as much about MongoDB as you think you do.
I came from a SQL background myself, and had reservations based on all the things I'd read about MongoDB as we decided to build a platform after doing things bespoke for two decades, but time has proven our architecture choices out. It's easy to be proud of something that works well.
I didn't pick Mongo, I was just the engineer called to clean that mess.
My only experience with MongoDB is being "the engineer called to clean the mess". I'm sure you can effectively use MongoDB in production if you're knowledgable and careful, but most people aren't and they shouldn't have to know the detailed inner working to not create a mess.
In this case, the parent commenter probably meant that "newbie web developers" are likely to choose MongoDB. Of course, web developers have a range of experience, some new, some seasoned.
Caveat: this is a meta-comment about voting, not a complaint about how people upvoted or downvoted the parent comment. (My motivations are explained at the very bottom).
Based on seeing how comments like this may get interpreted, as well as broader thinking about online communication, I think HN should consider a more nuanced system of comment feedback mechanisms.
I don't have a particular plan finalized, but I would like to see HN provide feedback on different aspects of the comment. Below are some important aspects:
To what degree does the reader / voter... *
1. agree/disagee with the comment?
2. find the comment relevant / irrelevant to the topic as a whole?
3. find the comment is situated in the correct / incorrect location in the thread? (e.g. responding to the parent comment or not)
4. find the comment interesting / uninteresting?
5. think the comment adds to a diversity of perspectives?
* When I write '/' above, I intend it to be a continuum; e.g. hot/cold means "in the continuum between hot and cold).
Additionally, being able to give feedback in a more granular fashion could be of use. For example, in my comment above, I would not be surprised if a significant number of people were bothered/offended by my commentary that people seem to be taking offense more easily. Some would call this ironic -- I wouldn't -- I think it gives more data to prove the point.
Motivations: my goal here is not to gain or lose karma -- I care very little about karma here, precisely because it is so muddled and varied from person to person -- as long as I have enough to participate fully. My goal is to learn and play a small part in fostering awareness and community, while hopefully to motivating others to reflect on their impact on the community here.
MySQL is less of a joke than MongoDB is. They similarly started by someone who didn't know anything about databases and learned about them on the go. Actually both of them started as much faster alternatives to other databases, both ended up having complete rewrite of its engine written by someone from outside that knew their stuff. MySQL ISAM then MyISAM and then InnoDB (written by an outsider). Similarly MongoDB got a WiredTiger written.
The thing is that MySQL is older so it went through all of it earlier, but it still suffers from poor decisions from the past. This is contrasting with PostgreSQL, where correctness and reliability was #1 from the beginning. It started as an awfully slow database, but performance for improved and we now have correct, reliable and fast database.
If you were around back in the day you will remember the MySQL team claiming that no one needed transactions or referential integrity, that you should just do it yourself in the application...
MySQL's rise IMO cannot be considered without also looking at the rise of Ruby on Rails and other CRUD-optimized platforms and frameworks. Also ORMs. These things denigrated the idea of using an RDBMS as anything but a dumb table store. Features like stored procedures and views were seen as pointless. MySQL was the perfect database for people who had no respect for databases.
I agree that the rise of MySQL is combined with using RDBMS as a table store rather than a relational database, but I am not positive that this was driven by RoR and ORMs. Every large-scale system I have worked with that utilizes MySQL (and I'm on at least my third in a row of these systems, sadly!) is/was driven by application-logic database utilization via the "FriendFeed model" - that is, a big fat ID->Document Blob table for persistence and breakout tables for indexing.
ORMs and ActiveRecord in particular encourage, to some extent, the use of a RDBMS, even if they didn't get to take advantage of them well for a long time - for example, in RoR "has_one / has_many" for foreign-key relationship, .joins(:field_name) for, well, joins, and so on.
Perhaps. Something happened between those first-generation web sites where you were writing SQL by hand -- so you could just as easily be writing (injection-attack-prone) queries that made use of stored procedures etc -- and today.
A big reason I called out RoR is that back in '04-05 I was railing against its default use of plural table names, and DHH on IRC recommended I shut up and just flip the configuration switch and turn off the feature, but of course when I did that all sorts of latent bugs were exposed.
RoR was the beginning of hipster "coding" and I therefore blame it for everything.
I'm wasn't previously familiar with the FriendFeed approach to database (ab)use. I paid about as much attention to it as I did to MySpace back in the day -- nearly zilch -- so its etc innards are doubly obscure to me.
MySQL was already ridiculously popular from use in PHP/MySQL applications well before Rails was popular. That said, I generally agree with your statement:
> MySQL was the perfect database for people who had no respect for databases.
No, but it does support online DDL for some operations in InnoDB.
Very few database systems support online DDL, which unlike a transaction, does not require undo or rollback resources.
Of course one must have a rollback procedure if something fails, but you need one for transactions too, just in case.
An online rollback is far lest costly than a transactional rollback, because and online rollback is just undoing what you did. Added a column you didn't want in one query? Remove it again in another, very quickly.
TokuDB (a mysql/mariadb storage engine) supported all DDL as an online operation. But percona killed it in favour of TokuMX, the MongoDB equivalent.
TokuMX has no upgrade path to wired tiger, only one major customer at Percona (I can't say who it is) and no engineers.
Any kind of DDL is tricky and requires users to RTFM for the intricacies of their chosen database. One size rarely fits all.
TokuDB is a great storage engine! Online DDL and fast compression are a winning combo. We use it for all our big MySQL tables. It is still available in MySQL 8.0.
I really wish Percona would reconsider their decision to deprecate it.
After Percona took over TokuDB's creator TokuTek, they wasted so much of their development time and money on TokuMX (Percona's fractal tree-enabled MongoDB server) only to abandon it in 2017.
That money would have been better spent on TokuDB development to allow it to match the features present in InnoDB like generated columns, spatial indexing, fulltext Indexes and Galera.
TokuDB still has many users and MyRocks is just no substitute.
Good question. You'd need some accurate-enough data source telling you about failed writes. Which eventually comes back around to needing a consistent database and indications of client disconnects.
With a huge amount of data (as I've heard analytics is), could you take a sampling approach where you log every n transaction and only check those against the DB?
Sure. Data that people don't care about enough to be worried about losing--for example, time series data from an unimportant remote sensor. Should this data be recorded at all? Maybe not, but if should then a best-effort recording may be fine. It may even be all that's possible.
I wouldn’t go as far as to say an “unimportant” remote sensor... but I think you’re correct in spirit.
I could think of an instance where you’d like to log data, but the occasional datapoint being missing wouldn’t be terrible. Maybe something like a temperature monitor — you’d like to have a record of the temperature by the minute, but if a few records dropped out, you’d be able to guess the missing values from context. Something like the data monitoring equivalent of UDP vs TCP.
Even more elementary that sibling comments, this also happens in gaming all the time. You are recording live results, say in Fifa, but if you unplug your device, your results are gone, since they were memory only. The game simply cannot afford to write to disk, the write is "non guaranteed" in the true sense of the word, but it is fast.
You then "checkpoint" when the game is over.
You might dissent that is not a "non-guaranted" write, because in fact the write did occur, but I simply want to allude to the concept of a "non-secured" write, in that it vanished without an fsync.
When I was evaluating MongoDb couple of years ago (around the time they were switching to WiredTiger engine), I've found memory leak in their Node.js client on day one, I've submitted a ticket on their Jira and the same time I had a look at other issues they had there. I saw there memory leak after memory leak, memory corruption everywhere, data disappearing without any reason, segfaults etc. After that MongoDB was dropped as a candidate for a DB in project I was working on, we went with Postgres and never regretted it.
Not the author but done similar things (patching something rather than migrating away from it). Usually it's way more work to migrate away than just patching it again to fit your use-case. Once you find yourself having to patch it too often, you start thinking about migrating away. Then the research slowly begins ad-hoc until it hits "seems we need to migrate away now, otherwise we're spending too much time working around something / fixing their broken shit", that's when you sit down and decide to migrate away from it.
Also would depend on how long time you think the application will be around. You're building an MVP to evaluate something? Just hack together whatever will work (then throw away). You're maintain software for a library/archive that will most likely stick around for a long time, even if they say it's just temporary? Do decisions that will help in the future, always.
We have a complicated system and migration is ~3 months we won't be shipping features.
We have a roadmap we need to meet and so far we have been trying to spill money on it rather than developers (paying mongo atlas) and adding features incrementally as Mongo gets them (like transactions).
If this wasn't a startup we would probably rewrite.
> Even tools people joke about a lot like mysql never gave me this sort of data corruption.
That's about a decade out of date at this point. MySQL/InnoDB is the standard table engine and corruption is exceedingly rare. As of 2014, when I last directly worked on MySQL prod systems, there was no practical difference from PostgreSQL in terms of transactional guarantees. That includes APIs like JDBC which we used for billions of transactions.
The biggest issue with MySQL/MariaDB isn't so much data corruption at the InnoDB level but stuff like:
MariaDB [test]> create table test ( i int );
Query OK, 0 rows affected (0.06 sec)
MariaDB [test]> insert into test values (''), ('xxx');
Query OK, 2 row affected, 2 warning (0.01 sec)
MariaDB [test]> select * from test;
+------+
| i |
+------+
| 0 |
| 0 |
+------+
2 row in set (0.01 sec)
There's a bunch of other similar caveats as well, and this can really take you by surprise. I've seen it introduce data integrity issues more than once.
That's a new MariaDB 15.1 with the default settings I just installed the other day to test some WordPress stuff. I know there are warnings, and that you can configure this by adding STRICT_ALL_TABLES to SQL_MODE, but IMO it's a dangerous default.
This is also an issue with using MongoDB as a generic database: every time I've seen it used there were these kind of data integrity issues: sometimes minor, sometimes brining everything down. Jepsen reports aside, this alone should make people double-check if they really want or need MongoDB, because turns out that most of the time you don't really want this.
MySQL still has no transactional DDL (and I think still even autocommits if you try). This is a major difference from Postgres which I believe supports everything short of dropping tables.
Every month, we do an external database import into our production PostgreSQL database. In a single transaction, we drop dozens of tables, create new ones with the same names, insert hundreds of thousands of rows, and recreate indexes, all in a single transaction. It works flawlessly.
I wouldn't use that particular thing against MySQL. DDL normally supposed to be always outside of a transaction, it's just PostgreSQL feature that you can use them inside and be able to rollback. BTW I'm convinced you also can drop table within a transaction in PostgreSQL.
No, MySQL stands out here. Postgres, SQL Server, DB2, and Firebird all give at least some way to do some major DDL transactionally. Usability varies (e.g. Oracle supports a very specific kind of change that is not its normal DDL statements), but it's at least possible.
> DDL normally supposed to be always outside of a transaction
A basic element of the relational model is that metadata is stored as relational data and that the same guarantees that apply to manipulating main data in the database apply to manipulating the schema metadata.
It's true that many real relational databases compromise on this element in various ways at times, but it is absolutely not the case that DDL “is supposed to be” non-transactional.
Mongo may retry running it (calling the function again) if a "TransaientTransactionError" is raised (the transaction is retried from the client side rather than at the cluster).
However, when the driver calls your function again it doesn't invalidate the `session` object - so previous calls to the same function can make updates to the database.
Let's say `someOp` does something that causes the transaction to retry and `someOtherOp` is doing something non-mongo-related in the meantime (like pulling a value from redis). Now `someOtherOp` reached the mongo part of its code and it is executing it happily with the same session object (so operations succeed although they really shouldn't)
The point of transactions like you said is to perform multiple operations atomically and for them to happen "exactly once or not at all". With Mongo in practice it is very easy to get "Once and some leftovers from a previous attempt".
Sorry, I haven’t had my coffee yet. If I am reading this correctly, either someOp() or someOtherOp() may execute first, no? And if you introduce an external database, why do you expect Mongo to handle that rollback? Say someOtherOp() increments a Redis value by 1. If that part executed first since both are asynchronous here, what would a Mongo session have to do with it?
What exactly would invalidating that session object do here? And what would the session object do after it was invalidated?
Thanks for this explanation. So if I understand correctly, `someOp` has thrown an error but this doesn't affect `someOtherOp`? So `someOtherOp` will end up being called twice?
I think this is the expected behaviour of the transaction but the problem comes from the fact you wrap all DB operation inside a Promise.all.
Because you wrap the DB operations inside a Promise.all, it means it will run them all BUT it will not revert them if one fails (it's not atomic, it just says that one has failed and you need to catch it), it will reject them but not revert them. (the CUD operation will already have changed the data)
The problem I believe is the transaction is considering the Promise.all and not what's inside of it so it will run it again despite the fact that some have already succeeded earlier
I think you just have to resolve each of them outside a Promise.all.
In your case because Promise.all has been rejected it will redo the transaction, therefor it will redo the one that have already worked in the first call.
Not op, I think what he means is the session, even when already failed once, can still be used (without error) in the next operation without being invalidated.
Without promise.all, I think it can be replicated like this:
I have found people to attack me when I make a technical criticism of technologies they like.
Am I expected to justify being able to code in HN? I could go on about being a maintainer of the two most promise libraries in npm (bluebird and Q), being node core, organizing promises sessions for APIs in Node core and having over 1000 answers about promises in SO.
I generally find that sort of "non-technical chat" boring compared to the technical stuff and appeal to authority kind of lame.
The only other person attacked in this issue is
aphyr, so I believe I am in very good company in this particular instance.
Yes, several times - we pay Mongo Atlas over $5000 per month.
We reported it immediately at the highest severity and we pay for the highest tier or support - we tried to collaborate as soon as possible. It sort of went "over their head".
Can you link the JIRA ticket? I use the Node driver heavily and have contributed several PRs to it in the past, would be more than happy to fix this since it seems fairly bad.
Hi folks! Author of the report here. If anyone has questions about detecting transactional anomalies, what those anomalies are in the first place, snapshot isolation, etc., I'm happy to answer as best I can.
Have you considered presenting the data in a concise manner in addition to the in-depth analyses?
That is, a table on the jepsen.io frontpage, or at least on each product's review page, with database products and configuration on rows and consistency properties on columns, and a nice "Yay!" or "Nope!" mark in the cell, plus links on how to achieve the database configurations in the table (esp. how to configure each database to have the most guarantees).
Also, ideally the analyses should be rerun automatically (or possibly after being paid, but making it easy for the company to do so) every time a new major release happens rather than being done once and then being stale.
Finally, there should be tests for the non-broken databases (PostgreSQL for instance, both in single-server mode, deployed with Stolon on Kubernetes and using the multimaster projects) as well to confirm they actually work.
That is, a table on the jepsen.io frontpage, or at least on each product's review page, with database products and configuration on rows and consistency properties on columns, and a nice "Yay!" or "Nope!" mark in the cell, plus links on how to achieve the database configurations in the table (esp. how to configure each database to have the most guarantees).
This is a wonderful idea, and I've got no idea how to actually do it in a standardized, rigorous way. Vendor claims are often contradictory, it's hard to get a good idea of anomaly frequency, availability is... a rabbithole, and it's hard to come up with a standard taxonomy of anomalies--most of the analyses I do wind up finding something I've never really seen before, haha. With that in mind, I've wound up letting the reports speak for themselves.
Also, ideally the analyses should be rerun automatically (or possibly after being paid, but making it easy for the company to do so) every time a new major release happens rather than being done once and then being stale.
I don't know a good way to do this either. Each report is typically the product of months of experimental work; it's not like Jepsen is a pass-fail test suite that gives immediately accurate results. There is, unfortunately, a lot of subtle interpretive work that goes into figuring out if a test is doing something meaningful, and a lot of that work needs to be repeated on each test run. Think, like... staring at the logs and noticing that a certain class of exception is being caught more often than you might have expected, and realizing that a certain type of transaction now triggers a new conflict detection mechanism which causes higher probabilities of aborts; those aborts reduce the frequency with which you can observe database state, allowing a race condition to go un-noticed. That kinda thing.
If I'm lucky and the API/setup process haven't changed, I can re-run an analysis in about a week or so. If I'm unlucky, there's been drift in the OS, setup process, APIs, client libraries, error handling, etc. It's not uncommon for a repeat analysis to take months. :-(
It's probably more snarky than helpful, but it'd be great to have a section where it's just marketing materials or docs that you've corrected with a red pen
It's probably better to keep it professional. Your average employee can afford some snark. But when companies hire you for this sort of consulting, you could turn off a lot of potential clients by including it in materials you produce, even when they didn't pay for it. Because it is a representation of the product they would be paying for.
It would be kinda like you including this sort of thing on your resume. Which would also be a bad idea.
For those who don’t know, Kyle makes a living offering these types of analysis to database companies directly. While a lot of us love to dunk on Mongo (myself included), it would be silly to expect Kyle to risk his livelihood.
If done accurately and professionally, something like you're suggesting could be really useful to aid people and organizations during vendor selection.
Thank you for all of your work over the years. Your reports have helped me and others stand up to bizdev hype and make better decisions for our companies and customers.
Postgres is widely understood to be a robust database with safe defaults. I, and perhaps others, would love to see you aim your array of weapons at Postgres. Do you have any plans to look at stock Postgres?
It's been on my list for a long time, but I've also struggled to find out like... what, exactly, is the right way to do postgres replication? Every time I go into the docs I wind up with a laundry list of different mechanisms for replication and failover, and no idea which one would be most appropriate for a test. I gotta get on this!
It'd be especially interesting given that MongoDB claims this:
> Postgres has both asynchronous (the default) and synchronous replication options, neither of which offers automatic failure detection and failover [12]. The synchronous replication only waits for durability on one additional node, regardless of how many nodes exist [13]. Additionally, Postgres allows one to tune these durability behaviors at the user level. When reading from a node, there is no way to specify the durability or recency of the data read. A query may return data that is subsequently lost. Additionally, Postgres does not guarantee clients can read their own writes across nodes.
> It'd be especially interesting given that MongoDB claims this:
> > Postgres has both asynchronous (the default) and synchronous replication options, neither of which offers automatic failure detection and failover [12]. The synchronous replication only waits for durability on one additional node, regardless of how many nodes exist [13]. Additionally, Postgres allows one to tune these durability behaviors at the user level. When reading from a node, there is no way to specify the durability or recency of the data read. A query may return data that is subsequently lost. Additionally, Postgres does not guarantee clients can read their own writes across nodes.
This is like those commonly seen tables comparing your product with others where your product had checkmarks in all categories, and of course competitors are missing a bunch of them. The problem is that the categories were picked by you, and are often irrelevant to the other product. This is the case here.
PostgreSQL is not a distributed database, the master is the one doing all writes. The replicas are read only. By default replicas are asynchronous which means they won't affect master performance, at the cost of having data there being late by few seconds. Since you can't write to replicas, this won't cause data corruption, only delay which often is acceptable. If you design your applications in such way that will have two database endpoints: one for writes and one just for reads, you can then decide based on context which endpoint you want to use. The read only is easy to scale, but as mentioned earlier it is read only, and might slight delay.
Now, for failover, you might also opt on using synchronous replicas this will add extra latency, but then you always have at least one machine that has the same data. They mentioned that if you have multiple synchronous standbys then it only one needs to write. Actually that's configurable, you can specify group of synchronous machines and how many and which need to be synchronized, the remaining ones are a backup in case those that you specified aren't available.
Besides, the writes don't work the same way as in mongo, when a standby node is in sync it isn't just in sync for that particular write, it is completely in sync, so their following argument about not being able to specify durability/recency of data on read is redundant. If you contact the master or synchronous replica, you will always get the most recent state. If you don't mind slight delay you should query asynchronous replicas (in fact you should prefer them whenever you can, since those are cheap to add)
I think that it'd be super-valuable to do an analysis of an RDS Postgres deployment. Amazon is doing some dark magic with RDS that sits at this really interesting "distributed, but not that distributed" inflection point, which impacts the basic assumptions of lots of distributed database design.
I believe RDS Postgres is probably the right answer for lots of applications, especially for those that already depend on AWS for baseline availability. I'd love to see if that holds up against a rigorous analysis.
the setup would prob ably be pgsql primary, aurora secondaries on diff zones and something changing cross-zone or cross-region vpc setting to try to break replication ? never tried that but was hurt by rds pure pgsql cross region replication in a network outage situation.
i feel like this is the reaction of everyone having ever tried to setup postgres replication. With your audience, you deciding for a particular setup will probably help a LOT of people, and ultimately the postgres project as well.
If you worry about data, you should not use automatic failover. It's nearly impossible for standby to know why master stopped responding. Maybe there was a hardware failure, or maybe master is just busy. This is why manual failover is better, because you can know the real reason and decide whether you should perform failover or just wait.
With tools like repmgr it is just a single command invoked on the standby.
If you absolutely don't want to lose any data, you should have two masters in close proximity (so the latency isn't high) set up with synchronous replication, then have one or two standbys with asynchronous replication. This will reduce throughout, but then you can be sure that the other machine has all the same transactions. If something happens to both you then can fallback to the asynchronous one which might be a bit behind.
Quoting a former colleague here, but "if it hurts, do it more often". That is what you should do with your PostgreSQL failovers.
I have clusters running on timelines in the hundreds without a byte of data loss due to using synchronous replication, tools that help out with leader election, and just doing it often.
Can Patroni tell if master node is not responsive because it is busy vs dead? GitHub (I believe) had few outages that caused data loss because their auto failover mechanism kicked in when it shouldn't.
I would actually be interested if aphyr's analysis of Patroni and other distributed add-ons to PostgreSQL.
There is no real difference between dead or too-busy.
The only question is how soon are you going to page humans. After the automated mechanism flipped your master 2-3 times but the cluster still hasn't made progress [nothing coming out of the master; or it locks up after a few minutes again]), or right after some other automated mechanism detects that there's a problem.
Whatever automation you have in place, it has advantages and disadvantages. In the GitHub case - I suppose - they determined post-mortem that it would have been better to just let the master chug through the incoming onslaught of queries instead of failing over, and over, and over. (But of course this seems like a trivial problem in any auto failover setup, so I suspect there's more to the story.)
> Can Patroni tell if master node is not responsive because it is busy vs dead
No. But the contract Patroni has is this:
I only serve a master (primary) if I have the lock.
If I do not have the lock I will demote.
This results in that there can be only 1 primary active at any given point in time, even if the network is partitioned.
This in and of itself does not guarantee no-split-brain situations, a split-brain can occur if writes were made on the former primary, but not yet on the future primary.
This however can be mitigated with synchronous replication.
> tell if master node is not responsive because it is busy vs dead?
The postgres documentation will tell you that you'll need to set up your own mechanisms for this, and that they will need to integrate with OS facilities as appropriate. One-size-fits-all does not cut it. Not wrt. replication, not wrt. HA/failover.
Well, the built-in ways is the right way to do it. But given that PostgreSQL is quite conservative about it, it will be hard to find issues there (the replicas are read only, so at worst it will be just a replication delay, unless you use synchronous replication, which will remove the replication delay at the cost of slower performance).
All the tooling that provides extra distributed functionality not present in postgres (auto failover, multi master replication, sharding etc) will surely have issues, but then you aren't testing the PostgreSQL itself, but the tooling, so to be fair, you the article should evaluate these tools, and any shortcomings shouldn't go to PostgreSQL (unless it really is a PostgreSQL issue).
Postgres is not a distributed database and doesn't have a single safe default for running it in a distributed configuration, including talking to it over network. It can't claim any consistency guarantee, so there is nothing for aphyr to test it for.
Even common highly available configurations take the route of no consistency guarantees by doing primitive async replication and primitive failover.
Postgres supports multi-master replication, among other replication models. This could provide an interesting target.
In a classic single node configuration, a confirmation that its transaction isolation behaviors exhibited the corresponding anomalies would be valuable.
Postgres doesn't natively support multi-master. (Although there are a variety of open source/proprietary offerings that add support for it to various degrees.)
PostgreSQL doesn't offer multi master replication. There are extensions that do, but if aphyr will evaluate them he should emphasize that he is treating them not the PostgreSQL (unless he finds a bug in PostgreSQL itself).
I think he did something similar for MySQL when evaluating the Galera cluster.
Jepsen reports often include two distinct types of analyses: correctness in a distributed storage system under a variety of failure scenarios, and in-depth analysis of consistency claims. Both examinations are extremely helpful.
In a single write master configuration, Postgres runs transactions concurrently, so the consistency analysis is still quite relevant.
I don’t think it’s a stretch to say that everyone expects Postgres to get top marks in this configuration and it would be worth confirming that this is the case.
But it was long ago, and maybe needs to be redone?
Edit: after re-reading it he treats it as a distributed system because client and server is over network. And that is true, it can also be thought of as a distributed system because as you said transactions are concurrent and are running as separate processes. Although in these cases you can't have a partition (which aphyr uses to find weaknesses), or maybe there is something equivalent that happens?
> PostgreSQL doesn't offer multi master replication.
Not in itself, but it does offer a PREPARE TRANSACTION - COMMIT PREPARED / ROLLBACK PREPARED extension that could be used to add such support in the future. This would not be unprecedented, as the simpler case of db sharding is already being supported via the PARTITION BY feature, combined with "FOREIGN" database access.
i'm not sure what you mean by pg not being a "distributed" database. it has replication and sharding functionalities that let it run in various clustering configuration. This looks enough to me to qualify it for aphyr tests.
Replication is read only, so at worst there's only delay when it is set up asynchronously, but ultimately it will be the same as master. The sharding part, do you mean FDW? I don't think PostgreSQL gives any consistency guarantees if you use them.
Not a question necessarily about the technical side, but I'm interested in your opinion as to the root cause – is it desire to achieve certain results for marketing purposes, lack of understanding/training in the team about distributed systems, just bugs and a lack of testing...? Alternatively does most of this come down to one specific technical choice, and why might they have made that choice?
Very happy for (informed) speculation here, I recognise we'll probably never know for certain, but I'm interested to avoid making similar mistakes myself.
There's a few things at play here. One is talking only about the positive results from the previous Jepsen analysis, while not discussing the negative ones. Vendors often try to represent findings in the most positive light, but this was a particularly extreme case. Not discussing default behavior is a significant oversight, and it's especially important given ~80% of people run with default write concern, and 99% run with default read concern.
The middle part of the report talks about unexpected but (almost all) documented behavior around read and write concern for transactions. I don't want to conjecture too much about motivations here, but based on my professional experience with a few dozen databases, and surveys of colleagues, I termed it "surprising". The fact that there's explicit documentation for what I'd consider Counterintuitive API Design suggests that this is something MongoDB engineers considered, and possibly debated, internally.
The final part of the report talks about what I'm pretty sure are bugs. I'm strongly suspicious of the retry mechanism: it's possible that an idempotency token doesn't exist, isn't properly used, or that MongoDB's client or server layers are improperly interpreting an indeterminate failure as a determinate one. It seems possible that all 4 phenomena we observed stem from the retry mechanism, but as discussed in the report, it's not entirely clear that's the case.
I get the impression that MongoDB may have hyped themselves into a corner in the early days with poorly made (or misleading) benchmarks. Perhaps they have customers with a lot of influence determining how they think about performance vs consistency.
Maybe this combined with patching, re-patching, re-patching again their replication logic/consistency algorithm means that they'll be stuck in this sort of position for a long time.
Possibly! You're right that path dependence played a role in safety issues: the problems we found in 3.4.0-rc3 were related to grafting the new v1 replication protocol onto a system which made assumptions about how v0 behaved. That said, I don't want to discount that MongoDB has made significant improvements over the years. Single-document linearizability was a long time in the works, and that's nothing to sneeze at!
Yeah, there's no workaround that I can find for 3.4 (duplicate effects), 3.5 (read skew), 3.6 (cyclic information flow), or 3.7 (read own future writes). I've arranged those in "increasingly worrying order"--duplicating writes doesn't feel as bad as allowing transactions to mutually observe each other's effects, for example. The fact that you can't even rely on a single transactions' operations taking place (or, more precisely, appearing to take place) in the order they're written is especially worrying. All of these behaviors occurred with read and write concerns set to snapshot/majority.
That's not to say that workarounds don't exist, just that I didn't find any in the documentation or by twiddling config flags in the ~2 weeks I was working on this report. :)
Hi Kyle, thanks for the Elle :) I want to use Elle to check long histories of transactions over small set of keys with read dominant workload, the paper recommends to use lists over registers but when the history becomes long on the one hand it becomes too wasteful to read the register's history on each request on the other hand the Elle's input becomes very large. E.g. when each read should return the whole register's history the size of history grows O(n^2) compared to the case when the reads return just the head.
So I'm curios how would you have described the ability of finding violations with Elle using read-write registers with unique values vs the append-only lists?
E.g. when each read should return the whole register's history the size of history grows O(n^2) compared to the case when the reads return just the head.
If you look at Elle's transaction generators, you can cap the size of any individual key, and use an uneven (e.g. exponential) distribution of key choices to get various frequencies. That way keys stay reasonably small (I use 1-10K writes/key), some keys are updated frequently to catch race conditions, and others last hundreds of seconds to catch long-lasting errors.
So I'm curios how would you have described the ability of finding violations with Elle using read-write registers with unique values vs the append-only lists?
RW registers are significantly weaker, though I don't know how to quantify the difference. I've still caught errors with registers, but the grounds for inferring anomalies are a.) less powerful and b.) can only be applied in certain circumstances--we talk about some of these details in the paper.
Huge fan of your work! I was curious if you've ever attempted to run your (or part of) Mongo test suite against FoundationDB using their DocumentLayer since it's supposed to be Mongo API compatible.
IIRC one of the FoundationDB engineers tested with Jepsen and found that it passed in its default configuration, but the blog post seems to have disappeared.
Thanks for firing up the time machine! I've been using FDB for a little over a year now and can't recommend it enough. Such a solid piece of meticulous engineering.
I was lucky to have a good education: my B.A. involved courses in contemporary experimental physics and independent research in nonlinear quantum dynamics (esp. proofs, experimental design, writing), cognitive and social psychology (more experiment design and stats), math structures (proof techniques), philosophy (metaphysics, philosophy of science), and English (rhetoric). All of those helped give me a foundation for doing this kind of experimental work and communicating it to others.
Jepsen draws inspiration from a long line of work on property-based testing, especially Quickcheck & co. It also draws on roughly 10 years of experience building & running distributed systems in production. A lot of Jepsen I invented from whole cloth, but some of the checkers in Jepsen are derived from specific research papers, like work by Wing, Gong, and Howe on linearizability checking.
Then it's just... a lot of thinking, experimenting, and writing. Jepsen's the product of ~6 years of full-time work. Elle, the system which detected the anomalies in this report, was a research project I've been puzzling over for roughly two years.
I write the Jepsen series, and open-source all of the code for these tests, partly as a resource so that other people can learn to do this same kind of work. :-)
I guess you've answered my question, but to be clear, you do not instrument/analyse the code, you treat it as a black box which you hammer on externally, is that right?
Pretty much, yeah. There are some cases where Jepsen reaches into the guts of a database or lies to it via LD_PRELOAD shims, but generally these are Just Plain Old Binaries provided by vendors; no instrumentation required.
Hi Kyle! I’ve really enjoyed your work over the years. I was wondering, with all of your testing and experimentation, is there any system that had really impressed you?
I've actually written a little mini-Jepsen workbench for folks who want to practice implementing a distributed key-value store. Might be worth a spin! https://github.com/jepsen-io/maelstrom
Mongo has been related to "perpetual irritation" up to "major production issue" at all three of my last companies.
For as easy as it is to use jsonb in Postgres, or Redis, or RocksDB/SQLite, or whatever else depending on your use case - I can't find any reason to advocate its use these days. In my anecdotal experience, the success stories never happen, and nearly developer I know has an unpleasant experience they can share.
Big thanks to aphyr and the Jepsen suite (and unrelated blog posts like Hexing the Interview) for inspiring me to do thorough engineering.
I find that using JSON for things you don't need to query/validate (like big blobs you just want to store) and breaking the rest out to columns works well enough. Plus, you can always migrate the data out to a field anyway.
Postgres 12 has generated columns, so you can throw your data in a jsonb column and have Postgres pull data out of it into separate columns for indexing for example.
Generated columns are not necessary for indexing in Postgres, you can create an index on any expression based on the record (supported by many versions now).
Or trust me, I'm aware. But inevitably I will be in a design meeting where they will want a non-sql alternative, and I'd be nice to know what I can suggest besides Mongo
It depends what you're using it for. Postgres is a very good all-around choice these days (compared to when the whole 'noSql' thing got started) and also supports document-based scenarios quite well via JSON/JSONB columns and its support for these datatypes in queries, updates, indexing etc. Sharding and replication can also be set up via fairly general mechanisms, as described in pgSQL documentation. (For instance, the FDW facility is often used to set up sharding, but it could also support e.g. aggregation.)
As an engineer for whom automated testing tools are crucial to my mental health, let me know if you want a UX tester or just someone to provide feedback on the documentation.
This article reinforces my stance that bad defaults are a bug. Defaults should be set up with the least number of pitfalls and safety tradeoffs possible so that the system is as robust as it can be for the majority of its users, since the vast majority of them aren't going to change the defaults.
Sometimes you end up with bad defaults simply by accident but I feel like for MongoDB the morally correct choice would be to own up to past mistakes and change the defaults rather than maintain a dangerous status quo for "backwards compatibility", even if you end up looking worse in benchmarks as a result.
I think this is a good way to look at things, and there are vendors who do this! VoltDB, for instance, changed their defaults to be strict serializable even though it imposed a performance hit, following their Jepsen analysis. https://www.voltdb.com/blog/2016/07/voltdb-6-4-passes-offici...
How many more years do we have to keep evaluating, studying, and reading about MongoDB's ongoing failures? It would appear this product has been a great burden on the community for many years.
I like to keep in mind that MongoDB's existing feature set is maturing--occasional regressions may happen, but by and large they're making progress. The problems in this analysis were in a transaction system that's only been around for a couple years, so it's had less time to have rough edges sanded off.
I’m Dan Pasette, Executive Vice President of Core Engineering at MongoDB. I'd like to thank aphyr for posting the detailed report on MongoDB 4.2.6. We were able to use these findings to identify a bug that can lead to a previously committed transaction being incorrectly retried in the presence of a primary failover and a subsequent transaction commit retry. From our testing, this bug is the cause of the anomalies described in sections 3.4 - 3.7 of the report.
This bug has been fixed and backported, and will be available to users in MongoDB 4.2.8 onwards. The MongoDB test suite has been updated to ensure that this specific phenomenon is detected in future releases. We are also planning to update the version of Jepsen we are currently running in our CI loop to include the newest test case used in the report.
Last, we’ve made some changes to how we share information discussed in the Jepsen reports on our website. You can find the updated page here (https://www.mongodb.com/jepsen).
there are somanygreatdatabases out there. There's no need for one that has been mediocre for years and continues to make false claims. This is an issue of years of super aggressive marketing of an inferior product making it hard on engineers.
I think if you compared it to other databases that are designed to scale horizontally like Cassandra and DynamoDB, you might have a more favorable opinion. IMHO, most products at this scale are terrible in different ways, because it is a difficult problem to solve generally.
I have been responsible for <100 clustered Cassandra instances, and <500 clustered MongoDB instances, and I would choose the latter every time.
It has been a decade since MongoDB was initially released. It still isn't a database in any meaningful sense of the word. Please don't downplay the amount of trouble one can get into by assuming it is a database.
> Clients observed a monotonically growing list of elements until [1 2 3 5 4 6 7], at which point the list reset to [], and started afresh with [8]. This could be an example of MongoDB rollbacks, which is a fancy way of saying “data loss”.
I hope they learned the lesson, don't fuck with aphyr.
I agree but maybe it’s the only lesson they are able to understand at this time. Their attitude was asking for somebody to call them, which aphyr is maybe the best positioned to do.
I’d love to read a roasting like that authored by Leslie Lamport for a different perspective but aphyr’s works absolutely stand on their own.
Any ideas how to get Jepsen and TLA to work together? :)
I wanted to incorporate MongoDB into a C++ server at one point.
Their C/C++ client is literally unusable. I went to look into writing my own that actually worked and their network protocols are almost impossible to understand. BSON is a wreck and basically the whole thing discouraged me from ever trying to interact with that project again.
Aphyr is such a competent professional. What a relatively thorough and polite response to Mongo's inaccurate claims. "We also wish to thank MongoDB’s Maxime Beugnet for inspiration." is a nice touch.
The general mood I observed about MongoDB was that it used to be inconsistent and unreliable but they fixed most, if not all of those problems and they now have a stable product but bad word of mouth among developers. Personally, I've treated it as "legacy" and migrated everything that I had to touch since 2013 [0], and luckily (just read the article so hindsight 20/20 -- transaction running twice and seeing its own updates? holy...) never gave it another try.
[0]: https://news.ycombinator.com/item?id=6801970 (BTW: no, my dream of simple migration never materialized, but exporting and dumping data to Postgres JSONB columns and rewriting queries turned out to be neither buggy nor hard).
> MongoDB was that it used to be inconsistent and unreliable but they fixed most, if not all of those problems and they now have a stable product but bad word of mouth among developers.
This report is 9 days old, and tests the latest stable release of MongoDB. The problems it discusses are present on modern MongoDB.
If it wasn't clear, I said "mood" (what you conveniently ignored), referring to chit-chat I heard recently, and was underlining the fact how wrong it has been. I totally understand what the report says and know what version it tests.
In my defense, it wasn't clear that's what you were saying in your original comment. "Mood" has become a filler word at this point -- hence why I omitted it from the quote -- and can mean anything from the traditional meaning of "mood in the room" to "incredibly relatable/factual statement". How I originally understood your comment was that you were saying that you felt that most of the issues are in the past, but you still decided to migrate away from it.
This is not directly related to this report or Jepsen, but since you're here I've got to ask: Aphyr, are there any recent papers/research in the realm of distributed databases which you're excited about?
Calvin and CRDTs aren't new, but I still think they're dramatically underappreciated! Heidi Howard's recent work on generalizing Paxos quorums is super intriguing, and from some discussion with her, I think there are open possibilities in making leaderless single-round-trip consensus systems for log-oriented FSMs, which is what pretty much everyone WANTS.
I'm also excited about my own research with Elle, but we're still working on getting that through peer review, haha. ;-)
> I think there are open possibilities in making leaderless single-round-trip consensus systems for log-oriented FSMs, which is what pretty much everyone WANTS.
Woah, that's wild. Are there any pre-prints/papers/talks that you can link to on this subject? I'd _love_ to read this.
> I'm also excited about my own research with Elle, but we're still working on getting that through peer review, haha. ;-)
I read over bits of Elle; the documentation in it is absolutely top-notch. You and Peter Alvaro knocked it out of the park!
I think there are open possibilities in making leaderless single-round-trip consensus systems for log-oriented FSMs, which is what pretty much everyone WANTS.
This is based on her presentation and some dinner conversation at HPTS 2019, so I don't know if there's actually a paper I can point to. The gist of is that Paxos normally involves an arbitration phase where there are conflicting proposals, which adds a second pair of message delays. But if you relax the consensus problem to agreement on a set of proposals, rather than a single proposal, you don't need the arbitration phase. Instead of "who won", it becomes "everyone wins". Then you can impose an order on that set via, say, sorting, and iterate to get a replicated log.
I read over bits of Elle; the documentation in it is absolutely top-notch. You and Peter Alvaro knocked it out of the park!
Thank you! Could I... hang on, just let me grab reviewer #1 quickly, I'd like them to hear this. ;-)
> This is based on her presentation and some dinner conversation at HPTS 2019, so I don't know if there's actually a paper I can point to. The gist of is that Paxos normally involves an arbitration phase where there are conflicting proposals, which adds a second pair of message delays. But if you relax the consensus problem to agreement on a set of proposals, rather than a single proposal, you don't need the arbitration phase. Instead of "who won", it becomes "everyone wins". Then you can impose an order on that set via, say, sorting, and iterate to get a replicated log.
This sounds very similar to atomic broadcast (https://en.wikipedia.org/wiki/Atomic_broadcast) where each node sends a single message and the process ensures that all nodes agree on the same set of messages. Not sure how it would fit with a log-oriented FSM, but it certainly sounds interesting.
It’s really pretty trivial to implement RSM given an atomic broadcast protocol. But you can implement many other things, like totally ordered ephemeral messaging with arbitrary fanout, or a replicated durable log ala Kafka. Here’s my current favorite atomic broadcast protocol (from 2007 or so), which is leaderless, has write throughput saturating network bandwidth, and read throughput scaling linearly with cluster size:
Nope! Something weird happened to that post; it got a lot of upvotes and some comments, but never made it to frontpage. After the InfoQ article took off yesterday, an HN mod got in touch and asked if I'd like to resubmit it.
I suppose there are reasons why the defaults are the way they are. Can anyone comment on the implications, performance or otherwise, of bumping up the read/write concerns?
Latency is a big one--you've got to wait an extra round-trip for secondaries to acknowledge primary writes, and primaries (assuming you don't have reliable clocks) need to check in with secondaries to confirm they have the most recent picture of things if you want to do a linearizable read. Snapshot isolated reads shouldn't require that, at least in theory--it's legal to read state from the past under SI, so there's no need to establish present leadership. That's why I'm surprised that MongoDB requires snapshot reads to go through write concern majority--it doesn't seem like it'd be necessary. Might have something to do with sharding--maybe establishing a consistent cut across shards requires a round of coordination. Even then I feel like that's a cost you should be able to pay only at write time, making reads fast, but... apparently not! I'm sure the MongoDB engineers who designed this system have good reasons; they're smart folks and understand the replication protocol much better than I do.
MongoDB's also published a writeup (which is cited a few times in the Jepsen report!) talking about the impact of stronger safety settings and why they choose weak defaults: http://www.vldb.org/pvldb/vol12/p2071-schultz.pdf
In general, MongoDB’s defaults fall into two categories. The first could possibly be justified as making it easy for inexperienced devs to get started, but it means that people rely on those defaults and then try to promote to production, and unless there is an experienced traditional DBA with the power to veto it, it will go ahead. This is how they “backdoor” their way into companies. The second category is whatever will look good on a benchmark, regardless of any corners cut.
Compare and contrast with the highly ethical Postgres team, who encourage good practices from the start and who get a feature right first before worrying about performance. That may harm their adoption in the short term but over the long term, that's why they're the gold standard. And with their JSONB datatype they have a better MongoDB than MongoDB anyway! And have a million other features besides!
> Compare and contrast with the highly ethical Postgres team
You do know that PostgreSQL had issues with not fsyncing data as well ? It's technology. Bugs will be made. Design decisions will be wrong.
I think it's really disappointing and inappropriate to be labelling MongoDB engineers as unethical for simply having incorrect defaults. Which in their history they often change after they are made aware of them.
You do know that PostgreSQL had issues with not fsyncing data as well ?
See, you can name just one Postgres bug, and they held their hands up to it straight away. Whereas the MongoDB "bugs" are countless and by sheer coincidence, they mostly skew to improving performance in benchmarks and demos. That's a pattern.
The downloadable version of dynamodb is only intended for testing and is not a distributed system by any definition, nor does it behavior match the production system exactly.
There's no reason for Jepson to be applied to a single-node in-memory kv store.
At this point I think we might be going a bit overboard with title changes.
Now that it's just "MongoDB 4.2.6", the title makes me think that this is a release announcement, not an analysis of the software.
The first title (that specifically referenced a finding of the analysis) was best, imo. Mildly opinionated or whatever, but at least it quickly communicated the gist of the post. On the other hand:
"Jepsen: MongoDB 4.2.6" – not super helpful if you're not already familiar with the Jepsen body of work.
"MongoDB 4.2.6" – as stated above, sounds like a release announcement.
If you want a suggestion, maybe something like "Jepsen evaluation of MongoDB 4.2.6"? Not overly specific (/ negative) like the first title, but at least provides some slight amount of context.
Please read the site guidelines: https://news.ycombinator.com/newsguidelines.html. They say: "If the title includes the name of the site, please take it out, because the site name will be displayed after the link." That's why a moderator changed it: the submitted title was "Jepsen: MongoDB 4.2.6".
I don't mind making an exception, since exceptions are things sometimes. Jepsen is famous on HN, so the current title is not an issue. Indeed, referencing a specific finding would arguably be misleading, since this article is the Jepsen report about MongoDB 4.2.6. Btw, I don't know what you mean by "The first title (that specifically referenced a finding of the analysis) was best". The submitted title was "Jepsen: MongoDB 4.2.6" and it has only ever rotated between two states, one with "Jepsen: " and one without. Are you confusing this thread with https://news.ycombinator.com/item?id=23285249?
It's very silly to have this be the top comment on the page (I've since downweighted it, but that's where it was when I looked in). Yesterday I briefly swapped the URL of this article into the other thread, but then reversed that because it seemed that thread couldn't support a more technical discussion (https://news.ycombinator.com/item?id=23288120). I invited aphyr to repost it instead, which was quite a break from our standard practice of downweighting follow-up posts, but seemed like the best solution at the time. What technical discussion was our reward? Bickering about title policy!
This... usually happens on Jepsen HN threads. The full title, as in the page metadata, and as originally submitted, is "Jepsen: MongoDB 4.2.6. At some point a mod drops the "Jepsen:" part, then we have this discussion, and it comes back. :)
"Why don't you put Jepsen:" on the same line as the database name and version?"
Space concerns, and also, it's immediately above the DB name in giant letters.
"Why don't you give them more creative names?"
Clients love to argue about the titles of these analyses; having a concise, predictable policy for titling is how I get past those discussions.
As another commenter pointed out, it might be worth making the titles "An evaluation of X" going forward – better for HN and probably better everywhere else this is shared too.
Not sure how many ways I can say this: the titles are already "Jepsen: X". HN's got a policy in place that means sometimes mods change the title to just "X". That's not something I have control over, sorry.
"An evaluation of MongoDB 4.2.6" might be neutral and informative enough I suppose.
But then again ultimately the blame is on the author of the article, it's a terrible title for this type of articles. I can understand if the moderators here don't want to go through the trouble of dealing with editorialized titles (with all the controversies it could generate) when clearly the original author didn't care enough to come up with a decent title.
Why? His site is about evaluating distributed data stores. In context of his site, that title makes perfect sense, HN should just add the missing context to its title.
Because as can be seen from the fact that most people only found this article because it was posted on HN (and not because they were browsing the site), the context of the overall site isn't super relevant.
Site context isn't a given when most of us are finding content via 3rd party sources.
A generic “Mongo 4.2.6” title doesn’t help me decide whether to click on the link (especially with how light the domain is). I thought it was a release announcement and only clicked through to the comments because of yesterday’s discusssion.
That's a fair point, but people have a lot of contradictory preferences about things like that. I think I'd rather address this by allowing more customization of the site. Still thinking about https://news.ycombinator.com/item?id=23199264.
I mean, this was kind of an exception case, where there is a big old technical war of words back and forth. Almost a "He said She said" except here, He is an absolute expert, and She is just some marketing dorks at Mongo.
I, for one, welcome this by-hand moderation because it keeps this issue alive, and allows Kyle to keep the discussion going.
As I commented in a previous post, Kyle is the Chef Ramsey of database testing, and here, he's in a position where some idiot has just served him an undercooked hamburger. Bits will fly, marketing people will be flayed alive, and Kyle will be the only one left standing at the end.
Without this by-hand moderation, we'd be missing out on the second act of this intense thriller!
They use a combination of algorithms and human intervention, to generally good effect.
No clue if this "downweighting" in this case is an algorithm or a manual thing. I would assume algorithm for the downweighting and human intervention for reversing it, but that's sort of a guess or inference.
I am tech lead for a project that revolves around multiple terabytes of trading data for one of top ten largest banks in the world. My team has three, 3-node, 3TB per node MongoDB clusters where we keep huge amount of documents (mostly immutable 1kB to 10kB in size).
Majority write/read concern is exactly so that you don't loose data and don't observe stuff that is going to be rolled back. It is important to understand this fact when you evaluate MongoDB for your solution. That it comes with additional downsides is hardly a surprise, otherwise there would be no reason to specify anything else than majority.
You just can't test lower levels of guarantees and then complain you did not get what higher levels of guarantees were designed to provide.
It is also obvious, when you use majority concern, that some of the nodes may accept the write but then have to roll back when the majority cannot acknowledge the write. It is obvious this may cause some of the writes to fail that would succeed should the write concern be configured to not require majority acknowledgment.
The article simply misses the mark by trying to create sensation where there is none to be found.
The MongoDB documentation explains the architecture and guarantees provided by MongoDB enough so that you should be able to understand various read/write concerns and that anything below majority does not guarantee much. This is a tradeoff which you are allowed to make provided you understand the consequences.
To quote from the report: "Moreover, the snapshot read concern did not guarantee snapshot unless paired with write concern majority—even for read-only transactions."
Of course, it doesn't work when you don't pair it with majority read/write concern. You can't expect to get a snapshot of data that wasn't yet acknowledged by majority of the cluster.
As to the quote you probably are referring to:
"Jepsen evaluated MongoDB version 4.2.6, and found that even at the strongest levels of read and write concern, it failed to preserve snapshot isolation."
I did not find any proof of this in the rest of the report. It seems this is mostly complaint of what happens when you mix different read and write concerns.
I would also suggest to think a little bit on the concept of snapshot in the context of distributed system. It is not possible to have the same kind of snapshot that you would get with a single-node application with the architecture of MongoDB. MongoDB is a distributed system where you will get different results depending on which node you are asking.
The only way you could get close to having a global snapshot is if all nodes agreed on a single truth (for example single log file, block chain, etc.) which would preclude read/write with concern level less than majority.
Did you see the part about "Operations in a transaction use the transaction-level read concern. That is, any read concern set at the collection and database level is ignored inside the transaction."?
"Tansactions without an explicit read concern downgrade any requested read concern at the database or collection level to a default level of local, which offers “no guarantee that the data has been written to a majority of replicas (i.e. may be rolled back).”"
The big problem is that, even if somebody correctly sets the read and write concerns to something sensible, the moment they use a transaction these guarantees fly out the window, unless they read the docs carefully enough to realise they have to set the read and write concern for the transaction too. The defaults are very un-intuitive; I can't imagine that the case of somebody needing snapshot isolation in general but being fine with arbitrary data less in transactions is a common case, compared to wanting to avoid data loss both generally and in transactions.
Not saying you're wrong. As an anecdotal data point - we've read the docs (carefully) and spoke to MongoDB quite a bit when implementing transactions including their highest paid levels of support and still ran into this issue:
> transactions running with the strongest isolation levels can exhibit G1c: cyclic information flow.
As well as the Node.js API issue (I just checked randomly and their Python API has the same bug lol) listed above.
It is not different. For a product like mongodb, both the durability guarantees and the documentation explaining them are an integral part of the user experience. If I'm starting a project, I'm making decisions for a junior developer whom I'll hire in 2 years. I care what code that junior developer will be most nudged to write.
If the Stripe API had documentation was needlessly unclear in a way which led people to lose a significant amount of money, that would be a bug.
Chief, it does not have to be this hard. 3.4 clearly states:
This anomaly occurred even with read concern snapshot and write concern majority
3.5: In this case, a test running with read concern snapshot and write concern majority executed a trio of transactions with the following dependency graph
3.6: Worse yet, transactions running with the strongest isolation levels can exhibit G1c: cyclic information flow.
3.7: It’s even possible for a single transaction to observe its own future effects. In this test run, four transactions, all executed at read concern snapshot and write concern majority, append 1, 2, 3, and 4 to key 586—but the transaction which wrote 1 observed [1 2 3 4] before it appended 1.
Like... if you had read any of these sections--or even their very first sentences--you wouldn't be in this position. They're also summarized both in the abstract and discussion sections, in case you skipped the results.
4.0: Finally, even with the strongest levels of read and write concern for both single-document and transactional operations, we observed cases of G-single (read skew), G1c (cyclic information flow), duplicated writes, and a sort of retrocausal internal consistency anomaly: within a single transaction, reads could observe that transaction’s own writes from the future. MongoDB appears to allow transactions to both observe and not observe prior transactions, and to observe one another’s writes. A single write could be applied multiple times, suggesting an error in MongoDB’s automatic retry mechanism. All of these behaviors are incompatible with MongoDB’s claims of snapshot isolation.
May I suggest alternative perspective on the matter?
Compared to a product like Oracle, transactions on MongoDB are very new, very niche functionality. Even MongoDB consultants do openly suggest not to use it.
MongoDB is really meant to store and retrieve documents. That's where the majority read/write concern guarantees come from.
As long as you are storing and retrieving documents you are pretty safe functionality.
Your article presents the situation as if MongoDB did not work correctly at all. That is simply not true, the most you can say is that a single (niche) feature doesn't work.
Have you ever tried distributed transactions with relational databases? Everybody knows these exist but nobody with sound mind would ever architect their application to rely on it.
Any person with a bit of experience will understand that things don't come free and some things are just too good to be true. MongoDB marketing may be a bit trigger happy with their advertisements but it does not mean the product is unusable, they just probably promised bit too much.
The world does not revolve around HN votes. If your first urge is whether the post gets downvoted or not you might want to rethink your life a little bit.
I'm not "worried" nor experiencing an "urge." Please skip the concern trolling.
What I do have an interest in is HN's accepted decorum, which I admittedly stepped outside of when I implored you to stop digging yourself such a hole.
HN is far from perfect but there is a culture of respectful discourse here, which is part of the reason for its value IMO.
May I suggest the tiniest bit of consideration (such as reading the report) before jumping to conclusions and low-key offending the author? You should be embarrassed.
This comment looks a bit comical when compared with the one you started this whole thread with. You're an engineer, why are you siding with marketing over measured technical facts? Do you think denial will make your infrastructure any safer? Don't make excuses for MongoDB, just acknowledge the article as an appropriately well weighted response to their marketing claims and move on.
> May I suggest alternative perspective on the matter?
Can't reply to that since it's too nested so I'll reply here. I warmly recommend getting off tree you climbed on and actually reading the article because if you do - you will see you are not disagreeing on that part.
The article is a mostly technical analysis of the transaction isolation levels and where they hold. The main criticism is how MongoDB advertises itself. If they didn't claim the database is "fully ACID" then the article would have just been a technical analysis :]
> The article simply misses the mark by trying to create sensation where there is none to be found.
As someone who is a tech lead for a large database install, I'd urge you to read the rest of the Jepsen reports. They aren't intended to be hit pieces on technology - they're deep dives into the claims and guarantees of each database. IIRC MDB has explicitly reached out to OP in the past (I doubt they'll continue to do so after this).
Why that matters to the rest of us: once I learn all those dials and knobs I'm left wondering why I would choose Mongo over another technology, and how much the design of the default behavior and complexity of said dials/knobs are influenced by their core business.
I would also wonder about the surrounding ecosystem of tooling & libraries.
Imagine there was a programming language which had rather inconsistent naming, poor automated testing support, and a history of guiding its users toward security vulnerabilities. A culture would grow up around that language and the most successful members would be those who could best tolerate those properties. People generally self-select into language communities. So unless some powerful influence pushed random programmers to use the language or made it easier to add new tooling, the culture would continue to undervalue what the language originally lacked.
I suspect the same social dynamic would apply to a database.
I agree. MongoDB has large numbers of peculiarities that you better know before you buy in. It is definitely not so rosy as advertised. In particular it seems the product is not mature (especially if you come from Oracle world) and the features seem slapped on as they go and not thought through.
The documentation states that very clearly and the attributes are part of every call to the database (as long as you are using native driver).
In any case any person that has some experience with distributed systems will understand what it roughly means to get an acknowledgment from just a single node vs. waiting for the majority.
Oracle also does not use serializable as its default isolation level, yet it advertises it.
This is all part of the product functionality. Whenever you evaluate product for your project you have to understand various options, functionalities and their tradeoffs.
Defaults don't mean shit. In a complex clustered product you need to understand all important knobs to decide the correct settings and configurable guarantees are most important knobs there are.
Since you just leaned all the way in, while repeatedly proving you either will not, or cannot read the posted article at all. Will you let us know what bank you support so at least I can make sure I never use that bank?
Thanks,
Those of us who care about our banking and investing data.