I would argue that MongoDB is not—and has never been—the best choice for solving any particular technical problem. But it had some other "advantages" over other, better solutions – in that it was easier to set up, didn't require schema definition, had a passable clustering story etc.
I have worked with at least one company that had been built using MongoDB as a primary data store from day one. This caused untold pain later on, but the trade-off is that it likely allowed the company to exist at all – the founder being more of a domain expert than a technical expert, but being able to use it to scale their idea pretty quickly without having to pay much attention to all that tedious "reliability" and "safety" nonsense :)
That said, it's not something that an experienced developer should be using for anything nowadays, and the solution might be to ensure that competing alternatives (like Postgres) can learn from why MongoDB became popular and seek to solve some of the pain points in their own implementations.
I once worked for a company where Lotus Notes fulfilled a similar role (this was back in the late 1990s). Eventually they ran an entire free to sign up web-based email system using 4 giant Notes instances. It was an absolute nightmare, but probably the company would never have happened otherwise because Notes was all that the founders knew. My job was managing the migration off to a normal SQL database, which took the best part of a year.
There are, I am sure, numerous stories like this. I know of a case where discussions and votes were conducted using a shared yahoo mail account -- a bunch of folks would just log in and note their preferences in the draft email with subject "Vote for X". A single account, shared password (in theory one could erase/edit previous votes), but it worked just fine as the group was small and no one wanted to do mischief.
The flip side of this is that beyond a certain size such initial designs are a killer. And a company should monitor time sunk into maintaining such things, prioritize those that are most expensive and fix them as soon as it can afford to. And this is often done way too late, causing harm that can be impossible to repair. My 2c.
Every time we wanted to do something new a founder would say "Domino can do this!" and then they would spend the whole weekend setting it up for what we were trying to do.
All because we were some kind of IBM partner and they wanted to please the big wigs.
Once upon a time in a job far away, about 15% of my time was allocated to being the company sysadmin, and as all good sysadmins do, reduced that time to about 5%.....until....
One day, and despite my protests, I had to bin our perfectly fine and well maintained/loved Exchange server for fricken Notes/Domino just because one of the investors had some free licenses. After the migration, everyone hated it and I was persona non grata seeing as it was me who had to switch us over. It never worked properly. I left not long afterwards and the poor sod who was my replacement had the joy of looking after that dumpster fire.
I used to work for a company that occasionally had to interact with others' Lotus Notes setups.
At one point, there was some discussion about migrating one of our systems to a document store. There was a fair bit of back-and-forth at first. After someone pointed out that Lotus Notes is basically a document store, though, the idea was dropped more-or-less immediately and unanimously.
I can see the attraction when you're just starting out on a project, but we had just spent way too much time dealing with what they're like after they've had time to mature into full-blown quagmires.
As a n00b when it comes to web development I've sort of given up on fretting more than "too much" about "Is this the best way?" and focused on "let's try some stuff and it doesn't work I'll learn from it and make a better choice when I come to it".
There seems to be a cost associated with evaluating optimal paths. Personally I get SQL (well enough to use it) and I can work with so I don't bother with MongoDB much, but if there's someone who doesn't and needs to get a project going and the alternative is lots of delay or no project, yeah why not.
I feel like when it comes to development people really gripe about the best choices ... when the non best choices might just as well be looked at as a path to the best choices, and there's a lot of value in that.
Now that's all within the context of weighing the costs to ... everyone. Hobby project few if anyone will see, hey who cares. Work project, a lot more diligence and consultation with others ;)
Could you elaborate please on the untold pain part? We use Mongo as our primary storage at the moment and I would like to avoid painful issues. Thank you in advance.
On the other note, I keep seeing people recommending Postgres, but to me that is apples vs. oranges. Just because it has json storage doesn't make it a replacement. How much effort and money a company needs to keep a replicated Postres cluster vs. Mongo? In one of my previous jobs we had to hire a consultant to do that. With Mongo pretty much any experience dev could do that.
My personal view is that MongoDB's main advantage is its flexibility. It allows teams to shape and evolve the product as needed. When specific features/parts require new solutions, new solutions can be used (financial transactions, etc). Features that allowed us to leverage what we have instead of researching new tools were:
1. Schemaless - Logic is in the app; Gives visibility; Can be tested; Less migration headaches; Easier to evolve your architecture.
2. Indexed arrays, attributes (name, value pair) - Allows to be more creative (add tagging system to any type of data you have)
3. Aggregation - basic BI could be done from day one, which business can get value sooner.
It’s been a while since I’ve worked with Mongo but I remember having significant performance issues with indexed arrays, particularly for compound indexes. They weren’t solvable by any tweaking, it was a fundamental problem with that feature, to the point where we wondered why they allowed it at all. It’s possible they’ve fixed it by now.
If you "abuse" MongoDB and treat it like a relational store, certain queries will be difficult to write and very poorly performing. At least that was our experience, 2 years ago.
You can use Citus, as others have mentioned, or something like BDR – but bear in mind that vanilla Postgres will happily do big pile of hot standby servers for scaling reads, and will scale vertically very well.
You might have a case in which you need something faster – I'd argue that there are still better solutions, but YMMV.
It's dead easy to migrate an existing PostgreSQL install to Citus (application code can all stay the same), so it's a great way to scale your application when you need it, and at the same time just use PostgreSQL for when you don't need scale yet (99% of all apps).
In my experience, when there's something that offers additional guarantees for additional work, it's very rarely the right solution to completely forgo the guarantees. This goes for schemas, typing in languages, etc.
Maybe the additional work is too much for the benefit, but it's not that there's no benefit. Forgoing the guarantees usually leads to much more extra work. I think Python's "no types anywhere" was a reaction to Java's "types everywhere", but neither is optimal. Something with gradual typing or type inference has much more bang for its buck, so whenever I see "I use MongoDB because there's no schema to worry about", I always think that it's going to come back and bite someone in the ass.
I disagree... I think if you have an application where you have several variants of a type of document in a handful of collections, MongoDB is a great fit (think classifieds, or other storefront). It the structures don't nest too deeply, and they vary a bit from one subtype to another, it can work very well. And scale better in the box than many alternatives (a shortfall of PostgreSQL imho).
It's a combination of factors. I still mostly reach for more traditional SQL options (including SQLite). It just depends on the needs of the application, the costs associated and the skill and experience of the developers working on the application.
I would still reach for Mongo first for a handful of application types, I do wish that RethinkDB had the marketing effort/money/skill that Mongo did, as I think they were more solid on the technology, and the admin interfaces are great. I'd love to see a mid-scale vendor pick up the tech and run with it.
I have to absolutely disagree with the point you're making about mongodb never being something an experienced developers should use.
An experience developer understands when the goal is to deliver something as fast as possible vs building something more resilient.
If you're building an internal tool, particularly one that won't be storing data that would later be used for analytics, there's absolutely no reason to not go with mongodb over a relational database considering you can get hosted versions of both for the same price. For a less than a week long project, you could be saving at least 20% of the time spent not having to worry about a datamodel.
You could be saving at least 20% of the time spent not having to worry about a datamodel.
You are not "not worrying about a datamodel" – you are just ignoring the data model that exists. And that overhead, if it exists at all, is far less than "20%".
That's what I've seen. I've been involved with efforts to do something that an actual RDBMS is designed for with Mongo, Cassandra, Couchbase, Redis and even ElasticSearch - in every case we ended up re-creating so much of the functionality that is built into an RDBMS that the effort took much longer than it would have taken if we'd used MySQL or Postgres, in addition to being much slower (not to say anything of how proprietary and locked in the solution ended up being). There are good use cases for all of these products, and when used the way they were designed to be used, they're great; they're just not used that way very often (that I've seen).
It was often the best choice for getting a product up and running fast and rapid iteration for exactly the reasons you mention. Once the schema solidified, it stopped being the right choice.
We (ZeroTier) experienced the same thing but with RethinkDB, a similar schema-less database. We outgrew it when we needed something more solid and more reliable and we had a pretty settled schema.
Postgres is not an alternative to Mongo, though. Provide a noSQL alternative if you want an alternative. RDBs and non-RDBs seek to address different sets of problems, and sometimes you do not want or need a relational database.
But postgres actually is that with json and jsonb. The postgres developers are also very focused on providing a better alternative to mongo.
Afaik similarly configured postgres also challenge or beat mongodb in performance.
Setting up a simple cluster without relying on addons, or having to worry about which server in the cluster your writes are going to (bisected application configurations)
I went to production with Mongo and it exceeded my expectations. It operated without a single issue for years under high load. It was a dream to administer and a vital piece of a multimillion dollar franchise.
Everybody else I know, however, has had nothing but headaches.
They’re the same reasons I’ve seen people fail at using Cassandra, Redis, or Spanner. If you can’t adjust to the limitations and paradigm shift, you get no benefits. And an ORM often makes everything worse.
The “no” in NoSQL doesn’t seem to stop people from modeling join relationships in Redis, or chaining distributed queries with fully consistent writes on Cassandra.
I’m on a project right now, where a developer has selected an ORM for PostgreSQL that forgoes joins. They’ve managed to generate about 100 queries, in one case, where a single query is all that’s needed. 2ms vs. 800+ms. That individual is incapable of using something more complicated. Substituting Mongo as-is will make everything worse, and they’ll triumphantly proclaim how terrible NoSQL is and then write an article.
I feel confident using Mongo for any task. I don’t feel confident letting most of my peers use it.
There are a LOT of legitimate gripes, but no article I’ve read mentions them. It’s always the same superficial complaints from ten years ago. If you can’t get past those, choose another tool. End. Of. Story.
biggest headaches I ever saw with it, was when Azure had several regions down (could happen with any cloud)... when it came up, it was spotty and the cluster itself never really recovered. Fortunately it was replicated read data from another source and in the end it was faster to stand up a new cluster. It could have been much more painful that it was for my own experience.
Since everyone is sharing their opinion and experience with mongodb I think I’ll share mine.
As an appeal to authority I would like to mention that I have relevant vocational qualifications on the subject (more geared towards scalability and operations). Although I don’t believe it really matters - it will to those who assume I don’t understand best practice.
MongoDB itself is not /really/ a valid choice in many scenarios that it was painted as solving. Their only fault is overzealous marketing, it has (in my opinion) very clear pain points that should be avoided, but those painpoints are antithical to why many people used it in the first place.
Most people pick up mongo because it’s painted as being “beginner developer friendly”, I don’t mean new developers, I mean picking it up and running with it, without understanding it, was made to be incredibly easy. But MongoDB itself needs you to understand your data patterns before you start adding shards, so the technology itself depends on you actually sitting down and designing an architecture while understanding that. These goals are at odds with each other.
In MongoDB (as it was when I was using it in full prod 6+ years ago) you -needed- to understand how your data is going to grow and how it will be queried long before you ever created an index. You could not grow after creation. But using it as a plain document store with no searching and heavy sharding on the document ID is the best way to go. And in that scenario it is much better than most competitors.
In nearly every /other/ scenario its a less favourable choice than another technology of some variety.
I would argue the data loss point but I think if that’s not a solved issue it will be, and I’m fairly certain you can configure it to be slower but correct (my memory is bad).
I am not a MongoDB advocate, nor do I hate the technology outright. I strongly dislike how it was marketed as being a panacea.
And for the same reason I avoid PHP, I will attempt to avoid MongoDB.
(As in; it can be done well but the majority of cases will be poorly implemented)
I’m a big fan of Mongo for the use case you described - searching by ID and all information in one document.
But people don’t seem to understand that there are plenty of scenarios where you really either don’t know the schemes in advance and/or the “schema” is defined by an external source.
I worked for a company that sold software that allowed users to create forms that could be filled out either on the web or via a mobile app.
The user created the form and the schema and the indexes were created on the fly - one collection per type of form. What would an RDMS have bought us?
And if you create a table with a single ID column and a single JSON column you’ve essentially re-invented a NoSQL database. But I guess you can pretend it isn’t.
I've been using this in production for around 1 year - it's an absolute dream to use!
For context, I've previous experience with NHibernate, EF, EF Core, Dapper and some others from yesteryear - Marten is probably the best dev experience I've had from an ORM.
And then what happens when they add a field to the form and the table already has a million rows? What happens when they decide that the numeric field should have strings?
It would probably work using a forms table, fields table, submissions table, and values table.
I didn’t ask “would it have worked”, I asked “what would have bought us”.
Alter table is generally no big deal for any of the use cases that MongoDB is also able to handle.
On any good RDBMS, adding a nullable column to an existing table is an O(1) operation. This is the only option that's comparable to what's available in MongoDB, and it has the same performance characteristics.
On the great ones, adding a non-nullable column with a default value to an existing table is also an O(1) operation. The good-but-not-great ones, it's also O(N). (As always, you get what you pay for.) For MongoDB, wanting to do this would be unusual, but you would have the option of back-filling every record. It would be an O(N) operation, too. So, for this case, the characteristics of the RDBMS are no worse, and possibly better.
Adding a non-nullable column with no default is always O(N), but the fact that you're suggesting a document store as an alternative implies even more strongly that this is not the use case you're trying to cover. That said, if you did do it, it would also be O(N).
Converting a numeric column to a string column is always going to be O(N), yes. Whether or not that's the better option is something that's got to be decided in context. Basically, do you want to pay the cost of datatype conversion in one lump sum and then be done with it forevermore, or do you want to pay a small fee for datatype coalescing every time you access that field? There are good reasons to choose both options. However, all too often, the 2nd option is chosen for a very bad reason: Simply assuming that it's zero cost.
What happens when you need to do something like "Select browser user agent from all users who filled forms for a particular set of clients after a given date." ?
This would fit in a single SQL query which is expected to perform reasonably well, with an unstructured database optimizing this query will take months of work.
Something like this? I'm not sure why this couldn't be an optimized query in mongo, but I'm also not sure why a query like this one needs to be optimized? This would run fast enough without needing indexes, and really fast with an index on a couple fields, but is a query like this run so frequently you need to have it be extremely optimized?
How so? In our case, meta data like the userid, browser agent, date entered, etc was always added to the object before it was stored and those fields were indexed. They are just name value pairs.
The point is that the query I mentioned requires joins.
You can of course get the same information from key value pairs, it will just require a number of scans over all your data, which doesn't scale if you need the queries to be fast.
On the RDBMS side, there has been more than three decades of research on optimizing patterns like this. You don't want to try and reinvent that.
If you can know for sure from the start that you'll never need queries like this, then of course something like Mongo will be awesome. But requirements change, hence this article.
You saw the part where I said that all the forms had different schemas and were in different collections? The RDMS equivalent would be all of the different types of forms would be in different tables and each user would have their own database. You would still have the same issue where you would have to query the database’s metadata to get all of the tables and programmatically join the data.
At another company where I worked where we used Postgres, we had a multitenant set up where each of our (large) customers had their own database. The issue would have been the same.
You would no more “scan over all of your data” with Mongo with indexed fields than you would with an RDMS with indexes.
Yes Mongo supports joins. But, I wouldn’t use them. Application servers scale much easier than database servers. You’re not getting any efficiency gains from doing server side joins over just reading documents from the left side and doing an “in” query with the ids from the right side. Assuming you are doing the equivalent of a left outer join.
In fact, if you are using C#. You could use the same LINQ syntax either way.
I work a lot with documents in my current role which includes a lot of JSON structures as well.
MongoDB has been immensely useful for a team with limited scope (and requisitional abilities within the organization) to get up and running and store backups of documents that have been processed and JSON API responses.
I definitely wouldn’t apply it as a panacea, either.
Like any tool, it has it’s place in the belt for me. It’s no universal hammer, though.
As with anything, it depends on the project. I’m working on an internal service that uses Mongo as a single merged cache for a lot of mostly unchanging data from various data stores with different credentials for each, which are distributed around the world, that we otherwise have to fetch through multiple comparatively slow API calls. For this, Mongo is perfect: no messing with schemas as they change, unannounced, from upstream; I can index just the fields I want to search on; Mongo will expire things for me on a TTL; the query API is simpler than the API we’re caching from; and we get results 20-50x faster. We looked into FoundationDB and Postgres, but they require a lot more initial setup. ElasticSearch is the closest solution, but it needs a lot more info about the schema up front and its query language is a nightmare compared to Mongo’s, for no real gain in functionality that I can see.
Is Mongo the right tool to build your entire business on top of? Probably not, but it can be the best tool for the right job.
You and I must be the only ones on HN using mongo at scale and enjoying it.
We use it as an event database which collects over 100M+ semi-structured records daily with about 200 (and growing) different schemas... It keeps 1.5TB of records in the collection which is achieved using the invaluable capped collection function, and we can index the structured fields very easily. We also pipe the data into elastic for quick kibana querying but just that step requires a lot of index partitioning and mapping to work smoothly.
I also feel that a significant amount of mongo's power lies in the aggregation pipeline which is often ignored and which can replace entire ETLs, sending the compute to the data can be much more efficient in a lot of use cases.
All this runs for us on some relatively cheap (compared to 1.5TB of high i/o RDS) commodity hardware, running 3 replicated nodes. More complex ETL jobs or Hadoop users can pull data from mongo selectively and quickly - spark has a way to partition a query into N threads based on a provided key achieving line rate data pulls even from a single node.
Throw in compression on disk and in-flight and you get something that is really compelling. We've all run out of disk on a database before, it's never fun.
My one gripe is the learning curve for new developers. I do find that once leveraged, I see a lot more data being stored when it's as simple as nesting an object into your main object and saving it, which data teams like.
Depends on your use cases, but really, it's been an invaluable tool so far.
My project as well, sharded over 4 nodes, ingesting about a TB every two weeks. Automatic deletion of old data was too slow so we had to work with a scheme that allowed us to drop daily collections, but beside that it ran great. This was 6 to 3 years ago though. Maybe there's a better log stash out there now? I haven't seen one yet.
ElasticSearch does most of what you are saying in the box, can even have each day read into a different collection and query them on joined aliases. At least IIRC, I'm not an expert. When I first tried using it and Mongo, I had issues with ES geo indexing, but since then have used both with little issue. Just depends on the scenario.
Latest ES supports purging expiry based on a time field I believe, but before that you were creating daily/hourly indexes and haggling with elastic curator to delete them. There is also no meaningful way to do it by logical size.
Elastic is great in it's own sense, but it's not as flexible as mongo when it comes to schemas.
Agreed... I didn't know it allowed for time expiry now, been a couple years. I just find that at least half the time where I would consider Mongo today, ES seems to be a better fit. If PG was nearly as easy to setup replication hot/auto fail over, I'd probably favor that 95% of the time. I really wish RethinkDB had been as successful with marketing as Mongo though.
Elasticsearch can index documents dynamically, and doesn't require a schema to create an index. Dynamic data types for fields may not always produce what you want, but it's possible to define a partial schema for the fields that are important and let Elasticsearch handle the rest.
The query language is verbose but I would hesitate to call it a nightmare. You can always search using the Lucene query language, and SQL support is landing sometime soon.
My last project required fulltext search and I was going to go with Elasticsearch but being dependency-averse (also complicates the deployment story), I ended up using Postgres' built-in fulltext search capabilities and it actually works really nice, especially after I added a little DSL to take advantage of the full power of `to_tsquery` (so I could avoid the oversimplified `plainto_tsquery`)
I used mongo years ago with pretty much the exact same use case and it was awesome. I moved companies a few years ago and they were using mongo as well, but for services that were 95% relational. It was awful. We no longer use mongo for anything
I think you’ve got the right use case. Personally I’ve never found NoSQL to be superior, or even equal, to relational databases for general product development, but with Mongo’s recent improvements (fixing the data loss bugs, adding multidoc TXs), it’s become a pretty good platform for consuming system-of-record data and making it rapidly available. Great for customer 360 or CDP scenarios where you need to rapidly analyze personalization or segmentation data and act on it.
One benefit to ES is scale. Elasticsearch can scale to at least a couple trillion records (on the order of 1.5 PB) if you spread your data out across enough hardware. You can then search those records in seconds (for simple queries). I'm not sure Mongo can scale quite that far, it was given up on before I started this work. Not saying it can't, just saying others had more success scaling ES and haven't really hit its limits. Bear in mind datasets span multiple clusters and are searches are merged using cross cluster search in these larger cases. Not likely to be an apples to apples comparison.
My anecdote: several years back I had a test lab for measuring the performance and scalability characteristics of various geospatial databases. We added MongoDB to the mix a couple years after they released geospatial support.
We always verified basic correctness with a new database by inserting several billion geometries as fast as the database would accept them, reading the entire data set back out, and comparing it to the original data set we inserted for any discrepancies. MongoDB never passed this test. It would apparently lose records semi-randomly every time, so we removed it from the test set. It was the only database we tested that had this issue.
I evaluated several distributed databases for a healthcare-related system. The ability to lose messages in sharding scenarios, and the specifics of how one would recover them, made me think I could never support MongoDB for anything more serious than Reddit.
The Jepsen tests [1] have been run against MongoDB - while older versions presented edge-case opportunities for data loss, that's no longer the case with recent versions. The Jepsen tests also specifically test sharded clusters. From Aphyr's report:
These tests are now integrated into MongoDB's regular test suite. Maybe MongoDB wasn't the right choice for you at the time you were evaluating it, but I just want to point out that MongoDB has matured and improved a great deal.
Wasn't it only has of version 3.4 that Jepsen stopped finding single-node data loss bugs in MongoDB? So it's been 3 years that MongoDB has been suitable for single-node data storage, and apparently 5 months that it's been reasonable to use in a sharded deployment.
Perhaps in another decade, MongoDB can shed its well-earned reputation for eating data.
I think it's fair to say that it might take some time to regain trust. Just want people to know that the software is being improved and large strides have been made in this area.
The thing is, it's too late. When MongoDB was in its heyday, it got known as something that loses records. Not much you can do now when for every post affirming its consistency, there are two about how someone tested it and it failed consistency checks.
I'm not disagreeing with anybody else's test results. Just want to say that MongoDB has matured a lot and people should test out the newer versions to see how they fare.
Kind of in line with the article, I think that people should methodically check what works for them... particularly for something as serious as a database. It would be a shame if somebody decided against using something because they heard about an issue that has been fixed and/or improved in more recent versions.
FWIW, the data loss I was referring to happened on a single node. There may have been issues in a distributed environment as well but we never got there.
I'm still in charge of a production system serving around 2,000 small to medium websites from a 2-machine MongoDB cluster. It's been running on MongoDB since around 2010 and we have NEVER had any issues.
I accept that the unacknowledged writes was a bad decision, but IMHO if you deploy a new database without reading the documentation, you have bigger issues.
The reality is that there are some places where speed of movement is important and referential integrity is just not that big a deal. We're not all building banking systems.
1. Writing DDL is not hard. It's just not very hard.
2. You can go from strict guarantees to looseness safely, when you demonstrably need to. The reverse isn't true -- it's easy to wind up realising, much too late, that you actually needed particular guarantees that you didn't even think of.
Relational databases didn't become incredibly popular by accident. It's because they were a drastic improvement -- theoretically and empirically -- on the generation of NoSQL databases that preceded them.
Why do we always go from one extreme to the other. Mongo is great at handling large volumes of schema-less data. Relational database are great at connecting two datasets.
If you are just using one table with two fields (id/value and value contains a json object that changed) for logging perhaps a relational database isn't the right choice. Redis might be better / mongo might be better depending on what comes next.
Let's stop the next cycle where everyone moves to postgres for everything only to throw it away for the next thing. I was using postgres in 2008 and it was great.. why did it take until 2019 for everyone else to discover? We are in the peak postgres cycle. I hope it doesn't get disgarded when the serverless hype kicks into gear.
All the SQL databases with which I'm familiar support large string columns. What's the downside to just stuffing your schema-less or volatile-schema data in one of those?
The downside is that you can’t easily query against the individual fields and you can’t index individual JSON fields.
I’m bringing up C# again...
Querying with Mongo using the Mongo driver in C#....
var people = db.GetCollection<People>().AsQueryable();
var seniorMales = from p in people where p.Age >= 65 && p.Sex == “M” select p;
Querying an RDMS with EF:
var people = context.People;
var seniorMales = from p in people where p.Age >= 65 && p.Sex == “M” select p;
Both queries get translated to their respective query languages and run on the server. C# enforces the types in either case and you get compile time type checking and IDE Intellisense. In either case you can index the Age and Sex fields.
NoSQL databases aren’t “schemaless”. Mongo understands the schema of JSON data and can query against it just like an RDMS understands rows and columns.
It is still in production as the main database in a startup I joined in 2010 and kicked off their software's development. I've been through smaller ups and downs, but we never lost data. The reason why I loved it back then and still do today (I'm using mongodb in my own little webapp) is the speed of development. The few data migrations that I had to write in over eight years where nothing compared to what you'd have to do in a relational database. With new features we were always able to keep the db schema in some state of fluidity on our dev and staging machines until we we happy with the data's architecture. No back and forth that you'd do with a relational database.
We have only a dozen collections in the db and only four where pulled during a normal user session but those translate not just into four models/classes but a few more embedded ones, which makes totally sense because except for reporting we didn't pull out the embedded data, even though it has become so much smoother with the aggregation framework.
It sounds like your product was in the sweet spot where there isn’t much data complexity or evolution, and the schema is understandable and easy enough to modify over time without strictness.
The product I work on has ~400 tables active at the moment, and ~800 that have ever existed in it over the last 6 years. We depend on the database schema to reduce complexity at the application level.
I don’t think the complexity would be manageable at the application level if we couldn’t do this. At least not with the ~6 engineers working on it.
That's about six times the engineering power you have there.
Our data is indeed simple and beside data collection via a web app, the focus is on reports with that data; but always in small sensible junk and never across the whole available data for a model.
I always supposed that document databases were based on object databases from the early 90s... And that the "NoSQL" craze was simply a continuation of that progression. I mean there is a difference between NoSQL and no schema. The early object databases had schemas. The reason they wanted to abandon SQL was because they believed that they never wanted relational data: they just wanted a persitance layer for their business models.
I might be wrong about that because I pretty much ignored DBs in the late 90's and 00s, so I never really followed what was happening. However, one thing I'm absolutely sure about: MongoDB had very few features that were really novel. I'm not really sure why people think everything started there (especially someone who knows about 4GL ;-) )
First we had an xml craze - xml had to be used everywhere, including communicating with the browser - see AJAX. Then people realised that xml was very inconvenient for the browser and for human readability, while json fit the bill perfectly. Then, it was json craze - why not use it for your dB??
As a serialization format, JSON is horrible. Way too loose regarding formatting and way too much noise. All that is typically needed is a standard text format for relational data. I.e. a fixed CSV.
The only "advantage" to JSON is that it maps directly to the object trees most developers use in their scripted programs (which is misguided IMHO).
Not sure I would call CSV a "standard text format" - I've seen many many problems over the years with badly formed CSV files and bad Unicode handling. CSV appears to be an "almost standard" where 98% of the time it is fine and the remaining 2% are an utter nightmare.
Yes, CSV in the wild is an absolute mess. It was made an RFC standard at some point but even that was quite unclear if I recall correctly. That's why I wrote "a fixed CSV".
after 2010 there was also the surge of single page application frameworks a-la Angular.js, my take is that having a JSON-native database with REST interface meant that anybody could whip up a nice angular frontend with Mongo as its only backend.
Sure, business logic in the browser, antipatterns all over the place, security be damned, but if it meant that the business could live at all, then people got on with it.
> Having a database enforce these relationships can offload a lot of work from your application, and therefore from your engineers.
I've never seen anyone use MongoDB without also using something like Mongoose where you get all this for free. Zero work for your engineers. https://mongoosejs.com/
> Lack of ability to enforce data structure
Again, Mongoose. The work doesn't fall on your engineers, it falls on an awesome, heavily-used, heavily-tested library.
> Custom query language
Can you give an example of something that you realistically would want to do in SQL that you can't with a JSON query?
> Loss of tooling ecosystem
This might be the only valid caveat in the entire article.
The problem with thinking a library like Mongoose satisfies those constraints is that it only works with a single application. Every database I've worked with that made it beyond a prototype outlived the original application. So you either end up building an RDBMS app/api in front of mongo for every new application to use or you have to translate those constraints to every client.
I'm of the opinion to let the "awesome, heavily-used, heavily-tested" RDBMSs built to manage data and constraints do just that.
True, using a library to hide the limitations of a DB smells bad. It is fixing the problem in the wrong location.
I may have misread your comment but rarely do more than one client application access a database. Especially in today's often Microservices architecture. So using a library as mentioned is fine. And as the app evolves/gets rewritten you can keep the library or replace it with a similar one.
Over 10+ years ago I did join a few projects with existing architecture that had evolved to multiple applications accessing the same database. And they were a nightmare but that is rare to encounter these days as most people have learned that one app = one DB. (Data warehousing is a possible exception, but mostly those datasets are exported instead as well as streams for machine learning).
Probably preaching to the choir but multiple clients mean feature freeze/deadlock, no DB refactoring and spaghetti architecture.
This post isn’t about a single snapshot in time, this is about all of the people who didn’t evaluate MongoDB properly over the years and then were “burned” by it. I actually think it has received a bad rap due to people over the years not evaluating it properly. Sure, some of these things have changed (transactions very recently), but they all come with serious caveats and limitations that need to be explored... which was the main point of the article.
Hey Justin, thanks for writing the article! I agree with you that there was a failure of evaluation, and that MongoDB has matured quite a bit since the NoSQL explosion.
Sorry if my comment came across as a criticism of your post. It was more targeted at the long threads of people bashing MongoDB, thinking the caveats are still relevant/accurate today. Because they really aren't – MongoDB is a completely valid choice of database today.
ACID transactions were added over a year ago, and Mongoose is almost as old as MongoDB itself.
> > Having a database enforce these relationships can offload a lot of work from your application, and therefore from your engineers.
> I've never seen anyone use MongoDB without also using something like Mongoose where you get all this for free. Zero work for your engineers.
For me, Mongoose is actually another argument against MongoDB. It's basically trying to add basic functionality on top of MongoDB, that most RDMS already have.
And no, it does actually not "all of that", just some of it. Just an example: Mongoose would still let you delete a user, while all of the related posts are remain with broken references to that user.
> Can you give an example of something that you realistically would want to do in SQL that you can't with a JSON query?
I've been using Mongoose in production for several years, 2011-2014 and again for the last 12 months, within teams of 4-8 developers. The amount of engineering, bug fixing and testing was definitely higher in MongoDB than with MySQL or Postgres.
Once I got to work with MySQL and Postgres after 2014, I found it so much easier to query for the data I'm looking for and to ensure data strong consistency, mind you, even without any use of a library!
Lack of ability to enforce structure - when using Mongo with C# you’re working with a strongly typed Collection<T> where types are enforced by the compiler when you add records and a strongly type IMongoQueryable<T> when you query the database.
Custom query language - I bet he also hates ElasticSearch. But again, using C# and the first party Mongo LINQ library, you are working with LINQ - the same built in C# query language that every C# developer should be familiar with. One of the requirements I wrote when I was hiring contractors for a Mongo based project was familiarity with Entity Framework. Even though we wouldn’t be using EF, if they knew it, it would be seamless to transition over to Mongo/LINQ.
Edit:
And as a side note, creating your “application defined schema” from a pre-existing JSON document, is a simple matter of copying the JSON to:
This is fine and well if you're using your team's language tooling to enforce these things. But the moment you're outside those boundaries all bets are off.
Even if I’m using an RDMS, as a tech lead, I still enforce only one “service” writes to a particular set of domain related tables. I don’t mean an http microservice necessarily. It could very well be a separate project/module in a monolith or even a package/module shared via an internal package manager.
But I would never use a weakly typed language and a weakly typed data store. That’s a mess waiting to happen.
> Can you give an example of something that you realistically would want to do in SQL that you can't with a JSON query?
Can you define joins in terms of each other? Otherwise, any join that isn't left outer join ($lookup) can't be represented in an aggregate query AFAIK.
Most of this counterpoint seems to revolve around the third-party library Mongoose.
I'm not that deeply familiar with MongoDB first-hand. Is Mongoose something you can bring to bear as a plugin on the database server? Or is it strictly a client application library?
Because to analogize to relational databases:
* If you cite PL/SQL, T-SQL, PL/pgSQL, etc as solutions for SQL's shortcomings, then you have a point.
* But if you point to JPA/Hibernate, Entity Framework, SQLAlchemy, or just raw hand-rolled application-tier logic, then that's no counterpoint at all. Because it means that only one client can safely use the database, and that assumption breaks down in the real world beyond the toy proof-of-concept stage.
> Bandwagon effect – Everyone knows this, and yet it is still hard to fight against. Just make sure that you’re choosing a technology because it solves real needs for you, not because the cool kids are doing it.
> Mere newness bias – Many software developers tend to undervalue technologies they have worked with for a long time, and overvalue the benefits of a new technology. This isn’t specific to software engineers, everyone has the tendency to do this.
> Feature-positive effect – We tend to see what is present, and overlook what isn’t there. This can wreak havoc when working in concert with the “Mere newness bias”, since not only are you inherently putting more value on the new technology, but you’re also overlooking the gaps of the new tech.
This is so much what I've always wanted to and tried to say on the topic. Adopting new technologies thoughtfully rather than reflexively is something I'd love to find out about a team before joining one again.
Great article - it's definitely vital to fully understand the tradeoffs of any technology you choose, and not just going with the latest fad.
I work at MongoDB, so I do want to mention that MongoDB has changed a lot over the past few years. MongoDB now has multi-document ACID transactions. Also, schema validation means you can enforce a strict data structure if you want. On top of that there have been improvements in data durability and consistency (the Jepsen tests are now integrated into the MongoDB test suite).
As mentioned in the article MongoDB has matured a great deal. It's far more mature and fully-featured in 2019 than it was in 2012.
I hate these kinds of articles, however, I do recognise that people are allowed their opinions, which is good. But these sorts of articles take a very polarised view of the world.
> Does this tool solve a real problem for us
> Do we thoroughly understand the tradeoffs
Well, no. But neither did running JS on the server (along with numerous other similar examples). I've used MongoDB plenty of times before (in production with 1 issue to date) and I absolutely will again.
Huh? Enough people apparently knew JavaScript well enough that running it on a server was really useful. Why's that not a real problem? (Not being able to do so.)
In my previous job, at a payment service provider, I wrote a "velocity" system. This was required to block transactions that met certain conditions within a certain time frame. A typical example was: block transactions that are from the same IP if that happens more than 5 times in a 10 minute period. Or: block transactions from the same card number if it is used more than 3 times in a minute.
The problem was that the "velocity" blocking requirements could be completely arbitrary, and could be anything a merchant required as long as they used any of the data that was available from the transaction/previous transactions - block a transaction if it's more than 500 GBP and we've had 3 other transactions greater than 500 GBP in the same country in the previous 10 minutes.
This essentially meant we would either have to do dev work whenever a merchant had a new "velocity" rule, write our own complex framework to handle the addition of arbitrary rules, or just stuff the data into a schemaless NoSQL store and then leverage the power of the engine's query syntax as part of the merchant's configuration.
We went for the schemaless NoSQL solution and we used mongodb v1.6 - around 2010 IIRC, before it reached peak hype, before it "fixed" a lot of the out of the box defaults, before it got a lot of hate. It worked perfectly and ran from the time we deployed it until i left the company a few years later. Maybe I left a mass of technical debt, but the solution ended up being so simple and so little code I doubt it.
The other nice thing was that the velocity system was not essential - if the time to do a velocity check took more than 0.05s we would ignore it. If the backend wasn't responding we ignored it. If the write to the storage failed it didn't matter. We didn't need to keep more than a couple of hours worth of data. If we lost all of that data it didn't matter.
I don't know if I'd take the same approach now, nine years later, but at the time the use of mongodb worked perfectly. It was only a couple of weeks work, and solved the problem elegantly. So in response to the original question: yes, but as usual RTFM and make sure the pros and the cons fit your use cases.
> If you have almost no guarantees to provide, almost any solution would have worked.
Within reason. If the failure rate is 1 in 1000 then it's probably not as a big a concern as 1 in 10 depending on your use case
> In summary, it is not really a useful data point on whether MongoDB is useful or not.
As I said in the last sentence - it worked perfectly. In other words the "no guarantees to provide" never actually came up and I don't recall us having any data loss, response time, or write issues.
I'm not clear on why the datastore needed to be schemaless though? It sounds like the schema for the transaction data was well understood and only the blocking requirements were not, but the blocking requirements were formulas/assertions over the well-formed data.
The transaction data was well understood, but did change over time. It was well normalised in a relational backend that was (at the time) a few hundred GB, but it did not have any partitioning.
The problem we were trying to solve was that allowing, essentially, arbitrary queries against the main database had the potential to cause problems because we would have to ensure that those queries were optimal and all the relations were correct before we allowed those rules to be put into place - i.e. more dev and DBA work and potentially schema changes (addition of more indexes, etc).
It was far easier to have an ephemeral duplicate of the transaction data with a flat structure, a single key value document, as that mitigated the risks and removed the aforementioned problem.
The velocity system was essentially a firewall in front of the main database server - this was one of the main requirements, the velocity system must not have any affect on the load of the main transaction system. We could have added another db server, but that would have put us on the route to more servers, more complexity, etc, as we were self hosting at the time.
I used it before I understood anything about databases, and I got some jobs done with it. Now that I have some knowledge, I choose PostgreSQL for most cases. But mongodb got me up and rolling in a pretty DRY way when I first started programming for the web.
If you get to the point where your website is outgrowing it’s initial mongodb implementation, you have a very good problem on your hands.
It doesn’t take a lot of traffic to outgrow a poorly implemented mongodb structure with a lot of data. That’s not a fault of mongodb as much as it is using the tool wrong. When I first started using mongodb I treated it like a relational database and hit performance issues very quickly.
Man, I remember the Mongo craze. You couldn't have a rational discussion about databases with some of the most rabid supporters. I made a comment on a post yesterday about tech having those few early and loud supporters who shut down any conversation that doesn't support their chosen technology. The Mongo craze was like that early on.
It was really frustrating not being able to look at other databases when everyone was chanting "Mongo! Mongo!" Thankfully tech has moved on to Kubernetes and React and microservices and now we can finally have a rational discussion about databases.
Off course it was, prototyping speed on mongodb was (and probably still is) always excellent.
Some features that don't scale are very nice to have when you don't have scaling issues. For example, if you add tags to your documents and you want to query on those tags (find all documents containing tag A and B), it's nice that's just a builtin.
I haven't found a single datastore that is as developer friendly that supports that use case, so for now, I'm sticking with mongodb for my pet project.
(if you know of a datastore that has support for this query out of the box, please let me know)
The other answers have confirmed that Postgres will do this with array fields, and it's good advice to follow. It's also in my view much easier to read than MongoDB's query language is!
CREATE TABLE documents (name text, tags text[]);
INSERT INTO documents VALUES ('Doc1', '{tag1, tag2}');
INSERT INTO documents VALUES ('Doc2', '{tag2, tag3}');
INSERT INTO documents VALUES ('Doc3', '{tag2, tag3, tag4}');
SELECT * FROM documents WHERE tags @> '{tag1}';
name | tags
------+-------------
Doc1 | {tag1,tag2}
SELECT * FROM documents WHERE tags @> '{tag2}';
name | tags
------+-------------
Doc1 | {tag1,tag2}
Doc2 | {tag2,tag3}
Doc3 | {tag2,tag3,tag4}
SELECT * FROM documents WHERE tags @> '{tag2, tag3}';
name | tags
------+------------------
Doc2 | {tag2,tag3}
Doc3 | {tag2,tag3,tag4}
Postgres certainly isn't perfect, but it's usually a good answer to "how do I store and query data" where you don't have any particular specialist requirements.
Someone has mentioned PostgreSQL jsonb, which basically works the same way as Mongo, although I won't argue the query syntax is easier to learn.
But also, PostgreSQL supports arrays, so you can have much more native tags as well, and it can be indexed and searched in a more traditional SQL fashion.
I have only used this with Django ORM, but I believe it's pretty straightforward if my memory is correct.
I've only ever heard of mongo successfully used for two use cases:
1. As a cache, like redis
2. As a log store, like Elasticsearch
In both cases, the data is somewhat ephemeral, and not the "source of truth" for the app. The minute it is used for holding real, customer supporting data, things start to get dire real fast.
Plenty of people are successfully using MongoDB for real, customer supporting data at a large scale. There's a selection of users on the website for a start: https://www.mongodb.com/who-uses-mongodb
I have used Mongo to store medical data that had to be stored for 7 years. We set it to make sure that all three servers in the cluster did a write acknowledgement.
Thank you! As an aside, is there a reason you're not using a write concern of "majority"? If you have w:3 in a three node cluster then if one node goes down, writes are going to start throwing wtimeout errors (assuming you have wtimeout set) even though the data may safely be written to a majority of nodes. We generally recommend setting w:"majority" for this reason.
It was recommended by our vendor. I asked why, he was just paranoid. But, most of the data was entered on a semi connected mobile device, stored on the handheld and synced when they had service. If it wasn’t immediately writable to the server, it wasn’t a show stopper.
I found developing with MongoDB a pleasure, until its lack of transactions became problematic. Fortunately, the project I was working on didn't go very far.
At the time, I concluded that MongoDB was the "Visual Basic of Databases." It was very easy get something simple running, much like Visual Basic classic was.
Quite honestly, something ACID-compliant with a MongoDB-like API is really needed for small-scale projects and prototyping.
After using mongodb for many years, I haven't run into a use case where I should have used it over SQL. The only benefit is rapid prototyping for me personally and it initially felt awesome to have mongoose in node be the schema as it was super easy to modify. But you pay for it big time as you go and find that anything difficult means you'll have to do backflips and more requests to get things done and get used to very strange and verbose nested json queries. It's the 10-15% use cases that are really hard. NoSQL is especially annoying when needing any type of joins because despite your best planning, it still happens. There were many cases I needed to fallback on postgres to get stuff done and have a second source of data to sync and then I was wondering why I didn't just do it all in postgres first.
I wish I had really thought about this when I first wrote my largest web scraper. At the time, I was still relatively new to database design and programming in general. This web scraper, out of the thousands I've written in the interim, is--of course--the one that is still going strong many years later.
I eshewed MongoDB for all the reasons given to me on the internet and, because I was slowly gaining competence with SQL, ended up building a large and complex pipeline to send the data right into Postgres. In retrospect, this was a serious design mistake, and one that I regret the most.
Although I still contend that the data did eventually need to be normalized, I now believe that I was doing it far too early. By ingesting the JSON stream into a parser, splitting it up, generating foreign keys, and then forcing the whole works into a single Postgres database I severely limited the capacity of my web scraper (and also guaranteed the need for a very powerful server to run it).
Had I initially dumped all results into MongoDB (or some other efficient document store) and then, separately, parsed the output into normalized SQL, I would have dramatically simplified the operation, maintenance, and debugging of my web scraper. Plus it would have been much simpler to spawn work jobs on to different machines instead of trying to break up huge monolithic processes with poorly defined endpoints. There have been many lessons learned.
In short, Mongo likely serves a very good purpose for high-speed data storage and manipulation (although it's hardly alone in this space). However, it's still likely not a great all-around solution and works best when supported by an ACID-based normalized RDBMS. Unless, of course, things have dramatically changed in recent updates.
I also really like CouchDB and can attest to it being a reliable part of the stack. I'm using it on a personal project (Node/Nano/Couch) that has the potential for needing to store a lot of data but hasn't gotten there yet, so I can't speak from experience yet on performance/scalability. It has been so far great in production and also great for the normal CRUD parts of this project.
The main reasons I chose Couch over something relational like MySQL was essentially 1. to have a a clear path for data to go as JSON from server/API/client without the need for mapping, 2. Schemaless to allow for quick iterating and development, 3. Easy to start with local first development and then rollout for deployment. Also, I hadn't worked on a stack with a persistence strategy that was solely document oriented so it was a fun and good learning experience.
A few things I learned along the way:
I still needed to create externalized "views" of the data that either combined data from multiple documents and the need to hide private data. I still needed to serve lists of data in pages. More importantly I needed to provide a lot of adhoc reporting over the various data I'm storing across multiple Couch dbs.
The views and paging are all easily solvable with Couch, but so much easier to implement using SQL and it feels like I'm just pushing that need or compensating somewhere else in the implementation. But the friction around quick reporting has made me second guess choosing Couch/document based vs. something like a MySQL/ORM.
The author's point of the custom query language (Mango for Couch) and loss of tooling ecosystem has been the biggest problem for me on this project and I'm considering migrating to a Node/Sequelize/MySQL stack just to avoid wasting future cycles trying to quickly report on the data I'm storing. When the project started the reporting aspects weren't as apparent as they are today since the project has evolved and other requirements became necessary.
If anyone has any experience or recommendations for tools that can easily do adhoc reporting against CouchDB or documents in general I'm interested in hearing about them.
I've had decent success so far with CouchDB 2.0 and Mango for performing ad hoc queries.
I believe Mango will automatically create the view index on the fly for any query you make and save the index for later. Before CouchDB 2.0 this would usually have to be done manually before the query was run.
Came here and searched for "couch". Love it to death and have used it at medium large scale with very large custom indexes.
Once you understand the way that views and map/reduce work, it's super powerful for doing large scale statistics on data sets quickly. Only thing that I wish for is a Mongoose adapter for couchdb so I can use it as a backend for when an open-source project wants to use MongoDB (Looking at keystone.js for example)...
> Feature-positive effect – We tend to see what is present, and overlook what isn’t there.
This is such a problem. Just yesterday someone came to me and said, "We're now using x to do y. It makes z a lot easier. Just wanted to let you know so you can make the appropriate changes to your analysis & reporting applications."
Say what now? I'm embedded within the operational unit. I know a lot about day to day operations in & out of the systems in use, and how they ultimately translate through in data. I asked half a dozen questions, and 4 of them were met with "oh, we hadn't thought about that." These are deal-break questions, things the current method handled without issue, so much so they became invisible to the users, until they decide to make a change and realize the new method doesn't make any provision for them.
MongoDB with it's JSON model helped us develop features and iterate much faster than the previous MySQL/PG. So it was a huge win for use (but we had some pain with some issues).
Today we use PG everywhere with JSON, though still the library support for JSON in Mongo is often a little bit better than in PG drivers.
The really interesting part of the article comes after the "What could have been done differently?" headline. I think the article would have been better if it wasn't focused as much on MongoDB because now everybody is discussing if the caveats still apply or ever applied.
If you need limited document storage capabilities I'd recommend just using S3 alongside a traditional RD like PostgreSQL. You can simply store object IDs in the primary database but the actual document/data in S3. I used this methodology for a cloud platform I built that required the ability to store large 3D models uploaded by users. Metadata was stored in PostgreSQL then the actual data in S3. This also facilitated generating a acquisition URL on the server which could be triggered client-side so that after initial creation, there was almost no primary server overhead (bandwidth or storage) for retrieval.
Yeah, Meteor was pretty amazing when it first came out. It felt like it was from several years in the future. As I recall, it relied on mongo on the server and a minimongo that ran in the client. It made it trivially easy to create universal real-time webapps, which was a game-changer for me at the time. Reactivity has since become mainstream, but at the time it represented a bright line / step function increase in both DX and UX.
Meteor for me, was really good, and then kinda got worse over time. They started to remove the simplicity of the design, but, that was what it really had going for it.
The main use case for MongoDB I have seen are custom forms or data that can be nested a variable number of times.
The main problems I have had with MongoDB were that, as of several years ago, it did not integrate well with most (I would argue any in practice) data reporting / displaying third parties.
Also, some of the more advanced queries were not at all intuitive. In fact, I barely remember any of the syntax now. In Mongo's defense, that might have to do with the fact that we atempted some stuff in MongoDB that we would never attempt to do with SQL cursors.
> data that can be nested a variable number of times
I'm far from a SQL expert and have mostly done client-side work in my life, but SQL-based database does seem like a good match for things like these. You just need to write more foreign key relations, and constraints are a bit more complicated, but I see no reason to switch to NoSQL for things like these.
While I think MongoDB is deserving of a lot of the criticism it receives I think the bigger issue is with those who adopt it without a full understanding of their data and the use cases they might need to implement. There are countless stories now from people who assumed their data wasn't relational only to discover much later on that the opposite was true. There are cases where a document oriented datastore could be the way to go but everything I've read seems to suggest that they're in a minority.
I don't think the document addresses key issues with Mongo: Poor technical implementation.
There are countless stories of odd performance issues, occasional data loss, and similar.
postgresql can handle JSON documents with none of that and better performance. If you're not doing a lot of relational work, relational databases shard really nicely for arbitrary scalability. etc.
I love the concept of Mongo, but when I used it, I found it unusable. I switched back to relational databases as well as S3. On the whole, it just works better.
If relations are embedded in a document then you favour querying from the "side" of the relation that is the outer document. But usually you cannot anticipate from what sides of the relation you may want to query in the future. As such, MongoDB may scale nicely for performance, but it doesn't scale nicely for business needs. A relational database is completely unbiased in that regard, you do not need to make nor encode any decision on what would be the "outer" document or inner nested objects. It's all just entities and relations.
Great for prototyping, and cases where you can contain everything in one document, but as soon as you're referencing model2 from model1, it's probably time to switch to a SQL based DB.
I know MongoDB supports references, but it's like using a flathead driver for a Phillips head screw.
If one researches the company's history, MongoDB came from a specific need the founders had for another product: ShopWiki.
ShopWiki is a shopping price listing site, where it keeps tracks of prices for any item: computer, clothing, food, etc. ShopWiki needs to be able to store, access, AND search amongst all of these different items. So, MongoDB is the perfect solution.
If the application being built has similar requirements to ShopWiki or a retail site where it sells _everything_, then MongoDB IS the right choice, because the founders basically built MongoDB for ShopWiki.
To sum it up: Is MongoDB the right choice? If the product is similar to ShopWiki, then yes.
I actually think NoSQL might be better for MVP and prototyping. Why? Because they can handle churn more quickly, and you don't have to worry about scale.
Once the data layer starts to resonate around a solution ... then pick the right horse for it.
Why not just use a SQL db that handles json and stick everything in a couple columns? Changing databases is a lot of work, and as far as I can tell, mongo isn’t really providing much value over SQL dbs that already support schemaless json columns.
"Support" for schemaless json columns tends to be quite limited. Maybe your database supports it, but do the drivers for that database in all the languages you might use support them? Will all that SQL tooling that you're so happy about handle the constructs that are needed to query into those JSON columns?
> Why not just use a SQL db that handles json and stick everything in a couple columns?
Honest question: what's wrong with just defining a domain model, adopting an ORM and a serialization framework, and simply go with a conventional RDBMS? Afaik all reference web application frameworks handle this right out of the box.
Nothing on paper, if you get the schema more or less correctly the first time. The problem is when you need to do a schema change that affects terabytes of data down the road.
That is going to be a problem regardless of technology.
With an enforced schema you will at least know that all of the existing data matches the schema. Without it you have to hope that you had zero bugs while collecting the terabytes of data.
If you've a schema, you sometimes need to rewrite an entire table as part of a migration, and that can mean heavy engineering if you want to avoid downtime or disabling writes during the migration. There are 1:1 relationships between tables in the wild that wouldn't have passed a sniff test as part of an early schema design review, but then got created regardless to avoid a lengthy table rewrite.
If you've no schema proper, by contrast, you can manage multiple variations of what the data might look like in code. Not that such a thing is simple; it's definitely not, for the reason you raised. But it's simpler to deploy and migrate, or certainly might appear to be so to someone who isn't comfortable with SQL.
Also, there's a class of apps where having a schema doesn't add much value and NoSQL actually makes sense. Think storing and mining logs, scraped data, ML training sets, etc. -- apps where it doesn't matter much however a big pile of data gets stored, so long as you can shovel through it in parallel or store it very fast.
The ability to query, index etc. I believe is much more limited than regular Mongo, but you can correct me if I'm wrong. Also, it feels like a cludge.
RDBMS to me is to expensive in the churn and I've run into some deadly query scenarios.
As a non-storage expert I would just love one of those 2000's era 'object databases' to start with. We used to use them a lot in networking because they suited well to topology. I'm not sure they're much of a thing anymore however as I suggest JSON DB's are pretty close to it so there's no need for it.
For any team that ever avoided a six or seven figure SQL Server or Oracle license and managed to scale to more than 50 transactions per second with horizontal scaling, yes, MongoDB was absolutely worth it.
I have worked with at least one company that had been built using MongoDB as a primary data store from day one. This caused untold pain later on, but the trade-off is that it likely allowed the company to exist at all – the founder being more of a domain expert than a technical expert, but being able to use it to scale their idea pretty quickly without having to pay much attention to all that tedious "reliability" and "safety" nonsense :)
That said, it's not something that an experienced developer should be using for anything nowadays, and the solution might be to ensure that competing alternatives (like Postgres) can learn from why MongoDB became popular and seek to solve some of the pain points in their own implementations.