The problem with MongoDB is their shadiness. The shipped with unacknowledged writes up until not too long ago. In other words you would write to it and there wouldn't be an ok or fail response, you'd just sort of hoped it would go in.
They fixed that problem but it was too late. In my eyes they proved they are not to be trusted with data.
Had they called themselves MangoCache or MongoProbabilisticStorage, fine, can silently drop writes, I don't care it is not database. But telling people they are a "database" and then tweaking their default to look good in stupid little benchmarks, and telling people they are webscale, sealed the deal for me. Never looking at that product again.
And it's not even good or recommended as a cache at any kind of profile. So I guess their most valuable niche is low traffic/prototype sites with poor architecture discipline or genuinely unrelated data sets. There are so much better tools for caching (memcached/redis), durable persistent storage (postgres), session storage (memcached/redis/browser hybrids) and document storage (postgres). Mongo is just one of those brands that quickly solved a problem that most of the better technologies missed - the user interface.
I think being able to have flexible data storage with indexing is where they are better than most other options. There's something to be said for some of what they do offer. I was able to replace the search system for a site that used SQL to MongoDB, which often includes geolocation, it works fairly well, I had considered using a ElasticSearch, or something similar, Mongo was a better fit.
Today, I would be inclined to use PostgreSQL with JSON support, and some triggers to update an aggregate search table, or look more seriously towards RethinkDB.
With any NoSQL system you give up something.. you just need to be aware of what you are giving up, why and for what gains.
I understand some of the reasons people didn't like Mongo, but this always vexed me. The default write level was very clearly documented and you could always change it as necessary. Surely it would be necessary to read the documentation of a database before rolling it out to production?
> Surely it would be necessary to read the documentation of a database before rolling it out to production?
You buy a car. It comes with brakes disabled because for whatever reasons that also lets it get to a higher top speed. You are expected to read you car owner manual and on page 54 you find that you have to hold "enable brakes" button under the console for 10 seconds to turn on your brakes. Would it vex you that people might be slightly critical of that car. Clearly they are silly for not reading their car manual until page 54.
That "feature" is not something that should be discovered by reading docs or when you get a crash and then load a backup from another week and still get a crash and then you start hitting your head on your desk.
Anything calling itself a "database" should not have shipped with those default settings _ever_. If they did they might have gotten away with it in my book by having a big flashing red warning on the front or download page. I don't remember one.
I would also read the manual of a car I just bought before driving it. I guess that's just my style.
Don't get me wrong, I'm not saying your assumption is unreasonable. But in the end, it's on you as a conscientious developer to read the documentation. I'm not even suggesting cover to cover - in this case though they are very up front about write concerns. There is no real excuse to find this out any other way, it's just negligence.
What is impractical is that something like brakes not working is not something you want to find in the manual on page 54. You don't want the car to start without that feature unless you enter a special code, and confirm you know what you are doing.
A database that has default configured that ends up corrupting users' data silently is like buying a car with the brakes disabled.
Well except that in the car case may brakes disabled won't make the car go faster, but in case of MongoDB I remember fans strutting write benchmarks around comparing it to Postgres, Couch and other database and telling how it is webscale. The reason that design decision was made is shady. That was my initial point.
Your approach is wholly impractical on its face, actually.
So, let me get this straight: You laid down tens of thousands of dollars on a vehicle that you only post-purchase read the manual of, and you're raising this as some sort of standard people should follow?
Honestly, asking the right questions (and test-driving) upfront should be what lands the purchase, and not discovering the folly of purchasing a car with such ass-backwards issues you only discover after the fact when you bother to dig out the manual.
You drove it off the lot after you bought it, right? Or did you read the manual in the lot right after signing the papers locking you into the purchase?
Turns out you can read the manual of a vehicle ahead of time. Turns out you can also test drive and do everything else you said, and we don't need to pretend that it's all mutually exclusive. Stop being a pedant -- me listing every bit of due diligence about my car isn't relevant, so let's stay on topic.
Honestly, how can people on HN actually be this against reading? Especially things that are really important? Sure, don't read the contest rules for your McDonald's monopoly. But if the data for your livelihood depends on something, there's no excuse for not reading the documentation.
If understanding your production database is a bad use of your time, then I really don't understand your priorities, but I'm glad you're not on my team.
You don't need to read every word of everything, but some things are worth it. Do you sign contracts without reading them too since it's a "terrible use of" your time?
> You don't need to read every word of everything, but some things are worth it.
Yes - I am saying in this particular car example, the benefit derived reading the entire manual before purchasing/driving a car is not worth the cost unless your time is worth very little. As others have pointed out, no one is flipping through the manual to check whether the brake pedal actually applies the brakes
But it is a good engineering decision to thoroughly read the docs before jumping into a new datastore like Mongo, I agree. Learning there are things like gigantic global locks and unsafe writes are normally enough to make you say, "hey, I probably shouldn't use this to store production data I actually care about"
That's fairly unusual; modern cars have a fairly standardized user experience such that there aren't many ways for a car to do something surprising and dangerous that's covered in the manual.
I did experience one once though: I discovered that a vehicle had traction control when the system activated during a skid. The computer and I disagreed about the best way to respond, and the surprise did make the situation more dangerous than it could have been.
In both vehicles and databases, the situations in which the product might do something unexpected and dangerous should be clearly documented in their own section of the manual. Databases should say "here are the things that could lead to data corruption or loss". Vehicles should say "here are the situations where the vehicle might disregard or override the driver's control inputs".
I'm curious about your desired response during the skid and what the traction control system did differently. Would you mind expanding on that if you remember it clearly?
Sure. The vehicle involved was rear wheel drive and, as loaded probably had a rear-biased mass distribution. It began to oversteer in a corner on a patch of ice. Standard protocol for these situations is to apply a moderate amount of throttle to shift weight to the rear and increase traction there while reducing or reversing steering input. The traction control system counteracted my attempt to increase power, requiring vastly more reverse steering input and interfering with my ability to position the vehicle on the road.
I suspect some people will believe the results wouldn't have been what I expected without the traction control. I can't prove they would have been, but I did grow up and learn to drive in Alaska. Based on my experience, I think I would have done better than the computer did.
Yes, whenever I buy a car, I read the manual. If it's second-hand, I download the manual.
Heck, even when I have a rental car for a single day, I will read the manual. Maybe not every page, but I will skim it for gotchas (and if I have time, the whole thing).
Maybe it's just the engineer in me, but it's what I do.
Come on guys, as computer engineers/programmers/developers/whatever, surely professional pride at least would mean we at least read the README and/or the manual, before putting something into production?
This is a pretty silly argument. The people whom it affects are not people buying cars, it's more like launching a shuttle mission, at which point I would assume you have read the f*ing manual.
Given how much we spend time talking about MVPs, Lean Startup, etc, I think people on this site are trying to avoid launch a shuttle mission. [Often] They're looking at building startups and are looking for both time-tested and new-but-advantage-providing technologies and techniques. At first glance, MongoDB appears to be advantage-providing so people adopted it quickly. They didn't read the manual; they put it in production on a small site and got surprised by the lack of durability.
I assure you, having to spend 75% of your administration time dealing with MongoDB problems, taking you away from other critical tasks, you'll see it's not really a great problem to have.
I like your analogy, very fitting. To their defense however they are far from the only ones to do that... I lost about 2h worth of production data with HBase in just the same way - fortunately I didn't want to trust it completely anyway and had my own logs of all transactions on filesystem, but it definitely shattered my trust in that DB (not to mention it was a pain to setup and had no secondary indexes).
I use MongoDB now in production and I am happy about it. Not huge dataset by any means so MongoDB fits the bill perfectly. It has a few idiosyncrasies (doesn't release disk space after deleting records - what?!?) and you definitely want to read the manual on settings. But it is incredibly easy to use (documents instead of relational data) and allows me to focus on app development instead of my storage backend.
HBase never needed nor used append. "Append" refers to the ability to reopen an existing file from a new client and append data to it.
HBase writes only immutable files, which means they are written once then only read. Even the WAL is no exception; it is written once, and then only opened for read when needed for recovery or replication.
HBase needs hflush to make sure that the WAL edits are resident at at least 3 (default) HDFS data node machines.
Not sure how exactly grand parent lost data. Each edit is first written to the WAL then committed to the in memory store. The in memory store is flushed to disk into a new file at a certain size.
If a server crashes and had unflushed data in the memory store that part of the data is replayed from the WAL on another server.
HDFS didn't have append at the time, not sure how it is now. It did have some filesystem journalling though (if I remember correctly), we just didn't know we should turn it on.
Other databases are forgiving; they are configured "safe", even at the expense of speed. The intention is that you can deploy a small system immediately; if/as you grow you will see that the database is going too slowly. You can _then_ look at the performance/safety dials you can tune and choose appropriate trade-offs.
These are systems designed for the real world, where people don't read the manual until they have to.
When people assume MongoDB was similarly designed with their best interests in mind, that's when things go wrong.
>When people assume MongoDB was similarly designed with their best interests in mind, that's when things go wrong.
No, I just assume that a database has a similar set of features as other databases have had for decades. Mongo does not; it is clearly the exception - and for possibly nefarious reasons, as well.
I'm not sure how anyone else could know what my best interests are. There are a lot of real world applications where small amounts of data loss don't matter but latency matters a lot.
Any time I deploy something as critical as a database, I carefully read about what it does and how it works. Not doing so is like signing a contract without reading it.
> a lot of real world applications where small amounts of data loss don't matter but latency matters a lot.
I don't understand this reasoning. We are talking about defaults. Defaults are used by people who did not tweak the settings yet. If I am just starting building a thing, I will have bugs and squeaks and I want to make sure I am not fooled by some unreliable data store. I am not likely to need 100GiB/s throughput, but I am very likely to have to hunt bugs, like "I did click on this <like> button but it did not add to the total likes". And I would really really hate it if after half a day of bug hunting I would realize that my data store just didn't store the thing...
> I understand some of the reasons people didn't like Mongo, but this always vexed me. The default write level ... Surely it would be necessary to read the documentation
I don't have much sympathy for people who can't RTFM but storing data is kind of a thing for databases.
Mongo was sort of great at first, it's a not bad solution for low traffic sites. It's also great for prototyping. But when anything grows up into high availability, high traffic or anything high.
It's kind of funny, albeit interesting that it took this long for much of the industry to start knocking down the house of cards that is built around the database. However many of the databases that are coming out these days have a lot of the same cultural issue of "hide the issues" and "talk about the cool parts".
I'm expecting any minute now for the anti-schema-less pro-schema movement to rise up...
IMHO, it's all about what you're doing at the time, and making the right decision... albeit it helps if the decisions for a database isn't glossing over the issues as insignificant. :-/
btw - "MongoProbabilisticStorage" is a great name for a product!
Note that even with the changed default to 'acknowledged', data is not guaranteed to have been written to the journal. So, there is still no full durability in writes (by default) and there is a chance data might be lost (e.g. a mongod instance crashes).
I posted this further down the thread, but I thought I'd share my thoughts on why I like mongo.
Most people don't like mongo because 10gen gives the impression that mongo is better than it actually is, many people feel that mongo is not reliable enough for at-scale applications. They're right; it's not. But that's ok, because:
Mongo's really great for rapid prototyping. You don't need to worry about updating the schema at the db level, it can store any type of document in any collection without complaining, it's really easy to install and configure, the query language is simple and only takes a couple of minutes to learn, it's pretty fast in most use cases, it's pretty safe in most use cases, and it's easy to create a replica set once your prototype gets usage and starts scaling.
Mongo does everything well up until you reach the level where you need heavy-hitting, at-scale, mission-critical performance and reliability. Most projects out there (99 in 100?) will never reach the level of scale that requires better tools than mongo. And since the rest of it is so easy to use, that makes mongo a great starting point for most projects. You can always switch databases later, but mongo gives you the flexibility to concentrate on more important things in the early stages of a project.
Application design for me almost always begins with data and data structures. Whether my database has an explicit schema or not, I always have one in mind, documented or otherwise reified in the table-data structures I have in my code. I just don't get why people would want a schema-free database that is in almost every way inferior to the rock-solid power beast that is Postgres. Just use a library with proper migration support so you can propagate changes to your schema rapidly during development. You'll thank us later when you learn a little bit of SQL and start analyzing your data, running circles around the no-sql guys.
Cassandra et. al. are completely different, in that you don't use them because they are more fun to use. You use them despite their awkward, low-level interfaces because you're going to dump billions of data cells into your database from day one with no end in sight and want all the easy scaling/availability features provided.
Do you always start with the perfect data structure? I find myself adding, removing, and restructuring schema often. Just as you think it's silly to use an "inferior" db during prototyping, I think it's silly to have to jump through hoops -- even minor ones -- while I'm just trying to experiment with a new technology or play with a concept, product design, or pet project. 99 times out of 100, I don't care if my project survives the weekend. Let me use that database I want to use!
> You'll thank us later when you learn a little bit of SQL and start analyzing your data, running circles around the no-sql guys.
That's a little condescending... do you know a single mongo user who doesn't have experience with SQL? Plus, I love the fact that I can literally run javascript against my database. Good for production? Certainly not. But that doesn't mean it's not fun or useful.
Not every project requires such rigor. If that's how you enjoy development, that's great! Very few of my projects put the db layer to the test, and so I'm happy with the balance that mongo gives me. I use it in about 4/5 of my experiments and side projects.
That is easiest [cough, imho] solved with adding a version number to stored records. Since data is not in much of a normal form and there won't be that many joins, it generally is easy to handle in code.
Sometimes you have to do update of records with a certain version number.
My opinions, for the record: MongoDB is a tool with some use cases. I'm more of an SQL+Memcache guy, if possible, but not religiously if a good argument is presented (that don't sound like "let's use .*, I want another keyword on my cv").
If you make 12 schema changes in month 1 and then no schema changes for the next year, does it really make sense to keep a month's worth of data in 12 different formats and maintain code to support all of the different versions? Why not just do a simple schema change and/or data migration each time and be done with it?
And since this is supposed to aid in rapid prototyping, how does it do so? It seems to me that it does just the opposite by introducing a significant and totally unnecessary burden.
Generally I'd only have at most 2 formats at once while you converted the older records to the new format. You're right that there's no sense in keeping around a dozen versions but there are a lot of business cases for having two versions of a schema active at once. For example, if you can't bring down your application to convert everything mid-day and instead want to do an incremental conversion.
As functional_test said. Also note that this e.g. depends on how long lived your data is.
(An update routine can be run at any point with low use like Xmas, etc. This is potentially neat, depending on use statistics.)
I'm not saying this is a common thing, but the lack of joins makes the data a bit more flexible -- this can't be too much, if nothing else because then the Javascript will begin to break.
(I do think there are much more use cases for nosql than as a Memcached with more features. Where an old job used MongoDB wasn't one.)
Couldn't you argue that e.g. Postgres and ActiveRecord give you the same rapid prototyping ability but with an easier (and more established) path towards scalability? It is easy to change your schema with migrations at the beginning of a project - just go edit the original ones and nuke your database. And I don't have to worry about properly configuring write-locks, replica sets, or writing map reduce javascript.
Of course you could argue that. But so what? Having an easier path towards scalability is nice, but irrelevant for the vast majority of projects; not every project is going to turn into a startup or a real product or even something you work on for more than a few weekends!
The last time you hacked together a blogging engine in Node.js one weekend, were you worried about future scalability, or just playing with new technologies because it's fun?
And while it's pretty easy to do schema migrations, it's not easier than _not_ doing them. And what is it you really want to worry about? Making sure your DB is production ready, or tinkering with Express.js and Backbone?
So because of that, many people use mongo as their de facto database. It's just what I use when I need a persistence layer for anything I build, because I already have a mongo db running for like 30 different defunct projects on my dev server.
And then, by happy accident, one of your side projects turns into a real product, and then mongo handles you really well for the first year or so, just up til the point of having to hire a real devops engineer; at which point you swap out your ORM layer and switch to postgres.
> while it's pretty easy to do schema migrations, it's not easier than _not_ doing them.
Regardless of whether you are dealing with a strict schema or flexible schema, you still have to make changes to how you structure your data as you are prototyping or otherwise iterating on it. MongoDB provides no tangible benefit in this case. If you want to rename a field, then you still need to run an update.
How are ad hoc, manual, historically opaque tweaks to data in any way better than an easily generated and version controlled series of scripts representing a replayable history of changes to the data?
If anything, manual untracked tweaks make "rapid prototyping" more difficult since lots of partially or completely undocumented changes to the structure of the data are harder to revert, replay, reason about, or share with others. It's also more work to do it manually since you need to run the commands in multiple environments, rather than just entering the same command or, more frequently, a shortcut command into a generated file.
> while it's pretty easy to do schema migrations, it's not easier than _not_ doing them.
I just don't buy this argument, writing and executing migrations is braindead simple and usually takes what, 20 seconds start to finish? Writing the line of code you need for mongo must be about 5 seconds.
edit: I did actually give mongodb a good crack(used on a side-project for 6 months last year) but I found that I actually spent a huge proportion of my time working around things that were missing compared to ActiveRecord. It was a huge net loss for me in terms of productivity.
Actually postgres can be a bit of a PITA, but so can mongo. At the risk of sounding reckless, unless the app needs to support high CUD throughput I sometimes opt for sqlite. Doesn't get much easier than that and it's read performance is impressive from what I've seen.
Even then you can sometimes get away with staying on sqlite for your admin side CRUD and redis for the heavy / concurrent writing from the public facing side (obviously situational).
I've been using sqlite more and more as well. Super lightweight, but since it's SQL most ORMs can handle switching to MySQL/postgres really easily if you ever need to make the switch.
Now this is an important point. Once you make your product, what do you need to do to retool a major part of the application? This seems like an excellent approach.
Many people in the Java world use something very simple like hsqldb, then shift to a new database when out of development.
Configuring write-locks ? I guess you have never actually used MongoDB before which explains why absolutely none of what you said makes any sense. MongoDB is far easier to use, manage the schema with and scale than PostgreSQL.
That's why you use a language or technology that's politically unfeasible for your rapid prototypes, like Clojure or Haskell, or...for that matter...MongoDB. ;-)
I think they are upset with their marketing, like "web scale". Even the name "Mongo" is derived from "Humongous" -- but that's exactly the scale at which you'd switch away from Mongo.
Riak also has massive problems. Realistically, figure out your data that you want to stick in a database, why, and how you're going to query it, and then work from there.
Your entire database's keyspace must fit in memory if you're using Bitcask. If you're using LevelDB, you have compaction overheads. Also, sibling resolution can get very messy, and complicated if your app works in a way that can potentially result in sibling explosions.
See, you're just perpetuating The NoSQL Problem. :) Riak is well-suited to some tasks, but it is no more a magical fits-every-problem thing than MongoDB is.
I sometimes throw out these things as a quick way to get reactions and interesting feedback as to why something is good/bad. For instance right now I know mostly good things about Riak, that's why I posted this ending to my comment.
Having recently rolled out riak into a production environment I can offer some off-the-cuff bullet points:
- Mostly easy to work with. Mapreduces can be a big pain to troubleshoot because you can't console.log() in your JS. Didn't try it in erlang.
- Being masterless, it has a very good replication story for servers _in the same data center_. It really bit us that there was no good riak solution for syncing data across multiple data centers. There is an enterprise solution for that, but it's quite expensive, which makes riak less appealing if you don't have much budget on your project.
- Errors in general are next to useless. Get comfortable waiting for answers in IRC when you get opaque error messages after running queries. You can definitely work past this, but it wasted a lot of my time.
- Not sure if pro or con, but as the cluster reached load capacity, from a combination of data size and read requests, map functions would begin to slowly fail. After a while, we could tell which completely useless error message (preflist_exhausted, my old friend) could be fixed by a cluster restart, and which would simply begin to happen with greater frequency as more data was added. This was exacerbated by my company refusing to pay for anything more than a three node cluster. You might say I should have fought harder for more, but I had to fight to make them not host all three nodes on a single server. There are places that will hire you that simply do not intend to do anything sane, but I digress. The takeaway: riak is not a super cheap way to scale.
- Bulk inserts? What are bulk inserts?
- Key filtering is just a shim over listing all keys in a bucket. Further, listing all keys in a bucket, or all buckets in a cluster, can be very expensive, and basically you'd never do it unless you had a very small bucket. The bag of tricks you can apply to speed up slow queries is basically "Do you have secondary indexes? Ok, good."
Those points do read a little negative, but actually I would use riak again. To me it works best as a temporary event store living in one data center. If you've got a bunch of items shuffling around your backend in real time, being processed to and fro, you could definitely do worse than sticking in it riak and adding more nodes as needed.
A bit off-topic, but if you've got a moment I would love to pick your brain a bit more about your Riak usage. Shoot me an email if you're up for it - mark@basho.com
Have you considered RethinkDB? For most of the advantages of MongoDB that don't specifically come from mmap and overwrite-in-place, it's an equal or better.
Most developers who use Windows will just go with something which doesn't require a virtual machine. For example, MongoDB, CouchDB, OrientDB, Cassandra, and ArrangoDB work fine everywhere.
Mongo is also brilliant for internal tools that need to change rapidly but you absolutely know will never require significant scale. I've been badly burnt trying to use mongo on a large dataset but it's genuinely great for getting things done quickly.
I also used it to write a service that had to go from nothing to working in a couple of days. I then spent the next two days swapping it back out again. It was surprisingly painless to go from an object store to a relational model
I just use the filesystem for that sort of thing. Everyone justifies using MongoDB because its easy and general but compared to the tool and compatibility ecosystem around files it's awkward and primitive.
The case for using it as a prototyping database is the best use-case I've seen for Mongo, however I'm not sure it's always a good idea.
For a hack-weekend sort of project, fine, but if you are in any way attempting to make a product, it strikes me as the sort of thing that would be really difficult to change later down the line, and so worth investing the very little extra effort it takes to include your schema in the database, and use something like Postgres/MySQL/etc.
If this was a proprietary database we'd call that vendor lock-in and advocate an open source solution. 10gen is a company that earns it's revenue from selling support. They are highly incentivized to lure you in and trap you in a situation that requires a lot of consulting.
Or they could just be interested in adding useful features.
PostgreSQL has HSTORE which is a useful but proprietary feature. Cassandra has the ability to have Lists/Maps as data types. Again useful but proprietary.
If you are that concerned about database independence then do what everyone else does. Use an ORM, minimise coupling in your domain model and do as much as possible in the application layer.
Cassandra support for lists/maps is open-source. Cassandra is Apache Software Foundation project. Where is it proprietary? Or do you mean something else under "proprietary"?
I feel the opposite way. If your application's data layer is sensibly designed, it shouldn't be too bad to switch to postgres when you need to. It may be tedious if you have a large codebase, but it won't be difficult.
I think the up-front benefits of using mongo (especially as a sole developer/devops/sysadmin person) outweigh the difficulty of the changes you'll need to make later on, which will only happen as you hit scale and have more resources to nurture the devops side of the tech stack.
This prototyping story sounds to me like admittedly shooting yourself in the foot if your prototype turns out to be worth a damn. Basically, you're saying mongo is an excellent choice only when your storage backend is a moot point.
I can't think of any codebases I've seen where intentionally choosing the storage backend you know you don't want to use (if the project is successful) would be a reasonable thing to do. Understate it if you must, but having to change your backend from mongo to postgres is not a desirable situation. Besides, if you're going to use postgres to scale, use it's features and write well optimized queries for it. The difference between a bad massively complex query and a well optimized one can be several orders of magnitude, and that optimization can indeed be difficult. It goes without saying that you wouldn't leave that to an ORM.
The advantage of shooting yourself in the foot if your prototype turns out to be worth a damn is that it forces you to rewrite it with more rigorous development practices ASAP. Usually, choice of a storage engine isn't the only problem with a throwaway MVP - you've probably written it in a language that won't scale, and skimped on error handling, and are using really inefficient algorithms, and didn't bother documenting anything.
That said, I would use PostGres for my MVPs, using it as a key-value store initially until its more clear what the schema should be. That is, if I still bothered using code for prototypes; of late I've been more fond of napkins and Adobe Fireworks.
I've never understand people that worry so much about the schema. It's like they've missed the last decade of computing. Everybody these days uses ORM. Which means that (a) data migration between databases is a relatively simple task and (b) schemas often just get in your way.
The problem with that argument is that prototyping rarely needs a database. Just store stuff in memory. Who the fk in the real world is seriously writing data-layer code for a prototype?
None of the things you listed are any harder with Postgres and SQLAlchemy. Learning to use MongoDB isn't exactly trivial anyway, so why choose the thing that is known to be broken at all, when it's neither easier nor faster?
Previous versions of my startup's enterprise product used to be based on relational DBs (mostly Oracle, MySQL also). This year we switched to Mongo and dropped RDBMS support.
RDBMS performance was fine most of the time as we're not doing big data really. Our problem was developing and maintaining a schema that holds lots of metadata many levels deep. Our app allows for unlimited user defined forms and fields, some of which may hold grids inside which hold some more fields... Our app also handles lots of logs and large file dumps, which slowly made data, cache and fulltext search management mission impossible. Even though we had considerable previous experience with Mongo, it took us a long time to switch because we were utterly scared. It's nice to sell a product that is Oracle-based, as that sent out a message about our "high-level of industry standardization and corporate commitment" bullshit that (we thought) is quite positive for a startup competing against the likes of IBM, HP, etc.
To our surprise, our customers (some Fortune 500 and the like) were VERY receptive to switch to a NoSQL, opensource database. Surprise specially given it would be supported by us instead of their dreadfully expensive and mostly useless DBA departments. It even came to a point where it has changed their perception of our product and our company as next generation, and surprisingly set us apart from our competition even further.
In short, as many people here know, not all MongoDB users are cool kids in startups that need to fend off HN front page peak traffic day in day out. Having a schemaless, easy to manage database is a step forward for sooo many use cases, from little intranet apps to log storage to some crazy homebrew queue-like thing. 10-gen superb, although criticized, "marketing effort" also helps a lot when you need to convince a customer's upper-management this is something they should trust and even invest on. I can't express my gratitude and appreciation for 10-gen's simultaneous interest in community building, flirting with corporate wigs and getting the word out to developers for every other language. Mongo is definitely a flawed product, but why should I care about the clownshoeness of its mmapped files when it has given us so much for so long?
Well written post. Even as a detractor of mongo I'll agree that it works for your use case. But the key is "Our app allows for unlimited user defined forms and fields, some of which may hold grids". That really isn't a very common case. SQL is not great at representing large groups of documents without any common structure.
The vast majority of apps just don't deal with that problem. If MongoDB was really only used by people that its a good fit for (like yourself), it'd really be a niche product. They're marketing it as a general purpose product, which is why they've earned scorn from so many.
Bingo. They should call themselves mongodocs, not mongodb. The way I see it mongodb sees widespread misunderstanding about its use cases and instead of make the use cases more clear, they seem to take an interest in seeing mongodb being used unnecessarily.
> Having a schemaless, easy to manage database is a step forward for sooo many use cases
Can you explain why can't you do schemaless with an RDBMS?
From what I understand MongoDB is schemaless by storing all fields as one single JSON document. So what stops you from doing the same in an RDBMS - have a catch-all field "JSON" and store all your data there?
That gets you halfway there, but you still don't have the ability to query your datastore by structure, unless you've installed PostgreSQL 9.3 and are using its JSON field type, which does have that capability, thus entirely demolishing the NoSQL USP as far as I can determine.
Also of note is that stored procedures are supported in a variety of languages, including Javascript, so it's quite easy to handle cases where the surprisingly broad range of core JSON functions and operators [1] doesn't include what you need.
PostgreSQL has also recently added a key-value store type [2] with semantics reminiscent of Redis. The impression I get is that they're gunning for the NoSQL kids in general, and this pleases me; while I grant it is sometimes possible and necessary to obtain new insight in a field by ignoring all that's gone before, I very much doubt this is one of those times, and I am therefore delighted to see a properly engineered database engine gain more or less the entirety of the features which draw interest to the NoSQL crowd in the first place.
That is extremely interesting. So it looks like you can store a JSON type as well as a KV datatype in Postgres! And it looks like it is relatively easy to convert between the two.
This leaves only performance. I think I'm still confused around this area - why do people say that non-relational technologies like MongoDB are faster than relational databases?
Thanks for sharing your experience on this. We make a lot of custom enterprise intranet applications, and we've been considering adding MongoDB to our toolchest. My concern has been what you say- there will be customer resistance. That they'll have fear for the future of their DB, since it's not SQL. Based on your post, it seems like it may be my fear and not necessarily the clients'. I was very interested in NoSQL at the beginning and what that did is make me realize I need to up my SQL game. I want to make sure I don't move into NoSQL just because it's hip and relational DBs annoy me.
We are evaluating Mongo to be the persistence for single page web applications, which is how we'd like to start making the majority of our intranet/private enterprise jobs. Are you using it in this context and has it been helpful? Your example of nested fields being easier made me warm and fuzzy- we had a project last year that had growing, fluid, user-defined data structures. We made it work (well) with Postgres but there were several kludges that really bothered me. One of them was handling delete dependencies gracefully on user-defined, nested structures. Did you encounter this issue pre-Mongo as well and if so, did it help with it?
Perhaps your customers were receptive because you were supporting it and they were happy with the existing support you got and unhappy with their internal DBA department? Its possible that they didn't care about the technology and just realized that the support was going to be better.
I'm confused by your comment. The beginning acknowledges the fact that MongoDB has a weak storage engine, but your conclusion is that, even with a strong storage engine like ours, there is still a problem. What other problems do you see? Are they something we could work on?
This is going to come off as abit negative but I kinda feel it has to be said. I would first like to say I do love the Fractal tree indexing, very cool and could have alot more intesting usecases outside of databases (I'm thinking logical volume/block storage etc.. I'm always thinking in kernel land..)
The problem is that Mongo advertised itself as a database and wasn't one. Once you do that reputation of the product is dead forever.
TokuMX is a real database as far as I can see, MVCC, great indexing story etc.
By association TokuMX is probably not regarded as highly as it should be. Which is a shame but it's a people problem, not a technical one. People can very easily lose trust in a technology at which point it's effectively dead, it might take a long time to die due to lock-in but it's dead.
For instance I have recently started playing with RethinkDB over TokuMX almost purely because of Mongo association.
Now technically that might not sound like good reasoning but when you think about the kind of person that writes a database that doesn't fsync your writes by default and relies on the page-cache over doing direct I/O when building a database.. doesn't really inspire confidence in the network stack, the query planner.. or well anything.
If anything it makes you insistent on not having ANYTHING to do with that sort of codebase.
Just replacing the storage engine might actually be good enough, but restoring my trust in the rest of the codebase is almost a forgone conclusion at this point.
I've seen a lot of the rest of their code, and most if it is getting better over time, as they grow they're forced to adopt better habits in order to scale their engineering team. I think you're misunderstanding the type of programmers they are. They didn't use mmap because they are sloppy everywhere, they used mmap because their critical innovation was not in storage. What they really thought was valuable, what they wanted to work on, was the query language and cluster management tools, so they did the simplest thing for storage and moved on (personally I don't understand why they didn't just use BDB, maybe they were afraid of transactions, but I suppose everyone has a little NIH syndrome in their database). Now they're a bit locked in to that code, because after bolting on journaling (that architecture is a brilliant but incredibly dirty hack), the code is a mess and I'm sure nobody wants to touch it. In fact most of the other subsystems have been getting cleaner rewrites, except for the storage layer. I think the only way out is a complete replacement, which is what we did so I feel pretty good about that. So I don't know if I'll convince you, but I've read a lot of their code (especially in the last few weeks, I've been backporting things from 2.4), and that's the feeling I get about their history and vision. Hope it gives you some insight.
I don't think the problem is I misunderstand them, I just disagree with them
I disagree with them on what is the minimum viable product for a database. I come a storage and service provider background where failures are treated very harshly (usually death of companies for singular mistakes) so I take releasing a product that stores customer data very seriously.
To be honest this is the biggest attraction for me to RethinkDB. They waited a sufficiently long amount of time with a commercially backed team of very competent engineers that obviously have the required background to sit down and DESIGN a database. The query language generates a non-turing complete language with a clean AST the has all the right deterministic characteristics to implement a powerful planner/optimizer. Their on disk format has been abit in flux but the core design is excellent and you can see that it has been optimized for very fast range queries. Even the API protocol and serialization were designed with care, not to mention the excellent ReQL language and attention to detail when integrating drivers into the host language.
Which is the other thing I tend to dislike about Mongo, it reeks of lack of design. The journalling effort for instance as you pointed out is very adhoc, this goes for GridFS and alot of the other features they have integrated into the codebase.
These are smells that I can't ignore when looking at a product that I need to trust with my data.
The counter argument is to not trust it with your data. But I am yet to find a reason where that makes sense where another datastore wouldn't be a better choice.
As I recall it, all the early noise they generated was their excitement about their hot benchmarks and how good mmap was...
They can try and rewrite the web and remove all the silly benchmarks, but they were the loudest "web scale" cowboys back in the beginning and we remember them for it.
There are also people using MongoDB and finding it meets their needs well, and don't feel the need to keep writing about how everything sucks or is wonderful. (I'm one of them.)
None of how MongoDB works is a secret. And just like everything else it has sweet spots and problem areas. And like many others, development continues and it gets better.
The database does not get the job done - it is a tool to help get the job done.
Maybe not now, but this hasn't always been the case. The fact that they had (have?) a global write lock was completely buried on the doc site for ages. Benchmarks were waved in front of developer's faces to distract them from the "drivers don't actually write data, they just blast it out in every direction and hope it lands somewhere good" BS.
I don't use Mongo anymore, and I think a lot of it has not to do with the database itself, but with the way 10gen used their marketing machine in a dishonest way. They incurred a lot of trust-debt, and now have a serious amount of work to do to pay it back.
I worked for 10gen (now MongoDB) for over 2 years (I left in December).
Never once while I was there did they publish a benchmark: There was a [publicly] stated company policy to not publish or comment on benchmarks.
If you have evidence otherwise (i.e. benchmarks published by the folks working on MongoDB) fine, but I take this as a deliberately inflammatory (and false) statement.
EDIT: The global write lock was removed ~last August; there is now a database level lock. Future releases will likely make that more fine grained. Additionally, the drivers no longer do "unsafe" writes, but check w/ server.. as of the same release.
It is not a sin of commission Mongo is accused of, but a sin of omission. MongoDB out of the box is configured to be fast-but-unsafe. Postgres and other databases out of the box are configured to be safe-even-if-slower. A benchmark which doesn't spend the time to configure both systems equivalently (i.e. most community benchmarks) will therefore show MongoDB as the faster system. The policy of not publishing/commenting on benchmarks simply allows misleading benchmarks to be created and to stand. It's a self-serving policy.
Mongo has repeatedly chosen defaults for their database which make naive benchmarks look better, at the expense of production safety. You seem to be willing to attribute that to Mongo's incompetence. Proverbs are on your side, but it sure ties in nicely with the "leave the benchmarks to the community" policy.
Sorry, but the global write lock has been public knowledge since the very early days. Mongo's done nothing to hide that. Also (i worked for 10gen at the time) the "marketing department" that you refer to, was a single person organizing events to give developers what they wanted: knowledge and a community around mongodb.
I think it's really less that 10gen itself was trying to mislead people, and more that the community itself was building up a strange mythos with little relation to reality. This, unfortunately, tends to happen rather often (node.js is magic! Java is really slow! etc.)
That's what made me drop research into NoSQL a couple of years ago- the overly optimistic and magical thinking seemed to be really pervasive. I don't want to get burned from joining in a group delusion. (I'm not trying to say that's what's going on specifically in Mongo or anything else, just acknowledging that the mythos phenomenon mentioned above can repel me.) Since you've coined the term, are there any NoSQL projects you've seen that don't suffer from the mythos issue?
It varies, a fair bit. MongoDB is perhaps the worst, possibly because it's easy to get up and running with, and behaves a little magically (you don't need to know how it works to use it, or at least so you might think initially). It also has a company pushing it, of course.
I'd say that this is less of a problem for the Dynamo paper databases (Riak, Voldemort, Cassandra), because really to use them at all you have to have some idea of what's going on.
I'm a bit biased here, though, while I've found Voldemort, Redis and Cassandra useful and can see Riak and a couple of others being useful, I could never really figure out a good reason that anyone would use MongoDB besides naiveté. That said, I've never really tried, as I don't have a problem that fits it (part of my issue is that I don't know what a problem that fits it would look like).
If you are looking for a sober and no bullshit NOSQL database, consider RethinkDB. The DB is quite young but it doesn't have some obvious flaws out of the gate.
That's all true, but they were giving 90% of their users exactly what they wanted:
"We value feature-set and expressiveness much more than scalability at our data size, but we want to feel like we're big data too so say some of that stuff"
And that's their brilliance, they listened to what people said they wanted and then gave them what they really wanted.
No, what they thought they wanted. They promised a world without DBAs, but the "DBA" is really, whoever gets called at 3am in the morning with the site goes down. If you don't know who the DBA is, it's you. They promised a world without schemas, but you always have a schema, the only question is whether you know what it is and the tools help you validate it or not. They promised massive scalability, but really just pushed scalability problems out into other layers of the stack. I could go on and on...
Global write lock. Abysmal performance when your data set doesn't fit in ram. Clustering with durability. Clustering in a way that isn't horribly complex and fragile.
Mongos (the routing for MongoDB when clustering) has a bunch of drawbacks that make MongoDB worse. One of which is dropping all connections when a master switches. Have you dealt with those problems yet? MongoDB is generally perfect at small scale is what I have perceived.
The new database level lock in 2.2 is also annoying (and arbitrary) but it is better than the global lock.
I'm curious, how do other DBMSs handle a master switch/other cluster updates? I'm familiar enough with mongos to know how it works but not what e.g. redis or postgres or mysql does.
Oracle restarts your query, where it was interrupted, on another node. The feature is called TAF, transparent application failover. A client might notice a brief pause, but probably not even that.
The best is not to ever need it by using an architecture with no SPOF (even temporary). Master switch is a huge pain - there are simply too many nasty ways it can fail miserably. I'd stay away from databases needing it, if high availability is the primary concern.
OK, but that's a serious performance and functionality tradeoff.
Most modern architectures make the choice between having nodes serve as master for a subset of the data, and the increased cross-link bandwidth needs and reduced flexibility of a master-less system.
Most master-slave databases don't do auto-promotion themselves; it's a bit of a minefield. (In particular, in cases of network partition, where some applications servers may have a different view to others on whether the master is dead or not).
OK, but that's a different concern. If a database admin manually switches a master (called a primary in Mongo), then someone is responsible for deciding what to do with queries that are in flight. In Mongo, the driver drops them, which is less than optimal. It certainly is a hard problem for writes, but not so for reads.
To your point, I think deployments on modern architectures generally want the ability to scale out and tolerate network partitions, which makes the ability to drop and re-elect masters, reconcile a node that rejoins after a partition, manage shards, avoid hosing remaining nodes, etc. critical. Inability to do so really hurts on a platform like AWS (or in any multi-datacenter deployment, really).
I think dropping reads is just as unsettling as dropping writes in production.
Also, MongoDB does let you elect a primary via setting a priority. Really, it should be a requirement because sometimes mongo nodes will switch due to a dropped packet (or this is all I can assume at least) and the arbiter just randomly picks a node when there aren't priorities.
Most people here seem to complain that their champion lost a benchmark, or that Mongodb does not follow their ideology.
I've used Mongodb for the last 18 months, never lost any data, and it made it obvious that "enough durability" is sometimes... enough.
When I switched to MongoDB from MySQL, performance rose 10+ times and I switched from "don't hit the db too much" to "give it more work, 'cause it's idle".
I know it's an ideology problem: otherwise, I can't see why people would complain about a tool they don't use.
I am so happy Postgres is adding support for JSON. This is a big change. The sole benefit of mongo to me is that you can be flexible with your schema at the beginning. But the consequences are
* you have to learn to do indexing right later (if you have to scale)
* failure and miss starting to occur (as you scale)
* more code to write to manage legacy schema and optional fields
The last is painful and ugly. Whereas if you start out with a good schema that last point is in a good hand. When you use SQL you always have the restraint that "xyz" attributes are repeating and you can just make a new relation, whereas with mongo you'd stuff 20 fields into a single collection. The refactoring is harder.
I will begin to migrate back to SQL for new projects.
Also ecosystem is richer in SQL. I have not seen a good ORM for Mongo. MongoEngine is fine but implementation + db have a lot of issues make that ORM a bit unusable from time to time. SQLAlchemy is good.
PS: For quick PoC and Hackathon projects sure prototyping with mongo is fine.
He's right that MongoDB could use improvements like string interning so you don't need to worry about field names. But overall, I think this article is very misleading.
If you use MongoDB in production, you should definitely take he time to learn about the durability options on the database side AND in your driver. By using them appropriately, you can have as little or as much as you like. Data sets larger than 100GB are no problem either -- right now I'm running an instance with a 1.6TB database.
As always, use the right tool for the right job. If you need joins/etc. and don't need unstructured data, Mongo probably isn't a great choice (even with the aggregation framework).
I use it to store a lot of historical time series data that doesn't change once written (at least, not often). I can easily achieve the write performance necessary to record the data streams live. Since it's all append-only, I don't need to worry about fragmentation. With replication, it's possible to access the data with very high throughput which is useful when the data is being accessed by a cluster, for example.
I also use it as a metadata "scratch space" for highly available applications (things where failures are not acceptable and must run for days at a time). Again, with replication and automatic fail overs, I've been able to maintain 100% uptime outside of maintenance windows. Obviously that can't last, but so far it's been >2 years with no major problems.
EDIT: I should point out that although the size of the metadata objects can be highly variable, since I usually had a small number of them relative to the time series, fragmentation was still not an issue.
You have a smallish number of documents where some particular field of fixed size gets overwritten a lot, the old values are uninteresting, and it wouldn't really be a tragedy if your data got trashed. For example, it's the player's score.
You want a fixed-size, rolling backlog of time series data such as logs.
Postgres update performance is pretty bad. When running a big data migration, it's generally faster to copy the old table to a new temporary table and rename the temp table to the old table than it is to run an update.
RDBMSes need to check for primary key violations, hence read before write. Random access is slow. The fastest you could do is "no read-before-write, append-only writes, compact later" (Cassandra way).
Most RDBMSs will, if you rewrite a field, write a fresh row and tombstone the old one, and clear it down in the next compaction. This is what MVCC means in practise: that the old version doesn't disappear while the new is being written.
MongoDB by contrast will simply mmap that block of file, overwrite the contents, and fsync. Yes, this has obvious downsides.
I won't dispute the empirical finding that MongoDB and friends are faster at this than RDBMSs.
However, i'm not sure why this should be the case. You mention the complexity of updating a row in MVCC; sure, but all the database has to do before reporting success to the user is to write its intent to make this change to the transaction log (WAL in PostgreSQL, redo log in Oracle). The actual changes to the data files can be written back later on. The transaction log is a single stream being continuously written to disk, so that should be very fast.
MongoDB, on the other hand, is making scattered writes across its mmapped data files, which should be much slower. Except that of course it's probably doing this on a journalled filesystem, which is using exactly the same mechanism as the RDBMSs to provide fast, safe updates.
I'd be really interested to see how a simple update to a single field translates into actual writes to disk for PostgreSQL and MongoDB. If only i knew how to use strace!
It's "a" right tool in any case where distributed storage of unstructured data in JSON format is wanted, where database-level locks won't be an issue of concern, and availability is the primary, overriding concern.
What does mongo offer over (possibly sharded) postgres for this use case? Postgres won't hit you with db-level locks and gives you master/slave replication for availability. You can also get great performance if you put the WAL on a ramdisk, which I think is roughly equivalent to how mongodb handles writes.
I'm really not trying to be argumentative here, I'm just trying to understand what mongodb is for.
IIRC, until fairly recently (well after Mongo had launched), "master/slave replication for availability" in Postgres was a bitch to set up, requiring 3rd-party tools + manual failover if the master died. It was a lot easier to get going with Mongo, which is really what matters if you're a 2 person startup just trying to validate an idea.
Strongly disagree here. MongoDB (10gen, at the time) had absolutely insane, irresponsible defaults set in all its drivers until, like, 1.8 (very recent). This is anathema to "we're just trying to validate an idea" especially when "our idea took off and now 6 months later we actually do need to scale."
They've fixed it like I said but that whole "we're just using it to validate an idea" thing is a total con. "Nothing so permanent like a temporary [solution]."
> What does mongo offer over (possibly sharded) postgres for this use case?
Doesn't really matter for the point I'm making. It's a solution for a given set of constraints. Not the solution, or the very best tippy-top solution in all the kingdom, just a solution.
Point being I can't think of a use case where this is true, but if you read the article, the author does include what he says is the only reasonable use case for using MongoDB.
I'd submit that database-level locks make any claims of availability or distributed storage a little overblown. If a single query can blow you out of the water, you're really not highly available. Although I don't have a lot of experience doing big mongodb personally, so maybe I'm missing something.
That's exactly right imo. Running MongoDB in production, you end up concerned over the performance of each query (as you should be). MongoDB's profiler makes this easier to investigate.
If you hit a db level lock limit, you're probably running a sub-optimal or unindexed query.
That doesn't have anything to do with availability, at least not in the CAP theorem sense as I understand it. What I think you're talking about (being "blown out of the water" is pretty vague, though) is partition tolerance: high-latency requests that are practically indistinguishable from network partition events.
I'm not sure what MongoDB returns (or how its clients react) when there are no available connections because of a lock whose duration exceeds the configured timeout. I'm pretty confident, though, that this sort of thing is covered by basic driver config.
From what I've seen, the data model has a lot of utility as long as you don't need super high concurrent performance. Basically, the same area as where rails is the right tool - we want easy features and rapid development, will worry about scaling later.
Mongo's really great for rapid prototyping. You don't need to worry about updating the schema at the db level, it can store any type of document in any collection without complaining, it's really easy to install and configure, the query language is simple and only takes a couple of minutes to learn, it's pretty fast in most use cases, it's pretty safe in most use cases, and it's easy to create a replica set once your prototype gets usage and starts scaling.
Mongo does everything well up until you reach the level where you need heavy-hitting, at-scale, mission-critical performance and reliability. Most projects out there (99 in 100?) will never reach the level of scale that requires better tools than mongo. And since the rest of it is so easy to use, it makes mongo a great starting point. You can always switch databases later, but mongo gives you the flexibility to concentrate on more important things in the early stages of a project.
> You don't need to worry about updating the schema at the db level
What's your magic non-db level, supposedly-easier-than-updating-a-schema approach to renaming a field common to all existing documents in a collection, eg, rename an "author" field to "writer"?
I've barely used it, but the json document thing with a lot of random convenience functions in the query language seem to lend themselves well to rapid development.
For postgres you'd be mapping to a relational schema, and for redis you'd be storing the json yourself as a blob, without any server-side manipulation capabilities (or using redis maps/sets/etc, which are awesome, but aren't as general as json).
I haven't been doing very much web dev the last few years though so it's possible that my first impressions are wrong. I'm just repeating what I've been told, basically.
Postgres has had a native json type and the latest version improves upon the functions given to manipulate json data. So data that may not map well to a relational schema can just be put in the json object type and handled accordingly. Also, because it's attached to an SQL engine, you can use things like views on your json data if it makes sense for the type of data you're querying.
There has been a considerable amount of work put into postgres over the past few years for getting it to handle your data regardless of what it looks like. The developers seem to have a very good grasp on the fact that not all data is alike, and giving tools that will work well, and together with, all your data leads to a lot fewer headaches in the long run.
Really like this article. I try not to dump on MongoDB too much because frankly I have never taken the time to understand its internals. I constrain my criticisms to particular unnecessary failures/inadequacies that I personally have experienced (or any "I'll just use mongodb so I don't have to worry about my data" sentiment).
I like this too. There's very little to criticize, since it basically tells it exactly like it is without too much embellishment, and makes intelligent, honest conclusions.
Article is spot on about mongodb being ideal for online games. We use it as the main datastore for our latest game, and it has worked out very well for us. My main gripes with it has been key values taking up too much space and how difficult it is to shard. I think Rethink DB will be even better once that matures.
I think the genius referred to is in its simplicity. This simplicity let MongoDB get a product to market very quickly, as well as the inherent goodness of making simple things.
I think in MongoDB's case, the getting-to-market part pushed a little too hard on the make-it-simply part. Simple is good but a thing should be as simple as possible, no less.
Its apparent simplicity to _developers_ was certainly a good marketing tool, but it did come with tradeoffs; some of its limitations imply a considerable amount of developer and operational complexity if you actually want to use it.
>But in that case, it also wouldn’t be crazy to pull a Viaweb and store it on the file system
I've done this before when I was doing work for a client using an existing simple web host with no built-in options for databases. It works well, and the nice part is that there's a simple, obvious way to do any query. The bad part is that anything other than a primary key lookup is slow unless you add a lot of complexity.
I've used MongoDB for various projects and found it nice to use. Lately though, I've found MySQL to be pretty enjoyable too, so honestly, what's all the fuss? It's a database.
Nobody writes about the filesystem like they do the database, and yet they do the same job - store and retrieve data.
The fuss is that a "proper" database does so much more than store and retrieve data.
If you've only ever used Mongo, a filesystem, and/or mysql, then you've never really used a database. Postgres (and mssql, oracle, etc) are so much richer; they are so much more than storage systems. I'm not saying you no one should ever use Mongo or MySql, I'm just saying that they are generally far inferior choices for problems bigger than mere storage.
It's high time we started thinking about storage systems and the higher-level functionality of databases separately. We can make different, more informed, and generally better tradeoffs than we're currently making by viewing this broad category of software through such a foggy lens. For instance: Take a look at how Datomic utilize pluggable storage to provide a sensible information model, with raw index access and powerful, pluggable querying.
I would put it out there by suggesting that you probably have low standards when it comes to databases. Have you used anything else other than MongoDB and MySQL? These are basically two of the worst NoSQL and SQL implementations in existence.
Recommendation: Stop following all the hype and what all the other blind sheep are doing.
Regarding filesystems... yes they do. Tons of information out there about filesystems you just have to look for it. Read up about ZFS, ReFS and that should get you started.
I think more often its easy to poke fun at _how_ its used.
When any tool or tech is used globally, before knowing its limitations, problems are likely. Attempting to use MongoDB in all storage or persistence scenarios is no more sensible than using MySQL in all cases.
Yes, there is marketing around this product that must be looked at critically - after taking into account that many newly developed technologies won't solve all the problems older tech have worked for decades to solve.
> Attempting to use MongoDB in all storage or persistence scenarios is no more sensible than using MySQL in all cases.
Substantially less sensible in many cases. MySQL has its issues (it has a lot of issues), but people have been able to get it to work surprisingly well in roles that it wasn't designed for (albeit sometimes by just building a database on top of it, as with Twitter's thing).
No check constraints. Spotty transaction isolation. Silent data corruption if you happen to make certain kinds of updates while using statement-based replication. No on-line schema updates (is that still true?). Complete inability to execute joins of any size in reasonable time due to the lack of merge or hash join strategies. Corresponding inability to handle subqueries of any complexity. Readers block writers (at table level with MyISAM - and still at row level with InnoDB?).
As well as things like that which are actually ridiculous, there is also the substantial gap in features as compared to real databases. Things like recursive queries, user-defined types, partial indices, etc, are commonplace in the more sophisticated databases. You probably won't need them for a simple web application (or even a complex one!), but they can be very useful when trying to do more complex things, or manage a complex system efficiently.
I believe that InnoDB is an MVCC implementation, so readers blocking writes shouldn't happen. Another thing to add to your list is missing window functions.
I'm not a big MySQL fan at all, but it's still leaps and bounds ahead of mongo technologically.
Having read through it, I rather suspect that that's not a matter of writers blocking readers or the other way round, but instead a case of writers blocking writers - he's writing a lot of data to the table, and it's highly likely that InnoDB has escalated the lock to a table lock - which effectively prevents concurrent writes.
> InnoDB does locking on the row level and runs queries as nonlocking consistent reads by default, in the style of Oracle. The lock information in InnoDB is stored so space-efficiently that lock escalation is not needed: Typically, several users are permitted to lock every row in InnoDB tables, or any random subset of the rows, without causing InnoDB memory exhaustion.
My apologies - you're right. In the general case readers don't block writes, but InnoDB does use share locks on some foreign key interactions and when using the SERIALIZABLE isolation level.
people make fun of MySQL all the time. especially postgresql people :)
anecdotally in the more than 10 years I've used MySQL I've never had any issues. whereas with postgresql I've had a few major downtime incidents. it can be very stubborn and arcane. but at least I didn't lose any data.
So, what's a good NoSQL database for e.g. node.js use? The only alternative I know of is CouchDB. (Yes, I should give more parameters about the intended use, but I really don't know any alternatives).
Node.js isn't really a usecase that dictates what datastore you should use, it's just a certain way of writing a totally different piece of your app.
In most node.js apps, the best answer is probably a SQL database. Sorry.
If you're working with timelines or other cases where Redis's data models can help you, consider it, though beware that if your data is large, things will get more expensive fast since you're keeping everything in RAM.
HBase, Cassandra, and Riak are all reasonable in similar cases and have their own tradeoffs.
And yes, Couch fits a similar niche as Mongo. You might even be able to use something simpler like BerkeleyDB (quite mature) if you think you want a document store.
RethinkDB may be a nice choice too. It's fairly young but looks like it's going good places.
But your choice should be mostly dependent on what kind of data you're storing and what kind of guarantees and access models you need.
It should not be based on someone on HN telling you "Riak is the best NoSQL database for Node.js" because their idea of what most Node apps need may not be what yours needs.
What's wrong with the Viaweb/Arc/HackerNews/Mailinator approach of just using in-memory datastructures (hashtables, linked lists) and then journaling out changes to the filesystem as records that are read in on startup? It's incredibly simple and blindingly fast as long as you stay on one server, and you can get several thousand QPS of capacity on that one server (vs. like 10 with a Django/Rails + SQL database solution).
Another highly underrated solution is using MySQL/PostGres as a key-value store. Just create one table for each entity type, with the primary key as the key and a JSON or protobuf blob as the value. You're using completely battle-tested solutions, you've got bindings in basically every language, you're doing basically the same work (at the same speed) as your NoSQL solutions, but you have a lot more flexibility to add additional indices and can rely more on pre-existing functionality than a MongoDB or CouchDB solution.
That's caused by using closures to create dynamically generated "callbacks" on the server, not keeping data structures in RAM. If you ask for some old item not in memory, it just gets lazily loaded.
I suspect he doesn't have to rely on closures to do pagination: they're a programming convenience that means you don't have to do things like think about what state persists between pages.
Anything you can do with SQL you can do with in-memory data structures. If you're interested, I'll be happy to take any SQL query and convert it to some Python list comprehensions on arrays of dicts.
Pretty rarely, at least in Python. I don't miss MultiSet, because Python has that (collections.Counter). Ditto LinkedHashMap (collections.OrderedDict). Those are the two "extended" collections that I most often use. I do miss the absence of balanced binary trees occasionally, since sometimes it's useful to have an associative container with a defined iteration order, but sorted(dict) is usually good enough where performance is critical. And Python's heapq module is a bit harder to use than Java's PriorityQueues, but all the functionality is there.
I think I'd miss these a bit more in Go because the built-in datatypes are privileges in some of the language statements, but I haven't written enough Go code to really feel their absence.
Because it's much cleaner and more powerful if the thing that generates the next page is a closure rather than just an index. Among other things it lets you show each user a different set of items (depending on whether they have showdead turned on for example).
> It's incredibly simple and blindingly fast as long as you stay on one server and you can get several thousand QPS of capacity on that one server (vs. like 10 with a Django/Rails + SQL database solution).
Wait, what? Even if vertical scaling was a good idea, scaling is far from the only reason you should have more than one server for anything serious.
Isn't this thread about non-serious use? Pretty much everything I see here is about how MongoDB is only suitable for prototypes, how it doesn't even guarantee writes, how they just want something quick & dirty to build a MVP with. The parent poster asked for something to replace MongoDB with - if the use-case is prototypes and "web scale" startups that don't have users or a product yet, I think a single server with in-memory data structures is a perfectly adequate starting point.
If you do get to the point where you need some redundancy (and don't yet need to scale horizontally), you can proxy all writes to a second server running the same codebase, have it update its in-memory data structures in the background, and hot-swap it over if the master dies.
I would say that a MVP should be written in such a way that you don't have to waste time rewriting from scratch once the concept is validated. If you're writing a MVP to be disposable you aren't necessarily launching it on a real server with persistent storage anyway, more likely Heroku or at the very least AWS, but in either case you're well equipped to do the right thing from the outset rather than being forced to totally rewrite your app to enable a real architecture later on.
You will have to rewrite anyway. Multiple times. If you pick a RDBMS you will have to rewrite it to scale, if you pick MongoDB you will have to rewrite it for reliability, if you pick Heroku or AppEngine you will have to rewrite it to avoid paying them a good chunk of your profits.
That's probably the biggest surprise I learned from working in a fast-growing, well-functioning engineering organization. The half-life of code in a market that's actively growing and changing is roughly 1 year, i.e. 50% of the code you write now will have been removed within a year from now. And attempts to optimize for problems you're going to have in a year, rather than the ones you have now, actively make things worse because you inevitably have a different product direction in a year, and baking in last year's speculative assumptions just means there's more code you have to work around.
If it takes you a year from persisting serialized data on the hard drive of your one server to using a real data store, you're fucked either way. As low as that half-life might be, that's no reason to make deliberately short-sighted engineering decisions to make it even worse, especially when all the quick and easy ways of shipping an MVP effectively preclude that strategy. You're gonna go through all the effort of shipping your MVP to a real server but you're not going to go through the effort of setting up a database? Are you kidding me? Setting up Heroku with shared Postgres is not only much quicker to ship to, but it gives you a software and data architecture that you can much more easily improve in the future.
You understand that Hacker News uses precisely this persistence strategy (in-memory data structures with persistent state written to the filesystem on the hard disk of the server), and has been going on 6 years now?
You also understand that most of the advice easily accessible on the Internet comes from people trying to sell you something, and so they have a vested interest in you adding many layers into your software stack that you don't need?
If you work in an actual engineering organization that has a clue what they're doing, mmap() is your best friend, and the more layers you can cut out of the stack, the better off you are.
If anything it's this fantasy of vertical scaling that's perpetuated by "people trying to sell you something". If you're going to go with "Hacker News does it, therefore it's okay", I guess that means it's sensible for any web app to use a Lisp dialect of their own invention implemented on top of Scheme, for URL's to be generated pseudorandomly and time out, and so forth.
> What's wrong with the Viaweb/Arc/HackerNews/Mailinator approach of just using in-memory datastructures (hashtables, linked lists) and then journaling out changes to the filesystem as records that are read in on startup?
That works for some things. However, it's no more a foolproof magical solution than MySQL or MongoDB or Cassandra or Oracle or... It just has different tradeoffs (non-primary key queries will tend to be a problem, you'll have to make your own replication, sharding will be a problem, etc etc).
Well, that's of course true. All engineering systems face trade-offs.
The nice thing about doing the dead simple solutions first is that they give you time to focus on the things all startups have to do (getting users, building product) and then fall down at the the things that very few startups have the luxury of needing to deal with (scaling, fault tolerance, reporting, alternative views of data).
Throughout the lifetime of my first startup, I was obsessed with the question of "What are we going to do when we need to scale?" It failed because it had a daily userbase measured in the dozens. Then I went to Google to learn how to scale things. And it turned out the biggest lesson I learned at Google was not how to scale things (though I did learn that too), but that you shouldn't scale things, not until you need to. Because the process of designing for scale slows you down significantly, and makes it much harder to develop a system that's usable and performs well under small workloads. Google products take forever to launch, because they have to scale to millions of users from day 1. As a result, their product decisions are very often questionable in early versions. Most startups don't have the luxury of Google's brand name and billions in cash to tide them over that learning process, and need to hit the ground running.
Focus on the problems you have, not the problems you hope to have in the future.
Leaving aside the issue of scalability (generally, by the time you find out that you need to scale up, it's already almost too late if you haven't had being able to scale up in the back of your mind all along), there are other reasons that you don't necessarily want to commit to a solution that makes it difficult to use more than one machine; availability is the obvious one.
I think most people who have never worked for a very fast-growing company grossly underestimate the number of rewrites that it will require anyway. You are not committing to a solution that makes it difficult to use more than one machine; you are trying to get to the point where you need to replace that architecture. Pretty much all your other architectural choices will be bad ones at that point anyways.
Mongo (and other document stores) let you do pretty complex querying within JSON objects. Postgres unfortunately doesn't let you do so; that's really the only thing keeping me with Mongo at the moment.
> So, what's a good NoSQL database for e.g. node.js use?
Well, this is the thing; 'NoSQL' is really a pretty unhelpful term. It tends to just mean "not relational", and covers a vast number of things.
So, for instance, you might be okay with having to have your data set fit in RAM (with MongoDB you'll suffer if it doesn't, anyway), and not care too much about availability. In that case, Redis might be good. Or maybe you care deeply about availability; in that case, one of the Dynamo paper databases might be good, if you're willing to put in the work dealing with the consistency issues. Or...
I could go on for a bit. 'NoSQL' is verging on a meaningless term.
Well, the obvious answer to your question is: CouchDB! It's a brilliant, underrated database, and hey, it backs NPM!
On the other hand, Redis, Cassandra, Riak, and many more are also excellent NoSQL databases. But none of them, including CouchDB, are excellent at everything. What are you planning on making? You can write a lot of different things in node.js. If you're writing, say, a blogging engine you probably should look into flat files, or maybe Postgres, and forget the NoSQL kool-aid. :)
I'm a huge fan of NoSQL in general, and document databases in particular. However, I think this is nuts. :) NoSQL is about making tradeoffs; you give up some of the strengths of a traditional RDBMS but in return you get some unique advantages. The problem is, you aren't taking advantage of any of those advantages with a blog...are you?
I really don't see how MongoDB beats Postgres for running a basic blog. And while it doesn't prove anything, I note that Ghost (which has been getting a lot of press as a new, shiny, node.js based blogging platform) is backed by SQLite of all things. Why is it obvious that they should have used a document database instead? What advantages do you think that would have given them? Because of the top of my head I can't think of one.
1. If you are doing multi lingual site, you can store your multiple language content in a single document instead of futzing around with {lang, content} tables
2. If you want to do custom form/content, it is trivial to do it in a document database instead of relying on key,attribute tables.
3. Just store your theme in a single document, which can include various html templates, css, etc. To export or import a theme is also easy - just stuff the whole document into the db.
4. If you want to add plugins to enhance the capability of your blog/cms, they can have their own nested document inside their target document. Everything is contained.
Couch is a great choice. So is Redis. I would go with Redis if you can afford the cost of ram and you know you'll always be looking stuff up by key. Otherwise I'd with Couch since the queries are just map/reduce functions and the api is just rest.
I'm unsure why you say there are no other alternatives. Surely the choice of database should preference your data and what you want to do with it over the language you'll use? There are node libraries for any database I've ever wanted to use (SQL and NoSQL alike)
I tried to say 'no NoSQL alternatives'. And I noticed that while there's node middleware for most libraries, the ease of use is different for each database.
You know, if you like it with node, ignore the haters and use it anyway. It's not some atrocious mess. I love mongoose in addition to it, use it all the time.
A lot of these downsides are
fixed by Tokumx. Real transactions, document level locking, compression and disk optimized indexes. I suggest everyone take a look at it.
I agree. TokuMX has filled many holes in Mongo and (at least in my experience so far) performs very well. It's got great documentation, backed by a brilliant team. As a drop in replacement for mongo binaries, it's really easy to install and offers professional/enterprise support if you need it.
I feel like the discussion of MongoDB is a bit like the discussion around the Affordable Care Act (aka "Obamacare"). The conversation always shifts between whether the very idea of a noSQL db is a good one, to the question of Mdb's implementation faults and (I guess?) its strengths.
Whether 10gen are vapid spin-meisters or not, even whether they have developed a usable product, seems orthogonal to the question as to whether a schemaless persistent storage layer might be a better fit for some projects than a relational database.
One thing I always do is try to scope a database for the problem. This site[1] has been a valuable resource for myself and colleague who were evaluation the best way to store/access our data depending on what we need back.
Can someone comment on how mature rethinkdb is at the moment?
I'm considering moving away from MongoDB before I have to implement what seems to be an incredibly complicated architecture to get it to scale on the level tens/hundreds of millions of documents.
RethinkDB is not yet "ready for production use" but reportedly will be soon.
If you're currently on MongoDB but need more performance, concurrency, or compression, please try TokuMX: http://www.tokutek.com/products/tokumx-for-mongodb It's a drop-in replacement server that uses a better storage engine but speaks the same protocol and query language.
They fixed that problem but it was too late. In my eyes they proved they are not to be trusted with data.
Had they called themselves MangoCache or MongoProbabilisticStorage, fine, can silently drop writes, I don't care it is not database. But telling people they are a "database" and then tweaking their default to look good in stupid little benchmarks, and telling people they are webscale, sealed the deal for me. Never looking at that product again.