MongoDB really sucks at count. It's a problem I've run into repeatedly with games that automatically submit highscores (ie without a name field/submit button) using an API call of ours that saves then returns the rank by checking how many scores are higher or lower than X.... on games that get millions of plays.
My solution ultimately ended up being to put the leaderboards that are doing that particular thing into a special "truncate" list that periodically deletes everything after their 50,000th score.
I run/built mogade.com, which does the same thing. I blogged about how I switched to Redis for this specific feature (it made it top of HN, weren't you paying attention that day!?). Maybe you'll find it useful, simple test went from over 5minutes to 147ms:
For what it's worth, I had used MySQL on a table with ~20 million rows and doing a count on an index would also be extremely slow. I put in triggers to do counts which worked perfectly for 99% of the sites, but there was some issue (either a bug or programmer error, never figured out which), but a handful of rows on this count table would claim there were a negative number of rows for a given query. (Any row with a negative number was typically zero a count of zero in the real table. In the cases where it was not zero, it was very close to it).
Currently painted myself into a corner with this and trying to find a workaround that I'm happy with. Using mongoid as my ORM and subclassing models seemed like a great idea, that is until I've got 20M instances of these models. Its currently making my dashboard functionality both useless for the user and slowing the system down considerable when its requested.
Understandably this is the cost of using things like mongodb. Until now I've been very happy with the benefits and tradeoffs but the problems with count() for reasonably sized collections was a particularly nasty surprise.
Author of the article here. The count caching solved the problem for most practical production purposes (CRITICAL ALLITERATION), but it's still uglier than I'd like. However, if the use case is pagination, then "faking it" once you reach a certain size works well, and is unlikely to be noticed by the user. If your counts are on relatively static pieces of your dataset, then keeping counts manually somewhere (via $inc) will get you that performance back, but obviously at the cost of having to maintain it all manually. Incremental map/reduce isn't a bad idea, but again, depends on the complexity of what's being counted.
All in all, it's something of a sore spot in an otherwise very enjoyable development experience.
Count can be done with an incremental map-reduced. CouchDB does this automatically, but you could easily (read - with some difficulty) replicate this with any stack.
This would be much easier to do if Mongo had triggers.
Yup, but the slow part of count() is the value comparison over n documents, moreso than the actual iteration. So while the point holds true, it's less of a practical concern.
I had the same problem with a dataset of 1m+ records in mongo and really slow pagination with will_paginate. I have since swapped it out with the 'kaminari' gem (https://github.com/amatsuda/kaminari). This should solve your problem and is almost as simple as will_paginate to drop in. FYI im using mongoid and not mongo mapper
Kaminari should suffer from the same issues; it's still invoking count() on the scope (at least, it appears to, from the source), which is going to trigger the problem. If you're counting the whole collection, it's going to be very fast, since mongo will just return the number of records in the collection, but if you're viewing a subset of the collection, it's still going to iterate over each record and do a value comparison.
You can combine limit() with count by passing true to the count function. So if you only care about counts greater than, say, 10000, you can do db.collection.find({criteria:'123}).limit(10000).count(true) and you will only get back a max of 10000. This puts an upper bound on waiting for some large counts to come back.
I forked a version of Kaminari a few months ago to solve this problem for me that I just updated with the latest upstream: https://github.com/joshcutler/kaminari
I added a class method for Mongoid::Documents called limit_max_count_results_to that limits the size of Kaminari's total_count
By the way, if you do .limit() / .offset() pagination, by the time you get into really high offsets, queries will start going slow again. Mongo has to do a scan up to the document at offset that you are asking for.
If your pagination interface is simply "Next", then why not skip the count all together, limit(N+1) documents, and offer the "Next" button if the result set is N+1 large?
One of the things that I think puts people off from CouchDB is the comparative complicatedness of the query system, what with having to map/reduce and all. SQL queries are incredibly flexible by comparison, not to mention that SQL queries are how many developers think.
But after studying data structures, I realized that most other databases have a lot of time complexity involved that they are basically sweeping under the rug, whereas Couch's relatively simple query model can guarantee that it's roughly O(log(M) + N).
This is only true of the MyISAM backend (which has poor concurrency and has more complex table repair). If you use InnoDB, count(*) will be a full table scan, just as in PostgreSQL (as this is a direct tradeoff on those other properties).
Right... but PostgreSQL does that as well, so it really isn't relevant to a PostgreSQL vs. MySQL comment, which is what I was responding to. ;P PostgreSQL and InnoDB have the exact same constraints and tradeoffs with respect to count() performance, AFAIK.
(That said, an index doesn't make this magically faster... the fast performance that people credit MyISAM with is that asking for the total number of rows in a table was O(1), as its lack of intricate transaction concurrency allowed it to just keep a global counter; one which of course was difficult and costly to repair after an unclean outage. In comparison, that indexes can optimize a count() over a where is obvious and true of any sane database engine, whether it be PostgreSQL or any of the common MySQL backends.)
There is a big difference with regards to PostgreSQL and InnoDB when you have an indexed COUNT(*) clause.
MySQL can do index only queries, whereas PG currently can not (it is part of the next release iirc). This can often result in a fraction of the disk I/O being done for MySQL.
Interesting; I did not realize that InnoDB supported covering indexes. That said, it is still "only" a constant factor: the difference between O(n) (looking at just the index) and O(2n) (scanning the index, but having to also pull that original record, which you now know the exact position of, to see if it is alive for the current transaction).
Thanks for pointing this out, though! I'm going to go look into how InnoDB manages to handle that.
Actually, I'm seeing that MySQL uses any index available to do count([asterisk]), even if there isn't a where clause. For example, I just ran the following queries on a WordPress database:
[1] select count(object_id) from wp_term_relationships force index(PRIMARY);
[2] select count([asterisk]) from wp_term_relationships
[3] select count([asterisk]) from wp_posts
... all of which produce execution plans that show indexes are being used to perform the count() query. All tables are using InnoDB.
Off topic question: how do you escape an asterisk when entering a reply so that it shows w/in the thread vs. italicizing text?
/rant/
You know what I think is "a big fat pile of suck"? This guy and his hack blog post. Congrats, you figured out what O(n) means. You know, this stuff is not magic. You can't magically say, hey database, give me everything I want and do it for me immediately. Oh, and make it open source, aka free. Oh, and on and on and on.
This stuff is hard. Nothing comes for free. I don't even use mongodb in any production project and am really feeling for them. All these pseudo functional developers think they know how things should work and then get all indignant when stuff doesn't work the way they want it to or think it should.
In this season of Thanks, let us give thanks to the real engineers who make these databases (and other software) and then give them out to everyone for free. Go send a polite email to their mailing lists and say "Dear OpenSource Developer, Thank you for all your hard work."
Re. this specific problem - cache your data and stop crying about it or simply don't present counts to users.
Hi, big fat pile of hack-posting author here. I know it's not magic. I'm a huge fan of Mongo. I blog about the issues I have with it, because the issues I don't have with it are far less interesting. The solution wasn't "AMG, MONGO SUX", it was "here's how I band-aided the problem".
The impetus for this particular post was that it was somewhat surprising that counts were still doing full scans, even if the query conditions were on an indexed field. My expectation was that the index might have some information on how big a particular part of the index tree is (given two bounds), but it turns out that it doesn't, so you end up looking at each document and doing inspection per row. This isn't normally a problem, but the comparison is relatively slow, so the whole thing starts to drag down faster than you'd normally expect.
When the problem is hidden below several levels of library/ORM candy coating, it's not going to be immediately obvious that this is a problem, especially if the queries against that field are otherwise performant thanks to proper use of indexes. It's non-obvious, and given that someone else found it interesting enough to post here, I'm apparently not the only person in the world to ever have this problem. There are multiple open tickets in the Mongo JIRA system, as well as multiple discussions on this topic on the mailing list. This isn't me whining that 10gen won't fix my software, it's an acknowledgment of the problem and steps towards fixing it.
10gen themselves have acknowledged via the mailing lists that counts are slower than they should be, and it's on the list of things to improve. Until then, we have to make do with workarounds.
I think that more than the content of your post, siculars is responding to the somewhat inflammatory title. As soon as I saw the title I thought, "Someone is going to be complaining about mongoDB having a slow operation, and it is going to be upvoted endlessly because mongoDB is webscale." Upon closer inspection, it became clear that this title is only inflammatory because it is on HN, and also that the title was not designed for HN: on your blog, it is simply a description of what the post is about. On HN, it becomes a honeypot.
@cheald, clearly this isn't your fault, and you don't control the tenor of the discussion of mongoDB. What follows isn't a criticism of you, but more a general comment.
I don't think that criticism of mongoDB should be stifled on HN (it clearly still has huge flaws--the lack of granularity when grabbing the write lock comes to mind), but it seems like it's a little hard to have an intelligent conversation in this context. It might make sense to encourage HN to discuss mongo in a way that it's not criticizing a performance detail (no database can do everything for everyone), eg where it can be compared against other databases. When criticizing a performance detail, it might make sense to move that discussion to the MongoDB Jira, where your voice can actually help influence decision making.
My solution ultimately ended up being to put the leaderboards that are doing that particular thing into a special "truncate" list that periodically deletes everything after their 50,000th score.