I see this sentiment a lot. It annoys me. NoSQL is not a performance operation. It's about choosing a representation that is appropriate for your data.
It becomes one, but only when your dataset reaches sizes almost nobody will ever see.
If your business model implies a scale-or-die condition, then I'd say it's fair to design your solution to accommodate NoSQL, but not waste time implementing it in production. It's a horrible waste of time to design something around a Cassandra cluster that will be hit once every second.
And, BTW, if you hit the performance architecture's limit (actually, you should have detailed monitoring to tell you how many weeks you have until you hit it), then you should consider what you'll do to solve your particular problem (which may not be what you anticipated initially). You should always invest a little time evaluating your options for different scenarios.
It's a horrible waste of time to design something around a Cassandra cluster that will be hit once every second.
While I would probably agree that Cassandra was designed for scalability, there are hundreds of other NoSQL databases designed for completely different problems that have nothing to do with scaling. CouchDB, for instance, is really good at synchronizing data between separate disjoint databases (think temporary offline use); a problem some have with databases both big and small.
Every NoSQL database was born out of the need to solve a problem that was not well solved using existing technologies. Some of those problems are related to scaling, but scaling problems are just the cusp of what is now available to you to use. Definitely use SQL when it is the right tool for the job, but don't try to shoehorn it into the wrong job. There are other tools that just might be a better fit no matter how big your database is or how many people are going to be using it.
It is overkill to implement a full fledged SQL database and write a bunch of SQL queries, if all I need is a persistent key/value store.
I've been playing with Redis lately, and it is really incredibly easy to use, and it maps quite well to the data-structures that I already have in my code.
There is certainly use for both types of database, but a lot of the NoSQL products are genuinely good tools (completely ignoring scalability), and make using SQL look like hammering screws for certain applications.
If all you need is a persistent key/value store then why do you need more than one table and one each of SELECT, INSERT, UPDATE and DELETE statements?
I'm all for using the most appropriate tools, but sometimes path of least resistance is worth considering too. Android bundles SQLite as a basic service, while many/most organisations will have existing RDBMS SQL servers as part of their infrastructure which are well understood and maintained. In either of those scenarios SQL may well have more power than is required, but it's also already in place and well understood and on the principle of minimal deviation from the known standards, I'd probably still use it.
The path of least resistance for me, BY FAR, was CouchDB.
What I wanted:
* Easy interface to a key/value store.
* Reliability. Once something is stored in the database, I want a reasonably high level of guarantee that it won't be forgotten.
* Trivial replication so if I server goes down and takes the database with it, I can restore it quickly and have a backup server in the mean time.
* High speed on minimal hardware, so I wouldn't NEED to scale under likely usage scenarios.
* Easy path to scaling if it should become necessary.
Every SQL database I know takes nontrivial effort to shard. CouchDB has built-in master-master replication that takes two seconds to set up.
SQL imposes performance penalties that I simply don't need, and so NoSQL is going to be faster on the same hardware, and require less scaling.
CouchDB makes things so simple and low-maintenance that I don't need to hire people who understand it, because I can keep it maintained in my free time. I haven't needed to do anything with it after I spent a couple days to deploy it, and it's happily replicating to a second CouchDB instance that I can fall back on if needed.
And if I get to the point where I need to scale, well, it will work fine to create more CouchDB instances (my data is read FAR more often than it's written, so simple replication should be workable).
Even with MySQL, I feel like I would NEED an expert to be sure to achieve all of my goals above -- not because I couldn't, but because it would waste too much of my time. With CouchDB I got something that Just Worked in exactly the way I needed it to.
Well, one reason that forces itself is, that the 'value' part is not homogenous, primitive type. It is often kind of tree or graph, with possibly different branches for each key.
A lot of it really is a matter of what you know. I think SQL is incredibly easy to use. ESPECIALLY if all you are doing is select, insert, update, delete. Many people never use the full power of their RDBMS, and in that sense it may be the case that they don't need an RDBMS.
I recently built a view in postgres that uses windowed aggregate functions (what Oracle would call "analytics") to present a certain summary of statistical data. The beauty of this is that it's declarative, and took me a LOT less time than coding up a manual aggregation over the specific windows and grouping that I needed, the unit testing surface area is a lot smaller, and it's written once and available to any client platform/language/framework/tools that can talk to Postgres.
Postgres and even moreso Oracle gives you this kind of power out of the box. If you're not using it, you're not getting the full value out of those products.
Let's say my business is real time analtics---tracking mouse movements on the page.
This is tackled by quite a few startups, so it's not out of the realm of things we HNers do ourselves.
Mouse tracking can generate potentially millions of data points per day for even a small number of users. However, we only store a set amount of data per client.
If for example, you think that every point should be tracked individually you can easily run into the case where Postgres will be non-optimal under the read/write load.
Something like Redis on the other-hand might be optimal.
This is a case where its not a premature optimization, its just the right tool for the right job.
I once had to track user actions in a pretty active online game - I configured the Apache servers to pipe their logs to a Perl script. The Perl script was parsing the requests, then pushing that data to a buffer (i.e. an array).
Then when the buffer reached 5000 requests, it would push the data in PosgreSQL using a COPY FROM STDIN. On committing the transaction, we also specified synchronous_commit == off
As far as DB tables go, each day a new table was created with a timestamp in its name, then a cron script would take care of tables older than one week, aggregating data and getting rid of junk.
This setup was handling tens of thousands of writes per second without a sweat on a pretty modest server. Of course, it's less than what Redis can do, but then again I trust PostgreSQL more than I trust Redis.
Out of curiosity, were you doing a high number of reads at the same second? And if so, was it the same batch processing method you were using, or was it random reads?
I will readily admit that I am biased, A good deal of my experience has been in large high load web systems with a good deal of legacy environments. When it comes to the final end point for my data I like relational structures. Personally, I find them more adaptable to the unforeseen as far as business insight is concerned, and in the environment, that I have worked in, the unforeseen occurs daily.
For example, a marketing manager wants to aggregate data set X against Y to see the outcome. The more data I have in structures that support these unforeseen and ad-hoc requirements the more insight my organization obtain. So personally for your situation I would front cache the data in an NoSQL type structure and bulk transfer it into a relational structure at set intervals to avoid having to write logic in an application layer for each use case that comes through the door. I know that there are emerging tools in this space for the NoSQL databases, but I still find analytical and reporting easier to do in the relational world, relational models seems to lend themselves better to discovering links between data sets after the fact.
Why not store the data points in a Stable Bloom Filter and use the filter to automatically tell you the frequency of mouse movements? Could this work well? What does the data for mouse movements look like?
I am curious if your hypothetical question is a reality. :)
> If your business model implies a scale-or-die condition, then I'd say it's fair to design your solution to accommodate NoSQL, but not waste time implementing it in production. It's a horrible waste of time to design something around a Cassandra cluster that will be hit once every second.
It sounded to me like he was saying that sometimes it's actually more work to figure out how to stuff your data into a relational db than a non-relational one?
As the paper mentions, the relational model is mathematically more general than object or column one, but it is very difficult to implement fully and efficiently. Yet Oracle almost did it!
W.E.Wolfengagen, famous russian CS professor, likes to note that he doesn't know of any complete relational computation system.
Oops... Sorry. I probably projected my expectations on your answer.
Sure. When your data is not relational (or, representing it as relations doesn't help you getting to it) you should consider other forms of representing it.
I am very fond of Zope and its underlying ZODB storage (and the connectors that allow it to store documents on mostly anything else), but it's very Python-centric and Python, as much as I love it, is not the perfect solution to every problem.
Agreed, there are things I can do with CouchDB that make me absolutely giddy. And I'm not talking about performance, and I don't even want to think about what the relational model would look like.