How to Efficiently Choose the Right Database for Your Applications

deepsun · on Feb 28, 2021

I kinda disagree with separate branch for "document database" for Mongo. Mongo is a key-value storage, with a thin wrapper that converts BSON<->JSON, and indices on subfields.

You can achieve exactly the same thing with PostgreSQL tables with two columns (key JSONB PRIMARY KEY, value JSONB), including indices on subfields. With way more other functionality and support options.

gqewogpdqa · on Feb 28, 2021

Not really true. MongoDB natively supports sharding, multiple indexes, high availability, arrays, sub documents, array traversal, etc - all able to be accessed in your native language with get/set functionality (or via MQL if you want). While PostgreSQL is a really powerful database, the JSON support is really painful to program against.

tankenmate · on Feb 28, 2021

Postgres supports sharding (and partitioning) also with some limitations by expression and multi level as well. HA obviously exists for Postgres in a number of forms, obviously there are limits, but then CAP is an issue for all forms of distribution regardless of DB type. Postgres supports arrays, JSON arrays, JSONB arrays, also indexes on arrays, traversal and looping on arrays, etc. And you get almost all of this via SQL (which you can then map onto your dev language choice via various libraries, personally I don't use ORMs if I can help it, I used SQL in my models as it eases performance tuning. I do realise that a lot of devs don't do or understand SQL, but then I find that a case of the dev not knowing their craft well enough; you need to understand not just logic but data as well). Also more recent version of postgres (12+) have proper support for JSON path queries.

Using Postgres with JSON operators isn't that difficult, of course there are some pitfalls and corner cases due to Postgres's architectural choices but then you'll get that with just about any DB choice. And if Postgres's JSON operators aren't to your taste there are JSONPath queries you can use too.

tehlike · on Feb 28, 2021

This. Postgres as a document db is more capable than mongodb.

westurner · on Feb 28, 2021

> You can achieve exactly the same thing with PostgreSQL tables with two columns (key JSONB PRIMARY KEY, value JSONB), including indices on subfields. With way more other functionality and support options.

PostgreSQL docs > "JSON Functions and Operators" https://www.postgresql.org/docs/current/functions-json.html

MongoDB can do jsonSchema:

> Document Validator¶ You can use $jsonSchema in a document validator to enforce the specified schema on insert and update operations:

   db.createCollection( <collection>, { validator: { $jsonSchema: <schema> } } )
   db.runCommand( { collMod: <collection>, validator:{ $jsonSchema: <schema> } } )

https://docs.mongodb.com/manual/reference/operator/query/jso...

Looks like there are at least 2 ways to handle JSONschema with Postgres: https://stackoverflow.com/questions/22228525/json-schema-val... ; neither of which are written in e.g. Rust or Go.

Is there a good way to handle JSON-LD (JSON Linked Data) with Postgres yet?

There are probably 10 comparisons of triple stores with rule inference slash reasoning on data ingress and/or egress.

tgv · on Feb 28, 2021

If mongodb is a key-value storage, then so is postgresql, but you make it sound as if it's something to stay away from.

forinti · on Feb 28, 2021

As I also have to manage databases, I also take into careful consideration how much work it is to manage them.

So, at one end of the spectrum is SQLite (zero administration) and at the opposite is Oracle (a major PITA).

Postgresql/MariaDB lie in the middle.

say_it_as_it_is · on Feb 28, 2021

The author omits Postgresql yet includes MySQL. I don't trust the author's expertise, or motives, about what database to use.

CuriousNinja · on Feb 28, 2021

This article is just marketing material for one of their products, TiDB.

indymike · on Feb 28, 2021

Leaving out Postgres, and SQLite seem like a pretty efficient way to end at a bad decision. Turns out his article's title is misleading as it is about a HTAP databases.

beckingz · on Feb 28, 2021

One of the flowcharts says that TokuDB is an alternative to MySQL.

TokuDB is a MySQL engine, not the actual database...

pg_bot · on Feb 28, 2021

I believe I can simplify the flow chart for "How to efficiently choose a relational database"

  +----------------+
  |                |
  | Use PostgreSQL |
  |                |
  +----------------+

PurpleFoxy · on Feb 28, 2021

I have found that for many personal projects, sqlite is more than good enough and the simplification of infrastructure makes it worth it over pg.

sradman · on Feb 28, 2021

SQLite is ubiquitous as an embedded database in client-side apps, especially iOS and Android. This ubiquity and simplicity make it a viable alternative to PostgreSQL during development. This is why all popular ORM/QueryBuilder frameworks support SQLite and fits with your appreciation of it.

SQLite as an embedded server-side database requires extra work and configuration to make it a viable alternative to PostgreSQL. It lacks good write concurrency and recoverability by default. It is, however, continually improving but gaps remain.

Since it does not have a wire protocol, SQLite is rarely connected to a data warehouse via ETL so it does not fit well as an alternative to TiDB.

wayneftw · on Feb 28, 2021

> SQLite as an embedded server-side database requires extra work...

No, it depends on the server application. An all-read or read-mostly server requires nothing special. Same for any server that expects a low number of users or database per user(s).

Mozilla uses it on servers for documentation sites. You can also run your own Firefox sync server using SQLite.

forinti · on Feb 28, 2021

SQLite is surprisingly good. You can even run a small site or intranet from it.

It has the advantage of requiring zero administration. All you need is a file: backups or copies are just a matter of copying the file.

PurpleFoxy · on March 1, 2021

Yep, I think too many people fall in to the trap of thinking they are Google or Netflix when the reality is their site could probably be run on a potato. I wouldn’t use sqlite for a commercial web app but even a good chunk of those would be just fine on it.

roenxi · on Feb 28, 2021

If you have already made your choice, you do not need a flow chart. If you are uncertain, use postgres.

0xbadcafebee · on Feb 28, 2021

I would phrase it more like:

  How much time and money do you have?
  
    "None" -> Use PostgreSQL
    "A little" -> Pick a database which matches your application use cases
    "A lot" -> Use PostgreSQL
    "FAANG" -> Roll your own

With "A lot" of time and money, you quickly spend way too much time on databases, and the sprawl of the work will eat up time and budget that could be better spent on improving products / your organization. Just get the enterprise standardized on Postgres and move on with life.

spacemanmatt · on Feb 28, 2021

I wonder what FB is using now, having abandoned Cassandra in 2010.

freyir · on Feb 28, 2021

For the last decade or so, since I first became acquainted with databases, the HN crowd has said “Use PostgreSQL.” And every few weeks there’s a company blog post mentioning that they use MySQL. Why is this?

pg_bot · on Feb 28, 2021

MySQL is certainly more popular, but that doesn't necessarily mean it is a better database from a technical standpoint. MySQL is a fantastic piece of technology and it will likely serve most of your needs if you are building websites. If you have used it before and are familiar with it there may not be a great reason to change. You also have to consider that databases are "sticky" products. Once the choice is made it is unlikely to change for the lifetime of the project due to high switching costs so there is not really an incentive to go out of your way to learn about multiple databases.

MySQL was basically the default database for developing web applications from 1998-2013. (It is the "M" in the LAMP stack) It gained this position by being free to use, reliable and stable. This began a virtuous cycle where more companies catered to MySQL users. Deploying, managing, and MySQL was easier since everyone catered to the audience which drove a virtuous cycle of more developers using MySQL.

A lot PostgreSQL's popularity can be traced to Heroku where it was the default choice for a database. Heroku made it even easier for developers to deploy their applications to the internet. Instead of having some janky build process you would just type "git push heroku master" and you changes would be live in a matter of minutes. This ease of deployment drove the virtuous cycle for PostgreSQL.

For the web it's unlikely that choosing either database would be a mistake. They both are great options. Going into the technical reasons why I believe PostgreSQL is a superior if you don't know what to choose would be a separate post in and of itself, but I've already gone on long enough.

skunkworker · on Feb 28, 2021

There are some tools that make horizontal sharding much easier, like Vitess which is MySQL only. But most businesses won't come close to needing these kinds of capabilities.

I would say for anyone starting out and worrying about how large a single node can scale, Postgres can run well on aa 64/128core server, 2TB ram and 20TB ZRAID6 with a chain of read replicas. This can be done out of the box on Postgres without much issue and can get many businesses quite far, but once you go to the lots and lots of TB, or you have specific write latency, consistency, or other requirements, you have to evaluate multiple databases against your company's specific usage patterns and data, as no benchmark will give you a good idea.

samlambert · on Feb 28, 2021

Vertical scaling and scaling out via replicas is only excusable in a world where sharding isn't a solved problem.

With a bunch of read replicas you have to deal with eventual consistency when you get replication lag as well as the inability to scale writes.

If you shard out horizontally you get write scaleability, and read scalability with much less chance of replication delay. You like likely be reading from the master node most of the time. You also gain a bunch of operational flexibility, and smaller failure domains.

gher-shyu3i · on March 1, 2021

Don't you still have replication delay when sharding horizontally?

mtVessel · on Feb 28, 2021

How does horizontal sharding help with HA/DR concerns?

rzzzt · on Feb 28, 2021

Multi-master replication was also a MySQL-only functionality that came built in, while on Postgres you had to pick a third-party solution (I think...)? Did that change in recent times?

pezezin · on March 2, 2021

I have been looking for how to set up multi-master replication, and as far as I know you still need a third-party solution.

avinassh · on March 1, 2021

> There are some tools that make horizontal sharding much easier, like Vitess which is MySQL only.

I am just curious, how does Vitess do that?

seer · on Feb 28, 2021

Apart from the other reasons people have mentioned - the headline “we switched from postgres to $database (for example MySQL) is so unusual it warrants a justification. PG is so good that you really need to be a very niche case for there to be objective reasons to do such a switch. And granted the bigger the company the more intricate and unique its requirements.

I personally advise people to just start with PG and if they encounter very unique requirements that for some reason PG can’t be tuned to - then do the switch.

zozbot234 · on Feb 28, 2021

Legacy. MySQL used to be very popular in the late-90s and early-2000s as part of a broader MySQL+PHP web stack. But that was before PostgreSQL got good enough (including wrt. performance) for that sort of simple use case. Nowadays, there's really no reason not to go for Postgres.

dugmartin · on Feb 28, 2021

I also think it was because MySQL didn’t require vacuuming so small hosting providers and resellers didn’t have to deal with support issues around that. I know when I was looking around in the late 90s for a database it was a deciding element for me.

acomjean · on Feb 28, 2021

I use MySQL. From what I’ve read Postgres is better.

As a developer I’ve used MySQL a lot, installed it myself in local environment and used it with fairly large tables. It’s very reliable and I’m generally comfortable with it. Once it’s running I don’t give it much thought frankly, so I’m not itching to change.

jniedrauer · on Feb 28, 2021

I think it's inertia. A lot of people have been using MySQL for 10+ years and it's still Good Enough. It takes them an extra 10 minutes to figure out how authentication or permissions work in postgres, and that extra 10 minutes isn't worth it.

jasonkester · on Feb 28, 2021

Remove the word “relational” from your title and it’s still accurate nearly all the time.

There’s maybe a 1% case down at the bottom for cases that had reached a hard limit in production where you could fit a box entitled “read the linked article”

rzk · on Feb 28, 2021

I'm not sure PostgreSQL is the right choice for large analytical workloads. Unless you considser TB size datasets less than 1% of the cases? https://news.ycombinator.com/item?id=26186955

I'd appreciate if anyone could share their experience with using PostgreSQL for large enough data.

sradman · on Feb 28, 2021

Traditionally, many OLTP operational databases are connected via ETL to an OLAP data warehouse; they are not mutually exclusive. PingCap is the company behind TiDB, an HTAP NewSQL engine that competes with CockroachDB and YugabyteDB; the OP is content marketing for TiDB.

TiDB distinguishes itself with HTAP; transparently incorporating OLTP/ETL/OLAP in a single cluster. You have to specify the ETL layer and data warehouse in addition to PostgreSQL to make an apples to apples comparison; that is the core of HTAP positioning.

SAP HANA is the poster child for HTAP, a data warehouse with good enough OLTP performance to replace Oracle RDBMS; a single system is used for both SAP app tiers, Business Suite and Business Warehouse. The same value proposition applies to cloud apps. Independent OLTP/ETL/OLAP is still robust and is more modular while HTAP is more tightly integrated and simpler to operate.

glogla · on Feb 28, 2021

We tried to deploy HANA with BW in a larger-ish life sciences company (200k employees) and so far it was a huge waste of money. It's not actually performing well and the support team had to stop replicating some of our most important data becase "there's too much".

I'm not convinced HTAP can actually work - the they OLTP and OLAP works internally seems too different.

sradman · on Feb 28, 2021

That’s interesting and makes sense. SAP introduced HANA NLS (nearline storage) based on Sybase IQ to address your use case (I think). HANA HTAP is in-memory while IQ works with shared storage clusters so it’s ideal for offloading massive amounts of historical BW data that doesn’t fit in memory.

In-Memory HANA is freakishly good for running OLAP BW queries against fresh data. Column stores like HANA and IQ are both good at this, but to be honest, I don’t know how BW systems were typically configured before HANA/IQ.

glogla · on Feb 28, 2021

I have to admit I'm not super involved with the system in question, and it's possible the team is underfunded or just not that great.

But we definitely ended up in a situation where it was hard to get to some important data, since the team couldn't SLT it to BW because "it's too many transactions" and trying to get it from Sidecar was blowing up, because it's too much data. And that was already with S/4.

But again, I don't know why it was the case. It would have been great and saved us a lot of work if it worked out and we could've done stuff in HANA directly, instead of copying data to different OLAP system daily. Especially now, when everyone is trying to get on the realtime-train.

brianoconnor · on Feb 28, 2021

A rough figure : 100gb is still fine with postresql but even then the problem is not the database (query) but rather your pipeline to get data into it. At this point you'll probalby optimize both.

glogla · on Feb 28, 2021

OLAP is kind of a different world - none of the typical Postgres "competition" (like MySQL or Oracle) works really any better in this domain.

There really isn't a very good free one part solution here - so either you pay big bucks for the likes of Google BigQuery or Snowflake so they can become gatekeepers to your own data, or you end up burning a lot of engineering time to get the likes of Hadoop or Spark on K8s or Trino working.

mb7733 · on Feb 28, 2021

In common usage does OLAP automatically imply data that is too large for a single node to handle? I'm wondering if a lot of OLAP workloads couldn't be handled with some parquet files (or some other column based storage)?

I realize parquet is used by the hadoop/spark ecosystem, but do you really need those systems? I'm thinking that a lot of companies reach for a hadoop cluster when some parquet files in a regular for system would be much simpler. I've done things like this and in my experience it works quite well. But only for personal projects.

glogla · on Feb 28, 2021

No, of course you're right, I'm just bit too comfortable in my own domain.

For smaller data (and here that would mean few hundreds GB which is pretty huge by normal standards) OLTP databases would cover you pretty well. Oracle has some fancy bitmap indexing, MS SQL even has columnar tables, and pretty much anything has partitioning.

Parquet is really cool - especially there's like 30 x ratio between "generic oracle table" and gzip compressed parquet file so scan time are really in a different world.

But by itself parquet doesn't solve a whole lot - where are the files stored and what scans them? What happens if someone is updating the files while someone else is reading them?

mb7733 · on March 5, 2021

Yeah the storage size was a real clincher for me, as much as I appreciate a SQL database.

Re: concurrent reading and writing, for my use case the files are immutable so that isn't a concern. But I agree, I don't think pure parquet is a good fit there.

aaronharnly · on Feb 28, 2021

Postgres is our goto, but in particular we have been pretty satisfied lately with AWS so-called “serverless” Aurora Postgres, which allows utilization to scale down during quiet periods, which is very helpful in a lots-of-microservices context, which can otherwise have a pretty high cost floor from having many Postgres DBs that have to be sized to accommodate their peak load.

DynamoDB, which I was slow to come around on, is attractive for the same reason, although it only works if the use case fits, obviously.

eugenejen · on March 1, 2021

Even you use DynamoDB, you still need to remember to have backupss. I have seen a recent mishap when devops by accident dropped the production tables and restoration requires high IOPs in order to restore it soon enough. The IOPs was not that high for usual use case. But when you wants to reduce MTTR, you need to increase the IOPs (which means $$$, too)

Eventually your dropping of the database is consistent. So back it up no matter what.

haneefmubarak · on Feb 28, 2021

I mean for straightforward usecases if Postgres will work, CockroachDB isn't a bad choice to get used to early on.

bryanrasmussen · on Feb 28, 2021

I didn't see Postgres in the article?

orhmeh09 · on Feb 28, 2021

Its conspicuous absence alongside so many obscure alternatives is baffling.

bryanrasmussen · on Feb 28, 2021

my interpretation after reading more is basically the article is actually - out of the list of databases we use, how to determine which one to use for any particular task.

pjmlp · on Feb 28, 2021

Depends, on the projects I work on, it usually goes with either Oracle or SQL Server.

Occasionally PostgreSQL gets used, as kind of staging database for small teams on a department level, with a db link for the big boy database used at corporate level.

pg_bot · on Feb 28, 2021

I mean no disrespect, but I've never met another developer/organization that was happy using Oracle. I'm genuinely curious as to why someone would prefer it over PostgreSQL if they were familiar with both had the option to not use it.

pjmlp · on Feb 28, 2021

IDE development tooling straight out of Oracle without going to hunt for third parties, including proper debugging of stored procedures with everything that I expect from a modern language.

Ada flavoured PL/SQL.

The Java and .NET drivers, with support for advanced stuff like distributed clustered transactions and direct mappings of UDTs into source language types.

Support for nice stuff like OLAP cubes, bare metal databases and APEX provides a nice way to quickly build database frontends.

As mentioned I am familiar with PostgreSQL, and honestly other than a couple of SQL extensions that are easier to use, I don't see much value when I compare everything that is on the box, and usually at the project scale I work on (just yet another cog on the enterprise wheel), license costs aren't the biggest hurdle to care about, there are other pain points where money matters more.

mrweasel · on Feb 28, 2021

We use it and sell Oracle DB consulting services. Support from Oracle is pretty good. Postgresql also cannot compete with Oracle in HA setups.

Don’t get me wrong, we love Postgresql, and MariaDB, but Oracle is still a great database, with all the features and stability you could possibly want, just at a hefty price.

smarx007 · on Feb 28, 2021

I am pretty sure at least some of this is not 100% true. Otherwise "Russian Gmail" would not migrate 300TB from Oracle to Postresql https://news.ycombinator.com/item?id=12489055

mrweasel · on Feb 28, 2021

I think that’s an edge case, most businesses don’t have the scale where the investment in building a Postgresql setup like that is cheaper than just paying Oracle.

pjmlp · on Feb 28, 2021

I am quite certain that Oracle being an high profile US company, and possibility of export restrictions, also played a role.

emrah · on Feb 28, 2021

I don't know for sure but my guess is that Oracle gets sold to managers in large orgs where such decisions are not necessarily made by engineers.

tybit · on Feb 28, 2021

Is there a gold standard for hosted Postgres? I’m impressed with AWS Aurora Postgres but it’s still high enough maintenance that for something simple I’ll go with DynamoDB most of the time.

stevekemp · on Feb 28, 2021

There are big players such as AWS/GCP who have hosted instances. Smaller companies such as aiven.io offer more dedicated services and are pretty awesome.

cipher_system · on Feb 28, 2021

+1 for aiven.io, works fantastically well and you don't need a DBA any more.

pg_bot · on Feb 28, 2021

If you are already on AWS, I would suggest checking out RDS. You pay a slightly higher price, but the time savings are well worth the extra cost.

bullen · on Feb 28, 2021

I would go even further and say:

  +------------+
  | Use a file |
  +------------+

Personally I use JSON over my own async. HTTP (server and client).

jtsiskin · on Feb 28, 2021

import json

json = json.dumps(database)

with open("database.json","w") as f:

  f.write(json)
 :)

- 0 dependencies

- easily inspectable and editable with any text editor or cli via jq

- backup and diff

- language agnostic

chousuke · on Feb 28, 2021

This "simple" approach will also lead to all kinds of problems quite quickly once you grow past one concurrent user.

If you really must use JSON, at least use SQLite in place of open().

bullen · on Feb 28, 2021

A actually use one file per value!

I have to partition the ext4 filesystem with type small othervise I run out if inodes before diskspace!

Here is what it looks like in action: http://root.rupy.se/link/type/task/847068548006606746

The front end: http://talk.binarytask.com

scottlamb · on Feb 28, 2021

- occasionally accidentally truncates the database

If you're going to do this, at least write to a temporary file, fsync the file, rename into place, and fsync the directory. But I recommend SQLite any time someone is tempted to write these lines.

uyt · on Feb 28, 2021

I tried this approach for scraping a REST API returning json (on a few tens of thousands of ids).

Since it was a long running job with many failures (throttling and connectivity issues), I had to constantly kill and restart the script.

I needed to know what ids to skip over on restart but for some reason listing a folder with a few tens of thousands of files is very slow. So restarting takes forever. I also ran into issues with nonatomic file writes where even though a json file was written, it was incomplete or empty.

I think if I had just inserted them as json strings into sqlite it would've been more robust? I am okay with losing writes (since I will just redownload them), but it was the incomplete writes and long time to reload the set of seen ids on restart that drove me crazy.

beckingz · on Feb 28, 2021

MariaDB is fine too

orhmeh09 · on Feb 28, 2021

A database that doesn’t allow you to interface with views externally is fatally flawed because it means you cannot use views. This is one of a few reasons why we are moving away from MariaDB: https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-1...

tankenmate · on Feb 28, 2021

Do you run into the same issue with materialised views? Or does materialisation force mariadb to treat the view as a separate table? I'm also assuming that materialised views may not be useful for you due to update / synchronisation issues.

droobles · on Feb 28, 2021

I decided for my application to use MariaDB because I have some MySQL experience and can get it up and running quickly with best practices, and when the app reaches a certain mass we can migrate to PostgresSQL or a solution like it as we find need for the kitchen sink features. Right now the app is one view and 5 tables, only one table supporting a JSON type column.

edit: grammar

boffinism · on Feb 28, 2021

username checks out?

pg_bot · on Feb 28, 2021

Nope, it's a java/javascript thing.

boffinism · on Feb 28, 2021

Yeah, I was disappointed when your comment history wasn't just a bunch of Postgres recommendations

selljamhere · on Feb 28, 2021

This flowchart is difficult to screw up.

tangjurine · on Feb 28, 2021

I wish I saw something like this during my internship

Zardoz84 · on Feb 28, 2021

And what happened with Postgresql, MariaDB, Oracle or Graph NoSQL databases ?

harikb · on Feb 28, 2021

Future generations will look back at us and wonder - 'Wait, they called it NoSQL? why couldn't they come up with a better name to denote what it does do, instead of what it doesn't do'

vosper · on Feb 28, 2021

“No rules” would have been a better way to describe how people actually use Mongo.

Though, we should remember that Mongo wasn’t the only “NoSQL” database at the time when that term was taking off - there were others competing for mindshare, like Cassandra (dead[0]), HBase (dead), Riak (dead), and CouchDb (dead).

[0] Obviously not actually dead - I’m taking a bit of license here. People are still running these things, and maybe they’re occasionally still the right choice

_benedict · on Feb 28, 2021

If you need multiple failover regions at large scale for real time operations, and particularly if you want to run true OSS, what alternatives are there today besides Cassandra?

Most use cases don’t fall into this niche, but they never did. Its earlier popularity was perhaps artificial, as perhaps is the popularity (particularly on HN) of the current wave of NewSQL.

vosper · on Feb 28, 2021

There was a little footnote in my comment :)

> Obviously not actually dead - I’m taking a bit of license here. People are still running these things, and maybe they’re occasionally still the right choice

sofixa · on Feb 28, 2021

ScyllaDB, which is Cassandra and DynamoDB compatible is a good alternative.

_benedict · on Feb 28, 2021

It doesn't fall under my definition of true* OSS as

1) it's owned by a private company, so the long term direction of the project is privately controlled;

2) it is released under only AGPL, by far the least business-friendly OSS license - I assume specifically to encourage direct licensing

There's nothing wrong with this, but it does constrain the freedoms of people who use the software more than the Apache license, and offers fewer opportunities to influence the project direction than the Apache Software Foundation (for all its many flaws).

Full disclosure: I'm a committer to Apache Cassandra, which is very much not a dead project, though it has been quiet for a while - focusing on not very visible aspects of the database.

* perhaps that's poor phrasing from my original post, or perhaps it is conveniently defined, but OSS isn't a scalar and we lack sufficient labels to express the relative freedoms associated with certain models

PeterCorless · on March 3, 2021

Someone from ScyllaDB here (Technical Marketing Manager). Forgive, if you can, the length of my reply.

AGPL was chosen to prevent people from taking the software and making it an -as-a-Service (-aaS) offering without contributing anything back to it. Which, if you look at other open source products, can cause them to wither in the vine as people reap the benefits without having to sustain and enhance the base code.

We now have plenty of folks using our Scylla Open Source product across spaces from cybersecurity to IIoT. No one who is just using Scylla internally really needs to worry about AGPL. Though I do admit that many people are allergic to it for lawyerly reasons. But it's also helped prevent other not-so-fine people from utterly vulching the code.

Scylla Open Source is often used under JanusGraph, which is the open source fork of TitanGraph now supported by the CNCF (folks familiar with the history know what happened to TitanGraph, so yes, your concerns are warranted). We use open source Prometheus and Grafana for our monitoring, rather than a proprietary offerings.

We're also taking your first point seriously (long-term direction). We see ourselves as stewards of the software; we don't want to bottleneck or freeze out contributions. For example, open source contributor @Fastio began adding the Redis API into Scylla Open Source! I remember when I learned he was planning on doing it, beginning with a Redis on Seastar implementation called "Pedis." Now it's there in the open source code base. Pretty amazing work, and you have to just thank amazing contributors like that.

https://github.com/scylladb/scylla/blob/master/docs/design-n...

https://github.com/scylladb/scylla/tree/master/redis

Apache Cassandra is also an awesome project, and ScyllaDB definitely owes a lot of our success to the groundbreaking work done there. Anyone working on it gets nothing but big props from me.

We therefore also want to ensure that what we do stays pretty much compatible with Cassandra (CQL v4, murmur3). Like the new Rust driver we wrote as part of our internal hackathon:

https://www.scylladb.com/2021/02/17/scylla-developer-hackath...

While the rivalry with the Cassandra community remains pretty heated in some parts with some parties, you'll get none of that from me. Personally I just hope that end user developers just get better code, better features, better choices.

In 2018, the head-to-head rivalry seemed pretty fierce. But now there are soooo many closed source CQL offerings out there: DataStax, Amazon Keyspaces, Azure CosmosDB, Scylla Enterprise (separate from our open source). There's also other open source offerings like Scylla Open Source and Yugabyte. Of all of those, we hope to show up as the "most open" of the competing offerings.

Also as of 2021 Scylla has broadened who we can please (or, I suppose, be mad at us) by offering other APIs. We support a CQL interface for Cassandra compatibility, a DynamoDB-compatible API, and, still under development, the aforementioned Redis API.

Each of those different NoSQL communities and constituencies bring high expectations for excellence, and their own high standards for what they want from an open source vendor. We definitely take their criticisms to heart.

And yes, our DynamoDB implementation, Alternator, is fully 100% open source. You can totally run your workloads where you want. On premise, on any cloud, or even still on AWS. We take that aspect of open source very seriously. We could have made it simply an enterprise feature. But we opened it up.

I know my title is "Marketing" and some people see that as a license to lie on behalf of a vendor, but I have never been more proud to see the open source commitment and contributions of any company I've worked for to date.

Thanks for the mention and for reading this far. And best wishes to anyone working on hard big data problems these days, regardless of your database-of-choice.

gqewogpdqa · on Feb 28, 2021

Totally agree. It should have been called “Scalability and availability (even degraded) first, with as much durability and consistency as possible”. After all, that was the motivation behind the “NoSQL” movement, when databases like Oracle just couldn’t scale, even at infinite cost, so things like Amazon went down for Christmas 2006 (?) and Amazon decided to write DynamoDB. A different set of requirements created a different product. And then MongoDB came along, initially to power DoubleClick, after relational had failed there too. Neither one had anything remotely to do with SQL but with the lack of scalability in the backend relational databases - a problem that still plagues single-primary-writer relational databases to this day.

And then other non-relational databases came along, like Couch and Cassandra and Scylla - all focused (again) on things other than “SQL vs something else”.

And now they are all adding transactions and secondary indexes and all that stuff - but starting from a much more solid base of a distributed architecture rather than the antiquated monolithic architectures of the relational leaders. Those same relational leaders (Oracle, SQL Server, MySQL, and even the current leader on the dance card, PostgreSQL) are all multi-million line monoliths which are incredibly hard to distribute, scale, and make available and operate at scale - much less easy to develop and improve.

The name is just so sad

enriquto · on Feb 28, 2021

Easy. Most likely you don't need any database.

spacemanmatt · on Feb 28, 2021

The last time I had a professional task not database-involved was 2007, and that's because my role was technical marketing. When I got started in the 90s, databases were weird entities you'd see in corporate IT settings but not regular embedded/consumer app development roles. Now they're common to embedded applications and ubiquitous in server applications.

enriquto · on Feb 28, 2021

There is this creepy fad of shoehorning databases into all applications, even when they are not needed at all, and architecting the whole app around the database constraints. It's weird.

spacemanmatt · on Feb 28, 2021

s/database/data, and you have actual application design spot-on.

Applications are designed around the data constraints. Full stop. Or you are just pretending.

enriquto · on Feb 28, 2021

Yes, I agree! My point is that using a "database" for storing your data is often an unnecessary abstraction, and sometimes even pernicious.

mattashii · on Feb 28, 2021

Most applications expect their data to be stored somewhere. To be able to give some guarantee of data consistency, correctness and persistence, that somewhere is usually a database, to alleviate the costs of needing to reinvent the wheel and inhousing the development-, education- and maintenance costs costs of such storage layer.