> For example, Github had 128 million public repositories in 2020. Even with 20 issues per repository it will cross the serial range. Also changing the type of the table is expensive.
I expect the majority of those public repositories are forks of other repositories, and those forks only exist so someone could create pull requests against the main repository. As such, they won't ever have any issues, unless someone makes a mistake.
Beyond that, there are probably a lot of small, toy projects that have no issues at all, or at most a few. Quickly-abandoned projects will suffer the same fate.
I suspect that even though there are certainly some projects with hundreds and thousands of issues, the average across all 128M of those repos is likely pretty small, probably keeping things well under the 2B limit.
Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs, github.com included.
The LowCardinality data type does not require the whole set of values to have a low cardinality. It benefits when the values have locally low cardinality. For example, if the number of unique values in `repo_name` is a hundred million, but for every million consecutive values, there are only ten thousand unique, it will give a great speed-up.
This is an ad-hoc query. It does a full scan, processing slightly less than a billion rows per second on a single machine, and finishes in a reasonable time with over 7 billion events on GitHub from 2015. While it does not make sense to optimize this table for my particular query, the fact that it works well for arbitrary queries is worth noting.
I'm convinced that GitHub's decision to move away from Rails was partly influenced by a significant flaw in ActiveRecord: its lack of support for composite primary keys. The need for something as basic as PRIMARY KEY(repo_id, issue_id) becomes unnecessarily complex within ActiveRecord, forcing developers to use workarounds that involve a unique key alongside a singular primary key column to meet ActiveRecord's requirements—a less than ideal solution.
Moreover, the use of UUIDs as primary keys, while seemingly a workaround, introduces its own set of problems. Despite adopting UUIDs, the necessity for a unique constraint on the (repo_id, issue_id) pair persists to ensure data integrity, but this significantly increases the database size, leading to substantial overhead. This is a major trade-off with potential repercussions on your application's performance and scalability.
This brings us to a broader architectural concern with Ruby on Rails. Despite its appeal for rapid development cycles, Rails' application-level enforcement of the Model-View-Controller (MVC) pattern, where there is a singular model layer, a singular controller layer, and a singular view layer, is fundamentally flawed. This monolithic approach to MVC will inevitably lead to scalability and maintainability issues as the application grows. The MVC pattern would be more effectively applied within modular or component-based architectures, allowing for better separation of concerns and flexibility. The inherent limitations of Rails, especially in terms of its rigid MVC architecture and database management constraints, are significant barriers for any project beyond the simplest MVPs, and these are critical factors to consider before choosing Rails for more complex applications.
Do you have any sources on GitHub moving away from Rails? This is the first that I've heard and my googlefu has returned zero results around this. Just last year they had a blog post around Building GitHub with Ruby and Rails[0] so your remark caught my off guard.
Perhaps too late, but Rails 7.1[1] introduced composite primary key support, and there's been a third-party gem[2] offering the functionality for earlier versions of ActiveRecord.
Whilst I would agree that a monolith can run into scalability issues, I am not sure your characterisation of Rails as such is proportionate.
To say that Rails' architecture is a "sigificant barrier for any project beyond the simplest MVPs" is rather hyperbolic, and the list of companies running monolithic Rails apps is a testament to that.
On this very topic, I would recommend reading GitLab's own post from 2022 on why they are sticking with a Rails monolith[1].
I can't really comment on GitHub, but Rails supports composite primary keys as of Rails 7.1, the latest released version [1].
About modularity, there are projects like Mongoid which can completely replace ActiveRecord. And there are plugins for the view layer, like "jbuilder" and "haml", and we can bypass the view layer completely by generating/sending data inside controller actions. But fair, I don't know if we can completely replace the view and controller layers.
I know I'm missing your larger point about architecture! I don't have so much to say, but I agree I've definitely worked on some hard-to-maintain systems. I wonder if that's an inevitability of Rails or an inevitability of software systems—though I'm sure there are exceptional codebases out there somewhere!
Do we know for sure if gitlab cloud uses a multi-tenanted database, or a db per user/customer/org? In my experience products that offer both a self hosted and cloud product tend to prefer a database per customer, as this greatly simplifies the shared parts of the codebase, which can use the same queries regardless of the hosting type.
If they use a db per customer then no one will ever approach those usage limits and if they do they would be better suited to a self hosted solution.
Unless something has substantially changed since I last checked, gitlab.com is essentially self-hosted gitlab ultimate with a few feature flags to enable some marginally different behaviour. That is, it uses one multitennant DB for the whole platform.
> This effectively results in two code paths in many parts of your platform: one for the SaaS version, and one for the self-hosted version. Even if the code is physically the same (i.e. you provide some sort of easy to use wrapper for self-hosted installations), you still need to think about the differences.
The migration has to rewrite the whole table, bigint needs 8 bytes so you have to make room for that.
I have done several such primary key migrations on tables with 500M+ records, they took anywhere from 30 to 120 minutes depending on the amount of columns and indexes. If you have foreign keys it can be even longer.
Edit: But there is another option which is logical replication. Change the type on your logical replica, then switch over. This way the downtime can be reduced to minutes.
In practice the only option that I’ve seen work for very large teams and very large relational databases is online schema change tools like https://github.com/shayonj/pg-osc and https://github.com/github/gh-ost (the latter developed for GitHub’s monolith). It’s just too difficult to model what migrations will cause problems under load. Using a binlog/shadowtable approach for all migrations mostly obviates the problem.
Rails migrations really fail to be viable at scale. And its not really because of the migrations in Rails, its because changes in PostgreSQL get very very expensive. Things have gotten better in PG 15, but its still not quite there yet.
I've ran into an integer PK being almost exhausted in values twice now. In one of these cases I used pg-osc to rewrite a 8TB table on disk without downtime over a period of a week to fix the issue. In the other case it was about 6TB within 2 days. Its doable, but the amount of planning and time it takes is non-trivial.
I so far have no found any other viable solutions either. I keep hearing about logical replication but I haven't seen a great real-world solution at scale so far.
- decouple migration execution from code deployment. Require that all migrations are in totally separate PRs can be run well-in-advance of the code that depends on the db change being run
- maintain a stateful migration execution system using an online schema change system supervised by DBAs
- PRs adding migrations still need to be merged and deployed, but can only do so after the actual migration has run.
Large tables take hours, if not days. I attempted a test case on AWS using souped up io2 disks (the fastest most expensive disks they have) and a beast of a DB server (r5.12xl I think) and it became abundantly clear that at certain scale you won't be doing any kind of in-place table updates like that on the system. Especially if your allowed downtime is one hour maintenance window per week...
I did it on a r6.24xlarge RDS instance and the CPU wasn't doing much during the operation. IO peaked at 40k IOPS on EBS with provisioned IOPS, I'm not sure if a local disk would be any faster but I already know that rewriting the table and creating the indexes are all single threaded so there isn't much you could gain.
Once I got the logical replication setup to work I changed 20 tables on the replica and made the switch with 15 minutes of downtime. That saved me a lot of long nights.
You can get an idea of how long this could take by running pg_repack, it's basically doing the same thing: Copying the data and recreating all the indexes.
In JavaScript land, postgres bigints deserialize as strings. Is your application resilient to this? Are your downstream customers ready to handle that sort of schema change?
Depends on the lib. Max safe int size is like 9 quadrillion. You can safely deserialize serial bigints to this without ever worrying about hitting that limit in many domains.
2^53, to be precise. If your application involves assigning a unique identifier to every ant on the planet Earth (approx. 10^15 ≈ 2^50), you might need to think about this. Otherwise, I wouldn't worry about it.
FKs, indexes and constraints in general make the process more difficult, but possible. The data migration took some hours in my case, but no need to be fast.
AFAIK GitLab has tooling to run tasks after upgrade to make it work anywhere in a version upgrade.
I expect the majority of those public repositories are forks of other repositories, and those forks only exist so someone could create pull requests against the main repository. As such, they won't ever have any issues, unless someone makes a mistake.
Beyond that, there are probably a lot of small, toy projects that have no issues at all, or at most a few. Quickly-abandoned projects will suffer the same fate.
I suspect that even though there are certainly some projects with hundreds and thousands of issues, the average across all 128M of those repos is likely pretty small, probably keeping things well under the 2B limit.
Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs, github.com included.