> *For example, Github had 128 million public repositories in 2020. Even with 20...

zX41ZdbW · on Feb 18, 2024

It is still under the limit today with 362,107,148 repositories and 818,516,506 unique issues and pull requests:

https://play.clickhouse.com/play?user=play#U0VMRUNUIHVuaXEoc...

ssalka · on Feb 18, 2024

I'm guessing this won't be including issues & PRs from private repos, which could be substantial

zx8080 · on Feb 18, 2024

Elapsed: 12.618 sec, read 7.13 billion rows, 42.77 GB

This is too long, seems the ORDER BY is not set up correctly for the table.

zx8080 · on Feb 18, 2024

Also,

> `repo_name` LowCardinality(String),

This is not a low cardinality:

7133122498 = 7.1B

Don't use low cardinality for such columns!

zX41ZdbW · on Feb 19, 2024

The LowCardinality data type does not require the whole set of values to have a low cardinality. It benefits when the values have locally low cardinality. For example, if the number of unique values in `repo_name` is a hundred million, but for every million consecutive values, there are only ten thousand unique, it will give a great speed-up.

zx8080 · on Feb 22, 2024

> LowCardinality data type does not require the whole set of values to have a low cardinality.

Don't mislead others. It's not true unless low_cardinality_max_dictionary_size is set to some other value than the default one: 8192.

It does not work well for hundred million values.

zX41ZdbW · on Feb 19, 2024

This is an ad-hoc query. It does a full scan, processing slightly less than a billion rows per second on a single machine, and finishes in a reasonable time with over 7 billion events on GitHub from 2015. While it does not make sense to optimize this table for my particular query, the fact that it works well for arbitrary queries is worth noting.

nly · on Feb 18, 2024

That query took a long time

istvanu · on Feb 18, 2024

I'm convinced that GitHub's decision to move away from Rails was partly influenced by a significant flaw in ActiveRecord: its lack of support for composite primary keys. The need for something as basic as PRIMARY KEY(repo_id, issue_id) becomes unnecessarily complex within ActiveRecord, forcing developers to use workarounds that involve a unique key alongside a singular primary key column to meet ActiveRecord's requirements—a less than ideal solution.

Moreover, the use of UUIDs as primary keys, while seemingly a workaround, introduces its own set of problems. Despite adopting UUIDs, the necessity for a unique constraint on the (repo_id, issue_id) pair persists to ensure data integrity, but this significantly increases the database size, leading to substantial overhead. This is a major trade-off with potential repercussions on your application's performance and scalability.

This brings us to a broader architectural concern with Ruby on Rails. Despite its appeal for rapid development cycles, Rails' application-level enforcement of the Model-View-Controller (MVC) pattern, where there is a singular model layer, a singular controller layer, and a singular view layer, is fundamentally flawed. This monolithic approach to MVC will inevitably lead to scalability and maintainability issues as the application grows. The MVC pattern would be more effectively applied within modular or component-based architectures, allowing for better separation of concerns and flexibility. The inherent limitations of Rails, especially in terms of its rigid MVC architecture and database management constraints, are significant barriers for any project beyond the simplest MVPs, and these are critical factors to consider before choosing Rails for more complex applications.

silasb · on Feb 18, 2024

Do you have any sources on GitHub moving away from Rails? This is the first that I've heard and my googlefu has returned zero results around this. Just last year they had a blog post around Building GitHub with Ruby and Rails[0] so your remark caught my off guard.

[0]: https://github.blog/2023-04-06-building-github-with-ruby-and...

modderation · on Feb 18, 2024

Perhaps too late, but Rails 7.1[1] introduced composite primary key support, and there's been a third-party gem[2] offering the functionality for earlier versions of ActiveRecord.

[1] https://guides.rubyonrails.org/7_1_release_notes.html#compos...

[2] https://github.com/composite-primary-keys/composite_primary_...

rglynn · on Feb 20, 2024

Whilst I would agree that a monolith can run into scalability issues, I am not sure your characterisation of Rails as such is proportionate.

To say that Rails' architecture is a "sigificant barrier for any project beyond the simplest MVPs" is rather hyperbolic, and the list of companies running monolithic Rails apps is a testament to that.

On this very topic, I would recommend reading GitLab's own post from 2022 on why they are sticking with a Rails monolith[1].

[1] - https://about.gitlab.com/blog/2022/07/06/why-were-sticking-w...

zachahn · on Feb 18, 2024

I can't really comment on GitHub, but Rails supports composite primary keys as of Rails 7.1, the latest released version [1].

About modularity, there are projects like Mongoid which can completely replace ActiveRecord. And there are plugins for the view layer, like "jbuilder" and "haml", and we can bypass the view layer completely by generating/sending data inside controller actions. But fair, I don't know if we can completely replace the view and controller layers.

I know I'm missing your larger point about architecture! I don't have so much to say, but I agree I've definitely worked on some hard-to-maintain systems. I wonder if that's an inevitability of Rails or an inevitability of software systems—though I'm sure there are exceptional codebases out there somewhere!

[1] https://guides.rubyonrails.org/7_1_release_notes.html#compos...

mvdtnz · on Feb 18, 2024

Do we know for sure if gitlab cloud uses a multi-tenanted database, or a db per user/customer/org? In my experience products that offer both a self hosted and cloud product tend to prefer a database per customer, as this greatly simplifies the shared parts of the codebase, which can use the same queries regardless of the hosting type.

If they use a db per customer then no one will ever approach those usage limits and if they do they would be better suited to a self hosted solution.

tcnj · on Feb 18, 2024

Unless something has substantially changed since I last checked, gitlab.com is essentially self-hosted gitlab ultimate with a few feature flags to enable some marginally different behaviour. That is, it uses one multitennant DB for the whole platform.

karolist · on Feb 18, 2024

Not according to [1] where the author said

> This effectively results in two code paths in many parts of your platform: one for the SaaS version, and one for the self-hosted version. Even if the code is physically the same (i.e. you provide some sort of easy to use wrapper for self-hosted installations), you still need to think about the differences.

1. https://yorickpeterse.com/articles/what-it-was-like-working-...

Maxion · on Feb 18, 2024

I've toyed with various SaaS designs and multi tenanted databses always come to th forefront of my mind. It seems to simplify the architecture a lot.

rapfaria · on Feb 17, 2024

> Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs

A bomb defused in a migration that takes eleven seconds

aeyes · on Feb 18, 2024

The migration has to rewrite the whole table, bigint needs 8 bytes so you have to make room for that.

I have done several such primary key migrations on tables with 500M+ records, they took anywhere from 30 to 120 minutes depending on the amount of columns and indexes. If you have foreign keys it can be even longer.

Edit: But there is another option which is logical replication. Change the type on your logical replica, then switch over. This way the downtime can be reduced to minutes.

semiquaver · on Feb 18, 2024

In practice the only option that I’ve seen work for very large teams and very large relational databases is online schema change tools like https://github.com/shayonj/pg-osc and https://github.com/github/gh-ost (the latter developed for GitHub’s monolith). It’s just too difficult to model what migrations will cause problems under load. Using a binlog/shadowtable approach for all migrations mostly obviates the problem.

scient · on Feb 18, 2024

Rails migrations really fail to be viable at scale. And its not really because of the migrations in Rails, its because changes in PostgreSQL get very very expensive. Things have gotten better in PG 15, but its still not quite there yet.

I've ran into an integer PK being almost exhausted in values twice now. In one of these cases I used pg-osc to rewrite a 8TB table on disk without downtime over a period of a week to fix the issue. In the other case it was about 6TB within 2 days. Its doable, but the amount of planning and time it takes is non-trivial.

I so far have no found any other viable solutions either. I keep hearing about logical replication but I haven't seen a great real-world solution at scale so far.

semiquaver · on Feb 19, 2024

What has worked in my experience:

- decouple migration execution from code deployment. Require that all migrations are in totally separate PRs can be run well-in-advance of the code that depends on the db change being run

- maintain a stateful migration execution system using an online schema change system supervised by DBAs

- PRs adding migrations still need to be merged and deployed, but can only do so after the actual migration has run.

tianzhou · on Feb 20, 2024

+1. We have been working on a schema migration tool called Bytebase for 3+ years. And we follow the similar principles.

CubsFan1060 · on Feb 18, 2024

You can also migrate it using logical replication.

winrid · on Feb 18, 2024

This largely depends on the disk. I wouldn't expect that to take 30mins on a modern NVME drive, but of course it depends on table size.

scient · on Feb 18, 2024

Large tables take hours, if not days. I attempted a test case on AWS using souped up io2 disks (the fastest most expensive disks they have) and a beast of a DB server (r5.12xl I think) and it became abundantly clear that at certain scale you won't be doing any kind of in-place table updates like that on the system. Especially if your allowed downtime is one hour maintenance window per week...

aeyes · on Feb 18, 2024

I did it on a r6.24xlarge RDS instance and the CPU wasn't doing much during the operation. IO peaked at 40k IOPS on EBS with provisioned IOPS, I'm not sure if a local disk would be any faster but I already know that rewriting the table and creating the indexes are all single threaded so there isn't much you could gain.

Once I got the logical replication setup to work I changed 20 tables on the replica and made the switch with 15 minutes of downtime. That saved me a lot of long nights.

You can get an idea of how long this could take by running pg_repack, it's basically doing the same thing: Copying the data and recreating all the indexes.

sgarland · on Feb 20, 2024

> not sure if a local disk would be any faster

Uh, yes. NVMe drives can hit 1,000,000 IOPS, plus the latency is orders of magnitude lower than EBS.

aeyes · on Feb 18, 2024

Disk wasn't the limit in my case, index creation is single threaded.

winrid · on Feb 18, 2024

Which is still very IO bound... Wonder what kinda IOPS you were observing? Also they make pretty fast CPUs these days :)

tengbretson · on Feb 18, 2024

In JavaScript land, postgres bigints deserialize as strings. Is your application resilient to this? Are your downstream customers ready to handle that sort of schema change?

Running the db migration is the easy part.

Rapzid · on Feb 18, 2024

Depends on the lib. Max safe int size is like 9 quadrillion. You can safely deserialize serial bigints to this without ever worrying about hitting that limit in many domains.

duskwuff · on Feb 18, 2024

> Max safe int size is like 9 quadrillion.

2^53, to be precise. If your application involves assigning a unique identifier to every ant on the planet Earth (approx. 10^15 ≈ 2^50), you might need to think about this. Otherwise, I wouldn't worry about it.

jameshart · on Feb 18, 2024

11 seconds won't fix all your foreign keys. And all the code written against it that assumes an int type will accommodate the value.

iurisilvio · on Feb 18, 2024

Migrating primary keys from int to bigint is feasible. Requires some preparation and custom code, but zero downtime.

I'm managing a big migration following mostly this recipe, with a few tweaks: http://zemanta.github.io/2021/08/25/column-migration-from-in...

FKs, indexes and constraints in general make the process more difficult, but possible. The data migration took some hours in my case, but no need to be fast.

AFAIK GitLab has tooling to run tasks after upgrade to make it work anywhere in a version upgrade.

golergka · on Feb 18, 2024

Being two orders of magnitude away from running out of ids is too close for comfort anyway.

justinclift · on Feb 17, 2024

> github.com included

Typo?