What are the challenges of such projects? How many people are usually involved? Does it incur downtimes or significant technical challenges for either the infrastructure or the codebase?
Changing the type of the column is no big deal per se, except on a massive table it’s a non-trivial operation, BUT you also have to change the type in everything that touches it, everywhere it’s assigned or copied, everywhere it’s sent over the wire and deserialized where assumptions might be made, any tests, and on, and on. And god help you if you’ve got stuff like int.MaxValue having a special meaning (we didn’t in this context, fortunately).
Our hosting environment at that time was a data centre so we were limited on storage, which complicated matters a bit. Like ideally you’d create a copy of the table but with a wider PK column and write to both tables, then migrate your reads, etc., but we couldn’t do that because the table was massive and we didn’t have enough space. Procuring more drives was possible but took sometimes weeks - no just dragging a slider in your cloud portal. And then of course you’d have to schedule a maintenance window for somebody to plug it in. It was absolutely archaic, especially when you consider this was late 2017/early 2018.
You need multiple environments so you can do thorough testing, which we barely had at that point, and because every major system component was impacted, we had to redeploy our entire platform. Also, because it was the PK column affected, we couldn’t do any kind of staged migration or rollback without the project becoming much more complex and taking a lot longer - time we didn’t have due to the rate at which we were consuming 32-bit integer values.
In the end it went off without a hitch, but pushing it live was still a bit of a white knuckle moment.
If you've written your services in JavaScript, going from i32 to i64 means your driver is probably going return it as a string (or a BigInt or some custom Decimal type), rather than the IEEE754 number you were getting before. This means you now need change your interfaces (both internal and public-facing) to a string or some other safely serializable representation. And if you are going to go through all that trouble, you may as well take the opportunity to just switch to some uuid strategy anyway.
The alternative is that you can monkey-patch the database driver to parse the i64 id as an IEEE754 number anyway and deal with this problem later when you overflow the JavaScript max safe integer size (2^53), except when that happens it will manifest in some really wacky ways, rather than the db just refusing to insert a new row.
Maybe you are better off moving to UUIDs then? It seems that there's packages to make handling them easier, but you'll still need a tiny hack to map old i32 Ids to some UUID.
I remember such a project, and due to our large and aging TypeScript frontend projects it would have added a couple of weeks to adjust all the types affected. All IDs in many places deep in code caused thousands of errors from the mismatch which was a nightmare. I can't remember exactly why it was so tough to go through them all, but we were under intense time pressure.
To speed things up we decided to correct the ID types for the server response, which was key since they were generated from protobuf. But we kept everything using number type IDs everywhere else, even though they would actually be strings, which would not cause many issues because there ain't much reason to be doing numeric operations on an ID, except the odd sort function.
I remember the smirk on my face when I suggested it to my colleague and at the time we knew it was what made sense. It must have been one of the dumbest solutions I've ever thought of, but it allowed us to switch the type eventually to string as we changed code, instead of converting the entire repos at once. Such a Javascript memory that one :)
Not the original commenter, but I've read through half a dozen post-mortems about this kind of thing. The answer is: yes. There's challenges and sometimes downtime and/or breaking changes are inevitable.
For one, if your IDs are approaching the 2^31 signed integer limit, then by definition, you have nearly two billion rows, which is a very big DB table! There are only a handful of systems that can handle any kind of change to that volume of data quickly. Everything you do to it will either need hours of downtime or careful orchestration of incremental/rolling changes. This issue tends to manifest first on the "biggest" and hence most important table in the business such as "sales entries" or "user comments". It's never some peripheral thing that nobody cares about.
Second, if you're using small integer IDs, that decision was probably motivated in part because you're using those integers as foreign keys and for making your secondary indexes more efficient. GUIDs are "simpler" in some ways but need 4x the data storage (assuming you're using a clustered database like MySQL or SQL Server). Even just the change from 32-bits to 64-bits doubles the size of the storage in a lot of places. For 2 billion rows, this is 8 GB more data minimum, but is almost certainly north of 100 GB across all tables and indexes.
Third, many database engines will refuse to establish foreign key constraints if the types don't match. This can force big-bang changes or very complex duplication of data during the migration phase.
Fourth, this is a breaking change to all of your APIs, both internal and external. Every ORM, REST endpoint, etc... will have to be updated with a new major version. There's a chance that all of your analytics, ETL jobs, etc... will also need to be touched.
> For one, if your IDs are approaching the 2^31 signed integer limit, then by definition, you have nearly two billion rows
Just wanted to nitpick this; this is not actually definitively true. A failed insert in some systems will increment the counter and deleting rows usually does not allow the deleted ID to be re-used (new inserts use the current counter). Of course, that is beside the point: the typical case of a table approaching this limit is a very large table.
It's actually fairly common to see this problem crop up in systems that are using a database table as a queue (which is a bad idea for many reasons, but people still do it) in which case the number of live rows in the table can be fairly small.