Immutable Data (2015)

kpmah · on June 26, 2023

Author here, 8 years on.

Although the advantages are real, I can't say I have had much opportunity to implement schemas like this. The extra complexity is usually what gets in the way, and it can add difficulty to migrations.

I think it would be useful in certain scenarios, for specific parts of an application. Usually where the history is relevant to the user. I think using it more generally could be helped by some theoretical tooling for common patterns and data migrations.

lichtenberger · on June 26, 2023

You can use Datomic for instance (mentioned already in your article IIRC!?) or SirixDB[1] on sich I'm working in my spare time.

The idea is an indexed append-only log-structure and to use a functional tree structure (sharing unchanged nodes between revisions) plus a novel algorithm to balance incremental and full dumps of database pages using a sliding window instead.

[1] https://sirix.io | https://github.com/sirixdb/sirix

benjaminwootton · on June 26, 2023

I agree this is more of an application level concern than a database thing. If you need to maintain a history for the user requirement then you will naturally land on a scheme like this.

We also have help from other quarters nowadays.

Databases often provide a time travel feature where we can query AS OF a certain date.

Some people went down the whole event sourcing / CQRS / Kafka route where there is an immutable audit log of updates.

Data warehousing has moved on such that we can implement “slowly changing data” there.

All in all, complicating our application logic, migrations and GDPR in order to maintain history in line of business applications might not be worthwhile.

refset · on June 26, 2023

> Databases often provide a time travel feature where we can query AS OF a certain date.

This is a good overview of SQL:2011 temporal functionality support across the major players: https://illuminatedcomputing.com/posts/2019/08/sql2011-surve...

code_biologist · on June 26, 2023

I'm curious if the author of that blog ever completed their work with temporal Postgres. I know about [1] but unfortunately work on hosted Postgres most often where the extension isn't an option.

[1] https://pgxn.org/dist/temporal_tables/

pil0u · on June 26, 2023

From a user's perspective, I can see a privacy drawback as well.

Suppose that instead of a typical User table, you have a User_Revision table like suggested. Every time a user updates their account settings, you INSERT a new row there. If a user changes their email address, you get a row each time they update it.

Not only the company gets an history of email addresses, but also they are tied to each other. If this information gets leaked, the user is exposed to more vectors of attack.

ludston · on June 26, 2023

GDPR and the right to having your personal data deleted certainly puts a bit of a stopper on using an immutable database for anything personally identifying.

kaveh_h · on June 26, 2023

This would be the case if you’re a purist, but you can be practical and allow physical deletions of certain record just for this reason.

goodbyesf · on June 26, 2023

Pretty much all databases are "immutable". The logs contain every transaction ever made and most institutions save their logs. Also, backups of data are stored on tape and likely shipped somewhere for safekeeping.

roflyear · on June 26, 2023

First, I don't consider that immutable. Second, that isn't true for all databases, often you have to set it up and turn it on, and think about what you're doing as there are considerations for storage and backups. Third, it isn't like you can just get the data back magically and without effort. It can be a problem to hunt through. It can be an enormous amount of data.

ludston · on June 27, 2023

That's not immutability, that's "log shipping". And if an organisation doesn't have the capability to delete personal data from it's backups, and it operates in Europe, chances are that it's in violation of GDPR and could be subject to millions of Euro's in fines.

razzm256 · on June 26, 2023

Forgetting is a separate operation that can be (relatively) easily added whereas updating existing system to work w/ immutable data is an effort on a whole another scale.

kaba0 · on June 26, 2023

Slightly unrelated: A great way to solve this specific email issue is to store hashes for emails only, similarly to passwords. When a user needs a password reset, they should send an email to a specific address, the sender mail is hashed, and if it matches a user, a password reset link is replied. That way a data breach can’t expose user emails.

(This is what Tildes use: https://tildes.net/settings/account_recovery , and I found it very genius)

vbezhenar · on June 26, 2023

While it's better than storing emails in clear-text, nobody uses vvkcbrxrsewf@gmail.com. Brute-forcing emails is going to be pretty effective. And if you have another e-mail database (which are plenty available in the open wild), I bet that success rate will be huge.

I like Apple's approach to generating random throwaway emails.

kaba0 · on June 26, 2023

You don’t have to use the same salt for each email, right? You can hash each email with a separate, random salt and “brute force” upon password recovery. That way while individual emails can still be restored, you can’t easily get a database of emails to mass-spam, which is the point (emails are not really private completely, anyway).

I also like apple’s method, but it is not either-or.

bloak · on June 26, 2023

This is probably worth doing, but one should be aware that a typical e-mail address does not contain enough entropy to not be guessable offline.

Perhaps someone should deliberately leek a bogus database that contains the bcrypt hash of davey.jones.490@well-known-email-provider.com and see if that account starts receiving spam ... probably it won't.

There is (almost certainly) no associated e-mail account, but the following is the SHA-256 hash of firstname.lastname.threedigits, something like "davey.jones.490". I'm half-expecting someone to reply with the solution, though I think there are about 36 bits of entropy in my choice:

b8b6395265714064afb525f37e9969a7dbf09a4a2afa28812e917580be2dc667

misterS · on June 26, 2023

This works great for password resets.

It does not work if you actually want to send emails, for instance notifications.

kaba0 · on June 26, 2023

That’s true, though many websites would be much better off if they would properly evaluate the question of do they actually need to send those?

coldtea · on June 26, 2023

That's an added bonus to the user in most cases!

roenxi · on June 26, 2023

I'll argue that this is bad design. It works as long as the amount of data is small, but even then taking a low-data scenario and building lots of views or triggers just seems a bit weird. When I do that, it works for a month and then falls over in maintenance because you've got a table-based database where none of the important data is in a table?! This is not a design that will be flexible if needs change even slightly.

Immutable design is extremely powerful, but it needs to be a first-class citizen to get full benefit. Clojure's data structures are a great study in this - they squeeze a shocking amount of efficiency out because they have guarantees that the underlying data is immutable (eg, copy-and-slightly-update a large object is effectively a free operation, we have old & new objects available for comparison and that is lovely). Mimicking the same style of programming in, say, Java would gain none of the performance advantages or the logical conveniences. I expect it would be an uncomfortable programmer experience.

Here I think it would be more effective to design the tables as-usual and keep a log of all the changes separately. There is a chance the log will get out of sync with the active tables but frankly if that is a problem go use something designed with immutability in mind and don't twist PostgreSQL into pretzel shapes.

jerrygenser · on June 26, 2023

It's ironic that postgres the underlying storage model for postgres are immutable tables that marks records as "invisible" and asynchronously cleaned up.

roenxi · on June 26, 2023

Yeah. Immutable design is better, what more is there to say?

Unfortunately Postgres doesn't have an API to get at those well designed internals. It implements SQL.

refset · on June 26, 2023

Old but relevant discussion about pulling out the MVCC data from Postgres to mimic Datomic: https://news.ycombinator.com/item?id=4448189

wredue · on June 26, 2023

>what more is there to say?

That it’s not better. Immutability is a tool, not a rule. Deploying immutability unanimously without regard for anything is a great way to create a terrible application.

pharmakom · on June 26, 2023

It’s a mistake to use timestamp for sorting versions instead of revision_id imo

steve_adams_86 · on June 26, 2023

What case would you make for using revision_id?

zambal · on June 26, 2023

I'm not OP, but I can at least give an argument against using a timestamp for sorting versions: for various reasons system time is not guaranteed to be monotonic, which can result in cases where newer versions would be sorted before older versions.

See https://www.erlang.org/doc/apps/erts/time_correction.html#in... for more info. It's about time handling in the Erlang run-time system, but the issue it describes is universal.

And having an Erlang background, I would say timestamps are also tricky in the context of distributed systems.

steve_adams_86 · on June 27, 2023

This is great, thank you. I’d never considered using time stamps like this before which is why I was curious, but I suppose now I can continue not using them slightly more confidently. What is programming really, but figuring out what not to do.

nine_k · on June 26, 2023

E.g. timestamps may (slightly) differ if the updates occur in different transactions, while the revision ID is likely passed around, and is easier to correlate.

This is relatively rare though.

steve_adams_86 · on June 27, 2023

It does seem like it would be rare, but it’s a valid concern. You can arguably gain any insights that time stamps offer by including them as metadata or in an audit log, right?

I find date and time data so difficult to work with that I typically avoid using it for anything important unless absolutely necessary.

whoisthemachine · on June 26, 2023

Evergreen article that neatly lists the problems with time (not just as a version key): https://infiniteundo.com/post/25326999628/falsehoods-program...

pharmakom · on June 26, 2023

Total ordering. Database guarantees monotonicity.

crdrost · on June 26, 2023

(1) You will benefit immensely in the applications built from these sorts of structures by keeping the version log for an entity in a separate table from the current value. First because you can just use old-school triggers, “any update over here triggers inserting a new row into the corresponding versions table,” you don't need to even care about implementing the versioning in application logic. Second because the sorts of queries that would involve looking at previous versions tend to not be the same as the sorts of queries for general data manipulation.

(2) To give an instance of where (1) becomes important, suppose you change some X to X' and then want to change it back. Suppose that after the change some entity was deleted—X foreign keys to a now-deleted value, X' does not. Most applications that try to shove both current state and history into one ubertable disable a bunch of constraint checking and other suchness, and permit this dubious feature of partially-rolling-back into an inconsistent state. But if you just DELETED the row when you said you had, then you would have gotten a foreign-key-error and your user would have copy-pasted you on their “unexpected error occurred” error message and you'd immediately be able to diagnose what foreign key constraint was blocking the undo, rather than mysterious failures several weeks later.

(3) Regardless of your stance on (1), once your application supports deletion, your relational integrity usually suffers because the technically correct value for all of the columns in a deleted-row is to make them all null. This is basically the problem that databases do not have sum types. A sum type in a database is not hard to create once you need it, create a row that has 3 columns which foreign-key to other tables, plus constraints that exactly one of these values is non-null. So the very lightweight construction if you are upset about denormalizing your data is for a Cat in your Cats table in your PetStore database to be a nullable pointer to a CatVersion. So that's how to proceed if you REALLY want to normalize.

(4) All of the above assumes that for every edit to an entity you will save a new row in the versions table, copying all of the other data. The problem is that inevitably some tables get super wide as they have to hold dozens of pieces of business data together, and it's never the ones that you initially expected. There is an easy fix for this as well, it is for versions to also be “mutable.” Whaaaaa??? Yes. Snapshots plus deltas. It's not really mutable because it's append-only.

phrenicfraud · on June 26, 2023

[flagged]

phrenicfraud · on June 26, 2023

[flagged]

smokel · on June 26, 2023

I did not downvote your comment, but the downvotes might be because your message was formulated quite strongly.

Perhaps if you offer a more nuanced view, people may agree more and upvote it as well.

If your stance on mutability is really so strong, then you have to actually persuade people about your views, and that takes quite some effort (mostly from the person to be persuaded, not from you.) Being friendly instead of dismissive might make people more eager to take on another perspective.

Hope this helps a bit for future situations :)

mrkeen · on June 26, 2023

> I don't want things to change, I simply will not change them, not rely on obscure and irrelevant languages

This gives you the worst of both worlds. It requires programmer discipline, and it prevents non-obscure and relevant languages from reaping any of the benefits of immutability.

c_crank · on June 26, 2023

Most relevant and non obscure languages support a const feature of some kind.

mrkeen · on June 26, 2023

It's too little.

There's no language I know of where slapping 'const/final' in front of a field or a function will free you of the burden of considering whether you can safely share it (with the exception of primitives).

If I have a final field of Set<AllowedUsers>, I can't return it, because the caller can insert a new user.

c_crank · on June 26, 2023

Speaking as someone working in the CRUD Java salt mines, I've never had an issue with sharing variables like this that allegedly makes imperative programming fundamentally broken and unproductive. Scoping and any basically functional program layout does 95% of the work for you.

59nadir · on June 26, 2023

I think your comment would've been better received if you maybe phrased it as a question that encouraged some discourse.

"Do people really feel like languages and systems that enforce immutability help them? I'd be interested in any arguments for handling immutability with language restrictions over code guidelines, because I'm not seeing how we need these restrictions on a language level to get the benefits we want."

The above has a clear stance and still encourages conversation, I think. For what it's worth, I mostly agree with you that in the end, I don't really give a shit if we can mutate things. In fact, I think the places where our software falls apart the most (performance on different axes) would benefit from being able to mutate things for sure.

boredumb · on June 26, 2023

Not even sure if this is serious or not. Immutable doesn't mean you can't change data, it means you can't change a variable that has already been assigned in most languages. Everyone here redditors aside can work with mutable state, but I can also work in assembly and choose to work in rust or golang because it allows me to move faster without having to test as many things, same thing with immutable data first, i'm able to move a lot faster without having to worry about mutable state. It's really not that difficult to use immutable data structures and be more efficient than otherwise.