Rarely do I read something so specific about software development that I agree w...

no_wizard · on Oct 17, 2022

I'm far more partial to the "deleted" table that simply removes the deleted entries from one table into another.

Slightly more hassle on the recovery side (in the rare event), way less risk on the read side, and covers the 90% use case where most of the time its more useful for things like debugging. Only rarely in my career have I seen soft deleted data get restored, but you don't lose that this way, you simply mitigate against accidentally showing it.

All in all, its a better compromise than soft deletes

EDIT: I'm pretty tempted to just convert this to dumping out to a SQLite 3 database for all soft deletes. I wonder if actual DB separation is better somehow for backup / compliance

wizofaus · on Oct 17, 2022

> I'm far more partial to the "deleted" table that simply removes the deleted entries from one table into another

Or you could use a system versioned temporal table (or the equivalent in non-MSSQL databases), then the DB does the work for you, and allows recovery of UPDATEs too. Not sure exactly what the performance implications are.

Merad · on Oct 17, 2022

Temporal tables can get pretty bloated when the table sees a lot of updates since they essentially save a copy of the row every time it's modified. You could of course add a reaper system to only keep the last N versions, or only versions less than N days old, etc. Unfortunately IIRC there's no built in support to do that automatically.

wizofaus · on Oct 17, 2022

I've actually not used in them anger, just looked into them at one point, the only reason we didn't go with them was because we didn't want to tie ourselves overly to SQL server. I'm assuming it can only work sensibly if you have an ID-generation/usage mechanism that means an ID should never be re-used even after records are deleted, which is arguably the most common scenario anyway, but I have worked with at least a few database tables in the past where this wasn't actually the case (IDs were randomly generated but with a sufficiently small range of possible values that there was a reasonable chance of generating the same ID twice within, say, 6 months of heavy usage).

MrZander · on Oct 17, 2022

Our current system uses soft deletes, and I wish we had done a "deleted" table. The biggest downside to soft deletes for us is that it ruins the default indexes cause every query has a !IsDeleted where clause added to it.

gfody · on Oct 17, 2022

this works better if you split your whole schema into a physical and logical layer, which is a lot of boilerplate but seems like a pagni as well since physical concerns that shouldn't change logical semantics always eventually creep in. this entails at least a view with instead-of-delete triggers and making all indexes be filtered indexes for all tables so it's a lot of boilerplate.

layer8 · on Oct 17, 2022

Wouldn’t partition indexes solve this?

MrZander · on Oct 18, 2022

Yes, we use filtered indexes to mitigate it. It's a lot of additional boilerplate though.

fernandotakai · on Oct 17, 2022

uh, that's a really interesting idea that i've never used before -- like a graveyard for data.

have you done it before? how did you implement it?

no_wizard · on Oct 17, 2022

yes, I have done it, and I've implemented it two ways:

A temporary "marked for deletion" (I guess temporary soft_delete) field was used to mark rows for deletion (we had a user trash type deal) anything older than 30 days just got dump into a new table, and it recorded some simple metadata, but the data itself was dumped as JSONB. The important thing is we always recorded a way to figure out the customer / user without having to dig into the JSON blob, so if they ever actually left the platform we could still delete the data in full quite easily. This was key. It preserved the PK system we had in place for stuff like this.

In another, more naive implementation, we just dumped it out into JSON blobs with special "$metadata" fields describing data types, deletion time, and other associated metadata I can't quite recall. if I recall correctly those were then encrypted and stored elsewhere (I can't recall where exactly, but likely S3 or equivalent). I don't think this was a good idea, but its what was done.

idiocrat · on Oct 17, 2022

There is at least some mismatch between the "relational databases" and the soft deletes.

While for some fact tables (time series) it is easy to implement the soft deletes, for other dynamic workloads (e.g. when both fact tables and dimension tables are updated) it can become a messy nightmare to keep the referential integrity up-to-date with the dangling soft-deleted records.

Sometimes it will be difficult to decide if the long-ago soft-deleted record should be maintained, if the dimension is maintained.

sicromoft · on Oct 17, 2022

Counterpoint: https://brandur.org/soft-deletion

HN discussion: https://news.ycombinator.com/item?id=32156009

yellowapple · on Oct 17, 2022

Relatedly, I've worked with systems that would maintain a "deleted" or "archived" table for every normal table, with an identical schema. I nowadays prefer it over a deleted_at field for many of the same reasons as the article.

I also nowadays don't care much for created_at or updated_at; IMO it's preferable to maintain an actual transaction log (which is something I'd add to the YPGNI list, given how invaluable that tends to be for auditing purposes).

layer8 · on Oct 17, 2022

How about using YEGNI for “you’re eventually gonna need it”? Not exactly the same as “probably”, but more pronounceable.

yellowapple · on Oct 17, 2022

I guess YUGNI (You're Usually Gonna Need It) would be close to the ideal intersection between pronouncability and original meaning.

dkarl · on Oct 17, 2022

Thanks for the reminder -- I forgot one of the great benefits of soft deletes that I posted in that conversation. If your customer support can quickly investigate and resolve false reports of "my data disappeared without me deleting it" then they can pass along the rarer more mysterious cases to engineering, a lot of which will be real bugs. With hard deletes, all the reports look the same to customer support, so nothing gets passed on to engineering (or worse, everything does.)

dkarl · on Oct 17, 2022

Too late to edit my original comment, but I'm surprised not to see more discussion about "right to be forgotten." It's legally mandated for some types of data in some jurisdictions, and it's hard to implement in a schema that has evolved without it. It doesn't affect the company I work at now, so I haven't had to think much about it, and I was hoping to hear from people who have.

tinco · on Oct 17, 2022

My company doesn't deal with personal information, so ymmv, but the way we deal with soft deletes is deleting the password, setting their name to "Deleted user" and setting the e-mail address of the user to deleted-<random_hash>@ourdomain. That's all personal information, all other information they produced on our platform is property of the company they work for so it's not their or even our call to make if it should be deleted.

Keeping the user record in the database helps bring complexity down tremendously.

jakelazaroff · on Oct 17, 2022

Can't agree with you there. I've never seen a situation in which it was necessary to recover "optimistically" soft-deleted data. I have seen multiple situations in which soft-deleted data was accidentally included in live queries.

dkarl · on Oct 17, 2022

Not recover, but explain to the users that the data was deleted, and when it was deleted. If you have to provide support to customers, you will get a steady trickle of confused and angry customers wondering why their data disappeared after they (or their coworker, etc.) deleted it. Soft deletes let you answer those inquiries confidently with very little effort. With hard deletes, complaints like that can eat up support time with no satisfactory resolution.

I haven't experienced soft-deleted data being accidentally included, but I think that's because I've worked with systems that included soft deletes from the very start.

Tyr42 · on Oct 17, 2022

Perhaps Tombstones would be a better choice, where ether data is deleted, but a record of when and why is included instead?

BlargMcLarg · on Oct 17, 2022

Difference being here a tombstone doesn't contain enough information to recreate the record should it be necessary. Guess it's a tradeoff whether you want a graveyard you can use digital necromancy on, or simply an indicator why the data got yeeted.

jsty · on Oct 17, 2022

From a customer support PoV, soft deletes have the wonderful property that you can assuage the upset / irate individual on the other end of the phone that their data can indeed be recovered from whatever mishap they just engineered and do so with a simple query.

Yes - backups can help - but without live backups (WAL shipping or equivalent) you still risk some form of data loss. And if you have those, you just have soft deletes on a system level rather than an in-DB level + a vastly more complex system of partial restores.

ainar-g · on Oct 17, 2022

Soft deletes can be thought of as a combination of versioning and timestamps. By using the Sixth Normal Form (6NF) with timestamps in the key, you get those for free, but this kind of schema may be a bit too complex for many simpler applications.

zasdffaa · on Oct 17, 2022

Can you point to an example please?

ainar-g · on Oct 17, 2022

Wikipedia has a couple of non-timeseries examples[1].

I guess with timeseries you'd have something along the lines of:

   user |   time_range |  status 
  ======+==============+=========
      1 | [2018, 2021) | ACTIVE  
      2 | [2019, )     | ACTIVE  
      1 | [2021, )     | BLOCKED

Where the PK is (user, time_range). (The time_range field being shortened to just year here for simplicity's sake.)

In PostgreSQL you can use the tstzrange[2] type. You could also only store one of the timestamps, but that would come at the cost of more complex queries.

[1]: https://en.wikipedia.org/wiki/Sixth_normal_form#Examples

[2]: https://www.postgresql.org/docs/current/rangetypes.html

pindab0ter · on Oct 17, 2022

For compliance reasons, you can always hard delete data. It's also not a terrible idea to hard delete data that has been soft deleted x time ago.