Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Rarely do I read something so specific about software development that I agree with 100%.

One thing I would add: soft deletes in relational databases. There's a really good chance that eventually you'll need it for customer support, for debugging, or to fix nasty performance issues caused by cascading deletes.

For some types of businesses, I wonder if "right to be forgotten" should be designed in from the beginning as well. This can be a problem with hard deletes and with soft deletes. With soft deletes, well, it's hard to figure out how to actually delete things if you've been growing a data model with soft deletes for a couple of years. With hard deletes, well, after you apply some hacks to prevent cascading deletes from killing your performance, now you're in the same place. Maybe worse if your foreign keys are no longer an exhaustive guide to relationships. "Right to be forgotten" will be a nightmare if it hasn't been designed in from the start. Obviously not every kind of business will have to worry about this, but I think the ones that do should consider putting some effort into making sure their design supports it.



I'm far more partial to the "deleted" table that simply removes the deleted entries from one table into another.

Slightly more hassle on the recovery side (in the rare event), way less risk on the read side, and covers the 90% use case where most of the time its more useful for things like debugging. Only rarely in my career have I seen soft deleted data get restored, but you don't lose that this way, you simply mitigate against accidentally showing it.

All in all, its a better compromise than soft deletes

EDIT: I'm pretty tempted to just convert this to dumping out to a SQLite 3 database for all soft deletes. I wonder if actual DB separation is better somehow for backup / compliance


> I'm far more partial to the "deleted" table that simply removes the deleted entries from one table into another

Or you could use a system versioned temporal table (or the equivalent in non-MSSQL databases), then the DB does the work for you, and allows recovery of UPDATEs too. Not sure exactly what the performance implications are.


Temporal tables can get pretty bloated when the table sees a lot of updates since they essentially save a copy of the row every time it's modified. You could of course add a reaper system to only keep the last N versions, or only versions less than N days old, etc. Unfortunately IIRC there's no built in support to do that automatically.


I've actually not used in them anger, just looked into them at one point, the only reason we didn't go with them was because we didn't want to tie ourselves overly to SQL server. I'm assuming it can only work sensibly if you have an ID-generation/usage mechanism that means an ID should never be re-used even after records are deleted, which is arguably the most common scenario anyway, but I have worked with at least a few database tables in the past where this wasn't actually the case (IDs were randomly generated but with a sufficiently small range of possible values that there was a reasonable chance of generating the same ID twice within, say, 6 months of heavy usage).


Our current system uses soft deletes, and I wish we had done a "deleted" table. The biggest downside to soft deletes for us is that it ruins the default indexes cause every query has a !IsDeleted where clause added to it.


this works better if you split your whole schema into a physical and logical layer, which is a lot of boilerplate but seems like a pagni as well since physical concerns that shouldn't change logical semantics always eventually creep in. this entails at least a view with instead-of-delete triggers and making all indexes be filtered indexes for all tables so it's a lot of boilerplate.


Wouldn’t partition indexes solve this?


Yes, we use filtered indexes to mitigate it. It's a lot of additional boilerplate though.


uh, that's a really interesting idea that i've never used before -- like a graveyard for data.

have you done it before? how did you implement it?


yes, I have done it, and I've implemented it two ways:

A temporary "marked for deletion" (I guess temporary soft_delete) field was used to mark rows for deletion (we had a user trash type deal) anything older than 30 days just got dump into a new table, and it recorded some simple metadata, but the data itself was dumped as JSONB. The important thing is we always recorded a way to figure out the customer / user without having to dig into the JSON blob, so if they ever actually left the platform we could still delete the data in full quite easily. This was key. It preserved the PK system we had in place for stuff like this.

In another, more naive implementation, we just dumped it out into JSON blobs with special "$metadata" fields describing data types, deletion time, and other associated metadata I can't quite recall. if I recall correctly those were then encrypted and stored elsewhere (I can't recall where exactly, but likely S3 or equivalent). I don't think this was a good idea, but its what was done.


There is at least some mismatch between the "relational databases" and the soft deletes.

While for some fact tables (time series) it is easy to implement the soft deletes, for other dynamic workloads (e.g. when both fact tables and dimension tables are updated) it can become a messy nightmare to keep the referential integrity up-to-date with the dangling soft-deleted records.

Sometimes it will be difficult to decide if the long-ago soft-deleted record should be maintained, if the dimension is maintained.



Relatedly, I've worked with systems that would maintain a "deleted" or "archived" table for every normal table, with an identical schema. I nowadays prefer it over a deleted_at field for many of the same reasons as the article.

I also nowadays don't care much for created_at or updated_at; IMO it's preferable to maintain an actual transaction log (which is something I'd add to the YPGNI list, given how invaluable that tends to be for auditing purposes).


How about using YEGNI for “you’re eventually gonna need it”? Not exactly the same as “probably”, but more pronounceable.


I guess YUGNI (You're Usually Gonna Need It) would be close to the ideal intersection between pronouncability and original meaning.


Thanks for the reminder -- I forgot one of the great benefits of soft deletes that I posted in that conversation. If your customer support can quickly investigate and resolve false reports of "my data disappeared without me deleting it" then they can pass along the rarer more mysterious cases to engineering, a lot of which will be real bugs. With hard deletes, all the reports look the same to customer support, so nothing gets passed on to engineering (or worse, everything does.)


Too late to edit my original comment, but I'm surprised not to see more discussion about "right to be forgotten." It's legally mandated for some types of data in some jurisdictions, and it's hard to implement in a schema that has evolved without it. It doesn't affect the company I work at now, so I haven't had to think much about it, and I was hoping to hear from people who have.


My company doesn't deal with personal information, so ymmv, but the way we deal with soft deletes is deleting the password, setting their name to "Deleted user" and setting the e-mail address of the user to deleted-<random_hash>@ourdomain. That's all personal information, all other information they produced on our platform is property of the company they work for so it's not their or even our call to make if it should be deleted.

Keeping the user record in the database helps bring complexity down tremendously.


Can't agree with you there. I've never seen a situation in which it was necessary to recover "optimistically" soft-deleted data. I have seen multiple situations in which soft-deleted data was accidentally included in live queries.


Not recover, but explain to the users that the data was deleted, and when it was deleted. If you have to provide support to customers, you will get a steady trickle of confused and angry customers wondering why their data disappeared after they (or their coworker, etc.) deleted it. Soft deletes let you answer those inquiries confidently with very little effort. With hard deletes, complaints like that can eat up support time with no satisfactory resolution.

I haven't experienced soft-deleted data being accidentally included, but I think that's because I've worked with systems that included soft deletes from the very start.


Perhaps Tombstones would be a better choice, where ether data is deleted, but a record of when and why is included instead?


Difference being here a tombstone doesn't contain enough information to recreate the record should it be necessary. Guess it's a tradeoff whether you want a graveyard you can use digital necromancy on, or simply an indicator why the data got yeeted.


From a customer support PoV, soft deletes have the wonderful property that you can assuage the upset / irate individual on the other end of the phone that their data can indeed be recovered from whatever mishap they just engineered and do so with a simple query.

Yes - backups can help - but without live backups (WAL shipping or equivalent) you still risk some form of data loss. And if you have those, you just have soft deletes on a system level rather than an in-DB level + a vastly more complex system of partial restores.


Soft deletes can be thought of as a combination of versioning and timestamps. By using the Sixth Normal Form (6NF) with timestamps in the key, you get those for free, but this kind of schema may be a bit too complex for many simpler applications.


Can you point to an example please?


Wikipedia has a couple of non-timeseries examples[1].

I guess with timeseries you'd have something along the lines of:

   user |   time_range |  status 
  ======+==============+=========
      1 | [2018, 2021) | ACTIVE  
      2 | [2019, )     | ACTIVE  
      1 | [2021, )     | BLOCKED 
Where the PK is (user, time_range). (The time_range field being shortened to just year here for simplicity's sake.)

In PostgreSQL you can use the tstzrange[2] type. You could also only store one of the timestamps, but that would come at the cost of more complex queries.

[1]: https://en.wikipedia.org/wiki/Sixth_normal_form#Examples

[2]: https://www.postgresql.org/docs/current/rangetypes.html


For compliance reasons, you can always hard delete data. It's also not a terrible idea to hard delete data that has been soft deleted x time ago.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: