Soft-deletes in relational databases are almost always a mistake. Most use-cases are actually business rules/processes which need to be clearly examined and done more explicitly and contextually, at least if you ever want to restore anything.
For example, you don't soft-delete a customer row, you "deactivate" a customer, and change logic that looks at customer-related entities to fit that. The "Customers" page that shouldn't show those anymore gets retitled to "Active Customers." Later there will be either process for reactivating customers or for using a permanently deactivated customer as a template for a new one.
What goes wrong with the classic soft-delete approach? The one where every table has a boolean is_deleted column and hopefully the deleting code updates the correct subgraph of records at the same time?
It typically rots as you expand your model and add features. It's not clear what significance different tables have, and you you can't "restore" things with any reliability or safety. That may be because of relational/index conflicts, or because you don't have enough information about how it was originally deleted to actually "restore" something that is both correct and useful, or because there are actually 3 different kinds of restoration schemes/scopes but current problem calls for a fourth.
> What goes wrong with the classic soft-delete approach? The one where every table has a boolean is_deleted column and hopefully the deleting code updates the correct subgraph of records at the same time?
> It typically rots as you expand your model and add features. It's not clear what significance different tables have, and you you can't "restore" things with any reliability or safety. That may be because of relational/index conflicts, or because you don't have enough information about how it was originally deleted to actually "restore" something that is both correct and useful, or because there are actually 3 different kinds of restoration schemes/scopes but current problem calls for a fourth.
The long and short of soft deletes is that it breaks my data's correctness guarantees WRT referential integrity: i.e. I 'delete' a record while there are still references to it, and the RDBMS cannot even warn me, much less stop me!
If I wanted broken data, I'd use a NoSQL DB, or flat files even.
Also makes UNIQUE indices much harder to use. You can't just delete and later re-insert the same data, you need to "undelete" it instead, but that can actually mess up the chronology of the data, so then you need a separate table to track dates... and things just get over-engineered quickly...
> For example, you don't soft-delete a customer row, you "deactivate" a customer
I'm all for naming things more precise, but functionally speaking at that point it's really just semantics, you still end up with some DB column acting as a flag, and from then on you need to take it into account in every single query that touches that table in the whole app. Since most modern ORMs know how to handle soft-delete internally, it's far easier to just stick to the defaults and use `deleted_at` if you really need a way to keep the records around. And you often need, for the referential integrity of the historical data.
Nah, There's a structural difference between (A) adding a status column to one table and changing your queries and joins to care/not care about it, versus (B) adding is_deleted to every table and layering get in as a bonus where clause everywhere.
In other words, The thing being solved is not suitable for a cookie-cutter table/row-level design. It's a real business-logic quality of your data model that lives in particular places and mean slightly different things in those different places.
This is a problem but IMO, the bigger problem is that you create a pit of failure rather than a pit of success because you now have to remember to add "and deleted_at is null" everywhere and forgetting to do so often doesn't look like a bug until the day some user sees something they thought had been deleted still appear.
Perhaps it's a scale thing, but I get more "oh crap, I just deleted this thing, can you get it back for me?" requests than I do bug reports about soft deleted records showing up incorrectly. Being able to restore records immediately is a big win imo.
Being able to contact a dev to undelete a record from a database does sound like a thing that's hard to scale. If this happens a lot, it could be worth examining if there's a UX problem.
This can be addressed with PG views. You create a view where deleted_at IS NULL and now you query the view instead. Furthermore, if the view is “updatable” (which is the case when it only has filters), you can also do updates and deletes directly against it, and it all just works. So as long as your ORM/query builder can work with views, you are good to go.
I understand many devs want to avoid moving more logic to the database (I also personally avoid stored procedures and similar) but views are quite declarative and simpler to manage than indexes.
I haven't used row level security, but wouldn't it be easy to set the table up such that only a dedicated recovery role could select rows that have deleted_at set? Then the view wouldn't even be necessary.
too much magic for my taste. I think the more proper thing would be to simply use views, and expose the natural absurdity of the design with one-view-per-table.
Right. I have never tried those but they sound unappealing to me. I would expect edge cases where they don't work and unpredictable performance implications. But as I said, I have no experience so this is just my imagination.
IDK. I think people can overthink the problem a bit.
I done the is_deleted flag with a deleted_at timestamp that gets cascaded down the table relations. Restoring is an admin job and the logic is to follow the table relations down again restoring linked objects with the same deleted_at timestamp.
I have this in a metadata database that controls a data warehouse. Accidentally knocking out a metadata record could nuke large parts of the warehouse so i have to have this and a locked flag that prevents records from being removed or even soft deleted once data is in the warehouse.
> I have this in a metadata database that controls a data warehouse. Accidentally knocking out a metadata record could nuke large parts of the warehouse
I don't have problems with that, we have effective controls. But the warehouse data requires context, and that's supplied via our metadata db. Once metadata starts having actual records associated with it you can't remove it, or you just end up storing junk.
Pretty much every soft-delete implementation I've seen involves modifying a tables schema to add a deleted_at column.
Invariably, this becomes a rather annoying thorn in my side when I need to do more advanced indexing, like unique indexes for upserts. Nothing impossible, but I'm quickly swimming upstream against whatever ORM I happen to be using (I'm looking at you, Prisma).
What I haven't seen much of is a trigger-based system which moves the deleted record into an entirely new table. Is it possible to make a deletions table that always mirrors the source tables schema? I know table inheritance is a thing but I don't know if that applies to future schema changes.
Anyways, I know deleted_at quite well, but I long for something better.
I wonder if this could be solved with partitioning on the deleted_at column; you’d only need two partitions: one for live data (deleted_at is null) and one for deleted data. Then, in Postgres at least, you could just query the live partition directly when you want to skip deleted records.
Yeah, but Ecto does. You can implement the Repo.prepare_query/3 callback to make sure the query has WHERE deleted_at IS NOT NULL. In fact, that's the exact example in the documentation:
You don't have to reply to a comment if you don't really have anything else to say. What's the point of saying "Now I have two problems" without saying what the new problem is? So what is it? It can't be that an ORM has been added to the mix, since Ecto isn't an ORM.
The problem is you have to manage indexes in the database and keep that in sync with each application that queries it. Ecto is one of those applications.
I like this (1). Like anything related to triggers, you need to be mindful of tables with heavy throughput and maybe move the logic to "code" so you can batch or skip things, but for most tables it works like a charm.
You are right, the main downside of the approach in the article is that you have to remember to filter the deleted_at column on constraints and consider how foreign keys cascade.
Since I am exploring a sync engine, having a separate table for deletes means I’d need to run additional queries when catching up, which led me to explore the single table approach. Let’s see how it evolves.
You read my mind. Yes, this is possible. I haven’t done it in prod, but I made a PoC. As to table inheritance, yes, you can also do that, and yes, in PG anyway it automatically propagates schema changes.
There are quite a few gotchas [0], though, so think carefully before doing so.
The below assumes you have a user/customer table named customer, and another copy named inactive_customer. One other note: if you use an IDENTITY for the PK in the main table, you’ll want to drop that first in the inactive_customer table, so the ids match. Example below.
ALTER TABLE inactive_customer ALTER COLUMN id DROP IDENTITY;
CREATE FUNCTION move_inactive_customer()
RETURNS TRIGGER
LANGUAGE plpgsql
AS $function$
BEGIN
INSERT INTO inactive_customer(<schema>) —-replace schema with yours
VALUES (OLD.<col> for col in cols) —-replace this pseudo-loop with your schema
ON CONFLICT (id) DO NOTHING —-modify for your table constraints
RETURN OLD;
END;
$function$;
CREATE TRIGGER redirect_customer
BEFORE DELETE
ON customer
FOR EACH ROW
EXECUTE PROCEDURE move_inactive_customer();
I mean, Prisma is just awful, that's hardly a problem of soft deletes.
At my previous job we used prisma, and what a terrible system, the amount of problems I saw there. Some have been fixed since, but it shouldnt take 3 major versions of an ORM to use an actual upsert operation instead of doing
if select() then update() else insert() at the client side, which is all sorts of wrong.
Or the 50ms of latency it introduced per query by starting a separate process that generated the queries, or the lack of schema flexibility, or the idiotic new join features that transforms rows into jsons in a destructive way.
I've hit a lot of rough edges with it too. A few off the top of my head:
- in memory joins (until they added the preview relationalJoin feature)
- no partial indexes
- no select for update
- no table partitioning (though if you do this out of band it all works fine)
- use of create type enum instead of check constraints makes maintenance of enumerated columns more challenging (IMO)
- little support for less common (but useful!) data types like geospatial, intervals, ltree, etc
- the reliance on wasm/rust makes debugging what it's doing difficult
- the connection pooling defaults are surprising (never reaps idle connections, AFAIK not cgroups aware and over provisions the pool size based on host CPU count)
For simple use cases it's appealing, and I suspect it's probably a better experience with nosql databases, but when you're used to taking full advantage of postgres' feature set it gets frustrating quickly.
There's issues in the tracker for basically everything above, and some may have even been solved by now - in a few more years it might be a better experience.
> in memory joins (until they added the preview relationalJoin feature)
This, to me, was the single issue that made me forever not take Prisma seriously. I get that “ship now, fix later” is a mindset, but you simply cannot ship an RDBMS wrapper that doesn’t handle JOINs. That’s an integral part of the system.
> the connection pooling defaults are surprising (never reaps idle connections, AFAIK not cgroups aware and over provisions the pool size based on host CPU count)
Tbf I have yet to find a piece of software that hasn’t messed up cgroupsv1. That’s not really an excuse – read docs – but at least everyone seems to make the same mistake.
I suppose "on delete cascade" would fire appropriate triggers in the downstream tables, and "on delete set null" will take care of upstream and peer tables.
If this is not sufficient, and a live record in one table may legitimately refer to a "dead" record in another table, then well, maybe "soft delete" is what you need, and your data just has a deactivated state.
However, in my experience, it's more easily maintainable to use normal deletes, have a trigger to move it to a schemaless table (jsonb column type that contains all of the data at the time of delete), and have manual restore logic via a good UX.
For Django, django-simple-history [1] is killer at this. It doesn't track perfectly in the same way as temporal tables (if a DB operation doesn't trigger a save hook, it won't be tracked) but the ergonomics for Django work is pretty unbeatable. Great integration into the ORM + auth + admin. History tables automatically derived from model definitions with sane nullability defaults, sane migration strategy, and hackability where needed.
It's made "can we get an audit table / undo delete for XYZ?" feature requests really easy.
It's already built into code to check does trigger exists, check it out.
As far as changes(alter table ) goes, I need to take history tables into consideration when writing migration scripts. Since I'm doing my migrations always with SQL scripts that not an issue.
However, there's a concept of system event triggers in PostgreSQL that can intercept alter table statements, so I'm thinking to expand my solution to alter history tables automatically. Check it out:
https://www.postgresql.org/docs/current/event-triggers.html
While I appreciate all criticism of this approach in the comments, it's not that simple.
1. There may be audit policies, compliance policies, retention policies in place. You may HAVE to keep data which is deleted from user's perspective.
2. People make mistakes. While I agree that restoring a long ago soft-deleted object may be very tricky because of referential integrity. It's easy to restore a recently (hours, days) soft-deleted object. Not only easy, it's much faster than restoring a point-in-time backup of an entire database. Also restored version is guaranteed to be the latest one, unlike in backup.
Other databases support similar features too. That allows excluding soft-deleted records from indexes, makes indexes smaller. A good query does not scan a table anyway. If one needs to read less pages from a table, PostgreSQL has CLUSTER command. If there are too many soft-deleted tuples, maybe it's time to think about an archive table anyway.
Unique constraints may ignore soft-deleted objects, so only active objects need to be unique. That is very useful if you need to support idempotent calls. Additionally that may actually be a security feature too. If I create a new user with the same username as an old deleted user, I know that username is unique, but also, that I will not accidentally get permissions of the old user assigned.
4. The worst is tooling. Forgetting "WHERE deleted_at IS NOT NULL" or "WHERE NOT is_deleted" is a very real problem. I would rather advise against soft deleted objects because of that problem alone.
On the other hand, if you use a decent ORM, (and why would not you?) like SQLAlchemy, Django ORM or EntityFramework/Linq (sorry for guys who are not Python/.Net, I do not have a good example for you), then it's quite easy to create a default model query which is not empty. But honestly, if your system is complex enough and has any custom row level security, if it is multi-tenant, if not all users see everything, you already have quite similar problem, you already have to start with non-empty query which already has WHERE and possibly even JOIN clauses. Without a good ORM it will suck anyway.
Anyone has experience with Softdeletes? Any links?
I've worked both ways. Hard DELETE and soft UPDATE deleted_at = NOW().
They both have tradeoffs.
It seems to me that delete_at columns allow for faster development at the cost of more runtime bugs. Like why is this deleted row being listed in this <select>? Did I forget to add `WHERE deleted_at IS NULL` again?
Meanwhile a hard DELETE usually means I get to copy rows to a cemetery table just in case. And it forces me to think about what to do with Foreign Keys that point to the row being deleted. Can I really delete this row if 3 other tables point to it? It makes some business rules evident.
I'm thinking, perhaps I can serialize hard DELETED rows as JSON and store all system deleted rows in a trash table. This way I don't have to keep schema in sync there too.
> Like why is this deleted row being listed in this <select>? Did I forget to add `WHERE deleted_at IS NULL` again?
This is adding more complexity but... views?
For example: rename the "foo" table to "all_foo" and "CREATE VIEW foo AS SELECT * FROM all_foo WHERE deleted_at IS NULL"
I think you could even create a trigger on the view so "DELETE FROM foo" gets replaced with an "UPDATE all_foo SET deleted_at = NOW()", but I've never tried that before.
> Did I forget to add `WHERE deleted_at IS NULL` again?
This seems like an unaddressed use case on the part of the RDBMS to me, a feature that shouldn’t be reinvented by ORMs or each individual project.
Instead, you should be able to just do DELETE SOFT FROM … in your queries and have the DB make the rows unavailable in all queries.
Or maybe even set it up, so that regular DELETE is treated like a soft delete, so that suddenly all of the ORMs and users wouldn’t be able to drop data on accident.
Maybe allow overwriting it with DELETE /*+DELETE_HARD*/ FROM … if you so choose, as well as do SELECT /*+WITH_SOFT_DELETED*/ … if you don’t want to introduce any historical incompatibility.
If such features don’t exist at the DB or a DB extension level, then we’ll be doomed to deal with ORMs reimplementing it which will be pretty bad if you ever need more than one tech stack connected to it, alongside the issues of all of your queries needing additional conditions.
Temporal tables seems like a step in the right direction but they too add complexity and gotchas. And I think they are kind of overkill for just softdeletes.
Another low level solution are Views. But I have databases approaching a thousand tables. Duplicating 1000 tables as views would add significant friction to development and maintenance.
> I'm thinking, perhaps I can serialize hard DELETED rows as JSON and store all system deleted rows in a trash table. This way I don't have to keep schema in sync there too.
this is normally known as an audit table.
The problem with all of these methods though is that data is persisted, but the strategy to actually recover it is left as an exercise for the reader. And in my experience, never implemented and data is never actually reviewed/recovered in practice. The scenarios where a "recycling bin" is actually needed usually ends up encoded as a user-facing business requirement that ends up having an actually well-defined, non-generic, and actually recoverable logic that has nothing to do with the generically applicable soft/hard/audit strategies.
I agree it's very rare, if ever, that business folks require to undelete data.
But I have to say, with the developer hat on, softdeleted rows helped me debug systems more than once. With that said, softdeleted rows also wasted my time more than I'd like to admit while exploring data only to realize I was looking at stale, deleted data.
an audit table is useful for that purpose. soft delete is product of the confused mind, thinking one can get data recovery for almost-free. Audit tables admit that at best it'll be a manual process to be custom-designed for the problem at hand, when it comes to it
I'm currently in the process of migrating back to hard-delete from soft-delete on my current company's codebase. We use a slightly different approach of using triggers to do cascade deletes. It's just very difficult to maintain as the product grows bigger. All those additional overhead to create additional rules for every table and the scope for human error. Overall, I dont recommend anyone take the route of soft-deletes unless it's absolutely required.
Not a fan of soft deleting, but as I sit here I wonder if using an RLS policy to limit to anything without a deleted field would make the whole “forgetting a where” a bit easier? I’m not sure the performance would be great though
For example, you don't soft-delete a customer row, you "deactivate" a customer, and change logic that looks at customer-related entities to fit that. The "Customers" page that shouldn't show those anymore gets retitled to "Active Customers." Later there will be either process for reactivating customers or for using a permanently deactivated customer as a template for a new one.
What goes wrong with the classic soft-delete approach? The one where every table has a boolean is_deleted column and hopefully the deleting code updates the correct subgraph of records at the same time?
It typically rots as you expand your model and add features. It's not clear what significance different tables have, and you you can't "restore" things with any reliability or safety. That may be because of relational/index conflicts, or because you don't have enough information about how it was originally deleted to actually "restore" something that is both correct and useful, or because there are actually 3 different kinds of restoration schemes/scopes but current problem calls for a fourth.