For the architectural documentation like this one, the C4 Model [0] is a much better fit than UML - primarily because it's less rigid in notation and modeling components. And in terms of tooling, I find IcePanel [1] to have the right combination of flexibility and simplicity.
The bottom layer of C4 is still basically UML, although everyone usually skips that.
I love IcePanel and would recommend everyone try it. But like all these things, it requires an almost superhuman level of commitment to get value out of it. It has built in mechanisms to keep you honest and up to date and I have found it useful both the strategic and tactical level when used right. But ultimately it’s difficult to build an engineering culture around long-term, living diagrams. The moment everything gets out of sync it might as well just be a photo of a whiteboard. I strongly suspect this sort of platform plus AI will be a great combination (I think at least one HNer is working on exactly that and I assume IcePanel too).
I can't speak for your parent but I'm aware of Zawinski's Law and I could see that's what the comment was about, but like your parent it's not at all clear to me why, it's a non-sequitur - this is giving Rust a convenient safe way to up cast, that's not anywhere in the ballpark of the sort of "expand to do everything" that Jamie was describing as inevitable.
If you say "Ooh, the new kitten is learning how the cat flap works" and I respond "Those who cannot remember the past are condemned to repeat it" and then you respond with confusion that's because my quoting of Santayana is a non-sequitur. I might even agree with his sentiment about the importance of history, but the kitten learning how to use a cat flap isn't anywhere close to what Santayana was talking about so...
My jest wasn't meant to say that Rust is expanding to do everything, but rather the opposite. The comment I replied to somehow seems to believe Rust is becoming more "OOP", so I took it a step further and also referenced a fairly known pitfall for platforms (so doesn't even apply to programming languages, in my mind).
In the end, it's a joke with no even a pinch of truth, which seems to have landed flat, that's on me I suppose.
Some parts of this book are extremely useful, especially when it's talking about concepts that are more general than Python or any other specific language -- such as event-driven architecture, commands, CQRS etc.
That being said, I have a number issues with other parts of it, and I have seen how dangerous it can be when inexperienced developers take it as a gospel and try to implement everything at once (which is a common problem with any collection of design patterns like this.
For example, repository is a helpful pattern in general; but in many cases, including the examples in the book itself, it is a huge overkill that adds complexity with very little benefit. Even more so as they're using SQLAlchemy, which is a "repository" in its own right (or, more precisely, a relational database abstraction layer with an ORM added on top).
Similarly, service layers and unit of work are useful when you have complex applications that cover multiple complex use cases; but in a system consisting of small services with narrow responsibilities they quickly become overly bloated using this pattern. And don't even get me started with dependency injection in Python.
The essential thing about design patterns is that they're tools like any other, and the developers should understand when to use them, and even more importantly when not to use them. This book has some advice in that direction, but in my opinion it should be more prominent and placed upfront rather at the end of each chapter.
Could you explain how repository pattern is a "huge overkill that adds complexity with very little benefit"? I find it a very light-weight pattern and would recommend to always use it when database access is needed, to clearly separate concerns.
In the end, it's just making sure that all database access for a specific entity all goes through one point (the repository for that entity). Inside the repository, you can do whatever you want (run queries yourself, use ORM, etc).
A lot of the stuff written in the article under the section Repository pattern has very little to do with the pattern, and much more to do with all sorts of Python, Django, and SQLAlchemy details.
In theory it's a nice abstraction, and the benefit is clear. In practice, your repository likely ends up forwarding its arguments one-for-one to SQLAlchemy's select() or session.query().
That's aside from their particular example of SQLAlchemy sessions, which is extra weird because a Session is already a repository, more or less.
I mean, sure, there's a difference between your repository for your things and types you might consider foreign, in theory, but how theoretical are we going to get? For what actual gain? How big of an app are we talking?
You could alias Repository = Session, or define a simple protocol with stubs for some of Session's methods, just for typing, and you'd get the same amount of theoretical decoupling with no extra layer. If you want to test without a database, don't bind your models to a session. If you want to use a session anyway but still not touch the database, replace your Session's scopefunc and your tested code will never know the difference.
It's not a convincing example.
Building your repository layer over theirs, admittedly you stop the Query type from leaking out. But then you implement essentially the Query interface in little bits for use in different layers, just probably worse, and lacking twenty years of testing.
Thanks, that makes a lot of sense. I don't have a whole bunch of experience with SQLAlchemy itself. In general, I prefer not to use ORMs but just write queries and map the results into value objects. That work I would put into a Repository.
Also in my opinion it's important to decouple the database structure from the domain model in the code. One might have a Person type which is constructed by getting data from 3 tables. A Repository class could do that nicely: maybe run a join query and a separate query, combine the results together, and return the Person object. ORMs usually tightly couple with the DB schema, which might create the risk of coupling the rest of the application as well (again, I don't know how flexible SQLAlchemy is in this).
There could be some value in hiding SQLAlchemy, in case one would ever like to replace it with a better alternative. I don't have enough experience with Python to understand if that ever will be the case though.
All in all, trade-offs are always important to consider. A tiny microservice consisting of a few functions: just do whatever. A growing modulith with various evolving domains which have not been fully settled yet: put some effort into decoupling and separating concerns.
I've used SqlAlchemy in a biggish project. Had many problems, the worst ones were around session scoping and DB hitting season limits, but we had issues around the models too.
The argument for hiding SqlAlchemy is nothing to do with "what if we change the DB"; that's done approximately never, and, even if so, you have some work to do, so do it at the time. YAGNI
The argument is that SA models are funky things with lazy loading. IIRC, that's the library where the metaclasses have metaclasses! It's possible to accidentally call the DB just by accessing a property.
It can be a debugging nightmare. You can have data races. I remember shouting at the code, "I've refreshed the session you stupid @#£*"
The responsible thing to do is flatten them to, say, a pydantic DTO. Then you can chuck them about willy-nilly. Your type checker will highlight a DTO problem that an SA model would have slipped underneath your nose.
The difficulty you have following that is that, when you have nested models, you need to know in advance what fields you want so you don't overfetch. I guess you're thinking "duh, I handcraft my queries" and my goodness I see the value of that approach now. However, SA still offers benefits even if you're doing this more tightly-circumscribed fetch-then-translate approach.
This is partly how I got from the eager junior code golf attitude to my current view, which is, DO repeat yourself, copy-paste a million fields if you need, don't sweat brevity, just make a bunch of very boring data classes.
Just a heads-up if you haven't seen it: Overriding lazy-loading options at query time can help with overfetching.
class Author(Model):
books = relationship(..., lazy='select')
fetch_authors = select(Author).options(raiseload(Author.books))
Anything that gets its Authors with fetch_authors will get instances that raise instead of doing a SELECT for the books. You can throw that in a smoke test and see if there's anything sneaking a query. Or if you know you never want to lazy-load, relationship(..., lazy='raise') will stop it at the source.
SQLModel is supposed to be the best of both Pydantic and SQLAlchemy, but by design
an SQLModel entity backed by a database table doesn't validate its fields on creation, which is the point of Pydantic.
I can't take a position without looking under the hood, but what concerns me is "SqlModel is both a pydantic model and an SA model", which makes me think it may still have the dynamic unintended-query characteristics that I'm warning about.
I seem to recall using SqlModel in a pet project and having difficulty expressing many-to-many relationships, but that's buried in some branch somewhere. I recall liking the syntax more than plain SA. I suspect the benefits of SqlModel are syntactical rather than systemic?
"Spaghetti" is an unrelated problem. My problem codebase was spaghetti, and that likely increased the problem surface, but sensible code doesn't eliminate the danger
I mean that from the point of view of YAGNI for a small app. For a big one, absolutely, you will find the places where the theoretical distinctions suddenly turn real. Decoupling your data model from your storage is a real concern and Session on its own won't give you that advantage of a real repository layer.
SQLAlchemy is flexible, though. You can map a Person from three tables if you need to. It's a data mapper, then a separate query builder on top, then a separate ORM on top of that, and then Declarative which ties them all together with an ActiveRecord-ish approach.
> I prefer not to use ORMs but just write queries and map the results into value objects. That work I would put into a Repository.
Yep, I hear ya. Maybe if they'd built on top of something lower-level like stdlib sqlite3, it wouldn't be so tempting to dismiss as YAGNI. I think my comment sounded more dismissive than I really meant.
SQLAlchemy Session is actually a unit of work (UoW), which they also build on top. By the end of the book they are using their UoW to collect and dispatch events emitted by the services. How would they have done that if they just used SQLAlchemy directly?
You might argue that they should have waited until they wanted their own UoW behaviour before actually implementing it, but that means by the time they need it they need to go and modify potentially hundreds of bits of calling code to swap out SQLAlchemy for their own wrapper. Why not just build it first? The worst that happens is it sits there being mostly redundant. There have been far worse things.
The tricks you mention for the tests might work for SQLAlchemy, but what if we're not using SQLAlchemy? The repository pattern works for everything. That's what makes it a pattern.
I understand not everyone agrees on what "repository" means. The session is a UoW (at two or three levels) and also a repository (in the sense of object-scoped persistence) and also like four other things.
I'm sort of tolerant of bits of Session leaking into things. I'd argue that its leaking pieces are the application-level things you'd implement, not versions of them from the lower layers that you need to wrap.
When users filter data and their filters go from POST submissions to some high-level Filter thing I'd pass to a repository query, what does that construct look like? Pretty much Query.filter(). When I pick how many things I want from the repository, it's Query.first() or Query.one(), or Query.filter().filter().filter().all().
Yes, it's tied to SQL, but only in a literal sense. The API would look like that no matter what, even if it wasn't. When the benefit outweighs the cost, I choose to treat it like it is the thing I should have written.
It isn't ideal or ideally correct, but it's fine, and it's simple.
You seem to have stopped reading my comment after the first sentence. I asked some specific questions about how you would do what they did if you just use SQLAlchemy as your repository/UoW.
Repository pattern is useful if you really feel like you're going to need to switch out your database layer for something else at some point in the future, but I've literally never seen this happen in my career ever. Otherwise, it's just duplicate code you have to write.
What is the alternative that you use, how do you provide data access in a clean, separated, maintainable way?
I have seen it a lot in my career, and have used it a lot. I've never used it in any situation to switch out a database layer for something else. It seems like we have very different careers.
I also don't really see how it duplicates code. At the basic level, it's practically nothing more than putting database access code in one place rather than all over the place.
What we are talking about is a "transformation" or "mapper" layer isolating your domain entities from the persistence. If this is what we call "Repository" then yes, I absolutely agree with you -- this is the right approach to this problem. But if the "Repository pattern" means a complex structure of abstract and concrete classes and inheritance trees -- as I have usually seen it implemented -- then it is usually an overkill and rarely a good idea.
Thanks. In my mind, anything about complex structures of (abstract) classes and/or inheritance trees has nothing to do with a Repository pattern.
As I understand it, Repository pattern is basically a generalization of the Data Access Object (DAO) pattern, and sometimes treated synonymously.
The way I mean it and implement it, is basically for each entity have a separate class to provide the database access. E.g. you have a Person (not complex at all, simply a value object) and a PersonRepository to get, update, and delete Person objects.
Then based on the complexity and scope of the project, Person either 1-to-1 maps to a e.g. a database table or stored object/document, or it is a somewhat more complex object in the business domain and the repository could be doing a little bit more work to fetch and construct it (e.g. perhaps some joins or more than 1 query for some data).
> for each entity have a separate class to provide the database access
Let me correct you: for each entity that needs database access. This is why I'm talking about layers here: sometimes entities are never persisted directly, but only as "parts" or "relations" of other entities; in other cases you might have a very complex persistence implementation (e.g. some entities are stored in a RDB, while others in a filesystem) and there is no clear mapping.
I recommend you to approach this from the perspective of each domain entity individually; "persistability" is essentially just another property which might or might not apply in each case.
Naturally, Repository is a pattern for data(base) access, so it should have nothing to do with objects that are not persisted. I used "entity" as meaning a persisted object. That was not very clear, sorry.
Well, again, that is not completely straightforward - what exactly is a "persisted object"? We have two things here that are usually called entities:
1. The domain entities, which are normally represented as native objects in our codebase. They have no idea whether they need to be persisted and how.
2. The database entities, which are - in RDBs at least - represented by tables.
It is not uncommon that our entities of the first type can easily be mapped 1:1 to our entities of the second type - but that is far from guaranteed. Even if this is the case, the entities will be different because of the differences between the two "worlds": for example, Python's integer type doesn't have a direct equivalent in, say, PostgreSQL (it has to be converted into smallint, integer, bigint or numeric).
In my "correction" above I was talking about the domain entities, and my phrasing that they "need database access" is not fully correct; it should have been "need to be persisted", to be pedantic.
I’ve seen it, but of course there was no strict enforcement of the pattern so it was a nightmare of leakage and the change got stuck half implemented, with two databases in use.
In my experience, both SQL and real-world database schema are each complex enough beasts that to ensure everything is fetched reasonably optimally, you either need tons of entity-specific (i.e. not easily interface-able) methods for every little use case, or you need to expose some sort of builder, at which point why not just use the query builder you're almost certainly already calling underneath?
Repository patterns are fine for CRUD but don't really stretch to those endpoints where you really need the query with the two CTEs and the four joins onto a query selecting from another query based on the output of a window function.
I rarely mock a repository. Mocking the database is nice for unit-testing, it's also a lot faster than using a real DB, but the DB and DB-application interface are some of the hottest spots for bugs: using a real DB (same engine as prod) gives me a whole lot more confidence that my code actually works. It's probably the thing I'm least likely to mock out, despite making tests more difficult to write and quite a bit slowerq
I had a former boss who strongly pushed my team to use the repository pattern for a microservice. The team wanted to try it out since it was new to us and, like the other commenters are saying, it worked but we never actually needed it. So it just sat there as another layer of abstraction, more code, more tests, and nothing benefited from it.
Anecdotally, the project was stopped after nine months because it took too long. The decision to use the repository pattern wasn't the straw that broke the camel's back, but I think using patterns that were more complicated than the usecase required was at the heart of it.
Could you give me some insights what the possible alternative was that you would have rather seen?
I am either now learning that the Repository pattern is something different than what I understand it to be, or there is misunderstanding here.
I cannot understand how (basically) tucking away database access code in a repository can lead to complicated code, long development times, and the entire project failing.
Your understanding of the repository pattern is correct. It's the other people in this thread that seem to have misunderstood it and/or implemented it incorrectly. I use the repository pattern in virtually every service (when appropriate) and it's incredibly simple, easy to test and document, and easy to teach to coworkers. Because most of our services use the repository pattern, we can jump into any project we're not familiar with and immediately have the lay of the land, knowing where to go to find business logic or make modifications.
One thing to note -- you stated in another comment that the repository pattern is just for database access, but this isn't really true. You can use the repository pattern for any type of service that requires fetching data from some other location or multiple locations -- whether that's a database, another HTTP API, a plain old file system, a gRPC server, an ftp server, a message queue, an email service... whatever.
This has been hugely helpful for me as one of the things my company does is aggregate data from a lot of other APIs (whois records, stuff of that nature). Multiple times we've had to switch providers due to contract issues or because we found something better/cheaper. Being able to swap out implementations was incredibly helpful because the business logic layer and its unit tests didn't need to be touched at all.
Before I started my current role, we had been using kafka for message queues. There was a huge initiative to switch over to rabbit and it was extremely painful ripping out all the kafka stuff and replacing it with rabbit stuff and it took forever and we still have issues with how the switch was executed to this day, years later. If we'd been using the repository pattern, the switch would've been a piece of cake.
Thanks. I was starting to get pretty insecure about it. I don't actually know why in my brain it was tightly linked to only database access. It makes perfect sense to apply it to other types of data retrieval too. Thanks for the insights!
> And don't even get me started with dependency injection in Python.
Could I get you started? Or could you point me to a place to get myself started? I primarily code in Python and I've found dependency injection, by which I mean giving a function all the inputs it needs to calculate via parameters, is a principle worth designing projects around.
> I have seen how dangerous it can be when inexperienced developers take it as a gospel and try to implement everything at once
This book explicitly tells you not to do this.
> Similarly, service layers and unit of work are useful when you have complex applications that cover multiple complex use cases; but in a system consisting of small services with narrow responsibilities they quickly become overly bloated using this pattern. And don't even get me started with dependency injection in Python.
I have found service layers and DI really helpful for writing functional programs. I have some complex image-processing scripts in Python that I can use as plug-ins with a distributed image processing service in Celery. Service layer and DI just takes code from:
```python
dependency.do_thing(params)
```
To:
```python
do_thing(dependency, params)
```
Which ends up being a lot more testable. I can run image processing tasks in a live deployment with all of their I/O mocked, or I can run real image processing tasks on a mocked version of Celery. This lets me test all my different functions end-to-end before I ever do a full deploy. Also using the Result type with service layer has helped me propagate relevant error information back to the web client without crashing the program, since the failure modes are all handled in their specific service layer function.
I just rolled my own mocked Celery objects. I have mocked Groups, Chords, Chains, and Signatures, mocked Celery backend, and mocked dispatch of tasks. Everything runs eagerly because it's all just running locally in the same thread, but the workflow still runs properly-- the output of one task is fed into the next task, the tasks are updated, etc.
I actually pass Celery and the functions like `signature`, `chain`, etc as a tuple into my service layer functions.
It's mostly just to test that the piping of the workflow is set up correctly so I don't find out that my args are swapped later during integration tests.
This was my takeaway too. It’s interesting to see the patterns. It would be helpful for some guidance upfront around when the situations in which they are most useful to implement. If a pattern is a tool, then steering me towards when it’s used or best avoided would be helpful. I do appreciate that the pros and cons sections get to this point, so perhaps it’s just ordering and emphasis.
That said, having built a small web app to enable a new business, and learning python along the way to get there, this provided me with some ideas for patterns I could implement to simplify things (but others I think I’ll avoid).
> That being said, I have a number issues with other parts of it, and I have seen how dangerous it can be when inexperienced developers take it as a gospel and try to implement everything at once (which is a common problem with any collection of design patterns like this.
Robert Martin is one of those examples, he did billions in damages by brainwashing inexperienced developers with his gaslighting garbage like "Clean Code".
Software engineering is not a hard science so there is almost never a silver bullet, everything is trade-offs, so people that claim to know the one true way are subcriminal psychopaths or noobs
Clean code has lots of useful tips and techniques.
When people are criticizing it they pick a concept from one or two pages out the hundreds and use it to dismiss the whole book. This is a worse mistake than introducing concepts that may be foot guns in some situations.
Becoming an experienced engineer is learning how, when and where to apply tools from your toolkit.
> I see OOP as fundamentally one key concept: that there are functions which "belong to" specific types, and all other operations on those types go through the functions that "belong to" them
I would go even further:
All programming essentially consists of two things only:
1. data structures (often, but not necessarily, formalised as "types")
2. mechanisms to transform that data (functions, methods, algorithms, tasks, whatever)
Programming paradigms essentially describe different ways to organise these two: functional programming, for example, keeps the two strictly separate; in OOP the mechanisms are "attached" to the data they operate on. Of course, in practice there are many combinations and variations, with more and less flexibility (from strictly closed objects to mixins and similar), but this is the foundation.
In my experience, design patterns cause a lot of problems when developers reach the stage when they understand them enough to know how to implement them, but not when (and even more importantly, when not).
Most design patterns were invented as workarounds to specifics and limitations of certain languages/paradigms, and work great in that case -- but are much more limited, or even downright dangerous, when used in other contexts. As a clear example, the GoF patterns focus on strongly typed OOP languages like Java or C++, and will result in difficult to maintain bloatware if copied directly to dynamic languages like Python, Javascript or PHP (source: been there, done that, got the t-shirt).
I'm honestly disappointed how OpenAPI keeps being used over and over as documentation, and extremely rarely as what it excels at, which is specification.
We build all kinds of frameworks with routing and request/response validation, and then extract that into OpenAPI format, ofteng having to jump through hoops to adapt our internal data types and structures into those supported by JSON Schema. Instead, we could be doing the opposite: writing the OpenAPI spec first, and use tooling to make routing and validation based on it in an automated way. That has been done before [0] [1], but we're still just scratching the surface of what is possible.
Yes, I am aware that it's not easy to manually write the specs using JSON or even YAML, but we need a better focus on tooling around it; currently only Stoplight [2] gives a solid level of support in that area.
As much as I am glad that it looks like one solution is being more and more accepted as the golden standard, I'm a little disappointed that PDM [0] -- which has been offering pretty much everything uv does for quite some time now -- has been completely overlooked. :(
[0] https://c4model.com/
[1] https://icepanel.io/