I kind of hate all these, because matrix multiplication is easiest to think of as simply repeated matrix-vector multiplication, which you need anyways (and earlier).
I'm sceptical of that article, it's making guesses about the limitations of SQL query optimisers.
Consider the Simple example it presents. The article is in effect implying that no query optimiser would be able to figure out the equivalence of the two predicates.
(Let's ignore that the two predicates aren't actually equivalent; the first version may raise an exception if myIntColumn is negative, depending on the DBMS.)
I should have been clearer, I meant the Wikipedia article.
Ozar's article is much better. It doesn't make sweeping assumptions about the limitations of all query optimisers, or basic oversights in contrasting supposedly equivalent predicates.
> This is literally what NULL is for. It's a special value precisely because missing data or a N/A field is so common.
Kinda. You need null for outer joins, but you could have a relational DBMS that prohibits nullable columns in tables. Christopher Date thought that in properly normalised designs, tables should never use nullable columns. Codd disagreed. [0]
> If you're suggesting mandatory additional has_address and has_customer_id fields, I would disagree. You'd be reinventing a database tool that already exists precisely for that purpose.
The way to do it without using a nullable column is to introduce another table for the 'optional' data, and use a left outer join.
> The way to do it without using a nullable column
I mean, you could, but having separate tables for every optional field would be an organizational and usability nightmare. Queries would be longer and slower for no good reason. Not to mention a gigantic waste of space with all those repeated primary keys and their indexes.
And you could have databases that prohibited NULL values, but we mostly don't, because they're so useful.
Right, Date's idea didn't catch on. I'm not aware of any industry-strength RDBMS prohibiting storing null.
Agreed that queries would tend be longer as you'd need joins, although views could help, especially for read operations.
Regarding storage-efficiency and query speed, agreed that it could well hurt both, but it's going to depend.
If a column holds null in almost all rows, we would expect it to be more space-efficient to use a separate table and a left outer join.
Query speed could also improve for queries that don't reference the nullable column, as the 'main' table would be smaller in storage. (I'm assuming a rowstore here.)
Or maybe NULLs are actually a great solution here, and it's fine.
The idea that having a separate table for every optional field is too unworkable isn't an issue with SQL. It's a fundamentally overcomplicated idea. It's like a programming language where every variable that could be null had to be put in its own file. It would be terrible design.
> The idea that having a separate table for every optional field is too unworkable isn't an issue with SQL.
It sure is. Consider a database language that innately supported algebraic data types. Instead of:
table Contact { int Id; Nullable<string> email; Nullable<string> phoneNo; }
you have:
type PhoneOrEmail = Phone<string> | Email<string>
table Contact { int Id; PhoneOrEmail info; }
This completely clarifies the relationships between the nullable columns (can both be null, or only one null?), and the database storage layer would manage how to actually store this. This is a direct consequence of SQL's semantics and how it implements the relational calculus.
That seems like a logical nightmare to me. When thinking about which fields get returned, which fields do you index over, etc. It's a nice idea in theory, but would add an astronomical level of complexity in practice.
If the database supports algebraic data types, you wouldn't have to think about which fields get returned, it understand the needed semantics and handles that for you. Like I said, these are SQL limitations.
You'd need to handle it in your code. The last thing I want is a database that returns different columns per row.
No, these aren't SQL limitations, it's design that is super complex. Figuring out how to index over these multi-type fields isn't a SQL limitation, it's a hard engineering problem.
> The last thing I want is a database that returns different columns per row.
Actually that's exactly what you'd want, because it saves you from running two different queries in those cases with properly normalized disjoint data sets, and moves more of the domain's constraints into the database schema where it belongs.
Definitely not what I want. I'm perfectly happy getting both fields at the same time in a single query the way I do now.
> and moves more of the domain's constraints into the database schema where it belongs.
No, I prefer to keep my domain constraints at the application level where they're far more flexible.
The database is for storing information, not for validating my business logic.
I mean, I realize some people want to build some of that logic into the database, especially when many applications interact with it. But it's not a superior design pattern. If you have a single application, it's perfectly valid and desirable to put all business logic in the application, not the database.
Over 90% of my optionals have their own tables and it’s the cleanest and most maintainable database I’ve ever worked with. I will always design databases this way going forward. That’s my experience.
I remember working on ERP systems with 40+ column tables, most of which were null. With no clear constraints on which options should or shouldn’t enable or make mandatory other options. This becomes incredibly obvious and natural when you group in additional tables.
My tables are incredibly concise and the cache loves this.
400 columns all of which are nullable screams "dynamic field set" to me. Why have all of those as columns rather than something like:
table Entity_Fields (ParentEntityId : int not null, FieldId : int not null, IntValue : int, TextValue : varchar(MAX), DateValue : datetime, ...)
Or per the OP's suggestion, a table per field type:
table Entity_IntFields (ParentEntityId : int not null, FieldId : int not null, Value : int not null)
table Entity_TextFields (ParentEntityId : int not null, FieldId : int not null, Value : varchar(MAX) not null)
table Entity_DateFields (ParentEntityId : int not null, FieldId : int not null, Value : datetime not null)
> 400 columns all of which are nullable screams "dynamic field set" to me
Could have been, but no. And doing it like you suggest would mean overview grids would have to do 50+ subqueries for each row, and loading a record would mean hundreds of queries.
And insertion performance would crater I assume, since the DB now needs to do hundreds of inserts per record rather than a single row. We do have some cases where we get 100k inserts per hour once a day or so. And this is just the main table, we have many child tables already, though they're not nearly as wide.
I think a more realistic split would have been to split the main table into maybe 10 tables or so at most. Still would result in a fair bit of subqueries for grids and such, but not that bad.
> And doing it like you suggest would mean overview grids would have to do 50+ subqueries for each row, and loading a record would mean hundreds of queries.
Only if you insist on loading the data into a flat record type with all of those fields. You have a dynamically evolving datatype, so your class should also be dynamic:
public enum FieldId
{
Field1,
Field2,
//...
}
public class YourRecord
{
public Dictionary<FieldId, string> TextValues { get; set; }
public Dictionary<FieldId, int> IntValues { get; set; }
public Dictionary<FieldId, DateTime> DateValues { get; set; }
// ...
}
At most the number of subqueries increases by the number types your class needs. SQL has a fixed number of datatypes, and most apps use a small subset of these (dates, text, integral, floating point and decimal types, that's typically 5 at most). You could collapse these all into a single table with optional columns for each value too and then it's only one query.
> And insertion performance would crater I assume, since the DB now needs to do hundreds of inserts per record rather than a single row. We do have some cases where we get 100k inserts per hour once a day or so.
That's a fairly small amount of data, I don't really see an issue here.
Also, you're neglecting the fact that since the fields are independent, adding a field value doesn't involve locking and rewriting a large 400 column row to disk, but only appending a tiny 4 column row to a separate distinct table.
> At most the number of subqueries increases by the number tables.
Fair enough. Application logic will want to access the data as a flat record, but that could be handled through getters. We have views which are also used by customers and our reporting tool, but the customers we're moving to API access and reporting could probably be done with something better if starting from scratch.
> That's a fairly small amount of data, I don't really see an issue here.
Well it turns 100k inserts into tens of millions. On tables where users need to work without slowdown while this is going on.
It probably works if you throw enough hardware at it, but currently we get by with quite modest DB hardware.
That said, how do you create compound indexes over these fields? Say you need to index one date and one varchar column? Such demands can arise down the line, often hard to predict up front.
> Well it turns 100k inserts into tens of millions.
I don't think "number of inserts" is the right metric because the total amount of bytes being written is almost the same, it's just written in different areas and still mostly contiguously. I think "number of distinct tables being written" is a better metric. Assuming all 400 columns become 400 records in the data type tables, say evenly divided among the 4 most common data types (int, decimal, text, date), that would be more like (4 or 5) x 100k = 400k-500k. I would still hesitate to naively compare it this way without a benchmark though, because with 4 or 5 tables being written there's also less contention than there is when everyone is writing to 1 table.
Regarding indexes, you can index the field tables but obviously this applies to the whole table. Without more understanding of your domain I can't really say if this breaks down. There's also the possibility that this one table is serving two competing goals, eg. perhaps the user should be able to add data incrementally (so this pattern applies), but then at some point when all of the data is filled it should be migrated to an actual table with all non-null columns that can be indexed and queried/processed as you usually do.
In any case, what I've sketched out isn't really new, it's been used at least since the 90s under "Dynamic Object Model" and "Entity-Attribute-Value" [1,2], so if they were using on hardware in the 90s I can't imagine the pattern would be unusable on modern hardware.
At that point why not abandon databases entirely and use a key-value store?
The answer, obviously, is because traditional tables make lots of things really easy. SQL is designed for that use case, and it's performant.
You're not even talking about traditional relational databases anymore. You're trying to construct tables inside of tables, which means abandoning performance, indexes, etc.
> At that point why not abandon databases entirely and use a key-value store?
Because there is a schema, it's just a dynamically evolving one, and also, presumably the rest of your system depends on the relational DB, so why multiply your dependencies unnecessarily?
> You're not even talking about traditional relational databases anymore.
There's nothing non-relational about the schema I've outlined, it matches what the domain requires with less noise than 400 nullable columns.
This sounds like the kinds of healthcare claims data that I often analyze for work. Just in case you have a team of seven doctors working on you at the same time while you're in hospice, they keep a column for just about everything.
I'm not sure I follow. Matrix multiplication isn't inherently 'branchy' in a way that we would expect to cause inefficient execution on SIMT (i.e. branch divergence).
I think the remark is more about Tensor Cores (or Matrix Cores in AMD lingo) are distributed by SM (and not aside on an interconnect and individually programmable) so on the same SM you have your classical warps (cuda cores) AND the Tensor units and switching between one and the other might be confusing.
My vision of SMs has always been "assume AVX512 is the default ISA" and "tensor cores are another layer aside of this" (kind-of like AMX) and you have this heterogeneous "thing" to program. Don't know if it helps. The CUDA programming model hides a lot and looking at PTX code in nsight-compute is most enlightening.
> Readability is a non-concept really. Maxwell's famous equations are readable to some and absolutely impenetrable to the rest of us.
When we talk about a language's readability we're typically talking about 'accidental complexity', to use Brooks' term, [0] and not the 'essential complexity' of the problem being solved. For a hairy enough algorithm, even pseudocode can be difficult to understand.
Readability applies in mathematics too, as a bad notation may make formulae unnecessarily difficult to comprehend.
> So when someone says "code should be readable", to whom exactly?
I'll have a go: to another competent engineer familiar with the general problem domain but not familiar with your specific work. This includes yourself in 2 years time.
This seems rather like the question of readability for scientific writing. Research papers should be readable to other researchers in the field, but they aren't generally expected to be readable to a general audience.
Don't introduce any clumsy nonstandard transformations like 'rotating' a matrix, just highlight the relevant row and column. [1]
[0] https://news.ycombinator.com/item?id=41402224
[1] https://www.mathsisfun.com/algebra/images/matrix-multiply-a.... (from https://www.mathsisfun.com/algebra/matrix-multiplying.html )
reply