> At some point transactional workload volume will grow to what we call "big dat...

infogulch · on Jan 3, 2023

I've also been thinking about future data representations; in particular, some blend of LSM and fractal trees combined with a blend of a generalized hierarchical prefix encoding and delta encoding. I'd be interested to see what you've got.

> I built some proof of concept storage using row representations with a novel compression scheme designed to compete with compressed columnar format. ... I didn't compare with a compressed columnar representation, but in theory .. it should take same or less space .. as well as using less I/O.

Some of your statements felt a bit discordant when placed together.

jlokier · on Jan 4, 2023

> Some of your statements felt a bit discordant when placed together.

Ah, thank you, I think I see what you mean.

I designed compression methods with field-aware entropy coding in row form while maintaining enough indexing structure (also compressed) for logarithmic-time queries and updates. I did this while being aware of columnar form compression and wanting to replicate a similar size benefit when used on data that compresses particularly well in columnar form, especially things like low-cardinality fields (e.g. booleans and small enums) which compress very well when partially sorted, and data like time series and similar series on branching DAGs, which compress well with a delta compressor or other predictor.

Using some concepts from compression theory I convinced myself how an optimal structure of this type in row form is size-equivalent (up to rounding errors) to optimal encoding of the data in a columnar form, and degrades in a similar way when used with lower quality probability models. Notably, with suitable probability models, run-length encoding, numaric delta coding (such as time series), string delta coding, and low cardinality values in compressed columnar form appear to be roughly matched in theory.

At the same time, the hidden keys stored in columnar form to connect columns together are not required, which saves space in the row form, but the new indexing structures to ensure logarithmic-time queries take up space instead (optionally, if you want the faster lookups), with different bounds, so these are somewhat balanced, depending on how hidden keys are represented.

So I did some comparisons at a theoretical, more abstract level between row and column forms, when considering columnar compression for different types of data and designing the rows and indexing structure with that in mind.

However, I didn't compare what I implemented with a real implementation of compressed column form of the same data, to see if the theory worked as well in practice as it did in theory.

What I did do instead is try out parts of it on some existing large databases, in the terabytes range to confirm how well various approaches compressed the space required compared with other implementations that also used row structure.

I knew that the databases the POC was designed for were hitting I/O bottlenecks before CPU bottlenecks in some phases of data usage, though not others, in other implementations, so I had some leeway to use more CPU to spend on compression to reduce streaming I/O, and motivation to reduce numbers of random access I/O by using an appropriately clustered representation, i.e. row form for most fields, as well as an appropriate inter-block tree structure. In the process I found some ways that intra-block field compression and inter-block merge+lookup structures complement each other, each helping the other to perform better.

Between all the tests and theorising I reached the view "should take same or less space... as well as using less I/O" (for the applications being targeted in particular), but I haven't actually implemented a columnar equivalent or compared with a columnar implementation to confirm this view. It's something I would like to do in due course, though using a configurable hybrid of row/column rather than fully one or the other. This is work in progress but spare time has been elusive (it's a side project, and I've spent recent months on another side project, and the "day" job).