Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Column – High-performance, columnar, in-memory store with bitmap indexing in Go (github.com/kelindar)
157 points by ngaut on June 21, 2021 | hide | past | favorite | 51 comments


I love seeing stuff like this, getting more understanding of the layers underlying high performance data analytics is super interesting to me.

This project seems very similar to Apache Arrow, if OP or anyone else is around to explain why one might be used over the other that would be great.


> This project seems very similar to Apache Arrow, if OP or anyone else is around to explain why one might be used over the other that would be great.

Arrow is primarily a serialization format to transfer data between distributed systems. It uses zero copy and other techniques to quickly process, and store large data sets in memory.

Other libraries allow you to query Arrow data once processed.

This project is an in-memory columnar data store with querying and other capabilities.


Great stuff, can it work with larger-than-memory datasets? Is there a way to limit resource consumption ? Or will process just blow up in such case?


It's actually possible, columns are simple Go interfaces and can be re-defined and defined for specific types. You can easily build implementation of columns that actually load data from disk or even a remote server (RDBMS, S3, ..?) and retain the indexing capability.

On the flip side, you could actually fit more data in-memory than with non-columnar methods, since the storage is column-by-column, it compresses very well. For example boolean values are stored as bitmaps in this implementation, strings could be stored in a hash map so there's only one string of a type that kept in memory, even if you have millions of rows.


Is there a Go equivalent of Calcite? If so, could probably bolt that onto the query path and work in the logical plan translation to the physical plan - which is the query API that’s currently provided.



Calcite is Java based, hence my question


Very cool. This kind of storage is similar to what's typically being used in Entity Component Systems like EnTT [0], which can also be interpreted as in-memory column oriented databases.

Recently I'm starting to like this type of programming over OOP. Each part of your application accesses a global in-memory DB with the full state. But unlike a traditional persistent db it's really fast.

[0] https://github.com/skypjack/entt


The same developer has an open-source entity-component-system, as well: https://github.com/kelindar/ecs


Wait no, that repo was an experiment that I'll be rebasing and finally building a real ECS based on the columnar storage library.


Something that gets lost is that this is also a variety of OOP.

https://www.amazon.com/Component-Software-Object-Oriented-Pr...

Programming against Objective-C protocols, COM interfaces, Component Pascal framework, and so forth.


What? In ECS state is managed seperately from logic and there is no inheritance. How is it a variety of OOP?


OOP is not inheritance, just one possible trait among OOP implementations, just like FP isn't Haskell.

Component programming with interface separated from state is exactly what Objective-C protocols, COM, VBX, SOM, Component Pascal were all about.

Those that promote ECS as not being OOP 99% of the times never read books like the one I linked on my comment.

Instead they reference a talk done at GDC by one of the very first engines that made it well known to those that never read CS papers or books.


Interfaces/protocols aren’t really the same though. An interface defines capabilities for an object; the capabilities are directly associated with the type.

In ECS (where Components are a bag of data, and systems handle all logic/operations), and like a DB, the object is defined by an id and its relations; from the relations, you can derive available capabilities.

That is, an object tells you what it can do. In a DB, what it can do tells you the object.

You can create the same system with interfaces by simply ignoring the methods part of it, and keeping the data part, but associating data with capabilities is pretty much the defining difference between objects and structs.

More importantly from an architectural perspective, in ECS the logic isn’t associated with the object, it’s associated with a system that takes the object as input. The system is shared across all objects. The object (entity) for ECS is little more than an id and some relations.

An ECS very directly corresponds to an RDBMS. To call it OOP is to deny the ORM’s classic Object-Relational mismatch.


An interface in a component object model can be made only of properties.

Secondly most languages with OOP support aren't Smalltalk/Java, rather multi-paradigm, e.g. Objective-C, Component Pascal, C++, Delphi, Python, among others when Component Programming came into CS papers for the first time.

To argue that Component Programming is not OOP is just religious hate that shows lack of knowledge regarding CS literature.


> An interface in a component object model can be made only of properties.

Yes, I addressed this case. It can be done, but it rips the “object” out of it — you’re no longer sensibly organizing things under an “OOP” paradigm, but something else altogether. You’re quite directly reducing the “object” to a struct.

> Secondly most languages with OOP support aren't Smalltalk/Java, rather multi-paradigm, e.g. Objective-C, Component Pascal, C++, Delphi, Python, among others when Component Programming came into CS papers for the first time.

That’s fine, but doesn’t really support a case that it’s an “OOP” architecture/mindset/organization. In fact, that rather undermines your case..? A language that isn’t interesting in being purely OOP (is anyone?) is happy to introduce an arbitrary construct is not an argument that the construct must therefore be “OOP”

> To argue that Component Programming is not OOP is just religious hate that shows lack of knowledge regarding CS literature.

It’d be more about diluting the term into nothingness — closures are a poor man’s objects, and objects are a poor man’s closures. It’s technically true, but it wouldn’t be useful to go ahead and describe all functional programming as “OOP”, because the main interest in defining the terms is to indicate architectural and logical flow/patterns one would expect under such a paradigm/organization.

Interfaces and classes with no defined methods looks and feels much closer to Haskell and C than it does C#, python and smalltalk.

OOP is not a technical definition. It’s an organizational strategy, and the term itself is just a marker on a continuum. Everything is OOP, and nothing is, just as all things are Turing machines. But that’s not the point.


That is surely the point, everything else is religious hate without any technical substance.

Apparently me in the mid-90's porting an particle engine from Objective-C NeXT into Visual C++ using COM on Windows, with a component based architecture, did not happen.


I’m not clear — did you just ignore everything I wrote and mindlessly restate your claim, in a manner akin to one undergoing a fit of religious fervor?


> OOP is not inheritance, just one possible trait among OOP implementations, just like FP isn't Haskell.

I didn't say it was. The key par was "state is managed seperately from logic."

> Component programming with interface separated from state

There is no such things as interfaces in ECS. Interfaces are a way of describing how state is bundled with logic. ECS does not do that.

> Those that promote ECS as not being OOP 99% of the times never read books like the one I linked on my comment.

I'll admit that I have not read that book. Your condescending appeal to authority here doesn't actually promote conversation.

Please, tell me what the "objects" are in ECS or what else qualifies it as OOP?


There are no "objects" in ECS, rather Components as the name clearly states.

Funny why do most ECS frameworks use interfaces/traits/pure virtual classes/static polymorphism then, since they don't exist according to you?

https://github.com/skypjack/entt

Component orientend programming is a subset of OOP, as for why that is, I provided a book, feel free to educate yourself.

More CS papers are available on demand.


> I didn't say it was. The key par was "state is managed seperately from logic."

But state is managed separately from logic pretty much anywhere where you don't store lambdas as fields of structs.


How is using an in-memory database related to OOP? They seem completely orthogonal to me.


It is not. It is related to ECS which is being contrasted with OOP.


I've wondered about bitmap indexing before. Is it an optimization for speed, too, or just memory?

If I had an array of a million things, and I wanted to specify some large subset of them via a separate million element array (like in numpy/pandas), is it faster to do it via a million bytes or via a million bits (ie I think this is bitmap indexing, right?). I would think that the bytes would be faster, even though terribly wasteful of memory. From my rudimentary knowledge of CPUs I thought they didn't really operate at the bit level, and so you'd have to do a few instructions of calculations. Or would it be made up for by the cache line reading in more of the indexer in one fetch?


I'm a little naive on this subject, but just wondering what are the use cases for in-memory columnar stores? I was under the impression that columnar stores are good for OLAP use cases involving massive amounts of data. For datasets that fit within memory, are there still benefits in organizing data in a columnar manner and are the performance gains appreciable?


Also see my comment above, but you find this kind of storage commonly in game development [0] where you are optimizing for batch access on specific columns to minimize cache misses. It's usually used as the storage layer for Entity Component Systems. It's also called data-oriented design [1]

[0] http://cowboyprogramming.com/2007/01/05/evolve-your-heirachy...

[1] https://en.wikipedia.org/wiki/Data-oriented_design


thanks!


I’m not sure about any performance gains or working with large datasets, but the ancient Metakit[1] was just a really pleasant relational algebra library ( ≠ SQL data model library, it could do e.g. relations as values which are difficult for row-oriented databases). I’d say that Metakit & OOMK in Tcl is strictly better than the relational part of Pandas in Python, except the documentation is somewhere between bad and nonexistent.

[1]: https://git.jeelabs.org/jcw/metakit


Not subject matter expert but few that come to mind: memory can become a bottleneck, reading sequential data instead of jumping pointers/reading useless data and trashing caches gives much better throughout, compression applied to columnar data is more efficient and can give a throughput boost when memory bw becomes a bottleneck on systems with high number of CPUs.


I am merely a dabbler in this area and definitely not an expert, but my understanding is that columnar stores tend to be substantially more efficient for analytical operations over large sets of in memory data by virtue of the data being easier to operate on with vectorized instructions like SIMD.


Do you have a (Docker) container that can be used for trying it out?


OT: Not a Go dev here but have some side projects written in it... Isn't docker for Go a bit unorthodox? I had a few nice headaches setting up my local env to use docker with Go to mirror my python workflow (all projects have a Dockerfile, no dependencies installed locally). I was under the impression that Pro Go people do not use docker for local Go dev. Please correct me if I am wrong.


Docker and go work fine together but using docker for go dev is just an unnecessary hassle, especially if (like me) you’re doing dev in MacOS - you have to cross compile to Linux which is slower, and then build and deploy the container - versus the very quick compile-run cycle of regular Go.

As a reformed Java developer I can say that docker didn’t add much time to the build cycle and gave us a better way to package resources for Java code, but Go is far more ergonomic, so taking a <2 second compile time for a small microservice and adding docker to turn it into a 30 second build time just isn’t worth whatever utility you get from containers at dev time.


As not yet reformed Java developer, an Uberjar, custom runtime with jlink, or one the AOT compilers available, do the job just as well.


Not really. Even with an uberjar, you still need to get that huge Java runtime distributed somehow. And then you need all the command line rubbish, starting heapsize, system properties, etc. Not to forget, for those of us outside the US, a special handmade distro of Java with the crypto export restrictions file in the right spot.

Docker helps manage all of this, and does it fairly quickly, and made life relatively easy, but not without a cost in time and complexity.

Go, out of the box, produces statically linked machine runnable binaries, including embedded resources, so you get the equivalent of an uberjar, plus resources, plus the runtime, all in a single executable file. And all of this pops out in a second or two with `go build`.

AOT for Java might perhaps have similar advantages except that AFAIK (two years ago) the AOT compilers were expensive and had plenty of caveats with eg reflection. I expect they would be even slower than javac as well. So certainly a solution, and maybe you don’t need docker any more, but then you have a different set of problems. It was never feasible when I was doing Java.

To be clear, this isn’t a Java vs Go thing. The question was why don’t Go devs use Docker, and I’ve given some reasons. I quite like the Java language and miss some aspects of it, but there is a lot about the Java environment that I don’t miss and runtime deployment complexity is one of them.


You missed the part of the comment, "custom runtime with jlink".

I never been into US, plus the restrictions apply to any tech produced in US, regardless of the programming language.

Thankfully, by having such laws, US made us create other standards as well.

I also don't want to make it into a Java vs Go thing, rather make the point that many dismiss Java without really knowing what is around during the last 26 years on the ecosystem.

It appears everyone just learns the basics and then complains from there.

Not targeted at you, as you obviously got my point.

On the other hand, kubernetes and docker are all about runtime deployment complexity. It feels like using Websphere 5 all over again, with containers == EAR, thankfully so far I managed to stay mostly away from them.


I exited Java just as modules were kicking in so I’m not really familiar with jlink. But you still need to distribute that custom runtime and docker helps with that. I think docker is a great tool for dealing with Java’s complexities. A Java docker image is like a Go executable.

Re the export restrictions, although you are right in theory, it doesn’t seem to affect Go. There is no special build, the crypto is just built in. Java is unique in how it dealt with this, I never understood why it was so hard.

I agree with you re K8s. And I like the comparison to EARs. Both container systems are pretty poor substitutes for a binary you can just run in an OS.

Go seems to recognise this. It knows its place in the deployment hierarchy and that’s made my life so much easier. Go feels like it’s part of the Unix world, rather than apart from it, and Java was never like that. That’s why docker became so important in the Java world. It gave Java the isolation from the OS that it always craved :)


In regards to restrictions, Go just flights under the radar by being on GitHub.

Ask the developers on countries with blocked access due to US sanctions how they get to use Go without workarounds.

Go was written by two of UNIX people as better C, not as a cross platform tool. In fact there are still language features missing from Windows support.

Java always felt at home in Solaris thought.

Docker isn't that relevant on my Java projects, it is only used in projects where customers want to feel modern or already have it as part of their workflow regardless of the technology stack.

I am pretty much 99% of the time on VM + scripting world.


Go doesn’t benefit as much from docker, but if you’re already living in a docker world (i.e. everything you deploy is a docker image, and it’s managed by compose or kubernetes) then it’s easier to use docker than not.

We build images (about 20, each with a Dockerfile) from a monorepo with a single go.mod. I have basically a full replica of prod running locally in k3s — letting k3s manage it all is easier than dealing with the pile of environment variables that would be needed to get everything hooked up properly. And with kustomize, we can reuse a bunch of yaml from prod.

Sometimes I’ll run go binaries locally on my machine for debugging (the builds still work because go’s packaging is finally stable). But the difference is minimal — using docker/k8s is more about streamlining deployment/config/rollback (and the occasional co-packaged asset) than anything else.


I agree that adding docker to a Go dev setup is not worth it, but I think commenter was asking for a docker image for running it. In that case, I’d say that docker could be worth it for the end user.


I dockerize Go apps to run in AWS ECS Fargate, but otherwise I agree. Go apps don't need docker.


Really nice project! The transactions and replication streaming seem to make it a great choice for sharding/distributed environments!


That's the idea, a transaction commit log decoupled from underlying durable storage allows you to build your own persistence layers. I'm still thinking to build a simple (memory-mapped?) layer, but as an optional, separate lib.


Why the use of Go instead of something more traditional like c++, or even rust ? Isn’t it primarily used for infrastructure scripting and will affect performance of the db


Honestly, I enjoy programming in Go and been using it on a daily basis for the last few years. Most importantly, when it comes to performance it's often not the language that matters but how you structure your code. It's very much possible to build a terrible C++ program which thrashes memory and will be very slow. And I feel like Go is actually lacking those nice data-oriented libraries.


What's the problem with Go? Many high-performance things are built in Go. https://awesome-go.com


Exactly to prove to people like yourself that it is possible.

IT industry is full of Matthews that need to be proven wrong for us to advance.


Is "Matthew" some sort of IT version of "Karen"?


I got the name in English wrong, it should have been Thomas.

“You believe because you see me. Great blessings belong to the people who believe without seeing me!” (John 20:24-31 )

Bringing it into the IT context, there are the visionaries that believe something is possible no matter what, and then there are those that even with stuff running in front of them cannot move beyond "yes but...".

Ironically, in the 80's in what concerns home computers and game programming, both C and C++ also belonged to the "yes but..." group.


How that compares to hazelcast ?


Great job picking up Go for this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: