Great read. I’ve been modeling developer activity as a time series key value system where each developer is a key and commits are values.
Faced the same issues: logs grow fast, indexes get heavy, range queries slow down.
How do you decide what to drop when compacting segments? Balancing freshness and retention is tricky.
I'm curious how much data you have? I have 12 years of dev data and reports are generated in seconds, if not milliseconds. What is your key patterns? It sounds like a key-design problem.
Great point, that’s definitely one of the biggest limitations right now. GitCruiter only sees public repos, so it naturally misses the context of people’s real work.
I’m exploring ways to let developers optionally connect private repos or upload anonymized activity snapshots, if there’s enough community interest to make privacy and consent solid.
Totally agree this would make the results far more representative. Thanks for raising it.
The compression framing is super interesting. It makes me wonder if there’s an equivalent notion for source code - like how much “information” or entropy a commit contains vs. boilerplate churn.
I’ve been exploring Git activity analysis recently and ran into similar trade-offs: how do you tokenize real-world code and avoid counting noise?