> You could say, there might be people who keep posting stories even though the don't get upvoted, but that would be kind of irrational. If the community doesn't seem to be interested in what they post, they will stop doing so sooner or later.
Probably many people stop sharing in this case indeed. But for those who don't, I guess the idea is: eventually, if you keep sharing, you can collect upvotes here and there, and eventually the karma will go up
> Mostly based on a kind of strange apparent outlier of accounts going idle (1+ year) in 2023
Well, it is just outliers in 2023. This is an upward trend since 2020.
> but the binning is also a full year and the 'idle year' is counted in a weird clippy (i.e. looking at calendar year rather than elapsed year) way
Granted, and I acknowledge this limitation. My idea, however, is that when studying many users in the same manner, this will even out. Why? Because a full calendar year implies somewhere between 0-2 elapsed years. So the average elapsed year, over many users, is 1 year.
The upward trend is much smaller than what happens in 2023 so that looks worth looking into. When you have this one outlier and one year can actually mean two years, it's not completely clear how much of the outlier is actual outlieriness and how much is some accidental artifact.
I double checked. I don't really see an issue. The only specific thing that affects 2023 is that I removed the users seen / last seen in 2024 (since it is not complete year). The aggregation is simple also: count the users first seen, grouped by year. count the users last seen, grouped by year.
There was a separate issue though (I didn't filter out the "dead" and "deleted" stories / comments). I fixed that and updated the article. Some values changed, but the patterns and conclusions stands.
Thanks for looking into this. I'll try to reproduce this myself (but with elapsed times) and see what happens.
Just to double check we're talking about the same thing: The red line is 'users who have been inactive for a year or more, at the time of the aggregate point'. So, for instance, for 2016 you'd have a point for 'users with a year+ inactivity, counted from 2016 back'.
> Thanks for looking into this. I'll try to reproduce this myself (but with elapsed times) and see what happens.
That will be great! Please don't hesitate to reach out if there is anything I can help with.
> Just to double check we're talking about the same thing: The red line is 'users who have been inactive for a year or more, at the time of the aggregate point'. So, for instance, for 2016 you'd have a point for 'users with a year+ inactivity, counted from 2016 back'.
Not quite. It is means the user has been last seen in that year (2016). By "last seen" I mean the user last shared story or comment (separate graphs) was that year.
I guess I don't exactly understand 'last seen, (not active from >= year)'. So to be part of the red value for a given year, you have to be seen in that year and then what? Be idle for a year after that? What's the connection between seen-ed-ness and idleness?
Perhaps I should have articulated this in a better way.
> So to be part of the red value for a given year, you have to be seen in that year and then what? Be idle for a year after that?
Exactly! Last seen: this is year of their last contribtuion (story / comment).
A user shared their first story in 2012, and last one in 2016: 2012 is when they were first seen, and 2016 is when they where last seen. So, on the blue line, they are part of 2012, and on the red line, they are part of 2012
> What's the connection between seen-ed-ness and idleness?
If I am last seen in 2016, then I am idle since then, no?
Aha ok, but if I am understanding this right, the future can change the past of this graph, right? Like our hypothetical user who first appeared in 2012 and last posted in 2016 - right now they appear in the 2016 red line but if they showed up again today and you made the graph again next year, they wouldn't be in the 2016 red line anymore. Or put another way and one that you can try: What happens if you cut off the data at 2022, 2021, 2020, 2019, 2018, etc and plotted those graphs? You'd see a different (rather than merely truncated) graph, no? Maybe even a different trend. So if my understanding is right, this is a pretty wiggly metric. The history of something you want to use as a historical trend line should not change as you append more data.
> Like our hypothetical user who first appeared in 2012 and last posted in 2016 - right now they appear in the 2016 red line but if they showed up again today and you made the graph again next year, they wouldn't be in the 2016 red line anymore
That is correct.
> What happens if you cut off the data at 2022, 2021, 2020, 2019, 2018, etc and plotted those graphs? You'd see a different (rather than merely truncated) graph, no? Maybe even a different trend. So if my understanding is right, this is a pretty wiggly metric. The history of something you want to use as a historical trend line should not change as you append more data.
I see your point, but I don't see how it is avoidable. From my knowledge, any user churn metric will suffer the same effect: If you consider a user is churned after two weeks of inactivity, then this will change if you change the cut-off (the last two weeks of the this month? the two weeks before them? ...etc).
Even if you measure the "elabsed time" instead of "last seen", the cut-off will change your curve.
Extreme example: If you assume a user is churned after 1 year of inactivity (elapsed time since last activitiy), then a user that shared one story in 2007, and then a second story in end of 2023, will apear as active. If you change the cut-off from 2023 to 2022, then the user will appear as inactive.
You can define a metric such that future data doesn't affect past data. Here's a straightforward one: a user is inactive at time t if they haven't posted in the period between t and t - k where k some constant time period one picks. So let's say k is a year and you're looking at active users per year†. So in your last example, the user would be counted as active in 2007 and 2008, counted inactive in 2009 to 2022 and would count as active in 2023. If you truncate the data at 2022 nothing changes.
† year is probably too big of a window for this (I'd take something like a month) but let's stick with it for now
In any case, I am happy to help: if you would like an export of the data, or the DB dump, let me know. And I very much looking forward for your analysis :)
> newly joined users post a disproportionate amount on the website.
What is your definition of a new user?
> For me the answer, as borne our in data, is no. I find from ~ 2016 a change in the types of discussions on the site and find that the newer the poster distribution skews from that time period onward the less interesting the discussion becomes to me.
I am curious: Can you elaborate on how such analysis is being made?
> I am curious: Can you elaborate on how such analysis is being made?
Try doing analysis on the time at which posters in a thread joined this site. I weigh their dates based on the number of times they comment in a thread. If you do that you'll see a recency bias. The mass of the distribution is clumped more heavily in the last 3 years since the article's submission. If you think about it, that makes a lot of sense. People are probably much more energized to comment when they first join and eventually get bored and stop. You point out a really similar effect out in your own analysis about posters that stop posting after a year.
I would be really curious to explore such correlations.
Tbh, while I was working on this, I did struggle with forming hypotheses to test, or even a suitable manifold for the different items/users (e.g., I didn't know some of the users share the PGP keys).
Unsupervised methods were not very satisfying in uncovering such hypotheses as well.
There is also the point you raised about how to share (or what to do with) some of these correlations. For example, concerning mental illness, as much as I would love to uncover that due to my sheer curiosity, I am really concerned about what I would see.
It doesn't need to be a leaderboard for "all the time", but over a recent window of time.
> Karma conflates comment karma and post karma, so that’s not really the same thing.
I agree, but the limitation that I faced is that I don't have access to the comments upvotes, and I didn't find a satisfying proxy for it either. That is why I dropped it from the analysis.
I think part of the confusion that I had was that I didn't know about the downvoting option at the beginning.
I think it will be really great of HN releases the scores of the comments.