Statistics and privacy

The opt-in statistics.json feature provides real-time sums of the number of users and posts on the D* pod. This might be a privacy issue, especially on small pods.

Is this considered a problem? Should we change the statistics implementation?


Note: This discussion was imported from Loomio. Click here to view the original discussion.

Also discussion in this post.

I’m really concerned about this feature regarding privacy, which is why I turned it off again to protect my pod’s users.

One can easily track when exactly a new user or a new post appears, even though this data should be hidden imho.

For example, say if Alice promised me to register on pod X, but the user number of that pod didn’t increase within the last week, I know that Alice didn’t register. Or if the user number of pod X increased by only 1, I know exactly at what time Alice registered. The same is true for posts.

This is a common problem within statistical databases (referring to Stallings).

I share the opinion that there should be a way to make good estimates about the number of people/accounts using Diaspora and obviosly the statistics.json method is a good way to provide those figures. But especially for very small pods (as mine) I consider it a privacy concern if they provide realtime sums for posts. Although small pods are clearly less important for generating network-wide estimates as big pods like geraspora.de or diasp.org, since Diaspora is a decentralized network, there will always be small pods, and, who knows, maybe in the future small pods (<100 accounts) will make up the majority of accounts (although the Pareto principle is more realistic imho - https://en.wikipedia.org/wiki/Pareto_principle).

I’m really keen on providing stats on my pod, but - sorry - not in real time (also not when my pod grows). I cannot provide a service to my users in clear conscience as long as it discloses those real-time sums.

Proposal: Provide weekly snapshots instead of real-time data

The statistics.json info (the fields total_users, active_users_halfyear, active_users_monthly and local_posts) should not contain the current sums, but those of the last Tuesday midnight GMT instead. This makes sure that the numbers are not real-time but weekly snapshots.


Outcome: Only 1 pro in 12. Proposal declined.

Votes:

  • Yes: 1
  • Abstain: 4
  • No: 6
  • Block: 1

Note: This proposal was imported from Loomio. Vote details, some comments and metadata were not imported. Click here to view the proposal with all details on Loomio.

This seems arbitrarily paranoid. While I understand both sides of the story, I will not vote either way on the subject. I don’t agree with either side. Honestly, I fought against any statistics in the first place. This, however, is just trying to stuff the genie back into the bottle.

@starblessed Based on what I read in your other posts, you consider a pod’s providing statistics about the number of accounts and posts a step into centralizing the network and a step against privacy as such.

I disagree on the decentralization point but I understand your privacy point. You can’t have full statistics and full privacy at the same time, those two things are mutually exclusive, just like Heisenberg’s uncertanity principle. We have to balance between statistics and privacy - and for proposing dropping all statistics because of privacy issues, one may call you paranoid. :wink:

And you would be right. I am paranoid. I wont connect my pod to FB or Twitter for that very reason. I’m getting close to pulling it away from Tumblr.
If I had my way, there would be no public data about any kind of D* statistics. But that’s just me.

@starblessed Well, the nice thing about a decentralized network is that every pod has their own philosophy - some will provide statistics and some won’t, and it’s your very personal decision which one you prefer to open an account on.

I think we should make it easily possible for podmins without any programming experience to choose their own statistics philosophy, maybe even provide more options than to opt-in or not to opt-in.

Btw. if you really want to be on a pod that does not provide any statistics whatsoever, the podmin must assure that he/she does not even say “well, about a thousand” when being called by media and asked how many accounts he/she serves.

I opted into the stats. Just for now. I want to see how it could possibly affect the value of the data.

In the linked discussion, we learn that there are two separate issues: each pod’s statistics.json file and the central stats collector’s polling. If Diaspora has no concept of regularly scheduled tasks, this change could require a fairly extensive rewrite.

I’m going to abstain because I do not think there is enough of a privacy benefit to justify the extra work this asks Jason to do (rewriting the stats collection process).

@flaburgan you are referring to Jason’s statistics hub that pulls every day, but the data itself is pullable in real-time. Just like Jason did, I could write a bot that pulls the pods’ data every second instead of every day. This topic is not about any pulling bot, it’s about Diaspora’s statistics.json feature.

Stats about non-anonymous data are never “completely anonymous”, ask @starblessed about that. :wink:

@lnxwalt If the community decides that something has to be done, I could do the programming stuff. No need to burden Jason.

Well, in that case, I think that the statistics.json can be updated every day, it looks precise enough for statistics, and long enough to not know when exactly someone registered (but seriously, knowing the massive amount of data online, what’s the problem by knowing when “someone” is registering? Believe me, I’m really engage for online privacy, but there I don’t get it…)

I know, this seems somewhat paranoid and the D* project probably has many other issues that are much more important than this one.

I just want to raise awareness that real-time sums may lead to data leaking situations on relatively small pods that can only be avoided when opting out of statistics at all. Even on relatively bigger pods, a real-time trend on the number of posts feels somewhat spooky.

We might also only publish the numbers in 100s instead of the exact numbers.

My first thought – dick swinging

I do not understand at all why and who needs to have these statistics? Are they used to solve concrete problems?

For privacy reasons I would prefer not to analyze anything. Everything else is an invitation for data mining.

@adrenalin the statistics feature is already implemented, see https://www.loomio.org/d/FBjn89X2/central-hub?proposal=1y7tgbVP but it is an opt-in feature, so a fresh installation of D* does not publish any statistics whatsoever.

@adrenalin the statistics are really needed, because we need credibility. To have more people coming in diaspora, more journalist and projects talking about us, more developers helping us to build a nice software. Most of the people who doesn’t follow the project just think that “diaspora is dead”. We have to fight this idea and we need to show numbers for that.

@manuelbichler
thanks Manuel, I realized that now but missed the discussion :frowning:

I’d sign @rekado 's comment over there

I don’t think statistics are at all important. Pod-local stats are important for the pod admin; since there is no “network admin” in a decentralised network I don’t see the need for stats at that level. Email didn’t need usage stats either.

If we follow the idea of decentralization iow every user setting up her/his own pod they’ll know their activity … and I agree with @flaburgan 's comment too

…having a better list of the pods than poduti.me

would be more interesting.

@adrenalin Well, apart from the fact that there is no statistics interface specialized for podmins, don’t you think that @flaburgan 's arguments up there are pretty much showing why a good estimate of the network-total number of users would be good for pushing the project? I mean, you can’t compare Email 1993 with D* 2014. Email went popular because it empowered people to do things they could never do without it, whereas D*, from a functional point of view, is for most users just a feature-poorer version of Facebook.

Oh, and @adrenalin “having a better list of the pods than poduti.me” already happened, and Jason had the motivation to do it because of the new statistics feature :wink:

I agree with @jonnehass Closing the vote and reopeing it saying that the update interval should be configurable might be a good idea.