Add pull to Diaspora's push model in federation

I had some ideas a while ago about improving communication between pods in instances where it currently falls down, but didn’t know enough about how federation works to be able to flesh them out. Now, helped by Fla’s blog post about federation to understand more about how it works, I’ve refined those ideas.

Just for clarity, this is only a speculative concept. I understand the technical issues only poorly, and so my suggestions as I’ve presented them may not be workable. However, I hope that, even if this proves to be the case, my suggestions will spark ideas in those of you who understand the technical side of Diaspora which might help to improve Diaspora’s federation protocols.

At the moment Diaspora relies solely, or almost solely, on pushing data from one pod to another. This means that if a pod does not receive data when it is pushed, there is no way for that pod to retrieve these data at a later time. I suggest that if we’re going to keep Diaspora working on a push model, we supplement this by enabling pods to pull data under certain circumstances.

New pods

Pods only receive data from pods with which they have an established connection. Currently, this means users making connections with users on other pods, and this takes time. I suggest putting in place an automatic means of connections with other pods so that this process can be done automatically, immediately the pod goes online, so that when users start using the pod, these connections with other pods are already in place.

I suggest putting in place a sort of ‘handshake’ system.

The process would work something like this:

  1. Podmin sets up Pod Z, and puts it online. Pod Z knows about Pod A.
  2. Pod Z contacts Pod A, and says ‘Hi, which pods do you know about?’
  3. Pod A gives Pod Z a list of pods it knows about.
  4. Pod Z adds each of these pods to its knowledge base.
  5. Pod Z contacts each of these pods and asks the same question in step 2.
  6. This process is repeated until Pod Z is not finding out about any more new pods.

This way the new pod would very quickly build connections with the whole network.

Of course, there needs to be some means of establishing the first pod to contact (Pod A). This could be prompted by going to the pod of whichever account new accounts are set to auto-follow on that pod (currently the Diaspora HQ account, which is located on joindiaspora.com). Alternatively a list of a few key pods could be kept on diasporafoundation.org (not as a web page visible to visitors, but somewhere from which pods can FTP the data), or the pod could get the information from a site such as podupti.me, which is frequently updated.

One possible way of doing this would be to automatically create ‘bot’ accounts on each pod which communicate with each other via the above protocol. I’m calling them ‘pod-spiders’. If Pod Z knows about Pod A, pod-spider@PodZ.org adds pod-spider@PodA.com to its aspects in order to contact it, and so on. I’m sure the inter-pod communication could be done without setting up bot accounts, and might be a better way to do it. As much as anything, the ‘pod-spider’ concept is a visual aid.

Tags

As tags are not federated, you could also have each pod-spider account follow all the tags that users on its pod follow or search for. (This could involve only tags that have been searched more than 5 times or are followed by more than 5 people, to eradicate spelling mistakes.) When Pod Z goes online, pod-spider@PodZ.org can also ask each pod it contacts ‘which tags do you know about?’ and can then follow those tags itself. In this way, it might be possible to populate tag searches from the time the pod goes online.

Alternatively, when a user searches for a tag which is not currently in that pod’s database, the pod can pull the data on that tag from all the pods it is connected to. That way, the first time a tag search is done on that pod, it is done by a pull, which would take longer but at least would get the data. After that, data relating to that tag can be pushed to the pod in the usual way.

Non-communication

There are also some circumstances in which an established pod doesn’t receive data that are pushed – for example, if a pod goes offline for a while or is temporarily over capacity. In these circumstances, it would be helpful if the pod can pull data when it goes back online.

At the moment, when Pod A can’t push data to another Pod B, it puts the data back into its send queue and retries a number of times at intervals. When the last of these retries has taken place, Pod A stops trying, whether or not it has been successful. If not successful by the last of these attempts, there is no possibility of the data getting from Pod A to Pod B.

For my suggestion to work, at the end of this process of retries, if the data still cannot be pushed, Pod A should write all data destined for Pod B to a log rather than placing them back in its queue. Pod B is placed on a list of ‘pods incommunicado, do not attempt to communicate’, and Pod A stops trying to push new data to Pod B, instead writing it to the log. This would save network resources. Once this has happened, when there are new data destined for Pod B, Pod A should add them to this log instead of attempting to push them to Pod B. (Pod A could perhaps continue to attempt communication with Pod B say once a day, and if successful can then push the logged data.)

When Pod B is back online, it immediately communicates with all pods known to it and says: ‘I’m back. What have I missed?’ When Pod A receives this communication, it refers to its log for Pod B, retrieves the data and sends them to Pod B, and once it receives confirmation that this transfer has been successful, deletes the log and removes Pod B from the ‘do not communicate’ list.

This should (a) allow pods to receive data pushed when they were unavailable, and (b) save network resources currently wasted by pods trying to communicate many times with pods which are unavailable.

There may be other circumstances in which it would be good for a pod to be able to do a pull request – perhaps if it hadn’t heard from a pod for a set period of time. However, this would involve pods keeping logs of data destined for other pods even when it hasn’t detected a communication problem, so may be a waste of resources.


Note: This discussion was imported from Loomio. Click here to view the original discussion.

New pods
This way the new pod would very quickly build connections with the whole network.

Well, we don’t need to do that. To save resources (network, CPU, database), we try to talk only to the pods we need to, and the least possible.

If I set up a pod and all my contacts come on my pod, or only on one external pod, why should I know the whole network?

So instead of New pod, I would talk about new user, meaning user not known by my pod (no existing relation). Being able to pull bio, old posts etc the first time a user is reaching by a pod is a good improvement. But that’s only for the first relation: if someone else in my pod adding the contact after me, no need to pull, we will received the data (pushed) because of the other sharing relation.

So knowing the whole network is useful only for the very special case of tag searching, and we definitely need to find a more global solution about that.

This suggestion sounds generally good to me as a non techie. I particularly like the tag-searching aspect of it.

I also like this solution more than some central-hub and tag-aggregator ideas. P2P is the way for us to go, I think. Diaspora has “Decentralization” as one of it’s key philosophies ( https://diasporafoundation.org/ ). I don’t want it to loose that.
There are many working examples of decentralized networks (eDonkey, FreeNet), so it’s not impossible to do.

Haven’t read this properly when it was posted. While there are definitely some good things here and the whole idea might work, it sounds to me a kind of “every pod knows every pod” thing. So while that would certainly solve some problems, it’s not a solution that would scale. It’s not realistic to have public posts for example federated in this way, unless we allow diaspora* as a network to stay small. I don’t know about eDonkey and FreeNet but afaik P2P is not what “everybody can follow anyone” is about. Diaspora works very well if you know who to follow. But if you just want to follow posts tagged with something - it simply will not scale, the work required to pass those messages around needs to be outsourced from the diaspora server code.

I think there are solutions to that - it could be enough for a certain pod to know a few other pods, who could just pass its query over. Another possibility - there could be some sort of shared list, saying which pod knows which tags. Some kind of routing algorithm. Another thing that I find bad when we try to centralize something - we have to maintain the code of this central hub, or relays, or tag aggregator.

oh, and “everybody knows everybody” is certainly a bad idea in my opinion :slight_smile:

@macieklozinski however something is done, there is always code to maintain. It’s usually better to split features into separate components and not build one big product that does everything. The server code is already being cleaned up into separate repositories with the federation code being split out thanks to the amazing work by @florianstaudacher :slight_smile:

There are some benefits of this, but I’d rather see a network of similar nodes connected to eachother than a group of different services which need to be installed separately and depend on eachother.

Sure, the pod should be something that just works, I agree. You’re missing the point that the relay/hub/taggregator ideas are things that podmins don’t need to install - the project with community volunteers would maintain those.

And you probably need time/money/discussion to maintain these volunteers. But, on the other hand, you need these also to maintain extra pod code…

It’s all volunteers, not a single person does anything related to diaspora* for money AFAIK. That is unlikely to change in any immediate future :slight_smile:

Yes, but what about the dependency problem? I think owners of pods want to be as independent from other volunteers as possible. There always is a problem with volunteers - they often become unavailable/unmotivated/busy.

Well, I think then our biggest risk is that the developers get bored and leave - oh wait that is our biggest risk :wink:

That is why the components should be open source and anything can host them. One relay goes down? Another takes it’s place.

Yes, but if developer gets bored - it doesn’t affect the network, it only slows down development. If admin gets bored - you have a handicapped network.

No, because:

One relay goes down? Another takes it’s place.

@jasonrobinson, thanks for your comments. This is a proposal to help in three specific circumstances:

  • to populate a stream on a new pod when it is first set up;
  • to retrieve posts made while a pod was down/offline, or in the case of a data loss;
  • an attempt to improve the federation of tags.

I didn’t envisage this method of bot accounts connecting pods being the norm for inter-pod communication in Diaspora during the normal course of events, but something which could kick in when the normal, push method of federation hasn’t worked in a particular circumstance.

So, while it is a case of ‘every pod knows every pod’, hopefully scalability issues wouldn’t be so much of a problem because it’s a method which would only kick in for a brief period of time in occasional circumstances, such as when a new pod is set up or when a pod which has been down comes back on-stream.

But it comes with the caveats that I don’t really understand the technical issues, so my hope is more that some elements of my proposal might spark some ideas of what might work in the minds of people who do understand the technicals, which seems to be happening. If anything I’ve said leads eventually to some improvement in performance through the work of other people, I’ll be happy!

@jasonrobinson Can you explain the issue with scalabilty further? What kind of obstacles do you expect?

@ryunoki I guess you are referring to this?

It’s not realistic to have public posts for example federated in this way, unless we allow diaspora* as a network to stay small. I don’t know about eDonkey and FreeNet but afaik P2P is not what “everybody can follow anyone” is about. Diaspora works very well if you know who to follow. But if you just want to follow posts tagged with something - it simply will not scale, the work required to pass those messages around needs to be outsourced from the diaspora server code.

The comment is not directly about Goob’s conversation started, but regarding the whole federating public posts around. I really think there are more clever ways of doing things than to make all the pods work equally and pass huge amount of messages around. In all networks and large software projects there are different components to handle things. We should have different nodes to handle different things - and no that doesn’t mean giving up decentralization as long as no node is hard coded and there can be several nodes for some purpose - like relaying public posts around.

So scaling in terms of “there would be too large data be passed around”?

Look, for a developer “scaling up” is a common term - but not for non-coders.

Or is it meant rather like this: https://www.loomio.org/d/9vpoe0UR/public-post-federation#comment-61592
(Okay, Landau notation isn’t that common either, but I can understand it as mathematician …)?