Fixing Federation

Just a general discussion. This isn’t about switching protocols, instead this discussion asks an equally important question: “What can we fix in our own implementation right now?”


Note: This discussion was imported from Loomio. Click here to view the original discussion.

Talking to Mike Macgirvin of the Friendica project, I came away with these notes of things we need as far as federation is concerned:

  • Globally Unique Message ID’s
  • A better queue manager for federation than Resque, which drops objects as it’s being overloaded. Sidekiq has been proposed in the past, and it’s been said that writing a queue manager from scratch isn’t all that difficult.

From Mike:

"Decentralised communications have a tendency to cause “fanout” or a hundred/thousand deliveries for one message injected into the system. This is the nature of the beast. Ilya create a nice batching protocol for public posts. That helped.

The decentralised social web needs every trick in the book to fight fanout. Batching and prioritisation are the keys - and are the only things you can change. Sure you can try and work on performance, but that just reduces fanout linearly. One needs to find clever ways to reduce it logarithmically, because you’re dealing with an exponentially expanding input."

The first requirement before anything else is http://loom.io/discussions/612. Just sayin’

Definitely agreed, however to do so, we’ll need an attack plan.

I’m good with plans. How can I help?

(two year bump!)

I hope this isn’t tl;dr so please be patient and stick with it :slight_smile:

So, Sean’s original question asked,

“What can we fix in our own implementation right now?”

One thing I see asked about and discussed countless times is the federation retry issue.

Our Wiki states:

Will a pod eventually receive federated posts that it misses while being offline/down?

Possibly. We retry the delivery three times at one hour intervals.

#WTF?!

We only try to resend a message/information/etc three times at one hour intervals? THREE TIMES? AT ONE HOUR INTERVALS?!??

No wonder there are so many questions and complaints about posts being missed :frowning:

Now here on Loomio there are countless (literally, I didn’t count them there are that many) about some huge changes which could be made to make the federation more reliable and they are indeed fantastic - but the reality is they are very long term goals in terms of implementation.

What I would like to suggest is a very simple change to the retry functionality.

Once an hour, for only three hours, is completely unrealistic in todays Diaspora ecosystem. So many pods, so many different connection types. You only need look at podupti.me to see just how much actual downtime there actually is.

I would like to see the retry intervals not only be made more frequent in the short term, but the longevity of the retries massively increased - to be something more like the average SMTP protocol in terms of re-trying to deliver the message. The SMTP RFC 5321 states:

Retries continue until the message is transmitted or the sender gives up; the give-up time generally needs to be at least 4-5 days

Why on earth do we give up after just three hours?

Is there a technical reason why we couldn’t (easily!) implement message delivery retries along the lines of:

  1. Retry every 5 mins for six attempts (30 mins)
  2. Then retry every 1 hour for six attempts (6 hrs)
  3. Then retry every 3 hours for four attempts (12hrs)
  4. Then retry every 6 hours for four attempts (24hrs)
  5. Then retry every 12 hours for two attempts (24hrs)
  6. Then retry every 24 hours for one attempts (24hrs)

I’ve just pulled these numbers out of the air, there is no science behind them and they are simply a starting point for discussion :slight_smile:

Note that the three times in one hour interval is only for message delivery. All other potentially recoverable failed jobs are retried with an exponential back-off, the formula for that is (count ** 4) + 15 + (rand(30)*(count+1)) with count being the number of attempts made so far[1]. We default to a maximum of 10 attempts but this is configurable by the pod maintainer[2]. This results in a retry approximately 4 hours after the first try.

This is already too much for joindiaspora.com to handle, Max lowered the number of retries for these comparably light jobs to just three[3].

Now is the job to deliver messages a really heavy and long running one. My pod knows about 575 pods, more than half of it are gone, most of the gone ones simply timeout. We have relatively high timeout of 25 seconds to accommodate slowly responding setups which improved federation stability significantly in the past[4]. Now requests to other pods happen in parallel, but we have to limit number of concurrent connections since more parallel connections mean drastic spikes in memory usage, the default currently is 20[5]. Lowering memory usage is one of my personal focuses here since it benefits both ends of deployments, big pods as well as small pods.

So yes, the default retry strategies are very conservative but this is to accommodate running costs. Look at the pricing for the Redis database you need to use on Heroku (which joindiaspora.com is deployed to) alone[6]. We can’t just pile up jobs for weeks in it. Note also that with increasing the delivery time we also need to retry trying to process successfully received comments, likes etc. for which we never got the parent. This in sum increases the number of jobs to process a lot, which means bigger deployments need to scale up more and thus significantly increase their running costs.

This issue might seem simple at first sight, but there are many variables and stakeholders involved. And nobody actually running one of the big deployments is actively contributing. I’m rather happy how good it works currently and that we have some defaults that seem to work for most people.

Hey Jonne.

Note that the three times in one hour interval is only for message delivery

Is message delivery not the most important part of the federation concept though?

I think every task is too much for joindiaspora.com to handle. The server is on its knees :frowning:

Again, longer term, it would be good to be able to either cleanly remove a pod from the ecosystem and let all other pods know to stop trying it. It would also be good if pods could expire other pods too, so if not connection can be made to that pod for perhaps 1 week, then never try again (or something). But again, that’s long term stuff.

Short term, three attempts/hours to deliver a message is still completely unacceptable.

Can you imagine if SMTP gave up after three hours?

So how about putting a setting in diaspora.yml to allow podmins to scale the attempts, according to their infrastructure?

I am sure that many (many) podmins would increase this retry value.

Ps. Thanks for the details response my friend! Very grateful :slight_smile:

Well, I guess a config option can’t hurt.

Or maybe instead 1h/1h/1h intervals, just make this something like 0.5h/2h/6h/24h, or even 1h/4h/24h? It should not put too much strain on servers.

Increasing the interval will mean increased memory usage by Redis since a larger number of retries will pile up.

With what Maciek is suggesting it shouldn’t make a difference to usage as the total retries would still only be 3.

Just at 1/4/24 instead of 1/1/1 ?

Which will help federation for offline pods a huge amount?

It’ll make a difference, it won’t grow after the initial growth, but the base usage will be higher since on average more retries will be in the queue waiting for their time to run.

Ah I see! Good point my friend :slight_smile:

Ok, so how about we move the three 1hr retry options into diaspora.yml and let the podmin choose the three interval times to suit their server environment?

Well, I guess a config option can’t hurt.

Proposal: Allow podmins to choose their federation re-try times

To try and improve the federation I propose we move the federation re-try times of 1hr/1hr/1hr from being hard coded, to being contained in config/diaspora.yml

This will allow each podmin to choose the retry interval times to suit their own pod environment.


Outcome: N/A

Votes:

  • Yes: 1
  • Abstain: 0
  • No: 0
  • Block: 0

Note: This proposal was imported from Loomio. Vote details, some comments and metadata were not imported. Click here to view the proposal with all details on Loomio.

?!?

Why the proposal? Where did any one oppose a configuration option?

Isn’t that what we do on Loomio? Discuss then propose, to give people the opportunity to vote/object?

Or have I completely missed the point? :slight_smile:

Only if there’s controversy.