I will only address the claim of better scaling now, and agree on disagreeing on everything else.
The worker distributing public entities generates the payload exactly once, then leverages typhoeus to actually do the requests. typhoeus internally is calling libcurl-multi multithreaded, with a concurrency we define in the config. If you want, you can say that this thing does one job: deliver and deliver fast. Delivering requests is not a bottleneck anymore and it hasn’t been in quite a while. In reality, database requests are the true bottleneck, so federating things in and out is a totally insignificant process in the chain right now.
If we have a look at the current runtimes, we have to agree that incoming federation is totally irrelevant in terms of processing times, but outgoing federation takes some time if we have a large reach (like, for example, when the dhq account posts something public).
So let’s ignore everything else and focus on delivering payloads, since that is a concern of yours. I don’t like the 10k pods example much, because if we talk about scaling the network, one should think ahead. Let’s get an order of magnitude bigger and deal with 100k pods, just to make my point clear.
It’s easy for me to say “you are wrong”, since I worked as a leading engineer of real high-traffic messaging systems in the past, but that might be a bit simple and surely will not get the point across.
Let me introduce by stating something: You totally ignore the fact that federated systems are actually more scaleable than central one, and you casually forget the exponential growth of messages if you have a central pubsub node.
To prove my point, I wrote a simple Rust application that loads a list of domains and executes http requests similar to the way typhoeus is doing it. I tried to code without fancy Rustness and comment the code a bit so people can understand what I am doing. This benchmark is not entirely perfect, as…
- I am using
GET instead of
POST. Although it does not make a difference for HTTP, delivering large payloads can cause some milliseconds extra, since sending the payload may take time. However, in my tests, I figured that almost the entire runtime is spent on connecting to the host in the first place, so this is negligible.
- For having hostnames, I used the top 1 million webpages according to Alexa and extracted random 100k one. This isn’t really a good simulation of a large diaspora network, but the best approximation I can get. I also cannot provide the dataset I am using, since the top domain list is now a commercial property, and making that public would be illegal. You should be able to find an old version archived in the internets, if you want to follow along.
- Once again, one very improtant note: I assume that everything gets send to everyone. Every post gets delivered to every pod, every like is delivered to every pod, and so on. In reality, this is highly unlikely, but it’s still a fair comparison since I assume the same for both current federation and the relay.
- My main benchmark system was a virtual machine with 500mb RAM and “1 virtual CPU core” (whatever that means, it’s kinda slow) that I bought for 3 euros per month. Probably one of the lower-end machines diaspora will ever run on.
- For comparison, I also ran on a Raspberry Pi 2. I should note that run times were almost exactly similar, but that did not really surprise me, since most of the time was spent waiting for the connection to be established, there is nothing CPU intensive involved at all.
- I measured the average route length and figured that almost 30% of my simulated pods had a route length of more than 10 hops. That’s mainly caused since my random set of domains has a very significant portion of asian domains, and peering towards china is horrible.
- Response times of these websites are surely not the same as response times for diaspora nodes. Some might be faster, some might be slower. I assume that it averages itself out. To avoid issues with chineese servers not responding to me, I set the connection timeout to five seconds. That’s shorter than you would have in a production systems, but it matches the average response time in my sidekiq logs.
- I ran the test 10 times, to get rid of any error, and to establish a more real scenario. In a real world, pods would have DNS caches of most pods, so that’s something I wanted to include. I took a simple average to get a final result.
Long story short. Submitting payloads to 100k pods in my test took, on average, 416 seconds; if we only look at the time that is actually spent with distributing payloads. As a base rate, we can say diaspora can handle 865k requests per hour.
So what does this mean? For interactions that get sent to all pods (in reality, this probably would be mainly public posts, but as before, we have some disagreement on that, so let’s stay call it “interaction” as a generic term), we actually do have a limit of what we can handle. With that processing time, pods would be limited to 8.6 interactions per hour before we get backlogged. Actually, that could work out on smaller pods (which they would probably be if there were that much pods), since they only have a few users. Even if we get backlogged, people do not spent 24 hours a day on diaspora, so we should be able to handle that. Incoming messages, as told earlier, are way faster, so we can be somewhat certain this would be the maximum we can ever achieve. Reality is probably a bit different, and my result should not be taken as a test to see if diaspora can or cannot handle such traffic. this is meant as a comparision between our current system and a relay based one.
Let’s call 100k pods with 8 interactions per hour the limit of what native diaspora could handle right now “somewhat”. It won’t be perfect, but eh… we could be fine, maybe.
Let’s compare that with the relay. All pods would deliver all their stuff to the relay, as suggested. Assuming all pods deliver at their max rate, the relay would have to distribute ~800k payloads per hour to 100k-1 pods each. Which means that, in total, you would somehow have to manage almost 80 billion (
80 * 10^9, just to avoid any confusion) requests per hour.
We established that the main delay in sending payloads is caused by connecting to the host, so that is nothing a faster and better server can fix. With my Rust script as a delivery machine, you would end up with a whopping 924856 hours of work by just receiving one hour of network traffic. That’s 105 years.
Now, uh, okay, that’s… impossible.
To get rid of the delay caused by opening connections, you could think about a solution where each subscriber has to keep a persistent connection to the relay, or you can keep open the connections yourselves (TCP has a keepalive, as we all know). The thing is, you cannot really have more than 10k connections opened on a single server. So, uh, you would need 10 servers to keep the connections open, and “10 servers” is optimistic, since managing the sockets would use a lot of CPU, so it’s probably more like 20 or so. Not to mention that you would have to distribute the messages between the nodes somehow…
You can work around the C10K problem fairly well with asynchronous I/O. With fairly, I mean there have been people who managed 60k concurrent websocket connections on a very customized machine. However, those are idling, and you’d run into CPU bottlenecks if you send messages over those channels. In our example, there would be a lot of traffic on those channels, so based on previous projects, I think you can handle something like 5k connections per host, probably way lower.
In the end, you would be building something similar as folks at Twitter and Facebook are using to deliver their “realtime” streams (spoiler: they still have up to 15 minutes delay, and you know that if you have a lot of transatlantic friends and their messages and posts take forever to arrive), which means: hundreds of servers distributed worldwide, connected via a private fibre network. Good luck building that. And, eh, you would end up with a federated system to handle the load for a relay you built because you thought a federated system could not handle the load.
Edit 1, 4:30utc:
I’d like to point out, just for the sake of pointing it out, that short-time caching of messages (like, for example, 15 minutes) and then bulk delivering is a solution that may come in mind. In the outlined scenario, we’d deal with an average stream of 25 million (outgoing) messages per second, created by only ~250 messages/second (actually more like 220, but I rounded for nicer numbers in the example above) inbound. Even when storing only the incoming payloads, if we assume ~10KiB per payload (which is accurate for posts, since signatures blow the size up a bit…) and decide to bulk-send every 15 minutes, we’d have to store 2.1GiB, in a constant buffer. Now, that’d actually be small enough to store in memory, but keep in mind the 25 million outgoing messages still will be there. Even if there is enough space in memory, memory bandwidth (yes, I know, usually, we don’t think about such specs) is becoming the bottleneck.
There actually was a nice case study back in 2014 by RabbitMQ (a insanely awesome message broker, that is built to deliver messages to subscribers in high scaleable environments). They managed to deliver one million messages per second. To achieve that, they had to use 32 servers (30 workers, 2 managing) with 8 CPU cores and 30GB RAM each. So, in theory, we would need 750 servers to handle that traffic. Totally the scale of large social networks!
I feel like I should talk again a bit about that insanely high number of messages per hour, which might confuse people if they read the aforementioned blog post and see that “Apple processes about 40 billion iMessages per day”, which would come down to only 460k/second, much lower than our 25 million / second. One might think that I made some math errors here.
When Apple delivers an iMessage, they get input from one node and have to deliver it to another node. 1 incoming, 1 outgoing. We would be only processing 250 messages per second, but since we would have to deliver one message to 100k nodes, so for every incoming message, there will be 100k outgoing messages. And suddenly you have to deal with huge numbers.
[End of Edits]
Summary: No, relays do not scale better than pod-to-pod federation. In fact, the opposite is true.
There is no room for discussion here. This is a simple, provable fact.