/stream performance

Thank you for your work, and especially for writing this up in a structured manner.

As you know, I’m not a fan of this solution. About 6 months ago, your mitigation point 2 was my go-to solution when thinking about this problem, with a bit more complexity involved. I have no pod and cannot check theories that come to mind easily. This is why I had all kinds of fancy solutions in mind with retrying, using partitioning and so on.

The other approaches you give, I’ma throwing them out without even discussing; I consider it obvious. If an explanation is required, please ask.

I’ve since come to believe that we can solve all queries using proper indexes and query optimizer analysis.

Why not do a retry solution?

First of all, it is a bad solution, and we all (should) know it. There are worse solutions, I’ll give that. Using a dedicated software just for caching the stream for every user, or, worse, building our own caching system, for a stream that needs to be rebuild anyway on every request because posts can be late. The complexity of building the stream for every user on post-arrival should throw that solution out of the window before we even leave the drawing board. But retrying the stream query adds more complexity to the Ruby code and/or to the front end. Arguably, the increased delay for near-empty streams like #muh to decide they’re empty or to still turn up posts may be tolerable.

Second though, we have not yet explored the possibilities of making the database do its job efficiently. Not even close. A DBMS is built to retrieve data in an efficient manner. If you have less than 100 million rows, and your query takes longer than a second to retrieve just 15 rows, then we are using it the wrong way. I repeat, if the DBMS doesn’t do its job fast enough, the error is with us, not a limit of its capabilities. For database experts, this is obvious. I consider myself not an expert, but I have the expertise required to use this knowledge.

A few days ago, I created an index and sped up the /public stream by three orders of magnitude on at least one system. I did not change anything else. Mind you, this index appeared obvious to me. It has been missing from the start, and was never added before. Please understand that the database layer of diaspora* has received little love, as, apparently, databases are not fun to a lot of people, and thus people with amateur knowledge in database optimization seem to be rare. We’re seeing bad speed because the people who created the slow queries knew how to get the data, but not how to get it fast. And as long as you just test with a dataset of a few hundred thousands of posts, you may not even notice the difference. Let’s optimize the completely unoptimized stuff first, before we build contraptions nobody will understand in a few years, and nobody will want to maintain.

To give you a feeling of how optimizations have been done in the past, a little fun fact: Not too long ago, we had a serious PR actually proposing to slap an index on every column that ends with _id. Nevermind the unnecessary overhead this would introduce for updating and maintaining the affected tables, most of them were just completely unnecessary by themselves. Obviously, this PR was created due to uncomplete knowledge of the inner workings of a DBMS and of our application, diaspora*. Had I not shared what little I know about this stuff, this would have been approved, because to database amateurs, this appears to be sensible. Now, it’s not a bad thing to be an amateur, it’s just a natural step along the way to become an expert. But it’s a dangerous thing to assume you can do optimizations by applying a principle you once heard of on a large scale, if you don’t understand how these optimizations actually work. It’s the Dunning-Kruger effect all over again. However, I’m starting to get philosophical, I apologize.

Pending improvements

@koehn, to give you a taste on what is possible, and to help you to give me a little more time before pressing your solution further, I will share what I did to improve the tag stream. As you know, the tag stream is part of /stream, as are a few other streams – by design – like the aspects stream. My reasoning is that if we manage to improve every stream by itself, the combined stream will be faster as well. Break it down, build from the bottom up, divide and conquer, yaddiyaddiyaddah.

The original tag stream query is as below.

SELECT 
    DISTINCT posts.*
 FROM
     "posts"
INNER JOIN "taggings"
    ON "taggings"."taggable_id" = "posts"."id"
        AND "taggings"."taggable_type" = 'Post'
LEFT OUTER JOIN share_visibilities
    ON share_visibilities.shareable_id = posts.id
        AND share_visibilities.shareable_type = 'Post'
WHERE
    "posts"."type" IN ('StatusMessage')
    AND ("share_visibilities"."user_id" = 1
         OR ("posts"."public" = 't' OR "posts"."author_id" = 1))
    AND (taggings.tag_id IN (66))
    AND (posts.created_at < '2018-12-29 02:52:37')
    AND "posts"."type" IN ('StatusMessage', 'Reshare')
ORDER BY
     posts.created_at DESC,
     posts.id DESC
LIMIT 15

The query optimizer loves to narrow down possible candidates first, so starts selecting and joining from the taggings and share_visibilities tables first. That is the moment when the index on posts.created_at becomes useless. During joining, a suitable index on posts.id first will be used. That, in turn, requires the DBMS to load all eligible posts, then sort them, then take the first 15. You can imagine that for some tags, this will take some effort. For testing, I tried to use tags with lots of posts to get a more noticable difference in execution time.

Another thing that is making the DBMS crawl instead of run is the DISTINCT posts.* phrase. We as developers know that the system need only compare posts.id and be done. The DBMS doesn’t know that. It considers all of the fields during sorting, which leads to an external merge sort in actual files, because >100MB do not fit into its dedicated sorting memory. You can imagine this takes resources.

The DBMS is not stupid. It recognizes it needs to sort for the duplicate elimination as well as the ORDER BY clause, and thus sorts first by ORDER BY fields, then by the rest of the fields. Now, it might just be able to use an index for the ORDER BY clause. However, the DBMS estimates differently than a human being. For the DBMS, the worst thing it can do is read from disk. If it were to sort only by our sorting fields, it might use an index and be done with it. However, since it needs all of the other fields as well, it considers it cheaper to avoid loading the index in memory, and rather just loads all pages from the disk that contain eligible rows. Mind you, it’ll have to load these fields eventually, but just for the top 15 rows, not for all of the thousands and thousands of rows that match all but the LIMIT 15 criteria.

We can mitigate this by telling it about the fact that it just needs to sort and eliminate duplicates based on the sorting fields. It makes the query slightly more complex, but for other reasons, this is not actually increased complexity. See the slightly changed query below.

SELECT posts.* FROM (
    SELECT 
        DISTINCT posts.id, posts.created_at
    FROM
        "posts"
    INNER JOIN "taggings"
        ON "taggings"."taggable_id" = "posts"."id"
            AND "taggings"."taggable_type" = 'Post'
    LEFT OUTER JOIN share_visibilities
        ON share_visibilities.shareable_id = posts.id
            AND share_visibilities.shareable_type = 'Post'
    WHERE
        "posts"."type" IN ('StatusMessage')
        AND ("share_visibilities"."user_id" = 1
             OR ("posts"."public" = 't' OR "posts"."author_id" = 1))
        AND (taggings.tag_id IN (66))
        AND (posts.created_at < '2018-12-29 02:52:37')
        AND "posts"."type" IN ('StatusMessage', 'Reshare')
    ORDER BY
         posts.created_at DESC,
         posts.id DESC
    LIMIT 15
) AS a
JOIN posts ON a.id = posts.id
ORDER BY
    a.created_at DESC,
    a.id DESC

When adding the appropriate indices, this query goes down by three orders of magnitude regarding execution time. It’s all about making it use just indexes for the inner query, and relying on the actual table pages only in the outer query. As soon as there is a field it needs to evaluate that’s not in the index and thus prompts page loading from the actual table, it prefers to load everything and avoid a useful index. At least PG10 did. PG11 did magic with parallel queries.

So, for PG10, this query required the following index to run fast.

CREATE INDEX index_posts_on_id_and_type_and_created_at_and_public_and_author
    ON public.posts USING btree
    (id, type COLLATE pg_catalog."default", created_at, public, author_id)
    TABLESPACE pg_default;

All selecting fields are in the index, and PG10 knows it can use the index for the complete evaluation.

For PG10, this changed from 24s execution time to ~20-40ms execution time. For PG11, without that insane index, this changed from 13s to ~20-30ms execution time. I love PG11.

I have not been able to do this for MySQL/MariaDB yet, and it’s going to be a lot more work, I guess, because none of my beloved query testers use MySQL/MariaDB anymore, and these two DBMS feature neither an EXPLAIN result as beautiful and rich as PG, nor do they have neat features like a Parallel Index Scan.

3 Likes