/stream performance

koehn · January 1, 2019, 10:00pm

I’ve been working on a solution to my pod’s declining /stream performance on my fork.

I did some analysis of the /stream query against my Postgres 11 database with @CSammy’s index on posts.created_at . The main problem with the query as written is that the Postgres (10/11) optimizer runs a full table scan on the posts table, not using the created_at index, because the index would be slower than just scanning the table. As you can see, on my relatively-capable machine (32GB RAM, 16GM free for caching, 8GB dedicated to postgres), it’s taking 2800 milliseconds to do the scan.

I recognized that there’s no point in doing a full table scan on /posts , since I’m only returning the next fifteen posts behind whatever date is asked for, and with 3.8 million posts in my database, it’s looking at a lot of additional posts for nothing.

So I went to work modifying the query to look at only posts created in the past thirty days. Why thirty? It seemed like a reasonable tradeoff between minimizing the number of posts the query needs to consider, and assuring that the query would actually return the requested number of results with high probability. This involved changing the query to use a between expression instead of a < expression.

The results were impressive. The new expression resulted in an ten-fold performance gain, taking only 250 milliseconds. It’s faster because the DB knows based on its column statistics that using the index will be radically faster than a table scan, so it uses it. The query still returns the requisite 15 results. Pushing the “next” button slides the window back, just as before. Narrowing the query to 7 days reduced execution time to 70ms, again with no loss of posts.

Upside: Radically faster stream performance. Query execution time is bounded by the density of posts over the window rather than number of posts in the pod’s database.

Downside: It’s possible that looking at a thirty-day window of posts might be too small for some users to see all fifteen posts, and were there a thirty-day “gap” of posts between windows, the UI would be convinced there are no more posts to see for that stream. It’s unlikely on the user’s, but it’s possible for brand-new users. The consequence is that they wouldn’t see posts created before the “gap,” but all posts after that would be displayed (and quickly!).

The downside can be managed in a few ways:

Acceptance. It’s a free social media platform, not a perfection-driven system. Users are unlikely to scroll very far back. Other social media platforms work this way.
Second Query If the user’s /stream search returns no results, run the query again with an infinitely-large window, thereby guaranteeing we don’t miss any posts. (Thanks @HankG for this suggestion!)
Pod-level customization. Have a property in diaspora.yml that allows the podmin to set the window size based on the performance they want.
User-level customization. Have a user-settable property in the database, bounded by reasonable ranges, that allows each user to define the window size to make the speed/precision tradeoff that’s best for them.

There are two issues with this approach that need discussing:

Is it technically feasible? While I can hand-assemble the query, my branch may or may not be correctly generating the query, and may or may not break other parts of the app. Are other functions like configuration needed to achieve an acceptable solution?
Is it the right thing to do? This approach trades the functional usability of the application in exchange for increasing the performance of the /stream and increasing technical complexity. Is that the right way to solve the problem? Are there other ways to improve performance? How do they compare? Should we allow podmins or users to decide for themselves if they want to trade some functionality for performance? And more importantly, how do we come to consensus?

HankG · January 2, 2019, 12:42am

I’m never a fan of this theory, especially as the first entry. It’s also not managing the downside it’s ignoring it. I think that whatever we do for query optimization can’t have an artifact like that which is user facing.

I like the other two mitigation discussions but I think this one is a level of detail that users shouldn’t really be privy to or exposed to.

Which functionality is being traded? The way I read our discussions we had a few days ago and before you posted the PR today is that it is transparent to the user except for the performance improvement. I think we should hold ourselves to that for any of these query updates. A slight increase in complexity could be okay too. I know we’ve talked about simplifying EvilQuery to get ride of the need for it. That isn’t necessarily relevant to your potential change but may be for @csammy and @denschub changes. I’ve only seen the one with the reformed indexes so I can’t speak to what the other stuff they are playing around with is though.

I appreciate the work. I can look at it in more detail tomorrow and play around with it a bit. The API uses the exact same queries as well so I’m curious to see if it is a drop in replacement as well. It should be if it’s done correctly.

csammy · January 2, 2019, 9:10am

Thank you for your work, and especially for writing this up in a structured manner.

As you know, I’m not a fan of this solution. About 6 months ago, your mitigation point 2 was my go-to solution when thinking about this problem, with a bit more complexity involved. I have no pod and cannot check theories that come to mind easily. This is why I had all kinds of fancy solutions in mind with retrying, using partitioning and so on.

The other approaches you give, I’ma throwing them out without even discussing; I consider it obvious. If an explanation is required, please ask.

I’ve since come to believe that we can solve all queries using proper indexes and query optimizer analysis.

Why not do a retry solution?

First of all, it is a bad solution, and we all (should) know it. There are worse solutions, I’ll give that. Using a dedicated software just for caching the stream for every user, or, worse, building our own caching system, for a stream that needs to be rebuild anyway on every request because posts can be late. The complexity of building the stream for every user on post-arrival should throw that solution out of the window before we even leave the drawing board. But retrying the stream query adds more complexity to the Ruby code and/or to the front end. Arguably, the increased delay for near-empty streams like https://nerdpol.ch/tags/muh to decide they’re empty or to still turn up posts may be tolerable.

Second though, we have not yet explored the possibilities of making the database do its job efficiently. Not even close. A DBMS is built to retrieve data in an efficient manner. If you have less than 100 million rows, and your query takes longer than a second to retrieve just 15 rows, then we are using it the wrong way. I repeat, if the DBMS doesn’t do its job fast enough, the error is with us, not a limit of its capabilities. For database experts, this is obvious. I consider myself not an expert, but I have the expertise required to use this knowledge.

A few days ago, I created an index and sped up the /public stream by three orders of magnitude on at least one system. I did not change anything else. Mind you, this index appeared obvious to me. It has been missing from the start, and was never added before. Please understand that the database layer of diaspora* has received little love, as, apparently, databases are not fun to a lot of people, and thus people with amateur knowledge in database optimization seem to be rare. We’re seeing bad speed because the people who created the slow queries knew how to get the data, but not how to get it fast. And as long as you just test with a dataset of a few hundred thousands of posts, you may not even notice the difference. Let’s optimize the completely unoptimized stuff first, before we build contraptions nobody will understand in a few years, and nobody will want to maintain.

To give you a feeling of how optimizations have been done in the past, a little fun fact: Not too long ago, we had a serious PR actually proposing to slap an index on every column that ends with _id. Nevermind the unnecessary overhead this would introduce for updating and maintaining the affected tables, most of them were just completely unnecessary by themselves. Obviously, this PR was created due to uncomplete knowledge of the inner workings of a DBMS and of our application, diaspora*. Had I not shared what little I know about this stuff, this would have been approved, because to database amateurs, this appears to be sensible. Now, it’s not a bad thing to be an amateur, it’s just a natural step along the way to become an expert. But it’s a dangerous thing to assume you can do optimizations by applying a principle you once heard of on a large scale, if you don’t understand how these optimizations actually work. It’s the Dunning-Kruger effect all over again. However, I’m starting to get philosophical, I apologize.

Pending improvements

@koehn, to give you a taste on what is possible, and to help you to give me a little more time before pressing your solution further, I will share what I did to improve the tag stream. As you know, the tag stream is part of /stream, as are a few other streams – by design – like the aspects stream. My reasoning is that if we manage to improve every stream by itself, the combined stream will be faster as well. Break it down, build from the bottom up, divide and conquer, yaddiyaddiyaddah.

The original tag stream query is as below.

SELECT 
    DISTINCT posts.*
 FROM
     "posts"
INNER JOIN "taggings"
    ON "taggings"."taggable_id" = "posts"."id"
        AND "taggings"."taggable_type" = 'Post'
LEFT OUTER JOIN share_visibilities
    ON share_visibilities.shareable_id = posts.id
        AND share_visibilities.shareable_type = 'Post'
WHERE
    "posts"."type" IN ('StatusMessage')
    AND ("share_visibilities"."user_id" = 1
         OR ("posts"."public" = 't' OR "posts"."author_id" = 1))
    AND (taggings.tag_id IN (66))
    AND (posts.created_at < '2018-12-29 02:52:37')
    AND "posts"."type" IN ('StatusMessage', 'Reshare')
ORDER BY
     posts.created_at DESC,
     posts.id DESC
LIMIT 15

The query optimizer loves to narrow down possible candidates first, so starts selecting and joining from the taggings and share_visibilities tables first. That is the moment when the index on posts.created_at becomes useless. During joining, a suitable index on posts.id first will be used. That, in turn, requires the DBMS to load all eligible posts, then sort them, then take the first 15. You can imagine that for some tags, this will take some effort. For testing, I tried to use tags with lots of posts to get a more noticable difference in execution time.

Another thing that is making the DBMS crawl instead of run is the DISTINCT posts.* phrase. We as developers know that the system need only compare posts.id and be done. The DBMS doesn’t know that. It considers all of the fields during sorting, which leads to an external merge sort in actual files, because >100MB do not fit into its dedicated sorting memory. You can imagine this takes resources.

The DBMS is not stupid. It recognizes it needs to sort for the duplicate elimination as well as the ORDER BY clause, and thus sorts first by ORDER BY fields, then by the rest of the fields. Now, it might just be able to use an index for the ORDER BY clause. However, the DBMS estimates differently than a human being. For the DBMS, the worst thing it can do is read from disk. If it were to sort only by our sorting fields, it might use an index and be done with it. However, since it needs all of the other fields as well, it considers it cheaper to avoid loading the index in memory, and rather just loads all pages from the disk that contain eligible rows. Mind you, it’ll have to load these fields eventually, but just for the top 15 rows, not for all of the thousands and thousands of rows that match all but the LIMIT 15 criteria.

We can mitigate this by telling it about the fact that it just needs to sort and eliminate duplicates based on the sorting fields. It makes the query slightly more complex, but for other reasons, this is not actually increased complexity. See the slightly changed query below.

SELECT posts.* FROM (
    SELECT 
        DISTINCT posts.id, posts.created_at
    FROM
        "posts"
    INNER JOIN "taggings"
        ON "taggings"."taggable_id" = "posts"."id"
            AND "taggings"."taggable_type" = 'Post'
    LEFT OUTER JOIN share_visibilities
        ON share_visibilities.shareable_id = posts.id
            AND share_visibilities.shareable_type = 'Post'
    WHERE
        "posts"."type" IN ('StatusMessage')
        AND ("share_visibilities"."user_id" = 1
             OR ("posts"."public" = 't' OR "posts"."author_id" = 1))
        AND (taggings.tag_id IN (66))
        AND (posts.created_at < '2018-12-29 02:52:37')
        AND "posts"."type" IN ('StatusMessage', 'Reshare')
    ORDER BY
         posts.created_at DESC,
         posts.id DESC
    LIMIT 15
) AS a
JOIN posts ON a.id = posts.id
ORDER BY
    a.created_at DESC,
    a.id DESC

When adding the appropriate indices, this query goes down by three orders of magnitude regarding execution time. It’s all about making it use just indexes for the inner query, and relying on the actual table pages only in the outer query. As soon as there is a field it needs to evaluate that’s not in the index and thus prompts page loading from the actual table, it prefers to load everything and avoid a useful index. At least PG10 did. PG11 did magic with parallel queries.

So, for PG10, this query required the following index to run fast.

CREATE INDEX index_posts_on_id_and_type_and_created_at_and_public_and_author
    ON public.posts USING btree
    (id, type COLLATE pg_catalog."default", created_at, public, author_id)
    TABLESPACE pg_default;

All selecting fields are in the index, and PG10 knows it can use the index for the complete evaluation.

For PG10, this changed from 24s execution time to ~20-40ms execution time. For PG11, without that insane index, this changed from 13s to ~20-30ms execution time. I love PG11.

I have not been able to do this for MySQL/MariaDB yet, and it’s going to be a lot more work, I guess, because none of my beloved query testers use MySQL/MariaDB anymore, and these two DBMS feature neither an EXPLAIN result as beautiful and rich as PG, nor do they have neat features like a Parallel Index Scan.

flaburgan · January 2, 2019, 11:29am

+1 on that, and I would add the podmin to the list too, the performance should be good without trade off for users and podmins.

@csammy your work is very interesting and shows that, indeed, the problem is on the DB side and application code modifications should not be needed. However, a “not-perfect-but-shipped” solution is better than a “perfect-but-never-shipped” solution, and you already talked about those optimizations maybe one year ago… So I’m a bit worried about the possible integration date. So, what can we do to help you / see those improvements integrated? The users of my production pod https://diaspora-fr.org knows that the pod has been built to test things, so I’m fine with giving you access to the data, if you need to. That could be a DB dump or an access to the real production server if you want to. Oh, and I recently upgraded it from PG9.5 to PG11

csammy · January 2, 2019, 11:43am

I’m going to answer more extensively, but for the moment, I just want to state very clearly, that I do not want access to the database or data unsupervised. There are several reasons for that.

Privacy. I believe even though your pod was set up to be a test pod and people should know, people don’t. Time and time again have people shown that they will not read stuff or take into account what it means. I don’t want to work with data that has not been given freely to me, and to meet my definition of that, every user has to state explicitly that they’re okay with it.
My reputation. I don’t want to be in a position where people can claim stuff and I’m the only one who knows better. I’m pretty sure I won’t abuse data. I also know how desperately I need access to a database to test query performance. But all of that is worth nothing if people come along and say “diaspora* gives data engineers access to their data!” This statement is so wrong on so many levels, but so difficult to disprove if we decide to go down that road.
Precedence. I consider myself trustworthy in this matter, but if podmins get an example of other podmins distributing their database to willing contributors, they might get used to the thought and eventually accidently give data to a would-be developer who just wants to scam. I don’t want to be the one who starts this.

My solution for the moment is to ask podmins like yourself to execute queries on my behalf and give me the DBMS statistics so I can improve the queries further, and to advance my synthetic dataset which will be generated and thus not violate privacy.

flaburgan · January 2, 2019, 11:49am

As I’m quite available during the coming months, that is a solution which works for me. Please do not hesitate to ask me!

HankG · January 2, 2019, 12:36pm

This is the heart of what both of these solutions are trying to accomplish. I disagree that doing windowed searches is inherently bad or a second search is inherently bad. However that solution was a means to an end. That end being that our search for returning 15 stream elements causes a full table scan which therefore is making it take several seconds on a reasonably populated pod. That behavior in the database has to be stopped. It looks like you have done a great job of exploring why that is occurring with our queries and how to get it to stop doing that. That would explain why you get a comparable speed up without doing the windowing in this particular case–you have also stopped the full table scan behavior.

If we had two solutions one which did windowing but requiring a second scan on zero results and one which simplified our unnecessarily convoluted queries which also happened to force the database to do a full table scan and kill our query performance I would absolutely 100% of the time go with the second solution even if it meant more testing and shake out periods and more work for proof of concept on the other database engines. I’ve been thinking about setting up a second dev test machine locally for testing MySQL since I got bit by that on the API updates with an overly constrained string field.

Getting query performance improved is a high enough priority now that the API is in the can I can help support this before moving on to bigger changes geared towards 0.8.0.0. So some questions for moving forward:

Have you tested this tuning by modifying the Ruby code that generates the SQL or is this just at the hand tuning SQL and then having to figure out how to get ActiveRecord to do this, short of just using their SQL crammer?
Does the project have a means of generating test data sets of arbitrary sizes so that a person could create a multi-GB database in a timely manner which they can use for performance testing?
Do you have a branch in your (or someone else’s) fork that others can begin to inspect?
Assuming this gets addressed in a timely manner is this something that can go into an 0.7 release if there are no schema changes? Does that answer change if there are schema changes?

Thanks again to @koehn @csammy and @denschub for the work on trying to tackle this incredibly important but complex problem. I can begin working on this sometime this week if I can get a communication going one-on-one about it.

dthiery · January 2, 2019, 1:00pm

@csammy I’d be happy to run queries for you against my DB. I’m still on MySQL though planning on migrating to PostgreSQL at some point (hopefully sooner than later).

koehn · January 2, 2019, 1:53pm

First, I’m glad to see the positive discussion. Thanks everyone for that. As experienced developers, we’re all aware of competing priorities that go into any change, and how complex it is to contrast the needs of users, admins/podmins, and other developers, especially in a group, especially in a group of volunteers. Thanks for volunteering your time to be a part of the struggle.

I’d like to avoid the frame of “good” and “bad” when discussing a solution. We’ve all been around long enough to know that it’s either effective or it’s not. The “goodness” or “badness” is really a shorthand for saying “I don’t agree with the priorities that went into this solution,” and it would be better to express things in terms of those priorities. That keeps the discussion going, where as calling someone else’s solution “bad” tends to shut it down.

With the complexity of these queries and the complexity of modern query optimizers, it makes sense to get as many plans from different pods as possible. I’m happy to send plans to @csammy, and if it would help I’d propose an API that podmins could grant access to that would run explain plan on a query that folks like @csammy would like to run. Then a service could call all the pods that person is authorized against, gather up the results, and display them in a meaningful way. Just a thought. I’ve found https://explain.depesz.com particularly helpful for displaying plans in a useful way, and making them sharable. It also occurs to me that making the relevant parts of the postgres pg_stats_statements query available across pods would be helpful, too, for analyzing the effects of these changes.

Responding to @HankG:

Have you tested this tuning by modifying the Ruby code that generates the SQL or is this just at the hand tuning SQL and then having to figure out how to get ActiveRecord to do this, short of just using their SQL crammer?

I started by testing against the query that my database was complaining about the most (/stream). learning why they were slow, and playing around with different techniques to make them fast by modifying the SQL directly. I’m not a Ruby on Rails guy, so this was the most efficient way to go about it. With some help, I started working those changes back into the source code, but I may have introduced some unintended consequences, or I may not have hit both of the cases where the between clause is needed: in particular, it looks like the evil_query.ids! function also needs to be modified. I need some help tracking that down, and cleaning up the tests I’ve broken. I’m happy to do the heavy lifting, but I can already see that I’m going to get stuck pretty quickly as I come up to speed on the Rails ORM and the test frameworks. At least I have an IDE that supports the language, without which I’d be completely hopeless.

Does the project have a means of generating test data sets of arbitrary sizes so that a person could create a multi-GB database in a timely manner which they can use for performance testing?

Having worked on this problem in the past, I’m pretty sure it’s not worth the effort it would take to get results that mirror the actual data. I’d recommend getting the plans from actual instances instead, although I recognize that this doesn’t help for changes that involve adding e.g., indicies.

In the long term, I think having a stream table is the way to go to solve this problem, but it introduces a bunch of other changes that we’d have to work through first, and for now I’m happy to settle on just making the existing structure faster.

csammy · January 2, 2019, 1:58pm

For me, this is not more readable, but have the improved tag stream anyway https://explain.depesz.com/s/SijL

Thank you for your input. I’m working on it

flaburgan · January 2, 2019, 2:01pm

Minor releases upgrades should be very easy to do, that means:

No manual steps others than the ones listed on the guide (we had exceptions for big bugs like when deletions were not correctly processed, we added a rake tasks to solve them in a minor version)
The time to upgrade is short, so a migration running during 30 minutes in a good server isn’t acceptable. Maybe it can be if the migration can be run with the pod still running (which is not what we’re recommending at the moment)

csammy · January 2, 2019, 2:19pm

Same goes for me. Even if I was fluent in Ruby on Rails, I’d do it this way. Iterations are faster.

I’m confident we can put this properly into AR, I already looked at it. Needs a bit of restructuring some stuff, e.g. in my opinion all queries should just get the ID and not all fields except if explicitly required.

HankG · January 2, 2019, 2:42pm

I’m personally not comfortable transporting actual user data around for testing purposes. If the answer is that we don’t have a good large test data set then in the short term we could have pod admins help out with that. I may look into using the API to build a large test data set of non-user data.

supertux88 · January 5, 2019, 3:33am

Both @denschub and me already helped @csammy with testing out new indexes and providing statistics from our medium to large sized pods, and @flaburgan also offered his help, so that shouldn’t be a problem.

I helped Sammy to create a little ruby script to directly insert posts into the database with dummy data. It doesn’t do much yet (not building public/private posts in the correct proportion for examplet), but it creates as may posts as fast as possible, and I think using ruby/rails directly to create dummy data is much faster than doing that through the API.

flaburgan · March 27, 2019, 8:44am

I’m sorry to bump this thread even if I have nothing more to add on the technical side, but users are complaining so much about it on framasphere.org that some of them start to leave. I can find money, time, dataset, anything which could help but we deeply need to see that performance issue solved. @csammy please tell me how I can help you.

dthiery · April 3, 2019, 2:06pm

@flaburgan if you’re running MySQL removing the index that @csammy did to help the /public stream really helps improve performance.

flaburgan · April 3, 2019, 2:40pm

No I’m running PostgreSQL

koehn · October 27, 2019, 12:24pm

Here’s another approach.