Creating a Test Data Set

csammy · October 22, 2018, 6:35pm

I want to improve our stream loading times and take another chance with the Evil Query. Since query optimizers choose the most efficient execution plan for the query at hand depending on the current data set, I need a database filled with life, that is, with data that is typical for a normal pod.

To that end, I took a first step and mapped out an ER model for the entities involved in the creation of our streams.

Entity Relationship Model

If you spot or suspect any errors, feel free to share.

In a second step, I need statistics on the average data distribution among those relations. To make sure I don’t get information I don’t want in the first place, I thought long and hard about how to go about this, especially with the diaspora* project being privacy-respecting to the core.

I need to know the total amount of data points per entity to know how much I need to insert into the tables. In addition, for every relationship in the ER model above, I need to know how “intense” they are and how this is spread.

To put this into perspective, here is an example. If I want to know how resource-heavy it is to get posts by tags, I need to know how many tags the lower 10% of posts have at maximum, the same for 50%, for 90% and the maximum tags on a post overall. This way, I can let a post generator produce random posts with appropriate probabilities for the contained tag count. However, I cannot infer private details from the posts at all, since all I have is four different markers on the probabilty of how many tags were used in a post. This is very important to me, and if you find a mistake, please tell me so.

Now posts as an entity are a bit more difficult than the others because they have multiple visibilities and their timestamp is relevant to sorting. As such, I would also need to know how many posts have been created in the last month in order to reflect and simulate recent posting behaviour. And, very importantly, I’d need to know the share of posts being public and thus implicitly visible to everyone.

I’m going to put together a few SQL commands for both MySQL and PostgreSQL and would be very happy if a few podmins would opt for sharing these numbers. If you feel you would violate the privacy of your users doing this, please talk to me so I’ll be able to revise this, as I really don’t want to have anything private in this dataset.

Edit: Updated ER model

supertux88 · October 22, 2018, 7:07pm

The only thing I see that’s missing is likes, but since it isn’t used for streaming loading, it probably isn’t important and you left it away intentionally? But then comments are probably not interesting either?

Are reshares interesting? Or are they just public posts without tags?

Maybe it’s also interesting to know how many posts in the stream are selected because of tags, or aspects or mentions? Or is that not important?

Let me know what I should run, and I’ll get you the numbers I also have a read-only backup where I can run expensive queries without impacting the performance of the pod

Also I can run some queries on the HQ pod if you like, that’s probably not interesting for most of the stats (since it’s only a single user pod), but since most other pods automatically share with this account, it has many (mostly public) posts. So it’s maybe interesting for how many posts a person writes and how many tags a post have?

HankG · October 23, 2018, 12:57am

One of the things I’m hoping to do with the API (once it’s complete) is external load testing as well. These sorts of metrics will be helpful for determining parameters for that sort of thing. The thing I am torn on is what amount of load testing needs to be with a farm of virtual web browser users versus just stimulating through the API. Since the API calls into the lower level services the same as the controllers that feed the website it’s a question of how much extra resources are spent loading static content, generating the pages, etc. versus the lower level service calls and calls into the data layer. Yes, I know I’m getting ahead of myself until the API is completed…

denschub · October 24, 2018, 7:31pm

The same is true for me, so ++.