Topic on Project:Support desk/Flow

DynamicPageList and Categories

5 comments • 14:52, 1 December 2022 2 years ago

5

Summary by HirnSpuk

Problems of Extension:DynamicPageList (Wikimedia) are discussed (compare phab:T287380). The extension might cause trouble when intersecting large categories. The pure listing of categories is fine. An example is used to illustrate this and is explained in some depth.

The limits of Categories are discussed. w:en:Category:Overpopulated categories exist in terms of usability not of physical limitations.

HirnSpuk (talk) 14:51, 1 December 2022 (UTC)

HirnSpuk (talkcontribs)

Hello, due to a lack of community-Contributors (content is more or less fine) I would have liked to work on some automatic structuring-"tools" based on categories and DPL on dewikibooks. Now I read, DPL has problems (and possibly might be abandoned in the future? Compare phab:T287380) and I found w:en:Category:Overpopulated_categories . So I'm worried, if my ideas will be feasible at all.

What I would like to do: via categorization of pages sort the structure of Wikibooks more or less automatically, so that content contributors need only to worry about categorization and not editing community-pages. Hence less edits, less errors, less archive-work. To do this and to have some kind of control of the visual representation I would have liked to use DPL. Because of the above mentioned problem I don't think this might be a good idea. Now, if I discard the idea of having some kind of control on the visual representation, I could use category-pages itself. But if those are limited in some way it might be a bad idea for the future. Not that I expect to run into problems in the near future, but there might come the time, when somebody will. Restructuring then might become a pretty big task. So, if my ideas do not have any structural benefit for the future (aka "exploding" categories instead of "exploding" other pages) it might be a good idea to retire the idea and think about something else.

So two short questions:

can someone elaborate in short where the problem is with dpl, because I'm not quite sure if I understand it correctly? And as a follow up: Would you recommend not using it? The answer here is probably not really crucial, because I already think it's a good idea not to use it extensively. But any explanation here will be highliy appreciated.
is there a physical limit on how many pages a category can contain, and if so, which is it and where can I read up on this topic?

Thank you very much for any help anyone can provide. Best regards, HirnSpuk (talk) 18:16, 29 November 2022 (UTC)

Edited 18:19, 29 November 2022 2 years ago

Bawolff (talkcontribs)

Honestly, i would just use DPL for now. There is unease with the extension, but as of right now there doesn't seem to be any positive plan to get rid of it.

So the problem with the extension: If you combine (intersect) multiple categories it does that inefficiently for large categories. i.e. It works by loading all the entries in the first category, and checking each one to see if it is in all the other categories^[1]. If the first category has 200 entries, this is fine. If instead it has 2 million entries, this is not so fine.

However if you are only using DPL with 1 category, then none of this applies. DPL just looks at the entries of the category it needs and displays them. It takes the same amount of time regardless of how big the category is. [And the sorting method is categorysortkey or categoryadd. Sorting by lastedit, length, created may require loading the whole category]

Anyways, de.wikibooks.org is really small, so things should be fine no matter what you do.

In conclusion, I would suggest using DPL if its useful. Keep in mind that should de.wikibooks.org really take off and increase in size by 10 or 20 times, you might have to change your templates if they use DPL with multiple categories intersecting.

As far as the drama goes with ru.wikinews.org - A lot of that was due to the community not being very cooperative with developers. In the very unlikely event that something happens and developers tell you you need to change something due to performance concerns, just change whatever needs to be changed. As they say on en wikipedia at w:Wikipedia:Don't worry about performance - generally speaking you shouldn't worry too much, but if something does happen, just cooperate with the people who are trying to fix whatever it is.

For question 2: No limit. de.wikibooks only has about 70,000 pages, so i suppose at most you could have a category with 70,000 pages in it (And currently the biggest is much smaller ). Most of MediaWiki is designed so that if you look at a category, mediawiki only has to load the first 200 entries not the entire thing, so it doesn't matter much how big they get (DPL when intersecting multiple categories is an exception to this).

[1] - Please note, I'm oversimplifying a lot here, and strictly speaking this is not quite true, but its mostly close enough.

Edited 20:12, 29 November 2022 2 years ago

HirnSpuk (talkcontribs)

Thank you very much for the explanation, that's exactly the "simplified" version I was looking for. Regarding "oversimplification", just one follow-up for question 2:

So I'm getting it right, there's no limit at all? Not only "no limit, because de.wb is small"? Is this in any way server-cost relevant if categories are large? Can I read up on this somewhere?

That said, another thing regarding oversimplification, being curious and seeing that you might have deeper insights, please feel free to elaborate further, if you'd like. I'm electrical engineer by training can program a little and might get it, if I'm not forced to look at a lot of pages figuring out the causal relations by myself. Use any of my talk pages, if you have the time and would like to.

Another side-note: I didn't follow the mentioned "drama". It might not be a question of "wanting to cooperate", but of availability, activity in the wikimedia-universe, enough spare time and ability (regarding rights and/or knowledge). Sure, I'd like to cooperate, but I might not be able to guarantee it because of "real-life-constraints" ;-). Thanks again, regards, HirnSpuk (talk) 12:06, 30 November 2022 (UTC)

Edited 12:07, 30 November 2022 2 years ago

Bawolff (talkcontribs)

There are eventually some limits - servers would eventually run out of disk space and that sort of thing. However, we're not really very close to those limits. There's also some softer limits - as the dataset gets bigger, less of it is going to be cached in ram at any given time, which can have a performance impact. Similarly, that's why large wikis like English Wikipedia get their own DB server, and smaller wikis share a DB server.

Currently the largest category is on commons with 34 million entries .

The way databases work, is you have the data, and then a bunch of "indexes" of the data. The index is just a sorted list of all the data based on some criteria. In the categorylinks table there are a bunch of indexes but the important one for viewing a category page is the one on (cl_to, cl_type, cl_sortkey, cl_from) which basically sorts all the entries in order based on the name of the category, followed by the page type (media, subcategory or "normal" page) and the sortkey/name of the page in the category. So if you want to display the first 200 normal page entries of a category, the DB basically just finds where that category begins in the index (Taking O(log N) time), and then starts reading 200 entries on that list in order starting at that point. Since its already in sorted order, the database can just look at the first 200 entries and stop, instead of looking through the whole category. If you're interested in the nitty gritty details on how this works see w:B-tree.

However, for DPL where multiple categories are specified, you can't really do it like that since you have to find the pages that are in all specified categories. The best case scenario is still relatively cheap, but the worst case might involve looking through all the entries in the category.

The situation on russian wikinews, was that they mass imported a lot (Like half a million) pages from a freely-licensed russian news source. All of these pages had an infobox on them with a DPL. The increasing size of the categories made these DPLs slow. Additionally, importing all these articles quickly, meant they all had to be rendered at the same time. Most of MediaWiki assumes database queries are very quick, and that most of the time rendering is spent in the CPU doing non-database stuff. This wasn't really true in this situation, as a result the DB started to get backed up, and requests to it piled up overwhelming it, making everything even slower.

Anyways, after that was dealt with, a few months later russian wikinews did another similar import. Although there may possibly have been language barrier issues, i don't know, at the time it seemed like they did it very uncautiously, doing the import as fast as possible without any consideration of possible risks. This caused the same DB problem, but also because the DB was so slowed down, normal requests to projects using that DB sort of hanged. As a result all the PHP servers were stuck waiting on the DB to respond, and web requests started to pile up, which caused things to spill out of control even further, triggering downtime not just for sites using that DB server, but all wikimedia websites. Anyways, when russian wikinews was told they had to stop, they got very angry, called all the devs incompetent, and wrote a "news" article about how WMF devs are screwing over ru wikinews. This didn't really endear them to the developers involved, who were not too happy to have to deal with a second outage caused by the same people doing the same thing. End result was that DPL was removed from ruwikinews (After all, it was only designed for small wikis, and ruwikinews was no longer small after the mass import). Additionally some changes were made to DPL that was hoped would improve performance. Recently made DPL queries were cached for a short time, as a big part of the problem was doing the same DPL query over and over for templates used on many pages. There was an attempt to use [PoolCounter]] to limit concurrency, but that wasn't enabled due to a bug that we couldn't figure out. A timeout to DPL queries was also added.

The nitty gritty technical details of the russian wikinews situation is at wikitech:Incidents/2021-07-26_ruwikinews_DynamicPageList

Anyhow, the moral of the story, if you do something that takes down Wikimedia websites, be really careful before doing it a second time, and don't get angry when someone tells you to stop.

Edited 04:51, 1 December 2022 2 years ago

HirnSpuk (talkcontribs)

You Sir are a steely eyed wikimedia man :-)! Thank you so much for the perfect answer and explanation, I really sincerely appreciate your time and effort you put into answering my question! Have a nice christmas time of year! Best regards HirnSpuk (talk) 14:46, 1 December 2022 (UTC)

14:46, 1 December 2022 2 years ago