Topic on Talk:Reading/Web/Projects/Related pages/Flow

J.K. Rowling recommended in (almost) every Novels article (or how the algorithm places too much emphasis on high profile articles)

18
Sadads (talkcontribs)

So I am noticing a pattern: I am assuming the algorithm that suggests the articles is based on the degree of "relatedness" in some form of network analysis, based on link closeness, categories and similar language? I work on novels articles almost exclusively in my volunteer time (I am on the Wikipedia Library team in the rest of the time), and <s>almost every single</s> very many of the novels article I have read, has author pages at the bottom AND one of the authors is J.K. Rowling.

My suggested solution: Might I suggest the algorithm put more weight on categories rather than "closeness" in links or quality or pageviews? I would imagine most of our readers would favor like types of items (for example, novels -> novels instead of novels -> authors, who didn't write the book) -- and we have a real opportunity here to profile some of our less viewed articles (not the stub with 50 views a month, but say, the start/c class with 300-400 views a month), instead of cycling everyone to the highest profile articles in the network. I can see how the kinds of recommendations that have been coming to me will lead to a self reinforcing cycle: the articles with high views and/or high quality get recommended more often that other articles, which means they get more eyes on them, which means more edits to those articles (or less new editors, because they aren't seeing the sloppy bits of Wikipedia -- most of us started on copy edits, and detail fixes when the material was less than ideal).

In short: the algorithm is optimized poorly for the kind of results we want people to get if their is a slight chance that they will become editors or participate in the less visited parts of Wikipedia: we want their eyes on the less visible and good, but not amazing items, because that is how we improve the encyclopedia. I would love to talk through this more @Jdlrobson .

Jdlrobson (talkcontribs)

Thank you so much @Sadads for highlighting this example where the algorithm consistently fails.

I'd suggest that @ Jkatz (WMF) would be a better person to talk through this with since he is the lead for this project. That said he's on paternity leave right now - User:Melamrawy_(WMF) is this something you and User:ABaso_(WMF) could work on capturing in Jon's absence?

Sadads (talkcontribs)

@Jkatz (WMF), @Melamrawy (WMF), @ABaso (WMF) I would be happy to chat or reflect on this.

This morning Rowling wasn't in the articles that I saw her associated with last night: so something changed (or you have a bit of randomness going on). I just get this unnerving feeling that I have seen her image and name on way too many pages --- especially fantasy and/or children's lit pages. She is not the only author we cover, and as someone who would like to see our content grow, its important to show just how diverse our content can be, rather than focusing overwhelmingly on the "high demand" stuff. I feel like, especially on any topic with a popular culture connection, we will end up reinforcing hegemonic topics -- topics with a certain level of cultural ubiquity/dominance -- rather than exciting curiosity in the unknown.

Melamrawy (WMF) (talkcontribs)

Hi @Sadads, I actually didn't encounter a similar J.K Rowling experience while browsing novel articles, can you please walk us through your browsing scenario for better clarity? Thanks.

Sadads (talkcontribs)

Hi @Melamrawy (WMF) so I was working on en:Jack (Homes novel) , and it had Rowling there yesterday; now, the space where Rowling was at is occupied by Harvey Milk and, the place where A. M. Homes was at is now occupied by David Bowie! (Think through a logical connection there :P) These are crazy odd "related articles": but I could see how an algorithm could call them close if it favored community assessment and pageviews and/or closeness via internal links. I do see Rowling (alongside Kafka and Marlyn Monroe !?!?) on the author page for en:A . M. Homes .

I also work on articles en-mass to improve linking and categories for Novels articles, so I can't place exactly which articles Rowling has been before. However, I do know most of these articles are relatively young novels or authors articles (I use tools like edwardbetts.com/find_link/debut_novel to add links). For example, I just linked debut novel on https://en.wikipedia.org/wiki/Sarah_Mason_(novelist) and yet again Rowling (alongside Mary Shelly and Kylie Minogue !?!?). Perhaps the algorithm is seeing the for the "Debut novel" article link combined with connections to common pages like en:Novelist and seeing J.K. Rowling as the closest connected of these articles; I would imagine any conventional reader would see these as connected in the "I am reading about this topic, I might want to learn about that other topic" sense. The common thread here: all of the results I have mentioned are high visibility/high quality articles, with only internal linking connecting them (rather than a common category, or sharing non-top level/generic topic connections). I suspect that the biggest nodes in the Wikipedia link network (if the algorithm is using network analysis) are dominating the internal link "closeness" score.

Anyway its a really neat feature, but provides really odd/blah results, esp. if we want people to see the less visible parts of Wikipedia (which we do, because thats how they become editors). As someone who curates the content: I want search to find the biggest, yet most useful articles, while this feature should be for surfacing the less needed, but more curiously interesting bits.

Sadads (talkcontribs)

I will keep listing articles that I see Rowling on in this thread:https://en.wikipedia.org/wiki/A_Summer_Bird-Cage https://en.wikipedia.org/wiki/Isabel_Fonseca https://en.wikipedia.org/wiki/Andrew_Michael_Hurley https://en.wikipedia.org/wiki/The_Queen_of_the_Tearling https://en.wikipedia.org/wiki/Did_You_Ever_Have_a_Family, https://en.wikipedia.org/wiki/Tell_The_Wolves_I%27m_Home https://en.wikipedia.org/wiki/Cathy_Marie_Buchanan https://en.wikipedia.org/wiki/John_Michael_Cummings

I am also going to track high frequency articles and what topic area: Mary Shelly, Jane Austen (Romance novels), Kylie Minogue (not seeing a pattern), William Gibson (science fiction), Marilyn Monroe (not seeing a pattern)

Sadads (talkcontribs)
Jkatz (WMF) (talkcontribs)

Hi Sadads sorry for the delay. I have been on paternity leave and just catching up. I just took a browse through and I am seeing the JK Rowling phenom with authors, particularly niche authors. I expect this has something to do with pageview volume. That being said, it is not something I have noticed with niche actors. I think this is something to look into as a refinement, but I am curious: do you see it as a blocker for the feature or an improvable? Does the fact that the selections are editable assuage your concerns?

Sadads (talkcontribs)

No worries on the delay: totally understand paternity leave.

Are the selections editable? I am not seeing a clear way in the interface for me to tweak suggestions.

I am just thinking that there is no value added beyond the fully hand curated "see also" and Navboxes: esp. when the links you are adding are either a) already linked in the article (this is already the case with something like 1/4-1/2 of the results I am seeing -- sometimes its even the article itself (I think this problem might be associated with redirects)) or b) so central to the network, that they ought to be common knowledge -- or can be found really easy through link chaining (vis-a-vis the behavior promoted in http://thewikigame.com/ ). The problem, I think, is that you are using a tool designed for getting a "closest to the search term" result (Cirrus search) to do something that ought to be focused on the "nearest, as in same neighbourhood of knowledge".

The real value of linking in Wikipedia for our readers, is the hand curated incongruities between what our readers thought they came to Wikipedia for, and the long chain of other things that are connected to that topic, which excite their curiosity. If you plan to enable this tool: it really ought to provide "unexpected but rationally connected" results that excite the imagination, rather than a) known quantities or b) stuff that doesn't need more attention by potential new editors -- even if its new or different. This tool seems to be at the opposite extreme of the Random Article tool: it provides almost too obvious/central topics that aren't exciting. It would be great to have a list of articles, editable by (admins?), which could be excluded from the results, to force the algorithm to work around these unusually central/important articles as assessed by the algorithm.

I like the idea of pushing more exploration of our Wikis (this is a really valuable engineering effort), and for smaller Wikis which don't have the level of micromanagement of connections (links, categories, navboxes) that happens on English or some of the higher volume edit Wikis, this tool as it stands might makes sense. However, on the bigger wikis, Editors much less flexable than me will be very angry about the tool circumventing their long, hard, hand-curated work AND producing unusually not useful links, when this tool could be doing something that creates innovative new "ah ha!" moments (that serrendipity moment, that makes library research so fun: I would highly recommend reading: http://dp.la/info/2014/02/07/planning-for-serendipity/ ).

The algorithm is just not sophisticated enough and really needs a way to be managed locally so that you aren't having to anticipate the community's tweaking needs centrally. If you could define: a) the variables that rate pages, b) provide an interface where admins could tweak those variables to meet something closer to consensus needs, and exclude pages, c) do more testing with people that see hundreds of pages a day (editors), d) machine learning that prioritizes the kinds of connections that people click through on, I think you would get something that would be really fun for the communities to play with and use. But as it stands now, its not useful in the grand scheme of things: it neither promotes exposure of interesting content to readers, nor exposes them to "new/different/esoteric" content that many of our editors pride themselves working on.

P.s. Some more examples of not useful linking ("useful" ones are the exceptions). Based on this set, you are talking 3 out of 18 links that encourage someone to explore the depths of Wikipedia around similar items, rather than the surface or topics that are already common knowledge:

https://en.wikipedia.org/wiki/Yazoo_and_Mississippi_Valley_Railroad Listed articles: Memphis, Tenessee (central in link network), W.C. Hardy (bizarre, possibly central to network?) and Alabama (central in link network)

https://en.wikipedia.org/wiki/Aim%C3%A9_Ngoy_Mukena Military of the Democratic Republic of Congo (linked on page), Democratic Republic of Congo (linked on page), Lubumbashi (bizarre, possibly central to network?)

https://en.wikipedia.org/wiki/Francis_Patrick_Donovan Gough Whitlam (useful), Stanley Bruce (useful), Australia (central to network, linked on page)

https://en.wikipedia.org/wiki/Thomas_Meehan_(writer) Musical theatre (central, and linked on page), Maury Yeston (Unexpected, interesting connection: useful), Hairspray (2007 film) (linked on page)

https://en.wikipedia.org/wiki/Michael_E._Smith Aztec (central), OCLC (central and bizarre), Nahuatl (central)

https://en.wikipedia.org/wiki/James_Morrill Minnesota (central), W.E.B. Du Bois (central), Michigan State University (central and/or tangential).

Sadads (talkcontribs)
Jkatz (WMF) (talkcontribs)

@Sadads Thank you for your thoughtful analysis! I agree with most of your concerns here and proposals, specifically:

  1. The algorithm could be better, and in an ideal world, we would improve it. The Android team, which is already using this algorithm is planning on tweaking it soon
  2. The algorithm variables/rules should be highly visible (we are planning on publishing them in simple english sometime soon)
  3. Ideally, editors could tweak the algorithm at the local level or at least be able to blacklist, or edit the items

To clarify some points. The read more options are editable. See here for an explanation: Topic:Suqj6do13qpmlerd it also explains some of the benefits over 'see also'.

Regarding the notion of 'is this useful', I would start with, is it harmful. Since it is at the bottom of the article, below the references, it is hard to argue that it is getting in the way of other content. For the vast majority of pageviews, pages are very long and anyone who has scrolled past the references is clearly looking for something more than they are finding and there is a need that was not satisfied with the various links above (such as see also) that make wikipedia so great.

Now, is it actually helpful. The data suggests not only that users are clicking on it but that they are continuing to click on it at a high rate. As has been pointed out several times in this thread, you can game click-through rates with flashing lights, pictures etc. However, what happens over time is that the click-through rates on unhelpful links drop over time as users come to trust them less and less. This is not what we were seeing with 'related pages' at all before I left. I'll have to check if this has changed since returning from pat leave, but my sql access is funky right now.

So, if readers are continuing clicking at high rates more frequently on related pages links than any other link on the page (even more than links in the first section), despite being at the bottom, I am inclined to agree it is useful navigation tool to our readers (if not to power-editors).

Could it be better? Definitely. The question is, now that we are helping people spend more time on Wikipedia, by 5% on desktop, 10% on mobile (as of last check), how much more do we want to invest in this feature to improve it more v. other features? If editors wont be happy with it as is, then maybe more investment is warranted on that alone. Otherwise, it is a question of what else we might do with those resources.

What do you think? (and I owe you updated #s)

Sadads (talkcontribs)

I wrote a really good response, but accidentally clicked on a link and lost it all (ACH!). How about we schedule for 20-30 minutes sometime (I am based on the east coast, and my WMF is calendar is usually up to date).

Jkatz (WMF) (talkcontribs)

Oooh, I feel your pain. Re: chat--this week is wrecked, but I put something in for next week.

Jkatz (WMF) (talkcontribs)

BTW for public record: I reran the numbers and the click through rate for mobile is very high at 19% of anyone who sees it. The desktop (vector) numbers are much lower at 4%. In both cases, the sample size is very small (<200 'seen' events per day). The likelihood of someone reaching the bottom of the desktop page is much higher than on mobile, so that might have something to do with it. This tool may simply be better suited for mobile, where the users who see it are a little more dogged.

Jkatz (WMF) (talkcontribs)
Sadads (talkcontribs)

Super exciting: @Jkatz (WMF) A lot less emphasis on J.K. Rowling, and Kate Perry, and any number of other centralities!

Nice. I wonder if we can do anything to bring a little bit more relevance to how the community hand curates material. For example, @Harej has built a database of articles by WikiProject that might be able to help with pointing at the intersection of different curated groupings. I would imagine putting a bit more weight to categories might be useful as well -- of course I realize that might not be a priority at the moment. It also might be worth putting less scaling effect from the popularity (multiplying the scores by eachother will always return the really big and popular, whereas reducing the emphasis on that criteria by scaling it, to say .10 of the original value, would still give it weight, just not as much that it overwhelms other articles). ~~~~

Jkatz (WMF) (talkcontribs)

@Sadads Thanks good thoughts! I can't say that adding categories to the considerations is something we can prioritize right now, but that is an interesting idea, as is the notion of simply scaling back the coefficient(!) I think for now, the goal is to see how far this tweak gets us.

Reply to "J.K. Rowling recommended in (almost) every Novels article (or how the algorithm places too much emphasis on high profile articles)"