I propose to wait a few weeks before making a new page listing. By then, lager wikis already listed can have had a first cleanup. Then, a more complete (and simple) script run would be helpful.
Topic on Talk:Parsoid/Language conversion/Preprocessor fixups
@Amire80 asked for more complete results, so I've made one based on the 2017-05-01 dumps. There probably won't be another dump made until 2017-05-20 or so, at the earliest.
@Cscott, I think getting a new list (and eventually sending an update to affected communities) is the last step so I can declare https://phabricator.wikimedia.org/T165175 closed. LMK when you can do it. I think it's safe for you to skip Wikidata (dunno about the rest).
I looked over the list, and most of the wikis have no problems or just a small number of pages. Are you all feeling ready to go? Or do you want to wait for the next dump (a few days from now?) and make a decision then?
In short: 1. Don't renew the list, the 'done-exception' notes are helpful. 2. When gone live, pages are broken only minor, and we can continue to work this list (If I understood correctly).
Making a new list from the next dump looks like a bad idea. It would remove all notes like "done, some issues" etc. That way helpful information on the status is lost (we'd have to revisit and check those wikis once more). Also, because the required edits leave some hits (false positives, no harm) behind, we are not striving for a zero-Pages list (in which case a new dump-list would be helpful).
If "ready to go" means roll out the change, I'd have to leave that decision to others. If I understand the big issue correct, pages will be broken in details (e.g., text disappearing), but not fatally. I can not judge on the effects in sister wikis.
The current plan is to merge the patch on Monday and have it be deployed next week. In the meantime, we are considering adding a new Linter category so that we can more precisely identify the remaining instances that might need fixing. So, yes, we can skip doing another round of dump grepping unless found necessary for some unforeseen reason.
Is this live now, or was it reverted?
This was live on group2 wikis briefly as part of 1.30.0-wmf.2 before it was reverted because of T166345. So, as of now, this is only live on group0 and group1 wikis.
It has since gone live on all wikis. I'm generating another list from the 20170601 dump, though I think it will be more useful to wait for the 20170620 dump to complete.
Hello, User:Cscott. Thank you for your work. Is there a way to watch some page to be aware about new runs? I recognized only now about the Juny 1 run, from your post above. Thanks.
Well, the dump should be linked at https://www.mediawiki.org/wiki/Parsoid/Language_conversion/Preprocessor_fixups#Articles_which_need_to_be_fixed.2C_by_project when ready, so I think that's it.
Thank you.
Yeah. I deliberately didn't make a lot of noise about the 20170601 run, because I'm about to replace it with results from the 20170620 dump which should be better, since they won't include as many pages which were already fixed up in the last big community cleanup push. But if you watch the Preprocessor_fixups page, you should get notified when that's done.
What is that page?
The one I linked above? :)
I thought this is the intension, but the Juny 1 run was published there yesterday, so I don't think Cscott is talking about it.
Whatever comes up next, you'll find it linked there. Promise!
Like
N3: It seems like a too big like. Something with flow?
Leave it! I like it.
Fine for here, but if it's a bug, it should be fixed for future usage.
Little birdie tells me it's not, see Template:Like.
??? OK...............
We REALLY LIKE things around here.
:-)
New 2017-06-20 dump is up and linked from Preprocessor fixups. Enjoy!
I'm going to be looking at adding a parser warning or a linter rule to catch these in the future, hopefully when they occur, since I noticed a few cases of editors flailing around to see why their templates weren't working.
Most of the problems in our wiki came from tech news about this problem :-). Surprisingly, about 99% of problems on user pages subcribed to tech news were not recognized here.
I'd appreciate if you could clarify this comment. I don't understand what you mean.
There were two issue of Tech News, explained about language conversion problem. They have the forbidden string in the news message. So, the new run found these pages in Wikipedia namespace. But it did not find them in User Talk namespace, when many user are subscribed to the News bulletin.
So @Cscott, I know you're filtering out non-wikitext, but if the mention in Tech News shows up on one namespace, it should also show up on the user talk one? (TN can get delivered to subscribers' talk pages.) Or maybe we are assuming that on the Wikipedia namespace it shows up because of Tech News, but maybe that's actually not the case.
Can you give me two specific URLs (or article names), one where the mention is included in the results and the other where it is not. That will allow me to diagnose the issue.
I looked into this a bit. We use the -pages-articles.xml.bz2
dumps for everything except labswiki
. These contain "articles, templates, media/file descriptions, and primary meta-pages". It appears they do *not* contain user talk pages. The Tech News subscription page appears to push to to a list of "community pages", and then to the *talk* page for the pages listed under "Users". I found hits from the "community pages", but some of these have archive or other features which complicate the issue. For example, Commons:User_scripts/tech_news is subscribed, but if you look at the history you see that User:ArchiverBot moves the published Tech News to its own archive after 7 days. So the actual hit listed in the 2017-06-20 dump is Commons:User_scripts/tech_news/Archives/2017/May, which is where Tech News 2017-19 ended up. It just so happens in this case that the archive page is still included in the -pages-articles
dump. On other wikis, with other archivers, or different distribution destinations, the ultimate location of Tech News 2017-19 might not be included in -pages-articles
and thus it wouldn't show up in the results.
I do not think we need to dig further? The mention in Tech News doesn't seem to be breaking the page. In cases I have seen the syntax at fault was in users' signature, and also not breaking anything. Wikis like the Dutch one didn't want fixes except in the articles' namespace, and didn't fix the other stuff.
I thought about this. But does the script know to avoid tech news? And anyway, if not all the namespaces are covered, maybe you should think to change this.
Why? Literally nobody AFAIK has complained about finding errors elsewhere and not being able to determine where they come from. The main namespaces are covered. We shouldn't request additional work when it is not evidently necessary - and there are still articles to fix, FWIW.
I see. Your choice.
I mean, in an ideal world, of course. But we are talking about a few dozen pages here, while we already need to focus on another project that requires intervention on many, many more pages - and this will be a theme for a while.
So, should you change the dump? And do you still need the links? Thank you.