Topic on Project:Support desk

Searching for html tags within a wiki

3
Summary by Vicarage

Using the Replace_Text extension seems the best way of spotting them. SQL queries/Pywikibot/dump grepping would also work, if rather heavy. Must look at Elastica/CirrusSearch long term.

Vicarage (talkcontribs)

What's a lightweight way of searching for html tags within a wiki source page. My users have an annoying habit of adding '<br>' to pages as it gives them the line breaks they want, which I'd like to spot and correct (and we have legacy <pre>tags too). A standard search just returns the multitude of matches to 'br', and while CirrusSearch/Elastica looks like it could do it, its a heavyweight solution I don't want to attempt now.

My question was hidden, but as no reason was given, I've unhid it.

Bawolff (talkcontribs)

Sorry about your question being hidden. I have no idea why logged out users are even allowed to hide questions.

So one possibility is Extension:Replace Text i guess.

Another way, is direct with an SQL query (This will only work in default configs. If you have enabled revision compression or external storage it won't work)

[Note: MediaWiki before 1.31 requires a different query]

SELECT page_namespace, page_title from page inner join slots on slot_revision_id = page_latest inner join content on slot_content_id = content_id inner join text on substring( content_address, 4 ) = old_id WHERE old_text like '%<br>%';

The last part of the query '%<br>%' says what to look for in a page. % matches 0 or more letters, _ matches precisely 1 letter. \ is an escape character. Everything else is normal. It is case sensitive.


You can do this from the command line mysql client, or phpmyadmin web interface if you have access to that. If all else fails, you can also do this via the sql.php maintenance script (but the interface isn't as nice)

Vicarage (talkcontribs)

The extension should allow me to patrol those pesky users, thanks. I think I'd use pywikibot or grep on dumps to do off-line searches rather than sql queries.