User talk:Amgine/Dump processing/test xml.php
Latest comment: 10 years ago by Amgine
Todos:
- After
if( !preg_match
add an else which adds the word to a $blacklist, print the array separately in a single txt file. - Check the word is single script? (Actually, each dictionary as a whole should be single script.) [1]
- Check the string is not confusable.[2]
--Nemo 12:39, 30 March 2014 (UTC)
- Moved your spoofchecker::isSuspicious up to the ns=0 check so it covers both add2Dictionary calls.
- Does the spoofchecker cover single script, confusable? The problem is that many terms are normalized in other languages, and some languages have multiple writing systems, e.g. Japanese has 4 including Romaji. - Amgine (talk) 00:47, 2 April 2014 (UTC)
- I'm not sure about the spoofchecker; if the docs are correct, no it doesn't, but that and is a bit confusing.[3] Currently I'm not even sure the setchecks call is working, I'll need to check what the actual effects are. We may add WHOLE_SCRIPT_CONFUSABLE if it doesn't remove too much stuff. --Nemo 07:20, 2 April 2014 (UTC)
- By quickly glancing at the results (new by me vs. old by Amgine), it seems it removed almost all non-latin characters, which is good for Vietnamese (per Mxn) and ok for Serbo-Croatian (consistency makes at least one part happy) but nonsense for Russian. Will need to play a bit more with the options. --Nemo 09:47, 3 April 2014 (UTC)
- Actually, no. Serbo-Croatian on en.WT includes Bosnian, Serbian, and Croatian, all of which are written in w:Gaj's Latin alphabet amongs other writing systems. - Amgine (talk) 15:01, 3 April 2014 (UTC)
- Based on what I can find, to reduce possible complaints here are some rules we should create per-language:
- bs - latinica script (Gaj's), optionally include w:Serbian Cyrillic alphabet
- hr - latinica only (Gaj's)
- sh - cyrillic script (Serbian) & latinica (Gaj's)
- sr - cyrillic script (Serbian), optionally include latinica (Gaj's)
- - Amgine (talk) 15:38, 3 April 2014 (UTC)
- By quickly glancing at the results (new by me vs. old by Amgine), it seems it removed almost all non-latin characters, which is good for Vietnamese (per Mxn) and ok for Serbo-Croatian (consistency makes at least one part happy) but nonsense for Russian. Will need to play a bit more with the options. --Nemo 09:47, 3 April 2014 (UTC)
- I'm not sure about the spoofchecker; if the docs are correct, no it doesn't, but that and is a bit confusing.[3] Currently I'm not even sure the setchecks call is working, I'll need to check what the actual effects are. We may add WHOLE_SCRIPT_CONFUSABLE if it doesn't remove too much stuff. --Nemo 07:20, 2 April 2014 (UTC)