User:DanielRenfro/Character Encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers.
Unicode
editStrictly speaking, Unicode is not a character encoding. It is a conceptual encoding (a character set) which pairs characters with numbers, rather than mapping to octets (bytes). For example the Cyrillic character capitol-Zhe (Ж) is paired with the number 1046. This number, called a code-point can be represented any number of ways -- in Unicode it is represented as "U+0416" (a capitol U with it's hexadecimal representation.) Unicode also contains some attributes about characters, such as "3 is a digit", or "É is an uppercase letter whose lowercase equivalent is é." Sometimes a character on the screen may be represented by more than one Unicode code-point. For example, most would consider à to be a single character, but in Unicode it can be composed of two code points: U+0061 (a) combined with the grave accent U+0300 (`). Unicode offers a number of these "combining-characters" that are intended to follow (and be combined with) a base character. This can cause some confusion when a regular expression is expecting a 'single-character' or a 'single byte.'
There are a number of ways to store this code-point, such as:
- UCS-2
- all characters are encoded using two bytes
- UCS-4
- all characters are encoded using four bytes
- UTF-16
- most characters encoded with two bytes, but some with four
- UTF-8
- characters encoded with one to six bytes
What is important is to know which type of encoding your program uses and how to convert from another encoding to this one. (From ASCII, Latin-1 to UTF-16 for example.)
Common character encodings
edit- ISO 646
- EBCDIC
- ISO 8859:
- ISO 8859-1 Western Europe
- ISO 8859-2 Western and Central Europe
- ISO 8859-7 Greek
- ISO 8859-8 Hebrew
- ISO 8859-16 Central, Eastern and Southern European languages (Albanian, Croatian, Hungarian, Polish, Romanian, Serbian and Slovenian, but also French, German, Italian and Irish Gaelic)
- MS-Windows character sets:
- Windows-1250 for Central European languages that use Latin script, (Polish, Czech, Slovak, Hungarian, Slovene, Serbian, Croatian, Romanian and Albanian)
- Windows-1251 for Cyrillic alphabets
- Windows-1252 for Western languages
- Windows-1253 for Greek
- Mac OS Roman
- Unicode (and subsets thereof, such as the 16-bit 'Basic Multilingual Plane'). See UTF-8
- ANSEL or ISO/IEC 6937
Translation
editHTTP
editThe Content-Type header should contain the correct encoding information.
Content-Type: text/html; charset=ISO-8859-1
HTML
editCharacter encodings in HTML can be troublesome when you do not specify the correct encoding in your document. You can define this in the HTTP response header, or in the HTML document like so:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If the HTML meta tag and the HTTP header don't match, most browsers will ignore the meta tag in favor of the header. But, this begs the question, which one is correct?.
PHP
edit- Problem
PHP thinks one character is equal to one byte.
This might have been true in the late-90's web, but now-a-days this assumption leads to some non-obvious problems with internationalization. For example, the strlen() function erroneously returns 27 instead of 20 when calculating the length of the following string (due to all the accents, etc.):
Iñtërnâtiônàlizætiøn 12345678901234567890
PHP actually "sees" something more like this:
Iñtërnâtiônà lizætiøn 123456789012345678901234567
One good thing is that PHP doesn't try and convert your strings to some other encoding, it just acts on whatever you give it. Even though it doesn't (using the native string functions) understand that characters could be more than one byte, it won't screw things up. The iconv extension is enabled by default, but is only partially useful. Alternatively, you can compile PHP with the mbstring package, but if you want your software to be portable this can be a problem.
- Solution
Hand the problem off to the browser and let it handle it. Web browsers have excellent support for many different character sets, the most important being UTF-8. All you have to do is tell it "everything is UTF-8" and your problems are (partially) solved! UTF-8 is good because it's Unicode, it's a standard, and it's backwards compatible with ASCII. Use the HTTP-header with the correct HTML-meta-tag and let the browser handle the problems.
Most of this section is taken from the good tutorial at http://www.phpwact.org/php/i18n/charsets.
Mediawiki
editCode
editAll PHP files in the Mediawiki software suite must be encoded with UTF-8, without byte order marks. Otherwise bad things happen.
Wikitext
editMediawiki does not make any assumptions about what characters are coming in from the user.
Some global configs dealing with encoding:
- $wgEditEncoding
- For some languages, it may be necessary to explicitly specify which characters make it into the edit box raw or are converted in some way or another. Note that if $wgOutputEncoding is different from $wgInputEncoding, this text will be further converted to wgOutputEncoding.
- $wgLegacyEncoding
- Set this to eg 'ISO-8859-1' to perform character set conversion when loading old revisions not marked with "utf-8" flag. Use this when converting wiki to UTF-8 without the burdensome mass conversion of old text data. NOTE! This DOES NOT touch any fields other than old_text. Titles, comments, user names, etc still must be converted en masse in the database before continuing as a UTF-8 wiki.
Older global configs (you might run into these):
- $wgInputEncoding (deprecated as of 1.19)
- $wgOutputEncoding (deprecated as of 1.19)
- $wgUseLatin1 (deprecated as of 1.19)
Storage
editFrom what I can tell, Mediawiki uses the BLOB (binary large object) datatype to store wikitext in the text database table. This means that no matter what the encoding is coming from input, Mediawiki will store things just fine. The problem is when it is displayed again. Mediawiki can be configured to send different 'Content-Type' HTTP headers, but I think by default it uses UTF-8.
- Set to true to engage MySQL 4.1/5.0 charset-related features; for now will just cause sending of 'SET NAMES=utf8' on connect.
- You should not generally change this value once installed -- if you create the wiki in Binary or UTF-8 schemas, keep this on! If your wiki was created with the old "backwards-compatible UTF-8" schema, it should stay off.
- (See also $wgDBTableOptions which in recentish versions will include the table type and character set used when creating tables.)
- May break if you're upgrading an existing wiki if set differently. Broken symptoms likely to include incorrect behavior with page titles, usernames, comments etc containing non-ASCII characters. Might also cause failures on the object cache and other things.
includes/search/SearchMySQL.php
editUsing MySQL as the database backend, Mediawiki will run the following code:
/**
* Converts some characters for MySQL's indexing to grok it correctly,
* and pads short words to overcome limitations.
*/
function normalizeText( $string ) {
global $wgContLang;
wfProfileIn( __METHOD__ );
$out = parent::normalizeText( $string );
// MySQL fulltext index doesn't grok utf-8, so we
// need to fold cases and convert to hex
$out = preg_replace_callback(
"/([\\xc0-\\xff][\\x80-\\xbf]*)/",
array( $this, 'stripForSearchCallback' ),
$wgContLang->lc( $out ) );
// And to add insult to injury, the default indexing
// ignores short words... Pad them so we can pass them
// through without reconfiguring the server...
$minLength = $this->minSearchLength();
if( $minLength > 1 ) {
$n = $minLength - 1;
$out = preg_replace(
"/\b(\w{1,$n})\b/",
"$1u800",
$out );
}
// Periods within things like hostnames and IP addresses
// are also important -- we want a search for "example.com"
// or "192.168.1.1" to work sanely.
//
// MySQL's search seems to ignore them, so you'd match on
// "example.wikipedia.com" and "192.168.83.1" as well.
$out = preg_replace(
"/(\w)\.(\w|\*)/u",
"$1u82e$2",
$out );
wfProfileOut( __METHOD__ );
return $out;
}
/**
* Armor a case-folded UTF-8 string to get through MySQL's
* fulltext search without being mucked up by funny charset
* settings or anything else of the sort.
*/
protected function stripForSearchCallback( $matches ) {
return 'u8' . bin2hex( $matches[1] );
}
The take-home message here is that any UTF-8 (which MySQL doesn't "grok", or so the MW comments say) will get turned into some other string (which can be encoded using the standard Latin-1 or ASCII encoding.) That means that anyone searching for a non-ASCII character won't find what they're looking for.
For example, this string: (from the LacZ product page, containing the Greek character Beta (β))
Component of [[:Category:Complex:β-galactosidase|β-galactosidase]]
will get turned into this string by the above code:
component of category complex u8ceb2-galactosidase u8ceb2-galactosidase beta-du800-galactosidase
Searching
edit- Problem
- Because of the problems listed above (in Storage), searching for non-ASCII characters won't work (because they've been turned into something else.)
- Proposed Solutions
- create an extension that hooks into the search somewhere, capturing non-ASCII characters and converting them to their parsed equivalents
- create an extension that hooks into the code when the page is saved to the searchindex table, and convert the non-ASCII characters to something better (HTML-entities?)
- both 1 and 2 (?)