utfnormal

utfnormal is a library that contains Unicode normalization routines. It includes pure PHP implementations, and automatically uses the php-intl extension if installed.

The main function to care about is UtfNormal\Validator::cleanUp(). This will strip illegal UTF-8 sequences and characters that are illegal in XML, and if necessary convert to normalization form C (NFC). See also "Unicode equivalence" on Wikipedia.

If you know the string is already valid UTF-8, you can directly call:

  • UtfNormal\Validator::toNFC(),
  • UtfNormal\Validator::toNFK(),
  • or UtfNormal\Validator::toNFKC()

This will convert a given UTF-8 string to Normalization Form C, K, or KC if it's not already such. The function assumes that the input string is already valid UTF-8; if there are corrupt characters this may produce erroneous results.

Performance is kind of stinky in absolute terms, though it should be speedy on pure ASCII text. ;) On text that can be determined quickly to already be in NFC it's not too awful but it can quickly get uncomfortably slow, particularly for Korean text (the Hangul decomposition/composition code is extra slow).

Bugs should be filed in Wikimedia's Phabricator under the "utfnormal" project.

To use it in your project, run composer require wikimedia/utfnormal.

This library was first introduced in MediaWiki 1.3 (rev:4965). It was split out of the MediaWiki codebase and published as an independent library during the MediaWiki 1.25 development cycle.

edit