Fun with mb strlen
Posted this on dev blog as well.
I noticed the fallback implementation for mb_strlen() that we had in GlobalSettings.php sucked:
function mb_strlen( $str, $enc = "" ) { preg_match_all( '/./us', $str, $matches ); return count($matches); }
There are two things to note about this code:
- It doesn't actually work, because no matches are done — it always returns 1.
- Even if you fix it to return the matches, it's extremely slow and will eat lots of memory by creating a giant array of every character in the (potentially quite long) string.
I'm replacing this with a new version which uses PHP's count_chars()
function to count up the ASCII-compatible bytes and multibyte sequence head bytes. It's still a smidge slower than mb_strlen
but it's... much better than the old one.
Some quick benchmarks using the UTF-8 normalization benchmark pages (/code):
Testing washington.txt: strlen 31526 chars 0.007ms mb_strlen 31526 chars 0.114ms old_mb_strlen 31526 chars 4813.686ms new_mb_strlen 31526 chars 0.132ms Testing berlin.txt: strlen 36320 chars 0.001ms mb_strlen 35899 chars 0.129ms old_mb_strlen 35899 chars 6328.748ms new_mb_strlen 35899 chars 0.127ms Testing bulgakov.txt: strlen 36849 chars 0.001ms mb_strlen 20418 chars 0.076ms old_mb_strlen 20418 chars 3003.042ms new_mb_strlen 20418 chars 0.133ms Testing tokyo.txt: strlen 36244 chars 0.001ms mb_strlen 19936 chars 0.071ms old_mb_strlen 19936 chars 2623.109ms new_mb_strlen 19936 chars 0.131ms Testing young.txt: strlen 36694 chars 0.001ms mb_strlen 16676 chars 0.063ms old_mb_strlen 16676 chars 2246.179ms new_mb_strlen 16676 chars 0.125ms