Multibyte trim in PHP? Multibyte trim in PHP? php php

Multibyte trim in PHP?


The standard trim function trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytes from 0 to 0100 0000.

Proper UTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx. All the bytes in proper UTF-8 multibyte characters start with 1xxx xxxx.

This means that in a proper UTF-8 sequence, the bytes 0xxx xxxx can only refer to single-byte characters. PHP's trim function will therefore never trim away "half a character" assuming you have a proper UTF-8 sequence. (Be very very careful about improper UTF-8 sequences.)


The \s on ASCII regular expressions will mostly match the same characters as trim.

The preg functions with the /u modifier only works on UTF-8 encoded regular expressions, and /\s/u match also the UTF8's nbsp. This behaviour with non-breaking spaces is the only advantage to using it.

If you want to replace space characters in other, non ASCII-compatible encodings, neither method will work.

In other words, if you're trying to trim usual spaces an ASCII-compatible string, just use trim. When using /\s/u be careful with the meaning of nbsp for your text.


Take care:

  $s1 = html_entity_decode(" Hello   "); // the NBSP  $s2 = " 𩸽 exotic test ホ 𩸽 ";  echo "\nCORRECT trim: [". trim($s1) ."], [".  trim($s2) ."]";  echo "\nSAME: [". trim($s1) ."] == [". preg_replace('/^\s+|\s+$/','',$s1) ."]";  echo "\nBUT: [". trim($s1) ."] != [". preg_replace('/^\s+|\s+$/u','',$s1) ."]";  echo "\n!INCORRECT trim: [". trim($s2,'𩸽 ') ."]"; // DANGER! not UTF8 safe!  echo "\nSAFE ONLY WITH preg: [".        preg_replace('/^[𩸽\s]+|[𩸽\s]+$/u', '', $s2) ."]";


I don't know what you're trying to do with that endless recursive function you're defining, but if you just want a multibyte-safe trim, this will work.

function mb_trim($str) {  return preg_replace("/^\s+|\s+$/u", "", $str); }


This version supports the second optional parameter $charlist:

function mb_trim ($string, $charlist = null) {       if (is_null($charlist)) {        return trim ($string);    }     $charlist = str_replace ('/', '\/', preg_quote ($charlist));    return preg_replace ("/(^[$charlist]+)|([$charlist]+$)/us", '', $string);}

Does not support ".." for ranges though.