Can php detect 4-byte encoded utf8 chars?

php utf8mb4

This should work:

if (max(array_map('ord', str_split($string))) >= 240)

The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.

If you want to remove long characters, this will do:

preg_replace_callback('/./u', function (array $match) {    return strlen($match[0]) >= 4 ? null : $match[0];}, $string)

Though there may be a more elegant regex way to express high codepoints directly.

php utf8mb4

The following regular expression will replace 4-byte UTF-8 characters:

function replace4byte($string, $replacement = '') {    return preg_replace('%(?:          \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15        | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16    )%xs', $replacement, $string);    }var_dump(replace4byte('d'), replace4byte('d𡃁d'));

This doesn't rely on the /u modifier, so you shouldn't need to worry about UTF-8 for PCRE being compiled in. However, if you have that support, deceze's preg_replace_callback is neater.

(Regex adapted from Ensuring valid utf-8 in PHP)

CodeHunter

Can php detect 4-byte encoded utf8 chars?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last