Can I safely use a utf8mb4 connection with utf8 columns? Can I safely use a utf8mb4 connection with utf8 columns? php php

Can I safely use a utf8mb4 connection with utf8 columns?


This can be tested quite easily with the following script:

<?php$pdo = new PDO('mysql:host=localhost;dbname=test', 'test', '');$pdo->exec("    drop table if exists utf8_test;    create table utf8_test(        conn varchar(50) collate ascii_bin,        column_latin1  varchar(50) collate latin1_general_ci,        column_utf8    varchar(50) collate utf8_unicode_ci,        column_utf8mb4 varchar(50) collate utf8mb4_unicode_ci    );");$latin = 'abc äŒé';$utf8  = '♔♕';$mb4   = '🛃 🔣';$pdo->exec("set names utf8");$pdo->exec("    insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)     values ('utf8', '$latin', '$latin $utf8', '$latin $utf8 $mb4')");$pdo->exec("set names utf8mb4");$pdo->exec("    insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)     values ('utf8mb4', '$latin', '$latin $utf8', '$latin $utf8 $mb4')");$result = $pdo->query('select * from utf8_test')->fetchAll(PDO::FETCH_ASSOC);var_export($result);

And this is the result:

array (  0 =>   array (    'conn' => 'utf8',    'column_latin1' => 'abc äŒé',    'column_utf8' => 'abc äŒé ♔♕',    'column_utf8mb4' => 'abc äŒé ♔♕ ???? ????',  ),  1 =>   array (    'conn' => 'utf8mb4',    'column_latin1' => 'abc äŒé',    'column_utf8' => 'abc äŒé ♔♕',    'column_utf8mb4' => 'abc äŒé ♔♕ 🛃 🔣',  ),)

As you can see, we can not use utf8 as connection charset, when we work with utf8mb4 columns (see ????). But we can use utf8mb4 for connection when working with utf8 columns. Also neither has problems writing to and reading from latin or ascii columns.

The reason is that you can encode any utf8, latin or ascii character in utf8mb4 but not the other way around. So using utf8mb4 as character set for connection is safe in this case.


Short Answer: Yes, if you are only using 3-byte (or shorter) UTF-8 characters.

Or... No if you intend to work with 4-byte UTF-8 characters such as 😅😘😍.

Long Answer:

(And I will address why "no" could be the right answer.)

The connection establishes what encoding the client is using.

The CHARACTER SET on a column (or, by default, from the table) establishes what encoding can be put into the column.

CHARACTER SET utf8 is a subset of utf8mb4. That is, all characters acceptable to utf8 (via connection or column) are acceptable to utf8mb4. Phrased another way, MySQL's utf8mb4 (same as the outside world's UTF-8) have the full 4-byte utf-8 encoding that includes more Emoji, more Chinese, etc, than MySQL up-to-3-byte utf8 (aka "BMP")

(Technically, utf8mb4 only handles up to 4 bytes, but UTF-8 handles longer characters. However, I doubt if 5-byte characters will happen in my lifetime.)

So, here's what happens with any 3-byte (or shorter) UTF-8 character in the client, given that the Connection is utf8mb4 and the columns in the tables are only utf8: Every character goes into and comes out of the server without transformation and without errors. Note: The problem occurs on INSERT, not on SELECT; however you may not notice the problem until you do a SELECT.

But, what if you have an Emoji in the client? Now you will get an error. (Or a truncated string) (Or question mark(s)) This is because the 4-byte Emoji (eg, 💩) cannot be squeezed into the 3-byte "utf8" (or "1-byte latin1" or ...).

If you are running 5.5 or 5.6, you may run into the 767 (or 191) problem. I provide several workarounds in here. None is perfect.

As for inverting (utf8 connection but utf8mb4 columns): The SELECT can have trouble if you manage to get some 4-byte characters into the table.

"Official sources" -- Good luck. I have spent a decade trying to tease out the ins and outs of character handling, and then simplify it into actionable sentences. Much of that time was thinking I had all the answers, only to encounter yet another failing test case. The common cases are listed in Trouble with UTF-8 characters; what I see is not what I stored . However, that does not directly address your question!

From Comment

mysql> SHOW CREATE TABLE emoji\G*************************** 1. row ***************************       Table: emojiCreate Table: CREATE TABLE `emoji` (  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,  `text` varchar(255) NOT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8mb41 row in set (0.00 sec)mysql> insert into emoji (text) values ("abc");Query OK, 1 row affected (0.01 sec)mysql> show variables like 'char%';+--------------------------+----------------------------+| Variable_name            | Value                      |+--------------------------+----------------------------+| character_set_client     | utf8                       || character_set_connection | utf8                       || character_set_database   | utf8mb4                    || character_set_filesystem | binary                     || character_set_results    | utf8                       || character_set_server     | utf8mb4                    || character_set_system     | utf8                       || character_sets_dir       | /usr/share/mysql/charsets/ |+--------------------------+----------------------------+8 rows in set (0.00 sec)

The above says that the "connection" (think "client") is using utf8, not utf8mb4.

mysql> insert into emoji (text) values ("😅😘😍");  -- 4-byte EmojiQuery OK, 1 row affected, 1 warning (0.00 sec)mysql> show warnings;+---------+------+----------------------------------------------------------------------------------+| Level   | Code | Message                                                                          |+---------+------+----------------------------------------------------------------------------------+| Warning | 1366 | Incorrect string value: '\xF0\x9F\x98\x85\xF0\x9F...' for column 'text' at row 1 |+---------+------+----------------------------------------------------------------------------------+1 row in set (0.00 sec)

Now, change the 'connection' to utf8mb4:

mysql> SET NAMES utf8mb4;Query OK, 0 rows affected (0.00 sec)mysql> insert into emoji (text) values ("😅😘😍");Query OK, 1 row affected (0.01 sec)mysql> SELECT * FROM emoji;+----+--------------+| id | text         |+----+--------------+|  1 | ? ? ? ?      ||  2 | abc          ||  3 | ???????????? |   -- from when "utf8" was in use|  4 | 😅😘😍             |  -- Success with utf8mb4 in use+----+--------------+4 rows in set (0.01 sec)


Short answer: NO, it is not safe.

If your data has utf8mb4 characters and you are using a MySQL utf8 charset connection, you will run into problems since MySQL utf8 charset supports only BMP characters (up to 3 bytes characters).

My recommendation is to convert all tables to utf8mb4 for full UTF-8 support. Also, utf8mb4 is backwards compatible with utf8.