Validating 4 bytes UTF-8 characters

When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode does not include STRICT_ALL_TABLES nor STRICT_TRANS_TABLES), setting a character that will be 4 bytes when encoded with utf-8 (such as an emoji like 😁) will truncate the remainder of the characters (with a warning).

To support 4 bytes UTF-8 characters, the columns with utf8mb4 for CHARACTER SET (and utf8mb4_xxx such as utf8mb4_unicode_520_ci, etc for COLLATE) must be used, and the connection character set must also use utf8mb4.

Incidentally, on Rails, if you try to set 4 bytes UTF-8 characters when the character set of MySQL column is utf8, the following error occurs, so the string will not be truncated unnoticed.

An ActiveRecord::StatementInvalid occurred in news#update:

Mysql2::Error: Incorrect string value: 'xF0x9Fx98x80x0Dx0A' for column 'description' at row 1: UPDATE `news` SET `description` = '😀rn' WHERE `news`.`id` = 2
app/controllers/news_controller.rb:98:in `update'

This is because, unless otherwise specified, AbstractMysqlAdapter#configure_connection adds STRICT_ALL_TABLES to the session's SQL_MODE. (NO_AUTO_VALUE_ON_ZERO is also added.)

You can confirm it by doing the following.

  • With the mysql client
    mysql> show variables like 'sql_mode';
    | Variable_name | Value                  |
    | sql_mode      | NO_ENGINE_SUBSTITUTION |
    1 row in set (0.00 sec)
  • With the Rails console against the same database
    > con = ActiveRecord::Base.connection
    > con.select_all("SHOW VARIABLES LIKE 'sql_mode'")
       (0.8ms)  SHOW VARIABLES LIKE 'sql_mode'
     => #<ActiveRecord::Result:0x00007fc6ca533728 @columns=["Variable_name", "Value"], @rows=[["sql_mode", "NO_AUTO_VALUE_ON_ZERO,STRICT_ALL_TABLES,NO_ENGINE_SUBSTITUTION"]], @hash_rows=nil, @column_types={}>


You can convert the CHARACTER SET of the column to utf8mb4 and COLLATE to utf8mb4_xxx and use utf8mb4 for the connection character set, but if you can't convert the column to utf8mb4 for some reason, you'll probably want to reject 4 bytes UTF-8 characters with validation because it's not good to just shut up and truncate the 4 bytes UTF-8 characters and beyond.

The range of Unicode characters that result in 4 bytes when encoded in UTF-8 is U+10000 to U+10FFFF.


if (preg_match('/[x{10000}-x{10FFFF}]/u', $s) { /* ... */ }
if (preg_match('/[xF0-xF7][x80-xBF][x80-xBF][x80-xBF]/', $s)) { /* ... */ }
preg_match_all('/[x{10000}-x{10FFFF}]/u', $s, $matches);
// An array of 4-bytes UTF-8 characters is stored in `$matches[0]`.

On Ruby

if /[u{10000}-u{10FFFF}]/ =~ s
  # ...
chars = s.scan(/[u{10000}-u{10FFFF}]/)
# An array of 4-bytes UTF-8 characters is stored in `chars`.