Category Archives: PHP

Validating 4 bytes UTF-8 characters

When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode does not include STRICT_ALL_TABLES nor STRICT_TRANS_TABLES), setting a character that will be 4 bytes when encoded with utf-8 (such as an emoji like 😁) will truncate the remainder of the characters (with a warning).

To support 4 bytes UTF-8 characters, the columns with utf8mb4 for CHARACTER SET (and utf8mb4_xxx such as utf8mb4_unicode_520_ci, etc for COLLATE) must be used, and the connection character set must also use utf8mb4.

Incidentally, on Rails, if you try to set 4 bytes UTF-8 characters when the character set of MySQL column is utf8, the following error occurs, so the string will not be truncated unnoticed.

An ActiveRecord::StatementInvalid occurred in news#update:

Mysql2::Error: Incorrect string value: 'xF0x9Fx98x80x0Dx0A' for column 'description' at row 1: UPDATE `news` SET `description` = '😀rn' WHERE `news`.`id` = 2
app/controllers/news_controller.rb:98:in `update'

This is because, unless otherwise specified, AbstractMysqlAdapter#configure_connection adds STRICT_ALL_TABLES to the session's SQL_MODE. (NO_AUTO_VALUE_ON_ZERO is also added.)

You can confirm it by doing the following.

  • With the mysql client
    mysql> show variables like 'sql_mode';
    +---------------+------------------------+
    | Variable_name | Value                  |
    +---------------+------------------------+
    | sql_mode      | NO_ENGINE_SUBSTITUTION |
    +---------------+------------------------+
    1 row in set (0.00 sec)
    
  • With the Rails console against the same database
    > con = ActiveRecord::Base.connection
    > con.select_all("SHOW VARIABLES LIKE 'sql_mode'")
       (0.8ms)  SHOW VARIABLES LIKE 'sql_mode'
     => #<ActiveRecord::Result:0x00007fc6ca533728 @columns=["Variable_name", "Value"], @rows=[["sql_mode", "NO_AUTO_VALUE_ON_ZERO,STRICT_ALL_TABLES,NO_ENGINE_SUBSTITUTION"]], @hash_rows=nil, @column_types={}>
    

Workaround

You can convert the CHARACTER SET of the column to utf8mb4 and COLLATE to utf8mb4_xxx and use utf8mb4 for the connection character set, but if you can't convert the column to utf8mb4 for some reason, you'll probably want to reject 4 bytes UTF-8 characters with validation because it's not good to just shut up and truncate the 4 bytes UTF-8 characters and beyond.

The range of Unicode characters that result in 4 bytes when encoded in UTF-8 is U+10000 to U+10FFFF.

On PHP

if (preg_match('/[x{10000}-x{10FFFF}]/u', $s) { /* ... */ }
if (preg_match('/[xF0-xF7][x80-xBF][x80-xBF][x80-xBF]/', $s)) { /* ... */ }
preg_match_all('/[x{10000}-x{10FFFF}]/u', $s, $matches);
// An array of 4-bytes UTF-8 characters is stored in `$matches[0]`.

On Ruby

if /[u{10000}-u{10FFFF}]/ =~ s
  # ...
end
chars = s.scan(/[u{10000}-u{10FFFF}]/)
# An array of 4-bytes UTF-8 characters is stored in `chars`.

Many cachegrind.out.xxxxxx were generated in /private/var/tmp and disk space filled up

Recently I found that the free disk space of my Mac OS X (Yosemite) got very small, so I looked into why.

$ sudo du -sh /*

$ sudo du -sh /private/*

$ sudo du -sh /private/var/*

Finally I found that /private/var/tmp contained over 200GB files.
In /private/var/tmp there are many files whose name are such as cachegrind.out.50526.
It is said that these files are generated by Xdebug profiler.
Yes, I use Xdebug from MacPorts.
Xdebug is convenient because it shows stacktrace in error, enables stepping execution (but it needs Eclipse...), and so on.

Then I attempted to delete them with rm but an error occurred.

$ sudo rm cachegrind.out.*
-bash: sudo: Argument list too long

Because the files were too many, shell expansion for "*" caused Argument list too long error.

Therefore I deleted them using find -exec.

$ cd /private/var/tmp
$ sudo find . -name 'cachegrind.out.*' -maxdepth 1 -exec rm {} \;

I got about 200GB free space.

See below for details about find -exec.
Delete or grep the results of find

If this goes on, files will be generated and they will fill up my disk again. So I decided to modify Xdebug setting so that Xdebug profiler will generate output files not in /private/var/tmp but in /tmp directory.
(I also set trace output directory to /tmp)
They will be deleted on rebooting.

/opt/local/etc/php53/php.ini
Add xdebug.profiler_output_dir and xdebug.trace_output_dir.

[xdebug]
xdebug.profiler_enable=On
xdebug.remote_enable=On
xdebug.remote_host="localhost"
xdebug.remote_port=9000
xdebug.remote_handler="dbgp"
xdebug.idekey=ECLIPSE_DBGP
xdebug.profiler_output_dir=/tmp/ ; where the profiler output will be written to
xdebug.trace_output_dir=/tmp/ ; where the tracing files will be written to

Xdebug: Documentation.