Validating 4 bytes UTF-8 characters

When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode does not include STRICT_ALL_TABLES nor STRICT_TRANS_TABLES), setting a character that will be 4 bytes when encoded with utf-8 (such as an emoji like 😁) will truncate the remainder of the characters (with a warning).

To support 4 bytes UTF-8 characters, the columns with utf8mb4 for CHARACTER SET (and utf8mb4_xxx such as utf8mb4_unicode_520_ci, etc for COLLATE) must be used, and the connection character set must also use utf8mb4.

Incidentally, on Rails, if you try to set 4 bytes UTF-8 characters when the character set of MySQL column is utf8, the following error occurs, so the string will not be truncated unnoticed.

An ActiveRecord::StatementInvalid occurred in news#update:

Mysql2::Error: Incorrect string value: 'xF0x9Fx98x80x0Dx0A' for column 'description' at row 1: UPDATE `news` SET `description` = '😀rn' WHERE `news`.`id` = 2
app/controllers/news_controller.rb:98:in `update'

This is because, unless otherwise specified, AbstractMysqlAdapter#configure_connection adds STRICT_ALL_TABLES to the session's SQL_MODE. (NO_AUTO_VALUE_ON_ZERO is also added.)

You can confirm it by doing the following.

  • With the mysql client
    mysql> show variables like 'sql_mode';
    +---------------+------------------------+
    | Variable_name | Value                  |
    +---------------+------------------------+
    | sql_mode      | NO_ENGINE_SUBSTITUTION |
    +---------------+------------------------+
    1 row in set (0.00 sec)
    
  • With the Rails console against the same database
    > con = ActiveRecord::Base.connection
    > con.select_all("SHOW VARIABLES LIKE 'sql_mode'")
       (0.8ms)  SHOW VARIABLES LIKE 'sql_mode'
     => #<ActiveRecord::Result:0x00007fc6ca533728 @columns=["Variable_name", "Value"], @rows=[["sql_mode", "NO_AUTO_VALUE_ON_ZERO,STRICT_ALL_TABLES,NO_ENGINE_SUBSTITUTION"]], @hash_rows=nil, @column_types={}>
    

Workaround

You can convert the CHARACTER SET of the column to utf8mb4 and COLLATE to utf8mb4_xxx and use utf8mb4 for the connection character set, but if you can't convert the column to utf8mb4 for some reason, you'll probably want to reject 4 bytes UTF-8 characters with validation because it's not good to just shut up and truncate the 4 bytes UTF-8 characters and beyond.

The range of Unicode characters that result in 4 bytes when encoded in UTF-8 is U+10000 to U+10FFFF.

On PHP

if (preg_match('/[x{10000}-x{10FFFF}]/u', $s) { /* ... */ }
if (preg_match('/[xF0-xF7][x80-xBF][x80-xBF][x80-xBF]/', $s)) { /* ... */ }
preg_match_all('/[x{10000}-x{10FFFF}]/u', $s, $matches);
// An array of 4-bytes UTF-8 characters is stored in `$matches[0]`.

On Ruby

if /[u{10000}-u{10FFFF}]/ =~ s
  # ...
end
chars = s.scan(/[u{10000}-u{10FFFF}]/)
# An array of 4-bytes UTF-8 characters is stored in `chars`.

Change file timestamp on Windows

Run Set-ItemProperty with PowerShell.

Change last modified datetime

> Set-ItemProperty "<PATH TO FILE>" -Name LastWriteTime -Value "<DATETIME STRING>"

Change creation datetime

> Set-ItemProperty "<PATH TO FILE>" -Name CreationTime -Value "<DATETIME STRING>"

It seems to be good that the DATETIME STRING format specified for -Value is a standard date and time string that can be parsed with .NET, such as "2018/06/01 12:27:59".

Display the character code of the character at the cursor position on Emacs

C-x = (M-x what-cursor-position)

The code point in Unicode of the character at the cursor position is displayed in the minibuffer.
For example, if you place the cursor on the letter "あ" in the file of Shift_JIS and press 'C-x =', the following is displayed in the minibuffer.

Char: あ (12354, #o30102, #x3042, file ...) point=1 of 2 (0%) column=0

`12354, #o30102, #x3042` are decimal, octal, hexadecimal notation of the code point in Unicode of "あ".

C-u C-x = (M-x describe-char)

Display detailed information of the character at the cursor position in the split window.
For example, place the cursor on the letter "あ" in the file of Shift_JIS and press 'C-u C-x =', then the following is displayed in the split window.

             position: 1 of 2 (0%), column: 0
            character: あ (displayed as あ) (codepoint 12354, #o30102, #x3042)
    preferred charset: japanese-jisx0208 (JISX0208.1983/1990 Japanese Kanji: ISO-IR-87)
code point in charset: 0x2422
               script: kana
               syntax: w  which means: word
             category: .:Base, H:2-byte Hiragana, L:Left-to-right (strong), c:Chinese, h:Korean, j:
Japanese, |:line breakable
             to input: type "C-x 8 RET 3042" or "C-x 8 RET HIRAGANA LETTER A"
          buffer code: #xE3 #x81 #x82
            file code: #x82 #xA0 (encoded by coding system japanese-shift-jis-dos)
              display: terminal code #xE3 #x81 #x82

`code point in charset: 0x2422` represents the code point of "あ" in character set JIS X 0208,
`buffer code: #xE3 #x81 #x82` represents the encoding in the buffer (UTF-8),
`file code: #x82 #xA0` represents the encoding in the file (Shift_JIS).

Installing Ruby-2.5.0 on CentOS6

Attempting to install Ruby-2.5.0 from the source on CentOS6 causes an error in `make` and it can not be installed.
It is an error because gcc on CentOS6 is old.

$ ./configure --prefix=/opt/ruby-2.5.0 --disable-install-doc
$ make
...(略)
prelude.c: In function ‘prelude_eval’:
prelude.c:204: error: #pragma GCC diagnostic not allowed inside functions
prelude.c:205: error: #pragma GCC diagnostic not allowed inside functions
prelude.c:221: error: #pragma GCC diagnostic not allowed inside functions
トップレベル:
cc1: 警告: unrecognized command line option "-Wno-self-assign"
cc1: 警告: unrecognized command line option "-Wno-constant-logical-operand"
cc1: 警告: unrecognized command line option "-Wno-parentheses-equality"
cc1: 警告: unrecognized command line option "-Wno-tautological-compare"
make: *** [prelude.o] エラー 1

Bug #14234: Failed to build on CentOS 6.9 - Ruby trunk - Ruby Issue Tracking System

It will be fixed on the next release (Ruby-2.5.1), but you can install Ruby-2.5.0 using scl's devtoolset for the time being.

Installing scl devtoolset on CentOS6

Create under /etc/profile.d with the following content(example for devtoolset-4 collection)and enable scl's devtoolset for users after reboot. So you can install Passenger by `passenger-install-apache2-module` or install gems which need native build with Capistrano.

$ cat /etc/profile.d/enabledevtoolset-4.sh
#!/bin/bash
source scl_source enable devtoolset-4

Show logs on macOS(10.12 Sierra or later)

Overview

Use `log`.

Streaming (like tail command)

`log stream`

Find from past log

`log show`

See `man log` for detail.

Example

cron

$ log stream --info --predicate 'process == "cron"'
$ log show --info --predicate 'process == "cron"' --start '2017-05-25'

postfix

$ log stream --info --predicate '(process == "smtp") || (process == "smtpd")'
$ log show --info --predicate '(process == "smtp") || (process == "smtpd")' --start '2017-05-25'

If you doesn't know process name to specify

$ log show --info --start '2017-11-08' | grep 'xxx'

and guess process name.

Developer Blog