{"id":4630,"date":"2020-04-26T10:24:48","date_gmt":"2020-04-26T01:24:48","guid":{"rendered":"https:\/\/www.pistolfly.com\/weblog\/?p=4630"},"modified":"2020-05-07T09:32:25","modified_gmt":"2020-05-07T00:32:25","slug":"validating-4-bytes-utf-8-characters","status":"publish","type":"post","link":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html","title":{"rendered":"Validating 4 bytes UTF-8 characters"},"content":{"rendered":"\n<p>When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode does not include STRICT_ALL_TABLES nor STRICT_TRANS_TABLES), setting a character that will be 4 bytes when encoded with utf-8 (such as an emoji like \ud83d\ude01) will truncate the remainder of the characters (with a warning).<\/p>\n\n\n\n<p>To support 4 bytes UTF-8 characters, the columns with utf8mb4 for CHARACTER SET (and utf8mb4_xxx such as utf8mb4_unicode_520_ci, etc for COLLATE) must be used, and the connection character set must also use utf8mb4.<\/p>\n\n\n\n<p>Incidentally, on Rails, if you try to set 4 bytes UTF-8 characters when the character set of MySQL column is utf8, the following error occurs, so the string will not be truncated unnoticed.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>An ActiveRecord::StatementInvalid occurred in news#update:<\/p><p>Mysql2::Error: Incorrect string value: 'xF0x9Fx98x80x0Dx0A' for column 'description' at row 1: UPDATE `news` SET `description` = '\ud83d\ude00rn' WHERE `news`.`id` = 2<br \/>app\/controllers\/news_controller.rb:98:in `update'<\/p><\/blockquote>\n\n\n\n<p>This is because, unless otherwise specified, <a aria-label=\" (opens in a new tab)\" href=\"https:\/\/github.com\/rails\/rails\/blob\/8bec77cc0f1fd47677a331a64f68c5918efd2ca9\/activerecord\/lib\/active_record\/connection_adapters\/abstract_mysql_adapter.rb#L739\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"aioseop-link\">AbstractMysqlAdapter#configure_connection<\/a> adds STRICT_ALL_TABLES to the session's SQL_MODE. (NO_AUTO_VALUE_ON_ZERO is also added.)<\/p>\n\n\n\n<p>You can confirm it by doing the following.<\/p>\n\n\n\n<ul>\n<li>With the mysql client\n<pre class=\"command-line\">mysql&gt; show variables like 'sql_mode';\n+---------------+------------------------+\n| Variable_name | Value                  |\n+---------------+------------------------+\n| sql_mode      | NO_ENGINE_SUBSTITUTION |\n+---------------+------------------------+\n1 row in set (0.00 sec)\n<\/pre>\n<\/li>\n<li>With the Rails console against the same database\n<pre class=\"command-line\">&gt; con = ActiveRecord::Base.connection\n&gt; con.select_all(\"SHOW VARIABLES LIKE 'sql_mode'\")\n   (0.8ms)  SHOW VARIABLES LIKE 'sql_mode'\n =&gt; #&lt;ActiveRecord::Result:0x00007fc6ca533728 @columns=[\"Variable_name\", \"Value\"], @rows=[[\"sql_mode\", \"NO_AUTO_VALUE_ON_ZERO,STRICT_ALL_TABLES,NO_ENGINE_SUBSTITUTION\"]], @hash_rows=nil, @column_types={}&gt;\n<\/pre>\n<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Workaround<\/h2>\n\n\n\n<p>You can convert the CHARACTER SET of the column to utf8mb4 and COLLATE to utf8mb4_xxx and use utf8mb4 for the connection character set, but if  you can't convert the column to utf8mb4 for some reason, you'll probably want to reject 4 bytes UTF-8 characters with validation because it's not good to just shut up and truncate the 4 bytes UTF-8 characters and beyond.<br \/><br \/>The range of Unicode characters that result in 4 bytes when encoded in UTF-8 is U+10000 to U+10FFFF.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/UTF-8#Description\" target=\"_blank\" class=\"aioseop-link\">https:\/\/en.wikipedia.org\/wiki\/UTF-8#Description<\/a><\/li><li><a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/Unicode#Code_point_planes_and_blocks\" target=\"_blank\">https:\/\/en.wikipedia.org\/wiki\/Unicode#Code_point_planes_and_blocks<\/a><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">On PHP<\/h3>\n\n\n\n<pre class=\"prettyprint lang-php\">if (preg_match('\/[x{10000}-x{10FFFF}]\/u', $s) { \/* ... *\/ }\n<\/pre>\n<pre class=\"prettyprint lang-php\">if (preg_match('\/[xF0-xF7][x80-xBF][x80-xBF][x80-xBF]\/', $s)) { \/* ... *\/ }\n<\/pre>\n<pre class=\"prettyprint lang-php\">preg_match_all('\/[x{10000}-x{10FFFF}]\/u', $s, $matches);\n\/\/ An array of 4-bytes UTF-8 characters is stored in `$matches[0]`.\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">On Ruby<\/h3>\n\n\n\n<pre class=\"prettyprint lang-rb\">if \/[u{10000}-u{10FFFF}]\/ =~ s\n  # ...\nend\n<\/pre>\n\n<pre class=\"prettyprint lang-ruby\">chars = s.scan(\/[u{10000}-u{10FFFF}]\/)\n# An array of 4-bytes UTF-8 characters is stored in `chars`.\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode d &hellip; <a href=\"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Validating 4 bytes UTF-8 characters<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_locale":"en_US","_original_post":"https:\/\/www.pistolfly.com\/weblog\/?p=4613","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[13,4,27],"tags":[],"class_list":["post-4630","post","type-post","status-publish","format-standard","hentry","category-mysql","category-php","category-ruby","en-US"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Validating 4 bytes UTF-8 characters - Pistolfly<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Validating 4 bytes UTF-8 characters - Pistolfly\" \/>\n<meta property=\"og:description\" content=\"When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode d &hellip; Continue reading Validating 4 bytes UTF-8 characters &rarr;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html\" \/>\n<meta property=\"og:site_name\" content=\"Pistolfly\" \/>\n<meta property=\"article:published_time\" content=\"2020-04-26T01:24:48+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-05-07T00:32:25+00:00\" \/>\n<meta name=\"author\" content=\"Pistolfly\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Pistolfly\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/en\\\/2020\\\/04\\\/validating-4-bytes-utf-8-characters.html#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/en\\\/2020\\\/04\\\/validating-4-bytes-utf-8-characters.html\"},\"author\":{\"name\":\"Pistolfly\",\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/#\\\/schema\\\/person\\\/e779e918730b507907b3a35b77a9a2ab\"},\"headline\":\"Validating 4 bytes UTF-8 characters\",\"datePublished\":\"2020-04-26T01:24:48+00:00\",\"dateModified\":\"2020-05-07T00:32:25+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/en\\\/2020\\\/04\\\/validating-4-bytes-utf-8-characters.html\"},\"wordCount\":330,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/#\\\/schema\\\/person\\\/e779e918730b507907b3a35b77a9a2ab\"},\"articleSection\":[\"MySQL\",\"PHP\",\"Ruby\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/en\\\/2020\\\/04\\\/validating-4-bytes-utf-8-characters.html#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/en\\\/2020\\\/04\\\/validating-4-bytes-utf-8-characters.html\",\"url\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/en\\\/2020\\\/04\\\/validating-4-bytes-utf-8-characters.html\",\"name\":\"Validating 4 bytes UTF-8 characters - Pistolfly\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/#website\"},\"datePublished\":\"2020-04-26T01:24:48+00:00\",\"dateModified\":\"2020-05-07T00:32:25+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/en\\\/2020\\\/04\\\/validating-4-bytes-utf-8-characters.html#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/en\\\/2020\\\/04\\\/validating-4-bytes-utf-8-characters.html\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/en\\\/2020\\\/04\\\/validating-4-bytes-utf-8-characters.html#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\u30db\u30fc\u30e0\",\"item\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Validating 4 bytes UTF-8 characters\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/#website\",\"url\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/\",\"name\":\"Pistolfly\",\"description\":\"Developer Blog\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/#\\\/schema\\\/person\\\/e779e918730b507907b3a35b77a9a2ab\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/#\\\/schema\\\/person\\\/e779e918730b507907b3a35b77a9a2ab\",\"name\":\"Pistolfly\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/0b3ddb2859adabf28a2b97ae9fff98772ddffc46088d11cad2b9a2cef66115e7?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/0b3ddb2859adabf28a2b97ae9fff98772ddffc46088d11cad2b9a2cef66115e7?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/0b3ddb2859adabf28a2b97ae9fff98772ddffc46088d11cad2b9a2cef66115e7?s=96&d=mm&r=g\",\"caption\":\"Pistolfly\"},\"logo\":{\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/0b3ddb2859adabf28a2b97ae9fff98772ddffc46088d11cad2b9a2cef66115e7?s=96&d=mm&r=g\"},\"description\":\"Software Engineer in Tokyo, Japan\",\"url\":\"https:\\\/\\\/www.pistolfly.com\\\/weblog\\\/author\\\/pistolfly\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Validating 4 bytes UTF-8 characters - Pistolfly","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html","og_locale":"en_US","og_type":"article","og_title":"Validating 4 bytes UTF-8 characters - Pistolfly","og_description":"When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode d &hellip; Continue reading Validating 4 bytes UTF-8 characters &rarr;","og_url":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html","og_site_name":"Pistolfly","article_published_time":"2020-04-26T01:24:48+00:00","article_modified_time":"2020-05-07T00:32:25+00:00","author":"Pistolfly","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Pistolfly","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html#article","isPartOf":{"@id":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html"},"author":{"name":"Pistolfly","@id":"https:\/\/www.pistolfly.com\/weblog\/#\/schema\/person\/e779e918730b507907b3a35b77a9a2ab"},"headline":"Validating 4 bytes UTF-8 characters","datePublished":"2020-04-26T01:24:48+00:00","dateModified":"2020-05-07T00:32:25+00:00","mainEntityOfPage":{"@id":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html"},"wordCount":330,"commentCount":0,"publisher":{"@id":"https:\/\/www.pistolfly.com\/weblog\/#\/schema\/person\/e779e918730b507907b3a35b77a9a2ab"},"articleSection":["MySQL","PHP","Ruby"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html","url":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html","name":"Validating 4 bytes UTF-8 characters - Pistolfly","isPartOf":{"@id":"https:\/\/www.pistolfly.com\/weblog\/#website"},"datePublished":"2020-04-26T01:24:48+00:00","dateModified":"2020-05-07T00:32:25+00:00","breadcrumb":{"@id":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.pistolfly.com\/weblog\/en\/2020\/04\/validating-4-bytes-utf-8-characters.html#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\u30db\u30fc\u30e0","item":"https:\/\/www.pistolfly.com\/weblog"},{"@type":"ListItem","position":2,"name":"Validating 4 bytes UTF-8 characters"}]},{"@type":"WebSite","@id":"https:\/\/www.pistolfly.com\/weblog\/#website","url":"https:\/\/www.pistolfly.com\/weblog\/","name":"Pistolfly","description":"Developer Blog","publisher":{"@id":"https:\/\/www.pistolfly.com\/weblog\/#\/schema\/person\/e779e918730b507907b3a35b77a9a2ab"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.pistolfly.com\/weblog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.pistolfly.com\/weblog\/#\/schema\/person\/e779e918730b507907b3a35b77a9a2ab","name":"Pistolfly","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/0b3ddb2859adabf28a2b97ae9fff98772ddffc46088d11cad2b9a2cef66115e7?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/0b3ddb2859adabf28a2b97ae9fff98772ddffc46088d11cad2b9a2cef66115e7?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/0b3ddb2859adabf28a2b97ae9fff98772ddffc46088d11cad2b9a2cef66115e7?s=96&d=mm&r=g","caption":"Pistolfly"},"logo":{"@id":"https:\/\/secure.gravatar.com\/avatar\/0b3ddb2859adabf28a2b97ae9fff98772ddffc46088d11cad2b9a2cef66115e7?s=96&d=mm&r=g"},"description":"Software Engineer in Tokyo, Japan","url":"https:\/\/www.pistolfly.com\/weblog\/author\/pistolfly"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/paorzz-1cG","jetpack-related-posts":[],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/posts\/4630","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/comments?post=4630"}],"version-history":[{"count":31,"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/posts\/4630\/revisions"}],"predecessor-version":[{"id":4768,"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/posts\/4630\/revisions\/4768"}],"wp:attachment":[{"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/media?parent=4630"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/categories?post=4630"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pistolfly.com\/weblog\/wp-json\/wp\/v2\/tags?post=4630"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}