35
Character encoding Breaking and unbreaking your data Maciej Dobrzanski [email protected] | @mushupl Brussels, 1 Feb 2015 01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Character Encoding - MySQL DevRoom - FOSDEM 2015

  • Upload
    mushupl

  • View
    1.100

  • Download
    3

Embed Size (px)

Citation preview

Character encodingBreaking and unbreaking your data

Maciej [email protected] | @mushupl

Brussels, 1 Feb 2015

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Character Encoding

• Binary representation of glyphs

• Each character can be represented by 1 or more bytes

• Popular schemes• ASCII

• Unicode• UTF-8, UTF-16, UTF-32

• Language specific character sets• US (Latin US)

• Europe (Latin 1, Latin 2)

• Asia (EUC-KR, GB18030)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Character Encoding

• Character set defines the visual interpretation of binary information• One glyph can be associated with several numeric codes

• One numeric code may be used to represent several different glyphs

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Please state the nature of the emergency

• Application configuration

• Database configuration

• Table/column definitions

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #1: We are all born Swedish

• MySQL uses latin1 by default• MySQL 5.7 too

• Is anyone actually aware of that?

• Why Swedish?• latin1_swedish_ci is the default collation

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #1

• Let’s build an applicationmysql> SELECT @@global.character_set_server, @@session.character_set_client;+-------------------------------+--------------------------------+| @@global.character_set_server | @@session.character_set_client |+-------------------------------+--------------------------------+| latin1 | latin1 |+-------------------------------+--------------------------------+1 row in set (0.00 sec)

mysql> CREATE SCHEMA fosdem;Query OK, 1 row affected (0.00 sec)

mysql> USE fosdem;

mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL);Query OK, 0 rows affected (0.15 sec)

mysql> SHOW CREATE TABLE locations\G*************************** 1. row ***************************

Table: locationsCreate Table: CREATE TABLE `locations` (

`city` varchar(30) NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.00 sec)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #1

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #1

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #1

• Everything is correct… NOT!mysql> SET NAMES utf8;Query OK, 0 rows affected (0.00 sec)

mysql> select * from locations;+--------------------+| city |+--------------------+| Berlin || Kraków || �京都 |+--------------------+3 rows in set (0.00 sec)

mysql> SET NAMES latin1;Query OK, 0 rows affected (0.00 sec)

mysql> select * from locations;+-----------+| city |+-----------+| Berlin || Kraków || 東京都 |+-----------+3 rows in set (0.00 sec)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #1

• Let’s fix this• Or can we ignore it?

• Ruby may not like it

# grep character-set-server /etc/mysql/my.cnfcharacter-set-server = utf8

mysql> SELECT @@global.character_set_server, @@session.character_set_client;+-------------------------------+--------------------------------+| @@global.character_set_server | @@session.character_set_client |+-------------------------------+--------------------------------+| utf8 | utf8 |+-------------------------------+--------------------------------+1 row in set (0.00 sec)

...we are fixing our tables here...

mysql> SHOW CREATE TABLE locations\G*************************** 1. row ***************************

Table: locationsCreate Table: CREATE TABLE `locations` (

`city` varchar(30) NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=utf81 row in set (0.00 sec)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #1: The good news

• It’s usually fixable

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2: Settings, defaults, inheritance

• Where do you set character sets in MySQL?• Sesssion settings

• character_set_server

• character_set_client

• character_set_connection

• character_set_database

• character_set_result

• Schema level defaults

• Table level defaults

• Column charsets

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2

• Having fixed our problem #1, we continue to develop our applicationmysql> SELECT @@session.character_set_server, @@session.character_set_client;+--------------------------------+--------------------------------+| @@session.character_set_server | @@session.character_set_client |+--------------------------------+--------------------------------+| utf8 | utf8 |+--------------------------------+--------------------------------+1 row in set (0.00 sec)

mysql> USE fosdem;

mysql> CREATE TABLE people (first_name VARCHAR(30) NOT NULL, last_name VARCHAR(30) NOT NULL);Query OK, 0 rows affected (0.13 sec)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2

• Why is the table character set latin1?mysql> SELECT @@session.character_set_server, @@session.character_set_client;+--------------------------------+--------------------------------+| @@session.character_set_server | @@session.character_set_client |+--------------------------------+--------------------------------+| utf8 | utf8 |+--------------------------------+--------------------------------+1 row in set (0.00 sec)

mysql> USE fosdem;

mysql> SHOW CREATE TABLE people\G*************************** 1. row ***************************

Table: peopleCreate Table: CREATE TABLE `people` (

`first_name` varchar(30) NOT NULL,`last_name` varchar(30) NOT NULL

) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.00 sec)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2

• What’s all this, then?mysql> SHOW SESSION VARIABLES LIKE 'character_set_%';+--------------------------+----------------------------+| Variable_name | Value |+--------------------------+----------------------------+| character_set_client | utf8 || character_set_connection | utf8 || character_set_database | latin1 || character_set_filesystem | binary || character_set_results | utf8 || character_set_server | utf8 || character_set_system | utf8 || character_sets_dir | /usr/share/mysql/charsets/ |+--------------------------+----------------------------+8 rows in set (0.00 sec)

mysql> SHOW CREATE DATABASE fosdem\G*************************** 1. row ***************************

Database: fosdemCreate Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */1 row in set (0.00 sec)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2

• Can we fix this?mysql> SET NAMES utf8;Query OK, 0 rows affected (0.00 sec)

mysql> SELECT last_name, HEX(last_name) FROM people; +------------+----------------------+| last_name | HEX(last_name) |+------------+----------------------+| Lemon | 4C656D6F6E || Müller | 4DFC6C6C6572 || Dobrza?ski | 446F62727A613F736B69 |+------------+----------------------+3 rows in set (0.00 sec)

mysql> SET NAMES latin2;Query OK, 0 rows affected (0.00 sec)

mysql> SELECT last_name, HEX(last_name) FROM people; +------------+----------------------+| last_name | HEX(last_name) |+------------+----------------------+| Lemon | 4C656D6F6E || Müller | 4DFC6C6C6572 || Dobrza?ski | 446F62727A613F736B69 |+------------+----------------------+3 rows in set (0.00 sec)

• We can’t! :-(• 0x3F is '?', so my 'ń' was lost

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2: The bad news

• It may not be enough to configure the server correctly

• A mismatch between client and server can permantenly break data• Implicit conversion inside MySQL server

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2: Settings, defaults, inheritance

• Where do you set character sets in MySQL?• Sesssion settings

• character_set_server

• character_set_client

• character_set_connection

• character_set_database

• character_set_result

• Schema level defaults – affect new tables

• Table level defaults – affect new columns

• Column charsets

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2: Settings, defaults, inheritance

master [localhost] {msandbox} ((none)) > SELECT @@global.character_set_server, @@session.character_set_client;+-------------------------------+--------------------------------+| @@global.character_set_server | @@session.character_set_client |+-------------------------------+--------------------------------+| latin1 | utf8 |+-------------------------------+--------------------------------+1 row in set (0.00 sec)

master [localhost] {msandbox} ((none)) > CREATE SCHEMA fosdem\GQuery OK, 1 row affected (0.00 sec)

master [localhost] {msandbox} ((none)) > SHOW CREATE SCHEMA fosdem\G*************************** 1. row ***************************

Database: fosdemCreate Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */1 row in set (0.00 sec)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2: Settings, defaults, inheritance

master [localhost] {msandbox} ((none)) > USE fosdem;Database changedmaster [localhost] {msandbox} (fosdem) > CREATE TABLE test (a VARCHAR(300), INDEX (a));Query OK, 0 rows affected (0.62 sec)

master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE test\G*************************** 1. row ***************************

Table: testCreate Table: CREATE TABLE `test` (

`a` varchar(300) DEFAULT NULL,KEY `a` (`a`)

) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.00 sec)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2: Settings, defaults, inheritance

master [localhost] {msandbox} (fosdem) > ALTER TABLE test DEFAULT CHARSET = utf8;Query OK, 0 rows affected (0.08 sec)Records: 0 Duplicates: 0 Warnings: 0

master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE test\G*************************** 1. row ***************************

Table: testCreate Table: CREATE TABLE `test` (

`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,KEY `a` (`a`)

) ENGINE=InnoDB DEFAULT CHARSET=utf81 row in set (0.00 sec)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Problem #2: Settings, defaults, inheritance

master [localhost] {msandbox} (fosdem) > ALTER TABLE test ADD b VARCHAR(10);Query OK, 0 rows affected (0.74 sec)Records: 0 Duplicates: 0 Warnings: 0

master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE test\G*************************** 1. row ***************************

Table: testCreate Table: CREATE TABLE `test` (

`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,`b` varchar(10) DEFAULT NULL,KEY `a` (`a`)

) ENGINE=InnoDB DEFAULT CHARSET=utf81 row in set (0.00 sec)

I f**ckd up. What do I do?

• Let’s start with what you shouldn’t do

• Keep calm and don’t start by changing something• Analyze the situation

• Why did the problem occur in the first place?

• Reassess the damage• Is it consistent?

• Are all rows broken in the same way?

• Are some rows bad, but others are okay?

• Are all bad in several different ways?

• Is it actually repearable?• No character mapping occurred during writes (e.g. unicode over latin1/latin1)

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

I f**ckd up. What else I shouldn’t do, then?

• Do not rush things as you may easily go from bad to worse

• Do not start fixing this on a replication slave

• You can’t fix this by fixing tables one by one on a live database• Unless you really have everything in one table

• Do not use: ALTER TABLE … DEFAULT CHARSET = …• It only changes the default character set for new columns

• Do not use: ALTER TABLE … CONVERT TO CHARACTER SET …• It’s not for fixing broken encoding

• Do not use: ALTER TABLE … MODIFY col_name … CHARACTER SET …

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

I f**ckd up. So how do I fix it?

• What needs to be fixed?• Schema defaut character set

• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8

• Tables with text columns: CHAR, VARCHAR, TEXT, TINYTEXT, LONGTEXT• What about ENUM?• Use INFORMATION_SCHEMA to grab a list

• What about other tables?• They too (eventually), but it’s not critical

SELECT CONCAT(c.table_schema, '.', c.table_name) AS candidate_tableFROM information_schema.columns cWHERE c.table_schema = 'fosdem'AND c.column_type REGEXP '^(.*CHAR|.*TEXT|ENUM)(\(.+\))?$'

GROUP BY candidate_table;

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

I f**ckd up. So how do I fix it?

• Option 1 – requires downtime

• Dump and restore• Dump the data preserving the bad configuration and drop the old database

bash# mysqldump -u root -p --skip-set-charset --default-character-set=latin1 fosdem > fosdem.sqlmysql> DROP SCHEMA fosdem;

• Correct table definitions in the dump file• Edit DEFAULT CHARSET in all CREATE TABLE statements

• Create the database again and import the data backmysql> CREATE SCHEMA fosdem DEFAULT CHARSET utf8;bash# mysql -u root -p --default-character-set=utf8 fosdem < fosdem.sql

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

I f**ckd up. So how do I fix it?

• Option 2 – requires downtime

• Perform a two step conversion with ALTER TABLE• Original encoding -> VARBINARY/BLOB -> Target encoding• Conversion from/to BINARY/BLOB removes character set context

• How?• Stop applications• On each tabe, for each text column perform:

ALTER TABLE tbl MODIFY col_name VARBINARY(255);ALTER TABLE tbl MODIFY col_name VARCHAR(255) CHARACTER SET utf8;• You may specify multiple columns per ALTER TABLE

• Fix the problems (application and/or db configs)• Restart applications

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

I f**ckd up. So how do I fix it?

• Option 3 – online character set fix; no downtime*

• Thanks to our plugin for pt-online-schema-change• and a tiny patch for pt-online-schema-change that goes with the plugin

• How?• Start pt-online-schema-change on all tables – one by one

• Do not rotate tables (--no-swap-tables) or drop pt-osc triggers

• Wait until all tables have been converted• Stop applications• Fix the problems (application and/or db configs)• Rotate tables – takes just 1 minute• Restart applications• Et voilà

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

GOTCHAs!

• Data space requrements may change during conversion• Latin1 uses 1 byte per character, utf8 will need to assume 3 bytes

• VARCHAR/TEXT fit up to 64KB – it won’t fit 65536 multi-byte characters

• Key length limit is 767 bytes

• Data type and/or index length changes may be required• Test and plan this ahead

• There may be more prolems than you think• Detect irrecoverible problems with a simple stored procedure

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

CREATE FUNCTION `cnv_test_conversion` (`value_before` LONGTEXT, `value_after` LONGTEXT) RETURNS tinyint(1)BEGIN

RETURN (IFNULL(CONVERT(CONVERT(`value_before` USING latin1) USING binary), "") = IFNULL(CONVERT(`value_after` USING binary), ""));

END;;

01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com

GOTCHAs!

master [localhost] {msandbox} (fosdem) > ALTER TABLE test MODIFY a VARCHAR(300) CHARACTER SET utf8;Query OK, 0 rows affected, 1 warning (1.23 sec)Records: 0 Duplicates: 0 Warnings: 1

master [localhost] {msandbox} (fosdem) > SHOW WARNINGS\G*************************** 1. row ***************************

Level: WarningCode: 1071

Message: Specified key was too long; max key length is 767 bytes1 row in set (0.00 sec)

master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE test\G*************************** 1. row ***************************

Table: testCreate Table: CREATE TABLE `test` (

`a` varchar(300) DEFAULT NULL,`b` varchar(10) DEFAULT NULL,KEY `a` (`a`(255))

) ENGINE=InnoDB DEFAULT CHARSET=utf81 row in set (0.00 sec)

How to do it right?

• Set character-set-server during initial configuration

• When creating new schemas, always specify the desired charset• CREATE SCHEMA fosdem DEFAULT CHARSET = utf8

• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8

• When creating new tables, also explicitly specify the charset• CREATE TABLE people (…) DEFAULT CHARSET = utf8

• And don’t forget to configure applications too• You can try to force charset on the clients

• init-connect = "SET NAMES utf8"

• It might also break applications that don’t want to talk to MySQL using utf8

01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Oh, and one more thing…

01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com

• We are sharing WebScaleSQL packages with the MySQL Community!

• Check out http://www.psce.com/blog for details

• Follow @dbasquare to receive updates

01.02.2015 Follow us on Twitter @dbasquare 35

WebScaleSQL

What is WebScaleSQL?

WebScaleSQL is a collaboration among engineers from several companiessuch as Facebook, Twitter, Google or Linkedin, that face the same challengesin deploying MySQL at scale, and seek greater performance from a databasetechnology tailored for their needs.