Upload
mushupl
View
1.100
Download
3
Embed Size (px)
Citation preview
Character encodingBreaking and unbreaking your data
Maciej [email protected] | @mushupl
Brussels, 1 Feb 2015
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Character Encoding
• Binary representation of glyphs
• Each character can be represented by 1 or more bytes
• Popular schemes• ASCII
• Unicode• UTF-8, UTF-16, UTF-32
• Language specific character sets• US (Latin US)
• Europe (Latin 1, Latin 2)
• Asia (EUC-KR, GB18030)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Character Encoding
• Character set defines the visual interpretation of binary information• One glyph can be associated with several numeric codes
• One numeric code may be used to represent several different glyphs
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Please state the nature of the emergency
• Application configuration
• Database configuration
• Table/column definitions
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1: We are all born Swedish
• MySQL uses latin1 by default• MySQL 5.7 too
• Is anyone actually aware of that?
• Why Swedish?• latin1_swedish_ci is the default collation
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
• Let’s build an applicationmysql> SELECT @@global.character_set_server, @@session.character_set_client;+-------------------------------+--------------------------------+| @@global.character_set_server | @@session.character_set_client |+-------------------------------+--------------------------------+| latin1 | latin1 |+-------------------------------+--------------------------------+1 row in set (0.00 sec)
mysql> CREATE SCHEMA fosdem;Query OK, 1 row affected (0.00 sec)
mysql> USE fosdem;
mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL);Query OK, 0 rows affected (0.15 sec)
mysql> SHOW CREATE TABLE locations\G*************************** 1. row ***************************
Table: locationsCreate Table: CREATE TABLE `locations` (
`city` varchar(30) NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
• Everything is correct… NOT!mysql> SET NAMES utf8;Query OK, 0 rows affected (0.00 sec)
mysql> select * from locations;+--------------------+| city |+--------------------+| Berlin || Kraków || �京都 |+--------------------+3 rows in set (0.00 sec)
mysql> SET NAMES latin1;Query OK, 0 rows affected (0.00 sec)
mysql> select * from locations;+-----------+| city |+-----------+| Berlin || Kraków || 東京都 |+-----------+3 rows in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
• Let’s fix this• Or can we ignore it?
• Ruby may not like it
# grep character-set-server /etc/mysql/my.cnfcharacter-set-server = utf8
mysql> SELECT @@global.character_set_server, @@session.character_set_client;+-------------------------------+--------------------------------+| @@global.character_set_server | @@session.character_set_client |+-------------------------------+--------------------------------+| utf8 | utf8 |+-------------------------------+--------------------------------+1 row in set (0.00 sec)
...we are fixing our tables here...
mysql> SHOW CREATE TABLE locations\G*************************** 1. row ***************************
Table: locationsCreate Table: CREATE TABLE `locations` (
`city` varchar(30) NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=utf81 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1: The good news
• It’s usually fixable
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
• Where do you set character sets in MySQL?• Sesssion settings
• character_set_server
• character_set_client
• character_set_connection
• character_set_database
• character_set_result
• Schema level defaults
• Table level defaults
• Column charsets
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
• Having fixed our problem #1, we continue to develop our applicationmysql> SELECT @@session.character_set_server, @@session.character_set_client;+--------------------------------+--------------------------------+| @@session.character_set_server | @@session.character_set_client |+--------------------------------+--------------------------------+| utf8 | utf8 |+--------------------------------+--------------------------------+1 row in set (0.00 sec)
mysql> USE fosdem;
mysql> CREATE TABLE people (first_name VARCHAR(30) NOT NULL, last_name VARCHAR(30) NOT NULL);Query OK, 0 rows affected (0.13 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
• Why is the table character set latin1?mysql> SELECT @@session.character_set_server, @@session.character_set_client;+--------------------------------+--------------------------------+| @@session.character_set_server | @@session.character_set_client |+--------------------------------+--------------------------------+| utf8 | utf8 |+--------------------------------+--------------------------------+1 row in set (0.00 sec)
mysql> USE fosdem;
mysql> SHOW CREATE TABLE people\G*************************** 1. row ***************************
Table: peopleCreate Table: CREATE TABLE `people` (
`first_name` varchar(30) NOT NULL,`last_name` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
• What’s all this, then?mysql> SHOW SESSION VARIABLES LIKE 'character_set_%';+--------------------------+----------------------------+| Variable_name | Value |+--------------------------+----------------------------+| character_set_client | utf8 || character_set_connection | utf8 || character_set_database | latin1 || character_set_filesystem | binary || character_set_results | utf8 || character_set_server | utf8 || character_set_system | utf8 || character_sets_dir | /usr/share/mysql/charsets/ |+--------------------------+----------------------------+8 rows in set (0.00 sec)
mysql> SHOW CREATE DATABASE fosdem\G*************************** 1. row ***************************
Database: fosdemCreate Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
• Can we fix this?mysql> SET NAMES utf8;Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people; +------------+----------------------+| last_name | HEX(last_name) |+------------+----------------------+| Lemon | 4C656D6F6E || Müller | 4DFC6C6C6572 || Dobrza?ski | 446F62727A613F736B69 |+------------+----------------------+3 rows in set (0.00 sec)
mysql> SET NAMES latin2;Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people; +------------+----------------------+| last_name | HEX(last_name) |+------------+----------------------+| Lemon | 4C656D6F6E || Müller | 4DFC6C6C6572 || Dobrza?ski | 446F62727A613F736B69 |+------------+----------------------+3 rows in set (0.00 sec)
• We can’t! :-(• 0x3F is '?', so my 'ń' was lost
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: The bad news
• It may not be enough to configure the server correctly
• A mismatch between client and server can permantenly break data• Implicit conversion inside MySQL server
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
• Where do you set character sets in MySQL?• Sesssion settings
• character_set_server
• character_set_client
• character_set_connection
• character_set_database
• character_set_result
• Schema level defaults – affect new tables
• Table level defaults – affect new columns
• Column charsets
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} ((none)) > SELECT @@global.character_set_server, @@session.character_set_client;+-------------------------------+--------------------------------+| @@global.character_set_server | @@session.character_set_client |+-------------------------------+--------------------------------+| latin1 | utf8 |+-------------------------------+--------------------------------+1 row in set (0.00 sec)
master [localhost] {msandbox} ((none)) > CREATE SCHEMA fosdem\GQuery OK, 1 row affected (0.00 sec)
master [localhost] {msandbox} ((none)) > SHOW CREATE SCHEMA fosdem\G*************************** 1. row ***************************
Database: fosdemCreate Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} ((none)) > USE fosdem;Database changedmaster [localhost] {msandbox} (fosdem) > CREATE TABLE test (a VARCHAR(300), INDEX (a));Query OK, 0 rows affected (0.62 sec)
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE test\G*************************** 1. row ***************************
Table: testCreate Table: CREATE TABLE `test` (
`a` varchar(300) DEFAULT NULL,KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} (fosdem) > ALTER TABLE test DEFAULT CHARSET = utf8;Query OK, 0 rows affected (0.08 sec)Records: 0 Duplicates: 0 Warnings: 0
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE test\G*************************** 1. row ***************************
Table: testCreate Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf81 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} (fosdem) > ALTER TABLE test ADD b VARCHAR(10);Query OK, 0 rows affected (0.74 sec)Records: 0 Duplicates: 0 Warnings: 0
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE test\G*************************** 1. row ***************************
Table: testCreate Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,`b` varchar(10) DEFAULT NULL,KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf81 row in set (0.00 sec)
I f**ckd up. What do I do?
• Let’s start with what you shouldn’t do
• Keep calm and don’t start by changing something• Analyze the situation
• Why did the problem occur in the first place?
• Reassess the damage• Is it consistent?
• Are all rows broken in the same way?
• Are some rows bad, but others are okay?
• Are all bad in several different ways?
• Is it actually repearable?• No character mapping occurred during writes (e.g. unicode over latin1/latin1)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. What else I shouldn’t do, then?
• Do not rush things as you may easily go from bad to worse
• Do not start fixing this on a replication slave
• You can’t fix this by fixing tables one by one on a live database• Unless you really have everything in one table
• Do not use: ALTER TABLE … DEFAULT CHARSET = …• It only changes the default character set for new columns
• Do not use: ALTER TABLE … CONVERT TO CHARACTER SET …• It’s not for fixing broken encoding
• Do not use: ALTER TABLE … MODIFY col_name … CHARACTER SET …
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. So how do I fix it?
• What needs to be fixed?• Schema defaut character set
• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8
• Tables with text columns: CHAR, VARCHAR, TEXT, TINYTEXT, LONGTEXT• What about ENUM?• Use INFORMATION_SCHEMA to grab a list
• What about other tables?• They too (eventually), but it’s not critical
SELECT CONCAT(c.table_schema, '.', c.table_name) AS candidate_tableFROM information_schema.columns cWHERE c.table_schema = 'fosdem'AND c.column_type REGEXP '^(.*CHAR|.*TEXT|ENUM)(\(.+\))?$'
GROUP BY candidate_table;
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. So how do I fix it?
• Option 1 – requires downtime
• Dump and restore• Dump the data preserving the bad configuration and drop the old database
bash# mysqldump -u root -p --skip-set-charset --default-character-set=latin1 fosdem > fosdem.sqlmysql> DROP SCHEMA fosdem;
• Correct table definitions in the dump file• Edit DEFAULT CHARSET in all CREATE TABLE statements
• Create the database again and import the data backmysql> CREATE SCHEMA fosdem DEFAULT CHARSET utf8;bash# mysql -u root -p --default-character-set=utf8 fosdem < fosdem.sql
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. So how do I fix it?
• Option 2 – requires downtime
• Perform a two step conversion with ALTER TABLE• Original encoding -> VARBINARY/BLOB -> Target encoding• Conversion from/to BINARY/BLOB removes character set context
• How?• Stop applications• On each tabe, for each text column perform:
ALTER TABLE tbl MODIFY col_name VARBINARY(255);ALTER TABLE tbl MODIFY col_name VARCHAR(255) CHARACTER SET utf8;• You may specify multiple columns per ALTER TABLE
• Fix the problems (application and/or db configs)• Restart applications
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. So how do I fix it?
• Option 3 – online character set fix; no downtime*
• Thanks to our plugin for pt-online-schema-change• and a tiny patch for pt-online-schema-change that goes with the plugin
• How?• Start pt-online-schema-change on all tables – one by one
• Do not rotate tables (--no-swap-tables) or drop pt-osc triggers
• Wait until all tables have been converted• Stop applications• Fix the problems (application and/or db configs)• Rotate tables – takes just 1 minute• Restart applications• Et voilà
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
GOTCHAs!
• Data space requrements may change during conversion• Latin1 uses 1 byte per character, utf8 will need to assume 3 bytes
• VARCHAR/TEXT fit up to 64KB – it won’t fit 65536 multi-byte characters
• Key length limit is 767 bytes
• Data type and/or index length changes may be required• Test and plan this ahead
• There may be more prolems than you think• Detect irrecoverible problems with a simple stored procedure
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
CREATE FUNCTION `cnv_test_conversion` (`value_before` LONGTEXT, `value_after` LONGTEXT) RETURNS tinyint(1)BEGIN
RETURN (IFNULL(CONVERT(CONVERT(`value_before` USING latin1) USING binary), "") = IFNULL(CONVERT(`value_after` USING binary), ""));
END;;
01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
GOTCHAs!
master [localhost] {msandbox} (fosdem) > ALTER TABLE test MODIFY a VARCHAR(300) CHARACTER SET utf8;Query OK, 0 rows affected, 1 warning (1.23 sec)Records: 0 Duplicates: 0 Warnings: 1
master [localhost] {msandbox} (fosdem) > SHOW WARNINGS\G*************************** 1. row ***************************
Level: WarningCode: 1071
Message: Specified key was too long; max key length is 767 bytes1 row in set (0.00 sec)
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE test\G*************************** 1. row ***************************
Table: testCreate Table: CREATE TABLE `test` (
`a` varchar(300) DEFAULT NULL,`b` varchar(10) DEFAULT NULL,KEY `a` (`a`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf81 row in set (0.00 sec)
How to do it right?
• Set character-set-server during initial configuration
• When creating new schemas, always specify the desired charset• CREATE SCHEMA fosdem DEFAULT CHARSET = utf8
• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8
• When creating new tables, also explicitly specify the charset• CREATE TABLE people (…) DEFAULT CHARSET = utf8
• And don’t forget to configure applications too• You can try to force charset on the clients
• init-connect = "SET NAMES utf8"
• It might also break applications that don’t want to talk to MySQL using utf8
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
• We are sharing WebScaleSQL packages with the MySQL Community!
• Check out http://www.psce.com/blog for details
• Follow @dbasquare to receive updates
01.02.2015 Follow us on Twitter @dbasquare 35
WebScaleSQL
What is WebScaleSQL?
WebScaleSQL is a collaboration among engineers from several companiessuch as Facebook, Twitter, Google or Linkedin, that face the same challengesin deploying MySQL at scale, and seek greater performance from a databasetechnology tailored for their needs.