MySQL Character encoding – part 2

In MySQL Character encoding – part 1 we stated that the myriad of ways in which character encoding can be controlled can lead to many situations where your data may not be available as expected.
UTF8 was designed on a placemat in a New Jersey diner one night in September or so 1992.
Setting MySQL Client and Server Character encoding.
Lets restart MySQL with the correct setting for our purpose, UTF8. Here we can see the setting in the MySQL configuration file, in this case /etc/mysql/my.cnf.
character-set-server = utf8
This change is then reflected in the session and global variables once the instance is restarted with the new configuration parameter.
mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+——————————-+——————————–+
| @@global.character_set_server | @@session.character_set_client | +——————————-+——————————–+
| utf8 | utf8 | +——————————-+——————————–+ 1 row in set (0.00 sec)
Now we have verified the server and client are set to use UTF8, we can go ahead, continue developing our application and create a new table people.
mysql> CREATE TABLE people (first_name VARCHAR(30) NOT NULL, last_name VARCHAR(30) NOT NULL); Query OK, 0 rows affected (0.13 sec)
Now let’s enter some data into the new table, which was created with the server and client configured for UTF8.
Something appears to have gone terribly wrong, the accent in Maciek’s surname now appears as a question mark.
mysql> SELECT @@session.character_set_server, @@session.character_set_client; +——————————–+——————————–+
| @@session.character_set_server | @@session.character_set_client | +——————————–+——————————–+
| utf8 | utf8 | +——————————–+——————————–+ 1 row in set (0.00 sec)
The database settings are still UTF8, this should have worked.
mysql> USE fosdem;
mysql> SHOW CREATE TABLE people\G
*************************** 1. row ***************************
Table: people
Create Table: CREATE TABLE `people` (
`first_name` varchar(30) NOT NULL,
`last_name` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
Looking at the table, we see that despite being created under a server set to use UTF8, it appears to be set to use latin1.
How can this be?, Let’s look at the session settings.
mysql> SHOW SESSION VARIABLES LIKE ‘character_set_%’; +————————–+—————————-+ | Variable_name | Value | +————————–+—————————-+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+————————–+—————————-+ 8 rows in set (0.00 sec)
We can see the server and client values are as expected, but database is set to something else.
mysql> SHOW CREATE DATABASE fosdem\G *************************** 1. row ***************************
Database: fosdem
Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */ 1 row in set (0.00 sec)
Since the database was created when the server was set to latin1 it inherited that charset setting, which persists even when the server setting changes.
Can we fix this?
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people; +————+———————-+
| last_name | HEX(last_name) | +————+———————-+
| Lemon | 4C656D6F6E | | Müller | 4DFC6C6C6572 | | Dobrza?ski | 446F62727A613F736B69 | +————+———————-+ 3 rows in set (0.00 sec)
mysql> SET NAMES latin2;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people; +————+———————-+
| last_name | HEX(last_name) | +————+———————-+
| Lemon | 4C656D6F6E | | Müller | 4DFC6C6C6572 | | Dobrza?ski | 446F62727A613F736B69 | +————+———————-+ 3 rows in set (0.00 sec)
Unfortunately, no matter how I try to read the data, 0x3F is ‘?’, so the ‘ń’ has been lost forever. Therefore it may not be enough to reconfigure the server, as a mismatch between client and server can permanently break data, due to the implicit conversion inside the MySQL server.
Implicit conversions happen silently when characters of one character set are inserted into a column with a different character set. This behaviour can be controlled by SQL_MODE, which allows you force MySQL to raise an error instead.
In MySQL Character encoding – part 1 we established there were a number of places you can control the character settings, now we can add a couple of important observations to our view of Character encoding settings.
Session settings
character_set_server
character_set_client
character_set_connection
character_set_database
character_set_result
Schema level Defaults – Affects new tables
Table level Defaults – Affects new columns
Column charsets
We have seen how a table created with no explicit charset declaration inherits the database (schema) charset, but what happens to a column when the table charset is changed?.
mysql> USE fosdem;
mysql> CREATE TABLE test (a VARCHAR(300), INDEX (a)); Query OK, 0 rows affected (0.62 sec)
mysql> SHOW CREATE TABLE test\G *************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
mysql> ALTER TABLE test DEFAULT CHARSET = utf8; Query OK, 0 rows affected (0.08 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> SHOW CREATE TABLE test\G *************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec)
The columns in a table inherit their default charset value when the table is created, but do not change when the table is changed, however new columns added after the ALTER TABLE would inherit UTF8.
mysql> ALTER TABLE test ADD b VARCHAR(10); Query OK, 0 rows affected (0.74 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> SHOW CREATE TABLE test\G *************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL, `b` varchar(10) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec
What can you do if you detect inconsistencies in your MySQL Character encoding settings
First of all, keep calm and don’t start by changing something. Analyse the situation and make sure you understand what settings you have and what your application understands regarding reading and writing data from the database.
Once you detect a problem, try to assess the extent of the damage. Firstly, what is the scope of the damage and is it consistent. Are all the rows bad or is it just a subset such as the last days worth of inserts. Are all the bad rows broken in the same way or are there actually a mixture of problems affected different sets of rows. Are the rows actually repairable – could be that recovering from backup and rolling forward is necessary as the inserted data has already been destroyed. Has any character mapping occurred during writes (e.g. unicode over latin1/latin1) – all of this is necessary to get a good picture of where you are starting from.
Take care not to do not do any of the following:
Try to fix this table by table unless you really only have a single table. – Any fix will affect the application and database as a whole, therefore fixing a single table may lead to inconsistencies and further problems elsewhere.
ALTER TABLE … DEFAULT CHARSET = as it only changes the default character set for new columns.
ALTER TABLE … CONVERT TO CHARACTER SET … It’s not for fixing broken encoding.
ALTER TABLE … MODIFY col_name … CHARACTER SET …
What needs to be fixed?
Schema default character set
ALTER SCHEMA fosdem DEFAULT CHARSET = utf8;
Tables with text columns: CHAR, VARCHAR, TEXT, TINYTEXT, LONGTEXT
What about ENUM?
The information schema can provide a list of candidate tables.
SELECT CONCAT(c.table_schema, ‘.’, c.table_name) AS candidate_table FROM information_schema.columns c
WHERE c.table_schema = ‘fosdem’
AND c.column_type REGEXP ‘^(.*CHAR|.*TEXT|ENUM)(\(.+\))?$’ GROUP BY candidate_table;
You must also ensure the database and application configuration is correct also, to avoid having the newly fixed tables broken by new data being introduced incorrectly (for the settings) into the tables.
How do I fix this?
Option 1. Dump and restore (Requires downtime)
Dump the data preserving the bad configuration and drop the old database
# mysqldump -u root -p –skip-set-charset –default-character-set=latin1 fosdem > fosdem.sql
mysql> DROP SCHEMA fosdem;
Correct table definitions in the dump file by editing DEFAULT CHARSET in all CREATE TABLE statements, then create the database again and import the data.
mysql> CREATE SCHEMA fosdem DEFAULT CHARSET utf8;
# mysql -u root -p –default-character-set=utf8 fosdem < fosdem.sql
Option 2. Two step conversion (Requires downtime)
Perform a two step conversion with ALTER TABLE, converting the original encoding to VARBINARY/BLOB and then from there to the target encoding. Conversion from/to BINARY/BLOB removes character set context.
Stop applications
On each table, for each text column perform:
ALTER TABLE tbl MODIFY col_name VARBINARY(255);
ALTER TABLE tbl MODIFY col_name VARCHAR(255) CHARACTER SET utf8;
You may specify multiple columns per ALTER TABLE
Fix the problems (application and/or db configs)
Restart applications
Option 3. – Online character set fix; (Minimal downtime, Approximately 1 min)
Using pt-online-schema-change with the PSCE plugin and a small patch for pt-online-schema-change, you can convert columns online in the live database.
Start pt-online-schema-change on all tables – one by one with table rotation disabled (–no-swap-tables) or drop pt-online-schema-change triggers
Wait until all tables have been converted
Stop applications
Fix the problems (application and/or db configs)
Rotate the tables – should take a minute or so
Restart applications
Currently the patch to pt-online-schema-change and plugin are available on bitbucket.
In MySQL Character encoding part 3 we will cover the gotchas in the process of fixing broken encoding, and what best practise to follow to get it right each time you setup a new server or create a new database.
via Planet MySQL
MySQL Character encoding – part 2