Preparing your MySQL schema for a character set change

Recently I blogged about identifying necessary index changes when changing the character set for a MySQL database. In this follow-up post I want to talk about general schema changes that you should consider when changing your character set from latin1 to utf8, from utf8 to utf8mb4, etc. The main issue is with long VARCHAR columns and columns with columns with length limits that are enforeced in bytes rather than characters.
The manual points out the main areas where you may have trouble:
A TINYTEXT column can hold up to 255 bytes, so it can hold up to 85 3-byte or 63 4-byte characters. Suppose that you have a TINYTEXT column that uses utf8 but must be able to contain more than 63 characters. You cannot convert it to utf8mb4 unless you also change the data type to a longer type such as TEXT.
Similarly, a very long VARCHAR column may need to be changed to one of the longer TEXT types if you want to convert it from utf8 to utf8mb4.
The point on TINYTEXT is pretty clear. Frankly I’m not sure why anyone uses TINYTEXT to begin with, but if you have TINYTEXT columns in your database and you are changing your character set, this might be a great opportunity to change those columns to a TEXT type that supports longer values, such as MEDIUMTEXT, TEXT, or LONGTEXT.
Long VARCHAR columns can be a problem because MySQL has a row length limit of 65,535 bytes, and one really long VARCHAR or a few relatively long VARCHARs in a multi-byte character set can push your rows past that limit. For example you can have a VARCHAR(20000) in utf8 because it takes up a maximum of 60,000 bytes, but you can’t have a VARCHAR(20000) in utf8mb4 because that requires up to 80,000 bytes of storage.
Here’s my simple recommendation for preparing a MySQL schema for a character set change:
Change all TINYTEXT columns to TEXT, MEDIUMTEXT, or LONGTEXT
Change all long VARCHAR columns to TEXT, MEDIUMTEXT, or LONGTEXT
Read on for sample queries that generate the necessary DDL to modify the relevant columns, and examples that illustrate the problem.
Generating DDL
The following information_schema query generates ALTER TABLE statements to change all TINYTEXT columns in your database to TEXT:
— change all tinytext columns to text
select concat(‘alter table ‘,t.table_schema,
‘.’,
t.table_name,
‘ modify column ‘,
c.column_name,
‘ ‘,
‘text’,
‘ character set ‘,
character_set_name,
‘ collate ‘,
collation_name,
‘ ‘,
case when c.is_nullable = ‘yes’ then ‘null’ else ‘not null’ end,
case when c.column_default is not null and c.column_default != ” then concat(‘ default ”’,c.column_default,””) else ” end,
‘;’) as ddl
from information_schema.columns c
inner join information_schema.tables t on t.table_schema = c.table_schema and t.table_name = c.table_name
where c.column_type = ‘tinytext’
and t.table_type = ‘base table’
order by t.table_name asc;
The following information_schema query generates ALTER TABLE statements to change all long VARCHAR columns in your database to LONGTEXT. I’m using an arbitrary limit of 2,000 characters to define long VARCHAR columns. You can tweak that number to your own particular use case. Also the choice of LONGTEXT is somewhat arbitrary. Feel free to change that to TEXT or MEDIUMTEXT if that works better for your use case.
— change all varchar columns 2000 characters or longer to longtext
select concat(‘alter table ‘,t.table_schema,
‘.’,
t.table_name,
‘ modify column ‘,
c.column_name,
‘ ‘,
‘longtext’,
‘ character set ‘,
character_set_name,
‘ collate ‘,
collation_name,
‘ ‘,
case when c.is_nullable = ‘yes’ then ‘null’ else ‘not null’ end,
case when c.column_default is not null and c.column_default != ” then concat(‘ default ”’,c.column_default,””) else ” end,
‘;’) as ddl
from information_schema.columns c
inner join information_schema.tables t on t.table_schema = c.table_schema and t.table_name = c.table_name
where c.data_type = ‘varchar’
and c.character_maximum_length >= 2000
and t.table_type = ‘base table’
order by t.table_name asc, c.column_name asc;
Those queries will generate multiple ALTER TABLE statements if more than one column needs to be modified in a given table. In order to avoid gratuitous rebuilds of large tables you may want to combine the multiple statements into a single ALTER TABLE for each relevant table.
Examples
Since TINYTEXT columns can only store 255 bytes, if I try to insert more than (255/N) N-byte characters the value is either silently truncated or the insert fails, depending on sql_mode.
Here’s an example:
“`
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> create table if not exists tinytext_test (
-> id int auto_increment primary key,
-> str tinytext
-> ) engine = InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.04 sec)
mysql> — insert 255 one byte characters
mysql> insert into tinytext_test (str) values (repeat(‘$’,255));
Query OK, 1 row affected (0.00 sec)
mysql> — change table and column to utf8
mysql> alter table tinytext_test DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> alter table tinytext_test modify column str tinytext CHARACTER SET utf8;
Query OK, 1 row affected (0.06 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> — insert 255 two byte characters
mysql> insert into tinytext_test (str) values (repeat(‘å’,255));
Query OK, 1 row affected, 1 warning (0.00 sec)
Warning (Code 1366): Incorrect string value: ‘\xC3\xA5\xC3\xA5\xC3\xA5…’ for column ‘str’ at row 1
mysql> — insert 255 three byte characters
mysql> insert into tinytext_test (str) values (repeat(‘€’,255));
Query OK, 1 row affected, 1 warning (0.00 sec)
Warning (Code 1265): Data truncated for column ‘str’ at row 1
“`
Here’s a query to show how many characters are stored for each row. As expected we get floor(255/N) characters in each row:
mysql> select left(str,1),length(str),char_length(str) from tinytext_test;
+————-+————-+——————+
| left(str,1) | length(str) | char_length(str) |
+————-+————-+——————+
| $ | 255 | 255 |
| å | 254 | 127 |
| € | 255 | 85 |
+————-+————-+——————+
3 rows in set (0.00 sec)
If I enable strict sql_mode and try to insert 255 multi-byte characters then the INSERT will fail with an error:
“`
mysql> set session sql_mode = ‘STRICT_ALL_TABLES’;
Query OK, 0 rows affected (0.00 sec)
mysql> — insert 255 three byte characters
mysql> insert into tinytext_test (str) values (repeat(‘€’,255));
ERROR 1406 (22001): Data too long for column ‘str’ at row 1
“`
For long VARCHAR columns, if I try to change the character set of the column in a way that will violate the 65,535 byte row length limit MySQL will either silently convert the column to a TEXT data type, or the ALTER TABLE statement will fail.
Here’s an example where a single very long VARCHAR column is converted from VARCHAR to MEDIUMTEXT:
“`
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> create table if not exists single_varchar_test (
-> id int auto_increment primary key,
-> str varchar(50000)
-> ) engine = InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.05 sec)
mysql> alter table single_varchar_test DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> alter table single_varchar_test modify column str varchar(50000) CHARACTER SET utf8;
Query OK, 0 rows affected, 1 warning (0.09 sec)
Records: 0 Duplicates: 0 Warnings: 1
Note (Code 1246): Converting column ‘str’ from VARCHAR to TEXT
mysql> desc single_varchar_test;
+——-+————+——+—–+———+—————-+
| Field | Type | Null | Key | Default | Extra |
+——-+————+——+—–+———+—————-+
| id | int(11) | NO | PRI | NULL | auto_increment |
| str | mediumtext | YES | | NULL | |
+——-+————+——+—–+———+—————-+
2 rows in set (0.00 sec)
“`
Here’s an example with several long VARCHAR columns where the ALTER TABLE statement that would push the table past the 65,535 byte row length limit actually fails:
“`
mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
mysql> create table if not exists multi_varchar_test (
-> id int auto_increment primary key,
-> str1 varchar(5000),
-> str2 varchar(5000),
-> str3 varchar(5000),
-> str4 varchar(5000)
-> ) engine = InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.04 sec)
mysql> alter table multi_varchar_test DEFAULT CHARSET=utf8mb4;
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> alter table multi_varchar_test modify column str1 varchar(5000) CHARACTER SET utf8mb4;
Query OK, 0 rows affected (0.10 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> alter table multi_varchar_test modify column str2 varchar(5000) CHARACTER SET utf8mb4;
ERROR 1118 (42000): Row size too large. The maximum row size for the used table type, not counting BLOBs, is 65535. This includes storage overhead, check the manual. You have to change some columns to TEXT or BLOBs
“`
I don’t know why MySQL does the implicit data type change versus throwing an error in some cases, but the last two examples show that both of those benaviors exist.
via Planet MySQL
Preparing your MySQL schema for a character set change