Migrating Database Charsets to utf8mb4: A Story from the Trenches

In this blog post, we’ll look at options for migrating database charsets to utf8mb4.

Migrating charsets, in my opinion, is one of the most tedious tasks in a DBA’s life. There are so many things involved that can screw up our data, making it work is always hard. Sometimes what seems like a trivial task can become a nightmare very easily, and keeps us working for longer than expected.

I’ve recently worked on a case that challenged me with lots of tests due to some existing schema designs that made InnoDB suffer. I’ve decided to write this post to put together some definitive guide to enact charset conversion with minimal downtime and pain.

First disclosure: I can’t emphasize enough that you need to always backup your data. If something goes wrong, you can always roll things back by keeping a healthy set of backups.
Second disclosure: A backup can’t be considered a good backup until you test it, so I can’t emphasize enough that running regular backups and also performing regular restore tests is a must-to-do task for being in the safe side.
Third and last disclosure: I’m not pretending to present the best or only way to do this exercise. This is the way I consider easiest and painless to perform a charset conversion with minimal downtime.

My approach involves at least one slave for failover and logical/physical backup operations to make sure that data is loaded properly using the right charset.

In this case, we are moving from latin1 (default until MySQL 8.0.0) to utf8mb4 (new default from 8.0.1). In this post, Lefred refers to this change and some safety checks for upgrading. For our change, an important thing to consider: Latin1 charset stores one byte per character, while utf8mb4 can store up to four bytes per character. This change definitely impacts the disk usage, but also makes us hit some limits that I describe later in the plan.

So let’s put out hands in action. First, let’s create a slave using a fresh (non-locking) backup. Remember that these operations are designed to minimize downtime and reduce any potential impact on our production server.

If you already have a slave that can act as a master replacement then you can skip this section. In our source server, configure binlog_format and flush logs to start with fresh binary logs:

set global binlog_format=MIXED;

flush logs;

Start a streaming backup using Percona Xtrabackup through netcat in the destination server:

nc –l 9999 | cat – > /dest/folder/backup.tar

and in our source server:

innobackupex —stream=tar ./ | nc dest_server_ip 9999

Once the backup is done, untar and restore the backup. Then set up the slave:

tar –xif /dest/folder/backup.tar

innobackupex —apply–log /dest/folder/

/etc/init.d/mysql stop

rm –rf /var/lib/mysql/

mv /dest/folder/* /var/lib/mysql/

chown –R mysql:mysql /var/lib/mysql

/etc/init.d/mysql start

cat /var/lib/mysql/xtrabackup_binlog_info

change master to master_host=‘master_host’, master_user=‘master_user, master_password=’master_password‘, master_log_file=’file_printed_in_xtrabackup_binlog_info’, master_log_pos=pos_printed_in_xtrabackup_binlog_info;

start slave;

Now that we have the slave ready, we prepare our dataset by running two mysqldump processes so we have data and schemas in separate files. You can also run this operation using MyDumper or mysqlpump, but I will keep it easy:

STOP SLAVE;

SHOW SLAVE STATUS;

Write down this output, as it may be needed later:

mysqldump —skip–set–charset —no–data —databases `mysql —skip–column–names –e “SELECT GROUP_CONCAT(schema_name SEPARATOR ‘ ‘) FROM information_schema.schemata WHERE schema_name NOT IN (‘mysql’,’performance_schema’,’information_schema’);”` > schema.sql

mysqldump —skip–set–charset –n –t —databases `mysql —skip–column–names –e “SELECT GROUP_CONCAT(schema_name SEPARATOR ‘ ‘) FROM information_schema.schemata WHERE schema_name NOT IN (‘mysql’,’performance_schema’,’information_schema’);”` > data.sql

Notice that I’m passing a command as an argument to –databases to dump all databases but mysql, performance_schema and information_schema (hack stolen from this post, with credit to Ronald Bradford). It is very important to keep the replication stopped, as we will resume replication after fully converting our charset.

Now we have to convert our data to utf8mb4. This is easy as we just need to touch the schema.sql file by running few commands:

sed –e “s/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4/g” schema.sql

sed –e “s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci /” schema.sql

sed –e “s/SET character_set_client = utf8/SET character_set_client = utf8mb4/” schema.sql

Can this be a one-liner? Yes, but I’m not a good basher. 🙂

Now we are ready to restore our data using new encoding:

mysql –e “set global innodb_large_prefix=1;”

mysql < schema.sql

mysql < data.sql

Notice I’ve enabled the variable innodb_large_prefix. This is important because InnoDB limits index prefixes to 768 bytes by default. If you have an index based in a varchar(255) data type, you will get an error because the new charset exceeds this limit (up to four bytes per character goes beyond 1000 bytes) unless you limit the index prefix. To avoid issues during data load, we enable this variable to extend the limit to 3072 bytes.

Finally, let’s configure our server and restart it to make sure to set new defaults properly. In the my.cnf file, add:

[client]

default–character–set=utf8mb4

[mysqld]

skip–slave–start

character–set–server=utf8mb4

collation–server=utf8mb4_unicode_ci

innodb_large_prefix=1

Let’s resume replication after the restart, and make sure everything is ok:

START SLAVE;

SHOW SLAVE STATUS;

Ok, at this point we should be fine and our data should be already converted to utf8mb4. So far so good. The next step is to failover applications to use the new server, and rebuild the old server using a fresh backup using xtrabackup as described above.

There are few things we need to consider now before converting this slave into master:

Make sure you properly configured applications. Charset and collation values can be set as session level, so if you set your connection driver to another charset then you may end up mixing things in your data.
Make sure the new slave is powerful enough to handle traffic from the master.
Test everything before failing over production applications. Going from Latin1 to utf8mb4 should be straightforward, as utf8mb4 includes all the characters in Latin1. But let’s face it, things can go wrong and we are trying to avoid surprises.
Last but not least, all procedures were done in a relatively small/medium sized dataset (around 600G). But this conversion (done via logical backups) is more difficult when talking about big databases (i.e., in the order of TBs). In these cases, the procedure helps but might not be good enough due to time restrictions (imagine loading a 1TB table from a logical dump — it take ages). If you happen to face such a conversion, here is a short, high-level plan:

- Convert only smaller tables in the slave (i.e., those smaller than 500MB) following same procedure. Make sure to exclude big tables from the dump using the –ignore-tables parameter in mysqldump.
- Convert bigger tables via alter table, as follows:
  
  ALTER TABLE big_table MODIFY latin1_column varbinary(250);
  
  ALTER TABLE big_table MODIFY latin1_column varchar(250) CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
- Once everything is finished, you can resume replication. Notice you can do dump/conversion/restore in parallel with the altering of bigger tables, which should reduce the time required for conversion.

It’s important to understand why we need the double conversion from latin1 to varbinary to utf8mb4. This post from Marco Tusa largely explains this.

Conclusion

I wrote this guide from my experience working with these type of projects. If you Google a bit, you’ll find a lot of resources that make this work, along with different solutions. What I’ve tried to present here is a guide to help you deal with these projects. Normally, we have to perform these changes in existing datasets that sometimes are big enough to prevent any work getting done via ALTER TABLE commands. Hopefully, you find this useful!

Francisco Bordenave

Francisco has been working in MySQL since 2006, he has worked for several companies which includes Health Care industry to Gaming. Over the last 6 years he has been working as a Remote DBA and Database Consultant which help him to acquire a lot of technical and multi-cultural skills.
He lives in La Plata, Argentina and during his free time he likes to play football, spent time with family and friends and cook.

via Planet MySQL
Migrating Database Charsets to utf8mb4: A Story from the Trenches

Conclusion

Related