MyDumper’s Stream Implementation

https://www.percona.com/blog/wp-content/uploads/2022/06/MyDumper-Stream-Implementation.pngMyDumper Stream Implementation

MyDumper Stream ImplementationAs you might know, mysqldump is single-threaded and STDOUT is its default output. As MyDumper is multithreaded, it has to write on different files. Since version 0.11.3 was released in Nov 2021, we have the possibility to stream our backup in MyDumper. We thought for several months until we decided what was the simplest way to implement it and we also had to add support for compression. So, after fixing several bugs, and we now consider it is stable enough, we can explain how it works.

How Can You Stream if MyDumper is Multithreaded?

Receiving a stream is not a problem for myloader, it receives a file at a time and sends it to a thread to process it. However, each worker thread in mydumper is connected to the database, and as soon as it reads data, it should be sent to the stream, which might cause collisions with other worker threads that are reading data from the database. In order to avoid this issue, we ended up with the simplest solution: mydumper is going to take a backup and store it in the local file system that you configured, and the filename will be enqueued to be processed by the Stream Thread which pops one file at a time and pipes to stdout. We study the alternative to send chunks of the file while it is being dumped, but the way that we implemented is simpler and improves the overall performance.

Implementation Details

Here is a high-level diagram of how we implemented it:

MyDumper


When a mydumper Worker Thread processes a job, it connects to the database and stores the output into a file. That didn’t change, but with stream, we are pushing the filename into the mydumper stream_queue.

The mydumper Stream Thread is popping filenames from the mydumper stream_queue, it is going to send the header of the file to stdout and then open the file and send its content.

Then, myloader Stream Thread is going to receive and detect the header, it will create the new file with the filename from the header and store the content in it.

After closing the file, it will enqueue the filename in the myloader stream_queue. A myloader Worker Thread is going to take that file and process it according to the kind of file it is.

By default, the files are deleted, but if you want to keep them, you can use the –no-delete option. 

The header is simply adding — to the filename so you can use myloader or mysql client to import your database. Here is an example:

-- sbtest-schema-create.sql
CREATE DATABASE /*!32312 IF NOT EXISTS*/ `sbtest` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci */ /*!80016 DEFAULT ENCRYPTION='N' */;

-- sbtest.sbtest1-schema.sql
/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;

/*!40103 SET TIME_ZONE='+00:00' */;
CREATE TABLE `sbtest1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  `pad2` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=100010 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

-- sbtest.sbtest1.00000.sql
/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;
/*!40103 SET TIME_ZONE='+00:00' */;
INSERT INTO `sbtest1` VALUES(1,49929,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291","")
…

Simple Use Cases

A thread writes to a single file to avoid a collision, which improves the performance. However, having thousands of files for a backup of a couple of tables is not manageable. So, the simplest use case is to send everything to a single file:

mydumper -B <SCHEMA_NAME> -h <FROM> > filename.sql

Then you can just simply import it using:

myloader --stream -o -h <TO_SERVER> < filename.sql

Now that you can pipe from a mydumper process to myloader, this execution is possible:

mydumper -B <SCHEMA_NAME> -h <FROM> | myloader --stream -o -h <TO>

pipe from a mydumper process to myloader

Or you can send the stream through the network using nc: 

mydumper -B <SCHEMA_NAME> -h <FROM_SERVER> | nc <MYDUMPER_SERVER> <ANY_PORT>
nc -l <MYDUMPER_SERVER> <ANY_PORT> | myloader --stream -o -h <TO_SERVER>

 

stream through the network using nc

This implementation is using the backup directory on mydumper and myloader as Buffers, you must take this into account, as by default it is going to create a directory where you run it.

Another thing that you need to take into account is that mydumper and myloader will be writing on disk, the whole backup will be written on both File Systems while it is being processed, and use a file system with enough disk space.

Finally, you can keep myloader running and send several mydumper backups. First, you need to run:

nc -k -l <MYDUMPER_SERVER> <ANY_PORT> | myloader --stream -o -h <TO_SERVER>

And then execute:

mydumper -B <SCHEMA_NAME_1> -h <FROM_SERVER> | nc <MYDUMPER_SERVER> <ANY_PORT>
mydumper -B <SCHEMA_NAME_2> -h <FROM_SERVER> | nc <MYDUMPER_SERVER> <ANY_PORT>
mydumper -B <SCHEMA_NAME_3> -h <FROM_SERVER> | nc <MYDUMPER_SERVER> <ANY_PORT>
mydumper -B <SCHEMA_NAME_4> -h <FROM_SERVER> | nc -N <MYDUMPER_SERVER> <ANY_PORT>

Some versions of nc have these two options:

      -k      When a connection is completed, listen for another one.  Requires -l.

     -N      shutdown(2) the network socket after EOF on the input.  Some servers require this to finish their work.

This is very useful if you are refreshing some testing environment and you only need a couple of tables on different databases or if you are using a where clause that only applies to some tables.

Considerations

Usually, when you send data to STDOUT, you are not going to have trouble with disk space usage on the dumper server. That is NOT true if you are using MyDumper. Files will be stored on the mydumper server until they are transferred to the receiving server. For instance, if you have a 10TB database, with a very low network bandwidth compared to the disk bandwidth, you might end up filling up the disk where you keep the files temporarily.

Conclusion

We focus the implementation to speed up export and import processes. Opposite to other software or implementations, we use the file system as a buffer causing a higher disk utilization.

Percona Database Performance Blog