https://www.percona.com/blog/wp-content/uploads/2021/09/Horizontal-Scalability-for-MySQL.png
As a Technical Account Manager at Percona, I get to work with many of our largest clients. While the industry verticals vary, one main core challenge generally remains the same – what do I do with all this data? Dealing with massive data sets in MySQL isn’t a new challenge, but the best approach still isn’t trivial. Each application is obviously different, but I wanted to discuss some of the main best practices around dealing with lakes of data.
Keep MySQL Instances Small
First and foremost, the architecture needs to be designed to keep each MySQL instance relatively small. A very common question I get from teams new to working with MySQL is: “So what is the largest instance size MySQL supports?”. My answer goes back to my time in consulting: “It depends”. Can my MySQL instance support a 20TB dataset? Maybe, but it depends on the workload pattern. Should I store 20TB of data in a single MySQL instance? In most cases, absolutely not.
MySQL can definitely store massive amounts of data. But RDBMSs are designed to store, write, and read that data. When the data grows that large, often the read performance starts to suffer. But what if my working dataset still fits in RAM? This is often the critical consideration when it comes to sizing an instance. In this case, active read/write operations may remain fast, but what happens when you need to take a backup or ALTER a table? You are reading (and writing out) 20TB which will always be bounded by I/O.
So what is the magic number for sizing? Many large-scale shops try to keep individual instance sizes under the 2-3TB mark. This results in a few major advantages:
- Predictable operational times (backups, alters, etc)
- Allows for optimized and standardized hardware
- Potential for parallelism in loading data
If I know my instance will never exceed a couple of terabytes, I can fully optimize my systems for that data size. The results are predictable and repeatable operational actions. Now, when a backup is “slow”, it is almost assuredly due to hardware and not being an outlier instance that is double the size. This is a huge win for the operations team in managing the overall infrastructure. In addition to backups, you have the additional consideration of restore time. Massive backups will slow restoration and have a negative impact on RTO.
Store Less Data
Now that the negative impact of large, individual instances is known, let’s look at how we keep the sizes down. While seemingly obvious, the best way to keep data sizes small is to store less data. There are a few ways to approach this:
- Optimize data types
- If data types are bigger than needed, it results in excess disk footprint (i.e. using bigint when int will suffice)
- Review indexes for bloat
- Limit composite Primary Keys (PKs)
- Find and remove redundant indexes (using pt-duplicate-key-checker)
- Avoid PKs using varchar
- Purge old data
- When possible, remove records not being read
- Tools like pt-archiver can really help in this process
These techniques can help you delay the need for more advanced techniques. However, in some cases (due to compliance, limited flexibility, etc), the above options aren’t possible. In other cases, you may already be doing them and are still hitting size limits.
Horizontal Sharding
So what is another way to deal with massive data sets in MySQL? When all other options are exhausted, you need to look at splitting the data horizontally and spreading it across multiple equally sized instances. Unfortunately, this is much easier said than done. While there are some tools and options out there for MySQL (such as Vitess), often the best and most flexible approach is building this sharding logic into your application directly. Sharding can be done statically (key modulus for example) or more dynamically (via a dictionary lookup) or some hybrid approach of the two:
Sharding Considerations
When you finally have to bite the bullet and split data horizontally, there are definitely some things to keep in mind. First and foremost, picking the correct sharding key is imperative. With the wrong key, shards won’t be balanced and you’ll end up with sizes all over the board. This then becomes the same problem where a shard can grow too large.
Once you have the correct key, you need to understand that different workloads will be impacted by sharding differently. When the data is split across shards, individual lookups are generally the easiest to implement. You take the key, map to a shard, and fetch the results. However, if the workload requires aggregate access (think reports, totals, etc), now you are dealing with combining multiple shards. This is a primary and major challenge when looking at horizontal sharding. As is the case in most architectures, the business requirements and workload will dictate the design.
If your team is struggling with an exploding data set, the Professional Services team at Percona can help you design a more flexible and scalable solution. Each case is unique and our team can work with your specific use case and business requirements to guide you in the right direction. The biggest thing to remember: please don’t just keep adding hard disk space to your instances while expecting it to scale. Proper design and horizontal sharding is the critical factor as your data grows!
Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!
Download Percona Distribution for MySQL Today
Percona Database Performance Blog