Implementing MySQL Data-at-Rest Encryption: Step-by-Step Guide

https://cdn.prod.website-files.com/6717800cb1e973b8fc433b03/6a0bf6640fd479be346bd089_Implementing%20MySQL%20Data-at-Rest%20Encryption.avif

Implementing MySQL Data-at-Rest Encryption

Implementing encryption in MySQL requires careful planning, correct configuration, and operational discipline. While the feature itself is straightforward, a production deployment involves multiple components such as keyring setup, table encryption, log encryption, backups, and performance validation.

This guide provides a detailed, step-by-step approach to configuring MySQL data-at-rest encryption in a production environment, focusing on the latest standards in MySQL 8.4.

Phase 1: Setting Up the Keyring (MySQL 8.4 Component-Based Approach)

In MySQL 8.4, the keyring architecture transitioned from a plugin-based model to a component-based framework. This update improves flexibility, maintainability, and compatibility with modern MySQL infrastructure.

Unlike MySQL 8.0, where early-plugin-load was necessary, MySQL 8.4 uses a component manifest and configuration file. The keyring initializes automatically during server startup.

1. Understanding the Keyring Components

The keyring setup in MySQL 8.4 involves four main elements:

  • Component: component_keyring_file
  • Configuration file: component_keyring_file.cnf
  • Data file: component_keyring_datafile
  • Manifest file: mysqld.my

These elements work together to ensure the keyring loads before InnoDB initializes.

MySQL 8.4 Keyring Component Initialization Flow

Decoupled startup mapping sequence matching strict system initialization constraints.

Architecture Standard: Fully Symmetrical Pathing

2. Create Keyring Configuration File

First, create the configuration file to define where the database will store the keyring data.

vi /etc/mysql/component_keyring_file.cnf

Add the following configuration:

{
 "path": "/secure/keyring/component_keyring_datafile"
}

3. Create and Secure the Keyring Directory

Create the directory and assign the correct ownership and permissions. This prevents unauthorized OS-level access.

mkdir -p /secure/keyring
chown -R mysql:mysql /secure/keyring
chmod 700 /secure/keyring

4. Create the Manifest File

The manifest file instructs the server to load the keyring component at startup.

vi /var/lib/mysql/mysqld.my

Add the following content:

{
 "components": "file://component_keyring_file"
}

5. Restart MySQL Service

Apply the changes by restarting the MySQL service.

systemctl restart mysqld

6. Verify Keyring Component Status

Run the following query to confirm the keyring is active.

SELECT * FROM performance_schema.keyring_component_status;

Expected Output:

component_keyring_file | ACTIVE | YES

This output confirms the keyring component is operational.

Phase 2: Enabling Tablespace Encryption

With the keyring in place, you can now encrypt your data. Tablespace encryption protects your tables at the storage level.

Encrypt New Tables

Specify the encryption clause when creating a new table.

CREATE TABLE customers (
 id INT PRIMARY KEY,
 name VARCHAR(100)
) ENCRYPTION='Y';

Encrypt Existing Tables

Alter existing tables to enable encryption.

ALTER TABLE customers ENCRYPTION='Y';

Set Global Default

To force encryption for all new tables automatically, adjust the global variable.

default_table_encryption=ON

Verify Encryption

Check the encryption status of your tables using the information schema.

SELECT NAME, ENCRYPTION FROM information_schema.INNODB_TABLESPACES;
Security Audit Interface

Interactive Data Protection Matrix

Simulate your production encryption profile by toggling settings. Ensure complete log coverage to reach 100% compliance.

🔗 Fulfilled Tablespace Encryption (TDE) in Percona Cluster For teams running Percona XtraDB Cluster who need to apply TDE across nodes.

🔗 Shield Your Sensitive Data: MySQL Data Encryption at Rest A webinar recap covering encryption at rest concepts — good supplementary read for teams new to the topic.

Phase 3: Encrypting Logs

To achieve complete data-at-rest protection, you must also encrypt the database logs. Unencrypted logs can expose sensitive data even if the tablespaces are secure.

Redo Log Encryption

innodb_redo_log_encrypt=ON

Undo Log Encryption

innodb_undo_log_encrypt=ON

Binary Log Encryption

binlog_encryption=ON

🔗 How to Use the MySQL 8.4 Audit Log Filter Key rotation and encryption events should be captured in audit logs — this guide explains how to set up granular audit filtering in MySQL 8.4.

Phase 4: Master Key Rotation

Periodic key rotation is a standard security requirement. Rotating the master key generates a new key and re-encrypts the tablespace keys, but it does not require re-encrypting the actual table data.

ALTER INSTANCE ROTATE INNODB MASTER KEY;
Automated Step-by-Step Pipeline

InnoDB Master Key Rotation Workflow

Watch the pipeline phases below to see metadata execution processes.

1. Generate New Master Key

Executing ALTER INSTANCE ROTATE INNODB MASTER KEY signals the active Keyring Component to securely generate a fresh, highly random Master Encryption Key (MEK) internally within its memory boundaries.

Backup and Recovery Considerations

Encryption directly affects how you handle backups and recovery. Keep these principles in mind:

  • Physical backups must include the keyring.
  • Losing the keyring makes encrypted data permanently unrecoverable.
  • Logical backups (like mysqldump) export data in plaintext and are not encrypted by default.

Example Keyring Backup Command:

cp /secure/keyring/keyring /backup/keyring

Performance Impact and Benchmarking

Encryption adds a measurable workload to your database server. Before deploying to production, benchmark the performance impact in a staging environment.

Common observations include:

  • Query latency increases moderately.
  • Commit latency increases slightly due to redo log encryption.
  • Overall CPU and I/O usage will rise.

Example Metrics:

  • Average query latency: 12.3 ms → 14.7 ms
  • Commit latency: 1.2 ms → 1.8 ms
Throughput vs Latency Monitor

Encryption Overhead Benchmarks

Toggle workload patterns to view simulated AES-256 performance impact.

Average Query Latency
12.3 ms vs 14.7 ms (+19.5%)

Transaction Commit Latency
1.2 ms vs 1.8 ms (+50.0%)

Overall CPU Encryption Overhead
Baseline vs Active (+8.2%)

Operational Best Practices

To maintain a secure and stable environment, follow these operational rules:

  • Restrict OS-level access: Limit access to the database data directory and the keyring folder.
  • Enforce minimal privileges: Apply the principle of least privilege for all database users.
  • Align replication nodes: Ensure encryption configurations are identical across all primary and replica nodes.
  • Monitor status regularly: Track the state of your keyring and encryption variables as part of your routine checks.

🔗 Fastest Parallel Replication in MySQL 8For teams aligning encryption settings across replication nodes, this explains replication internals relevant to that process.

Common Pitfalls to Avoid

Many encryption issues surface only during a critical incident. Avoid these frequent mistakes:

  • Failing to back up the keyring file.
  • Skipping routine recovery tests.
  • Forgetting to enable binary log encryption.
  • Deploying to production without proper performance benchmarking.

These oversights often lead to costly delays or permanent data loss during a recovery scenario.

Conclusion

Implementing MySQL data-at-rest encryption is a structured process that extends beyond simple configuration. It requires strict attention to key management, operational workflows, and performance validation.

A well-planned execution ensures that your data remains completely secure, even in the event of a physical or storage-level compromise. For additional context on official specifications, consult the MySQL 8.4 Reference Manual on InnoDB Data Encryption.

Protecting your data at rest is just one part of a robust database strategy. Mydbops provides comprehensive MySQL support, ranging from 24/7 managed services and remote DBA assistance to architectural consulting and thorough security audits. Let our team handle the operational complexities and performance tuning to eliminate bottlenecks and keep your infrastructure secure.

Reach out today to secure and optimize your database environment.

Planet for the MySQL Community

MySQL Data at Rest: Architecture & Best Practices

https://cdn.prod.website-files.com/6717800cb1e973b8fc433b03/6a0c9cfd1d34e7ec60607e8f_MySQL%20Data%20at%20Rest%20Architecture%20%26%20Best%20Practices.avif

Securing MySQL Data at Rest: Architecture, Concepts, and Why It Matters

In today’s data-driven environments, protecting sensitive information is no longer optional, it is foundational. While organizations invest heavily in access controls and network security, one critical area is often misunderstood: data-at-rest encryption in MySQL.

Data-at-rest encryption ensures that even if database files are accessed outside MySQL through disk theft, backup exposure, or filesystem compromise, the data remains unreadable. This post explains how MySQL implements encryption internally, what it protects, and why a well-designed encryption strategy is necessary for modern systems.

Understanding the Real-World Threat Model

Before configuring encryption, you must understand the exact risks it mitigates. Data-at-rest encryption in MySQL protects against scenarios such as:

However, encryption operates exclusively at the storage layer. It does not protect against:

  • Valid database users executing queries.
  • Application-level vulnerabilities.
  • Compromised database credentials.

How MySQL Implements Data-at-Rest Encryption

MySQL utilizes InnoDB storage engine encryption (External Link), which is designed to be transparent to applications and highly efficient at runtime.

Key Characteristics:

  • Encryption is applied at the tablespace level.
  • Data is encrypted at the page level.
  • All encryption and decryption operations occur entirely in memory (within the InnoDB buffer pool).
  • Plaintext data is never written to the physical disk.

This design ensures that your applications require absolutely no code changes, maintaining strong protection at the storage level without sacrificing usability.

Encryption Architecture: A Layered Approach

MySQL relies on a hierarchical key structure to manage encryption securely.

Key Hierarchy

  1. Master Key: Stored securely in the MySQL keyring.
  2. Tablespace Key: A unique key generated for each individual tablespace.
  3. Encrypted Data Pages: The actual user data stored on disk.

How It Works:
Each tablespace is encrypted using its own unique tablespace key. To secure the system further, the tablespace key is then encrypted using the master key. Finally, the master key is stored in the keyring. This layered approach improves overall security and allows administrators to rotate keys efficiently without needing to re-encrypt entire tablespaces.

What Should Be Encrypted in MySQL

A complete encryption strategy involves much more than just securing your tables.

To prevent data leaks, the following components must be considered:

  1. Tablespaces: These hold the actual user data and are the primary target for encryption.
  2. Undo Tablespaces: Undo logs store previous versions of rows, which may include deleted or modified data. Encrypting them stops the exposure of historical data.
  3. Redo Logs: Redo logs capture recent database changes and transaction history. They often contain highly sensitive before-and-after values.
  4. Binary Logs: Binary logs track all database modifications. They are essential for MySQL replication (Internal Link placeholder) and point-in-time recovery.
  5. Relay Logs: Existing on replica servers, relay logs mirror the primary server’s binary logs, making them equally sensitive.

Encrypting only your tablespaces leaves these other layers heavily exposed. Full coverage guarantees your data is protected across all storage components.

Key Management: The Core of Encryption

Encryption is only as strong as its key management. The MySQL keyring carries several vital responsibilities:

  • Storing the master encryption key.
  • Enabling the encryption and decryption of tablespace keys.
  • Supporting secure key rotation.

Example Configuration (MySQL 8.0)

early-plugin-load=keyring_file.so
keyring_file_data=/secure/keyring/keyring

Verification

SHOW PLUGINS;

Expected output:

keyring_file | ACTIVE

Best Practices for Keyring Management:

  • Store keyring files with strictly restricted OS permissions.
  • Keep your keyring storage logically and physically separate from your database files.
  • Always include the keyring in your database backup strategy (Internal Link placeholder).

MySQL 8.0 vs MySQL 8.4: Evolution of the Keyring

MySQL 8.4 introduces a modern, component-based keyring system, shifting away from the older plugin methods.

  • MySQL 8.0: Utilizes a plugin-based approach, which requires the plugin to be loaded early during server startup (early-plugin-load).
  • MySQL 8.4: Utilizes a component-based architecture, offering better flexibility, easier configuration, and long-term maintainability.

Verification in MySQL 8.4

SELECT * FROM performance_schema.keyring_component_status;

Why Data-at-Rest Encryption Is Essential

Implementing encryption extends beyond basic security. It actively supports:

  • Regulatory Compliance: Meeting strict standards for PCI-DSS, GDPR, and HIPAA.
  • Accidental Exposure Prevention: Stopping data leaks from misplaced backups or retired hardware.
  • Defense Against Infrastructure Threats: Adding a final layer of protection against hardware and OS-level breaches.

Modern infrastructure planning must assume that storage layers could eventually be accessed by unauthorized parties.

Conclusion

Data-at-rest encryption in MySQL provides a reliable method to defend sensitive data from unauthorized file-level access. However, its actual effectiveness relies on a complete approach. You must encrypt all relevant database components, manage your encryption keys safely, and clearly understand the scope and limitations of the feature. When set up correctly, data-at-rest encryption acts as a necessary layer in a defense-in-depth strategy.

Data-at-rest encryption in MySQL is a foundational defense-in-depth strategy that protects sensitive information from physical disk theft, backup exposure, and unauthorized filesystem access by ensuring plaintext data is never written to disk. Utilizing InnoDB’s transparent, layered architecture, it secures data via a master key stored in a keyring—which has evolved from a plugin system in MySQL 8.0 to a modern component-based model in MySQL 8.4—to safely encrypt individual tablespace keys.

To achieve true regulatory compliance (like GDPR or HIPAA) and robust security, organizations must extend this encryption beyond standard tablespaces to include undo logs, redo logs, binary logs, and relay logs, ensuring complete data protection backed by strict, isolated key management practices.

Secure Your MySQL Infrastructure

Ensure your database meets strict compliance standards and is fully protected against file-level threats. Our experts at Mydbops are ready to audit, configure, and manage your MySQL encryption strategy.

Planet for the MySQL Community

An Entire Wikipedia That’s 100% AI Hallucinations

"Every link leads to an entry that does not exist yet," explains the GitHub page for a Wikipedia-like site called Halupedia. "Until you click it, at which point an LLM pretends it has always existed and writes it for you, in the deadpan register of a 19th-century scholarly press…"
Every article is invented on demand. The footnotes are also lies… The hardest problem with an infinite, on-demand encyclopedia is internal contradiction… When the LLM writes an article, it is required to add a context="…" attribute on every <a> it inserts, summarising the future article it is linking to (e.g. context="19th-century clerk who formalized footnote drift, Pellbrick’s mentor")… When that target article is later requested for the first time, the worker loads the accumulated hints and injects them into the system prompt as "PRIOR REFERENCES — these are CANON". The LLM is instructed that the encyclopedia is hallucinated and absurd, but it must not contradict itself. Fast Company reports that Halupedia was created by software developer BartÅomiej Strama, who confessed in a Reddit comment that the site came about after a drunk night with a friend. In the week since launch, he says Halupedia has amassed more than 150,000 users."
Beyond indulging in silly alternate histories, what’s the point of using Halupedia? Strama hinted at one larger purpose in a reply to a donor on his Buy Me a Coffee page: "Your contribution towards polluting LLM training data will surely benefit society!" he wrote.
The site is licensed as free software under the GPL-3.0 license. Thanks to long-time Slashdot reader schwit1 for sharing the news.


Read more of this story at Slashdot.

Slashdot

Best practices for upgrading Amazon RDS for MySQL 8.0 to 8.4 with prechecks, Blue/Green, and rollback

https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2026/05/14/DB5315.png

Upgrading a production MySQL database across major versions is one of the most high-stakes operations a database team faces. Compatibility breaks, unexpected downtime, and the risk of no easy rollback can stall upgrade plans for months. Amazon Relational Database Service (Amazon RDS) for MySQL 8.0 standard support ends on July 31, 2026, and Amazon RDS for MySQL now supports long-term support (LTS) version 8.4. Teams running MySQL 8.0 workloads need a reliable, low-risk path to MySQL 8.4. However, upgrading from 8.0 to 8.4 involves significant compatibility changes that you must evaluate and test before proceeding. In this post, you learn how to build a complete upgrade workflow for RDS for MySQL 8.0 to 8.4, covering pre-upgrade compatibility checks, Amazon RDS Blue/Green Deployments for minimal downtime, and a rollback strategy for added protection.

If you need to stay on MySQL 8.0 beyond that date, check the latest updates on RDS Extended Support. For a detailed breakdown of Extended Support costs, see Upgrade strategies for Amazon RDS for MySQL 8.0 to 8.4. You can also consider Amazon Aurora MySQL-Compatible Edition, which provides its own release calendar and still supports MySQL 8.0 without having to use extended support at least until 30 April 2028.

Solution overview

This post is part of a two-part series on upgrading RDS for MySQL 8.0 to 8.4. Here, we focus on the step-by-step implementation, covering pre-upgrade compatibility checks, Blue/Green Deployments, and rollback preparation. For an overview of upgrade strategies, Extended Support costs, and key considerations, see Upgrade strategies for Amazon RDS for MySQL 8.0 to 8.4.

In this post, we walk you through a complete upgrade workflow covering three tightly coupled phases:

  1. Pre-upgrade checks: Identify incompatibilities using the MySQL Shell upgrade checker utility and RDS built-in validation.
  2. Blue/Green Deployments: Perform the upgrade with minimal downtime using Amazon RDS Blue/Green Deployments.
  3. Rollback preparation: Set up reverse replication before switchover so you can roll back with confidence.

These phases are not independent: pre-check results determine whether the green environment can upgrade successfully, the binlog position from Blue/Green Deployments creation feeds into the rollback setup, and you need to make sure the rollback solution is ready before you switchover. The following architecture diagram illustrates the complete workflow, helping you understand how each phase interacts with the others.

Architecture diagram showing the three-phase upgrade workflow: pre-upgrade checks, Blue/Green Deployments, and rollback preparation with reverse replication

Prerequisites

Blue/Green Deployments and RDS for MySQL 8.4 are available in all AWS Regions. Before you begin, make sure you have the following resources in place:

  • An Amazon RDS for MySQL 8.0 instance with automatic backups enabled.
  • An Amazon Elastic Compute Cloud (Amazon EC2) instance or bastion host in the same VPC for running MySQL Shell.
  • A custom DB parameter group created for MySQL 8.4.
  • An IAM user or role with at least the following permissions for Blue/Green Deployments:
    • Create Blue/Green Deployments
      • rds:CreateBlueGreenDeployment
      • rds:AddTagsToResource
      • rds:CreateDBInstanceReadReplica
    • Switchover Blue/Green Deployments
      • rds:SwitchoverBlueGreenDeployment
      • rds:ModifyDBInstance
      • rds:PromoteReadReplica
    • Delete Blue/Green Deployments
      • rds:DeleteBlueGreenDeployment
      • rds:DeleteDBInstance
  • (For rollback) An AWS Database Migration Service (AWS DMS) replication instance, or network connectivity between the 8.4 and 8.0 instances for native replication. Make sure the security groups associated with both instances allow MySQL traffic on port 3306 between them. If using AWS DMS, see AWS DMS identity-based policy examples for IAM permission best practices.

Phase 1: Pre-upgrade checks

Before upgrading, use the MySQL Shell check-for-server-upgrade utility to identify incompatibilities. Because the utility performs intensive metadata scans across all databases, we recommend running it against a snapshot-restored instance rather than your production database to avoid impacting workload performance. If you are working in a dev/test environment, you can skip the snapshot restore and run the check directly.

To run the pre-upgrade check:

  1. Restore a DB snapshot of your MySQL 8.0 production instance. This gives you an isolated copy to safely run the check without affecting production.
  2. Connect to your Amazon EC2 bastion host (or any instance in the same VPC that has MySQL Shell installed). If MySQL Shell is not already installed, see Installing MySQL Shell.
  3. The upgrade checker utility requires RELOAD, PROCESS, and SELECT privileges. Before running the check, make sure your user account has these privileges. You can use the RDS master user, or grant the required privileges with the following commands:
  1. Run the upgrade checker:

The utility categorizes findings into three severity levels:

  • Error (blocks upgrade).
  • Warning (evaluate case-by-case).
  • Notice (informational).

Whether an RDS for MySQL instance can successfully upgrade from 8.0 to 8.4 is ultimately determined by the RDS pre-upgrade validation. Therefore, this section focuses on the intersection of MySQL Shell’s check-for-server-upgrade findings and the RDS built-in pre-upgrade checks, providing a practical reference for planning your upgrade (detailed in the following section).

MySQL Shell vs. RDS pre-upgrade checks

The MySQL Shell upgrade checker and the RDS built-in pre-upgrade validation have slightly different default check items. The following table provides the main differences between the two. For detailed explanations and resolution steps for each check item, please see the “Understanding and resolving pre-upgrade check errors” section.

Items with a Source of Shell Only are not enforced by the RDS upgrade validation, so their Severity is marked as -.

Check item Source Severity Notes
Removed system variables Shell Only Can be ignored. RDS upgrade validation does not enforce this item
System variables with new default values Both Warning This check lists parameters whose default values have changed in MySQL 8.4. If you have customized any of these in your MySQL 8.0 parameter group.
Issues reported by ‘check table x for upgrade’ command Both Varies Severity depends on the storage engine; handle based on actual precheck output. Typically, if the result shows “Corrupt”, the upgrade checker treats it as an Error item.
Checks for foreign keys not referencing a full unique index Both Warning MySQL 8.4 may forbid foreign keys to partial indexes, this check identifies such cases to warn the user.
Check for deprecated or invalid user authentication methods. Both Warning MySQL 8.4 may deprecate or remove some authentication methods, this check identifies users still using them.
Check for deprecated or removed plugin usage. Both Error RDS does not support these plugins (authentication_fido, keyring_file, keyring_encrypted_file, keyring_oci), so this check is unlikely to be triggered on RDS. If flagged, please verify and remove the plugin.
Check for deprecated or invalid default authentication methods in system Shell Only Can be ignored, RDS upgrade checks ignore this item
Check for deprecated or invalid authentication methods in use by MySQL Shell Only Can be ignored, RDS upgrade checks ignore this item
Checks for errors in column definitions Both Error Identifies column definitions that MySQL 8.4 may no longer support.
Check for allowed values in System Variables. Both Error Checks whether system variables use values that MySQL 8.4 no longer allows.
Checks for user privileges that will be removed Both Notice Informational only. Verifies users with grants that the upgrade process will remove.
Checks for partitions by key using columns with prefix key indexes Both Error MySQL 8.4 no longer supports indexes on column prefixes for key partitioning. The partition function ignores these indexes, so MySQL 8.4 no longer allows them.
Use of AUTO_INCREMENT for DOUBLE and FLOAT data types RDS Only Error MySQL 8.4 deprecates AUTO_INCREMENT for DOUBLE and FLOAT data types.
RDS checked for the use of non-inclusive language in SQL statements. RDS Only Error MySQL 8.4 no longer allows non-inclusive language in SQL statements.
memcached plugin needs to be uninstalled before upgrade RDS Only Error MySQL 8.3 no longer supports the InnoDB memcached plugin
DB instance must have enough free disk space RDS Only Error Make sure the instance has at least 2 GiB of free storage space.
Detect system objects created as tables in sys schema RDS Only Error Remove any user-created tables in the sys schema before upgrading.

The RDS built-in pre-upgrade validation only runs when you initiate an actual upgrade. If the validation finds blocking issues at that point, the upgrade fails and you need to resolve them and reschedule another maintenance window. Running the MySQL Shell upgrade checker beforehand helps you identify and resolve these issues in advance.

The recommended approach:

  1. Run the MySQL Shell upgrade checker to resolve the overlapping items. RDS manages shell-only items marked “can be ignored” internally.
  2. Perform a test upgrade on a snapshot-restored instance to catch the RDS-only items via PrePatchCompatibility.log.

Understanding and resolving pre-upgrade check errors

To show you the pre-upgrade check process and the Error items that block upgrades, we reproduced several Error items on our test environment. In the following sections, we walk you through the actual precheck output and the suggested resolution for each item, so you can interpret and resolve precheck errors in practice.

Issues reported by ‘check table x for upgrade’ command

Runs CHECK TABLE … FOR UPGRADE against each tables. Depending on the scope of impact, the severity varies. The following are examples of both Error and Warning types.

Check item

(Error)

Issues reported by ‘check table x for upgrade’ command

  • db1.t_1 – View ‘db1.t_1’ references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them
  • db1.t_1 – Corrupt

Description

The View references a table, column, or function that no longer exists, or the definer/invoker lacks the required privileges. The “Corrupt” status indicates this is an Error-level finding that blocks the upgrade.

Remediation

Identify the invalid references and remove or fix them:

  • Drop the invalid View if no longer needed.
  • Or recreate the missing base table so the View becomes valid again.

Check item

(Warning)

Issues reported by ‘check table x for upgrade’ command

  • db2.t_2 – Row size too large (> 8126). Changing some columns to TEXT or BLOB or using ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED may help. In current row format, BLOB prefix of 768 bytes is stored inline.

Description

The table uses COMPACT or REDUNDANT row format, where each BLOB/TEXT column stores a 768-byte prefix inline. With many such columns, the total row size exceeds InnoDB’s 8126-byte page limit.

Remediation

Changing some columns to TEXT or BLOB or using ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED may help. This is a Warning item and does not block the upgrade, but INSERT or UPDATE operations that exceed the row size limit may fail after the upgrade.

Note: Changing the row format or altering column types triggers a table rebuild. For large tables, consider running during a maintenance window or using an online DDL tool.

Check for deprecated or removed plugin usage

Check item

Check for deprecated or removed plugin usage.

  • <plugin_name> – deprecated/removed plugin detected.

Description

RDS for MySQL does not support installing the plugins that the precheck scans for (authentication_fido, keyring_file, keyring_encrypted_file, keyring_oci), so this item cannot be reproduced. If detected, remove the reported plugin(s) and retry the upgrade.

Remediation

Uninstall the plugin(s) if found.

Checks for errors in column definitions

Check item

Checks for errors in column definitions. Identifies column definitions that may not be supported in future versions of MySQL.

  • db1.t3.id – The column is of type DOUBLE and has the AUTO_INCREMENT flag set, this is no longer supported.
  • db1.t4.id – The column is of type FLOAT and has the AUTO_INCREMENT flag set, this is no longer supported.

Description

Using DOUBLE or FLOAT with AUTO_INCREMENT is no longer allowed in MySQL 8.4. This check and “Use of AUTO_INCREMENT for DOUBLE and FLOAT data types” point to the same affected tables.

Remediation

Change the column type to an integer type (for example: BIGINT).

Check for allowed values in system variables

Check item

Check for allowed values in system variables. The following system variables are using values that are not allowed.

  • ssl_cipher – Error: The system variable ‘ssl_cipher’ is set to ‘DHE-DSS-AES256-SHA’, allowed values include: ECDHE-ECDSA-AES128-GCM-SHA256, ECDHE-ECDSA-AES256-GCM-SHA384, ECDHE-RSA-AES128-GCM-SHA256, ECDHE-RSA-AES256-GCM-SHA384, ECDHE-ECDSA-CHACHA20-POLY1305, ECDHE-RSA-CHACHA20-POLY1305, ECDHE-ECDSA-AES256-CCM, ECDHE-ECDSA-AES128-CCM, DHE-RSA-AES128-GCM-SHA256, DHE-RSA-AES256-GCM-SHA384, DHE-RSA-AES256-CCM, DHE-RSA-AES128-CCM, DHE-RSA-CHACHA20-POLY1305

Description

Checks whether system variables use values that are no longer allowed in MySQL 8.4.

Remediation

Find non-default system variables and change the value(s).

Checks for partitions by key using columns with prefix key indexes

Check item

Checks for partitions by key using columns with prefix key indexes. Indexes on column prefixes are not supported for key partitioning, they are ignored by the partition function and so they are not allowed as of 8.4.0. This check identifies tables with partitions defined this way, they should be fixed before upgrading to 8.4.0. More information: Partitioning limitations

  • db1.t5 – Error: the db1.t5 table uses partition by KEY using the following columns with prefix index: name.

Description

Tables partitioned by key using columns with prefix key indexes are no longer compatible in MySQL 8.4.

Remediation

If the partition key column uses a prefix index (for example: KEY(col(10))), remove the prefix and use the full column, or redesign the partition strategy (for example: switch to RANGE or HASH).

Use of AUTO_INCREMENT for DOUBLE and FLOAT data types

Check item

Use of AUTO_INCREMENT for DOUBLE and FLOAT data types. Starting with MySQL version 8.4.0, the use of AUTO_INCREMENT is deprecated for DOUBLE and FLOAT data types.

  • db1 – Don’t use AUTO_INCREMENT with DOUBLE and FLOAT data types.

Description

This is the RDS-specific check for the same issue as “Checks for errors in column definitions”. Both checks point to the same affected tables.

Remediation

Same as “Checks for errors in column definitions” — change the column type to an integer type.

RDS checked for the use of non-inclusive language in SQL statements

Check item

RDS checked for the use of non-inclusive language in SQL statements. Starting with MySQL version 8.4.0, the use of non-inclusive language in SQL statements isn’t allowed.

  • [ROUTINE_SCHEMA, ROUTINE_NAME]
  • db1,sp1 – Don’t use SQL statements that contain MASTER or SLAVE.

Description

Starting with MySQL 8.4.0, non-inclusive language (for example: MASTER, SLAVE) in SQL statements is no longer allowed.

Remediation

Replace non-inclusive keywords:

  • MASTER → SOURCE
  • SLAVE → REPLICA
  • SHOW MASTER STATUS → SHOW BINARY LOG STATUS
  • SHOW SLAVE STATUS → SHOW REPLICA STATUS

memcached plugin needs to be uninstalled before upgrade

Check item

memcached plugin needs to be uninstalled before upgrade. Starting with MySQL version 8.3.0, the InnoDB memcached plugin is no longer supported.

  • [name]
  • daemon_memcached – Remove the MySQL memcached support option from your DB instance and then retry the upgrade.

Description

The InnoDB memcached plugin is no longer supported starting with MySQL 8.3.0. Remove it before upgrading.

Remediation

Remove the MEMCACHED option from the Option Group associated with the DB instance.

DB instance must have enough free disk space

Check item

DB instance must have enough free disk space. The DB instance must have at least 2 GiB free space for an upgrade to MySQL 8.0.30 and above.

  • Make sure you have more than 2147483648 bytes available for the DB instance.
  • Scale storage so that there is at least 2 GiB free space before retrying the upgrade.

Description

The DB instance must have at least 2 GiB of free disk space for upgrades to MySQL 8.0.30 and above.

Remediation

Scale up the storage so that the instance has more than 2 GiB of free storage space (FreeStorageSpace).

Detect system objects created as tables in sys schema

Check item

Detect system objects created as tables in sys schema.

  • <table_name> – system object created as a table in sys schema.

Description

Detects whether system objects in the sys schema have been created as regular tables. These objects can conflict during the upgrade process.

Remediation

If results are returned:

1. Back up the data from the table.

2. Remove the sys.<table_name>.

Key compatibility changes in MySQL 8.4

The following changes most commonly affect RDS for MySQL upgrades:

mysql_native_password deprecation: MySQL 8.4 defaults to caching_sha2_password. Existing users with mysql_native_password continue to work after the upgrade, but newly created users will use caching_sha2_password. To change the default authentication plugin, modify the authentication_policy parameter in your DB parameter group. If you do not plan to keep using mysql_native_password, consider switching affected users to caching_sha2_password on your MySQL 8.0 instance and validating application connectivity before proceeding with the upgrade.

Applications using older client drivers that do not support caching_sha2_password will fail to connect with the error: Authentication plugin ‘caching_sha2_password’ cannot be loaded. The following minimum driver versions are required:

Query affected users:

restrict_fk_on_non_standard_key: New parameter in MySQL 8.4 (default ON) that blocks CREATE TABLE and ALTER TABLE statements from creating foreign keys on non-unique or partial keys. This does not affect existing foreign keys or the upgrade itself. It only applies to data definition language (DDL) operations after the upgrade. If your application creates or modifies foreign keys at runtime, set this parameter to OFF in your MySQL 8.4 parameter group, or adjust your DDL statements accordingly.

memcached plugin removal: If enabled in your option group, disable it before upgrading (the RDS pre-upgrade validation can also flag this as an Error).

Parameter default value changes in MySQL 8.4

Beyond the compatibility changes above, MySQL 8.4 also changes the default values for several system variables. If you have explicitly set some of these in your MySQL 8.0 parameter group, verify whether your values need to carry over to the MySQL 8.4 parameter group. If you are using the engine defaults, be aware of the new behavior. The following table only includes parameters that are modifiable in the RDS for MySQL 8.4 parameter group.

Parameter 8.0 default 8.4 default
group_replication_consistency EVENTUAL BEFORE_ON_PRIMARY_FAILOVER
group_replication_exit_state_action READ_ONLY OFFLINE_MODE
innodb_adaptive_hash_index ON OFF
innodb_buffer_pool_instances 8 (or 1 if innodb_buffer_pool_size < 1 GB) MAX(1, #vcpu/4)
innodb_change_buffering all none
innodb_io_capacity 200 10000
innodb_io_capacity_max 200 2 × innodb_io_capacity
innodb_numa_interleave OFF ON
innodb_page_cleaners 4 innodb_buffer_pool_instances
innodb_parallel_read_threads 4 MAX(#vcpu/8, 4)
innodb_read_io_threads 4 MAX(#vcpu/2, 4)

You can check which parameters have been explicitly modified in your current parameter group:

For the complete list of changes, see Amazon RDS for MySQL LTS version 8.4 is now generally available.

Go/no-go checklist

Before proceeding to Phase 2, confirm:

Phase 2: Blue/Green Deployments (minimize downtime)

Amazon RDS Blue/Green Deployments lets you upgrade the major engine version, without affecting production. When you’re ready, you promote the staging environment to become the new production database with downtime typically five seconds or lower for single-Region configurations. This makes Blue/Green Deployments a practical choice for major version upgrades with minimal disruption. The following sections walk you through creating Blue/Green Deployments, upgrading the green environment to MySQL 8.4, and performing the switchover.

Prechecks

Before creating the deployment, confirm:

RDS console showing binlog retention hours configuration result

Create and upgrade the green environment

  1. Create the Blue/Green Deployments: Select the same engine version as the source instance initially. This avoids a failed automatic upgrade requiring you to rebuild the entire deployment.

RDS console showing Blue/Green Deployment creation with same engine version selected

  1. After creation completes, manually upgrade the green instance to MySQL 8.4.x. Select your pre-configured MySQL 8.4 parameter group, or the default parameter group if you’re not customizing any parameters.

RDS console showing the green instance modification to upgrade to MySQL 8.4

RDS console showing MySQL 8.4 parameter group selection during green instance upgrade

  1. If the upgrade fails, check the green instance’s PrePatchCompatibility.log and Recent events for details. Resolve the issues and retry.
  2. (Optional) If you plan to set up a rollback path in Phase 3, verify that binlog retention on the green instance (MySQL 8.4) is set to at least 24 hours before switchover. Otherwise, binlog files might be purged after switchover, making reverse replication impossible.

Switchover

Before switching over:

  • Confirm ReplicaLag on the green instance is near zero.
  • Check for long-running queries on the blue instance (SHOW PROCESSLIST).
  • Verify both environments show Available status.
  • Verify that the application isn’t caching DNS, or if it is then the cache TTL is 5 seconds or less.
  • Complete the rollback setup in Phase 3 first

Set the switchover timeout to your acceptable downtime window (maximum 60 minutes). During switchover, existing connections are dropped.

RDS console showing the Blue/Green Deployment switchover configuration with timeout setting

RDS console showing the switchover confirmation dialog

After a successful switchover, RDS renames the old instance with a -old1 suffix and sets it to read-only. The new MySQL 8.4 instance takes the original name and endpoint. RDS automatically disconnects replication at this point. Blue/Green Deployments is a one-time operation. You can delete the Blue/Green Deployments at your convenience; doing so does not delete the old instance. If you do not need a rollback path, the upgrade is complete at this point.

Important: Note the binlog position recorded after switchover completed, you will need it for the rollback setup in Phase 3. Following these steps to find the binlog position after switchover:

  1. Open the Amazon RDS console and select your new MySQL 8.4 instance.
  2. Open the Logs & events tab and search for Binary log. Note the last recorded binlog file name and position. For example, in our test environment, the values are mysql-bin-changelog.000129 and position 1116.

RDS console Logs and events tab showing the binary log file name and position after switchover

Phase 3: Rollback preparation

Set up reverse replication from the new MySQL 8.4 instance to the old MySQL 8.0 instance after switchover, so you have a working rollback path if issues arise post-upgrade. This approach uses change data capture (CDC) to synchronize only incremental changes. The key input is the binlog position recorded in the switchover event (Logs & eventsRecent events on the new MySQL 8.4 instance). Use this position to establish reverse replication back to the old 8.0 instance.

The following table provides a brief comparison between Option A (native replication) and Option B (AWS DMS) to help you choose the approach that best fits your environment:

Criteria Native Replication AWS DMS
Setup complexity Lower Higher
Additional cost None DMS instance cost
LOB handling Automatic Requires DMS task configuration
When to use Direct network connectivity available Complex environments or when native replication isn’t feasible

Option A: Native replication (preferred)

Migration between RDS for MySQL instances is a homogeneous data migration. We recommend native replication as the preferred approach, because native database migration tools generally provide better performance and accuracy compared to AWS DMS.

After switchover, you do not need to change the read_only parameter to 0 on the MySQL 8.0 -old1 instance. In replication setups, it can be useful to enable read_only on replica servers to make sure that replicas accept updates only from the source server and not from clients.

  1. Retrieve the binlog position from the new MySQL 8.4 instance (Logs & events → Recent events).
  2. On the MySQL 8.4 instance, create a replication user.
  1. On the MySQL 8.0 -old1 instance, configure replication using the binlog position from step 1:

MySQL console output showing successful replication configuration with rds_set_external_master and rds_start_replication

  1. Verify with SHOW REPLICA STATUS\G and confirm the following:
    • Replica_IO_Running: Yes
    • Replica_SQL_Running: Yes
    • Seconds_Behind_Master is decreasing, approaching 0, and is not -1 or NULL

SHOW REPLICA STATUS output confirming Replica_IO_Running and Replica_SQL_Running are both Yes

Option B: AWS DMS

If native replication is not suitable for your environment, you can use AWS DMS as an alternative to set up the rollback path.

Because the MySQL 8.0 -old instance is set to read-only after switchover, change the read_only parameter to 0 in its parameter group before creating the task.

  • Step 1: Create a DMS replication instance
    • Instance class & Allocated storage: based on workload
    • Engine version: use the default
    • Virtual private cloud (VPC): use the same VPC as both the MySQL 8.4 (source) and MySQL 8.0 (target) instances. Note: If the source and target instances are in different VPCs, you need to configure VPC peering or another network connectivity option before creating the replication instance.
  • Step 2: Create DMS endpoints
    • Source endpoint: use the MySQL 8.4 instance endpoint (after switchover)
    • Target endpoint: use the old MySQL 8.0 -old1 instance endpoint
    • Test both endpoint connections from the replication instance before proceeding
  • Step 3: Create a CDC-only DMS task
    • Migration type: CDC only (ongoing replication)
    • CDC start position: enter the binlog position from the switchover event (Logs & eventsRecent events on the new MySQL 8.4 instance)
    • Include LOB columns in replication: if your tables contain LOB columns, consider using Limited LOB mode for faster replication. This mode truncates LOB value that exceeds the configured Maximum LOB size, so you need to set it appropriately.
    • Maximum LOB size (KB): query the maximum LOB size in your instance first, then set accordingly:
  • Table mappings: either explicitly include the databases you want to synchronize, or use % to include every database and exclude system databases (mysql, sys, information_schema, performance_schema).
  • Step 4: Monitor the task
    • Once the task status shows Replication ongoing, CDC synchronization is active and the rollback path is ready.

Once you have validated that the new RDS for MySQL 8.4 environment is working as expected, you can stop the DMS replication task and delete the old RDS for MySQL 8.0 instance.

Clean up

After completing the upgrade and confirming your application is running normally on MySQL 8.4, you should remove the resources created during this process to avoid incurring additional charges. Follow these steps in the AWS Management Console or the AWS Command Line Interface (AWS CLI):

  • Deleting Blue/Green Deployments (deleting the deployment does not delete the DB instances).
    1. Open the Amazon RDS console and choose Databases.
    2. Select the Blue/Green Deployment → Choose Actions → ‘Delete’.
  • Snapshot-restored RDS instance used for pre-upgrade checks.
  • The old RDS for MySQL 8.0 -old1 instance (after you confirm rollback is no longer needed). For more AWS CLI options, see Deleting a DB instance.
    1. Open the Amazon RDS console and choose Databases.
    2. Select the instance → Choose ‘Actions’ → ‘Delete’.
  • Amazon EC2 instance used for running prechecks (if applicable). For more AWS CLI options, see Terminate Amazon EC2 instances.
    1. Open the Amazon EC2 console and choose Instances.
    2. Select the EC2 instance used for running prechecks → Choose ‘Instance state’ → ‘Terminate (delete) instance’.
  • AWS DMS replication instance, endpoints, and task (if you used Option B for rollback).
    • Tasks:
      1. Open the AWS DMS console and choose Tasks.
      2. Select the DMS task used for rollback.
      3. If the task is still running, choose ‘Actions’ → ‘Stop’.
      4. After the task has stopped, choose ‘Actions’ → ‘Delete’.
    • Endpoints:
      1. Choose Endpoints.
      2. Select the source and target endpoints → Choose ‘Actions’ → ‘Delete’.
    • Deleting a replication instance:
      1. Choose Provisioned instances.
      2. Select the replication instance → Choose ‘Actions’ → ‘Delete’.

Conclusion

In this post, we walked through a three-phase approach for upgrading Amazon RDS for MySQL 8.0 to 8.4: interpreting pre-upgrade compatibility checks, performing the upgrade with minimal downtime using Amazon RDS Blue/Green Deployments, and setting up a rollback path with reverse replication. These phases give you a tested, reversible upgrade path with minimal downtime, reducing the risk that typically stalls major version upgrades. With MySQL 8.0 standard support ending on July 31, 2026, we encourage you to start planning your upgrade and validating your workloads against MySQL 8.4’s new features and performance improvements. If you need to continue using MySQL 8.0 beyond the standard support end date, see the breakdown of Extended Support costs in Upgrade strategies for Amazon RDS for MySQL 8.0 to 8.4.

To learn more, see Upgrading a MySQL DB instance engine version and Using Amazon RDS Blue/Green Deployments. If you have questions or feedback, leave a comment on this post.


About the authors

Nicky Hsu

Nicky Hsu is a Cloud Support Engineer II at AWS Support and a Subject Matter Expert in RDS for MySQL and Aurora PostgreSQL. With 3 years at AWS and 8 years of experience in the IT industry, he specializes in providing technical support for Amazon RDS for MySQL, RDS for MariaDB, Aurora MySQL, RDS for PostgreSQL, Aurora PostgreSQL, and AWS DMS, helping customers troubleshoot issues and implement tailored solutions.

Planet for the MySQL Community

Upgrade strategies for Amazon RDS for MySQL 8.0 to 8.4

https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2026/05/14/DB5369.png

Amazon Relational Database Service (Amazon RDS) for MySQL 8.0 is planned to reach end of standard support on July 31, 2026. To keep your databases up to date with bug fixes, security enhancements and the latest features, upgrade to MySQL version 8.4 before the current version reaches end of standard support. This is a major engine version upgrade.

When a new version is released for RDS for MySQL, you can choose how and when to upgrade your DB instances. However, major engine version upgrades can be complex and risky, introducing changes that might not work with existing applications and raising concerns about downtime and compatibility.

This post is part of a two-part series on upgrading RDS for MySQL 8.0 to 8.4. Here, we cover the end of standard support timeline, extended support costs, upgrade methods, and key best practices. For a step-by-step implementation guide, see Best practices for upgrading RDS for MySQL 8.0 to 8.4 with prechecks, Blue/Green, and rollback.

RDS for MySQL 8.0 end of standard support timeline

It is important to note that individual minor versions within MySQL 8.0 have their own end of support dates that might occur before the July 31, 2026 deadline. For example, minor version 8.0.28 reached its end of standard support on March 28, 2024. Always verify the support status of your specific minor version to avoid running on an unsupported version before the major version deadline. See the RDS for MySQL release calendar to stay up to date on end of standard support dates for various minor versions within a major version.

You can use the describe-db-engine-versions AWS Command Line Interface (AWS CLI) command to check available versions and their upgrade paths:

To check the available upgrade paths for your specific current version, run:

Excerpt from output:

This command helps you identify valid upgrade targets from your current version, so you can proactively plan your upgrade before end-of-support deadlines approach.

Amazon RDS Extended Support

In September 2023, AWS announced Amazon RDS Extended Support, a paid offering that provides critical security and bug fixes for Amazon Aurora MySQL or Amazon RDS for MySQL major versions for up to three years after the end of standard support date. If you plan to upgrade at a later date, extended support gives you additional time to plan, test and execute major version upgrades without being rushed into an emergency migration.

Extended Support provides flexibility to upgrade on your timeline while maintaining full AWS support for security patches and critical bug fixes. It activates automatically after a major version’s end of standard support date and continues until you upgrade to a supported version, ensuring your databases remain secure and stable during your transition period. Extended Support charges are calculated on a per vCPU per hour basis. See Hardware specifications for DB instance classes to check the vCPUs corresponding to your DB instance class. The charges vary by AWS Region, with pricing structured based on how long a major version remains past its end of standard support date.

For a Multi-AZ deployment, both the primary and the standby instance are billed independently, because each runs the same instance class and carries the same vCPU count. This effectively doubles your Extended Support cost compared to a Single-AZ deployment. See the following example cost table for us-east-1 Region.

Deployment Type Instance class vCPUs count Year 1-2 Year 3
Single-AZ (SAZ) db.r5.xlarge 4 $292/month $584/month
Multi-AZ (MAZ) db.r5.xlarge 8 $576/month $1,152/month
MAZ primary with a SAZ read replica db.r5.xlarge 12 $864/month $1,728/month
MAZ primary with a MAZ read replica db.r5.xlarge 16 $1,168/ month $2,336/month

For accurate, Region-specific pricing, refer to Amazon RDS Extended Support costs in RDS for MySQL pricing. You can also use the AWS Pricing Calculator to model your specific fleet.

Key changes in MySQL 8.4

As with any major version upgrade, there are important changes you should be aware of before upgrading so you can test your database and applications accordingly. MySQL 8.4 introduces meaningful behavioral and configuration changes from 8.0 that require thorough validation before migration. We call out a few important changes in this post. Refer to Changes in MySQL 8.4 as your primary reference for the complete list of changes.

In MySQL 8.0 (on RDS), the default authentication plugin was mysql_native_password. In MySQL 8.4, mysql_native_password is disabled by default, and caching_sha2_password is now the default authentication plugin for RDS for MySQL 8.4. For more information, see MySQL default authentication plugin. You can audit your database users to identify any accounts still using mysql_native_password and update them as needed. Verify that your application drivers and connectors support caching_sha2_password.

MySQL 8.4 has also fully transitioned to SOURCE/REPLICA terminology, replacing the legacy MASTER and SLAVE keywords. As a result, review and update any scripts, automation, or documentation that references the old terminology. In addition, there is an updated list of reserved keywords. See MySQL 8.4 Keywords and Reserved Words for the complete list. If any reserved keywords are present in your workload, they will also be flagged in the upgrade precheck log file discussed in the following section.

Lastly, MySQL 8.4 changed the default values for several InnoDB server system variables. These changes can impact performance and behavior, especially if you are relying on the previous defaults. For example, innodb_adaptive_hash_index changed from ON (MySQL 8.0 default) to OFF (MySQL 8.4 default). For a complete list, see MySQL 8.4 Changed Server Defaults. If you have custom parameter group values set for any of these parameters, evaluate whether those are compatible with the new default settings.

For a comprehensive view of all features added, changed, and deprecated in MySQL 8.4 compared to MySQL 8.0, refer to What Is New in MySQL 8.4 since MySQL 8.0.

Upgrade prechecks

When you initiate an upgrade from RDS for MySQL 8.0 to 8.4 from the AWS Management Console or AWS Command Line Interface (AWS CLI), RDS runs mandatory prechecks automatically in the background to detect any incompatibilities. These prechecks run before the DB instance is stopped for the upgrade, with no downtime during the validation phase.

If prechecks detect incompatibilities, RDS automatically cancels the upgrade and generates an event notification. Your original MySQL 8.0 instance remains unchanged with no downtime. To identify the specific incompatibilities that blocked the upgrade, review the PrePatchCompatibility.log file available in the Logs and events section of the Amazon RDS console. This log provides detailed information about each incompatibility, allowing you to address the issues before reattempting the upgrade. You can also retrieve the log file using the AWS CLI with the describe-db-log-files and download-db-log-file-portion commands.

As a best practice, test the upgrade process before upgrading in production. You can either use snapshot restore or RDS Blue/ Green Deployments to test the upgrade. Alternatively, you can also use RDS read replica and perform the upgrade first on the replica instance to test for changes and then upgrade the primary. Performing these tests not only provides you with upgrade incompatibilities (if any) using the RDS prechecks log file, but also provides you with an estimate of how long it takes for the prechecks to run and the complete the upgrade. The duration of the upgrade varies depending on the workload and number of database objects. Lastly, these prechecks identify incompatibilities in the database objects, such as reserved words in a procedure definition. They do not validate any application-side logic. Therefore, verify how any reserved keywords or unsupported syntax can impact your application. For detailed information regarding all RDS prechecks, refer to Best practices for upgrading Amazon RDS for MySQL 8.0 to 8.4 with prechecks, Blue/Green, and rollback.

You can also review any incompatibilities by running the MySQL community upgrade checker utility.

Available version upgrade choices

In this section, we cover the available upgrade options in order of complexity and downtime impact. Upgrading an RDS instance requires downtime for the duration of the upgrade.

  • In-place upgrade
  • Amazon RDS blue/green deployment
  • Snapshot restore

In-place upgrade

The in-place upgrade is the most straightforward path. RDS upgrades the engine version on the existing instance without creating a new one, preserving the same endpoint, resource ID, and storage volume. Before RDS begins the upgrade, it automatically takes a pre-upgrade snapshot that you can use in case of any issues. An in-place upgrade cannot be canceled once started. For Multi-AZ DB instance deployments, RDS upgrades both the primary and standby instances simultaneously during major version upgrades. If the upgrade fails, RDS attempts an automatic rollback. If you do not have automated backups enabled on your RDS instance, take a manual snapshot as a rollback option before starting the upgrade. Verify that your custom parameter groups and option groups are compatible with the target MySQL version, as deprecated or renamed parameters between versions can cause unexpected behavior post-upgrade. For complete steps, refer to upgrading the MySQL DB engine.

Amazon RDS Blue/Green Deployment

For production environments where minimizing downtime is the top priority, Amazon RDS blue/green deployments are the right choice. Amazon RDS creates the green environment from a snapshot of the blue environment. The feature requires automated backups to be enabled and runs your current production instance (blue) and a fully upgraded staging instance (green) side by side, keeping them synchronized via MySQL binary log replication until you’re ready to switchover. As the storage blocks are being loaded from Amazon Simple Storage Service (Amazon S3), the green environment experiences lazy loading effects. Switchover times have been reduced to under five seconds, delivering near-instantaneous cutover with minimal application impact. For switchover, RDS drops existing connections and blocks new connections, performs a final sync, and automatically renames the green endpoints to match the blue endpoints. This eliminates any application configuration changes. If issues surface after the switchover, you can manually switch back to the blue environment, because the original instance remains running and available throughout. The additional infrastructure cost during the deployment window is the primary trade-off, but for risk-averse production upgrades, the near-zero downtime and immediate rollback capability make blue/green the strongest option available.

Snapshot restore

Snapshot restore is a fully manual process best suited for dev/test environments or scenarios where complete isolation between the old and new versions is required. You take a manual snapshot of the source instance, restore from it while selecting the target MySQL 8.4 version, and the upgrade runs as part of the restore operation. The original instance remains untouched throughout, giving you a clean rollback path. To roll back, redirect traffic back to the original endpoint if the upgraded instance has issues. The downside is that the new restored instance gets a different endpoint, requiring application configuration updates, and read replicas must be manually recreated. Restore and upgrade time for large databases can range from minutes to hours, making this approach impractical for production cutover windows but valuable for pre-production validation.

Rollback after failure to upgrade

For an in-place upgrade you can restore the pre-upgrade snapshot taken by RDS in case of any issues. Perform additional validation to verify that there is no data loss for any transactions committed after the snapshot was taken. Blue/green deployments offer the strongest rollback: the blue environment stays live throughout, and switching back takes seconds making it the right choice for any production upgrade where immediate recovery is a hard requirement. Additionally, you can also set up manual binlog replication from the green environment to the blue environment to minimize data loss after switchover. Snapshot restore sits in the middle: the original instance is unchanged and available, but you’re managing two separate endpoints and must redirect traffic manually. For a detailed breakdown of rollback behavior across all three methods, refer to the AWS documentation on rollback after failure to upgrade. In all three cases, test and practice your rollback procedure in a non-production environment before the production upgrade window. Check detailed rollback steps in this post.

Summary

The following table summarizes the upgrade options available for Amazon RDS for MySQL:

Method Downtime Rollback Capability Best for
In-place upgrade Minutes to hours (varies by deployment type and DB size) Limited. Automatic rollback might be possible in some cases; snapshot restore is required Environments with an acceptable downtime window
RDS Blue/Green Deployment Typically, under 5 seconds (switchover) Excellent. Immediate switchback to blue environment with no data loss Production upgrades requiring minimal downtime and fast rollback
Snapshot restore Hours (restore + upgrade time) Good — original instance unchanged; endpoint change required for rollback Dev/test validation; isolated upgrade testing before production cutover

Additional considerations

For Multi-AZ deployments, RDS upgrades both the primary and standby instances simultaneously during the maintenance window, resulting in downtime for the duration of the upgrade process. Blue/green deployments replicate the Multi-AZ configuration in the green environment automatically. If your DB instance uses in-Region or cross-Region read replicas, then you must upgrade all the read replicas before upgrading the source instance. In all cases, review your custom parameter groups against the target MySQL version before initiating the upgrade. Deprecated parameters are a common source of post-upgrade surprises that are easier to catch before the upgrade window than after it.

How Amazon RDS for MySQL performs a major version upgrade

When a major version upgrade is invoked on the console or via the AWS CLI or Amazon RDS API, Amazon RDS automation completes the following steps:

  1. Verifies if there are any read replicas that need to be upgraded. Read replicas must be upgraded before the primary instance. If you have cross-Region read replicas, upgrade those first, then proceed to upgrade the primary instance.
  2. Runs mandatory pre-upgrade prechecks. When upgrading from MySQL 8.0 to 8.4, RDS automatically runs prechecks to detect any incompatibilities before the instance is stopped. This step does not cause any downtime. If incompatibilities are found, the upgrade is canceled automatically. The instance remains on MySQL 8.0 with no changes made, and the findings are reported in the PrePatchCompatibility.log file.
  3. Takes a pre-upgrade snapshot (if automated backups are enabled on the RDS for MySQL DB instance). This snapshot serves as a recovery point and can be used for rollback purposes if the upgrade encounters any issues.
  4. Shuts down the instance and prepares it for upgrade. RDS places the DB instance into a maintenance state, stops accepting new connections, and prepares the environment for the binary upgrade.
  5. Upgrades the MySQL binaries and applies data dictionary changes. RDS upgrades the underlying MySQL binaries from 8.0 to 8.4 and runs the necessary internal upgrade routines to update system tables and validate the data directory against the new version.
  6. Brings the instance back online running MySQL 8.4, updates the engine version metadata, and resumes normal operations.

A major version upgrade cannot be rolled back automatically once it has completed successfully. This is why testing the upgrade on a non-production copy (using a read replica, a snapshot restore, or a point-in-time restore) is strongly recommended before upgrading your production instance.

Key considerations and upgrade best practices

The following key best practices will help you plan and execute a smooth upgrade from RDS for MySQL 8.0 to 8.4. This is not an exhaustive list.

  1. Changes in MySQL 8.4 — Before starting, review Changes in MySQL 8.4 to understand behavioral, configuration, and syntax changes. Some important ones are mysql_native_password authentication plugin being disabled by default, the removal of legacy MASTER/SLAVE terminology, and InnoDB server variable default changes.
  2. Test the upgrade— Restore a snapshot of your production instance and run the full upgrade against it. Validate application behavior and review key metrics in Amazon CloudWatch and CloudWatch Database Insights, as well as the slow query log, before making changes in production. For detailed guidance, see Testing an RDS for MySQL upgrade.
  3. New custom parameter group — MySQL 8.4 requires a parameter group family of mysql8.4. Create and configure your new parameter group before the upgrade window, comparing your existing values against the new defaults. Pay particular attention to InnoDB parameters such as innodb_adaptive_hash_index. See Working with parameter groups.
  4. Review precheck log— RDS runs mandatory prechecks automatically, but run them early on a non-production copy. Download and review the PrePatchCompatibility.log file. A non-zero errorCount means the upgrade did not proceed. Resolve all flagged issues before your production window. For more information, refer to RDS prechecks for upgrades.
  5. Upgrade to the latest version — Target the latest minor MySQL 8.4 version to get the most recent security patches and bug fixes. After the upgrade, configure the Auto minor version upgrade setting according to your business requirements.
  6. Handle read replicas (if any)— Upgrade cross-region read replicas first, then in-region replicas, before upgrading the primary. Multi-AZ standby instances are upgraded automatically. Failing to upgrade replicas first will block the upgrade. See Upgrading a MySQL DB instance.
  7. Take a manual snapshot— Even though RDS takes an automatic pre-upgrade snapshot when automated backups are enabled, take an explicit manual snapshot immediately beforehand for a clean, known-good restore point. This also speeds up the automated pre-upgrade snapshot RDS takes during the upgrade. See Considerations for MySQL upgrades.
  8. Blue/Green Deployments for critical workloads — For production workloads where minimizing risk and downtime is a priority, use Amazon RDS Blue/Green Deployments. Upgrade the green environment, validate fully, then switch over in under a minute. The blue environment is retained as a fallback. For detailed guidance, refer to Best practices for Amazon RDS Blue/Green Deployments.
  9. Monitor RDS recommendations — After the upgrade, check the Recommendations section in the Amazon RDS console regularly for actionable, instance-specific guidance on parameter group adjustments, instance class optimizations, and configuration best practices. Additionally, monitor Database Insights for reviewing any changes with the key metrics.
  10. Warm the database — The engine restart evicts all InnoDB buffer pool contents. Before routing full production traffic, run your most commonly executed queries to repopulate the buffer pool with your actual working set. Avoid broad SELECT * queries, which can flood the buffer pool with non-representative data.

Conclusion

In this post we reviewed the MySQL 8.0 end of standard support timelines, extended support charges and the options available to perform an upgrade. Upgrade your RDS for MySQL instances to the latest minor version in MySQL 8.4 before July 31, 2026 to avoid extended support costs. For critical workloads with minimal downtime requirements, use the Amazon RDS Blue/ Green Deployments with a detailed implementation guide (Best practices for upgrading RDS for MySQL 8.0 to 8.4 with prechecks, Blue/Green, and rollback). We recommend that you start by testing the upgrade, reviewing your precheck log, and building familiarity with MySQL 8.4 before performing the upgrade on your production environments.


About the authors

Kiran Mulupuru

Kiran is a Database Specialist Technical Account Manager at Amazon Web Services. She focuses on Amazon RDS and Amazon Aurora databases. She works with enterprise customers, providing technical assistance on database operational performance and sharing database best practices.

Poulami Maity

Poulami is a Database Specialist Solutions Architect at Amazon Web Services. She works with AWS customers to help them migrate and modernize their existing databases to the AWS Cloud.

Shagun Arora

Shagun is a Sr. Database Specialist Solutions Architect at Amazon Web Services (AWS). She works with customers to design scalable, highly available, and secure solutions in the AWS Cloud.

Planet for the MySQL Community

ProxySQL in Front of AWS RDS & Aurora MySQL — Part 5: Monitoring, Tuning, and Troubleshooting

https://reliadb.com/images/og-default.png

The system is running. ProxySQL in front of Aurora, query rules routing reads to replicas, the two-node cluster syncing config in ~600ms, TLS on the backend leg. Four parts to reach this point. The question shifts now: how do you know it’s still working correctly at 3am on a Sunday — and when it isn’t, where do you look?

This part covers the operational layer. The same Lima lab topology from Parts 1–4 runs throughout: dbdeployer MySQL 8.0.41 sandbox (master on port 25001, two replicas on 25002 and 25003), ProxySQL 2.7.3 on proxysql-1 and proxysql-2, backends in HG 10 (writer) and HG 20 (readers). No AWS resources — Part 5 is fully local. The Aurora-specific captures referenced in Section 2 are reused from Parts 2 and 4, cited explicitly.

The System from Parts 1–4 — and What Part 5 Adds

Part 1 made the placement decision. Part 2 wired ProxySQL to Aurora’s native topology discovery — mysql_aws_aurora_hostgroups, REPLICA_HOST_STATUS, 2 errors across 1,485 queries through a live failover. Part 3 built the query routing layer: mysql_query_rules, the ordering rule for SELECT ... FOR UPDATE, transaction_persistent, and the exact conditions that break multiplexing. Part 4 tested the full HA stack under pressure — Aurora at T0+15s, RDS Multi-AZ at T0+64s, TLS footguns in auto-discovery, and NLB health check timing that says 90 seconds in the docs but measured 110 in the lab.

Part 5 adds three things those parts explicitly deferred: the monitoring layer you query to know the system is healthy, the tuning decisions now grounded in observed behavior, and the recovery path for the most common non-obvious production failure mode.

The Monitoring Layer: What to Watch and Why

Three tables cover health at different granularities. Together they answer: is my proxy routing correctly, how loaded is my backend pool, and did my Aurora topology discovery run cleanly?

stats_mysql_query_digest: Workload Shape and Latency Baselines

-- Tested on Lima VMs, MySQL 8.0.41 via dbdeployer, ProxySQL 2.7.3, 2026-05-10

stats_mysql_query_digest accumulates per-digest statistics for every query ProxySQL routes. The primary uses are identifying slow queries by total time spent and reading the shape of your workload — which hostgroups are receiving traffic and in what proportions.

The two captures below show the same ProxySQL instance under two workload patterns. Both are correct behavior. The point is that monitoring tells you which shape you have; your job is to verify it matches what you expect.

Shape A — Transactional workload with transaction_persistent=1 (Part 5 lab, 20 sysbench threads, 300s oltp_read_write):

-- stats_mysql_query_digest on proxysql-1 — top 5 by sum_time, 300s oltp_read_write load
-- Note: all queries land in HG 10 — see analysis below
hostgroup  schemaname  digest_text                                count_star  avg_time_us
10         lab_test    SELECT c FROM sbtest1 WHERE id=?           193270      1936
10         lab_test    SELECT c FROM sbtest2 WHERE id=?           193150      1929
10         lab_test    SELECT c FROM sbtest4 WHERE id=?           191770      1940
10         lab_test    SELECT c FROM sbtest3 WHERE id=?           192110      1927
10         lab_test    COMMIT                                      77046      4280

Every query — including the SELECT statements — landed in HG 10 (the writer). HG 20 (readers) received 644 total executions against 247,593 for HG 10. This is not a routing misconfiguration. The oltp_read_write workload wraps every statement inside BEGIN … COMMIT. With transaction_persistent=1 set on the app user, ProxySQL pins all queries in a detected transaction to the same hostgroup — the writer, where the transaction opened. The reads never had an opportunity to fan out to replicas because the transaction boundary kept them anchored.

The corresponding connection pool snapshot confirms this:

-- stats_mysql_connection_pool on proxysql-1 (mid-load, 20 sysbench threads)
hostgroup  srv_host       srv_port  status  ConnUsed  ConnFree  ConnOK  Latency_us
10         192.168.105.6  25001     ONLINE  20        0         20      2489
20         192.168.105.6  25002     ONLINE   0        12        12      2556
20         192.168.105.6  25003     ONLINE   0         9         9      2445

ConnUsed=20 on HG 10 — one backend connection per sysbench thread, held for the duration of each active transaction. ConnUsed=0 on both replicas — they’re healthy and connected, but receiving no queries.

Shape B — Idle multiplexing baseline (from the Part 3 lab, 100 Python threads, SELECT 1, no session state — reused for comparison):

-- stats_mysql_connection_pool on proxysql-1 (T0+10s, 100 idle frontends, no session state)
-- Source: Part 3 A.5 capture — reused for baseline comparison
hostgroup  srv_host       srv_port  ConnUsed  ConnFree
10         192.168.105.6  25001     0         100
20         192.168.105.6  25002     0          42
20         192.168.105.6  25003     0          61

One hundred frontend sessions open. Zero backend connections pinned to any of them. After SELECT 1 completed, ProxySQL returned all connections to the free pool — the frontends are connected, they just don’t hold a MySQL thread on the other side.

The diagnostic question these two shapes answer: if you expect reads to distribute across replicas but HG 20 shows ConnUsed=0 and stats_mysql_query_digest shows all executions in HG 10, check transaction_persistent first, then check whether your ORM or application wraps reads inside explicit transactions. Both shapes above represent correct behavior for their respective workloads. The monitoring tells you which one you’re looking at.

The two queries to run routinely:

-- Top 10 queries by total time spent — identifies slow-query candidates
SELECT hostgroup, schemaname, username, digest_text,
       count_star,
       ROUND(sum_time / count_star) AS avg_time_us,
       max_time
FROM stats_mysql_query_digest
ORDER BY sum_time DESC
LIMIT 10;

-- Per-hostgroup execution distribution — reveals workload shape
SELECT hostgroup,
       COUNT(DISTINCT digest)    AS unique_queries,
       SUM(count_star)           AS total_executions
FROM stats_mysql_query_digest
GROUP BY hostgroup
ORDER BY hostgroup;

Rising avg_time_us on a digest that was previously stable is the early slow-query signal. Unexpected hostgroup skew — all traffic in HG 10 when you expect a 70/30 read split — tells you to check transaction_persistent or your query rules before blaming the backends.

mysql_server_aws_aurora_log: Topology Detection and Gap Alerting

monitor.mysql_server_aws_aurora_log is the only table that shows Aurora topology discovery in real time. ProxySQL writes a row every check_interval_ms for each backend it polls. It’s the authoritative record of whether ProxySQL is successfully reading INFORMATION_SCHEMA.REPLICA_HOST_STATUS — and how long each poll took.

-- monitor.mysql_server_aws_aurora_log — healthy polling pattern (3-row excerpt)
-- Source: Part 2 live Aurora lab capture, 2026-05-08 (reused — no AWS resources in Part 5)
SELECT check_utc, hostname, is_writer_per_replica_host_status AS writer_detected, lag_ms
FROM monitor.mysql_server_aws_aurora_log
ORDER BY check_utc DESC
LIMIT 3;
check_utc               hostname (polled)                           writer_detected              lag_ms
2026-05-08 10:53:16     proxysql-aurora-EXAMPLE-writer.EXAMPLE...  proxysql-aurora-EXAMPLE-writer   0
2026-05-08 10:53:14     proxysql-aurora-EXAMPLE-reader.EXAMPLE...  proxysql-aurora-EXAMPLE-writer   0
2026-05-08 10:53:12     proxysql-aurora-EXAMPLE-reader.EXAMPLE...  proxysql-aurora-EXAMPLE-writer   0

A healthy pattern: rows appear at roughly check_interval_ms intervals, writer_detected is consistent across all rows in a given window, and lag_ms stays low or zero. The 6-second detection gap from the Part 2 and Part 4 failover captures appeared in this table exactly: ProxySQL polled on schedule throughout, but Aurora’s backends were unreachable mid-promotion, so no rows appear between 10:53:10 and 10:53:16. Section 3 covers how to size check_interval_ms against that observed promotion floor.

Detection gap alerting rule: alert when no successful poll row appears for more than 2×check_interval_ms. At check_interval_ms=2000, that’s a 4-second silence. Any gap longer than that means ProxySQL either can’t reach the backend or Aurora’s control plane is mid-promotion. This is the right threshold to wire into your monitoring system — not a static time value, but a function of your configured polling interval.

SCHEMA NOTE: mysql_server_aws_aurora_log lives in the monitor schema, not in main or stats. Use SELECT ... FROM monitor.mysql_server_aws_aurora_log. The table main.mysql_server_aurora_log does not exist in ProxySQL 2.7.3. This footgun was documented in Part 2 during the auto-discovery setup.

stats_mysql_connection_pool and stats_mysql_processlist: Pool Headroom

stats_mysql_connection_pool answers the connection budget question: how close am I to exhausting the backend pool? The ConnUsed / (ConnUsed + ConnFree) ratio is the headroom metric. The mid-load capture from the Part 5 lab shows the pattern for a transactional workload:

-- stats_mysql_connection_pool on proxysql-1 (mid-load, 20 threads, oltp_read_write)
-- Tested on Lima VMs, MySQL 8.0.41 via dbdeployer, ProxySQL 2.7.3, 2026-05-10
hostgroup  srv_host       srv_port  status  ConnUsed  ConnFree  ConnOK  ConnERR  Queries   Latency_us
10         192.168.105.6  25001     ONLINE  20        0         20      0        181493    2489
20         192.168.105.6  25002     ONLINE   0        12        12      2280     116       2556
20         192.168.105.6  25003     ONLINE   0         9         9      2230     83        2445

A few readings worth calling out. ConnERR of 2,280 and 2,230 on the replicas are artifacts of earlier lab sessions, not live failures — verify by watching whether they increment during active load. If ConnERR climbs alongside a ConnUsed spike, that’s a backend connectivity problem. If it’s static, it’s historical noise.

Latency_us is the proxy-measured round-trip for health checks to each backend. Rising latency on one backend before rising ConnERR is the early warning signal: the backend is struggling before it starts failing checks. At 2,489µs on the master and ~2,500µs on the replicas, latency is healthy and symmetric in this capture.

Pool headroom alert threshold: flag when ConnUsed / (ConnUsed + ConnFree) > 0.8 sustained for more than 30 seconds on any hostgroup. Below 0.5 at steady state is healthy. Above 0.8 means you’re approaching the connection ceiling — either raise max_connections in mysql_servers, add a backend, or reduce transaction_persistent scope if the workload allows it.

stats_mysql_processlist gives the live per-session view — which hostgroup each frontend session is currently assigned to and what command it’s running:

-- stats_mysql_processlist on proxysql-1 (mid-load snapshot)
SELECT SessionID, user, db, hostgroup, command, time_ms, info
FROM stats_mysql_processlist
ORDER BY time_ms DESC
LIMIT 10;

During the Part 5 sustained load, all 20 sessions showed hostgroup=10 with a mix of Execute and Sleep states. A session in Sleep with time_ms climbing means it’s holding an open backend connection without issuing queries — the cost of transaction_persistent=1 in a slow-consumer application. Use processlist during incidents to see exactly which sessions are holding pool resources and which queries are actively executing.

Production Sizing: Polling Intervals, Lag Thresholds, Multiplexing

The right values for these variables don’t come from the ProxySQL docs. They come from your own observed promotion time and workload shape. Here’s how to derive them from the data Parts 2–4 already captured.

check_interval_ms: Sizing Against the Promotion Floor

The detection latency formula from Part 2’s detection math section:

detection latency = Aurora internal promotion time (~6s, opaque to ProxySQL)
                  + at most one check_interval_ms cycle

Aurora’s internal promotion time is the floor — ProxySQL was polling on schedule throughout both the Part 2 and Part 4 failovers, but the backends were simply unreachable while Aurora was mid-promotion. Lowering check_interval_ms below 1000ms adds polling load on Aurora’s INFORMATION_SCHEMA without meaningfully reducing detection latency — the floor is Aurora’s promotion time, and that’s set by instance class and cross-AZ replication state, not polling frequency.

check_interval_ms controls the worst-case additional lag on top of that floor:

check_interval_ms Worst-case detection Typical use case
2000ms (2s) ~8s Most production workloads — low overhead, tight detection
5000ms (5s) ~11s Cost-sensitive setups; 3s of additional lag vs. 2000ms is acceptable for many apps
10000ms (10s) ~16s Background or batch Aurora clusters where sub-15s detection isn’t required

The Part 2 lab used check_interval_ms=2000; Part 4 used 5000. Both labs produced identical detection floors because the constraint was Aurora’s ~6-second internal promotion, not polling frequency. Choose based on the detection window your application’s connection pool and retry logic can tolerate — not on the assumption that faster polling reduces the floor. check_timeout_ms must also remain below check_interval_ms and at or below 3000ms (ProxySQL 2.7.3 enforces this with a CHECK constraint; a silent INSERT failure is the symptom if you exceed it, as documented in Part 4).

Lag Thresholds: max_replication_lag vs max_lag_ms

These are two different columns in two different tables with different units. Conflating them produces a config that looks correct but either does nothing or clips reads far more aggressively than intended.

Column Table Unit Scope What it controls
max_replication_lag mysql_servers seconds Standard MySQL replication SHUNNED when Seconds_Behind_Source > max_replication_lag
max_lag_ms mysql_aws_aurora_hostgroups milliseconds Aurora only (REPLICA_HOST_STATUS) Excludes reader from HG when replica_lag_in_milliseconds > max_lag_ms

The footgun: max_lag_ms=600000 in mysql_aws_aurora_hostgroups means 600000 milliseconds (= 600 s = 10 minutes of acceptable Aurora replica lag) — a generous lab default from Part 2; the column name carries the _ms unit. The sibling knob mysql_servers.max_replication_lag is in whole seconds for standard replication lag. Copying the numeric literal 600000 from max_lag_ms into max_replication_lag does not mean “10 minutes”; it means 600000 seconds (about 7 days). Your replicas would have to lag roughly a week before ProxySQL excluded them from routing.

Lab result for max_replication_lag: with max_replication_lag=2 set on replica2 (port 25003) and the replica’s SQL thread stopped, Seconds_Behind_Source returns NULL. ProxySQL treats NULL as 60 seconds of lag by default — so a stopped SQL thread suddenly looks like a 60-second-lagging replica even though the underlying data is fine. The variable mysql-monitor_slave_lag_when_null=60 controls this; size it based on how tolerant your application is of reads from a replica whose SQL thread is stopped.

With slave_lag_when_null=60 and max_replication_lag=2, replica2 transitioned to SHUNNED within one monitor_replication_lag_interval cycle (10 seconds) after the SQL thread was stopped. The status was SHUNNED, not OFFLINE_SOFT — that’s the actual ProxySQL 2.7.3 behavior for lag-threshold violations. Reads stopped routing to replica2 immediately; replica1 absorbed them cleanly.

Recovery after START REPLICA SQL_THREAD: replica2 returned to ONLINE in approximately 72 seconds — about 7× the 10-second monitor_replication_lag_interval, as the lag counter drained across multiple polling cycles before ProxySQL confirmed it was clear. Recovery time is bounded by monitor_replication_lag_interval × monitor_replication_lag_count polling cycles, not by a fixed timeout.

transaction_persistent and Multiplexing Variables

The Part 5 sysbench capture (all traffic in HG 10) makes the transaction_persistent tradeoff concrete. With transaction_persistent=1, queries inside an open transaction stay on the writer. This is correct for application accounts that hold real transactions — the alternative, allowing in-transaction reads to jump to a replica, would route a SELECT to a server that doesn’t yet have the transaction’s uncommitted writes visible, which produces inconsistent reads without any error. Don’t set transaction_persistent=0 for application accounts that use explicit transactions or that issue DML.

Set transaction_persistent=0 for analytics or reporting accounts that connect, run a read, and disconnect — no open transactions, no consistency hazard. This is the same analytics user pattern from Part 3.

Two monitor variables worth knowing for the lag and health check rhythm:

  • mysql-monitor_ping_interval=10000 (10s default): how often ProxySQL sends COM_PING to each backend on existing connections. With mysql-monitor_ping_max_failures=3, three consecutive ping failures trigger SHUNNED — a 30-second window of consistently-failing pings before a backend is excluded.
  • mysql-wait_timeout=28800000 (8 hours): how long ProxySQL keeps backend connections alive. This means a credential change on the MySQL side doesn’t immediately invalidate existing ProxySQL connections — they continue using the cached auth until the connections cycle out or a new connection attempt fails. Section 5 covers exactly what this looks like when it’s the monitor user whose credentials change.

Rolling Upgrade Runbook

LAB NOTE: ProxySQL 2.7.3 was the latest available 2.7.x package in our apt repository at time of writing — there was no newer minor version to upgrade to. The runbook below is the canonical drain/upgrade/restore procedure for any binary upgrade; only the apt-get install proxysql=2.7.X version string changes. We executed the full cycle on both nodes to verify timing and zero-error behavior on a properly-configured client.

ZERO ERRORS, ~25 SECONDS PER NODE: A Linux MySQL 8.0 client running queries against both ProxySQL nodes throughout the upgrade window saw 0 errors across 40 requests during post-upgrade verification. Per-node cycle from drain to restored: ~26 seconds on node 1, ~20 seconds on node 2. The surviving node handled all traffic seamlessly during each drain window.
Note: a macOS MySQL 9.5 client in the test harness produced ERROR 2059 (HY000): Authentication plugin 'mysql_native_password' cannot be loaded errors — the mysql_native_password.so plugin was removed from MySQL 9.x. These are client-side errors unrelated to ProxySQL behavior, confirmed by parallel testing from the Linux client which saw 0 errors.

Step 1 — Capture pre-upgrade baseline. Record the version and connection pool state on both nodes before touching anything. If something goes wrong during the upgrade, this snapshot is your reference point.

-- Pre-upgrade version baseline on both nodes (proxysql-1 shown)
-- Tested on Lima VMs, ProxySQL 2.7.3, 2026-05-10
SELECT @@version;
-- 2.7.3-12-g50b7f85

SELECT hostgroup, srv_host, srv_port, status, ConnUsed, ConnFree
FROM stats_mysql_connection_pool
ORDER BY hostgroup, srv_port;

Step 2 — Start background traffic. Run a SELECT loop from your client against both ProxySQL nodes simultaneously. Log every response with a timestamp — this is the evidence trail that quantifies the upgrade’s error window. In production, your application’s existing traffic serves this purpose; in a maintenance window, an explicit probe script gives you a clean record.

Step 3 — Drain proxysql-1. In production: deregister proxysql-1 from the NLB target group first (NLB default connection draining: 30 seconds). Wait for in-flight connections to finish, then stop the service. The NLB routes all new connections to proxysql-2 from the moment the target is deregistered. In the lab, where there’s no NLB, stopping the service directly simulates this:

# Drain proxysql-1 (lab simulation of NLB target deregistration + service stop)
# Production: deregister from NLB first, wait for connection draining, then stop
sudo systemctl stop proxysql

Verify proxysql-1 is unreachable on port 6033 and proxysql-2 is serving normally before proceeding. Part 4’s NLB section covers the 110-second real-world detection window versus the theoretical 90-second threshold — size your drain window accordingly.

Step 4 — Upgrade the binary.

# Upgrade ProxySQL binary (replace 2.7.X with your target version)
sudo apt-get install proxysql=2.7.X

Step 5 — Start the service and verify cluster sync. After systemctl start proxysql, the restarted node bootstraps from its peer automatically — given a populated proxysql_servers table and matching cluster credentials, it fetches the current runtime config from proxysql-2 within the cluster’s check_interval_ms window (~600ms in our lab from Part 4).

# Start ProxySQL after upgrade; cluster sync bootstraps from peer automatically
sudo systemctl start proxysql
-- Verify cluster sync on the restarted node: runtime_mysql_servers should match proxysql-2
SELECT hostgroup_id, hostname, port, status
FROM runtime_mysql_servers
ORDER BY hostgroup_id, port;

If runtime_mysql_servers shows the master (port 25001) in both HG 10 and HG 20 after restart, that’s expected: mysql-monitor_writer_is_also_reader=true places the master in both the writer and reader hostgroups. It’s not a routing anomaly — it reflects the ProxySQL default that allows reads to land on the writer when both replicas are lagging or SHUNNED.

If mysql_servers doesn’t arrive on the restarted node, check whether admin-cluster_mysql_servers_sync_algorithm=1 (delta mode) is set and the node has no sync baseline — the bootstrap footgun from Part 4. Set it to 0 temporarily to force a full pull, then restore 1.

Step 6 — Re-register with NLB (production step) and spot-check traffic through the upgraded node.

Step 7 — Repeat for proxysql-2.

Lab timing for both nodes:

Step proxysql-1 proxysql-2
Service stopped 21:01:27Z 21:02:56Z
Service back online 21:01:53Z 21:03:16Z
Total cycle ~26s ~20s

The per-node time includes the apt-get install step. In a real upgrade where the package download is already cached, the binary swap itself takes under 5 seconds — the remaining time is service start, monitor thread initialization, and cluster sync confirmation.

Troubleshooting: When Something Goes Wrong

The most common non-obvious production failure with ProxySQL follows a specific pattern: backends appear SHUNNED or errors start climbing, the instinct is to check Aurora or the MySQL backends directly, but the actual cause lives in a ProxySQL internal table that most DBAs don’t check first. Here’s the diagnostic sequence that surfaces it quickly.

The worked example is monitor user credential revocation. ProxySQL connects to each backend using the monitor user (set via mysql-monitor_username and mysql-monitor_password) to run health checks — COM_PING, SHOW REPLICA STATUS for replication lag, SHOW GLOBAL VARIABLES LIKE 'read_only' for writer detection. If those credentials break — password rotation without updating ProxySQL, a permission change by someone who didn’t know the monitor user was load-bearing — every health check against every backend starts failing simultaneously.

The troubleshooting flowchart:

Symptom: rising errors or SHUNNED backends in runtime_mysql_servers
                │
                ▼
   Step 1: Check runtime_mysql_servers
   ──────────────────────────────────────────────────────
   All ONLINE?
     YES → backend health is fine → check query rules
           and stats_mysql_query_digest for routing anomalies
     NO (SHUNNED present) → continue ↓
   ──────────────────────────────────────────────────────
                │
                ▼
   Step 2: Check mysql_server_connect_log
   ──────────────────────────────────────────────────────
   SELECT hostname, port, time_start_us,
          connect_success_time_us, connect_error
   FROM monitor.mysql_server_connect_log
   ORDER BY time_start_us DESC LIMIT 20;

   connect_error = NULL?
     YES → connect checks are clean → go to ping_log
     "Access denied for user 'monitor'" → FOUND IT
   ──────────────────────────────────────────────────────
                │
                ▼
   Step 3: Confirm with mysql_server_ping_log
   ──────────────────────────────────────────────────────
   SELECT hostname, port, time_start_us,
          ping_success_time_us, ping_error
   FROM monitor.mysql_server_ping_log
   ORDER BY time_start_us DESC LIMIT 20;

   Same "Access denied" pattern? → confirms monitor credentials
   "Gone away" / timeout? → backend connectivity problem
   ──────────────────────────────────────────────────────
                │
                ▼
   Step 4: Verify the monitor user directly on the backend
   ──────────────────────────────────────────────────────
   mysql -h <backend-host> -P <port> -u monitor -p'<pass>' \
     -e "SHOW REPLICA STATUS\G"

   Access denied → confirm which grant is missing
   ──────────────────────────────────────────────────────
                │
                ▼
   Step 5: Restore
   ──────────────────────────────────────────────────────
   On the MySQL backend (run on master; replicated to replicas):
     GRANT REPLICATION CLIENT ON *.* TO 'monitor'@'%';
     GRANT SELECT ON sys.* TO 'monitor'@'%';
     FLUSH PRIVILEGES;

   If the password changed on the MySQL side, also update ProxySQL:
     SET mysql-monitor_password='<new-pass>';
     LOAD MYSQL VARIABLES TO RUNTIME;
     SAVE MYSQL VARIABLES TO DISK;
   ──────────────────────────────────────────────────────

MONITOR-USER REVOCATION: DIAGNOSTIC ORDER

  1. Check runtime_mysql_servers for SHUNNED backends — this is the symptom, not the cause.
  2. Check monitor.mysql_server_connect_log ordered by time_start_us DESC. Look at connect_error. "Access denied for user ‘monitor’" on every recent row is the smoking gun.
  3. Check monitor.mysql_server_ping_log — the same "Access denied" pattern appears here once existing cached backend connections cycle out.
  4. Test the monitor user directly from the ProxySQL node: mysql -h <backend> -u monitor -p'<pass>' -e "SHOW REPLICA STATUS\G" — confirms which specific privilege is missing.
  5. Restore: GRANT the missing privilege on the MySQL backend, FLUSH PRIVILEGES, and if the password changed on the MySQL side, update mysql-monitor_password in ProxySQL global_variables and LOAD MYSQL VARIABLES TO RUNTIME.

What the lab capture shows. After changing the monitor user’s password to an incorrect value on the MySQL backend, monitor.mysql_server_connect_log filled with this pattern on the next connect-check cycle:

-- monitor.mysql_server_connect_log — credential failure in progress
-- connect_interval=60s; errors appear once per cycle on each backend
hostname       port   time_start_us       connect_success_time_us  connect_error
192.168.105.6  25001  1778361407360863    0   Access denied for user 'monitor'@'proxysql-1' (using password: YES)
192.168.105.6  25002  1778361406680516    0   Access denied for user 'monitor'@'proxysql-1' (using password: YES)
192.168.105.6  25003  1778361408042762    0   Access denied for user 'monitor'@'proxysql-1' (using password: YES)

All three backends, same error, every connect-check cycle. That pattern — not one backend, not an intermittent error, but every backend on every cycle — points directly at the monitor credentials, not at backend connectivity.

Auth failures don’t trigger SHUNNED instantly. The connect check fires every mysql-monitor_connect_interval (default 60s), and ProxySQL needs mysql-monitor_ping_max_failures consecutive ping failures before it formally SHUNs a backend. What you see first is the connect log filling with "Access denied" entries — one per polling cycle per backend. That window, from first error to formal SHUNNED, is your diagnostic opportunity. The signal is clear and early; the cascade is gradual by design. A credential problem that would cause a full SHUNNED state on all backends gives you several minutes of warning in mysql_server_connect_log before client traffic starts seeing widespread errors.

Recovery is equally bounded by the polling cycle. After restoring the correct credentials on the MySQL backend, the connect log showed a clean entry 11 seconds after the GRANT was restored — that’s wherever in the 60-second polling cycle the next connect check happened to fire. The range is 0 to mysql-monitor_connect_interval (60s default); expect recovery on the next monitor poll after the fix is applied.

ProxySQL in production and something doesn’t look right?

The monitoring layer takes minutes to query and hours to interpret if you don’t know what healthy looks like. If you’re seeing SHUNNED backends, rising ConnERR, or Aurora detection gaps you can’t explain, a 30-minute call usually narrows it to one root cause.

Book Free Assessment →

What’s Not in This Series

Three topics adjacent to this series are real and important. Each deserves its own treatment.

Sharding. ProxySQL supports basic query-level sharding — routing by schema boundary or by a rule that hashes a user ID into a destination hostgroup. For simple cases this works. For production sharding at scale, with consistent cross-shard transactions and managed schema migrations, this is Vitess territory. ProxySQL’s sharding support is a routing primitive, not a sharding framework.

Multi-region Aurora + ProxySQL. Aurora Global Database places a writer in one region and reader clusters in others, with sub-second replication lag. ProxySQL in front of a Global Database deployment is a different configuration: mysql_aws_aurora_hostgroups scoped per-region, topology discovery that stays local while the primary region is healthy, and failover coordination when a secondary region is promoted to writer. This series covers single-region Aurora only.

PostgreSQL ProxySQL HA. ProxySQL speaks MySQL wire protocol. For PostgreSQL, the equivalent stack is different: see the ProxySQL PostgreSQL HA series which covers the same placement-to-operations arc for PostgreSQL backends.

Across Five Parts

Across five parts, you’ve built and operated a production-representative ProxySQL + Aurora MySQL topology: decided where the proxy layer goes and why, wired it to Aurora’s native topology discovery, tuned query routing and multiplexing against real workload patterns, tested HA under a live failover with a measured error count, and now have the monitoring queries and runbooks to operate it day-to-day.

Three things adjacent to this series worth exploring from here: the Aurora Performance Insights layer for correlating ProxySQL digest data with query execution inside the database engine itself; the ProxySQL Prometheus exporter for time-series dashboards that alert on ConnUsed headroom and detection gaps without manual polling; and slow query log parsing to match ProxySQL’s stats_mysql_query_digest patterns against Aurora’s slow log and identify the same queries from both sides of the proxy.

If you’re standing up ProxySQL in front of RDS or Aurora MySQL and want a second pair of eyes before production traffic, book a free assessment.

M

Mario — ReliaDB

ReliaDB is a specialist DBA team for PostgreSQL and MySQL performance, high availability, and cloud database optimization. More about ReliaDB →

Planet for the MySQL Community

ProxySQL in Front of AWS RDS & Aurora MySQL — Part 4: HA, Failover Patterns, and TLS

https://reliadb.com/images/og-default.png

Two API calls, same region, same lab session. Aurora writer failover: ProxySQL marked the writer SHUNNED at T0+8.7 seconds. Topology fully inverted at T0+15 seconds. Three errors across 2,020 queries through two ProxySQL nodes — the same order of magnitude as Part 2‘s 2-in-1,485 result from a single node. RDS Multi-AZ failover: 64 seconds from the same API trigger to AWS "completed." No ProxySQL topology change required, because there’s no topology to change — the standby isn’t readable, and the endpoint is DNS-based.

The 4× gap between 15 seconds and 64 seconds isn’t a sizing difference or an artifact of how the failover was triggered. It’s structural. Aurora’s ProxySQL integration reads INFORMATION_SCHEMA.REPLICA_HOST_STATUS directly, on its own polling schedule, independent of the AWS control plane. RDS Multi-AZ failover goes through DNS — and DNS propagation through resolver caches takes as long as it takes, regardless of what ProxySQL is doing.

Parts 1–3 built to this point: the placement decision, the Aurora wiring, and the query routing layer. This part is where the HA system is tested under pressure. ProxySQL Cluster sync timing and footguns, NLB health check reality versus the theoretical window, Aurora dual-node failover, RDS Multi-AZ as the baseline comparison, backend TLS with a footgun hiding in auto-discovery, and query mirroring’s measured latency impact. Everything here comes from a live AWS lab session.

Lab provenance: Tested live on Aurora MySQL 3.12.0 / MySQL 8.0.44 and RDS MySQL 8.0.40 (Multi-AZ), ProxySQL 2.7.3 in us-east-1 on 2026-05-09. All AWS resources destroyed post-capture. Estimated cost: ~$0.65.

ProxySQL Cluster Sync: Timings, Footguns, and Limits

The ProxySQL Cluster propagates configuration automatically between peers — it’s the mechanism that lets you apply a query rule change to one node and have it arrive on the other without a manual push. But the sync model has specific failure modes under fresh nodes and concurrent edits that are worth understanding before you rely on it during an incident.

What Syncs and How Fast

Three variables control sync cadence: admin-cluster_check_interval_ms sets how often each node polls its peers for config checksums; admin-cluster_mysql_query_rules_diffs_before_sync (and the equivalent for mysql_servers, mysql_users, and global_variables) sets how many consecutive checksum differences must be detected before sync fires. With check_interval_ms=200 and diffs_before_sync=3, the minimum propagation window is 3 × 200ms = 600ms.

Lab measurements, inserting a new query rule on proxysql-1 and polling proxysql-2 for arrival:

Config type Measured latency Notes
mysql_query_rules 614ms
mysql_servers 620ms
global variable (mysql-monitor_ping_interval) 597ms
mysql_users ~600ms see note

The mysql_users row warrants a note. runtime_mysql_users stores every user twice — one frontend row (for client authentication against ProxySQL) and one backend row (for ProxySQL’s own authentication against MySQL backends). A correctness check expecting COUNT(*)=1 after sync would always fail, because the synced state is COUNT(*)=2. That’s the correct count, not a sync failure. Verify sync by comparing the count on the peer against the count on the source node — they should match.

Disk persistence works per-node. Each node writes its runtime state to its own sqlite3 database when SAVE ... TO DISK is called. Cluster sync operates at the runtime layer only — it doesn’t sync disk state between nodes. If proxysql-2’s sqlite3 is wiped and proxysql-2 restarts, it comes back with an empty config regardless of what proxysql-1’s disk contains.

admin-cluster_mysql_servers_sync_algorithm=1 is the right setting for steady-state operation. It propagates only deltas — the diff between the current table state and the last state recorded from each peer. Efficient, low-overhead, and correct under normal conditions.

The failure mode is specific: a fresh node that has never seen any state has no baseline. There are no deltas to compute against nothing. The sync mechanism produces nothing, and the fresh node stays empty indefinitely.

Lab result: wiped proxysql-2’s sqlite3 while keeping proxysql_servers intact so the cluster could still communicate. With algorithm=1, mysql_query_rules synced at 592ms — the cluster had tracked that table across prior sessions and had a baseline to diff against. mysql_servers never arrived on proxysql-2, even after 60+ seconds of polling. ProxySQL’s diff engine had no previous state to compare against for mysql_servers on the fresh node, so it computed no changes and pushed nothing.

Fix: set admin-cluster_mysql_servers_sync_algorithm=0 on both nodes and reload. Full-pull mode disregards the diff baseline and copies the complete table from the peer with the highest epoch. Proxysql-2 had a populated mysql_servers within 3 seconds. Once you verify runtime_mysql_servers is correct on the bootstrapped node, restore algorithm=1.

-- Tested: ProxySQL 2.7.3, Lima VMs, 2026-05-09
-- Temporary fix for bootstrapping a fresh or wiped node
SET admin-cluster_mysql_servers_sync_algorithm=0;
LOAD ADMIN VARIABLES TO RUNTIME;

-- After confirming the fresh node has synced (check runtime_mysql_servers):
SET admin-cluster_mysql_servers_sync_algorithm=1;
LOAD ADMIN VARIABLES TO RUNTIME;
SAVE ADMIN VARIABLES TO DISK;

algorithm=0 for initial bootstrap; algorithm=1 for steady-state. Not both at once, not neither.

Concurrent Writes Don’t Converge Cleanly

The cluster’s conflict resolution model is last-epoch-wins: each config change increments the epoch counter on the node that made it, and peers adopt the config from whichever peer reports the highest epoch. For changes separated in time — the normal case — this works without issue.

What happens when two changes hit different nodes within the same check_interval window? Lab result: changed the same query rule row simultaneously on proxysql-1 (destination_hostgroup=10 at T0+50ms) and proxysql-2 (destination_hostgroup=20 at T0+102ms). Both changes fell inside a single 200ms polling cycle. After 5 seconds, neither node had adopted the other’s value. Config checksums had diverged and stayed diverged — proxysql-1 at 0x82176C666CE36C22, proxysql-2 at 0x23250DE02EC63EED.

The root issue: when both nodes increment their epochs within the same check cycle, neither sees the other as definitively higher before committing its own change. The split-brain persisted well past the "a few hundred milliseconds" convergence that the docs describe for normal propagation.

The operational fix is architectural, not a tuning parameter. Pick one ProxySQL node as the exclusive target for all config changes. Apply changes there, let cluster sync carry them to the peer, and confirm arrival before proceeding. Never write conflicting changes to both nodes simultaneously and rely on sync to reconcile the result.

NLB Health Checks: Why Your Runbook Says 90 Seconds But Should Say 110

An NLB in front of the ProxySQL pair routes around a failed node once its health checks reach the unhealthy_threshold. Standard configuration for this topology: TCP target group on port 6033, unhealthy_threshold=3, interval=30s. The theoretical detection window: 3 × 30s = 90 seconds.

The measured result was 110 seconds. The extra 20 seconds isn’t NLB slowness — it’s EC2 shutdown time. When an EC2 instance stops, there’s a graceful shutdown sequence before the OS releases the network stack and the TCP port becomes unreachable. The NLB’s health check timer starts only once connections are actively refused — it can’t detect "instance is shutting down," only "TCP connection failed." The additional 20-second gap between "shutdown initiated" and "connections refused" doesn’t shorten regardless of how you tune the health check thresholds.

Budget 110–120 seconds in your runbooks and alert thresholds, not 90. If your application has connect timeouts shorter than the actual detection window, you’ll see errors before the NLB reroutes. If they’re longer, the reroute happens first and the error disappears silently. The 110-second real-world number is the right one to design against.

Once the NLB marks a target unhealthy, all new TCP connections route to the surviving ProxySQL node. That node needs no configuration change — it already holds its share of backend connections and continues routing normally. Applications using connection pools with retry logic reconnect within one cycle. The surviving node handles the increased load without any config intervention.

Aurora Dual-Node Failover: 4× Faster Than RDS Multi-AZ

Scenario ProxySQL detection Full resolution AWS “completed” event Traffic errors
Aurora + dual ProxySQL T0+8.7s (SHUNNED) T0+15s (topology flip) T0+22s ~3 / 2,020 queries
RDS Multi-AZ (single ProxySQL HG) n/a — DNS-based T0+64s T0+64s not captured

AWS event timestamps drive the RDS comparison. Per-second client error counts for the RDS leg require manual inspection of the raw traffic log — the automated analysis script did not extract them. The 64-second figure comes directly from AWS event timestamps and is the reliable number.

Aurora resolves approximately 4× faster — 15 seconds versus 64 seconds — and without touching DNS. ProxySQL reads INFORMATION_SCHEMA.REPLICA_HOST_STATUS directly and reroutes at the TCP connection level. There’s no resolver cache between ProxySQL and the Aurora instance endpoints.

Aurora ProxySQL Configuration for Dual Nodes

The mysql_aws_aurora_hostgroups configuration adds a dedicated HG pair (200/201) for Aurora alongside the existing Lima MySQL hostgroups (HG 10/20). The key values and their constraints:

-- Tested: ProxySQL 2.7.3, Aurora MySQL 3.12.0 / MySQL 8.0.44, us-east-1, 2026-05-09
INSERT INTO mysql_aws_aurora_hostgroups (
  writer_hostgroup, reader_hostgroup, active,
  aurora_port, domain_name,
  max_lag_ms, check_interval_ms, check_timeout_ms,
  writer_is_also_reader, new_reader_weight,
  comment
) VALUES (
  200, 201, 1,
  3306, '.CLUSTER-EXAMPLE.us-east-1.rds.amazonaws.com',
  60000, 5000, 3000,
  1, 100,
  'part4-aurora'
);

Two values are worth flagging from experience. check_timeout_ms=3000 is the enforced upper bound — ProxySQL 2.7.3 has a CHECK constraint requiring 80 <= check_timeout_ms <= 3000. Exceeding it produces a silent failure: the INSERT returns success but the row isn’t accepted and the Aurora monitor thread never starts. If discovery isn’t happening, verify your check_timeout_ms first. Second: domain_name must start with a dot (.CLUSTER-EXAMPLE..., not CLUSTER-EXAMPLE...). The same constraint enforces this; INSERT without the leading dot fails with a constraint error.

After LOAD MYSQL SERVERS TO RUNTIME, the monitor thread polls the seed endpoint and auto-discovers the writer and reader within two polling cycles:

-- runtime_mysql_servers on proxysql-1 (~10s after LOAD TO RUNTIME)
hostgroup_id  hostname                                              port  status
200           mysql-ha-test-aurora-writer.CLUSTER-EXAMPLE...  3306  ONLINE
200           mysql-ha-test-aurora.cluster-CLUSTER-EXAMPLE... 3306  ONLINE  ← seed
201           mysql-ha-test-aurora-reader.CLUSTER-EXAMPLE...  3306  ONLINE

Both proxysql-1 and proxysql-2 show identical topology within seconds of configuration. The reason isn’t what you might expect.

CLUSTER SYNC ISN’T WHAT MAKES BOTH NODES CONVERGE. Both ProxySQL nodes reached identical Aurora topology at T0+15s. That looks like cluster sync working. It isn’t. Each ProxySQL node runs its own independent Aurora monitor thread that queries INFORMATION_SCHEMA.REPLICA_HOST_STATUS on its own polling schedule. Both threads read the same Aurora metadata — the same SESSION_ID=MASTER_SESSION_ID flip — and independently update their own runtime_mysql_servers. No config change was propagated between nodes during the failover. The convergence is emergent from two monitors reading the same source of truth, not from ProxySQL cluster propagation.

This matters in production: Aurora topology detection works correctly even when cluster sync is degraded or lagging behind. The Aurora monitor subsystem and the ProxySQL Cluster sync subsystem are orthogonal — they share no state and don’t depend on each other. Design your monitoring with that distinction in mind.

The Failover Timeline

Triggered with a single AWS CLI command. T0 is the moment the API accepted the request:

# -- Trigger Aurora cross-AZ failover -- T0 = API accepted (15:32:07 CEST)
aws rds failover-db-cluster --db-cluster-identifier mysql-ha-test-aurora
Marker   Elapsed   Event
------   -------   -----
T0       +0.0s     failover-db-cluster API accepted
         +3.1s     Aurora: "Started cross AZ failover to aurora-reader"
         +8.7s     Writer endpoints SHUNNED on both proxysql-1 and proxysql-2
         +15.0s    Topology flip: aurora-reader → HG 200 ONLINE, old writer → HG 201
         +22.2s    Aurora: "Completed customer initiated failover to aurora-reader"

Two milestones, not one. ProxySQL marked the old writer SHUNNED at T0+8.7s — it knew the writer was unreachable. The full topology flip — old writer demoted to HG 201, new writer promoted to HG 200 — landed at T0+15s, on the next successful poll of REPLICA_HOST_STATUS after Aurora completed the internal promotion. SHUNNED means "I can’t reach this backend right now." The topology flip means "I’ve confirmed, from Aurora’s own metadata, who the new writer is." The gap between them is Aurora’s internal promotion time — the same floor discussed in Part 2’s detection math section. ProxySQL was polling on schedule throughout; the backends were simply unreachable while Aurora was mid-promotion.

The SHUNNED state explains why errors appeared at ~T0+6s, before SHUNNED was recorded at T0+8.7s. ProxySQL routes a query to a backend before detecting its failure — the SHUNNED state is set after a health check fails, not preemptively. Queries that attempted new handshakes in the window between when the writer became unreachable and when ProxySQL formally recorded it got the handshake failure directly.

Three Errors Across 2,020 Queries

Two load generators ran simultaneously — one through proxysql-1:6033, one through proxysql-2:6033 — at 5 SELECTs/second each throughout the failover window.

Proxysql-1: 1 error. Proxysql-2: 2 errors. All three were ERROR 2013 (HY000): Lost connection to server at 'handshake: reading initial communication packet' — new connection attempts during the SHUNNED window that hit an Aurora backend mid-promotion. Queries running on existing multiplexed backend connections continued without error. ProxySQL’s connection multiplexing kept those backend connections alive through the SHUNNED window; only new handshakes failed.

The comparison to Part 2: that run used a single ProxySQL node and recorded 2 errors in 1,485 queries (~0.13% error rate). This run used two nodes and recorded ~3 errors across 2,020 queries (~0.15% error rate). Adding a second ProxySQL node didn’t change the error ratio in any meaningful way. The bottleneck isn’t the proxy layer — it’s Aurora’s internal promotion window, which is identical regardless of how many ProxySQL nodes are in front of it.

RDS Multi-AZ: The 64-Second Baseline

The RDS Multi-AZ configuration in ProxySQL is simpler than Aurora’s: one hostgroup (HG 300), one server row, no topology discovery table. The endpoint DNS is the failover mechanism — after the standby promotes, the same DNS name resolves to the new primary. ProxySQL doesn’t need to change anything.

-- HG 300 for RDS Multi-AZ — single endpoint, no topology discovery needed
INSERT INTO mysql_servers (hostgroup_id, hostname, port, comment)
VALUES (
  300,
  'mysql-ha-test-rds-multiaz.CLUSTER-EXAMPLE.us-east-1.rds.amazonaws.com',
  3306,
  'rds-multiaz'
);
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

Failover triggered via reboot-db-instance --force-failover. T0 is the moment the API accepted:

Marker   Elapsed   Event
------   -------   -----
T0       +0s       reboot-db-instance --force-failover (15:37:16 CEST)
         +10s      AWS: "Multi-AZ failover started"
         +25s      AWS: "DB instance restarted"
         +64s      AWS: "Multi-AZ instance failover completed"

Post-failover, runtime_mysql_servers shows HG 300 ONLINE again with the same hostname and port. From ProxySQL’s perspective, the backend went unavailable and came back. No hostgroup changes, no topology flip, no epoch updates required.

Two limitations worth naming explicitly. First, the standby isn’t readable during normal operation. You pay for a full second instance — same CPU, memory, and storage class — and receive zero read capacity in return. ProxySQL can distribute reads across Aurora reader instances; it can’t read from an RDS Multi-AZ standby because Aurora’s architecture makes it deliberately inaccessible. Second, the 64-second resolution window goes through DNS — ProxySQL isn’t aware of the topology change until the DNS TTL expires and the endpoint resolves to the new primary. The 4× failover gap and the zero read capacity are the two concrete reasons to prefer Aurora when active read scaling and fast application-visible failover both matter.

TLS on ProxySQL-to-Aurora Backend Connections

What Aurora MySQL Actually Enforces

Aurora MySQL 3 supports TLS and most clients negotiate it automatically — the default cipher from the Lima VMs was TLSv1.3 / TLS_AES_256_GCM_SHA384. But "supports by default" is not "enforces by default." In this lab cluster, require_secure_transport=OFF is the factory setting, and a connection opened with --ssl-mode=DISABLED succeeded with an empty Ssl_cipher. Enforcement requires require_secure_transport=ON set explicitly at the cluster parameter group level. Don’t assume Aurora encrypts backend connections without checking the parameter group.

Configuring ProxySQL for Backend TLS

Two ProxySQL settings control proxy-to-server TLS. mysql-have_ssl=true enables TLS capability on backend connections where TLS can be negotiated. mysql-ssl_p2s_ca provides the CA bundle ProxySQL uses to verify the server certificate. Together they give you encrypted, certificate-verified backend connections.

Aurora provides a CA bundle containing all AWS certificate authorities. Download it and copy it to the ProxySQL VM:

# Download the AWS RDS global CA bundle (162KB, 108 CAs as of May 2026)
curl -o /tmp/aurora-global-bundle.pem \
  https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem

sudo mkdir -p /etc/proxysql/certs
sudo cp /tmp/aurora-global-bundle.pem /etc/proxysql/certs/
-- Configure backend TLS on proxysql-1 (cluster sync carries it to proxysql-2)
SET mysql-have_ssl=true;
SET mysql-ssl_p2s_ca='/etc/proxysql/certs/aurora-global-bundle.pem';
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL VARIABLES TO DISK;

Per-server TLS is also controlled by use_ssl in mysql_servers: setting use_ssl=1 on a row tells ProxySQL to use TLS for that specific backend. The interaction with Aurora’s auto-discovery is where it gets complicated — and where the footgun lives.

Certificate Verification Behavior

Direct-connection tests to Aurora to verify behavior before relying on ProxySQL config:

Test ssl-mode flag CA bundle Result
Default TLS none none TLSv1.3 / TLS_AES_256_GCM_SHA384
TLS disabled --ssl-mode=DISABLED none Accepted — Ssl_cipher empty
VERIFY_CA — correct bundle --ssl-mode=VERIFY_CA aurora-global-bundle.pem SUCCESS — TLSv1.3
VERIFY_CA — wrong CA --ssl-mode=VERIFY_CA system CAs (no AWS roots) ERROR 2026: certificate verify failed

AUTO-DISCOVERY DEFEATS PER-SERVER TLS SETTINGS — VERIFY IN runtime_mysql_servers. When mysql_aws_aurora_hostgroups is active, the Aurora monitor thread populates runtime_mysql_servers with rows for every discovered endpoint. Those auto-discovered rows are created with use_ssl=0, regardless of any use_ssl=1 you set in mysql_servers for manually-inserted entries.

Lab evidence: after UPDATE mysql_servers SET use_ssl=1 WHERE hostgroup_id IN (200,201); LOAD MYSQL SERVERS TO RUNTIME;, the static cluster endpoint seed kept use_ssl=1 in the runtime table. The auto-discovered reader and writer endpoints appeared with use_ssl=0. The Aurora monitor thread owns those rows and resets them on each discovery cycle.

The reliable enforcement path is mysql-have_ssl=true combined with mysql-ssl_p2s_ca. These are global defaults that apply to all backend connections capable of negotiating TLS, including auto-discovered endpoints. Per-server use_ssl is not the enforcement mechanism when Aurora auto-discovery is active. After any Aurora configuration change, verify with:

SELECT hostgroup_id, hostname, port, use_ssl
FROM runtime_mysql_servers
WHERE hostgroup_id IN (200, 201)
ORDER BY hostgroup_id, hostname;

A row showing use_ssl=0 on an auto-discovered endpoint isn’t unencrypted if mysql-have_ssl=true is set — it means TLS is negotiated via the global setting, not the per-server flag. But if mysql-have_ssl is false or unset, that same row is sending plaintext and won’t tell you.

Running Aurora MySQL behind ProxySQL in production?

The TLS footgun in auto-discovery is one of several configuration details that fail silently without surfacing an error. If you want a second pair of eyes on your ProxySQL configuration before it handles production Aurora failovers, book a free assessment call.

Book Free Assessment →

Query Mirroring: No Measurable Impact on the Primary Path

Query mirroring in ProxySQL routes each matched query to a secondary hostgroup simultaneously — fire-and-forget, results discarded, the calling application never knows. The mirror_hostgroup column in mysql_query_rules activates it per rule. This section is the latency measurement.

100 queries with mirroring active to a secondary hostgroup: 3,907ms total / 39ms average. 100 queries without mirroring: 4,081ms total / 40ms average. Delta: −174ms total, −1ms per query — within measurement noise and not a meaningful latency penalty.

ProxySQL sends mirrored queries from a separate goroutine. The primary query’s result is returned to the client immediately; the mirror is dispatched independently with no serialization point on the primary path. If the mirror hostgroup is unavailable, the primary path is unaffected. Enabling mirroring to a staging Aurora cluster or a new version under evaluation doesn’t require a performance conversation with your application team. The only cost is backend resources on the mirror target itself.

What’s Not in This Part

Three topics are deferred to Part 5 by design:

  • Production monitoring — querying stats_mysql_query_digest for latency baselines, alerting on Aurora detection gaps via mysql_server_aws_aurora_log, and dashboards for connection pool headroom → Part 5
  • check_interval_ms production sizing — this part used check_interval_ms=5000; Part 5 covers sizing it based on observed Aurora promotion time from your own event history → Part 5
  • Rolling upgrade runbook — how to upgrade the ProxySQL binary on one node without taking the proxy layer offline, including NLB drain sequencing → Part 5

What’s Next

In Part 5, the system from Parts 1–4 is running and has been through a failover. Part 5 is about knowing it’s healthy: stats_mysql_query_digest for query latency baselining, mysql_server_aws_aurora_log for Aurora detection gap alerting, max_lag_ms tuning for excluding lagging readers under replication pressure, the rolling ProxySQL upgrade sequence, and the production-sizing decisions for check_interval_ms and multiplexing variables that weren’t worth making until the system had been under load.

M

Mario — ReliaDB

ReliaDB is a specialist DBA team for PostgreSQL and MySQL performance, high availability, and cloud database optimization. More about ReliaDB →

Planet for the MySQL Community