You probably don’t need a database per tenant!

https://ollieread.com/storage/og/articles/you-probably-dont-need-a-database-per-tenant.png

You could argue that the term "database per tenant" actually
covers two possible approaches, separate instances and
separate databases in the same instance. However, the majority of the time the term
is used, it is referring to the latter, and is often assumed to be the only way, or at least, the best way, to implement
multitenancy. It’s not. Most of the time at least.

This does, however, raise the question, what is the best way to implement data isolation in multitenancy? The answer,
like most things, is that it depends, but with the caveat that it will most often not be a database per tenant. It
depends on the application, the tenants, the database engine, the requirements, the scope, the resources, the scale, and
so on and so forth. There is no one-size-fits-all approach, though there is no harm in starting with something simple
and then working your way up when necessary. Don’t optimise for situations that may never happen, and don’t
overcomplicate things when you don’t need to.

Rather than me writing some sort of hit piece on "database per tenant", and those that advocate for it, this article
is instead going to cover the spectrum of approaches to data isolation in multitenancy. I’ll be laying it all out,
details and all, with some opinion sprinkled in here and there, and then you can make your own decision on
what is best for your application. The goal of this article is to help you make an informed decision.

Isolation is not separation

The whole assumption that "database per tenant" is the best approach is based on people confusing isolation
and separation. Sure, they’re related, but they are not the same thing. Isolation is an issue of security and
correctness, making it so that Tenant A cannot see Tenant B. Separation is a storage and operational concern, making
it so that Tenant A is stored in a different place to Tenant B. You can achieve isolation through separation,
but you can also achieve isolation without separation.

The spectrum, heaviest to lightest

When it comes to data isolation, there are a number of approaches, each with their own trade-offs. The best way to look
at them all is to consider them on a spectrum, from the heaviest and most isolated, with the highest operational cost,
to the lightest and least isolated, with the lowest operational cost. Developers often end up taking a far heavier
approach to achieve a guaranteed level of isolation, when a lighter approach would have been sufficient.

Below are the most common approaches, listed from heaviest to lightest. Each approach is described in terms of its
isolation, operational cost, how to work with it, and the best fit for its use. Each approach also includes a simple
diagram to illustrate the architecture; here’s a quick legend to help you understand them.

instance / server

shared infrastructure

tenant A

tenant B

Separate instance

With this approach, each tenant has its own database instance, each with its own credentials and connection information.
It is the heaviest, and most isolated approach, but also the most operationally expensive. If you were to ask me,
I’d say that this should be the true approach for "database per tenant".

Isolation

Tenant data is completely physically isolated from other tenants. All processes, resources, networking, and data
are completely separate. Not only will no data leak between tenants, but no tenant can even affect the performance of
another tenant. This is the absolute strongest and best level of isolation you can achieve.

Operations

As you can probably imagine, that means that this is also the most operationally expensive approach. You have
N instances (where N is the number of tenants), to provision, monitor, patch, scale, and back up. Migrating becomes
more complex, having to run N times over N hosts. You won’t run into connection pooling issues, though, and backing
up, restoring, and even fully deleting a tenant is trivial.

Working with it

Cross-tenant queries are virtually impossible, and if you had to do any sort of admin or analytics across tenants, you
would have to build some sort of pipeline and/or aggregation system to do it. Onboarding a new tenant requires you to
provision a new instance, which can be slow and expensive. You also need a secure mechanism for storing the
credentials and connection details for each tenant.

Best fit

This approach is best suited when there is a legal requirement for physical isolation. Since the instances are
completely separate, they can be placed in different regions, which is an important thing for certain regulatory
requirements. It’s also useful for resource isolation when you have a tenant that is a noisy neighbour, and you
want to make sure that it doesn’t affect the performance of other tenants.

Separate database, same instance

This approach is similar to the previous one, except each tenant has its own database within the same instance. This is
a lighter approach, for the most part, and it’s what people mean when they say "database per tenant". It is still a
fairly heavy approach and is often overkill for most applications. This is also the first approach where the database
engine in use is a factor.

Isolation

With this approach there is no resource isolation between tenants, meaning a noisy neighbour tenant can affect
the performance of others. As for data isolation, that depends entirely on the engine you’re using.

  • PostgreSQL: Each database is completely isolated from the others, as cross-database queries are not natively
    supported and are only possible if you’ve gone out of your way to make them so. The isolation here is architectural,
    and it’s the best engine for this approach.
  • MySQL/MariaDB: For these engines, isolation is entirely privilege-based. If you’re sharing a single user
    across all tenants, then there is no true isolation, as cross-database queries are supported through the
    {db}.{table} syntax. The isolation here is only as good as your privilege management.
  • SQLite: Each tenant would have its own file, which is sort of like a separate database, but not really. The
    isolation provided by that is filesystem-based because it can be opened or attached by any process that has access to
    the file.

Operations

Migrations on this approach need to be run N times, once for each tenant. Backing up, restoring, and deleting a
tenant is trivial, as you can just back up, restore, or delete the database. Monitoring is also pretty
straightforward as you’re only dealing with a single instance and can often get stats per database. The engine you’re
using, though, will have an operational impact.

  • PostgreSQL: Connections are per-database and poolers will key on a database and user pair, so connection pools
    cannot be shared across tenants. The more tenants you have, the more connections you’ll need, and the more
    connections you have, the closer you get to the database instance’s connection limit.
  • MySQL/MariaDB: Connections can easily change databases on the fly, so connection pools can be shared across
    tenants. However, that’s only applicable if you’re sharing a single user, which will reduce the data isolation
    significantly. If you’re using dedicated users per tenant, then you’ll have the same connection pooling issues as
    PostgreSQL.
  • SQLite: There is no connection limiting, so you’ll be constrained by file handles and write concurrency.

If you’re using Laravel

Changing the database on the fly won’t cause a reconnection when using MySQL or MariaDB, but will
when using PostgreSQL.

Working with it

Onboarding is the same for every engine, as you just need to create the database and migrate it. Developers will
often make the database name deterministic, so you don’t need to store it and can derive it from some data on the
tenant. If you’re also using separate users per tenant, which you should be, then you’ll need to create those and
securely store the credentials. Administering or aggregating across tenants depends on the engine.

  • PostgreSQL: Cross-database queries are not supported, so you’ll need to build a pipeline or aggregation system
    to do that.
  • MySQL/MariaDB: Cross-database queries are supported, but only if you aren’t using separate users per tenant.
    If you are, then it’s the same as above.
  • SQLite: Cross-database queries are as simple as using ATTACH to load an SQLite database file.

Best fit

Shared users with MySQL and MariaDB get you all the operational cost, with none of the isolation, and SQLite has no real
enforceable isolation at all. So, if you’re using either SQLite or shared users, then this approach is not a good fit
for multitenancy.

If you have users per tenant, or are using PostgreSQL, then this approach works best for small applications with a small
number of tenants, ideally one where the tenants are fully known and controlled, and where the number of tenants is not
expected to grow significantly. Anything larger, or with the possibility of exponential growth, will very quickly hit a
ceiling with connections.

Schemas and/or prefixes

This approach is an interesting one, as it’s really two different approaches, with one that only works on a
particular database engine. You can simplify it by thinking of this approach as namespacing. All the tenants share
the same database, but their data is split by a namespace. For PostgreSQL, that namespace is a schema, and for MySQL,
MariaDB, and pretty much every other engine, that namespace is just a prefix on the table name.

Isolation

This approach isolates data by namespace, which is a much softer form of isolation than the previous approach. In
most cases, the isolation is conventional and must be enforced at the query level. There is also absolutely no
resource isolation between tenants, so noisy neighbours are an issue. The last deciding factor, for isolation,
is the engine in use.

  • PostgreSQL: For this engine, schemas are first-class citizens. This is usually achieved by a single user
    setting the search_path in the connection, which will isolate it to that schema. It’s possible to strengthen the
    isolation even further by having users per tenant and only granting them access to their own schema. This will
    run into the connection pooling issues mentioned in the previous approach, so you need to pick between stronger
    isolation and operational cost. If you’re using a transaction mode connection pooler, there is the risk of
    cross-tenant data leakage.
  • MySQL/MariaDB: Schema only exists in these engines as a synonym for database, so you’d need to use table name
    prefixing. There’s also absolutely no enforcement of isolation here, as the table name is just a string that you
    pass to queries, not even one that can be prepared. This is by far the weakest form of isolation so far.
  • SQLite: This is exactly the same as the above, it’s a weak approach that’s only enforced by the string you
    pass to queries.

Operations

Using a single database means that connection pooling issues are effectively gone, at least, from a per-tenant
perspective. This approach does, unfortunately, still incur the ongoing operational cost of migrations, needing to
be run N times, once for each tenant, regardless of whether it’s schemas or prefixes. There are a few
considerations for the engine in use, though.

  • PostgreSQL: Using schemas doesn’t make migrating any easier, though it’s virtually identical to
    the separate database, same instance approach. Backing up, restoring, and
    deleting tenants is also pretty much the same as that approach. However, the more tenants you get, the more of a
    hit the performance will take.
  • MySQL/MariaDB: Migrations are more complex with prefixes, as with the other approaches, you can just swap the
    current database or schema. With prefixes, the migration engine itself needs to be tenant-aware, and table names
    need to be generated dynamically. Backing up, restoring, and deleting tenants is also more complex, and you have
    to do it at the table level.
  • SQLite: This is the same as the above.

Working with it

Cross-tenant queries are easy, as you can just query across schemas or tables, meaning administration tasks and
analytics are far easier than every other approach so far. Onboarding is also easier than the previous approaches,
though there is still some complexity, as you’re still running all the migrations for the new tenant.

Best fit

This approach best fits applications that aren’t large and need tenant data to be separate, though not
necessarily isolated, and can’t pay the operational cost of the heavier approaches. With the following caveats based
on engine.

  • PostgreSQL: This is really the only engine to use for this approach, with the schema level implementation. If
    you’re looking at tens-to-hundreds of tenants, then this is a good fit.
  • MySQL/MariaDB: Since this approach is only achievable through table name prefixing with these engines, it’s
    not a good fit for multitenancy. If you absolutely need the tenants to be separated, pay the slightly higher
    operational cost of the separate database, same instance approach with users
    per tenant.
  • SQLite: Same as above.

Partitioned tables

Partitioned tables are a feature of some database engines that allow you to split a table into multiple physical
storage tables, based on a partitioning key. This can be used to implement multitenancy by partitioning on a
discriminator column. Without PostgreSQL and RLS, this approach adds
nothing in terms of data isolation, beyond what the below discriminator column approach does.

Isolation

Partitioning is a storage concern, so the isolation is as good as the query convention. If you miss the WHERE
clause with the discriminator column, then you’re querying the entire dataset across all tenant boundaries. There
are, however, additional enforcement mechanisms that can be put in place, depending on the engine.

  • PostgreSQL: This is the only engine that creates real isolation between tenant data, though only if you use
    the RLS approach as well.
  • MySQL/MariaDB: These engines have no concept of RLS, so the isolation is only as good as the query convention,
    enforced at the application level.
  • SQLite: This engine does not support partitioning and has no alternative that isn’t just a query convention,
    which is covered by the discriminator column approach.

Operations

Unlike all other approaches so far, migrations only need to be run once, exactly the same way that you would for a
single-tenant application. It even gets rid of the possible connection pooling issue that has appeared in almost
every approach so far. It does, however, come with some new problems that we’ve not seen before. Partitioning
requires additional thought and planning when it comes to designing your database schema.
The key used to partition the table across all engines that support it must be part of the primary key and
every unique key. This is a particular issue with things like Eloquent, Laravel’s ORM, which has limited support for
composite keys.

Partition management is also a concern, which differs based on the strategy you’ve taken. You can partition by list
or by hash. The former would be a partition per tenant, and the latter would distribute tenants across a fixed set
of partitions. If you’re taking the former approach, you can back up, restore, and delete tenants by deleting their
partition. If you take the latter approach, you lose all of that, and everything is done at the table level.

There are also a few things to consider based on the engine in use.

  • PostgreSQL: This engine has the best support for partitioning. The ability to back up and/or query per
    partition is much nicer. You can also move tenants to different tablespaces, so a hot tenant can be swapped to faster
    storage. There’s also no limit to the number of partitions, though obviously performance will degrade at a certain
    point.
  • MySQL/MariaDB: These engines do not allow partitioned tables to contain foreign keys, which means you need to
    give up on database-level referential integrity and enforce it at the application level. These engines also have a
    hard-cap on the number of partitions, which is 8192. This is a hard limit, and you cannot work around it. You can
    back up partitions, though it’s not as nice as PostgreSQL, and you cannot move partitions.
  • SQLite: N/A

Working with it

Since queries are filtered using a discriminator column, cross-tenant queries are as simple as not adding a WHERE
clause. Administration and aggregation operations become as easy as they would be if you had no multitenancy at all.
This approach has a better performance footprint, as the database engine can optimise queries to only hit the relevant
partitions, rather than scanning the entire dataset. Onboarding depends on the strategy you’ve taken. Partitioning by
list requires you to create a new partition for the new tenant, which is an operation that needs to be done on the
database. Partitioning by hash does not require any additional operations, as the new tenant will be automatically
distributed across the existing partitions.

Best fit

Reach for this approach when you’re using one of the below approaches, but you’ve got a table that is large enough
that the physical layout is important, i.e. per-tenant pruning for speed, and/or per-tenant partition operations for
backups and the likes. This approach is better suited as a scaling decision for the cheaper options, than one for
isolation. Again, there are caveats based on the engine in use.

  • PostgreSQL: This is the ideal home for this approach, as it has the best support for partitioning and allows
    you to couple it with RLS for true isolation and improved performance.
  • MySQL/MariaDB: If you’re using one of these, this approach is only worth it if you have a specific table that
    is oversized and pruning is important. But only if you’re okay without foreign keys, otherwise don’t complicate
    your life, just go with the discriminator column approach.
  • SQLite: N/A

Discriminator column

The discriminator column approach is the lightest and least isolated approach, but also the cheapest and easiest to
implement. It’s also all that most applications need. All tables that contain tenant data have a discriminator column,
typically a foreign key to the tenant’s table, which is used to filter the queries to only return data for the current
tenant. For the purpose of explanation, the discriminator column will be referred to as tenant_id, though in reality
it would have a domain appropriate name, like account_id, organization_id, or company_id. This approach is
also the only one that is truly engine agnostic, as it can be implemented on any engine, even without foreign keys.

Isolation

This approach has the weakest isolation of all the approaches, as it is entirely convention-based. There’s no
enforcement in the database, or in the framework or underlying functionality. It’s something you have to enforce in
your application code, and if you ever miss the WHERE clause, you’re going to get data leakage. That being said,
providing that you’re using a proper testing strategy and ORM, the approach is pretty simple to implement, and is
more often than not, sufficient. There is only one note to be made in regard to the engine.

  • PostgreSQL: If you want to enforce the isolation at the engine level, you can make use
    of row-level security, which is detailed in the next section.

Operations

Migrations are straightforward, as only one version of the schema exists, and it only needs to be run once. Connection
pooling is also not an issue, as all tenants share the same database and connection. Backing up, restoring, and deleting
tenants becomes more complex, as you’ve got to unpick the data for that tenant from all the tables, which is a
non-trivial operation. Monitoring is also a little more difficult, as you can’t monitor per tenant as default, and
instead need to build that in yourself.

There are also a few considerations for this particular approach. The first is to do with indexing. Since all tenants
share the same table, you need to make sure that the discriminator column is part of the index for any queries that
filter on it. The second is to do with table volume. Since all tenants share the same table, the table can grow to be
quite large, which can have a performance impact, especially for smaller tenants. This is the particular problem that
usually leads to people reaching for partitioning.

Working with it

Cross-tenant queries are trivial, you just don’t add the WHERE clause, and onboarding a tenant is as simple as
create the tenant row, something you’d have to do anyway. This is the cheapest and simplest approach you could
implement, and as mentioned above, it can be upgraded to the partitioned tables approach if you
need to scale it.

The only real cost with this approach is the burden it puts on the application code to enforce the isolation, which is
an ongoing continual burden, rather than simply a one-off operational cost. In practice, you’d centralise the logic
that handles the isolation, whether by using a global query scope, or a default query constraint/criteria. You
wouldn’t ever expect anyone to have to manually add the WHERE clause to every query, as that would be a recipe for
disaster. It does make the enforcement a bit fragile, though, as even with centralised logic, it’s split across the
whole codebase, and if you ever miss it, you’re going to have a problem.

Best fit

Honestly, this approach is the best fit for most applications, unless you have some strict criteria that requires a
higher level of isolation. At the very least, this is a great place to start, as it’s easier to migrate to a heavier
approach when the time comes than it is to migrate to a lighter approach. This particular approach is the one that was
used by Shopify, though they also sharded tenants across multiple instances to spread the load better, and they’re
operating at a scale that most applications will never reach. The only caveat is that you need to be using an ORM or
framework that allows you to centralise the logic that enforces the isolation.

The only real weakness of this approach is that the isolation is enforced at the application level, which honestly,
can be solved by using PostgreSQL and RLS, which is covered below.

Row-level security closes the case

Row-level security (RLS) isn’t an approach to multitenancy, but rather a particular feature of some database engines,
most notably PostgreSQL. It is used alongside the discriminator column, and by association,
the partitioned tables approach, to provide a stronger level of isolation. It allows you to
move the enforcement from a query convention in the application code to a security policy enforced by the database
engine.

All that is required to make use of the features, once the policies are defined, is to set the tenant
in the connection, which is usually done by setting a session variable. While this does close the gap on isolation
and will do wonders to make you feel better about the risk of data leakage, there are caveats and potential issues
to be aware of.

  • Table owners, that is, the database user that is configured as the owner, are not subject to RLS policies by
    default. You have to enforce it by setting the FORCE ROW LEVEL SECURITY option on the table.
  • Superusers, like owners, will also always bypass RLS policies, along with any user granted BYPASSRLS, regardless
    of whether the table has FORCE ROW LEVEL SECURITY set or not.
  • Migrations must be run by a user that either bypasses RLS policies or simply isn’t subject to them, otherwise the
    migrations will fail to run.
  • As mentioned in the partitioned tables section, if you’re using a transaction mode connection
    pooler, there is a risk of cross-tenant data leakage. Because the current tenant is set in a session variable, if
    the connection is returned to the pool and then used by another tenant, the session variable will still be set.

However, since PostgreSQL is the gift that keeps on giving, there are ways to mitigate all the above issues.

For the table owner and migration issues, the commonly recommended way of setting this up
is to have two different users. The first is the table owner and isn’t subject to RLS policies, allowing them to
run migrations without issue. The second is the application user, who is subject to RLS policies and is used for all
application queries. This does add some slight operational overhead, but it’s a one-off cost during setup, and pales
in comparison to the ongoing operational cost of the heavier approaches.

And for the transaction mode connection pooler issue, you can scope the session variable to the current transaction
using SET LOCAL instead of SET, ensuring that the session variable is reset when the transaction ends. This does
mean that you need to ensure that every query is wrapped in a transaction, otherwise the session variable will not be
set, and the RLS policies will not be applied. When creating the policy, there’s a second argument that tells the
engine what needs to happen when the session variable is not set, with the choices being between returning an
empty result or erroring. In almost every case, you want it to error, as an empty result will be indistinguishable from
a query that simply has no results, which is a recipe for disaster.

When separation is genuinely worth it

It’s pretty clear from the title of this article, and pretty much everything above, that I’m of the opinion that you
can achieve perfectly adequate isolation of tenant data, without having to separate it. I do stand by this, but
there are some things that just can’t be achieved, at least easily, without separation. So, when these things are a
genuine requirement, something that’s important enough that you pay the operational and complexity cost of
implementing separation, then you should go for it.

All of these have been briefly mentioned, mainly in the "Best fit" section for the relevant approaches, but here they
are.

  • Regulatory requirements: Some regulations require that data is physically isolated, and in some cases, even
    that it is stored in a particular jurisdiction. Some regulations also require that the data is encrypted
    per-tenant, to allow for things like cryptographic erasure. If you’ve any of these requirements, then you’ll need
    to separate the tenants.
  • Per-tenant backup, restore, and point-in-time recovery: It is theoretically possible to achieve this with any
    approach, but the further you get away from separation, the more exponentially complex it becomes. With
    separation, it’s trivial, and not even something you need to build yourself, as most engines have this built-in.
  • True resource isolation: If you’ve got yourself some noisy neighbours, who have huge datasets and/or are
    very active, unless you separate them, they’re going to negatively affect the performance of other tenants. The
    good news is that if you have a tenant of that size, you probably have the budget and manpower to absorb the
    operational overhead.

Not every tenant

While the above are all genuinely valid reasons to separate tenants, in most cases they are not applicable to every
tenant. The operational overhead of separation is significant, but it scales linearly with the number of tenants. So,
if you can only separate the tenants that need it and keep the rest on a lighter approach, you can get the best of both
worlds. After all, it’s a spectrum, and you don’t have to pick just one.

You don’t have to pick just one

If you’ve read through even some of the above approaches, you’ll have noticed that many of them reference others
and either build on top of them or sit alongside them. This is because they aren’t mutually exclusive, and they aren’t
simply a linear choice of one or the other. As I said at the start of the article, it’s a spectrum, and how you pick
depends on what you need.

Combining by tenant

This strategy is one of exceptions, meaning that all tenants follow the default approach you picked for
data-isolation, with a few that are exceptions to that rule.

Let’s say that your application uses the
discriminator column approach, but you have a handful of tenants that are noisy neighbours,
and one that is a large enterprise with additional requirements. For your everyday tenants, they stay in the shared
database instance using the discriminator column, but for the noisy neighbours, they each have their own database on
a shared instance, following the separate database, same instance approach. Your
large enterprise tenant has its own dedicated instance in a jurisdiction that meets its regulatory requirements,
following the separate instance approach.

How you combine approaches is entirely up to you and depends entirely on your needs. The one thing to consider is that
combining approaches has a cost, though how much depends on the approaches being combined.

Combining across layers adds complexity

All the approaches covered in this article operate on one of two layers, either the connection layer, or the query
layer. Combining approaches that operate on the same layer is trivial, but combining approaches that operate across
layers adds complexity.

The connection layer covers approaches that are handled at the connection level, so
separate instance, separate database, same instance,
schemas (schema only), and even RLS.

The query layer covers approaches that are handled at the query level, so
partitioned tables, prefixes (prefixes only), and the discriminator
column
.

Some of you eagle-eyed individuals will have noticed that the schemas and/or prefixes
approach is listed in both layers, which I assure you is intentional. That particular approach is a bit of a hybrid, and
the layer it sits in depends on the mechanism being used. If you’re using schemas with PostgreSQL, it’s connection
level, but if you’re using regular old table prefixes, it’s at the query level.

Combining by data

This strategy is almost entirely identical to the above, except that instead of using different approaches based on
specific tenants, you use different approaches based on specific datasets. This particular strategy is often overlooked,
despite it being something that many multitenanted applications will implement without realising.

Imagine you’re isolating tenants at the connection level, each with
a separate database on the same instance, but you want users to be able to
authenticate
against tenants that they are added to, with the same credentials. You’re already going to have a global, or
"landlord", database that contains the tenant table, so you can look up the correct tenant before creating the
connection to the tenant’s database. So to achieve this, rather than duplicate user data across multiple databases,
you’d add the users table to the global database, along with
a tenant_user pivot table that contains a discriminator column to filter the users by tenant.

Crossing a database boundary costs referential integrity

The only real downside to this strategy comes when the approaches you combine cross a database or instance boundary:
you sacrifice referential integrity.
In the above example, the tenant_user table can have a foreign key to
the tenant table because they both live in the global database, but none of the tenant-specific databases can have a
foreign key to the users table. Without foreign keys, there is no referential integrity at the database level.

Another example, but going the other way would be when your application has a small subset of data that has particular
regulatory requirements and needs to either stay in a particular jurisdiction, or needs to be isolated to a higher
degree.
Rather than isolating every tenant’s data to that level, you could isolate just the data that needs it. Your whole
application makes use of the discriminator column approach, but the subset of data uses
the separate database, same instance, or even
the separate instance approach.

Combining by data sidesteps the layer cost

When you separate by data rather than by tenant, you avoid the cross-layer cost completely. With this approach, your
models, entities, or whatever you call them, can belong to different connections or have different scopes. That’s
something easily done by almost every modern ORM or framework. This is much cheaper than the previous strategy,
which would require your models, entities, etc to support multiple different ways simultaneously, based on the
current context.

You do also incur additional operational costs for more complex/heavier approaches; however, since you’re only
isolating a small subset of data, it is slightly reduced compared to the previous strategy, which would require you to
isolate all of a tenant’s data.

Sharding is a different axis

Sharding is entirely a concern of scale and has nothing to do with isolation, unlike the previous two strategies.
While data isolation in multitenancy is a spectrum, as I keep saying, they’re intentionally laid out in this article
to be vertical. Sharding is horizontal, which is why it has been left out of the discussion until now. It has no
fixed isolation level, and inherits the isolation level of whatever approach it is that you’re sharding.

The only approach that you can’t shard, at least, not in the way that I’m talking about it here, is the
separate instance approach, as each tenant is already isolated to its own instance. However,
all the other approaches can be sharded.

To really get the most out of sharding a multitenanted application, you’d split tenant data across multiple database
instances, with the particular structure dependent on the approach. For the
separate database, same instance approach, each tenant would still have their
own database, but they’d be equally spread out across the instances. Sharding tenants for this approach is the
closest to the generic concept of data sharding, outside a multitenanted context.
Schemas and/or prefixes,
partitioned tables, and discriminator column approaches would still
have the tenants split across the instances; the difference being that the structure is identical on every instance,
with only the tenants differing.

Sharding does increase complexity

The cost of sharding depends entirely on how you’re routing tenants to their shard. Evenly distributing tenants across
shards is simple, but basing it on an attribute of the tenant makes it more complex.

The important consideration, however, is whether a tenant’s shard is fixed or can change. If the shard is derived
from an attribute of the tenant, that tenant is effectively locked to that shard, making the process of adding new
shards or balancing for performance far more complex and costly. If tenants are instead mapped to a shard based on
some arbitrary logic during onboarding, it becomes much easier to add new shards and balance, as the cost and
complexity are entirely in the migration of the tenant’s data. Mapping does incur the cost of maintenance, though
it’s trivial and often worth paying a little bit every now and then to save a lot in the future.

Start cheap, escalate on demand

All the bits above deal with the how and why you’d combine approaches, but the most important thing to consider is
the when. You don’t have to start with the most isolated approach, and in fact, you probably shouldn’t. Start with
the cheapest and simplest approach that meets your requirements and then escalate to a heavier approach when the
need arises. You move your tenants to a heavier approach in response to a need, not in anticipation of one. Don’t
try to preemptively solve a problem that doesn’t, and may not ever exist.

It is always going to be cheaper and easier to move a tenant to a heavier approach than it is to move them to a
lighter one. In almost all cases, the heavier approach is additive when it comes to complexity, and as every one of
us knows far better than we’d comfortably like to admit, it’s always easier to make something more complex than it
is to make it simpler.

When I say requirements, I mean it

Any of you that have worked in any sort of agency or sales/marketing heavy situation will know that there are two
types of requirements for any project. The first are the technical requirements, the things that are actually
required to make the project work. The second are the "requirements" that are just things that people want or think
they should have.

I can tell you, from experience, that most people that think they need full separation from the start don’t
actually need it at all. This is something that you or your leadership team will need to determine, and isn’t something
that I can dictate to you. Unless, of course, you want to hire me to do exactly that.

The cost you’re actually choosing

Each approach covered in this article has two costs associated with it. The first is the
architectural cost of implementing the approach, and the second is the ongoing operational cost of maintaining it.
The architectural cost is paid once, per approach. You pay it during the initial building of the application, and
then again each time you implement a new approach, which is unlikely to be a common occurrence.
The operational cost, however, is paid for every tenant, every day, for the lifetime of the application.
When you’re picking an approach, it’s the second operational cost that you’re choosing, and it’s the one that you need
to be most aware of.

Do you want to be able to run a new migration once, or do you want to have it run N times, once per tenant? If
you’re thinking that waiting for a migration to run for every tenant is no big deal, then consider that the time it
takes to run a migration isn’t the only thing here. If you’re running the migration once, and it breaks, you can
either roll back the migration and/or roll back the application. That’s going to be relatively quick and simple. If
you’re running the migration N times, and it breaks on a tenant, then you’ve got a much more complex problem. Sure,
you roll back the migration and/or the application, but now you’ve got to figure out which tenants the migration ran
successfully for, and which ones it didn’t. Building a migration pipeline that can handle this is non-trivial, and
even the best implementation can be fragile.

What about backups? Sure, it’s nice to be able to back up every tenant separately, but again, you’ve got to consider
the operational cost of that. It’s the same as with the migrations, waiting for the backups to run isn’t that much
of an issue, it’s the complexity of the backup pipeline that you need to consider, and the management of the backups.
If you’ve got fully separated tenants, and then you’re backing them all up to the same location, you’ve undermined
the entire point of separating them in the first place. I know that I listed backups as a genuinely good reason to
separate tenants, but that was in the context of a requirement, not a nice-to-have. So, be honest; do you really
need a backup per tenant? Sure, it ticks a box and looks good on a slide, or feature list, but in reality, is it
worth it? Is it frequent enough to pay both the architectural cost of setting it up and the ongoing
operational cost of maintaining it? I’d be willing to bet that if you even need it at all, it’ll be so infrequent and
such a special case that you could just manually do it.

Too many people approach multitenancy and try to answer the question "how isolated do I need it to be?", when the
real question is, and always was, "what am I willing to pay, per tenant, forever, for isolation?". Unless you
absolutely, genuinely, have to have some of the things I
listed above, a PostgreSQL database with a
discriminator column and row-level security will buy
you all the isolation you need, without the operational cost of separation. Compared to this, the
separate database, same instance approach, without any of the requirements, will
incur all the cost, and provide none of the benefits.

If you take only one thing away from this article, let it be this:

The best data isolation approach for any given application will vary from application to application, but it’s
almost never going to be "database per tenant"
.

Laravel News Links

Manage long-running transactions for AWS DMS performance

https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2026/06/25/4899.png

In this post, we show you how long-running transactions affect AWS Database Migration Service (AWS DMS) change data capture (CDC) latency, walk through monitoring approaches for Oracle, PostgreSQL, MySQL, and SQL Server, and provide ready-to-use scripts to identify and resolve problematic transactions before they impact your replication performance.

Long-running transactions are one of the most common and often overlooked causes of rising CDC source latency during AWS Database Migration Service (AWS DMS) migrations. When a database transaction stays open for hours instead of seconds, DMS must hold its replication position, buffering all subsequent changes until that transaction commits. The result is a cascading delay across all pending changes that can stall your migration pipeline right when low latency matters most, during your final cutover window. At that point, every minute of replication delay translates directly to extended downtime and higher migration risk.

By detecting these transactions early, you can avoid unnecessary replication instance scaling, reduce cutover downtime, and keep migration on schedule.

DMS source latency

When working with AWS DMS, source latency (CDCLatencySource) is a key metric. It measures the delay between the commit time of the last event captured from the source endpoint and the current system timestamp of the replication instance. In other words, it measures how far behind the replication instance is from the latest committed change at the source. Lower latency indicates healthier replication performance.

Several factors contribute to elevated source latency: an uncommitted transaction prevents DMS from applying/forwarding changes to the target, network throughput limitations between source and replication instance, source database performance bottlenecks, and heavy workload during peak processing times. These issues often interconnect, compounding their impact on replication performance.

To troubleshoot effectively, analyze latency patterns over time at hourly, daily, and weekly intervals. This helps you distinguish between temporary spikes during expected peak periods and systematic issues that indicate underlying problems.

The path to resolving source latency isn’t always straightforward, but it typically involves investigating six key areas:

  • High database workload
  • Network connectivity issues
  • Insufficient system resources
  • Large transaction volumes
  • Long-running transactions
  • Archival log processing delays

Each of these factors requires a different troubleshooting approach. Pinpointing the exact root cause before initiating remediation is essential for an effective fix. For the scope of this post, we focus on long-running transactions.

Long-running transactions

Long-running transactions are database operations that remain open for minutes or even hours. In AWS DMS, they can cause significant replication delays.

When a transaction stays open, DMS must retain all transaction log entries from the point that transaction began. For example, if a transaction opens at 9:00 AM and does not commit until 11:00 AM, DMS preserves two hours of log data. This increases storage consumption and memory usage on the replication instance.

The open transaction also blocks DMS from advancing its log-reading position. All changes that commit after the open transaction must wait in queue, causing CDC source latency to climb steadily. In high-transaction environments, a single long-running transaction can create a backlog that takes hours to clear after the transaction finally completes.

Common causes include complex queries processing large datasets, batch operations that run longer than expected, application code that holds transactions open unnecessarily, and missing commit or rollback statements. Each cause calls for a different troubleshooting approach.

To reduce the impact, set up automated monitoring to detect open transactions that exceed a defined threshold. Keep transaction duration as short as possible by optimizing application code and committing frequently. Avoid user interaction within open transactions. Also make sure the replication instance has sufficient storage and computing resources to handle temporary backlogs.

Some long-running transactions are unavoidable for certain business operations. The goal is not to eliminate them entirely but to detect them early and prevent them from stalling your replication pipeline.

How long-running transactions impact DMS source latency

Long-running transactions directly impact your AWS DMS CDC source latency. Identifying them is the first step to troubleshooting performance issues.

How DMS processes transactions

When DMS operates in CDC mode, it captures changes from the source database’s transaction logs (such as Oracle redo logs, SQL Server transaction logs, PostgreSQL WAL, or MySQL binlogs). However, DMS can only capture and replicate committed transactions to maintain data consistency.

A bottleneck forms because DMS cannot skip past the open transaction:

  • CDCLatencySource increases from seconds to potentially hours.
  • CDCIncomingChanges accumulates.
  • CDCChangesMemorySource or CDCChangesDiskSource increases as DMS buffers pending changes.

For instance, consider a batch job that starts a transaction at 10:00 AM and runs for 2 hours without committing. Any changes captured during that window cannot be applied to the target until the transaction commits at 12:00 PM. The DMS replication task is the unit of work responsible for reading the source transaction log and applying changes to the target. During this period, the task buffers all subsequent changes it has read, regardless of whether those changes involve related tables. It maintains transactional ordering within its scope, so no captured changes can be applied until the open transaction completes at noon. This creates a significant replication delay that grows proportionally with the transaction’s duration.

Diagnostic approach

Identifying long-running transactions helps you differentiate between DMS configuration issues and database-level problems when diagnosing source latency. This identification process follows a systematic approach:

  1. Initial investigation: Monitor CDCLatencySource to identify when replication is falling behind, then investigate replication instance metrics and source database open transactions to determine the root cause.
  2. Root cause analysis: Use monitoring scripts to identify active long-running transactions and correlate transaction start times with CDCLatencySource spikes. Track specific identifiers such as Oracle SCN (such as 12345678) or SQL Server LSN (such as 0000000A:00000B00:0001).
  3. Targeted solutions: Once identified, you can request a commit or rollback from the application team, optimize the logic to use smaller transaction batches, or implement transaction timeout policies.

Without identifying these long-running transactions, teams often waste time tuning DMS task settings (like MemoryLimitTotal or parallel threads) or scaling up replication instances unnecessarily, when the actual solution requires addressing the source database’s transaction management practices. The monitoring scripts thus serve as a diagnostic bridge between observing DMS performance symptoms and pinpointing their database-level causes.

Queries to identify long-running transactions

To proactively manage this issue, implementing alerts for long-running transactions is essential. Here are the monitoring queries for each database engine:

For Oracle source engine:

For PostgreSQL source engine:

For MySQL source engine:

SELECT
CONCAT('PID: ', p.id,
' | User: ', p.user, '@', p.host,
' | DB: ', IFNULL(p.db, 'None'),
' | Duration: ', p.time, 's',
' | State: ', IFNULL(p.state, 'N/A'),
' | Locked Rows: ', IFNULL(t.trx_rows_locked, 0),
' | Modified Rows: ', IFNULL(t.trx_rows_modified, 0),
' | Query: PID_', p.id)
FROM information_schema.processlist p
LEFT JOIN information_schema.innodb_trx t ON p.id = t.trx_mysql_thread_id
WHERE p.time > 900 
AND p.command != 'Sleep'
AND p.command != 'Daemon'
AND p.user NOT IN ('event_scheduler', 'rdsadmin', 'system user')
ORDER BY p.time DESC;

For SQL Server source engine:

Monitoring scripts

We’ve developed monitoring scripts for Oracle, PostgreSQL, MySQL, and SQL Server. These scripts alert when long-running transactions stall DMS replication. They detect and alert on long-running transactions (default: 15 minutes). Each script is designed for its specific database engine while following a consistent monitoring framework.

The full scripts are available on GitHub: sample-dms-long-running-transaction-detection.

Infrastructure requirements

  • Linux / Unix environment (bash shell required)
  • Appropriate database client tools (sqlplus, psql, mysql, or sqlcmd)
  • An Amazon Simple Notification Service (Amazon SNS) topic for receiving alert notifications
  • AWS Command Line Interface (AWS CLI) configured with IAM permissions for SNS publish
  • Proper database access permissions
  • Network connectivity from the monitoring host to the database endpoint

Common features across all scripts

All four scripts share a consistent design pattern with the following features:

Feature Description
Interactive Prompts All connection parameters are collected through interactive prompts at runtime. Credentials are securely retrieved from AWS Secrets Manager
TLS/SSL Encryption All database connections are encrypted by default: TCPS for Oracle, -N flag for SQL Server, sslmode=require for PostgreSQL, --ssl-mode=REQUIRED for MySQL
AMAZON SNS Alerts Alerts are published to a configurable AMAZON SNS topic, enabling distribution to multiple subscribers (email, SMS, AWS Lambda, and more)
Prerequisite Checks Each script validates that required tools (AWS CLI, database client) are installed before proceeding
False Positive Filtering Internal/system processes and client warnings are filtered out to prevent false alerts

Core functions

Each script implements these core functions:

  • collect_inputs(): Interactive prompt-based parameter collection.
  • validate_inputs(): Input validation and prerequisite checks.
  • check_long_sessions() / check_long_transactions(): Core monitoring query execution.
  • send_alert(): SNS alert publishing with formatted message.
  • log_message(): Timestamped logging to file.
  • build_*_connstr() / build_*_args(): TLS/SSL connection string construction.
  • main(): Orchestration with configuration summary and log directory setup.

Script details by engine

Oracle long-running session monitor

This bash script automatically detects Oracle database sessions running longer than the configured threshold and publishes alerts to Amazon SNS. It connects using the TCPS protocol for TLS/SSL encryption, with optional Oracle Wallet support for certificate-based authentication.

Usage:

Interactive prompts:

TLS/SSL configuration: The script uses TCPS protocol by default (port 2484). For Oracle Wallet-based SSL, set the following environment variables before running:

Key filters: Excludes BACKGROUND processes and the RDSADMIN user. Only monitors ACTIVE sessions.

PostgreSQL long-running transaction monitor

This bash script automatically detects PostgreSQL transactions exceeding the configured threshold and publishes alerts to Amazon SNS. It enforces TLS/SSL via the sslmode connection parameter.

Usage:

Interactive prompts:

TLS/SSL configuration: Default sslmode=require. For stricter verification, set environment variables:

False positive prevention: The script separates stderr from stdout to prevent psql client warnings (such as libpq.so version messages) from being treated as transaction data.

MySQL long-running transaction monitor

This bash script automatically detects MySQL transactions running longer than the configured threshold, providing detailed reports and publishing alerts to Amazon SNS. It enforces TLS/SSL via the --ssl-mode parameter.

Usage:

Interactive prompts:

TLS/SSL configuration: Default --ssl-mode=REQUIRED. For stricter verification with the RDS CA bundle:

False positive prevention: The script filters out MySQL internal processes (event_scheduler, rdsadmin, system user, Daemon commands) and strips the mysql: [Warning] Using a password on the command line interface can be insecure warning from output.

SQL Server long-running transaction monitor

This bash script automatically detects SQL Server transactions exceeding the configured threshold and publishes alerts to Amazon SNS. It enforces TLS/SSL via the -N flag with sqlcmd.

Usage:

Interactive prompts:

TLS/SSL configuration: The script uses -N (encrypt) and -C (trust server certificate) by default. For strict certificate verification, install the RDS CA bundle into the system trust store and set:

False positive prevention: Filters to is_user_process = 1 only and strips (N rows affected) noise from sqlcmd output. SQL text is truncated to 200 characters.

Sample SNS alert output

When a long-running transaction is detected, the script publishes a formatted alert to the configured SNS topic. Here is an example alert from the Oracle monitor:

All subscribers to the SNS topic (email, SMS, Lambda, and more) receive this alert, so the operations team can respond quickly.

Sample log output

Scheduling with crontab

For continuous monitoring, configure the scripts as cron jobs. Since the scripts use interactive prompts, you can create a wrapper script that pipes the inputs, or modify the scripts to retrieve credentials from AWS Secrets Manager. Alternatively, for crontab usage, you can set the variables directly in a non-interactive wrapper:

Note: The scripts use bash-specific syntax (such as, [[ =~ ]] for regex matching). Always invoke them with bash script_name.sh rather than sh script_name.sh.

Demonstration

We are going to perform a demo of the script in action by choosing Oracle as our source endpoint and check if the script is indeed able to capture the long-running sessions. For the demo we have used Amazon Relational Database Service (Amazon RDS) for Oracle 19c Enterprise Edition as source:

  1. Create a full load and CDC task to load the EVENTS_LOB table. Table structure:
  2. Schedule the monitoring script in crontab to execute every 15 minutes:
  3. Start the DMS task and have it completed the full load and move on to the CDC phase of the migration task.
  4. Generate high volume of DML in the source with commit and wait for the latency spike to happen:
  5. Observe the DMS task Amazon CloudWatch metrics. You will see an increase in CDCLatencySource as the long-running transaction holds up replication.
  6. Execute the monitoring script. The script detects the long-running session and publishes an alert to the SNS topic with full session details including SID, username, duration, SQL text, and wait events.
  7. SNS subscribers receive the alert. All subscribers to the configured SNS topic (email, SMS, Lambda, and more) receive the formatted alert, so the operations team can investigate and take action. A sample alert is included in the following section for reference:

Security considerations

The monitoring scripts incorporate several security best practices:

  • No command-line credentials: All connection parameters are collected via interactive prompts, preventing password exposure in ps output, shell history, or process listings.
  • TLS/SSL encryption: All database connections are encrypted by default across all four engines.
  • IAM-based alerting: Alerts use Amazon SNS with IAM authentication, so no email credentials are stored in the scripts.
  • Password: Credentials are collected via silent interactive prompts (read -s), preventing exposure in shell history or process listings. For automated scheduling, the scripts support --secret-id to retrieve credentials directly from AWS Secrets Manager, eliminating plaintext passwords entirely.
  • SQL ID only: For security, scripts report only SQL_ID/query_id references, not actual SQL text. Use these identifiers to look up queries in database monitoring views if needed.
  • Temporary file cleanup: All temporary SQL files and stderr capture files are cleaned up after use.

Conclusion

Long-running transactions can increase your CDC source latency from seconds to hours and cause replication delays that grow until the transaction commits. The monitoring scripts provided for Oracle, PostgreSQL, MySQL, and SQL Server offer a proactive solution to identify these problematic transactions before they become critical issues. As demonstrated in our Oracle example, these tools effectively detect and alert administrators about long-running sessions, enabling timely intervention and optimization.

The scripts enforce TLS/SSL encryption on all database connections, use Amazon SNS for scalable alert distribution, and include false-positive filtering to make sure alerts are actionable. By implementing this monitoring suite, you transform from reactive troubleshooting to proactive management. This post helps you identify potential latency issues before they impact database operations or AWS DMS replication, ensuring smooth, efficient database operations across all your platforms.


About the authors

Chirantan Pandya

Chirantan is a database engineer (AWS Countdown Premium) and subject matter expert for AWS DMS and Amazon RDS for PostgreSQL, Oracle. At AWS, he collaborates closely with customers to provide guidance and technical assistance about database migration projects, and with Amazon RDS for PostgreSQL and Oracle.

Manojit Saha Sardar

Manojit is a Database Engineer at AWS, specializing in AWS DMS, Amazon RDS, and Amazon Aurora — with deep expertise across Oracle, PostgreSQL, and Aurora PostgreSQL engines. With over six years of hands-on production experience in cloud database systems at scale, he is a certified Subject Matter Expert (SME) across five AWS domains: RDS Core, RDS PostgreSQL, RDS Oracle, Aurora PostgreSQL, and AWS DMS.

Planet for the MySQL Community

MySQL Workbench to DBeaver Transition

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmktJunYqMacdP812wQ-9ETWDfxDZhk0VxnTrc39wD8RbK4uxjd_K-6KpYbz1jaY_zOkkBLjvLCX-h9PWV2QFYwqvd94jwy-PMrwck9JeovAbkhVRrtoZCRereTigTuxcNAsPHVTQhArHfxTiplocSl6VoE8xLur1yUJ7PEtN7tGxnjv9dsUJP7OtgIPye/w1200-h630-p-k-no-nu/Screenshot%202026-06-26%20072921.png

 First, there were the MySQL Query Browser and the MySQL Administrator, collectively known as the MySQL GUI Tool Bundle. But they were given End-of-Life status in 2009. Next was MySQL Workbench, but that tool is now EOL as well.

MySQL Workbench provides a warning if you
try to use it with later versions of MySQL.

 You could still run Workbench, but it will grumble if you try to use it with any version other than 5.6, 5.7, or 8.0. Heck, it warns you that 8.4 may be a bit sketchy. On, and you need an older version for MySQL 5.5 and earlier.

  Many people swore by Workbench as the tool for checking on system status, wiring queries, and making backups. Others swore at it.

Yes, it had quirks. But Workbench did a lot of this right. 

There is a free, open-source database tool you should consider as a replacement. DBeaver Community Edition is also a database tool, but it supports many databases, not just MySQL. It uses JDBC connectors. DBeaver is an IDE that offers smarter auto-completion, advanced sorting, cell inline-editing, and a "Copy As" configuration feature. Additionally, the community version allows developers to integrate OpenAI or Copilot to translate regular text descriptions into SQL queries. Built with superior data-handling agility. You can directly import data from CSV, XML, or Excel XLSX files. It also features robust utilities to migrate data smoothly between two different database formats with minimal manual intervention.

Multiple Databases

Some of the available database connectors 
for the DBeaver Community Edition.

Gone are the days when you had one database. In this cloudy and Kubed world, you may be lucky if you have only a dozen databases to work with. Finding a universal interface, like DBeaver, keeps you from having to try to remember that the psql equivalent of \G is in the MySQL Shell. 

A big bonus is the ability to open a CSV file, transform the data, and then load it into a table in a database. Or compare that data to decide what subset of that data to move into that table. 

First Steps

Establishing a connection from DBeaver is going to be familiar for Workbench fans. 

Copy the hostname, database name,
username and password for DBeaver CE
from Workbench.
You will need the hostname, database name,
username and password for DBeaver CE.

Your connection details are going to be the same, bout double check that you did not ‘fat finger’ the information

Your Database

On the right, you will see my configuration. On the bottom left is a MySQL ‘Sakila’ icon and the World database. Ignore the rest of the database connections, at least for now.

The World database has been around for decades. It is the standard for MySQL documentation, examples, demos, and tutorials. And it is a great place to start for the rest of this blog.

Expanding the Greater Than character to the left of that icon and name, DBeaver will reveal some details about the world database.

World Database at a
high-level view

 We see Databases, Users, Administer, and System Info.

As we expand, we see even more details about the server.  We see the World database, the various users, a session manager, and general information about the instance. 

Right about now, you probably see a lot of overlap between the two products. Both are data tools, with DBeaver having a wider scope than the MySQL-only Workbench. 

Tables

Let’s look at the tables in that world database.

The world database has three tables.

We can drill down to see the tables.  Then we can look at the columns in a specific table.

The city table has five columns

The Views

Double-click on the city table icon, and the data will appear.

The Data

Or you can view the entity-relationship map.

ERM

And you may want to see the table structure.

The city table

You can edit data or structure with these views.

Queries

You need to be able to make queries.

SQL Editor -> Open SQL Console

The Query

You get command completion, highlight coloring, and all the stuff you expect. Click the top-left orange triangle to see the query results.

The results

What if you 

This has been a quick intro to DBeaver Community Edition for those who are used to MySQL Workbench. I always found Workbench useful. Hopefully, you will be pleasantly surprised with DBeaver. 

Planet for the MySQL Community

MySQL Query Optimization & EXPLAIN: A Complete Guide for DBAs and Developers

https://minervadb.com/wp-content/uploads/2026/06/shutterstock_2360159259-1024×520.jpg

MySQL query optimization is one of the most critical skills a database administrator or developer can possess. Whether you are managing a high-traffic e-commerce platform, a data warehouse with billions of rows, or a transactional OLTP system, poorly optimized queries are the leading cause of performance degradation, increased I/O, excessive CPU usage, and frustrated end users. At the heart of MySQL’s query optimization toolkit lies the EXPLAIN statement — a powerful diagnostic command that reveals how the MySQL query optimizer intends to execute a given SQL statement.
In this comprehensive guide, we will explore MySQL query optimization from the ground up: understanding the query execution lifecycle, dissecting every column of the EXPLAIN and EXPLAIN ANALYZE output, identifying common anti-patterns, and applying proven optimization strategies that MySQL DBAs and developers rely on in production environments every day. By the end of this article, you will be equipped with the knowledge to analyze execution plans, eliminate slow queries, and design indexes that drive maximum throughput.

Understanding the MySQL Query Optimizer

Before diving into EXPLAIN, it is essential to understand what the MySQL query optimizer does. The optimizer is a cost-based component within the MySQL server that evaluates multiple possible execution plans for a given query and selects the one with the lowest estimated cost. This cost is calculated based on statistics about tables and indexes stored in the Information Schema and the InnoDB storage engine‘s internal data dictionary.
The optimizer considers factors such as row estimates, index selectivity, join order, and available access methods before producing an execution plan. However, the optimizer is not perfect — it relies on statistics that may be stale or inaccurate, which is why understanding EXPLAIN and knowing how to guide the optimizer with hints is an indispensable skill for any serious MySQL DBA or developer.

The EXPLAIN Statement: Syntax and Variants

MySQL provides several variants of the EXPLAIN statement, each offering different levels of detail about query execution. Understanding when to use each variant is key to efficient query diagnostics.

-- Basic EXPLAIN
EXPLAIN SELECT * FROM orders WHERE customer_id = 1001;

-- EXPLAIN with FORMAT=JSON for richer, structured output
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 1001;

-- EXPLAIN ANALYZE (MySQL 8.0.18+) - executes query and returns real metrics
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 1001;

-- EXPLAIN for DML statements
EXPLAIN UPDATE orders SET status = 'shipped' WHERE order_date < '2024-01-01';
EXPLAIN DELETE FROM audit_log WHERE created_at < NOW() - INTERVAL 90 DAY;
EXPLAIN INSERT INTO archive_orders SELECT * FROM orders WHERE status = 'closed';

Sample Schema for Practical Examples

Throughout this guide, we use a realistic e-commerce schema to demonstrate every optimization technique hands-on.

CREATE TABLE customers (
    customer_id    INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    email          VARCHAR(255) NOT NULL,
    country_code   CHAR(2) NOT NULL,
    created_at     DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    status         TINYINT(1) NOT NULL DEFAULT 1,
    UNIQUE KEY uk_email (email),
    KEY idx_country_status (country_code, status),
    KEY idx_created_at (created_at)
) ENGINE=InnoDB;

CREATE TABLE orders (
    order_id       BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    customer_id    INT UNSIGNED NOT NULL,
    order_date     DATE NOT NULL,
    total_amount   DECIMAL(12,2) NOT NULL,
    status         ENUM('pending','processing','shipped','delivered','cancelled') NOT NULL,
    KEY idx_customer_id (customer_id),
    KEY idx_order_date_status (order_date, status),
    CONSTRAINT fk_orders_customer FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
) ENGINE=InnoDB;

CREATE TABLE order_items (
    item_id        BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    order_id       BIGINT UNSIGNED NOT NULL,
    product_id     INT UNSIGNED NOT NULL,
    quantity       SMALLINT UNSIGNED NOT NULL,
    unit_price     DECIMAL(10,2) NOT NULL,
    KEY idx_order_id (order_id),
    KEY idx_product_id (product_id),
    CONSTRAINT fk_items_order FOREIGN KEY (order_id) REFERENCES orders(order_id)
) ENGINE=InnoDB;

CREATE TABLE products (
    product_id     INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    sku            VARCHAR(50) NOT NULL,
    category_id    INT UNSIGNED NOT NULL,
    price          DECIMAL(10,2) NOT NULL,
    stock_qty      INT NOT NULL DEFAULT 0,
    UNIQUE KEY uk_sku (sku),
    KEY idx_category_id (category_id)
) ENGINE=InnoDB;

Dissecting the EXPLAIN Output: Column by Column

The id Column

The id column represents the sequential identifier of each SELECT within the query. Simple queries have a single id of 1. Subqueries and unions produce multiple rows with different id values. Rows with the same id execute as a join; rows with higher id values represent inner subqueries executed before the outer query.

The select_type Column

The select_type column describes the type of SELECT involved. Key values include: SIMPLE (no subqueries or unions), PRIMARY (the outermost SELECT), SUBQUERY (a subquery in SELECT or WHERE), DERIVED (a subquery in the FROM clause), UNION (subsequent SELECT in a UNION), and DEPENDENT SUBQUERY (a correlated subquery — a critical performance red flag indicating the subquery re-evaluates for each outer row).

-- SIMPLE: No subqueries or unions
EXPLAIN SELECT customer_id, email FROM customers WHERE country_code = 'US';

-- PRIMARY + SUBQUERY: Subquery in WHERE clause
EXPLAIN
SELECT order_id, total_amount FROM orders
WHERE customer_id IN (
    SELECT customer_id FROM customers WHERE country_code = 'DE'
);

-- PRIMARY + DERIVED: Subquery in FROM clause (derived table)
EXPLAIN
SELECT d.country_code, COUNT(*) AS order_count
FROM (
    SELECT c.country_code, o.order_id
    FROM customers c
    JOIN orders o ON o.customer_id = c.customer_id
    WHERE o.status = 'delivered'
) d
GROUP BY d.country_code;

-- UNION: Multiple SELECT statements combined
EXPLAIN
SELECT customer_id, 'active' AS label FROM customers WHERE status = 1
UNION ALL
SELECT customer_id, 'inactive' AS label FROM customers WHERE status = 0;

The type Column: The Most Critical Field in EXPLAIN

The type column — also called the join type or access type — is the most important field in the entire EXPLAIN output. It tells you how MySQL accesses rows in a table. From best to worst performance:

  • system — The table has only one row. A special case of const.
  • const — Exactly one matching row via PRIMARY KEY or UNIQUE index. Ideal for primary key lookups.
  • eq_ref — For each row from the preceding table, exactly one row is read via PRIMARY KEY or UNIQUE NOT NULL index. The best possible join access type.
  • ref — Multiple rows may match. Occurs with non-unique indexes or leftmost prefix matches.
  • range — Only rows within a given range are retrieved using an index (BETWEEN, IN, >, <, LIKE with prefix).
  • index — A full index scan. Faster than ALL but potentially a bottleneck on large indexes.
  • ALL — A full table scan. The worst case for large tables — must be eliminated in performance-critical paths.
-- const: Primary key lookup
EXPLAIN SELECT * FROM customers WHERE customer_id = 42;
-- type: const, rows: 1

-- eq_ref: Unique index join (best for joins)
EXPLAIN
SELECT c.email, o.order_id, o.total_amount
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_date = '2024-06-01';
-- type for customers: eq_ref (primary key join)

-- ref: Non-unique index lookup
EXPLAIN SELECT order_id, order_date, status FROM orders WHERE customer_id = 1001;
-- type: ref

-- range: Index range scan
EXPLAIN SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';
-- type: range

-- ALL: Full table scan (must be fixed for large tables!)
EXPLAIN SELECT * FROM orders WHERE total_amount > 5000;
-- type: ALL if no index on total_amount
-- Solution: CREATE INDEX idx_total_amount ON orders(total_amount);

The possible_keys, key, key_len, ref, rows, filtered, and Extra Columns

The possible_keys column lists all indexes MySQL considered; key shows the index actually chosen. When key is NULL despite available indexes in possible_keys, MySQL chose a full table scan — often because statistics suggest too many rows match. Run ANALYZE TABLE to refresh statistics.
The key_len column shows how many bytes of the chosen index are used. For composite indexes, this reveals how many columns are utilized. The rows column is MySQL’s estimated row examination count — minimize this product across joined tables for optimal performance. The filtered percentage shows what fraction of rows examined actually pass the WHERE clause.
The Extra column contains the most actionable diagnostic signals: Using index (covering index — ideal), Using temporary (temp table — investigate), Using filesort (sort without index — add covering index), Using index condition (Index Condition Pushdown active — good), and Using MRR (Multi-Range Read active — good for range scans).

-- Using index: Covering index (zero table row access)
ALTER TABLE orders ADD INDEX idx_cust_covering
    (customer_id, order_id, order_date, total_amount, status);

EXPLAIN
SELECT order_id, order_date, total_amount, status
FROM orders WHERE customer_id = 1001;
-- Extra: Using index

-- Using temporary + Using filesort: Performance red flag
EXPLAIN
SELECT country_code, COUNT(*) AS cnt
FROM customers GROUP BY country_code ORDER BY cnt DESC;
-- Fix: add index on (country_code) to avoid temp table

-- Using filesort on non-indexed ORDER BY
EXPLAIN SELECT order_id, total_amount FROM orders
ORDER BY total_amount DESC LIMIT 20;
-- Fix: CREATE INDEX idx_total_amount ON orders(total_amount);

-- Using index condition: ICP optimization
EXPLAIN SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'
  AND status = 'shipped';
-- Extra: Using index condition

EXPLAIN ANALYZE: Real Execution Metrics in MySQL 8.0

EXPLAIN ANALYZE, introduced in MySQL 8.0.18, executes the query and returns both estimated and actual metrics for each node in the execution plan tree. This is critical for identifying cardinality estimation errors — cases where the optimizer’s row estimates diverge wildly from reality, leading to suboptimal plan selection.

EXPLAIN ANALYZE
SELECT
    c.country_code,
    COUNT(DISTINCT o.order_id)       AS total_orders,
    SUM(oi.unit_price * oi.quantity) AS total_revenue
FROM customers c
JOIN orders o     ON o.customer_id = c.customer_id
JOIN order_items oi ON oi.order_id = o.order_id
WHERE c.status = 1
  AND o.order_date >= '2024-01-01'
  AND o.status = 'delivered'
GROUP BY c.country_code
ORDER BY total_revenue DESC;
-> Sort: total_revenue DESC  (actual time=142.5..142.7 rows=48 loops=1)
    -> Aggregate using temporary table  (actual time=142.2..142.2 rows=48 loops=1)
        -> Nested loop inner join  (cost=18540.23 rows=9820)
                                   (actual time=0.8..138.6 rows=87342 loops=1)
            -> Nested loop inner join  (cost=5421.12 rows=3240)
                                       (actual time=0.5..22.4 rows=28918 loops=1)
                -> Filter: (c.status = 1)  (cost=1240.80 rows=8400)
                   (actual time=0.3..8.7 rows=71230 loops=1)
                    -> Index scan on c using idx_country_status
                       (cost=1240.80 rows=84000)
                       (actual time=0.2..6.9 rows=84000 loops=1)
                -> Filter: (o.order_date >= '2024-01-01') and (o.status='delivered')
                   (cost=0.25 rows=1) (actual time=0.00019..0.00019 rows=0 loops=71230)
                    -> Index lookup on o using idx_customer_id
                       (customer_id=c.customer_id)  (cost=0.25 rows=1)
                       (actual time=0.00017..0.00017 rows=1 loops=71230)
            -> Index lookup on oi using idx_order_id (order_id=o.order_id)
               (cost=1.12 rows=3) (actual time=0.003..0.004 rows=3 loops=28918)

Key analysis points: compare the estimated rows against actual rows. When these diverge by orders of magnitude, consider running ANALYZE TABLE or increasing innodb_stats_persistent_sample_pages. The actual time=start..end values are in milliseconds. The loops value shows how many times each node executed — high loop counts on expensive inner operations are the primary target for optimization.

Common Query Anti-Patterns and How to Fix Them

Anti-Pattern 1: Functions on Indexed Columns in WHERE Clauses

Wrapping an indexed column inside a function prevents MySQL from using the index, forcing a full table scan. This is one of the most common and damaging anti-patterns found in production SQL workloads — and the fix is almost always straightforward.

-- BAD: Function prevents index usage
EXPLAIN SELECT * FROM orders
WHERE YEAR(order_date) = 2024 AND MONTH(order_date) = 6;
-- type: ALL (full table scan on potentially millions of rows)

-- GOOD: Rewrite as range condition (uses index)
EXPLAIN SELECT * FROM orders
WHERE order_date >= '2024-06-01' AND order_date < '2024-07-01';
-- type: range, Extra: Using index condition

-- BAD: LIKE with leading wildcard (no index possible)
EXPLAIN SELECT * FROM products WHERE sku LIKE '%ABC%';
-- Consider FULLTEXT index for arbitrary substring searches
ALTER TABLE products ADD FULLTEXT INDEX ft_sku (sku);
SELECT * FROM products WHERE MATCH(sku) AGAINST('ABC' IN BOOLEAN MODE);

-- GOOD: LIKE with trailing wildcard (uses index prefix scan)
EXPLAIN SELECT * FROM products WHERE sku LIKE 'ABC%';
-- type: range

-- BAD: Function on indexed column breaks index usage
EXPLAIN SELECT * FROM customers WHERE LOWER(email) = 'user@example.com';

-- GOOD: Functional index (MySQL 8.0+) preserves index access
ALTER TABLE customers ADD INDEX idx_email_lower ((LOWER(email)));
EXPLAIN SELECT * FROM customers WHERE LOWER(email) = 'user@example.com';
-- type: ref, key: idx_email_lower

Anti-Pattern 2: The N+1 Query Problem

The N+1 problem occurs when an application executes one query to retrieve N records and then fires an additional query for each record — N+1 total round trips. This is catastrophic at scale and entirely preventable with proper JOIN usage or batch fetching.

-- BAD: N+1 pattern (500 pending orders = 501 queries!)
-- Query 1: SELECT order_id FROM orders WHERE status = 'pending';
-- Then for each order_id:
-- Queries 2..501: SELECT * FROM order_items WHERE order_id = ?;

-- GOOD: Single JOIN eliminates N+1 completely
EXPLAIN
SELECT
    o.order_id, o.order_date, o.total_amount,
    oi.item_id, oi.product_id, oi.quantity, oi.unit_price
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'pending'
ORDER BY o.order_id, oi.item_id;
-- type for orders: ref (idx_status)
-- type for order_items: ref (idx_order_id)
-- One query, complete result set

Anti-Pattern 3: SELECT * Instead of Column Projection

Using SELECT * prevents covering index usage, transfers unnecessary data across the network, and makes execution plans less predictable as schemas evolve. Always project only the columns your application actually needs.

-- BAD: SELECT * forces table row access even when index could cover query
EXPLAIN SELECT * FROM orders WHERE customer_id = 1001;

-- GOOD: Project only needed columns enables covering index
ALTER TABLE orders ADD INDEX idx_cust_cover
    (customer_id, order_id, order_date, total_amount, status);

EXPLAIN
SELECT order_id, order_date, total_amount, status
FROM orders WHERE customer_id = 1001;
-- type: ref, Extra: Using index (all data from index - zero table access)

Advanced Indexing Strategies for MySQL Query Optimization

Composite Index Design: The Left-Prefix Rule

Composite indexes follow the left-prefix rule: MySQL can only use an index starting from the leftmost column. A composite index on (A, B, C) supports queries on A, A+B, or A+B+C — but not B or C alone. Design composite indexes with equality columns first, range condition columns second, and ORDER BY / GROUP BY columns last to eliminate filesort operations.

-- Query: WHERE status = 'shipped' AND order_date BETWEEN x AND y ORDER BY order_date
-- Optimal: equality first, range second, ORDER BY aligned with range column
ALTER TABLE orders ADD INDEX idx_status_date_opt (status, order_date);

EXPLAIN
SELECT order_id, customer_id, total_amount
FROM orders
WHERE status = 'shipped'
  AND order_date BETWEEN '2024-01-01' AND '2024-06-30'
ORDER BY order_date;
-- type: range, key: idx_status_date_opt
-- Extra: Using index condition  (NO filesort! ORDER BY uses index)

-- Verify index columns being used via key_len
-- status ENUM NOT NULL = 1 byte
-- order_date DATE NOT NULL = 3 bytes
-- key_len = 4 means BOTH columns are utilized

-- Covering composite index for aggregate queries
ALTER TABLE orders ADD INDEX idx_grp_covering
    (status, order_date, customer_id, total_amount);

EXPLAIN
SELECT status, order_date, COUNT(*) AS cnt, SUM(total_amount) AS revenue
FROM orders
WHERE status IN ('shipped', 'delivered')
  AND order_date >= '2024-01-01'
GROUP BY status, order_date;
-- Extra: Using index (full covering index - no table access whatsoever)

Invisible Indexes: Safe Index Testing Without Dropping

MySQL 8.0 introduced invisible indexes, which the optimizer ignores while InnoDB continues maintaining them. This allows DBAs to safely validate the impact of removing an index before permanently dropping it — an indispensable tool for production index lifecycle management.

-- Make an index invisible to test impact of removing it
ALTER TABLE orders ALTER INDEX idx_status INVISIBLE;

-- EXPLAIN now shows optimizer ignoring this index
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- possible_keys: NULL (invisible index ignored)

-- Re-enable the index
ALTER TABLE orders ALTER INDEX idx_status VISIBLE;

-- Allow session to see invisible indexes for targeted testing
SET SESSION optimizer_switch = 'use_invisible_indexes=on';
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
SET SESSION optimizer_switch = 'use_invisible_indexes=off';

-- Check visibility status of all indexes
SELECT index_name, is_visible
FROM information_schema.STATISTICS
WHERE table_schema = 'ecommerce' AND table_name = 'orders'
GROUP BY index_name, is_visible;

Index Hints and MySQL 8.0 Optimizer Hints

When the MySQL optimizer makes a poor index selection — often due to outdated statistics or unusual data distributions — index hints and optimizer hints allow targeted intervention. Use them sparingly and always validate with EXPLAIN, as they bypass the optimizer’s cost model.

-- FORCE INDEX: Optimizer must use this index (ignores all others)
EXPLAIN SELECT * FROM orders FORCE INDEX (idx_order_date_status)
WHERE order_date >= '2024-01-01' AND status = 'delivered';

-- USE INDEX: Suggests an index (optimizer may still ignore)
EXPLAIN SELECT * FROM orders USE INDEX (idx_customer_id)
WHERE customer_id = 1001;

-- IGNORE INDEX: Prevents use of a specific index
EXPLAIN SELECT * FROM orders IGNORE INDEX (idx_status)
WHERE status = 'pending' AND order_date >= '2024-01-01';

-- Optimizer hints (MySQL 8.0+ preferred method)
SELECT /*+ NO_HASH_JOIN(o, c) */
    o.order_id, c.email, o.total_amount
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
WHERE o.status = 'pending';

-- SET_VAR hint: Change variable scope for a single query
SELECT /*+ SET_VAR(sort_buffer_size=4194304) */
    customer_id, SUM(total_amount) AS revenue
FROM orders
GROUP BY customer_id
ORDER BY revenue DESC
LIMIT 100;

Subquery Optimization and Common Table Expressions

Subqueries can be highly efficient or devastating for performance depending on how they are written. The most dangerous anti-pattern is the correlated subquery — a subquery with a DEPENDENT SUBQUERY select_type that re-evaluates for every row of the outer query. MySQL 8.0’s Common Table Expressions (CTEs) provide both performance parity with JOINs and dramatically improved readability for complex multi-step queries.

-- BAD: Correlated subquery (re-evaluated N times for N outer rows)
EXPLAIN
SELECT o.order_id, o.total_amount,
    (SELECT SUM(oi.unit_price * oi.quantity)
     FROM order_items oi
     WHERE oi.order_id = o.order_id) AS calculated_total
FROM orders o
WHERE o.order_date >= '2024-01-01';
-- select_type: DEPENDENT SUBQUERY (executed once per outer row!)

-- GOOD: JOIN with aggregation (single pass over data)
EXPLAIN
SELECT o.order_id, o.total_amount, oi_agg.calculated_total
FROM orders o
JOIN (
    SELECT order_id, SUM(unit_price * quantity) AS calculated_total
    FROM order_items GROUP BY order_id
) oi_agg ON oi_agg.order_id = o.order_id
WHERE o.order_date >= '2024-01-01';

-- BEST: CTE for readability with equivalent performance (MySQL 8.0+)
WITH order_totals AS (
    SELECT order_id, SUM(unit_price * quantity) AS calculated_total
    FROM order_items GROUP BY order_id
)
SELECT o.order_id, o.total_amount, ot.calculated_total
FROM orders o
JOIN order_totals ot ON ot.order_id = o.order_id
WHERE o.order_date >= '2024-01-01';

-- Recursive CTE: Hierarchical queries (category trees, org charts)
WITH RECURSIVE category_tree AS (
    SELECT category_id, parent_id, name, 0 AS depth
    FROM categories WHERE parent_id IS NULL
    UNION ALL
    SELECT c.category_id, c.parent_id, c.name, ct.depth + 1
    FROM categories c
    JOIN category_tree ct ON ct.category_id = c.parent_id
)
SELECT category_id, CONCAT(REPEAT('  ', depth), name) AS indented_name
FROM category_tree ORDER BY category_id;

Optimizing Pagination: Escaping the LIMIT/OFFSET Trap

Naive pagination using high OFFSET values is a classic performance trap. As OFFSET grows, MySQL must scan and discard increasingly large numbers of rows before returning the requested page — a problem known as deep pagination. For large datasets, cursor-based pagination using the last seen primary key delivers constant-time performance regardless of page depth.

-- BAD: High offset forces full scan of 1,000,100 rows
EXPLAIN SELECT order_id, order_date, total_amount
FROM orders ORDER BY order_id
LIMIT 100 OFFSET 1000000;
-- rows: 1000100 (scans and discards 1,000,000 rows)

-- GOOD: Cursor-based (keyset) pagination - constant performance
-- First page:
SELECT order_id, order_date, total_amount
FROM orders WHERE order_id > 0
ORDER BY order_id LIMIT 100;

-- Next page (pass last_order_id from previous result set):
SELECT order_id, order_date, total_amount
FROM orders
WHERE order_id > :last_order_id
ORDER BY order_id LIMIT 100;
-- type: range, rows: 100 (reads exactly what is needed)

-- Alternative: Late row lookup for complex multi-column sort
SELECT o.*
FROM orders o
JOIN (
    SELECT order_id FROM orders
    ORDER BY total_amount DESC, order_id
    LIMIT 100 OFFSET 50000
) ids ON ids.order_id = o.order_id
ORDER BY o.total_amount DESC, o.order_id;
-- Inner query works only with index pages; outer fetches only 100 full rows

Statistics Management and the Query Optimizer

The MySQL optimizer’s decisions are only as good as the statistics it uses. Stale or inaccurate statistics lead to poor plan choices — wrong join orders, missed index usage, and cardinality estimation errors. As a MySQL DBA, proactively managing statistics is a core operational responsibility, especially after bulk data loads or large DELETE operations.

-- Refresh table statistics
ANALYZE TABLE orders, customers, order_items, products;

-- View table statistics and sizes
SELECT table_name,
    table_rows,
    ROUND(data_length / 1024 / 1024, 2)  AS data_mb,
    ROUND(index_length / 1024 / 1024, 2) AS index_mb,
    update_time
FROM information_schema.TABLES
WHERE table_schema = 'ecommerce'
ORDER BY data_length DESC;

-- Check index cardinality (higher = more selective = better)
SELECT index_name, column_name, seq_in_index, cardinality, nullable
FROM information_schema.STATISTICS
WHERE table_schema = 'ecommerce' AND table_name = 'orders'
ORDER BY index_name, seq_in_index;

-- Increase sample pages for better statistics on large tables
ALTER TABLE orders STATS_SAMPLE_PAGES = 50;
ANALYZE TABLE orders;

-- InnoDB persistent statistics settings
SHOW VARIABLES LIKE 'innodb_stats%';
-- innodb_stats_persistent = ON (recommended for production)
-- innodb_stats_persistent_sample_pages = 20 (increase for accuracy)

-- Check when InnoDB table statistics were last updated
SELECT * FROM mysql.innodb_table_stats
WHERE database_name = 'ecommerce';

Performance Schema: Identifying the Highest-Impact Slow Queries

MySQL’s Performance Schema provides comprehensive instrumentation tables for real-time query performance monitoring. For MySQL DBAs, mastering the Performance Schema is essential for identifying the highest-impact optimization targets in production — revealing far more than the slow query log alone.

-- Top 10 slowest queries by total execution time
SELECT
    DIGEST_TEXT                                 AS query_template,
    COUNT_STAR                                  AS exec_count,
    ROUND(SUM_TIMER_WAIT / 1e12, 3)            AS total_time_sec,
    ROUND(AVG_TIMER_WAIT / 1e12, 6)            AS avg_time_sec,
    ROUND(MAX_TIMER_WAIT / 1e12, 6)            AS max_time_sec,
    SUM_ROWS_EXAMINED                           AS total_rows_examined,
    ROUND(SUM_ROWS_EXAMINED / COUNT_STAR, 0)   AS avg_rows_examined,
    SUM_NO_INDEX_USED                           AS full_scans
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME = 'ecommerce'
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

-- Queries performing full table scans in production
SELECT
    DIGEST_TEXT,
    COUNT_STAR,
    SUM_NO_INDEX_USED,
    ROUND(AVG_TIMER_WAIT / 1e12, 6) AS avg_sec
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME = 'ecommerce' AND SUM_NO_INDEX_USED > 0
ORDER BY SUM_NO_INDEX_USED DESC LIMIT 10;

-- sys schema: Simplified top-level performance view
SELECT * FROM sys.statement_analysis
WHERE db = 'ecommerce'
ORDER BY total_latency DESC LIMIT 10;

-- sys schema: All queries doing full table scans
SELECT * FROM sys.statements_with_full_table_scans
WHERE db = 'ecommerce'
ORDER BY no_index_used_count DESC;

The Optimizer Trace: Deep-Dive Plan Analysis

When EXPLAIN and EXPLAIN ANALYZE do not provide sufficient insight, the Optimizer Trace delivers a complete JSON log of every decision the optimizer made — including all alternative plans considered and their cost estimates. This is the ultimate diagnostic instrument for resolving the most difficult query optimization problems.

-- Enable optimizer trace
SET SESSION optimizer_trace = 'enabled=on';
SET SESSION optimizer_trace_max_mem_size = 1048576;

-- Run the query to analyze
SELECT order_id, customer_id, total_amount
FROM orders
WHERE status = 'shipped'
  AND order_date BETWEEN '2024-01-01' AND '2024-06-30'
ORDER BY total_amount DESC
LIMIT 50;

-- Retrieve the trace (JSON format)
SELECT QUERY, TRACE
FROM information_schema.OPTIMIZER_TRACE\G

-- Key JSON sections to examine:
-- "considered_execution_plans": All plans evaluated
-- "best_access_path": Index chosen and why
-- "rows_estimation": Cardinality estimates per table
-- "cost_info": read_cost, eval_cost, prefix_cost per plan

-- Disable optimizer trace
SET SESSION optimizer_trace = 'enabled=off';

Key MySQL Variables for Query Performance Tuning

Beyond index design, several MySQL server variables directly influence query execution performance. Understanding and tuning these variables is a critical complement to query-level optimization in production environments.

-- Sort buffer: used when ORDER BY/GROUP BY cannot use an index
SHOW VARIABLES LIKE 'sort_buffer_size';          -- Default: 256KB
SET SESSION sort_buffer_size = 4 * 1024 * 1024; -- 4MB for heavy sorts

-- Join buffer: used for Block Nested Loop joins (non-indexed joins)
SHOW VARIABLES LIKE 'join_buffer_size';          -- Default: 256KB
SET SESSION join_buffer_size = 2 * 1024 * 1024; -- 2MB for large joins

-- Temporary table memory thresholds (exceeding causes disk spill)
SHOW VARIABLES LIKE 'tmp_table_size';            -- Default: 16MB
SHOW VARIABLES LIKE 'max_heap_table_size';       -- Default: 16MB
-- Set both equal to prevent disk-based temp tables

-- InnoDB buffer pool: the single most impactful performance variable
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';   -- Target: 70-80% of total RAM

-- Enable slow query log for continuous production monitoring
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 1;                  -- Capture queries > 1 second
SET GLOBAL log_queries_not_using_indexes = ON;   -- Capture queries without indexes
SHOW VARIABLES LIKE 'slow_query_log_file';       -- Check log file location

-- Read buffer: sequential scan performance
SHOW VARIABLES LIKE 'read_buffer_size';          -- Default: 128KB
SHOW VARIABLES LIKE 'read_rnd_buffer_size';      -- Default: 256KB

MySQL Query Optimization Checklist for DBAs and Developers

The following checklist provides a systematic approach to diagnosing and resolving slow queries in MySQL production environments. Apply these steps in order for every optimization engagement.

  1. Capture the slow query — Use the slow query log, performance_schema.events_statements_summary_by_digest, or sys.statement_analysis to identify the highest-impact queries by total execution time and examination count.
  2. Run EXPLAIN and EXPLAIN ANALYZE — Review every column starting with type (eliminate ALL and index scans), rows (minimize the cross-join product), and Extra (eliminate Using filesort and Using temporary where feasible).
  3. Verify index design — Confirm indexes exist on all columns used in WHERE, JOIN ON, GROUP BY, and ORDER BY. Design composite indexes following the left-prefix rule: equality conditions first, range conditions second.
  4. Eliminate anti-patterns — Remove functions on indexed columns in WHERE, replace SELECT * with projection, convert correlated subqueries to JOINs, and eliminate N+1 patterns entirely.
  5. Update table statistics — Run ANALYZE TABLE after bulk data changes to ensure the optimizer works with accurate cardinality estimates.
  6. Validate with EXPLAIN ANALYZE — After applying changes, re-run EXPLAIN ANALYZE to confirm actual row counts match optimizer estimates and execution time has improved measurably.
  7. Test with production-scale data — Always benchmark optimizations against data volumes comparable to production. An effective index on 10,000 rows may not scale to 100,000,000 rows.
  8. Monitor continuously — Use Performance Schema and the sys schema to continuously monitor query performance and proactively identify regressions before they impact users.

Conclusion

MySQL query optimization is both a science and an art. The science lies in understanding how the cost-based optimizer works, how indexes are structured and accessed internally by InnoDB, and how to interpret every field of the EXPLAIN and EXPLAIN ANALYZE output with precision. The art lies in applying this knowledge pragmatically — knowing when to add a composite index, when to rewrite a correlated subquery as a JOIN, when to refresh statistics, and when to override the optimizer with targeted hints.
Mastering the techniques in this guide — from dissecting EXPLAIN columns and eliminating full table scans, to designing optimal composite and covering indexes, avoiding deep pagination traps, leveraging invisible indexes for safe lifecycle management, and using the Performance Schema for continuous monitoring — equips you to build MySQL-backed systems that scale confidently to hundreds of millions of rows and thousands of concurrent connections.
The return on investment in MySQL query optimization skills is exceptional: reduced infrastructure costs, dramatically improved user experience, fewer on-call incidents, and a more resilient, predictable database tier. Every millisecond shaved from a high-frequency query executed millions of times daily translates directly into meaningful savings and competitive advantage. Start every optimization engagement with EXPLAIN, follow the evidence rigorously, and let the data guide every decision you make.

Planet for the MySQL Community

Ford Rehires ‘Gray Beard’ Engineers After AI Falls Short

Ford executives said they’ve hired 350 veteran engineers — some of them former employees — after AI and automated systems failed to deliver the desired quality, reports TechCrunch:
Bloomberg reports the company’s chief operating officer Kumar Galhotra told journalists that Ford had been "relying more and more on automated quality systems" with disappointing results. So the company "brought back technical specialists," and those specialists "hunt for failure points before a part ever reaches the plant floor." Charles Poon, Ford’s vice president of vehicle hardware engineering, added, "Mistakenly we thought that by just introducing artificial intelligence and ingesting the design requirements that we had, that that would produce a high-quality product."
The article points out that Ford is using the rehired gray beard engineers to train younger staff — and, to reprogram its AI tools.


Read more of this story at Slashdot.

Slashdot

untitled

RonSQL: a new SQL engine for RonDB with predictable low latency and CTEs

RonSQL: a new SQL engine for RonDB with predictable low latency and CTEs

Today we released RonDB 26.04.1, a beta release. It contains a
lot of new features, but the most interesting one is that RonSQL now
supports pushdown join aggregation and CTEs
, so that complex queries run
with low, predictable latency.

RonDB has always been able to answer complex queries through a MySQL Server.
The problem with that path is predictability. The application asks for an
answer, but it has no guarantee about how fast that answer arrives: the MySQL
optimizer picks a plan that may or may not parallelise the query across the RonDB
data nodes, and a plan that looks fine on a small table can fall off a cliff as
the data grows.

RonSQL takes a different contract. The rule is simple:

Anything RonSQL accepts can be pushed down to the RonDB data nodes for
parallel execution.

If a query parses and plans in RonSQL, it runs as a parallel pushdown —
there is no fallback to a slow, single-threaded plan. That means the latency of a
complex query is something the application can actually reason about up front,
instead of discovering it in production.

Why this matters: Feature Stores

RonSQL grew out of the needs of AI applications built on Feature
Stores
, and in particular on-demand (real-time)
transformations in Hopsworks
.

Traditionally an online Feature Store only does primary-key lookups. To keep
those lookups fast, every feature has to be pre-computed and
written back before serving. That works, but it has two costs:

  1. Stale features. A feature is only as fresh as the last
    batch job that recomputed it. The events from the last few seconds —
    often the most predictive ones for fraud, recommendations, or anomaly
    detection — are not yet reflected.
  2. Expensive BLOB packing. A common trick is to pack a range
    of values into a single BLOB (often Avro-encoded) so a whole feature group can
    be fetched in one lookup. But every change to any value means
    re-encoding the whole BLOB, which may hold hundreds of values.

RonSQL attacks both problems:

  • Fresh features on the fly. Instead of pre-computing
    aggregations, you stream raw rows into RonDB and let RonSQL aggregate them at
    query time. A row inserted a second ago is included immediately, so the
    feature reflects what just happened.
  • Index scans instead of BLOBs. Instead of packing values
    into a BLOB and re-encoding on every change, you store the values as ordinary
    rows and let RonSQL read them with an index scan. Updates become simple inserts
    and deletes — and deletes are usually handled for you by RonDB’s
    row-level TTL, so old data ages out without any application
    code.

CTEs (Common Table Expressions, the SQL WITH clause) are what let
you combine these two ideas in a single, readable query: aggregate the fresh fact
rows in a CTE, then join the result against your normalised dimension tables.

A worked example: real-time card-fraud features

Consider a fraud-scoring model. At inference time it needs a feature vector for
one card, computed over that card’s most recent activity. The raw transactions
arrive continuously and are inserted straight into RonDB:

-- Fact table: one row per card transaction, inserted in real time.
CREATE TABLE txn (
  txn_id       BIGINT       NOT NULL,
  cc_num       BIGINT       NOT NULL,   -- card / account identifier
  merchantkey  INT          NOT NULL,   -- references merchant.m_merchantkey
  amount       INT          NOT NULL,   -- minor units (cents)
  txn_time     DATETIME(6)  NOT NULL,
  is_declined  TINYINT      NOT NULL,
  PRIMARY KEY USING HASH (txn_id),
  -- Ordered index: range-scan one card's recent activity cheaply.
  INDEX idx_card_time (cc_num, txn_time)
) ENGINE=NDB
  COMMENT='NDB_TABLE=TTL=604800@txn_time';  -- auto-expire rows after 7 days

-- Small dimension table: replaces a per-card Avro BLOB of merchant attributes.
CREATE TABLE merchant (
  m_merchantkey INT          NOT NULL,
  m_category    VARCHAR(16)  NOT NULL,
  m_risk_score  INT          NOT NULL,
  PRIMARY KEY USING HASH (m_merchantkey)
) ENGINE=NDB;

Step 1 — a fresh feature vector with a single scan

The simplest on-demand feature is a scalar aggregate over the card’s last hour
of transactions. No pre-computation, no BLOB — just an index range scan that
includes whatever was inserted milliseconds ago:

SELECT
  COUNT(*)                                          AS txns_1h,
  SUM(amount)                                       AS amount_1h,
  MAX(amount)                                       AS max_amount_1h,
  AVG(amount)                                       AS avg_amount_1h,
  SUM(CASE WHEN is_declined = 1 THEN 1 ELSE 0 END)  AS declines_1h
FROM txn
WHERE cc_num = 4716253018273645
  AND txn_time >= DATE_SUB('2026-06-29 14:30:00', INTERVAL 1 HOUR);

RonSQL turns the WHERE into an ordered-index range
scan
on idx_card_time — it touches only this card’s
last hour — and pushes the COUNT/SUM/MAX/AVG
and the CASE expression down to the data nodes, which aggregate in
parallel and return a single row.

Step 2 — combining fresh aggregation with a dimension join, using a CTE

Now suppose the model wants spend broken down by merchant category.
The category does not live on the transaction — it lives on the
merchant dimension. The classic Feature Store approach would
denormalise the category into a packed BLOB per card. With RonSQL we keep the
data normalised and join at query time:

WITH spend_by_merchant AS (
  SELECT merchantkey AS m,
         SUM(amount) AS spend,
         COUNT(*)    AS txns
  FROM txn
  WHERE cc_num = 4716253018273645
    AND txn_time >= DATE_SUB('2026-06-29 14:30:00', INTERVAL 1 HOUR)
  GROUP BY merchantkey
)
SELECT m.m_category                  AS category,
       SUM(spend_by_merchant.spend)  AS spend_last_hour,
       SUM(spend_by_merchant.txns)   AS txns_last_hour
FROM merchant AS m
JOIN spend_by_merchant ON spend_by_merchant.m = m.m_merchantkey
GROUP BY m.m_category;

This query is easy to reason about, top to bottom:

  1. The CTE spend_by_merchant runs an
    ordered-index range scan on idx_card_time, restricted to one card
    over the last hour — the only large table in play. The data nodes
    aggregate SUM(amount) and COUNT(*) grouped by
    merchantkey, returning just a handful of rows (one per merchant
    the card touched in that hour).
  2. The join attaches the merchant attributes.
    m_merchantkey is the primary key of merchant, so each
    row is resolved with a cheap primary-key lookup rather than another scan.
    merchant is a small dimension table.
  3. The outer query re-aggregates the joined result by
    m_category, producing one row per merchant category — a
    compact, model-ready feature vector.

Every stage is a pushdown, and stages such as the index scan and the lookups
run in parallel across the data nodes. We could even execute several CTEs in
parallel. Because RonSQL guarantees the whole thing pushes down, the latency is
bounded and predictable — which is exactly the contract a real-time
inference path needs.

Running a RonSQL query

RonSQL is reachable two ways:

  • REST (RDRS). The RonDB REST server exposes a RonSQL
    endpoint, which is the path used by online serving. It even keeps a built-in
    latency histogram so you can watch the predictable-latency promise hold in
    production. The rondb-cli shell sends a line straight to it with
    the RONSQL prefix.
  • ronsql_cli. A standalone client for scripting
    and experimentation. It reads a query from --execute,
    --execute-file, or stdin and can emit results as JSON
    (ideal for a feature vector) or TEXT.

Both paths support EXPLAIN. Prefixing a query with
EXPLAIN shows the chosen pushdown plan — which index drives
each scan, which joins become lookups, and where the aggregation happens —
so “will this be fast?” is a question you answer before you
ship, not after.

What RonSQL supports today

RonSQL is a read-only, aggregation-focused SQL subset designed so that
everything it accepts can be pushed down:

  • Statements: SELECT only (plus
    EXPLAIN). No DDL/DML.
  • CTEs: multiple, comma-separated WITH clauses
    (non-recursive); a CTE can be joined as a child or used as the driving
    table.
  • Joins: INNER JOIN,
    LEFT [OUTER] JOIN, self-joins, and comma cross-joins over scalar
    CTEs. Equi-join conditions, including composite keys
    (a.x = b.x AND a.y = b.y).
  • Filtering: rich WHERE
    = <> < <= > >=, LIKE,
    IN (list), IS [NOT] NULL,
    AND/OR/XOR/NOT, arithmetic, bitwise ops, and
    CASE WHEN.
  • Subqueries: EXISTS,
    IN (subquery), and scalar subqueries.
  • Aggregates: COUNT(*), COUNT(expr),
    SUM, MIN, MAX, AVG.
  • Grouping & shaping: GROUP BY
    (multi-column, any table), HAVING,
    ORDER BY ASC/DESC, LIMIT.
  • Expressions: arithmetic, CASE WHEN,
    GREATEST/LEAST, and date/time functions
    DATE_ADD, DATE_SUB, EXTRACT,
    INTERVAL.
  • Index hints: FORCE INDEX,
    USE INDEX, IGNORE INDEX.

Why express features in SQL at all?

Because the Feature Store has to compute the same feature in two very
different settings. Batch training and batch inference run on
engines like Spark SQL and DuckDB — both
batch query engines, chosen for different characteristics (Spark scales the work
across a cluster for very large datasets; DuckDB runs embedded and is hard to
beat on a single node for moderate data). Online serving runs on
RonSQL, computing the feature fresh at inference time. When all
of them speak SQL, the same feature logic can be expressed as the same query text
on each engine, which eliminates a notorious source of
training/serving skew — features that subtly differ
between the model’s training data and what it sees live at inference.

Where RonSQL goes next

RonSQL is already useful, but there is a clear roadmap, much of it driven
directly by Feature Store needs:

  • Distinct-count features. COUNT(DISTINCT ...),
    and an approximate variant (HyperLogLog), to answer “how many distinct
    merchants / devices / countries in the last hour?” — a staple fraud
    signal. DISTINCT and OFFSET more generally.
  • More aggregate functions. STDDEV and
    VARIANCE (for z-score features), and
    GROUP_CONCAT.
  • Point-in-time correctness (“time travel”).
    As-of joins so the same RonSQL query can reconstruct a feature’s value
    at a historical timestamp for training, exactly matching what online serving
    would have returned — closing the skew gap completely.
  • Richer query shapes. Derived tables / subqueries in
    FROM, UNION, RIGHT/FULL OUTER
    JOIN
    , and recursive CTEs for hierarchy/graph features.
  • Vector / embedding pushdown. Top-K nearest-neighbour
    search pushed to the data nodes, as embeddings increasingly live alongside
    scalar features.
  • Cost-based join ordering. The planner currently joins
    left-to-right; reordering based on table/index statistics would make more
    queries fast by default.
  • Continuous / materialised features. Incrementally
    maintaining a CTE’s result as new rows arrive, blurring the line between
    on-demand and pre-computed features.

The core contribution stays the same: predictable low latency for
complex queries over fresh data
, expressed in portable SQL —
exactly what an online Feature Store needs to serve fresh, skew-free features to
an AI model.

Planet for the MySQL Community

Top MySQL Performance Metrics to Monitor in 2026

https://webyog.com/wp-content/uploads/2017/11/connections-and-buffer-pool-usage-1.png

As databases power increasingly complex workloads — from AI-driven applications to cloud-native
microservices and containerized deployments — the ability to monitor MySQL performance with precision has
never been more important. Whether you’re running MySQL 8.0/8.4 LTS in a stable production environment or
experimenting with the MySQL 9.x Innovation series, the fundamentals of connection management and buffer
pool tuning remain the bedrock of a healthy database.
This post, part of our ongoing MySQL monitoring series, dives into two critical areas: connection metrics and
the InnoDB buffer pool. Mastering these will help you catch problems before they become outages.

Why Performance Monitoring Matters in 2026

Modern infrastructure has raised the stakes for database performance. AI inference workloads generate
high-throughput, low-latency query patterns. Cloud deployments scale horizontally but introduce new failure
modes. Containerized MySQL instances — whether on Kubernetes or ECS — spin up and down rapidly, making
consistent monitoring essential.
The good news: MySQL’s built-in instrumentation is richer than ever. MySQL 8.4 LTS and the 9.x Innovation
releases ship with improved Performance Schema coverage, enhanced replication visibility, and better
diagnostics for connection errors. Knowing which metrics to watch — and what thresholds to act on — separates
reactive firefighting from proactive operations.

Part 1: MySQL Connection Metrics

How MySQL Manages Connections

Every client request to MySQL passes through the connection manager thread. MySQL maintains a pool of
threads to handle these connections, and each active connection consumes memory and CPU. In
high-concurrency workloads — think e-commerce flash sales, real-time analytics pipelines, or AI feature stores
— connection pressure is one of the first things that breaks.

The default max_connections value is 151, which is appropriate for development but far too low for
production. Most production environments should set this to hundreds or even thousands, depending on
available RAM and workload patterns.

Key Connection Metrics to Track

Metric What It Tells You
Threads_connected Number of currently open connections
Threads_running Connections actively executing queries (not idle)
Connections Cumulative total connections since server start
Connection_errors_internal Errors from internal server issues
Aborted_connects Failed connection attempts
Aborted_clients Failed connection attempts

Threads_running is arguably the most important of these. A spike here — especially if it approaches
max_connections — signals that your server is under stress. If Threads_running climbs while
Threads_connected stays flat, you likely have slow queries piling up.

Granular Error Diagnostics

MySQL surfaces granular connection error counters that help you diagnose the root cause of failed connections:

  • Connection_errors_accept — errors at the network accept layer, often a kernel or OS-level issue
  • Connection_errors_max_connections — clients being refused because max_connections is
    exhausted
  • Connection_errors_peer_address — errors resolving client IP addresses
    In containerized environments, Connection_errors_peer_address can spike unexpectedly due to DNS
    resolution latency or ephemeral IP behavior. Watching this counter saves hours of debugging.

Connection Tuning Checklist

  • Set max_connections based on available RAM, not intuition
  • Use connection pooling (ProxySQL, MySQL Router) to reduce raw connection overhead in high-concurrency apps
  • Monitor Aborted_clients — persistent values indicate application-level connection leaks
  • Alert when Threads_running / Threads_connected > 0.5 consistently

Part 2: InnoDB Buffer Pool Metrics

What the Buffer Pool Does

The InnoDB buffer pool is MySQL’s most important memory structure. It caches table data and index pages in
RAM, reducing the need for expensive disk reads. A well-sized buffer pool can serve the majority of reads from
memory — dramatically reducing latency and I/O load.

The default buffer pool size is 128MB, which is a reasonable starting point for development. For dedicated
database servers, the best practice is to allocate approximately 80% of available RAM to the buffer pool.

Sizing the Buffer Pool

The buffer pool size must align with this formula:

innodb_buffer_pool_size = N × innodb_buffer_pool_chunk_size × innodb_buffer_pool_instances

MySQL 8.x allows online resizing of the buffer pool, meaning you can adjust it without a restart — a significant
operational improvement. In cloud environments where instance sizes change frequently, this matters.

The LRU Algorithm and Midpoint Insertion

InnoDB uses a variant of the Least Recently Used (LRU) algorithm to manage which pages stay in the buffer
pool. Rather than a simple LRU list, MySQL uses a midpoint insertion strategy: newly loaded pages enter at
the midpoint of the list, not the head. This prevents large full-table scans from flushing your hot working set out of
the pool.

Two tuning parameters control this behavior:

  • innodb_old_blocks_pct — percentage of the buffer pool reserved for “old” (recently loaded) pages; default is 37%
  • innodb_old_blocks_time — how long a page must stay in the old sublist before it can be promoted to “young”; default is 1000ms

For OLTP workloads with repeated access to the same rows, the defaults work well. For mixed workloads
running analytical queries alongside transactional ones — increasingly common as AI pipelines run batch
feature extraction alongside live serving — you may need to tune these values to protect your hot page set.

Key Buffer Pool Metrics

Metric What It Tells You
Innodb_buffer_pool_read_requests Total logical read requests (memory hits + disk reads)
Innodb_buffer_pool_reads Physical reads from disk (cache misses)
Innodb_buffer_pool_pages_total Total pages in the buffer pool
Innodb_buffer_pool_pages_free Pages currently available (not in use)

Calculating Buffer Pool Efficiency

Cache Miss Rate = (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests) × 100

A healthy production system should have a cache miss rate below 1% — meaning 99%+ of reads are served
from memory. If your miss rate climbs above this threshold, your buffer pool is undersized for your working data
set.

Watch Innodb_buffer_pool_pages_free as well. A consistently low free page count means the pool is
under memory pressure, and MySQL is spending time evicting pages rather than serving data.

Monitoring These Metrics in Practice

Manual monitoring with SHOW GLOBAL STATUS is a starting point, but it doesn’t scale. For teams running
MySQL in production — especially across multiple instances or cloud regions — a dedicated monitoring tool is essential.

SQL Diagnostic Manager for MySQL (part of the Webyog/IDERA family) provides real-time dashboards for all
the metrics discussed here, plus automated alerting, query analysis, and root cause diagnostics. Whether you’re
managing a single server or dozens of replicas behind a load balancer, having these metrics in one place makes
the difference between proactive tuning and reactive recovery.

Summary

Connection management and buffer pool sizing are foundational to MySQL performance. In 2026’s environment
of AI workloads, cloud scaling, and containerized deployments, these metrics deserve continuous attention —
not just one-time configuration.

  • Monitor Threads_running and connection error counters for early warning signs
  • Size your buffer pool to approximately 80% of RAM on dedicated servers
  • Target a cache miss rate below 1%
  • Use a monitoring tool that surfaces these metrics continuously, not just on demand

Stay tuned for the next post in this series, where we cover InnoDB I/O metrics and query performance
diagnostics.

Frequently Asked Questions

What is a healthy value for Threads_running?

In most workloads, Threads_running should stay well below max_connections. If it consistently exceeds
20–30% of your connection limit, investigate slow queries or lock contention. A sudden spike often points to a
rogue query or a batch job gone wrong.

How do I know if my buffer pool is too small?

Check your cache miss rate: (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)
× 100. A value above 1% is a warning sign. Also watch Innodb_buffer_pool_pages_free — if free pages hover near zero, MySQL is under memory pressure and evicting data it needs.

Can I resize the buffer pool without restarting MySQL?

Yes — MySQL 8.x supports online buffer pool resizing via SET GLOBAL innodb_buffer_pool_size. The
resize happens in chunks and may take a few seconds to minutes depending on pool size. No restart required.

What causes Aborted_clients to increase?

Aborted clients typically indicate application-side connection leaks — connections opened but not properly
closed. Check your application’s connection pooling configuration and ensure connections are returned to the
pool after each operation.

How often should I review these metrics?

Critical metrics like Threads_running and buffer pool efficiency should be monitored continuously with
alerting thresholds. Review trends weekly and investigate any sustained drift from your baseline.

Does this apply to cloud-managed MySQL (RDS, Cloud SQL)?

Yes. These metrics are exposed in cloud-managed MySQL instances and most providers surface them in their
native monitoring dashboards. You can also connect SQL Diagnostic Manager for MySQL to RDS and Cloud
SQL instances for deeper analysis.

Ready to Monitor MySQL Like a Pro?

Stop flying blind on your MySQL performance. SQL Diagnostic Manager for MySQL gives you real-time
dashboards, automated alerting, and root cause analysis — covering every metric discussed in this post and
more.

  • Start a free 14-day trial — no credit card required
  • Request a personalised demo — see it working against your own workload
  • Talk to our team — get advice on the right monitoring setup for your environment

Visit webyog.com to get started today.

Planet for the MySQL Community

How to Become a MySQL DBA in 2026

https://webyog.com/wp-content/uploads/2023/04/online-business-database_53876-95876.jpeg

Database administrators are the unsung architects of the modern internet. Every time a patient’s electronic
health record loads instantly, every time you check out a shopping cart without a hitch, every time a
recommendation engine surfaces exactly what you want — a database administrator made sure the system
could handle it. In 2026, that responsibility has grown larger, more complex, and more rewarding than ever.
If you’re wondering how to break into MySQL DBA work — or level up from a junior role — this guide maps out
the path clearly. MySQL 8.0/8.4 LTS and the MySQL 9.x Innovation series have expanded what DBAs need to
know, but the fundamentals remain the same. Let’s walk through them.

Why MySQL DBA Skills Are Still in High Demand

MySQL has consistently ranked among the top relational databases worldwide, trailing only Oracle in the
DB-Engines rankings — and it’s not slowing down. It powers massive global platforms — Meta’s social graph,
YouTube’s video metadata, countless SaaS applications, and a growing share of AI training and inference
pipelines that need fast, reliable structured data access.
Despite the rise of NoSQL systems and cloud-managed databases, MySQL expertise remains a hiring priority.
Cloud providers offering managed MySQL (Amazon RDS, Google Cloud SQL, Azure Database for MySQL)
have increased accessibility, but they haven’t reduced the need for skilled DBAs. Someone still needs to tune
queries, manage access controls, architect backup strategies, and respond when things go wrong. That
someone is you.

What Does a MySQL DBA Actually Do Day-to-Day?

Before you invest months of learning, it helps to understand what the job looks like in practice. A typical MySQL
DBA’s day might include:

  • Implementing database changes — rolling out schema migrations, index additions, or stored procedure
    updates with minimal downtime
  • Refreshing development databases — copying sanitized production data to dev and QA environments
    so developers can test against realistic datasets
  • Diagnosing performance issues — identifying slow queries, lock contention, or replication lag and
    resolving them before they cascade into outages
  • Managing access permissions — using GRANT, REVOKE, and role-based access controls to enforce the
    principle of least privilege
  • Conducting compliance reviews — auditing user permissions and data handling practices, increasingly
    important as privacy regulations tighten globally
  • Supporting disaster recovery testing — verifying that backup and restore procedures work as
    documented, not just as assumed

In 2026, many DBAs also find themselves involved in AI infrastructure work — maintaining databases that store
training datasets, feature stores, or model metadata. Familiarity with high-throughput ingestion patterns and
vector-adjacent storage has become a differentiator.

The Learning Roadmap: What to Study

Installation and Configuration

Start with the basics: install MySQL Community Edition on your local machine (or a free-tier cloud VM),
configure the server, and learn your way around the configuration file (my.cnf / my.ini). Understand the
difference between MySQL 8.0/8.4 LTS (the stable, long-term support branch) and MySQL 9.x Innovation
releases (feature-rich but faster-moving). Most production environments run LTS versions — that’s where your
hands-on practice should focus.

Security and Access Control

Security is non-negotiable. Learn the MySQL privilege system thoroughly:

  • GRANT — assign privileges to users
  • REVOKE — remove specific privileges
  • Role-based access control (introduced in MySQL 8.0 and now mature in 8.4)
  • Authentication plugins, including caching_sha2_password (the modern default)
  • SSL/TLS configuration for encrypted connections

In containerized and cloud environments, managing secrets and rotating credentials securely is as important as
the SQL syntax itself.

Backup and Restore

A DBA who can’t restore a database is a liability. Study:

  • mysqldump for logical backups
  • MySQL Enterprise Backup / Percona XtraBackup for physical backups
  • Point-in-time recovery using binary logs
  • Replication as a component of your HA and DR strategy

Practice restores regularly. Many DBAs have discovered their backup strategy was broken only when they
needed it most.

Indexing and Query Optimization

Understanding how MySQL executes queries is what separates good DBAs from great ones. Learn to:

  • Read EXPLAIN and EXPLAIN ANALYZE output
  • Design indexes that support your workload’s query patterns
  • Identify and resolve N+1 query problems, full table scans, and missing index conditions
  • Use the Performance Schema and sys schema for workload analysis

MySQL 8.4 and 9.x have added richer optimizer tracing and index skip scan capabilities — worth learning
alongside the fundamentals.

Replication and High Availability

Most production MySQL environments use replication. Learn:

  • Asynchronous replication (the classic model)
  • Semi-synchronous replication for stronger durability guarantees
  • Group Replication / InnoDB Cluster for multi-primary topologies
  • MySQL Router for automatic failover routing

Cloud-managed services abstract some of this, but understanding what’s happening underneath makes you far
more effective when things go wrong.

Monitoring

You cannot manage what you cannot measure. Learn to query Performance Schema, watch global status
variables, and set up alerting for:

  • Replication lag
  • Long-running queries and lock waits
  • Connection exhaustion

Familiarity with monitoring tools will accelerate your effectiveness immediately.

Career Paths Into MySQL DBA Work

Many successful MySQL DBAs didn’t start there. Common transition paths include:

  • Systems administrators who learned database operations as part of owning the full stack
  • Software developers who moved into data engineering or backend operations
  • Data analysts who wanted to go deeper into the infrastructure powering their data

If you’re transitioning from sysadmin or dev work, you already have transferable skills — Linux administration,
scripting, networking, and version control all apply directly to DBA work.
Mentorship accelerates the path considerably. If you can find an experienced DBA to learn from — through a
job, a community forum, or an open-source project — take that opportunity. The gap between knowing the
commands and understanding the judgment calls comes from experience, and mentorship compresses that
timeline.

Tools That Will Make You Effective

Learning MySQL’s command-line tools (mysql, mysqladmin, mysqldump, mysqlcheck) is foundational. As
you advance, dedicated tooling becomes essential.
SQL Diagnostic Manager for MySQL is the tool Webyog and IDERA offer for professional MySQL monitoring.
It provides:

  • Real-time performance dashboards covering connections, query throughput, and buffer pool health
  • Disk and lock monitoring to catch I/O bottlenecks and contention before they escalate
  • Security alerts for privilege changes and suspicious access patterns
  • Multi-user access for DBA teams managing multiple instances
  • Root cause analysis tools that surface the query or configuration change behind a performance event

For day-to-day query writing and database browsing, SQLyog (available as Community and paid editions) is a
widely-used GUI client that speeds up development and administration workflows significantly.

How Long Does It Take?

Expect 6–12 months of consistent study and hands-on practice to reach functional competency — enough to
take on a junior DBA role. Most practitioners report logging 200+ hours of practical work before feeling genuinely
confident handling production incidents.

The MySQL documentation is excellent and free. MySQL Community Edition gives you a full server to
experiment on at no cost. There’s no excuse not to start today.

Getting Started This Week

  1. Download and install MySQL Community Edition (8.4 LTS recommended for beginners)
  2. Work through the official MySQL Reference Manual chapters on installation, security, and backup
  3. Install SQLyog Community Edition for a GUI interface alongside your CLI practice
  4. Join the Webyog Forums — a community of 15,000+ MySQL users where questions get answered
  5. Build something real: a sample database, a backup script, a monitoring query

The MySQL DBA path is well-documented, practically learnable, and professionally rewarding. In 2026, the
demand is strong and growing. The question isn’t whether the opportunity is there — it’s whether you’ll start.

Frequently Asked Questions

Do I need a formal degree to become a MySQL DBA?

No. Many successful DBAs come from sysadmin backgrounds, software development, or entirely self-taught
paths. Employers care about what you can demonstrate — not the credential on your resume. A portfolio of
hands-on work with real MySQL instances carries more weight than a certificate alone.

How long does it take to land a junior DBA role?

Most people reach functional competency — enough to handle a junior position — within 6–12 months of
consistent, hands-on practice. Budget for 200+ hours of real work: installing, configuring, breaking, and restoring
MySQL in a lab environment.

What’s the difference between MySQL 8.4 LTS and MySQL 9.x Innovation?

MySQL 8.4 LTS (Long-Term Support) is the stable, production-recommended track with multi-year security and
bug-fix support. MySQL 9.x Innovation releases ship new features faster but are not intended for long-term
production use. For learners, start with 8.4 LTS — it’s what most production environments run.

Is MySQL experience transferable to other databases?

Absolutely. The fundamentals — indexing, query optimisation, backup strategy, access control, replication —
translate well to PostgreSQL, MariaDB, and cloud-managed databases like Amazon Aurora. MySQL DBA
experience is a strong foundation for a broader data infrastructure career.

What tools should I learn as a MySQL DBA?

Start with the MySQL command-line tools (mysql, mysqldump, mysqladmin). Add a GUI client like SQLyog
for day-to-day administration. As you advance, learn a professional monitoring platform — SQL Diagnostic
Manager for MySQL is widely used in production environments and worth familiarising yourself with early.

Is MySQL DBA a good long-term career in the age of AI?

Yes. AI workloads increase — not decrease — the demand for reliable structured data storage. Model training
pipelines, feature stores, and inference logging all depend on databases. DBAs who understand high-throughput
ingestion, replication, and cloud deployments are well-positioned for the AI era.

Start Your MySQL DBA Journey Today

Whether you’re just exploring the role or ready to accelerate your path to production, Webyog has the tools to
get you there faster.

  • Try SQL Diagnostic Manager for MySQL free for 14 days — learn on a real monitoring platform used by
    professionals
  • Request a demo — see how DBAs use it to manage production MySQL environments
  • Contact our team — get guidance on building a MySQL lab environment or career development resources

Visit webyog.com and take the first step today.

Download the IDERA whitepaper “How to Become a MySQL DBA” for a deeper dive into the curriculum and career path. Available at webyog.com.

Curious what this career actually pays? Read our salary deep-dive: MySQL DBAs Are Landing Six-Figure Jobs
in 2026 — And You Can Too.

Planet for the MySQL Community