https://www.percona.com/blog/wp-content/uploads/2023/08/Backup-and-Recovery-for-Databases-150×150.png
People used to say, “Coal is king,” and for decades, it was. Coal powered just about everything, but mismanagement and a lack of stewardship left some messes.Â
These days, “Data is king,” and a lot more. Data powers everything, and unlike coal and coal combustion, data and databases aren’t going away. So staying with our analogy, and learning from historical lessons, organizations must be responsible stewards of data — for the sake of customers, stakeholders, and the business itself.Â
Any organization that uses personal and other sensitive data must have a firm, proven plan for business continuity in the event of a disaster or cyberattack. Losing access to or control of data for an extended period of time will disrupt operations, lead to financial losses, and damage an organization’s reputation. Recovering from a tarnished reputation can be costly and time-consuming.
Data stewardship also means protecting people’s privacy, safeguarding against breaches, and adhering to regulations and standards such as the European Union’s General Data Protection Regulation (GDPR), the United States’ Sarbanes-Oxley Act (SOX), and the Payment Card Industry Data Security Standard (PCI DSS). Â
In this blog, we’ll focus on the elements of database backup and disaster recovery, and we’ll introduce proven solutions for maintaining business continuity, even amid otherwise dire circumstances.
Why backup and recovery preparedness is so important
Such a royal introduction demands background, so let’s get after it: Any data loss or unexpected downtime hurts an organization. Some losses can be crippling, even a business death knell. So it’s important to know (and not overlook) the trouble spots. Some are hidden, but some are in plain sight. In general terms, here are potential trouble spots:
- Hardware failure: Manufacturing defects, wear and tear, physical damage, and other factors can cause hardware to fail. Power surges, outages, and harsh conditions (i.e., heat) can damage hardware components and prompt data loss.Â
- Software failure: Software applications can become vulnerable, or they can crash altogether. Without data backup mechanisms, there can be data loss or system downtime. Even worse, entire operating systems can crash, also resulting in data loss.
- Human mistakes: Incorrect configuration is an all-too-common cause of hardware and software failure. Similarly, accidental deletion is a culprit.
In all three instances, failure to regularly back up data can result in significant data loss in the event of a disaster. Solid backup procedures must be in place.
Backup types and strategies
It all begins with choosing a strategy, and that depends on factors such as the use and importance of data to your business, your recovery time objectives (RTO), and your budget. Depending on what is needed, here are some common database backup types, strategies, and considerations:
Full backup vs. incremental backup: Best suited for smaller databases or those that don’t incur a lot of changes, a full backup includes a complete restore point. Though it can put safety concerns at ease, a full backup can be time-consuming and expensive. An incremental backup, which is faster and requires less storage than a full backup, captures changes made since the previous backup. It’s suitable for databases with moderate change rates. (For a more detailed description, read Full vs. Incremental vs. Differential Backups: Comparing Backup Types.)
Hot backups vs. cold backups: A hot backup — used to minimize downtime and ensure data availability for critical applications — allows you to create a copy of a database while the system is still actively serving user requests and processing transactions. In a cold backup, the database is taken offline. This is advisable only in certain scenarios, such as maintenance during low-use hours.
Choosing the right backup frequency: The appropriate frequency, of course, will vary from one organization to another. If you can’t afford to lose much data, you’ll need more frequent backups, possibly even continuous data protection solutions. The rate at which data changes within your database is a significant factor as well. To incorporate the latest changes, databases with high write activity might require more frequent backups.
Offsite backup: This involves physically storing backups in a secure location somewhere other than the primary data center. It’s more expensive, but offsite backup protects against site-wide disasters.
With the right backup strategy, you’ll be better able to achieve the aforementioned business continuity. Still, disasters might happen, so let’s also examine recovery.
Disaster recovery strategies and testing
One might think disaster begets loss, but with the right planning, that loss can be so minimal it’s more “blip” than “bad.” With that planning in place, you can look at it this way instead: Backup begets disaster recovery.
Disaster recovery (DR) strategies are essential to ensuring the integrity, availability, and reliability of data, particularly in the event of unexpected failures or errors. Such strategies help restore a database to a consistent and usable state. A disaster recovery plan can be as simple as the use of a backup and recovery procedure, or it can be complex, depending on the RTO and the recovery point objective (RPO). Key elements of comprehensive database recovery plans include point-in-time recovery, high availability and failover, replication, and others (we’ll break them down in a bit).Â
Whatever plan you arrive at, you must test it. There should be no shortcuts in the testing, which should include:
- Backup testing: Test to ensure that data can indeed be recovered from backups. This includes both data and log backups.
- Failover testing: If you have a high availability setup or a secondary data center, test to ensure the failover process switches over to the backup database server.
- Application testing: Test to make sure applications function correctly after recovery.
- Data consistency: Verify that data consistency between the primary and secondary systems will exist after recovery.
- Application testing: Test to make sure applications function correctly after recovery.
Backup and recovery tools
Now that we’ve covered some strategies, it’s a good time to look at some tools for putting plans in motion. There’s too much out there to provide a comprehensive list, so we’ll mention just a few high-profile options. These obviously are not detailed descriptions; they’re just introductions:
MySQL Enterprise Backup: This software supports hot backups and incremental backups.
Oracle Recovery Manager (RMAN): This enables users of Oracle databases to perform full, incremental, and differential backups. This tool also provides point-in-time recovery.
SQL Server Management Studio: This software includes backup and recovery tools for use with Microsoft SQL Server.
MongoDB Atlas backup: This hosted cloud service offers continuous backups or snapshots for point-in-time recovery, as well as incremental backups.
MongoDB Cloud Manager: This hosted service, which uses a graphical user interface, supports backup and restoration of replica sets and sharded clusters.
MongoDB Ops Manager: Available with Enterprise Advanced subscriptions, this is an on-premise tool that provides backup software and features much like those of Cloud Manager.
IBM Data Studio: This provides graphics-based and command-line tools for backup and recovery within IBM Db2 databases.
Commvault: This data management and protection software includes features for backup and recovery. It’s used to help ensure the functionality of hardware, software, and applications.Â
All of the options above have enterprise-grade attributes, but not one of them is truly open source. So let’s discuss some options that are open source.
Open source Percona solutions for backup and disaster recovery
When you read reviews of what’s out there for backup and disaster recovery solutions, you tend to see glowing words about software with big-name recognition. But you don’t see a warning that the software is proprietary (most often, it is); you just see hints, like a “starting price.” And you certainly don’t see a warning that vendor lock-in might be lurking.Â
Amid the options, you see Percona at or near the top of lists and reviews. But with Percona, there are differences. There’s no hint of proprietary lock-in because it doesn’t exist. The software is truly open source. Percona solutions, fully supported and enterprise-grade, include:
Percona Backup for MongoDB: Percona Backup for MongoDB is a distributed and low-impact solution for consistent backups of MongoDB clusters, including sharding support. It enables you to make logical, physical, incremental, and selective backups and restores. Plus, point-in-time recovery functionality allows you to recover your database to a specific timestamp.
Percona XtraBackup: This is a free, complete online backup solution for all versions of Percona Server for MySQL, MySQL, and MariaDB. Percona XtraBackup performs online non-blocking, tightly compressed, highly secure backups on transactional systems. Percona XtraBackup is the world’s only open source, free MySQL hot backup software that performs non-blocking backups for InnoDB and XtraDB databases.Â
Percona Distribution for PostgreSQL: This production-ready PostgreSQL distribution includes pgBackRest, an open source backup and restore solution that enables full backup, incremental backup, and differential backup in PostgreSQL databases. The toolset also supports high availability and disaster recovery via Patroni, pg_bouncer, and HA proxy.Â
Essential elements of database backup and recovery
Now, let’s introduce a couple elements of an ideal backup and recovery solution. (There are more coming in the next section, but these two are big, huge in fact.)Â
Database replication
Data redundancy is a cornerstone of disaster recovery strategy, and to achieve it, you must have replication mechanisms in place. Depending on the use case, that can mean synchronous replication or asynchronous replication.Â
In synchronous replication, data is written to the primary database. The data then is copied to one or more replica databases. The primary database waits for acknowledgment from the replica(s) before advancing the transaction to the application. This means that the data in the primary and replica databases is always in sync. (View a Percona whitepaper that shows synchronous replication in action.)
In asynchronous replication, data is written to the primary database, but the acknowledgment to the application occurs before the data is replicated to secondary databases. This results in a delay between the time data is written to the primary and when it appears in the replica(s). Real-time data consistency is not guaranteed.
Point-in-time recovery
With point-in-time recovery, a database is restored to a specific moment in time rather than the time of the most recent backup. PITR is especially essential in situations when data integrity and consistency cannot be compromised. Financial systems and critical business applications are especially dependent on PITR. In addition to protecting data accuracy and limiting data loss, PITR can help with auditing and compliance requirements by providing a record of changes to the database.
Common components of DR and HA architectures
By definition, there are differences between high availability (HA) and disaster recovery (DR). High availability is focused on preventing downtime and ensuring that the database remains available; disaster recovery is focused on recovering from a catastrophic event and minimizing negative effects on the business. High availability typically involves redundant hardware, software, applications, and network components that can quickly take over if the primary component fails; disaster recovery typically involves regular backups, replication to a secondary site, and a clear recovery plan with steps to be taken in the event of a disaster.Â
Although the emphasis and configurations may vary depending on whether the focus is on high availability or disaster recovery, there are shared components. Some of them include:
- Redundant hardware: Both HA and DR use redundant hardware components, including servers and storage devices. Redundancy ensures that if one component fails, there is another to take its place. This helps minimize downtime.
- Clustering: In HA, clustering helps ensure that there are redundant database servers; if one fails, another can take over. This minimizes downtime during disasters or hardware failures. For disaster recovery, clustering can be used to maintain a synchronized copy of the database in a different location. Database replication, log shipping, or synchronous data mirroring can be used for DR purposes.
- Load balancing: Load balancers distribute traffic evenly across multiple servers or data centers. In HA, load balancers help ensure that no single server is overwhelmed. In DR, load balancers route traffic to the secondary data center when a failover occurs.
- Backup systems: Both HA and DR architectures have backup systems in place. HA setups tend to use backup servers within the same data center. DR setups have backup data centers in different locations.
- Monitoring and alerting: Continuous monitoring of system health and performance is essential for both HA and DR. Automated alerts are set up to notify administrators of any issues that require attention.
What’s at stake and what to do about it
Now that we’ve taken a deeper dive into the components of a backup and disaster recovery solution, as well as a look at high availability, let’s expand on what backup and disaster recovery solutions are used for. We’ll also talk about what can happen if you don’t have a solid plan in place, if you take shortcuts, or if you turn it all over to the wrong vendor.
Data archiving and retention matter
We’ve discussed the essential nature of backup and recovery in business continuity. Relatedly, by separating historical data from operational data, archiving helps you manage data growth, maintain compliance, and optimize backup processes. Likewise, establishing clear retention policies for both backups and archives is crucial to balancing data recovery needs with data management efficiency and compliance requirements.
And the essential nature of compliance can’t be emphasized enough. Failure to adhere to legal requirements can result in monetary and even criminal penalties, reputational damage, and loss of data integrity. Here are some of those regulations:
General Data Protection Regulation (GDPR): In addition to stipulating that organizations must have a lawful basis for processing personal data, this regulation includes guidelines for data retention and erasure policies.
Sarbanes-Oxley Act: SOX, a U.S. federal law, requires companies to have database disaster recovery and business continuity plans in place. The purpose is to ensure the availability and integrity of financial data amid unexpected events.
Payment Card Industry Data Security Standard: The PCI DSS mandates that organizations must regularly back up data, including critical payment card data, so that data availability and integrity are maintained. The PCI DSS also prescribes steps for responding to data breaches and other security incidents, including how to restore services and data after a disaster.Â
California Consumer Privacy Act: Similar to GDPR, CCPA also includes mandates concerning data retention and erasure policies.
Regional regulations: Depending on your location and who your customers are, your organization might have to adhere to privacy, archiving, and retention mandates as spelled out in HIPAA (healthcare), FERPA (education), PIPEDA (Canada), and other regulations.
Examining cloud-based backup solutions
Benefits related to accessibility, automation, scalability, and security might inspire you to go with a cloud service for database backup and disaster recovery. If this is the direction you choose, carefully consider the long-term costs, data security concerns, and potential vendor lock-in. Find a partner that will answer any questions about such concerns — and be as certain as possible that vendor lock-in is not on the horizon.
While cloud services can be cost-effective initially, long-term costs can escalate if your needs for data storage grow significantly. Relatedly, transferring large amounts of data into or out of the cloud can be expensive, especially for companies with limited bandwidth. Further, there are data compliance and security concerns that could matter a lot more to you than to the vendor. Whereas the vendor might make promises, the responsibility and potential repercussions are all on you.Â
Safe and reliable cloud services exist, but read the fine print and ask those questions.
Never scrimp on security measures
Here are some of the best practices for helping ensure that data remains secure during backup and recovery:
- Encryption (which also could have appeared in the Essential Elements section) is a must-have component of database backup and recovery strategies. Using algorithms and keys to make data unreadable, encryption helps safeguard information during backup and recovery. Even if backup files fall into the wrong hands, you’re safe. Encryption also helps you adhere to strict data-protection regulations (GDPR, Sarbanes-Oxley Act, PCI DSS, CCPA, HIPAA, etc.).
- Access control is the process of managing who can access a database and what actions they can perform. In backup and recovery, access control helps prevent unauthorized access to sensitive backup files and database recovery mechanisms.
- Backup authorization and recovery authorization are the processes of determining whether a user or process has the permissions to perform a specific action. Authorization helps ensure that only authorized entities can initiate backup and recovery operations.
Be proactive — set up monitoring and alerting
The coal miners of yesteryear carried caged canaries deep into mine tunnels. If carbon monoxide or other dangerous gasses were present, the gasses would kill the canary, signaling the miners to leave the tunnels immediately.
Thankfully, with database backup and disaster recovery, watching for warning signs can be a lot more scientific — and foolproof. Instead of relying on a wing and prayer, an effective monitoring and alerting solution can rely on:
Thresholds: Thresholds for various metrics, such as backup completion time, replication lag, and resource availability, are defined. When those thresholds are reached, alerts are triggered.
Notification channels for real-time alerts: With the proper configuration, the appropriate personnel are contacted promptly via automated email, text, chat, and other channels when the previously mentioned thresholds are hit. Such notification should include an escalation process in which a different support avenue or person is contacted if the alert is not acknowledged or resolved within a predetermined amount of time.
Automated storage increase: There should be automation — tied into the alerts — in which storage space is increased when it reaches a predefined threshold. This will help prevent backup failures.
Detailed logs: It’s important to maintain logs of all monitoring activities and alerts. Then, you have the information to generate reports for identifying trends and areas of improvement.Â
With the right monitoring system in place, you can avoid losses. You also can spot critical performance issues faster, understand the root cause of incidents better, and troubleshoot them more efficiently moving forward.
Recovering from different scenarios
In database backup and disaster recovery planning, clear steps for dealing with hardware failure must be in place. Those steps should include:
- Identification: The first step in recovering from a hardware failure is identifying the affected hardware component. This could be a hard drive, a server, or even an entire data center. Monitoring tools and systems can help detect such failures and trigger automated alerts described in the previous section.
- Isolation and remediation: Once the failure is identified, IT staff or automated systems should work to isolate the affected hardware and restore it to a functional state as soon as possible. This may involve taking a server offline, rerouting network traffic, or replacing a failed disk drive.
- Restoration: With the hardware issue resolved, the next step is to restore the database services. This involves restarting database servers, restoring network connections, and ensuring that the database management system is operational. The previously mentioned automated failover mechanisms and load balancers can help minimize downtime during this phase.
- Recovery of data from backups: This might be a full backup or a combination of full and incremental backups. Use backup software to restore the database to its clean state at the time of the last good backup. This might involve copying data from backup storage to the production database server.
Perhaps we should call this the All Things Ominous Section because now we’ll look at restoring data after a cyberattack. (Of course, the dead canary was pretty dark.)Â
But really, this section is about bouncing back. With the right preparedness, a cyberattack doesn’t have to be that death knell to your database system and business. In fact, you can beat down an incursion and come back stronger. So let’s look at planning, preparedness, and a systematic approach to minimizing downtime and data loss while ensuring the security and integrity of your systems and information. Here are key steps in overcoming a cyberattack:
Preparation: Conduct regular backups that ensure you have recent, clean copies of your data to restore. Keep backup copies in offsite or isolated locations, perhaps in the cloud. This safeguards data from physical damage or compromise in the event of an attack.
Documentation: Maintain thorough documentation of your database configurations, schemas, and data structures. This documentation will be invaluable during the restoration process.
Response plan: Develop a clear incident response plan that outlines roles, responsibilities, and steps to take in the event of a cyberattack.
Detection and isolation: As soon as you detect an attack, quickly identify the scope and nature. Determine which databases or systems are affected. Then, quarantine or disconnect affected systems from the network to prevent the spread of malware and further data corruption.
Damage assessment: Evaluate the extent of data loss or corruption. This assessment will help determine the appropriate restoration strategy.
Culprit identification: So that you can patch vulnerabilities and prevent future attacks, determine how the attack happened.
Data restoration: Use your latest clean backup to restore the affected database. Ensure the backup is from a time before the attack occurred. In some cases in which the attack had compromised data, you might have to perform incremental restoration. This involves applying incremental backups to bring the data up to standards.
Security updates and auditing: Immediately patch and update the database system and associated software to address vulnerabilities that were exploited.To prevent future attacks, implement intrusion detection systems (IDS) and access controls.
Data consistency and integrity
You must maintain data accuracy before, during, and after a disaster or attack. By doing so, your organization can recover quickly and reliably. In addition to replication, monitoring and alerts, encryption, auditing, and other activities already mentioned, here are some other best practices for maintaining data accuracy and addressing data corruption:
- Perform regular backup testing, full and incremental, to verify data integrity.
- Perform automated backup verification to check the integrity of backup files and ensure they are not corrupted.
- Implement version control for database backups, which will give you a history of changes and let you choose a specific point-in-time if necessary.
- Always apply the latest security patches and updates to your database management system and backup software to prevent vulnerabilities.
- Use IDS and security information and event management (SIEM) tools to monitor network activity and detect suspicious activity.
- Develop and regularly update a comprehensive disaster recovery plan that outlines roles, responsibilities, and procedures for data recovery in various scenarios.
- Consider having outside experts assess your data protection and recovery plans and provide recommendations for improvement.
Scaling backup and recovery processes
Massive amounts of data can reside in large and enterprise databases. Though it might seem obvious, it’s important that you aren’t caught off-guard, and that means having significant storage capacity and efficient data transfer mechanisms.
Enterprises also typically require frequent backups, ranging from daily to hourly, depending on their RPOs. Automated backup processes are essential in ensuring data consistency and minimizing downtime. Techniques like online backups and snapshot-based backups can help ensure databases remain accessible during the backup process.
Relatedly, transferring large database backups over the network can strain available bandwidth. So, enterprises might need dedicated high-speed connections or WAN optimization solutions to mitigate network congestion during backup operations. To reduce storage and bandwidth requirements, compression and deduplication techniques are often applied to the backup data. This involves identifying redundant data and storing only the unique blocks.
Planning for business continuity
In real estate, they say “location, location, location.” In database management, we say: Proactivity. Proactivity. Proactivity. (OK, not so catchy and maybe we don’t say it in those words, but we think it.) And here, we’ll say it a fourth time in relation to protecting data and avoiding downtime: Proactivity.Â
Any business or organization that relies on data (meaning just about all of them) must be proactive if they’re to maintain business continuity amid a disaster, power outage, cyberattack, or other event that could threaten data processes.
We covered a lot about components, tools, and best technical practices, so here we’ll key in on the actual planning parts that should be included in a business continuity plan. Since replication and redundancy are such big elements of database management, let’s stick with the redundancy theme and call these proactive activities:
Risk assessment: Identify potential risks and threats that could negatively affect your database systems. They might include natural disasters (hurricanes, tornadoes, earthquakes, flooding, blizzards, etc.), cyberattacks (data breaches, malicious bugs, ransomware, etc.), human error, and hardware failure.Â
Impact assessment: Evaluate how each identified risk or threat could negatively affect your database systems.
Recovery objectives: To determine how quickly your organization must recover its database systems after a disruption (maximum allowable downtime), you should establish an RTO. To determine the maximum amount of data loss that your organization can tolerate, you should set an RPO. This will determine how frequently you should back up your data.
Disaster recovery plan (DRP): All the components in this section are part of the DRP, which outlines the steps to be taken in the event of a disaster or cyberattack. The DRP should include roles and responsibilities, communication procedures, and recovery procedures. You should test the DRP regularly through simulations to ensure it works effectively.
Communication plan: Develop a communication plan for keeping employees, customers, and stakeholders informed during a disaster or cyberattack.
Financial plan: Allocate budget resources for disaster recovery and business continuity initiatives to ensure they are adequately funded.
Additional tools and outside expertise: A business continuity plan isn’t something you just throw together. You might not have the tools and expertise on-staff to get it done. Consider ready-to-go backup software and whether or not you can design, implement, and maintain the business continuity plan on your own. If not, consider outside help, but beware of proprietary licensing and the pitfalls of vendor lock-in. Â
Keep learning
Now that you’ve got the basics down, level up your knowledge with our on-demand presentation: The Many Ways To Copy Your Database. In it, Nicolai Plum of Booking.com discusses the best ways to copy your database – from logical data dump and file copying through native cloning and backup tools to advanced scale-out techniques.
Â
Watch The Many Ways To Copy Your Database
Percona Database Performance Blog