Continuous data protection has emerged as a form of data protection that picks up where legacy backup approaches leave off.
For decades, the standard of care in protecting data has been to run full scheduled backups complemented with incremental backups. Administrators have long made a practice of executing full backups of all data after hours or on weekends, sometimes even taking systems offline to back them up. Then, during normal business hours, they complemented the full backups with incremental backups consisting of only the data that had changed.
That practice has evolved toward continuous data protection, an approach better suited to the constraints of time, storage capacity and network bandwidth that most enterprises face.
What is continuous data protection and how does it work?
In simple terms, Continuous Data Protection (CDP) backs up any data that has changed, and maintains a log of those changes to allow admins to restore systems to a previous point in time.
Overall, the term suggests that software and hardware are continuously backing up data as it is created or modified. However, it is more nuanced than that.
Continuous data protection relies on a full backup, but with an important variation: the incremental forever. After the initial full backup, instead of scheduling the incremental backups at given times on given days, administrators schedule incremental forever backups every day or throughout the day. The time interval is not so much the focus as the fact that data protection is always in process.
The incremental forever is effective only if it does not get in the way of other running processes and does not needlessly saturate the network. It depends on techniques that can quickly figure out which files have changed since the last incremental, or in the case of block-based backup, identify the changed blocks of data and back up only those blocks. That technology keeps the incremental backups small, the network traffic low and the backup time short.
Continuous data protection comes to prominence
The push toward continuous data protection is closely tied to several factors in IT.
The first is the rapid increase in the amount of data that companies must transact, generate and safeguard. The legacy approach to full and incremental backups may still suffice if your company has up to a few dozen terabytes of data. But enterprises with petabytes or even hundreds of terabytes of cannot back up that much data in a reasonable time frame.
Next, network bandwidth plays a role. Even if the window of available time sufficed to back up that much data, you would still have to move the data over the network. Far better to perform the full backup, and then send just the changes so that you don’t saturate your network and systems input/output (I/O) with backup I/O.
Then, if you’re pushing your backups to the cloud, you’ll want to keep your costs as low as possible, and protecting data continuously can help reduce your storage requirements.
Finally, ransomware has become a factor. When ransomware hits, you want to lose as little data as possible, and continuous data protection shortens the window of potential data loss. You want to return to normal operations quickly, which is why most products in the category include tools for getting your environment up and running. Restoring data is centered on the recovery point objective (RPO) and recovery time objective (RTO), both of which are reduced when you aim for continuous data protection.
The synthetic full — Mitigate the risk of data loss
Of course, the goal of data protection is to mitigate the risk of data loss due to factors like human mistakes, illegitimate deletion, malicious insiders and natural disasters. But even data backup strategies introduce the risk of data loss.
Suppose that you configure an incremental forever on a database server that hosts the constantly updated file CustTransaction.db. You start by performing a full backup on Monday, with incremental backups every hour. The database is adequately protected, but by Tuesday, it’s in 24 pieces: the full, plus 23 incrementals. By Wednesday it’s in 48 pieces, and so on, in an ever-growing incremental chain. The problem is that, if one of those incrementals becomes corrupt or goes missing, every subsequent incremental would be unrestorable. Once you’ve stored your full backup and you’re performing those incrementals forever, you don’t want an infinitely long chain of dependent incrementals.
Enter the synthetic full. It’s a data protection technique that creates a full backup out of those incrementals, giving you a new starting point. That reduces the risk of data loss from having an incremental chain that never stops growing.
The other way continuous data protection reduces risk is that you’re backing up data far more frequently. Since you’re backing up only the changes, you have to protect a much smaller amount of data.
Choosing between legacy backup and protecting data continuously
It’s most valuable to continuously protect data that changes constantly, such as the CustTransaction.db in the example above. Conversely, legacy backup strategies will often keep playing a role in protecting data that changes infrequently. Since all organizations will often handle data in both those categories, every organization will need to decide on their individual approach.
Consider, for instance, an ecommerce company with a web-based store that is always generating transactions from purchases. Their front-line systems are their lifeline, and they want to reduce the RPO and RTO for those systems. Continuous protection would mitigate the risk of losing transactions in case of disaster.
At the same time, however, the store displays thousands of images that rarely change; that static content is better suited to legacy backup. Similarly, virtual desktop infrastructure (VDI) and virtual machines rarely change. The files occupy a lot of space and are very important, so you would never want to lose them. But the data in them changes infrequently, so legacy backup is a better fit.
True continuous data protection, or near-continuous data protection?
True continuous data protection means that any changes made to the data are immediately recognized and backed up. If disaster strikes, you lose nothing because every transaction is recorded in both the primary system and the backup system. If it were ever necessary to restore, the backup system would have everything you needed.
Near-continuous data protection does not do that. It uses periodic save points with granularity measured down to minutes. In case of disaster, your exposure would be limited to that many minutes worth of data. Compared to legacy data backups, in which once-daily backups are common, you’re greatly reducing the risk of data loss.
Because you could lose the data since the end of the previous period, there will always be some measurable risk of data loss. The goal of continuous data protection is to reduce that risk by configuring the shortest practical period. An hour makes sense in some cases, but it’s not uncommon for companies to specify a five-minute period, which is a practical setting for true continuous data protection. After all, in the context of an ecommerce site, think of how many transactions can occur in five minutes. You wouldn’t want to lose any of them, so shortening the period is an important factor in effectiveness.
In short, near-continuous data protection does not mean constant and real-time protection of every change. It means a short period — but a period nonetheless — between save points.
The advantages of continuous protection
1. Less storage space for backups
The sooner you run out of available storage space, the shorter your backup retention time and the sooner you have to either discard backups or move them elsewhere. With continuous data protection you can store more backups, which translates into a longer retention period, given the same amount of storage.
2. Lower costs
That also translates into lower costs, because you’re able to store more data in the same amount of space. Note that data deduplication— the process of identifying and eliminating blocks of duplicated data in a data backup set — is an enabling technology for continuous data protection. Deduplication is the key to storing as much data as possible within the available capacity.
3. Reducing the amount of network traffic and network bandwidth
Continuous protection has the effect of reducing both the amount of network traffic generated by backups and the network bandwidth consumed by them. When you’re protecting throughout the day with short periods, you want that data to take up as little bandwidth as possible. Otherwise, backups saturate the network and slow traffic for all users.
All of these advantages — the incremental forever strategy, the synthetic full strategy and deduplication — fit together to make the solution possible.
The disadvantages
1. Legacy backups
Legacy backups are relatively simple. The technology is well established and understood, and its hardware requirements are modest. Continuous data protection, on the other hand, constantly generates I/O on your storage devices. You need to be prepared to invest more money in hardware that can keep up with the amount of data you’re pushing to it. The biggest potential disadvantage, therefore, is that your legacy backup hardware is not fast enough to ingest data adequately.
2. Synthetic fulls
In addition to the I/O generated by backing up, synthetic fulls generate considerable I/O as they piece together incrementals and build a new full backup. Legacy systems don’t have anything like the synthetic full, so they’re not equipped to deal with the spike in I/O. But the architecture of a properly configured continuous data protection system includes software and hardware that account for that.
Mistakes companies make along the way
As noted above, legacy backup is appropriate for data that does not change very often, and continuous protection is suited to constantly changing data. Companies that try to use one where the other belongs soon discover limitations.
Organizations may also make the mistake of neglecting the 3-2-1 rule. Always maintain three copies of your data, store two of them locally on different media types and store one of them offsite (at a remote site or in the cloud). The rule ensures that you’ll have redundant copies of your data in case you need to restore after a disaster. Any good data protection architecture will factor in the 3-2-1 rule. But if a company decides to save money by not implementing the entire architecture, they can be caught without a third, offsite copy of their data when disaster strikes.
Protect all your systems, applications and data.
Best practices for implementation
1. Data type
The most important thing is to focus on the kind of data you’re backing up: Is it constantly changing, or does it rarely change? That will determine whether it’s suited to continuous data protection or legacy backup, respectively. And from there, you can make decisions about storage hardware that can handle the amount of I/O required.
2. Time period for the incrementals
Figure out how often you can execute incremental backups while still having enough overhead in the hardware. To ensure your data protection is still continuous, factor in any schedule anomalies, such as restoring on Sunday and testing backups.
Suppose that you’re trying to protect every hour, but that it takes 90 minutes to back up the incremental changes. That means you’re not protecting every hour. You have two options: Either you increase the period to 90 minutes or you adjust the hardware to ingest all the data in an hour.
Summary
A workable continuous data protection solution keeps backups from being disrupted. It ensures that your system has enough overhead to work continuously while meeting the period you’ve set and permitting restores when needed.