Backups and archives offer different approaches to safeguarding and preserving data. When you examine the question of archive vs. backup, you discover an overlap between their similarities and differences. Both involve non-production data, both involve long-term storage and both are important enough to deserve keeping copies in more than one place.
Although questions about archives and backups may have similar answers, it’s the business and operational questions about an archive vs. backup that have very different answers. Many times, the need for backups and archives can be determined through some key questions:
- Why do you need a backup? Why do you think you need an archive?
- How does your organization define “backup” and “archive” for its particular business purposes?
- How often do you plan to access or retrieve the data in your backups? In your archive?
- How often do you plan to modify your archives? Your backups?
This post helps you explore those similarities and differences in light of your company’s regulatory needs, operating environment and risk tolerance.
What is the difference between backup and archive? How are archive and backup defined?
Backups and archives are typically defined as the following:
Backups: Backups store copies of data and files to protect against data loss and are typically used for recovery.
Archives: Archives store and preserve data that may be of value for reference purposes and easy retrieval.
Most people would agree that a backup is what you create for the purpose of protecting yourself against data loss in case of an outage or disaster. They might spell “disaster” in lower case – they’re missing a single file – or they might spell “DISASTER” in all caps – they’re missing a data center. By and large, though, they would agree that the idea of making backups is to be able to recover from a disaster.
But if you ask those same people the important question, “What do you consider to be an archive?” they’ll give you a wide variety of answers.
For that matter, since the purpose of a backup is disaster recovery, companies tend to retain their backups for some length of time. They may, therefore, be tempted to regard their backups as a kind of archive. One rule of thumb – and this is NOT ironclad – is that anything you want to keep longer than one year is a candidate for archiving.
The right questions and rules of thumb
The right question is not “What is Peter’s definition of an archive?” or “What is Quest’s definition of an archive?” The right question is what do you and your organization want to accomplish?
The first thing to determine in defining an archive is the purpose you want it to serve. Why do you think you need an archive? From that answer, you can evaluate whether your backup software or some other solution is suited to that need.
Here’s another rule of thumb – and this isn’t ironclad either: If an archive is usually where you keep something for legal or historical reasons for a long time, it is unlikely that you will ever use it to recover from a disaster.
Besides the purpose you want the archive vs. backup to serve, another important question is “How often will we update or replace that data?” Backup is meant for high frequency. You perform – or should perform – backups daily, or even hourly, so that you can recover from a disaster quickly and resume normal operations. On the other hand, would you archive every hour? Not likely. Most companies update their archives on a weekly, monthly or annual basis.
A final important question is “What is the frequency or likelihood that our users will need to retrieve that data?” Suppose your answer is “My archive is there to make historical data available to users for the job they have to do today.”
In that case, you can bet that the frequency of data retrieval will be high. But your answer may be “No, I need to store data for a long time to comply with a regulation or industry requirement.” That data could spend years and years in your company’s storage without being accessed. Users won’t need it to do their job, but if the tax authorities come knocking, you’ll need to retrieve it. An archive will serve that purpose.
Can you use your backup as an archive?
I mentioned that companies may be tempted to repurpose their backups as a kind of archive by simply holding onto them for a long time. That prompts yet another question: “Is a backup suitable as an archive?” The answer to that depends on what you expect from a backup and what you expect from an archive.
That kind of repurposing can an acceptable practice as long as your backup software supports your goals. If your goals are to be able to restore data after a disaster and to retrieve it for compliance reasons, then yes, you could probably store your backups for years. But suppose you want to keep the data for historical reasons, like mining it for insights as part of your decision-making process. That’s a completely different use case for which you’ll need a separate archiving environment.
The length of time you need to keep data frequently influences how your organization should store it. If you go the route of repurposing backups into archives, you’ll discover that backup software sometimes suffices as an archiving tool. However, there are situations and use cases where using a backup as an archive is wholly inadequate.
The role of file formats in backup and archive
I’ve mentioned the use case of searching old data; backup software is not designed with that in mind. There’s another big consideration.
If you want to store data for a long time for historic reasons, one of the first questions you should ask yourself is “What will we store in our archive?” More specifically, “Which file formats will we store in the archive?”
Remember 20 years ago when you bought a new car, then discovered that it didn’t have a deck for playing your cassette tapes? Or, more recently, do you remember when you bought a new car five years ago, then discovered that you couldn’t play your CDs in it?
It’s easy to think a year or two into the future and predict that the productivity and business applications you rely on today will still be around. But technology has evolved right over the wildly popular software products of yesteryear, such as WordPerfect, Lotus 1-2-3, dBASE and Harvard Graphics. If you need to access data in one of these formats seven, ten or 20 years from now, will there be an accessible way to read that data?
That is a strong argument for archiving truly important documents not in proprietary formats but in self-describing file formats, which allow you to store metadata that describes their content. If you want some kind of guarantee that you’ll be able to read the data in 20 years, you’ll need a type of data that can describe itself. It sounds odd that something could describe itself, but you need that kind of format if you want to store something for a long time. Examples include portable document format (PDF) and Extensible Markup Language (XML).
Formats for video and pictures have done a good job of that by storing a lot of metadata inside of files. The result is greater assurance that you’ll be able to read those files well into the future.
Thus, if you want to store something for historic reasons, you need to consider which file formats and storage media will still be readable in 30 years. That is rarely a problem with backups, because the time horizon is shorter. If, the next seven years, you’re audited by a tax authority and need to recover a data file, your accounting software will almost certainly still read it. Your incorporation documents, by-laws and meeting minutes are iffier.
Optimal storage media for backup and archive
To where do you write your data? To which types of media do you write it?
Tape once was the leading media for both backups and archives. That led to the temptation to use simple backup software as an archiving tool because, either way, the data would end up on tape. Now you have more choices.
You can store on tape, on a disk in your own data center or in the cloud (also known as a disk in someone else’s data center).
The TL;DR is that “backup” means “fast access,” which means “more expensive,” and “archive” means “slow,” which means “less expensive storage.”
Storage options for backup
As described, backups also involves high frequency, and recovery demands speed. That means that backing up straight to tape is no longer the best fit for most large companies.
Why not? Because the tape usually goes into a vault, which means somebody or something needs to find and load it, then locate and retrieve the data from it. The steps involved mean that tape won’t get you fast access.
For backups nowadays, disk and even cloud are better suited, although restoring from the cloud will probably be slower than restoring from a disk in your local data center. Still, if you can live with a longer recovery time objective (RTO), then the cloud is fine.
Storage options for archive
For archives, money is usually the first consideration. How much data do you want to archive? A large archive on a local disk or in the cloud can become quite expensive. But you’ll likely tolerate slower retrieval from the archive, so perhaps you’ll trade speed of access against storage cost. That brings tape back into your calculations; even though it represents more manual labor, it’s a sensible way to store archives.
Cloud service providers (hyperscalers) offer low-cost storage options specifically for archives. Why are they inexpensive? Because although it looks as though you’re saving to a cloud-based disk, chances are you’re saving to a disk cache, which then hands the data off to tape. Time to first byte can be four to eight hours. If that’s an acceptable retrieval time for your historical information, then a cloud archive is suitable.
Optimal storage tiers: Hot, cold and archive
Before you send anything to the cloud, you must tell your cloud provider the level of service – usually hot, cold or archive – you want for the data. That’s how the provider knows whether to store your data on fast (i.e., expensive) disks, on slower (less expensive) disks, or on tape (least expensive). And that takes you back to the earlier question, “What is the frequency or likelihood that our users will need to retrieve that data?”
Here is a typical scenario: Suppose you decide to store data in the cloud and you start with the hot storage tier.
- Your provider will allow your users to access that data as often as they want. But you will pay the highest price (100%) for immediate access.
- If they don’t touch the data within the next 30 days, your provider can move it into the cold tier, where it’s cheaper for them (and for you). Access is slower, but it is also less frequently accessed. You pay a rate that is 50% of the hot-tier rate.
- If your users do access that data within those 30 days, your provider will change the data back to the hot-tier rate of 100%.
Here is a typical scenario for the archive tier:
- Eventually, if nobody accesses the data, your provider will move it to archive.
- If nobody accesses the data within 180 days, your provider will charge you at the archive rate (30%).
- If somebody accesses the data within those 180 days, the provider must retrieve it from the archive (tape). They will charge you at the cold-tier rate (50%) again, even though it was stored on archive media.
It’s up to you as the one who pays for data storage to understand storage tiers.
It’s also up to you to think carefully about how your users understand the term “archive.”
Suppose you referred to your years-old data as your “archives.” It would make sense to you to store it in an archive tier, wouldn’t it? That would be the best, most cost-effective tier for it. But now suppose that your users want to be able to freely search your archives. In that case, you would never store it in an archive tier, because your users would be continually retrieving that data. They would access it as if it were normal production data, and your cloud provider would charge you at the highest rate.
In short, if you know the likelihood that your users will need to retrieve data within the next 30 days, you can choose an optimal tier for your usage. The same thing applies to your archives, which, strictly speaking, are not specifically accessed at all. They are kept for a very long time and they are unchangeable.
And that brings up the topic of immutability.
Immutability and the difference between backup and archive
Recall that the main purpose of backups is to protect yourself against and recover from data loss. One of the biggest risks is that your backup data gets compromised – say, by ransomware – leaving you with nothing from which you can restore. Immutable backups are a way around that because, once they’ve been created, they cannot be changed by anything, including ransomware. In that regard, backups and archives are similar: you want to protect both of them against unwanted change or deletion.
However, there’s a big difference in the rate of change between backups and archives. Your backups have a much higher rate of change because a short RTO depends on capturing changes every hour or day. Resuming operations after an outage depends on capturing changes frequently, which implies a high rate of change in your backups. Also, your software should allow you to clean up your backup set regularly, to ensure that you’re not filling up storage with obsolete files.
In your archives, on the other hand, the rate of change approaches zero. Your archives represent the data you want to keep for a long time, so you store it and that’s it. Occasionally you’ll add to it, but under normal circumstances, it’s unlikely you’ll modify or delete any of it. The files will never become obsolete, so you don’t need to clean up your archive.
Data scope – The data that you back up or archive
Another big difference between archive and backup is the scope of the data that goes into each one. That extends to the workflow you’ll put in place to get data to its proper destination.
For archives, it’s usually simple because you’re keeping the data for historical reasons. Your archiving policy may dictate that you review finished projects – software development, marketing content, architectural drawings – once a month and store them immutably. You can apply a project-management mindset to archiving your finite, closed work in a specific location and never think about it again.
For backups, the workflow is different because you have to apply a work-in-progress mindset. Production is ongoing, projects are in mid-flight and, most of all, communication is unstructured.
In fact, the biggest problem with information is when it’s unstructured. Data in a discernable shape and form – records in a database, cells in a spreadsheet, text in XML – is structured. Data with no shape and form – dialog and text in a video, bullets in a presentation, web pages – is unstructured.
It’s important to think carefully about unstructured data when it comes to backup and archive. What unstructured data do you need to protect? How long must you retain it to ensure you can address any legal issues?
Email – A special case
Email is unstructured, so how can you know if somebody discusses something important with a specific person about a specific product? You can’t, because there’s no structure, no easy way to locate that conversation, short of logging into the account and searching through the mailbox.
News articles frequently mention the value of email in the legal discovery process, including the importance of retaining it. Think about that in the context of your own business and sales correspondence. Suppose you use a web-based customer relationship management (CRM) software to generate a quote to a prospect, then you send a PDF of the quote via email. You don’t back up the CRM data; your CRM vendor is responsible for that. But suppose they suffer data loss or a ransomware attack and your quote disappears from their servers. Suddenly the email and attachment you sent become quite important.
If you want to protect your own history of that information – separate from the CRM data you cannot protect – you need to back up your incoming and outgoing email. You can do that, but you’ll still face the problem that it’s unstructured data.
The next problem is setting a realistic backup frequency for email, and for all unstructured data. Why? Because some users delete email messages – sometimes by accident, sometimes on purpose. To protect your company completely, you’ll need to keep a lot of history. That means a high frequency of backups because of the high rate of change mentioned above. If you back up only at night, a user can send something in the morning and delete it in the afternoon, and you’ll have no copy.
Thus, unstructured data can throw a spanner into the works as you think about backup and archive.
In summary, the right question in dealing with archive and backup is “What do we want to accomplish?”
Create a small spreadsheet with your priorities and limitations, then decide what fits. Ask yourself, “Is this a feature of backup? Or is this a feature of an archiving environment?” Then determine whether you can fulfill your need with your existing backup environment or have to invest in a proper archiving system.
Mind you, the issues of backup and archive should really be evaluated in light of the business, not in light of the technology. They should be part of the process for setting service level agreements for the different types of data in your company. It’s a good idea to have IT weigh in on which system to buy, but it’s not always a good idea to have IT make basic decisions about what to archive and backup. The potential impact of those decisions to the organization are such that your business managers should make them.