When considering backups and long term data storage, IT professionals must consider not only how much data to keep, but also how long to keep it. Long-term data storage is a balancing act between the costs of keeping data for a long time and the probability that you will ever access the data again. That is especially true given that most organizations frequently retain data over the long term to mainly comply with regulations. Enterprises design their infrastructure to keep data for years. System administrators will learn how to navigate the obstacles to providing long-term data storage in their own organizations.
Why does long-term data storage matter?
The most common reason for keeping data for a long time is that you’re required to keep it. There are regulations for retaining a given number of years of your data and the regulations are usually a function of finance. Tax authorities, financial agencies and industry groups enforce long-term data storage as a means of governance. If you’re obligated to comply with those regulations, you’ll find it’s easier when you retain original data in case of eventual audit.
Closely related are the legal reasons for long-term data storage. Suppose your company manufactures a product – whether a screwdriver or a skyscraper – and your customers are entitled to certain guarantees. It’s in your best interest to hold onto all related designs, plans, approvals and documentation so that you can prove, even many years later, that it was built correctly.
Similarly, take the example of an automobile made up of thousands of different parts – engine, electrical, interior, electronics, tires – that come mostly from different sources. You want a guarantee from your suppliers as to the quality of the components so you can extend a good-faith, overall guarantee to your customers.
In case a component of the automobile should prove defective within a certain number of years of manufacture, there will be inquiries around a recall. You’ll search for records – dates, lot numbers, specifications – about the standards you enforced. Long-term data storage can be advantageous to you in determining who will pay for the recall. If you supplied the component and have not retained the data proving that you manufactured and tested it to a specific standard, you could end up paying for the recall.
Note that the technologies behind long-term data storage are similar to those for backups; however, the reasons for them are different. The former is for reducing liability in normal business operations and the latter are for recovering from an outage or disaster.
Common challenges involved with long-term data storage
Time and the elements degrade any object in storage, whether physical or digital, and that degradation is the main obstacle to storing your data over time.
Degradation
Paper was the medium of recordkeeping for centuries; it decays over time, whereas the ones and zeros of digital records do not. But because of the media on which they are stored, digital records are as vulnerable to moisture and heat as paper records are.
Tape is sensitive to changes in temperature and humidity. If you store to tape, you may find yourself after 10 years with tape that is unreadable because it has glued itself together. The solution is to spin up the tape once per year. It is not even necessary to read the tape; merely unwinding and rewinding it will keep it from sticking to itself.
Optical disks emerged as an alternative for long-term data storage. But they are vulnerable to sunlight and radiation, which can lead to bit rot if the disks are not correctly stored. So, the industry has adopted the checksum, a number based on the characters contained in a piece of data. The drive stores the checksum with the data upon writing it to disk, then verifies the checksum upon reading the data later. Any difference in checksum indicates that the data has changed and cannot be trusted.
The advent of cloud computing and storage may seem like a welcome departure from the risks of faulty media, but that is not necessarily the case. Cloud service providers may have ways of slowing the degradation, but ultimately they face the same media alternatives that you face: tape, optical drives, hard drives and solid-state drives (SSD). Most providers, however, reduce risk by offering colocation and geographical redundancy. As always, reducing risk costs money.
Degradation is ubiquitous and it affects different industries to different degrees. Health care, for example, would not tolerate degradation in data from a body scan or a diagnostic report. Similarly, finance would not tolerate degradation that resulted in a change in bank balance. Those industries would be expected to compare before- and after-checksums to reduce risk.
File format obsolescence
If you plan to store something for a very long time, will you store it in a format that is common today but may not be common in thirty years? Or, can you store it such that the format becomes less dependent on how you store it today?
Consider avoiding obsolescence with a self-describing format in which you save, with the data, instructions on how the data can be read. It may seem counterintuitive that a data file could describe itself, but that’s what it takes to lower the risk of obsolescence. Examples include portable document format (PDF) and Extensible Markup Language (XML).
Unforeseen consequences of migrating to a new format
Suppose you migrate to a new format to avoid obsolescence. Will something happen in the future to eliminate that advantage?
The best way to answer that question is according to the level of change you’re willing to accept over time. In the enterprise context, do you need to store every change to your sales database for the next 10 years? Or, does it suffice to store at specific points in time for the account of each user? Similarly, do you need to save the spreadsheet file from which you generated a report, or just a PDF of the report? It’s valid to talk about how much resolution you need in the financial information you save to long-term data storage.
Mistakes organizations make when it comes to long-term data storage
The most common mistake is not understanding how data is stored.
For example, consider a database that you created one year ago in your production environment. Every day since then, transactions have been added to, deleted from and edited in that database. So, at the end of each day, the database is different. Does that mean you should retain 365 versions of the database in long-term storage? Of course not. Since the database has been running for a year, the single version of the database you save from the production environment today already contains 365 days of history. Every version you save contains the entire history of transactions.
The misunderstanding arises when people think that “long-term storage” means that you need to keep every version of the file for years. It’s necessary to understand how your data is stored in the production environment today and how much history is already there. When you understand how your data is structured and stored, then you realize that there is no need to store obsolete versions.
Practices organizations should consider when architecting long-term data storage
The first consideration when it comes to architecting long-term data storage is where to put the data for the long term, which depends in turn on the answers to several other questions.
- What do you need to keep? Which regulations, industry standards or financial requirements come to bear on your decisions about long-term data storage?
- How much resolution do you need to keep? As described above, how much detail are you expected to retain, and for how long?
- How much risk can you take for changes like bit rot that could befall the data?
- How often do users need to access it? If frequently, then it will probably better to store it where access is quick and easy, even if you must pay more. If infrequently, then you could store it far less expensively, but you will pay more to retrieve it.
Once you’ve seen what to store and where to store it, you can consider how to store it.
- What are the use cases for the data? Under what foreseeable circumstances could you still need it after five, seven, ten or twenty years?
- In which format will you save it? Can you convert it to a new format? In cases like legal documents, the original format is the only one permitted, so the information must not be altered at all. You would have to store the original data because conversion would change it.
- How do you make the data accessible? How can you find and retrieve it if need be? If you just pile your data up in long-term storage without thinking about how to store it, then finding your data again will be difficult. That’s especially true when time is of the essence.
- Who owns the data if your company is acquired? That question becomes even more compelling if the acquiring and target companies are in different countries or trading blocs.
Additional best practices for long-term data storage should include the following:
Automation
First and foremost, automate your procedures for storing data long-term. Anything you need to remember to run is something you may forget or procrastinate on running. The sooner you automate it, the more time you’ll have to devote to high-value tasks. Automation extends to creating the scripts and policies that will store your data long-term with less effort on your part.
Protect all your systems, applications and data.
Encryption
When you convert readable content to encrypted content that requires a key to read, you protect your data even if attackers can get to it. Commonly touted as a best practice for protecting backups, encryption is also valuable in long-term data storage.
Deduplication
How can you keep from running out of long-term storage space? You cannot, of course, but data deduplication is the best way to get more capacity from your existing storage. Plus, the deduplication technique of replacing redundant data with tokens means that you expend less time and fewer network resources in sending data to long-term storage. Source-side deduplication can transmit as much as 90% less data across the network to your storage target.
Conclusion
When your business is obligated to use long term data storage, you face decisions rooted as much in your processes as in your technology. You must balance factors like financial and legal requirements against how you will organize, access and structure long-term data storage years or even decades into the future.
Common long-term obstacles include data degradation, file format obsolescence and the unforeseen consequences of migrating to a new format. Organizations confront basic, high-level questions such as where to store the data and how to store it. The most effective way to answer those questions is in the context of lower-level questions tied to business practices and appetite for risk.