McCarony - Fotolia
Data is growing, but the majority of this growth is from machine and sensor data, although user data is certainly contributing to the problem. However, the amount of active data accessed within the last 30 days is growing at a much slower rate. In other words, a considerable amount of data growth can be attributed to older, infrequently accessed information.
Much of this dormant data is unstructured (file) data. This is the hardest type of data for a backup system to ingest and the most expensive to store, yet IT professionals methodically back this data up week after week. Removing this information from the data storage protection process using a data archiving system improves backup operations substantially and reduces costs.
The unstructured data challenge
Knowing what unstructured data will become active again is the data center equivalent of finding a needle in the haystack. Manually restoring this data based on user need is also cumbersome. Frequently, users cannot provide essential details such as file names or the data creation date. And even if the files can be identified, finding them within a vast sea of data can be difficult.
Other articles in this series
In part 2 of this article, learn how to choose the right storage and software for archiving.
The simplest approach seems to be to just keep all the data on production servers and to not manage it at all. To some extent, data storage vendors have led IT professionals to believe that this is the best possible outcome and that technology can handle it. Thanks to scale-out NAS technology, the ability to add almost limitless capacity is now possible. But is it logical? A large scale-out NAS architecture can be expensive to buy, expand, power and cool -- and you still need to protect it.
Backing all data up presents a number of challenges:
- The time required for the backup application to examine each of the files -- which could number in the billions -- to determine which ones should be backed up.
- The need to transfer all the data to the backup system over the network.
- The cost to maintain multiple copies of the data. At a minimum, backup storage is two times the size of production storage, and in most cases it is five times or more.
The majority of backup data (80% to 90%) does not change, will probably never change and much of it will never be needed again. The data just consumes capacity and clogs backup jobs, "just in case."
Removing this data from the backup process can provide dramatic results, reducing backup data sets by as much as 80%. With a modern data archiving system in place, only databases and the most active unstructured data must be backed up.
Three ways a data archiving system eases access
The primary responsibility of a data archiving system is to provide seamless access to data. There are three ways to provide rapid response to recovery requests, and each requires the archive system to have an on-premises disk front end. The archive software can then replicate the directory structure of the primary storage system and provide either an indexing and search function or automated data recovery.
- Replicate the directory structure of the archived data set. Allow the user to navigate the same directory path as before to find files with the insertion of "archive" in the path. For example, instead of looking for a file in "\home\docs\," users would be instructed to look in "archive\home\docs." This is the easiest method to implement, the one least likely to break due to a software update and it has the least impact on the environment. It is often the least-expensive option, but it requires the most training and users have to be comfortable navigating file systems.
- Move all data to the archive and include a "Google-like" search capability. This could include context-level searching. A sophisticated search like this allows users to search by file name, modification dates or content within the file. Once they find the necessary file, the interface would present an option to restore the file to the location of the user's choice, assuming they have adequate security clearance.
- Automated recovery. Some products leave stub files or symbolic links, which make it look like the file is in its original place, but when needed, it can be retrieved from its archived location. While this option is easiest on users, it is also the most fragile. If the stub files or links are deleted, accessing the original data can be difficult.
Each of these methods significantly reduces the amount of data the backup software needs to transfer from primary to secondary storage. Only the first two methods lessen the amount of time it takes to verify if each file needs to be backed up. Known as the file system walk time, this process involves inspecting every file on the storage device, and in large file shares can take longer than the actual backup. The reduction occurs because the data, and the files associated with it, are no longer on production storage. With the third method, there is still the same number of physical files or objects to process because stub files replace migrated files.
How data backup and archiving are converging
Tape and data archiving are still an important pairing
Mining backup and archiving data for business analysis