whitehoune - Fotolia
Published: 04 Mar 2015
Historically, backup and archive data has represented the most comprehensive data set in many organizations. But this data is rarely used for anything other than restoring deleted, corrupted or otherwise lost primary data. This is a noble duty in and of itself, but the excitement about big data analytics is leading some users to believe that there's value in this aggregated data collection. While some of this attention may be unfounded, there's an opportunity to turn some of the cost of backup into a resource. But to achieve that, it's necessary to rethink the backup process.
In this article, when we talk about mining a company's existing data for potential value, we're talking about searching and accessing stored data, not about complex big data comparisons that require source data sets to be normalized or converted into different formats. One example could be finding customer data that fits specific criteria, such as purchasing behavior or demographics. Another example may be searching existing digital assets, such as stored videos or images, for data that may pertain to a current project.
Index and search
To search and access stored backups, the backup application must provide an index for data objects (files) that fit certain parameters, since these backup applications typically store data in a proprietary format. Most of these applications were designed to provide fast backups while minimizing the amount of storage consumed. But some backup software providers, such as CommVault and Hewlett-Packard, include more advanced search and archive capabilities.
CommVault collects backup and archive data using a single-pass process, storing data in a repository. Backup and archive data are cataloged in one index, and users can search across all data from a single screen. While this capability is most often deployed for compliance reasons, it can also facilitate data access for business analysis.
Snapshots are a common technology used to provide faster backups and increased efficiency. But they also make data retrieval more complex, especially when potentially hundreds of snapshots are created. Today, some legacy backup vendors have added snapshot index-and-search capabilities to backup software platforms to improve data access for shops that rely heavily on snapshots as part of their data protection strategy.
Another way to search proprietary data sets is to create an external index. Companies like Index Engines have built a business supporting the legal industry with products that crawl the network and catalog unstructured data, including data stored on backup systems. The primary use case for these products is in response to e-discovery requests, where all data objects pertinent to a given legal event can be culled and made available to the court. These indexes can also enable the search and access required to support business analysis.
Archiving unstructured data
Most of the data growth in recent years has come from unstructured data, especially digital content, relatively large files created and seldom modified, such as images, video and audio. These data objects can represent a significant investment, so they're saved for long periods of time, often indefinitely. But they need to be accessible at the file level to support immediate use cases, such as applications driven by current events or market conditions. These characteristics -- large size, long retention time and static nature -- make these objects ideal for moving out of traditional backup systems and storing in an archive. Moving data from backup into an archive improves search and helps to support the goal of pulling additional value out of stored data.
Scale-out NAS systems that leverage high-capacity disk drives (near-line SAS) or tape can be an effective way to store digital content but still keep it accessible. Solutions like StrongBox by Crossroads Systems Inc. can incorporate archive functions within a NAS architecture to significantly lower the overall cost per gigabyte compared to traditional disk arrays. While these storage systems don't provide search functions themselves, data is still accessible via standard file formats to applications that do.
For data sets that can get exceedingly large, but must remain on disk, object storage platforms with an integrated file system or NAS gateway offer viable alternatives to traditional file storage. These are the architectures deployed by most public cloud storage providers and many enterprise private clouds. While object storage systems are used as unstructured data repositories, object-based architectures also represent a technology that can enhance the value-added processes this article focuses on.
Take a dip in a data lake
The data lake is a single instancing concept focused on "real" big data analytics, not the simpler search and access processes identified in this article. There are many definitions for data lakes, but most agree that they're enterprise-wide data repositories designed to store data objects (using object-based architectures) in their native formats, as opposed to the format used by each particular application.
The objective is to make a company's data more useful for analysis, without requiring format conversion or normalizing. Most data lake discussions involve Hadoop-based applications for running the analytics the data lake is supposed to espouse, and often the engine for management of the data lake itself.
Copy data management
Companies face an accumulation of data copies used to support disparate applications, such as data protection, disaster recovery, test and development, business analysis and so on. Copy data systems essentially provide a single-instance repository by replacing these redundant data stores with a common storage area. There are several methods these products use, but one process is to create a golden copy of every file under management and keep it updated with incremental change tracking. This copy is then used to create virtual copies as needed for applications such as data protection and business analysis.
Copy data systems are available that can store both unstructured data and databases, replacing backup altogether for the data objects they store. Since they store files in their native formats, copy data management systems can be ideal for supporting data mining and business analysis.
In near the future, new technologies that provide capabilities not available today will come online. For example, object-based storage architectures have the flexibility to support much larger and more sophisticated metadata stores than traditional file systems. This enables more data about each object to be retained within that object to support more content-specific searching and more detailed analysis.
Enhanced metadata is a capability object storage vendors are aware of and could add to their products' feature sets, but most haven't yet.
Object storage systems can support this enhanced metadata processing at scale, while maintaining storage performance. This ability is fundamental to the success object-based architectures have enjoyed to date in the cloud storage space.
Data mining is a concept that seems to resonate with companies across the board. The idea of reusing data that's been stored for another purpose (like backup) is popular indeed. Backup, as an application, touches most of the data a company creates or uses, making it an ideal data set for supporting business analysis. If the deployed backup application provides adequate search capabilities, a company may be able to extract significant value from its stored backup.
However, many organizations store large data sets outside of the backup system to enhance search capabilities and save money, especially when data objects are large or saved for long periods of time. These data sets are also stored in native file formats, making them available for search and analysis by tools or applications designed for this purpose.
BIO: Eric Slack is an analyst at Storage Switzerland, an IT analyst firm focused on storage and virtualization.