data reduction

Contributor(s): Brien Posey
This definition is part of our Essential Guide: Complete guide to backup deduplication

Data reduction is the process of minimizing the amount of data that needs to be stored in a data storage environment. Data reduction can increase storage efficiency and reduce costs.

Data reduction can be achieved using several different types of technologies. The best-known data reduction technique is data deduplication, which eliminates redundant data on storage systems. The deduplication process typically occurs at the storage block level. The system analyzes the storage to see if duplicate blocks exist, and gets rid of any redundant blocks. The remaining block is shared by any file that requires a copy of the block. If an application attempts to modify this block, the block is copied prior to modification so that other files that depend on the block can continue to use the unmodified version, thereby avoiding file corruption.

Some storage arrays track which blocks are the most heavily shared. Those blocks that are shared by the largest number of files may be moved to a memory- or flash storage-based cache so they can be read as efficiently as possible.

Other data reduction techniques

One pixel Taneja Group Analyst Mike Matchett
explains the benefits of
compression and deduplication

While data deduplication is probably the most common data reduction technique, it is not the only viable one. Data archiving and data compression can also reduce the amount of data that has to be stored on primary storage systems.

Data compression reduces the size of a file by removing redundant information from files so that less disk space is required. This is accomplished natively in storage systems using algorithms or formulas designed to identify and remove redundant bits of data.

Archiving data also reduces data on storage systems, but the approach is quite different. Rather than reducing data within files or databases, archiving removes older, infrequently accessed data from expensive storage and moves it to low-cost, high-capacity storage. Archive storage can be disk, tape or cloud based.

Data reduction for primary storage

Although data deduplication is often associated with the backup process and secondary storage, it is possible to deduplicate primary storage. Primary storage deduplication can occur as a function of the storage hardware or operating system. Windows Server 2012 and Windows Server 2012 R2, for instance, have built-in data deduplication capabilities. The deduplication engine uses post-processing deduplication, which means deduplication does not occur in real time. Instead, a scheduled process periodically deduplicates primary storage data.

Primary storage deduplication is a common feature of many all-flash storage systems. Because flash storage is expensive, deduplication is used to make the most of flash storage capacity. Also, because flash storage offers such high performance, the overhead of performing deduplication has less of an impact than it would on a disk system.

This was last updated in September 2015

Next Steps

Are data reduction techniques essential in VDI environment?

Continue Reading About data reduction

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

What are your thoughts on whether improvements in storage capacity make data reduction techniques obsolete?
Isn't there a corollary to Parkinson's Law about "Data expands to fill the space available"? Goodness knows no matter how many bookshelves I get, they always end up full, and data seems to be the same way.
I tend to think that having a constraint (in this case, a limited amount of data space) helps drive organizations to place priorities on what they choose to keep and what they choose to archive. Removing or expanding those constraints will just provide more incentive to fill the empty spaces without addressing the problem of making sure you are storing and acting on what is relevant. Much like getting more closet space doesn't mean you have a tidier closet, it typically means you end up with reasons to keep more things that won't be used. 
Our biggest challenge is finding a way to evaluate the data we're keeping. This requires much more processing power. I surmise you're going to spend the money on either storage or on the in-house technology to process the increased amounts of data.
Shouldn't this mention that some kinds of data reduction techniques are vendor-specific, which means if you use them you won't be able to read the data with other vendors' products?


Extensiones de Documento y Formatos de Documento

Accionado por: