data reduction

Contributor(s): Johnny Yu and Brien Posey

Data reduction is the process of reducing the amount of capacity required to store data. Data reduction can increase storage efficiency and reduce costs. Storage vendors will often describe storage capacity in terms of raw capacity and effective capacity, which refers to data after the reduction.

Data reduction can be achieved several ways. The main types are data deduplication, compression and single-instance storage. Data deduplication, also known as data dedupe, eliminates redundant segments of data on storage systems. It only stores redundant segments once and uses that one copy whenever a request is made to access that piece of data. Data dedupe is more granular than single-instance storage. Single-instance storage finds files such as email attachments sent to multiple people and only stores one copy of that file. As with dedupe, single-instance storage replaces duplicates with pointers to the one saved copy.

Some storage arrays track which blocks are the most heavily shared. Those blocks that are shared by the largest number of files may be moved to a memory- or flash storage-based cache so they can be read as efficiently as possible.

Data compression also works on a file level. It is accomplished natively in storage systems using algorithms or formulas designed to identify and remove redundant bits of data. Data compression specifically refers to a data reduction method by which files are shrunk at the bit level. Compression works by using formulas or algorithms to reduce the number of bits needed to represent the data. This is usually done by representing a repeating string of bits with a smaller string of bits and using a dictionary to convert between them.

Common techniques of data reduction

There are also ways to reduce the amount of data that has to be stored without actually shrinking the sizes of blocks and files. These techniques include thin provisioning and data archiving.

Thin provisioning is achieved by dynamically allocating storage space in a flexible manner. This method keeps reserved space just a little ahead of actual written space, enabling more unreserved space to be used by other applications. Traditional thick provisioning allocates fixed amounts of storage space as soon as a disk is created, regardless of whether that entire capacity will be filled.

Thick vs. thin provisioning
Differences between thin and thick provisioning

Archiving data also reduces data on storage systems, but the approach is quite different. Rather than reducing data within files or databases, archiving removes older, infrequently accessed data from expensive storage and moves it to low-cost, high-capacity storage. Archive storage can be on disktape or cloud.

Data reduction for primary storage

Although data deduplication was first developed for backup data on secondary storage, it is possible to deduplicate primary storage. Primary storage deduplication can occur as a function of the storage hardware or operating system (OS). Windows Server 2012 and Windows Server 2012 R2, for instance, have built-in data deduplication capabilities. The deduplication engine uses post-processing deduplication, which means deduplication does not occur in real time. Instead, a scheduled process periodically deduplicates primary storage data.

Taneja Group analyst Mike Matchett explains the benefits of compression and deduplication.

Primary storage deduplication is a common feature of many all-flash storage systems. Because flash storage is expensive, deduplication is used to make the most of flash storage capacity. Also, because flash storage offers such high performance, the overhead of performing deduplication has less of an impact than it would on a disk system.

This was last updated in December 2018

Continue Reading About data reduction

Dig Deeper on Data reduction and deduplication

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

What are your thoughts on whether improvements in storage capacity make data reduction techniques obsolete?
Isn't there a corollary to Parkinson's Law about "Data expands to fill the space available"? Goodness knows no matter how many bookshelves I get, they always end up full, and data seems to be the same way.
I tend to think that having a constraint (in this case, a limited amount of data space) helps drive organizations to place priorities on what they choose to keep and what they choose to archive. Removing or expanding those constraints will just provide more incentive to fill the empty spaces without addressing the problem of making sure you are storing and acting on what is relevant. Much like getting more closet space doesn't mean you have a tidier closet, it typically means you end up with reasons to keep more things that won't be used. 
Whether it's finding ways to maximize storage space or finding ways to sift through an ocean of data, an organization will be investing resources regardless. Given the explosion of secondary data many organizations are experiencing, it would be good to be able to do both.
Our biggest challenge is finding a way to evaluate the data we're keeping. This requires much more processing power. I surmise you're going to spend the money on either storage or on the in-house technology to process the increased amounts of data.
Shouldn't this mention that some kinds of data reduction techniques are vendor-specific, which means if you use them you won't be able to read the data with other vendors' products?