lassedesignen - Fotolia

Problem solve Get help with specific problems with your technologies, process and projects.

Navigate the chaos of ephemeral data storage

You may hear chatter that data is too ephemeral to go through the trouble of backing up and archiving it, but Jon Toigo says it's a worthy exercise.

This article can also be found in the Premium Editorial Download: Storage magazine: Everything you need to know about storage snapshots:

I get extremely perturbed when I hear otherwise intelligent developers talk about data backup and archive like they are an old man's game. Increasingly, I'm told storage infrastructure today is best designed "flat and frictionless" rather than tiered with data migration or copy.

When data is no longer accessed or updated, the flat folks argue it should be quiesced on the disk or SSD where it resides -- just power the drive down and you're set. You can always stand up a new node next to the powered-down one and continue processing, so there is no need to reclaim space with archiving or for duplicating data for off-site removal for backup. Those activities are so "day before yesterday." Plus, there's the issue of ephemeral data, which we'll get into later.

The origins of these views are obvious. First and foremost, there is that entropy thing. A lot of IT folks seem to think they sound "science-y" when using theories and laws from hard sciences like physics as metaphors for stuff in their world, and the laws of thermodynamics are at the top of the list.

The truth about entropy

These IT folks interpret the laws of thermodynamics, especially the second law (that of entropy), incorrectly. To them, it means that systems cannot be made more orderly, so that innate entropy -- the process by which systems move to an increasingly disorganized state -- prevents activities like storage tiering and data archiving from contributing to the order and optimization of data and storage. This perceived futility translates into a preference for flat infrastructure, as flatness contributes nothing to improving orderliness or efficiency, and there's no data movement or replication.

At first glance, ephemeral data makes a case for not bothering to archive or backup, I suppose.

The truth of the second law of thermodynamics is that systems can maintain their orderliness in the face of entropy, or be made more orderly in spite of entropy by adding energy from somewhere else. The real metaphor here is that folks need to get off of their butts and do the hard work of defining the best strategies and policies for implementing the necessary technologies to get the job done. This entropy excuse for doing nothing is a veiled defense for laziness.

No data left behind

Most flat storage folks may well respond that their data is too ephemeral to go to all the trouble of backup and archiving, especially those working in big data and the internet of things. Millions of data points gathered from sensors or social media posts and so on are included in algorithms that are updated every millisecond. The raw data itself is useful for no more than four minutes, in some cases, after which it is replaced by new input. So, after four minutes, it has no value and can, therefore, be lost.

At first glance, ephemeral data makes a case for not bothering to archive or backup, I suppose. IDC's Data Universe report for EMC back in 2014 reflected this thinking. It said of the 4.4 zettabytes of data produced that year, only 37% would be useful to tag and analyze, suggesting that the rest could go away and no one would be the worse for it. IDC's report also noted that for 55% of that total, there was no provision for backup or data protection -- and they were okay with that, too.

Truth be told, the industry has never really looked closely at the issue of ephemeral data, or why we bother to collect it at all.

Truth be told, the industry has never really looked closely at the issue of ephemeral data, or why we bother to collect it at all. Is ephemerality determined by workflow -- or, put another way, is ephemeral data that which does not conform to contextual analytical objectives? How do you know what ephemeral data is important now or in the future?

Our flirtation with business intelligence and data mining back in the 1990s and aughts suggested there is tremendous value in keeping original source data in as close to its raw form as possible -- that is, not refined or normalized to work with a specific analytical process or tool. Why? Because experience has proven how useful it is to be able to return to source data with new algorithms from time to time to see what we missed, such as any non-intuitable relationships that might exist in the data itself. You can't do that if the raw data isn't archived and is missing or transformed in some way.

Storage has its limits

The flat storage folk would say powering down a drive and sheltering data in place preserves a data asset. This too is poorly thought out. For one thing, advocates of this position have no reliable data about the failure rate of powered-down drives when they are restarted after a month or six months or more.

Plus, there is a conceit or lack of appreciation of business reality behind the idea that you can simply replace storage nodes ad infinitum. That is grad student, media lab nonsense. In the real world, there isn't sufficient budget to continuously roll out more all-flash storage or another three nodes of VSAN or whatever. Vendors love the idea of continuous scaling of storage infrastructure, but even they will tell you that there are fixed limits to the production capacity of the industry and the manageability of infrastructure at scale.

I know everyone loves to hate latency, but flat infrastructure free of migration and replication voids the entire notion of efficient and intelligent storage.

I know everyone loves to hate latency, but flat infrastructure free of migration and replication voids the entire notion of efficient and intelligent storage. And, by the way, latency is rarely a product of storage I/O. It is more often caused by how raw I/O is sequentially processed by multicore CPUs using sequential I/O processing techniques borne from Unicore chip architecture (see "In enterprise computing systems, where's the Big Bang?").

If you are worried about how to guide data between tiers for archive or protection without using a lot of app server CPU power, think about offloading it to something like StrongLINK from StrongBox Data Solutions. I tested this platform all summer and found it is already on the right track by combining cognitive processing with any-to-any, workflow-to-storage data management designed to simplify things for the app/dev folks whose mental acuity has fallen prey to entropy. Have a look.

About the author:
Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International and chairman of the Data Management Institute.

Next Steps

Old storage technologies get new life

Persistent vs. ephemeral storage

Other storage options besides OpenStack

This was last published in November 2016

Dig Deeper on Archiving and backup

PRO+

Content

Find more PRO+ content and other member only offers, here.

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Is data too ephemeral for backup and archive?
Cancel

-ADS BY GOOGLE

SearchSolidStateStorage

SearchCloudStorage

SearchDisasterRecovery

SearchStorage

SearchITChannel

Close