lassedesignen - Fotolia
Toigo Partners International
Published: 04 Nov 2016
I get extremely perturbed when I hear otherwise intelligent developers talk about data backup and archive like they are an old man's game. Increasingly, I'm told storage infrastructure today is best designed "flat and frictionless" rather than tiered with data migration or copy.
When data is no longer accessed or updated, the flat folks argue it should be quiesced on the disk or
The origins of these views are obvious. First and foremost, there is that entropy thing. A lot of IT folks seem to think they sound "science-y" when using theories and laws from hard sciences like physics as metaphors for stuff in their world, and the laws of thermodynamics are at the top of the list.
The truth about entropy
These IT folks interpret the laws of thermodynamics, especially the second law (that of entropy), incorrectly. To them, it means that systems cannot be made more orderly, so that innate entropy -- the process by which systems move to an increasingly disorganized state -- prevents activities like storage tiering and data archiving from contributing to the order and optimization of data and storage. This perceived futility translates into a preference for flat infrastructure, as flatness contributes nothing to improving orderliness or efficiency, and there's no data movement or replication.
The truth of the second law of thermodynamics is that systems can maintain their orderliness in the face of entropy, or be made more orderly in spite of entropy by adding energy from somewhere else. The real metaphor here is that folks need to get off of their butts and do the hard work of defining the best strategies and policies for implementing the necessary technologies to get the job done. This entropy excuse for doing nothing is a veiled defense for laziness.
No data left behind
Most flat storage folks may well respond that their data is too ephemeral to go to all the trouble of backup and archiving, especially those working in big data and the internet of things. Millions of data points gathered from sensors or social media posts and so on are included in algorithms that are updated every millisecond. The raw data itself is useful for no more than four minutes, in some cases, after which it is replaced by new input. So, after four minutes, it has no value and can, therefore, be lost.
At first glance, ephemeral data makes a case for not bothering to archive or backup, I suppose. IDC's Data Universe report for EMC back in 2014 reflected this thinking. It said of the 4.4 zettabytes of data produced that year, only 37% would be useful to tag and analyze, suggesting that the rest could go away and no one would be the worse for it. IDC's report also noted that for 55% of that total, there was no provision for backup or data protection -- and they were okay with that, too.
Truth be told, the industry has never really looked closely at the issue of ephemeral data, or why we bother to collect it at all. Is ephemerality determined by workflow -- or, put another way, is ephemeral data that which does not conform to contextual analytical objectives? How do you know what ephemeral data is important now or in the future?
Our flirtation with business intelligence and data mining back in the 1990s and aughts suggested there is tremendous value in keeping original source data in as close to its raw form as possible -- that is, not refined or normalized to work with a specific analytical process or tool. Why? Because experience has proven how useful it is to be able to return to source data with new algorithms from time to time to see what we missed, such as any non-intuitable relationships that might exist in the data itself. You can't do that if the raw data isn't archived and is missing or transformed in some way.
Storage has its limits
The flat storage folk would say powering down a drive and sheltering data in place preserves a data asset. This too is poorly thought out. For one thing, advocates of this position have no reliable data about the failure rate of powered-down drives when they are restarted after a month or six months or more.
Plus, there is a conceit or lack of appreciation of business reality behind the idea that you can simply replace storage nodes ad infinitum. That is grad student, media lab nonsense. In the real world, there isn't sufficient budget to continuously roll out more all-flash storage or another three nodes of VSAN or whatever. Vendors love the idea of continuous scaling of storage infrastructure, but even they will tell you that there are fixed limits to the production capacity of the industry and the manageability of infrastructure at scale.
I know everyone loves to hate latency, but flat infrastructure free of migration and replication voids the entire notion of efficient and intelligent storage. And, by the way, latency is rarely a product of storage I/O. It is more often caused by how raw I/O is sequentially processed by multicore CPUs using sequential I/O processing techniques borne from Unicore chip architecture (see "In enterprise computing systems, where's the Big Bang?").
If you are worried about how to guide data between tiers for archive or protection without using a lot of app server CPU power, think about offloading it to something like StrongLINK from StrongBox Data Solutions. I tested this platform all summer and found it is already on the right track by combining cognitive processing with any-to-any, workflow-to-storage data management designed to simplify things for the app/dev folks whose mental acuity has fallen prey to entropy. Have a look.
About the author:
Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International and chairman of the Data Management Institute.
Old storage technologies get new life
Persistent vs. ephemeral storage
Other storage options besides OpenStack