At a recent CloudCamp I had a discussion about data retention in the cloud, the argument was that the size of “big data” would be significantly reduced if you delete the unimportant/unnecessary/trivial data.
Problem 1: The Filtering Job
If you want to avoid collecting any unimportant data, it has to be filtered when coming in. If that would be an easy job, some companies would not use big data solutions – it would be less cost and resource intensive to just put it into a SQLDB. One of the reasons that it is necessary to work with cloud and big data solutions is that it is easier/less resource consuming to process the data later when you want to analyse it then when you receive it.
Problem 2: The Purging Job
If you can’t reasonably filter that data, how about purging it? All boils down to storage cost vs purging cost. If it is simple and effective to purge, you could have done it via filtering. It it’s not, you either have to spend precious resources for purging calculation or hire people to evaluate and purge data. Either way, it’s most likely more expensive than some more hard drives.
Problem 3: The Future
What is unimportant? What do you not need? If you think just about now, it might be an easy questions. But requirements might change, Data needs to be reprocessed in a different light. Your company might do something completely different with the data in a year (happens more often than you think). So why delete something you might need in the future?