it – Kevin Schmidt

I frequently get into a discussion which goes something like this: I have this high-cpu service which would just work fine and fast on physical hardware but I’m running up the capacity of my VM. Why don’t we just run it on the physical hardware?

The first answer would be that virtual hardware has too many benefits to just easily go back to the physical world. VMs can be replicated, copied and deployed at will. If you also have a decent tool kit to deploy multiple VMs with respective configurations at once on demand, you can replace and fail over quickly. If you have a physical server, you need at least a fail over server – which will also be physical. You also have to pay a guy to fix it if it breaks. Is it really worth the additional money and time?

The second answer, and I think the more important one, is that you should really ask yourself if you are trying to solve a scale problem the wrong way. Sure, I know the occasional optimised C++ service with constant demand happily working for years on one server. But are you sure the demand will not increase in 3 month, 6 month time, beyond the physical capacity of the server? Are you then starting to beef up the server, spending more time and money? Maybe it’s time to look into a distributed architecture that can scale across multiple VMs and you solve the scaling problem before it become big and costly.

At a recent CloudCamp I had a discussion about data retention in the cloud, the argument was that the size of “big data” would be significantly reduced if you delete the unimportant/unnecessary/trivial data.

Problem 1: The Filtering Job

If you want to avoid collecting any unimportant data, it has to be filtered when coming in. If that would be an easy job, some companies would not use big data solutions – it would be less cost and resource intensive to just put it into a SQLDB. One of the reasons that it is necessary to work with cloud and big data solutions is that it is easier/less resource consuming to process the data later when you want to analyse it then when you receive it.

Problem 2: The Purging Job

If you can’t reasonably filter that data, how about purging it? All boils down to storage cost vs purging cost. If it is simple and effective to purge, you could have done it via filtering. It it’s not, you either have to spend precious resources for purging calculation or hire people to evaluate and purge data. Either way, it’s most likely more expensive than some more hard drives.

Problem 3: The Future

What is unimportant? What do you not need? If you think just about now, it might be an easy questions. But requirements might change, Data needs to be reprocessed in a different light. Your company might do something completely different with the data in a year (happens more often than you think). So why delete something you might need in the future?