One of the benefits of coming to a greenfield job – like when I joined Mind Candy two years ago – was that you can jump several technological steps ahead as you don’t have any legacy to deal with. Essentially we could build from scratch based on lessons learned from traditional data architecture. One of the main ones was to establish a real-time path right away to avoid having to shoehorn it in afterwards. Another was to avoid physical hardware. And the most important one was to hold off on Hadoop as long as possible.
The last one might seem surprising, isn’t Hadoop the centre-piece of a data architecture? Unfortunately it creates a lots of admin overhead and it might be a full person’s (or more) workload to maintain. Not ideal in a small company where people resources are limited. AWS S3 can fulfil most of the storage function but requires no maintenance and is largely fast enough. Also while HDFS is important and will probably come back for us soon, MR1 or YARN is just not – there are better and more advanced execution systems that can use HDFS and we used one of those: Mesos.
Mesos is a universal execution engine for job and resource distribution. Unlike YARN it can not only run Spark but also Cassandra, Kafka, Docker containers and recently also HDFS. That works because Mesos just offers resources and let the framework handle the starting and management of the jobs. This finally breaks the link between framework and execution engine: in Mesos you can run not only different frameworks but different versions of the same framework. No more waiting for your infrastructure to upgrade to the latest Hadoop or Spark version, you can run it right now even when all your other jobs run on older versions. Combine that with a robust architecture and simple upgrading and Mesos can easily be seen as the successor to YARN (for more details on why Mesos beats Yarn, see Dean Wampler’s talk from Strata).
For the real-time path the obvious processing solution is Spark Streaming (so we have a simpler code base) running on Mesos with Kafka to feed data in and with Cassandra to store the results. You now have a so-called SMACK stack (Spark Mesos Apache Cassandra Kafka) for data processing which the Mesos folks call Mesosphere Infinity for some reason (aka marketing).
The last bit of a data architecture is the SQL engine. Traditionally this was Hive but we all know Hive is slow. While there are several open-source solutions out there that improve on good old Hive (Impala, Spark SQL) in the end we decided on AWS Redshift. It’s a column-oriented SQL-based data warehouse with PostgreSQL interface which fulfils most of the data analysis and data science needs while being reasonably fast and relatively easy to maintain with few people.
The resulting architecture looks like the above picture. We have an event receiver and a enricher/validate/cleaner, which were written in-house in Scala/Akka and are relatively simple programs using AWS SQS as a transport channel. The data is then send to Kafka and S3. Spark uses data straight from S3 to aggregate and put the processed data back into either Redshift or S3. On the real-time side of things we have Kafka going into Spark Streaming with an output into Cassandra.
What can be improved here? HDFS is still better than S3 for certain large scale jobs and we want to bring it back running on Mesos. Redshift could be replaced with Spark SQL hopefully soon. All in all the switch from tightly coupled Hadoop to an open architecture based on Mesos allowed us to have an unprecedented freedom as to which kind of data jobs we want to run and which frameworks to use, allowing a small team to do data processing in ways previous only possible on a large budget.
I frequently get into a discussion which goes something like this: I have this high-cpu service which would just work fine and fast on physical hardware but I’m running up the capacity of my VM. Why don’t we just run it on the physical hardware?
The first answer would be that virtual hardware has too many benefits to just easily go back to the physical world. VMs can be replicated, copied and deployed at will. If you also have a decent tool kit to deploy multiple VMs with respective configurations at once on demand, you can replace and fail over quickly. If you have a physical server, you need at least a fail over server – which will also be physical. You also have to pay a guy to fix it if it breaks. Is it really worth the additional money and time?
The second answer, and I think the more important one, is that you should really ask yourself if you are trying to solve a scale problem the wrong way. Sure, I know the occasional optimised C++ service with constant demand happily working for years on one server. But are you sure the demand will not increase in 3 month, 6 month time, beyond the physical capacity of the server? Are you then starting to beef up the server, spending more time and money? Maybe it’s time to look into a distributed architecture that can scale across multiple VMs and you solve the scaling problem before it become big and costly.
At a recent CloudCamp I had a discussion about data retention in the cloud, the argument was that the size of “big data” would be significantly reduced if you delete the unimportant/unnecessary/trivial data.
Problem 1: The Filtering Job
If you want to avoid collecting any unimportant data, it has to be filtered when coming in. If that would be an easy job, some companies would not use big data solutions – it would be less cost and resource intensive to just put it into a SQLDB. One of the reasons that it is necessary to work with cloud and big data solutions is that it is easier/less resource consuming to process the data later when you want to analyse it then when you receive it.
Problem 2: The Purging Job
If you can’t reasonably filter that data, how about purging it? All boils down to storage cost vs purging cost. If it is simple and effective to purge, you could have done it via filtering. It it’s not, you either have to spend precious resources for purging calculation or hire people to evaluate and purge data. Either way, it’s most likely more expensive than some more hard drives.
Problem 3: The Future
What is unimportant? What do you not need? If you think just about now, it might be an easy questions. But requirements might change, Data needs to be reprocessed in a different light. Your company might do something completely different with the data in a year (happens more often than you think). So why delete something you might need in the future?