Speaking at the London Hadoop Meetup on Monday

I’m giving a talk about Spark Streaming and probabilistic data structures this Monday at the London Hadoop Meetup. Sign up with the link below!

Hadoop Users Group UK

London, GB
2,359 Hadoopers

We are the Hadoop users group for the UK based in London. We meet monthly for talks and discussion on all topics related to Hadoop. Join if you’re interested in learning what …

Next Meetup

April 2015 Meetup

Monday, Apr 13, 2015, 6:30 PM
120 Attending

Check out this Meetup Group →

Data Engineer vs Data Scientist vs Business Analyst

Looking again at the data science diagram – or the unicorn diagram for that matter – makes me realize they are not really addressing how a typical data science role fits into an organization. To do that we have to contrast it with two other roles: data engineer and business analyst.

What makes a data scientist different from a data engineer? Most data engineers can write machine learning services perfectly well or do complicated data transformation in code. It’s not the skill that makes them different, it’s the focus: data scientists focus on the statistical model or the data mining task at hand, data engineers focus on coding, cleaning up data and implementing the models fine-tuned by the data scientists.

What is the difference between a data scientist and a business/insight/data analyst? Data scientists can code and understand the tools! Why is that important? With the emergence of the new tool sets around data, SQL and point & click skills can only get you so far. If you can do the same in Spark or Cascading your data deep dive will be faster and more accurate than it will ever be in Hive. Understanding your way around R libraries gives you statistical abilities most analysts only dream of. On the other hand, business analysts know their subject area very well and will easily come up with many different subject angles to approach the data.

The focus of a data scientist, what I am looking for when I hire one, should be statistical knowledge and using coding skills for applied mathematics. Yes, there can be the occasional unicorn in a very senior data scientist, but I know few junior or mid-level data scientist who can surpass a data engineer in coding skills. Very few know as much about the business as a proper business analyst.

Which means you end up with something like this:

data science venn

Data scientists use their advanced statistical skills to help improve the models the data engineers implement and to put proper statistical rigour on the data discovery and analysis the customer is asking for. Essentially the business analyst is just one of many customers – in mobile gaming most of the questions come from game designers and product designers – people with a subject matter expertise very few data scientists can ever reach.

But they don’t have to. Occupying the space between engineering and subject matter experts, data scientists can help both by using skills no one else has without having to be the unicorn.

Data Science in the Trenches

If you are working as a data architect or a technical lead of a data team you are in a bit of thankless position at the moment. You could be working at or even founding one of the many data platform startups right now. Or work for the many enterprise consultancies that provide “big data solutions”. Both would mean directly profiting from you acquired technical skills. Instead, you are working in a company that actually needs the data you provide but also doesn’t care how you get it. There is the old business metaphor of selling shovels to gold diggers instead of digging for gold yourself. I think a closer metaphor is that the other guys are logistics and you are fighting with everybody in the trenches.

The particular trench for me is free-to-play mobile gaming which is closer to being a figurative battle field than say web or B2B. You either get big or you die. There is no meeting that goes by without people discussing performance metrics, mostly retention and ARPDAU. Because the business boils down to a mathematical formula: if you have a good retention and a good revenue per user and your acquisition costs are low you make a profit. If either of those is flailing, even just for a couple of days, you don’t. Fortunes can change very very quickly. Where metrics are this important, having people who can provide the metrics accurately is key. Hence front line data science.

The challenges you face in the trenches are of different nature. Real-time is very very important as everybody wants to see the impact of say an Apple feature right away. At the same time product managers and game designers want to crunch weeks of data to optimise say level difficulty. Spark Streaming query bugging out late night on Saturday and your inbox is overflowing with “What’s going on?” emails. Delays in a weekly Hadoop aggregation and a game release might be delayed as an A/B test could not be verified. In the trenches, the meaning within the data is much much more important than the technology you throw at it. But it’s also very limited from a data science point of you: you do a bit of significance testing here and a some revenue predictions there but most of the statistical methods are rather simple. Not what everybody was promised when taking up data science.

What does one gain being on the front lines? The data actually flows into the product every day, what you find during data mining is important to the survival of the game or app. Features live or die with your significance test which you hopefully picked the correct statistical method for. You could be making tools for data scientist or crunching large data sets for reports that one manager might read maybe – but that would be less chaotic, less rushed and less fun than throwing out some data and actually watching your game going up the charts. Welcome to the trenches.

Some thoughts on Bitcoin as a long-term currency

I have been watching the development of Bitcoin since it started and found it fascinating. From a technical point of view the idea of cryptocurrency is intriguing. Even if perfect security can not be guaranteed, I’m not worried about Bitcoin’s long-term prospects from an algorithmic point of view. I think the real problems are economical.

I don’t doubt, as some have, that Bitcoin is a proper currency. Economics has a pretty low bar for a currency: 1) is a store of value, 2) accepted for transactions, 3) is a accounting unit, 4) in common use within a territory. If you read this carefully, you notice that cigarettes in the territory of a prison could also be considered a currency. The important difference to make is between private and public currencies.

There is a reason we run our economy on public currencies aka a currency controlled by a central bank usually bound to a set of fixed rules. It’s mostly lessons learned from economic history. Private currencies have a really bad track record in matters of stability. Before the American Federal Reserve was established in 1913, most currencies in the US were issued by private banks and therefore closer to private currencies. As those currencies were only influenced by the market, there value fluctuated wildly (by modern standards) and banking crisis were a regular occurrence (total of 6 in the US between 1880 and 1913). Public currencies were established to give stability by market intervention –  meaning the central bank countering upwards or downwards trends in demand – and they have in most regards been successful.

Bitcoin behaves very much like a private currency, it’s value is shaped mostly by market forces. There seemed to be a common misconception that only amount of currency units determine value of the currency (which seemed to have been one of the reason Bitcoin was created the way it is). In reality currency value is determined by both demand and supply. The supply of Bitcoins is effectively limited based on mining. Demand on the other hand can vary wildly based on the state of the economy or e.g. events on the stock market. The same factors are behind the large fluctuations in gold price over the last decades. And that was also the problem for the private currencies of the late 19th century: they were mostly based on gold and completely demand driven. Binding yourself to a fix supply increase like gold mining or crypto-algorithms is no guarantee for stability.

The fluctuation in value will make it hard to use Bitcoin for transactions, a problem commonly called currency risk. If you think about the typical larger business operating on contracts where delivery of goods and payment can be month apart, a large short-term fluctuation can be devastating. As the supply of Bitcoins will stop at some point in the future, demand is going to be the main factor for value. While value against other currencies is determined by market forces, long term value is determined between a person using Bitcoin as his primary currency and the economy he is part of. An economy usually increases in value by at least a couple of percent each year, Bitcoin itself will stay the same amount. That is great for the person as long as he has no debt in Bitcoins as debt would increase over time in relation to the economy as a whole. A business, which usually depends on debt to operate would not be willing to take a loan in Bitcoins. This is the real danger of deflation. It’s not a death sentence but will prevent Bitcoin from spreading throughout the economy.

In summary, I can currently not see how Bitcoin would be usable as a day to day currency, even a private one. It will very likely survive as a value store as it is well suited for that compared to say gold.

Should you run anything on physical hardware?

I frequently get into a discussion which goes something like this: I have this high-cpu service which would just work fine and fast on physical hardware but I’m running up the capacity of my VM. Why don’t we just run it on the physical hardware?

The first answer would be that virtual hardware has too many benefits to just easily go back to the physical world. VMs can be replicated, copied and deployed at will. If you also have a decent tool kit to deploy multiple VMs with respective configurations at once on demand, you can replace and fail over quickly. If you have a physical server, you need at least a fail over server – which will also be physical. You also have to pay a guy to fix it if it breaks. Is it really worth the additional money and time?

The second answer, and I think the more important one, is that you should really ask yourself if you are trying to solve a scale problem the wrong way. Sure, I know the occasional optimised C++ service with constant demand happily working for years on one server. But are you sure the demand will not increase in 3 month, 6 month time, beyond the physical capacity of the server? Are you then starting to beef up the server, spending more time and money? Maybe it’s time to look into a distributed architecture that can scale across multiple VMs and you solve the scaling problem before it become big and costly.


Should you ever delete something from the cloud?

At a recent CloudCamp I had a discussion about data retention in the cloud, the argument was that the size of “big data” would be significantly reduced if you delete the unimportant/unnecessary/trivial data.

Problem 1: The Filtering Job

If you want to avoid collecting any unimportant data, it has to be filtered when coming in. If that would be an easy job, some companies would not use big data solutions – it would be less cost and resource intensive to just put it into a SQLDB. One of the reasons that it is necessary to work with cloud and big data solutions is that it is easier/less resource consuming to process the data later when you want to analyse it then when you receive it.

Problem 2: The Purging Job

If you can’t reasonably filter that data, how about purging it? All boils down to storage cost vs purging cost. If it is simple and effective to purge, you could have done it via filtering. It it’s not, you either have to spend precious resources for purging calculation or hire people to evaluate and purge data. Either way, it’s most likely more expensive than some more hard drives.

Problem 3: The Future

What is unimportant? What do you not need? If you think just about now, it might be an easy questions. But requirements might change, Data needs to be reprocessed in a different light. Your company might do something completely different with the data in a year (happens more often than you think). So why delete something you might need in the future?