If you would like to hear my opinions about data science and engineering management, check out my bugtrackers.io interview here:
If you like to hear me talk about how to use probabilistic data structures for easy scaling and real-time processing with Spark Streaming, here is a video from the London Hadoop User Group:
I’m going to speak at nucl.ai, the conference for Artificial Intelligence in Creative Industries, next week. It’s a great conference with a variety of topics, check it out!
Join me on Tuesday the 21th of July at 9:45 when I talk about Data Science and Spark.
I’m going to speak at O’Reilly’s Strata + Hadoop World in London next week. Very excited as this is one of the biggest data conferences in the world.
Join me on Thursday the 7th of May at 16:15 when I talk about Spark Streaming and probabilistic data structures.
I’m giving a talk about Spark Streaming and probabilistic data structures this Monday at the London Hadoop Meetup. Sign up with the link below!
Looking again at the data science diagram – or the unicorn diagram for that matter – makes me realize they are not really addressing how a typical data science role fits into an organization. To do that we have to contrast it with two other roles: data engineer and business analyst.
What makes a data scientist different from a data engineer? Most data engineers can write machine learning services perfectly well or do complicated data transformation in code. It’s not the skill that makes them different, it’s the focus: data scientists focus on the statistical model or the data mining task at hand, data engineers focus on coding, cleaning up data and implementing the models fine-tuned by the data scientists.
What is the difference between a data scientist and a business/insight/data analyst? Data scientists can code and understand the tools! Why is that important? With the emergence of the new tool sets around data, SQL and point & click skills can only get you so far. If you can do the same in Spark or Cascading your data deep dive will be faster and more accurate than it will ever be in Hive. Understanding your way around R libraries gives you statistical abilities most analysts only dream of. On the other hand, business analysts know their subject area very well and will easily come up with many different subject angles to approach the data.
The focus of a data scientist, what I am looking for when I hire one, should be statistical knowledge and using coding skills for applied mathematics. Yes, there can be the occasional unicorn in a very senior data scientist, but I know few junior or mid-level data scientist who can surpass a data engineer in coding skills. Very few know as much about the business as a proper business analyst.
Which means you end up with something like this:
Data scientists use their advanced statistical skills to help improve the models the data engineers implement and to put proper statistical rigour on the data discovery and analysis the customer is asking for. Essentially the business analyst is just one of many customers – in mobile gaming most of the questions come from game designers and product designers – people with a subject matter expertise very few data scientists can ever reach.
But they don’t have to. Occupying the space between engineering and subject matter experts, data scientists can help both by using skills no one else has without having to be the unicorn.
If you are working as a data architect or a technical lead of a data team you are in a bit of thankless position at the moment. You could be working at or even founding one of the many data platform startups right now. Or work for the many enterprise consultancies that provide “big data solutions”. Both would mean directly profiting from you acquired technical skills. Instead, you are working in a company that actually needs the data you provide but also doesn’t care how you get it. There is the old business metaphor of selling shovels to gold diggers instead of digging for gold yourself. I think a closer metaphor is that the other guys are logistics and you are fighting with everybody in the trenches.
The particular trench for me is free-to-play mobile gaming which is closer to being a figurative battle field than say web or B2B. You either get big or you die. There is no meeting that goes by without people discussing performance metrics, mostly retention and ARPDAU. Because the business boils down to a mathematical formula: if you have a good retention and a good revenue per user and your acquisition costs are low you make a profit. If either of those is flailing, even just for a couple of days, you don’t. Fortunes can change very very quickly. Where metrics are this important, having people who can provide the metrics accurately is key. Hence front line data science.
The challenges you face in the trenches are of different nature. Real-time is very very important as everybody wants to see the impact of say an Apple feature right away. At the same time product managers and game designers want to crunch weeks of data to optimise say level difficulty. Spark Streaming query bugging out late night on Saturday and your inbox is overflowing with “What’s going on?” emails. Delays in a weekly Hadoop aggregation and a game release might be delayed as an A/B test could not be verified. In the trenches, the meaning within the data is much much more important than the technology you throw at it. But it’s also very limited from a data science point of you: you do a bit of significance testing here and a some revenue predictions there but most of the statistical methods are rather simple. Not what everybody was promised when taking up data science.
What does one gain being on the front lines? The data actually flows into the product every day, what you find during data mining is important to the survival of the game or app. Features live or die with your significance test which you hopefully picked the correct statistical method for. You could be making tools for data scientist or crunching large data sets for reports that one manager might read maybe – but that would be less chaotic, less rushed and less fun than throwing out some data and actually watching your game going up the charts. Welcome to the trenches.