Kevin Schmidt – ML at Instagram, former CTO Century Tech, former head data cruncher at Mind Candy and Last.fm, Speaker at O'Reilly Strata and Nucl.AI. My writing is also at https://medium.com/@KevinSchmidtBiz

Docker in Production

By now we all know and have worked with Docker a bit. It’s perfect for creating a consistent environment. Running several of them at the same time is easy due to the isolated networking and file system. If you have a set of proper environments (Integration, Staging and Production), then it gets a bit more complicated as one has to integrate Docker with existing services: logging, monitoring, backup as well as deal with deployment and rolling updates.

Logging

Logging is probably the easiest to deal, as you have a variety of options supported by Docker, notably Syslog, JSON file, Fluentd, GELF and various cloud provider services (AWS, GC, etc.). If you have an existing logging infrastructure, like the typical setup of Logstash forwarding to ElasticSearch for Kibana visualisation, you can connect Docker containers directly to Logstash using GELF – just select the correct logging driver and GELF config when starting the container and activate the Logstash plugin for GELF. Sounds easy and for simple programs, it works quite well. But what if you have separate error and access logs? Or log particular sections of the app into different log files? Docker logging by default only captures console logging so to capture multiple logs or file logging you either have to do post-processing in Logstash or mount a logging volume into your container and then pick up the logging files with a typical file forwarder like Filebeat. Which way to pick depends on the complexity of the application and your setup.

Monitoring

Starting containers is easy, keeping them running is harder. Monitoring Docker containers have become much easier: most monitoring solutions now have built-in Docker support or support it via plugins. So at least you get decent alerts when your containers are failing. Standard process monitoring also works on processes inside Docker containers – after all, they are not really separate, like in a VM, but just in a jail, easily noticeable if you just run a ps on your docker host server. Exposed ports are also just ports on the host system. Internal ports or network connection between Docker containers are not easily monitored as the network stack is isolated. In this case one would either have to a) expose a rest status call that returns the status of those ports and connections to the monitoring system or b) start up a docker container with the only purpose of monitoring – both non-trivial solutions.

Keeping Docker containers running can be done in two ways: if you have applications that occasional but regularly crash (e.g. Node), Docker supports the --restart parameter for docker run. The options are always, unless-stopped, on-failure:max-retry. I consider always and unless-stopped both dangerous – a constantly restarting container (due to persistent error) can’t fulfil its function and might not be noticed by any monitoring solution. The on-failure:max-retry is much nicer as long as the max-retry value is picked reasonably well. I currently run most containers at on-failure:5. Note that the container has to exit with an exit code different than 0 for this to be triggered.

If your container tends to fail for long periods of time, e.g. because of an external database outage, it might be more prudent to leave the restarting to your configuration management system or clustering solution.

Backups

Container themselves should not be backed up – that’s against the spirit of containers as disposable environments. But some Docker applications produce data or are databases themselves. You would then mount the respective data directory from the host and then use your trusted backup system of choice to back up from the host.

Alternatively, you can mount NFS/SMB/EFS volumes in Docker directly using the --volume-driver options and respective plugins (http://netshare.containx.io/). I haven’t yet used that in production, but it looks promising.

The third option is to run a Docker container that mounts the volume of another container and runs a backups program. I found that honestly more hassle than just mounting it in on the host, but it is an option.

Deployment

The biggest issues with Docker and supporting multiple environments (Production/Staging/Integration) is configuration and updates, let’s look at configuration first, updates second.

Configuration is naturally different between Integration and Production, different databases, different micro-services to connect to, etc. I tried out four different approaches here that all have their benefits and downsides.

First and simplest is to configure via environment variables when the container starts. That can easily be handled during deployment and works well for a limited set of config options but can get out of hand quickly if you have a service requiring lots of configuration and some frameworks need a config file and won’t be happy with just environment variables.

Second, you can have different docker containers for different environments. That’s easily done by using the layered nature of Docker containers. We start with a base container (say Scala runtime), let the developer add the compiled code and libraries to run the app and lastly DevOps adds the config file for the respective environment. That can be easily organised with naming standards: scala, scala-myapp-1.0, scala-myapp-1.0-prod-16ba001. The first is the base runtime container; the second is the app binary container build from the base container and properly versioned as 1.0. The last one is the final container build with the proper configuration for prod and the git hash of the prod configuration attached – so you know what config you have deployed. This works very well in practice, and the only downside is that you have to have a rather complicated build process to do all those steps.

The third option is to use a distributed service discovery and configuration like Consul, etcd or Zookeeper. That allows your container service to connect directly to the service and download their respective config programmatically which means that the container just needs an environment variable to know if it is in e.g. Staging. I used this setup only with Consul so far, but it works quite well and makes the release process much simpler. The downside is that you have to have a discovery service running without interruption and also make it available for developers otherwise the container won’t start during development. Also, same as in the first option, some frameworks need config files and won’t be happy with distributed variables.

Fourth and so far last option is just to mount the config directory to the host and lot your configuration management system of choice (I use SaltStack) handle config files. That works well for all framework but creates overhead for the config management system: it now needs to know what specific containers are running to deploy the config and also restart the container to pick up the new config.

Updates

You want to update your containers one by one to a new version of the docker images to avoid downtime. Again, the configuration system with built-in Docker support (again SaltStack) can handle that for you by running different docker host servers on different update schedules or roll out to half the servers manually. That can be quite cumbersome, and that’s where Docker clustering solutions like Mesos, Kubernetes or Swarm come in. In Mesos using the Marathon deployer, you can schedule a rolling update of your container with a minimal safe capacity of containers to keep running during an upgrade. That works very well in practice so far not a single hiccup during upgrades in over a year of running on Mesos. How one runs Docker on those clustering solutions in practice would be a bit much for his article so I will explain this in a follow-up.

In summary – thanks to increased Docker support in and with logging, monitoring and configuration management systems – running Docker in production is not particularly hard as long as the limitations of containers are taken into account.

Next Generation Data Architecture

One of the benefits of coming to a greenfield job – like when I joined Mind Candy two years ago – was that you can jump several technological steps ahead as you don’t have any legacy to deal with. Essentially we could build from scratch based on lessons learned from traditional data architecture. One of the main ones was to establish a real-time path right away to avoid having to shoehorn it in afterwards. Another was to avoid physical hardware. And the most important one was to hold off on Hadoop as long as possible.

The last one might seem surprising, isn’t Hadoop the centre-piece of a data architecture? Unfortunately it creates a lots of admin overhead and it might be a full person’s (or more) workload to maintain. Not ideal in a small company where people resources are limited. AWS S3 can fulfil most of the storage function but requires no maintenance and is largely fast enough. Also while HDFS is important and will probably come back for us soon, MR1 or YARN is just not – there are better and more advanced execution systems that can use HDFS and we used one of those: Mesos.

Mesos is a universal execution engine for job and resource distribution. Unlike YARN it can not only run Spark but also Cassandra, Kafka, Docker containers and recently also HDFS. That works because Mesos just offers resources and let the framework handle the starting and management of the jobs. This finally breaks the link between framework and execution engine: in Mesos you can run not only different frameworks but different versions of the same framework. No more waiting for your infrastructure to upgrade to the latest Hadoop or Spark version, you can run it right now even when all your other jobs run on older versions. Combine that with a robust architecture and simple upgrading and Mesos can easily be seen as the successor to YARN (for more details on why Mesos beats Yarn, see Dean Wampler’s talk from Strata).

For the real-time path the obvious processing solution is Spark Streaming (so we have a simpler code base) running on Mesos with Kafka to feed data in and with Cassandra to store the results. You now have a so-called SMACK stack (Spark Mesos Apache Cassandra Kafka) for data processing which the Mesos folks call Mesosphere Infinity for some reason (aka marketing).

The last bit of a data architecture is the SQL engine. Traditionally this was Hive but we all know Hive is slow. While there are several open-source solutions out there that improve on good old Hive (Impala, Spark SQL) in the end we decided on AWS Redshift. It’s a column-oriented SQL-based data warehouse with PostgreSQL interface which fulfils most of the data analysis and data science needs while being reasonably fast and relatively easy to maintain with few people.

The resulting architecture looks like the above picture. We have an event receiver and a enricher/validate/cleaner, which were written in-house in Scala/Akka and are relatively simple programs using AWS SQS as a transport channel. The data is then send to Kafka and S3. Spark uses data straight from S3 to aggregate and put the processed data back into either Redshift or S3. On the real-time side of things we have Kafka going into Spark Streaming with an output into Cassandra.

What can be improved here? HDFS is still better than S3 for certain large scale jobs and we want to bring it back running on Mesos. Redshift could be replaced with Spark SQL hopefully soon. All in all the switch from tightly coupled Hadoop to an open architecture based on Mesos allowed us to have an unprecedented freedom as to which kind of data jobs we want to run and which frameworks to use, allowing a small team to do data processing in ways previous only possible on a large budget.

Interview with yours truly on Bugtrackers

If you would like to hear my opinions about data science and engineering management, check out my bugtrackers.io interview here:

https://www.bugtrackers.io/interview-mindcandy-kevin-schmidt

Probabilistic Data Structures and Spark Streaming Talk

If you like to hear me talk about how to use probabilistic data structures for easy scaling and real-time processing with Spark Streaming, here is a video from the London Hadoop User Group:

https://www.youtube.com/watch?v=yXDpPHvJMT0&index=1&list=PL5OOLwV_m9vYluhSk_7XF4nHpI5W75B3N

Speaking at nucl.ai in Vienna next week

I’m going to speak at nucl.ai, the conference for Artificial Intelligence in Creative Industries, next week. It’s a great conference with a variety of topics, check it out!

Join me on Tuesday the 21th of July at 9:45 when I talk about Data Science and Spark.

http://events.nucl.ai/

Speaking at O’Reilly’s Strata in London next week

I’m going to speak at O’Reilly’s Strata + Hadoop World in London next week. Very excited as this is one of the biggest data conferences in the world.

Join me on Thursday the 7th of May at 16:15 when I talk about Spark Streaming and probabilistic data structures.

http://strataconf.com/big-data-conference-uk-2015/public/schedule/detail/39816

Speaking at the London Hadoop Meetup on Monday

I’m giving a talk about Spark Streaming and probabilistic data structures this Monday at the London Hadoop Meetup. Sign up with the link below!

https://www.meetup.com/hadoop-users-group-uk/

Data Engineer vs Data Scientist vs Business Analyst

Looking again at the data science diagram – or the unicorn diagram for that matter – makes me realize they are not really addressing how a typical data science role fits into an organization. To do that we have to contrast it with two other roles: data engineer and business analyst.

What makes a data scientist different from a data engineer? Most data engineers can write machine learning services perfectly well or do complicated data transformation in code. It’s not the skill that makes them different, it’s the focus: data scientists focus on the statistical model or the data mining task at hand, data engineers focus on coding, cleaning up data and implementing the models fine-tuned by the data scientists.

What is the difference between a data scientist and a business/insight/data analyst? Data scientists can code and understand the tools! Why is that important? With the emergence of the new tool sets around data, SQL and point & click skills can only get you so far. If you can do the same in Spark or Cascading your data deep dive will be faster and more accurate than it will ever be in Hive. Understanding your way around R libraries gives you statistical abilities most analysts only dream of. On the other hand, business analysts know their subject area very well and will easily come up with many different subject angles to approach the data.

The focus of a data scientist, what I am looking for when I hire one, should be statistical knowledge and using coding skills for applied mathematics. Yes, there can be the occasional unicorn in a very senior data scientist, but I know few junior or mid-level data scientist who can surpass a data engineer in coding skills. Very few know as much about the business as a proper business analyst.

Which means you end up with something like this:

Data scientists use their advanced statistical skills to help improve the models the data engineers implement and to put proper statistical rigour on the data discovery and analysis the customer is asking for. Essentially the business analyst is just one of many customers – in mobile gaming most of the questions come from game designers and product designers – people with a subject matter expertise very few data scientists can ever reach.

But they don’t have to. Occupying the space between engineering and subject matter experts, data scientists can help both by using skills no one else has without having to be the unicorn.

Data Science in the Trenches

If you are working as a data architect or a technical lead of a data team you are in a bit of thankless position at the moment. You could be working at or even founding one of the many data platform startups right now. Or work for the many enterprise consultancies that provide “big data solutions”. Both would mean directly profiting from you acquired technical skills. Instead, you are working in a company that actually needs the data you provide but also doesn’t care how you get it. There is the old business metaphor of selling shovels to gold diggers instead of digging for gold yourself. I think a closer metaphor is that the other guys are logistics and you are fighting with everybody in the trenches.

The particular trench for me is free-to-play mobile gaming which is closer to being a figurative battle field than say web or B2B. You either get big or you die. There is no meeting that goes by without people discussing performance metrics, mostly retention and ARPDAU. Because the business boils down to a mathematical formula: if you have a good retention and a good revenue per user and your acquisition costs are low you make a profit. If either of those is flailing, even just for a couple of days, you don’t. Fortunes can change very very quickly. Where metrics are this important, having people who can provide the metrics accurately is key. Hence front line data science.

The challenges you face in the trenches are of different nature. Real-time is very very important as everybody wants to see the impact of say an Apple feature right away. At the same time product managers and game designers want to crunch weeks of data to optimise say level difficulty. Spark Streaming query bugging out late night on Saturday and your inbox is overflowing with “What’s going on?” emails. Delays in a weekly Hadoop aggregation and a game release might be delayed as an A/B test could not be verified. In the trenches, the meaning within the data is much much more important than the technology you throw at it. But it’s also very limited from a data science point of you: you do a bit of significance testing here and a some revenue predictions there but most of the statistical methods are rather simple. Not what everybody was promised when taking up data science.

What does one gain being on the front lines? The data actually flows into the product every day, what you find during data mining is important to the survival of the game or app. Features live or die with your significance test which you hopefully picked the correct statistical method for. You could be making tools for data scientist or crunching large data sets for reports that one manager might read maybe – but that would be less chaotic, less rushed and less fun than throwing out some data and actually watching your game going up the charts. Welcome to the trenches.

Some thoughts on Bitcoin as a long-term currency

I have been watching the development of Bitcoin since it started and found it fascinating. From a technical point of view the idea of cryptocurrency is intriguing. Even if perfect security can not be guaranteed, I’m not worried about Bitcoin’s long-term prospects from an algorithmic point of view. I think the real problems are economical.

I don’t doubt, as some have, that Bitcoin is a proper currency. Economics has a pretty low bar for a currency: 1) is a store of value, 2) accepted for transactions, 3) is a accounting unit, 4) in common use within a territory. If you read this carefully, you notice that cigarettes in the territory of a prison could also be considered a currency. The important difference to make is between private and public currencies.

There is a reason we run our economy on public currencies aka a currency controlled by a central bank usually bound to a set of fixed rules. It’s mostly lessons learned from economic history. Private currencies have a really bad track record in matters of stability. Before the American Federal Reserve was established in 1913, most currencies in the US were issued by private banks and therefore closer to private currencies. As those currencies were only influenced by the market, there value fluctuated wildly (by modern standards) and banking crisis were a regular occurrence (total of 6 in the US between 1880 and 1913). Public currencies were established to give stability by market intervention – meaning the central bank countering upwards or downwards trends in demand – and they have in most regards been successful.

Bitcoin behaves very much like a private currency, it’s value is shaped mostly by market forces. There seemed to be a common misconception that only amount of currency units determine value of the currency (which seemed to have been one of the reason Bitcoin was created the way it is). In reality currency value is determined by both demand and supply. The supply of Bitcoins is effectively limited based on mining. Demand on the other hand can vary wildly based on the state of the economy or e.g. events on the stock market. The same factors are behind the large fluctuations in gold price over the last decades. And that was also the problem for the private currencies of the late 19th century: they were mostly based on gold and completely demand driven. Binding yourself to a fix supply increase like gold mining or crypto-algorithms is no guarantee for stability.

The fluctuation in value will make it hard to use Bitcoin for transactions, a problem commonly called currency risk. If you think about the typical larger business operating on contracts where delivery of goods and payment can be month apart, a large short-term fluctuation can be devastating. As the supply of Bitcoins will stop at some point in the future, demand is going to be the main factor for value. While value against other currencies is determined by market forces, long term value is determined between a person using Bitcoin as his primary currency and the economy he is part of. An economy usually increases in value by at least a couple of percent each year, Bitcoin itself will stay the same amount. That is great for the person as long as he has no debt in Bitcoins as debt would increase over time in relation to the economy as a whole. A business, which usually depends on debt to operate would not be willing to take a loan in Bitcoins. This is the real danger of deflation. It’s not a death sentence but will prevent Bitcoin from spreading throughout the economy.

In summary, I can currently not see how Bitcoin would be usable as a day to day currency, even a private one. It will very likely survive as a value store as it is well suited for that compared to say gold.