Measuring the pace of change of Big Data

Last week I gave a talk to Spark London about the relatively new Microsoft Machine Learning Framework for Apache Spark, MMLSpark. For those of you that have known me for a while you'll know that I'm like a scratched record about Apache Spark and have dedicated a huge amount of time and effort to understanding the ever-expanding codebase.

The talk gave me a chance to reflect my time with Spark quite a bit. It's been a five year journey of a few contributions and ancillary development for spark-contrib which helped me get closer to the Spark codebase but three years ago I was talking about our own distribution of Spark called Brisk and one particular aspect which we built in quite early, a Metrics handler using the metrics API. The Metrics sink fits in nicely which allows inspection of values to be pushed out in some form. In our time we were using Storage but now I would probably do the same with something like Microsoft's AppInsights.

The technology itself is not the point of this post, it's actually about the gap in technology at the time. I remember it took me about 6 weeks to write a fairly bad C# MVP which did a devops process and cobbled a bunch of things which we built in the distribution, that included SSH scripts to configure standalone clusters with masters and slaves as well as the latest compiled jar files that included both the metrics handler and WASB, an HDFS proxy which used Azure Blob Storage to store all of our data (A Microsoft innovation). For orchestration we used persistent virtual machines and disk images of Ubuntu 15 that we were happy with, rather than the ever-changing gallery (some of which at the time had disastrous consequences for us). Everything was set to use Azure Fluent Management, a library that Andy Cross and myself had written a few years earlier, which at the time was the only way to control Azure from .NET. All of this in 6 weeks, a functioning version of Spark on Azure that several companies began using.

Microsoft took notice and were very interested in how we had incorporated things like Zeppelin and Jupyter into this early stage and we were fairly open about our stack as a whole and how we accommplished this. We used this for many customer until about a year and half on Microsoft released an early HDInsight version (which took a few iterations to get right).

I never would have suspected that I would be standing in front of an audience three and half years later talking about Deep Learning on Spark on Azure! It's easy to forget what cloud has brought us but being in an era of Big Data the pace of change never seems to be fast enough!

Happy trails and take some time to reflect on the pace of change of Data Platform tooling!