We started Elastacloud eight years ago by focussing on High Performance Computing on Microsoft Azure. HPC is normally in the purview of Researchers, Engineers and Risk Calculators in finance but it is good for so much more.
Here over the next few posts I'll break down the problem domain and why I didn't use Big Data to solve this.
- We have several tens of thousands of files an hour being copied into Azure Blob Storage through a series of feeds
- Each feed comes in a form called the Common Event Format (CEF) which needs to be converted into something which can be processed and ingested into a Data Warehouse
- Each file is gzipped and when inflated can be anywhere between 1K and 200Mb
From the outset this is not a case for Big Data technologies even though the ease of modern frameworks like Azure HDInsight and Azure Databricks make this easy, it's not a good use of resources. Here's why:
- Gzip doesn't scale well with Hadoop or Spark, as it's not a compression format that can be parallelised
- There are thousands of files which need to be read in a short space of time and at load time this doesn't scale well with IO especially if you're looking at hourly
- Each file needs to heavy processing to output into a new intermediate format which can be ingested
Point (3) is especially relevant because the cost doesn't scale well with smaller Spark clusters.
In the few short posts that follow I'll build a story of how we can use:
- Azure Data Factory
- Azure Batch
- Parquet.NET
- Azure SQL Data Warehouse
This is a programmers solution. Nothing comes for free so there is some work to do and good programming practice which needs to be in play.
We'll be learning:
- To get around the limitations of Azure Data Factory and Custom Activities
- Using Azure Batch to enable pure linear scalability of file conversion
- Using Batch applications to build reusability and reproducable and versioned deployments
In the next post we'll be looking at extending Azure Data Factory Custom activities to achieve scale and throughput allowing the conversion of several GB of files in 10-20 mins.