Big Data and Parquet (the Microsoft way)

The Azure data platform is awash with ways of querying data. From relational, nosql and unstructured data you generally want to minimise the movement of data at the point of querying. At Elastacloud we're big fans of Microsoft's Azure Data Lake Store (ADLS) which offers us a highly performant way of storing data securely giving us the capability to encrypt, audit and set authorization policies on directories. It also has an HDFS (or webHDFS) interface which allows it to be used by Apache Hadoop or Apache Spark.

One way of querying data is through Azure Data Lake Analytics (ADLA) which allows a SQL-like syntax called U-SQL to be built to query data. It's not hugely popular at Elastacloud due to the design and performance constraints but seems popular amongst our customers due to its simplicity. As such we decided to plug the gap and build a Parquet "Extractor" for it. You can now read parquet files specifically using U-SQL!

Pop on over to our GitHub repository https://github.com/elastacloud/parquet-usql and follow the instructions to deploy the Parquet.Adla library to your local or remote store. Then follow the samples for "Outputter" and "Extractor". Unfortunately ADLA is not easy to debug or test currently but if you have specific issues reach out and we'll try and get them fixed! Have fun!