Spark abstracts the idea of a schema from us by enabling us to read a directory of files which can be similar or identical in schema. A key characteristic is that a superset schema is needed on many occasions. Spark will infer the schema automatically for timestamps, dates, numeric and string types across all of the various data providers including parquet.
This is the first in a series of posts where I'm going to review features that we've copied into Parquet.NET. This post is about automatic schema merges. To understand what this is we'll look at the basic behaviour and then we'll fall back to Set theory.
Spark abstracts the idea of a schema from us by enabling us to read a directory of files which can be similar or identical in schema. A key characteristic is that a superset schema is needed on many occassions. Spark will infer the schema automatically for timestamps, dates, numeric and string types across all of the various data providers including parquet.
For example, this will load all product data into memory:
val product_details = spark.read.parquet("/parquet/product_*").persist()
However, one drawback of this is that the schema read can be wrong given that the schema is inferred on a scan of the first n rows only so for larger files it may not pick up other product schemas.
The alternative is to be able to coerce the schema so that the values are automatically combined.
val df = spark.read.option("mergeSchema", "true").parquet("/parquet/product_1", "/parquet/product_2")
df.printSchema()
In set theory this is a Union operation which allows us to values such as 1,2,3,4 in Set 1 and combine with values 2,2,3,4,5 in Set 2. The Union will give us a unique combination of 1,2,3,4,5.
Parquet.NET now allows for a more explicit union without worrying about schema merger. It will automatically pick out similarities and differences between the schemas and combine them.
var ds1 = new DataSet(schema1);
var ds2 = new DataSet(schema2);
var ds3 = ds1.Merge(ds2);
Look for it formally in release 1.3 but it's floating around in one of the Alpha releases at the moment.
Happy trails!