You've gotta comprehend | elastacloud-channels

One of the things I often find myself doing is collecting data from external sources to get at feeds of data such as weather, pricing, scheduling etc... Often these sources have been around for a while and the world of APIs has moved on, meaning we often find data in massively nested structures, XML formats and some other weird formats.

One source of data I've been working on lately returns the data column-wise, where each field then lists all of the row values in an array, for example.

{
    "column1": [1.1, 2.2, 3.3, 4.4],
    "column2": ["a", "b", "c", "d"]
}

I'm not going to argue if this is the right way or not, it makes sense in some situations and not others, it's all about perspective. But one of the useful things with this data is that I can guarantee that there will always be the same number of values.

The data is being collected using Azure Functions in a function written in C#, and through various manipulation gets into a good structure for use elsewhere, including being written out to Parquet using Parquet .NET. But one of the things I wanted to do was to call the API quickly and produce a reasonable output so that I could view it using Excel (or using the CSV preview extension for Visual Studio Code).

To me the obvious solution was to break out Python and use the requests library to make the API call, then use the Pandas DataFrame from_dict method to create the data frames, and use Python dictionary comprehension to produce the dictionaries.

In the case of this API the response is in JSON format which makes getting the data from the API a lot easier as we can just grab the bit of the response we're interested in. Then it's pretty much a one-liner to produce the data frame.

df = pd.DataFrame.from_dict({k: pd.Series(data[k]) for k in sorted(data.keys()) if k != 'Excluded column'})

At this point I was sorting the keys so that I could keep the output consistent.

Just to keep things interesting I decided to output the data to an Excel workbook. This took a little more code and I wanted to produce a file using the current date as the filename.

writer = pd.ExcelWriter('{}.xlsx'.format(datetime.now().strftime('%Y%m%d')))
df.to_excel(writer, 'Sheet 1', index=False, freeze_panes=(1, 0))
writer.save()

I did try this in C# to start with but for one-off exploration it's hard to argue with doing things using this few lines of code compared to the significantly more complex C# code.