AutoExploreR is an open source R package that can be used during the data exploration/understanding stage of the data science life cycle. At ElastaCloud we have found this part of the process to be particularly time consuming. When using R, a problem that we often encounter, is that to do seemingly simple things, we either have to find and install multiple packages or write our own new functions as we go. With that in mind we decided to create our own package that could carry out our most common tasks in a simple, but reliable way.
This post will introduce just some of the functionality included in AutoExploreR and show some examples of their use on a real data set. The tasks that we will showcase include
Calculating and visualising correlations
Identifying outliers
The data used is the 'swiss' data set, available with an installation of R, which gives measures of fertility and socio-economic indicators for the French speaking provinces of Switzerland in 1888.
Correlations
When we have numeric data we usually want to quickly know how the various variables are related to one another. Calculating correlations is a commonly used method, but it can be difficult in R to calculate and then visualise the multiple correlations in larger data sets without installing multiple packages and somehow joining their outputs together. In AutoExploreR we have developed three functions, targetCorrelation, multivariateCorrelation and autoCorrelationPlot that make this process very easy.
The targetCorrelations function automatically calculates all the correlations between a ‘target’ variable and all other numerical variables in a data set, whereas the multivariateCorrelation function calculates all correlations between numerical variables, automatically ignoring none-numeric data. With the output argument set to “matrix” the result of the multivariateCorrelation function can be passed to autoCorrelationPlot to automatically produce a correlation plot, as shown below (click here to see an interactive version of the correlation plot).
Multivariate Outliers
We want to know if any data points are outliers; these could be points that contain erroneous data, or are particularly interesting as they are so different to what is 'normal' for the data set. AutoExploreR has capability to identify outliers in univariate and multivariate data; here we look at the MultivariateOutlier function.
When provided with a data set the function will automatically find the optimal parameters for the outlier detection procedure, return a dataframe showing the location of any outliers in the set and a plot, with reduced dimensions, highlighting the outlier.
The results of the multivariate outlier for the Swiss data identified one outlier, Geneva, which has low fertility, very low agriculture and very high education compared to the other provinces. It is likely that in this case this data point is an outlier due to actual differences in the socio-economic development of the provinces, rather than being erroneous data.
In this post, I have given a gentle introduction into a few of the capabilities of the AutoExploreR package that we at ElastaCloud have found particularly useful so far. In a following post I will discuss some of the report generating functions of the package; also stay tuned for further posts as the package is under constant development!