Piping hot - scikit-learn's Pipelines

As data scientists, time and again, we need to perform tasks which follow some form of sequence. A general example could be Extract data, Pre-process the data and then train a model. Pipeline helps us do these steps sequentially via a simple interface. In other words, Pipelines enables us to chain scikit-learns transformers and estimators into a single unit.

If you haven’t heard of transformers or estimators before now, transformers are used to perform various machine learning tasks including cleaning, dimensionality reduction or feature extraction on the dataset. They take your dataset and return your changed (transformed) dataset. Estimators implement both fit() and predict() on the dataset. An example will make this clearer. The

Say we want to build a Pipeline of data that needs to be normalised in a pre-processing stage before building the model. This is illustrated by the flow diagram below. The extracted data is fed into the pre-processing stage which then does the normalisation before feeding the output into the model building stage.

In some cases, the pre-processing stage might require several steps that should run in parallel. Using the example above, say we have categorical and numerical variables in our dataset. As such, we may want to encode the categorical variables and normalise the numerical variables. FeatureUnion class helps us to achieve this. FeatureUnion combines several transformer objects into a new transformer that combines their output. In our example above, the output from the extract stage is fed into each of the Normalisation and Encoding stages and their outputs are combined (by column) and fed into the build model stage (see flow diagram below).

Overall, using a Pipeline ensures that one can continue to make changes/improvements to the script without having to worry about keeping the steps in sequential. One ‘downside’ is that scikit-learn’s API expects numpy arrays and even if you feed it a dataframe, it outputs a numpy array. As at today, there are some attempts at integrating pandas with sklearn.

Useful resources:

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

https://signal-to-noise.xyz/post/sklearn-pipeline/

Ayodeji Akiwowo