Operationalising DS 2: pros and cons of using R in SQL

My last post demonstrated a successful use case of executing R scripts as a stored procedure on SQL Server. The main advantage of using R in SQL Server is that is moves intelligent analytical capabilities closer to the data, thereby overcoming issues associated with big data such as moving, merging, managing and munging data. Impressed with its functionality I decided to try and apply it to my current rail project with an Azure SQL database (60GB). I had developed a good model for a small use case in R and now wanted to extend the feature engineering and modelling R scripts for the entire rail network.

Unfortunately, I was blocked from the get go. I began by checking which of the several R packages I had used would be supported to run on SQL server (see here for guided details on how to install additional R packages to SQL Server). The first step would be to enable package management, which I soon found cannot be done for Azure SQL databases (see here).

“The ability to run R and other machine learning scripts in Azure SQL Database is a preview feature in the West Central US region only and the following functionality is not supported:

· The MicrosoftML package is not available.

· Package management features such as CREATE EXTERNAL LIBRARY are not supported.

· You cannot use the Azure SQL database as a remote compute context when executing scripts from an R client. R scripts must be run by using the stored procedure sp_execute_external_script. Scripts called by the stored procedure cannot use other compute contexts.

· You cannot execute calls to RevoScaleR functions that require parallel execution.

· Loopback connections from R script to SQL Server are not supported. In other words, you cannot make external calls from your R script to another ODBC data source.

R packages

· The preview release includes Microsoft R Open 3.3.3, and Microsoft R Server version 9.2. The RevoScaleR package is preinstalled.

· Some R packages have been removed or modified to reduce footprint in the Azure environment. For example, mrsdeploy is not included in Azure SQL Database.

Performance

· Supports training and scoring from any model where data can fit in memory. The amount of memory available depends on the edition of the database.

· Trivial parallelism is supported using the @parallel =1 argument, as well as streaming for R script execution

· In preview, limited to a single R script executed concurrently per database.”

Evidently, the current release is very limited in its capability and not ideal for operationalising data science for big data. It would be extremely inefficient to write R scripts using mainly base R when there are so many amazing R packages that optimise code. However, given the nice implementation available for smaller local databases I remain hopeful for the next release in 2018.

Additionally, there are other services available for operationalising data science which maybe more effective/ successful such as:

· HDInsight (Spark)

· Azure Data Lake Analytics (can have embedded R and Python)

· Machine Learning Model Management and Experimentation

· Azure Functions (can be R / Python)

Happy Experimenting!