Friday, August 7, 2015

R Create Data.Frame Like Read.Csv

In exploring using R.NET and RserveLink inside of C#, I ran into a couple performance issues with my scenario. In load balanced environments which will need to pull the random forests and the cached CSV data saving the files to a network share makes sense, but there is a cost. My initial process was:
  1. In C#, load and prep the data.
  2. Serialize the data into CSV format and save to a network share.
  3. Use R.NET or RServeLink, send the below commands to pull the Random Forest and the CSV data and run the data through the Random Forest.
Below is the proof of concept R code.
library("randomForest")
library("caret")

mydata = read.csv(file="IntPonAllTheData.csv",head=TRUE,row.names="IntPonID")

test.predict <- predict(readRDS('//intponsrv/Data/RandomForest/CLASS123.rf'), mydata)
write.table(test.predict)
I was able to combine several of the lines, but it didn't improve the performance very much. However, if I could remove 2 of the network and file I/O trips, that would greatly improve the performance. The only question was, how do I create a data.frame in R that would produce the same object as read.csv. I inquired to the #R freenode channel (they are awesome, check them out and stay a while) and they suggested that I look into the save function or dputs function. I was able to use dputs and serialize the data.frame. The format of the serialization wasn't an exact match, but it was close enough that I could figure out how the data.frame is structured relative to the CSV data. The following is my converted code which generates a data.frame directly.
library("randomForest")
library("caret")

df <- data.frame(Q1 = c(0.301775147928994,0.301775147928994,0.301775147928994,0.301775147928994),Q2 = c(0.301775147928994,0.301775147928994,0.301775147928994,0.301775147928994),Q2 = c(0.094674556213018,0.094674556213018,0.094674556213018,0.094674556213018),Q3 = c(0.301775147928994,0.301775147928994,0.301775147928994,0.301775147928994),Q4 = c(0.082840236686391,0.082840236686391,0.082840236686391,0.082840236686391),row.names = c("baseline","TEST1","TEST2","TEST3"))
write.table(df)

test.predict <- predict(readRDS('//intponsrv/Data/RandomForest/CLASS123.rf'), df)
write.table(test.predict)
Instead of generating the CSV, I can generate the data.frame statement with the same result. After integrating it with my C# code, this produced a 60-80% improvement in processing time per prediction, when repeatedly processing large data sets.

No comments: