This post will explain the approach that the PIRATE TEAM used to achieve an Area Under Curve of 0.85 on the provided test set in the DAM Data Challenge on Saturday 7th May. The code and the results are provided along with the explanation, so you should be able to run the code yourself.
Firstly, you’ll need to install R, and I’d recommend installing RStudio as well. The installation is outside the scope of this guide, but a quick google gave me this link that should point you in the right direction: LINK (at the time of writing the latest version of R is 3.2.5 – this is the version you should install)
With R installed, you’ll need to install a bunch of packages. Run the code below to install all the packages required for this guide (some of these will automatically trigger several other packages). It will probably take a little while to download and install all of them.
|
install.packages("caret") install.packages("glmnet") install.packages("dplyr") |
If you get any errors, use your googley skills to get those packages installed. You won’t be able to continue without them! Once they’re installed, load the packages.
It’s probably worth quickly explaining what each of these packages do.
- The caret package is a modelling package which helps to automate a lot of the code you would normally have to write yourself. Models in R have all sorts of different syntax and outputs, and caret helps to abstract a lot of those details away so that you can focus on solving the problem.
- The glmnet package provides the machine learning algorithm we’re going to use, which is called Elastic Net. This algorithm is pretty much just a fancy GLM (which means we can do Logistic Regression) with both L1 and L2 regularisation. Regularisation is probably an explanation too far for this guide, but it’s very well suited to problems where you want to avoid overfitting.
- The dplyr package is one of my favourite things about R. It makes data manipulation really simple through the use of verb-inspired functions (select, mutate, filter, etc) and the chain operator (%>%). Hadley Wickham (the author of the dplyr package) has put together the definitive guide here: http://r4ds.had.co.nz/transform.html
Now to start with the data! Firstly, import the data (you’ll need to replace the path below with the location of the file on your own computer).
|
input_raw <- read.csv("~/Dropbox/DAM/data/IsAlert-Train.csv") |
When we looked at the data (not covered in this guide) we realised that a few of the columns could be better off as factors (i.e. categories) rather than integers or numerics. We used some functions from the dpylr package to transform these columns using the as.factor function, and recoded the IsAlert column as Y/N (it makes things a bit easier because a Y/N cannot be misinterpreted as a number by any of the models later on).
|
input_clean <- input_raw %>% mutate( TrialID = as.factor(TrialID), vehicular_att7 = as.factor(vehicular_att7), environmental_att7 = as.factor(environmental_att7), environmental_att8 = as.factor(environmental_att8), IsAlert = as.factor(ifelse(input_raw$IsAlert == 1,'Y','N')) ) |
Now we will partition the data into an 80/20 train/test split using the createDataPartition function from the caret package. This is a bit safer than using a purely random sample as it takes a “stratified” sample, which ensures that both the train and test set have the same proportion of Y and N.
|
trainIndex <- createDataPartition(input_clean$IsAlert, p = .8, list = FALSE, times = 1) train_data <- input_clean[ trainIndex,] test_data <- input_clean[-trainIndex,] |
Now we can start the modelling. Elastic Net has two parameters – alpha and lambda. If I was a magician I could just guess these, however I’m not a magician and my guess would be terrible. Instead of guessing, you can split the training data again to make new “validation” set, and “tune” the model using this set.
To tune a model, you train the model multiple times with different parameters (using the training set) and use the validation set to evaluate the performance of the model with those parameters. As an example, you could choose 4 different values of alpha and 4 values of lambda, which would give you 16 different combinations of parameters.
How do you evaluate the model for these iterations? Given that we’re being assessed on ROC (more correctly the area under the ROC curve) we can just compare the ROC scores for each set of parameters! Use each iteration of the model to predict against the validation set and choose the parameters which give the highest ROC score against the validation set.
Going one step further, we can take multiple validation sets, using a process known as cross-validation (CV). As an example, for 5-fold CV, you would cut the data into 5 roughly equal sets. You then “hold out” the first set as the validation fold (a fold is just a slice of data), train your model using folds 2-5, and measure the ROC score on the fold that you held out. Then you hold out the second fold, train using folds 1 and 3-5, and measure the ROC of the prediction against the 2nd fold. In this way you get 5 estimates of the ROC for each pair of parameters, which gives you a much more robust estimate and helps avoid inadvertent overfitting. So now if we’re testing 4 parameters for each option, then we’re running 4x4x5 (80) models to find the best one.
One final step further, for increased robustness, you can do CV multiple times! So in the code below, we’re going to run 8-fold CV repeated 5 times.
Now for the code below. This code sets up the “training control” for the caret package. Firstly, we’re setting a seed for the random number generator so that we get the same random folds each time, which makes the results reproducable. We specify that we’re going to use the “repeatedcv” approach, with 8 folds and repeating 5 times. We’re then going to use the default parameter grid – caret has a default grid of parameters to test for each algorithm, and I assume that it’s better than my choices. We’re telling it to return the probabilities (not just the predictions) because we’ll need those for the ROC curve, and telling it to use the twoClassSummary function (this is how it calculates the ROC). Finally we’re telling it to be “verbose” when it runs, which just means it will provide status updates so we know it’s running.
|
set.seed(1337) ctrl <- trainControl( method = "repeatedcv", number = 8, repeats = 5, search = "grid", classProbs = TRUE, summaryFunction = twoClassSummary, verboseIter = TRUE ) |
Right! Now that it’s all ready to go, we can run the CV. This will take AGES because we’re running 8-fold CV 5 times (40 models for each combination of parameters in the default grid) so you’ll want to get a drink or something.
The code below tells caret to build a model which predicts IsAlert using every other column in the data as a predictor (except for columns 1 and 2 which I have removed). It also specifies glmnet as the algorithm, ROC as the evaluation metric, and tells the model to allow NAs (missing data) to pass through to the preProcess step. This is is a really cool feature of caret, it can automatically do standard preProcessing functions for you! In the example below I’ve asked it to automatically apply centering and scaling, and to impute any missing data (NAs) using the median value for that field. Finally, we pass the training control options we set above.
Note that I’ve suppressed the output for the purpose of this post, but it’s going to give you a lot of information about what it’s doing.
|
elasticTune <- train(IsAlert ~ ., data = train_data[,-c(1,2)], method = "glmnet", metric = "ROC", na.action = na.pass, preProcess = c("center", "scale", "medianImpute"), trControl = ctrl) |
|
## + Fold1.Rep1: alpha=0.10, lambda=0.04613 ## - Fold1.Rep1: alpha=0.10, lambda=0.04613 ## + Fold1.Rep1: alpha=0.55, lambda=0.04613 ############# OUTPUT SUPPRESSED ############# ## - Fold8.Rep5: alpha=0.55, lambda=0.04613 ## + Fold8.Rep5: alpha=1.00, lambda=0.04613 ## - Fold8.Rep5: alpha=1.00, lambda=0.04613 ## Aggregating results ## Selecting tuning parameters ## Fitting alpha = 1, lambda = 0.000461 on full training set |
If you read the output from this step, it tries a whole bunch of models on a whole bunch of CV folds, then chooses the best parameters and trains the model one final time on the whole dataset using those parameters. It stores the results of this model so you don’t need to train it again. So finally, you should have an absolute belter of a model stored in the elasticTune object we just created. The object also has a whole bunch of summary information about the model. You can inspect it using the commands below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
## glmnet ## ## 173246 samples ## 25 predictor ## 2 classes: 'N', 'Y' ## ## Pre-processing: centered (36), scaled (36), median imputation (36) ## Resampling: Cross-Validated (8 fold, repeated 5 times) ## Summary of sample sizes: 151590, 151590, 151590, 151590, 151590, 151590, ... ## Resampling results across tuning parameters: ## ## alpha lambda ROC Sens Spec ## 0.10 0.0004613352 0.9073139 0.7038542 0.9476173 ## 0.10 0.0046133519 0.9064279 0.7011666 0.9467340 ## 0.10 0.0461335194 0.8980523 0.6891040 0.9474769 ## 0.55 0.0004613352 0.9073296 0.7035280 0.9478877 ## 0.55 0.0046133519 0.9058153 0.6990568 0.9486392 ## 0.55 0.0461335194 0.8972081 0.6675034 0.9536580 ## 1.00 0.0004613352 0.9073338 0.7032763 0.9480606 ## 1.00 0.0046133519 0.9047197 0.6982059 0.9488702 ## 1.00 0.0461335194 0.8964130 0.6675283 0.9616724 ## ## ROC was used to select the optimal model using the largest value. ## The final values used for the model were alpha = 1 and lambda ## = 0.0004613352. |
![]()

We can now test the model against our own internal test set. Assuming that Siamak’s final test set was selected in the same way then this result should be fairly similar to the final result. Note that we’re explicitly removing the IsAlert column from the data frame before we pass it into the prediction to reduce the chance of leakage. To do the tests we will build a data frame with 4 columns:
- The probability that each line was N
- The probability that each line was Y
- The observed value for each line (the truth)
- Our prediction for each line based on the probabilities above
From this table we can generate the confusion matrix and the ROC AUC.
|
resultsTest <- predict(elasticTune,test_data[,-28], na.action = na.pass, type = "prob") resultsTest$obs <- test_data$IsAlert resultsTest$pred <- predict(elasticTune,test_data[,-28], na.action = na.pass) print(confusionMatrix(resultsTest$pred,resultsTest$obs)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
|
## Confusion Matrix and Statistics ## ## Reference ## Prediction N Y ## N 9808 1559 ## Y 4293 27650 ## ## Accuracy : 0.8649 ## 95% CI : (0.8616, 0.8681) ## No Information Rate : 0.6744 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.6761 ## Mcnemar's Test P-Value : < 2.2e-16 ## ## Sensitivity : 0.6956 ## Specificity : 0.9466 ## Pos Pred Value : 0.8628 ## Neg Pred Value : 0.8656 ## Prevalence : 0.3256 ## Detection Rate : 0.2265 ## Detection Prevalence : 0.2625 ## Balanced Accuracy : 0.8211 ## ## 'Positive' Class : N ## |
|
twoClassSummary(resultsTest, lev = c("Y","N")) |
|
## ROC Sens Spec ## 0.9030444 0.9466260 0.6955535 |
Woohoo! It looks good! 0.9 ROC (area under curve) is excellent but we can’t be too cocky yet.
Now we can import Siamak’s final test data and test our model. We’re also going to apply the same transformations to this data as we did to the training data.
|
test_final_raw <- read.csv("~/Dropbox/DAM/data/IsAlert_Test1.csv") test_final_clean <- test_final_raw %>% mutate( TrialID = as.factor(TrialID), vehicular_att7 = as.factor(vehicular_att7), environmental_att7 = as.factor(environmental_att7), environmental_att8 = as.factor(environmental_att8), IsAlert = as.factor(ifelse(test_final_raw$IsAlert == 1,'Y','N')) ) |
And now we’re going to be really careful to prevent leaks, because we don’t want to accidentally cheat on this final test. And then we’ll print the first 6 rows to prove it’s gone!
|
test_final_IsAlert <- test_final_clean$IsAlert test_final_clean <- test_final_clean[,-28] head(test_final_clean) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
|
## TrialID ObsNum physiologicl_att1 physiological_att2 physiologicl_att13 ## 1 300 0 34.8293 12.2809 1716 ## 2 300 1 34.8091 13.8770 1716 ## 3 300 2 34.8353 11.6274 1000 ## 4 300 3 34.8856 11.6745 1000 ## 5 300 4 34.8963 13.9484 1000 ## 6 300 5 34.9611 10.3686 1000 ## physiologicl_att14 physiologicl_att15 physiologicl_att16 ## 1 34.965 0.247789 532 ## 2 34.965 0.247789 532 ## 3 60.000 0.303574 532 ## 4 60.000 0.303574 532 ## 5 60.000 0.303574 528 ## 6 60.000 0.303574 528 ## physiologicl_att17 environmental_att1 environmental_att2 ## 1 112.782 0 0 ## 2 112.782 0 0 ## 3 112.782 0 0 ## 4 112.782 0 0 ## 5 113.636 0 0 ## 6 113.636 0 0 ## environmental_att3 environmental_att4 environmental_att5 ## 1 8 0.015749 329 ## 2 8 0.015749 329 ## 3 8 0.015749 329 ## 4 8 0.015749 329 ## 5 8 0.015749 329 ## 6 8 0.015749 329 ## environmental_att6 environmental_att7 environmental_att8 ## 1 1 1 1 ## 2 1 1 1 ## 3 1 1 1 ## 4 1 1 1 ## 5 1 1 1 ## 6 1 1 1 ## environmental_att9 environmental_att10 vehicular_att1 vehicular_att2 ## 1 71 0 97.42 0.385 ## 2 71 0 97.12 0.070 ## 3 71 0 97.68 0.455 ## 4 71 0 98.09 0.000 ## 5 71 0 97.82 0.280 ## 6 71 0 97.43 0.280 ## vehicular_att3 vehicular_att4 vehicular_att5 vehicular_att6 ## 1 1008 3.01875 1894 18.4 ## 2 1008 3.01875 1899 18.9 ## 3 240 4.50625 1889 18.9 ## 4 240 4.50625 1894 20.6 ## 5 1008 4.50625 1909 21.0 ## 6 767 4.50625 1905 21.0 ## vehicular_att7 vehicular_att8 ## 1 4 15.8576 ## 2 4 15.8723 ## 3 4 15.8258 ## 4 4 15.8159 ## 5 4 15.8289 ## 6 4 15.8321 |
So now that we’re sure we aren’t cheating we’ll use the model to predict the results of Siamak’s test data, and then compare these predictions to the truth to generate a confusion matrix and calculate the area under the ROC curve.
|
resultsFinal <- predict(elasticTune,test_final_clean, na.action = na.pass, type = "prob") resultsFinal$obs <- test_final_IsAlert resultsFinal$pred <- predict(elasticTune,test_final_clean, na.action = na.pass) print(confusionMatrix(resultsFinal$pred,resultsFinal$obs)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
|
## Confusion Matrix and Statistics ## ## Reference ## Prediction N Y ## N 51567 4658 ## Y 66864 118619 ## ## Accuracy : 0.7041 ## 95% CI : (0.7023, 0.7059) ## No Information Rate : 0.51 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.4018 ## Mcnemar's Test P-Value : < 2.2e-16 ## ## Sensitivity : 0.4354 ## Specificity : 0.9622 ## Pos Pred Value : 0.9172 ## Neg Pred Value : 0.6395 ## Prevalence : 0.4900 ## Detection Rate : 0.2133 ## Detection Prevalence : 0.2326 ## Balanced Accuracy : 0.6988 ## ## 'Positive' Class : N ## |
|
twoClassSummary(resultsFinal, lev = c("Y","N")) |
|
## ROC Sens Spec ## 0.8449919 0.9622152 0.4354181 |
Woohoo! ROC AUC is 0.844! There are countless ways that this model could be improved, but that can be the topic of another post in future 🙂
I hope this has been useful, and please feel free to leave a comment if you’re having any issues with parts of the script.