DAM Challenge – a brief guide to getting 0.85 AUC

This post will explain the approach that the PIRATE TEAM used to achieve an Area Under Curve of 0.85 on the provided test set in the DAM Data Challenge on Saturday 7th May. The code and the results are provided along with the explanation, so you should be able to run the code yourself.

Firstly, you’ll need to install R, and I’d recommend installing RStudio as well. The installation is outside the scope of this guide, but a quick google gave me this link that should point you in the right direction: LINK (at the time of writing the latest version of R is 3.2.5 – this is the version you should install)

With R installed, you’ll need to install a bunch of packages. Run the code below to install all the packages required for this guide (some of these will automatically trigger several other packages). It will probably take a little while to download and install all of them.

If you get any errors, use your googley skills to get those packages installed. You won’t be able to continue without them! Once they’re installed, load the packages.

It’s probably worth quickly explaining what each of these packages do.

  1. The caret package is a modelling package which helps to automate a lot of the code you would normally have to write yourself. Models in R have all sorts of different syntax and outputs, and caret helps to abstract a lot of those details away so that you can focus on solving the problem.
  2. The glmnet package provides the machine learning algorithm we’re going to use, which is called Elastic Net. This algorithm is pretty much just a fancy GLM (which means we can do Logistic Regression) with both L1 and L2 regularisation. Regularisation is probably an explanation too far for this guide, but it’s very well suited to problems where you want to avoid overfitting.
  3. The dplyr package is one of my favourite things about R. It makes data manipulation really simple through the use of verb-inspired functions (select, mutate, filter, etc) and the chain operator (%>%). Hadley Wickham (the author of the dplyr package) has put together the definitive guide here: http://r4ds.had.co.nz/transform.html

Now to start with the data! Firstly, import the data (you’ll need to replace the path below with the location of the file on your own computer).

When we looked at the data (not covered in this guide) we realised that a few of the columns could be better off as factors (i.e. categories) rather than integers or numerics. We used some functions from the dpylr package to transform these columns using the as.factor function, and recoded the IsAlert column as Y/N (it makes things a bit easier because a Y/N cannot be misinterpreted as a number by any of the models later on).

Now we will partition the data into an 80/20 train/test split using the createDataPartition function from the caret package. This is a bit safer than using a purely random sample as it takes a “stratified” sample, which ensures that both the train and test set have the same proportion of Y and N.

Now we can start the modelling. Elastic Net has two parameters – alpha and lambda. If I was a magician I could just guess these, however I’m not a magician and my guess would be terrible. Instead of guessing, you can split the training data again to make new “validation” set, and “tune” the model using this set.

To tune a model, you train the model multiple times with different parameters (using the training set) and use the validation set to evaluate the performance of the model with those parameters. As an example, you could choose 4 different values of alpha and 4 values of lambda, which would give you 16 different combinations of parameters.

How do you evaluate the model for these iterations? Given that we’re being assessed on ROC (more correctly the area under the ROC curve) we can just compare the ROC scores for each set of parameters! Use each iteration of the model to predict against the validation set and choose the parameters which give the highest ROC score against the validation set.

Going one step further, we can take multiple validation sets, using a process known as cross-validation (CV). As an example, for 5-fold CV, you would cut the data into 5 roughly equal sets. You then “hold out” the first set as the validation fold (a fold is just a slice of data), train your model using folds 2-5, and measure the ROC score on the fold that you held out. Then you hold out the second fold, train using folds 1 and 3-5, and measure the ROC of the prediction against the 2nd fold. In this way you get 5 estimates of the ROC for each pair of parameters, which gives you a much more robust estimate and helps avoid inadvertent overfitting. So now if we’re testing 4 parameters for each option, then we’re running 4x4x5 (80) models to find the best one.

One final step further, for increased robustness, you can do CV multiple times! So in the code below, we’re going to run 8-fold CV repeated 5 times.

Now for the code below. This code sets up the “training control” for the caret package. Firstly, we’re setting a seed for the random number generator so that we get the same random folds each time, which makes the results reproducable. We specify that we’re going to use the “repeatedcv” approach, with 8 folds and repeating 5 times. We’re then going to use the default parameter grid – caret has a default grid of parameters to test for each algorithm, and I assume that it’s better than my choices. We’re telling it to return the probabilities (not just the predictions) because we’ll need those for the ROC curve, and telling it to use the twoClassSummary function (this is how it calculates the ROC). Finally we’re telling it to be “verbose” when it runs, which just means it will provide status updates so we know it’s running.

Right! Now that it’s all ready to go, we can run the CV. This will take AGES because we’re running 8-fold CV 5 times (40 models for each combination of parameters in the default grid) so you’ll want to get a drink or something.

The code below tells caret to build a model which predicts IsAlert using every other column in the data as a predictor (except for columns 1 and 2 which I have removed). It also specifies glmnet as the algorithm, ROC as the evaluation metric, and tells the model to allow NAs (missing data) to pass through to the preProcess step. This is is a really cool feature of caret, it can automatically do standard preProcessing functions for you! In the example below I’ve asked it to automatically apply centering and scaling, and to impute any missing data (NAs) using the median value for that field. Finally, we pass the training control options we set above.

Note that I’ve suppressed the output for the purpose of this post, but it’s going to give you a lot of information about what it’s doing.

If you read the output from this step, it tries a whole bunch of models on a whole bunch of CV folds, then chooses the best parameters and trains the model one final time on the whole dataset using those parameters. It stores the results of this model so you don’t need to train it again. So finally, you should have an absolute belter of a model stored in the elasticTune object we just created. The object also has a whole bunch of summary information about the model. You can inspect it using the commands below.


We can now test the model against our own internal test set. Assuming that Siamak’s final test set was selected in the same way then this result should be fairly similar to the final result. Note that we’re explicitly removing the IsAlert column from the data frame before we pass it into the prediction to reduce the chance of leakage. To do the tests we will build a data frame with 4 columns:

  • The probability that each line was N
  • The probability that each line was Y
  • The observed value for each line (the truth)
  • Our prediction for each line based on the probabilities above

From this table we can generate the confusion matrix and the ROC AUC.

Woohoo! It looks good! 0.9 ROC (area under curve) is excellent but we can’t be too cocky yet.

Now we can import Siamak’s final test data and test our model. We’re also going to apply the same transformations to this data as we did to the training data.

And now we’re going to be really careful to prevent leaks, because we don’t want to accidentally cheat on this final test. And then we’ll print the first 6 rows to prove it’s gone!

So now that we’re sure we aren’t cheating we’ll use the model to predict the results of Siamak’s test data, and then compare these predictions to the truth to generate a confusion matrix and calculate the area under the ROC curve.

Woohoo! ROC AUC is 0.844! There are countless ways that this model could be improved, but that can be the topic of another post in future 🙂

I hope this has been useful, and please feel free to leave a comment if you’re having any issues with parts of the script.



A collection of resources for the caret package

I’ve been trying to get up and running with the caret package for R and found that there is quite a shortage of good materials. I’m planning to write some tutorials for getting started with caret once I get a bit more comfortable with it, but for now I’ll just post a collection of useful resources. Most of these resources are produced by Max Kuhn, which is hardly surprising as he is the package author and maintainer. Is Max Kuhn to machine learning in R as Hadley Wickham is to everything else in R? I’ll leave that one hanging.

Caret Package Webinar – link
This is a webinar presented by Max Kuhn from February 2014. It covers the motivation for the package as well as quite a bit of R code to set up a machine learning algorithm using the package. You can download the slide deck here.

Machine Learning with R: An Irresponsibly Fast Tutoriallink
This is the only resource on this page which was not produced by Max Kuhn. It provides an excellent worked example which you can submit to Kaggle.

Interview with Max Kuhn – link
Another video with Max Kuhn explaining some of the motivation and intentions behind caret.

Presentation at useR 2013link
This is unsurprisingly also by Max Kuhn. It is a longer and more detailed slide deck than the first link on this page, and it includes a different example which shows a few additional features. As it’s a bit older it also does things a bit differently to how you would do them in the latest version of caret.

Applied Predictive Modellinglink
This is unfortunately the only non-free resource. The book comes highly regarded and I’ll probably end up buying it at some point. It’s written by Max Kuhn and talks about the theory of predictive modelling as well as providing some examples in R and caret.

The caret websitelink
This looks like an excellent technical resource once you’re up and running with caret but it doesn’t really help you get your head around what it’s doing and how to use it.

Max Kuhn’s Bloglink
I can’t believe how much material there is here. I haven’t managed to watch any of the videos yet but this looks like an awesome resource.

Building Predictive Models in R Using the caret Package – link
This is an academic paper also written by Max Kuhn at around the same time as caret was added to CRAN. The package has a bunch of new features since this paper was written so it’s probably not the go-to resource it once was.