DAM Challenge – a brief guide to getting 0.85 AUC

This post will explain the approach that the PIRATE TEAM used to achieve an Area Under Curve of 0.85 on the provided test set in the DAM Data Challenge on Saturday 7th May. The code and the results are provided along with the explanation, so you should be able to run the code yourself.

Firstly, you’ll need to install R, and I’d recommend installing RStudio as well. The installation is outside the scope of this guide, but a quick google gave me this link that should point you in the right direction: LINK (at the time of writing the latest version of R is 3.2.5 – this is the version you should install)

With R installed, you’ll need to install a bunch of packages. Run the code below to install all the packages required for this guide (some of these will automatically trigger several other packages). It will probably take a little while to download and install all of them.

If you get any errors, use your googley skills to get those packages installed. You won’t be able to continue without them! Once they’re installed, load the packages.

It’s probably worth quickly explaining what each of these packages do.

  1. The caret package is a modelling package which helps to automate a lot of the code you would normally have to write yourself. Models in R have all sorts of different syntax and outputs, and caret helps to abstract a lot of those details away so that you can focus on solving the problem.
  2. The glmnet package provides the machine learning algorithm we’re going to use, which is called Elastic Net. This algorithm is pretty much just a fancy GLM (which means we can do Logistic Regression) with both L1 and L2 regularisation. Regularisation is probably an explanation too far for this guide, but it’s very well suited to problems where you want to avoid overfitting.
  3. The dplyr package is one of my favourite things about R. It makes data manipulation really simple through the use of verb-inspired functions (select, mutate, filter, etc) and the chain operator (%>%). Hadley Wickham (the author of the dplyr package) has put together the definitive guide here: http://r4ds.had.co.nz/transform.html

Now to start with the data! Firstly, import the data (you’ll need to replace the path below with the location of the file on your own computer).

When we looked at the data (not covered in this guide) we realised that a few of the columns could be better off as factors (i.e. categories) rather than integers or numerics. We used some functions from the dpylr package to transform these columns using the as.factor function, and recoded the IsAlert column as Y/N (it makes things a bit easier because a Y/N cannot be misinterpreted as a number by any of the models later on).

Now we will partition the data into an 80/20 train/test split using the createDataPartition function from the caret package. This is a bit safer than using a purely random sample as it takes a “stratified” sample, which ensures that both the train and test set have the same proportion of Y and N.

Now we can start the modelling. Elastic Net has two parameters – alpha and lambda. If I was a magician I could just guess these, however I’m not a magician and my guess would be terrible. Instead of guessing, you can split the training data again to make new “validation” set, and “tune” the model using this set.

To tune a model, you train the model multiple times with different parameters (using the training set) and use the validation set to evaluate the performance of the model with those parameters. As an example, you could choose 4 different values of alpha and 4 values of lambda, which would give you 16 different combinations of parameters.

How do you evaluate the model for these iterations? Given that we’re being assessed on ROC (more correctly the area under the ROC curve) we can just compare the ROC scores for each set of parameters! Use each iteration of the model to predict against the validation set and choose the parameters which give the highest ROC score against the validation set.

Going one step further, we can take multiple validation sets, using a process known as cross-validation (CV). As an example, for 5-fold CV, you would cut the data into 5 roughly equal sets. You then “hold out” the first set as the validation fold (a fold is just a slice of data), train your model using folds 2-5, and measure the ROC score on the fold that you held out. Then you hold out the second fold, train using folds 1 and 3-5, and measure the ROC of the prediction against the 2nd fold. In this way you get 5 estimates of the ROC for each pair of parameters, which gives you a much more robust estimate and helps avoid inadvertent overfitting. So now if we’re testing 4 parameters for each option, then we’re running 4x4x5 (80) models to find the best one.

One final step further, for increased robustness, you can do CV multiple times! So in the code below, we’re going to run 8-fold CV repeated 5 times.

Now for the code below. This code sets up the “training control” for the caret package. Firstly, we’re setting a seed for the random number generator so that we get the same random folds each time, which makes the results reproducable. We specify that we’re going to use the “repeatedcv” approach, with 8 folds and repeating 5 times. We’re then going to use the default parameter grid – caret has a default grid of parameters to test for each algorithm, and I assume that it’s better than my choices. We’re telling it to return the probabilities (not just the predictions) because we’ll need those for the ROC curve, and telling it to use the twoClassSummary function (this is how it calculates the ROC). Finally we’re telling it to be “verbose” when it runs, which just means it will provide status updates so we know it’s running.

Right! Now that it’s all ready to go, we can run the CV. This will take AGES because we’re running 8-fold CV 5 times (40 models for each combination of parameters in the default grid) so you’ll want to get a drink or something.

The code below tells caret to build a model which predicts IsAlert using every other column in the data as a predictor (except for columns 1 and 2 which I have removed). It also specifies glmnet as the algorithm, ROC as the evaluation metric, and tells the model to allow NAs (missing data) to pass through to the preProcess step. This is is a really cool feature of caret, it can automatically do standard preProcessing functions for you! In the example below I’ve asked it to automatically apply centering and scaling, and to impute any missing data (NAs) using the median value for that field. Finally, we pass the training control options we set above.

Note that I’ve suppressed the output for the purpose of this post, but it’s going to give you a lot of information about what it’s doing.

If you read the output from this step, it tries a whole bunch of models on a whole bunch of CV folds, then chooses the best parameters and trains the model one final time on the whole dataset using those parameters. It stores the results of this model so you don’t need to train it again. So finally, you should have an absolute belter of a model stored in the elasticTune object we just created. The object also has a whole bunch of summary information about the model. You can inspect it using the commands below.

download

We can now test the model against our own internal test set. Assuming that Siamak’s final test set was selected in the same way then this result should be fairly similar to the final result. Note that we’re explicitly removing the IsAlert column from the data frame before we pass it into the prediction to reduce the chance of leakage. To do the tests we will build a data frame with 4 columns:

  • The probability that each line was N
  • The probability that each line was Y
  • The observed value for each line (the truth)
  • Our prediction for each line based on the probabilities above

From this table we can generate the confusion matrix and the ROC AUC.

Woohoo! It looks good! 0.9 ROC (area under curve) is excellent but we can’t be too cocky yet.

Now we can import Siamak’s final test data and test our model. We’re also going to apply the same transformations to this data as we did to the training data.

And now we’re going to be really careful to prevent leaks, because we don’t want to accidentally cheat on this final test. And then we’ll print the first 6 rows to prove it’s gone!

So now that we’re sure we aren’t cheating we’ll use the model to predict the results of Siamak’s test data, and then compare these predictions to the truth to generate a confusion matrix and calculate the area under the ROC curve.

Woohoo! ROC AUC is 0.844! There are countless ways that this model could be improved, but that can be the topic of another post in future 🙂

I hope this has been useful, and please feel free to leave a comment if you’re having any issues with parts of the script.

 

 

7 thoughts on “DAM Challenge – a brief guide to getting 0.85 AUC”

  1. This is a HD quality post! I can’t begin to explain how helpful this is, so thank you very much! I hope others see this too.

    Two questions:
    1. During cross validation, when you choose 8 folds, and repeat 5 times. Does this mean the training set is broken down to 8 parts, and you choose 7 sets to train and test on the remaining, and repeat this 5 times? Why Use 8 and 5?

    2. Avoid leakage by removing the response variable. Can you explain that?

    Thanks David

    1. Yessss nice job planting the HD seed for the teaching staff to see, well played PIRATE TEAM member!

      1. Yes, that’s correct. You cut the training set into 8 parts, train on 7 and test on the 8th part. Then you train on a different 7 and test on the one left out, and so on until you’ve done a train/test with each of the 8 parts left out. I chose 8 because when I ran it on my computer I did it in parallel (which didn’t work very well) and I have 8 processors, so it sort of made sense (only sort of). The 5 repeats was just because the tutorial I was reading on some website used 5, and it seemed like a nice round number.

      2. Leakage is a term for when you use information that you shouldn’t have to predict the outcome. It’s normally an accident or oversight, and in this case I just wanted to be really careful that I didn’t put the answer into the model as a predictor, because that would be cheating! Normally it happens through including information that was created after the event you’re trying to predict – for example if you were trying to predict whether or not a customer would default on a loan in April using data from March, then including information about their bank balance in May would be a leak (because you wouldn’t have the July balance data if you were trying to predict a customer defaulting in June).

Leave a Reply