[DVN] Using Gource to visualise UTS MDSI Slack

I wasn’t satisfied with just saying “check out what Martin did with Gource”, so I had a bit more of a look at what I could do. Firstly, I downloaded the UTS MDSI Slack logs, and worked out how to extract the information I needed. I did a bit of transformation, then exported the file in the appropriate format (code below the video).

I then played around with a few of the settings in Gource to make sure that it doesn’t run too slowly, zooms dynamically, doesn’t get too cluttered, etc, before saving it to file. The command for this was:

And now the video!

 

Four Verbs

Looking at the black-box that is the algorithm for my grade in 36106 Data Algorithms and Meaning, I can see that I am required to provide some inputs based on four verbs – this post will be structured accordingly.

Reflect on your contributions to your group project in Assignment 2

I think that my contributions to the group project fall into three categories:

Teaching – Some of our team members were fairly new to the topic of machine learning, and I spent a fair bit of effort both teaching them directly, or coaching them with what to study next. I was really pleased to see how much they learned throughout the course, and whilst the vast majority of this learning is due to their own efforts and aptitude, I found the process of teaching them quite rewarding (especially considering it enabled them to magnify their contributions to the assignment!).

Processes, tools and pipelines – In my opinion (and the opinion of many of my colleagues) it is far more important to understand the different processes around the machine learning algorithm, than it is to understand the algorithm itself. The use of train/test splits, cross-validation, parameter searches, regularisation, resampling etc often makes the difference between useless model and a model that drives real value. In our group assignment I established robust and repeatable machine learning pipelines to ensure that the models could be fairly assessed and compared, and that the findings were meaningful and comparable between different modelling approaches. I also set up and administered high performance computing environments for the team as needed.

Developing and modelling – In addition to setting up the processes and pipelines I also developed a regularised logistic regression model using the Elastic Net algorithm. This model uses a mix of L1 and L2 regularisation to optimise the bias-variance trade-off and avoid overfitting.

Select one insight (new understanding, realisation) that helps explain how you generate meaning from data in complex situations.

One insight that I gained through this project was the value inherent in feature importance scores. Most popular models have a number of approaches for objectively scoring and ranking features, and using tools such as caret (an R package) makes it easy to calculate these feature importance scores for each trained model. By calculating importance for each feature using a number of different models you can objectively identify features which are important using all models, features which are unimportant for all models, and which features become important using different methods.

Whilst this doesn’t necessarily help improve accuracy, it helps greatly when using machine learning to develop understanding within a business. Even if the end-goal of the modelling is to create a highly predictive model, most businesses are not yet willing to accept predictive models without an understanding of what is going on inside the black box. Being able to supplement your advanced models with an understanding of which features contributed to the accuracy of those models is critical when developing data maturity within an organisation.

Analyse the challenges and impact of algorithms on your meaning-making, using your reading and online forum discussions.

I am interpreting this reflection prompt to be asking about how algorithms (in particular advanced algorithms) both help and hinder my ability to distill meaning from data.

Firstly, I will consider how algorithms help me find meaning from data. In particular, I will consider how algorithms can be used to develop models, where a model is defined as “a simplified description, especially a mathematical one, of a system or process, to assist calculations and predictions” (Oxford Dictionary). With this definition it is fairly easy to show how an algorithm helps with meaning-making: the algorithm helps to develop a simplified description of a system or process. Who doesn’t love simplification?! These models can be used to inform business understanding, and they can be used to predict what might happen in the future.

Now for the ways in which algorithms can hinder meaning-making. Firstly, an algorithm can only help a model to generalise based on the data it is given. If you collect and sample data in a specific way, then the model can (at best) only help you to predict new data which is collected and sampled in the same way. An example of this is using transport card (e.g. Opal) data to develop a model which predicts how many people will be catching the train tomorrow morning. This model will not account for people who have are starting or stopping commuting using the train tomorrow. It will not account for people who have moved house. It will not account for people who use single-use Opal tickets, and it will not account for paper ticket users. Whilst you may be able to predict the number of Opal cards using the train with reasonable accuracy, this does not necessarily translate to being able to predict passengers on the train.

Secondly, an algorithm can only make generalisations based on the features available. Too much focus on developing a better model (or a better algorithm) could detract from efforts to improve the features available for modeling. It is not hard to conceive situations where a subject matter expert could construct additional features with significant predictive power – these features would have far greater impact on accuracy than development of a better model. This is even more important for finding meaning in the dataset – features derived by a domain expert can contribute far more to business understanding than an advanced black box model.

Speculate on the implications for future professional practice (for example, ‘what if’s, lessons from key current debates, examples from your own practice).

I believe that whilst the bulk of development in machine learning is focused on improving algorithms, this will not help business make meaning from data. The development of robust, flexible and machine learning processes – potentially through the development of industry standards – will be far more important. This is important for two reasons:

  1. In most cases where businesses attempt to implement machine learning, they would get better performance from improving their algorithm tuning processes than they would from using a more powerful algorithm.
  2. In cases where machine learning models are trusted blindly without an appropriate test and validation framework, the potential for negative business outcomes is huge. Data Science needs to move quickly to create sustainable business value, otherwise it risks a collapse into the trough of disillusionment. Development of appropriate frameworks is critical for ensuring that models are supporting business value!

If I had to pick a single example of such a development, I would point to the recent advances in marketing using attribution modeling. Attribution modeling is about answering the question “did my intervention cause the change in my observations?”, for example determining whether a specific marketing campaign drove an increase in sales. There are a number of companies experiencing increasing success in this space – Ambiata and Datalicious are two that come to mind. They are seeing a sustainable competitive advantage through helping other companies to assess and verify the impact of their own algorithmic endeavours.

DAM Challenge – a brief guide to getting 0.85 AUC

This post will explain the approach that the PIRATE TEAM used to achieve an Area Under Curve of 0.85 on the provided test set in the DAM Data Challenge on Saturday 7th May. The code and the results are provided along with the explanation, so you should be able to run the code yourself.

Firstly, you’ll need to install R, and I’d recommend installing RStudio as well. The installation is outside the scope of this guide, but a quick google gave me this link that should point you in the right direction: LINK (at the time of writing the latest version of R is 3.2.5 – this is the version you should install)

With R installed, you’ll need to install a bunch of packages. Run the code below to install all the packages required for this guide (some of these will automatically trigger several other packages). It will probably take a little while to download and install all of them.

If you get any errors, use your googley skills to get those packages installed. You won’t be able to continue without them! Once they’re installed, load the packages.

It’s probably worth quickly explaining what each of these packages do.

  1. The caret package is a modelling package which helps to automate a lot of the code you would normally have to write yourself. Models in R have all sorts of different syntax and outputs, and caret helps to abstract a lot of those details away so that you can focus on solving the problem.
  2. The glmnet package provides the machine learning algorithm we’re going to use, which is called Elastic Net. This algorithm is pretty much just a fancy GLM (which means we can do Logistic Regression) with both L1 and L2 regularisation. Regularisation is probably an explanation too far for this guide, but it’s very well suited to problems where you want to avoid overfitting.
  3. The dplyr package is one of my favourite things about R. It makes data manipulation really simple through the use of verb-inspired functions (select, mutate, filter, etc) and the chain operator (%>%). Hadley Wickham (the author of the dplyr package) has put together the definitive guide here: http://r4ds.had.co.nz/transform.html

Now to start with the data! Firstly, import the data (you’ll need to replace the path below with the location of the file on your own computer).

When we looked at the data (not covered in this guide) we realised that a few of the columns could be better off as factors (i.e. categories) rather than integers or numerics. We used some functions from the dpylr package to transform these columns using the as.factor function, and recoded the IsAlert column as Y/N (it makes things a bit easier because a Y/N cannot be misinterpreted as a number by any of the models later on).

Now we will partition the data into an 80/20 train/test split using the createDataPartition function from the caret package. This is a bit safer than using a purely random sample as it takes a “stratified” sample, which ensures that both the train and test set have the same proportion of Y and N.

Now we can start the modelling. Elastic Net has two parameters – alpha and lambda. If I was a magician I could just guess these, however I’m not a magician and my guess would be terrible. Instead of guessing, you can split the training data again to make new “validation” set, and “tune” the model using this set.

To tune a model, you train the model multiple times with different parameters (using the training set) and use the validation set to evaluate the performance of the model with those parameters. As an example, you could choose 4 different values of alpha and 4 values of lambda, which would give you 16 different combinations of parameters.

How do you evaluate the model for these iterations? Given that we’re being assessed on ROC (more correctly the area under the ROC curve) we can just compare the ROC scores for each set of parameters! Use each iteration of the model to predict against the validation set and choose the parameters which give the highest ROC score against the validation set.

Going one step further, we can take multiple validation sets, using a process known as cross-validation (CV). As an example, for 5-fold CV, you would cut the data into 5 roughly equal sets. You then “hold out” the first set as the validation fold (a fold is just a slice of data), train your model using folds 2-5, and measure the ROC score on the fold that you held out. Then you hold out the second fold, train using folds 1 and 3-5, and measure the ROC of the prediction against the 2nd fold. In this way you get 5 estimates of the ROC for each pair of parameters, which gives you a much more robust estimate and helps avoid inadvertent overfitting. So now if we’re testing 4 parameters for each option, then we’re running 4x4x5 (80) models to find the best one.

One final step further, for increased robustness, you can do CV multiple times! So in the code below, we’re going to run 8-fold CV repeated 5 times.

Now for the code below. This code sets up the “training control” for the caret package. Firstly, we’re setting a seed for the random number generator so that we get the same random folds each time, which makes the results reproducable. We specify that we’re going to use the “repeatedcv” approach, with 8 folds and repeating 5 times. We’re then going to use the default parameter grid – caret has a default grid of parameters to test for each algorithm, and I assume that it’s better than my choices. We’re telling it to return the probabilities (not just the predictions) because we’ll need those for the ROC curve, and telling it to use the twoClassSummary function (this is how it calculates the ROC). Finally we’re telling it to be “verbose” when it runs, which just means it will provide status updates so we know it’s running.

Right! Now that it’s all ready to go, we can run the CV. This will take AGES because we’re running 8-fold CV 5 times (40 models for each combination of parameters in the default grid) so you’ll want to get a drink or something.

The code below tells caret to build a model which predicts IsAlert using every other column in the data as a predictor (except for columns 1 and 2 which I have removed). It also specifies glmnet as the algorithm, ROC as the evaluation metric, and tells the model to allow NAs (missing data) to pass through to the preProcess step. This is is a really cool feature of caret, it can automatically do standard preProcessing functions for you! In the example below I’ve asked it to automatically apply centering and scaling, and to impute any missing data (NAs) using the median value for that field. Finally, we pass the training control options we set above.

Note that I’ve suppressed the output for the purpose of this post, but it’s going to give you a lot of information about what it’s doing.

If you read the output from this step, it tries a whole bunch of models on a whole bunch of CV folds, then chooses the best parameters and trains the model one final time on the whole dataset using those parameters. It stores the results of this model so you don’t need to train it again. So finally, you should have an absolute belter of a model stored in the elasticTune object we just created. The object also has a whole bunch of summary information about the model. You can inspect it using the commands below.

download

We can now test the model against our own internal test set. Assuming that Siamak’s final test set was selected in the same way then this result should be fairly similar to the final result. Note that we’re explicitly removing the IsAlert column from the data frame before we pass it into the prediction to reduce the chance of leakage. To do the tests we will build a data frame with 4 columns:

  • The probability that each line was N
  • The probability that each line was Y
  • The observed value for each line (the truth)
  • Our prediction for each line based on the probabilities above

From this table we can generate the confusion matrix and the ROC AUC.

Woohoo! It looks good! 0.9 ROC (area under curve) is excellent but we can’t be too cocky yet.

Now we can import Siamak’s final test data and test our model. We’re also going to apply the same transformations to this data as we did to the training data.

And now we’re going to be really careful to prevent leaks, because we don’t want to accidentally cheat on this final test. And then we’ll print the first 6 rows to prove it’s gone!

So now that we’re sure we aren’t cheating we’ll use the model to predict the results of Siamak’s test data, and then compare these predictions to the truth to generate a confusion matrix and calculate the area under the ROC curve.

Woohoo! ROC AUC is 0.844! There are countless ways that this model could be improved, but that can be the topic of another post in future 🙂

I hope this has been useful, and please feel free to leave a comment if you’re having any issues with parts of the script.

 

 

A collection of resources for the caret package

I’ve been trying to get up and running with the caret package for R and found that there is quite a shortage of good materials. I’m planning to write some tutorials for getting started with caret once I get a bit more comfortable with it, but for now I’ll just post a collection of useful resources. Most of these resources are produced by Max Kuhn, which is hardly surprising as he is the package author and maintainer. Is Max Kuhn to machine learning in R as Hadley Wickham is to everything else in R? I’ll leave that one hanging.

Caret Package Webinar – link
This is a webinar presented by Max Kuhn from February 2014. It covers the motivation for the package as well as quite a bit of R code to set up a machine learning algorithm using the package. You can download the slide deck here.

Machine Learning with R: An Irresponsibly Fast Tutoriallink
This is the only resource on this page which was not produced by Max Kuhn. It provides an excellent worked example which you can submit to Kaggle.

Interview with Max Kuhn – link
Another video with Max Kuhn explaining some of the motivation and intentions behind caret.

Presentation at useR 2013link
This is unsurprisingly also by Max Kuhn. It is a longer and more detailed slide deck than the first link on this page, and it includes a different example which shows a few additional features. As it’s a bit older it also does things a bit differently to how you would do them in the latest version of caret.

Applied Predictive Modellinglink
This is unfortunately the only non-free resource. The book comes highly regarded and I’ll probably end up buying it at some point. It’s written by Max Kuhn and talks about the theory of predictive modelling as well as providing some examples in R and caret.

The caret websitelink
This looks like an excellent technical resource once you’re up and running with caret but it doesn’t really help you get your head around what it’s doing and how to use it.

Max Kuhn’s Bloglink
I can’t believe how much material there is here. I haven’t managed to watch any of the videos yet but this looks like an awesome resource.

Building Predictive Models in R Using the caret Package – link
This is an academic paper also written by Max Kuhn at around the same time as caret was added to CRAN. The package has a bunch of new features since this paper was written so it’s probably not the go-to resource it once was.