Four Verbs

Looking at the black-box that is the algorithm for my grade in 36106 Data Algorithms and Meaning, I can see that I am required to provide some inputs based on four verbs – this post will be structured accordingly.

Reflect on your contributions to your group project in Assignment 2

I think that my contributions to the group project fall into three categories:

Teaching – Some of our team members were fairly new to the topic of machine learning, and I spent a fair bit of effort both teaching them directly, or coaching them with what to study next. I was really pleased to see how much they learned throughout the course, and whilst the vast majority of this learning is due to their own efforts and aptitude, I found the process of teaching them quite rewarding (especially considering it enabled them to magnify their contributions to the assignment!).

Processes, tools and pipelines – In my opinion (and the opinion of many of my colleagues) it is far more important to understand the different processes around the machine learning algorithm, than it is to understand the algorithm itself. The use of train/test splits, cross-validation, parameter searches, regularisation, resampling etc often makes the difference between useless model and a model that drives real value. In our group assignment I established robust and repeatable machine learning pipelines to ensure that the models could be fairly assessed and compared, and that the findings were meaningful and comparable between different modelling approaches. I also set up and administered high performance computing environments for the team as needed.

Developing and modelling – In addition to setting up the processes and pipelines I also developed a regularised logistic regression model using the Elastic Net algorithm. This model uses a mix of L1 and L2 regularisation to optimise the bias-variance trade-off and avoid overfitting.

Select one insight (new understanding, realisation) that helps explain how you generate meaning from data in complex situations.

One insight that I gained through this project was the value inherent in feature importance scores. Most popular models have a number of approaches for objectively scoring and ranking features, and using tools such as caret (an R package) makes it easy to calculate these feature importance scores for each trained model. By calculating importance for each feature using a number of different models you can objectively identify features which are important using all models, features which are unimportant for all models, and which features become important using different methods.

Whilst this doesn’t necessarily help improve accuracy, it helps greatly when using machine learning to develop understanding within a business. Even if the end-goal of the modelling is to create a highly predictive model, most businesses are not yet willing to accept predictive models without an understanding of what is going on inside the black box. Being able to supplement your advanced models with an understanding of which features contributed to the accuracy of those models is critical when developing data maturity within an organisation.

Analyse the challenges and impact of algorithms on your meaning-making, using your reading and online forum discussions.

I am interpreting this reflection prompt to be asking about how algorithms (in particular advanced algorithms) both help and hinder my ability to distill meaning from data.

Firstly, I will consider how algorithms help me find meaning from data. In particular, I will consider how algorithms can be used to develop models, where a model is defined as “a simplified description, especially a mathematical one, of a system or process, to assist calculations and predictions” (Oxford Dictionary). With this definition it is fairly easy to show how an algorithm helps with meaning-making: the algorithm helps to develop a simplified description of a system or process. Who doesn’t love simplification?! These models can be used to inform business understanding, and they can be used to predict what might happen in the future.

Now for the ways in which algorithms can hinder meaning-making. Firstly, an algorithm can only help a model to generalise based on the data it is given. If you collect and sample data in a specific way, then the model can (at best) only help you to predict new data which is collected and sampled in the same way. An example of this is using transport card (e.g. Opal) data to develop a model which predicts how many people will be catching the train tomorrow morning. This model will not account for people who have are starting or stopping commuting using the train tomorrow. It will not account for people who have moved house. It will not account for people who use single-use Opal tickets, and it will not account for paper ticket users. Whilst you may be able to predict the number of Opal cards using the train with reasonable accuracy, this does not necessarily translate to being able to predict passengers on the train.

Secondly, an algorithm can only make generalisations based on the features available. Too much focus on developing a better model (or a better algorithm) could detract from efforts to improve the features available for modeling. It is not hard to conceive situations where a subject matter expert could construct additional features with significant predictive power – these features would have far greater impact on accuracy than development of a better model. This is even more important for finding meaning in the dataset – features derived by a domain expert can contribute far more to business understanding than an advanced black box model.

Speculate on the implications for future professional practice (for example, ‘what if’s, lessons from key current debates, examples from your own practice).

I believe that whilst the bulk of development in machine learning is focused on improving algorithms, this will not help business make meaning from data. The development of robust, flexible and machine learning processes – potentially through the development of industry standards – will be far more important. This is important for two reasons:

  1. In most cases where businesses attempt to implement machine learning, they would get better performance from improving their algorithm tuning processes than they would from using a more powerful algorithm.
  2. In cases where machine learning models are trusted blindly without an appropriate test and validation framework, the potential for negative business outcomes is huge. Data Science needs to move quickly to create sustainable business value, otherwise it risks a collapse into the trough of disillusionment. Development of appropriate frameworks is critical for ensuring that models are supporting business value!

If I had to pick a single example of such a development, I would point to the recent advances in marketing using attribution modeling. Attribution modeling is about answering the question “did my intervention cause the change in my observations?”, for example determining whether a specific marketing campaign drove an increase in sales. There are a number of companies experiencing increasing success in this space – Ambiata and Datalicious are two that come to mind. They are seeing a sustainable competitive advantage through helping other companies to assess and verify the impact of their own algorithmic endeavours.