How Do I Model The Students Who Leave?Posted: March 23, 2012
This is a quick note on one of the problems I face in trying to analyse student data: dealing with students who are only in the system so briefly that I can’t capture much data on them. In my other educational research work I can look at student behaviour in terms of final grades and on-time assignment submission but, in order to try and see the impact of what we’re doing on behaviour, I really have to be able to capture data before and after a change. I then have to try and eliminate all other factors to find a correlation that looks like it’s significant.
In yesterday’s post, I didn’t mention that one of the issues that the Baldwin-Wallace researchers noted was trying to deal with students who gave some initial data and then left the system – how do you incorporate these students in a way that allows you to infer behaviour without introducing the spectre of bias because you’ve inserted dummy data into your system. They had discussed adding another grade type, W or PW, that would allow them to keep students in their data who had left the program early – can you spot the situation that will lead to people leaving early and can we predict the withdrawal from the course based on earlier performance?
I face the same problem in a lot of my assignment submission data. I have 17,000 students in the initial dataset but, after cleaning and removing students who withdraw, that shrinks a lot. Regrettably, this also removes the students that I really want to work with – those who have withdrawn. We use a binary notation as an overview for on-time and late submission, so extending the sequence is straight-forward, but any time we extend the sequence we have to justify it very, very well to make sure that we haven’t introduced too much noise or bias.
There are a lot of good existing techniques and, of course, Bayesian analysis is once again our friend in many ways but I’m now looking at machine learning to provide a very simple two-component partitioning – can I learn to predict who will be in the incomplete group and who won’t? I have to do something about the ‘length’ of the submission history or the most obvious thing the machine will probably learn is that ‘short history == fail’. I’m looking forward to getting onto this research in the very near future, especially if it ca give me insight into those students who are only with us for a short time. I really need a tool and a model that will work within the first 2-3 weeks – it’s a challenge but a fun one.