Big Data, Big ProblemsPosted: April 26, 2012
My new PhD student joined our research group on Monday last week (Hi, T) and we’ve already tried to explode his brain by discussing every possible idea that we’ve had about his project area – that we’ve developed over the last year, but that we’ve presented to him in the past week.
He’s still coming to meetings, which is good, because it means that he’s not dead yet. The ideas that we’re dealing with are fairly interesting and build upon some work that I’ve spoken about earlier, where we’ve looked at student data that we happen to have to see if we can determine other behaviours, predict GPA, or get an idea of the likelihood of the student completing their studies.
Our pilot research study is almost written up for submission this Sunday but, like all studies that are conducted after the collection time, we only have the data that was collected rather than the ideal set of data that we would like to collect. That’s one of the things that we’ve given T to think about – what is the complete set of student data that we could collect if we could collect everything?
If we could collect everything, what would be useful? What is duplicated within the collection set? Which of these factors has an impact on things that we care about, like student participation, engagement, level of achievement and development of discipline skills? How can I collect them and store them so that I not only can look at the data in light of today’s thinking but that, twenty years from now, I can completely re-evaluate the data set in different frameworks?
There’s a lot of data out there, there are many ways of collecting, and there are lots of projects in operation. But there are also lots and lots of problems: correlations to find, factors to exclude, privacy and ethical considerations to take into account, storage systems to wrestle with and, at the end of the day, a giant validation issue to make sure that what we’re doing is fundamentally accurate and useful.
I’ve written before about the data deluge but, even when we restrict our data crawling to one small area, it’s sometimes easy to lose track of how complicated our world is and how many pieces of data we can collect.
Fortunately, or unfortunately, for T, there are many good and bad examples to look at, many studies that didn’t quite achieve what was wanted, and a lot of space for him to explore and define his own research. Now if I could only put aside that much time for my own research.