SIGCSE 2014: Collecting and Analysing Student Data 1, Paper 2, Thursday 3:15 – 5:00pm (#SIGCSE2014)Posted: March 7, 2014
Whoo! I nearly burnt out a digit writing up the first talk but it’s a subject close to my heart. I’ll try to be a little more terse for these next two talks.
The second talk in this session was “Blackbox: A Large Scale Repository of Novice Programmers’ Activity” by the amazing Blackbox team at Kent, Neil Brown, Michael Kölling, Davin McCall, and Ian Utting. The Blackbox data is the anonymised student data from students coding into the BlueJ Java programming environment. It’s a rich source of information on how students code and Mark and I have been scheming to do something with the Blackbox data for some time. With Ian and Neil here, it’s a good opportunity to steal their brains. I tried to get Ian to agree to doing all the work but it turns out that he’s been in the game long enough to not say “yes” when someone asks him to without context. (Maybe it’s just me.)
Michael was presenting, with some help from Neil, and reviewed the relationship between Blackbox and BlueJ. BlueJ is an educational programming environment for CS education using Java, dating back to the original Blue in 1996. (For those who don’t know, that’s old for this kind of thing. We should throw it a party.) BlueJ is a graphically operated development environment so novice programmers can drag things out to build programs. It’s a well-established and widely used environment.
(Hey, that means BlueJ is 18. Someone buy BlueJ a beer.)
BlueJ has about 2,000,000 users in 2013, who use it for about three months and then move on (it’s not a production tool, it’s a learning environment). The idea of Blackbox came out of SIGCSE sessions about three years ago where some research questions were raised, nice set-up, good design and really small student groups. One of our common problems is having enough students to actually do a big study and, frankly, all of us are curious about how students code. (It’s really hard to tell this from the final program, trust me.) So BlueJ has lots of users, can we look at their data and then share this with people?
Of course, the first question is “what do we collect?” Normally, we’d collect what we need to answer a research question but this data was going to be used to support lots of different (and currently unasked) research questions. The community was consulted at SIGCSE in 2012 but there has been an evolution of this over time. There are a lot of things collected – go and look at them in the paper because Michael flicked past that slide!🙂
From an ethical standpoint, participation is an explicit decision made by the student to have their data collected or not. (This does raise the spectre of bias, especially as all the students must be over 16 for legal reasons.) So it’s opt in and THEN anonymised just to make it totally tasty from an ethical perspective.
Session data is collected for each session: start time, end time, project, path and userID (centrally anonymised for tracking).
So much for keeping it short, hey? Here’s a quick picture to give you a break.
Other things that can be captured are object creation and invocation among many other useful measures. For me, the fact that you can see how and when students are testing is fascinating, as it allows us to evaluate the whole expectation, observation and reflection scientific cycle in action.
The Blackbox project has already been running for 9 months. The opt-in rate is 40% (higher than I or anyone else expected). This means that there’s data from 250,000 users, recording roughly 11 events per second, over more than 1,000,000 projects and 20,000,000 compilations. What a fantastic resource! Michael then handed over to Neil to talk about the challenges.
Neil talked about tracking users, starting front he problem that one machine profile does not necessarily correspond to one user. Another problem is anonymisation, stripping project paths and the code where possible. You can’t guarantee anonymisation, because people sometimes use their own names as variable or class names, but they do what they can. It’s made clear on opt-in what’s going to happen. Data integrity is another challenge. Is it complete? No – it’s client side and there’s no guarantee of completeness, or even connectivity. But the Data that you do have for each session you is consistent. So the data is consistent but not complete. If you want, locally, you can tie your local data to the Blackbox data that they have on your students but then ethics becomes your problem. This can be done by Experiment and Participant Identifiers as part of the set-up so your students can be grouped. More example mini analyses are in the paper.
Looking at Error Frequency, Neil talked about certain errors and how their frequency changes over the weeks of 2013 (Semicolon expected, unknown variable). Over time, the syntax errors decreased (suggesting a learning effect) but others stay more constant.
The data is not completely open, and you need to request access as a researcher, sign a privacy and access restriction agreement. Students need not apply! There’s a SIGCSE workshop on this Saturday but I can’t go as my Puzzle Based Learning workshop is on at the same time. Great resource, go and check it out!
The final talk was “Using CodeBrowser to Seek Difference Between Novice Programmers” by Kenny Heinonen, Kasper Hirvikoski, Matti Luukkainen, and Arto Vihavainen, University of Helsinki,