SIGCSE 2014: Automated Assessment Session, Thursday 10:45-12:00Posted: March 7, 2014 Filed under: Education | Tags: automated assessment, education, education research, higher education, learning, SIGCSE2014, teaching Leave a comment
This session was the one I spoke in and I think it went well. Lots of good questions, which is always handy, and I can only hope that the answers made sense! The next talk was “Adaptively Identifying Non-Terminating Code when Testing Student Programs” presented by Stephen Edwards.
How do we handle infinite loops in student testing? Killing the process works but what happens to later tests if we use a timeout-based termination? What happens to the data from earlier tests? What we’re doing is wasting time up to the timeout. Stephen put the wasted time at 99.2 hours of cumulative delay in the 2012-2013 academic year, over nearly 9,000 loop cases. Coarse timeout would have resulted in the loss of any results from these programs.
(This is a problem close to my heart, so I was listening intently!) Stephen talked about using JUnit 4 rules, where you can add timeouts to a given rule, but these have to be added to every test class, it’s only in 4 not JUnit 3 and a single flat timeout can still cause delays. So, sadly, we can’t use this solution to address our key concerns. So they built off the JUnit 4 rules but wanted to:
- create adaptive timeout rules
- extend Junit to run Junit3-style tests under JUnit4
- Automatically inject the timeout rule in every test class transparently
The adaptive rule starts with a fixed timeout and then adapt it. I didn’t quite follow some of this so I’ll have to read the paper. There are hard upper and lower bounds on the time limits and are customisable, with the time taken being roughly equivalent to that of the slowest terminating code. They’ve now developed the unit and integrated it with their existing code.
To evaluate it, they deleted a single data structures programming assignment with 4,214 program submissions and regraded them using the new approach. 82 instructor-written references tests (!!!) resulting in 345,456 test executions (that’s a very funny number!). A very small number of tests caused very large problems for students – 2 students had previously received no feedback at all because everything that they did had an infinite loop in it!
One of the questions asked how you bootstrap the initial timeout periods – data driven would be ideal but, without any data, there’s a problem. Stephen wants to do this experiment ut hasn’t had a chance to do it yet.
The next talk was “Can Computers Compare Student Code Solutions as Well as Teachers?” presented by Matheus Gaudencio, from the Software Practices Laboratory. They use a lot of automatic tests and code comparison so their first question was whether they, as teachers, had a similar way of examining and comparing code (the old “how many different marks can you get for the same essay” chestnut). They evaluated 11 teachers and generate a reference solution which the teachers had to compare to two sample solutions, based on which was the best approximation to the reference code. Results varied to a low of 62% agreement. From eyeballing his data, it looks like 75-80% agreement is the average.
Matheus then looked at other strategies, including token-based and tree-based approaches (out of 7 different strategies), for computational comparison of code. There has to be a threshold (which the paper refers to as Delta) which allows some rubberiness in the similarity equations. The produced a hierarchal clustering tool, which can be found at http://relatedecode.appsot.com. If you’re interested in this you can contact Matheus at firstname.lastname@example.org