What are we assessing? How?

How we can create a better assessment system, without penalties, that works in a grade-free environment? Let’s provide a foundation for this discussion by looking at assessment today.

fx_Bloom_New

Bloom’s Revised Taxonomy

We have many different ways of understanding exactly how we are assessing knowledge. Bloom’s taxonomy allows us to classify the objectives that we set for students, in that we can determine if we’re just asking them to remember something, explain it, apply it, analyse it, evaluate it or, having mastered all of those other aspects, create a new example of it. We’ve also got Bigg’s SOLO taxonomy to classify levels of increasing complexity in a student’s understanding of subjects. Now let’s add in threshold concepts, learning edge momentum, neo-Piagetian theory and …

Let’s summarise and just say that we know that students take a while to learn things, can demonstrate some convincing illusions of progress that quickly fall apart, and that we can design our activities and assessment in a way that acknowledges this.

I attended a talk by Eric Mazur, of Peer Instruction fame, and he said a lot of what I’ve already said about assessment not working with how we know we should be teaching. His belief is that we rarely rise above remembering and understanding, when it comes to testing, and he’s at Harvard, where everyone would easily accept their practices as, in theory, being top notch. Eric proposed a number of approaches but his focus on outcomes was one that I really liked. He wanted to keep the coaching role he could provide separate from his evaluator role: another thing I think we should be doing more.

Eric is in Physics but all of these ideas have been extensively explored in my own field, especially where we start to look at which of the levels we teach students to and then what we assess. We do a lot of work on this in Australia and here is some work by our groups and others I have learned from:

  • Szabo, C., Falkner, K. & Falkner, N. 2014, ‘Experiences in Course Design using Neo-Piagetian Theory’
  • Falkner, K., Vivian, R., Falkner, N., 2013, ‘Neo-piagetian Forms of Reasoning in Software Development Process Construction’
  • Whalley, J., Lister, R.F., Thompson, E., Clear, T., Robbins, P., Kumar, P. & Prasad, C. 2006, ‘An Australasian study of reading and comprehension skills in novice programmers, using Bloom and SOLO taxonomies’
  • Gluga, R., Kay, J., Lister, R.F. & Teague, D. 2012, ‘On the reliability of classifying programming tasks using a neo-piagetian theory of cognitive development’

I would be remiss to not mention Anna Eckerdal’s work, and collaborations, in the area of threshold concepts. You can find her many papers on determining which concepts are going to challenge students the most, and how we could deal with this, here.

Let me summarise all of this:

  • There are different levels at which students will perform as they learn.
  • It needs careful evaluation to separate students who appear to have learned something from students who have actually learned something.
  • We often focus too much on memorisation and simple explanation, without going to more advanced levels.
  • If we want to assess advanced levels, we may have to give up the idea of trying to grade these additional steps as objectivity is almost impossible as is task equivalence.
  • We should teach in a way that supports the assessment we wish to carry out. The assessment we wish to carry out is the right choice to demonstrate true mastery of knowledge and skills.

If we are not designing for our learning outcomes, we’re unlikely to create courses to achieve those outcomes. If we don’t take into account the realities of student behaviour, we will also fail.

We can break our assessment tasks down by one of the taxonomies or learning theories and, from my own work and that of others, we know that we will get better results if we provide a learning environment that supports assessment at the desired taxonomic level.

But, there is a problem. The most descriptive, authentic and open-ended assessments incur the most load in terms of expert human marking. We don’t have a lot of expert human markers. Overloading them is not good. Pretending that we can mark an infinite number of assignments is not true. Our evaluation aesthetics are objectivity, fairness, effectiveness, timeliness and depth of feedback. Assignment evaluation should be useful to the students, to show progress, and useful to us, to show the health of the learning environment. Overloading the marker will compromise the aesthetics.

Our beauty lens tells us very clearly that we need to be careful about how we deal with our finite resources. As Eric notes, and we all know, if we were to test simpler aspects of student learning, we can throw machines at it and we have a near infinite supply of machines. I cannot produce more experts like me, easily. (Snickers from the audience) I can recruit human evaluators from my casual pool and train them to mark to something like my standard, using a rubric or using an approximation of my approach.

Thus I have a framework of assignments, divide by level, and I appear to have assignment evaluation resources. And the more expert and human the marker, the more … for want of a better word … valuable the resource. The better feedback it can produce. Yet the more valuable the resource, the less of it I have because it takes time to develop evaluation skills in humans.

Tune in tomorrow for the penalty free evaluation and feedback that ties all of this together.


SIGCSE Day 3, “What We Say, What They Do”, Saturday, 9-10:15am, (#SIGCSE2014)

The first paper was “Metaphors we teach by” presented by Ben Shapiro from Tufts. What are the type of metaphors that CS1 instructors use and what are the wrinkles in these metaphors. What do we mean by metaphors? Ben’s talking about conceptual metaphors, linguistic devices to allow us to understand one idea in terms o another idea that we already know. Example: love is a journey – twists and turns, no guaranteed good ending, The structure of a metaphor is that you have a thing we’re trying to explain (the target) in terms of something we already know (the source).  Conceptual metaphors are explanatory devices to assist us in understanding new things.

Metaphors are widely used in teaching in CS, pointers, stacks and loops – all metaphorical aspects of computer science, but that’s not the focus of this study. How do people teach with metaphor? The authors couldn’t find any studies on general metaphor use in CS and its implication on student learning. An example from a birds-of-a-feather session held at this conference, a variable is like a box. A box can hold many different things but it holds things. (This has been the subject of a specific study.) Ben also introduced the “Too much milk” metaphor. This metaphor is laid out as follows. Jane comes home from work, goes to get milk from the fridge but her roommate has already drunk it (bad roommate!). Jane goes out to get more milk. While she’s out, her roommate comes back with milk, then Jane comes back with milk. Now they have too much milk! This could be used to explain race conditions in CS. Another example is the use of bus lockers mapping to virtual memory.
Ben returned to boxes again? One of the problems is that boxes can hold many things but a variable can only hold one thing – which appears to be a confusing point for learners who knew how boxes work. Is this a common problem? Metaphors have some benefits but come with this kind of baggage? Metaphors are partial mappings – they don’t match every aspect of the target to the source. (If it was a complete mapping they’d be the same thing.)
The research questions that the group considered were:
  • What metaphors do CS1 instructors use for teaching?
  • What are the trying to explain?
  • What are the sources that they use?
Learners don’t know where the mappings start and stop – where do the metaphors break down for students? What mistakes do they make because of these misunderstandings? Why does this matter? We all have knowledge on how to explain but we don’t have good published collections of the kind of metaphors that we use to teach CS, which would be handy for new teachers. We could study these and work out which are more effective. What are the most enduring and universal metaphors?
The study was interview-based, interviewing Uni-level CS1 instructors, ended up with 10 people, with an average of 13 years of teaching.  The interview questions given to these instructors were (paraphrased):
  • Levels taught and number of years
  • Tell me about a metahpor
  • Target to source mapping
  • Common questions students have
  • Where the metaphor breaks down
  • How to handle the breakdown in teaching.
Ben then presented the results. (We had a brief discussion of similes versus metaphors but I’ll leave that to you.) An instructor discussed using the simile of a portkey from Harry Potter to explain return statements in functions, because students had trouble with return existing immediately. The group of 10 people provided 18 different CS Concepts (Targets) and 19 Metaphorical Explanations (Sources).
What’s the target for “Card Catalogs”? Memory addressing and pointers. The results were interesting – there’s a wide range of ways to explain things! (The paper contains a table of a number of targets and sources.)
Out of date cultural references were identified as a problem and you have to be aware of the students’ cultural context. (Card desk and phone booths are nowhere near as widely used as they used to be.) Where do students make inferences beyond the metaphor? None of the 10 participants could give a single example of this happening! (This is surprising – Ben called it weird.) Two hypotheses – our metaphors are special and don’t get overextended (very unlikely) OR CS1 instructors poorly understand student thinking (more likely).
The following experimental studies may shed some light on this:
  • Which metaphors work better?
  • Cognitive clinical internviews, exploring how students think with metaphors and where incorrect inferences are drawn.
There was also a brief explanation of PCK (teachers’ pedagogical content knowledge) but I don’t have enough knowledge to fully flesh this out. Ben, if you’re reading this, feel free to add a beautifully explanatory comment. 🙂
The next walk was “‘Explain in Plain English’ Questions Revisited: Data Structures Problems” presented by Sue Fitzgerald and Laurie. This session opened with a poll to find out what the participants wanted and we all wanted to find out how to get students to use plain English. An Explain in Plain English  (EiPE) question asks you to describe what a chunk of code does, but not in a line by line discussion. A student’s ability to explain what a chink of code does correlates with a student’s ability to write and read code. The study wanted to investigate if this was just a novice phenomenon or if this advanced during the years and expertise. This study looked at 120 undergraduates in a CS2 course in data structures and algorithms using C++, with much more difficult questions than in earlier studies: linked lists, recursive calls and so on.
The students were given two questions in an exam with some preamble to describe the underlying class structure with a short example and a diagram. The students then had to look at a piece of code and determine what would happen in order to answer in the question as a plain English response. (There’s always a problem where you throw to an interactive response system where the question isn’t repeated, perhaps we need two screens.)
The SOLO taxonomy was used to analyse the problems (more Neo-Piagetian goodness!). Four of the SOLO categories were used: relational (summarises the code), multistructural (line by line explanation of the code) , unistructural (only describes one portion rather than the whole idea), and pre structural (misses it completely, gibberish). I was interested to see the examples presented, with pointers and mutual function calling, because it quickly became apparent that the room I was in (which had a lot of CS people in it) were having to think relatively hard about the answer to the second example. One of the things about working memory is that it’s not very deep and none of us were quite ready to work in a session 🙂 but a lot of good discussion ensued. The students would have had ready access to the preamble code but I do wonder how much obfuscation is really required here. The speaker made a parenthetical comment that experts usually doodle but where was our pen and paper! (As someone else said, reinforcing the point that we didn’t come prepared to work, nobody told us we had to bring paper. 🙂 ) We then got to classify a student response that was quite “student-y”. (A question came up as to whether an answer can be relational if it’s wrong – the opinion appears to be that a concise, complete and incorrect answer could be considered relational. A point for later discussion.) The answer we saw was multistructural because it was a line-by-line answer – it wasn’t clear, concise and abstract. We then saw another response that was much more terse but far less accurate. THe group tossed up between unistructural and pre structural. (The group couldn’t see the original code or the question, so this uncertainty make sense. Again, a problem with trying to have an engaging on-line response system and a presentation on the same screen. The presenters did a great job of trying to make it work but it’s not ideal.)
What about correlations? For the first question asked, students who gave relational and multistructural answers generally passed, with a 58% grade. Those who answered at the uni or pre level generally failed with an average grade of 38%. In the second test question, the relational and multi group generally passed with a grade of 61.2%, the uni and pre group generally failed with an achieved grade of 42%.
So these correlations hold for no-novice programmers. A mix of explaining, writing and reading code is an effective way to develop good programming skills and EiPE questions give students good practice in the valuable skills of explaining code. Instructors can overestimate how well students understand presented code – asking them to explain it back is very useful for student self-assessment. The authors’ speculation is that explaining code to peers is probably part of the success of peer instruction and pair programming.
The final talk was “A Formative Study of Influences on Student Testing Behaviours” presented by Kevin Buffardi, from VT. In their introductory CS1 and CS2 courses they use Test-Driven Development (TDD) – code a little, test a little, for incremental development. It’s popular in industry, so students come out with relevant experience, but some previous studies have found improvement in student work when they closely adhered to TDD philosophy. BUT a lot of students didn’t follow it at all! So the authors were looking for ways to encourage students to follow this, especially when they were on their own and programming by themselves. Because it’s a process, you can tell what happened by looking at the final program but they use WebCAT and so can track the developmental stages of the program as students submit their work for partial grading. These snapshots provide clear views of what the students are doing over time. (I really have to look at what we could do with WebCAT. Our existing automarker is getting a bit creaky.) Students also received hints back when they submitted their work, general and instructor level.
The first time students achieved something with any type of testing, they would get a “Good Start” feedback and be entitled to a free hint. If you kept up with your testing, you would ‘buy’ more hints. If your test coverage was good, you got more hints. If your coverage was poor, you got general feedback. (Prior to this, WebCAT only gave 3 hints. Now there are no free hints but you can buy an unlimited number.) This is an adaptive feedback mechanism, to encourage testing with hints as incentives. The study compared reinforcement treatments:
  • Constant – even time a goal achieve, you got a hint (Consistently rewards target behaviour)
  • Delayed – Hints when earned, at most one hint per hour (less inceptive for hammering the system)
  • Random – 50% chance of hints when goal is met. (Should reduce dependency on extrinsic behaviours)
Should you show them the goal or not? This was an additional factor – the goals were either visual (concrete goal) or obscured (suggest improvement without specified target). These were a paired treatment.
What was the impact? There were no differences in the number of lines written, but the visual goal lead to students getting better test coverage than obscured goal. There didn’t appear to be a long term effect but there is an upcoming ITiCSE talk that will discuss this further. There were some changes from one submission to another but this wasn’t covered in detail.
The authors held formative group interviews where the students explained their development process and interaction with WebCAT. They said that they valued several types of evaluation, they paid attention to RED progress bars (visualisation and dash boarding – I’d argue that is is more about awareness than motivation), and noticed when they earned a hint but didn’t get it. The students drew their individual developmental process as a diagram and, while everyone had a unique approach, but there were two general approaches. Test last approach showed up: write a solution, submit a solution to WebCAT, take a break, do some testing, then submit to WebCAT again. Periodic testing approach was the other pattern seen, where they wrote solutions, WebCAT, write tests, submit to WebCAT, then revise solution and tests, and iterate.
Going forward, the automated evaluation became part of their development strategy. There were conflicting interests: the correctness reports from WebCAT were actually reducing the need to write their own tests because they were getting an indication of how well it was working. This is an important point for me, because from the examples I saw, I really couldn’t see what I would call test-driven development, especially for test last, so the framework is not encouraging the right behaviour. Kevin handled my question on this well, because it’s a complicated issue, and I’m really looking forward to seeing the ITiCSE paper follow-up! Behavioural change is difficult and, as Kevin rightly noted, it’s optimistic to think that we can achieve it in the short term.
Everyone wants to get students doing the right thing but it’s a very complicated issue. Much food for thought and a great session!

SIGCSE Best Paper Award Winner – Dr Claudia Szabo, University of Adelaide

Congratulations to my colleague, friend and running partner, Dr Claudia Szabo on winning the SIGCSE Best Paper Award for a paper entitled “Student Projects are Not Throwaways: Teaching Practical Software Maintenance in a Software Engineer Course”. Claudia has three papers here because overachievement, but more seriously this is a fantastic achievement, especially for her first SIGCSE. This is also really useful research that has direct practical applications for people who are teaching Software Engineering AND we’re working together to build courses based on some of her earlier work on Neo-Piagetian analysis of existing courses.

Here’s a picture! Yay, Claudia!

Claudia Szabo holding plaque - like a BOSS!

Claudia Szabo holding plaque – like a BOSS!