The Illusion of a Number

Rabbit? Duck? Paging Wittgenstein!

I hope you’ve had a chance to read William Rapaport’s paper, which I referred to yesterday. He proposed a great, simple alternative to traditional grading that reduces confusion about what is signalled by ‘grade-type’ feedback, as well as making things easier for students and teachers. Being me, after saying how much I liked it, I then finished by saying “… but I think that there are problems.” His approach was that we could break all grading down into: did nothing, wrong answer, some way to go, pretty much there. And that, I think, is much better than a lot of the nonsense that we pretend we hand out as marks. But, yes, I have some problems.

I note that Rapaport’s exceedingly clear and honest account of what he is doing includes this statement. “Still, there are some subjective calls to make, and you might very well disagree with the way that I have made them.” Therefore, I have license to accept the value of the overall scholarship and the frame of the approach, without having to accept all of the implementation details given in the paper.  Onwards!

I think my biggest concern with the approach given is not in how it works for individual assessment elements. In that area, I think it shines, as it makes clear what has been achieved. A marker can quickly place the work into one of four boxes if there are clear guidelines as to what has to be achieved, without having to worry about one or two percentage points here or there. Because the grade bands are so distinct, as Rapaport notes, it is very hard for the student to make the ‘I only need one more point argument’ that is so clearly indicative as a focus on the grade rather than the learning. (I note that such emphasis is often what we have trained students for, there is no pejorative intention here.) I agree this is consistent and fair, and time-saving (after Walvoord and Anderson), and it avoids curve grading, which I loathe with a passion.

However, my problems start when we are combining a number of these triaged grades into a cumulative mark for an assignment or for a final letter grade, showing progress in the course. Sections 4.3 and 4.4 of the paper detail the implementation of assignments that have triage graded sub-tasks. Now, instead of receiving a “some way to go” for an assignment, we can start getting different scores for sub-tasks. Let’s look at an example from the paper, note 12, to describe programming projects in CS.

  • Problem definition 0,1,2,3
  • Top-down design 0,1,2,3
  • Documented code
    • Code 0,1,2,3
    • Documentation 0,1,2,3
  • Annotated output
    • Output 0,1,2,3
    • Annotations 0,1,2,3

Total possible points = 18

Remember my hypothetical situation from yesterday? I provided an example of two students who managed to score enough marks to pass by knowing the complement of each other’s course knowledge.  Looking at the above example, it appears (although not easily) to be possible for this situation to occur and both students to receive a 9/18, yet for different aspects. But I have some more pressing questions:

  1. Should it be possible for a student to receive full marks for output, if there is no definition, design or code presented?
  2. Can a student receive full marks for everything else if they have no design?

The first question indicates what we already know about task dependencies: if we want to build them into numerical grading, we have to be pedantically specific and provide rules on top of the aggregation mathematics. But, more subtly, by aggregating these measures, we no longer have an ‘accurately triaged’ grade to indicate if the assignment as a whole is acceptable or not. An assignment with no definition, design or code can hardly be considered to be a valid submission, yet good output, documentation and annotation (with no code) will not give us the right result!

The second question is more for those of us who teach programming and it’s a question we all should ask. If a student can get a decent grade for an assignment without submitting a design, then what message are we sending? We are, implicitly, saying that although we talk a lot about design, it’s not something you have to do in order to be successful. Rapaport does go on to talk about weightings and how we can emphasis these issues but we are still faced with an ugly reality that, unless we weight our key aspects to be 50-60% of the final aggregate, students will be able to side-step them and still perform to a passing standard. Every assignment should be doing something useful, modelling the correct approaches, demonstrating correct techniques. How do we capture that?

Now, let me step back and say that I have no problem with identifying the sub-tasks and clearly indicating the level of performance using triage grading, but I disagree with using it for marks. For feedback it is absolutely invaluable: triage grading on sub-tasks will immediately tell you where the majority of students are having trouble, quickly. That then lets you know an area that is more challenging than you thought or one that your students were not prepared for, for some reason. (If every student in the class is struggling with something, the problem is more likely to lie with the teacher.) However, I see three major problems with sub-task aggregation and, thus, with final grade aggregation from assignments.

The first problem is that I think this is the wrong kind of scale to try and aggregate in this way. As Rapaport notes, agreement on clear, linear intervals in grading is never going to be achieved and is, very likely, not even possible. Recall that there are four fundamental types of scale: nominal, ordinal, interval and ratio. The scales in use for triage grading are not interval scales (the intervals aren’t predictable or equidistant) and thus we cannot expect to average them and get sensible results. What we have here are, to my eye, ordinal scales, with no objective distance but a clear ranking of best to worst. The clearest indicator of this is the construction of a B grade for final grading, where no such concept exists in the triage marks for assessing assignment quality. We have created a “some way to go but sometimes nearly perfect” that shouldn’t really exist. Think of it like runners: you win one race and you come third in another. You never actually came second in any race so averaging it makes no sense.

The second problem is that aggregation masks the beauty of triage in terms of identifying if a task has been performed to the pre-determined level. In an ideal world, every area of knowledge that a student is exposed to should be an important contributor to their learning journey. We may have multiple assignments in one area but our assessment mechanism should provide clear opportunities to demonstrate that knowledge. Thus, their achievement of sufficient assignment work to demonstrate their competency in every relevant area of knowledge should be a necessary condition for graduating. When we take triage grading back to an assignment level, we can then look at our assignments grouped by knowledge area and quickly see if a student has some way to go or has achieved the goal. This is not anywhere near as clear when we start aggregating the marks because of the mathematical issues already raised.

Finally, the reduction of triage to mathematical approximation reduces the ability to specify which areas of an assessment are really valuable and, while weighting is a reasonable approximation to this, it is very hard to use a mathematical formula with more and more ‘fudge factors’, a term Rapaport uses, to make up for the fact that this is just a little too fragile.

To summarise, I really like the thrust of this paper. I think what is proposed is far better, even with all of the problems raised above, at giving a reasonable, fair and predictable grade to students. But I think that the clash with existing grading traditions and the implicit requirement to turn everything back into one number is causing problems that have to be addressed. These problems mean that this solution is not, yet, beautiful. But let’s see where we can go.

Tomorrow, I’ll suggest an even more cut-down version of grading and then work on an even trickier problem: late penalties and how they affect grades.