No numbers

We know that grades are really quite arbitrary and that turning numbers into letters, while something we can do, is actually not that strongly coupled to evaluating learning or demonstrating mastery. Why? Because having the appropriate level of knowledge and being able to demonstrate it are not necessarily the same as being able to pass tests or produce solutions to assignments.

For example, if we look at Rapaport’s triage approach as a way to evaluate student interaction with assignments, we can then design our learning environment to provide multiple opportunities to construct and evaluate knowledge on the understanding that we are seeking clear evidence that a student cannot just perform tasks of this nature but, more important, can do reliably. We can do this even if we use “Good, getting there, wrong and no submission” rather than numbers. The duality of grades (a symbol and its meaning) degenerates to something other than numbers anyway. Students at my University didn’t care about 84 versus 85 until we put a new letter grade in at 85 (High Distinction). But even these distinctions are arbitrary scales when it comes to evaluating actual learning.

A very arbitrary scale.

Why are numbers not important in this? Because they’re rarely important anyway. Have you ever asked your surgeon what her grades were in school? What about your accountant? Perhaps you’ve questioned the percentage that your favourite Master of Wine achieved in the tasting exams? Of course you haven’t. You’ve assumed that a certification (of some sort) indicates sufficient knowledge to practise. And what we have to face is that we are currently falling back onto numbers to give us false confidence that we are measuring learning. They don’t map. They’re not objective. They’re often mathematically nonsensical. No-one cares about them except to provide yet another way of sorting human beings and, goodness knows, we already have enough of those.

Ah, but “but students like to know how they’re going”, right? Yes. Which is where critique and evaluation come in, as well as may other authentic and appropriate ways to recognise progress and encourage curiosity and further development. None of which require numbers.

Let me ask you a question:

Does every student who accumulates enough pass tokens to graduate from your program have a clearly demonstrated ability to perform tasks to the requisite level in all of the knowledge areas of your program?

If the answer is no, then numbers and grades didn’t help, did they? I suspect that, for you as for many others including me, you can probably think of students who managed to struggle through but, in reality, were probably never going to be much good in the field. Perhaps 50% doesn’t magically cover competency? If 50% doesn’t, then raising the bar to 75% won’t solve the problem either. For reasons already mentioned, many of the ways we combine numbers to get grades just don’t make any real sense and they certainly don’t provide much insight into how well the student actually learned what you were trying to teach.

If numbers/grades don’t have much solid foundation, don’t always reflect ability to perform the task, and aren’t actually going to be used in the future? Then they are neither good nor true. And they cannot be beautiful.

Thus, let me strip Rapaport back one notch and provide a three-tier grade-free system, commonly used in many places already, that is closer to what we probably want:

  1. Nothing submitted,
  2. Work in progress, resubmit if possible, and
  3. Work to competent standard.

I know that there are concerns about the word ‘competency’ but I think it’s something we’re going to have think about moving on from. I teach engineers and computer scientists and they have to go out and perform tasks successfully if people are going to employ them or work with them. They have to be competent. Right now, I can tell you which of them have passed but, for a variety of grading reasons, I can’t tell you which one of them, from an academic transcript alone, will be able to sit down and solve your problem. I can see which ones pass exams but I don’t know if this is fixed knowledge or swotting. But what if you made it easy and said “ok, just point to the one who will build me the best bridge”? No. I can’t tell you that. (The most likely worst bridge is easier, as I can identify who does and doesn’t have Civil Engineering qualifications.)

The three-tier scale is simple. The feedback approach that the marker should take is pretty clear in each place and the result is clear to the student. If we build our learning environment correctly, then we can construct a pathway where a student has to achieve tier 3 for all key activities and, at that point, we can actually say “Yes, this student can perform this task or apply this knowledge to the required level”. We do this enough times, we may even start to think that the student could perform this at the level of the profession.

Wait. Have we just re-invented competency-based assessment? There’s an immediate urge to say “but that’s not a University level thing” and I do understand that. CBA has a strong vocational focus but anyone who works in an engineering faculty is already in that boat. We have industry linked accreditation to allow our students to practise as engineers and they have to demonstrate the achievement of a certified program, as well as work experience. That program is taught at University but, given that all you need is to get the degree, you can do it on raw passes and be ‘as accredited’ as the next person.

Now, I’d be the first person to say that not only are many aspects of the University not vocationally focussed but I’d go further and say that they shouldn’t be vocationally focussed. The University is a place that allows for the unfettered exploration of knowledge for knowledge’s sake and I wouldn’t want to change that. (And, yet, so often, we still grade such abstract ideals…) But let’s take competency away from the words job and vocational for a moment. I’m not suggesting we turn Universities into vocational study centres or shut down “non-Industry” programs and schools. (I’d like to see more but that’s another post.) Let’s look at focusing on clarity and simplicity of evaluation.

A student writes an essay on Brecht and submits it for assessment. All of the rich feedback on language use, referencing and analysis still exists without the need to grade it as A, B or C. The question is whether the work should be changed in response to the feedback (if possible) or whether it is, recognisably, an appropriate response to the question ‘write an essay on Brecht’ that will allow the student to develop their knowledge and skills. There is no job focus here but pulling back to separate feedback and identifying whether knowledge has been sufficiently demonstrated is, fundamentally, a competency argument.

The PhD, the pinnacle of the University system, is essentially not graded. You gain vast amounts of feedback over time, you write in response and then you either defend it to your prospective peers or have it blind-assessed by external markers. Yes, there are degrees of acceptance but, ultimately, what you end up with is “Fine as it is”, “Do some more work”, and “Oh, no. Just no.” If we can extend this level of acceptance of competency to our highest valued qualification, what is the consistent and sound reasoning that requires us to look at a student group and say “Hmm, 73. And this one is… yes, 74.”? If I may, cui bono? Who is benefitting here?

But what would such a program look like, you ask? (Hey, and didn’t Nick say he was going to talk about late penalties?) Yes, indeed. Come back tomorrow!

The Illusion of a Number

Rabbit? Duck? Paging Wittgenstein!

I hope you’ve had a chance to read William Rapaport’s paper, which I referred to yesterday. He proposed a great, simple alternative to traditional grading that reduces confusion about what is signalled by ‘grade-type’ feedback, as well as making things easier for students and teachers. Being me, after saying how much I liked it, I then finished by saying “… but I think that there are problems.” His approach was that we could break all grading down into: did nothing, wrong answer, some way to go, pretty much there. And that, I think, is much better than a lot of the nonsense that we pretend we hand out as marks. But, yes, I have some problems.

I note that Rapaport’s exceedingly clear and honest account of what he is doing includes this statement. “Still, there are some subjective calls to make, and you might very well disagree with the way that I have made them.” Therefore, I have license to accept the value of the overall scholarship and the frame of the approach, without having to accept all of the implementation details given in the paper.  Onwards!

I think my biggest concern with the approach given is not in how it works for individual assessment elements. In that area, I think it shines, as it makes clear what has been achieved. A marker can quickly place the work into one of four boxes if there are clear guidelines as to what has to be achieved, without having to worry about one or two percentage points here or there. Because the grade bands are so distinct, as Rapaport notes, it is very hard for the student to make the ‘I only need one more point argument’ that is so clearly indicative as a focus on the grade rather than the learning. (I note that such emphasis is often what we have trained students for, there is no pejorative intention here.) I agree this is consistent and fair, and time-saving (after Walvoord and Anderson), and it avoids curve grading, which I loathe with a passion.

However, my problems start when we are combining a number of these triaged grades into a cumulative mark for an assignment or for a final letter grade, showing progress in the course. Sections 4.3 and 4.4 of the paper detail the implementation of assignments that have triage graded sub-tasks. Now, instead of receiving a “some way to go” for an assignment, we can start getting different scores for sub-tasks. Let’s look at an example from the paper, note 12, to describe programming projects in CS.

  • Problem definition 0,1,2,3
  • Top-down design 0,1,2,3
  • Documented code
    • Code 0,1,2,3
    • Documentation 0,1,2,3
  • Annotated output
    • Output 0,1,2,3
    • Annotations 0,1,2,3

Total possible points = 18

Remember my hypothetical situation from yesterday? I provided an example of two students who managed to score enough marks to pass by knowing the complement of each other’s course knowledge.  Looking at the above example, it appears (although not easily) to be possible for this situation to occur and both students to receive a 9/18, yet for different aspects. But I have some more pressing questions:

  1. Should it be possible for a student to receive full marks for output, if there is no definition, design or code presented?
  2. Can a student receive full marks for everything else if they have no design?

The first question indicates what we already know about task dependencies: if we want to build them into numerical grading, we have to be pedantically specific and provide rules on top of the aggregation mathematics. But, more subtly, by aggregating these measures, we no longer have an ‘accurately triaged’ grade to indicate if the assignment as a whole is acceptable or not. An assignment with no definition, design or code can hardly be considered to be a valid submission, yet good output, documentation and annotation (with no code) will not give us the right result!

The second question is more for those of us who teach programming and it’s a question we all should ask. If a student can get a decent grade for an assignment without submitting a design, then what message are we sending? We are, implicitly, saying that although we talk a lot about design, it’s not something you have to do in order to be successful. Rapaport does go on to talk about weightings and how we can emphasis these issues but we are still faced with an ugly reality that, unless we weight our key aspects to be 50-60% of the final aggregate, students will be able to side-step them and still perform to a passing standard. Every assignment should be doing something useful, modelling the correct approaches, demonstrating correct techniques. How do we capture that?

Now, let me step back and say that I have no problem with identifying the sub-tasks and clearly indicating the level of performance using triage grading, but I disagree with using it for marks. For feedback it is absolutely invaluable: triage grading on sub-tasks will immediately tell you where the majority of students are having trouble, quickly. That then lets you know an area that is more challenging than you thought or one that your students were not prepared for, for some reason. (If every student in the class is struggling with something, the problem is more likely to lie with the teacher.) However, I see three major problems with sub-task aggregation and, thus, with final grade aggregation from assignments.

The first problem is that I think this is the wrong kind of scale to try and aggregate in this way. As Rapaport notes, agreement on clear, linear intervals in grading is never going to be achieved and is, very likely, not even possible. Recall that there are four fundamental types of scale: nominal, ordinal, interval and ratio. The scales in use for triage grading are not interval scales (the intervals aren’t predictable or equidistant) and thus we cannot expect to average them and get sensible results. What we have here are, to my eye, ordinal scales, with no objective distance but a clear ranking of best to worst. The clearest indicator of this is the construction of a B grade for final grading, where no such concept exists in the triage marks for assessing assignment quality. We have created a “some way to go but sometimes nearly perfect” that shouldn’t really exist. Think of it like runners: you win one race and you come third in another. You never actually came second in any race so averaging it makes no sense.

The second problem is that aggregation masks the beauty of triage in terms of identifying if a task has been performed to the pre-determined level. In an ideal world, every area of knowledge that a student is exposed to should be an important contributor to their learning journey. We may have multiple assignments in one area but our assessment mechanism should provide clear opportunities to demonstrate that knowledge. Thus, their achievement of sufficient assignment work to demonstrate their competency in every relevant area of knowledge should be a necessary condition for graduating. When we take triage grading back to an assignment level, we can then look at our assignments grouped by knowledge area and quickly see if a student has some way to go or has achieved the goal. This is not anywhere near as clear when we start aggregating the marks because of the mathematical issues already raised.

Finally, the reduction of triage to mathematical approximation reduces the ability to specify which areas of an assessment are really valuable and, while weighting is a reasonable approximation to this, it is very hard to use a mathematical formula with more and more ‘fudge factors’, a term Rapaport uses, to make up for the fact that this is just a little too fragile.

To summarise, I really like the thrust of this paper. I think what is proposed is far better, even with all of the problems raised above, at giving a reasonable, fair and predictable grade to students. But I think that the clash with existing grading traditions and the implicit requirement to turn everything back into one number is causing problems that have to be addressed. These problems mean that this solution is not, yet, beautiful. But let’s see where we can go.

Tomorrow, I’ll suggest an even more cut-down version of grading and then work on an even trickier problem: late penalties and how they affect grades.

Assessment is (often) neither good nor true.

If you’ve been reading my blog over the past years, you’ll know that I have a lot of time for thinking about assessment systems that encourage and develop students, with an emphasis on intrinsic motivation. I’m strongly influenced by the work of Alfie Kohn, unsurprisingly given I’ve already shown my hand on Focault! But there are many other writers who are… reassessing assessment: why we do it, why we think we are doing it, how we do it, what actually happens and what we achieve.

Screen Shot 2016-01-09 at 6.50.12 PM

In my framing, I want assessment to be as all other aspects of education: aesthetically satisfying, leading to good outcomes and being clear and what it is and what it is not. Beautiful. Good. True. There are some better and worse assessment approaches out there and there are many papers discussing this.  One of these that I have found really useful is Rapaport’s paper on a simplified assessment process for consistent, fair and efficient grading. Although I disagree with some aspects, I consider it to be both good, as it is designed to clearly address a certain problem to achieve good outcomes, and it is true, because it is very honest about providing guidance to the student as to how well they have met the challenge. It is also highly illustrative and honest in representing the struggle of the author in dealing with the collision of novel and traditional assessment systems. However, further discussion of Rapaport is for the near future. Let me start by demonstrating how broken things often are in assessment, by taking you through a hypothetical situation.

Thought Experiment 1

Two students, A and B, are taking the same course. There are a number of assignments in the course and two exams. A and B, by sheer luck, end up doing no overlapping work. They complete different assignments to each other, half each and achieve the same (cumulative bare pass overall) marks. They then manage to score bare pass marks in both exams, but one answers only the even questions and only answers the odd. (And, yes, there are an even number of questions.) Because of the way the assessment was constructed, they have managed to avoid any common answers in the same area of course knowledge. Yet, both end up scoring 50%, a passing grade in the Australian system.

Which of these students has the correct half of the knowledge?

I had planned to build up to Rapaport but, if you’re reading the blog comments, he’s already been mentioned so I’ll summarise his 2011 paper before I get to my main point. In 2011, William J. Rapaport, SUNY Buffalo, published a paper entitled “A Triage Theory of Grading: The Good, The Bad and the Middling.” in Teaching Philosophy. This paper summarised a number of thoughtful and important authors, among them Perry, Wolff, and Kohn. Rapaport starts by asking why we grade, moving through Wolff’s taxonomic classification of assessment into criticism, evaluation, and ranking. Students are trained, by our world and our education systems to treat grades as a measure of progress and, in many ways, a proxy for knowledge. But this brings us into conflict with Perry’s developmental stages, where students start with a deep need for authority and the safety of a single right answer. It is only when students are capable of understanding that there are, in many cases, multiple right answers that we can expect them to understand that grades can have multiple meanings. As Rapaport notes, grades are inherently dual: a representative symbol attached to a quality measure and then, in his words, “ethical and aesthetic values are attached” (emphasis mine.) In other words, a B is a measure of progress (not quite there) that also has a value of being … second-tier if an A is our measure of excellence. A is not A, as it must be contextualised. Sorry, Ayn.

When we start to examine why we are grading, Kohn tells us that the carrot and stick is never as effective as the motivation that someone has intrinsically. So we look to Wolff: are we critiquing for feedback, are we evaluating learning, or are we providing handy value measures for sorting our product for some consumer or market? Returning to my thought experiment above, we cannot provide feedback on assignments that students don’t do, our evaluation of learning says that both students are acceptable for complementary knowledge, and our students cannot be discerned from their graded rank, despite the fact that they have nothing in common!

Yes, it’s an artificial example but, without attention to the design of our courses and in particular the design of our assessment, it is entirely possible to achieve this result to some degree. This is where I wish to refer to Rapaport as an example of thoughtful design, with a clear assessment goal in mind. To step away from measures that provide an (effectively) arbitrary distinction, Rapaport proposes a tiered system for grading that simplifies the overall system with an emphasis on identifying whether a piece of assessment work is demonstrating clear knowledge, a partial solution, an incorrect solution or no work at all.

This, for me, is an example of assessment that is pretty close to true. The difference between a 74 and a 75 is, in most cases, not very defensible (after Haladyna) unless you are applying some kind of ‘quality gate’ that really reduces a percentile scale to, at most, 13 different outcomes. Rapaport’s argument is that we can reduce this further and this will reduce grade clawing, identify clear levels of achieve and reduce marking load on the assessor. That last point is important. A system that buries the marker under load is not sustainable. It cannot be beautiful.

There are issues in taking this approach and turning it back into the grades that our institutions generally require. Rapaport is very open about the difficulties that he has turning his triage system into an acceptable letter grade and it’s worth reading the paper to see that discussion alone, because it quite clearly shows what

Rapaport’s scheme clearly defines which of Wolff’s criteria he wishes his assessment to achieve. The scheme, for individual assessments, is no good for ranking (although we can fashion a ranking from it) but it is good to identify weak areas of knowledge (as transmitted or received) for evaluation of progress and also for providing elementary critique. It says what it is and it pretty much does it. It sets out to achieve a clear goal.

The paper ends with a summary of the key points of Haladyna’s 1999 book “A Complete Guide to Student Grading”, which brings all of this together.

Haladyna says that “Before we assign a grade to any students, we need:

  1. an idea about what a grade means,
  2. an understanding of the purposes of grading,
  3. a set of personal beliefs and proven principles that we will use in teaching

    and grading,

  4. a set of criteria on which the grade is based, and, finally,
  5. a grading method,which is a set of procedures that we consistently follow

    in arriving at each student’s grade. (Haladyna 1999: ix)

There is no doubt that Rapaport’s scheme meets all of these criteria and, yet, for me, we have not yet gone far enough in search of the most beautiful, most good and most true extent that we can take this idea. Is point 3, which could be summarised as aesthetics not enough for me? Apparently not.

Tomorrow I will return to Rapaport to discuss those aspects I disagree with and, later on, discuss both an even more trimmed-down model and some more controversial aspects.