Grades are the fossils of evaluation

Assessments support evaluation, criticism and ranking (Wolff). That’s what it does and, in many cases, that also constitutes a lot of why we do it. But who are we doing it for?

I’ve reflected on the dual nature of evaluation, showing a student her or his level of progress and mastery while also telling us how well the learning environment is working. In my argument to reduce numerical grades to something meaningful, I’ve asked what the actual requirement is for our students, how we measure mastery and how we can build systems to provide this.

But who are the student’s grades actually for?

In terms of ranking, grades allow people who are not the student to place the students in some order. By doing this, we can award awards to students who are in the awarding an award band (repeated word use deliberate). We can restrict our job interviews to students who are summa cum laude or valedictorian or Dean’s Merit Award Winner. Certain groups of students, not all, like to define their progress through comparison so there is a degree of self-ranking but, for the most part, ranking is something that happens to students.

Criticism, in terms of providing constructive, timely feedback to assist the student, is weakly linked to any grading system. Giving someone a Fail grade isn’t a critique as it contains no clear identification of the problems. The clear identification of problems may not constitute a fail. Often these correlate but it’s weak. A student’s grades are not going to provide useful critique to the student by themselves. These grades are to allow us to work out if the student has met our assessment mechanisms to a point where they can count this course as a pre-requisite or can be awarded a degree. (Award!)

Evaluation is, as noted, useful to us and the student but a grade by itself does not contain enough record of process to be useful in evaluating how mastery goals were met and how the learning environment succeeded or failed. Competency, when applied systematically, does have a well-defined meaning. A passing grade does not although there is an implied competency and there is a loose correlation with achievement.

Grades allow us to look at all of a student’s work as if this one impression is a reflection of the student’s involvement, engagement, study, mistakes, triumphs, hopes and dreams. They are additions to a record from which we attempt to reconstruct a living, whole being.

Grades are the fossils of evaluation.

Grades provide a mechanism for us, in a proxy role as academic archaeologist, to classify students into different groups, in an attempt to project colour into grey stone, to try and understand the ecosystem that such a creature would live in, and to identify how successful this species was.

As someone who has been a student several times in my life, I’m aware that I have a fossil record that is not traditional for an academic. I was lucky to be able to place a new imprint in the record, to obscure my history as a much less successful species, and could then build upon it until I became an ACADEMIC TYRANNOSAURUS.

Skull of a Tyrannosaurus Rex at Palais de la Decouverte


But I’m lucky. I’m privileged. I had a level of schooling and parental influence that provided me with an excellent vocabulary and high social mobility. I live in a safe city. I have a supportive partner. And, more importantly, at a crucial moment in my life, someone who knew me told me about an opportunity that I was able to pursue despite the grades that I had set in stone. A chance came my way that I never would have thought of because I had internalised my grades as my worth.

Let’s look at the fossil record of Nick.

My original GPA fossil, encompassing everything that went wrong and right in my first degree, was 2.9. On a scale of 7, which is how we measure it, that’s well below a pass average. I’m sharing that because I want you to put that fact together with what happened next. Four years later, I started a Masters program that I finished with a GPA of 6.4. A few years after the masters, I decided to go and study wine making. That degree was 6.43. Then I received a PhD, with commendation, that is equivalent to GPA 7. (We don’t actually use GPA in research degrees. Hmmm.) If my grade record alone lobbed onto your desk you would see the desiccated and dead snapshot of how I (failed to) engage with the University system. A lot of that is on me but, amazingly, it appears that much better things were possible. That original grade record stopped me from getting interviews. Stopped me from getting jobs. When I was finally able to demonstrate the skills that I had, which weren’t bad, I was able to get work. Then I had the opportunity to rewrite my historical record.

Yes, this is personal for me. But it’s not about me because I wasn’t trapped by this. I was lucky as well as privileged. I can’t emphasise that enough. The fact that you are reading this is due to luck. That’s not a good enough mechanism.

Too many students don’t have this opportunity. That impression in the wet mud of their school life will harden into a stone straitjacket from which they may never escape. The way we measure and record grades has far too much potential to work against students and the correlation with actual ability is there but it’s not strong and it’s not always reliable.

The student you are about to send out with a GPA of 2.9 may be competent and they are, most definitely, more than that number.

The recording of grades is a high-loss storage record of the student’s learning and pathway to mastery. It allows us to conceal achievement and failure alike in the accumulation of mathematical aggregates that proxy for competence but correlate weakly.

We need assessment systems that work for the student first and everyone else second.

Is It Called Ranking Because It Smells Funny?

Years ago, I was a professional winemaker, which is an awesome job but with very long hours (that seems to be a trend for me). One of the things that we did a lot in winemaking was to assess the quality of wine to work out if we’d made what we wanted to but also to allow us to blend this parcel with that parcel and come up with a better wine. Wine judging, for wine shows, is an important part of getting feedback on the quality of your wine as it’s perceived by other professionals. Wine is judged on a 20 point scale most of the time, although some 100 point schemes are in operation. The problem is that this scale is not actually as wide as it might look. Wines below 12/20 are usually regarded as faulty or not at commercial level – so, in reality, most wine shows are working in the range 12-19.5 (20 was relatively rare but I don’t know what it’s like now). This gets worse for the “100 point” ranges, where Wine Spectator claim to go from 50-100, but James Halliday (a prominent wine critic) rates from 75-100, where ‘Good’ starts at 80. This is really quite unnecessarily confusing, because it means that James Halliday is effectively using a version of the 16 available ranks (12-19.5 at 0.5 interval) of the 20 point scale, mapped into a higher range.

Of course, the numbers are highly subjective, even to a very well trained palate, because the difference between an 87 and an 88 could be colour, or bouquet, or flavour – so saying that the wine at 88 is better doesn’t mean anything unless you know what the rater actually means by that kind of ranking. I used to really enjoy the wine selections of a wine writer called Mark Shields, because he used a straightforward rating system and our palates were fairly well aligned. If Mark liked it, I’d probably like it. This is the dirty secret of any ranking mechanism that has any aspect of subjectivity or weighting built into it – it needs to agree with your interpretation of reasonable or you will always be at odds with it.

In terms of wine, the medal system that is in use really does give us four basic categories: commercially sound (no medal), better than usual (bronze), much better than usual (silver) and “please pass me another bottle” (gold). On top of that, you have the ‘best in show’ effectively which says that, in this place and from these tasters, this was the best overall in this category. To be frank, the gold wines normally blow away the bronzes and the no-awards, but the line between silver and gold is a little more blurred. However, the show medals have one advantage in that a given class has been inspected by the same people and the wines have actually been compared (in one sense) and ranked. However, if nothing is outstanding then no medals will be awarded because it is based on the marks on the 20 point scale, so if all the wines come in at 13, there will be no gongs – there doesn’t have to be a gold or a silver, or even a bronze, although that would be highly unusual. More subtly, gold at one show may not even get bronze at another – another dirty little secret of subjective ranking, sometimes what you are comparing things to makes a very big difference.

Which brings me to my point, which is the ranking on Universities. You’re probably aware that there are national and international rankings of Universities across a range of metrics, often related to funding and research, but that the different rankings have broad agreement rather than exact agreement as to who is ‘top’, top 5 and so on. The Times Higher Education supplement provides a stunning area of snazzy looking graphics, with their weightings as to what makes a great University. But, when we look at this, and we appear to have accuracy to one significant figure (ahem), is it significant that Caltech is 1.8 points higher than Stanford? Is this actually useful information in terms of which university a student might wish to attend? Well, teaching (learning environment) makes 30% of the score, international outlook makes up 7.5%, industry income makes up 2.5%, research is 30% (volume, income and reputation) and citations (research influence) make up the last 30%. If we sort by learning environment (because I am a curious undergraduate, say) then the order starts shifting, not at the very top to the list, but certainly further down – Yale would leap to 4th in the US instead of 9th. Once we get out of the top 200, suddenly we have very broad bands and, honestly, you have to wonder why we are still putting together the numbers if the thing that people appear to be worrying about is the top 200. (When you deem worthiness on a rating scale as only being a subset of the available scale, you rapidly turn something that could be considered continuous into something with an increasingly constrained categorical basis.) But let’s go to the Shanghai rankings, where Caltech is dropped from number 1 to number 6. Or the QS World Rankings, who rate Caltech as #10.

Obviously, there is no doubt about the general class of these universities, but it does appear that the judges are having some difficulty in consistently awarding best in class medals. This would be of minor interest, were it not for the fact that these ratings do actually matter in terms of industry confidence in partnership, in terms of attracting students from outside of your home educational system and in terms of who get to be the voices who decide what constitutes a good University. It strikes me that broad classes are something could apply quite well here. Who really cares whether Caltech is 1, 6 or 10 – it’s obviously rating well across the board and, barring catastrophe, always will.

So why keep ranking it? What we’re currently doing is polishing the door knob on the ranking system, devoting effort to ranking Universities like Caltech, Stanford, Harvard, Cambridge and Oxford, who could not, with any credibility be ranked low – or we’d immediately think that the ranking mechanism was suspect. So let’s stop ranking them, because it’s compressing the ranking at a point where the ranking is not even vaguely informational. What would be interesting was more devotion to the bands further down, where a University can assess its global progress against its peers to find out if it’s cutting the mustard.

If I put a bottle of Grange (one of Australia’s best red wines and it is pretty much worthy of its reputation if not its price) into a wine show and it came back with less than 17/20, I’d immediately suspect the rating system and the professionalism of the judges. The question is, of course, why would I put it in other than to win gold medals – what am I actually achieving in this sense? If it’s a commercial decision to sell more wine then I get this but wine is, after all, just wine and you and I drink it the same way. Universities, especially when ranked across complex weighted metrics and by different people, are very different products to different people. The single figure ranking may carry prestige and probably attracts both students and money but should it? Does it make any sense to be so detailed (one significant figure, indeed) about how one stacks up against each other, when in reality you have almost exponentially separated groups – my University will never ‘challenge’ Caltech, and if Caltech ‘drops’ to the graded level of the University of Melbourne (one of our most highly ranked Unis), I’m not sure that the experience will tell Caltech anything other than “Ruh oh!”

The Scooby Gang, stunned that Caltech was now in the range 50-100.

The Scooby Gang, stunned that Caltech was now in the range 50-100.

If I could summarise all of this, it would be to say that our leader board and ranking obsession would be fine, were it not for the amount of time spent on these things, the weight placed upon what is ultimately highly subjective even in terms of the weighting, and is not clearly defined as to how these rankings can be used to make sensible decisions. Perhaps there is something more useful we could be doing with our time?