The Illusion of a NumberPosted: January 10, 2016 Filed under: Education, Opinion | Tags: authenticity, beauty, curve grading, design, education, educational problem, educational research, ethics, grading, higher education, in the student's head, learning, rapaport, reflection, resources, teaching, teaching approaches, thinking, tools, wittgenstein 3 Comments
I hope you’ve had a chance to read William Rapaport’s paper, which I referred to yesterday. He proposed a great, simple alternative to traditional grading that reduces confusion about what is signalled by ‘grade-type’ feedback, as well as making things easier for students and teachers. Being me, after saying how much I liked it, I then finished by saying “… but I think that there are problems.” His approach was that we could break all grading down into: did nothing, wrong answer, some way to go, pretty much there. And that, I think, is much better than a lot of the nonsense that we pretend we hand out as marks. But, yes, I have some problems.
I note that Rapaport’s exceedingly clear and honest account of what he is doing includes this statement. “Still, there are some subjective calls to make, and you might very well disagree with the way that I have made them.” Therefore, I have license to accept the value of the overall scholarship and the frame of the approach, without having to accept all of the implementation details given in the paper. Onwards!
I think my biggest concern with the approach given is not in how it works for individual assessment elements. In that area, I think it shines, as it makes clear what has been achieved. A marker can quickly place the work into one of four boxes if there are clear guidelines as to what has to be achieved, without having to worry about one or two percentage points here or there. Because the grade bands are so distinct, as Rapaport notes, it is very hard for the student to make the ‘I only need one more point argument’ that is so clearly indicative as a focus on the grade rather than the learning. (I note that such emphasis is often what we have trained students for, there is no pejorative intention here.) I agree this is consistent and fair, and time-saving (after Walvoord and Anderson), and it avoids curve grading, which I loathe with a passion.
However, my problems start when we are combining a number of these triaged grades into a cumulative mark for an assignment or for a final letter grade, showing progress in the course. Sections 4.3 and 4.4 of the paper detail the implementation of assignments that have triage graded sub-tasks. Now, instead of receiving a “some way to go” for an assignment, we can start getting different scores for sub-tasks. Let’s look at an example from the paper, note 12, to describe programming projects in CS.
- Problem definition 0,1,2,3
- Top-down design 0,1,2,3
- Documented code
- Code 0,1,2,3
- Documentation 0,1,2,3
- Annotated output
- Output 0,1,2,3
- Annotations 0,1,2,3
Total possible points = 18
Remember my hypothetical situation from yesterday? I provided an example of two students who managed to score enough marks to pass by knowing the complement of each other’s course knowledge. Looking at the above example, it appears (although not easily) to be possible for this situation to occur and both students to receive a 9/18, yet for different aspects. But I have some more pressing questions:
- Should it be possible for a student to receive full marks for output, if there is no definition, design or code presented?
- Can a student receive full marks for everything else if they have no design?
The first question indicates what we already know about task dependencies: if we want to build them into numerical grading, we have to be pedantically specific and provide rules on top of the aggregation mathematics. But, more subtly, by aggregating these measures, we no longer have an ‘accurately triaged’ grade to indicate if the assignment as a whole is acceptable or not. An assignment with no definition, design or code can hardly be considered to be a valid submission, yet good output, documentation and annotation (with no code) will not give us the right result!
The second question is more for those of us who teach programming and it’s a question we all should ask. If a student can get a decent grade for an assignment without submitting a design, then what message are we sending? We are, implicitly, saying that although we talk a lot about design, it’s not something you have to do in order to be successful. Rapaport does go on to talk about weightings and how we can emphasis these issues but we are still faced with an ugly reality that, unless we weight our key aspects to be 50-60% of the final aggregate, students will be able to side-step them and still perform to a passing standard. Every assignment should be doing something useful, modelling the correct approaches, demonstrating correct techniques. How do we capture that?
Now, let me step back and say that I have no problem with identifying the sub-tasks and clearly indicating the level of performance using triage grading, but I disagree with using it for marks. For feedback it is absolutely invaluable: triage grading on sub-tasks will immediately tell you where the majority of students are having trouble, quickly. That then lets you know an area that is more challenging than you thought or one that your students were not prepared for, for some reason. (If every student in the class is struggling with something, the problem is more likely to lie with the teacher.) However, I see three major problems with sub-task aggregation and, thus, with final grade aggregation from assignments.
The first problem is that I think this is the wrong kind of scale to try and aggregate in this way. As Rapaport notes, agreement on clear, linear intervals in grading is never going to be achieved and is, very likely, not even possible. Recall that there are four fundamental types of scale: nominal, ordinal, interval and ratio. The scales in use for triage grading are not interval scales (the intervals aren’t predictable or equidistant) and thus we cannot expect to average them and get sensible results. What we have here are, to my eye, ordinal scales, with no objective distance but a clear ranking of best to worst. The clearest indicator of this is the construction of a B grade for final grading, where no such concept exists in the triage marks for assessing assignment quality. We have created a “some way to go but sometimes nearly perfect” that shouldn’t really exist. Think of it like runners: you win one race and you come third in another. You never actually came second in any race so averaging it makes no sense.
The second problem is that aggregation masks the beauty of triage in terms of identifying if a task has been performed to the pre-determined level. In an ideal world, every area of knowledge that a student is exposed to should be an important contributor to their learning journey. We may have multiple assignments in one area but our assessment mechanism should provide clear opportunities to demonstrate that knowledge. Thus, their achievement of sufficient assignment work to demonstrate their competency in every relevant area of knowledge should be a necessary condition for graduating. When we take triage grading back to an assignment level, we can then look at our assignments grouped by knowledge area and quickly see if a student has some way to go or has achieved the goal. This is not anywhere near as clear when we start aggregating the marks because of the mathematical issues already raised.
Finally, the reduction of triage to mathematical approximation reduces the ability to specify which areas of an assessment are really valuable and, while weighting is a reasonable approximation to this, it is very hard to use a mathematical formula with more and more ‘fudge factors’, a term Rapaport uses, to make up for the fact that this is just a little too fragile.
To summarise, I really like the thrust of this paper. I think what is proposed is far better, even with all of the problems raised above, at giving a reasonable, fair and predictable grade to students. But I think that the clash with existing grading traditions and the implicit requirement to turn everything back into one number is causing problems that have to be addressed. These problems mean that this solution is not, yet, beautiful. But let’s see where we can go.
Tomorrow, I’ll suggest an even more cut-down version of grading and then work on an even trickier problem: late penalties and how they affect grades.
Nick, I’m a humanities student. [it’s OK, I’m not anyone’s target audience] Students hang on that extra point that will push them to a pass, credit, distinction or HD. In a previous life I was a Humanities student at a (NSW) university where each student received either a pass or fail grade (the tutor/lecturer gave comprehensive feedback). The students rebelled, and felt they weren’t being “rewarded” for hard work, and the university implemented grading.
Now at another (NSW) university, and I’m sick of hearing the students complain about “one more mark” and their illusion that the number maketh the man.
Looking forward to late penalties and grades!
LikeLiked by 1 person
Thanks for the comment! Yes, the problem we have is that that one point is meaningless and dependency upon it is a pretty clear sign of misdirected focus and (as Rapaport notes) dualistic dependency on authority. And yet we have that focus! That’s not a criticism of students, by the way, that’s an admission that we create an environment in which students demonstrate those behaviours.
I’m sure many other readers will be thinking the same thing so it’s great to see this raised and, yes, I will be talking about my own struggles in working out what to do with this.
One of my biggest problems in doing research into this for the past 12 months, with an eye to implementing it on a large scale, has been accepting that such issues are going to occur and then working out how we can address them. There are a number of times where our knowledge of education is at odds with both what students want and what they think will be best for them. However, I’m trying very hard not to make an argument from authority but to develop an approach that is convincing even for those students who are desperately clinging to the grading myths. Oh, and let’s not forget all of my peers who are also clinging to the same myths.
And, from everything we know, those precise-appearing numerical scores are mostly rubbish. We should never be in the business of rubbish!
Out of curiosity, for the NSW uni, was that situation a general policy for that school or program or were some courses graded and some not? I was wondering if the marking system, in isolation, was leading to the problem or a more general problem? Did it affect GPA in some way or was that not a problem?
It’s worth noting, and you probably already know, that MIT in the US has a ‘no marks’ (P/F) policy for the first semester, specifically to address transition issues, and that’s one of the hardest Unis to get into. Several others, including Brown, use grading mechanisms that give control to the students and try to change the environment so that students don’t define themselves through the numbers. We know it can work in some places but we, well I, don’t know the magic recipe to make it work everywhere yet.
Thanks again for the comment. I look forward to hearing more about that course!
I’d be really interested in your opinion of our proposed grading system for the new curriculum we will be rolling out this year. We have tried to break down the grading into different domains, and some of those the student will have to pass independently to progress through a module. For example, obviously a future physician has to have a level of proficiency in clinical skills. So all students will have to demonstrate a minimum proficiency (determined by both written exams as well as working with simulated patients) to pass a module. They will also have to have a minimum grade on standardized MCQ exams to pass a module (required as ultimately they have to pass a high stakes liscensing exam which is all MCQ based). The remainder of their grade will come from assessment during active learning sessions where we will have well-defined rubrics for what constitutes an appropriate level of participation, and some points for completion of formative quizzes. The remaining grading component is professionalism – all students will start of with a specific number of professionalism points that they can subsequently lose for a variety of reasons – consistently being late, not showing up for mandatory sessions etc. with a sliding scale of points for the seriousness of the offense. They will have to have all of their professionalism points to receive an honors grade (we have Honors, High Pass, Pass and Fail grades). In the first two years, as we consider that they are learning to be professionals, they will have an opportunity to earn one (and only one) point back by doing one thing from a proscribed list including writing a report on how that behavior could impact a patient if they acted that way when in a clinical context, or being mentored by a senior student.