I’ve been talking about why late penalties are not only not useful but they don’t work, yet I keep talking about getting work in on time and tying it to realistic resource allocation. Does this mean I’m really using late penalties?
No, but let me explain why, starting from the underlying principle of fairness that is an aesthetic pillar of good education. One part of this is that the actions of one student should not unduly affect the learning journey of another student. That includes evaluation (and associated marks).
This is the same principle that makes me reject curve grading. It makes no sense to me that someone else’s work is judged in the context of another, when we have so little real information with which we could establish any form of equivalence of human experience and available capacity.
I don’t want to create a market economy for knowledge, where we devaluate successful demonstrations of knowledge and skill for reasons that have nothing to do with learning. Curve grading devalues knowledge. Time penalties devalue knowledge.
I do have to deal with resource constraints, in that I often have (some) deadlines that are administrative necessities, such as degree awards and things like this. I have limited human resources, both personally and professionally.
Given that I do not have unconstrained resources, the fairness principle naturally extends to say that individual students should not consume resources to the detriment of others. I know that I have a limited amount of human evaluation time, therefore I have to treat this as a constrained resource. My E1 and E2 evaluations resources must be, to a degree at least, protected to ensure the best outcome for the most students. (We can factor equity into this, and should, but this stops this from being a simple linear equivalence and makes the terms more complex than they need to be for explanation, so I’ll continue this discussion as if we’re discussing equality.)
You’ve noticed that the E3 and E4 evaluation systems are pretty much always available to students. That’s deliberate. If we can automate something, we can scale it. No student is depriving another of timely evaluation and so there’s no limitation of access to E3 and E4, unless it’s too late for it to be of use.
If we ask students to get their work in at time X, it should be on the expectation that we are ready to leap into action at second X+(prep time), or that the students should be engaged in some other worthwhile activity from X+1, because otherwise we have made up a nonsense figure. In order to be fair, we should release all of our evaluations back at the same time, to avoid accidental advantages because of the order in which things were marked. (We may wish to vary this for time banking but we’ll come back to this later.) As many things are marked in surname or student number order, the only way to ensure that we don’t accidentally keep granting an advantage is to release everything at the same time.
Remember, our whole scheme is predicated on the assumption that we have designed and planned for how long it will take to go through the work and provide feedback in time for modification before another submission. When X+(prep time) comes, we should know, roughly to the hour or day, at worst, when this will be done.
If a student hands up fifteen minutes late, they have most likely missed the preparation phase. If we delay our process to include this student, then we will delay feedback to everyone. Here is a genuine motivation for students to submit on time: they will receive rich and detailed feedback as soon as it is ready. Students who hand up late will be assessed in the next round.
That’s how the real world actually works. No-one gives you half marks for something that you do a day late. It’s either accepted or not and, often, you go to the back of the queue. When you miss the bus, you don’t get 50% of the bus. You just have to wait for the next opportunity and, most of the time, there is another bus. Being late once rarely leaves you stranded without apparent hope – unlucky Martian visitors aside.
But there’s more to this. When we have finished with the first group, we can immediately release detailed feedback on what we were expecting to see, providing the best results to students and, from that point on, anyone who submits would have the benefit of information that the first group didn’t have before their initial submission. Rather than make the first group think that they should have waited (and we know students do), we give them the best possible outcome for organising their time.
The next submission deadline is done by everyone with the knowledge gained from the first pass but people who didn’t contribute to it can’t immediately use it for their own benefit. So there’s no free-riding.
There is, of course, a tricky period between the submission deadline and the release, where we could say “Well, they didn’t see the feedback” and accept the work but that’s when we think about the message we want to send. We would prefer students to improve their time management and one part of this is to have genuine outcomes from necessary deadlines.
If we let students keep handing in later and later, we will eventually end up having these late submissions running into our requirement to give feedback. But, more importantly, we will say “You shouldn’t have bothered” to those students who did hand up on time. When you say something like this, students will learn and they will change their behaviour. We should never reinforce behaviour that is the opposite of what we consider to be valuable.
Fairness is a core aesthetic of education. Authentic time management needs to reflect the reality of lost opportunity, rather than diminished recognition of good work in some numerical reduction. Our beauty argument is clear: we can be firm on certain deadlines and remove certain tasks from consideration and it will be a better approach and be more likely to have positive outcomes than an arbitrary reduction scheme already in use.
How we can create a better assessment system, without penalties, that works in a grade-free environment? Let’s provide a foundation for this discussion by looking at assessment today.
We have many different ways of understanding exactly how we are assessing knowledge. Bloom’s taxonomy allows us to classify the objectives that we set for students, in that we can determine if we’re just asking them to remember something, explain it, apply it, analyse it, evaluate it or, having mastered all of those other aspects, create a new example of it. We’ve also got Bigg’s SOLO taxonomy to classify levels of increasing complexity in a student’s understanding of subjects. Now let’s add in threshold concepts, learning edge momentum, neo-Piagetian theory and …
Let’s summarise and just say that we know that students take a while to learn things, can demonstrate some convincing illusions of progress that quickly fall apart, and that we can design our activities and assessment in a way that acknowledges this.
I attended a talk by Eric Mazur, of Peer Instruction fame, and he said a lot of what I’ve already said about assessment not working with how we know we should be teaching. His belief is that we rarely rise above remembering and understanding, when it comes to testing, and he’s at Harvard, where everyone would easily accept their practices as, in theory, being top notch. Eric proposed a number of approaches but his focus on outcomes was one that I really liked. He wanted to keep the coaching role he could provide separate from his evaluator role: another thing I think we should be doing more.
Eric is in Physics but all of these ideas have been extensively explored in my own field, especially where we start to look at which of the levels we teach students to and then what we assess. We do a lot of work on this in Australia and here is some work by our groups and others I have learned from:
- Szabo, C., Falkner, K. & Falkner, N. 2014, ‘Experiences in Course Design using Neo-Piagetian Theory’
- Falkner, K., Vivian, R., Falkner, N., 2013, ‘Neo-piagetian Forms of Reasoning in Software Development Process Construction’
- Whalley, J., Lister, R.F., Thompson, E., Clear, T., Robbins, P., Kumar, P. & Prasad, C. 2006, ‘An Australasian study of reading and comprehension skills in novice programmers, using Bloom and SOLO taxonomies’
- Gluga, R., Kay, J., Lister, R.F. & Teague, D. 2012, ‘On the reliability of classifying programming tasks using a neo-piagetian theory of cognitive development’
I would be remiss to not mention Anna Eckerdal’s work, and collaborations, in the area of threshold concepts. You can find her many papers on determining which concepts are going to challenge students the most, and how we could deal with this, here.
Let me summarise all of this:
- There are different levels at which students will perform as they learn.
- It needs careful evaluation to separate students who appear to have learned something from students who have actually learned something.
- We often focus too much on memorisation and simple explanation, without going to more advanced levels.
- If we want to assess advanced levels, we may have to give up the idea of trying to grade these additional steps as objectivity is almost impossible as is task equivalence.
- We should teach in a way that supports the assessment we wish to carry out. The assessment we wish to carry out is the right choice to demonstrate true mastery of knowledge and skills.
If we are not designing for our learning outcomes, we’re unlikely to create courses to achieve those outcomes. If we don’t take into account the realities of student behaviour, we will also fail.
We can break our assessment tasks down by one of the taxonomies or learning theories and, from my own work and that of others, we know that we will get better results if we provide a learning environment that supports assessment at the desired taxonomic level.
But, there is a problem. The most descriptive, authentic and open-ended assessments incur the most load in terms of expert human marking. We don’t have a lot of expert human markers. Overloading them is not good. Pretending that we can mark an infinite number of assignments is not true. Our evaluation aesthetics are objectivity, fairness, effectiveness, timeliness and depth of feedback. Assignment evaluation should be useful to the students, to show progress, and useful to us, to show the health of the learning environment. Overloading the marker will compromise the aesthetics.
Our beauty lens tells us very clearly that we need to be careful about how we deal with our finite resources. As Eric notes, and we all know, if we were to test simpler aspects of student learning, we can throw machines at it and we have a near infinite supply of machines. I cannot produce more experts like me, easily. (Snickers from the audience) I can recruit human evaluators from my casual pool and train them to mark to something like my standard, using a rubric or using an approximation of my approach.
Thus I have a framework of assignments, divide by level, and I appear to have assignment evaluation resources. And the more expert and human the marker, the more … for want of a better word … valuable the resource. The better feedback it can produce. Yet the more valuable the resource, the less of it I have because it takes time to develop evaluation skills in humans.
Tune in tomorrow for the penalty free evaluation and feedback that ties all of this together.
How does one actually turn everything I’ve been saying into a course that can be taught? We already have examples of this working, whether in the performance/competency based models found in medical schools around the world or whether in mastery learning based approaches where do not measure anything except whether a student has demonstrated sufficient knowledge or skill to show an appropriate level of mastery.
An absence of grades, or student control over their grades, is not as uncommon as many people think. MIT in the United States give students their entire first semester with no grades more specific than pass or fail. This is a deliberate decision to ease the transition of students who have gone from being leaders at their own schools to the compressed scale of MIT. Why compressed? If we were to assess all school students then we would need a scale that could measure all levels of ability, from ‘not making any progress at school’ to ‘transcendent’. The tertiary entry band is somewhere between ‘passing school studies’ to ‘transcendent’ and, depending upon the college that you enter, can shift higher and higher as your target institution becomes more exclusive. If you look at the MIT entry requirements, they are a little coy for ‘per student’ adjustments, but when the 75th percentile for the SAT components is 800, 790, 790, and 800,800,800 would be perfect, we can see that any arguments on how demotivating simple pass/fail grades must be for excellent students have not just withered, they have caught fire and the ash has blown away. When the target is MIT, it appears the freshmen get their head around a system that is even simpler than Rapaport’s.
Other universities, such as Brown, deliberately allow students to choose how their marks are presented, as they wish to deemphasise the numbers in order to focus on education. It is not a cakewalk to get into Brown, as these figures attest, and yet Brown have made a clear statement that they have changed their grading system in order to change student behaviour – and the world is just going to have to deal with that. It doesn’t seem to be hurting their graduates, from quotes on the website such as “Our 85% admission rate to medical school and 89% admission rate to law school are both far above the national average.”
And, returning to medical schools themselves, my own University runs a medical program where the usual guidelines for grading do not hold. The medical school is running on a performance/competency scheme, where students who wish to practise medicine must demonstrate that they are knowledgable, skilful and safe to practice. Medical schools have identified the core problem in my thought experiment where two students could have the opposite set of knowledge or skills and they have come to the same logical conclusion: decide what is important and set up a scheme that works for it.
When I was a solider, I was responsible for much of the Officer Training in my home state for the Reserve. We had any number of things to report on for our candidates, across knowledge and skills, but one of them was “Demonstrate the qualities of an officer” and this single item could fail an otherwise suitable candidate. If a candidate could not be trusted to one day be in command of troops on the battlefield, based on problems we saw in peacetime, then they would be counselled to see if it could be addressed and, if not, let go. (I can assure you that this was not used often and it required a large number of observations and discussion before we would pull that handle. The power of such a thing forced us to be responsible.)
We know that limited scale, mastery-based approaches are not just working in the vocational sector but in allied sectors (such as the military), in the Ivy league (Brown) and in highly prestigious non-Ivy league institutions such as MIT. But we also know of examples such as Harvey Mudd, who proudly state that only seven students since 1955 have earned a 4.0 GPA and have a post on the career blog devoted to “explaining why your GPA is so low” And, be in no doubt, Harvey Mudd is an excellent school, especially for my discipline. I’m not criticising their program, I’ve only heard great things about them, but when you have to put up a page like that? You’re admitting that there’s a problem but you are pushing it on to the student to fix it. But contrast that with Brown, who say to employers “look at our students, not their grades” (at least on the website).
Feedback to the students on their progress is essential. Being able to see what your students are up to is essential for the teacher. Being able to see what your staff and schools are doing is important for the University. Employers want to know who to hire. Which of these is the most important?
The students. It has to be the students. Doesn’t it? (Arguments for the existence of Universities as a self-sustaining bureaucracy system in the comments, if you think that’s a thing you want to do.)
This is not an easy problem but, as we can see, we have pieces of the solution all over the place. Tomorrow, I’m going to put in a place a cornerstone of beautiful assessment that I haven’t seen provided elsewhere or explained in this way. (Then all of you can tell me which papers I should have read to get it from, I can publish the citation, and we can all go forward.)
Just a quick note that on-line learning is not just videos! I am a very strong advocate of active learning in my face-to-face practice and am working to compose on-line systems that will be as close to this as possible: learning and doing and building and thinking are all essential parts of the process.
Please, once again, check out Mark’s CACM blog on the 10 myths of teaching computer science. There’s great stuff here that extends everything I’m talking about with short video sequences and attention spans. I wrote something ages ago about not turning ‘chalk and talk’ into ‘watch and scratch (your head)’. It’s a little dated but I include it for completeness.
I was recently at a conference-like event where someone stood up and talked about video lectures. And these lectures were about 40 minutes long.
Over several million viewing sessions, EdX have clearly shown that watchable video length tops out at just over 6 minutes. And that’s the same for certificate-earning students and the people who have enrolled for fun. At 9 minutes, students are watching for fewer than 6 minutes. At the 40 minute mark, it’s 3-4 minutes.
I raised this point to the speaker because I like the idea that, if we do on-line it should be good on-line, and I got a response that was basically “Yes, I know that but I think the students should be watching these anyway.” Um. Six minutes is the limit but, hey, students, sit there for this time anyway.
We have never been able to unobtrusively measure certain student activities as well as we can today. I admit that it’s hard to measure actual attention by looking at video activity time but it’s also hard to measure activity by watching students in a lecture theatre. When we add clickers to measure lecture activity, we change the activity and, unsurprisingly, clicker-based assessment of lecture attentiveness gives us different numbers to observation of note-taking. We can monitor video activity by watching what the student actually does and pausing/stopping a video is a very clear signal of “I’m done”. The fact that students are less likely to watch as far on longer videos is a pretty interesting one because it implies that students will hold on for a while if the end is in sight.
In a lecture, we think students fade after about 15-20 minutes but, because of physical implications, peer pressure, politeness and inertia, we don’t know how many students have silently switched off before that because very few will just get up and leave. That 6 minute figure may be the true measure of how long a human will remain engaged in this kind of task when there is no active component and we are asking them to process or retain complex cognitive content. (Speculation, here, as I’m still reading into one of these areas but you see where I’m going.) We know that cognitive load is a complicated thing and that identifying subgoals of learning makes a difference in cognitive load (Morrison, Margulieux, Guzdial) but, in so many cases, this isn’t what is happening in those long videos, they’re just someone talking with loose scaffolding. Having designed courses with short videos I can tell you that it forces you, as the designer and teacher, to focus on exactly what you want to say and it really helps in making your points, clearly. Implicit sub-goal labelling, anyone? (I can hear Briana and Mark warming up their keyboards!)
If you want to make your videos 40 minutes long, I can’t stop you. But I can tell you that everything I know tells me that you have set your materials up for another hominid species because you’re not providing something that’s likely to be effective for current humans.
If you’ve been reading my blog over the past years, you’ll know that I have a lot of time for thinking about assessment systems that encourage and develop students, with an emphasis on intrinsic motivation. I’m strongly influenced by the work of Alfie Kohn, unsurprisingly given I’ve already shown my hand on Focault! But there are many other writers who are… reassessing assessment: why we do it, why we think we are doing it, how we do it, what actually happens and what we achieve.
In my framing, I want assessment to be as all other aspects of education: aesthetically satisfying, leading to good outcomes and being clear and what it is and what it is not. Beautiful. Good. True. There are some better and worse assessment approaches out there and there are many papers discussing this. One of these that I have found really useful is Rapaport’s paper on a simplified assessment process for consistent, fair and efficient grading. Although I disagree with some aspects, I consider it to be both good, as it is designed to clearly address a certain problem to achieve good outcomes, and it is true, because it is very honest about providing guidance to the student as to how well they have met the challenge. It is also highly illustrative and honest in representing the struggle of the author in dealing with the collision of novel and traditional assessment systems. However, further discussion of Rapaport is for the near future. Let me start by demonstrating how broken things often are in assessment, by taking you through a hypothetical situation.
Thought Experiment 1
Two students, A and B, are taking the same course. There are a number of assignments in the course and two exams. A and B, by sheer luck, end up doing no overlapping work. They complete different assignments to each other, half each and achieve the same (cumulative bare pass overall) marks. They then manage to score bare pass marks in both exams, but one answers only the even questions and only answers the odd. (And, yes, there are an even number of questions.) Because of the way the assessment was constructed, they have managed to avoid any common answers in the same area of course knowledge. Yet, both end up scoring 50%, a passing grade in the Australian system.
Which of these students has the correct half of the knowledge?
I had planned to build up to Rapaport but, if you’re reading the blog comments, he’s already been mentioned so I’ll summarise his 2011 paper before I get to my main point. In 2011, William J. Rapaport, SUNY Buffalo, published a paper entitled “A Triage Theory of Grading: The Good, The Bad and the Middling.” in Teaching Philosophy. This paper summarised a number of thoughtful and important authors, among them Perry, Wolff, and Kohn. Rapaport starts by asking why we grade, moving through Wolff’s taxonomic classification of assessment into criticism, evaluation, and ranking. Students are trained, by our world and our education systems to treat grades as a measure of progress and, in many ways, a proxy for knowledge. But this brings us into conflict with Perry’s developmental stages, where students start with a deep need for authority and the safety of a single right answer. It is only when students are capable of understanding that there are, in many cases, multiple right answers that we can expect them to understand that grades can have multiple meanings. As Rapaport notes, grades are inherently dual: a representative symbol attached to a quality measure and then, in his words, “ethical and aesthetic values are attached” (emphasis mine.) In other words, a B is a measure of progress (not quite there) that also has a value of being … second-tier if an A is our measure of excellence. A is not A, as it must be contextualised. Sorry, Ayn.
When we start to examine why we are grading, Kohn tells us that the carrot and stick is never as effective as the motivation that someone has intrinsically. So we look to Wolff: are we critiquing for feedback, are we evaluating learning, or are we providing handy value measures for sorting our product for some consumer or market? Returning to my thought experiment above, we cannot provide feedback on assignments that students don’t do, our evaluation of learning says that both students are acceptable for complementary knowledge, and our students cannot be discerned from their graded rank, despite the fact that they have nothing in common!
Yes, it’s an artificial example but, without attention to the design of our courses and in particular the design of our assessment, it is entirely possible to achieve this result to some degree. This is where I wish to refer to Rapaport as an example of thoughtful design, with a clear assessment goal in mind. To step away from measures that provide an (effectively) arbitrary distinction, Rapaport proposes a tiered system for grading that simplifies the overall system with an emphasis on identifying whether a piece of assessment work is demonstrating clear knowledge, a partial solution, an incorrect solution or no work at all.
This, for me, is an example of assessment that is pretty close to true. The difference between a 74 and a 75 is, in most cases, not very defensible (after Haladyna) unless you are applying some kind of ‘quality gate’ that really reduces a percentile scale to, at most, 13 different outcomes. Rapaport’s argument is that we can reduce this further and this will reduce grade clawing, identify clear levels of achieve and reduce marking load on the assessor. That last point is important. A system that buries the marker under load is not sustainable. It cannot be beautiful.
There are issues in taking this approach and turning it back into the grades that our institutions generally require. Rapaport is very open about the difficulties that he has turning his triage system into an acceptable letter grade and it’s worth reading the paper to see that discussion alone, because it quite clearly shows what
Rapaport’s scheme clearly defines which of Wolff’s criteria he wishes his assessment to achieve. The scheme, for individual assessments, is no good for ranking (although we can fashion a ranking from it) but it is good to identify weak areas of knowledge (as transmitted or received) for evaluation of progress and also for providing elementary critique. It says what it is and it pretty much does it. It sets out to achieve a clear goal.
The paper ends with a summary of the key points of Haladyna’s 1999 book “A Complete Guide to Student Grading”, which brings all of this together.
Haladyna says that “Before we assign a grade to any students, we need:
- an idea about what a grade means,
- an understanding of the purposes of grading,
- a set of personal beliefs and proven principles that we will use in teaching
- a set of criteria on which the grade is based, and, finally,
- a grading method,which is a set of procedures that we consistently follow
in arriving at each student’s grade. (Haladyna 1999: ix)
There is no doubt that Rapaport’s scheme meets all of these criteria and, yet, for me, we have not yet gone far enough in search of the most beautiful, most good and most true extent that we can take this idea. Is point 3, which could be summarised as aesthetics not enough for me? Apparently not.
Tomorrow I will return to Rapaport to discuss those aspects I disagree with and, later on, discuss both an even more trimmed-down model and some more controversial aspects.
For the next week, I’m going to be applying an aesthetic lens to assessment and, because I’m in Computer Science, I’ll be focusing on the assessment of Computer Science knowledge and practice.
How do we know if our students know something? In reality, the best way is to turn them loose, come back in 25 years and ask the people in their lives, their clients, their beneficiaries and (of course) their victims, the same question: “Did the student demonstrate knowledge of area X?”
This is not available to us as an option because my Dean, if not my Head of School, would probably peer at me curiously if I were to suggest that all measurement of my efficacy be moved a generation from now. Thus, I am forced to retreat to the conventions and traditions of assessment: it is now up to the student to demonstrate to me, within a fixed timeframe, that he or she has taken a firm grip of the knowledge.
We know that students who are prepared to learn and who are motivated to learn will probably learn, often regardless of what we do. We don’t have to read Vallerand et al to be convinced that self-motivated students will perform, as we can see it every day. (But it is an enjoyable paper to read!) Yet we measure these students in the same assessment frames as students who do not have the same advantages and, thus, may not yet have the luxury or capacity of self-motivation: students from disadvantaged backgrounds, students who are first-in-family and students who wouldn’t know auto-didacticism if it were to dance in front of them.
How, then, do we fairly determine what it means to pass, what it means to fail and, even more subtly, what it means to pass or fail well? I hesitate to invoke Foucault, especially when I speak of “Discipline and Punish” in an educational setting, but he is unavoidable when we gaze upon a system that is dedicated to awarding ranks, graduated in terms of punishment and reward. It is strange, really, that were many patients to die under the hand of a surgeon for a simple surgery, we would ask for an inquest, but many students failing under the same professor in a first-year course is merely an indicator of “bad students”. So many of our mechanisms tell us that students are failing but often too late to be helpful and not in a way that encourages improvement. This is punishment. And it is not good enough.
Our assessment mechanisms are not beautiful. They are barely functional. They exist to provide a rough measure to separate pass from fail, with a variety of other distinctions that owe more to previous experience and privilege in many cases than any higher pedagogical approach.
Over the next week, I shall conduct an attack upon the assessment mechanisms that are currently used in my field, including my own, in the hope of arriving at a mechanism of design, practice and validation that is pedagogically pleasing (the aesthetic argument again) and will lead to outcomes that are both good and true.