I’ve been talking about why late penalties are not only not useful but they don’t work, yet I keep talking about getting work in on time and tying it to realistic resource allocation. Does this mean I’m really using late penalties?
No, but let me explain why, starting from the underlying principle of fairness that is an aesthetic pillar of good education. One part of this is that the actions of one student should not unduly affect the learning journey of another student. That includes evaluation (and associated marks).
This is the same principle that makes me reject curve grading. It makes no sense to me that someone else’s work is judged in the context of another, when we have so little real information with which we could establish any form of equivalence of human experience and available capacity.
I don’t want to create a market economy for knowledge, where we devaluate successful demonstrations of knowledge and skill for reasons that have nothing to do with learning. Curve grading devalues knowledge. Time penalties devalue knowledge.
I do have to deal with resource constraints, in that I often have (some) deadlines that are administrative necessities, such as degree awards and things like this. I have limited human resources, both personally and professionally.
Given that I do not have unconstrained resources, the fairness principle naturally extends to say that individual students should not consume resources to the detriment of others. I know that I have a limited amount of human evaluation time, therefore I have to treat this as a constrained resource. My E1 and E2 evaluations resources must be, to a degree at least, protected to ensure the best outcome for the most students. (We can factor equity into this, and should, but this stops this from being a simple linear equivalence and makes the terms more complex than they need to be for explanation, so I’ll continue this discussion as if we’re discussing equality.)
You’ve noticed that the E3 and E4 evaluation systems are pretty much always available to students. That’s deliberate. If we can automate something, we can scale it. No student is depriving another of timely evaluation and so there’s no limitation of access to E3 and E4, unless it’s too late for it to be of use.
If we ask students to get their work in at time X, it should be on the expectation that we are ready to leap into action at second X+(prep time), or that the students should be engaged in some other worthwhile activity from X+1, because otherwise we have made up a nonsense figure. In order to be fair, we should release all of our evaluations back at the same time, to avoid accidental advantages because of the order in which things were marked. (We may wish to vary this for time banking but we’ll come back to this later.) As many things are marked in surname or student number order, the only way to ensure that we don’t accidentally keep granting an advantage is to release everything at the same time.
Remember, our whole scheme is predicated on the assumption that we have designed and planned for how long it will take to go through the work and provide feedback in time for modification before another submission. When X+(prep time) comes, we should know, roughly to the hour or day, at worst, when this will be done.
If a student hands up fifteen minutes late, they have most likely missed the preparation phase. If we delay our process to include this student, then we will delay feedback to everyone. Here is a genuine motivation for students to submit on time: they will receive rich and detailed feedback as soon as it is ready. Students who hand up late will be assessed in the next round.
That’s how the real world actually works. No-one gives you half marks for something that you do a day late. It’s either accepted or not and, often, you go to the back of the queue. When you miss the bus, you don’t get 50% of the bus. You just have to wait for the next opportunity and, most of the time, there is another bus. Being late once rarely leaves you stranded without apparent hope – unlucky Martian visitors aside.
But there’s more to this. When we have finished with the first group, we can immediately release detailed feedback on what we were expecting to see, providing the best results to students and, from that point on, anyone who submits would have the benefit of information that the first group didn’t have before their initial submission. Rather than make the first group think that they should have waited (and we know students do), we give them the best possible outcome for organising their time.
The next submission deadline is done by everyone with the knowledge gained from the first pass but people who didn’t contribute to it can’t immediately use it for their own benefit. So there’s no free-riding.
There is, of course, a tricky period between the submission deadline and the release, where we could say “Well, they didn’t see the feedback” and accept the work but that’s when we think about the message we want to send. We would prefer students to improve their time management and one part of this is to have genuine outcomes from necessary deadlines.
If we let students keep handing in later and later, we will eventually end up having these late submissions running into our requirement to give feedback. But, more importantly, we will say “You shouldn’t have bothered” to those students who did hand up on time. When you say something like this, students will learn and they will change their behaviour. We should never reinforce behaviour that is the opposite of what we consider to be valuable.
Fairness is a core aesthetic of education. Authentic time management needs to reflect the reality of lost opportunity, rather than diminished recognition of good work in some numerical reduction. Our beauty argument is clear: we can be firm on certain deadlines and remove certain tasks from consideration and it will be a better approach and be more likely to have positive outcomes than an arbitrary reduction scheme already in use.
We’ve looked at a classification of evaluators that matches our understanding of the complexity of the assessment tasks we could ask students to perform. If we want to look at this from an aesthetic framing then, as Dewey notes:
“By common consent, the Parthenon is a great work of art. Yet it has aesthetic standing only as the work becomes an experience for a human being.”
John Dewey, Art as Experience, Chapter 1, The Live Creature.
Having a classification of evaluators cannot be appreciated aesthetically unless we provide a way for it to be experienced. Our aesthetic framing demands an implementation that makes use of such an evaluator classification, applies to a problem where we can apply a pedagogical lens and then, finally, we can start to ask how aesthetically pleasing it is.
And this is what brings us to beauty.
A systematic allocation of tasks to these different evaluators should provide valid and reliable marking, assuming we’ve carried out our design phase correctly. But what about fairness, motivation or relevancy, the three points that we did not address previously? To be able to satisfy these aesthetic constraints, and to confirm the others, it now matters how we handle these evaluation phases because it’s not enough to be aware that some things are going to need different approaches, we have to create a learning environment to provide fairness, motivation and relevancy.
I’ve already argued that arbitrary deadlines are unfair, that extrinsic motivational factors are grossly inferior to those found within, and, in even earlier articles, that we too insist on the relevancy of the measurements that we have, rather than designing for relevancy and insisting on the measurements that we need.
To achieve all of this and to provide a framework that we can use to develop a sense of aesthetic satisfaction (and hence beauty), here is a brief description of a four-tier, penalty free, assessment.
Let’s say that, as part of our course design, we develop an assessment item, A1, that is one of the elements to provide evaluation coverage of one of the knowledge areas. (Thus, we can assume that A1 is not required to be achieved by itself to show mastery but I will come back to this in a later post.)
Recall that the marking groups are: E1, expert human markers; E2, trained or guided human markers; E3, complex automated marking; and E4, simple and mechanical automated marking.
A1 has four, inbuilt, course deadlines but rather than these being arbitrary reductions of mark, these reflect the availability of evaluation resource, a real limitation as we’ve already discussed. When the teacher sets these courses up, she develops an evaluation scheme for the most advanced aspects (E1, which is her in this case), an evaluation scheme that could be used by other markers or her (E2), an E3 acceptance test suite and some E4 tests for simplicity. She matches the aspects of the assignment to these evaluation groups, building from simple to complex, concrete to abstract, definite to ambiguous.
The overall assessment of work consists of the evaluation of four separate areas, associated with each of the evaluators. Individual components of the assessment build up towards the most complex but, for example, a student should usually have had to complete at least some of E4-evaluated work to be able to attempt E3.
Here’s a diagram of the overall pattern for evaluation and assessment.
The first deadline for the assignment is where all evaluation is available. If students provide their work by this time, the E1 will look at the work, after executing the automated mechanisms, first E4 then E3, and applying the E2 rubrics. If the student has actually answered some E1-level items, then the “top tier” E1 evaluator will look at that work and evaluate it. Regardless of whether there is E1 work or not, human-written feedback from the lecturer on everything will be provided if students get their work in at that point. This includes things that would be of help for all other levels. This is the richest form of feedback, it is the most useful to the students and, if we are going to use measures of performance, this is the point at which the most opportunities to demonstrate performance can occur.
This feedback will be provided in enough time that the students can modify their work to meet the next deadline, which is the availability of E2 markers. Now TAs or casuals are marking instead or the lecturer is now doing easier evaluation from a simpler rubric. These human markers still start by running the automated scripts, E4 then E3, to make sure that they can mark something in E2. They also provide feedback on everything in E2 to E4, sent out in time for students to make changes for the next deadline.
Now note carefully what’s going on here. Students will get useful feedback, which is great, but because we have these staggered deadlines, we can pass on important messages as we identify problems. If the class is struggling with key complex or more abstract elements, harder to fix and requiring more thought, we know about it quickly because we have front-loaded our labour.
Once we move down to the fully automated systems, we’re losing opportunities for rich and human feedback to students who have not yet submitted. However, we have a list of students who haven’t submitted, which is where we can allocate human labour, and we can encourage them to get work in, in time for the E3 “complicated” script. This E3 marking script remains open for the rest of the semester, to encourage students to do the work sometime ahead of the exam. At this point, the discretionary allocation of labour for feedback is possible, because the lecturer has done most of the hard work in E1 and E2 and should, with any luck, have far fewer evaluation activities for this particular assignment. (Other things may intrude, including other assignments, but we have time bounds on this one, which is better than we often have!)
Finally, at the end of the teaching time (in our parlance, a semester’s teaching will end then we will move to exams), we move the assessment to E4 marking only, giving students the ability (if required) to test their work to meet any “minimum performance” requirements you may have for their eligibility to sit the exam. Eventually, the requirement to enter a record of student performance in this course forces us to declare the assessment item closed.
This is totally transparent and it’s based on real resource limitations. Our restrictions have been put in place to improve student feedback opportunities and give them more guidance. We have also improved our own ability to predict our workload and to guide our resource requests, as well as allowing us to reuse some elements of automated scripts between assignments, without forcing us to regurgitate entire assignments. These deadlines are not arbitrary. They are not punitive. We have improved feedback and provided supportive approaches to encourage more work on assignments. We are able to get better insight into what our students are achieving, against our design, in a timely fashion. We can now see fairness, intrinsic motivation and relevance.
I’m not saying this is beautiful yet (I think I have more to prove to you) but I think this is much closer than many solutions that we are currently using. It’s not hiding anything, so it’s true. It does many things we know are great for students so it looks pretty good.
Tomorrow, we’ll look at whether such a complicated system is necessary for early years and, spoilers, I’ll explain a system for first year that uses peer assessment to provide a similar, but easier to scale, solution.
How we can create a better assessment system, without penalties, that works in a grade-free environment? Let’s provide a foundation for this discussion by looking at assessment today.
We have many different ways of understanding exactly how we are assessing knowledge. Bloom’s taxonomy allows us to classify the objectives that we set for students, in that we can determine if we’re just asking them to remember something, explain it, apply it, analyse it, evaluate it or, having mastered all of those other aspects, create a new example of it. We’ve also got Bigg’s SOLO taxonomy to classify levels of increasing complexity in a student’s understanding of subjects. Now let’s add in threshold concepts, learning edge momentum, neo-Piagetian theory and …
Let’s summarise and just say that we know that students take a while to learn things, can demonstrate some convincing illusions of progress that quickly fall apart, and that we can design our activities and assessment in a way that acknowledges this.
I attended a talk by Eric Mazur, of Peer Instruction fame, and he said a lot of what I’ve already said about assessment not working with how we know we should be teaching. His belief is that we rarely rise above remembering and understanding, when it comes to testing, and he’s at Harvard, where everyone would easily accept their practices as, in theory, being top notch. Eric proposed a number of approaches but his focus on outcomes was one that I really liked. He wanted to keep the coaching role he could provide separate from his evaluator role: another thing I think we should be doing more.
Eric is in Physics but all of these ideas have been extensively explored in my own field, especially where we start to look at which of the levels we teach students to and then what we assess. We do a lot of work on this in Australia and here is some work by our groups and others I have learned from:
- Szabo, C., Falkner, K. & Falkner, N. 2014, ‘Experiences in Course Design using Neo-Piagetian Theory’
- Falkner, K., Vivian, R., Falkner, N., 2013, ‘Neo-piagetian Forms of Reasoning in Software Development Process Construction’
- Whalley, J., Lister, R.F., Thompson, E., Clear, T., Robbins, P., Kumar, P. & Prasad, C. 2006, ‘An Australasian study of reading and comprehension skills in novice programmers, using Bloom and SOLO taxonomies’
- Gluga, R., Kay, J., Lister, R.F. & Teague, D. 2012, ‘On the reliability of classifying programming tasks using a neo-piagetian theory of cognitive development’
I would be remiss to not mention Anna Eckerdal’s work, and collaborations, in the area of threshold concepts. You can find her many papers on determining which concepts are going to challenge students the most, and how we could deal with this, here.
Let me summarise all of this:
- There are different levels at which students will perform as they learn.
- It needs careful evaluation to separate students who appear to have learned something from students who have actually learned something.
- We often focus too much on memorisation and simple explanation, without going to more advanced levels.
- If we want to assess advanced levels, we may have to give up the idea of trying to grade these additional steps as objectivity is almost impossible as is task equivalence.
- We should teach in a way that supports the assessment we wish to carry out. The assessment we wish to carry out is the right choice to demonstrate true mastery of knowledge and skills.
If we are not designing for our learning outcomes, we’re unlikely to create courses to achieve those outcomes. If we don’t take into account the realities of student behaviour, we will also fail.
We can break our assessment tasks down by one of the taxonomies or learning theories and, from my own work and that of others, we know that we will get better results if we provide a learning environment that supports assessment at the desired taxonomic level.
But, there is a problem. The most descriptive, authentic and open-ended assessments incur the most load in terms of expert human marking. We don’t have a lot of expert human markers. Overloading them is not good. Pretending that we can mark an infinite number of assignments is not true. Our evaluation aesthetics are objectivity, fairness, effectiveness, timeliness and depth of feedback. Assignment evaluation should be useful to the students, to show progress, and useful to us, to show the health of the learning environment. Overloading the marker will compromise the aesthetics.
Our beauty lens tells us very clearly that we need to be careful about how we deal with our finite resources. As Eric notes, and we all know, if we were to test simpler aspects of student learning, we can throw machines at it and we have a near infinite supply of machines. I cannot produce more experts like me, easily. (Snickers from the audience) I can recruit human evaluators from my casual pool and train them to mark to something like my standard, using a rubric or using an approximation of my approach.
Thus I have a framework of assignments, divide by level, and I appear to have assignment evaluation resources. And the more expert and human the marker, the more … for want of a better word … valuable the resource. The better feedback it can produce. Yet the more valuable the resource, the less of it I have because it takes time to develop evaluation skills in humans.
Tune in tomorrow for the penalty free evaluation and feedback that ties all of this together.
How does one actually turn everything I’ve been saying into a course that can be taught? We already have examples of this working, whether in the performance/competency based models found in medical schools around the world or whether in mastery learning based approaches where do not measure anything except whether a student has demonstrated sufficient knowledge or skill to show an appropriate level of mastery.
An absence of grades, or student control over their grades, is not as uncommon as many people think. MIT in the United States give students their entire first semester with no grades more specific than pass or fail. This is a deliberate decision to ease the transition of students who have gone from being leaders at their own schools to the compressed scale of MIT. Why compressed? If we were to assess all school students then we would need a scale that could measure all levels of ability, from ‘not making any progress at school’ to ‘transcendent’. The tertiary entry band is somewhere between ‘passing school studies’ to ‘transcendent’ and, depending upon the college that you enter, can shift higher and higher as your target institution becomes more exclusive. If you look at the MIT entry requirements, they are a little coy for ‘per student’ adjustments, but when the 75th percentile for the SAT components is 800, 790, 790, and 800,800,800 would be perfect, we can see that any arguments on how demotivating simple pass/fail grades must be for excellent students have not just withered, they have caught fire and the ash has blown away. When the target is MIT, it appears the freshmen get their head around a system that is even simpler than Rapaport’s.
Other universities, such as Brown, deliberately allow students to choose how their marks are presented, as they wish to deemphasise the numbers in order to focus on education. It is not a cakewalk to get into Brown, as these figures attest, and yet Brown have made a clear statement that they have changed their grading system in order to change student behaviour – and the world is just going to have to deal with that. It doesn’t seem to be hurting their graduates, from quotes on the website such as “Our 85% admission rate to medical school and 89% admission rate to law school are both far above the national average.”
And, returning to medical schools themselves, my own University runs a medical program where the usual guidelines for grading do not hold. The medical school is running on a performance/competency scheme, where students who wish to practise medicine must demonstrate that they are knowledgable, skilful and safe to practice. Medical schools have identified the core problem in my thought experiment where two students could have the opposite set of knowledge or skills and they have come to the same logical conclusion: decide what is important and set up a scheme that works for it.
When I was a solider, I was responsible for much of the Officer Training in my home state for the Reserve. We had any number of things to report on for our candidates, across knowledge and skills, but one of them was “Demonstrate the qualities of an officer” and this single item could fail an otherwise suitable candidate. If a candidate could not be trusted to one day be in command of troops on the battlefield, based on problems we saw in peacetime, then they would be counselled to see if it could be addressed and, if not, let go. (I can assure you that this was not used often and it required a large number of observations and discussion before we would pull that handle. The power of such a thing forced us to be responsible.)
We know that limited scale, mastery-based approaches are not just working in the vocational sector but in allied sectors (such as the military), in the Ivy league (Brown) and in highly prestigious non-Ivy league institutions such as MIT. But we also know of examples such as Harvey Mudd, who proudly state that only seven students since 1955 have earned a 4.0 GPA and have a post on the career blog devoted to “explaining why your GPA is so low” And, be in no doubt, Harvey Mudd is an excellent school, especially for my discipline. I’m not criticising their program, I’ve only heard great things about them, but when you have to put up a page like that? You’re admitting that there’s a problem but you are pushing it on to the student to fix it. But contrast that with Brown, who say to employers “look at our students, not their grades” (at least on the website).
Feedback to the students on their progress is essential. Being able to see what your students are up to is essential for the teacher. Being able to see what your staff and schools are doing is important for the University. Employers want to know who to hire. Which of these is the most important?
The students. It has to be the students. Doesn’t it? (Arguments for the existence of Universities as a self-sustaining bureaucracy system in the comments, if you think that’s a thing you want to do.)
This is not an easy problem but, as we can see, we have pieces of the solution all over the place. Tomorrow, I’m going to put in a place a cornerstone of beautiful assessment that I haven’t seen provided elsewhere or explained in this way. (Then all of you can tell me which papers I should have read to get it from, I can publish the citation, and we can all go forward.)
There are many lessons to be learned from what is going on in the MOOC sector. The first is that we have a lot to learn, even for those of us who are committed to doing it ‘properly’ whatever that means. I’m not trying to convince you of “MOOC yes” or “MOOC no”. We can have that argument some other time. I’m talking about we already know from using these tools.
We’ve learned (again) that producing a broadcast video set of boring people reading the book at you in a monotone is, amazingly, not effective, no matter how fancy the platform. We know that MOOCs are predominantly taken by people who have already ‘succeeded’ at learning, often despite our educational system, and are thus not as likely to have an impact in traditionally disadvantaged areas, especially without an existing learning community and culture. (No references, you can Google all of this easily.)
We know that online communities can and do form. Ok, it’s not the same as twenty people in a room with you but our own work in this space confirms that you can have students experiencing a genuine feeling of belonging, facilitated through course design and forum interaction.
“Really?” you ask.
In a MOOC we ran with over 25,000 students, a student wrote a thank you note to us at the top of his code, for the final assignment. He had moved from non-coder to coder with us and had created some beautiful things. He left a note in his code because he thought that someone would read it. And we did. There is evidence of this everywhere in the forums and their code. No, we don’t have a face-to-face relationship. But we made them feel something and, from what we’ve seen so far, it doesn’t appear to be a bad something.
But we, as in the wider on-line community, have learned something else that is very important. Students in MOOCs often set their own expectations of achievement. They come in, find what they’re after, and leave, much like they are asking a question on Quora or StackExchange. Much like you check out reviews on-line before you start watching a show or you download one or two episodes to check it out. You know, 21st Century life.
Once you see that self-defined achievement and engagement, a lot of things about MOOCs, including drop rates and strange progression, suddenly make sense. As does the realisation that this is a total change from what we have accepted for centuries as desirable behaviour. This is something that we are going to have a lot of trouble fitting into our existing system. It also indicates how much work we’re going to have to do in order to bring in traditionally disadvantaged communities, first-in-family and any other under-represented group. Because they may still believe that we’re offering Perry’s nightmare in on-line form: serried ranks with computers screaming facts at you.
We offer our students a lot of choice but, as Universities, we mostly work on the idea of ‘follow this program to achieve this qualification’. Despite notionally being in the business of knowledge for the sake of knowledge, our non-award and ‘not for credit’ courses are dwarfed in enrolments by the ‘follow the track, get a prize’ streams. And that, of course, is where the diminishing bags of dollars come from. That’s why retention is such a hot-button issue at Universities because even 1% more retained students is worth millions to most Universities. A hunt and peck community? We don’t even know what retention looks like in that context.
Pretending that this isn’t happening is ignoring evidence. It’s self-deceptive, disingenuous, hypocritical (for we are supposed to be the evidence junkies) and, once again, we have a failure of educational aesthetics. Giving people what they don’t want isn’t good. Pretending that they just don’t know what’s good for them is really not being truthful. That’s three Socratic strikes: you’re out.
We have a message from our learning community. They want some control. We have to be aware that, if we really want them to do something, they have to feel that it’s necessary. (So much research supports this.) By letting them run around in the MOOC space, artificial and heavily instrumented, we can finally see what they’re up to without having to follow them around with clipboards. We see them on the massive scale, individuals and aggregates. Remember, on average these are graduates; these are students who have already been through our machine and come out. These are the last people, if we’ve convinced them of the rightness of our structure, who should be rocking the boat and wanting to try something different. Unless, of course, we haven’t quite been meeting their true needs all these years.
I often say that the problem we have with MOOC enrolments is that we can see all of them. There is no ‘peeking around the door’ in a MOOC. You’re in or you’re out, in order to be signed up for access or updates.
If we were collaborating with all of our students to produce learning materials and structures, not just the subset who go into MOOC, I wonder what we would end up turning out? We still need to apply our knowledge of pedagogy and psychology, of course, to temper desire with what works but I suspect that we should be collaborating with our learner community in a far more open way. Everywhere else, technology is changing the relationship between supplier and consumer. Name any other industry and we can probably find a new model where consumers get more choice, more knowledge and more power.
No-one (sensible) is saying we should raze the Universities overnight. I keep being told that allowing more student control is going to lead to terrible things but, frankly, I don’t believe it and I don’t think we have enough evidence to stop us from at least exploring this path. I think it’s scary, yes. I think it’s going to challenge how we think about tertiary education, absolutely. I also think that we need to work out how we can bring together the best of face-to-face with the best of on-line, for the most people, in the most educationally beautiful way. Because anything else just isn’t that beautiful.
I was recently at a conference-like event where someone stood up and talked about video lectures. And these lectures were about 40 minutes long.
Over several million viewing sessions, EdX have clearly shown that watchable video length tops out at just over 6 minutes. And that’s the same for certificate-earning students and the people who have enrolled for fun. At 9 minutes, students are watching for fewer than 6 minutes. At the 40 minute mark, it’s 3-4 minutes.
I raised this point to the speaker because I like the idea that, if we do on-line it should be good on-line, and I got a response that was basically “Yes, I know that but I think the students should be watching these anyway.” Um. Six minutes is the limit but, hey, students, sit there for this time anyway.
We have never been able to unobtrusively measure certain student activities as well as we can today. I admit that it’s hard to measure actual attention by looking at video activity time but it’s also hard to measure activity by watching students in a lecture theatre. When we add clickers to measure lecture activity, we change the activity and, unsurprisingly, clicker-based assessment of lecture attentiveness gives us different numbers to observation of note-taking. We can monitor video activity by watching what the student actually does and pausing/stopping a video is a very clear signal of “I’m done”. The fact that students are less likely to watch as far on longer videos is a pretty interesting one because it implies that students will hold on for a while if the end is in sight.
In a lecture, we think students fade after about 15-20 minutes but, because of physical implications, peer pressure, politeness and inertia, we don’t know how many students have silently switched off before that because very few will just get up and leave. That 6 minute figure may be the true measure of how long a human will remain engaged in this kind of task when there is no active component and we are asking them to process or retain complex cognitive content. (Speculation, here, as I’m still reading into one of these areas but you see where I’m going.) We know that cognitive load is a complicated thing and that identifying subgoals of learning makes a difference in cognitive load (Morrison, Margulieux, Guzdial) but, in so many cases, this isn’t what is happening in those long videos, they’re just someone talking with loose scaffolding. Having designed courses with short videos I can tell you that it forces you, as the designer and teacher, to focus on exactly what you want to say and it really helps in making your points, clearly. Implicit sub-goal labelling, anyone? (I can hear Briana and Mark warming up their keyboards!)
If you want to make your videos 40 minutes long, I can’t stop you. But I can tell you that everything I know tells me that you have set your materials up for another hominid species because you’re not providing something that’s likely to be effective for current humans.
Before I lay out the program design I’m thinking of (and, beyond any discussion of competency, as a number of you have suggested, we are heading towards Bloom’s mastery learning as a frame with active learning elements), we need to address one of the most problematic areas of assessment.
Well, let’s be accurate, penalties are, by definition, punishments imposed for breaking the rules, so these are punishments. This is the stick in the carrot-and-stick reward/punish approach to forcing people to do what you want.
Let’s throw the Greek trinity at this and see how it shapes up. A student produces an otherwise perfect piece of work for an assessment task. It’s her own work. She has spent time developing it. It’s really good. Insightful. Oh, but she handed it up a day late. So we’re now going to say that this knowledge is worth less because it wasn’t delivered on time. She’s working a day job to pay the bills? She should have organised herself better. No Internet at home? Why didn’t she work in the library? I’m sure the campus is totally safe after hours and, well, she should just be careful in getting to and from the library. After all, the most important thing in her life, without knowing anything about her, should be this one hundred line program to reinvent something that has been written over a million times by every other CS student in history.
That’s not truth. That’s establishing a market value for knowledge with a temporal currency. To me, unless there’s a good reason for doing this, this is as bad as curve grading because it changes what the student has achieved for reasons outside of the assignment activity itself.
“Ah!” you say “Nick, we want to teach people to hand work in on time because that’s how the world works! Time is money, Jones!”
Rubbish. Yes, there are a (small) number of unmovable deadlines in the world. We certainly have some in education because we have to get grades in to achieve graduations and degrees. But most adults function in a world where they choose how to handle all of the commitments in their lives and then they schedule them accordingly. The more you do that, the more practice you get and you can learn how to do it well.
If you have ever given students a week, or even a day’s, extension because of something that has stopped you being able to accept or mark student work, no matter how good the reason, you have accepted that your submission points are arbitrary. (I feel strongly about this and have posted about it before.)
So what would be a good reason for sticking to these arbitrary deadlines? We’d want to see something really positive coming out of the research into this, right? Let’s look at some research on this, starting with Britton and Tesser, “Effects of Time-Management Practices on College Grades”, J Edu Psych, 1991, 83, 3. This reinforces what we already know from Bandura: students who feel in control and have high self-efficacy are going to do well. If a student sits down every day to work out what they’re going to do then they, unsurprisingly, can get things done. But this study doesn’t tell us about long-range time planning – the realm of instrumentality, the capability to link activity today with success in the future. (Here are some of my earlier thoughts on this, with references to Husman.) From Husman, we know that students value tasks in terms of how important they think it is, how motivated they are and how well they can link future success to the current task.
In another J Edu Psych paper (1990,82,4), Macan and Shahani reported that participants who felt that they had control over what they were doing did better but also clearly indicated that ambiguity and stress had an influence on time management in terms of perception and actuality. But the Perceived Control of Time (author’s caps) dominated everything, reducing the impact of ambiguity, reducing the impact of stress, and lead to greater satisfaction.
Students are rarely in control of their submission deadlines. Worse, we often do not take into account everything else in a student’s life (even other University courses) when we set our own deadlines. Our deadlines look arbitrary to students because they are, in the majority of cases. There’s your truth. We choose deadlines that work for our ability to mark and to get grades in or, perhaps, based on whether we are in the country or off presenting research on the best way to get students to hand work in on-time.
(Yes, the owl above is staring at me just as hard as he is staring at anyone else here.)
My own research clearly shows that fixed deadlines do not magically teach students the ability to manage their time and, when you examine it, why should it? (ICER 2012, was part of a larger study that clearly demonstrated students continuing, and even extending, last-minute behaviour all the way to the fourth year of their studies.) Time management is a discipline that involves awareness of the tasks to be performed, a decomposition of those tasks to subtasks that can be performed when the hyperbolic time discounting triggers go off, and a well-developed sense of instrumentality. Telling someone to hand in their work by this date OR ELSE does not increase awareness, train decomposition, or develop any form of planning skills. Well, no wonder it doesn’t work any better than shouting at people teaches them Maxwell’s Equations or caning children suddenly reveals the magic of the pluperfect form in Latin grammar.
So, let’s summarise: students do well when they feel in control and it helps with all of the other factors that could get in the way. So, in order to do almost exactly the opposite of help with this essential support step, we impose frequently arbitrary time deadlines and then act surprised when students fall prey to lack of self-confidence, stress or lose sight of what they’re trying to do. They panic, asking lots of (what appear to be) unnecessary questions because they are desperately trying to reduce confusion and stress. Sound familiar?
I have written about this at length while exploring time banking, giving students agency and the ability to plan their own time, to address all of these points. But the new lens in my educational inspection loupe allows me to be very clear about what is most terribly wrong with late penalties.
They are not just wrong, they satisfy none of anyone’s educational aesthetics. Because we don’t take a student’s real life into account, we are not being fair. Because we are not actually developing the time management abilities but treating them as something that will be auto-didactically generated, we are not being supportive. Because we downgrade work when it is still good, we are being intellectually dishonest. Because we vary deadlines to suit ourselves but may not do so for an individual student, we are being hypocritical. We are degrading the value of knowledge for procedural correctness. This is hideously “unbeautiful”.
That is not education. That’s bureaucracy. Just because most of us live within a bureaucracy doesn’t mean that we have to compromise our pedagogical principles. Even trying to make things fit well, as Rapaport did to try and fit into another scale, we end up warping and twisting our intent, even before we start thinking about lateness and difficult areas such as that. This cannot be good.
There is nothing to stop a teacher setting an exercise that is about time management and is constructed so that all steps will lead someone to develop better time management. Feedback or marks that reflect something being late when that is the only measure of fitness is totally reasonable. But to pretend that you can slap some penalties on to the side of an assessment and it will magically self-scaffold is to deceive yourself, to your students’ detriment. It’s not true.
Do I have thoughts on how to balance marking resources with student feedback requirements, elastic time management, and real assessments while still recognising that there are some fixed deadlines?
Funny you should ask. We’ll come back to this, soon.
I hope you’ve had a chance to read William Rapaport’s paper, which I referred to yesterday. He proposed a great, simple alternative to traditional grading that reduces confusion about what is signalled by ‘grade-type’ feedback, as well as making things easier for students and teachers. Being me, after saying how much I liked it, I then finished by saying “… but I think that there are problems.” His approach was that we could break all grading down into: did nothing, wrong answer, some way to go, pretty much there. And that, I think, is much better than a lot of the nonsense that we pretend we hand out as marks. But, yes, I have some problems.
I note that Rapaport’s exceedingly clear and honest account of what he is doing includes this statement. “Still, there are some subjective calls to make, and you might very well disagree with the way that I have made them.” Therefore, I have license to accept the value of the overall scholarship and the frame of the approach, without having to accept all of the implementation details given in the paper. Onwards!
I think my biggest concern with the approach given is not in how it works for individual assessment elements. In that area, I think it shines, as it makes clear what has been achieved. A marker can quickly place the work into one of four boxes if there are clear guidelines as to what has to be achieved, without having to worry about one or two percentage points here or there. Because the grade bands are so distinct, as Rapaport notes, it is very hard for the student to make the ‘I only need one more point argument’ that is so clearly indicative as a focus on the grade rather than the learning. (I note that such emphasis is often what we have trained students for, there is no pejorative intention here.) I agree this is consistent and fair, and time-saving (after Walvoord and Anderson), and it avoids curve grading, which I loathe with a passion.
However, my problems start when we are combining a number of these triaged grades into a cumulative mark for an assignment or for a final letter grade, showing progress in the course. Sections 4.3 and 4.4 of the paper detail the implementation of assignments that have triage graded sub-tasks. Now, instead of receiving a “some way to go” for an assignment, we can start getting different scores for sub-tasks. Let’s look at an example from the paper, note 12, to describe programming projects in CS.
- Problem definition 0,1,2,3
- Top-down design 0,1,2,3
- Documented code
- Code 0,1,2,3
- Documentation 0,1,2,3
- Annotated output
- Output 0,1,2,3
- Annotations 0,1,2,3
Total possible points = 18
Remember my hypothetical situation from yesterday? I provided an example of two students who managed to score enough marks to pass by knowing the complement of each other’s course knowledge. Looking at the above example, it appears (although not easily) to be possible for this situation to occur and both students to receive a 9/18, yet for different aspects. But I have some more pressing questions:
- Should it be possible for a student to receive full marks for output, if there is no definition, design or code presented?
- Can a student receive full marks for everything else if they have no design?
The first question indicates what we already know about task dependencies: if we want to build them into numerical grading, we have to be pedantically specific and provide rules on top of the aggregation mathematics. But, more subtly, by aggregating these measures, we no longer have an ‘accurately triaged’ grade to indicate if the assignment as a whole is acceptable or not. An assignment with no definition, design or code can hardly be considered to be a valid submission, yet good output, documentation and annotation (with no code) will not give us the right result!
The second question is more for those of us who teach programming and it’s a question we all should ask. If a student can get a decent grade for an assignment without submitting a design, then what message are we sending? We are, implicitly, saying that although we talk a lot about design, it’s not something you have to do in order to be successful. Rapaport does go on to talk about weightings and how we can emphasis these issues but we are still faced with an ugly reality that, unless we weight our key aspects to be 50-60% of the final aggregate, students will be able to side-step them and still perform to a passing standard. Every assignment should be doing something useful, modelling the correct approaches, demonstrating correct techniques. How do we capture that?
Now, let me step back and say that I have no problem with identifying the sub-tasks and clearly indicating the level of performance using triage grading, but I disagree with using it for marks. For feedback it is absolutely invaluable: triage grading on sub-tasks will immediately tell you where the majority of students are having trouble, quickly. That then lets you know an area that is more challenging than you thought or one that your students were not prepared for, for some reason. (If every student in the class is struggling with something, the problem is more likely to lie with the teacher.) However, I see three major problems with sub-task aggregation and, thus, with final grade aggregation from assignments.
The first problem is that I think this is the wrong kind of scale to try and aggregate in this way. As Rapaport notes, agreement on clear, linear intervals in grading is never going to be achieved and is, very likely, not even possible. Recall that there are four fundamental types of scale: nominal, ordinal, interval and ratio. The scales in use for triage grading are not interval scales (the intervals aren’t predictable or equidistant) and thus we cannot expect to average them and get sensible results. What we have here are, to my eye, ordinal scales, with no objective distance but a clear ranking of best to worst. The clearest indicator of this is the construction of a B grade for final grading, where no such concept exists in the triage marks for assessing assignment quality. We have created a “some way to go but sometimes nearly perfect” that shouldn’t really exist. Think of it like runners: you win one race and you come third in another. You never actually came second in any race so averaging it makes no sense.
The second problem is that aggregation masks the beauty of triage in terms of identifying if a task has been performed to the pre-determined level. In an ideal world, every area of knowledge that a student is exposed to should be an important contributor to their learning journey. We may have multiple assignments in one area but our assessment mechanism should provide clear opportunities to demonstrate that knowledge. Thus, their achievement of sufficient assignment work to demonstrate their competency in every relevant area of knowledge should be a necessary condition for graduating. When we take triage grading back to an assignment level, we can then look at our assignments grouped by knowledge area and quickly see if a student has some way to go or has achieved the goal. This is not anywhere near as clear when we start aggregating the marks because of the mathematical issues already raised.
Finally, the reduction of triage to mathematical approximation reduces the ability to specify which areas of an assessment are really valuable and, while weighting is a reasonable approximation to this, it is very hard to use a mathematical formula with more and more ‘fudge factors’, a term Rapaport uses, to make up for the fact that this is just a little too fragile.
To summarise, I really like the thrust of this paper. I think what is proposed is far better, even with all of the problems raised above, at giving a reasonable, fair and predictable grade to students. But I think that the clash with existing grading traditions and the implicit requirement to turn everything back into one number is causing problems that have to be addressed. These problems mean that this solution is not, yet, beautiful. But let’s see where we can go.
Tomorrow, I’ll suggest an even more cut-down version of grading and then work on an even trickier problem: late penalties and how they affect grades.
If you’ve been reading my blog over the past years, you’ll know that I have a lot of time for thinking about assessment systems that encourage and develop students, with an emphasis on intrinsic motivation. I’m strongly influenced by the work of Alfie Kohn, unsurprisingly given I’ve already shown my hand on Focault! But there are many other writers who are… reassessing assessment: why we do it, why we think we are doing it, how we do it, what actually happens and what we achieve.
In my framing, I want assessment to be as all other aspects of education: aesthetically satisfying, leading to good outcomes and being clear and what it is and what it is not. Beautiful. Good. True. There are some better and worse assessment approaches out there and there are many papers discussing this. One of these that I have found really useful is Rapaport’s paper on a simplified assessment process for consistent, fair and efficient grading. Although I disagree with some aspects, I consider it to be both good, as it is designed to clearly address a certain problem to achieve good outcomes, and it is true, because it is very honest about providing guidance to the student as to how well they have met the challenge. It is also highly illustrative and honest in representing the struggle of the author in dealing with the collision of novel and traditional assessment systems. However, further discussion of Rapaport is for the near future. Let me start by demonstrating how broken things often are in assessment, by taking you through a hypothetical situation.
Thought Experiment 1
Two students, A and B, are taking the same course. There are a number of assignments in the course and two exams. A and B, by sheer luck, end up doing no overlapping work. They complete different assignments to each other, half each and achieve the same (cumulative bare pass overall) marks. They then manage to score bare pass marks in both exams, but one answers only the even questions and only answers the odd. (And, yes, there are an even number of questions.) Because of the way the assessment was constructed, they have managed to avoid any common answers in the same area of course knowledge. Yet, both end up scoring 50%, a passing grade in the Australian system.
Which of these students has the correct half of the knowledge?
I had planned to build up to Rapaport but, if you’re reading the blog comments, he’s already been mentioned so I’ll summarise his 2011 paper before I get to my main point. In 2011, William J. Rapaport, SUNY Buffalo, published a paper entitled “A Triage Theory of Grading: The Good, The Bad and the Middling.” in Teaching Philosophy. This paper summarised a number of thoughtful and important authors, among them Perry, Wolff, and Kohn. Rapaport starts by asking why we grade, moving through Wolff’s taxonomic classification of assessment into criticism, evaluation, and ranking. Students are trained, by our world and our education systems to treat grades as a measure of progress and, in many ways, a proxy for knowledge. But this brings us into conflict with Perry’s developmental stages, where students start with a deep need for authority and the safety of a single right answer. It is only when students are capable of understanding that there are, in many cases, multiple right answers that we can expect them to understand that grades can have multiple meanings. As Rapaport notes, grades are inherently dual: a representative symbol attached to a quality measure and then, in his words, “ethical and aesthetic values are attached” (emphasis mine.) In other words, a B is a measure of progress (not quite there) that also has a value of being … second-tier if an A is our measure of excellence. A is not A, as it must be contextualised. Sorry, Ayn.
When we start to examine why we are grading, Kohn tells us that the carrot and stick is never as effective as the motivation that someone has intrinsically. So we look to Wolff: are we critiquing for feedback, are we evaluating learning, or are we providing handy value measures for sorting our product for some consumer or market? Returning to my thought experiment above, we cannot provide feedback on assignments that students don’t do, our evaluation of learning says that both students are acceptable for complementary knowledge, and our students cannot be discerned from their graded rank, despite the fact that they have nothing in common!
Yes, it’s an artificial example but, without attention to the design of our courses and in particular the design of our assessment, it is entirely possible to achieve this result to some degree. This is where I wish to refer to Rapaport as an example of thoughtful design, with a clear assessment goal in mind. To step away from measures that provide an (effectively) arbitrary distinction, Rapaport proposes a tiered system for grading that simplifies the overall system with an emphasis on identifying whether a piece of assessment work is demonstrating clear knowledge, a partial solution, an incorrect solution or no work at all.
This, for me, is an example of assessment that is pretty close to true. The difference between a 74 and a 75 is, in most cases, not very defensible (after Haladyna) unless you are applying some kind of ‘quality gate’ that really reduces a percentile scale to, at most, 13 different outcomes. Rapaport’s argument is that we can reduce this further and this will reduce grade clawing, identify clear levels of achieve and reduce marking load on the assessor. That last point is important. A system that buries the marker under load is not sustainable. It cannot be beautiful.
There are issues in taking this approach and turning it back into the grades that our institutions generally require. Rapaport is very open about the difficulties that he has turning his triage system into an acceptable letter grade and it’s worth reading the paper to see that discussion alone, because it quite clearly shows what
Rapaport’s scheme clearly defines which of Wolff’s criteria he wishes his assessment to achieve. The scheme, for individual assessments, is no good for ranking (although we can fashion a ranking from it) but it is good to identify weak areas of knowledge (as transmitted or received) for evaluation of progress and also for providing elementary critique. It says what it is and it pretty much does it. It sets out to achieve a clear goal.
The paper ends with a summary of the key points of Haladyna’s 1999 book “A Complete Guide to Student Grading”, which brings all of this together.
Haladyna says that “Before we assign a grade to any students, we need:
- an idea about what a grade means,
- an understanding of the purposes of grading,
- a set of personal beliefs and proven principles that we will use in teaching
- a set of criteria on which the grade is based, and, finally,
- a grading method,which is a set of procedures that we consistently follow
in arriving at each student’s grade. (Haladyna 1999: ix)
There is no doubt that Rapaport’s scheme meets all of these criteria and, yet, for me, we have not yet gone far enough in search of the most beautiful, most good and most true extent that we can take this idea. Is point 3, which could be summarised as aesthetics not enough for me? Apparently not.
Tomorrow I will return to Rapaport to discuss those aspects I disagree with and, later on, discuss both an even more trimmed-down model and some more controversial aspects.
For the next week, I’m going to be applying an aesthetic lens to assessment and, because I’m in Computer Science, I’ll be focusing on the assessment of Computer Science knowledge and practice.
How do we know if our students know something? In reality, the best way is to turn them loose, come back in 25 years and ask the people in their lives, their clients, their beneficiaries and (of course) their victims, the same question: “Did the student demonstrate knowledge of area X?”
This is not available to us as an option because my Dean, if not my Head of School, would probably peer at me curiously if I were to suggest that all measurement of my efficacy be moved a generation from now. Thus, I am forced to retreat to the conventions and traditions of assessment: it is now up to the student to demonstrate to me, within a fixed timeframe, that he or she has taken a firm grip of the knowledge.
We know that students who are prepared to learn and who are motivated to learn will probably learn, often regardless of what we do. We don’t have to read Vallerand et al to be convinced that self-motivated students will perform, as we can see it every day. (But it is an enjoyable paper to read!) Yet we measure these students in the same assessment frames as students who do not have the same advantages and, thus, may not yet have the luxury or capacity of self-motivation: students from disadvantaged backgrounds, students who are first-in-family and students who wouldn’t know auto-didacticism if it were to dance in front of them.
How, then, do we fairly determine what it means to pass, what it means to fail and, even more subtly, what it means to pass or fail well? I hesitate to invoke Foucault, especially when I speak of “Discipline and Punish” in an educational setting, but he is unavoidable when we gaze upon a system that is dedicated to awarding ranks, graduated in terms of punishment and reward. It is strange, really, that were many patients to die under the hand of a surgeon for a simple surgery, we would ask for an inquest, but many students failing under the same professor in a first-year course is merely an indicator of “bad students”. So many of our mechanisms tell us that students are failing but often too late to be helpful and not in a way that encourages improvement. This is punishment. And it is not good enough.
Our assessment mechanisms are not beautiful. They are barely functional. They exist to provide a rough measure to separate pass from fail, with a variety of other distinctions that owe more to previous experience and privilege in many cases than any higher pedagogical approach.
Over the next week, I shall conduct an attack upon the assessment mechanisms that are currently used in my field, including my own, in the hope of arriving at a mechanism of design, practice and validation that is pedagogically pleasing (the aesthetic argument again) and will lead to outcomes that are both good and true.