We know that we can, and do, assess different levels of skill and knowledge. We know that we can, and do, often resort to testing memorisation, simple understanding and, sometimes, the application of the knowledge that we teach. We also know that the best evaluation of work tends to come from the teachers who know the most about the course and have the most experience, but we also know that these teachers have many demands on their time.
The principles of good assessment can be argued but we can probably agree upon a set much like this:
- Valid, based on the content. We should be evaluating things that we’ve taught.
- Reliable, in that our evaluations are consistent and return similar results for different evaluators, that re-evaluating would give the same result, that we’re not unintentionally raising or lowering difficulty.
- Motivating, in that we know how much influence feedback and encouragement have on students, so we should be maximising the motivation and, we hope, this should drive engagement.
- Finally, we want our assessment to be as relevant to us, in terms of being able to use the knowledge gained to improve or modify our courses, as it is to our student. Better things should come from having run this assessment.
Notice that nothing here says “We have to mark or give a grade”, yet we can all agree on these principles, and any scheme that adheres to them, as being a good set of characteristics to build upon. Let me label these as aesthetics of assessment, now let’s see if I can make something beautiful. Let me put together my shopping list.
- Feedback is essential. We can see that. Let’s have lots of feedback and let’s put it in places where it can be the most help.
- Contextual relevance is essential. We’re going to need good design and work out what we want to evaluate and then make sure we locate our assessment in the right place.
- We want to encourage students. This means focusing on intrinsics and support, as well as well-articulated pathways to improvement.
- We want to be fair and honest.
- We don’t want to overload either the students or ourselves.
- We want to allow enough time for reliable and fair evaluation of the work.
What are the resources we have?
- Course syllabus
- Course timetable
- The teacher’s available time
- TA or casual evaluation time, if available
- Student time (for group work or individual work, including peer review)
- Rubrics for evaluation.
- Computerised/automated evaluation systems, to varying degree.
Wait, am I suggesting automated marking belongs in a beautiful marking system? Why, yes, I think it has a place, if we are going to look at those things we can measure mechanistically. Checking to see if someone has ticked the right box for a Bloom’s “remembering” level activity? Machine task. Checking to see if an essay has a lot of syntax or grammatical errors? Machine task. But we can build on that. We can use human markers and machine markers, in conjunction, to the best of their strengths and to overcome each other’s weaknesses.
If we think about it, we really have four separate tiers of evaluators to draw upon, who have different levels of ability. These are:
- E1: The course designers and subject matter experts who have a deep understanding of the course and could, possibly with training, evaluate work and provide rich feedback.
- E2: Human evaluators who have received training or are following a rubric provided by the E1 evaluators. They are still human-level reasoners but are constrained in terms of breadth of interpretation. (It’s worth noting that peer assessment could fit in here, as well.)
- E3: High-level machine evaluation includes machine-based evaluation of work, which could include structural, sentiment or topic analysis, as well as running complicated acceptance tests that look for specific results, coverage of topics or, in the case of programming tasks, certain output in response to given input. The E3 evaluation mechanisms will require some work to set up but can provide evaluation of large classes in hours, rather than days.
- E4: Low-level machine evaluation, checking for conformity in terms of length of assignment, names, type of work submitted, plagiarism detection. In the case of programming assignments, E4 would check that the filenames were correct, that the code compiled and also may run some very basic acceptance tests. E4 evaluation mechanisms should be quick to set up and very quick to execute.
This separation clearly shows us a graded increase of expertise that corresponds to an increase of time spent and, unfortunately, a decrease in time available. E4 evaluation is very easy to set up and carry out but it’s not fantastic for detailed feedback or higher Bloom’s level. Yet we have an almost infinite amount of this marking time available. E1 markers will (we hope) give the best feedback but they take a long time and this immediately reduces the amount of time to be spent on other things. How do we handle this and select the best mix?
While we’re thinking about that, let’s see if we are meeting the aesthetics.
- Valid? Yes. We’ve looked at our design (we appear to have a design!) and we’ve specifically set up evaluation into different areas while thinking about outcomes, levels and areas that we care about.
- Reliable? Looks like it. E3 and E4 are automated and E2 has a defined marking rubric. E1 should also have guidelines but, if we’ve done our work properly in design, the majority of marks, if not all of them, are going to be assigned reliably.
- Fair? We’ve got multiple stages of evaluation but we haven’t yet said how we’re going to use this so we don’t have this one yet.
- Motivating? Hmm, we have the potential for a lot of feedback but we haven’t said how we’re using that, either. Don’t have this one either.
- Relevant to us and the students. No, for the same reasons as 3 and 4, we haven’t yet shown how this can be useful to us.
It looks like we’re half-way there. Tomorrow, we finish the job.
How does one actually turn everything I’ve been saying into a course that can be taught? We already have examples of this working, whether in the performance/competency based models found in medical schools around the world or whether in mastery learning based approaches where do not measure anything except whether a student has demonstrated sufficient knowledge or skill to show an appropriate level of mastery.
An absence of grades, or student control over their grades, is not as uncommon as many people think. MIT in the United States give students their entire first semester with no grades more specific than pass or fail. This is a deliberate decision to ease the transition of students who have gone from being leaders at their own schools to the compressed scale of MIT. Why compressed? If we were to assess all school students then we would need a scale that could measure all levels of ability, from ‘not making any progress at school’ to ‘transcendent’. The tertiary entry band is somewhere between ‘passing school studies’ to ‘transcendent’ and, depending upon the college that you enter, can shift higher and higher as your target institution becomes more exclusive. If you look at the MIT entry requirements, they are a little coy for ‘per student’ adjustments, but when the 75th percentile for the SAT components is 800, 790, 790, and 800,800,800 would be perfect, we can see that any arguments on how demotivating simple pass/fail grades must be for excellent students have not just withered, they have caught fire and the ash has blown away. When the target is MIT, it appears the freshmen get their head around a system that is even simpler than Rapaport’s.
Other universities, such as Brown, deliberately allow students to choose how their marks are presented, as they wish to deemphasise the numbers in order to focus on education. It is not a cakewalk to get into Brown, as these figures attest, and yet Brown have made a clear statement that they have changed their grading system in order to change student behaviour – and the world is just going to have to deal with that. It doesn’t seem to be hurting their graduates, from quotes on the website such as “Our 85% admission rate to medical school and 89% admission rate to law school are both far above the national average.”
And, returning to medical schools themselves, my own University runs a medical program where the usual guidelines for grading do not hold. The medical school is running on a performance/competency scheme, where students who wish to practise medicine must demonstrate that they are knowledgable, skilful and safe to practice. Medical schools have identified the core problem in my thought experiment where two students could have the opposite set of knowledge or skills and they have come to the same logical conclusion: decide what is important and set up a scheme that works for it.
When I was a solider, I was responsible for much of the Officer Training in my home state for the Reserve. We had any number of things to report on for our candidates, across knowledge and skills, but one of them was “Demonstrate the qualities of an officer” and this single item could fail an otherwise suitable candidate. If a candidate could not be trusted to one day be in command of troops on the battlefield, based on problems we saw in peacetime, then they would be counselled to see if it could be addressed and, if not, let go. (I can assure you that this was not used often and it required a large number of observations and discussion before we would pull that handle. The power of such a thing forced us to be responsible.)
We know that limited scale, mastery-based approaches are not just working in the vocational sector but in allied sectors (such as the military), in the Ivy league (Brown) and in highly prestigious non-Ivy league institutions such as MIT. But we also know of examples such as Harvey Mudd, who proudly state that only seven students since 1955 have earned a 4.0 GPA and have a post on the career blog devoted to “explaining why your GPA is so low” And, be in no doubt, Harvey Mudd is an excellent school, especially for my discipline. I’m not criticising their program, I’ve only heard great things about them, but when you have to put up a page like that? You’re admitting that there’s a problem but you are pushing it on to the student to fix it. But contrast that with Brown, who say to employers “look at our students, not their grades” (at least on the website).
Feedback to the students on their progress is essential. Being able to see what your students are up to is essential for the teacher. Being able to see what your staff and schools are doing is important for the University. Employers want to know who to hire. Which of these is the most important?
The students. It has to be the students. Doesn’t it? (Arguments for the existence of Universities as a self-sustaining bureaucracy system in the comments, if you think that’s a thing you want to do.)
This is not an easy problem but, as we can see, we have pieces of the solution all over the place. Tomorrow, I’m going to put in a place a cornerstone of beautiful assessment that I haven’t seen provided elsewhere or explained in this way. (Then all of you can tell me which papers I should have read to get it from, I can publish the citation, and we can all go forward.)
There are many lessons to be learned from what is going on in the MOOC sector. The first is that we have a lot to learn, even for those of us who are committed to doing it ‘properly’ whatever that means. I’m not trying to convince you of “MOOC yes” or “MOOC no”. We can have that argument some other time. I’m talking about we already know from using these tools.
We’ve learned (again) that producing a broadcast video set of boring people reading the book at you in a monotone is, amazingly, not effective, no matter how fancy the platform. We know that MOOCs are predominantly taken by people who have already ‘succeeded’ at learning, often despite our educational system, and are thus not as likely to have an impact in traditionally disadvantaged areas, especially without an existing learning community and culture. (No references, you can Google all of this easily.)
We know that online communities can and do form. Ok, it’s not the same as twenty people in a room with you but our own work in this space confirms that you can have students experiencing a genuine feeling of belonging, facilitated through course design and forum interaction.
“Really?” you ask.
In a MOOC we ran with over 25,000 students, a student wrote a thank you note to us at the top of his code, for the final assignment. He had moved from non-coder to coder with us and had created some beautiful things. He left a note in his code because he thought that someone would read it. And we did. There is evidence of this everywhere in the forums and their code. No, we don’t have a face-to-face relationship. But we made them feel something and, from what we’ve seen so far, it doesn’t appear to be a bad something.
But we, as in the wider on-line community, have learned something else that is very important. Students in MOOCs often set their own expectations of achievement. They come in, find what they’re after, and leave, much like they are asking a question on Quora or StackExchange. Much like you check out reviews on-line before you start watching a show or you download one or two episodes to check it out. You know, 21st Century life.
Once you see that self-defined achievement and engagement, a lot of things about MOOCs, including drop rates and strange progression, suddenly make sense. As does the realisation that this is a total change from what we have accepted for centuries as desirable behaviour. This is something that we are going to have a lot of trouble fitting into our existing system. It also indicates how much work we’re going to have to do in order to bring in traditionally disadvantaged communities, first-in-family and any other under-represented group. Because they may still believe that we’re offering Perry’s nightmare in on-line form: serried ranks with computers screaming facts at you.
We offer our students a lot of choice but, as Universities, we mostly work on the idea of ‘follow this program to achieve this qualification’. Despite notionally being in the business of knowledge for the sake of knowledge, our non-award and ‘not for credit’ courses are dwarfed in enrolments by the ‘follow the track, get a prize’ streams. And that, of course, is where the diminishing bags of dollars come from. That’s why retention is such a hot-button issue at Universities because even 1% more retained students is worth millions to most Universities. A hunt and peck community? We don’t even know what retention looks like in that context.
Pretending that this isn’t happening is ignoring evidence. It’s self-deceptive, disingenuous, hypocritical (for we are supposed to be the evidence junkies) and, once again, we have a failure of educational aesthetics. Giving people what they don’t want isn’t good. Pretending that they just don’t know what’s good for them is really not being truthful. That’s three Socratic strikes: you’re out.
We have a message from our learning community. They want some control. We have to be aware that, if we really want them to do something, they have to feel that it’s necessary. (So much research supports this.) By letting them run around in the MOOC space, artificial and heavily instrumented, we can finally see what they’re up to without having to follow them around with clipboards. We see them on the massive scale, individuals and aggregates. Remember, on average these are graduates; these are students who have already been through our machine and come out. These are the last people, if we’ve convinced them of the rightness of our structure, who should be rocking the boat and wanting to try something different. Unless, of course, we haven’t quite been meeting their true needs all these years.
I often say that the problem we have with MOOC enrolments is that we can see all of them. There is no ‘peeking around the door’ in a MOOC. You’re in or you’re out, in order to be signed up for access or updates.
If we were collaborating with all of our students to produce learning materials and structures, not just the subset who go into MOOC, I wonder what we would end up turning out? We still need to apply our knowledge of pedagogy and psychology, of course, to temper desire with what works but I suspect that we should be collaborating with our learner community in a far more open way. Everywhere else, technology is changing the relationship between supplier and consumer. Name any other industry and we can probably find a new model where consumers get more choice, more knowledge and more power.
No-one (sensible) is saying we should raze the Universities overnight. I keep being told that allowing more student control is going to lead to terrible things but, frankly, I don’t believe it and I don’t think we have enough evidence to stop us from at least exploring this path. I think it’s scary, yes. I think it’s going to challenge how we think about tertiary education, absolutely. I also think that we need to work out how we can bring together the best of face-to-face with the best of on-line, for the most people, in the most educationally beautiful way. Because anything else just isn’t that beautiful.
I was recently at a conference-like event where someone stood up and talked about video lectures. And these lectures were about 40 minutes long.
Over several million viewing sessions, EdX have clearly shown that watchable video length tops out at just over 6 minutes. And that’s the same for certificate-earning students and the people who have enrolled for fun. At 9 minutes, students are watching for fewer than 6 minutes. At the 40 minute mark, it’s 3-4 minutes.
I raised this point to the speaker because I like the idea that, if we do on-line it should be good on-line, and I got a response that was basically “Yes, I know that but I think the students should be watching these anyway.” Um. Six minutes is the limit but, hey, students, sit there for this time anyway.
We have never been able to unobtrusively measure certain student activities as well as we can today. I admit that it’s hard to measure actual attention by looking at video activity time but it’s also hard to measure activity by watching students in a lecture theatre. When we add clickers to measure lecture activity, we change the activity and, unsurprisingly, clicker-based assessment of lecture attentiveness gives us different numbers to observation of note-taking. We can monitor video activity by watching what the student actually does and pausing/stopping a video is a very clear signal of “I’m done”. The fact that students are less likely to watch as far on longer videos is a pretty interesting one because it implies that students will hold on for a while if the end is in sight.
In a lecture, we think students fade after about 15-20 minutes but, because of physical implications, peer pressure, politeness and inertia, we don’t know how many students have silently switched off before that because very few will just get up and leave. That 6 minute figure may be the true measure of how long a human will remain engaged in this kind of task when there is no active component and we are asking them to process or retain complex cognitive content. (Speculation, here, as I’m still reading into one of these areas but you see where I’m going.) We know that cognitive load is a complicated thing and that identifying subgoals of learning makes a difference in cognitive load (Morrison, Margulieux, Guzdial) but, in so many cases, this isn’t what is happening in those long videos, they’re just someone talking with loose scaffolding. Having designed courses with short videos I can tell you that it forces you, as the designer and teacher, to focus on exactly what you want to say and it really helps in making your points, clearly. Implicit sub-goal labelling, anyone? (I can hear Briana and Mark warming up their keyboards!)
If you want to make your videos 40 minutes long, I can’t stop you. But I can tell you that everything I know tells me that you have set your materials up for another hominid species because you’re not providing something that’s likely to be effective for current humans.
I was inspired to write this by a comment about using late penalties but dealing slightly differently with students when they owned up to being late. I have used late penalties extensively (it’s school policy) and so I have a lot of experience with the many ways students try to get around them.
Like everyone, I have had students who have tried to use honesty where every other possible way of getting the assignment in on time (starting early, working on it before the day before, miraculous good luck) has failed. Sometimes students are puzzled that “Oh, I was doing another assignment from another lecturer” isn’t a good enough excuse. (Genuine reasons for interrupted work, medical or compassionate, are different and I’m talking about the ambit extension or ‘dog ate my homework’ level of bargaining.)
My reasoning is simple. In education, owning up to something that you did knowing that it would have punitive consequences of some sort should not immediately cause things to become magically better. Plea bargaining (and this is an interesting article of why that’s not a good idea anywhere) is you agreeing to your guilt in order to reduce your sentence. But this is, once again, horse-trading knowledge on the market. Suddenly, we don’t just have a temporal currency, we have a conformal currency, where getting a better deal involves finding the ‘kindest judge’ among the group who will give you the ‘lightest sentence’. Students optimise their behaviour to what works or, if they’re lucky, they have a behaviour set that’s enough to get them to a degree without changing much. The second group aren’t mostly who we’re talking about and I don’t want to encourage the first group to become bargain-hunting mark-hagglers.
I believe that ‘finding Mr Nice Lecturer’ behaviour is why some students feel free to tell me that they thought someone else’s course was more important than mine, because I’m a pretty nice person and have a good rapport with my students, and many of my colleagues can be seen (fairly or not) as less approachable or less open.
We are not doing ourselves or our students any favours. At the very least, we risk accusations of unfairness if we extend benefits to one group who are bold enough to speak to us (and we know that impostor syndrome and lack of confidence are rife in under-represented groups). At worst, we turn our students into cynical mark shoppers, looking for the easiest touch and planning their work strategy based on what they think they can get away with instead of focusing back on the learning. The message is important and the message must be clearly communicated so that students try to do the work for when it’s required. (And I note that this may or may not coincide with any deadlines.)
We wouldn’t give credit to someone who wrote ‘True’ and then said ‘Oh, but I really meant False’. The work is important or it is not. The deadline is important or it is not. Consequences, in a learning sense, do not have to mean punishments and we do not need to construct a Star Chamber in our offices.
Yes, I do feel strongly about this. I completely understand why people do this and I have also done this before. But after thinking about it at length, I changed my practice so that being honest about something that shouldn’t have happened was appreciated but it didn’t change what occurred unless there was a specific procedural difference in handling. I am not a judge. I am not a jury. I want to change the system so that not only do I not have to be but I’m not tempted to be.
Before I lay out the program design I’m thinking of (and, beyond any discussion of competency, as a number of you have suggested, we are heading towards Bloom’s mastery learning as a frame with active learning elements), we need to address one of the most problematic areas of assessment.
Well, let’s be accurate, penalties are, by definition, punishments imposed for breaking the rules, so these are punishments. This is the stick in the carrot-and-stick reward/punish approach to forcing people to do what you want.
Let’s throw the Greek trinity at this and see how it shapes up. A student produces an otherwise perfect piece of work for an assessment task. It’s her own work. She has spent time developing it. It’s really good. Insightful. Oh, but she handed it up a day late. So we’re now going to say that this knowledge is worth less because it wasn’t delivered on time. She’s working a day job to pay the bills? She should have organised herself better. No Internet at home? Why didn’t she work in the library? I’m sure the campus is totally safe after hours and, well, she should just be careful in getting to and from the library. After all, the most important thing in her life, without knowing anything about her, should be this one hundred line program to reinvent something that has been written over a million times by every other CS student in history.
That’s not truth. That’s establishing a market value for knowledge with a temporal currency. To me, unless there’s a good reason for doing this, this is as bad as curve grading because it changes what the student has achieved for reasons outside of the assignment activity itself.
“Ah!” you say “Nick, we want to teach people to hand work in on time because that’s how the world works! Time is money, Jones!”
Rubbish. Yes, there are a (small) number of unmovable deadlines in the world. We certainly have some in education because we have to get grades in to achieve graduations and degrees. But most adults function in a world where they choose how to handle all of the commitments in their lives and then they schedule them accordingly. The more you do that, the more practice you get and you can learn how to do it well.
If you have ever given students a week, or even a day’s, extension because of something that has stopped you being able to accept or mark student work, no matter how good the reason, you have accepted that your submission points are arbitrary. (I feel strongly about this and have posted about it before.)
So what would be a good reason for sticking to these arbitrary deadlines? We’d want to see something really positive coming out of the research into this, right? Let’s look at some research on this, starting with Britton and Tesser, “Effects of Time-Management Practices on College Grades”, J Edu Psych, 1991, 83, 3. This reinforces what we already know from Bandura: students who feel in control and have high self-efficacy are going to do well. If a student sits down every day to work out what they’re going to do then they, unsurprisingly, can get things done. But this study doesn’t tell us about long-range time planning – the realm of instrumentality, the capability to link activity today with success in the future. (Here are some of my earlier thoughts on this, with references to Husman.) From Husman, we know that students value tasks in terms of how important they think it is, how motivated they are and how well they can link future success to the current task.
In another J Edu Psych paper (1990,82,4), Macan and Shahani reported that participants who felt that they had control over what they were doing did better but also clearly indicated that ambiguity and stress had an influence on time management in terms of perception and actuality. But the Perceived Control of Time (author’s caps) dominated everything, reducing the impact of ambiguity, reducing the impact of stress, and lead to greater satisfaction.
Students are rarely in control of their submission deadlines. Worse, we often do not take into account everything else in a student’s life (even other University courses) when we set our own deadlines. Our deadlines look arbitrary to students because they are, in the majority of cases. There’s your truth. We choose deadlines that work for our ability to mark and to get grades in or, perhaps, based on whether we are in the country or off presenting research on the best way to get students to hand work in on-time.
(Yes, the owl above is staring at me just as hard as he is staring at anyone else here.)
My own research clearly shows that fixed deadlines do not magically teach students the ability to manage their time and, when you examine it, why should it? (ICER 2012, was part of a larger study that clearly demonstrated students continuing, and even extending, last-minute behaviour all the way to the fourth year of their studies.) Time management is a discipline that involves awareness of the tasks to be performed, a decomposition of those tasks to subtasks that can be performed when the hyperbolic time discounting triggers go off, and a well-developed sense of instrumentality. Telling someone to hand in their work by this date OR ELSE does not increase awareness, train decomposition, or develop any form of planning skills. Well, no wonder it doesn’t work any better than shouting at people teaches them Maxwell’s Equations or caning children suddenly reveals the magic of the pluperfect form in Latin grammar.
So, let’s summarise: students do well when they feel in control and it helps with all of the other factors that could get in the way. So, in order to do almost exactly the opposite of help with this essential support step, we impose frequently arbitrary time deadlines and then act surprised when students fall prey to lack of self-confidence, stress or lose sight of what they’re trying to do. They panic, asking lots of (what appear to be) unnecessary questions because they are desperately trying to reduce confusion and stress. Sound familiar?
I have written about this at length while exploring time banking, giving students agency and the ability to plan their own time, to address all of these points. But the new lens in my educational inspection loupe allows me to be very clear about what is most terribly wrong with late penalties.
They are not just wrong, they satisfy none of anyone’s educational aesthetics. Because we don’t take a student’s real life into account, we are not being fair. Because we are not actually developing the time management abilities but treating them as something that will be auto-didactically generated, we are not being supportive. Because we downgrade work when it is still good, we are being intellectually dishonest. Because we vary deadlines to suit ourselves but may not do so for an individual student, we are being hypocritical. We are degrading the value of knowledge for procedural correctness. This is hideously “unbeautiful”.
That is not education. That’s bureaucracy. Just because most of us live within a bureaucracy doesn’t mean that we have to compromise our pedagogical principles. Even trying to make things fit well, as Rapaport did to try and fit into another scale, we end up warping and twisting our intent, even before we start thinking about lateness and difficult areas such as that. This cannot be good.
There is nothing to stop a teacher setting an exercise that is about time management and is constructed so that all steps will lead someone to develop better time management. Feedback or marks that reflect something being late when that is the only measure of fitness is totally reasonable. But to pretend that you can slap some penalties on to the side of an assessment and it will magically self-scaffold is to deceive yourself, to your students’ detriment. It’s not true.
Do I have thoughts on how to balance marking resources with student feedback requirements, elastic time management, and real assessments while still recognising that there are some fixed deadlines?
Funny you should ask. We’ll come back to this, soon.
We all have heroes. One of mine was David Bowie. Innovator. Musician. Technologist. Visionary. Bowie took Burrough’s analogue techniques and brought them into the 20th, then 21st Century. But he did so much more. His (now) last album is wonderful. I listened to his previous album and it made my transition to middle-age easier, because I could hear him exploring the same issues. Now I have the album he recorded in the face of death. Undimmed. Unafraid. Strong.
Whatever I do, I want to help people to see things as they are. Truth, in the classical Greek sense of revelation of reality, is what I’ve been talking about. I have been a Bowie fan, follower, and later scholar, for decades. I have no doubt that when David Bowie looked at the world, he saw what it was, what it could be and what he could do with it.
He certainly influenced me and made me think about what could be, if I saw it truly and pursued it with passion.
A moment’s silence for the passing of a unique and important element of our culture.
I’m getting some great comments, on and off the blog, about possible solutions to the problems I’m putting up, as well as thoughts on some of my examples.
Firstly, thank you, everyone! Secondly, I am deliberately starting slowly and building up, to reframe all of these arguments in terms of aesthetics, fitness for purpose and clarity. (Beauty, goodness and truth, again.) I am not trying to make anything appear worse than it is but I’m teasing out some points to show why we should be seeking to change practice that is both widespread and ingrained.
I will make a quick note that Raymond Lister raised about my thought experiment with the two students who split the knowledge, in that I don’t differentiate between skills and knowledge (true) and I am talking about an educational design where no work has been done to identify which areas have to be mastered in order to progress (also true). This is totally deliberate on my part, because it reflects a lot of current practice, not because I think it’s what we should be doing. I will be returning to, and extending this, example over time.
(Raymond does great work in a lot of areas dear to my heart and we will be returning to some of his work in our peregrinations, especially the SOLO taxonomy and Bloom’s mappings. Until then, here is his Google Scholar link for you to read some very interesting papers. And I could not agree more that there is no programming gene!)
If you’ve been reading my blog over the past years, you’ll know that I have a lot of time for thinking about assessment systems that encourage and develop students, with an emphasis on intrinsic motivation. I’m strongly influenced by the work of Alfie Kohn, unsurprisingly given I’ve already shown my hand on Focault! But there are many other writers who are… reassessing assessment: why we do it, why we think we are doing it, how we do it, what actually happens and what we achieve.
In my framing, I want assessment to be as all other aspects of education: aesthetically satisfying, leading to good outcomes and being clear and what it is and what it is not. Beautiful. Good. True. There are some better and worse assessment approaches out there and there are many papers discussing this. One of these that I have found really useful is Rapaport’s paper on a simplified assessment process for consistent, fair and efficient grading. Although I disagree with some aspects, I consider it to be both good, as it is designed to clearly address a certain problem to achieve good outcomes, and it is true, because it is very honest about providing guidance to the student as to how well they have met the challenge. It is also highly illustrative and honest in representing the struggle of the author in dealing with the collision of novel and traditional assessment systems. However, further discussion of Rapaport is for the near future. Let me start by demonstrating how broken things often are in assessment, by taking you through a hypothetical situation.
Thought Experiment 1
Two students, A and B, are taking the same course. There are a number of assignments in the course and two exams. A and B, by sheer luck, end up doing no overlapping work. They complete different assignments to each other, half each and achieve the same (cumulative bare pass overall) marks. They then manage to score bare pass marks in both exams, but one answers only the even questions and only answers the odd. (And, yes, there are an even number of questions.) Because of the way the assessment was constructed, they have managed to avoid any common answers in the same area of course knowledge. Yet, both end up scoring 50%, a passing grade in the Australian system.
Which of these students has the correct half of the knowledge?
I had planned to build up to Rapaport but, if you’re reading the blog comments, he’s already been mentioned so I’ll summarise his 2011 paper before I get to my main point. In 2011, William J. Rapaport, SUNY Buffalo, published a paper entitled “A Triage Theory of Grading: The Good, The Bad and the Middling.” in Teaching Philosophy. This paper summarised a number of thoughtful and important authors, among them Perry, Wolff, and Kohn. Rapaport starts by asking why we grade, moving through Wolff’s taxonomic classification of assessment into criticism, evaluation, and ranking. Students are trained, by our world and our education systems to treat grades as a measure of progress and, in many ways, a proxy for knowledge. But this brings us into conflict with Perry’s developmental stages, where students start with a deep need for authority and the safety of a single right answer. It is only when students are capable of understanding that there are, in many cases, multiple right answers that we can expect them to understand that grades can have multiple meanings. As Rapaport notes, grades are inherently dual: a representative symbol attached to a quality measure and then, in his words, “ethical and aesthetic values are attached” (emphasis mine.) In other words, a B is a measure of progress (not quite there) that also has a value of being … second-tier if an A is our measure of excellence. A is not A, as it must be contextualised. Sorry, Ayn.
When we start to examine why we are grading, Kohn tells us that the carrot and stick is never as effective as the motivation that someone has intrinsically. So we look to Wolff: are we critiquing for feedback, are we evaluating learning, or are we providing handy value measures for sorting our product for some consumer or market? Returning to my thought experiment above, we cannot provide feedback on assignments that students don’t do, our evaluation of learning says that both students are acceptable for complementary knowledge, and our students cannot be discerned from their graded rank, despite the fact that they have nothing in common!
Yes, it’s an artificial example but, without attention to the design of our courses and in particular the design of our assessment, it is entirely possible to achieve this result to some degree. This is where I wish to refer to Rapaport as an example of thoughtful design, with a clear assessment goal in mind. To step away from measures that provide an (effectively) arbitrary distinction, Rapaport proposes a tiered system for grading that simplifies the overall system with an emphasis on identifying whether a piece of assessment work is demonstrating clear knowledge, a partial solution, an incorrect solution or no work at all.
This, for me, is an example of assessment that is pretty close to true. The difference between a 74 and a 75 is, in most cases, not very defensible (after Haladyna) unless you are applying some kind of ‘quality gate’ that really reduces a percentile scale to, at most, 13 different outcomes. Rapaport’s argument is that we can reduce this further and this will reduce grade clawing, identify clear levels of achieve and reduce marking load on the assessor. That last point is important. A system that buries the marker under load is not sustainable. It cannot be beautiful.
There are issues in taking this approach and turning it back into the grades that our institutions generally require. Rapaport is very open about the difficulties that he has turning his triage system into an acceptable letter grade and it’s worth reading the paper to see that discussion alone, because it quite clearly shows what
Rapaport’s scheme clearly defines which of Wolff’s criteria he wishes his assessment to achieve. The scheme, for individual assessments, is no good for ranking (although we can fashion a ranking from it) but it is good to identify weak areas of knowledge (as transmitted or received) for evaluation of progress and also for providing elementary critique. It says what it is and it pretty much does it. It sets out to achieve a clear goal.
The paper ends with a summary of the key points of Haladyna’s 1999 book “A Complete Guide to Student Grading”, which brings all of this together.
Haladyna says that “Before we assign a grade to any students, we need:
- an idea about what a grade means,
- an understanding of the purposes of grading,
- a set of personal beliefs and proven principles that we will use in teaching
- a set of criteria on which the grade is based, and, finally,
- a grading method,which is a set of procedures that we consistently follow
in arriving at each student’s grade. (Haladyna 1999: ix)
There is no doubt that Rapaport’s scheme meets all of these criteria and, yet, for me, we have not yet gone far enough in search of the most beautiful, most good and most true extent that we can take this idea. Is point 3, which could be summarised as aesthetics not enough for me? Apparently not.
Tomorrow I will return to Rapaport to discuss those aspects I disagree with and, later on, discuss both an even more trimmed-down model and some more controversial aspects.
I’ve been thinking about learning analytics and, while some Unis have managed to solve parts of the problem, I think that we need to confront the complexity of the problem, to explain why it’s so challenging. I break it into five key problems.
- Data. We don’t currently collect enough of it to analyse, what we do collect is of questionable value and isn’t clearly tied to mechanisms, and we have not confronted the spectre of what we do with this data when we get it.
- Mechanisms linking learning and what is produced. The mechanisms are complex. Students could be failing for any number of reasons, not the least of which is crap staff. Trying to work out what has happened by looking at outputs is unlikely to help.
- Focus. Generally, we measure things to evaluate people. This means that students do tests to get marked and, even where we mix this up with formative work, they tend to focus on the things that get them marks. That’s because it’s how we’ve trained them. This focus warps measurement into an enforcement and judgment mechanism, rather than a supportive and constructive mechanism.
- Community. We often mandate or apply analytics as an extension of the evaluation focus above. This means that we don’t have a community who are supported by analytics, we have a community of evaluators and the evaluated. This is what we would usually label as a Panopticon, because of the asymmetrical application of this kind of visibility. And it’s not a great environment for education. Without a strong community, why should staff go to the extra effort to produce the things required to generate more data if they can’t see a need for it? This is a terribly destructive loop as it requires learning analytics to work and be seen as effective before you have the data to make learning analytics work!
- Support. When we actually have the data, understand the mechanism, have the right focus and are linked in to the community, we still need the money, time and other resources to provide remediation, to encourage development, to pay for the technology, to send people to places where they can learn. For students and staff. We just don’t have that.
I think almost all Unis are suffering from the same problems. This is a terribly complex problem and it cannot be solved by technology alone.
It’s certainly not as easy as driving car. You know that you make the car go faster by pushing on one pedal and you make it go slower by pushing on another. You look at your speedometer. This measures how often your wheels are rotating and, by simple arithmetic, gives you your speed across the road. Now you can work out the speed you want to travel at, taking into account signs, conditions and things like that. Simple. But this simple, everyday, action and its outcomes are the result of many, many technological, social and personal systems interacting.
The speedometer in the car is giving you continuously available, and reasonably reliable, data on your performance. You know how to influence that performance through the use of simple and direct controls (mechanism). There exists a culture of driver training, road signage and engineering, and car design that provides you with information that ties your personal performance to external achievement (These are all part of support, focus and community). Finally, there are extrinsic mechanisms that function as checks and balances but, importantly, they are not directly tied to what you are doing in the car, although there are strong causative connections to certain outcomes (And we can see elements of support and community in this as we all want to drive on safe roads, hence state support for this is essential).
We are nowhere near the car scenario with learning analytics right now. We have some measurements of learning in the classroom because we grade assignments and mark exams. But these are not continuous feedback, to be consulted wherever possible, and the mechanisms to cause positive change in these are not necessarily clear and direct. I would argue that most of what we currently do is much closer to police enforcement of speed. We ask students to drive a track and, periodically, we check to see if they’re doing the correct speed. We then, often irrevocably from a grading sense, assign a mark to how well they are driving the track and settle back to measure them again later.
Learning analytics faces huge problems before it reaches this stage. We need vast quantities of data that we are not currently generating. Many University courses lack opportunities to demonstrate prowess early on. Many courses offer only two or three measurements of performance to determine the final grade. This trying to guess our speed when the speedo only lights up every three to four weeks after we have pressed a combination of pedals.
The mechanisms for improvement and performance control in University education are not just murky, they’re opaque. If we identify a problem, what happens? In the case of detecting that we are speeding, most of us will slow down. If the police detect you are speeding, they may stop you or (more likely) issue you a fine and eventually you’ll use up your licence and have to stop driving. We just give people low marks or fail them. But, combine this with mechanism issues, and suddenly we need to ask if we’re even ready to try to take action if we had the analytics.
Let’s say we get all the data and it’s reliable and pedagogically sensible. We work out how to link things together. We build community support and we focus it correctly. You run analytics over your data. After some digging, you discover that 70% of your teaching staff simply don’t know how to do their jobs. And, as far as you can see, have been performing at this standard for 20 years.
What do you do?
Until we are ready to listen to what analytics tell us, until we have had the discussion of how we deal with students (and staff) who may wish to opt out, and until we have looked at this as the monstrous, resource-hungry, incredibly complex problem that it is, we really have to ask if we’re ready to take learning analytics seriously. And, given how much money can be spent on this, it’s probably better to work out if we’re going to listen before we invest money into a solution that won’t work because it cannot work.