Four-tier assessment

We’ve looked at a classification of evaluators that matches our understanding of the complexity of the assessment tasks we could ask students to perform. If we want to look at this from an aesthetic framing then, as Dewey notes:

“By common consent, the Parthenon is a great work of art. Yet it has aesthetic standing only as the work becomes an experience for a human being.”

John Dewey, Art as Experience, Chapter 1, The Live Creature.

Having a classification of evaluators cannot be appreciated aesthetically unless we provide a way for it to be experienced. Our aesthetic framing demands an implementation that makes use of such an evaluator classification, applies to a problem where we can apply a pedagogical lens and then, finally, we can start to ask how aesthetically pleasing it is.

And this is what brings us to beauty.

A systematic allocation of tasks to these different evaluators should provide valid and reliable marking, assuming we’ve carried out our design phase correctly. But what about fairness, motivation or relevancy, the three points that we did not address previously? To be able to satisfy these aesthetic constraints, and to confirm the others, it now matters how we handle these evaluation phases because it’s not enough to be aware that some things are going to need different approaches, we have to create a learning environment to provide fairness, motivation and relevancy.

I’ve already argued that arbitrary deadlines are unfair, that extrinsic motivational factors are grossly inferior to those found within, and, in even earlier articles, that we too insist on the relevancy of the measurements that we have, rather than designing for relevancy and insisting on the measurements that we need.

To achieve all of this and to provide a framework that we can use to develop a sense of aesthetic satisfaction (and hence beauty), here is a brief description of a four-tier, penalty free, assessment.

Let’s say that, as part of our course design, we develop an assessment item, A1, that is one of the elements to provide evaluation coverage of one of the knowledge areas. (Thus, we can assume that A1 is not required to be achieved by itself to show mastery but I will come back to this in a later post.)

Recall that the marking groups are: E1, expert human markers; E2, trained or guided human markers; E3, complex automated marking; and E4, simple and mechanical automated marking.

A1 has four, inbuilt, course deadlines but rather than these being arbitrary reductions of mark, these reflect the availability of evaluation resource, a real limitation as we’ve already discussed. When the teacher sets these courses up, she develops an evaluation scheme for the most advanced aspects (E1, which is her in this case), an evaluation scheme that could be used by other markers or her (E2), an E3 acceptance test suite and some E4 tests for simplicity. She matches the aspects of the assignment to these evaluation groups, building from simple to complex, concrete to abstract, definite to ambiguous.

The overall assessment of work consists of the evaluation of four separate areas, associated with each of the evaluators. Individual components of the assessment build up towards the most complex but, for example, a student should usually have had to complete at least some of E4-evaluated work to be able to attempt E3.

Here’s a diagram of the overall pattern for evaluation and assessment.


The first deadline for the assignment is where all evaluation is available. If students provide their work by this time, the E1 will look at the work, after executing the automated mechanisms, first E4 then E3, and applying the E2 rubrics. If the student has actually answered some E1-level items, then the “top tier” E1 evaluator will look at that work and evaluate it. Regardless of whether there is E1 work or not, human-written feedback from the lecturer on everything will be provided if students get their work in at that point. This includes things that would be of help for all other levels. This is the richest form of feedback, it is the most useful to the students and, if we are going to use measures of performance, this is the point at which the most opportunities to demonstrate performance can occur.

This feedback will be provided in enough time that the students can modify their work to meet the next deadline, which is the availability of E2 markers. Now TAs or casuals are marking instead or the lecturer is now doing easier evaluation from a simpler rubric. These human markers still start by running the automated scripts, E4 then E3, to make sure that they can mark something in E2. They also provide feedback on everything in E2 to E4, sent out in time for students to make changes for the next deadline.

Now note carefully what’s going on here. Students will get useful feedback, which is great, but because we have these staggered deadlines, we can pass on important messages as we identify problems. If the class is struggling with key complex or more abstract elements, harder to fix and requiring more thought, we know about it quickly because we have front-loaded our labour.

Once we move down to the fully automated systems, we’re losing opportunities for rich and human feedback to students who have not yet submitted. However, we have a list of students who haven’t submitted, which is where we can allocate human labour, and we can encourage them to get work in, in time for the E3 “complicated” script. This E3 marking script remains open for the rest of the semester, to encourage students to do the work sometime ahead of the exam. At this point, the discretionary allocation of labour for feedback is possible, because the lecturer has done most of the hard work in E1 and E2 and should, with any luck, have far fewer evaluation activities for this particular assignment. (Other things may intrude, including other assignments, but we have time bounds on this one, which is better than we often have!)

Finally, at the end of the teaching time (in our parlance, a semester’s teaching will end then we will move to exams), we move the assessment to E4 marking only, giving students the ability (if required) to test their work to meet any “minimum performance” requirements you may have for their eligibility to sit the exam. Eventually, the requirement to enter a record of student performance in this course forces us to declare the assessment item closed.

This is totally transparent and it’s based on real resource limitations. Our restrictions have been put in place to improve student feedback opportunities and give them more guidance. We have also improved our own ability to predict our workload and to guide our resource requests, as well as allowing us to reuse some elements of automated scripts between assignments, without forcing us to regurgitate entire assignments. These deadlines are not arbitrary. They are not punitive. We have improved feedback and provided supportive approaches to encourage more work on assignments. We are able to get better insight into what our students are achieving, against our design, in a timely fashion. We can now see fairness, intrinsic motivation and relevance.

I’m not saying this is beautiful yet (I think I have more to prove to you) but I think this is much closer than many solutions that we are currently using. It’s not hiding anything, so it’s true. It does many things we know are great for students so it looks pretty good.

Tomorrow, we’ll look at whether such a complicated system is necessary for early years and, spoilers, I’ll explain a system for first year that uses peer assessment to provide a similar, but easier to scale, solution.

Who Knew That the Slippery Slope Was Real?

Take a look at this picture.

Dan Ariely. Photo: poptech/Flickr, via

One thing you might have noticed, if you’ve looked carefully, is that this man appears to have had some reconstructive surgery on the right side of his face and there is a colour difference, which is slightly accentuated by the lack of beard stubble. What if I were to tell you that this man was offered the chance to have fake stubble tattooed onto that section and, when he declined because he felt strange about it, received a higher level of pressure and, in his words, guilt trip than for any other procedure during the extensive time he spent in hospital receiving skin grafts and burn treatments. Why was the doctor pressuring him?

Because he had already performed the tattooing remediation on two people and needed a third for the paper. In Dan’s words, again, the doctor was a fantastic physician, thoughtful, and he cared but he had a conflict of interest that meant that he moved to a different mode of behaviour. For me, I had to look a couple of times because the asymmetry that the doctor referred to is not that apparent at first glance. Yet the doctor felt compelled, by interests that were now Dan’s, to make Dan self-conscious about the perceived problem.

A friend on Facebook (thanks, Bill!) posted a link to an excellent article in Wired, entitled “Why We Lie, Cheat, Go to Prison and Eat Chocolate Cake” by Dan Ariely, the man pictured above. Dan is a professor of behavioural economics and psychology at Duke and his new book explores the reasons that we lie to each other. I was interested in this because I’m always looking for explanations of student behaviour and I want to understand their motivations. I know that my students will rationalise and do some strange things but, if I’m forewarned, maybe I can construct activities and courses in a way that heads this off at the pass.

There were several points of interest to me. The first was the question whether a cost/benefit analysis of dishonesty – do something bad, go to prison – actually has the effect that we intend. As Ariely points out, if you talk to the people who got caught, the long-term outcome of their actions was never something that they thought about. He also discusses the notion of someone taking small steps, a little each time, that move them from law abiding, for want of a better word, to dishonest. Rather than set out to do bad things in one giant leap, people tend to take small steps, rationalising each one, and after each step opening up a range of darker and darker options.

Welcome to the slippery slope – beloved argument of rubicose conservative politicians since time immemorial. Except that, in this case, it appears that the slop is piecewise composed on tiny little steps. Yes, each step requires a decision, so there isn’t the momentum that we commonly associate with the slope, but each step, in some sense, takes you to larger and larger steps away from the honest place from which you started.

Ariely discusses an experiment where he gave two groups designer sunglasses and told one group that they had the real thing, and the other that they had fakes, and then asked them to complete a test and then gave them a chance to cheat. The people who had been randomly assigned into the ‘fake sunglasses’ group cheated more than the others. Now there are many possible reasons for this. One of them is the idea that if you know that are signalling your status deceptively to the world, which is Ariely’s argument, you are in a mindset where you have taken a step towards dishonesty. Cheating a little more is an easier step. I can see many interpretations of this, because of the nature of the cheating which is in reporting how many questions you completed on the test, where self-esteem issues caused by being in the ‘fake’ group may lead to you over-promoting yourself in the reporting of your success on the quiz – but it’s still cheating. Ultimately, whatever is motivating people to take that step, the step appears to be easier if you are already inside the dishonest space, even to a degree.

[Note: Previous paragraph was edited slightly after initial publication due to terrible auto-correcting slipping by me. Thanks, Gary!]

Where does something like copying software or illicitly downloading music come into this? Does this constant reminder of your small, well-rationalised, step into low-level lawlessness have any impact on the other decisions that you make? It’s an interesting question because, according to the outline in Ariely’s sunglasses experiment, we would expect it to be more of a problem if the products became part of your projected image. We know that having developed a systematic technological solution for downloading is the first hurdle in terms of achieving downloads but is it also the first hurdle in making steadily less legitimate decisions? I actually have no idea but would be very interested to see some research in this area. I feel it’s too glib to assume a relationship, because it is so ‘slippery slope’ argument, but Ariely’s work now makes me wonder. Is it possible that, after downloading enough music or software, you could actually rationalise the theft of a car? Especially if you were only ‘borrowing’ it? (Personally, I doubt it because I think that there are several steps in between.) I don’t have a stake in this fight – I have a personal code for behaviour in this sphere that I can live with but I see some benefits in asking and trying to answer these questions from something other that personal experience.

Returning to the article, of particular interest to me was the discussion of an honour code, such as Princeton’s, where students sign a pledge. Ariely sees it as benefit as a reminder to people that is active for some time but, ultimately, would have little value over several years because, as we’ve already discussed, people rationalise in small increments over the short term rather than constructing long-term models where the pledge would make a difference. Sign a pledge in 2012 and it may just not have any impact on you by the middle of 2012, let alone at the end of 2015 when you’re trying to graduate. Potentially, at almost any cost.

In terms of ongoing reminders, and a signature on a piece of work saying (in effect) “I didn’t cheat”, Ariely asks what happens if you have to sign the honour clause after you’ve finished a test – well, if you’ve finished then any cheating has already occurred so the honour clause is useless then. If you remind people at the start of every assignment, every test, and get them to pledge at the beginning then this should have an impact – a halo effect to an extent, or a reminder of expectation that will make it harder for you to rationalise your dishonesty.

In our school we have an electronic submission system that require students to use to submit their assignments. It has boiler plate ‘anti-plagiarism’ text and you must accept the conditions to submit. However, this is your final act before submission and you have already finished the code, which falls immediately into the trap mentioned in the previous paragraph. Dan Ariely’s answers have made me think about how we can change this to make it more of an upfront reminder, rather than an ‘after the fact – oh it may be too late now’ auto-accept at the end of the activity. And, yes, reminder structures and behaviour modifiers in time banking are also being reviewed and added in the light of these new ideas.

The Wired Q&A is very interesting and covers a lot of ground but, realistically, I think I have to go and buy Dan Ariely’s book(s), prepare myself for some harsh reflection and thought, and plan for a long weekend of reading.