We know that we can, and do, assess different levels of skill and knowledge. We know that we can, and do, often resort to testing memorisation, simple understanding and, sometimes, the application of the knowledge that we teach. We also know that the best evaluation of work tends to come from the teachers who know the most about the course and have the most experience, but we also know that these teachers have many demands on their time.
The principles of good assessment can be argued but we can probably agree upon a set much like this:
- Valid, based on the content. We should be evaluating things that we’ve taught.
- Reliable, in that our evaluations are consistent and return similar results for different evaluators, that re-evaluating would give the same result, that we’re not unintentionally raising or lowering difficulty.
- Motivating, in that we know how much influence feedback and encouragement have on students, so we should be maximising the motivation and, we hope, this should drive engagement.
- Finally, we want our assessment to be as relevant to us, in terms of being able to use the knowledge gained to improve or modify our courses, as it is to our student. Better things should come from having run this assessment.
Notice that nothing here says “We have to mark or give a grade”, yet we can all agree on these principles, and any scheme that adheres to them, as being a good set of characteristics to build upon. Let me label these as aesthetics of assessment, now let’s see if I can make something beautiful. Let me put together my shopping list.
- Feedback is essential. We can see that. Let’s have lots of feedback and let’s put it in places where it can be the most help.
- Contextual relevance is essential. We’re going to need good design and work out what we want to evaluate and then make sure we locate our assessment in the right place.
- We want to encourage students. This means focusing on intrinsics and support, as well as well-articulated pathways to improvement.
- We want to be fair and honest.
- We don’t want to overload either the students or ourselves.
- We want to allow enough time for reliable and fair evaluation of the work.
What are the resources we have?
- Course syllabus
- Course timetable
- The teacher’s available time
- TA or casual evaluation time, if available
- Student time (for group work or individual work, including peer review)
- Rubrics for evaluation.
- Computerised/automated evaluation systems, to varying degree.
Wait, am I suggesting automated marking belongs in a beautiful marking system? Why, yes, I think it has a place, if we are going to look at those things we can measure mechanistically. Checking to see if someone has ticked the right box for a Bloom’s “remembering” level activity? Machine task. Checking to see if an essay has a lot of syntax or grammatical errors? Machine task. But we can build on that. We can use human markers and machine markers, in conjunction, to the best of their strengths and to overcome each other’s weaknesses.
If we think about it, we really have four separate tiers of evaluators to draw upon, who have different levels of ability. These are:
- E1: The course designers and subject matter experts who have a deep understanding of the course and could, possibly with training, evaluate work and provide rich feedback.
- E2: Human evaluators who have received training or are following a rubric provided by the E1 evaluators. They are still human-level reasoners but are constrained in terms of breadth of interpretation. (It’s worth noting that peer assessment could fit in here, as well.)
- E3: High-level machine evaluation includes machine-based evaluation of work, which could include structural, sentiment or topic analysis, as well as running complicated acceptance tests that look for specific results, coverage of topics or, in the case of programming tasks, certain output in response to given input. The E3 evaluation mechanisms will require some work to set up but can provide evaluation of large classes in hours, rather than days.
- E4: Low-level machine evaluation, checking for conformity in terms of length of assignment, names, type of work submitted, plagiarism detection. In the case of programming assignments, E4 would check that the filenames were correct, that the code compiled and also may run some very basic acceptance tests. E4 evaluation mechanisms should be quick to set up and very quick to execute.
This separation clearly shows us a graded increase of expertise that corresponds to an increase of time spent and, unfortunately, a decrease in time available. E4 evaluation is very easy to set up and carry out but it’s not fantastic for detailed feedback or higher Bloom’s level. Yet we have an almost infinite amount of this marking time available. E1 markers will (we hope) give the best feedback but they take a long time and this immediately reduces the amount of time to be spent on other things. How do we handle this and select the best mix?
While we’re thinking about that, let’s see if we are meeting the aesthetics.
- Valid? Yes. We’ve looked at our design (we appear to have a design!) and we’ve specifically set up evaluation into different areas while thinking about outcomes, levels and areas that we care about.
- Reliable? Looks like it. E3 and E4 are automated and E2 has a defined marking rubric. E1 should also have guidelines but, if we’ve done our work properly in design, the majority of marks, if not all of them, are going to be assigned reliably.
- Fair? We’ve got multiple stages of evaluation but we haven’t yet said how we’re going to use this so we don’t have this one yet.
- Motivating? Hmm, we have the potential for a lot of feedback but we haven’t said how we’re using that, either. Don’t have this one either.
- Relevant to us and the students. No, for the same reasons as 3 and 4, we haven’t yet shown how this can be useful to us.
It looks like we’re half-way there. Tomorrow, we finish the job.