If we are going to try and summarise a complicated, long-term process with a single number, and I don’t see such shortcuts going away anytime soon, then it helps to know:
- Exactly what the number represents.
- How it can be used.
- What the processes are that go into its construction.
We have conventions as to what things mean but, when we want to be precise, we have to be careful about our definition and our usage of the final value. As a simple example, one thing that often surprises people who are new to numerical analysis is that there is more than one way of calculating the average value of a group of numbers.
While average in colloquial language would usually mean that we take the sum of all of the numbers and divide them by their count, this is more formally referred to as the arithmetic mean. What we usually want from the average is some indication of what the typical value for this group would be. If you weigh ten bags of wheat and the average weight is 10 kilograms, then that’s what many people would expect the weight to be for future bags, unless there was clear early evidence of high variation (some 500g, some 20 kilograms, for example.)
But the mean is only one way to measure central tendency in a group of numbers. We can also measure the median, the number that separates the highest half of the data from the lowest, or the mode, the value that is the most frequently occurring value in the group.
(This doesn’t even get into the situation where we decide to aggregate the values in a different way.)
If you’ve got ten bags of wheat and nine have 10 kilograms in there, but one has only 5 kilograms, which of these ways of calculating the average is the one you want? The mode is 10kg but the mean is 9.5kg. If you tried to distribute the bags based on the expectation that everyone gets 9.5, you’re going to make nine people very happy and one person unhappy.
Most Grade Point Average calculations are based on a simple arithmetic mean of all available grades, with points allocated from 0 to an upper bound based on the grade performance. As a student adds more courses, these contributions are added to the calculation.
In yesterday’s post, I mused on letting students control which grades go into a GPA calculation and, to explore that, I now have to explain what I mean and why that would change things.
As it stands, because a GPA is an average across all courses, any lower grades will permanently drop the GPA contribution of any higher grades. If a student gets a 7 (A+ or High Distinction) for 71 of her courses and then a single 4 (a Passing grade) for one, her GPA will be 6.875. It can never return to 7. The clear performance band of this student is at the highest level, given that just under 99% of her marks are at the highest level, yet the inclusion of all grades means that a single underperformance, for whatever reason, in three years has cost her standing for those people who care about this figure.
My partner and I discussed some possible approaches to GPA that would be better and, by better, we mean approaches that encourage students to improve, that clearly show what the GPA figure means, and that are much fairer to the student. There are too many external factors contributing to resilience and high performance for me to be 100% comfortable with the questionable representation provided by the GPA.
Before we even think about student control over what is presented, we can easily think of several ways to make a GPA reflect what you have achieved, rather than what you have survived.
- We could only count a percentage of the courses for each student. Even having 90% counted means that students who stumble a little once or twice do not have this permanently etched into a dragging grade.
- We could allow a future attempt at a course with an improved to replace the previous grade. Before we get too caught up in the possibility of ‘gaming’, remember that students would have to pay for this (even if delayed) in most systems and it will add years to their degree. If a student can reach achievement level X in a course then it’s up to us to make sure that does correspond to the achievement level!
- We could only count passes. Given that a student has to assemble sufficient passing grades to be awarded a degree, why then would we include the courses that do not count in a calculation of GPA?
- We could use the mode and report the most common mark the student receives.
- We could do away with it totally. (Not going to happen any time soon.)
- We could pair the GPA with a statistical accompaniment that tells the viewer how indicative it is.
Options 1 and 2 are fairly straight-forward. Option 3 is interesting because it compresses the measurement band to a range of (in my system) 4-7 and this then implicitly recognises that GPA measures for students who graduate are more likely to be in this tighter range: we don’t actually have the degree of separation that we’d assume from a range of 0-7. Option 4 is an interesting way to think about the problem: which grade is the student most likely to achieve, across everything? Option 5 is there for completeness but that’s another post.
Option 6 introduces the idea that we stop GPA being a number and we carefully and accurately contextualise it. A student who receives all high distinctions in first semester still has a number of known hurdles to get over. The GPA of 7 that would be present now is not as clear an indicator of facility with the academic system as a GPA of 7 at the end of a degree, whichever other GPA adjustment systems are in play.
More evidence makes it clearer what is happening. If we can accompany a GPA (or similar measure) with evidence, then we are starting to make the process apparent and we make the number mean something. However, this also allows us to let students control what goes into their calculation, from the grades that they have, as a clear measure of the relevance of that measure can be associated.
But this doesn’t have to be a method of avoidance, this can be a useful focusing device. If a student did really well in, say, Software Engineering but struggled with an earlier, unrelated, stream, why can’t we construct a GPA for Software Engineering that clearly states the area of relevance and degree of information? Isn’t that actually what employers and people interested in SE want to know?
Handing over an academic transcript seems to allow anyone to do this but human cognitive biases are powerful, subtle and pervasive. It is harder for most humans to recognise positive progress in the areas that they are interested in, if there is evidence of less stellar performance elsewhere. I cite my usual non-academic example: Everyone thought Anthony La Paglia’s American accent was too fake until he stopped telling people he was Australian.
If we have to use numbers like this, then let us think carefully about what they mean and, if they don’t mean that much, then let’s either get rid of them or make them meaningful. These should, at a fundamental level, be useful to the students first, us second.
Like most published academics, I regularly receive invitations to propose books or book chapters from publishers. Today, one of the larger groups contacted me and mentioned that they would also be interested in any proposals for a video lecture sequence.
And so the world changes.
I’m at the Australasian Computer Science Week at the moment and I’m dividing my time between attending amazing talks, asking difficult questions, catching up with friends and colleagues and doing my own usual work in the cracks. I’ve talked to a lot of people about my ideas on assessment (and beauty) and, as always, the responses have been thoughtful, challenging and helpful.
I think I know what the basis of my problem with assessment is, taking into account all of the roles that it can take. In an earlier post, I discussed Wolff’s classification of assessment tasks into criticism, evaluation and ranking. I’ve also made earlier (grumpy) notes about ranking systems and their arbitrary nature. One of the interesting talks I attended yesterday talked about the fragility and questionable accuracy of post-University exit surveys, which are used extensively in formal and informal rankings of Universities, yet don’t actually seem to meet many of the statistical or sensible guidelines for efficacy we already have.
But let’s put aside ranking for a moment and return to criticism and evaluation. I’ve already argued (successfully I hope) for a separation of feedback and grades from the criticism perspective. While they are often tied to each other, they can be separated and the feedback can still be useful. Now let’s focus on evaluation.
Remind me why we’re evaluating our students? Well, we’re looking to see if they can perform the task, apply the skill or knowledge, and reach some defined standard. So we’re evaluating our students to guide their learning. We’re also evaluating our students to indirectly measure the efficacy of our learning environment and us as educators. (Otherwise, why is it that there are ‘triggers’ in grading patterns to bring more scrutiny on a course if everyone fails?) We’re also, often accidentally, carrying out an assessment of the innate success of each class and socio-economic grouping present in our class, among other things, but let’s drill down to evaluating the student and evaluating the learning environment. Time for another thought experiment.
Thought Experiment 2
There are twenty tasks aligned with a particularly learning outcome. It’s an important task and we evaluate it in different ways but the core knowledge or skill is the same. Each of these tasks can receive a ‘grade’ of 0, 0.5 or 1. 0 means unsuccessful, 0.5 is acceptable, 1 is excellent. Student A attempts all tasks and is acceptable in 19, unsuccessful in 1. Student B attempts the first 10 tasks, receives excellent in all of them and stops. Student C sets up a pattern of excellent,unsuccessful, excellent, unsuccessful.. and so on to receive 10 “Excellent”s and 10 “unsuccessful”s. When we form an aggregate grade, A receives 47.5%, B receives 50% and C also receives 50%. Which of these students is the most likely to successfully complete the task?
This framing allows us to look at the evaluation of the student in a meaningful way. “Who will pass the course?” is not the question we should be asking, it’s “Who will be able to reliably demonstrate mastery of the skills or knowledge that we are imparting.” Passing the course has a naturally discrete attention focus: focus on n assignments and m exams and pass. Continual demonstration of mastery is a different goal. This framing also allows us to examine the learning environment because, without looking at the design, I can’t tell you if B and C’s behaviour is problematic or not.
A has undertaken the most tasks to an acceptable level but an artefact of grading (or bad luck) has dropped the mark below 50%, which would be a fail (aggregate less than acceptable) in many systems. B has performed excellently on every task attempted but, being aware of the marking scheme, optimising and strategic behaviour allows this student to walk away. (Many students who perform at this level wouldn’t, I’m aware, but we’re looking at the implications of this.) C has a troublesome pattern that provides the same outcome as B but with half the success rate.
Before we answer the original question (which is most likely to succeed), I can nominate C as the most likely to struggle because C has the most “unsuccessful”s. From a simple probabilistic argument, 10/20 success is worse than 19/20. It’s a bit tricker comparing 10/10 and 10/20 (because of confidence intervals) but 10/20 has an Adjusted Wald range of +/- 20% and 10/10 is -14%, so the highest possible ‘real’ measure for C is 14/20 and the lowest possible ‘real’ measure for B is (scaled) 15/20, so they don’t overlap and we can say that B appears to be more successful than C as well.
From a learning design perspective, do our evaluation artefacts have an implicit design that explains C’s pattern? Is there a difference we’re not seeing? Taking apart any ranking of likeliness to pass our evaluatory framework, C’s pattern is so unusual (high success/lack of any progress) that we learn something immediately from the pattern, whether it’s that C is struggling or that we need to review mechanisms we thought to be equivalent!
But who is more likely to succeed out of A and B? 19/20 and 10/10 are barely distinguishable in statistical terms! The question for us now is how many evaluations of a given skill or knowledge mastery are required for us to be confident of competence. This totally breaks the discrete cramming for exams and focus on assignment model because all of our science is built on the notion that evidence is accumulated through observation and the analysis of what occurred, in order to be able to construct models to predict future behaviour. In this case, our goal is to see if our students are competent.
I can never be 100% sure that my students will be able to perform a task but what is the level I’m happy with? How many times do I have to evaluate them at a skill so that I can say that x successes in y attempts constitutes a reliable outcome?
If we say that a student has to reliably succeed 90% of the time, we face the problem that just testing them ten times isn’t enough for us to be sure that they’re hitting 90%.
But the level of performance we need to be confident is quite daunting. By looking at some statistics, we can see that if we provide a student with 150 opportunities to demonstrate knowledge and they succeed at this 143 times, then it is very likely that their real success level is at least 90%.
If we say that competency is measured by a success rate that is greater than 75%, a student who achieves 10/10 has immediately met that but even succeeding at 9/9 doesn’t meet that level.
What this tells us (and reminds us) is that our learning environment design is incredibly important and it must start from a clear articulation of what success actually means, what our goals are and how we will know when our students have reached that point.
There is a grade separation between A and B but it’s artificial. I noted that it was hard to distinguish A and B statistically but there is one important difference in the lower bound of their confidence interval. A is less than 75%, B is slightly above.
Now we have to deal with the fact that A and B were both competent (if not the same) for the first ten tests and A was actually more competent than B until the 20th failed test. This has enormous implications for we structure evaluation, how many successful repetitions define success and how many ‘failures’ we can tolerate and still say that A and B are competent.
Confused? I hope not but I hope that this is making you think about evaluation in ways that you may not have done so before.
There was a time before graphics dominated the way that you worked with computers and, back then, after punchcards and before Mac/Windows, the most common way of working with a computer was to use the Command Line Interface (CLI). Many of you will have seen this, here’s Terminal from the Mac OS X, showing a piece of Python code inside an editor.
Rather than use a rich Integrated Development Environment, where text is highlighted and all sorts of clever things are done for me, I would run some sort of program editor from the command line, write my code, close that editor and then see what worked.
At my University, we almost always taught Computer Science using command line tools, rather than rich development environments such as Eclipse or the Visual Studio tools. Why? The reasoning was that the CLI developed skills required to write code, compile it, debug it and run it, without training students into IDE-provided shortcuts. The CLI was the approach that would work anywhere. That knowledge was, as we saw it, fundamental.
But, remember that Processing example? We clearly saw where the error was. This is what a similar error looks like for the Java programming language in a CLI environment.
Same message (and now usefully on the right line because 21st Century) but it is totally divorced from the program itself. That message has to give me a line number (5) in the original program because it has no other way to point to the problem.
And here’s the problem. The cognitive load increases once we separate code and errors. Despite those Processing errors looking like the soft option, everything we know about load tells us that students will find fixing their problems easier if they don’t have to mentally or physically switch between code and error output.
Everything I said about CLIs is still true but that’s only a real consideration if my students go out into the workplace and need some CLI skills. And, today, just about every workplace has graphics based IDEs for production software. (Networking is often an exception but we’ll skip over that. Networking is special.)
The best approach for students learning to code is that we don’t make things any harder than we need to. The CLI approach is something I would like students to be able to do but my first task is to get them interested in programming. Then I have to make their first experiences authentic and effective, and hopefully pleasant and rewarding.
I have thought about this for years and I started out as staunchly CLI. But as time goes by, I really have to wonder whether a tiny advantage for a small number of graduates is worth additional load for every new programmer.
And I don’t think it is worth it. It’s not fair. It’s the opposite of equitable. And it goes against the research that we have on cognitive load and student workflows in these kinds of systems. We already know of enough load problems in graphics based environments if we make the screens large enough, without any flicking from one application to another!
You don’t have to accept my evaluation model to see this because it’s a matter of common sense that forcing someone to unnecessarily switch tasks to learn a new skill is going to make it harder. Asking someone to remember something complicated in order to use it later is not as easy as someone being able to see it when and where they need to use it.
The world has changed. CLIs still exist but graphical user interfaces (GUIs) now rule. Any of my students who needs to be a crack programmer in a text window of 80×24 will manage it, even if I teach with all IDEs for the entire degree, because all of the IDEs are made up of small windows. Students can either debug and read error messages or they can’t – a good IDE helps you but it doesn’t write or fix the code for you, in any deep way. It just helps you to write code faster, without having to wait and switch context to find silly mistakes that you could have fixed in a split second in an IDE.
When it comes to teaching programming, I’m not a CLI guy anymore.
If we want to give feedback, then the time it takes to give feedback is going to determine how often we can do it. If the core of our evaluation is feedback, rather than some low-Bloom’s quiz-based approach giving a score of some sort, then we have to set our timelines to allow us to:
- Get the work when we are ready to work on it
- Undertake evaluation to the required level
- Return that feedback
- Do this at such a time that our students can learn from it and potentially use it immediately, to reinforce the learning
A commenter asked me how I actually ran large-scale assessment. The largest class I’ve run detailed feedback/evaluation on was 360 students with a weekly submission of a free-text (and graphics) solution to a puzzle. The goal was to have the feedback back within a week – prior to the next lecture where the solution would be delivered.
I love a challenge.
This scale is, obviously, impossible for one person to achieve reliably (we estimated it as at least forty hours of work). Instead, we allocated a marking team to this task, coordinated by the lead educator. (E1 and E2 model again. There was, initially, no automated capacity for this at the time although we added some later.)
Coordinating a team takes time. Even when you start with a rubric, free text answers can turn up answer themes that you didn’t anticipate and we would often carry our simple checks to make sure that things were working. But, looking at the marking time I was billed for (a good measure), I could run an entire cycle of this in three days, including briefing time, testing, marking, and oversight. But this is with a trained team, a big enough team, good conceptual design and a senior educator who’s happy to take a more executive role.
In this case, we didn’t give the students a chance to refactor their work but, if we had, we could have done this with a release 3 days after submission. To ensure that we then completed the work again by the ‘solution release’ deadline, we would have had to set the next submission deadline to only 24 hours after the feedback was released. This sounds short but, if we assume that some work has been done, then refactoring and reworking should take less time.
But then we have to think about the cost. By running two evaluation cycles we are providing early feedback but we have doubled our cost for human markers (a real concern for just about everyone these days).
My solution was to divide the work into two components. The first was quiz-based and could be automatically and immediately assessed by the Learning Management System, delivering a mark at a fixed deadline. The second part was looked at by humans. Thus, students received immediate feedback on part of the problem straight away (or a related problem) while they were waiting for humans.
But I’d be the first to admit that I hadn’t linked this properly, according to my new model. It does give us insight for a staged hybrid model where we buffer our human feedback by using either smart or dumb automated assessment component to highlight key areas and, better still, we can bring these forward to help guide time management.
I’m not unhappy with that early attempt at large-scale human feedback as the students were receiving some excellent evaluation and feedback and it was timely and valuable. It also gave me a lot of valuable information about design and about what can work, as well as how to manage marking teams.
I also realised that some courses could never be assessed the way that they claimed unless they had more people on task or only delivered at a time when the result wasn’t usable anymore.
How much time should we give students to rework things? I’d suggest that allowing a couple of days takes into account the life beyond Uni that many students have. That means that we can do a cycle in a week if we can keep our human evaluation stages under 2 days. Then, without any automated marking, we get 2 days (E1 or E2) + 2 days (student) + 2 days (second evaluation, possibly E2) + 1 day (final readjustment) and then we should start to see some of the best work that our students can produce.
Assuming, of course, that all of us can drop everything to slot into this. For me, this motivates a cycle closer to two to three weeks to allow for everything else that both groups are doing. But that then limits us to fewer than five big assessment items for a twelve week course!
What’s better? Twelve assessment items that are “submit and done” or four that are “refine and reinforce to best practice”? Is this even a question we can ask? I know which one is aesthetically pleasing, in terms of all of the educational aesthetics we’ve discussed so far but is this enough for an educator to be able to stand up to a superior and say “We’re not going to do X because it just doesn’t make any sense!”
What do you think?
One of the problems with any model that builds in more feedback is that we incur both the time required to produce the feedback and we also have an implicit requirement to allow students enough time to assimilate and make use of it. This second requirement is still there even if we don’t have subsequent attempts at work, as we want to build upon existing knowledge. The requirement for good feedback makes no sense without a requirement that it be useful.
But let me reiterate that pretty much all evaluation and feedback can be very valuable, no matter how small or quick, if we know what we are trying to achieve. (I’ll get to more complicated systems in later posts.)
Novice programmers often struggle with programming and this early stage of development is often going to influence if they start off thinking that they can program or not. Given that automated evaluation only really provides useful feedback once the student has got something working, novice programming classes are an ideal place to put human markers. If we can make students think “Yes, I can do this” early on, this is the emotion that they will remember. We need to get to big problems quickly, turn them into manageable issues that can be overcome, and then let motivation and curiosity take the rest.
There’s an excellent summary paper on computer programming visualisation systems aimed at novice programmers, which discusses some of the key problems novices face on their path to mastery:
- Novices can see some concepts as code rather than the components of a dynamic process. For example, they might see objects as simply a way of containing things rather than modelling objects and their behaviours. These static perceptions prevent the students from understanding that they are designing behaviours, not just writing magic formulas.
- There can be significant difficulties in understanding the computer, seeing the notional machine that is the abstraction, forming a basis upon which knowledge of one language or platform could be used elsewhere.
- Misunderstanding fundamental concepts is common and such misconceptions can easily cause weak understanding, leaving the students in the liminal state, unable to assimilate a threshold concept and move on.
- Students struggle to trace programs and work out what state the program should be in. In my own community, Raymond Lister, Donna Teague, Simon, and others have clearly shown that many students struggle with the tracing of even simple programs.
If we have put human markers (E1 or E2) into a programming class and identified that these are the problems we’re looking for, we can provide immediate targeted evaluation that is also immediate constructive feedback. On the day, in response to actual issues, authentic demonstration of a solution process that students can model. This is the tightest feedback and reward loop we can offer. How does this work?
- Program doesn’t work because of one of the key problem areas.
- Human evaluator intervenes with student and addresses the issue, encouraging discovery inside the problem area.
- Student tries to identify problem and explains it to evaluator in context, modelling evaluator and based on existing knowledge.
- Evaluator provides more guidance and feedback.
- Student continues to work on problem.
- We hope that the student will come across the solution (or think towards it) but we may have to restart this loop.
Note that we’re not necessarily giving the solution here but we can consider leading towards this if the student is getting visibly frustrated. I’d suggest never telling a student what to type as it doesn’t address any of the problems, it just makes the student dependent upon being told the answer. Not desirable. (There’s an argument here for rich development environments that I’ll expand on later.)
Evaluation like this is formative, immediate and rich. We can even streamline it with guidelines to help the evaluators although much of this will amount to supporting students as they learn to read their own code and understand the key concepts. We should develop students simple to complex, concrete to abstract, so some problems with abstraction are to be expected, especially if we are playing near any threshold concepts.
But this is where learning designers have to be ready to say “this may cause trouble” and properly brief the evaluators who will be on the ground. If we want our evaluators to work efficiently and effectively, we have to brief them on what to expect, what to do, and how to follow up.
If you’ve missed it so far, one of our big responsibilities is training our evaluation team. It’s only by doing this that we can make sure that our evaluators aren’t getting bogged down in side issues or spending too much time with one student and doing the work for them. This training should include active scenario-based training to allow the evaluators to practise with the oversight of the educators and designers.
We have finite resources. If we want to support a room full of novices, we have to prepare for the possibility of all of them having problems at once and the only way to support that at scale is to have an excellent design and train for it.
I’m about to start a new thread of discussion, once I’ve completed the assessment posts, and this seemed to be good priming for thinking ahead.
“The true business of people should be to go back to school and think about whatever it was they were thinking about before somebody came along and told them they had to earn a living.”
Buckminster Fuller, reference.