Dances with GPAs

Dragon_dance_at_China_1

The trick to dancing with dragons is to never lose your grip on the tail.

If we are going to try and summarise a complicated, long-term process with a single number, and I don’t see such shortcuts going away anytime soon, then it helps to know:

  • Exactly what the number represents.
  • How it can be used.
  • What the processes are that go into its construction.

We have conventions as to what things mean but, when we want to be precise, we have to be careful about our definition and our usage of the final value. As a simple example, one thing that often surprises people who are new to numerical analysis is that there is more than one way of calculating the average value of a group of numbers.

While average in colloquial language would usually mean that we take the sum of all of the numbers and divide them by their count, this is more formally referred to as the arithmetic mean. What we usually want from the average is some indication of what the typical value for this group would be. If you weigh ten bags of wheat and the average weight is 10 kilograms, then that’s what many people would expect the weight to be for future bags, unless there was clear early evidence of high variation (some 500g, some 20 kilograms, for example.)

But the mean is only one way to measure central tendency in a group of numbers. We can also measure the median, the number that separates the highest half of the data from the lowest, or the mode, the value that is the most frequently occurring value in the group.

(This doesn’t even get into the situation where we decide to aggregate the values in a different way.)

If you’ve got ten bags of wheat and nine have 10 kilograms in there, but one has only 5 kilograms, which of these ways of calculating the average is the one you want? The mode is 10kg but the mean is 9.5kg. If you tried to distribute the bags based on the expectation that everyone gets 9.5, you’re going to make nine people very happy and one person unhappy.

Most Grade Point Average calculations are based on a simple arithmetic mean of all available grades, with points allocated from 0 to an upper bound based on the grade performance. As a student adds more courses, these contributions are added to the calculation.

In yesterday’s post, I mused on letting students control which grades go into a GPA calculation and, to explore that, I now have to explain what I mean and why that would change things.

As it stands, because a GPA is an average across all courses, any lower grades will permanently drop the GPA contribution of any higher grades. If a student gets a 7 (A+ or High Distinction) for 71 of her courses and then a single 4 (a Passing grade) for one, her GPA will be 6.875. It can never return to 7. The clear performance band of this student is at the highest level, given that just under 99% of her marks are at the highest level, yet the inclusion of all grades means that a single underperformance, for whatever reason, in three years has cost her standing for those people who care about this figure.

My partner and I discussed some possible approaches to GPA that would be better and, by better, we mean approaches that encourage students to improve, that clearly show what the GPA figure means, and that are much fairer to the student. There are too many external factors contributing to resilience and high performance for me to be 100% comfortable with the questionable representation provided by the GPA.

Before we even think about student control over what is presented, we can easily think of several ways to make a GPA reflect what you have achieved, rather than what you have survived.

  1. We could only count a percentage of the courses for each student. Even having 90% counted means that students who stumble a little once or twice do not have this permanently etched into a dragging grade.
  2. We could allow a future attempt at a course with an improved to replace the previous grade. Before we get too caught up in the possibility of ‘gaming’, remember that students would have to pay for this (even if delayed) in most systems and it will add years to their degree. If a student can reach achievement level X in a course then it’s up to us to make sure that does correspond to the achievement level!
  3. We could only count passes. Given that a student has to assemble sufficient passing grades to be awarded a degree, why then would we include the courses that do not count in a calculation of GPA?
  4. We could use the mode and report the most common mark the student receives.
  5. We could do away with it totally. (Not going to happen any time soon.)
  6. We could pair the GPA with a statistical accompaniment that tells the viewer how indicative it is.

Options 1 and 2 are fairly straight-forward. Option 3 is interesting because it compresses the measurement band to a range of (in my system) 4-7 and this then implicitly recognises that GPA measures for students who graduate are more likely to be in this tighter range: we don’t actually have the degree of separation that we’d assume from a range of 0-7. Option 4 is an interesting way to think about the problem: which grade is the student most likely to achieve, across everything? Option 5 is there for completeness but that’s another post.

Option 6 introduces the idea that we stop GPA being a number and we carefully and accurately contextualise it. A student who receives all high distinctions in first semester still has a number of known hurdles to get over. The GPA of 7 that would be present now is not as clear an indicator of facility with the academic system as a GPA of 7 at the end of a degree, whichever other GPA adjustment systems are in play.

More evidence makes it clearer what is happening. If we can accompany a GPA (or similar measure) with evidence, then we are starting to make the process apparent and we make the number mean something. However, this also allows us to let students control what goes into their calculation, from the grades that they have, as a clear measure of the relevance of that measure can be associated.

But this doesn’t have to be a method of avoidance, this can be a useful focusing device. If a student did really well in, say, Software Engineering but struggled with an earlier, unrelated, stream, why can’t we construct a GPA for Software Engineering that clearly states the area of relevance and degree of information? Isn’t that actually what employers and people interested in SE want to know?

Handing over an academic transcript seems to allow anyone to do this but human cognitive biases are powerful, subtle and pervasive. It is harder for most humans to recognise positive progress in the areas that they are interested in, if there is evidence of less stellar performance elsewhere. I cite my usual non-academic example: Everyone thought Anthony La Paglia’s American accent was too fake until he stopped telling people he was Australian.

If we have to use numbers like this, then let us think carefully about what they mean and, if they don’t mean that much, then let’s either get rid of them or make them meaningful. These should, at a fundamental level, be useful to the students first, us second.


Total control: a user model for student results

Yesterday, I wrote:

We need assessment systems that work for the student first and everyone else second.


Grades are the fossils of evaluation

Assessments support evaluation, criticism and ranking (Wolff). That’s what it does and, in many cases, that also constitutes a lot of why we do it. But who are we doing it for?

I’ve reflected on the dual nature of evaluation, showing a student her or his level of progress and mastery while also telling us how well the learning environment is working. In my argument to reduce numerical grades to something meaningful, I’ve asked what the actual requirement is for our students, how we measure mastery and how we can build systems to provide this.

But who are the student’s grades actually for?

In terms of ranking, grades allow people who are not the student to place the students in some order. By doing this, we can award awards to students who are in the awarding an award band (repeated word use deliberate). We can restrict our job interviews to students who are summa cum laude or valedictorian or Dean’s Merit Award Winner. Certain groups of students, not all, like to define their progress through comparison so there is a degree of self-ranking but, for the most part, ranking is something that happens to students.

Criticism, in terms of providing constructive, timely feedback to assist the student, is weakly linked to any grading system. Giving someone a Fail grade isn’t a critique as it contains no clear identification of the problems. The clear identification of problems may not constitute a fail. Often these correlate but it’s weak. A student’s grades are not going to provide useful critique to the student by themselves. These grades are to allow us to work out if the student has met our assessment mechanisms to a point where they can count this course as a pre-requisite or can be awarded a degree. (Award!)

Evaluation is, as noted, useful to us and the student but a grade by itself does not contain enough record of process to be useful in evaluating how mastery goals were met and how the learning environment succeeded or failed. Competency, when applied systematically, does have a well-defined meaning. A passing grade does not although there is an implied competency and there is a loose correlation with achievement.

Grades allow us to look at all of a student’s work as if this one impression is a reflection of the student’s involvement, engagement, study, mistakes, triumphs, hopes and dreams. They are additions to a record from which we attempt to reconstruct a living, whole being.

Grades are the fossils of evaluation.

Grades provide a mechanism for us, in a proxy role as academic archaeologist, to classify students into different groups, in an attempt to project colour into grey stone, to try and understand the ecosystem that such a creature would live in, and to identify how successful this species was.

As someone who has been a student several times in my life, I’m aware that I have a fossil record that is not traditional for an academic. I was lucky to be able to place a new imprint in the record, to obscure my history as a much less successful species, and could then build upon it until I became an ACADEMIC TYRANNOSAURUS.

Skull of a Tyrannosaurus Rex at Palais de la Decouverte

LIFE LONG LEARNING, ROAARRRR!

But I’m lucky. I’m privileged. I had a level of schooling and parental influence that provided me with an excellent vocabulary and high social mobility. I live in a safe city. I have a supportive partner. And, more importantly, at a crucial moment in my life, someone who knew me told me about an opportunity that I was able to pursue despite the grades that I had set in stone. A chance came my way that I never would have thought of because I had internalised my grades as my worth.

Let’s look at the fossil record of Nick.

My original GPA fossil, encompassing everything that went wrong and right in my first degree, was 2.9. On a scale of 7, which is how we measure it, that’s well below a pass average. I’m sharing that because I want you to put that fact together with what happened next. Four years later, I started a Masters program that I finished with a GPA of 6.4. A few years after the masters, I decided to go and study wine making. That degree was 6.43. Then I received a PhD, with commendation, that is equivalent to GPA 7. (We don’t actually use GPA in research degrees. Hmmm.) If my grade record alone lobbed onto your desk you would see the desiccated and dead snapshot of how I (failed to) engage with the University system. A lot of that is on me but, amazingly, it appears that much better things were possible. That original grade record stopped me from getting interviews. Stopped me from getting jobs. When I was finally able to demonstrate the skills that I had, which weren’t bad, I was able to get work. Then I had the opportunity to rewrite my historical record.

Yes, this is personal for me. But it’s not about me because I wasn’t trapped by this. I was lucky as well as privileged. I can’t emphasise that enough. The fact that you are reading this is due to luck. That’s not a good enough mechanism.

Too many students don’t have this opportunity. That impression in the wet mud of their school life will harden into a stone straitjacket from which they may never escape. The way we measure and record grades has far too much potential to work against students and the correlation with actual ability is there but it’s not strong and it’s not always reliable.

The student you are about to send out with a GPA of 2.9 may be competent and they are, most definitely, more than that number.

The recording of grades is a high-loss storage record of the student’s learning and pathway to mastery. It allows us to conceal achievement and failure alike in the accumulation of mathematical aggregates that proxy for competence but correlate weakly.

We need assessment systems that work for the student first and everyone else second.


How does competency based assessment work?

From the previous post, I asked how many times a student has to perform a certain task, and to which standard, that we become confident that they can reliably perform the task. In the Vocational Education and Training world this is referred to as competence and this is defined (here, from the Western Australian documentation) as:

In VET, individuals are considered competent when they are able to consistently apply their knowledge and skills to the standard of performance required in the workplace.

How do we know if someone has reached that level of competency?

We know whether an individual is competent after they have completed an assessment that verifies that all aspects of the unit of competency are held and can be applied in an industry context.

The programs involved are made up of units that span the essential knowledge and are assessed through direct observation, indirect measurements (such as examination) and in talking to employers or getting references. (And we have to be careful that we are directly measuring what we think we are!)

A vintage Czech eye chart.

A direct measurement of your eyesight or your ability to memorise Czech eye-charts.

Hang on. Examinations are an indirect measurement? Yes, of course they are here, we’re looking for the ability to apply this and that requires doing rather than talking about what you would do. Your ability to perform the task in direct observation is related to how you can present that knowledge in another frame but it’s not going to be 1:1 because we’re looking at issues of different modes and mediation.

But it’s not enough just to do these tasks as you like, the specification is quite clear in this:

It can be demonstrated consistently over time, and covers a sufficient range of experiences (including those in simulated or institutional environments).

I’m sure that some of you are now howling that many of the things that we teach at University are not just something that you do, there’s a deeper mode of thinking or something innately non-Vocational about what is going on.

And, for some of you, that’s true. Any of you who are asking students to do anything in the bottom range of Bloom’s taxonomy… I’m not convinced. Right now, many assessments of concepts that we like to think of as abstract are so heavily grounded in the necessities of assessment that they become equivalent to competency-based training outcomes.

The goal may be to understand Dijkstra’s algorithm but the task is to write a piece of code that solves the algorithm for certain inputs, under certain conditions. This is, implicitly, a programming competency task and one that must be achieved before you can demonstrate any ability to show your understanding of the algorithm. But the evaluator’s perspective of Dijkstra is mediated through your programming ability, which means that this assessment is a direct measure of programming ability in language X but an indirect measure of Dijkstra. Your ability to apply Dijkstra’s algorithm would, in a competency-based frame, be located in a variety of work-related activities that could verify your ability to perform the task reliably.

All of my statistical arguments on certainty from the last post come back to a simple concept: do I have the confidence that the student can reliably perform the task under evaluation? But we add to this the following: Am I carrying out enough direct observation of the task in question to be able to make a reliable claim on this as an evaluator?

There is obvious tension, at modern Universities, between what we see as educational and what we see as vocational. Given that some of what we do falls into “workplace skills” in a real sense, although we may wish to be snooty about the workplace, why are we not using the established approaches that allow us to actually say “This student can function as an X when they leave here?”

If we want to say that we are concerned with a more abstract education, perhaps we should be teaching, assessing and talking about our students very, very differently. Especially to employers.


Brief reflection on a changing world

Like most published academics, I regularly receive invitations to propose books or book chapters from publishers. Today, one of the larger groups contacted me and mentioned that they would also be interested in any proposals for a video lecture sequence.

And so the world changes.

TV_noise

Something something radio star.


What do we want? Passing average or competency always?

I’m at the Australasian Computer Science Week at the moment and I’m dividing my time between attending amazing talks, asking difficult questions, catching up with friends and colleagues and doing my own usual work in the cracks.  I’ve talked to a lot of people about my ideas on assessment (and beauty) and, as always, the responses have been thoughtful, challenging and helpful.

I think I know what the basis of my problem with assessment is, taking into account all of the roles that it can take. In an earlier post, I discussed Wolff’s classification of assessment tasks into criticism, evaluation and ranking. I’ve also made earlier (grumpy) notes about ranking systems and their arbitrary nature. One of the interesting talks I attended yesterday talked about the fragility and questionable accuracy of post-University exit surveys, which are used extensively in formal and informal rankings of Universities, yet don’t actually seem to meet many of the statistical or sensible guidelines for efficacy we already have.

But let’s put aside ranking for a moment and return to criticism and evaluation. I’ve already argued (successfully I hope) for a separation of feedback and grades from the criticism perspective. While they are often tied to each other, they can be separated and the feedback can still be useful. Now let’s focus on evaluation.

Remind me why we’re evaluating our students? Well, we’re looking to see if they can perform the task, apply the skill or knowledge, and reach some defined standard. So we’re evaluating our students to guide their learning. We’re also evaluating our students to indirectly measure the efficacy of our learning environment and us as educators. (Otherwise, why is it that there are ‘triggers’ in grading patterns to bring more scrutiny on a course if everyone fails?) We’re also, often accidentally, carrying out an assessment of the innate success of each class and socio-economic grouping present in our class, among other things, but let’s drill down to evaluating the student and evaluating the learning environment. Time for another thought experiment.

Thought Experiment 2

There are twenty tasks aligned with a particularly learning outcome. It’s an important task and we evaluate it in different ways but the core knowledge or skill is the same. Each of these tasks can receive a ‘grade’ of 0, 0.5 or 1. 0 means unsuccessful, 0.5 is acceptable, 1 is excellent. Student A attempts all tasks and is acceptable in 19, unsuccessful in 1. Student B attempts the first 10 tasks, receives excellent in all of them and stops. Student C sets up a pattern of excellent,unsuccessful, excellent, unsuccessful.. and so on to receive 10 “Excellent”s and 10 “unsuccessful”s. When we form an aggregate grade, A receives 47.5%, B receives 50% and C also receives 50%. Which of these students is the most likely to successfully complete the task?

This framing allows us to look at the evaluation of the student in a meaningful way. “Who will pass the course?” is not the question we should be asking, it’s “Who will be able to reliably demonstrate mastery of the skills or knowledge that we are imparting.” Passing the course has a naturally discrete attention focus: focus on n assignments and m exams and pass. Continual demonstration of mastery is a different goal. This framing also allows us to examine the learning environment because, without looking at the design, I can’t tell you if B and C’s behaviour is problematic or not.

CompFail

A has undertaken the most tasks to an acceptable level but an artefact of grading (or bad luck) has dropped the mark below 50%, which would be a fail (aggregate less than acceptable) in many systems. B has performed excellently on every task attempted but, being aware of the marking scheme, optimising and strategic behaviour allows this student to walk away. (Many students who perform at this level wouldn’t, I’m aware, but we’re looking at the implications of this.) C has a troublesome pattern that provides the same outcome as B but with half the success rate.

Before we answer the original question (which is most likely to succeed), I can nominate C as the most likely to struggle because C has the most “unsuccessful”s. From a simple probabilistic argument, 10/20 success is worse than 19/20. It’s a bit tricker comparing 10/10 and 10/20 (because of confidence intervals) but 10/20 has an Adjusted Wald range of +/- 20% and 10/10 is -14%, so the highest possible ‘real’ measure for C is 14/20 and the lowest possible ‘real’ measure for B is (scaled) 15/20, so they don’t overlap and we can say that B appears to be more successful than C as well.

From a learning design perspective, do our evaluation artefacts have an implicit design that explains C’s pattern? Is there a difference we’re not seeing? Taking apart any ranking of likeliness to pass our evaluatory framework, C’s pattern is so unusual (high success/lack of any progress) that we learn something immediately from the pattern, whether it’s that C is struggling or that we need to review mechanisms we thought to be equivalent!

But who is more likely to succeed out of A and B? 19/20 and 10/10 are barely distinguishable in statistical terms! The question for us now is how many evaluations of a given skill or knowledge mastery are required for us to be confident of competence. This totally breaks the discrete cramming for exams and focus on assignment model because all of our science is built on the notion that evidence is accumulated through observation and the analysis of what occurred, in order to be able to construct models to predict future behaviour. In this case, our goal is to see if our students are competent.

I can never be 100% sure that my students will be able to perform a task but what is the level I’m happy with? How many times do I have to evaluate them at a skill so that I can say that x successes in y attempts constitutes a reliable outcome?

If we say that a student has to reliably succeed 90% of the time, we face the problem that just testing them ten times isn’t enough for us to be sure that they’re hitting 90%.

But the level of performance we need to be confident is quite daunting. By looking at some statistics, we can see that if we provide a student with 150 opportunities to demonstrate knowledge and they succeed at this 143 times, then it is very likely that their real success level is at least 90%.

If we say that competency is measured by a success rate that is greater than 75%, a student who achieves 10/10 has immediately met that but even succeeding at 9/9 doesn’t meet that level.

What this tells us (and reminds us) is that our learning environment design is incredibly important and it must start from a clear articulation of what success actually means, what our goals are and how we will know when our students have reached that point.

There is a grade separation between A and B but it’s artificial. I noted that it was hard to distinguish A and B statistically but there is one important difference in the lower bound of their confidence interval. A is less than 75%, B is slightly above.

Now we have to deal with the fact that A and B were both competent (if not the same) for the first ten tests and A was actually more competent than B until the 20th failed test. This has enormous implications for we structure evaluation, how many successful repetitions define success and how many ‘failures’ we can tolerate and still say that A and B are competent.

Confused? I hope not but I hope that this is making you think about evaluation in ways that you may not have done so before.

 


Too big for a term? Why terms?

I’ve reached the conclusion that a lot of courses have an unrealistically high number of evaluations. We have too many and we pretend that we are going to achieve outcomes for which we have no supporting evidence. Worse, in many cases, we are painfully aware that we cause last-minute lemming-like effects that do anything other than encourage learning. But why do we have so many? Because we’re trying to fit them into the term or semester size that we have: the administrative limit.

One the big challenges for authenticity in Computer Science is the nature of the software project. While individual programs can be small and easy to write, a lot of contemporary programming projects are:

  1. Large and composed of many small programs.
  2. Complex to a scale that may exceed one person’s ability to visualise.
  3. Long-lived.
  4. Multi-owner.
  5. Built on platforms that provide core services; the programmers do not have the luxury to write all of the code in the system.

Many final year courses in Software Engineering have a large project courses, where students are forced to work with a (usually randomly assigned) group to produce a ‘large’ piece of software. In reality, this piece of software is very well-defined and can be constructed in the time available: it has been deliberately selected to be so.

Is a two month software task in a group of six people indicative of real software?

calendar-660670_960_720

June 16: Remember to curse teammate for late delivery on June 15.

Yes and no. It does give a student experience in group management, except that they still have the safe framework of lecturers over the top. It’s more challenging than a lot of what we do because it is a larger artefact over a longer time.

But it’s not that realistic. Industry software projects live over years, with tens to hundreds of programmers ‘contributing’ updates and fixes… reversing changes… writing documentation… correcting documentation. This isn’t to say that the role of a university is to teach industry skills but these skill sets are very handy for helping programmers to take their code and make it work, so it’s good to encourage them.

I believe finally, that education must be conceived as a continuing reconstruction of experience; that the process and the goal of education are one and the same thing.

from John Dewey, “My Pedagogic Creed”,  School Journal vol. 54 (January 1897)

I love the term ‘continuing reconstruction of experience’ as it drives authenticity as one of the aesthetic characteristics of good education.

Authentic, appropriate and effective learning and evaluation activities may not fit comfortably into a term. We already accept this for activities such as medical internship, where students must undertake 47 weeks of work to attain full registration. But we are, for many degrees, trapped by the convention of a semester of so many weeks, which is then connected with other semesters to make a degree that is somewhere between three to five years long.

The semester is an artefact of the artificial decomposition of the year, previously related to season in many places but now taking on a life of its own as an administrative mechanism. Jamming things into this space is not going to lead to an authentic experience and we can now reject this on aesthetic grounds. It might fit but it’s beautiful or true.

But wait! We can’t do that! We have to fit everything into neat degree packages or our students won’t complete on time!

Really?

Let’s now look at the ‘so many years degree’. This is a fascinating read and I’ll summarise the reported results for degree programs in the US, which don’t include private colleges and universities:

  • Fewer than 10% of reporting institutions graduated a majority of students on time.
  • Only 19% of students at public universities graduate on-time.
  • Only 36% of state flagship universities graduate on-time
  • 5% of community college students complete an associate degree on-time.

The report has a simple name for this: the four-year myth. Students are taking longer to do their degrees for a number of reasons but among them are poorly designed, delivered, administered or assessed learning experiences. And jamming things into semester blocks doesn’t seem to be magically translating into on-time completions (unsurprisingly).

It appears that the way we break up software into little pieces is artificial and we’re also often trying to carry out too many little assessments. It looks like a good model is to stretch our timeline out over more than one course to produce an experience that is genuinely engaging, more authentic and more supportive of long term collaboration. That way, our capstone course could be a natural end-point to a three year process… or however long it takes to get there.

Finally, in the middle of all of this, we need to think very carefully about why we keep using the semester or the term as a container. Why are degrees still three to four years long when everything else in the world has changed so much in the last twenty years?


Follow

Get every new post delivered to your Inbox.

Join 1,063 other followers