Like most published academics, I regularly receive invitations to propose books or book chapters from publishers. Today, one of the larger groups contacted me and mentioned that they would also be interested in any proposals for a video lecture sequence.
And so the world changes.
I’m at the Australasian Computer Science Week at the moment and I’m dividing my time between attending amazing talks, asking difficult questions, catching up with friends and colleagues and doing my own usual work in the cracks. I’ve talked to a lot of people about my ideas on assessment (and beauty) and, as always, the responses have been thoughtful, challenging and helpful.
I think I know what the basis of my problem with assessment is, taking into account all of the roles that it can take. In an earlier post, I discussed Wolff’s classification of assessment tasks into criticism, evaluation and ranking. I’ve also made earlier (grumpy) notes about ranking systems and their arbitrary nature. One of the interesting talks I attended yesterday talked about the fragility and questionable accuracy of post-University exit surveys, which are used extensively in formal and informal rankings of Universities, yet don’t actually seem to meet many of the statistical or sensible guidelines for efficacy we already have.
But let’s put aside ranking for a moment and return to criticism and evaluation. I’ve already argued (successfully I hope) for a separation of feedback and grades from the criticism perspective. While they are often tied to each other, they can be separated and the feedback can still be useful. Now let’s focus on evaluation.
Remind me why we’re evaluating our students? Well, we’re looking to see if they can perform the task, apply the skill or knowledge, and reach some defined standard. So we’re evaluating our students to guide their learning. We’re also evaluating our students to indirectly measure the efficacy of our learning environment and us as educators. (Otherwise, why is it that there are ‘triggers’ in grading patterns to bring more scrutiny on a course if everyone fails?) We’re also, often accidentally, carrying out an assessment of the innate success of each class and socio-economic grouping present in our class, among other things, but let’s drill down to evaluating the student and evaluating the learning environment. Time for another thought experiment.
Thought Experiment 2
There are twenty tasks aligned with a particularly learning outcome. It’s an important task and we evaluate it in different ways but the core knowledge or skill is the same. Each of these tasks can receive a ‘grade’ of 0, 0.5 or 1. 0 means unsuccessful, 0.5 is acceptable, 1 is excellent. Student A attempts all tasks and is acceptable in 19, unsuccessful in 1. Student B attempts the first 10 tasks, receives excellent in all of them and stops. Student C sets up a pattern of excellent,unsuccessful, excellent, unsuccessful.. and so on to receive 10 “Excellent”s and 10 “unsuccessful”s. When we form an aggregate grade, A receives 47.5%, B receives 50% and C also receives 50%. Which of these students is the most likely to successfully complete the task?
This framing allows us to look at the evaluation of the student in a meaningful way. “Who will pass the course?” is not the question we should be asking, it’s “Who will be able to reliably demonstrate mastery of the skills or knowledge that we are imparting.” Passing the course has a naturally discrete attention focus: focus on n assignments and m exams and pass. Continual demonstration of mastery is a different goal. This framing also allows us to examine the learning environment because, without looking at the design, I can’t tell you if B and C’s behaviour is problematic or not.
A has undertaken the most tasks to an acceptable level but an artefact of grading (or bad luck) has dropped the mark below 50%, which would be a fail (aggregate less than acceptable) in many systems. B has performed excellently on every task attempted but, being aware of the marking scheme, optimising and strategic behaviour allows this student to walk away. (Many students who perform at this level wouldn’t, I’m aware, but we’re looking at the implications of this.) C has a troublesome pattern that provides the same outcome as B but with half the success rate.
Before we answer the original question (which is most likely to succeed), I can nominate C as the most likely to struggle because C has the most “unsuccessful”s. From a simple probabilistic argument, 10/20 success is worse than 19/20. It’s a bit tricker comparing 10/10 and 10/20 (because of confidence intervals) but 10/20 has an Adjusted Wald range of +/- 20% and 10/10 is -14%, so the highest possible ‘real’ measure for C is 14/20 and the lowest possible ‘real’ measure for B is (scaled) 15/20, so they don’t overlap and we can say that B appears to be more successful than C as well.
From a learning design perspective, do our evaluation artefacts have an implicit design that explains C’s pattern? Is there a difference we’re not seeing? Taking apart any ranking of likeliness to pass our evaluatory framework, C’s pattern is so unusual (high success/lack of any progress) that we learn something immediately from the pattern, whether it’s that C is struggling or that we need to review mechanisms we thought to be equivalent!
But who is more likely to succeed out of A and B? 19/20 and 10/10 are barely distinguishable in statistical terms! The question for us now is how many evaluations of a given skill or knowledge mastery are required for us to be confident of competence. This totally breaks the discrete cramming for exams and focus on assignment model because all of our science is built on the notion that evidence is accumulated through observation and the analysis of what occurred, in order to be able to construct models to predict future behaviour. In this case, our goal is to see if our students are competent.
I can never be 100% sure that my students will be able to perform a task but what is the level I’m happy with? How many times do I have to evaluate them at a skill so that I can say that x successes in y attempts constitutes a reliable outcome?
If we say that a student has to reliably succeed 90% of the time, we face the problem that just testing them ten times isn’t enough for us to be sure that they’re hitting 90%.
But the level of performance we need to be confident is quite daunting. By looking at some statistics, we can see that if we provide a student with 150 opportunities to demonstrate knowledge and they succeed at this 143 times, then it is very likely that their real success level is at least 90%.
If we say that competency is measured by a success rate that is greater than 75%, a student who achieves 10/10 has immediately met that but even succeeding at 9/9 doesn’t meet that level.
What this tells us (and reminds us) is that our learning environment design is incredibly important and it must start from a clear articulation of what success actually means, what our goals are and how we will know when our students have reached that point.
There is a grade separation between A and B but it’s artificial. I noted that it was hard to distinguish A and B statistically but there is one important difference in the lower bound of their confidence interval. A is less than 75%, B is slightly above.
Now we have to deal with the fact that A and B were both competent (if not the same) for the first ten tests and A was actually more competent than B until the 20th failed test. This has enormous implications for we structure evaluation, how many successful repetitions define success and how many ‘failures’ we can tolerate and still say that A and B are competent.
Confused? I hope not but I hope that this is making you think about evaluation in ways that you may not have done so before.
I’ve reached the conclusion that a lot of courses have an unrealistically high number of evaluations. We have too many and we pretend that we are going to achieve outcomes for which we have no supporting evidence. Worse, in many cases, we are painfully aware that we cause last-minute lemming-like effects that do anything other than encourage learning. But why do we have so many? Because we’re trying to fit them into the term or semester size that we have: the administrative limit.
One the big challenges for authenticity in Computer Science is the nature of the software project. While individual programs can be small and easy to write, a lot of contemporary programming projects are:
- Large and composed of many small programs.
- Complex to a scale that may exceed one person’s ability to visualise.
- Built on platforms that provide core services; the programmers do not have the luxury to write all of the code in the system.
Many final year courses in Software Engineering have a large project courses, where students are forced to work with a (usually randomly assigned) group to produce a ‘large’ piece of software. In reality, this piece of software is very well-defined and can be constructed in the time available: it has been deliberately selected to be so.
Is a two month software task in a group of six people indicative of real software?
Yes and no. It does give a student experience in group management, except that they still have the safe framework of lecturers over the top. It’s more challenging than a lot of what we do because it is a larger artefact over a longer time.
But it’s not that realistic. Industry software projects live over years, with tens to hundreds of programmers ‘contributing’ updates and fixes… reversing changes… writing documentation… correcting documentation. This isn’t to say that the role of a university is to teach industry skills but these skill sets are very handy for helping programmers to take their code and make it work, so it’s good to encourage them.
I believe finally, that education must be conceived as a continuing reconstruction of experience; that the process and the goal of education are one and the same thing.
from John Dewey, “My Pedagogic Creed”, School Journal vol. 54 (January 1897)
I love the term ‘continuing reconstruction of experience’ as it drives authenticity as one of the aesthetic characteristics of good education.
Authentic, appropriate and effective learning and evaluation activities may not fit comfortably into a term. We already accept this for activities such as medical internship, where students must undertake 47 weeks of work to attain full registration. But we are, for many degrees, trapped by the convention of a semester of so many weeks, which is then connected with other semesters to make a degree that is somewhere between three to five years long.
The semester is an artefact of the artificial decomposition of the year, previously related to season in many places but now taking on a life of its own as an administrative mechanism. Jamming things into this space is not going to lead to an authentic experience and we can now reject this on aesthetic grounds. It might fit but it’s beautiful or true.
But wait! We can’t do that! We have to fit everything into neat degree packages or our students won’t complete on time!
Let’s now look at the ‘so many years degree’. This is a fascinating read and I’ll summarise the reported results for degree programs in the US, which don’t include private colleges and universities:
- Fewer than 10% of reporting institutions graduated a majority of students on time.
- Only 19% of students at public universities graduate on-time.
- Only 36% of state flagship universities graduate on-time
- 5% of community college students complete an associate degree on-time.
The report has a simple name for this: the four-year myth. Students are taking longer to do their degrees for a number of reasons but among them are poorly designed, delivered, administered or assessed learning experiences. And jamming things into semester blocks doesn’t seem to be magically translating into on-time completions (unsurprisingly).
It appears that the way we break up software into little pieces is artificial and we’re also often trying to carry out too many little assessments. It looks like a good model is to stretch our timeline out over more than one course to produce an experience that is genuinely engaging, more authentic and more supportive of long term collaboration. That way, our capstone course could be a natural end-point to a three year process… or however long it takes to get there.
Finally, in the middle of all of this, we need to think very carefully about why we keep using the semester or the term as a container. Why are degrees still three to four years long when everything else in the world has changed so much in the last twenty years?
There was a time before graphics dominated the way that you worked with computers and, back then, after punchcards and before Mac/Windows, the most common way of working with a computer was to use the Command Line Interface (CLI). Many of you will have seen this, here’s Terminal from the Mac OS X, showing a piece of Python code inside an editor.
Rather than use a rich Integrated Development Environment, where text is highlighted and all sorts of clever things are done for me, I would run some sort of program editor from the command line, write my code, close that editor and then see what worked.
At my University, we almost always taught Computer Science using command line tools, rather than rich development environments such as Eclipse or the Visual Studio tools. Why? The reasoning was that the CLI developed skills required to write code, compile it, debug it and run it, without training students into IDE-provided shortcuts. The CLI was the approach that would work anywhere. That knowledge was, as we saw it, fundamental.
But, remember that Processing example? We clearly saw where the error was. This is what a similar error looks like for the Java programming language in a CLI environment.
Same message (and now usefully on the right line because 21st Century) but it is totally divorced from the program itself. That message has to give me a line number (5) in the original program because it has no other way to point to the problem.
And here’s the problem. The cognitive load increases once we separate code and errors. Despite those Processing errors looking like the soft option, everything we know about load tells us that students will find fixing their problems easier if they don’t have to mentally or physically switch between code and error output.
Everything I said about CLIs is still true but that’s only a real consideration if my students go out into the workplace and need some CLI skills. And, today, just about every workplace has graphics based IDEs for production software. (Networking is often an exception but we’ll skip over that. Networking is special.)
The best approach for students learning to code is that we don’t make things any harder than we need to. The CLI approach is something I would like students to be able to do but my first task is to get them interested in programming. Then I have to make their first experiences authentic and effective, and hopefully pleasant and rewarding.
I have thought about this for years and I started out as staunchly CLI. But as time goes by, I really have to wonder whether a tiny advantage for a small number of graduates is worth additional load for every new programmer.
And I don’t think it is worth it. It’s not fair. It’s the opposite of equitable. And it goes against the research that we have on cognitive load and student workflows in these kinds of systems. We already know of enough load problems in graphics based environments if we make the screens large enough, without any flicking from one application to another!
You don’t have to accept my evaluation model to see this because it’s a matter of common sense that forcing someone to unnecessarily switch tasks to learn a new skill is going to make it harder. Asking someone to remember something complicated in order to use it later is not as easy as someone being able to see it when and where they need to use it.
The world has changed. CLIs still exist but graphical user interfaces (GUIs) now rule. Any of my students who needs to be a crack programmer in a text window of 80×24 will manage it, even if I teach with all IDEs for the entire degree, because all of the IDEs are made up of small windows. Students can either debug and read error messages or they can’t – a good IDE helps you but it doesn’t write or fix the code for you, in any deep way. It just helps you to write code faster, without having to wait and switch context to find silly mistakes that you could have fixed in a split second in an IDE.
When it comes to teaching programming, I’m not a CLI guy anymore.
Earlier, I split the evaluation resources of a course into:
- E1 (the lecturer and course designer),
- E2 (human work that can be based on rubrics, including peer assessment and casual markers),
- E3 (complicated automated evaluation mechanisms)
- E4 (simple automated evaluation mechanisms, often for acceptance testing)
E1 and E2 everyone tends to understand, because the culture of Prof+TA is widespread, as is the concept of peer assessment. In a Computing Course, we can define E3 as complex marking scripts that perform amazing actions in response to input (or even carry out formal analysis if we’re being really keen), with E4 as simple file checks, program compilation and dumb scripts that jam in a set of data and see what comes out.
But let’s get back to my first year, first exposure, programming class. What I want is hands-on, concrete, active participation and constructive activity and lots of it. To support that, I want the best and most immediate feedback I can provide. Now I can try to fill a room with tutors, or do a lot of peer work, but there will come times when I want to provide some sort of automated feedback.
Given how inexperienced these students are, it could be a quite a lot to expect them to get their code together and then submit it to a separate evaluation system, then interpret the results. (Remember I noted earlier on how code tracing correlates with code ability.)
Thus, the best way to get that automated feedback is probably working with the student in place. And that brings us to the Integrated Development Environment (IDE). An IDE is an application that provides facilities to computer programmers and helps them to develop software. They can be very complicated and rich (Eclipse), simple (Processing) or aimed at pedagogical support (Scratch, BlueJ, Greenfoot et al) but they are usually made up of a place in which you can assemble code (typing or dragging) and a set of buttons or tools to make things happen. These are usually quite abstract for early programmers, built on notional machines rather than requiring a detailed knowledge of hardware.
Even simple IDEs will tell you things that provide immediate feedback. We know how these environments can have positive reception, with some demonstrated benefits, although I recommend reading Sorva et al’s “A Review of Generic Program Visualization Systems for Introductory Programming Education” to see the open research questions. In particular, people employing IDEs in teaching often worry about the time to teach the environment (as well as the language), software visualisations, concern about time on task, lack of integration and the often short lifespan of many of the simpler IDEs that are focused on pedagogical outcomes. Even for well-established systems such as BlueJ, there’s always concern over whether the investment of time in learning it is going to pay off.
In academia, time is our currency.
But let me make an aesthetic argument for IDEs, based on the feedback that I’ve already put into my beautiful model. We want to maximise feedback in a useful way for early programmers. Early programmers are still learning the language, still learning how to spell words, how to punctuate, and are building up to a grammatical understanding. An IDE can provide immediate feedback as to what the computer ‘thinks’ is going on with the program and this can help the junior programmer make immediate changes. (Some IDEs have graphical representations for object systems but we won’t discuss these any further here as the time to introduce objects is a subject of debate.)
Now there’s a lot of discussion over the readability of computer error messages but let me show you an example. What’s gone wrong in this program?
See where that little red line is, just on the end of the first line? Down the bottom there’s a message that says “missing a semicolon”. In the Processing language, almost all lines end with a “;” so that section of code should read:
Did you get that? That missing semicolon problem has been an issue for years because many systems report the semicolon missing on the next line, due to the way that compilers work. Here, Processing is clearly saying: Oi! Put a semi-colon on the red squiggle.
I’m an old programmer, who currently programs in Java, C++ and Processing, so typing “;” at the end of a line is second nature to me. But it’s an easy mistake for a new programmer to make because, between all of the ( and the ) and the , and the numbers and the size and the rect… what do I do with the “;”?
The Processing IDE is functioning in at least an E4 mode: simple acceptance testing that won’t let anything happen until you fix that particular problem. It’s even giving you feedback as to what’s wrong. Now this isn’t to say that it’s great but it’s certainly better than a student sitting there with her hand up for 20 minutes waiting for a tutor to have the time to come over and say “Oh, you’re missing a semicolon.”
We don’t want shotgun coding, where random fixes and bashed-in attempts are made desperately to solve a problem. We want students to get used to getting feedback on how they’re going and using this to improve what they do.
Because of Processing’s highly visual mode, I think it’s closer to E3 (complex scripting) in many ways because it can tell you if it doesn’t understand what you’re trying to do at all. Beyond just not doing something, it can clearly tell you what’s wrong.
But what if it works and then the student puts something up on the screen, a graphic of some sort and it’s not quite right? Then the student has started to become their own E2, evaluating what has happened in response to the code and using human insight to address the shortfall and make changes. Not as an expert but, with support and encouragement, a developing expertise.
Feedback is good. Immediacy is good. Student involvement is good. Code tracing is linked to coding ability. A well-designed IDE can be simple and engage the student to an extent that is potentially as high as E2, although it won’t be as rich, without using any other human evaluation resources. Even if there is no other benefit, the aesthetic argument is giving us a very strong nudge to adopt an appropriate IDE.
Maybe it’s time to hang up the command line and live in a world where IDEs can help us to get things done faster, support our students better and make our formal human evaluation resources go further.
What do you think?
A literate and numerate society is an excellent goal. I’d say it’s probably our least goal for a happy, safe and stable society. But the rise in the number of programmable machines and objects has meant that being able to program or being able to think about programming can make a great deal of difference in the jobs you can hold and in the way that you can amplify your own human effort. Cars help us to go faster but computers help us to get more thinking work done. Being able to program, or knowing when it would be a good idea and how to approach it, will be essential for getting things done.
In fact, having some computer science or programming is handy right now because so many pieces of software can be much more useful if you use their programmatic extensions.
To give you an example, yesterday I was proof reading my first novel. I’m using the Scrivener software package and, among other features, it allows you to use Regular Expressions to search and replace text. A Regular Expression (RegEx) is a type of pattern; once defined, the computer looks for everything that matches that pattern.
I wanted to see if, while writing, I’d accidentally written the same word twice. (Believe me, it happens over 100,000 words.) Instead of searching for duplicate words by having to type ‘of of’ or ‘and and’ into a search field and looking for hits, I can use my knowledge of CS to enter the RegEx:
And this will go looking for any repeated pattern of the form ‘ it it ‘ or ‘ and an d’. (The RegEx should be read as ‘find all the times that I have put two words next to the other, separated by a space, where the words are the same.) Now my hit list is every possible occurrence of this!
By using a RegEx, I found that I had written ‘some some’, a pattern I never would have thought to check for. But that’s the power of programming. When I know how to tell a computer what I actually want, I can use its power to amplify the impact of my thoughts with reduced effort on my part.
Many of today’s applications become much more usable with a little programming. Microsoft Excel is another example where a little CS goes a long way.
That’s why I’m excited by the US President’s announcement on CS for all. You’ll know that our own work in Australia is towards empowering creators and building confidence in all educators and students. It’s great to see such a large and funded initiative being declared for the US. Armed with more knowledge, people can use computers to help themselves and so many more.
You don’t have to be an aesthetic philosopher or educational rebel to know that an empowered and knowledgable generation of school kids is a beautiful thing. As Mark put it, this is huge!
If we want to give feedback, then the time it takes to give feedback is going to determine how often we can do it. If the core of our evaluation is feedback, rather than some low-Bloom’s quiz-based approach giving a score of some sort, then we have to set our timelines to allow us to:
- Get the work when we are ready to work on it
- Undertake evaluation to the required level
- Return that feedback
- Do this at such a time that our students can learn from it and potentially use it immediately, to reinforce the learning
A commenter asked me how I actually ran large-scale assessment. The largest class I’ve run detailed feedback/evaluation on was 360 students with a weekly submission of a free-text (and graphics) solution to a puzzle. The goal was to have the feedback back within a week – prior to the next lecture where the solution would be delivered.
I love a challenge.
This scale is, obviously, impossible for one person to achieve reliably (we estimated it as at least forty hours of work). Instead, we allocated a marking team to this task, coordinated by the lead educator. (E1 and E2 model again. There was, initially, no automated capacity for this at the time although we added some later.)
Coordinating a team takes time. Even when you start with a rubric, free text answers can turn up answer themes that you didn’t anticipate and we would often carry our simple checks to make sure that things were working. But, looking at the marking time I was billed for (a good measure), I could run an entire cycle of this in three days, including briefing time, testing, marking, and oversight. But this is with a trained team, a big enough team, good conceptual design and a senior educator who’s happy to take a more executive role.
In this case, we didn’t give the students a chance to refactor their work but, if we had, we could have done this with a release 3 days after submission. To ensure that we then completed the work again by the ‘solution release’ deadline, we would have had to set the next submission deadline to only 24 hours after the feedback was released. This sounds short but, if we assume that some work has been done, then refactoring and reworking should take less time.
But then we have to think about the cost. By running two evaluation cycles we are providing early feedback but we have doubled our cost for human markers (a real concern for just about everyone these days).
My solution was to divide the work into two components. The first was quiz-based and could be automatically and immediately assessed by the Learning Management System, delivering a mark at a fixed deadline. The second part was looked at by humans. Thus, students received immediate feedback on part of the problem straight away (or a related problem) while they were waiting for humans.
But I’d be the first to admit that I hadn’t linked this properly, according to my new model. It does give us insight for a staged hybrid model where we buffer our human feedback by using either smart or dumb automated assessment component to highlight key areas and, better still, we can bring these forward to help guide time management.
I’m not unhappy with that early attempt at large-scale human feedback as the students were receiving some excellent evaluation and feedback and it was timely and valuable. It also gave me a lot of valuable information about design and about what can work, as well as how to manage marking teams.
I also realised that some courses could never be assessed the way that they claimed unless they had more people on task or only delivered at a time when the result wasn’t usable anymore.
How much time should we give students to rework things? I’d suggest that allowing a couple of days takes into account the life beyond Uni that many students have. That means that we can do a cycle in a week if we can keep our human evaluation stages under 2 days. Then, without any automated marking, we get 2 days (E1 or E2) + 2 days (student) + 2 days (second evaluation, possibly E2) + 1 day (final readjustment) and then we should start to see some of the best work that our students can produce.
Assuming, of course, that all of us can drop everything to slot into this. For me, this motivates a cycle closer to two to three weeks to allow for everything else that both groups are doing. But that then limits us to fewer than five big assessment items for a twelve week course!
What’s better? Twelve assessment items that are “submit and done” or four that are “refine and reinforce to best practice”? Is this even a question we can ask? I know which one is aesthetically pleasing, in terms of all of the educational aesthetics we’ve discussed so far but is this enough for an educator to be able to stand up to a superior and say “We’re not going to do X because it just doesn’t make any sense!”
What do you think?