Reflection: Why I’m stopping daily updates

I’ve written a lot in the past month and a half. Now, because I’m committed to evaluation, I have to look back at all of it and think about some difficult matters:

  1. Is anyone reading this?
  2. Are the people reading this the ones who can make change?
  3. Is the best way to do this?
  4. Should I be doing something else?

There are roughly 1,000 people who see my posts, between direct subscribers who read in e-mail, Facebook and the elusive following community on Twitter.

Twitter shouldn’t count, as I know from direct experience that the click-through rate from Twitter is tiny. (My posts have been shared by people with 5-10,000 followers and it has turned into maybe 10-20 more people reading.) Now I’m down to maybe 4-500 readers.

Facebook shares a longer fragment of my ideas but the click through is still small. Perhaps this brings me down to the roughly 200 followers I have, who have (over time) contributed about 1,000 ‘Likes’. However, almost all of these positive reinforcements stem from a different phase of the blog, a time when I was blogging conferences and being useful, rather than pontificating on the nature of beauty. My readership used to be 100 people a day, or more. I can’t crack 80 today and the way that I’m blogging is unlikely to reach that larger audience, yet it’s what I want to do.

The answer to 1 is that a few other people a day are reading what I write. I’d put it as high as twenty on a good day but most days it’s under ten.

2’s a tricky question. We can all make change; that’s one of my firmest beliefs. However, there is making change and then there are change makers. I know several people in this area quite well and they read me occasionally but it’s not something that they dedicate time to do. I have people that I always read but I can’t make the changes they need. It’s frustrating. No doubt, my ideas appeal to some people but change takes will and capacity to change, not just a sympathetic ear. I don’t want people to read this and feel trapped because they can’t make change. The answer to 2 is, probably, ‘no’.

3 follows from 1 and 2. If my readership is small and my ideas have little influence then this is not the best way to do things. We face enormous challenges. We need effective mechanisms for sharing information. If I am to make change, I have to invest my time wisely. I am not a large-scale player or a change maker. I need help to do it and if that help isn’t coming from this avenue, I have to choose another.

4 is easier. I can focus on my scholarship, practice, and research, rededicating the time I’ve been spending on this blog. People read papers where they don’t read blogs. Papers drive recognition. Recognition gets you the places to speak where your voice can be heard. There is no point having written all those words in a blog if it’s rarely read. This has been a highly rewarding experience in many ways but you have to wonder why you’re doing it if very few people read it or remember what you’ve written.

I wanted people to think and to talk about the ideas shared here. For those of you who have let me know that this worked, my thanks!

I’m tempted to keep going with the daily blog but the aesthetic argument traps me here. Spending time on something that isn’t working and insisting that it’s valuable is self-deception. Investing energy into an avenue that isn’t achieving your goals isn’t good. I cannot deprive my students of the hour or so a day that I’ve been spending doing this unless I achieve more for them than I would by doing some other aspect of my job.

Students and teachers: the true focus of any aesthetic discussion of education; the most important aspects of any discussion of what we should be doing because they are people and not just machine parts. As for us, so for them.

There are more discussions to be had but they’ll show up in more formal places, most likely. I’m always happy to talk to people about ideas at conferences. I’ve already started a face-to-face discussion about taking some of these ideas further in a more traditional research sense and I’m very excited about that.

But perhaps it’s time to let this blog go, listen to the numbers, reflect on the dissemination of knowledge, and accept that I would not be following my own advice if I were to continue. I love the beauty argument. I think it’s great. I stand by everything I’ve written this year. I just don’t think that this is the way to move people towards that agenda.

Thus, the daily updates stop with this post. I’ll still post things that interest me but there’ll be fewer of them.

I’ll leave you with the message I wanted to get across this year:

  1. Educational philosophy is full of the aesthetics of education. Dewey and Bloom just scratch the surface of this. The late 19th and early 20th century were an incredible time of upheaval and we still haven’t addressed many of the questions raised then. To the libraries!
  2. Fair, equitable, well-designed and evidence-based education is at the core of any beautiful system.
  3. Every day, we should ask ourselves if what we are doing is beautiful, good or true, taking into account all of the difficult questions of how we balance necessities against desirabilities, being honest about which is which. If we aren’t managing this, we need to either seek to change or accept that what we are doing isn’t right.
  4. We should leave enough time for ourselves in all of this, as there should be no sacrificial element to beautiful education.
  5. Change is coming. Change is here. Pretending that it won’t happen isn’t beautiful.

I hope that you all have a fantastic learning and teaching year, with many amazing and beautiful moments and outcomes!

This year, I hope to be at several conferences and I look forward to talking to anyone about the ideas in this phase (or any other phase) of the blog.

Have a great year!

SMPTE_Color_Bars.svg


Aesthetics of group work

What are the characteristics of group work and how can we define these in terms that allow us to form a model of beauty about them? We know what most people want from their group members. They want them to be:

  1. Honest. They do what they say and they only claim what they do. They’re fair in their dealings with others.
  2. Dependable. They actually do all of what they say they’re going to do.
  3. Hard-working. They take a ‘reasonable’ time to get things done.
  4. Able to contribute a useful skill
  5. A communicator. They let the group know what’s going on.
  6. Positive, possibly even optimistic.

A number of these are already included in the Socratic principles of goodness and truth. Truth, in the sense of being honest and transparent, covers 1, 2 and possibly even 5. Goodness, that what we set out to do is what we do and this leads to beauty, covers 3 and 4, and I think we can stretch it to 6.

But what about the aesthetics of the group itself? What does a beautiful group look like? Let’s ignore the tasks we often use in group environments and talk about a generic group. A group should have at least some of these (from) :

  1. Common goals.
  2. Participation from every member.
  3. A focus on what people do rather than who they are.
  4. A focus on what happened rather than how people intended.
  5. The ability to discuss and handle difference.
  6. A respectful environment with some boundaries.
  7. The capability to work beyond authoritarianism.
  8. An accomodation of difference while understanding that this may be temporary.
  9. The awareness that what group members want is not always what they get.
  10. The realisation that hidden conflict can poison a group.

Note how many of these are actually related to the task itself. In fact, of all of the things I’ve listed, none of the group competencies have anything at all to do with a task and we can measure and assess these directly by observation and by peer report.

How many of these are refined by looking at some arbitrary discipline artefact? If anything, by forcing students to work together on a task ‘for their own good’, are we in direct violation of this new number 7, allowing a group to work beyond strict hierarchies?

512px-Group_font_awesome.svg

“I’m carrying my whole team here!”

I’ve worked in hierarchical groups in the Army. The Army’s structure exists for a very specific reason: soldiers die in war. Roles and relationships are strictly codified to drive skill and knowledge training and to ensure smooth interoperation with a minimum of acclimatisation time. I think we can be bold and state that such an approach is not required for third- or fourth-year computer programming, even at the better colleges.

I am not saying that we cannot evaluate group work, nor am I saying that I don’t believe such training to be valuable for students entering the workforce. I just don’t happen to accept that mediating the value of a student’s skills and knowledge through their ability to carry out group competencies is either fair or honest. Item 9, where group members may have to adopt a role that they have identified is not optimal, is grossly unfair when final marks depend upon how the group work channel mediates the perception of your contribution.

There is a vast amount of excellent group work analysis and support being carried out right now, in many places. The problem occurs when we try to turn this into a mark that is re-contextualised into the knowledge frame. Your ability to work in groups is a competency and should be clearly identified as such. It may even be a competency that you need to display in order to receive industry-recognised accreditation. No problems with that.

The hallmarks of traditional student group work are resentment at having to do it, fear that either their own contributions won’t be recognised or someone else’s will dominate, and a deep-seated desire to get the process over with.

Some tasks are better suited to group solution. Why don’t we change our evaluation mechanisms to give students the freedom to explore the advantages of the group without the repercussions that we currently have in place? I can provide detailed evaluation to a student on their group role and tell a lot about the team. A student’s inability to work with a randomly selected team on a fake project with artificial timelines doesn’t say anything that I would be happy to allocate a failing grade to. It is, however, an excellent opportunity for discussion and learning, assuming I can get beyond the tyranny of the grade to say it.


Challenge accepted: beautiful groupwork

You knew it was coming. The biggest challenge of any assessment model: how do we handle group-based assessment?

Angry_mob_of_four

Come out! We know that you didn’t hand it in on-time!

There’s a joke that says a lot about how students feel when they’re asked to do group work:

When I die I want my group project members to lower me into my grave so they can let me down one more time.

Everyone has horror stories about group work and they tend to fall into these patterns:

  1. Group members X and Y didn’t do enough of the work.
  2. I did all of the work.
  3. We all got the same mark but we didn’t do the same work.
  4. Person X got more than I did and I did more.
  5. Person X never even showed up and they still passed!
  6. We got it all together but Person X handed it in late.
  7. Person W said that he/she would do task T but never did and I ended up having to do it.

Let’s consolidate these. People are concerned about a fair division of work and fair recognition of effort, especially where this falls into an allocation of grades. (Point 6 only matters if there are late penalties or opportunities lost by not submitting in time.)

This is totally reasonable! If someone is getting recognition for doing a task then let’s make sure it’s the right person and that everyone who contributed gets a guernsey. (Australian football reference to being a recognised team member.)

How do we make group work beautiful? First, we have to define the aesthetics of group work: which characteristics define the activity? Then we maximise those as we have done before to find beauty. But in order for the activity to be both good and true, it has to achieve the goals that define and we have to be open about what we are doing. Let’s start, even before the aesthetics, and ask about group work itself.

What is the point of group work? This varies by discipline but, usually, we take a task that is too large or complex for one person to achieve in the time allowed and that mimics (or is) a task you’d expect graduates to perform. This task is then attacked through some sort of decomposition into smaller pieces, many of which are dependant in a strict order, and these are assigned to group members. By doing this, we usually claim to be providing an authentic workplace or task-focused assignment.

The problem that arises, for me, is when we try and work out how we measure the success of such a group activity. Being able to function in a group has a lot of related theory (psychological, behavioural, and sociological, at least) but we often don’t teach that. We take a discipline task that we believe can be decomposed effectively and we then expect students to carve it up. Now the actual group dynamics will feature in the assessment but we often measure the outputs associate with the task to determine how effective group formation and management was. However, the discipline task has a skill and knowledge dimension, while the group activity elements have a competency focus. What’s more problematic is that unsuccessful group work can overshadow task achievement and lead to a discounting of skill and knowledge success, through mechanisms that are associated but not necessarily correlated.

Going back to competency-based assessment, we assess competency by carrying out direct observation, indirect measures and through professional reports and references. Our group members’ reports on us (and our reports on them) function in the latter area and are useful sources of feedback, identifying group and individual perceptions as well as work progress. But are these inherently markable? We spend a lot of time trying to balance peer feedback, minimise bullying, minimise over-claiming, and get a realistic view of the group through such mechanisms but adding marks to a task does not make it more cognitively beneficial. We know that.

For me, the problem with most group work assessment is that we are looking at the output of the task and competency based artefacts associated with the group and jamming them together as if they mean something.

Much as I argue against late penalties changing the grade you received, which formed a temporal market for knowledge, I’m going to argue against trying to assess group work through marking a final product and then dividing those grades based on reported contributions.

We are measuring different things. You cannot just add red to melon and divide it by four to get a number and, yet, we are combining different areas, with different intentions, and dragging it into one grade that is more likely to foster resentment and negative association with the task. I know that people are making this work, at least to an extent, and that a lot of great work is being done to address this but I wonder if we can channel all of the energy spent in making it work into getting more amazing things done?

Just about every student I’ve spoken to hates group work. Let’s talk about how we can fix that.


Streamlining for meaning.

In yesterday’s musings on Grade Point Average, GPA, I said:

But [GPA calculation adjustment] have to be a method of avoidance, this can be a useful focusing device. If a student did really well in, say, Software Engineering but struggled with an earlier, unrelated, stream, why can’t we construct a GPA for Software Engineering that clearly states the area of relevance and degree of information? Isn’t that actually what employers and people interested in SE want to know?

This hits at the heart of my concerns over any kind of summary calculation that obscures the process. Who does this benefit? What use it is to anyone? What does it mean? Let’s look at one of the most obvious consumers of student GPAs: the employers and industry.

Feedback from the Australian industry tells us that employers are generally happy with the technical skills that we’re providing but it’s the softer skills (interpersonal skills, leadership, management abilities) that they would like to see more of and know more about. A general GPA doesn’t tell you this but a Software Engineering focused GPA (as I mentioned above) would show you how a student performed in courses where we would expect to see these skills introduced and exercised.

Putting everything into one transcript gives people the power to assemble this themselves, yes, but this requires the assembler to know what everything means. Most employers have neither the time nor inclination to do this for all 39 or so institutions in Australia. But if a University were to say “this is a summary of performance in these graduate attributes”, where the GAs are regularly focused on the softer skills, then we start to make something more meaningful out of an arbitrary number.

But let’s go further. If we can see individual assessments, rather than coarse subject grades, we can start to construct a model of an individual across the different challenges that they have faced and overcome. Portfolios are, of course, a great way to do this but they’re more work to read than single measures and, too often, such a portfolio is weighed against simpler, apparently meaningful measures such as high GPAs and found wanting. Portfolios also struggle if placed into a context of previous failure, even if recent activity clearly demonstrates that a student has moved on from that troubled or difficult time.

I have a deep ethical and philosophical objection to curve grading, as you probably know. The reason is simple: the actions of one student should not negatively affect the outcomes of another. This same objection is my biggest problem with GPA, although in this case the action and outcomes belong to the same student at different points in her or his life. Rather than using performance in one course to determine access to the learning upon which it depends, we make these grades a permanent effect and every grade that comes afterwards is implicitly mediated through this action.

Dead-Man's_Curve_in_Lebec,_California,_2010

Sometimes you should be cautious regarding adding curves to address your problems.

Should Past Academic Nick have an inescapable impact on Now and Future Academic Nick’s life? When we look at all of the external influences on success, which make it clear how much totally non-academic things matter, it gets harder and harder to say “Yes, Past Academic Nick is inescapable.” Unfairness is rarely aesthetically pleasing.

An excellent comment on the previous post raised the issue of comparing GPAs in an environment where the higher GPA included some fails but the slightly lower GPA student had always passed. Which was the ‘best’ student from an award perspective? Student A fails three courses at the start of his degree, student B fails three courses at the end. Both pass with the same GPA, time to completion, and number of passes and fails. Is there even a sense of ‘better student’ here? B’s struggles are more immediate and, implicitly, concerns would be raised that these problems could still be active. A has, apparently, moved on in some way. But we’d never know this from simplistic calculations.

If we’re struggling to define ‘best’ and we’re not actually providing something that many people feel is useful, while burdening students with an inescapable past, then the least we can do is to sit down with the people who are affected by this and ask them what they really want.

And then, when they tell us, we do something about changing our systems.


Total control: a user model for student results

Yesterday, I wrote:

We need assessment systems that work for the student first and everyone else second.


Grades are the fossils of evaluation

Assessments support evaluation, criticism and ranking (Wolff). That’s what it does and, in many cases, that also constitutes a lot of why we do it. But who are we doing it for?

I’ve reflected on the dual nature of evaluation, showing a student her or his level of progress and mastery while also telling us how well the learning environment is working. In my argument to reduce numerical grades to something meaningful, I’ve asked what the actual requirement is for our students, how we measure mastery and how we can build systems to provide this.

But who are the student’s grades actually for?

In terms of ranking, grades allow people who are not the student to place the students in some order. By doing this, we can award awards to students who are in the awarding an award band (repeated word use deliberate). We can restrict our job interviews to students who are summa cum laude or valedictorian or Dean’s Merit Award Winner. Certain groups of students, not all, like to define their progress through comparison so there is a degree of self-ranking but, for the most part, ranking is something that happens to students.

Criticism, in terms of providing constructive, timely feedback to assist the student, is weakly linked to any grading system. Giving someone a Fail grade isn’t a critique as it contains no clear identification of the problems. The clear identification of problems may not constitute a fail. Often these correlate but it’s weak. A student’s grades are not going to provide useful critique to the student by themselves. These grades are to allow us to work out if the student has met our assessment mechanisms to a point where they can count this course as a pre-requisite or can be awarded a degree. (Award!)

Evaluation is, as noted, useful to us and the student but a grade by itself does not contain enough record of process to be useful in evaluating how mastery goals were met and how the learning environment succeeded or failed. Competency, when applied systematically, does have a well-defined meaning. A passing grade does not although there is an implied competency and there is a loose correlation with achievement.

Grades allow us to look at all of a student’s work as if this one impression is a reflection of the student’s involvement, engagement, study, mistakes, triumphs, hopes and dreams. They are additions to a record from which we attempt to reconstruct a living, whole being.

Grades are the fossils of evaluation.

Grades provide a mechanism for us, in a proxy role as academic archaeologist, to classify students into different groups, in an attempt to project colour into grey stone, to try and understand the ecosystem that such a creature would live in, and to identify how successful this species was.

As someone who has been a student several times in my life, I’m aware that I have a fossil record that is not traditional for an academic. I was lucky to be able to place a new imprint in the record, to obscure my history as a much less successful species, and could then build upon it until I became an ACADEMIC TYRANNOSAURUS.

Skull of a Tyrannosaurus Rex at Palais de la Decouverte

LIFE LONG LEARNING, ROAARRRR!

But I’m lucky. I’m privileged. I had a level of schooling and parental influence that provided me with an excellent vocabulary and high social mobility. I live in a safe city. I have a supportive partner. And, more importantly, at a crucial moment in my life, someone who knew me told me about an opportunity that I was able to pursue despite the grades that I had set in stone. A chance came my way that I never would have thought of because I had internalised my grades as my worth.

Let’s look at the fossil record of Nick.

My original GPA fossil, encompassing everything that went wrong and right in my first degree, was 2.9. On a scale of 7, which is how we measure it, that’s well below a pass average. I’m sharing that because I want you to put that fact together with what happened next. Four years later, I started a Masters program that I finished with a GPA of 6.4. A few years after the masters, I decided to go and study wine making. That degree was 6.43. Then I received a PhD, with commendation, that is equivalent to GPA 7. (We don’t actually use GPA in research degrees. Hmmm.) If my grade record alone lobbed onto your desk you would see the desiccated and dead snapshot of how I (failed to) engage with the University system. A lot of that is on me but, amazingly, it appears that much better things were possible. That original grade record stopped me from getting interviews. Stopped me from getting jobs. When I was finally able to demonstrate the skills that I had, which weren’t bad, I was able to get work. Then I had the opportunity to rewrite my historical record.

Yes, this is personal for me. But it’s not about me because I wasn’t trapped by this. I was lucky as well as privileged. I can’t emphasise that enough. The fact that you are reading this is due to luck. That’s not a good enough mechanism.

Too many students don’t have this opportunity. That impression in the wet mud of their school life will harden into a stone straitjacket from which they may never escape. The way we measure and record grades has far too much potential to work against students and the correlation with actual ability is there but it’s not strong and it’s not always reliable.

The student you are about to send out with a GPA of 2.9 may be competent and they are, most definitely, more than that number.

The recording of grades is a high-loss storage record of the student’s learning and pathway to mastery. It allows us to conceal achievement and failure alike in the accumulation of mathematical aggregates that proxy for competence but correlate weakly.

We need assessment systems that work for the student first and everyone else second.


What do we want? Passing average or competency always?

I’m at the Australasian Computer Science Week at the moment and I’m dividing my time between attending amazing talks, asking difficult questions, catching up with friends and colleagues and doing my own usual work in the cracks.  I’ve talked to a lot of people about my ideas on assessment (and beauty) and, as always, the responses have been thoughtful, challenging and helpful.

I think I know what the basis of my problem with assessment is, taking into account all of the roles that it can take. In an earlier post, I discussed Wolff’s classification of assessment tasks into criticism, evaluation and ranking. I’ve also made earlier (grumpy) notes about ranking systems and their arbitrary nature. One of the interesting talks I attended yesterday talked about the fragility and questionable accuracy of post-University exit surveys, which are used extensively in formal and informal rankings of Universities, yet don’t actually seem to meet many of the statistical or sensible guidelines for efficacy we already have.

But let’s put aside ranking for a moment and return to criticism and evaluation. I’ve already argued (successfully I hope) for a separation of feedback and grades from the criticism perspective. While they are often tied to each other, they can be separated and the feedback can still be useful. Now let’s focus on evaluation.

Remind me why we’re evaluating our students? Well, we’re looking to see if they can perform the task, apply the skill or knowledge, and reach some defined standard. So we’re evaluating our students to guide their learning. We’re also evaluating our students to indirectly measure the efficacy of our learning environment and us as educators. (Otherwise, why is it that there are ‘triggers’ in grading patterns to bring more scrutiny on a course if everyone fails?) We’re also, often accidentally, carrying out an assessment of the innate success of each class and socio-economic grouping present in our class, among other things, but let’s drill down to evaluating the student and evaluating the learning environment. Time for another thought experiment.

Thought Experiment 2

There are twenty tasks aligned with a particularly learning outcome. It’s an important task and we evaluate it in different ways but the core knowledge or skill is the same. Each of these tasks can receive a ‘grade’ of 0, 0.5 or 1. 0 means unsuccessful, 0.5 is acceptable, 1 is excellent. Student A attempts all tasks and is acceptable in 19, unsuccessful in 1. Student B attempts the first 10 tasks, receives excellent in all of them and stops. Student C sets up a pattern of excellent,unsuccessful, excellent, unsuccessful.. and so on to receive 10 “Excellent”s and 10 “unsuccessful”s. When we form an aggregate grade, A receives 47.5%, B receives 50% and C also receives 50%. Which of these students is the most likely to successfully complete the task?

This framing allows us to look at the evaluation of the student in a meaningful way. “Who will pass the course?” is not the question we should be asking, it’s “Who will be able to reliably demonstrate mastery of the skills or knowledge that we are imparting.” Passing the course has a naturally discrete attention focus: focus on n assignments and m exams and pass. Continual demonstration of mastery is a different goal. This framing also allows us to examine the learning environment because, without looking at the design, I can’t tell you if B and C’s behaviour is problematic or not.

CompFail

A has undertaken the most tasks to an acceptable level but an artefact of grading (or bad luck) has dropped the mark below 50%, which would be a fail (aggregate less than acceptable) in many systems. B has performed excellently on every task attempted but, being aware of the marking scheme, optimising and strategic behaviour allows this student to walk away. (Many students who perform at this level wouldn’t, I’m aware, but we’re looking at the implications of this.) C has a troublesome pattern that provides the same outcome as B but with half the success rate.

Before we answer the original question (which is most likely to succeed), I can nominate C as the most likely to struggle because C has the most “unsuccessful”s. From a simple probabilistic argument, 10/20 success is worse than 19/20. It’s a bit tricker comparing 10/10 and 10/20 (because of confidence intervals) but 10/20 has an Adjusted Wald range of +/- 20% and 10/10 is -14%, so the highest possible ‘real’ measure for C is 14/20 and the lowest possible ‘real’ measure for B is (scaled) 15/20, so they don’t overlap and we can say that B appears to be more successful than C as well.

From a learning design perspective, do our evaluation artefacts have an implicit design that explains C’s pattern? Is there a difference we’re not seeing? Taking apart any ranking of likeliness to pass our evaluatory framework, C’s pattern is so unusual (high success/lack of any progress) that we learn something immediately from the pattern, whether it’s that C is struggling or that we need to review mechanisms we thought to be equivalent!

But who is more likely to succeed out of A and B? 19/20 and 10/10 are barely distinguishable in statistical terms! The question for us now is how many evaluations of a given skill or knowledge mastery are required for us to be confident of competence. This totally breaks the discrete cramming for exams and focus on assignment model because all of our science is built on the notion that evidence is accumulated through observation and the analysis of what occurred, in order to be able to construct models to predict future behaviour. In this case, our goal is to see if our students are competent.

I can never be 100% sure that my students will be able to perform a task but what is the level I’m happy with? How many times do I have to evaluate them at a skill so that I can say that x successes in y attempts constitutes a reliable outcome?

If we say that a student has to reliably succeed 90% of the time, we face the problem that just testing them ten times isn’t enough for us to be sure that they’re hitting 90%.

But the level of performance we need to be confident is quite daunting. By looking at some statistics, we can see that if we provide a student with 150 opportunities to demonstrate knowledge and they succeed at this 143 times, then it is very likely that their real success level is at least 90%.

If we say that competency is measured by a success rate that is greater than 75%, a student who achieves 10/10 has immediately met that but even succeeding at 9/9 doesn’t meet that level.

What this tells us (and reminds us) is that our learning environment design is incredibly important and it must start from a clear articulation of what success actually means, what our goals are and how we will know when our students have reached that point.

There is a grade separation between A and B but it’s artificial. I noted that it was hard to distinguish A and B statistically but there is one important difference in the lower bound of their confidence interval. A is less than 75%, B is slightly above.

Now we have to deal with the fact that A and B were both competent (if not the same) for the first ten tests and A was actually more competent than B until the 20th failed test. This has enormous implications for we structure evaluation, how many successful repetitions define success and how many ‘failures’ we can tolerate and still say that A and B are competent.

Confused? I hope not but I hope that this is making you think about evaluation in ways that you may not have done so before.

 


Too big for a term? Why terms?

I’ve reached the conclusion that a lot of courses have an unrealistically high number of evaluations. We have too many and we pretend that we are going to achieve outcomes for which we have no supporting evidence. Worse, in many cases, we are painfully aware that we cause last-minute lemming-like effects that do anything other than encourage learning. But why do we have so many? Because we’re trying to fit them into the term or semester size that we have: the administrative limit.

One the big challenges for authenticity in Computer Science is the nature of the software project. While individual programs can be small and easy to write, a lot of contemporary programming projects are:

  1. Large and composed of many small programs.
  2. Complex to a scale that may exceed one person’s ability to visualise.
  3. Long-lived.
  4. Multi-owner.
  5. Built on platforms that provide core services; the programmers do not have the luxury to write all of the code in the system.

Many final year courses in Software Engineering have a large project courses, where students are forced to work with a (usually randomly assigned) group to produce a ‘large’ piece of software. In reality, this piece of software is very well-defined and can be constructed in the time available: it has been deliberately selected to be so.

Is a two month software task in a group of six people indicative of real software?

calendar-660670_960_720

June 16: Remember to curse teammate for late delivery on June 15.

Yes and no. It does give a student experience in group management, except that they still have the safe framework of lecturers over the top. It’s more challenging than a lot of what we do because it is a larger artefact over a longer time.

But it’s not that realistic. Industry software projects live over years, with tens to hundreds of programmers ‘contributing’ updates and fixes… reversing changes… writing documentation… correcting documentation. This isn’t to say that the role of a university is to teach industry skills but these skill sets are very handy for helping programmers to take their code and make it work, so it’s good to encourage them.

I believe finally, that education must be conceived as a continuing reconstruction of experience; that the process and the goal of education are one and the same thing.

from John Dewey, “My Pedagogic Creed”,  School Journal vol. 54 (January 1897)

I love the term ‘continuing reconstruction of experience’ as it drives authenticity as one of the aesthetic characteristics of good education.

Authentic, appropriate and effective learning and evaluation activities may not fit comfortably into a term. We already accept this for activities such as medical internship, where students must undertake 47 weeks of work to attain full registration. But we are, for many degrees, trapped by the convention of a semester of so many weeks, which is then connected with other semesters to make a degree that is somewhere between three to five years long.

The semester is an artefact of the artificial decomposition of the year, previously related to season in many places but now taking on a life of its own as an administrative mechanism. Jamming things into this space is not going to lead to an authentic experience and we can now reject this on aesthetic grounds. It might fit but it’s beautiful or true.

But wait! We can’t do that! We have to fit everything into neat degree packages or our students won’t complete on time!

Really?

Let’s now look at the ‘so many years degree’. This is a fascinating read and I’ll summarise the reported results for degree programs in the US, which don’t include private colleges and universities:

  • Fewer than 10% of reporting institutions graduated a majority of students on time.
  • Only 19% of students at public universities graduate on-time.
  • Only 36% of state flagship universities graduate on-time
  • 5% of community college students complete an associate degree on-time.

The report has a simple name for this: the four-year myth. Students are taking longer to do their degrees for a number of reasons but among them are poorly designed, delivered, administered or assessed learning experiences. And jamming things into semester blocks doesn’t seem to be magically translating into on-time completions (unsurprisingly).

It appears that the way we break up software into little pieces is artificial and we’re also often trying to carry out too many little assessments. It looks like a good model is to stretch our timeline out over more than one course to produce an experience that is genuinely engaging, more authentic and more supportive of long term collaboration. That way, our capstone course could be a natural end-point to a three year process… or however long it takes to get there.

Finally, in the middle of all of this, we need to think very carefully about why we keep using the semester or the term as a container. Why are degrees still three to four years long when everything else in the world has changed so much in the last twenty years?


Confessions of a CLI guy

There was a time before graphics dominated the way that you worked with computers and, back then, after punchcards and before Mac/Windows, the most common way of working with a computer was to use the Command Line Interface (CLI). Many of you will have seen this, here’s Terminal from the Mac OS X, showing a piece of Python code inside an editor.

Screen Shot 2016-02-01 at 5.39.56 PM

Rather than use a rich Integrated Development Environment, where text is highlighted and all sorts of clever things are done for me, I would run some sort of program editor from the command line, write my code, close that editor and then see what worked.

At my University, we almost always taught Computer Science using command line tools, rather than rich development environments such as Eclipse or the Visual Studio tools. Why? The reasoning was that the CLI developed skills required to write code, compile it, debug it and run it, without training students into IDE-provided shortcuts. The CLI was the approach that would work anywhere. That knowledge was, as we saw it, fundamental.

But, remember that Processing example? We clearly saw where the error was. This is what a similar error looks like for the Java programming language in a CLI environment.

Screen Shot 2016-02-01 at 5.48.03 PM

Same message (and now usefully on the right line because 21st Century) but it is totally divorced from the program itself. That message has to give me a line number (5) in the original program because it has no other way to point to the problem.

And here’s the problem. The cognitive load increases once we separate code and errors. Despite those Processing errors looking like the soft option, everything we know about load tells us that students will find fixing their problems easier if they don’t have to mentally or physically switch between code and error output.

Everything I said about CLIs is still true but that’s only a real consideration if my students go out into the workplace and need some CLI skills. And, today, just about every workplace has graphics based IDEs for production software. (Networking is often an exception but we’ll skip over that. Networking is special.)

The best approach for students learning to code is that we don’t make things any harder than we need to. The CLI approach is something I would like students to be able to do but my first task is to get them interested in programming. Then I have to make their first experiences authentic and effective, and hopefully pleasant and rewarding.

I have thought about this for years and I started out as staunchly CLI. But as time goes by, I really have to wonder whether a tiny advantage for a small number of graduates is worth additional load for every new programmer.

And I don’t think it is worth it. It’s not fair. It’s the opposite of equitable. And it goes against the research that we have on cognitive load and student workflows in these kinds of systems. We already know of enough load problems in graphics based environments if we make the screens large enough, without any flicking from one application to another!

You don’t have to accept my evaluation model to see this because it’s a matter of common sense that forcing someone to unnecessarily switch tasks to learn a new skill is going to make it harder. Asking someone to remember something complicated in order to use it later is not as easy as someone being able to see it when and where they need to use it.

The world has changed. CLIs still exist but graphical user interfaces (GUIs) now rule. Any of my students who needs to be a crack programmer in a text window of 80×24 will manage it, even if I teach with all IDEs for the entire degree, because all of the IDEs are made up of small windows. Students can either debug and read error messages or they can’t – a good IDE helps you but it doesn’t write or fix the code for you, in any deep way. It just helps you to write code faster, without having to wait and switch context to find silly mistakes that you could have fixed in a split second in an IDE.

When it comes to teaching programming, I’m not a CLI guy anymore.


Is an IDE an E3? Maybe an E2?

Earlier, I split the evaluation resources of a course into:

  • E1 (the lecturer and course designer),
  • E2 (human work that can be based on rubrics, including peer assessment and casual markers),
  • E3 (complicated automated evaluation mechanisms)
  • E4 (simple automated evaluation mechanisms, often for acceptance testing)

E1 and E2 everyone tends to understand, because the culture of Prof+TA is widespread, as is the concept of peer assessment. In a Computing Course, we can define E3 as complex marking scripts that perform amazing actions in response to input (or even carry out formal analysis if we’re being really keen), with E4 as simple file checks, program compilation and dumb scripts that jam in a set of data and see what comes out.

But let’s get back to my first year, first exposure, programming class. What I want is hands-on, concrete, active participation and constructive activity and lots of it. To support that, I want the best and most immediate feedback I can provide. Now I can try to fill a room with tutors, or do a lot of peer work, but there will come times when I want to provide some sort of automated feedback.

Given how inexperienced these students are, it could be a quite a lot to expect them to get their code together and then submit it to a separate evaluation system, then interpret the results. (Remember I noted earlier on how code tracing correlates with code ability.)

Thus, the best way to get that automated feedback is probably working with the student in place. And that brings us to the Integrated Development Environment (IDE). An IDE is an application that provides facilities to computer programmers and helps them to develop software. They can be very complicated and rich (Eclipse), simple (Processing) or aimed at pedagogical support (Scratch, BlueJ, Greenfoot et al) but they are usually made up of a place in which you can assemble code (typing or dragging) and a set of buttons or tools to make things happen. These are usually quite abstract for early programmers, built on notional machines rather than requiring a detailed knowledge of hardware.

Screen Shot 2016-01-31 at 4.12.01 PM

The Processing IDE. Type in one box. Hit play. Rectangle appears.

Even simple IDEs will tell you things that provide immediate feedback. We know how these environments can have positive reception, with some demonstrated benefits, although I recommend reading Sorva et al’s “A Review of Generic Program Visualization Systems for Introductory Programming Education” to see the open research questions. In particular, people employing IDEs in teaching often worry about the time to teach the environment (as well as the language), software visualisations, concern about time on task, lack of integration and the often short lifespan of many of the simpler IDEs that are focused on pedagogical outcomes. Even for well-established systems such as BlueJ, there’s always concern over whether the investment of time in learning it is going to pay off.

In academia, time is our currency.

But let me make an aesthetic argument for IDEs, based on the feedback that I’ve already put into my beautiful model. We want to maximise feedback in a useful way for early programmers. Early programmers are still learning the language, still learning how to spell words, how to punctuate, and are building up to a grammatical understanding. An IDE can provide immediate feedback as to what the computer ‘thinks’ is going on with the program and this can help the junior programmer make immediate changes. (Some IDEs have graphical representations for object systems but we won’t discuss these any further here as the time to introduce objects is a subject of debate.)

Now there’s a lot of discussion over the readability of computer error messages but let me show you an example. What’s gone wrong in this program?

 

Screen Shot 2016-01-31 at 4.20.22 PM

See where that little red line is, just on the end of the first line? Down the bottom there’s a message that says “missing a semicolon”. In the Processing language, almost all lines end with a “;” so that section of code should read:

size(200,200);
rect(0,10,100,100);

Did you get that? That missing semicolon problem has been an issue for years because many systems report the semicolon missing on the next line, due to the way that compilers work. Here, Processing is clearly saying: Oi! Put a semi-colon on the red squiggle.

I’m an old programmer, who currently programs in Java, C++ and Processing, so typing “;” at the end of a line is second nature to me. But it’s an easy mistake for a new programmer to make because, between all of the ( and the ) and the , and the numbers and the size and the rect… what do I do with the “;”?

The Processing IDE is functioning in at least an E4 mode: simple acceptance testing that won’t let anything happen until you fix that particular problem. It’s even giving you feedback as to what’s wrong. Now this isn’t to say that it’s great but it’s certainly better than a student sitting there with her hand up for 20 minutes waiting for a tutor to have the time to come over and say “Oh, you’re missing a semicolon.”

We don’t want shotgun coding, where random fixes and bashed-in attempts are made desperately to solve a problem. We want students to get used to getting feedback on how they’re going and using this to improve what they do.

Because of Processing’s highly visual mode, I think it’s closer to E3 (complex scripting) in many ways because it can tell you if it doesn’t understand what you’re trying to do at all. Beyond just not doing something, it can clearly tell you what’s wrong.

But what if it works and then the student puts something up on the screen, a graphic of some sort and it’s not quite right? Then the student has started to become their own E2, evaluating what has happened in response to the code and using human insight to address the shortfall and make changes. Not as an expert but, with support and encouragement, a developing expertise.

Feedback is good. Immediacy is good. Student involvement is good. Code tracing is linked to coding ability. A well-designed IDE can be simple and engage the student to an extent that is potentially as high as E2, although it won’t be as rich, without using any other human evaluation resources. Even if there is no other benefit, the aesthetic argument is giving us a very strong nudge to adopt an appropriate IDE.

Maybe it’s time to hang up the command line and live in a world where IDEs can help us to get things done faster, support our students better and make our formal human evaluation resources go further.

What do you think?