SIGCSE Day 2, Assessment and Evaluation Session, Friday 10:45-12:00, (#SIGCSE2014)

The session opened with a talk on the “Importance of Early Performance in CS1: Two Conflicting Assessment Stories” by Leo Porter and Daniel Zingaro. Frequent readers will know that I published a paper in ICER 2012 on the impact of early assignment submission behaviour on later activity so I was looking forward to seeing what the conflict was. This was, apparently, supposed to be a single story but, like much research, it suddenly turned out that there were two different stories.

In early term performance, do you notice students falling into a small set of performance groups? Does it feel that you can predict the results? (Shout out to Ahadi and Lister’s “Geek genes, prior knowledge, stumbling points and learning edge momentum: parts of the one elephant?” from ICER 2013!). Is there a truly bimodal distribution of ability? The results don’t match a neat bell curve. (I’m sure a number of readers to wait and see where this goes.)

Why? Well, the Geek Gene theory is that there is internet and immutable talent you have or you don’t. The author didn’t agree with this and the research supports this, by the way. The next possibility is a stumbling block, where you misunderstand something critical. The final possibility is learning edge momentum, where you build knowledge incrementally and early mistakes cascade.

In evaluating these theories, the current approach is over a limited number of assessments but it’s hard to know what happened in-between. we need more data! Leo uses Peer Instruction (PI) a lot so has a lot of clicker question data to draw on. (Leo gave a quick background on PI but you can look that up. ūüôā ) The authors have some studies to see the correlation between vote and group vote.

The study was run over a CS1 course in Python with 126 students, with 34 PI sessions over 12 weeks and 8 prac lab sessions. The instructor was experiences in PI and the material. Components for analysis include standard assessments (midterm and finals), on-class PI for the last two weeks and the PI results per student, averaged bi-weekly to reduce noise because students might be absent and are graded on participation.

(I was slightly surprised to see that more than 20% of the students had scored 100% on the midterm!) The final was harder but it was hard to see the modalities in the histograms. Comparing this with the last two weeks of course for PI, and this isn’t bi-modal either and looks very different. The next step was to use the weekly assessments to see how they would do in the last two weeks, and that requires a correlation. The Geek Gene should have a strong correlation early and no change. Stumbling block should see strong correlation somewhat early and then no change. Lastly, for LEM, strong correlation somewhat early, then no change – again. This is not really that easy to distinguish.

The results were interesting. Weeks 1,2 don’t’ correlate much at all but from weeks 3,4 onwards, correlation is roughly 40% but it doesn’t get better. Looking at the final exam correlation with the Week 11/12 PI scores, correlation is over 60% (growing steadily from weeks 3,4) Let’s look at the exam content (analyse the test) – where did the content fall? 54% of the questions target the first weeks, 46% target the latter half. Buuuuuuut, the later questions were more conceptually rich – and this revealed a strong bias for the first half of the class (87%) and only 13% of the later stuff. The early test indicators were valid because the exam is mostly testing the early section! The PI in Week 11 and 12 was actually 50/50 first half and second half, so no wonder that correlated!

Threads to validity? Well, the data was noisy and participation was variable. The PI questions are concept tests, focused on a signal concept and many not actually reflect writing code. There were different forms of assessment. The PI itself may actually change student performance because students generally do better in PI courses. So what does all this mean?

Well, the final exam correlation supports stumbling block and LEM but the Week 11 and 12 are different! The final exam story isn;t ideal but the Week 11/12 improvements are promising. We’re addicted tot his kind of assessment ands student performance early in term will predict assessment based on that material, but the PI is f more generally used.

It’s interesting to know that there were mot actual MCQs on the final exam.

The next talk was “Reinventing homework as a cooperative, formative assessment” by Don Blaheta. There are a couple of problems in teaching: the students need practice and the students need feedback. In reinventing homework, the big problem is that trading is a lot of work and matching comments to grades and rubrics is hard, with a delay for feedback, it’s not group work and solitary work isn’t the best for all students, and a lot of the students don’t read the comments anyway. (My ears pricked up, this is very similar to the work I was presenting on.)

There’s existing work on automation, off-the-shelf programming, testing systems and online suites, with immediate feedback. But some things just can’t be auto graded and we have to come back to manual marking. Diagrams can’t be automarked.

To deal with this, the author tried “Work together, write alone” but there is confusion about what and what isn’t acceptable as collaboration – the lecturer ends up grading the same thing three times. What about revising previous work? It’s great for learning but students may nt have budgeted any time for it, some will be happy with a lower mark. here’s the issue of apathy and it increases the workload.

How can we package these ideas together to get them to work better? We can make the homework group work, the next idea is that there’s a revision cycle where an early (ungraded) version is hand back with comments – limited scale response of correct, substantial understanding, littler or no understanding. (Then homework is relatively low stakes.) Other mechanisms include comments, no grades; grade, no comments; limed scale. (Comments, no grades, should make them look at the comments – with any luck.) Don’t forget that revision increases workload where everything else theoretically decreases it! Comments identify higher order problems and marks are not handed back to students. The limited scale now reduces marking over head and can mark improvement rather than absolutes. (And the author referred to my talk from yesterday, which startled me quite a lot, but it’s nice to see! Thanks, Don!)

It’s possible to manage the group, which is self-policing, very interestingly – the “free rider” problem rears its ugly head. Some groups did divide the task but moved to full group model after initially splitting up the work. Grades could swing and students might not respond positively.

In the outcomes, while the n is small, he doesn’t see a high homework mark correlated with a lot exam average, with would be the expected indicator of the “free rider” or “plagiarist” effect. So, nothing significant but an indication that things are on the right track. Looking at class participation, students are working in different ways, but overall it’s positive in effect. (The students liked it but you know my thoughts on that. ūüôā ) Increased cooperation is a great outcome as is making revisions on existing code.

The final talk was on “Evaluating an Inverted CS1” presented by Jennifer Campbell form the University of Toronto. Their CS1 is a 12 week course with 3 lecture hours and a 2 hour Lab per week with Python in Objects early, Classes-late approach. Lecture size is 130-150 students across mostly 1st years with some higher and some non-CS. Typical lab sizes are 30 students with one TA.

The inverted classroom is also known as the flipped classroom: resources are available and materials are completed before the students show up. The face-to-face time is used for activities. Before the lecture, students watch videos, with two instructors, with screencasts and some embedded quizzes (about 15 questions), worth 0.5% per week. In class, the students work on exercises on paper, solo or in pairs, exercises were not handed in or for credit and the instructor plus 1 TA per 100 enrolled students. (There was an early indicator of possible poor attendance in class because the ratio in reality is higher than that.) Most weeks the number of lecture hours were reduced from three to two.

In coursework, there were 9 2-hour labs, some lecture prep, some auto-graded programming assignments, two larger TA-graded programming assignments, one 50-minute midterm and a three hour final exam.

How did it go? Pre- and post-course surveys on paper, relating to demography, interest in pursuing a CS1 program, interest in CS1, enthusiasm, difficulty, time spent and more. (Part of me thinks that these things are better tracked by looking at later enrolments in the course or degree transfers.) Weekly lecture attendance counts and enrolment tracked, along with standard university course evaluation.

There was a traditional environment available for comparison, from a ¬†previous offering, so they had collected all of that data. (If you’re going to make a change, establish a baseline first.) Sadly, the baselines were different for the different terms so comparison wasn’t as easy,

The results? Across their population, 76% of students are not intending to purse CS, 62% had no prior programming experience, 53% were women! I was slightly surprised that tradition lecture attendance was overall higher with a much steeper decline early on. For students who completed the course, the average mark for prep work was 81% so the students were preparing the material but were then¬†not attending the lecture. Hmm. This came out again in the ‘helpfulness’ graphs where the online materials outscored the in-lecture activities. But the traditional lecture still outscored both – which makes me think this is a hearts and mind problem combined with some possible problems in the face-to-face activities. (Getting f2f right for flipped classes is hard and I sympathise entirely if this is a start-up issue.)

For those people who responded pre and post survey on their enthusiasm and enthusiasm increased but it was done on paper and we already know that there was a drop in attendance so this had bias, but on-line university surveys also backed this up. In terms of perceptions of difficulty and time, women found it harder and more time consuming. What was more surprising is that prior programming experience did not correlate with difficulty or time spent.

Outcomes? The drop rate was comparable to past offerings and 25% of students dropped the course. The pass rates were comparable with 86% pass rate and there was comparable performance on “standard” exam questions. There was no significant difference in the performance on those three exam questions. The students who were still attending at the end wanted more of these types of course, not really surprisingly.

Lessons learned – there was a lot learnt! In the resources read, video preparation took ~600 hours and development of in-class exercises took ~130 hours. The extra TA support cost money and, despite trying to make the load easier, two lecture hours per week were too few. (They’ve now reverted to three hours, most weekly two hour labs are replaced with online exercises and a TA drop-in help centre, which allows them to use the same TA resources as a traditional offering.) In terms of lecture delivery, the in-class exercises on paper were valuable test preparation. There was no review of the lecture material that had been pre-delivered (which is always our approach, by the way) so occasionally students had difficulty getting starred. However, they do now start each lecture with a short worked example to prime the students on the material that they had seen before. (It’s really nice to see this because we’re doing almost exactly the same thing in our new Object Oriented Programming course!) ¬†They’ve now introduced a weekly online exercise to allow them to assess whether they should be coming to class but lecture attendance is still lower than for the traditional course.

The take away is that the initial resource cost is pretty big but you then get to re-use it on more than occasion, a pretty common result. They’re on their third offering, having made ongoing changes. A follow-up paper on the second offering has been re-run and will be pretend as Horton et al, “Comparing Outcomes in Inverted and Traditional CS1”, which will appear in ITiCSE 2014.

They haven’t had the chance to ask the students¬†why they’re not coming to the lectures but that would be very interesting to find out. A good talk to finish on!


Workshop report: ALTC Workshop “Assessing student learning against the Engineering Accreditation Competency Standards: A practical approach”

I was fortunate to be able to attend a 3 hour workshop today presented by Professor Wageeh Boles, Queensland University of Technology, and Professor Jeffrey (Jeff) Froyd, Texas A&M, on how we could assess student learning against the accreditation competency standards in Engineering. I’ve seen Wageeh present before in his capacity as an Australian Learning and Teaching Council ALTC National Teaching Fellowship and greatly enjoyed it, so I was looking forward to today. (Note: the ALTC has been replaced with the Office for Learning and Teaching, OLT, but a number of schemes are still labelled under the old title. Fortunately, I speak acronym.)

Both Wageeh and Jeff spoke at length about¬†why¬†we were undertaking assessment and we started by looking at the big picture: University graduate capabilities and the Engineers Australia accreditation criteria. Like it or not, we live in a world where people expect our students to be able to achieve well-defined things and be able to demonstrate certain skills. To focus on the course, unit, teaching and learning objectives and assessment alone, without framing this in the national and University expectations is to risk not producing the students that are expected or desired. Ultimately if the high level and local requirements aren’t linked then they¬†should be because otherwise we’re probably not pursuing the right objectives. (Is it too soon to mention pedagogical luck again?)

We then discussed three types of assessment:

  • Assessment FOR Learning:¬†Which is for teachers and allows them to determine the next steps in advancing learning.
  • Assessment AS Learning:¬†Which is for students and allows them to monitor and reflect upon their own progress (effectively formative).
  • Assessment OF Learning: Which is used to assess what the students have learned and is most often characterised as¬†summative learning.

But, after being asked about the formative/summative approach, this was recast into a decision making framework. We carry out assessment of all kinds to allow people to make better decisions and the people, in this situation, are Educators and Students. When we see the results of the summative assessment we, as teachers, can then ask “What decisions do we need to make for this class?” to improve the levels of knowledge demonstrated in the summative. When the students see the result of formative assessment, we then have the question “What decisions do students need to make” to improve their own understanding. The final aspect,¬†Assessment FOR Learning,¬†is going to cover those areas of assessment that help both educators and students to make better decisions by making changes to the overall course in response to what we’re seeing.

This is a powerful concept as it identifies assessment in terms of responsible groups: this assessment involves one group, the other or both and this is why you need to think about the results. (As an aside, this is why I strongly subscribe to the idea that formative assessment should never have an extrinsic motivating aspect, like empty or easy submission marks, because it stops the student focussing on the feedback, which will help their decisions, and makes it look summative, which suddenly starts to look like the educator’s problem.)

One point that came out repeatedly was that our assessment methods should be¬†varied. If your entire assessment is based on a single exam, of one type of question, at the end of the semester then you really only have a single point of data. Anyone who has ever drawn a line on a graph knows that a single point tells you nothing about the shape of the line and, ultimately, the more points that yo can plot accurately, the more you can work out what is actually happening. However, varying assessment methods doesn’t mean replicating or proxying the exam, it means providing different assessment types, varying questions, changing assessment over time. (Yes, this was stressed: changing assessment from offering to offering is important and is much a part of varying assessment as any other component.)

All delightful music to my ears, which was just was well as we all worked very hard, talking, discussing and sharing ideas throughout the groups. We had a range of people who were mostly from within the Faculty and, while it was a small group and full of the usual faces, we all worked well, had an open discussion and there were some first-timers who obviously learned a lot.

What I found great about this was that it was very strongly practical. We worked on our own courses, looked for points for improvement and I took away four points of improvement that I’m currently working on: a fantastic result for a three-hour investment. Our students don’t need to just have done assessment that makes it look like they know their stuff, they have to actually know their stuff and be confident with it. Job ready. Able to stand up and demonstrate their skills. Ready for reality.

As was discussed in the workshop, assessment of learning occurs when Lecturers:

  • Use evidence of student learning
  • to make judgements¬†on student achievement
  • against goals and standards

And this identifies some of our key problems. We often gather all of the evidence, whether it’s final grades or Student Evaluations, at a point when the students have left, or are just about to leave, the course. How can we change this course for that student? We are always working one step in the past. Even if we do have the data, do we have the time and the knowledge to make the right judgement? If so, is it defensible, fair and meeting the standards that we should be meeting? We can’t apply standards from 20 years ago because that’s what we’re used to. The future, in Australia, is death by educational acronyms (AQF, TEQSA, EA, ACS, OLT…) but these are the standards by which we are accredited and these are the yardsticks by which our students will be judged. If we want to change those then, sure, we can argue this at the Government level but until then, these have to be taken into account, along with all of our discipline, faculty and University requirements.

I think that this will probably spill over in a second post but, in short, if you get a chance to see Wageeh and Jeff on the road with this workshop then, please, set aside the time to go and leave time for a chat afterwards. This is one of the most rewarding and useful activities that I’ve done this year – and I’ve had a very good year for thinking about CS Education.