ITiCSE 2014, Day 3, Session 7B, Peer Instruction, #ITiCSE2014 #ITiCSE

The first talk was “Peer Instruction: a Link to the Exam” presented by Daniel Zingaro from University of Toronto. Peer Instruction (PI) is an active learning pedagogy developed for physics and now heavily used in computing. Students complete a reading quiz prior to class and teachers use multiple-choice quizzes to assess knowledge. (You can look this one up in a number of places but I’ve discussed it here before a bit.) There’s a lot of research that shows gains between individual and group vote, with enduring improvements in student learning. (We can use isomorphic questions to reduce the likelihood of copying.) Both students and instructors value the learning.

PI appears to demonstrate improved learning outcomes on the final exam grades, as well as perceived depth of learning. (Couple of studies here from Beth Simon et al, and Daniel himself, checking Beth’s results.) But what leads to this improved outcome? The peer discussion. The class wide discussion? Both? If one part isn’t useful then we can adapt it to make it more useful to Computer Scientists. Daniel is going to use isomorphic questions to investigate relationships between PI components and final exam grades.

The isomorphic questions test the same concept with different questions, where if they get the first one right, we hope that they get the second one right – and if people learn how to do one, then that knowledge flows on to the other. (The example given was of loop complexity in nested loops depending on different variables.)

Daniel has two question modes in this experiment, which are slightly different. Both modes include the PI components, but the location of the isomorphic questions vary between the two approaches in the second question. In the Peer ℗ mode, the isomorphic question comes directly after the group vote and the second mode (Combined – C), the Q2 isomorphic questions occur direct after the instructor has had a chance to influence the class.

Are the questions really isomorphic and of the same difficulty? An external ranker was used to verity this and then the question pairs and mode were randomised. The difficulty of the questions was found to be statistically equivalent, based on the percentage of Q1 that were found to be correct.

Daniel had two hypotheses. Firstly, that peer scores will correlate to final exam scores. Secondly, that combined scores will also correlate with final exam scores, but the correlation should be stronger than for Peer, with the Combined questions representing learning from the full PI cycle. In terms of the final exam, there were three measures of the final exam grades: total exam score, score on the tracing question (similar to PI questions) and score on a code-writing question (very different to PI questions).

The implementation was a CS1 course with 3 lectures/week, with reading quizzes worthy 4% submitted prior to each lecture, clicker responses worth 5%, where the lectures on average contained three PI cycles- one cycle per lecture contained the follow-up isomorphic question. Multiple regression was used to test relationships between PI and final exam scores.

All of the results were statistically significant. For code-tracing, hat students know before exam explains 13% of their scores in the final exam. With the peer questions, it goes up to 16%. With combined as well, it goes up to 19%. Is this practically significant? Daniel raised this question because it doesn’t rise very much.

In terms of code writing, Baseline is 16%, + Peers is 22% and +Combined is 25%, so we’re starting to see more contribution from peers than instructor in this case. Are we measuring the different difficulty of a problem that peers couldn’t correct, which is why the instructor does less?

Overall? Baseline 21%, Peer 30% and then Combined is 34%. (Any questions about the stats, please read the paper. 🙂 )

Maybe adding combined questions to peer questions increases our predictive accuracy, just because we’re adding more data and this being able to produce a better model?

In discussion, PI performance related to final exam scores (as expected). Peer learning alone is important and the instructor-led discussion is important, over and above peer learning. This validates the role of the instructor in a “student-centred” classroom. Given that PI uses MCQs, we might expect it to only correlate with code-tracing but it does appear to correlate with code-writing problems as well – there may be deep conceptual similarities between PI questions and programming skills. But would the students that learned from PI also the students that would have learned from any other form of instruction? Still an open question and there’s a lot of ongoing work still to do.

The next paper was “Comparing Outcomes in Inverted and Traditional CS1” presented by Diane Horton from U Toronto. I’ve been discussing early intervention and student attendance issues in inverted/hybrid courses with Jennifer Campbell and Michelle Craig, also from U Toronto and also on this paper, as part of an attempt to get some good answers so I’d just come straight out of a lunch, discussing inverted classrooms and their outcomes. (Again, this is why we come to conferences – much of the value is in the meetings and discussion that are just so hard to have when you’re doing your day job or fitting a Skype meeting into the wee small hours to bridge the continental time gap.)

As a reminder, in inverted teaching, some or all of the material is delivered outside the classroom. Work typically done as homework is done in the lecture with the help of instructor or TAs. There were three research questions. Would the inverted offerings have better outcomes? Would inverted teaching affect students’ behaviour or experience? Would particular subgroups respond differently, especially English-language learners and beginner programmers?

The CS1 course at Toronto is a 12 week course in Python, objects-early, classes-late, with most students in 1st year and less than half looking to major in CS. The lectures are roughly 200 students, with 5 of these lecture sections.

Before the lecture, students prepared by watching videos, from the instructors, mostly screencasts of live programming with voice over, credit for attempting quizzes embedded in videos is 0.5% per week. (It’s scary how small that fraction has to be and really rather sad, from a behavioural perspective.)

During the lecture, the instructors used a worked example and students worked on worksheet-based exercises for most of the lectures, with assistance, solo or in pairs. This was a responsive teaching approach because the instructor could draw the class together as required. There was no mark reward or penalty that depended on attendance. If you were solid on the material, it was okay to miss the lecture. The mark scheme reflected some marks for lecture preparation with an increased number of online exercises and decreased weighting on labs.

The inverted CS1 course had gone well in the pilot in January 2013, which was published in a peer at SIGCSE ’14, but it was hard to compare this with the previous class as the make-up of the cohort varies from September to January courses. The study was run again in a more similar cohort in September 2014. The data presented here is for a similar cohort with a high overlap of instructors, compared the traditional offering.

For the present study, there were pre- and post-course surveys about attitude and behaviour, competed on-paper in the lecture. Weekly lecture attendance counts were made and standard university course valuations collected. In terms of attendance, the inverted pilot was the lowest, but the inverted class had lower attendance most of the time – an effect that we have also seen under some circumstances and are still thinking about. Interestingly, students thought that the inverted lectures weren’t seen as being as useful as face-to-face lectures but the online support materials were seen to be very helpful. As a package, this seems to be an overall positive experience.

The hypothesis was that students in the inverted offering would self-report a higher quality of learning experience and greater enjoyment but this wasn’t supported in the data, nor was it for beginners in particular. However, when asked if they wanted more inverted courses, there was a very strong positive response to this question.

The authors expected that beginners would benefit more because they need more helps and the gap between beginners and experiences students would reduce – this wasn’t supported by the data. Also, there was no reduction of gaps for English language learners, either.

Would the inverted course help people stay on and pass the course? Well, however success was defined, the pass rate was remarkably consistent, even when the beginners were isolated. However, it does appear that the overall level of knowledge, as measured by the final exam grades, actually improved in the inverted offerings, across two exams of similar difficult, with a jump from an average grade of 66% to 74% between the terms. Is this just due to the inverted teaching?

Maybe students learned more in the inverted offering because they spent more time on task? Based on self-reported student time, this doesn’t appear to be true. Maybe the beginners got killed off early to reduce their numbers and raise the mark? No, the drop rates among beginners were the same. It appears that the 8 percentage point increase may be related to the inverted mode, although, obviously, more work is required.

Is it worth it? They used no additional TA resources but the development time was enormous. You may not be ready for this investment. There are other options, you don’t need to use videos and you can use pre-existing materials to reduce costs.

Future work involves looking at dropping patterns – who drops when – and student who stumble and recover. They’re also looking at a full online CS1 course for course credit.

The final talk was on “Making Group Processes Explicit to Students: A Case for Justice” presented by Ville Isomöttönen. Project courses have a very long history in Computer Science, as capstones, using authentic customer projects, and the intention is to provide a realistic experience. (Editor’s note: It’s worth noting that some of this may be coming from the “We got punished like this so you can be too.”) What do students actually learn from this? Are they learning what we want them to learn or they are learning something very different and, potentially, much darker?

(This sounds like the kind of philosophical paper I’d give, let’s see where it goes! 🙂 )

If we have tertiary students, why can’t we just place them into a workplace for work experience? They’re adults – maybe we can separate this aspect and the pedagogue. The author’s study wants to look at how to promote conceptual learning in the response of realistic course work. Parker (1999) proposes that students are spending their effort of building wiring products, rather than actually learning about and reflecting upon the professional issues we consider important. The conjecture is that just because the situation is realistic doesn’t mean that the conceptual learning is happening as we intended.

The study is based around a fairly straight forward project-based learning structure, but had a Pass/Fail grade, with no distinction grading as far as I could tell. The teaching was baed on weekly group discussions, with self/peer evaluations, also housed in a group situation, and technical supervision offered by teaching assistant. Throughout the course, students are prompted to think about their operation at a conceptual level. Hmm. I’m not sure what the speaker means by this as, without a very detailed description of what is going on, this could have many different implementations.

We then cut to a diagram of justice conceptualised – I may have missed something as I’m not quite sure how this sits with the group work. I can’t find the diagram online but it involves participation, involving and negotiating with others – fused together as the skill of justice. This sits above statuses, norms and roles. Some of the related work deals with fairness (Richards 2009) as a key attribute of successful group work, Clear 2002 uses it in diagnostic technique, and Pieterse and Thompson 2010 mentioned ‘social loafers’ and ‘diligent isolates’.

I’m dreadfully sorry, dear reader, but I’m not following this properly so this may be a bit sketchy. Go and read the paper and I’ll try to get this together. Everyone else in the room appears to be getting this so I may just be tired or struggling with my (not very good) hearing and someone who is speaking rather quietly.

The underlying pedagogy comes from the social realist mindset (Moore,2000, Maton and Moore, 2010) and “avoids the dilemma between constructivist relativism and positivist absolutism”. We should also look at the Integrative Pedagogy (Tynjana (sp)), where the speaker feels that what they are describing is a realist version of this.

The course was surveyed with a preliminary small study (N=21/26, which is curious. Which one is it? Ah, 21 out of 26 enrolled, there we go.). The survey questions were… rather loose and very open to influence, unfortunately, from my quick glance at them but I will have to read the original paper.

Justice is a difficult topic to address, especially where it’s reified as a professional skill that can be developed, and discussing the notion of justice in terms of the ways that a group can work together fairly is very important. I suppose I’m not 100% convinced how much is added in this context through the use of a new term that is an apparent parent to communication and negotiation, with the desired outcome of fairness, because the amalgamation seems to obscure the individual components that would be improved upon to return to a fair state. The very small study, and a small survey, is a valid approach for a case study or phenomenographic approach, but I get the feeling that I was seeing a grounded theory argument. We do have to expose our desired processes to students if we’re going to achieve cognitive apprenticeship and there is a great deal of tension between industrial practice and key concepts, so this is a very interesting area to work in. I completely agree with the speaker that our heavy technical focus often precludes discussions of the empathic, philosophical and intangible, but I’m yet to see how this approach contributes.

The discussions mentioned as important are very important but group reports and discussion are a built-in part of many SE process models so I wonder how the justice theme amplifies this aspects. Again, getting students to engage in a dialogue that they do not expect to have in CS can be very challenging but we could be discussing issues such as critical thinking and ethics, which are often equally alien and orthogonal to the technical, without forming a compound concept that potentially obscures the underlying component mechanisms.

Simon asked a very good question: you didn’t present anything that showed a problem where the students would have needed the concept of justice. Apparently, this is in the writings that are yet to be analysed. The answer to the question ended up as an unlabelled graph on the blackboard which was focused on a skill difference with more experienced peers. I still can’t see how justice ties into this. I have to go and get my hearing checked.


SIGCSE Day 2, Assessment and Evaluation Session, Friday 10:45-12:00, (#SIGCSE2014)

The session opened with a talk on the “Importance of Early Performance in CS1: Two Conflicting Assessment Stories” by Leo Porter and Daniel Zingaro. Frequent readers will know that I published a paper in ICER 2012 on the impact of early assignment submission behaviour on later activity so I was looking forward to seeing what the conflict was. This was, apparently, supposed to be a single story but, like much research, it suddenly turned out that there were two different stories.

In early term performance, do you notice students falling into a small set of performance groups? Does it feel that you can predict the results? (Shout out to Ahadi and Lister’s “Geek genes, prior knowledge, stumbling points and learning edge momentum: parts of the one elephant?” from ICER 2013!). Is there a truly bimodal distribution of ability? The results don’t match a neat bell curve. (I’m sure a number of readers to wait and see where this goes.)

Why? Well, the Geek Gene theory is that there is internet and immutable talent you have or you don’t. The author didn’t agree with this and the research supports this, by the way. The next possibility is a stumbling block, where you misunderstand something critical. The final possibility is learning edge momentum, where you build knowledge incrementally and early mistakes cascade.

In evaluating these theories, the current approach is over a limited number of assessments but it’s hard to know what happened in-between. we need more data! Leo uses Peer Instruction (PI) a lot so has a lot of clicker question data to draw on. (Leo gave a quick background on PI but you can look that up. 🙂 ) The authors have some studies to see the correlation between vote and group vote.

The study was run over a CS1 course in Python with 126 students, with 34 PI sessions over 12 weeks and 8 prac lab sessions. The instructor was experiences in PI and the material. Components for analysis include standard assessments (midterm and finals), on-class PI for the last two weeks and the PI results per student, averaged bi-weekly to reduce noise because students might be absent and are graded on participation.

(I was slightly surprised to see that more than 20% of the students had scored 100% on the midterm!) The final was harder but it was hard to see the modalities in the histograms. Comparing this with the last two weeks of course for PI, and this isn’t bi-modal either and looks very different. The next step was to use the weekly assessments to see how they would do in the last two weeks, and that requires a correlation. The Geek Gene should have a strong correlation early and no change. Stumbling block should see strong correlation somewhat early and then no change. Lastly, for LEM, strong correlation somewhat early, then no change – again. This is not really that easy to distinguish.

The results were interesting. Weeks 1,2 don’t’ correlate much at all but from weeks 3,4 onwards, correlation is roughly 40% but it doesn’t get better. Looking at the final exam correlation with the Week 11/12 PI scores, correlation is over 60% (growing steadily from weeks 3,4) Let’s look at the exam content (analyse the test) – where did the content fall? 54% of the questions target the first weeks, 46% target the latter half. Buuuuuuut, the later questions were more conceptually rich – and this revealed a strong bias for the first half of the class (87%) and only 13% of the later stuff. The early test indicators were valid because the exam is mostly testing the early section! The PI in Week 11 and 12 was actually 50/50 first half and second half, so no wonder that correlated!

Threads to validity? Well, the data was noisy and participation was variable. The PI questions are concept tests, focused on a signal concept and many not actually reflect writing code. There were different forms of assessment. The PI itself may actually change student performance because students generally do better in PI courses. So what does all this mean?

Well, the final exam correlation supports stumbling block and LEM but the Week 11 and 12 are different! The final exam story isn;t ideal but the Week 11/12 improvements are promising. We’re addicted tot his kind of assessment ands student performance early in term will predict assessment based on that material, but the PI is f more generally used.

It’s interesting to know that there were mot actual MCQs on the final exam.

The next talk was “Reinventing homework as a cooperative, formative assessment” by Don Blaheta. There are a couple of problems in teaching: the students need practice and the students need feedback. In reinventing homework, the big problem is that trading is a lot of work and matching comments to grades and rubrics is hard, with a delay for feedback, it’s not group work and solitary work isn’t the best for all students, and a lot of the students don’t read the comments anyway. (My ears pricked up, this is very similar to the work I was presenting on.)

There’s existing work on automation, off-the-shelf programming, testing systems and online suites, with immediate feedback. But some things just can’t be auto graded and we have to come back to manual marking. Diagrams can’t be automarked.

To deal with this, the author tried “Work together, write alone” but there is confusion about what and what isn’t acceptable as collaboration – the lecturer ends up grading the same thing three times. What about revising previous work? It’s great for learning but students may nt have budgeted any time for it, some will be happy with a lower mark. here’s the issue of apathy and it increases the workload.

How can we package these ideas together to get them to work better? We can make the homework group work, the next idea is that there’s a revision cycle where an early (ungraded) version is hand back with comments – limited scale response of correct, substantial understanding, littler or no understanding. (Then homework is relatively low stakes.) Other mechanisms include comments, no grades; grade, no comments; limed scale. (Comments, no grades, should make them look at the comments – with any luck.) Don’t forget that revision increases workload where everything else theoretically decreases it! Comments identify higher order problems and marks are not handed back to students. The limited scale now reduces marking over head and can mark improvement rather than absolutes. (And the author referred to my talk from yesterday, which startled me quite a lot, but it’s nice to see! Thanks, Don!)

It’s possible to manage the group, which is self-policing, very interestingly – the “free rider” problem rears its ugly head. Some groups did divide the task but moved to full group model after initially splitting up the work. Grades could swing and students might not respond positively.

In the outcomes, while the n is small, he doesn’t see a high homework mark correlated with a lot exam average, with would be the expected indicator of the “free rider” or “plagiarist” effect. So, nothing significant but an indication that things are on the right track. Looking at class participation, students are working in different ways, but overall it’s positive in effect. (The students liked it but you know my thoughts on that. 🙂 ) Increased cooperation is a great outcome as is making revisions on existing code.

The final talk was on “Evaluating an Inverted CS1” presented by Jennifer Campbell form the University of Toronto. Their CS1 is a 12 week course with 3 lecture hours and a 2 hour Lab per week with Python in Objects early, Classes-late approach. Lecture size is 130-150 students across mostly 1st years with some higher and some non-CS. Typical lab sizes are 30 students with one TA.

The inverted classroom is also known as the flipped classroom: resources are available and materials are completed before the students show up. The face-to-face time is used for activities. Before the lecture, students watch videos, with two instructors, with screencasts and some embedded quizzes (about 15 questions), worth 0.5% per week. In class, the students work on exercises on paper, solo or in pairs, exercises were not handed in or for credit and the instructor plus 1 TA per 100 enrolled students. (There was an early indicator of possible poor attendance in class because the ratio in reality is higher than that.) Most weeks the number of lecture hours were reduced from three to two.

In coursework, there were 9 2-hour labs, some lecture prep, some auto-graded programming assignments, two larger TA-graded programming assignments, one 50-minute midterm and a three hour final exam.

How did it go? Pre- and post-course surveys on paper, relating to demography, interest in pursuing a CS1 program, interest in CS1, enthusiasm, difficulty, time spent and more. (Part of me thinks that these things are better tracked by looking at later enrolments in the course or degree transfers.) Weekly lecture attendance counts and enrolment tracked, along with standard university course evaluation.

There was a traditional environment available for comparison, from a  previous offering, so they had collected all of that data. (If you’re going to make a change, establish a baseline first.) Sadly, the baselines were different for the different terms so comparison wasn’t as easy,

The results? Across their population, 76% of students are not intending to purse CS, 62% had no prior programming experience, 53% were women! I was slightly surprised that tradition lecture attendance was overall higher with a much steeper decline early on. For students who completed the course, the average mark for prep work was 81% so the students were preparing the material but were then not attending the lecture. Hmm. This came out again in the ‘helpfulness’ graphs where the online materials outscored the in-lecture activities. But the traditional lecture still outscored both – which makes me think this is a hearts and mind problem combined with some possible problems in the face-to-face activities. (Getting f2f right for flipped classes is hard and I sympathise entirely if this is a start-up issue.)

For those people who responded pre and post survey on their enthusiasm and enthusiasm increased but it was done on paper and we already know that there was a drop in attendance so this had bias, but on-line university surveys also backed this up. In terms of perceptions of difficulty and time, women found it harder and more time consuming. What was more surprising is that prior programming experience did not correlate with difficulty or time spent.

Outcomes? The drop rate was comparable to past offerings and 25% of students dropped the course. The pass rates were comparable with 86% pass rate and there was comparable performance on “standard” exam questions. There was no significant difference in the performance on those three exam questions. The students who were still attending at the end wanted more of these types of course, not really surprisingly.

Lessons learned – there was a lot learnt! In the resources read, video preparation took ~600 hours and development of in-class exercises took ~130 hours. The extra TA support cost money and, despite trying to make the load easier, two lecture hours per week were too few. (They’ve now reverted to three hours, most weekly two hour labs are replaced with online exercises and a TA drop-in help centre, which allows them to use the same TA resources as a traditional offering.) In terms of lecture delivery, the in-class exercises on paper were valuable test preparation. There was no review of the lecture material that had been pre-delivered (which is always our approach, by the way) so occasionally students had difficulty getting starred. However, they do now start each lecture with a short worked example to prime the students on the material that they had seen before. (It’s really nice to see this because we’re doing almost exactly the same thing in our new Object Oriented Programming course!)  They’ve now introduced a weekly online exercise to allow them to assess whether they should be coming to class but lecture attendance is still lower than for the traditional course.

The take away is that the initial resource cost is pretty big but you then get to re-use it on more than occasion, a pretty common result. They’re on their third offering, having made ongoing changes. A follow-up paper on the second offering has been re-run and will be pretend as Horton et al, “Comparing Outcomes in Inverted and Traditional CS1”, which will appear in ITiCSE 2014.

They haven’t had the chance to ask the students why they’re not coming to the lectures but that would be very interesting to find out. A good talk to finish on!