SIGCSE Day 3, “What We Say, What They Do”, Saturday, 9-10:15am, (#SIGCSE2014)

The first paper was “Metaphors we teach by” presented by Ben Shapiro from Tufts. What are the type of metaphors that CS1 instructors use and what are the wrinkles in these metaphors. What do we mean by metaphors? Ben’s talking about conceptual metaphors, linguistic devices to allow us to understand one idea in terms o another idea that we already know. Example: love is a journey – twists and turns, no guaranteed good ending, The structure of a metaphor is that you have a thing we’re trying to explain (the target) in terms of something we already know (the source).  Conceptual metaphors are explanatory devices to assist us in understanding new things.

Metaphors are widely used in teaching in CS, pointers, stacks and loops – all metaphorical aspects of computer science, but that’s not the focus of this study. How do people teach with metaphor? The authors couldn’t find any studies on general metaphor use in CS and its implication on student learning. An example from a birds-of-a-feather session held at this conference, a variable is like a box. A box can hold many different things but it holds things. (This has been the subject of a specific study.) Ben also introduced the “Too much milk” metaphor. This metaphor is laid out as follows. Jane comes home from work, goes to get milk from the fridge but her roommate has already drunk it (bad roommate!). Jane goes out to get more milk. While she’s out, her roommate comes back with milk, then Jane comes back with milk. Now they have too much milk! This could be used to explain race conditions in CS. Another example is the use of bus lockers mapping to virtual memory.
Ben returned to boxes again? One of the problems is that boxes can hold many things but a variable can only hold one thing – which appears to be a confusing point for learners who knew how boxes work. Is this a common problem? Metaphors have some benefits but come with this kind of baggage? Metaphors are partial mappings – they don’t match every aspect of the target to the source. (If it was a complete mapping they’d be the same thing.)
The research questions that the group considered were:
  • What metaphors do CS1 instructors use for teaching?
  • What are the trying to explain?
  • What are the sources that they use?
Learners don’t know where the mappings start and stop – where do the metaphors break down for students? What mistakes do they make because of these misunderstandings? Why does this matter? We all have knowledge on how to explain but we don’t have good published collections of the kind of metaphors that we use to teach CS, which would be handy for new teachers. We could study these and work out which are more effective. What are the most enduring and universal metaphors?
The study was interview-based, interviewing Uni-level CS1 instructors, ended up with 10 people, with an average of 13 years of teaching.  The interview questions given to these instructors were (paraphrased):
  • Levels taught and number of years
  • Tell me about a metahpor
  • Target to source mapping
  • Common questions students have
  • Where the metaphor breaks down
  • How to handle the breakdown in teaching.
Ben then presented the results. (We had a brief discussion of similes versus metaphors but I’ll leave that to you.) An instructor discussed using the simile of a portkey from Harry Potter to explain return statements in functions, because students had trouble with return existing immediately. The group of 10 people provided 18 different CS Concepts (Targets) and 19 Metaphorical Explanations (Sources).
What’s the target for “Card Catalogs”? Memory addressing and pointers. The results were interesting – there’s a wide range of ways to explain things! (The paper contains a table of a number of targets and sources.)
Out of date cultural references were identified as a problem and you have to be aware of the students’ cultural context. (Card desk and phone booths are nowhere near as widely used as they used to be.) Where do students make inferences beyond the metaphor? None of the 10 participants could give a single example of this happening! (This is surprising – Ben called it weird.) Two hypotheses – our metaphors are special and don’t get overextended (very unlikely) OR CS1 instructors poorly understand student thinking (more likely).
The following experimental studies may shed some light on this:
  • Which metaphors work better?
  • Cognitive clinical internviews, exploring how students think with metaphors and where incorrect inferences are drawn.
There was also a brief explanation of PCK (teachers’ pedagogical content knowledge) but I don’t have enough knowledge to fully flesh this out. Ben, if you’re reading this, feel free to add a beautifully explanatory comment. 🙂
The next walk was “‘Explain in Plain English’ Questions Revisited: Data Structures Problems” presented by Sue Fitzgerald and Laurie. This session opened with a poll to find out what the participants wanted and we all wanted to find out how to get students to use plain English. An Explain in Plain English  (EiPE) question asks you to describe what a chunk of code does, but not in a line by line discussion. A student’s ability to explain what a chink of code does correlates with a student’s ability to write and read code. The study wanted to investigate if this was just a novice phenomenon or if this advanced during the years and expertise. This study looked at 120 undergraduates in a CS2 course in data structures and algorithms using C++, with much more difficult questions than in earlier studies: linked lists, recursive calls and so on.
The students were given two questions in an exam with some preamble to describe the underlying class structure with a short example and a diagram. The students then had to look at a piece of code and determine what would happen in order to answer in the question as a plain English response. (There’s always a problem where you throw to an interactive response system where the question isn’t repeated, perhaps we need two screens.)
The SOLO taxonomy was used to analyse the problems (more Neo-Piagetian goodness!). Four of the SOLO categories were used: relational (summarises the code), multistructural (line by line explanation of the code) , unistructural (only describes one portion rather than the whole idea), and pre structural (misses it completely, gibberish). I was interested to see the examples presented, with pointers and mutual function calling, because it quickly became apparent that the room I was in (which had a lot of CS people in it) were having to think relatively hard about the answer to the second example. One of the things about working memory is that it’s not very deep and none of us were quite ready to work in a session 🙂 but a lot of good discussion ensued. The students would have had ready access to the preamble code but I do wonder how much obfuscation is really required here. The speaker made a parenthetical comment that experts usually doodle but where was our pen and paper! (As someone else said, reinforcing the point that we didn’t come prepared to work, nobody told us we had to bring paper. 🙂 ) We then got to classify a student response that was quite “student-y”. (A question came up as to whether an answer can be relational if it’s wrong – the opinion appears to be that a concise, complete and incorrect answer could be considered relational. A point for later discussion.) The answer we saw was multistructural because it was a line-by-line answer – it wasn’t clear, concise and abstract. We then saw another response that was much more terse but far less accurate. THe group tossed up between unistructural and pre structural. (The group couldn’t see the original code or the question, so this uncertainty make sense. Again, a problem with trying to have an engaging on-line response system and a presentation on the same screen. The presenters did a great job of trying to make it work but it’s not ideal.)
What about correlations? For the first question asked, students who gave relational and multistructural answers generally passed, with a 58% grade. Those who answered at the uni or pre level generally failed with an average grade of 38%. In the second test question, the relational and multi group generally passed with a grade of 61.2%, the uni and pre group generally failed with an achieved grade of 42%.
So these correlations hold for no-novice programmers. A mix of explaining, writing and reading code is an effective way to develop good programming skills and EiPE questions give students good practice in the valuable skills of explaining code. Instructors can overestimate how well students understand presented code – asking them to explain it back is very useful for student self-assessment. The authors’ speculation is that explaining code to peers is probably part of the success of peer instruction and pair programming.
The final talk was “A Formative Study of Influences on Student Testing Behaviours” presented by Kevin Buffardi, from VT. In their introductory CS1 and CS2 courses they use Test-Driven Development (TDD) – code a little, test a little, for incremental development. It’s popular in industry, so students come out with relevant experience, but some previous studies have found improvement in student work when they closely adhered to TDD philosophy. BUT a lot of students didn’t follow it at all! So the authors were looking for ways to encourage students to follow this, especially when they were on their own and programming by themselves. Because it’s a process, you can tell what happened by looking at the final program but they use WebCAT and so can track the developmental stages of the program as students submit their work for partial grading. These snapshots provide clear views of what the students are doing over time. (I really have to look at what we could do with WebCAT. Our existing automarker is getting a bit creaky.) Students also received hints back when they submitted their work, general and instructor level.
The first time students achieved something with any type of testing, they would get a “Good Start” feedback and be entitled to a free hint. If you kept up with your testing, you would ‘buy’ more hints. If your test coverage was good, you got more hints. If your coverage was poor, you got general feedback. (Prior to this, WebCAT only gave 3 hints. Now there are no free hints but you can buy an unlimited number.) This is an adaptive feedback mechanism, to encourage testing with hints as incentives. The study compared reinforcement treatments:
  • Constant – even time a goal achieve, you got a hint (Consistently rewards target behaviour)
  • Delayed – Hints when earned, at most one hint per hour (less inceptive for hammering the system)
  • Random – 50% chance of hints when goal is met. (Should reduce dependency on extrinsic behaviours)
Should you show them the goal or not? This was an additional factor – the goals were either visual (concrete goal) or obscured (suggest improvement without specified target). These were a paired treatment.
What was the impact? There were no differences in the number of lines written, but the visual goal lead to students getting better test coverage than obscured goal. There didn’t appear to be a long term effect but there is an upcoming ITiCSE talk that will discuss this further. There were some changes from one submission to another but this wasn’t covered in detail.
The authors held formative group interviews where the students explained their development process and interaction with WebCAT. They said that they valued several types of evaluation, they paid attention to RED progress bars (visualisation and dash boarding – I’d argue that is is more about awareness than motivation), and noticed when they earned a hint but didn’t get it. The students drew their individual developmental process as a diagram and, while everyone had a unique approach, but there were two general approaches. Test last approach showed up: write a solution, submit a solution to WebCAT, take a break, do some testing, then submit to WebCAT again. Periodic testing approach was the other pattern seen, where they wrote solutions, WebCAT, write tests, submit to WebCAT, then revise solution and tests, and iterate.
Going forward, the automated evaluation became part of their development strategy. There were conflicting interests: the correctness reports from WebCAT were actually reducing the need to write their own tests because they were getting an indication of how well it was working. This is an important point for me, because from the examples I saw, I really couldn’t see what I would call test-driven development, especially for test last, so the framework is not encouraging the right behaviour. Kevin handled my question on this well, because it’s a complicated issue, and I’m really looking forward to seeing the ITiCSE paper follow-up! Behavioural change is difficult and, as Kevin rightly noted, it’s optimistic to think that we can achieve it in the short term.
Everyone wants to get students doing the right thing but it’s a very complicated issue. Much food for thought and a great session!

SIGCSE Day 2, Assessment and Evaluation Session, Friday 10:45-12:00, (#SIGCSE2014)

The session opened with a talk on the “Importance of Early Performance in CS1: Two Conflicting Assessment Stories” by Leo Porter and Daniel Zingaro. Frequent readers will know that I published a paper in ICER 2012 on the impact of early assignment submission behaviour on later activity so I was looking forward to seeing what the conflict was. This was, apparently, supposed to be a single story but, like much research, it suddenly turned out that there were two different stories.

In early term performance, do you notice students falling into a small set of performance groups? Does it feel that you can predict the results? (Shout out to Ahadi and Lister’s “Geek genes, prior knowledge, stumbling points and learning edge momentum: parts of the one elephant?” from ICER 2013!). Is there a truly bimodal distribution of ability? The results don’t match a neat bell curve. (I’m sure a number of readers to wait and see where this goes.)

Why? Well, the Geek Gene theory is that there is internet and immutable talent you have or you don’t. The author didn’t agree with this and the research supports this, by the way. The next possibility is a stumbling block, where you misunderstand something critical. The final possibility is learning edge momentum, where you build knowledge incrementally and early mistakes cascade.

In evaluating these theories, the current approach is over a limited number of assessments but it’s hard to know what happened in-between. we need more data! Leo uses Peer Instruction (PI) a lot so has a lot of clicker question data to draw on. (Leo gave a quick background on PI but you can look that up. 🙂 ) The authors have some studies to see the correlation between vote and group vote.

The study was run over a CS1 course in Python with 126 students, with 34 PI sessions over 12 weeks and 8 prac lab sessions. The instructor was experiences in PI and the material. Components for analysis include standard assessments (midterm and finals), on-class PI for the last two weeks and the PI results per student, averaged bi-weekly to reduce noise because students might be absent and are graded on participation.

(I was slightly surprised to see that more than 20% of the students had scored 100% on the midterm!) The final was harder but it was hard to see the modalities in the histograms. Comparing this with the last two weeks of course for PI, and this isn’t bi-modal either and looks very different. The next step was to use the weekly assessments to see how they would do in the last two weeks, and that requires a correlation. The Geek Gene should have a strong correlation early and no change. Stumbling block should see strong correlation somewhat early and then no change. Lastly, for LEM, strong correlation somewhat early, then no change – again. This is not really that easy to distinguish.

The results were interesting. Weeks 1,2 don’t’ correlate much at all but from weeks 3,4 onwards, correlation is roughly 40% but it doesn’t get better. Looking at the final exam correlation with the Week 11/12 PI scores, correlation is over 60% (growing steadily from weeks 3,4) Let’s look at the exam content (analyse the test) – where did the content fall? 54% of the questions target the first weeks, 46% target the latter half. Buuuuuuut, the later questions were more conceptually rich – and this revealed a strong bias for the first half of the class (87%) and only 13% of the later stuff. The early test indicators were valid because the exam is mostly testing the early section! The PI in Week 11 and 12 was actually 50/50 first half and second half, so no wonder that correlated!

Threads to validity? Well, the data was noisy and participation was variable. The PI questions are concept tests, focused on a signal concept and many not actually reflect writing code. There were different forms of assessment. The PI itself may actually change student performance because students generally do better in PI courses. So what does all this mean?

Well, the final exam correlation supports stumbling block and LEM but the Week 11 and 12 are different! The final exam story isn;t ideal but the Week 11/12 improvements are promising. We’re addicted tot his kind of assessment ands student performance early in term will predict assessment based on that material, but the PI is f more generally used.

It’s interesting to know that there were mot actual MCQs on the final exam.

The next talk was “Reinventing homework as a cooperative, formative assessment” by Don Blaheta. There are a couple of problems in teaching: the students need practice and the students need feedback. In reinventing homework, the big problem is that trading is a lot of work and matching comments to grades and rubrics is hard, with a delay for feedback, it’s not group work and solitary work isn’t the best for all students, and a lot of the students don’t read the comments anyway. (My ears pricked up, this is very similar to the work I was presenting on.)

There’s existing work on automation, off-the-shelf programming, testing systems and online suites, with immediate feedback. But some things just can’t be auto graded and we have to come back to manual marking. Diagrams can’t be automarked.

To deal with this, the author tried “Work together, write alone” but there is confusion about what and what isn’t acceptable as collaboration – the lecturer ends up grading the same thing three times. What about revising previous work? It’s great for learning but students may nt have budgeted any time for it, some will be happy with a lower mark. here’s the issue of apathy and it increases the workload.

How can we package these ideas together to get them to work better? We can make the homework group work, the next idea is that there’s a revision cycle where an early (ungraded) version is hand back with comments – limited scale response of correct, substantial understanding, littler or no understanding. (Then homework is relatively low stakes.) Other mechanisms include comments, no grades; grade, no comments; limed scale. (Comments, no grades, should make them look at the comments – with any luck.) Don’t forget that revision increases workload where everything else theoretically decreases it! Comments identify higher order problems and marks are not handed back to students. The limited scale now reduces marking over head and can mark improvement rather than absolutes. (And the author referred to my talk from yesterday, which startled me quite a lot, but it’s nice to see! Thanks, Don!)

It’s possible to manage the group, which is self-policing, very interestingly – the “free rider” problem rears its ugly head. Some groups did divide the task but moved to full group model after initially splitting up the work. Grades could swing and students might not respond positively.

In the outcomes, while the n is small, he doesn’t see a high homework mark correlated with a lot exam average, with would be the expected indicator of the “free rider” or “plagiarist” effect. So, nothing significant but an indication that things are on the right track. Looking at class participation, students are working in different ways, but overall it’s positive in effect. (The students liked it but you know my thoughts on that. 🙂 ) Increased cooperation is a great outcome as is making revisions on existing code.

The final talk was on “Evaluating an Inverted CS1” presented by Jennifer Campbell form the University of Toronto. Their CS1 is a 12 week course with 3 lecture hours and a 2 hour Lab per week with Python in Objects early, Classes-late approach. Lecture size is 130-150 students across mostly 1st years with some higher and some non-CS. Typical lab sizes are 30 students with one TA.

The inverted classroom is also known as the flipped classroom: resources are available and materials are completed before the students show up. The face-to-face time is used for activities. Before the lecture, students watch videos, with two instructors, with screencasts and some embedded quizzes (about 15 questions), worth 0.5% per week. In class, the students work on exercises on paper, solo or in pairs, exercises were not handed in or for credit and the instructor plus 1 TA per 100 enrolled students. (There was an early indicator of possible poor attendance in class because the ratio in reality is higher than that.) Most weeks the number of lecture hours were reduced from three to two.

In coursework, there were 9 2-hour labs, some lecture prep, some auto-graded programming assignments, two larger TA-graded programming assignments, one 50-minute midterm and a three hour final exam.

How did it go? Pre- and post-course surveys on paper, relating to demography, interest in pursuing a CS1 program, interest in CS1, enthusiasm, difficulty, time spent and more. (Part of me thinks that these things are better tracked by looking at later enrolments in the course or degree transfers.) Weekly lecture attendance counts and enrolment tracked, along with standard university course evaluation.

There was a traditional environment available for comparison, from a  previous offering, so they had collected all of that data. (If you’re going to make a change, establish a baseline first.) Sadly, the baselines were different for the different terms so comparison wasn’t as easy,

The results? Across their population, 76% of students are not intending to purse CS, 62% had no prior programming experience, 53% were women! I was slightly surprised that tradition lecture attendance was overall higher with a much steeper decline early on. For students who completed the course, the average mark for prep work was 81% so the students were preparing the material but were then not attending the lecture. Hmm. This came out again in the ‘helpfulness’ graphs where the online materials outscored the in-lecture activities. But the traditional lecture still outscored both – which makes me think this is a hearts and mind problem combined with some possible problems in the face-to-face activities. (Getting f2f right for flipped classes is hard and I sympathise entirely if this is a start-up issue.)

For those people who responded pre and post survey on their enthusiasm and enthusiasm increased but it was done on paper and we already know that there was a drop in attendance so this had bias, but on-line university surveys also backed this up. In terms of perceptions of difficulty and time, women found it harder and more time consuming. What was more surprising is that prior programming experience did not correlate with difficulty or time spent.

Outcomes? The drop rate was comparable to past offerings and 25% of students dropped the course. The pass rates were comparable with 86% pass rate and there was comparable performance on “standard” exam questions. There was no significant difference in the performance on those three exam questions. The students who were still attending at the end wanted more of these types of course, not really surprisingly.

Lessons learned – there was a lot learnt! In the resources read, video preparation took ~600 hours and development of in-class exercises took ~130 hours. The extra TA support cost money and, despite trying to make the load easier, two lecture hours per week were too few. (They’ve now reverted to three hours, most weekly two hour labs are replaced with online exercises and a TA drop-in help centre, which allows them to use the same TA resources as a traditional offering.) In terms of lecture delivery, the in-class exercises on paper were valuable test preparation. There was no review of the lecture material that had been pre-delivered (which is always our approach, by the way) so occasionally students had difficulty getting starred. However, they do now start each lecture with a short worked example to prime the students on the material that they had seen before. (It’s really nice to see this because we’re doing almost exactly the same thing in our new Object Oriented Programming course!)  They’ve now introduced a weekly online exercise to allow them to assess whether they should be coming to class but lecture attendance is still lower than for the traditional course.

The take away is that the initial resource cost is pretty big but you then get to re-use it on more than occasion, a pretty common result. They’re on their third offering, having made ongoing changes. A follow-up paper on the second offering has been re-run and will be pretend as Horton et al, “Comparing Outcomes in Inverted and Traditional CS1”, which will appear in ITiCSE 2014.

They haven’t had the chance to ask the students why they’re not coming to the lectures but that would be very interesting to find out. A good talk to finish on!