What do we want? Passing average or competency always?

I’m at the Australasian Computer Science Week at the moment and I’m dividing my time between attending amazing talks, asking difficult questions, catching up with friends and colleagues and doing my own usual work in the cracks.  I’ve talked to a lot of people about my ideas on assessment (and beauty) and, as always, the responses have been thoughtful, challenging and helpful.

I think I know what the basis of my problem with assessment is, taking into account all of the roles that it can take. In an earlier post, I discussed Wolff’s classification of assessment tasks into criticism, evaluation and ranking. I’ve also made earlier (grumpy) notes about ranking systems and their arbitrary nature. One of the interesting talks I attended yesterday talked about the fragility and questionable accuracy of post-University exit surveys, which are used extensively in formal and informal rankings of Universities, yet don’t actually seem to meet many of the statistical or sensible guidelines for efficacy we already have.

But let’s put aside ranking for a moment and return to criticism and evaluation. I’ve already argued (successfully I hope) for a separation of feedback and grades from the criticism perspective. While they are often tied to each other, they can be separated and the feedback can still be useful. Now let’s focus on evaluation.

Remind me why we’re evaluating our students? Well, we’re looking to see if they can perform the task, apply the skill or knowledge, and reach some defined standard. So we’re evaluating our students to guide their learning. We’re also evaluating our students to indirectly measure the efficacy of our learning environment and us as educators. (Otherwise, why is it that there are ‘triggers’ in grading patterns to bring more scrutiny on a course if everyone fails?) We’re also, often accidentally, carrying out an assessment of the innate success of each class and socio-economic grouping present in our class, among other things, but let’s drill down to evaluating the student and evaluating the learning environment. Time for another thought experiment.

Thought Experiment 2

There are twenty tasks aligned with a particularly learning outcome. It’s an important task and we evaluate it in different ways but the core knowledge or skill is the same. Each of these tasks can receive a ‘grade’ of 0, 0.5 or 1. 0 means unsuccessful, 0.5 is acceptable, 1 is excellent. Student A attempts all tasks and is acceptable in 19, unsuccessful in 1. Student B attempts the first 10 tasks, receives excellent in all of them and stops. Student C sets up a pattern of excellent,unsuccessful, excellent, unsuccessful.. and so on to receive 10 “Excellent”s and 10 “unsuccessful”s. When we form an aggregate grade, A receives 47.5%, B receives 50% and C also receives 50%. Which of these students is the most likely to successfully complete the task?

This framing allows us to look at the evaluation of the student in a meaningful way. “Who will pass the course?” is not the question we should be asking, it’s “Who will be able to reliably demonstrate mastery of the skills or knowledge that we are imparting.” Passing the course has a naturally discrete attention focus: focus on n assignments and m exams and pass. Continual demonstration of mastery is a different goal. This framing also allows us to examine the learning environment because, without looking at the design, I can’t tell you if B and C’s behaviour is problematic or not.

CompFail

A has undertaken the most tasks to an acceptable level but an artefact of grading (or bad luck) has dropped the mark below 50%, which would be a fail (aggregate less than acceptable) in many systems. B has performed excellently on every task attempted but, being aware of the marking scheme, optimising and strategic behaviour allows this student to walk away. (Many students who perform at this level wouldn’t, I’m aware, but we’re looking at the implications of this.) C has a troublesome pattern that provides the same outcome as B but with half the success rate.

Before we answer the original question (which is most likely to succeed), I can nominate C as the most likely to struggle because C has the most “unsuccessful”s. From a simple probabilistic argument, 10/20 success is worse than 19/20. It’s a bit tricker comparing 10/10 and 10/20 (because of confidence intervals) but 10/20 has an Adjusted Wald range of +/- 20% and 10/10 is -14%, so the highest possible ‘real’ measure for C is 14/20 and the lowest possible ‘real’ measure for B is (scaled) 15/20, so they don’t overlap and we can say that B appears to be more successful than C as well.

From a learning design perspective, do our evaluation artefacts have an implicit design that explains C’s pattern? Is there a difference we’re not seeing? Taking apart any ranking of likeliness to pass our evaluatory framework, C’s pattern is so unusual (high success/lack of any progress) that we learn something immediately from the pattern, whether it’s that C is struggling or that we need to review mechanisms we thought to be equivalent!

But who is more likely to succeed out of A and B? 19/20 and 10/10 are barely distinguishable in statistical terms! The question for us now is how many evaluations of a given skill or knowledge mastery are required for us to be confident of competence. This totally breaks the discrete cramming for exams and focus on assignment model because all of our science is built on the notion that evidence is accumulated through observation and the analysis of what occurred, in order to be able to construct models to predict future behaviour. In this case, our goal is to see if our students are competent.

I can never be 100% sure that my students will be able to perform a task but what is the level I’m happy with? How many times do I have to evaluate them at a skill so that I can say that x successes in y attempts constitutes a reliable outcome?

If we say that a student has to reliably succeed 90% of the time, we face the problem that just testing them ten times isn’t enough for us to be sure that they’re hitting 90%.

But the level of performance we need to be confident is quite daunting. By looking at some statistics, we can see that if we provide a student with 150 opportunities to demonstrate knowledge and they succeed at this 143 times, then it is very likely that their real success level is at least 90%.

If we say that competency is measured by a success rate that is greater than 75%, a student who achieves 10/10 has immediately met that but even succeeding at 9/9 doesn’t meet that level.

What this tells us (and reminds us) is that our learning environment design is incredibly important and it must start from a clear articulation of what success actually means, what our goals are and how we will know when our students have reached that point.

There is a grade separation between A and B but it’s artificial. I noted that it was hard to distinguish A and B statistically but there is one important difference in the lower bound of their confidence interval. A is less than 75%, B is slightly above.

Now we have to deal with the fact that A and B were both competent (if not the same) for the first ten tests and A was actually more competent than B until the 20th failed test. This has enormous implications for we structure evaluation, how many successful repetitions define success and how many ‘failures’ we can tolerate and still say that A and B are competent.

Confused? I hope not but I hope that this is making you think about evaluation in ways that you may not have done so before.

 


Can we do this? We already have.

How does one actually turn everything I’ve been saying into a course that can be taught? We already have examples of this working, whether in the performance/competency based models found in medical schools around the world or whether in mastery learning based approaches where do not measure anything except whether a student has demonstrated sufficient knowledge or skill to show an appropriate level of mastery.

An absence of grades, or student control over their grades, is not as uncommon as many people think. MIT in the United States give students their entire first semester with no grades more specific than pass or fail. This is a deliberate decision to ease the transition of students who have gone from being leaders at their own schools to the compressed scale of MIT. Why compressed? If we were to assess all school students then we would need a scale that could measure all levels of ability, from ‘not making any progress at school’ to ‘transcendent’. The tertiary entry band is somewhere between ‘passing school studies’ to ‘transcendent’ and, depending upon the college that you enter, can shift higher and higher as your target institution becomes more exclusive. If you look at the MIT entry requirements, they are a little coy for ‘per student’ adjustments, but when the 75th percentile for the SAT components is 800, 790, 790, and 800,800,800 would be perfect, we can see that any arguments on how demotivating simple pass/fail grades must be for excellent students have not just withered, they have caught fire and the ash has blown away. When the target is MIT, it appears the freshmen get their head around a system that is even simpler than Rapaport’s.

MIT_Dome_night1_Edit

Pictured: A highly prestigious University with some of the most stringent entry requirements in the world, which uses no grades in first semester.

Other universities, such as Brown, deliberately allow students to choose how their marks are presented, as they wish to deemphasise the numbers in order to focus on education. It is not a cakewalk to get into Brown, as these figures attest, and yet Brown have made a clear statement that they have changed their grading system in order to change student behaviour – and the world is just going to have to deal with that. It doesn’t seem to be hurting their graduates, from quotes on the website such as “Our 85% admission rate to medical school and 89% admission rate to law school are both far above the national average.

And, returning to medical schools themselves, my own University runs a medical program where the usual guidelines for grading do not hold. The medical school is running on a performance/competency scheme, where students who wish to practise medicine must demonstrate that they are knowledgable, skilful and safe to practice. Medical schools have identified the core problem in my thought experiment where two students could have the opposite set of knowledge or skills and they have come to the same logical conclusion: decide what is important and set up a scheme that works for it.

When I was a solider, I was responsible for much of the Officer Training in my home state for the Reserve. We had any number of things to report on for our candidates, across knowledge and skills, but one of them was “Demonstrate the qualities of an officer” and this single item could fail an otherwise suitable candidate. If a candidate could not be trusted to one day be in command of troops on the battlefield, based on problems we saw in peacetime, then they would be counselled to see if it could be addressed and, if not, let go. (I can assure you that this was not used often and it required a large number of observations and discussion before we would pull that handle. The power of such a thing forced us to be responsible.)

We know that limited scale, mastery-based approaches are not just working in the vocational sector but in allied sectors (such as the military), in the Ivy league (Brown) and in highly prestigious non-Ivy league institutions such as MIT. But we also know of examples such as Harvey Mudd, who proudly state that only seven students since 1955 have earned a 4.0 GPA and have a post on the career blog devoted to “explaining why your GPA is so low” And, be in no doubt, Harvey Mudd is an excellent school, especially for my discipline. I’m not criticising their program, I’ve only heard great things about them, but when you have to put up a page like that? You’re admitting that there’s a problem but you are pushing it on to the student to fix it. But contrast that with Brown, who say to employers “look at our students, not their grades” (at least on the website).

Feedback to the students on their progress is essential. Being able to see what your students are up to is essential for the teacher. Being able to see what your staff and schools are doing is important for the University. Employers want to know who to hire. Which of these is the most important?

The students. It has to be the students. Doesn’t it? (Arguments for the existence of Universities as a self-sustaining bureaucracy system in the comments, if you think that’s a thing you want to do.)

This is not an easy problem but, as we can see, we have pieces of the solution all over the place. Tomorrow, I’m going to put in a place a cornerstone of beautiful assessment that I haven’t seen provided elsewhere or explained in this way. (Then all of you can tell me which papers I should have read to get it from, I can publish the citation, and we can all go forward.)

 


ITiCSE 2014, Day 3, Session 6A, “Digital Fluency”, #ITiCSE2014 #ITiCSE

I'm at the Ångstrom Laboratory of Uppsala so this portrait hangs in the main hall. Hi, Anders!

I’m at the Ångstrom Laboratory of Uppsala so this portrait hangs in the main hall. Hi, Anders!

The first paper was “A Methodological Approach to Key Competences in Informatics”, presented by Christina Dörge. The motivation for this study is moving educational standards from input-oriented approaches to output-oriented approaches – how students will use what you teach them in later life. Key competencies are important but what are they? What are the definitions, terms and real meaning of the words “key competencies”? A certificate of a certain grade or qualification doesn’t actually reflect true competency is many regards. (Bologna focuses on competencies but what do really mean?) Competencies also vary across different disciplines as skills are used differently in different areas – can we develop a non-normative approach to this?

The author discussed Qualitative Content Analysis (QCA) to look at different educational methods in the German educational system: hardware-oriented approaches, algorithm-oriented, application-oriented, user-oriented, information-oriented and, finally, system-oriented. The paradigm of teaching has shifted a lot over time (including the idea-oriented approach which is subsumed in system-oriented approaches). Looking across the development of the paradigms and trying to work out which categories developed requires a coding system over a review of textbooks in the field. If new competencies were added, then they were included in the category system and the coding started again. The resulting material could be referred to as “Possible candidates of Competencies in Informatics”, but those that are found in all of the previous approaches should be included as Competencies in Informatics. What about the key ones? Which of these are found in every part of informatics: theoretical, technical, practical and applied (under the German partitioning)? A key competency should be fundamental and ubiquitous.

The most important key competencies, by ranking, was algorithmic thinking, followed by design thinking, then analytic thinking (must look up the subtle difference here). (The paper contains all of the details) How can we gain competencies, especially these key ones, outside of a normative model that we have to apply to all contexts? We would like to be able to build on competencies, regardless of entry point, but taking into account prior learning so that we can build to a professional end point, regardless of starting point. What do we want to teach in the universities and to what degree?

The author finished on this point and it’s a good question: if we view our progression in terms of competency then how we can use these as building blocks to higher-level competencies? THis will help us in designing pre-requsitites and entry and exit points for all of our educational design.

The next talk was “Weaving Computing into all Middle School Disciplines”, presented by Susan Rodger from Duke. There were a lot of co-authors who were undergraduates (always good to see). The motivation for this project was there are problems with CS in the K-12 grades. It’s not taught in many schools and definitely missing in many high schools – not all Unis teach CS (?!?). Students don’t actually know what it is (the classic CS identify problem). There are also under-represented groups (women and minorities). Why should we teach it? 21st century skills, rewordings and many useful skills – from NCWIT.org.

Schools are already content-heavy so how do we convince people to add new courses? We can’t really so how about trying to weave it in to the existing project framework. Instead of doing a poster or a PowerPoint  prevention, why not provide an animations that’s interactive in some way and  that will involve computing. One way to achieve this is to use Alice, creating interactive stories or games, learning programming and computation concepts in a drag-and-drop code approach. Why Alice? There are many other good tools (Greenfoot, Lego, Scratch, etc) – well, it’s drag-and-drop, story-based and works well for women. The introductory Alice course in 2005 started to attract more women and now the class is more than 50% women. However, many people couldn’t come in because they didn’t have the prerequisites so the initiative moved out to 4th-6th grade to develop these skills earlier. Alice Virtual Worlds excited kids about computing, even at the younger ages.

The course “Adventures in Alice Programming” is aimed at grades 5-12 as Outreach, without having to use computing teachers (which would be a major restriction). There are 2-week teacher workshops where, initially, the teachers are taught Alice for a week, then the following week they develop lesson plans. There’s a one-week follow-up workshop the following summer. This initiative is funded until Summer, 2015, and has been run since 2008. There are sites: Durham, Charleston and Southern California.  The teachers coming in are from a variety of disciplines.

How is this used on middle and high schools by teachers? Demonstrations, examples, interactive quizzes and make worlds for students to view. The students may be able to undertake projects, take and build quizzes, view and answer questions about a world, and the older the student, the more they can do.

Recruitment of teachers has been interesting. Starting from mailing lists and asking the teachers who come, the advertising has spread out across other conferences. It really helps to give them education credits and hours – but if we’re going to pay people to do this, how much do we need to pay? In the first workshop, paying $500 got a lot of teachers (some of whom were interested in Alice). The next workshop, they got gas money ($50/week) and this reduced the number down to the more interested teachers.

There are a lot of curriculum materials available for free (over 90 tutorials) with getting-started material as a one-hour tutorial showing basic set-up, placing objects, camera views and so on. There are also longer tutorials over several different stories. (Editor’s note: could we get away from the Princess/Dragon motif? The Princess says “Help!” and waits there to be rescued and then says “My Sweet Prince. I am saved.” Can we please arm the Princess or save the Knight?) There are also tutorial topics on inheritance, lists and parameter usage. The presenter demonstrated a lot of different things you can do with Alice, including book reports and tying Alice animations into the real world – such as boat trips which didn’t occur.

It was weird looking at the examples, and I’m not sure if it was just because of the gender of the authors, but the kitchen example in cooking with Spanish language instruction used female characters, the Princess/Dragon had a woman in a very passive role and the adventure game example had a male character standing in the boat. It was a small sample of the materials so I’m assuming that this was just a coincidence for the time being or it reflects the gender of the creator. Hmm. Another example and this time the Punnett Squares example has a grey-haired male scientist standing there. Oh dear.

Moving on, lots of helper objects are available for you to use if you’re a teacher to save on your development time which is really handy if you want to get things going quickly.

Finally, on discussing the impact, one 200 teachers have attend the workshops since 2008, who have then go on to teach 2900 students (over 2012-2013). From Google Analytics, over 20,000 users have accessed the materials. Also, a number of small outreach activities, Alice for an hour, have been run across a range of schools.

The final talk in this session was “Early validation of Computational Thinking Pattern Analysis”, presented by Hilarie Nickerson, from University of Colorado at Boulder. Computational thinking is important and, in the US, there have been both scope and pedagogy discussions, as well as instructional standards. We don’t have as much teacher education as we’d like. Assuming that we want the students to understand it, how can we help the teachers? Scalable Game Design integrates game and simulation design into public school curricula. The intention is to broaden participation for all kinds of schools as after-scjool classes had identified a lot of differences in the groups.

What’s the expectation of computational thinking? Administrators and industry want us to be able to take game knowledge and potentially use it for scientific simulation. A good game of a piece of ocean is also a predator-prey model, after all. Does it work? Well, it’s spread across a wide range of areas and communities, with more than 10,000 students (and a lot of different frogger games). Do they like it? There’s a perception that programming is cognitively hard and boring (on the congnitive/affective graph ranging from easy-hard/exciting-boring) We want it to be easy and exciting. We can make it easier with syntactic support and semantic support but making it exciting requires the students to feel ownership and to be able to express their creativity. And now they’re looking at the zone of proximal flow, which I’ve written about here. It’s good see this working in a project first, principles first model for these authors. (Here’s that picture again)

Figure from A. Repenning, "Programming Goes to School", CACM, 55, 5, May, 2012.

Figure from A. Repenning, “Programming Goes to School”, CACM, 55, 5, May, 2012.

The results? The study spanned 10,000 students, 45% girls and 55% boys (pretty good numbers!), 48% underrepresented, with some middle schools exposing 350 students per year. The motivation starts by making things achievable but challenging – starting from 2D basics and moving up to more sophisticated 3D games. For those who wish to continue: 74% boys, 64% girls and 69% of minority students want to continue. There are other aspects that can raise motivation.

What about the issue of Computing Computational Thinking? The authors have created a Computational Thinking Pattern Analysis (CTPA) instrument that can track student learning trajectories and outcomes. Guided discovery, as a pedagogy, is very effective in raising motivation for both genders, where direct instruction is far less effective for girls (and is also less effective for boys).

How do we validate this? There are several computational thinking patterns grouped using latent semantic analysis. One of the simpler patterns for a game is the pair generation and absorption where we add things to the game world (trucks in Frogger or fish in predator/prey) and then remove them (truck gets off the screen/fish gets eaten). We also need collision detection. Measuring skill development across these skills will allow you to measure it in comparison to the tutorial and to other students. What does CTPA actually measure? The presence of code patterns that corresponded to computational thinking constructs suggest student skill with computational thinking (but doesn’t prove it) and is different from measuring learning. The graphs produced from this can be represented as a single number, which is used for validation. (See paper for the calculation!)

This has been running for two years now, with 39 student grades for 136 games, with the two human graders shown to have good inter-rater consistency. Frogger was not very heavily correlated (Spearman rank) but Sokoban, Centipede and the Sims weren’t bad, and removing design aspects of rubrics may improve this.

Was their predictive validity in the project? Did the CTPA correlate with the skill score of the final game produced? Yes, it appears to be significant although this is early work. CTPA does appear to be cabal of measuring CT patterns in code that correlate with human skill development. Future work on this includes the refinement of CTPA by dealing with the issue of non-orthogonal constructs (collisions that include generative and absorptive aspects), using more information about the rules and examining alternative calculations. The group are also working not oils for teachers, including REACT (real-time visualisations for progress assessment) and recommend possible skill trajectories based on their skill progression.