Is It Called Ranking Because It Smells Funny?

Years ago, I was a professional winemaker, which is an awesome job but with very long hours (that seems to be a trend for me). One of the things that we did a lot in winemaking was to assess the quality of wine to work out if we’d made what we wanted to but also to allow us to blend this parcel with that parcel and come up with a better wine. Wine judging, for wine shows, is an important part of getting feedback on the quality of your wine as it’s perceived by other professionals. Wine is judged on a 20 point scale most of the time, although some 100 point schemes are in operation. The problem is that this scale is not actually as wide as it might look. Wines below 12/20 are usually regarded as faulty or not at commercial level – so, in reality, most wine shows are working in the range 12-19.5 (20 was relatively rare but I don’t know what it’s like now). This gets worse for the “100 point” ranges, where Wine Spectator claim to go from 50-100, but James Halliday (a prominent wine critic) rates from 75-100, where ‘Good’ starts at 80. This is really quite unnecessarily confusing, because it means that James Halliday is effectively using a version of the 16 available ranks (12-19.5 at 0.5 interval) of the 20 point scale, mapped into a higher range.

Of course, the numbers are highly subjective, even to a very well trained palate, because the difference between an 87 and an 88 could be colour, or bouquet, or flavour – so saying that the wine at 88 is better doesn’t mean anything unless you know what the rater actually means by that kind of ranking. I used to really enjoy the wine selections of a wine writer called Mark Shields, because he used a straightforward rating system and our palates were fairly well aligned. If Mark liked it, I’d probably like it. This is the dirty secret of any ranking mechanism that has any aspect of subjectivity or weighting built into it – it needs to agree with your interpretation of reasonable or you will always be at odds with it.

In terms of wine, the medal system that is in use really does give us four basic categories: commercially sound (no medal), better than usual (bronze), much better than usual (silver) and “please pass me another bottle” (gold). On top of that, you have the ‘best in show’ effectively which says that, in this place and from these tasters, this was the best overall in this category. To be frank, the gold wines normally blow away the bronzes and the no-awards, but the line between silver and gold is a little more blurred. However, the show medals have one advantage in that a given class has been inspected by the same people and the wines have actually been compared (in one sense) and ranked. However, if nothing is outstanding then no medals will be awarded because it is based on the marks on the 20 point scale, so if all the wines come in at 13, there will be no gongs – there doesn’t have to be a gold or a silver, or even a bronze, although that would be highly unusual. More subtly, gold at one show may not even get bronze at another – another dirty little secret of subjective ranking, sometimes what you are comparing things to makes a very big difference.

Which brings me to my point, which is the ranking on Universities. You’re probably aware that there are national and international rankings of Universities across a range of metrics, often related to funding and research, but that the different rankings have broad agreement rather than exact agreement as to who is ‘top’, top 5 and so on. The Times Higher Education supplement provides a stunning area of snazzy looking graphics, with their weightings as to what makes a great University. But, when we look at this, and we appear to have accuracy to one significant figure (ahem), is it significant that Caltech is 1.8 points higher than Stanford? Is this actually useful information in terms of which university a student might wish to attend? Well, teaching (learning environment) makes 30% of the score, international outlook makes up 7.5%, industry income makes up 2.5%, research is 30% (volume, income and reputation) and citations (research influence) make up the last 30%. If we sort by learning environment (because I am a curious undergraduate, say) then the order starts shifting, not at the very top to the list, but certainly further down – Yale would leap to 4th in the US instead of 9th. Once we get out of the top 200, suddenly we have very broad bands and, honestly, you have to wonder why we are still putting together the numbers if the thing that people appear to be worrying about is the top 200. (When you deem worthiness on a rating scale as only being a subset of the available scale, you rapidly turn something that could be considered continuous into something with an increasingly constrained categorical basis.) But let’s go to the Shanghai rankings, where Caltech is dropped from number 1 to number 6. Or the QS World Rankings, who rate Caltech as #10.

Obviously, there is no doubt about the general class of these universities, but it does appear that the judges are having some difficulty in consistently awarding best in class medals. This would be of minor interest, were it not for the fact that these ratings do actually matter in terms of industry confidence in partnership, in terms of attracting students from outside of your home educational system and in terms of who get to be the voices who decide what constitutes a good University. It strikes me that broad classes are something could apply quite well here. Who really cares whether Caltech is 1, 6 or 10 – it’s obviously rating well across the board and, barring catastrophe, always will.

So why keep ranking it? What we’re currently doing is polishing the door knob on the ranking system, devoting effort to ranking Universities like Caltech, Stanford, Harvard, Cambridge and Oxford, who could not, with any credibility be ranked low – or we’d immediately think that the ranking mechanism was suspect. So let’s stop ranking them, because it’s compressing the ranking at a point where the ranking is not even vaguely informational. What would be interesting was more devotion to the bands further down, where a University can assess its global progress against its peers to find out if it’s cutting the mustard.

If I put a bottle of Grange (one of Australia’s best red wines and it is pretty much worthy of its reputation if not its price) into a wine show and it came back with less than 17/20, I’d immediately suspect the rating system and the professionalism of the judges. The question is, of course, why would I put it in other than to win gold medals – what am I actually achieving in this sense? If it’s a commercial decision to sell more wine then I get this but wine is, after all, just wine and you and I drink it the same way. Universities, especially when ranked across complex weighted metrics and by different people, are very different products to different people. The single figure ranking may carry prestige and probably attracts both students and money but should it? Does it make any sense to be so detailed (one significant figure, indeed) about how one stacks up against each other, when in reality you have almost exponentially separated groups – my University will never ‘challenge’ Caltech, and if Caltech ‘drops’ to the graded level of the University of Melbourne (one of our most highly ranked Unis), I’m not sure that the experience will tell Caltech anything other than “Ruh oh!”

The Scooby Gang, stunned that Caltech was now in the range 50-100.

The Scooby Gang, stunned that Caltech was now in the range 50-100.

If I could summarise all of this, it would be to say that our leader board and ranking obsession would be fine, were it not for the amount of time spent on these things, the weight placed upon what is ultimately highly subjective even in terms of the weighting, and is not clearly defined as to how these rankings can be used to make sensible decisions. Perhaps there is something more useful we could be doing with our time?

Taught for a Result or Developing a Passion

According to a story in the Australian national broadcaster, the ABC, website, Australian school children are now ranked 27th out of 48 countries in reading, according to the Progress in International Reading Literacy Study, and that a quarter of Australia’s year 4 students had failed to meet the minimum standard defined for reading at their age. As expected, the Australian government  has said “something must be done” and the Australian Federal Opposition has said “you did the wrong thing”. Ho hum. Reading the document itself is fascinating because our fourth graders apparently struggle once we move into the area of interpretation and integration of ideas and information, but do quite well on simple inference. There is a lot of scope for thought about how we are teaching, given that we appear to have a reasonably Bloom-like breakdown on the data but I’ll leave that to the (other) professionals. Another international test, the Program for International School Assessment (PISA) which is applied to 15 year olds, is something that we rank relatively highly in, which measures reading, mathematics and science. (And, for the record, we’re top 10 on the PISA rankings after a Year 4 ranking of 27th. Either someone has gone dramatically wrong in the last 7 years of Australian Education, or Year 4 results on PIRLS doesn’t have as much influence as we might have expected on the PISA).We don’t yet have the results for this but we expect it out soon.

The PISA report front cover (C) OECD.

The PISA report front cover (C) OECD.

What is of greatest interest to me from the linked article on the ABC is the Oslo University professor, Svein Sjoberg, who points out the comparing educational systems around the globe is potentially too difficult to be meaningful – which is a refreshingly honest assessment in these performance-ridden and leaderboard-focused days. As he says:

“I think that is a trap. The PISA test does not address the curricular test or the syllabus that is set in each country.

Like all of these tests, PIRLS and PISA measure a student’s ability to perform on a particular test and, regrettably, we’re all pretty much aware, or should be by now, that using a test like this will give you the results that you built the test to give you. But one thing that really struck me from his analysis of the PISA was that the countries who perform better on the PISA Science ranking generally had a lower measure of interest in science. Professor Sjoberg noted that this might be because the students had been encouraged to become result-focused rather than encouraging them to develop a passion.

If Professor Sjoberg is right, then is not just a tragedy, it’s an educational catastrophe – we have now started optimising our students to do well in tests but be less likely to go and pursue the subjects in which they can get these ‘good’ marks. If this nasty little correlation holds, then will have an educational system that dominates in the performance of science in the classroom, but turns out fewer actual scientists – our assessment is no longer aligned to our desired outcomes. Of course, what it is important to remember is that the vast majority of these rankings are relative rather than absolute. We are not saying that one group is competent or incompetent, we are saying that one group can perform better or worse on a given test.

Like anything, to excel at a particular task, you need to focus on it, practise it, and (most likely) prioritise it above something else. What Professor Sjoberg’s analysis might indicate, and I realise that I am making some pretty wild conjecture on shaky evidence, is that certain schools have focused the effort on test taking, rather than actual science. (I know, I know, shock, horror) Science is not always going to fit into neat multiple choice questions or simple automatically marked answers to questions. Science is one of the areas where the viva comes into its own because we wish to explore someone’s answer to determine exactly how much they understand. The questions in PISA theoretically fall into roughly the same categories (MCQ, short answer) as the PIRLS so we would expect to see similar problems in dealing with these questions, if students were actually having a fundamental problem with the questions. But, despite this, the questions in PISA are never going to be capable of gauging the depth of scientific knowledge, the passion for science or the degree to which a student already thinks within the discipline. A bigger problem is the one which always dogs standardised testing of any sort, and that is the risk that answering the question correctly and getting the question right may actually be two different things.

Years ago, I looked at the examination for a large company’s offering in a certain area, I have no wish to get sued so I’m being deliberately vague, and it became rapidly apparent that on occasion there was a company answer that was not the same as the technically correct answer. The best way to prepare for the test was not to study the established base of the discipline but it was to read the corporate tracts and practise the skills on the approved training platforms, which often involved a non-trivial fee for training attendance. This was something that was tangential to my role and I was neither of a sufficiently impressionable age nor strongly bothered enough by it for it to affect me. Time was a factor and incorrect answers cost you marks – so I sat down and learned the ‘right way’ so that I could achieve the correct results in the right time and then go on to do the work using the actual knowledge in my head.

However, let us imagine someone who is 14 or 15 and, on doing the practice tests for ‘test X’ discovers that what is important is in hitting precisely the right answer in the shortest time – thinking about the problem in depth is not really on the table for a two-hour exam, unless it’s highly constrained and students are very well prepared. How does this hypothetical student retain respect for teachers who talk about what science is, the purity of mathematics, or the importance of scholarship, when the correct optimising behaviour is to rote-learn the right answers, or the safe and acceptable answers, and reproduce those on demand. (Looking at some of the tables in the PISA document, we see that the best performing nations in the top band of mathematical thinking are those with amazing educational systems – the desired range – and those who reputedly place great value in high power-distance classrooms with large volumes of memorisation and received wisdom – which is probably not the desired range.)

Professor Sjoberg makes an excellent point, which is that trying to work out what is in need of fixing, and what is good, about the Australian education system is not going to be solved by looking at single figure representations of our international rankings, especially when the rankings contradict each other on occasion! Not all countries are the same, pedagogically, in terms of their educational processes or their power distances, and adjacency of rank is no guarantee that the two educational systems are the same (Finland, next to Shanghai-China for instance). What is needed is reflection upon what we think constitutes a good education and then we provide meaningful local measures that allow us to work out how we are doing with our educational system. If we get the educational system right then,  if we keep a bleary eye on the tests we use, we should then test well. Optimising for the tests takes the effort off the education and puts it all onto the implementation of the test – if that is the case, then no wonder people are less interested in a career of learning the right phrase for a short answer or the correct multiple-choice answer.

Ebb and Flow – Monitoring Systems Without Intrusion

I’ve been wishing a lot of people “Happy Thanksgiving” today because, despite being frightfully Antipodean, I have a lot of friends and family who are Thanksgiving observers in the US. However, I would know that something was up in the US anyway because I am missing about 40% of my standard viewers on my blog. Today is an honorary Sunday – hooray, sleep-ins all round! More seriously, this illustrates one of the most interesting things about measurement, which is measuring long enough to be able to determine when something out of the ordinary occurs. As I’ve already discussed, I can tell when I’ve been linked to a higher profile blog because my read count surges. I also can tell when I haven’t been using attractive pictures because the count drops by about 30%.

A fruit bat, in recovery, about to drink its special fruit smoothie. (Yes, this is shameless manipulation.)

This is because I know what the day-to-day operation of the blog looks like and I can spot anomalies. When I was a network admin, I could often tell when something was going wrong on the network just because of the way that certain network operations started to feel, and often well before these problems reached the level where they would trigger any sort of alarm. It’s the same for people who’ve lived by the same patch of sea for thirty years. They’ll look at what appears to be a flat sea on a calm day and tell you not to go out – because they can read a number of things from the system and those things mean ‘danger’.

One of the reasons that the network example is useful is because any time you send data through the network to see what happens, you’re actually using the network to do it. So network probes will actually consume network bandwidth and this may either mask or exacerbate your problems, depending on how unlucky you are. However, using the network for day-today operations, and sensing that something is off, then gives you a reason to run those probes or to check the counters on your networking gear to find out exactly why the hair on the back of your neck is going up.

I observe the behaviour of my students a lot and I try to gain as much information as I can from what they already give me. That’s one of the reasons that I’m so interested in assignment submissions, because students are going to submit assignments anyway and any extra information I can get from this is a giant bonus! I am running a follow-up Piazza activity on our remote campus and I’m fascinated to be able to watch the developing activity because it tells me who is participating and how they are participating. For those who haven’t heard about Piazza, it’s like a Wiki but instead of the Wiki model of “edit first, then argue into shape”, Piazza encourages a “discuss first and write after consensus” model. I put up the Piazza assignment for the class, with a mid-December deadline, and I’ve already had tens of registered discussions, some of which are leading to edits. Of course, not all groups are active yet and, come Monday, I’ll send out a reminder e-mail and chat to them privately. Instead of sending a blanket mail to everyone saying “HAVE YOU STARTED PIAZZA”, I can refine my contact based on passive observation.

The other thing about Piazza is that, once all of the assignment is over, I can still see all of their discussions, because that’s where I’ve told them to have the discussion! As a result, we can code their answers and track the development of their answers, classifying them in terms of their group role, their level of function and so on. For an open-ended team-based problem, this allows me a great deal of insight into how much understanding my students have of the area and allows me to fine-tune my teaching. Being me, I’m really looking for ways to improve self-regulation mechanisms, as well as uncovering any new threshold concepts, but this nonintrusive monitoring has more advantages than this. I can measure participation by briefly looking at my mailbox to see how many mail messages are foldered under a particular group’s ID, from anywhere, or I can go to Piazza and see it unfolding there. I can step in where I have to, but only when I have to, to get things back on track but I don’t have to prove or deconstruct a team-formed artefact to see what is going on.

In terms of ebb and flow, the Piazza groups are still unpredictable because I don’t have enough data to be able to tell you what the working pattern is for a successful group. I can tell you that no activity is undesirable but, even early on, I could tell you some interesting things about the people who post the most! (There are some upcoming publications that will deal with things along these lines and I will post more on these later.) We’ve been lucky enough to secure some Summer students and I’m hoping that at least some of their work will involve looking at dependencies in communication and ebb and flow across these systems.

As you may have guessed, I like simple. I like the idea of a single dashboard that has a green light (healthy course), an orange light (sick course) and a red light (time to go back to playing guitar on the street corner) although I know it will never be that easy. However, anything that brings me closer to that is doing me a huge favour, because the less time I have to spend actively probing in the course, the less of my students’ time I take up with probes and the less of my own time I spend not knowing what is going on!

Oh well, the good news is that I think that there are only three more papers to write before the Mayan Apocalypse occurs and at least one of them will be on this. I’ll see if I can sneak in a picture of a fruit bat. 🙂

The Hips Don’t Lie – Assuming That By Hips You Mean Numbers

For those who missed it, the United States went to the polls to elect a new President. Some people were surprised by the outcome.

Even Benedict Cumberbatch, seen here between takes on Sherlock Series 3.

Some people were not, including the new King of Quants, Nate Silver. Silver studied economics at the University of Chicago but really came to prominence in his predictions of baseball outcomes, based on his analysis of the associated statistics and sabermetrics. He correctly predicted, back in 2008, what would happen between Obama and Clinton, and he predicted, to the state, what the outcome would be in this year’s election, even in the notoriously fickle swing states. Silver’s approach isn’t secret. He looks at all of the polls and then generates a weighted average of them (very, very simplified) in order to value certain polls over others. You rerun some of the models, change some parameters, look at it all again and work out what the most likely scenario is. Nate’s been publishing this regularly on his FiveThirtyEight blog (that’s the number of electors in the electoral college, by the way, and I had to look that up because I am not an American) which is now a feature of the New York Times.

So, throughout the entire election, as journalists and the official voices have been ranting and railing, predicting victory for this candidate or that candidate, Nate’s been looking at the polls, adjusting his model and publishing his predictions. Understandably, when someone is predicting a Democratic victory, the opposing party is going to jump up and down a bit and accusing Nate of some pretty serious bias and poll fixing. However, unless young Mr Silver has powers beyond those of mortal men, fixing all 538 electors in order to ensure an exact match to his predictions does seem to be taking fixing to a new level – and, of course, we’re joking because Nate Silver was right. Why was he right? Because he worked out a correct mathematical model and  method that took into account how accurate each poll was likely to be in predicting the final voter behaviour and that reliable, scientific and analytic approach allowed him to make a pretty conclusive set of predictions.

There are notorious examples of what happens when you listen to the wrong set of polls, or poll in the wrong areas, or carry out a phone poll at a time when (a) only rich people have phones or (b) only older people have landlines. Any information you get from such biased polls has to be taken with a grain of salt and weighted to reduce a skewing impact, but you have to be smart in how you weight things. Plain averaging most definitely does not work because this assumes equal sized populations or that (mysteriously) each poll should be treated as having equal weight. Here’s the other thing, though, ignoring the numbers is not going to help you if those same numbers are going to count against you.

Example: You’re a student and you do a mock exam. You get 30% because you didn’t study. You assume that the main exam will be really different. You go along. It’s not. In fact, it’s the same exam. You get 35%. You ignored the feedback that you should have used to predict what your final numbers were going to be. The big difference here is that a student can change their destiny through their own efforts. Changing the mind of the American people from June to November (Nate published his first predictions in June) is going to be nearly impossible so you’re left with one option, apparently, and that’s to pretend that it’s not happening.

I can pretend that my car isn’t running out of gas but, if the gauge is even vaguely accurate, somewhere along the way the car is going to stop. Ignoring Nate’s indications of what the final result would be was only ever going to work if his model was absolutely terrible but, of course, it was based on the polling data and the people being polled were voters. Assuming that there was any accuracy to the polls, then it’s the combination of the polls that was very clever and that’s all down to careful thought and good modelling. There is no doubt that a vast amount of work has gone into producing such a good model because you have to carefully work out how much each vote is worth in which context. Someone in a blue-skewed poll votes blue? Not as important as an increasing number of blue voters in a red-skewed polling area. One hundred people polled in a group to be weighted differently from three thousand people in another – and the absence of certain outliers possibly just down to having too small a sample population. Then, just to make it more difficult, you have to work out how these voting patterns are going to turn into electoral college votes. Now you have one vote that doesn’t mean the difference between having Idaho and not having Idaho, you have a vote that means the difference between “Hail to the Chief” and “Former Presidential Candidate and Your Host Tonight”.

Nate Silver’s work has brought a very important issue to light. The numbers, when you are thorough, don’t lie. He didn’t create the President’s re-election, he merely told everyone that, according to the American people, this was what was going to happen. What is astounding to me, and really shouldn’t be, is how many commentators and politicians seemed to take Silver’s predictions personally, as if he was trying to change reality by lying about the numbers. Well, someone was trying to change public perception of reality by talking about numbers, but I don’t think it was Nate Silver.

This is, fundamentally, a victory for science, thinking and solid statistics. Nate put up his predictions in a public space and said “Well, let’s see” and, with a small margin for error in terms of the final percentages, he got it right. That’s how science is supposed to work. Look at stuff, work out what’s going on, make predictions, see if you’re right, modify model as required and repeat until you have worked out how it really works. There is no shortage of Monday morning quarterbacks who can tell you in great detail why something happened a certain way when the game is over. Thanks, Nate, for giving me something to show my students to say “This is what it looks like when you get data science right.”

Remind me, however, never to bet against you at a sporting event!

Industry Speaks! (May The Better Idea Win)

Alan Noble, Director of Engineering for Google Australia and an Adjunct Professor with my Uni, generously gave up a day today to give a two hour lecture of distributed systems and scale to our third-year Distributed Systems course, and another two-hour lecture on entrepreneurship to my Grand Challenge students. Industry contact is crucial for my students because the world inside the Uni and the world outside the Uni can be very, very different. While we try to keep industry contact high in later years, and we’re very keen on authentic assignments that tackle real-world problems, we really need the people who are working for the bigger companies to come in and tell our students what life would be like working for Google, Microsoft, Saab, IBM…

My GC students have had a weird mix of lectures that have been designed to advance their maturity in the community and as scientists, rather than their programming skills (although that’s an indirect requirement), but I’ve been talking from a position of social benefit and community-focused ethics. It is essential that they be exposed to companies, commercialisation and entrepreneurship as it is not my job to tell them who to be. I can give them skills and knowledge but the places that they take those are part of an intensely personal journey and so it’s great to have an opportunity for Alan, a man with well-established industry and research credentials, to talk to them about how to make things happen in business terms.

The students I spoke to afterwards were very excited and definitely saw the value of it. (Alan, if they all leave at the end of this year and go to Google, you’re off the Christmas Card list.) Alan focused on three things: problems, users and people.

Problems: Most great companies find a problem and solve it but, first, you have to recognise that there is a problem. This sometimes just requires putting the right people in front of something to find out what these new users see as a problem. You have to be attentive to the world around you but being inventive can be just as important. Something Alan said really resonated with me in that people in the engineering (and CS) world tend to solve the problems that they encounter (do it once manually and then set things up so it’s automatic thereafter) and don’t necessarily think “Oh, I could solve this for everyone”. There are problems everywhere but, unless we’re looking for them, we may just adapt and move on, instead of fixing the problem.

Users: Users don’t always know what they want yet (the classic Steve Jobs approach), they may not ask for it or, if they do ask for something, what they want may not yet be available for them. We talked here about a lot of current solutions to problems but there are so many problems to fix that would help users. Simultaneous translation, for example, over telephone. 100% accurate OCR (while we’re at it). The risk is always that when you offer the users the idea of a car, all they ask for is a faster horse (after Henry Ford). The best thing for you is a happy user because they’re the best form of marketing – but they’re also fickle. So it’s a balancing act between genuine user focus and telling them what they need.

People: Surround yourself with people who are equally passionate! Strive of a culture of innovation and getting things done. Treasure your agility as a company and foster it if you get too big. Keep your units of work (teams) smaller if you can and match work to the team size. Use structures that encourage a short distance from top to bottom of the hierarchy, which allows for ideas to move up, down and sideways. Be meritocratic and encourage people to contest ideas, using facts and articulating their ideas well. May the Better Idea Win! Motivating people is easier when you’re open and transparent about what they’re doing and what you want.

Alan then went on to speak a lot about execution, the crucial step in taking an idea and having a successful outcome. Alan had two key tips.

Experiment: Experiment, experiment, experiment. Measure, measure, measure. Analyse. Take it into account. Change what you’re doing if you need to. It’s ok to fail but it’s better to fail earlier. Learn to recognise when your experiment is failing – and don’t guess, experiment! Here’s a quote that I really liked:

When you fail a little every day, it’s not failing, it’s learning.

Risk goes hand-in-hand with failure and success. Entrepreneurs have to learn when to call an experiment and change direction (pivot). Pivot too soon, you might miss out on something good. Pivot too late, you’re in trouble. Learning how to be agile is crucial.

Data: Collect and scrutinise all of the data that you get – your data will keep you honest if you measure the right things. Be smart about your data and never copy it when you can analyse it in situ.

(Alan said a lot more than this over 2 hours but I’m trying to give you the core.)

Alan finished by summarising all of this as his Three As of Entrepreneurship, then why we seem to be hitting an entrepreneurship growth spurt in Australia at the moment. The Three As are:

  • Audit your data
  • Having Audited, Admit when things aren’t working
  • Once admitted, you can Adapt (or pivot)

As to why we’re seeing a growth of entrepreneurship, Australia has a population who are some of the highest early adopters on the planet. We have a high technical penetration, over 20,000,000 potential users, a high GDP and we love tech. 52% of Australians have smart phones and we had so many mobile phones, pre-smart, that it was just plain crazy. Get the tech right and we will buy it. Good tech, however, is hardware+software+user requirement+getting it all right.

It’s always a pleasure to host Alan because he communicates his passion for the area well but he also puts a passionate and committed face onto industry, which is what my students need to see in order to understand where they could sit in their soon-to-be professional community.

Brief Stats Update: I appear to have written two more books

On May 6th, I congratulated Mark Guzdial on his 1000th post and I noted that I had written 102,136 words, an average of 676 words per post, with 151 posts over 126 days. I commented that, at that rate, I could expect to produce about 180,000 more words by the end of the year, for a total of about 280,000. So, to summarise, my average posting level was at rate of 1.2 posts per day, and 676 words per post.

Today, I reanalysed the blog to see how I was going. This post will be published on Tuesday the 9th, my time, and the analysis here does not include itself. So, up until all activity on Monday the 8th, Central Australian Daylight Saving Time, here are the stats.

Total word count: 273,639. Total number of posts: 343. Number of words per post: 798. Number of posts per day: 1.23. I will reach my end of year projected word count in about 9 days.

I knew that I had been writing longer posts, you may remember that I’ve deliberately tried to keep the posts to around 1,000 where possible, but it’s obvious that I’m just not that capable of writing a short post! In the long term, I’d expect this to approach 1,000 words/post because of my goal to limit myself to that, with the occasional overshoot. I’m surprised by the consistency in number of posts per day. The previous average was a smidgen under 1.2 but I wanted to clarify that there has been a minor increase. Given that my goal was not to necessarily hit exactly 1/day but to set aside time to think about learning and teaching every day, I’m happy with that.

The word count, however, is terrifying. One of the reasons that I wanted to talk about this is to identify how much work something like this is, not to either over inflate myself or to put you off, but to help anyone out there who is considering such a venture. Let me explain some things first.

  1. I have been typing in one form or another since 1977. I was exposed to computers early on and, while I’ve never been trained to touch type, I have that nasty hybrid version where I don’t use all of my fingers but still don’t have to look at the keyboard.
  2. I can sustain a typing speed of about 2,500 words/hour for fiction for quite a long time. That includes the aspects of creativity required, not dictation or transcription. It is very tiring, however, and too much of it makes me amusingly incoherent.
  3. I do not have any problems with repetitive strain injury and I have a couple of excellent working spaces with fast computers and big screens.
  4. I love to write.

So, I’m starting from a good basis and, let me stress, I love to write. Now let me tell you about the problems that this project has revealed.

  1. I produce two kinds of posts: research focused and the more anecdotal. Anecdotal posts can be written up quickly but the moment any research, pre-reading or reformulation is required, it will take me about an hour or two to get a post together. So that cute high speed production drops to about 500-1000 words/hour.
  2. Research posts are the result of hours of reading and quite a lot of associated thought. My best posts start from a set of papers that I read, I then mull on it for a few days and finally it all comes together. I often ask someone else to look at the work to see how it sits in the queue.
  3. I’m always better when I don’t have to produce something for tomorrow. When the post queue is dry, I don’t have the time to read in detail or mull so I have to either pull a previous draft from the queue and see if I can fix it (and I’ve pretty much run out of those) or I have to come up with an idea now and write it now. All too often, these end up being relatively empty opinion pieces.
  4. If you are already tired, writing can be very tiring and you lose a lot of the fiero and inspiration from writing a good post.

I have probably spent, by all of these figures and time estimates, somewhere around 274 hours on this project. That’s just under 7 working weeks at 40 hours/week. No wonder I feel tired sometimes!

I am already, as you know, looking to change the posting frequency next year because I wish to focus on the quality of my work rather than the volume of my output. I still plan to have that hour or so put aside every day to contemplate and carry out research on learning and teaching but it will no longer be tied to an associated posting deadline. My original plan had an output requirement to force me to carry out the work. Unsurprisingly, oh brave new world that has such extrinsic motivating factors in it, I have become focused on the post, rather than the underlying research. My word count indicates that I am writing but, once this year is over, the review that I carry out will be to make sure that every word written from that point on is both valuable and necessary. My satisfaction in the contribution and utility of those posts I do make will replace any other quantitative measures of output.

My experience in this can be summarised quite simply. Setting a posting schedule that is too restrictive risks you putting the emphasis on the wrong component, where setting aside a regular time to study and contemplate the issues that lead to a good post is a far wiser investment. If you want to write this much, then it cannot be too much of a chore and, honestly, loving writing is almost essential, I feel. Fortunately, I have more than enough to keep the post queue going to the end of the year, as I’m working on a number of papers and ideas that will naturally end up here but I feel that I have, very much, achieved what I originally set to to do. I now deeply value the scholarship of learning and teaching and have learned enough to know that I have a great deal more to learn.

From a personal perspective, I believe that all of the words written have been valuable to me but, from next year, I have to make sure that the words I write are equally valuable to other people.

I’ll finish with something amusing. Someone asked me the other day how many words I’d written and, off the top of my head, I said “about 140,000” and thought that I was possibly over-claiming. The fact that I was under claiming by almost a factor of two never would have occurred to me, nor the fact that I had written more words than can be found in Order of the Phoenix. While I may wish to reclaim my reading time once this is over, for any fiction publishers reading this, I will have some free time next year! 🙂

Time Banking: Aiming for the 40 hour week.

I was reading an article on metafilter on the perception of future leisure from earlier last century and one of the commenters linked to a great article on “Why Crunch Mode Doesn’t Work: Six Lessons” via the International Game Designers Association. This article was partially in response to the quality of life discussions that ensued after ea_spouse outed the lifestyle (LiveJournal link) caused by her spouse’s ludicrous hours working for Electronic Arts, a game company. One of the key quotes from ea_spouse was this:

Now, it seems, is the “real” crunch, the one that the producers of this title so wisely prepared their team for by running them into the ground ahead of time. The current mandatory hours are 9am to 10pm — seven days a week — with the occasional Saturday evening off for good behavior (at 6:30pm). This averages out to an eighty-five hour work week. Complaints that these once more extended hours combined with the team’s existing fatigue would result in a greater number of mistakes made and an even greater amount of wasted energy were ignored.

The badge is fastened with two pins that go straight into your chest.

This is an incredible workload and, as Evan Robinson notes in the “Crunch Mode” article, this is not only incredible but it’s downright stupid because every serious investigation into the effect of working more than 40 hours a week, for extended periods, and for reducing sleep and accumulating sleep deficit has come to the same conclusion: hours worked after a certain point are not just worthless, they reduce worth from hours already worked.

Robinsons cites studies and practices coming from industrialists as Henry Ford, who reduced shift length to a 40-hour work week in 1926, attracting huge criticism, because 12 years of research had shown that the shorter work week meant more output, not less. These studies have been going on since the 18th century and well into the 60’s at least and they all show the same thing: working eight hours a day, five days a week gives you more productivity because you get fewer mistakes, you get less fatigue accumulation and you have workers that are producing during their optimal production times (first 4-6 hours of work) without sliding into their negatively productive zones.

As Robinson notes, the games industry doesn’t seem to have got the memo. The crunch is a common feature in many software production facilities and the ability to work such back-breaking and soul-destroying shifts is often seen as a badge of honour or mark of toughness. The fact that you can get fired for having the audacity to try and work otherwise also helps a great deal in motivating people to adopt the strategy.

Why spend so many hours in the office? Remember when I said that it’s sometimes hard for people to see what I’m doing because, when I’m thinking or planning, I can look like I’m sitting in the office doing nothing? Imagine what it looks like if, two weeks before a big deadline, someone walks into the office at 5:30pm and everyone’s gone home. What does this look like? Because of our conditioning, which I’ll talk about shortly, it looks like we’ve all decided to put our lives before the work – it looks like less than total commitment.

As a manager, if you can tell everyone above you that you have people at their desks 80+ hours a week and will have for the next three months, then you’re saying that “this work is important and we can’t do any more.” The fact that people were probably only useful for the first 6 hours of every day, and even then only for the first couple of months, doesn’t matter because it’s hard to see what someone is doing if all you focus on is the output. Those 80+ hour weeks are probably only now necessary because everyone is so tired, so overworked and so cognitively impaired, that they are taking 4 times as long to achieve anything.

Yes, that’s right. All the evidence says that more than 2 months of overtime and you would have been better off staying at 40 hours/week in terms of measurable output and quality of productivity.

Robinson lists six lessons, which I’ll summarise here because I want to talk about it terms of students and why forward planning for assignments is good practice for better smoothing of time management in the future. Here are the six lessons:

  1. Productivity varies over the course of the workday, with greatest productivity in the first 4-6 hours. After enough hours, you become unproductive and, eventually, destructive in terms of your output.
  2. Productivity is hard to quantify for knowledge workers.
  3. Five day weeks of eight house days maximise long-term output in every industry that has been studied in the past century.
  4. At 60 hours per week, the loss of productivity caused by working longer hours overwhelms the extra hours worked within a couple of months.
  5. Continuous work reduces cognitive function 25% for every 24 hours. Multiple consecutive overnighters have a severe cumulative effect.
  6. Error rates climb with hours worked and especially with loss of sleep.

My students have approximately 40 hours of assigned work a week, consisting of contact time and assignments, but many of them never really think about that. Most plan in other things around their ‘free time’ (they may need to work, they may play in a band, they may be looking after families or they may have an active social life) and they fit the assignment work and other study into the gaps that are left. Immediately, they will be over the 40 hour marker for work. If they have a part-time job, the three months of one of my semesters will, if not managed correctly, give them a lumpy time schedule alternating between some work and far too much work.

Many of my students don’t know how they are spending their time. They switch on the computer, look at the assignment, Skype, browse, try something, compile, walk away, grab a bite, web surf, try something else – wow, three hours of programming! This assignment is really hard! That’s not all of them but it’s enough of them that we spend time on process awareness: working out what you do so you know how to improve it.

Many of my students see sports drinks, energy drinks and caffeine as a licence to not sleep. It doesn’t work long term as most of us know, for exactly the reasons that long term overwork and sleeplessness don’t work. Stimulants can keep you awake but you will still be carrying most if not all of your cognitive impairment.

Finally, and most importantly, enough of my students don’t realise that everything I’ve said up until now means that they are trying to sit my course with half a brain after about the halfway point, if not sooner if they didn’t rest much between semesters.

I’ve talked about the theoretical basis for time banking and the pedagogical basis for time banking: this is the industrial basis for time banking. One day I hope that at least some of my students will be running parts of their industries and that we have taught them enough about sensible time management and work/life balance that, as people in control of a company, they look at real measures of productivity, they look at all of the masses of data supporting sensible ongoing work rates and that they champion and adopt these practices.

As Robinson says towards the end of the article:

Managers decide to crunch because they want to be able to tell their bosses “I did everything I could.” They crunch because they value the butts in the chairs more than the brains creating games. They crunch because they haven’t really thought about the job being done or the people doing it. They crunch because they have learned only the importance of appearing to do their best to instead of really of doing their best. And they crunch because, back when they were programmers or artists or testers or assistant producers or associate producers, that was the way they were taught to get things done. (Emphasis mine.)

If my students can see all of their requirements ahead of time, know what is expected, have been given enough process awareness, and have the will and the skill to undertake the activities, then we can potentially teach them a better way to get things done if we focus on time management in a self-regulated framework, rather than imposed deadlines in a rigid authority-based framework. Of course, I still have a lot of work to to demonstrate that this will work but, from industrial experience, we have yet another very good reason to try.

Flow, Happiness and the Pursuit of Significance

I’ve just been reading Deirdre McCloskey’s article on “Happyism” in The New Republic. While there are a number of points I could pick at in the article, I question her specific example of statistical significance and I think she’s oversimplified a number of the philosophical points, there are a lot of interesting thoughts and arguments within the article.

One of my challenges in connecting with my students is that of making them understand what the benefit is to them of adopting, or accepting, suggestions from me as to how to become better as discipline practitioners, as students and, to some extent, as people. It would be nice if doing the right thing in this regard could give the students a tangible and measurable benefit that they could accumulate on some sort of meter – I have performed well, my “success” meter has gone up by three units. As McCloskey points out, this effectively requires us to have a meter for something that we could call happiness, but it is then tied directly to events that give us pleasure, rather than a sequence of events that could give us happiness. Workflows (chains of actions that lead to an eventual outcome) can be assessed for accuracy and then the outcome measured, but it is only when the workflow is complete that we can assess the ‘success’ of the workflow and then derive pleasure, and hence happiness, from the completion of the workflow. Yes, we can compose a workflow from sub-workflows but we will hit the same problem if we focus on an outcome-based model – at some stage, we are likely to be carrying out an action that can lead to an event from which we can derive a notion of success, but this requires us to be foresighted and see the events as a chain that results in this outcome.

And this is very hard to meter and display in a way that says anything other than “Keep going!” Unsurprisingly, this is not really the best way to provide useful feedback, reward or fodder for self-actualisation.

I have a standing joke that, as a runner, I go to a sports doctor because if I go to a General Practitioner and say “My leg hurts after I run”, the GP will just say “Stop running.” I am enough of a doctor to say that to myself – so I seek someone who is trained to deal with my specific problems and who can give me a range of feedback that may include “stop running” because my injuries are serious or chronic, but can provide me with far more useful information from which I can make an informed choice. The happiness meter must be able to work with workflow in some way that is useful – keep going is not enough. We therefore need to look at the happiness meter.

McCloskey identifies Bentham, founder of utilitarianism, as the original “pleasure meter” proponent and implicitly addressed the beneficial calculus as subverting our assessment of “happiness units” (utils) into a form that assumes that we can reasonably compare utils between different people and that we can assemble all of our life’s experiences in a meaningful way in terms of utils in the first place!

To address the issue of workflow itself, McCloskey refers to the work of Mihály Csíkszentmihályi on flow: “the absorption in a task just within our competence”. I have talked about this before, in terms of Vygotsky’s zone of proximal development and the use of a group to assist people who are just outside of the zone of flow. The string of activities can now be measured in terms of satisfaction or immersion, as well as the outcomes of this process. Of course, we have the outcomes of the process in terms of direct products and we have outcomes in terms of personal achievement at producing those products. Which of these go onto the until meter, given that they are utterly self-assessed, subjective and, arguably, orthogonal in some cases. (If you have ever done your best, been proud of what you did, but failed in your objective, you know what I’m talking about.)

My reading of McCloskey is probably a little generous because I find her overall argument appealing. I believe that her argument may be distilled are:

  • If we are going to measure, we must measure sensibly and be very clear in our context and the interpretation of significance.
  • If we are going to base any activity on our measurement, then the activity we create or change must be related to the field of measurement.

Looking at the student experience in this light, asking students if they are happy with something is, ultimately, a pointless activity unless I either provide well-defined training in my measurement system and scale, or I am looking for a measurement of better or worse. This is confounded by simple cognitive biasses including, but not limited to, the Hawthorne Effect and confirmation bias. However, measuring what my students are doing, as Csíkszentmihályi did in the flow experiments, will show me if they are so engaged with their activities that they are staying in the flow zone. Similarly, looking at participation and measuring outputs in collaborative activities where I would expect the zone of proximal development to be in effect is going to be far more revealing than asking students if they liked something or not.

As McCloskey discusses, there is a point at which we don’t seem to get any happier but it is very hard to tell if this is a fault in our measurement and our presumption of a three-point non-interval scale and it then often degenerates into a form of intellectual snobbery that, unsurprisingly, favours the elites who will be studying the non-elites. (As an aside, I learnt a new word. Clerisy: “A distinct class of learned or literary people” If you’re going to talk about the literate elites, it’s nice to have a single word to do so!) In student terms, does this mean that there is a point at which even the most keen of our best and brightest will not try some of our new approaches? The question, of course, is whether the pursuit of happiness is paralleling the quest for knowledge, or whether this is all one long endured workflow that results in a pleasure quantum labelled ‘graduation’.

As I said, I found it to be an interesting and thoughtful piece, despite some problems and I recommend it to you, even if we must then start an large debate in the comments on how much I misled you!

Speaking of measurement

In a delightfully serendipitous alignment of the planets, today marks my 200th post and my 10,000th view. Given that posting something new every day, which strives if not succeeds at being useful and interesting, is sometimes a very demanding commitment, the knowledge that people are reading does help me to keep it going. However, it’s the comments, both here and on FB, that show that people can sometimes actually make use of what I’m talking about that is the real motivator for me.

via (This looked smaller in preview but I really liked its solidity so didn’t want to scale it)

Thank you, everyone, for your continued reading and support, and to everyone else out there blogging who is showing me how it can be done better (and there are a lot of people who are doing it much better than I am).

Have a great day, wherever you are!

Your love is like bad measurement.

(This is my 200th post. I’ve allowed myself a little more latitude on the opinionated scale. Educational content is still present but you may find some of the content slightly more confronting than usual. I’ve also allowed myself an awful pun in the title.)

People like numbers. They like solid figures, percentages, clear statements and certainty. It’s a great shame that mis-measurement is so easy to do, when you search for these figures, and so much a part of our lives. Today, I’m going to discuss precision and recall, because I eventually want to talk about bad measurement. It’s very easy to get measurement wrong but, even when it’s conducted correctly, the way that we measure or the reasons that we have for measuring can make even the most precise and delicate measurements useless to us for an objective scientific purpose. This is still bad measurement.

I’m going to give you a big bag of stones. Some of the stones have diamonds hidden inside them. Some of the stones are red on the outside. Let’s say that you decide that you are going to assume that all stones that have been coloured red contain diamonds. You pull out all of the red stones, but what you actually want is diamonds. The number of red stones are referred to as the number of retrieved instances – the things that you have selected out of that original bag of stones. Now, you get to crack them open and find out how many of them have diamonds. Let’s say you have R red stones and D1 diamonds that you found once you opened up the red stones. The precision is the fraction D1/R: what percentage of the stones that you selected (Red) were actually the ones that you wanted (Diamonds). Now let’s say that there are D2 diamonds (where D2 is greater than or equal to zero) left back in the bag. The total number of diamonds in that original bag was D1+D2, right? The recall is the fraction of the total number of things that you wanted (Diamonds, given by D1+D2) that you actually got (Diamonds that were also painted Red, which is D1). So this fraction is D1/(D1+D2),the number you got divided by the number that there were there for you to actually get.

Sorry, Logan5, your time is up.

If I don’t have any other mechanism that I can rely upon for picking diamonds out of the bag (assuming no-one has conveniently painted them red), and I want all of the diamonds, then I need to take all of them out. This will give me a recall of 100% (D2 will be 0 as there will be nothing left in the bag and the fraction will be D1/D1). Hooray! I have all of the diamonds! There’s only one problem – there are still only so many diamonds in that bag and (maybe) a lot more stones, so my precision may be terrible. More importantly, my technique sucks (to use an official term) and I have no actual way of finding diamonds. I just happen to have used a mechanism that gets me everything so it must, as a side effect, get me all of the diamonds. I haven’t actually done anything except move everything from one bag to another.

One of the things about selection mechanisms is that people often seem happy to talk about one side of the precision/recall issue. “I got all of them” is fine but not if you haven’t actually reduced your problem at all. “All the ones I picked were the right ones” sounds fantastic until you realise that you don’t know how many were left behind that were also the ones that you wanted. If we can specify solutions (or selection strategies) in terms of their precision and their recall, we can start to compare them. This is an example of how something that appears to be straightforward can actually be a bad measurement – leave out one side of precision or recall and you have no real way of assessing the utility of what it is that you’re talking about, despite having some concrete numbers to fall back on.

You may have heard this expressed in another way. Let’s assume that you can have a mechanism for determining if people are innocent or guilty of a crime. If it was a perfect mechanism, then only innocent people would go free and only guilt people would go to jail. (Let’s assume it’s a crime for which a custodial sentence is appropriate.) Now, let’s assume that we don’t have a perfect mechanism so we have to make a choice – either we set up our system so that no innocent person goes to jail, or we set up our system so that no guilty person is set free. It’s fairly easy to see how our interpretation of the presumption of innocence, the notion of reasonable doubt and even evidentiary laws would be constructed in different ways under either of these assumptions. Ultimately, this is an issue of precision and recall and by understanding these concepts we can define what we are actually trying to achieve. (The foundation of most modern law is that innocent people don’t go to jail. A number of changes in certain areas are moving more towards a ‘no one who may be guilty of crimes of a certain type will escape us’ model and, unsurprisingly, this is causing problems due to inconsistent applications of our simple definitions from above.)

The reason that I brought all of this up was to talk about bad measurement, where we measure things and then over-interpet (torture the data) or over-assume (the only way that this could have happened was…) or over-claim (this always means that). It is possible to have a precise measurement of something and still be completely wrong about why it is occurring. It is possible that all of the data that we collect is the wrong data – collected because our fundamental hypothesis is in error. Data gives us information but our interpretative framework is crucial in determining what use we can make of this data. I talked about this yesterday and stressed the importance of having enough data, but you really have to know what your data means in order to be sure that you can even start to understand what ‘enough data’ means.

One example is the miasma theory of disease – the idea that bad smells caused disease outbreaks. You could construct a gadget that measured smells and then, say in 18th Century England, correlate this with disease outbreaks – and get quite a good correlation. This is still a bad measurement because we’re actually measuring two effects, rather than a cause (dead mammals introducing decaying matter/faecal bacteria etc into water or food pathways) and the effects (smell of decomposition, and diseases like cholera, E. Coli contamination, and so on). We can collect as much ‘smell’ data as we like, but we’re unlikely to learn much more because any techniques that focus on the smell and reducing it will only work if we do things like remove the odiferous elements, rather than just using scent bags and pomanders to mask the smell.

To look at another example, let’s talk about the number of women in Computer Science at the tertiary level. In Australia, it’s certainly pretty low in many Universities. Now, we can measure the number of women in Computer Science and we can tell you exactly how many are in a given class, what their average marks are, and all sorts of statistical data about them. The risk here is that, from the measurements alone, I may have no real idea of what has led to the low enrolments for women in Computer Science.

I have heard, far too many times, that there are too few women in Computer Science because women are ‘not good at maths/computer science/non-humanities courses’ and, as I also mentioned recently when talking about the work of Professor Seron, this doesn’t appear to the reason at all. When we look at female academic performance, reasons for doing the degree and try to separate men and women, we don’t get the clear separation that would support this assertion. In fact, what we see is that the representation of women in Computer Science is far lower than we would expect to see from the (marginally small) difference that does appear at the very top end of the data. Interesting. Once we actually start measuring, we have to question our hypothesis.

Or we can abandon our principles and our heritage as scientists and just measure something else that agrees with us.

You don’t have to get your measurement methods wrong to conduct bad measurement. You can also be looking for the wrong thing and measure it precisely, because you are attempting to find data that verifies your hypothesis, but rather than being open to change if you find contradiction, you can twist your measurements to meet your hypothesis, you can only collect the data that supports your assumptions and you can over-generalise from a small scale, or from another area.

When we look at the data, and survey people to find out the reasons behind the numbers, we reduce the risk that our measurements don’t actually serve a clear scientific purpose. For example, and as I’ve mentioned before, the reason that there are too few women studying Computer Science appears to be unpleasantly circular and relates to the fact that there are too few women in the discipline over all, reducing support in the workplace, development opportunities and producing a two-speed system that excludes the ‘newcomers’. Sorry, Ada and Grace (to name but two), it turns out that we seem to have very short memories.

Too often, measurement is conducted to reassure ourselves of our confirmed and immutable beliefs – people measure to say that ‘this race of people are all criminals/cheats/have this characteristic’ or ‘women cannot carry out this action’ or ‘poor people always perform this set of actions’ without necessarily asking themselves if the measurement is going to be useful, or if this is useful pursuit as part of something larger. Measuring in a way that really doesn’t provide any more information is just an empty and disingenuous confirmation. This is forcing people into a ghetto, then declaring that “all of these people live in a ghetto so they must like living in a ghetto”.

Presented a certain way, poor and misleading measurement can only lead to questionable interpretation, usually to serve a less than noble and utterly non-scientific goal. It’s bad enough when the media does it but it’s terrible when scientists, educators and academics do it.

Without valid data, collected on the understanding that a world-changing piece of data could actually change our data, all our work is worthless. A world based on data collection purely for the sake of propping up, with no possibility of discovery and adaptation, is a world of very bad measurement.