Your love is like bad measurement.Posted: June 19, 2012
(This is my 200th post. I’ve allowed myself a little more latitude on the opinionated scale. Educational content is still present but you may find some of the content slightly more confronting than usual. I’ve also allowed myself an awful pun in the title.)
People like numbers. They like solid figures, percentages, clear statements and certainty. It’s a great shame that mis-measurement is so easy to do, when you search for these figures, and so much a part of our lives. Today, I’m going to discuss precision and recall, because I eventually want to talk about bad measurement. It’s very easy to get measurement wrong but, even when it’s conducted correctly, the way that we measure or the reasons that we have for measuring can make even the most precise and delicate measurements useless to us for an objective scientific purpose. This is still bad measurement.
I’m going to give you a big bag of stones. Some of the stones have diamonds hidden inside them. Some of the stones are red on the outside. Let’s say that you decide that you are going to assume that all stones that have been coloured red contain diamonds. You pull out all of the red stones, but what you actually want is diamonds. The number of red stones are referred to as the number of retrieved instances – the things that you have selected out of that original bag of stones. Now, you get to crack them open and find out how many of them have diamonds. Let’s say you have R red stones and D1 diamonds that you found once you opened up the red stones. The precision is the fraction D1/R: what percentage of the stones that you selected (Red) were actually the ones that you wanted (Diamonds). Now let’s say that there are D2 diamonds (where D2 is greater than or equal to zero) left back in the bag. The total number of diamonds in that original bag was D1+D2, right? The recall is the fraction of the total number of things that you wanted (Diamonds, given by D1+D2) that you actually got (Diamonds that were also painted Red, which is D1). So this fraction is D1/(D1+D2),the number you got divided by the number that there were there for you to actually get.
If I don’t have any other mechanism that I can rely upon for picking diamonds out of the bag (assuming no-one has conveniently painted them red), and I want all of the diamonds, then I need to take all of them out. This will give me a recall of 100% (D2 will be 0 as there will be nothing left in the bag and the fraction will be D1/D1). Hooray! I have all of the diamonds! There’s only one problem – there are still only so many diamonds in that bag and (maybe) a lot more stones, so my precision may be terrible. More importantly, my technique sucks (to use an official term) and I have no actual way of finding diamonds. I just happen to have used a mechanism that gets me everything so it must, as a side effect, get me all of the diamonds. I haven’t actually done anything except move everything from one bag to another.
One of the things about selection mechanisms is that people often seem happy to talk about one side of the precision/recall issue. “I got all of them” is fine but not if you haven’t actually reduced your problem at all. “All the ones I picked were the right ones” sounds fantastic until you realise that you don’t know how many were left behind that were also the ones that you wanted. If we can specify solutions (or selection strategies) in terms of their precision and their recall, we can start to compare them. This is an example of how something that appears to be straightforward can actually be a bad measurement – leave out one side of precision or recall and you have no real way of assessing the utility of what it is that you’re talking about, despite having some concrete numbers to fall back on.
You may have heard this expressed in another way. Let’s assume that you can have a mechanism for determining if people are innocent or guilty of a crime. If it was a perfect mechanism, then only innocent people would go free and only guilt people would go to jail. (Let’s assume it’s a crime for which a custodial sentence is appropriate.) Now, let’s assume that we don’t have a perfect mechanism so we have to make a choice – either we set up our system so that no innocent person goes to jail, or we set up our system so that no guilty person is set free. It’s fairly easy to see how our interpretation of the presumption of innocence, the notion of reasonable doubt and even evidentiary laws would be constructed in different ways under either of these assumptions. Ultimately, this is an issue of precision and recall and by understanding these concepts we can define what we are actually trying to achieve. (The foundation of most modern law is that innocent people don’t go to jail. A number of changes in certain areas are moving more towards a ‘no one who may be guilty of crimes of a certain type will escape us’ model and, unsurprisingly, this is causing problems due to inconsistent applications of our simple definitions from above.)
The reason that I brought all of this up was to talk about bad measurement, where we measure things and then over-interpet (torture the data) or over-assume (the only way that this could have happened was…) or over-claim (this always means that). It is possible to have a precise measurement of something and still be completely wrong about why it is occurring. It is possible that all of the data that we collect is the wrong data – collected because our fundamental hypothesis is in error. Data gives us information but our interpretative framework is crucial in determining what use we can make of this data. I talked about this yesterday and stressed the importance of having enough data, but you really have to know what your data means in order to be sure that you can even start to understand what ‘enough data’ means.
One example is the miasma theory of disease – the idea that bad smells caused disease outbreaks. You could construct a gadget that measured smells and then, say in 18th Century England, correlate this with disease outbreaks – and get quite a good correlation. This is still a bad measurement because we’re actually measuring two effects, rather than a cause (dead mammals introducing decaying matter/faecal bacteria etc into water or food pathways) and the effects (smell of decomposition, and diseases like cholera, E. Coli contamination, and so on). We can collect as much ‘smell’ data as we like, but we’re unlikely to learn much more because any techniques that focus on the smell and reducing it will only work if we do things like remove the odiferous elements, rather than just using scent bags and pomanders to mask the smell.
To look at another example, let’s talk about the number of women in Computer Science at the tertiary level. In Australia, it’s certainly pretty low in many Universities. Now, we can measure the number of women in Computer Science and we can tell you exactly how many are in a given class, what their average marks are, and all sorts of statistical data about them. The risk here is that, from the measurements alone, I may have no real idea of what has led to the low enrolments for women in Computer Science.
I have heard, far too many times, that there are too few women in Computer Science because women are ‘not good at maths/computer science/non-humanities courses’ and, as I also mentioned recently when talking about the work of Professor Seron, this doesn’t appear to the reason at all. When we look at female academic performance, reasons for doing the degree and try to separate men and women, we don’t get the clear separation that would support this assertion. In fact, what we see is that the representation of women in Computer Science is far lower than we would expect to see from the (marginally small) difference that does appear at the very top end of the data. Interesting. Once we actually start measuring, we have to question our hypothesis.
Or we can abandon our principles and our heritage as scientists and just measure something else that agrees with us.
You don’t have to get your measurement methods wrong to conduct bad measurement. You can also be looking for the wrong thing and measure it precisely, because you are attempting to find data that verifies your hypothesis, but rather than being open to change if you find contradiction, you can twist your measurements to meet your hypothesis, you can only collect the data that supports your assumptions and you can over-generalise from a small scale, or from another area.
When we look at the data, and survey people to find out the reasons behind the numbers, we reduce the risk that our measurements don’t actually serve a clear scientific purpose. For example, and as I’ve mentioned before, the reason that there are too few women studying Computer Science appears to be unpleasantly circular and relates to the fact that there are too few women in the discipline over all, reducing support in the workplace, development opportunities and producing a two-speed system that excludes the ‘newcomers’. Sorry, Ada and Grace (to name but two), it turns out that we seem to have very short memories.
Too often, measurement is conducted to reassure ourselves of our confirmed and immutable beliefs – people measure to say that ‘this race of people are all criminals/cheats/have this characteristic’ or ‘women cannot carry out this action’ or ‘poor people always perform this set of actions’ without necessarily asking themselves if the measurement is going to be useful, or if this is useful pursuit as part of something larger. Measuring in a way that really doesn’t provide any more information is just an empty and disingenuous confirmation. This is forcing people into a ghetto, then declaring that “all of these people live in a ghetto so they must like living in a ghetto”.
Presented a certain way, poor and misleading measurement can only lead to questionable interpretation, usually to serve a less than noble and utterly non-scientific goal. It’s bad enough when the media does it but it’s terrible when scientists, educators and academics do it.
Without valid data, collected on the understanding that a world-changing piece of data could actually change our data, all our work is worthless. A world based on data collection purely for the sake of propping up, with no possibility of discovery and adaptation, is a world of very bad measurement.