January 2018
« May    

First post – a lesson in confounding variables…

I decided I would start writing this blog to highlight some of the misuses of statistical data found in journalistic articles, and give some ideas as to how to spot them. While I don’t have as much experience as other writers such as the Bad Science columnist Ben Goldacre and numerous other blogs (whose links will gradually begin appearing on this site in due course), I feel inspired enough by reading his book and enraged enough at the consequences and the symptoms of poor statistics use in the mainstream media that I want to join in and add my voice to the cause.

I have two major topics of interest:

  1. Public health (my professional interest – I’m currently doing a PhD on the social causes of differences in life expectancy)
  2. Free and open source software (a hobby of mine – though something I feel particularly strongly about).

I have noticed a number of parallels between the two subjects, in terms of the misuse of data to mislead the general public, along with a rather alarming problem in a wider societal inability to critically assess what numbers actually mean, and what they can tell us. Arguments are frequently made on the basis that “the numbers say it all” or “the stats speak for themselves”. This is infrequently (quite possibly, never) the case. I believe that this has come about from the generally held belief that anything to do with numbers involves lots of complicated maths that the average man on the street couldn’t possibly understand, but are somehow magically authoritative. This is, again, usually not the case.

The open source software world has an acronym – FUD – to describe the Fear, Uncertainty and Doubt instilled by convincingly-worded and authoritatively-voiced articles based on anecdotes and misrepresented data designed to reflect negatively on the Linux operating system (a free alternative to closed-source systems such as Microsoft Windows and Apple Mac OS). This term could equally apply in media representation of public health issues – as this blog progresses I fully intend to do so, where appropriate. I think that there is a lot of crossover between these two issues, and a lot of value in looking at both together. There’s certainly some scope for “cross-pollination of ideas” – a dreadful manager-speak term if ever there was one, but nevertheless still apt here.

Which leads me onto the article of the day – an article from an IT newspaper regarding the use of Linux on notebook PCs, and a common misuse of statistics that I learned of through a course in epidemiology I took as part of my masters, confounding variables. Confounding variables, I think, will probably be a common feature in this blog. It’s arguably the simplest error to make, but frequently made. In fact it’s such a simple concept, that I’m amazed that it’s not in the school curriculum – it shocks me as to how it wasn’t until my Masters education that it was first introduced to me.

Here’s an excerpt from the start of the article:

Uh-oh. Hazmat time. 0.001% of 1.04%* of the world’s Linux die-hard supergeeks are about to descend and call me all kinds of names, but facts are facts: one of the world’s biggest makers of netbooks says Linux models are being returned at a rate at least four times higher than XP netbooks. Could it be that Linux sucks?

Ok, open source, free loving, caring, sharing Linux users: what does it mean when everyday consumers are returning Linux netbooks at a four to one ratio – at least – over XP netbooks?

It means Linux hasn’t matured enough yet to cater to the needs of everyday users, despite having made its best efforts with the latest Linux distros, although hope certainly does exist for future versions to get better.

Essentially, the author has noted that in one company, the rate at which the Linux netbooks are being returned is four times that of XP netbooks, and used it to conclude that Linux is not yet a suitable platform for the average computer user. The “evidence” is that from one company, though the author doesn’t actually state the source of the data (data from actual reports, or just an anecdote from the interviewee?), nor does he say how the rates were actually measured (number of returns, ratio of returns to units sold, something else?). But, be that as it may, the main problem is in drawing the conclusion as the only possible explanation for the data, ignoring all the confounding variables.

Loosely speaking, a confounding variable is a factor which offers an alternative explanation of an association between two variables. In other words, it is a “third-party” variable which causes an association to appear between two other variables which otherwise would not have occured. In this example, several alternative explanations are plausible:

  • Customers who bought the machines with Linux on it mis-bought, expecting Windows simply because it was what they were used to, and therefore didn’t try Linux first before sending it back
  • The Linux computers may have been manufactured with more hardware defects than the Windows computers (i.e. the problems weren’t related to the operating system)
  • Windows is a more familiar operating system – most customers will have observed it working (after a fashion) somewhere beforehand. Customers switching to Linux may well never have seen Linux running before and may be more inclined to give up as a result.
  • We don’t know what happened with the customers afterwards. Did the Linux customers returning their laptops get a replacement Linux laptop, a replacement Windows laptop, or no replacement at all? What about the Windows returners?
  • Finally, the freedom of Linux means that if a Windows customer wished to change operating systems, they could do so without needing to physically return the machine, as Linux is freely downloadable from the internet. The same cannot be said the other way around – to give up on Linux and install Windows, a license must be purchased. This means that those who were unsatisfied with their Linux computers were more likely to return them to the manufacturer than those who were unsatisfied with their Windows machines.

None of these alternative explanations leads to the conclusion that “Linux is not ready for everyday users”. I could go on: we don’t know if “four times as fast” means “4 computers per year compared to 1” or “400,000 computers per year compared to 100,000”; we don’t even know what the unit of measurement by which “four times as fast” was measured. It may even have been the average length of time between purchase and return, regardless of the actual number of computers sent back. Of course, the author could be right, and the higher number of returns could be due to the fact that Linux isn’t very good compared to Windows, but this is only 1 out of at least 6 or 7 possible explanations, and maybe several others that I haven’t even thought of.

Basic lack of comprehension of confounding variables is something I find particularly frightening. The Wikipedia page regarding confounding is particularly soul-destroying – it’s a terrible mess of mostly impenetrable epidemiology and statistics jargon. As shown here, understanding confounding variables requires no complicated mathematics, just a little bit of patience not to be taken in by the seemingly shocking numbers (“four times as fast” in this case) and a few minutes consideration as to what else might be involved. It also applies to fields outside of epidemiology.

About a year ago, I had a rather painful email conversation with a colleague who dismissed quantitative methods as rubbish because you could (and I’m paraphrasing here) “get a statistical correlation between obesity rates and crime rates but you can’t say that fat people are more likely to be criminals as it’s just a coincidence!” I patiently explained to her that it wasn’t a coincidence but there were some confounding variables involved. Now, I’m not going to detail here what these might have been. Instead, I’m going to use this story to introduce a little game I just made up called “Spot the Confounder(s)”. Do please join in! The rules are simple – if you can identify what might have been a confounding variable in this relationship (just a reminder, it’s: obesity rates vs. crime rates) then please feel free to leave a comment below including your suggestion. I expect there’ll be many more games of “spot the confounder” to come!


1 comment to First post – a lesson in confounding variables…

  • Welcome! A great first post.

    Confounders, just off the top of my head:

    * Poverty – more likely to commit crime & poor diet
    * The lack of micronutrients in a poor diet leads to behavioural problems
    * Urban dwellers get less exercise and are more likely to be criminals (more potential victims)
    * Fat criminals less able to run away so get caught more often, inflating the crime statistics

    Interestingly, although in this case, it’s fairly obvious that obesity probably doesn’t cause crime per se, there is also a very strong correlation between IQ and (most types of) crime. The same confounders could apply in that case too.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>