# Tag Archives: statistics

## Paradoxes of probability and other statistical strangeness

Statistics is a useful tool for understanding the patterns in the world around us. But our intuition often lets us down when it comes to interpreting those patterns. In this series we look at some of the common mistakes we make and how to avoid them when thinking about statistics, probability and risk.

You don’t have to wait long to see a headline proclaiming that some food or behaviour is associated with either an increased or a decreased health risk, or often both. How can it be that seemingly rigorous scientific studies can produce opposite conclusions?

Nowadays, researchers can access a wealth of software packages that can readily analyse data and output the results of complex statistical tests. While these are powerful resources, they also open the door to people without a full statistical understanding to misunderstand some of the subtleties within a dataset and to draw wildly incorrect conclusions.

Here are a few common statistical fallacies and paradoxes and how they can lead to results that are counterintuitive and, in many cases, simply wrong.

### What is it?

This is where trends that appear within different groups disappear when data for those groups are combined. When this happens, the overall trend might even appear to be the opposite of the trends in each group.

One example of this paradox is where a treatment can be detrimental in all groups of patients, yet can appear beneficial overall once the groups are combined.

### How does it happen?

This can happen when the sizes of the groups are uneven. A trial with careless (or unscrupulous) selection of the numbers of patients could conclude that a harmful treatment appears beneficial.

### Example

Consider the following double blind trial of a proposed medical treatment. A group of 120 patients (split into subgroups of sizes 10, 20, 30 and 60) receive the treatment, and 120 patients (split into subgroups of corresponding sizes 60, 30, 20 and 10) receive no treatment.

The overall results make it look like the treatment was beneficial to patients, with a higher recovery rate for patients with the treatment than for those without it.

However, when you drill down into the various groups that made up the cohort in the study, you see in all groups of patients, the recovery rate was 50% higher for patients who had no treatment.

But note that the size and age distribution of each group is different between those who took the treatment and those who didn’t. This is what distorts the numbers. In this case, the treatment group is disproportionately stacked with children, whose recovery rates are typically higher, with or without treatment.

## Base rate fallacy

### What is it?

This fallacy occurs when we disregard important information when making a judgement on how likely something is.

If, for example, we hear that someone loves music, we might think it’s more likely they’re a professional musician than an accountant. However, there are many more accountants than there are professional musicians. Here we have neglected that the base rate for the number of accountants is far higher than the number of musicians, so we were unduly swayed by the information that the person likes music.

### How does it happen?

The base rate fallacy occurs when the base rate for one option is substantially higher than for another.

### Example

Consider testing for a rare medical condition, such as one that affects only 4% (1 in 25) of a population.

Let’s say there is a test for the condition, but it’s not perfect. If someone has the condition, the test will correctly identify them as being ill around 92% of the time. If someone doesn’t have the condition, the test will correctly identify them as being healthy 75% of the time.

So if we test a group of people, and find that over a quarter of them are diagnosed as being ill, we might expect that most of these people really do have the condition. But we’d be wrong.

According to our numbers above, of the 4% of patients who are ill, almost 92% will be correctly diagnosed as ill (that is, about 3.67% of the overall population). But of the 96% of patients who are not ill, 25% will be incorrectly diagnosed as ill (that’s 24% of the overall population).

What this means is that of the approximately 27.67% of the population who are diagnosed as ill, only around 3.67% actually are. So of the people who were diagnosed as ill, only around 13% (that is, 3.67%/27.67%) actually are unwell.

Worryingly, when a famous study asked general practitioners to perform a similar calculation to inform patients of the correct risks associated with mammogram results, just 15% of them did so correctly.

### What is it?

This occurs when moving something from one group to another raises the average of both groups, even though no values actually increase.

The name comes from the American comedian Will Rogers, who joked that “when the Okies left Oklahoma and moved to California, they raised the average intelligence in both states”.

Former New Zealand Prime Minister Rob Muldoon provided a local variant on the joke in the 1980s, regarding migration from his nation into Australia.

### How does it happen?

When a datapoint is reclassified from one group to another, if the point is below the average of the group it is leaving, but above the average of the one it is joining, both groups’ averages will increase.

### Example

Consider the case of six patients whose life expectancies (in years) have been assessed as being 40, 50, 60, 70, 80 and 90.

The patients who have life expectancies of 40 and 50 have been diagnosed with a medical condition; the other four have not. This gives an average life expectancy within diagnosed patients of 45 years and within non-diagnosed patients of 75 years.

If an improved diagnostic tool is developed that detects the condition in the patient with the 60-year life expectancy, then the average within both groups rises by 5 years.

### What is it?

Berkson’s paradox can make it look like there’s an association between two independent variables when there isn’t one.

### How does it happen?

This happens when we have a set with two independent variables, which means they should be entirely unrelated. But if we only look at a subset of the whole population, it can look like there is a negative trend between the two variables.

This can occur when the subset is not an unbiased sample of the whole population. It has been frequently cited in medical statistics. For example, if patients only present at a clinic with disease A, disease B or both, then even if the two diseases are independent, a negative association between them may be observed.

### Example

Consider the case of a school that recruits students based on both academic and sporting ability. Assume that these two skills are totally independent of each other. That is, in the whole population, an excellent sportsperson is just as likely to be strong or weak academically as is someone who’s poor at sport.

If the school admits only students who are excellent academically, excellent at sport or excellent at both, then within this group it would appear that sporting ability is negatively correlated with academic ability.

To illustrate, assume that every potential student is ranked on both academic and sporting ability from 1 to 10. There are an equal proportion of people in each band for each skill. Knowing a person’s band in either skill does not tell you anything about their likely band in the other.

Assume now that the school only admits students who are at band 9 or 10 in at least one of the skills.

If we look at the whole population, the average academic rank of the weakest sportsperson and the best sportsperson are both equal (5.5).

However, within the set of admitted students, the average academic rank of the elite sportsperson is still that of the whole population (5.5), but the average academic rank of the weakest sportsperson is 9.5, wrongly implying a negative correlation between the two abilities.

## Multiple comparisons fallacy

### What is it?

This is where unexpected trends can occur through random chance alone in a data set with a large number of variables.

### How does it happen?

When looking at many variables and mining for trends, it is easy to overlook how many possible trends you are testing. For example, with 1,000 variables, there are almost half a million (1,000×999/2) potential pairs of variables that might appear correlated by pure chance alone.

While each pair is extremely unlikely to look dependent, the chances are that from the half million pairs, quite a few will look dependent.

### Example

The Birthday paradox is a classic example of the multiple comparisons fallacy.

In a group of 23 people (assuming each of their birthdays is an independently chosen day of the year with all days equally likely), it is more likely than not that at least two of the group have the same birthday.

People often disbelieve this, recalling that it is rare that they meet someone who shares their own birthday. If you just pick two people, the chance they share a birthday is, of course, low (roughly 1 in 365, which is less than 0.3%).

However, with 23 people there are 253 (23×22/2) pairs of people who might have a common birthday. So by looking across the whole group you are testing to see if any one of these 253 pairings, each of which independently has a 0.3% chance of coinciding, does indeed match. These many possibilities of a pair actually make it statistically very likely for coincidental matches to arise.

For a group of as few as 40 people, it is almost nine times as likely that there is a shared birthday than not.

Stephen Woodcock, Senior Lecturer in Mathematics, University of Technology Sydney

## Smoking: new Australian data to die (or live) for

A new study of deaths from all causes in New South Wales published today in BMC Medicine (open access) reports both some very bad and very good news about smoking.

Up until now, Australian estimates of the death and disease risks of smoking have been modelled from large-scale British and United States cohort studies, where researchers followed very large groups of people across many years and compared the death records of “never smokers” with smokers and ex-smokers.

Now, for the first time, we have local cohort data. The 45 and Up study commenced in 2006 and tracked 204,953 people for an average 4.26 years (a total of 874,120 person years).

Researchers recorded participants’ smoking status (when the study started and at various follow-ups) and, where applicable, hospitalisations and death.

Overall, there were 5,593 deaths from all causes, with current smokers nearly three times (2.96) more likely to die than never or former smokers.

The two stand-out results are that up to two-thirds of the deaths in current smokers were due to smoking (the bad news) and that death rates in former smokers who had quit before turning 45 were not different from those in the study who had never smoked (the very welcome news).

As other studies have reported, the smokers in this study died, on average, ten years earlier than the never smokers. With the life expectancy in Australia at 82.1 years, smokers are losing an average of one day in eight off their lives.

So, a person who started smoking at 15, who smoked an average of 15 cigarettes a day and died at 72, would have smoked 312,288 cigarettes in their lifetime. These each take about six minutes to smoke.

Across 57 years of smoking, this translates to 3.56 years of continual smoking, meaning that each cigarette on average can be expected to shave about 2.8 times the time it takes to smoke it off the end of smokers’ lives.

We’ve known for some time that smoking adversely affects almost every body organ and bodily system, from the eyes to the toes.

Big population health data sets now allow us to understand that the list of previous diseases caused and exacerbated by smoking was very conservative. A major US study published this month pooled five contemporary US cohort studies including 421,378 men and 532,651 women followed from 2000 to 2011. It found:

17% of the excess mortality among current smokers was due to associations with causes that are not currently established as attributable to smoking.

These associations included deaths from renal failure, hypertensive heart disease, infections, breast and prostate cancer.

When Sir Richard Doll’s 40 year follow-up of his historic British doctors study was published in 1994, the take-home message was “half of all regular cigarette smokers will eventually be killed by their habit”.

We can now say with confidence that up to two-thirds of smokers will die from their smoking, on average ten years early.

Stopping smoking before age 45 appears to eliminate most of this risk.

Only about one in ten smokers do not regret having started, and today there are twice as many ex-smokers than there are daily smokers. Most have quit without any professional or pharmacological assistance.

When American surgeon Dr Alton Ochsner* was a medical student in 1919, he was summonsed to see a lung cancer operation, and told that this was a rare disease that he might never see again. He didn’t see another case for 17 years. Then he saw eight in six months – all smokers who had picked up the habit in WWI.

Today, lung cancer is the biggest case of cancer death is the world. It is an epidemic spread by the tobacco industry, facilitated by government inaction. An article in the journal Nature in 2001 forecast that a billion deaths will be caused by tobacco this century.

Nations that have taken tobacco control seriously, such as Australia, Canada, Britain and the United States, are leading the way in dramatically reducing smoking rates. This new data will strengthen that resolve.

* This article originally named the American surgeon as William Osler. This has now been corrected.

Filed under Reblogs

## The census matters – making it less frequent is a risky idea

If reports are to be believed, both the Australian Bureau of Statistics (ABS) and the federal government are strongly considering moving from a five-year to a ten-year census cycle.

This move has been on the cards for a little while, given major changes to the census in comparable countries (such as the UK, Canada, New Zealand and the US) over recent years. Australia is a bit of an outlier in how often we conduct a census.

So, what might Australia gain from such a change? And what would it lose?

## What is the census used for?

Ultimately, Australia uses the census for the allocation of seats in the lower house of federal parliament. We need to make sure that each MP represents roughly the same number of people. For that, we need population estimates.

But the census is also used to determine how the Commonwealth distributes funds to state and territory governments. For example, the number of Indigenous Australians in a given jurisdiction is used to allocate GST revenue. We can do this because the census provides reliable information about small population groups. The most recent Closing the Gap report relies heavily on census data to understand Indigenous employment and early childhood education.

The census is a vital resource for research purposes. For example, the ABS has recently developed the Australian Census Longitudinal Dataset by linking censuses through time. This is a resource that is only just starting to be utilised and can shed light on dynamics and trends that aren’t available in smaller sample surveys.

The census is also great for marketing and planning purposes for businesses. Where is the market for a new café, or a new car cleaning service? The census can help with that.

One of the census’ key advantages is that it provides information about the population and their characteristics for very small geographic areas. This means that census data can be used by state/territory and local governments to plan for and deliver services. Is the population in an area ageing, or is it turning into a nappy valley? Do we need more aged care places, more childcare services or more primary schools?

We can get some of this information from administrative data – but not the detailed demographic and socioeconomic information. 10 years is a long time to have to wait.

## Why Australia might consider changing the census

The census is expensive – very expensive. The 2011 Census cost about A\$440 million to complete. While it would appear that the ABS has pushed for legislative change, it is also true that this is in the context of reduced ABS budgets. More needs to be done with less.

The census also imposes a burden on the population. It has been argued that the census is coercive and involves the collection of personal data. In part, this motivated the decision in Canada to make the census optional, though that move has been highly controversial.

It is also true that the census isn’t great at collecting information on all population groups. Mobile populations and those who live in gated apartments are notoriously hard to get information on. Also, because of its sheer scale, processing and publishing the census data takes time and results may be out of date by the time they are released.

There is also the growth in alternative sources of data. The UK considered dropping its census as it thought its administrative data combined with household surveys could do a good enough job. However, it announced in 2014 that it would proceed with a national census (it is ten-yearly) in 2021 after reviewing its options.

## Is there scope to make other sensible changes?

I have argued in other contexts that Australia’s current data needs for Indigenous policy aren’t being met in the current statistical environment. The same is true undoubtedly in other policy domains. The census isn’t the only game in town, or even always the best one. So, are there other ways to redirect scarce resources?

The census is currently undergoing one of the greatest revamps in its 100-year history. From pen and paper for most of its history, in 2016 it is anticipated that nearly two-thirds of Australians will fill in the census online. To support this, the ABS will take advantage of recent technological developments.

Questions can also be relatively easy and painless to get put onto the census, but then are very hard to take off. There is certainly scope to trim the census back a bit to its core purposes and save money and people’s time.

## On balance, is it worth keeping?

The census is a very rich source of information. Everyone knows the census counts people, but it yields information about other types of statistical units, including families and dwellings. It covers a wide range of topics including some that are very infrequently covered by surveys such as unpaid work.

Alternatives such as the use of administrative data from population registers, possibly supplemented by sample surveys, are also expensive. Issues such as the public acceptability of alternatives like population registers would need to be considered.

Ultimately, one positive is that the news is out there way before the budget or any legislative changes. Australians can have a debate about whether we are willing to give up such a resource, and what it means for our democracy to have less rather than more information.

This article was prepared with assistance by Heather Crawford at the ANU.