Outliers – When should we eliminate them from our data?

As this week’s blog is another wildcard I figured I would write about something wild. Outliers are wild as they stray from the rest of the group. They don’t follow the norm of the rest of the data. But just because something stands out should it be excluded from the rest?

 Outliers are often a result of measurement error but can sometimes be due to chance. When outliers occur through measurement error it may be appropriate to remove them so that they do not affect the end results significantly. However, outliers can also result from the distribution being heavy-tailed. Therefore we need to be careful not to assume a normal distribution when working with outliers in statistical data. With any large sample we can expect a few outliers, those that stand out from the crowd; but just like in society, if too many people stand out we start to wonder why.

So, to remove or not to remove? Sometimes removing outliers is an essential part of research as they may not have been caused by chance. Take IQ for example. If we conducted a study to measure the IQ of say, our stats seminar group, we can reasonably say that the majority of people will have a decent score… Hence why we’re at university. However say we got these scores from the IQ test:

100, 108, 97, 112, 115, 139, 105, 92, 94 and 59

We can automatically see we will have two outliers, 139 (sample maximum) and 59 (sample minimum). So what should we do with them? When deciding whether to remove a score or not we should take several things into account. Firstly, was the person with a score of 59 having a really bad day? Or are they just not as clever as the rest of the group. Similarly, we should consider if the person with a score of 139 is super intelligent or if they have just taken an IQ test before and know how to work them (it is possible). Once we have established this we can decide what we are going to do. In this case we should consult the Stanford-Binet chart(1) to determine where these scores would be categorised. The person with a score of 59 would be categorised as having a ‘borderline deficiency’ so we can assume they were having a bad day or were bored and couldn’t be bothered to do the test, as otherwise they probably wouldn’t be at university. Therefore it would be acceptable to remove them from the data set to avoid them skewing the data.

Terman’s Stanford-Binet Fourth Revision classification

IQ Range (“Deviation IQ”)

Intelligence Classification

152 +

Genius or near genius

148 – 151

Very superior intelligence

132 – 148

Super intelligence

116 – 132

Above average intelligence

84 – 116

Normal or average intelligence

68 – 84

Dullness

52 – 68

Borderline deficiency

Below 52

Mental deficiency

However now we’re left with the higher score of 139. We know we had a controlled environment so participants couldn’t cheat, and as background we have looked at their grades for the year and with all A*’s we can assume that their score is correct and as it is a natural reflection of their ‘super intelligence’ (according to the chart) we should leave the outlier within our data set as it is a true representation of the intelligence of that person within the sample.

So to conclude, sometimes as the example above shows it is necessary to remove extreme scores that skew the data for no valid reason as otherwise our entire results can be skewed by one participant that couldn’t be bothered to do the test. However we must take into account various factors, as discussed, before removing an outlier as sometimes they can be a true representation of the natural differences that occur in human behaviour.

 Also….

 One more point I forgot to mention earlier: the mean is not considered a very robust statistic when working with outliers as it is easily skewed by extreme values, however the median is much more robust as it takes the middle number and is not affected by the outliers at either end. But we could always use the skimmed mean which removes the top and bottom 5%, essentially removing any outliers; even so it is still easier to use the median. (2)

(1)       Don’t judge but the best and most readable table I could find came from Wikipedia http://en.wikipedia.org/wiki/IQ_reference_chart

 (2)       http://www.ltcconline.net/greenl/courses/201/descstat/mean.htm

Advertisements

5 thoughts on “Outliers – When should we eliminate them from our data?

  1. giggles20 says:

    I completely agree when you say that the mean is a limited requirement when dealing with outliers. However, i think that ouliers should be removed before any analysing of data is going to take place. However, there may be an underlying reason for outliers occuring. Therefore, some analysis needs to take place to find out the reason why these outliers occurred in the first place. (http://itl.nist.gov/div898/handbook/prc/section1/prc16.htm).

    Therefore, it is important to first hand investigate why the data has outliers presented and then to transform the data to remove the outliers to generate results which allow for a true representation for the population

  2. rtanti says:

    Outliers do have many different reasons for being present, they could be because the participant could not have grasped the purpose of the experiment and are doing it wrong (therefore affecting the data) or they could be ruining the data on purpose by just clicking the button repeatedly and not listening to task instructions.
    The best idea, in my opinion, is to do as you have done by looking into the different effect on means and such. But after looking at the means, it is probably a good idea to create box-plots of the data with standard error/deviations to show how much the outlier is removed from the mean. By creating a box-plot with the outliers removed, and one where they aren’t, the difference can be seen; and you can therefore conclude whether taking the outlier out of your data will affect your results and therefore the relationship shown overall.

  3. psud6e says:

    I think you’ve made very good points, and argued your opinion well. I definitely like your use of examples and applications to the real world – it makes your blog very down-to-earth and understandable, so people that are less confidence in statistics can understand and not feel intimidated.
    You definitely know about outliers, and you mention looking at each case individually, to decipher whether it is a normal, random outlier, and should be left in, or whether it’s a ‘malicious’ outlier, whereby the individual has deliberately completed the experiment wrong. I agree with @rtanti’s comment about using box plots to find outliers. It’s very hard looking at a data set to see the outliers, apart from maybe a few obvious ones. The box plot reveals data that maybe you should be concerned about, and a close look at the individual data means the decisions can be made.
    Sometimes, however, it’s difficult to tell whether an outlier is normal or malicious. If this is the case, should you leave it in the data set, or should you remove it and carry out the analyses as normal? I would argue that you should do the data analysis both with the outlier included and with the outlier excluded. If the outlier doesn’t make a huge difference to the conclusion, then the outlier is likely to not be so much of an outlier. However, if there is a big difference between the two analyses e.g. you would conclude a significant effect with one and an insignificant effect with the other, then arguably you would remove it. However, you would also have the issue of the ethics of this, as is removing data that changes your outcome really ethical? I’ll leave you to ponder 🙂

  4. […] 1. https://psychmja1.wordpress.com/2011/10/18/outliers-when-should-we-eliminate-them-from-our-data/#comm… […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: