As this week’s blog is another wildcard I figured I would write about something wild. Outliers are wild as they stray from the rest of the group. They don’t follow the norm of the rest of the data. But just because something stands out should it be excluded from the rest?
Outliers are often a result of measurement error but can sometimes be due to chance. When outliers occur through measurement error it may be appropriate to remove them so that they do not affect the end results significantly. However, outliers can also result from the distribution being heavy-tailed. Therefore we need to be careful not to assume a normal distribution when working with outliers in statistical data. With any large sample we can expect a few outliers, those that stand out from the crowd; but just like in society, if too many people stand out we start to wonder why.
So, to remove or not to remove? Sometimes removing outliers is an essential part of research as they may not have been caused by chance. Take IQ for example. If we conducted a study to measure the IQ of say, our stats seminar group, we can reasonably say that the majority of people will have a decent score… Hence why we’re at university. However say we got these scores from the IQ test:
100, 108, 97, 112, 115, 139, 105, 92, 94 and 59
We can automatically see we will have two outliers, 139 (sample maximum) and 59 (sample minimum). So what should we do with them? When deciding whether to remove a score or not we should take several things into account. Firstly, was the person with a score of 59 having a really bad day? Or are they just not as clever as the rest of the group. Similarly, we should consider if the person with a score of 139 is super intelligent or if they have just taken an IQ test before and know how to work them (it is possible). Once we have established this we can decide what we are going to do. In this case we should consult the Stanford-Binet chart(1) to determine where these scores would be categorised. The person with a score of 59 would be categorised as having a ‘borderline deficiency’ so we can assume they were having a bad day or were bored and couldn’t be bothered to do the test, as otherwise they probably wouldn’t be at university. Therefore it would be acceptable to remove them from the data set to avoid them skewing the data.
Terman’s Stanford-Binet Fourth Revision classification |
|
IQ Range (“Deviation IQ”) |
Intelligence Classification |
152 + |
Genius or near genius |
148 – 151 |
Very superior intelligence |
132 – 148 |
Super intelligence |
116 – 132 |
Above average intelligence |
84 – 116 |
Normal or average intelligence |
68 – 84 |
Dullness |
52 – 68 |
Borderline deficiency |
Below 52 |
Mental deficiency |
However now we’re left with the higher score of 139. We know we had a controlled environment so participants couldn’t cheat, and as background we have looked at their grades for the year and with all A*’s we can assume that their score is correct and as it is a natural reflection of their ‘super intelligence’ (according to the chart) we should leave the outlier within our data set as it is a true representation of the intelligence of that person within the sample.
So to conclude, sometimes as the example above shows it is necessary to remove extreme scores that skew the data for no valid reason as otherwise our entire results can be skewed by one participant that couldn’t be bothered to do the test. However we must take into account various factors, as discussed, before removing an outlier as sometimes they can be a true representation of the natural differences that occur in human behaviour.
Also….
One more point I forgot to mention earlier: the mean is not considered a very robust statistic when working with outliers as it is easily skewed by extreme values, however the median is much more robust as it takes the middle number and is not affected by the outliers at either end. But we could always use the skimmed mean which removes the top and bottom 5%, essentially removing any outliers; even so it is still easier to use the median. (2)
(1) Don’t judge but the best and most readable table I could find came from Wikipedia http://en.wikipedia.org/wiki/IQ_reference_chart
(2) http://www.ltcconline.net/greenl/courses/201/descstat/mean.htm
I completely agree when you say that the mean is a limited requirement when dealing with outliers. However, i think that ouliers should be removed before any analysing of data is going to take place. However, there may be an underlying reason for outliers occuring. Therefore, some analysis needs to take place to find out the reason why these outliers occurred in the first place. (http://itl.nist.gov/div898/handbook/prc/section1/prc16.htm).
Therefore, it is important to first hand investigate why the data has outliers presented and then to transform the data to remove the outliers to generate results which allow for a true representation for the population
[…] https://psychmja1.wordpress.com/2011/10/18/outliers-when-should-we-eliminate-them-from-our-data/#comm… […]
Outliers do have many different reasons for being present, they could be because the participant could not have grasped the purpose of the experiment and are doing it wrong (therefore affecting the data) or they could be ruining the data on purpose by just clicking the button repeatedly and not listening to task instructions.
The best idea, in my opinion, is to do as you have done by looking into the different effect on means and such. But after looking at the means, it is probably a good idea to create box-plots of the data with standard error/deviations to show how much the outlier is removed from the mean. By creating a box-plot with the outliers removed, and one where they aren’t, the difference can be seen; and you can therefore conclude whether taking the outlier out of your data will affect your results and therefore the relationship shown overall.
I think you’ve made very good points, and argued your opinion well. I definitely like your use of examples and applications to the real world – it makes your blog very down-to-earth and understandable, so people that are less confidence in statistics can understand and not feel intimidated.
You definitely know about outliers, and you mention looking at each case individually, to decipher whether it is a normal, random outlier, and should be left in, or whether it’s a ‘malicious’ outlier, whereby the individual has deliberately completed the experiment wrong. I agree with @rtanti’s comment about using box plots to find outliers. It’s very hard looking at a data set to see the outliers, apart from maybe a few obvious ones. The box plot reveals data that maybe you should be concerned about, and a close look at the individual data means the decisions can be made.
Sometimes, however, it’s difficult to tell whether an outlier is normal or malicious. If this is the case, should you leave it in the data set, or should you remove it and carry out the analyses as normal? I would argue that you should do the data analysis both with the outlier included and with the outlier excluded. If the outlier doesn’t make a huge difference to the conclusion, then the outlier is likely to not be so much of an outlier. However, if there is a big difference between the two analyses e.g. you would conclude a significant effect with one and an insignificant effect with the other, then arguably you would remove it. However, you would also have the issue of the ethics of this, as is removing data that changes your outcome really ethical? I’ll leave you to ponder 🙂
[…] 1. https://psychmja1.wordpress.com/2011/10/18/outliers-when-should-we-eliminate-them-from-our-data/#comm… […]