Monthly Archives: October 2011

Outliers – When should we eliminate them from our data?

As this week’s blog is another wildcard I figured I would write about something wild. Outliers are wild as they stray from the rest of the group. They don’t follow the norm of the rest of the data. But just because something stands out should it be excluded from the rest?

 Outliers are often a result of measurement error but can sometimes be due to chance. When outliers occur through measurement error it may be appropriate to remove them so that they do not affect the end results significantly. However, outliers can also result from the distribution being heavy-tailed. Therefore we need to be careful not to assume a normal distribution when working with outliers in statistical data. With any large sample we can expect a few outliers, those that stand out from the crowd; but just like in society, if too many people stand out we start to wonder why.

So, to remove or not to remove? Sometimes removing outliers is an essential part of research as they may not have been caused by chance. Take IQ for example. If we conducted a study to measure the IQ of say, our stats seminar group, we can reasonably say that the majority of people will have a decent score… Hence why we’re at university. However say we got these scores from the IQ test:

100, 108, 97, 112, 115, 139, 105, 92, 94 and 59

We can automatically see we will have two outliers, 139 (sample maximum) and 59 (sample minimum). So what should we do with them? When deciding whether to remove a score or not we should take several things into account. Firstly, was the person with a score of 59 having a really bad day? Or are they just not as clever as the rest of the group. Similarly, we should consider if the person with a score of 139 is super intelligent or if they have just taken an IQ test before and know how to work them (it is possible). Once we have established this we can decide what we are going to do. In this case we should consult the Stanford-Binet chart(1) to determine where these scores would be categorised. The person with a score of 59 would be categorised as having a ‘borderline deficiency’ so we can assume they were having a bad day or were bored and couldn’t be bothered to do the test, as otherwise they probably wouldn’t be at university. Therefore it would be acceptable to remove them from the data set to avoid them skewing the data.

Terman’s Stanford-Binet Fourth Revision classification

IQ Range (“Deviation IQ”)

Intelligence Classification

152 +

Genius or near genius

148 – 151

Very superior intelligence

132 – 148

Super intelligence

116 – 132

Above average intelligence

84 – 116

Normal or average intelligence

68 – 84


52 – 68

Borderline deficiency

Below 52

Mental deficiency

However now we’re left with the higher score of 139. We know we had a controlled environment so participants couldn’t cheat, and as background we have looked at their grades for the year and with all A*’s we can assume that their score is correct and as it is a natural reflection of their ‘super intelligence’ (according to the chart) we should leave the outlier within our data set as it is a true representation of the intelligence of that person within the sample.

So to conclude, sometimes as the example above shows it is necessary to remove extreme scores that skew the data for no valid reason as otherwise our entire results can be skewed by one participant that couldn’t be bothered to do the test. However we must take into account various factors, as discussed, before removing an outlier as sometimes they can be a true representation of the natural differences that occur in human behaviour.


 One more point I forgot to mention earlier: the mean is not considered a very robust statistic when working with outliers as it is easily skewed by extreme values, however the median is much more robust as it takes the middle number and is not affected by the outliers at either end. But we could always use the skimmed mean which removes the top and bottom 5%, essentially removing any outliers; even so it is still easier to use the median. (2)

(1)       Don’t judge but the best and most readable table I could find came from Wikipedia


Reliability … The good and the bad

Reliability is an essential part of research as without it how would we know which results to trust? For the purpose of this blog I’m only going to talk about it within research terms as otherwise I will end up writing some massive essay and going off on a tangent.

 There are so many different types of reliability within research that you’d have thought all published research would have high reliability trying to fit in with all of the guidelines. However, if you look into it, no piece of research is going to be perfect. I think there will always be some unreliable aspects of research, particularly when studying humans. For example, how can we possible account for every type of variable? Is the person hungry? Are they nervous? Or are they tired?

 So, to define: reliability is when we are able to repeat a measure and gain the same (or similar) result time and time again. But how do we know if an experiment is reliable? Well there are several different methods that can be used to determine reliability.

 First, the test-retest reliability method can be useful in determining how reliable a measure is. For example, if a class of psychology students participate in a study to test reaction time in which they have to respond to certain stimuli and then perform the same task a week later we would hope for similar results. However, one of the main flaws of this method is that it would be likely to see a testing effect on participants. For instance, if the students do the same test twice there may be an issue if practice effects. By this I mean they will be more familiar with the test and because of this their reaction time may increase. Which, may in turn, reduce the reliability of the study. This is why it is best to use this method of testing reliability with things that remain stable over time, such as intelligence or personality.

 Another measure of reliability is inter-rater reliability. This is used for simultaneous measurements between more than one researcher and is often used when observing behaviour. This measure makes an observation more reliable as if two or more observers are watching then it is less likely that something will be missed. I can remember learning about one study, but I can’t remember who did it. In the study there were two observers that went out into the real world and conducted a study of children’s aggression by observing how many aggressive acts the children demonstrated. By using two observers the reliability of the study was improved as it would have provided more accurate results. Cohen’s Kappa coefficient is a measure of inter-rater agreement for qualitative data, such as observational studies, and is an effective measure as it also takes into account that an agreement between observers may be due to chance*.

 There are other factors that can affect the reliability of a research study, and as I don’t want to waffle on forever I will briefly mention two of them.

 The first factor is observer effect. It has been suggested by Eagly and Carli (1983)** that characteristics of the experimenter, such as age, sex or behaviours such as body language can affect the participant during a study, which can lead to a loss of reliability. For instance, Bickman (1974)*** conducted a study in which three confederate participants randomly asked people on the street to, for example, “pick up that bag”. They were all dressed differently; one confederate was a milkman, another was a civilian and the third dressed as a guard. The study found that people were more likely to obey the guard as they saw him as an authority figure. Therefore, we could suggest from this that participants in research studies may react differently than they normally would because they view the experimenter as an authority figure, particularly if they are wearing a white lab coat, so they may try extra hard to please them or may do the complete opposite, thus reducing the reliability of the study.

 The second factor I want to briefly mention is environmental changes. Whilst researchers take every effort to make the conditions that same for all participants it would be extremely difficult to account for everything. Changes in the time of day or time of year can affect how a participant will respond in an experiment or study, even a slight change in temperature could affect how likely a person will be to complete a task compared to another participant. If it’s hot then the participant may feel tired or if it’s too cold a participant may not be able to concentrate#.

 So, to conclude, reliability in research is always important as it helps us to ensure that our measures are consistent. Unfortunately when working with people it is difficult to account for every possible factor that could affect the reliability of a study. Most of the time researchers try to account for the most likely variables and understand that they will never have the perfect experiment.





Do you need statistics to understand your data?

I think it depends on the type of data we have as to whether we need statistics in order to understand it. For example, whether we have qualitative data or quantitative data can make a difference as to whether we need to use statistics to interpret and understand our findings.

In the case of qualitative data it may not be necessary to use statistics as this type of data often comes from interviews or case studies. We need to use qualitative research in psychology as a basis for direct experience, as a result of this we are often not measuring group means and therefore would not need to use statistics to understand our data. When using data such as interview transcripts it would be easier to go through the information in front of you and evaluate it to gain an understanding. However, the problem with this method is that it is open to a lot of biases. For example, the researcher may be biased towards their hypothesis and so may only take into account the parts of the data that are in line with their hypothesis. If a researcher did want to be able to use statistics with qualitative data they would be able to operationalise the variables making data quantitative. Research conducted into aggression and pro-social behaviour used this method by categorising different types of behaviour into groups and giving certain behaviours certain scores. For example ‘hitting’ would be categorised as aggressive behaviour and given a score of 5 (Ihori et al, 2007)*. The mean score could be calculated and statistical tests used to test the significance of the hypothesis. Even so, in the case of qualitative data I don’t really think we need to use statistics to understand our data but it may make the analysis simpler for the researcher and is also a more scientific method to use.

Quantitative data, however, is different in that statistics is an important part of understanding the data. If we just presented the raw data then it would be very difficult to comprehend. However by performing the correct statistical test we can simplify a lot of confusing numbers, thus making it simpler to understand. For example, the p value produced (e.g. p<.01) is a lot quicker and easier to understand than looking through hundreds of numbers and trying to decide if a result is significant from the raw data. However, it may also be useful to present the results of statistical tests but also make the raw data readily available so that people can see the bigger picture. Kling et al (1999)** conducted a study in which they analysed the relationship between self-esteem and gender. From this they were able to compare the mean self-esteem scores and compare these to the two genders to see how they differed. Without the use of statistics this study would have produced a lot of different numbers which would have been more difficult to understand. However, by statically analysing the data they were able to produce graphs to accompany the final results, making it more understandable.

So, to move on towards concluding, my opinion is that statistics can be important for understanding your data. However, I’m going to sit on the fence on this one and say that it completely depends on the type of data you have as to whether statistics will be helpful or not.


** (abstract) … Or … It’s in the research methods book from last year.