In writing about data and statistics, your first step is to choose the right statistic to present to your audience.
Think about the main message
Any number of values can be derived from a set of data, so part of your job as a writer is to decide what you should quote. Usually, state the quantity that best sums up the main message of the data.
To capture the typical standard of living in a town, you could quote the average (or mean) income. That is, add up all the incomes and divide by the number of people or households earning them.
But what if there is 1 extremely rich person in town? That could result in an average that does not give a good indication of the typical experience of someone in that town.
In this case, you could find the median. To do so, sort all the incomes from smallest to biggest, then find the entry halfway down the list. Half the people earn less and half earn more than this income. One huge income is still just 1 person in the list, so it does not affect the median very much.
The median is a better statistic to quote to capture the typical experience of someone in that town.
Draw on expert advice from statisticians or mathematicians, if necessary, to make sure your message is accurate.
Make sure the statistic is valid
Just because a calculation can be done does not make it valid:
Evaluating participant experience
Let’s say we’ve run a training workshop and we want to evaluate participant experience. We ask participants some questions with answers from a Likert scale. For example:
1. Overall, how would you rate the usefulness of course material?
❑ Very poor ❑ Poor ❑ Average ❑ Good ❑ Very good
To sum this up, we scould say very poor = 1, poor = 2, average = 3, good = 4 and very good = 5, and then average the responses to get (say) 3.4.
This calculation is not valid, because Likert-scale questions give ordinal data. The ‘distances’ between consecutive categories vary – they must, because they are purely psychological and not actually measured. For example, the difference between ‘very poor’ and ‘poor’ need not be the same as the difference between ‘good’ and ‘very good’, so we cannot average the scores. The calculation can be done, but the underlying assumptions are invalid. This average is not meaningful.
To show aggregated Likert-scale responses, you can use the median (the middle value in the set) or the mode (the most frequent value in the set). Mode is generally preferred.
Likert scales are most commonly constructed with 5 or 7 anchor points. Scales with an odd number of possible responses allow the respondent to choose a ‘neutral’ answer because the scale has a midpoint. A Likert scale with an even number of anchor points, most commonly a 4-point scale, is known as a ‘forced choice’ scale because it does not allow for a neutral answer: the respondent must take a position.
Similarly, the correct formula to use when calculating correlations depends on the type of data – are they measurements (eg height, weight) or chosen from a list of categories (eg eye colour, educational level)?
Do not quote a statistic without understanding it, and knowing that it is valid.
Make sure the number of decimal places included is reasonable
The number of decimal places to which a result is quoted implies the precision of the number. Do not quote the output of a calculation in a way that implies that it is more precise than the inputs:
If we travel 103 km in 1.7 h, our average speed is:
$$\frac{103\mathrm{~km}}{1.7\mathrm{~h}}=60.5882353\mathrm{~km/h}$$
‘103 km’ contains 3 significant figures, and 1.7 h has 2 significant figures. In this case, the least precise input has 2 significant figures, so quote the average speed as 61 km/h.