Monday, October 16, 2017

Don't be mean


I'm not talking about mean as in "offensive, selfish, or unaccommodating; nasty; malicious", although it's a good idea to not be those things.

I'm talking about mean in the mathematical sense, as in one of the "measures of central tendency". In this case mean means average. So, don't be average, which is also good adcice.

Mean is what we're referring to when we talk about average, although it's not always the best measure of central tendency. Since we're most familiar with the mean, we'll start our conversation there.

In this example from Laerd Statistics, the mean, or average, is 59:
65 55 89 56 35 14 56 55 87 45 92
Or in order from smallest to largest: 
14 35 45 55 55 56 56 65 87 89 92
These could be the scores on a math test. To find the mean, you add up all of the values, in this case the total is 649, and divide by the number of vales, in this case 11...649 / 11 = 59. The formula (again from Laerd) is:
The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values x1, x2, ..., xn, the sample mean, usually denoted by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol letter, , pronounced "sigma", which means "sum of...":
But the mean only gives you half the answer. We also want to know how spread out the values are from the mean. For example 99, 100 and 101, and 0, 100 and 200 both have a mean of 100...but the values in the second set are more spread out than the first set. The simplest measure of the spread is the range, which is the difference between the highest and lowest values. In the first set, the range is 2 (101 - 99) while the range in the second set is 200 (200 - 0). In the test scores example above, the range is 78 (92 - 14).

The range is ok in simple examples, but a better measure is the standard deviation:

Standard Deviation

The Standard Deviation is a measure of how spread out numbers are.
Its symbol is σ (the greek letter sigma)
The formula is easy: it is the square root of the Variance. So now you ask, "What is the Variance?"

Variance

The Variance is defined as: The average of the squared differences from the Mean.
To calculate the variance follow these steps:

  • Work out the Mean (the simple average of the numbers)
  • Then for each number: subtract the Mean and square the result (the squared difference).
  • Then work out the average of those squared differences.
For our test score example above, the mean is 59, the variance is:

(65-59)^2 + (55-59)^2 + (89-59)^2 + (56-59)^2 + (35-59)^2 + (14-59)^2 + (56-59)^2 + (55-59)^2 + (87-59)^2 + (45-59)^2 + (92-59)^2 / 11 which is 36 + 16 + 900 + 9 + 576 + 2025 + 9 + 16 + 784 + 196 + 1089 / 11 or 5656 / 11 which is 514.181818...the standard deviation is the square root of that or 22.67558.

Now, what can we do with that? Approximately two-thirds of the values (68%) will lie within 1 standard deviation of the mean...95% within 2 sd and almost all (99.7%) will lie within 3 sd.

In our example, 1 standard deviation is 36.32443 to 81.67558, 2 sd is 13.64884 to 104.35116 and 3 sd is -9.02674 to 127.02674. So, about 7 or 8 of our values should lie within 1 sd (45, 55, 55, 56, 56, 65), we have 6. And all of the rest lie within 2 sd...none are outside that range.

If we had 1,000 values we should expect 680 within 1 standard deviation, 950 within 2 sd and 997 within 3 sd (or all but 3).

The mean works well if all of the values are fairly close to each other, but it can be thrown off by outliers, or values that are far away from the others. That's why you'll usually wee median incomes or median home prices. A CEO's income can throw off a salary table or a very expensive home can doo the same thing to home prices. Going back to Laerd:
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation, we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation.
To take the median, the values have to be sorted from lowest to highest:
12k, 14k, 15k, 15k, 15k, 16k, 17k, 18k, 90k, 95k
The median is the middle value which is 15.5k since there are an even number of values, you take the average of the 2 middle values which are 15k and 16k in this example, which makes more sense than the $30.7k mean. Half the values (5) are below the median and half are above it. You can take the inter-quartile range by finding the upper and lower quartiles (which are basically the median of the bottom half of the data and the upper half of the data): since there are 5 values in the bottom and top halves, the quartile is the middle value. The lower quartile is 15k and the upper quartile is 18k. The inter-quartile range (IQR) is 18k - 15k, or 3k. Half of the data values will fall in that range. In our example, half of the values fall between $15k and $18k. If any value falls 1.5 times the IQR above Q3 or below Q1, it is considered an outlier. 1.5 times IQR is 4.5k, so anything below $11.5k or more than $22.5k is an outlier. We can see that our 2 highest values, $90k and $95k are outliers.You can use a box plot to show these values:

The blue box is the middle 50%, the white line is the median, the vertical bar is the lowest value
Last but not least is the mode. The mode shows the "popularity" of something, such as the most popular color of car in the parking lot or the most prevalent color of M&M's in a bag.

When I got to work at 6:45 on a Sunday evening, there were 2 black cars, 2 gray cars, 2 white cars, 1 silver car and 1 blue car (mine). In this case, the data is multi-modal because there are 3 colors with the same value. If there had been one more company car (white), then the mode would have been 3 for the 3 white cars.

I bought a bag of plain M&M's and the color breakdown was as follows:

Brown - 7
Yellow - 12
Red - 6
Orange - 11
Green - 13
Blue - 5.5
           54.5

The mode is 13 which equates to green although yellow and orange are quite close.

There is actually a breakdown of colors in a bag, according to Deal News:
On average, the mix of each variety of M&Ms follows these color percentage breakdown: M&M'S Milk Chocolate: 30% brown, 20% yellow, 20% red, 10% orange, 10% green, and 10% blue. M&M'S Peanut: 20% brown, 20% yellow, 20% blue, and 20% red, 10% orange, and 10% green.

Interesting days


Today - Steve Jobs DayDepartment Store DayFeral Cat DayBoss' DayClean Your Virtual Desktop Day and Dictionary Day

Tomorrow - International Day for the Eradication of PovertyWear Something Gaudy DaySpreadsheet DayMulligan DayAda Lovelace Day and Playing Card Collection Day

Next Monday - iPod DayMole DayTV Talk Show Host Day and Boston Cream Pie Day

November 16 - Beaujolais Nouveau DaySocial Enterprise DayButton DayHave a Party with Your Bear Day, Fast Food Day and International Day for Tolerance


No comments:

Post a Comment