Descriptive Statistics

Aleksandr Michuda

Agenda

Measures of Location
Measures of Variability
Measures of Distribution Shape, Relative Location, and Detecting Outliers
Five-Number Summaries and Box Plots
Measures of Association Between Two Variables
Data Dashboards: Adding Numerical Measures to Improve Effectiveness

Parameters vs. Statistics

If the measures are computed for data from a sample, they are called sample statistics.
If the measures are computed for data from a population, they are called population parameters.
A sample statistic is referred to as the point estimator of the corresponding population parameter.

Measures of Location

Mean
Median
Mode
Weighted Mean
Percentiles
- Quartiles

The Mean

Perhaps the most important measure of location is the mean.
The mean provides a measure of central location.
The mean of a data set is the average of all the data values.
The sample mean $\bar{x}$ is the point estimator of the population mean, $\mu$.

\[ \bar{x} = \frac{\sum_i^n x_i}{n} \]

Note: $\sum_i^n x_i = x_1 + x_2 + x_3 + ... + x_n$

Looking at the Mean

What do you see?

Example

Calculate the mean:

$x = \{10,20,4,5\}$

Median

The median of a data set is the value in the middle when the data items are arranged in ascending order.
Whenever a data set has extreme values, the median is the preferred measure of central location.
The median is the measure of location most often reported for annual income and property value data.
A few extremely large incomes or property values can inflate the mean.

Median

Here we have an odd number of observations: 7 observations: 26, 18, 27, 12, 14, 27, and 19. Rewritten in ascending order: 12, 14, 18, 19, 26, 27, and 27.

What is the median?

The median is the middle value in this list, so the median = 19.

Median

Median

Mode

The mode of a data set is the value that occurs with greatest frequency.
The greatest frequency can occur at two or more different values.
If the data have exactly two modes, the data are bimodal.
If the data have more than two modes, the data are multimodal.

Bimodal Distributions

Weighted Means

Weighted means make a lot more sense when you think of the mean as a weighted mean, itself.
What assumption is the simple mean making when divide by $N$?

Weighted Means

In some instances the mean is computed by giving each observation a weight that reflects its relative importance.
The choice of weights depends on the application.
The weights might be the number of credit hours earned for each grade, as in GPA.
In other weighted mean computations, quantities such as pounds, dollars, or volume are frequently used.

The “Simple” Mean is a Weighted mean

What is the weight of a simple mean?

In other words, what is $w_i$ for the simple mean?

\[ \bar{x} = \frac{1}{N} \sum_i^N x_i = \frac{\sum_i^N \frac{1}{N} x_i}{\sum_i^N \frac{1}{N}} \]

Example

Ron Butler, a home builder, is looking over the expenses he incurred for a house he just built. For the purpose of pricing future projects, he would like to know the average wage ($/hour) he paid the workers he employed. Listed below are the categories of workers he employed, along with their respective wage and total hours worked. What is the weighted mean?

Example

FYI: The simple mean would be $21.21

Percentiles

A percentile provides information about how the data are spread over the interval from the smallest value to the largest value.
Admission test scores for colleges and universities are frequently reported in terms of percentiles. The $p$th percentile of a data set is a value such that at least p percent of the items take on this value or less and at least $(100-p)$ percent of the items take on this value or more.

Arrange the data in ascending order.
Compute 𝐿_𝑝, the location of the $p$th percentile.

Percentiles

The median is a percentile! It’s the 50th percentile

Example

“At least 80% of the items take on a value of 646.2 or less.”

Where does this method come from?

The fancy term for this is “linear interpolation”
Since the position (56.8) is not a whole number, we can’t just pick a value from the table.
- We have to “interpolate,” which is a fancy way of saying we take a weighted average between the two surrounding points.
- The formula for the 80th percentile is:
\[ \text{Value at Integer Part} + \text{Decimal Part} \times (\text{Next Value} - \text{Current Value}) \]
Integer Part (56th value): 635
Decimal Part (0.8): This represents how much further we need to go past the 56th value.
The Difference ($649 - 635$): This is the distance between the 56th and 57th values.
Several ways to calculate percentiles (different software like Excel, R, or Stata use slightly different versions)
Why do we use this one:
- It is precise: It provides a unique value even when the percentile falls between two data points.
- It mirrors the CDF: It treats the data as if it represents a continuous underlying distribution rather than just discrete steps

Percentiles by any other name…

Often we use commonly used sets of percentiles and give them other names

Quartiles: 0, 25, 50, 75, 100
Quintiles: 0, 20, 40, 60, 80, 100
Deciles: 0,10,20,30,40,50,60,70,80, 90, 100

Measures of Variability

In statistics, you usually need two pieces of information:

A measure of location or some estimate of an impact
A measure of variability or some measure of uncertainty about the impact

Range

\[ \text{Max Value} - \text{Min Value} \]

Very simple measure of variability
But ONLY sensitive to min and max

IQR (Interquartile Range)

\[ \text{Third Quartile} - \text{First Quartile} \]

The range for the middle 50% of the data
More sensitive to data, and not so dependent on min and max

Outliers

An outlier is an unusually small or unusually large value in a data set.

It might be:

an incorrectly recorded data value
a data value that was incorrectly included in the data set
a correctly recorded data value that belongs in the data set

How to detect outliers using IQR

Calculate Q1 and Q3
Calculate IQR = Q3 - Q1
Calculate Lower Bound = Q1 - 1.5*IQR
Calculate Upper Bound = Q3 + 1.5*IQR
Any data point outside of these bounds is considered an outlier
This is a heuristic, not a hard rule
There are other methods to detect outliers as well

Example

Here’s some data:

[12, 15, 14, 10, 18, 20, 22, 30, 100]

Find the outliers using the IQR method

Arrange in ascending order: [10, 12, 14, 15, 18, 20, 22, 30, 100]
Find Q1 and Q3:
- Q1 = 13 (25th percentile)
- $(25/100)(9+1) = 2.5$ → between 2nd and 3rd values → $12 + 0.5(14-12) = 13$
- Q3 = 26 (75th percentile)
- $(75/100)(9+1) = 7.5$ → between 7th and 8th values → $22 + 0.5(30-22) = 26$
IQR = Q3 - Q1 = 26 - 13 = 13
Lower Bound = $Q1 - 1.5*IQR = 13 - 1.5*13 = -6.5$
Upper Bound = $Q3 + 1.5*IQR = 26 + 1.5*13 = 45.5$
Any data point outside of [-6.5, 45.5] is an outlier. Here, 100 is an outlier.

Other methods

Sometimes the IQR method is too sensitive
You may also just decide to cut off or “winsorize” data at a certain percentile
For example, you may decide to set all data above the 95th percentile to be equal to the 95th percentile
This is common in income data, where you may not want the top 5% of earners to skew your analysis
More sophisticated methods exist such as machine learning techniques
- The crux of those methods to try to learn the “normal” pattern of data and flag anything that deviates significantly from that pattern

Variance

The variance is a measure of variability that utilizes all the data.
It is based on the difference between the value of each observation ($x_i$) and the mean ($\bar{x}$ for a sample, $\mu$ for a population).
The variance is useful in comparing the variability of two or more variables.
The variance is the average of the squared deviations between each data value and the mean.

Variance

The variance of a sample is:

\[ s^2 = \frac{1}{N-1} \sum_i^N (x_i - \bar{x})^2 \]

The variance of a population is:

\[ \sigma^2 = \frac{1}{N} \sum_i^N (x_i - \mu)^2 \]

Sample vs. Population

It seems strange that since I told you that you can never know the DGP, that we’re talking about population variance
Although it’s useful theoretically, the population variance is never going to be a known thing, we can only estimate it
Why is the sample variance $N-1$, not $N$?

“Degrees of Freedom”

Think of it as a penalty for the fact that we are estimating

Standard Deviation

The standard deviation of a data set is the positive square root of the variance.
It is measured in the same units as the data, making it more easily interpreted than the variance. The standard deviation of a sample is:

\[ s = \sqrt{s^2} \]

Coefficient of Variation

The coefficient of variation indicates how large the standard deviation is in relation to the mean.
- High CV: More variability relative to the mean
- Low CV: Less variability relative to the mean
The coefficient of variation of a sample is:

\[ \frac{s}{\bar{x}} \cdot 100 \]

Examples:

Finance: Comparing risk (volatility) of different investments
Lab Science: A CV of less than 5% often indicates high “repeatability” of an experiment.

When to use and when not to use CV

Ratio Scale Only: It only makes sense for data with a “true zero” (like height, weight, or income).
- It does not work for things like temperature in Celsius or Fahrenheit because $0^\circ\text{C}$ is an arbitrary point.
Means Near Zero: If the mean is very close to zero, the CV will explode toward infinity, making the result meaningless.

Variability Measures in Action!

Text(0.5, 0.98, 's^2=0.833, s:0.913, CV:-25.545')

More Variable

Text(0.5, 1.0, 's^2=4.445, s:2.108, CV:2.467')

Measures of Distribution Shape, Relative Location, and Detecting Outliers

z-Scores
Chebyshev’s Theorem
Empirical Rule
Detecting Outliers

Z-scores

The z-score is often called the standardized value.
It denotes the number of standard deviations a data value xi is from the mean.
An observation’s z-score is a measure of the relative location of the observation in a data set.
A data value less than the sample mean will have a z-score less than zero.
A data value greater than the sample mean will have a z-score greater than zero.
A data value equal to the sample mean will have a z-score of zero.

Example

Chebyshev’s Theorem

At least $(1 – 1/a^2)$ of the data values must be within z standard deviations of the mean, where $a$ is any value greater than 1.
Chebyshev’s theorem requires $a > 1$; but $a$ need not be an integer.
At least 75% of the data values must be within $a = 2$ standard deviations of the mean.
At least 89% of the data values must be within $a = 3$ standard deviations of the mean.
At least 94% of the data values must be within $a = 4$ standard deviations of the mean.

Example

Empirical Rule

Covariance and Correlation

Finds the relationships between two variables
The covariance and correlation measure the linear association between two variables.
- Positive values indicate a positive relationship.
- Negative values indicate a negative relationship. The covariance is computed as follows:

\[ s_{xy} = \frac{1}{N} \sum_i^N (x_i - \bar{x})(y_i - \bar{y}) \]

Correlation is:

\[ r_{xy} = \frac{s_{xy}}{s_x \cdot s_y} \]