Mode
- The mode of a data set is the value that occurs with greatest frequency.
- The greatest frequency can occur at two or more different values.
- If the data have exactly two modes, the data are bimodal.
- If the data have more than two modes, the data are multimodal.
Weighted Means
- Weighted means make a lot more sense when you think of the mean as a weighted mean, itself.
- What assumption is the simple mean making when divide by \(N\)?
Weighted Means
- In some instances the mean is computed by giving each observation a weight that reflects its relative importance.
- The choice of weights depends on the application.
- The weights might be the number of credit hours earned for each grade, as in GPA.
- In other weighted mean computations, quantities such as pounds, dollars, or volume are frequently used.
The “Simple” Mean is a Weighted mean
What is the weight of a simple mean?
In other words, what is \(w_i\) for the simple mean?
\[
\bar{x} = \frac{1}{N} \sum_i^N x_i = \frac{\sum_i^N \frac{1}{N} x_i}{\sum_i^N \frac{1}{N}}
\]
Example
Ron Butler, a home builder, is looking over the expenses he incurred for a house he just built. For the purpose of pricing future projects, he would like to know the average wage ($/hour) he paid the workers he employed. Listed below are the categories of workers he employed, along with their respective wage and total hours worked. What is the weighted mean?
![]()
Example
![]()
FYI: The simple mean would be $21.21
Percentiles
The median is a percentile! It’s the 50th percentile
Example
![]()
“At least 80% of the items take on a value of 646.2 or less.”
Where does this method come from?
The fancy term for this is “linear interpolation”
Since the position (56.8) is not a whole number, we can’t just pick a value from the table.
- We have to “interpolate,” which is a fancy way of saying we take a weighted average between the two surrounding points.
- The formula for the 80th percentile is:
\[
\text{Value at Integer Part} + \text{Decimal Part} \times (\text{Next Value} - \text{Current Value})
\]
Integer Part (56th value): 635
Decimal Part (0.8): This represents how much further we need to go past the 56th value.
The Difference (\(649 - 635\)): This is the distance between the 56th and 57th values.
Several ways to calculate percentiles (different software like Excel, R, or Stata use slightly different versions)
Why do we use this one:
- It is precise: It provides a unique value even when the percentile falls between two data points.
- It mirrors the CDF: It treats the data as if it represents a continuous underlying distribution rather than just discrete steps
Range
\[
\text{Max Value} - \text{Min Value}
\]
- Very simple measure of variability
- But ONLY sensitive to min and max
IQR (Interquartile Range)
\[
\text{Third Quartile} - \text{First Quartile}
\]
- The range for the middle 50% of the data
- More sensitive to data, and not so dependent on min and max
How to detect outliers using IQR
- Calculate Q1 and Q3
- Calculate IQR = Q3 - Q1
- Calculate Lower Bound = Q1 - 1.5*IQR
- Calculate Upper Bound = Q3 + 1.5*IQR
- Any data point outside of these bounds is considered an outlier
- This is a heuristic, not a hard rule
- There are other methods to detect outliers as well
Other methods
- Sometimes the IQR method is too sensitive
- You may also just decide to cut off or “winsorize” data at a certain percentile
- For example, you may decide to set all data above the 95th percentile to be equal to the 95th percentile
- This is common in income data, where you may not want the top 5% of earners to skew your analysis
- More sophisticated methods exist such as machine learning techniques
- The crux of those methods to try to learn the “normal” pattern of data and flag anything that deviates significantly from that pattern
Variance
- The variance is a measure of variability that utilizes all the data.
- It is based on the difference between the value of each observation (\(x_i\)) and the mean (\(\bar{x}\) for a sample, \(\mu\) for a population).
- The variance is useful in comparing the variability of two or more variables.
- The variance is the average of the squared deviations between each data value and the mean.
Variance
The variance of a sample is:
\[
s^2 = \frac{1}{N-1} \sum_i^N (x_i - \bar{x})^2
\]
The variance of a population is:
\[
\sigma^2 = \frac{1}{N} \sum_i^N (x_i - \mu)^2
\]
Sample vs. Population
- It seems strange that since I told you that you can never know the DGP, that we’re talking about population variance
- Although it’s useful theoretically, the population variance is never going to be a known thing, we can only estimate it
- Why is the sample variance \(N-1\), not \(N\)?
“Degrees of Freedom”
Think of it as a penalty for the fact that we are estimating
Standard Deviation
- The standard deviation of a data set is the positive square root of the variance.
- It is measured in the same units as the data, making it more easily interpreted than the variance. The standard deviation of a sample is:
\[
s = \sqrt{s^2}
\]
Coefficient of Variation
- The coefficient of variation indicates how large the standard deviation is in relation to the mean.
- High CV: More variability relative to the mean
- Low CV: Less variability relative to the mean
- The coefficient of variation of a sample is:
\[
\frac{s}{\bar{x}} \cdot 100
\]
Examples:
- Finance: Comparing risk (volatility) of different investments
- Lab Science: A CV of less than 5% often indicates high “repeatability” of an experiment.
When to use and when not to use CV
- Ratio Scale Only: It only makes sense for data with a “true zero” (like height, weight, or income).
- It does not work for things like temperature in Celsius or Fahrenheit because \(0^\circ\text{C}\) is an arbitrary point.
- Means Near Zero: If the mean is very close to zero, the CV will explode toward infinity, making the result meaningless.
Variability Measures in Action!
Text(0.5, 0.98, 's^2=0.833, s:0.913, CV:-25.545')