Describing Data

Aleksandr Michuda

Agenda

First Stata Workshop are 22nd and 23rd of January
Make sure to reach out to Tao Wang (twang1@swarthmore.edu) about registering
PS1 Due next Friday at 11:59pm!
Check out website

Agenda

What is econometrics?
What is causality?
What is data?
How do we visualize different data?

What is econometrics?

Econometrics revolves around statistical methods to answer economic problems
- What is the demand of laptops?
- How do taxes affect GDP?
- What is the elasticity of labor supply?
EC001 taught you a lot about the theory of economics
- But are any of these things actually true?
- And if so, what IS the elasticity of X or the multiplier of X?
- How big is it? Does it even matter?

What is the difference between econometrics and statistics

To answer many of the questions in the last slide, you need to know impact
How does one thing affect another?
How does one thing causally affect another?
What is causation?
What is correlation?

Motivating Example: Supply and Demand

Suppose you want to estimate demand for a product, tomatoes in Swarthmore
You collect data every day from each supermarket, farmers market, and local grocery store over the course of a month.
These organization give you data on how many tomatoes they sold that day (quantity, Q) and the price they sold them at (P)
In the end, you have this data:

Text(0, 0.5, 'Price')

What can you say about this data?
Can you say something about demand or supply?

Supply and Demand

Let’s say I draw a line through the data.
The “best-fit” line.

Text(0, 0.5, 'Price')

Supply and Demand

What does that line represent?
What does the slope tell you?
If I asked you: “What is the effect of increasing quantity sold by 1 unit on price?”
What if I asked you, instead: “What is the effect of increasing price by 1 on unit on quantity demanded?”

Supply and Demand

Remember that the intersection of supply and demand determines the price and quantity sold
So each point on this graph is not supply or demand, it’s actually an equilibrium point!
Can we really learn about demand or supply from this data alone?
What would we need in order to isolate demand or supply?
Take 5 minutes to think about this with your neighbor.

Correlation

Correlation is about data moving together
It doesn’t necessarily mean that one causes the other
Correlation is still useful, especially for prediction
Machine learning/neural networks
You can also think of the best-fit line here as an estimate of “the conditional expectation function” (more later)

Correlation

What is causation?

That’s actually a pretty philosophical question…

What is causation?

Hume’s Belief on Knowledge and Experience: Our beliefs about general principles (e.g., “the sun will rise tomorrow” or “bread is nourishing”) are derived from experience. This justification is inadequate, as previous experiences do not logically guarantee future occurrences.
Impact on Philosophy and Practical Fields: Hume’s skepticism about causation influenced subsequent philosophy and posed challenges in science, law, policy, and other fields.

Selection

Suppose you want to understand the causal impact of providing mosquito nets to households in countries with high malaria incidence (perhaps, Ghana)
You decide to bring some mosquito nets to various villages in Ghana and let anyone who wants them, take one.
Then you collect data from those that took mosquito nets and those that didn’t and wrote down whether they got malaria or not in the next two weeks.
Let’s look at some data…

My calculation

What happened?

This should be easy.
All we have to do is take the average risk of those who took the nets and the average risk of those who didn’t.
We subtract the two and voila! Causal impact.

Did you get Malaria (Yes (1)/No(0))
Takers	Non-takers
1	1
0	0
0	0
0	0
0	0
1	0
0	0
0	0
0	0
0	0

Mean of takers: 20%
Mean of non-takers: 10%

Impact: 20-10 = +10% ?!

What is a counterfactual?

This isn’t some math trick or contrived example.
Many econometricians like to talk about “counterfactuals.”
What would that world look like?
What would have been the chances of getting malaria in the absence of getting the nets?

Back to the example

What is the counterfactual in the mosquito nets example?
What would have happened had mosquito nets not been provided?
Selection counfounded the impact

Group	Counterfactual malaria risk	Effect of nets	Observed malaria risk
High-risk area (take nets)	40%	Nets cut risk in half	20%
Low-risk area (don’t take nets)	10%	No nets → 10%	10%

Back to econometrics vs. statistics

Econometricians really delve deeply into causation
They care about what assumptions need to be made about the data generating process that will allow them to say that some model’s measure is a causal impact/measure/effect
Often econometrics involves thinking about counterfactuals and that involves simplification and modeling
Statistics concerns the collection, organization, analysis, interpretation, and presentation of data.

All models are wrong, but some are useful

All models are a simplification
But they give us the benefit of being able to generalize and say something about the data generating process
But at the cost of reducing a truly complex thing into a less “true” form
That’s why in statistics/econometrics, our estimates always come with a measure of uncertainty. The first usually a point estimate and then a confidence interval or standard error.

DGPs

Suppose we know want to look at the effect of a cash transfer to a low-income household on educational outcomes 5 years later.

The true data-generating process for that is incredible complex.
It includes things like:
- Household characteristics
- Local labor market conditions
- Quality of schools
- Parental education
- Peer effects
- The amount of cash transfer
- The effect of random shocks (weather, health, etc) in the next 5 years

DGPs

We can make a simplification, by adding structure:
- This complicated social phenomenon can be represented as a probability distribution of various random variables
This can be represented as some joint distribution of all these variables, interrelating with each in complicated and non-linear ways.
Imagine a multi-dimensional surface where each axis is one of these variables.

Simplifying the DGP

Let’s make an assumption: education in five years only depends on the cash transfer and your parent’s education.
What would that joint distribution look like?
What do we lose by making this assumption?
This is now a model.
We can visualize this in two dimensions with a pairplot.

Simplifying the DGP Further

The degree of simplification is dependent on how you want to answer a particular question.
With the pairplot above, we can see the rich interactions between cash transfer, parental education, and education, at the cost losing out on the complexity of the full DGP.
But what if what you really wanted to answer: “What is the average impact of a cash transfer on education, holding parental education constant?”
Then you can add even more structure to the DGP, by also assuming how cash transfer affects education.

Linear Model for the DGP

\[ Educ = \beta_0 + \beta_1 \cdot CT + \beta_2 \cdot PE + \varepsilon_{Normal} \]

The cash transfer (CT) is linearly related to education.
The impact of the cash transfer is $\beta_1$
This means that as we increase the cash transfer by $1, education will go up by $\beta_1$ units
More on this later.

What is Data?

Data is assumed to be coming from data generating process
- The DGP generates elements of data of a variable, which we see as observations.
A data generating process is not usually known
All we can do is try to understand it better through the data it generates
But data generating processes have observable characteristics
- Distributions
- Models
A model is some simplified representation of the data generating process
It provides us with a way to make claims about the data in a more formal, structured way

Data Collection

Suppose we wanted to answer this question and decided to collect data
What are the different data types we could collect for this?
The first level: Quantitative vs. Categorical Data

Quantitative Data

Quantitative data indicate how many or how much.
Ordinary arithmetic operations are meaningful for quantitative data.

i.e. “5 years of education, $1000 in CT”

Categorical Data

Labels or names are used to identify an attribute of each element
Often referred to as qualitative data

“High Education, Low Education, No education”

Ordinal vs. Nominal

The categories in categorical data can have ordering or just be a set
Nominal Data: Labels or names used to identify an attribute
Ordinal Data: Categories where the rank and ordering matters.

Example

Economics, Business, Physics… -> Nominal

Freshman, Sophomore, Junior, Senior -> Ordinal

Data Collection

Now that you know what kind of data types there are, what kind of data would you want to collect to answer this question:

What’s the ideal type? Why?

Education -> ?
Cash Transfer -> ?

Observation Levels

What does each row or observation represent in your data? A person, a time period, a person at some time period in some region, etc…
Most data can be broken down into:
- Crossection
- Time Series
- For the future: panel data

Crossectional Data

Data collected at the same or approximately the same point in time.

Example Data detailing the number of building permits issued in November 2013 in each of the counties of Ohio.

Crossectional Data

Time Series

Data collected over several time periods.

Example

Data detailing the number of building permits issued in Lucas County, Ohio, in each of the last 36 months.

Time Series

Data Sources/Collection

Data can be collected from administrative data:
- Internal company records
- Business databases
- government agencies
Publicly available surveys
- BLS
- Census
- LSMS
Or collect it yourself!
- Popular today with RCTs

Types of Statistical Studies

Statistical Studies – Observational

In observational (nonexperimental) studies no attempt is made to control or influence the variables of interest.

Example – Survey

Studies of smokers and nonsmokers are observational studies because researchers do not determine or control who will smoke and who will not smoke.

Types of Statistical Studies

Statistical Studies – Experimental

In experimental studies the variable of interest is first identified. Then one or more other variables are identified and controlled so that data can be obtained about how they influence the variable of interest.
The largest experimental study ever conducted is believed to be the 1954 Public Health Service experiment for the Salk polio vaccine. Nearly two million U.S. children (grades 1 through 3) were selected.

Observational vs. Experimental

This is a particularly important distinction in the causal inference world
Experiments allow you to isolate the effects of your intervention so that you know that the channel is causal
Observational studies often have confounders, which are factors that are related to both you outcome of interest and to the intervention you are studying.
- How could you know whether it was your intervention or something else?

Our example

What would be an example of how to make our study of cash transfers on education experimental?
What would be an example of that study being observational?

Big Data and Data Mining

Big data: Large and complex data set.

Three V’s of Big data:
- Volume: Amount of available data
- Velocity: Speed at which data is collected and processed
- Variety: Different data types

Data Mining

Methods for developing useful decision-making information from large databases.
Using a combination of procedures from statistics, mathematics, and computer science, analysts “mine the data” to convert it into useful information.
The most effective data mining systems use automated procedures to discover relationships in the data and predict future outcomes prompted by general and even vague queries by the user.

Data Mining Applications

The major applications of data mining have been made by companies with a strong consumer focus such as retail, financial, and communication firms.
Data mining is used to identify related products that customers who have already purchased a specific product are also likely to purchase (and then pop-ups are used to draw attention to those related products).
Data mining is also used to identify customers who should receive special discount offers based on their past purchasing volumes.

Data Mining Reliability

Finding a statistical model that works well for a particular sample of data does not necessarily mean that it can be reliably applied to other data.
With the enormous amount of data available, the data set can be partitioned into a training set (for model development) and a test set (for validating the model).
There is, however, a danger of overfitting the model to the point that misleading associations and conclusions appear to exist.
Careful interpretation of results and extensive testing is important.

Ethical Guidelines for Statistical Practice

In a statistical study, unethical behavior can take a variety of forms including:
- Improper sampling
- Inappropriate analysis of the data
- Development of misleading graphs
- Use of inappropriate summary statistics
- Biased interpretation of the statistical results
One should strive to be fair, thorough, objective, and neutral as you collect, analyze, and present data.
As a consumer of statistics, one should also be aware of the possibility of unethical behavior by others.

Summarizing Categorical Data

Frequency Distributions
Relative Frequency Distributions
Percent Frequency Distributions
Bar Charts
Pie Charts
Histograms

Frequency Distributions

A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.

Example: Marada Inn Guests staying at Marada Inn were asked to rate the quality of their accommodations as being excellent, above average, average, below average, or poor.

Rating	Frequency
Poor	2
Below Average	3
Average	5
Above Average	9
Excellent	1
Total	20

Relative Frequency and Percent Frequency Distributions

The relative frequency is the proportion of of the total number of data items in some groups

\[ \text{Relative Frequency} = \frac{\text{Frequency}}{n} \]

The percent frequency is that frequency multiplied by 100

Rating	Relative Frequency	Percent Frequency
Poor	0.10	10%
Below Average	0.15	15%
Average	0.25	25%
Above Average	0.45	45%
Excellent	0.05	5%
Total	1.00	100%

Bar Chart

A bar chart is a graphical display for depicting qualitative data.
A frequency, relative frequency, or percent frequency scale can be used for the other axis (usually the vertical axis).
Using a bar of fixed width drawn above each class label, we extend the height appropriately.
The bars are separated to emphasize the fact that each class is a separate category.

Pie Chart

The pie chart is a commonly used graphical display for presenting relative frequency and percent frequency distributions for categorical data.
First draw a circle; then use the relative frequencies to subdivide the circle into sectors that correspond to the relative frequency for each class.
Because there are 360 degrees in a circle, a class with a relative frequency of 0.25 would consume 0.25(360) = 90 degrees of the circle.

Summarizing Quantitative Data

Frequency Distribution
Relative Frequency and Percent
Frequency Distributions
Dot Plot
Histogram
Cumulative Distributions
Stem-and-Leaf Display

Dot Plots

One of the simplest graphical summaries of data is a dot plot.
A horizontal axis shows the range of data values.
Then each data value is represented by a dot placed above the axis.

Cross-tabulation

Many times, you want to look at how two or more variables can be summarized
For this you can use cross-tabulation

Example: Finger Lakes Homes

The number of Finger Lakes homes sold for each style and price for the past two years is shown below.

Example: Finger Lakes Homes

Insights

The greatest number of homes (19) in the sample are a split-level style and priced at less than $250,000.
Only three homes in the sample are an A-Frame style and priced at $250,000 or more.

Row and Column Percentages

Converting the entries in the table into row percentages or column percentages can provide additional insight about the relationship between the two variables.

Histogram

The variable of interest is placed on the horizontal axis. A rectangle is drawn above each class interval with its height corresponding to the interval’s frequency, relative frequency, or percent frequency. Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes.

Histogram Skewness

Cumulative Distributions

Cumulative Distribution

So let’s take the table for car repairs:

Parts Cost ($)	Frequency	Percent Frequency
50-59	2	4%
60-69	13	26%
70-79	16	32%
80-89	7	14%
90-99	7	14%
100-109	5	10%
TOTAL	50	100%

How would we turn this into a cumulative distribution?

Cumulative Distribution

Parts Cost ($)	Frequency	% Frequency	Cumulative Frequency	Cumulative % Frequency
50-59	2	4%	2	4%
60-69	13	26%	2 +13 = 15	4+26 = 30%
70-79	16	32%	15 + 16 = 31	30 + 32 = 62%
80-89	7	14%	31 + 7 = 38	62 + 14 = 76%
90-99	7	14%	38 + 7= 45	76 + 14 = 90%
100-109	5	10%	45 + 5 = 50	90+ 10 =100%
TOTAL	50	100%

Simpson’s Paradox

Data in two or more crosstabulations are often aggregated to produce a summary crosstabulation.
We must be careful in drawing conclusions about the relationship between the two variables in the aggregated crosstabulation.
In some cases the conclusions based upon an aggregated crosstabulation can be completely reversed if we look at the unaggregated data. The reversal of conclusions based on aggregate and unaggregated data is called Simpson’s paradox.

Scatterplots and Trendlines

Scatterplots can show us a clearer sense of the relationships between variables
“Ocular econometrics”
It’s also easier to spot strageness that can be investigated further
ex: Simpson’s paradox

Scatterplots and Simpson’s Paradox

Creating Effective Visualizations

Creating effective graphical displays is as much art as it is science.

Here are some guidelines:

Give the display a clear and concise title.
Keep the display simple.
Clearly label each axis and provide the units of measure.
If colors are used, make sure they are distinct.
If multiple colors or line types are used, provide a legend.

Misleading graphs

Each graph you ever make is trying to convey some point
What are you trying to say?
There is a balance between trying to convey some point, and risking misleading graphs to make it

Scales in bar graphs

Bar graphs are great because you can convey information in a digestible way
But changing the scales to suit your needs is misleading

Unconventional Visualizations

Unclear Plots

Each plot you make should be self-contained
I shouldn’t have to listen to a broadcast, read a book, or listen to a lecture to understand a particular plot
Make visualizations that are clear in what they are showing
Title, axis labels, scales, even notes if needed

Describing Data

Agenda

Agenda

What is econometrics?

What is the difference between econometrics and statistics

Motivating Example: Supply and Demand

Supply and Demand

Supply and Demand

Supply and Demand

Correlation

Correlation

What is causation?

What is causation?

Causality in the Social Sciences

Selection

My calculation

What is a counterfactual?

Back to the example

Back to econometrics vs. statistics

All models are wrong, but some are useful

DGPs

DGPs

Simplifying the DGP

Simplifying the DGP Further

Linear Model for the DGP

What is Data?

Data Collection

Quantitative Data

Categorical Data

Ordinal vs. Nominal

Data Collection

Observation Levels

Crossectional Data

Crossectional Data

Time Series

Time Series

Data Sources/Collection

Types of Statistical Studies

Types of Statistical Studies

Observational vs. Experimental

Our example

Big Data and Data Mining

Data Mining

Data Mining Applications

Data Mining Reliability

Ethical Guidelines for Statistical Practice

Summarizing Categorical Data

Frequency Distributions

Relative Frequency and Percent Frequency Distributions

Bar Chart

Pie Chart

Summarizing Quantitative Data

Dot Plots

Cross-tabulation

Example: Finger Lakes Homes

Example: Finger Lakes Homes

Row and Column Percentages

Histogram

Histogram Skewness

Cumulative Distributions

Cumulative Distribution

Cumulative Distribution

Simpson’s Paradox

Scatterplots and Trendlines

Scatterplots and Simpson’s Paradox

Creating Effective Visualizations

Misleading graphs

Scales in bar graphs

Unconventional Visualizations

Unclear Plots