Describing Data

Aleksandr Michuda

Agenda

  • First Stata Workshop are 22nd and 23rd of January
  • Make sure to reach out to Tao Wang (twang1@swarthmore.edu) about registering
  • PS1 Due next Friday at 11:59pm!
  • Check out website

Agenda

  • What is econometrics?
  • What is causality?
  • What is data?
  • How do we visualize different data?

What is econometrics?

  • Econometrics revolves around statistical methods to answer economic problems
    • What is the demand of laptops?
    • How do taxes affect GDP?
    • What is the elasticity of labor supply?
  • EC001 taught you a lot about the theory of economics
    • But are any of these things actually true?
    • And if so, what IS the elasticity of X or the multiplier of X?
    • How big is it? Does it even matter?

What is the difference between econometrics and statistics

  • To answer many of the questions in the last slide, you need to know impact
  • How does one thing affect another?
  • How does one thing causally affect another?
  • What is causation?
  • What is correlation?

Motivating Example: Supply and Demand

  • Suppose you want to estimate demand for a product, tomatoes in Swarthmore
  • You collect data every day from each supermarket, farmers market, and local grocery store over the course of a month.
  • These organization give you data on how many tomatoes they sold that day (quantity, Q) and the price they sold them at (P)
  • In the end, you have this data:
Text(0, 0.5, 'Price')

  • What can you say about this data?
  • Can you say something about demand or supply?

Supply and Demand

  • Let’s say I draw a line through the data.
  • The “best-fit” line.
Text(0, 0.5, 'Price')

Supply and Demand

  • What does that line represent?
  • What does the slope tell you?
  • If I asked you: “What is the effect of increasing quantity sold by 1 unit on price?”
  • What if I asked you, instead: “What is the effect of increasing price by 1 on unit on quantity demanded?”

Supply and Demand

  • Remember that the intersection of supply and demand determines the price and quantity sold
  • So each point on this graph is not supply or demand, it’s actually an equilibrium point!
  • Can we really learn about demand or supply from this data alone?
  • What would we need in order to isolate demand or supply?
  • Take 5 minutes to think about this with your neighbor.

Correlation

  • Correlation is about data moving together
  • It doesn’t necessarily mean that one causes the other
  • Correlation is still useful, especially for prediction
  • Machine learning/neural networks
  • You can also think of the best-fit line here as an estimate of “the conditional expectation function” (more later)

Correlation

What is causation?

That’s actually a pretty philosophical question…

What is causation?

  • Hume’s Belief on Knowledge and Experience: Our beliefs about general principles (e.g., “the sun will rise tomorrow” or “bread is nourishing”) are derived from experience. This justification is inadequate, as previous experiences do not logically guarantee future occurrences.
  • Impact on Philosophy and Practical Fields: Hume’s skepticism about causation influenced subsequent philosophy and posed challenges in science, law, policy, and other fields.

Causality in the Social Sciences

  • In physics, the question of causality isn’t as big of a problem
  • The causal relationships between atoms and forces seem much more apparent because atoms can’t choose to be affected by a force, or even change the force
  • But what does causation mean with social phenomena?
  • A treatment (force) can be chosen, or denied
    • Selection
  • People adapt and can “game” the system
  • How do you isolate causality?
  • In economics, we often think about “threats to identification”
    • What could be confounding my causal relationship?
    • Everything might be…

Selection

  • Suppose you want to understand the causal impact of providing mosquito nets to households in countries with high malaria incidence (perhaps, Ghana)
  • You decide to bring some mosquito nets to various villages in Ghana and let anyone who wants them, take one.
  • Then you collect data from those that took mosquito nets and those that didn’t and wrote down whether they got malaria or not in the next two weeks.
  • Let’s look at some data…

My calculation

What happened?

  • This should be easy.
  • All we have to do is take the average risk of those who took the nets and the average risk of those who didn’t.
  • We subtract the two and voila! Causal impact.
Did you get Malaria (Yes (1)/No(0))
Takers Non-takers
1 1
0 0
0 0
0 0
0 0
1 0
0 0
0 0
0 0
0 0
  • Mean of takers: 20%
  • Mean of non-takers: 10%

Impact: 20-10 = +10% ?!

What is a counterfactual?

  • This isn’t some math trick or contrived example.
  • Many econometricians like to talk about “counterfactuals.”
  • What would that world look like?
  • What would have been the chances of getting malaria in the absence of getting the nets?

Back to the example

  • What is the counterfactual in the mosquito nets example?
  • What would have happened had mosquito nets not been provided?
  • Selection counfounded the impact
Group Counterfactual malaria risk Effect of nets Observed malaria risk
High-risk area (take nets) 40% Nets cut risk in half 20%
Low-risk area (don’t take nets) 10% No nets → 10% 10%

Back to econometrics vs. statistics

  • Econometricians really delve deeply into causation
  • They care about what assumptions need to be made about the data generating process that will allow them to say that some model’s measure is a causal impact/measure/effect
  • Often econometrics involves thinking about counterfactuals and that involves simplification and modeling
  • Statistics concerns the collection, organization, analysis, interpretation, and presentation of data.

All models are wrong, but some are useful

  • All models are a simplification
  • But they give us the benefit of being able to generalize and say something about the data generating process
  • But at the cost of reducing a truly complex thing into a less “true” form
  • That’s why in statistics/econometrics, our estimates always come with a measure of uncertainty. The first usually a point estimate and then a confidence interval or standard error.

DGPs

Suppose we know want to look at the effect of a cash transfer to a low-income household on educational outcomes 5 years later.

  • The true data-generating process for that is incredible complex.
  • It includes things like:
    • Household characteristics
    • Local labor market conditions
    • Quality of schools
    • Parental education
    • Peer effects
    • The amount of cash transfer
    • The effect of random shocks (weather, health, etc) in the next 5 years

DGPs

  • We can make a simplification, by adding structure:
    • This complicated social phenomenon can be represented as a probability distribution of various random variables
  • This can be represented as some joint distribution of all these variables, interrelating with each in complicated and non-linear ways.
  • Imagine a multi-dimensional surface where each axis is one of these variables.

Simplifying the DGP

  • Let’s make an assumption: education in five years only depends on the cash transfer and your parent’s education.
  • What would that joint distribution look like?
  • What do we lose by making this assumption?
  • This is now a model.
  • We can visualize this in two dimensions with a pairplot.

Simplifying the DGP Further

  • The degree of simplification is dependent on how you want to answer a particular question.
  • With the pairplot above, we can see the rich interactions between cash transfer, parental education, and education, at the cost losing out on the complexity of the full DGP.
  • But what if what you really wanted to answer: “What is the average impact of a cash transfer on education, holding parental education constant?”
  • Then you can add even more structure to the DGP, by also assuming how cash transfer affects education.

Linear Model for the DGP

\[ Educ = \beta_0 + \beta_1 \cdot CT + \beta_2 \cdot PE + \varepsilon_{Normal} \]

  • The cash transfer (CT) is linearly related to education.
  • The impact of the cash transfer is \(\beta_1\)
  • This means that as we increase the cash transfer by $1, education will go up by \(\beta_1\) units
  • More on this later.

What is Data?

  • Data is assumed to be coming from data generating process
    • The DGP generates elements of data of a variable, which we see as observations.
  • A data generating process is not usually known
  • All we can do is try to understand it better through the data it generates
  • But data generating processes have observable characteristics
    • Distributions
    • Models
  • A model is some simplified representation of the data generating process
  • It provides us with a way to make claims about the data in a more formal, structured way

Data Collection

  • Suppose we wanted to answer this question and decided to collect data
  • What are the different data types we could collect for this?
  • The first level: Quantitative vs. Categorical Data

Quantitative Data

  • Quantitative data indicate how many or how much.
  • Ordinary arithmetic operations are meaningful for quantitative data.

i.e. “5 years of education, $1000 in CT”

Categorical Data

  • Labels or names are used to identify an attribute of each element
  • Often referred to as qualitative data

“High Education, Low Education, No education”

Ordinal vs. Nominal

  • The categories in categorical data can have ordering or just be a set
  • Nominal Data: Labels or names used to identify an attribute
  • Ordinal Data: Categories where the rank and ordering matters.

Example

Economics, Business, Physics… -> Nominal

Freshman, Sophomore, Junior, Senior -> Ordinal

Data Collection

Now that you know what kind of data types there are, what kind of data would you want to collect to answer this question:

What’s the ideal type? Why?

  • Education -> ?
  • Cash Transfer -> ?

Observation Levels

  • What does each row or observation represent in your data? A person, a time period, a person at some time period in some region, etc…
  • Most data can be broken down into:
    • Crossection
    • Time Series
    • For the future: panel data

Crossectional Data

Data collected at the same or approximately the same point in time.

Example Data detailing the number of building permits issued in November 2013 in each of the counties of Ohio.

Crossectional Data

Time Series

Data collected over several time periods.

Example

Data detailing the number of building permits issued in Lucas County, Ohio, in each of the last 36 months.

Time Series

Data Sources/Collection

  • Data can be collected from administrative data:
    • Internal company records
    • Business databases
    • government agencies
  • Publicly available surveys
    • BLS
    • Census
    • LSMS
  • Or collect it yourself!
    • Popular today with RCTs

Types of Statistical Studies

Statistical Studies – Observational

  • In observational (nonexperimental) studies no attempt is made to control or influence the variables of interest.

Example – Survey

Studies of smokers and nonsmokers are observational studies because researchers do not determine or control who will smoke and who will not smoke.

Types of Statistical Studies

Statistical Studies – Experimental

  • In experimental studies the variable of interest is first identified. Then one or more other variables are identified and controlled so that data can be obtained about how they influence the variable of interest.
  • The largest experimental study ever conducted is believed to be the 1954 Public Health Service experiment for the Salk polio vaccine. Nearly two million U.S. children (grades 1 through 3) were selected.

Observational vs. Experimental

  • This is a particularly important distinction in the causal inference world
  • Experiments allow you to isolate the effects of your intervention so that you know that the channel is causal
  • Observational studies often have confounders, which are factors that are related to both you outcome of interest and to the intervention you are studying.
    • How could you know whether it was your intervention or something else?

Our example

  • What would be an example of how to make our study of cash transfers on education experimental?
  • What would be an example of that study being observational?

Big Data and Data Mining

Big data: Large and complex data set.

  • Three V’s of Big data:
    • Volume: Amount of available data
    • Velocity: Speed at which data is collected and processed
    • Variety: Different data types

Data Mining

  • Methods for developing useful decision-making information from large databases.
  • Using a combination of procedures from statistics, mathematics, and computer science, analysts “mine the data” to convert it into useful information.
  • The most effective data mining systems use automated procedures to discover relationships in the data and predict future outcomes prompted by general and even vague queries by the user.

Data Mining Applications

  • The major applications of data mining have been made by companies with a strong consumer focus such as retail, financial, and communication firms.
  • Data mining is used to identify related products that customers who have already purchased a specific product are also likely to purchase (and then pop-ups are used to draw attention to those related products).
  • Data mining is also used to identify customers who should receive special discount offers based on their past purchasing volumes.

Data Mining Reliability

  • Finding a statistical model that works well for a particular sample of data does not necessarily mean that it can be reliably applied to other data.
  • With the enormous amount of data available, the data set can be partitioned into a training set (for model development) and a test set (for validating the model).
  • There is, however, a danger of overfitting the model to the point that misleading associations and conclusions appear to exist.
  • Careful interpretation of results and extensive testing is important.

Ethical Guidelines for Statistical Practice

  • In a statistical study, unethical behavior can take a variety of forms including:
    • Improper sampling
    • Inappropriate analysis of the data
    • Development of misleading graphs
    • Use of inappropriate summary statistics
    • Biased interpretation of the statistical results
  • One should strive to be fair, thorough, objective, and neutral as you collect, analyze, and present data.
  • As a consumer of statistics, one should also be aware of the possibility of unethical behavior by others.

Summarizing Categorical Data

  • Frequency Distributions
  • Relative Frequency Distributions
  • Percent Frequency Distributions
  • Bar Charts
  • Pie Charts
  • Histograms

Frequency Distributions

A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.

Example: Marada Inn Guests staying at Marada Inn were asked to rate the quality of their accommodations as being excellent, above average, average, below average, or poor.

Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20

Relative Frequency and Percent Frequency Distributions

  • The relative frequency is the proportion of of the total number of data items in some groups

\[ \text{Relative Frequency} = \frac{\text{Frequency}}{n} \]

  • The percent frequency is that frequency multiplied by 100
Rating Relative Frequency Percent Frequency
Poor 0.10 10%
Below Average 0.15 15%
Average 0.25 25%
Above Average 0.45 45%
Excellent 0.05 5%
Total 1.00 100%

Bar Chart

  • A bar chart is a graphical display for depicting qualitative data.
  • A frequency, relative frequency, or percent frequency scale can be used for the other axis (usually the vertical axis).
  • Using a bar of fixed width drawn above each class label, we extend the height appropriately.
  • The bars are separated to emphasize the fact that each class is a separate category.

Pie Chart

  • The pie chart is a commonly used graphical display for presenting relative frequency and percent frequency distributions for categorical data.
  • First draw a circle; then use the relative frequencies to subdivide the circle into sectors that correspond to the relative frequency for each class.
  • Because there are 360 degrees in a circle, a class with a relative frequency of 0.25 would consume 0.25(360) = 90 degrees of the circle.

Summarizing Quantitative Data

  • Frequency Distribution
  • Relative Frequency and Percent
  • Frequency Distributions
  • Dot Plot
  • Histogram
  • Cumulative Distributions
  • Stem-and-Leaf Display

Dot Plots

  • One of the simplest graphical summaries of data is a dot plot.
  • A horizontal axis shows the range of data values.
  • Then each data value is represented by a dot placed above the axis.

Cross-tabulation

  • Many times, you want to look at how two or more variables can be summarized
  • For this you can use cross-tabulation

Example: Finger Lakes Homes

The number of Finger Lakes homes sold for each style and price for the past two years is shown below.

Example: Finger Lakes Homes

Insights

  • The greatest number of homes (19) in the sample are a split-level style and priced at less than $250,000.
  • Only three homes in the sample are an A-Frame style and priced at $250,000 or more.

Row and Column Percentages

  • Converting the entries in the table into row percentages or column percentages can provide additional insight about the relationship between the two variables.

Histogram

The variable of interest is placed on the horizontal axis. A rectangle is drawn above each class interval with its height corresponding to the interval’s frequency, relative frequency, or percent frequency. Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes.

Histogram Skewness

Cumulative Distributions

Cumulative Distribution

So let’s take the table for car repairs:

Parts Cost ($) Frequency Percent Frequency
50-59 2 4%
60-69 13 26%
70-79 16 32%
80-89 7 14%
90-99 7 14%
100-109 5 10%
TOTAL 50 100%

How would we turn this into a cumulative distribution?

Cumulative Distribution

Parts Cost ($) Frequency % Frequency Cumulative Frequency Cumulative % Frequency
50-59 2 4% 2 4%
60-69 13 26% 2 +13 = 15 4+26 = 30%
70-79 16 32% 15 + 16 = 31 30 + 32 = 62%
80-89 7 14% 31 + 7 = 38 62 + 14 = 76%
90-99 7 14% 38 + 7= 45 76 + 14 = 90%
100-109 5 10% 45 + 5 = 50 90+ 10 =100%
TOTAL 50 100%

Simpson’s Paradox

  • Data in two or more crosstabulations are often aggregated to produce a summary crosstabulation.
  • We must be careful in drawing conclusions about the relationship between the two variables in the aggregated crosstabulation.
  • In some cases the conclusions based upon an aggregated crosstabulation can be completely reversed if we look at the unaggregated data. The reversal of conclusions based on aggregate and unaggregated data is called Simpson’s paradox.

Scatterplots and Trendlines

  • Scatterplots can show us a clearer sense of the relationships between variables
  • “Ocular econometrics”
  • It’s also easier to spot strageness that can be investigated further
  • ex: Simpson’s paradox

Scatterplots and Simpson’s Paradox

Creating Effective Visualizations

Creating effective graphical displays is as much art as it is science.

Here are some guidelines:

  • Give the display a clear and concise title.
  • Keep the display simple.
  • Clearly label each axis and provide the units of measure.
  • If colors are used, make sure they are distinct.
  • If multiple colors or line types are used, provide a legend.

Misleading graphs

  • Each graph you ever make is trying to convey some point
  • What are you trying to say?
  • There is a balance between trying to convey some point, and risking misleading graphs to make it

Scales in bar graphs

  • Bar graphs are great because you can convey information in a digestible way
  • But changing the scales to suit your needs is misleading

Unconventional Visualizations

Unclear Plots

  • Each plot you make should be self-contained
  • I shouldn’t have to listen to a broadcast, read a book, or listen to a lecture to understand a particular plot
  • Make visualizations that are clear in what they are showing
  • Title, axis labels, scales, even notes if needed