Today's Application: Regression
Today's Software: Stata
We want to model the outcome, Y, as a function of the predictor, X
$$Y_i=f(X_i)$$If we assume the relationship is linear, we can write our model as
$$Y_i =\beta_0 + \beta_1 X_i + \varepsilon_i$$Our job today is to estimate $\beta_0$ and $\beta_1$ using data
We use 'hats' to denote the estimates of $\beta_0$ and $\beta_1$
$$\hat{\beta}_0: \text{estimate of the intercept, } \beta_0$$$$\hat{\beta}_1: \text{esitmate of the slope coefficient, } \beta_1$$$\hat{\beta}_0$ and $\hat{\beta}_1$ are the parameter estimates that best fit our data
How do we estimate $\beta_0$ and $\beta_1$?
Choose $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize the sum of $\epsilon_i^2$
Both Rotten Tomatoes and Metacritic aggregate movie reviews and give a score to a movie that ranges from 0 to 100
How related are the two?
If I gave you a rotten tomatoes score, what would your prediction for the metacritic score be?
Movie Reviews Data:
Data comes from a FiveThirtyEight article about being suspicious of online movie reviews (especially Fandango) [https://fivethirtyeight.com/features/fandango-movies-ratings/]
We are going to focus on movies released in 2015
Box Office Data
From Box Office Mojo (you can look at the cleaning file if you are interested in how the data between the two were merged)
* Setup
cd "/Users/Brian/Dropbox/Grad School/Sixth Year/Econ:Poli 5/Lectures/Week 5"
* Load Data
use ./data/movie_ratings_rev.dta, replace
d
/Users/Brian/Dropbox/Grad School/Sixth Year/Econ:Poli 5/Lectures/Week 5 Contains data from ./data/movie_ratings_rev.dta obs: 127 vars: 8 22 Dec 2020 18:42 -------------------------------------------------------------------------------- storage display value variable name type format label variable label -------------------------------------------------------------------------------- film str70 %70s Title of Movie rottentomatoes byte %8.0g Rottent Tomatoes Critic Score metacritic byte %8.0g Metacritic Score genre str22 %22s box_office double %10.0g Box Office Revenue in Millions tickets double %10.0g Tickets Sold in Millions subsample float %9.0g Subsample For Illustrative Purposes n float %9.0g -------------------------------------------------------------------------------- Sorted by: rottentomatoes
Let's look at a scatterplot of the data to see if the two aggregators generally agree on what movies are "good"
twoway scatter metacritic rottentomatoes, msymbol(circle_hollow) ///
title("Relation Between Metacritic and Rotten Tomatoes Scores") ///
xtitle("Rotten Tomatoes Score") ///
ytitle("Metacritic Score") ///
graphregion(color(white) fcolor(white))
We want to fit a line that best fits our data
In Stata, the reg
command estimates regression coefficients
Syntax:
reg Y-variable X-variable(s), [options]
The best fit line for the data
The equation for that best fit line
Let's add a best fit line to our movie reviews data
To do so we will take advantage of the fact that Stata stores estimates for the last regression that has been run
* Create a variable named points that goes from 0 to 100
range points 0 100 101
* Generate line of best fit
gen best_fit = _b[_cons] + _b[rottentomatoes]*points
(26 missing values generated) (26 missing values generated)
Stata stores the estimates of regression coefficients in macros
_b[varname]
returns the coefficient for a variable with a given varname
di _b[rottentomatoes]
.59964528
Slope Intercept: If I change X by one unit, then I predict Y will change by $\beta_1$
In our example:
For every 1 unit increase in Rotten Tomatoes Score, I expect the Metacritic Score will go up by 0.6
Ant-man received a Rotten Tomatoes Score of 80. What is the predicted Metacritic score?
$$\text{Predicted $Y_i$} = \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Rotten Tomatoes}_i$$Plugging in our estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$, we find
$$\hat{Y} = 21.78 + 0.60 \cdot 80 = 69.78$$Therefore, given the Rotten Tomatoes score of 80, we predict based on our linear regression that Ant-man will receive a Metacritic score of about 70
These are predictions. Any model will have error associated with it
$$Error_i = Actual_i - Predicted_i$$For Ant-man the Actual Metacritic score was 64, but we predicted it would be 70 given its Rotten Tomatoes Score
Therefore, the error (or residual) is given by:
$$Error_i = \varepsilon_i = Actual_i - Predicted_i = 64-70 = -6 $$In Stata, we use the predict
command to make predictions after a linear regression
Syntax:
predict newvar
Will create a new variable named newvar
that is equal to $\hat{Y}_i=\hat{\beta}_0+ \hat{\beta}_1 X_i$
In other words, it creates a variable containing predicted values
*predicted values
predict yhat
*resiudals
gen residuals = metacritic-yhat
(option xb assumed; fitted values)
%head film metacritic yhat residuals
film | metacritic | yhat | residuals | |
---|---|---|---|---|
1 | Paul Blart: Mall Cop 2 | 13 | 24.779369 | -11.779369 |
2 | Hitman: Agent 47 | 28 | 25.978659 | 2.0213413 |
3 | Hot Pursuit | 31 | 26.578304 | 4.4216957 |
4 | Fantastic Four | 27 | 27.17795 | -.17794991 |
5 | Taken 3 | 26 | 27.17795 | -1.1779499 |
6 | The Boy Next Door | 30 | 27.777596 | 2.2224045 |
7 | The Loft | 24 | 28.377239 | -4.3772392 |
8 | Unfinished Business | 32 | 28.377239 | 3.6227608 |
9 | Seventh Son | 30 | 28.976885 | 1.0231152 |
10 | Mortdecai | 27 | 28.976885 | -1.9768848 |
Now let's consider how reviews are related to box office revenue
Suppose we have a hypothesis that better rated films will make more money at the box office: Hypothesis: If a movie has a higher rating (independent variable), it will make more money (dependent variable)
First let's look at a scatterplot
twoway scatter box_office rottentomatoes, msymbol(circle_hollow) ///
title("Relation Between Box Office Receipts and Rotten Tomatoes") ///
xtitle("Rotten Tomatoes Score") ///
ytitle("Box Office Revenue (in Millions)") ///
graphregion(color(white) fcolor(white))
Let's add in a line of best fit
twoway (scatter box_office rottentomatoes) (lfit box_office rottentomatoes), ///
title("Relation Between Box Office Receipts and Rotten Tomatoes") ///
xtitle("Rotten Tomatoes Score") ///
ytitle("Box Office Revenue (in Millions)") ///
graphregion(color(white) fcolor(white))
Interpretation If a rotten tomatoes score goes up by 1, I expect the movie to earn 0.27 million dollars more (or 270,000 more dollars)
But, the standard error is too high, and we can not reject the null hypothesis of no effect
For fun, let's see for which movies there is a large error
There are two ways we can actually get predictions using the predict
command
**** Option 1
*form predicted value
predict yhat_box_office, xb
*form residual
gen resid1 = box_office-yhat_box_office
**** Option 2 -- use residual option
predict resid2, residuals
**** Sometimes commands have useful options
**** that will save you time if you remember
**** to read the documentation
%head resid1 resid2
resid1 | resid2 | |
---|---|---|
1 | 31.383801 | 31.383801 |
2 | -17.77504 | -17.77504 |
3 | -5.929637 | -5.9296379 |
4 | 15.340361 | 15.340361 |
5 | 48.479237 | 48.479237 |
6 | -5.0384717 | -5.0384712 |
7 | -35.3092 | -35.3092 |
8 | -31.092384 | -31.092384 |
9 | -23.853451 | -23.853449 |
10 | -33.883102 | -33.883099 |
sum resid1
list film rottentomatoes box_office yhat resid1 if resid1 == `r(max)'
Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- resid1 | 127 4.64e-07 96.97567 -64.37617 594.9178 +-------------------------------------------------------------+ | film rotten~s box_off~e yhat resid1 | |-------------------------------------------------------------| 76. | Jurassic World 71 652.27062 64.35596 594.9178 | +-------------------------------------------------------------+
Regression will often have a hard time predicting outliers. Jurassic World made a lot of money, but had average ratings, so the prediction based solely on ratings is far too low
For fun, let's see which genre of films has the highest box office
graph bar box_office, over(genre) graphregion(color(white) fcolor(white)) ///
ytitle(Box Office in Millions)
Regression is a very important technique
Next class we will continue with regression in Stata