Stata V

Econ 5/Poli 5D Lecture 9

Announcements¶

  • Third quiz due Friday at midnight
  • Second homework assignment due next week

Today's Application: Regression

  • Extremely important statistical technique
  • We will use regression today to study relationship between different types of movie reviews
  • Also study relationship between movie reviews and box office revenue, making sure to distinguish between causality and correlation

Today's Software: Stata

  • Learn how to run regressions (using ``reg``) in Stata
  • Construct best fit line and interpret key elements

We want to model the outcome, Y, as a function of the predictor, X

$$Y_i=f(X_i)$$

If we assume the relationship is linear, we can write our model as

$$Y_i =\beta_0 + \beta_1 X_i + \varepsilon_i$$

Our job today is to estimate $\beta_0$ and $\beta_1$ using data

$$Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i$$$$\text{Y-value} = \text{Intercept} + \text{Slope * X-value} + \text{error} $$
centered image
$$Y_i=\beta_0 + \beta_1 X_i + \varepsilon_i$$

We use 'hats' to denote the estimates of $\beta_0$ and $\beta_1$

$$\hat{\beta}_0: \text{estimate of the intercept, } \beta_0$$$$\hat{\beta}_1: \text{esitmate of the slope coefficient, } \beta_1$$

$\hat{\beta}_0$ and $\hat{\beta}_1$ are the parameter estimates that best fit our data

How do we estimate $\beta_0$ and $\beta_1$?

Choose $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize the sum of $\epsilon_i^2$

Application -- Movie Reviews¶

  • We will consider the relationship between different movie review websites
  • Both Rotten Tomatoes and Metacritic aggregate movie reviews and give a score to a movie that ranges from 0 to 100

    • How related are the two?

    • If I gave you a rotten tomatoes score, what would your prediction for the metacritic score be?

  • Then we will see if there is a relationship between movie reviews and box office revenue/ticket sales

Data¶

  • Movie Reviews Data:

    • Data comes from a FiveThirtyEight article about being suspicious of online movie reviews (especially Fandango) [https://fivethirtyeight.com/features/fandango-movies-ratings/]

    • We are going to focus on movies released in 2015

  • Box Office Data

    • From Box Office Mojo (you can look at the cleaning file if you are interested in how the data between the two were merged)

      • Box office revenue and ticket sales are reported in millions
In [1]:
* Setup
cd "/Users/Brian/Dropbox/Grad School/Sixth Year/Econ:Poli 5/Lectures/Week 5"

* Load Data

use ./data/movie_ratings_rev.dta, replace
d
/Users/Brian/Dropbox/Grad School/Sixth Year/Econ:Poli 5/Lectures/Week 5



Contains data from ./data/movie_ratings_rev.dta
  obs:           127                          
 vars:             8                          22 Dec 2020 18:42
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
film            str70   %70s                  Title of Movie
rottentomatoes  byte    %8.0g                 Rottent Tomatoes Critic Score
metacritic      byte    %8.0g                 Metacritic Score
genre           str22   %22s                  
box_office      double  %10.0g                Box Office Revenue in Millions
tickets         double  %10.0g                Tickets Sold in Millions
subsample       float   %9.0g                 Subsample For Illustrative
                                                Purposes
n               float   %9.0g                 
--------------------------------------------------------------------------------
Sorted by: rottentomatoes

Let's look at a scatterplot of the data to see if the two aggregators generally agree on what movies are "good"

In [2]:
twoway scatter metacritic rottentomatoes, msymbol(circle_hollow) ///
    title("Relation Between Metacritic and Rotten Tomatoes Scores") ///
    xtitle("Rotten Tomatoes Score") ///
    ytitle("Metacritic Score") ///
    graphregion(color(white) fcolor(white))

We want to fit a line that best fits our data

In Stata, the reg command estimates regression coefficients

Syntax:

reg Y-variable X-variable(s), [options]
In [3]:
reg metacritic rottentomatoes
      Source |       SS           df       MS      Number of obs   =       127
-------------+----------------------------------   F(1, 125)       =   1414.98
       Model |  41425.9232         1  41425.9232   Prob > F        =    0.0000
    Residual |  3659.58865       125  29.2767092   R-squared       =    0.9188
-------------+----------------------------------   Adj R-squared   =    0.9182
       Total |  45085.5118       126  357.821522   Root MSE        =    5.4108

-------------------------------------------------------------------------------
   metacritic |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
rottentomat~s |   .5996453   .0159411    37.62   0.000     .5680958    .6311948
        _cons |   21.78114   1.039368    20.96   0.000      19.7241    23.83818
-------------------------------------------------------------------------------

$$\hat{\beta}_0 = 21.78$$$$\hat{\beta}_1 = 0.60$$

What does Linear Regression tell us?¶

The best fit line for the data

  • The equation for that best fit line

    • A predicted Y for any value of X

Let's add a best fit line to our movie reviews data

To do so we will take advantage of the fact that Stata stores estimates for the last regression that has been run

In [4]:
* Create a variable named points that goes from 0 to 100
range points 0 100 101

* Generate line of best fit
gen best_fit = _b[_cons] + _b[rottentomatoes]*points
(26 missing values generated)

(26 missing values generated)

Stata stores the estimates of regression coefficients in macros

_b[varname] returns the coefficient for a variable with a given varname

In [5]:
di _b[rottentomatoes]
.59964528

Interpretation of Slope Coefficient¶

Slope Intercept: If I change X by one unit, then I predict Y will change by $\beta_1$

In our example:

  • $\hat{\beta}_1= 0.60$
  • Y $=$ Metacritic Score
  • X $=$ Rotten Tomatoes Score

For every 1 unit increase in Rotten Tomatoes Score, I expect the Metacritic Score will go up by 0.6

Predicted Value¶

Ant-man received a Rotten Tomatoes Score of 80. What is the predicted Metacritic score?

$$\text{Predicted $Y_i$} = \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Rotten Tomatoes}_i$$

Plugging in our estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$, we find

$$\hat{Y} = 21.78 + 0.60 \cdot 80 = 69.78$$

Therefore, given the Rotten Tomatoes score of 80, we predict based on our linear regression that Ant-man will receive a Metacritic score of about 70

Model Error (Residuals)¶

These are predictions. Any model will have error associated with it

$$Error_i = Actual_i - Predicted_i$$

For Ant-man the Actual Metacritic score was 64, but we predicted it would be 70 given its Rotten Tomatoes Score

Therefore, the error (or residual) is given by:

$$Error_i = \varepsilon_i = Actual_i - Predicted_i = 64-70 = -6 $$

Predict Command¶

In Stata, we use the predict command to make predictions after a linear regression

Syntax:

predict newvar

Will create a new variable named newvar that is equal to $\hat{Y}_i=\hat{\beta}_0+ \hat{\beta}_1 X_i$

In other words, it creates a variable containing predicted values

In [6]:
*predicted values
predict yhat

*resiudals
gen residuals = metacritic-yhat
(option xb assumed; fitted values)

In [7]:
%head film metacritic yhat residuals
film metacritic yhat residuals
1 Paul Blart: Mall Cop 2 13 24.779369 -11.779369
2 Hitman: Agent 47 28 25.978659 2.0213413
3 Hot Pursuit 31 26.578304 4.4216957
4 Fantastic Four 27 27.17795 -.17794991
5 Taken 3 26 27.17795 -1.1779499
6 The Boy Next Door 30 27.777596 2.2224045
7 The Loft 24 28.377239 -4.3772392
8 Unfinished Business 32 28.377239 3.6227608
9 Seventh Son 30 28.976885 1.0231152
10 Mortdecai 27 28.976885 -1.9768848

Now let's consider how reviews are related to box office revenue

Suppose we have a hypothesis that better rated films will make more money at the box office: Hypothesis: If a movie has a higher rating (independent variable), it will make more money (dependent variable)

First let's look at a scatterplot

In [8]:
twoway scatter box_office rottentomatoes, msymbol(circle_hollow) ///
    title("Relation Between Box Office Receipts and Rotten Tomatoes") ///
    xtitle("Rotten Tomatoes Score") ///
    ytitle("Box Office Revenue (in Millions)") ///
    graphregion(color(white) fcolor(white))

Let's add in a line of best fit

In [9]:
twoway (scatter box_office rottentomatoes) (lfit box_office rottentomatoes), ///
    title("Relation Between Box Office Receipts and Rotten Tomatoes") ///
    xtitle("Rotten Tomatoes Score") ///
    ytitle("Box Office Revenue (in Millions)") ///
    graphregion(color(white) fcolor(white))
In [10]:
reg box_office rottentomatoes
      Source |       SS           df       MS      Number of obs   =       127
-------------+----------------------------------   F(1, 125)       =      0.87
       Model |  8234.53875         1  8234.53875   Prob > F        =    0.3531
    Residual |  1184939.29       125  9479.51432   R-squared       =    0.0069
-------------+----------------------------------   Adj R-squared   =   -0.0010
       Total |  1193173.83       126  9469.63357   Root MSE        =    97.363

-------------------------------------------------------------------------------
   box_office |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
rottentomat~s |   .2673486   .2868477     0.93   0.353    -.3003586    .8350558
        _cons |   38.37105   18.70257     2.05   0.042     1.356338    75.38576
-------------------------------------------------------------------------------

Interpretation If a rotten tomatoes score goes up by 1, I expect the movie to earn 0.27 million dollars more (or 270,000 more dollars)

But, the standard error is too high, and we can not reject the null hypothesis of no effect

For fun, let's see for which movies there is a large error

There are two ways we can actually get predictions using the predict command

In [11]:
**** Option 1
*form predicted value
predict yhat_box_office, xb
*form residual
gen resid1 = box_office-yhat_box_office

**** Option 2 -- use residual option
predict resid2, residuals

**** Sometimes commands have useful options 
**** that will save you time if you remember
**** to read the documentation
In [12]:
%head resid1 resid2
resid1 resid2
1 31.383801 31.383801
2 -17.77504 -17.77504
3 -5.929637 -5.9296379
4 15.340361 15.340361
5 48.479237 48.479237
6 -5.0384717 -5.0384712
7 -35.3092 -35.3092
8 -31.092384 -31.092384
9 -23.853451 -23.853449
10 -33.883102 -33.883099
In [13]:
sum resid1
list film rottentomatoes box_office yhat resid1 if resid1 == `r(max)'

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      resid1 |        127    4.64e-07    96.97567  -64.37617   594.9178


     +-------------------------------------------------------------+
     |           film   rotten~s   box_off~e       yhat     resid1 |
     |-------------------------------------------------------------|
 76. | Jurassic World         71   652.27062   64.35596   594.9178 |
     +-------------------------------------------------------------+

Regression will often have a hard time predicting outliers. Jurassic World made a lot of money, but had average ratings, so the prediction based solely on ratings is far too low

For fun, let's see which genre of films has the highest box office

In [14]:
graph bar box_office, over(genre) graphregion(color(white) fcolor(white)) ///
ytitle(Box Office in Millions)

Conclusion¶

Regression is a very important technique

  • Gives us a way to form predictions
  • Need to understand regression to move onto more sophisticated prediction techniques
  • Regression is often taught as a machine learning method

Next class we will continue with regression in Stata