How to Include Dummy Variables into a Regression
Originally published: 26/11/2018 09:19
Publication number: ELQ-28959-1
View all versions & Certificate
certified

How to Include Dummy Variables into a Regression

Learn how to include Dummy Variables into a Regression.

Introduction

Realizing how to include dummy variables into a regression is the best way to end your introduction into the world of linear regressions. Another useful concept you can learn is the Ordinary Least Squares. But now, onto dummy variables. Apart from the offensive use of the word “dummy”, there is another meaning – an imitation or a copy that stands as a substitute.

  • Step n°1 |

    What are we About to Learn

    In regression analysis, a dummy is a variable that is used to include categorical data into a regression model. In previous tutorials, we have only used numerical data. We did that when we first introduced linear regressions and again when we were exploring the adjusted R-squared. However, representing numbers on a scale makes more sense than representing categories like gender or season. It’s time to find out how to include such variables into a regression we are working with.
    How to Include Dummy Variables into a Regression image
  • Step n°2 |

    Including Categorical Data for the First Time

    Firstly, make sure that you check the article where we made our first steps into the world of linear regressions. We will be using the SAT-GPA example from there. If you don’t have time to read it, here is a brief explanation: Based on the SAT score of a student, we can predict his GPA. Now, we can improve our prediction by adding another regressor – attendance.
    In the picture below, you can see a dataset that includes a variable that measures if a student attended more than 75% of their university lectures.
    lightbulb_outline Keep in mind that this is categorical data, so we cannot simply put it in the regression.
    How to Include Dummy Variables into a Regression image
  • Step n°3 |

    Using a Dummy Variable

    The time has come to write some code. We can begin by importing the relevant libraries by writing:
    import numpy as np
    import pandas as pd
    import statsmodels.api as sm
    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.set()
    After that, let’s load the file ’1.03. Dummies.csv’ into the variable raw_data. You can download the file from here. If you don’t know how to load it, here’s what you need to type:
    raw_data = pd.read_csv(’1.03. Dummies.csv’)
    Now, let’s simply write

    ''raw_data''
    and see what happens.
    As you can tell from the picture, there is a third column named ‘Attendance’. It reflects if a student attended more than 75% of the lessons with two possibilities – Yes and No.
    How to Include Dummy Variables into a Regression image
  • Step n°4 |

    Mapping Values

    What we would usually do in such cases is to map the Yes/No values with 1s and 0s. In this way, if the student attended more than 75% of the lessons, the dummy will be equal to 1. Otherwise, it will be a 0.

    So, we will have transformed our yes/no question into 0s and 1s. That’s what the dummy name stands for – we are imitating the categories with numbers.
    How to Include Dummy Variables into a Regression image
add_shopping_cartContinue reading for free (70% left)


0.0 / 5 (0 votes)

please wait...