
Publication number: ELQ-28959-1
View all versions & Certificate

How to Include Dummy Variables into a Regression
Learn how to include Dummy Variables into a Regression.
Introduction
Realizing how to include dummy variables into a regression is the best way to end your introduction into the world of linear regressions. Another useful concept you can learn is the Ordinary Least Squares. But now, onto dummy variables. Apart from the offensive use of the word “dummy”, there is another meaning – an imitation or a copy that stands as a substitute.
- Step n°1 |
What are we About to Learn
In regression analysis, a dummy is a variable that is used to include categorical data into a regression model. In previous tutorials, we have only used numerical data. We did that when we first introduced linear regressions and again when we were exploring the adjusted R-squared. However, representing numbers on a scale makes more sense than representing categories like gender or season. It’s time to find out how to include such variables into a regression we are working with. - Step n°2 |
Including Categorical Data for the First Time
Firstly, make sure that you check the article where we made our first steps into the world of linear regressions. We will be using the SAT-GPA example from there. If you don’t have time to read it, here is a brief explanation: Based on the SAT score of a student, we can predict his GPA. Now, we can improve our prediction by adding another regressor – attendance.
In the picture below, you can see a dataset that includes a variable that measures if a student attended more than 75% of their university lectures.lightbulb_outline Keep in mind that this is categorical data, so we cannot simply put it in the regression. - Step n°3 |
Using a Dummy Variable
The time has come to write some code. We can begin by importing the relevant libraries by writing:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
After that, let’s load the file ’1.03. Dummies.csv’ into the variable raw_data. You can download the file from here. If you don’t know how to load it, here’s what you need to type:
raw_data = pd.read_csv(’1.03. Dummies.csv’)
Now, let’s simply write
''raw_data''
and see what happens.
As you can tell from the picture, there is a third column named ‘Attendance’. It reflects if a student attended more than 75% of the lessons with two possibilities – Yes and No. - Step n°4 |
Mapping Values
What we would usually do in such cases is to map the Yes/No values with 1s and 0s. In this way, if the student attended more than 75% of the lessons, the dummy will be equal to 1. Otherwise, it will be a 0.
So, we will have transformed our yes/no question into 0s and 1s. That’s what the dummy name stands for – we are imitating the categories with numbers.