Originally published: 26/11/2018 09:19
Publication number: ELQ-28959-1

How to Include Dummy Variables into a Regression

Learn how to include Dummy Variables into a Regression.

by 365 Data Science
Data Science Education Platform

317 views|1 comment|continue reading for free

data data science linear regression regression dummy variables 365 data science data scientist ordinary least squares substitute imitation

Introduction

Realizing how to include dummy variables into a regression is the best way to end your introduction into the world of linear regressions. Another useful concept you can learn is the Ordinary Least Squares. But now, onto dummy variables. Apart from the offensive use of the word “dummy”, there is another meaning – an imitation or a copy that stands as a substitute.

Step n°1 |
What are we About to Learn
In regression analysis, a dummy is a variable that is used to include categorical data into a regression model. In previous tutorials, we have only used numerical data. We did that when we first introduced linear regressions and again when we were exploring the adjusted R-squared. However, representing numbers on a scale makes more sense than representing categories like gender or season. It’s time to find out how to include such variables into a regression we are working with.
Step n°2 |
Including Categorical Data for the First Time
Firstly, make sure that you check the article where we made our first steps into the world of linear regressions. We will be using the SAT-GPA example from there. If you don’t have time to read it, here is a brief explanation: Based on the SAT score of a student, we can predict his GPA. Now, we can improve our prediction by adding another regressor – attendance.
In the picture below, you can see a dataset that includes a variable that measures if a student attended more than 75% of their university lectures.
Keep in mind that this is categorical data, so we cannot simply put it in the regression.
Step n°3 |
Using a Dummy Variable
The time has come to write some code. We can begin by importing the relevant libraries by writing:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
After that, let’s load the file ’1.03. Dummies.csv’ into the variable raw_data. You can download the file from here. If you don’t know how to load it, here’s what you need to type:
raw_data = pd.read_csv(’1.03. Dummies.csv’)
Now, let’s simply write

''raw_data''
and see what happens.
As you can tell from the picture, there is a third column named ‘Attendance’. It reflects if a student attended more than 75% of the lessons with two possibilities – Yes and No.
Step n°4 |
Mapping Values
What we would usually do in such cases is to map the Yes/No values with 1s and 0s. In this way, if the student attended more than 75% of the lessons, the dummy will be equal to 1. Otherwise, it will be a 0.

So, we will have transformed our yes/no question into 0s and 1s. That’s what the dummy name stands for – we are imitating the categories with numbers.

Continue reading for free (70% left)

by 365 Data Science
Data Science Education Platform

How to Include Dummy Variables into a Regression

Introduction

What are we About to Learn

Including Categorical Data for the First Time

Using a Dummy Variable

Mapping Values

%product_add_cart_title%

Login

Create an account

Are you using this Best Practice for...

Message

Certificate of publication date

Add to your library to review

Add to cart to continue reading

Add to cart to view the video

Please sign-up to download this free Best Practice