Categorical, Dummy Variables And One-Hot Encoding

So today's our topic is Categorical, Dummy Variable & One-Hot Encoding. Let us understand this with an with pandas & sklearn libraries.

Categorical Variables : Categorical Variables are the variables that falls into certain categories. In Categorical Variable there is no order.

Example : Category is Eye Color & Categorical Variable is Black, Blue, Brown, Grey.

Now if we take example; Category as House_Price & Categorical Variable as Low, Medium & High. This is not an Categorical Variable this is Ordinal Variable.

In the purpose of Data Analysis we could assign a number to our variable like :

	GENDER
0	MALE
1	FEMALE
2	OTHER

This assigning values to the variables make easier to analyze & manipulate data but it doesn’t change the fact variable is still categorical evenly orders these numbers & even though they might look like another type of variable called quantitative(variables on which we can perform mathematical operations such as 1+2+3 = 6) still makes them a Categorical Variable.

Dummy Variables :

In many situations we must work with the categorical independent variables.

In regression analysis we call these as Dummy Variable or Indicator Variables.

For a variable with n categories there are always n-1 Dummy Variable.

Eg1 : We have Convent School & Non Convent School there are 2 categories, so 2 - 1 = 1 Dummy Variables. So we can take values 0 & 1.

Eg2 : We have North/South/East/West these are 4 categories, so 4 – 1 = 3 Dummy Variables. So we have to assign value 1 to the every region.

Now as we saw example of North/South/East/West here is the tabular representation of that thing.

Q.)Lets take an example that we are looking at housing data across these 4 regions. Now the Question is how we can apply Dummy Variables on these?

REGION	MUMBAI	PUNE	DELHI
North	1	0	0
South	0	1	0
East	0	0	1
West	0	0	0

So above we have 4 categories so we need 4-1=3 Dummy Variables. So we could cut it like this so we have Mumbai, Pune & Delhi on the top. Those are our 3 Dummy variables. Now we could represent North the 1^st region where Mumbai is 1 & Pune, Delhi is 0. At South the 2^nd region where Pune is 1 & Mumbai, Delhi is 0. At East the 3^rd region where Delhi is 1 & Mumbai, Pune is 0. Now West the 4^th region where Mumbai, Pune & Delhi all are 0 now west will be coded nothing. This is how we represented 4 Categories & 3 Dummy Variables.

One-Hot Encoding : One-Hot Encoding transforms our Categorical Variables into Vectors of 0’s & 1’s the length of these vectors is equal to the number of classes or Categories that our model is expected to classify so if we are classifying whether the images were either of a Horse or a Donkey then our One-Hot Encoding vectors corresponds to these classes would be of length 2 since there are 2 categories total if we add another category such as Zebra so we could then classify whether images were Horse, Donkey or Zebra the our corresponding One-Hot Encoded vectors would each be of length 3 since we have now 3 categories. Now you can relate to the above example where we took Mumbai, Pune & Delhi & took 1 at for each as we goes down the region.

Difference between Dummy Variable And One Hot Encoding :

One-Hot Encoding & Dummy variables are both different ways to encoding the Categorical Data but they do Same thing.

Dummy Varaible Encoding vs One Hot Encoding :

Dummy Varaible Example :

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

df = pd.read_csv(r"E:\Git_repo\One_Hot_Data.csv")

output :

Step1 : Create Dummy Variables Columns

dummies = pd.get_dummies(df.City) #pd.get_dummies is pandas method

dummies

output :

Step2 : Concatinate/Append this Dummy Variables into the Original DataSet

mr_data = pd.concat([df,dummies],axis="columns")

mr_data

output :

Step3 : Drop City Data From the mr_data which is our merged dataset

final = mr_data.drop(['City'],axis='columns')

Step4 : Drop One of this Dummy Variable Column because of the Dummy Variable Trap(I will explain that concept at next blog)

final_data = final.drop(['Pune'],axis='columns')

final_data

output :

#Note : when we are using sklearn linear regression model it will work even if you dont drop it bcz linear regression is aware about the trap but its good practies to do this

from sklearn.linear_model import LinearRegression

model=LinearRegression()

#Now give X & Y for traning

X=final_data[['Area_in_sqft','Delhi','Mumbai']]

Y=final_data['Price_in_dollars']

#Now train our model using fit

model.fit(X,Y)

output :

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#Now predict

model.predict([[2600,0,1]]) #['Area_in_sqft',Delhi,Mumbai]

output :

array([11337668.71651687])

#For Pune Just put 0 at Delhi & Mumbai

model.predict([[3100,0,0]])

output :

array([14180478.15940978])

#To check accuracy of the model

model.score(X,Y)

output :

0.6453501530089485

This is the Dummy Variable using Pandas Dummies method.

One Hot Endoing Example :

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

df = pd.read_csv(r"E:\Git_repo\One_Hot_Data.csv")

output :

#To use One-Hot Encoder

Step1 : Use Label Encoding at City Column

from sklearn.preprocessing import LabelEncoder

label_Enc=LabelEncoder()

#now use this model at our original data frame

data_lbl = df

#now change value at the original datafame

data_lbl.City=label_Enc.fit_transform(data_lbl.City) #fit_transform means it take Label col as i/p & it will return the label

data_lbl

output :

X=data_lbl[['City','Area_in_sqft']].values #.values to convert it into 2D array & not a Dataframe

Y=data_lbl['Price_in_dollars']

#now we have to create dummy varaible col here so we will use sklearn

from sklearn.preprocessing import OneHotEncoder

one_hot=OneHotEncoder(categorical_features=[0]) #Always specify categorical_features

#what ever X i am supplying the 0th column in that X is my categorical feature

X=one_hot.fit_transform(X).toarray()

#now to avoid Dummy Variable Trap i am going to drop one column

X=X[:,1:] #Take all the row , Drop 0^th column

output :

#Now Train

model.fit(X,Y)

output :

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model.predict([[1 , 0 , 2800]])

output :

array([13050224.95535457])

model.predict([[0 , 0 , 3100]])

output :

array([14667200.76041244])

This is the One Hot Encoding using Sklearn preprocessing One-Hot Encoder.

Now compare values both of them Dummy Variable & One Hot Encoding.

Below I had given you my Jupyter Notebook file of the same code which is present above in case if you needed.

Here is my .csv file :

https://github.com/Vegadhardik7/Git_Prac_Repo/blob/master/One_Hot_Data.csv

Here is my code file :

https://github.com/Vegadhardik7/Git_Prac_Repo/blob/master/One-Hot_Encoding.ipynb

Hey guys, Do you like our content? feel free to comment below & share it with your friends.

2 comments:

Digital ArnavFebruary 8, 2024 at 4:03 PM
Furthermore, our Data Analytics Training in Noida program is tailored to the specific needs of the Noida job market. We provide insights into local industry trends, best practices, and emerging technologies, ensuring that you're well-prepared to meet the demands of employers in the region. Whether you aspire to work in IT, finance, healthcare, retail, or any other sector, our training equips you with versatile skills that are in high demand across industries.
Digital ArnavFebruary 8, 2024 at 4:03 PM
Furthermore, our Data Analytics Training in Noida program is tailored to the specific needs of the Noida job market. We provide insights into local industry trends, best practices, and emerging technologies, ensuring that you're well-prepared to meet the demands of employers in the region. Whether you aspire to work in IT, finance, healthcare, retail, or any other sector, our training equips you with versatile skills that are in high demand across industries.

No Spamming and No Offensive Language

InfinityCodeX