Categorical, Dummy Variables And One-Hot Encoding
So today's our topic is Categorical, Dummy Variable
& One-Hot Encoding. Let us understand this with an with pandas & sklearn
libraries.
Categorical Variables :
Categorical Variables are the variables that falls into certain categories. In
Categorical Variable there is no order.
Example : Category
is Eye Color & Categorical Variable is Black, Blue, Brown, Grey.
Now if we take example; Category as House_Price
& Categorical Variable as Low, Medium & High. This is not an Categorical
Variable this is Ordinal Variable.
In the purpose of Data Analysis we could assign a
number to our variable like :
GENDER
|
|
0
|
MALE
|
1
|
FEMALE
|
2
|
OTHER
|
This assigning
values to the variables make easier to analyze & manipulate data but it doesn’t
change the fact variable is still categorical evenly orders these numbers &
even though they might look like another type of variable called quantitative(variables
on which we can perform mathematical operations such as 1+2+3 = 6)
still makes them a Categorical Variable.
Dummy Variables :
In many situations
we must work with the categorical independent variables.
In regression
analysis we call these as Dummy Variable or Indicator Variables.
For a variable with n categories there are
always n-1 Dummy Variable.
Eg1 : We
have Convent School & Non Convent School there are 2 categories, so 2 - 1 =
1 Dummy Variables. So we can take values 0 & 1.
Eg2 : We
have North/South/East/West these are 4 categories, so 4 – 1 = 3 Dummy
Variables. So we have to assign value 1 to the every region.
Now as we saw example of North/South/East/West here is
the tabular representation of that thing.
Q.)Lets take an example that we are looking at housing
data across these 4 regions. Now the Question is how we can apply Dummy Variables
on these?
REGION
|
MUMBAI
|
PUNE
|
DELHI
|
North
|
1
|
0
|
0
|
South
|
0
|
1
|
0
|
East
|
0
|
0
|
1
|
West
|
0
|
0
|
0
|
So above we have 4 categories so we need 4-1=3 Dummy
Variables. So we could cut it like this so we have Mumbai, Pune & Delhi
on the top. Those are our 3 Dummy variables. Now we could represent North
the 1st region where Mumbai is 1 & Pune, Delhi is 0.
At South the 2nd region where Pune is 1 & Mumbai,
Delhi is 0. At East the 3rd region where Delhi is 1
& Mumbai, Pune is 0. Now West the 4th region where Mumbai, Pune
& Delhi all are 0 now west will be coded nothing. This is how we
represented 4 Categories & 3 Dummy Variables.
One-Hot Encoding :
One-Hot Encoding transforms our Categorical Variables into
Vectors of 0’s & 1’s the length of these vectors is equal to the number of
classes or Categories that our model is expected to classify so if we are
classifying whether the images were either of a Horse or a Donkey then our One-Hot
Encoding vectors corresponds to these classes would be of length 2 since
there are 2 categories total if we add another category such as Zebra so we
could then classify whether images were Horse, Donkey or Zebra the our
corresponding One-Hot Encoded vectors would each be of length 3 since we
have now 3 categories. Now you can relate to the above example where we took
Mumbai, Pune & Delhi & took 1 at for each as we goes down the region.
Difference between Dummy Variable And One
Hot Encoding :
One-Hot Encoding &
Dummy variables are both different ways to encoding the Categorical
Data but they do Same thing.
Dummy Varaible Encoding vs One Hot
Encoding :
Dummy Varaible Example :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df =
pd.read_csv(r"E:\Git_repo\One_Hot_Data.csv")
df
output :
Step1 : Create Dummy
Variables Columns
dummies = pd.get_dummies(df.City) #pd.get_dummies is pandas
method
dummies
output :
Step2 : Concatinate/Append
this Dummy Variables into the Original DataSet
mr_data =
pd.concat([df,dummies],axis="columns")
mr_data
output :
Step3 : Drop City
Data From the mr_data which is our merged dataset
final = mr_data.drop(['City'],axis='columns')
Step4 : Drop One of
this Dummy Variable Column because of the Dummy Variable Trap(I will
explain that concept at next blog)
final_data = final.drop(['Pune'],axis='columns')
final_data
output :
#Note
: when we are using sklearn linear regression model it will work even if you
dont drop it bcz linear regression is aware about the trap but its good
practies to do this
from sklearn.linear_model import LinearRegression
model=LinearRegression()
#Now give X & Y for traning
X=final_data[['Area_in_sqft','Delhi','Mumbai']]
Y=final_data['Price_in_dollars']
#Now train our model using fit
model.fit(X,Y)
output :
LinearRegression(copy_X=True,
fit_intercept=True, n_jobs=None, normalize=False)
#Now predict
model.predict([[2600,0,1]]) #['Area_in_sqft',Delhi,Mumbai]
output :
array([11337668.71651687])
#For Pune Just put 0 at Delhi & Mumbai
model.predict([[3100,0,0]])
output :
array([14180478.15940978])
#To check accuracy of the model
model.score(X,Y)
output :
0.6453501530089485
This is the Dummy Variable using Pandas
Dummies method.
One Hot Endoing Example :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df =
pd.read_csv(r"E:\Git_repo\One_Hot_Data.csv")
df
output :
#To use One-Hot Encoder
Step1 : Use Label
Encoding at City Column
from sklearn.preprocessing import LabelEncoder
label_Enc=LabelEncoder()
#now use this model at our original data frame
data_lbl = df
#now change value at the original datafame
data_lbl.City=label_Enc.fit_transform(data_lbl.City)
#fit_transform means it take Label col as i/p & it will return the label
data_lbl
output :
X=data_lbl[['City','Area_in_sqft']].values #.values to convert it into 2D array & not
a Dataframe
Y=data_lbl['Price_in_dollars']
#now we have to create dummy varaible col here so we
will use sklearn
from sklearn.preprocessing import OneHotEncoder
one_hot=OneHotEncoder(categorical_features=[0]) #Always
specify categorical_features
#what ever X i am supplying the 0th column in that X
is my categorical feature
X=one_hot.fit_transform(X).toarray()
#now to avoid Dummy Variable Trap i am going to drop
one column
X=X[:,1:] #Take all the row , Drop 0th
column
X
output :
#Now Train
model.fit(X,Y)
output :
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
model.predict([[1 , 0 , 2800]])
output :
array([13050224.95535457])
model.predict([[0 , 0 , 3100]])
output :
array([14667200.76041244])
This is the One Hot Encoding using Sklearn preprocessing One-Hot Encoder.
Now compare values both of them Dummy Variable & One Hot Encoding.
Below I had given you my Jupyter Notebook file of the same code which is present above in case if you needed.
Here is my .csv file :
https://github.com/Vegadhardik7/Git_Prac_Repo/blob/master/One_Hot_Data.csv
Now compare values both of them Dummy Variable & One Hot Encoding.
Below I had given you my Jupyter Notebook file of the same code which is present above in case if you needed.
Here is my .csv file :
https://github.com/Vegadhardik7/Git_Prac_Repo/blob/master/One_Hot_Data.csv
Here is my code file :
https://github.com/Vegadhardik7/Git_Prac_Repo/blob/master/One-Hot_Encoding.ipynb
Hey guys, Do you like our content? feel free to comment below & share it with your friends.
Furthermore, our Data Analytics Training in Noida program is tailored to the specific needs of the Noida job market. We provide insights into local industry trends, best practices, and emerging technologies, ensuring that you're well-prepared to meet the demands of employers in the region. Whether you aspire to work in IT, finance, healthcare, retail, or any other sector, our training equips you with versatile skills that are in high demand across industries.
ReplyDeleteFurthermore, our Data Analytics Training in Noida program is tailored to the specific needs of the Noida job market. We provide insights into local industry trends, best practices, and emerging technologies, ensuring that you're well-prepared to meet the demands of employers in the region. Whether you aspire to work in IT, finance, healthcare, retail, or any other sector, our training equips you with versatile skills that are in high demand across industries.
ReplyDelete