InfinityCodeX

Unlock the power of Python, Data Science, Machine Learning, and Deep Learning with our comprehensive guides! Whether you're a beginner eager to dive into AI or an experienced professional looking to sharpen your skills, our blog offers easy-to-follow tutorials, insightful tips, and expert advice to fuel your AI career. Dive in today and start mastering the cutting-edge technologies shaping the future!

Categorical, Dummy Variables And One-Hot Encoding



So today's our topic is Categorical, Dummy Variable & One-Hot Encoding. Let us understand this with an with pandas & sklearn libraries.

Categorical Variables : Categorical Variables are the variables that falls into certain categories. In Categorical Variable there is no order.

Example : Category is Eye Color & Categorical Variable is Black, Blue, Brown, Grey.
Now if we take example; Category as House_Price & Categorical Variable as Low, Medium & High. This is not an Categorical Variable this is Ordinal Variable.

In the purpose of Data Analysis we could assign a number to our variable like :


GENDER
0
MALE
1
FEMALE
2
OTHER

 This assigning values to the variables make easier to analyze & manipulate data but it doesn’t change the fact variable is still categorical evenly orders these numbers & even though they might look like another type of variable called quantitative(variables on which we can perform mathematical operations such as 1+2+3 = 6) still makes them a Categorical Variable.

Dummy Variables :

 In many situations we must work with the categorical independent variables.
 In regression analysis we call these as Dummy Variable or Indicator Variables.
For a variable with n categories there are always n-1 Dummy Variable.
         Eg1 : We have Convent School & Non Convent School there are 2 categories, so 2 - 1 = 1 Dummy Variables. So we can take values 0 & 1.
         Eg2 : We have North/South/East/West these are 4 categories, so 4 – 1 = 3 Dummy Variables. So we have to assign value 1 to the every region.

Now as we saw example of North/South/East/West here is the tabular representation of that thing.

Q.)Lets take an example that we are looking at housing data across these 4 regions. Now the Question is how we can apply Dummy Variables on these?


                
REGION
MUMBAI
PUNE
DELHI
North
1
0
0
South
0
1
0
East
0
0
1
West
0
0
0


So above we have 4 categories so we need 4-1=3 Dummy Variables. So we could cut it like this so we have Mumbai, Pune & Delhi on the top. Those are our 3 Dummy variables. Now we could represent North the 1st region where Mumbai is 1 & Pune, Delhi is 0. At South the 2nd region where Pune is 1 & Mumbai, Delhi is 0. At East the 3rd region where Delhi is 1 & Mumbai, Pune is 0. Now West  the 4th region where Mumbai, Pune & Delhi all are 0 now west will be coded nothing. This is how we represented 4 Categories & 3 Dummy Variables.

One-Hot Encoding : One-Hot Encoding transforms our Categorical Variables into Vectors of 0’s & 1’s the length of these vectors is equal to the number of classes or Categories that our model is expected to classify so if we are classifying whether the images were either of a Horse or a Donkey then our One-Hot Encoding vectors corresponds to these classes would be of length 2 since there are 2 categories total if we add another category such as Zebra so we could then classify whether images were Horse, Donkey or Zebra the our corresponding One-Hot Encoded vectors would each be of length 3 since we have now 3 categories. Now you can relate to the above example where we took Mumbai, Pune & Delhi & took 1 at for each as we goes down the region.

Difference between Dummy Variable And One Hot Encoding :


One-Hot Encoding & Dummy variables are both different ways to encoding the Categorical Data but they do Same thing.

Dummy Varaible Encoding vs One Hot Encoding :


Dummy Varaible Example :


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv(r"E:\Git_repo\One_Hot_Data.csv")
df

output :



Step1 : Create Dummy Variables Columns
dummies = pd.get_dummies(df.City)      #pd.get_dummies is pandas method
dummies

output :



Step2 : Concatinate/Append this Dummy Variables into the Original DataSet
mr_data = pd.concat([df,dummies],axis="columns")
mr_data

output :



Step3 : Drop City Data From the mr_data which is our merged dataset
final = mr_data.drop(['City'],axis='columns')
Step4 : Drop One of this Dummy Variable Column because of the Dummy Variable Trap(I will explain that concept at next blog)
final_data = final.drop(['Pune'],axis='columns')
final_data

output :



#Note : when we are using sklearn linear regression model it will work even if you dont drop it bcz linear regression is aware about the trap but its good practies to do this

from sklearn.linear_model import LinearRegression
model=LinearRegression()

#Now give X & Y for traning

X=final_data[['Area_in_sqft','Delhi','Mumbai']]
Y=final_data['Price_in_dollars']

#Now train our model using fit
model.fit(X,Y)

output :

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#Now predict
model.predict([[2600,0,1]])   #['Area_in_sqft',Delhi,Mumbai]

output :

array([11337668.71651687])

#For Pune Just put 0 at Delhi & Mumbai
model.predict([[3100,0,0]])

output :

array([14180478.15940978])
 
#To check accuracy of the model
model.score(X,Y)

output :

0.6453501530089485

This is the Dummy Variable using Pandas Dummies method.


One Hot Endoing Example :


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv(r"E:\Git_repo\One_Hot_Data.csv")
df

output :



#To use One-Hot Encoder
Step1 : Use Label Encoding at City Column

from sklearn.preprocessing import LabelEncoder
label_Enc=LabelEncoder()

#now use this model at our original data frame
data_lbl = df
#now change value at the original datafame
data_lbl.City=label_Enc.fit_transform(data_lbl.City) #fit_transform means it take Label col as i/p & it will return the label
data_lbl

output :



X=data_lbl[['City','Area_in_sqft']].values     #.values to convert it into 2D array & not a Dataframe
Y=data_lbl['Price_in_dollars']

#now we have to create dummy varaible col here so we will use sklearn
from sklearn.preprocessing import OneHotEncoder
one_hot=OneHotEncoder(categorical_features=[0]) #Always specify categorical_features
#what ever X i am supplying the 0th column in that X is my categorical feature
X=one_hot.fit_transform(X).toarray()
#now to avoid Dummy Variable Trap i am going to drop one column
X=X[:,1:] #Take all the row , Drop 0th  column
X

output :



#Now Train
model.fit(X,Y)

output :

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model.predict([[1 , 0 , 2800]])

output :

array([13050224.95535457])

model.predict([[0 , 0 , 3100]])

output :

array([14667200.76041244])
 

This is the One Hot Encoding using Sklearn preprocessing One-Hot Encoder. 

Now compare values both of them Dummy Variable  One Hot Encoding.

Below I had given you my Jupyter Notebook file of the same code which is present above in case if you needed.

Here is my .csv file : 

https://github.com/Vegadhardik7/Git_Prac_Repo/blob/master/One_Hot_Data.csv

Here is my code file :

https://github.com/Vegadhardik7/Git_Prac_Repo/blob/master/One-Hot_Encoding.ipynb


Hey guys, Do you like our content? feel free to comment below & share it with your friends. 

2 comments:

  1. Furthermore, our Data Analytics Training in Noida program is tailored to the specific needs of the Noida job market. We provide insights into local industry trends, best practices, and emerging technologies, ensuring that you're well-prepared to meet the demands of employers in the region. Whether you aspire to work in IT, finance, healthcare, retail, or any other sector, our training equips you with versatile skills that are in high demand across industries.

    ReplyDelete
  2. Furthermore, our Data Analytics Training in Noida program is tailored to the specific needs of the Noida job market. We provide insights into local industry trends, best practices, and emerging technologies, ensuring that you're well-prepared to meet the demands of employers in the region. Whether you aspire to work in IT, finance, healthcare, retail, or any other sector, our training equips you with versatile skills that are in high demand across industries.

    ReplyDelete

No Spamming and No Offensive Language

Powered by Blogger.