Dummy Variable Trap
What is Dummy
Variable Trap?
The Dummy variable Trap is a scenario in
which the independent variables are multicollinear.
Multicollinearity is a scenario in which 2 or more variables are highly correlated.
Multicollinearity occurs when your model includes multiple factors that are correlated not
just to your response variable, but also each other. In other words, it results
when you have factors that are a bit redundant(no longer needed).
In our previous blog (Data Science ss :
10.8) we took an example of Regions with Mumbai, Pune,
Delhi. Where in our primary dataset we had City, Area_in_sqft,
Price_in_dollars. This dataset is transformed into Dummy Columns. Now it
creates a matrix of 8 rows & 6 columns (excluding
index column).
Steps use to execute
Dummy Variable Trap :
In order to check whether the above dataset will
have Dummy Variable Trap scenario we will execute this following steps :
X = Above Matrix i.e (8 rows & 6 columns)
(i) We can see the above matrix is X. (Shape
8 x 6)
(ii) We will get Transpose of X as X`.(Shape
6 x 8)
(iii) Calculate the vector dot product of X`
& X [X`X or X.T.dot(X) using Python].
(iv) The shape of X`X will be a squared
matrix.
(v) If the determinant of X`X is 0
then we will have the Dummy Variable Trap scenario in our Dataset.
(vi) Alternatively, if you observe that 2
columns of a matrix is identical then the value of the determinant will be 0
& hence we cannot calculate the inverse of X`X or (X`X)-1 . This means we have hit the Dummy Variable
Trap scenario.
What is the Solution
of Dummy Variable Trap?
The solution for Dummy Variable Trap is to drop one
of the categorical variables (or alternately, drop the intercepts
constant)- if there are X number of categories, use X-1 in the model, the value
left out can be thought of as the reference value & the fit value of the
remaining categories represented the change from this reference.
The Dummy Variable Trap can be solved by dropping
one of the categorical variables, (we can also drop the intercepts constant),
if there are M number of categories then we can use X-1 in the regression
model. The remaining categories represent the change for this reference.
Mathematics behind
Dummy Variable Trap
In a multiple regression problem we want to create
a function that can map the input data into outcome values. Each data point is
a feature vector (X1, X2, X3, …, Xn) composed of 2 or more data values that
capture various features of the input. In order to represent all of the input
data along with the vectors of output values we set up a input matrix X &
an output vector.
In a simple least-squares linear regression model
we seek a vector P such that the product XP most closely
approximates the outcome vector Y.
Once we have constructed the P vector we can
use it to map input data to predicted outcomes. Given an Input vector in the
form.
X = [1 X1 X2 … Xn ]
We can compute a predicted outcome value
y = X.P = P0 + P1X1 + P1X1+ P1X1…+ P1Xn
The Formula to compute the P vector is
P = (X`X)-1X`y
In python we calculate the above expression P as
follows :
import numpy as np
np.linalg.solve(X.T.dot(X) , X.T.dot(y))
Where : X.T is X`
a.dot(b) is the dot product of vectors a & b.
The practical to this topic I had covered in my
previous blog so do check it out (Data Science ss : 10.8) . The Reason to cover the example earlier is because the main motive of
my blogging site is to cover almost all the complex topics because I had
mentioned earlier that to be a Datascientist there is no need to learn complete
math behind everything. I had covered this Dummy Variable Trap problem because
there is very much confusion when we try to learn it.
Hey guys, If you liked our content don't forget to comment below & share it with your friends.
Hey guys, If you liked our content don't forget to comment below & share it with your friends.
No comments:
No Spamming and No Offensive Language