Home › DataScience › MachineLearning

Dummy Variable Trap

What is Dummy Variable Trap?

The Dummy variable Trap is a scenario in which the independent variables are multicollinear.

Multicollinearity is a scenario in which 2 or more variables are highly correlated.

Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also each other. In other words, it results when you have factors that are a bit redundant(no longer needed).

In our previous blog (Data Science ss : 10.8) we took an example of Regions with Mumbai, Pune, Delhi. Where in our primary dataset we had City, Area_in_sqft, Price_in_dollars. This dataset is transformed into Dummy Columns. Now it creates a matrix of 8 rows & 6 columns (excluding index column).

Steps use to execute Dummy Variable Trap :

In order to check whether the above dataset will have Dummy Variable Trap scenario we will execute this following steps :

X = Above Matrix i.e (8 rows & 6 columns)

(i) We can see the above matrix is X. (Shape 8 x 6)

(ii) We will get Transpose of X as X`.(Shape 6 x 8)

(iii) Calculate the vector dot product of X` & X [X`X or X.T.dot(X) using Python].

(iv) The shape of X`X will be a squared matrix.

(v) If the determinant of X`X is 0 then we will have the Dummy Variable Trap scenario in our Dataset.

(vi) Alternatively, if you observe that 2 columns of a matrix is identical then the value of the determinant will be 0 & hence we cannot calculate the inverse of X`X or (X`X)^-1 . This means we have hit the Dummy Variable Trap scenario.

What is the Solution of Dummy Variable Trap?

The solution for Dummy Variable Trap is to drop one of the categorical variables (or alternately, drop the intercepts constant)- if there are X number of categories, use X-1 in the model, the value left out can be thought of as the reference value & the fit value of the remaining categories represented the change from this reference.

The Dummy Variable Trap can be solved by dropping one of the categorical variables, (we can also drop the intercepts constant), if there are M number of categories then we can use X-1 in the regression model. The remaining categories represent the change for this reference.

Mathematics behind Dummy Variable Trap

In a multiple regression problem we want to create a function that can map the input data into outcome values. Each data point is a feature vector (X1, X2, X3, …, Xn) composed of 2 or more data values that capture various features of the input. In order to represent all of the input data along with the vectors of output values we set up a input matrix X & an output vector.

In a simple least-squares linear regression model we seek a vector P such that the product XP most closely approximates the outcome vector Y.

Once we have constructed the P vector we can use it to map input data to predicted outcomes. Given an Input vector in the form.

X = [1 X₁ X₂ … X_n]

We can compute a predicted outcome value

y = X.P = P₀ + P₁X₁ + P₁X₁+ P₁X₁…+ P₁X_n

The Formula to compute the P vector is

P = (X`X)^-1X`y

In python we calculate the above expression P as follows :

import numpy as np

np.linalg.solve(X.T.dot(X) , X.T.dot(y))

Where : X.T is X`

a.dot(b) is the dot product of vectors a & b.

The practical to this topic I had covered in my previous blog so do check it out (Data Science ss : 10.8) . The Reason to cover the example earlier is because the main motive of my blogging site is to cover almost all the complex topics because I had mentioned earlier that to be a Datascientist there is no need to learn complete math behind everything. I had covered this Dummy Variable Trap problem because there is very much confusion when we try to learn it.

Hey guys, If you liked our content don't forget to comment below & share it with your friends.

No comments:

No Spamming and No Offensive Language

Subscribe to: Post Comments ( Atom )