TOP 10 JUTSUS OF FEATURE ENGINEERING EVERY DATA SCIENTISTS NINJA SHOULD KNOW RIGHT NOW

If you want to be a successful data scientist and to make your model predict the most accurate result, then the following article is for you.

What is Feature Engineering?

As per Wikipedia, Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.

Okay, so we saw what Wikipedia said about it but still, we want to know more about it. So let’s divide this term into 2 parts for better understanding i.e Feature and Engineering.

What are the Features in a dataset?

Basically, all machine learning algorithms use some input data (independent data) to generate output. These input data are features, which are usually in the form of structured columns.

So, why do we engineer it?

To get the most accurate or I can say to get the most precise output from our machine learning the algorithm we need to give it cleanest data possible which should be compatible with our machine learning algorithm.

So the process of preparing the proper input dataset for our machine learning model is known as Feature Engineering.

Now you will ask how important is this process of feature engineering to generate any machine-learning project?

According to Forbs survey, data scientist spends around 80% of their time on data preparation.

Link of the complete survey by Forbes.

Data scientists spend 60% of their time cleaning and organizing data. Collecting data sets come second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.

76% of data scientists view data preparation as the least enjoyable part of their work

57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets.

List of Techniques you can find in this blog:

1.) Techniques of Imputation of numerical and categorical data

2.) Dealing with outliers

3.) Binning

4.) Techniques of dealing with Gaussian-Distribution / Skewness

5.) OneHotEncoding and OrdinalEncoder

6.) Feature splitting & extraction

7.) Group By

8.) Concat, Merge, Join

9.) Scaling

10.) Extracting Date

1.) Techniques of Imputation of numerical and categorical data

In this we deal with missing values that are present in our dataset.

The most simple solution to deal with missing values is to drop that row or column, suppose that you have more than 75% of values missing that you can definitely drop that row or column.


 threshold = 0.75  
 # Columns:  
 data = data[data.columns[data.isnull().mean() < threshold]]  
 
 #------------------------------------------------------------------------
 
 # Rows:  
 data = data.loc[data.columns[data.isnull().mean(axis=1) < threshold]]

The most common way to deal with missing values is through fill 0, Mean, Mean, or Mode value in that missing space.


 # fill NaN values with 0  
 data = data.fillna(0)  
 
 #------------------------------------------------
 
 # fill NaN values with median  
 data = data.fillna(data.mean())  
 
 #------------------------------------------------
 
 # fill NaN values with median: Better than Mean  
 data = data.fillna(data.median())  
 
 #------------------------------------------------
 
 # fill NaN values with mode  
 data = data.fillna(data.mode())

To deal with numerical data specifically you can use imputers they are very useful to get the perfect value to fill in that missing space.

Above mention is too naïve way of imputation, but we are ninjas we have some of our own ninja techniques we can use

(i) Simple Imputer

-A simple imputer is univariate which means it will take only a single feature in the count.


 # using simple imputer  
 
 import numpy as np  
 from sklearn.impute import SimpleImputer  
 imputer = SimpleImputer(missing_values=np.nan, strategy='median') # can be mean, median, mode  
 imputer = imputer.fit(x)  
 data = imputer.transform(x)  
 print('Imputed Data:',data)

-This is quite a blunt approach.

-What if the feature “x” which has NaN values is very well co-correlated with the features such as “y” or “z”. Let’s say, people with higher “Age” give more “Rent” and people with lower “Age” give less “Rent”. So that suggests a multivariate approach. Which just means it takes multiple features into count.

-So the Iterative Imputer and KNN Imputer comes into the concentration.

(ii) Iterative Imputer

-In iterative imputer for all the rows in which “Age” is not missing sklearn trains a regression model.

-Where other features such as “Gender” and “Rent” are considered as independent input features and “Age” is considered as the target output feature.

-Now for all rows in which “Age” is missing it makes predictions for “Age” by using features “Gender” and “Rent” to trained model.

 
 # IterativeImputer  
 
 from sklearn.experimental import enable_iterative_imputer  
 from sklearn.impute import IterativeImputer  
 imputer_it = IterativeImputer()  
 imputer_it.fit_transform(x)

NOTE: This model used by iterative imputer is totally independent of the model you are using this dataset as training data.

(iii) KNN Imputer

-Let’s pretend we have a row in which “Age” is missing.

-Then sklearn finds the 2 most similar rows measured by how close the “Gender” and “Fare” values are to this row.

 
  # KNNImputer   
  
  from sklearn.experimental import enable_iterative_imputer   
  from sklearn.impute import KNNImputer   
  imputer_knn = KNNImputer(n_neighbors=2)   
  imputer_knn.fit_transform(x)

Dealing with the categorical and nominal data with the help of mode


 data['categorical_col'] = data['categorical_col'].fillna(data['categorical_col'].mode()[0])

2.) Dealing with outliers

Detecting outliers:

Using Z-score:

Data that fall outside of the 3^rd standard deviation is considered as an outlier. We can use a Z-score if Z-score falls outside of μ+3 or μ-3 then we will consider it as an outlier.

Z-Score

 
 import numpy as np  
 
 box = [10,20,15,25,11,12,16,26,19,29,3000]  
 
 outliers = []  
 def detect_outliers(data):  
   Threshold = 3  
   mean = np.mean(data)  
   std = np.std(data)  
   for i in data:  
     z_score = (i-mean)/std  
     if np.abs(z_score) > Threshold:  
       outliers.append(i)  
   return outliers  
 out = detect_outliers(box)  
 out

Using IQR:

Datapoint that fall outside of 1.5 times od Inter Quartile Range above 1^st the quartile and the 3^rd quartile.

 
 import numpy as np  
 
 box = [10,20,15,25,11,12,16,26,19,29,3000]  
 
 box = sorted(box)  
 Q1,Q3 = np.percentile(box, [25,75])  
 print(Q1,Q3)  
 iqr = Q3 - Q1  
 print(iqr)  
 
 # Lower bound and higher bound values  
 lower_bound_val = Q1 - (1.5*iqr)  
 higher_bound_val = Q3 + (1.5*iqr)  
 print(lower_bound_val,higher_bound_val)

Using Box-plot:

The values which are beyond the Max or Min are considered as the outliers.

 
   for feature in data:  
   dataset = data.copy()  
   if 0 in dataset[feature].unique():  
     pass  
   else:  
     dataset[feature] = np.log(dataset[feature])  
     dataset.boxplot(column=feature)  
     plt.ylabel(feature)  
     plt.title(feature)  
     plt.show()

Dropping outliers with the Standard Deviation:

 
 fact = 3  
 upper_bound_val = data['column'].mean() + data['column'].std() * fact  
 lower_bound_val = data['column'].mean() - data['column'].std() * fact  
 data = data[(data['column'] < upper_bound_val) & (data['column'] > lower_bound_val)]

 
 data = pd.DataFrame(np.random.randn(500,4))

 from scipy import stats

 data[(np.abs(stats.zscore(data)) < 3).all(axis=1)]

Dropping the outliers rows with percentile:

  
 upper_bound_val = data['column'].quantile(0.95)  
 lower_bound_val = data['column'].quantile(0.05)  
 data = data[(data['column'] < upper_bound_val) & (data['column'] > lower_bound_val)]

3.) Binning

Image Credit: https://wisdomschema.com/data-binning/

-Binning can be applied to both categorical and numerical data.

- The main reason behind binning is to make the model more robust and prevent overfitting, however, it has a performance cost.

#Numerical Binning Example

Value Bin

0-30 => Fail

31-70 => Average

71-100 => Excelent

#Categorical Binning Example

Value Bin

Mumbai => Maharashtra

Pune => Maharashtra

Bikaner => Rajasthan

Jaipur => Rajasthan

Numerical Binning Example:


data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Fail", "Average", "Excelent"])

value | bin

0 | 2 | Fail
1 | 45 | Average
2 | 7 | Fail
3 | 85 | Excellent
4 | 28 | Fail

Categorical Binning Example:

 
  conditions = [  
   data['State'].str.contains('Mumbai'),  
   data['State'].str.contains('Pune'),  
   data['State'].str.contains('Bikaner'),  
   data['State'].str.contains('Jaipur')]  
  choices = ['Maharashtra', 'Maharashtra', 'Rajasthan', 'Rajasthan']  
  data['Continent'] = np.select(conditions, choices, default='Other')

  value    | bin

4.)Techniques of dealing with Gaussian-Distribution / Skewness

-Helps to handles skewed data and after transformation, the distribution becomes more approximate to the normal.

-Decreases the effect of the outliers to the normalization of magnitude difference and the model becomes more robust.

(i)Log Transformation:

Log transformation is a data transformation method in which it replaces by log(x) with base 10, base 2, or natural log.

 
 data['log_column'] = np.log(data['column']+1)

(ii)Reciprocal Transformation:

In Reciprocal Transformation, x will replace by the inverse of X(1/X). The reciprocal transformation will give little effect on the shape of the distribution. This transformation can be only used for non-zero values. The skewness for the transformed data is increased.

 
 data['Reciprocal_column'] = 1/(data['column']+1)

(iii)Square-Root Transformation:

This transformation will give a moderate effect on distribution. The main advantage of square root transformation is, it can be applied to zero values. Here the x will replace by the square root(x). It is weaker than the Log Transformation.

 
  data['sqr_column'] = data['column']**(1/2)

5.)OneHotEncoding and OrdinalEncoder

image credit: https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding

Since our computer systems only understand numerical data. So categorical data makes no sense for them unless they are converted into the numerical data and for that we use techniques such as OneHotEncoding and OrdinalEncoders.

# For single column


  encoded_col = pd.get_dummies(data['column'], drop_first = True)

# For multi column

 
  encoded_multi_col = pd.get_dummies(data,columns = ['column1', 'column2', 'columnN'], drop_first=True)

The above mentioned is too naïve way of doing this, but we are ninjas we have a secret technique to get the optimum result.

For unordered (Nominal) data we will be using OneHotEncoding

Example: Male, Female

 
 # OneHotEncoding is for unordered (Nominal) data  
 # Example: Male, Female  
 
 from sklearn.preprocessing import OneHotEncoder  
 ohe = OneHotEncoder(sparse=False)  
 ohe.fit_transform(data[['column']])

For ordered (Ordinal) data we will be using OrdinalEncoder

Example: First, Second, Third

 
 # OrdinalEncoder is for ordered (Ordinal) data  
 # Example: First, Second, Third  
 from sklearn.preprocessing import OrdinalEncoder  
 oe = OrdinalEncoder(categories=[['First', 'Second', 'Third'],  
                 ['O','A','B','C']])  
 oe.fit_transform(data[['Rank', 'Grade']])

6.)Feature splitting & extraction

-By extracting the utilizable part of a column into a new feature:

*We enable the machine-learning algorithms to comprehend them.

*Make possible to bin and group them.

*Improve model performance by uncovering potential information.

Splitting a function is a good option, but however, there is one way of splitting features.

It depends on the characteristics of the column, how to split it.

Feature Splitting:

 
 import pandas as pd  
 import numpy as np  
 data = [('Arnold Schwarzenegger','M'),   
     ('Natasha Romanova','F'),   
     ('Sylvester Stallone','M'),   
     ('Gal Gadot','F'),   
     ('Dwayne Johnson','M')]  
 df = pd.DataFrame(data, columns=['name', 'gender'])  
 df


 # Extracting First Name  
 df.name.str.split(" ").map(lambda X : X[0])


 # Extracting Last Name  
 df.name.str.split(" ").map(lambda X : X[-1])

Feature Extraction:

 
 weather_data = [('1/1/2017', 32, 6, 'Rain'),  
         ('1/2/2017', 30, 7, 'Sunny'),  
         ('1/3/2017', 32, 2, 'Snow'),  
         ('1/4/2017', 34, 6, 'Snow'),  
         ('1/5/2017', 32, 4, 'Rain'),  
         ('1/6/2017', 32, 2, 'Sunny')  
         ]  
 df = pd.DataFrame(weather_data, columns=['day', 'temp', 'windspeed', 'weather'])  
 df


 df['day'][df['weather']=='Rain']


  df.day[df.temp == df.temp.max()]

7.) Group By

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

 
  weather_data = [('1/1/2017', 'Mumbai' , 32, 6, 'Rain'),  
         ('1/2/2017', 'Pune' , 30, 7, 'Sunny'),  
         ('1/3/2017', 'Mumbai' , 32, 2, 'Snow'),  
         ('1/4/2017', 'Pune' , 34, 6, 'Snow'),  
         ('1/5/2017', 'Mumbai' , 32, 4, 'Rain'),  
         ('1/6/2017', 'Delhi' , 32, 2, 'Sunny')  
         ]  
 data = pd.DataFrame(weather_data, columns=['day', 'city', 'temp', 'windspeed', 'weather'])  
 data

 
 grp_city = data.groupby('city')  
 for city, city_data in grp_city:  
   print(city)  
   print(city_data)


 # Get specific group  
 grp_city.get_group('Mumbai')

 
  print(grp_city.max())


  print(grp_city.mean())

 
  print(grp_city.describe())

8.) Concat, Merge, Join

Works like how we had worked on SQL.

 
  Maharashtra_weather_data = pd.DataFrame({  
   'city': ['Mumbai', 'Pune', 'Thane'],  
   'temperature': [32, 45, 30],  
   'humidity': [80, 60, 78]  
 })  
 Maharashtra_weather_data

 
 Gujarat_weather_data = pd.DataFrame({  
   'city': ['Surat', 'Rajkot', 'Mehsana'],  
   'temperature': [21, 24, 35],  
   'humidity': [68, 65, 75]  
 })  
 
 Gujarat_weather_data

(i) Concat:

-Column Wise Concatenation


 df = pd.concat([Maharashtra_weather_data, Gujarat_weather_data])  
 df

-Row Wise Concatenation

 
 df = pd.concat([Maharashtra_weather_data, Gujarat_weather_data], axis=1)  
 df

(ii) Merge

 
  temp_data = pd.DataFrame({  
   'city':['Mumbai','Delhi','Banglore','Hydrabad'],  
   'temp':[32,45,30,40]  
 })  
 temp_data


 humidity_data = pd.DataFrame({  
   'city':['Mumbai','Delhi','Banglore'],  
   'humidity':[62,65,70]  
 })  
 humidity_data


 # Merge 2 dataframe without explicitly mentioning the index  
 
 df = pd.merge(temp_data, humidity_data, on='city')  
 df

(iii) Join

 
 df = pd.merge(temp_data, humidity_data, on='city',how='outer')  
 df

9.) Scaling

Scaling are done basically in 2 ways first is Normalization and the second is Standardization.

 
 data = pd.DataFrame({  
   'name':['Ram', 'Lakhan', 'Shiva', 'Ria', 'Lucy', 'Suraj', 'Rohan', 'Anny', 'Priya', 'Niraj'],  
   'age':[25,40,33,26,30,35,28,43,36,50],  
   'Salary':[25000, 32000, 50000, 33000, 20000, 29000, 22000, 52000, 23000, 26000],  
   'purchased':[1,0,1,1,0,0,0,1,0,1]  
 })  
 data

(i) Normalization (MinMaxScaler)

* X_norm = X – X_min / X_max – X_min

* Range 0 to 1

* Due to the decrease in standard deviation the effect of outliers increases.

*Before Normalization, it is recommended to handle the outliers.


 from sklearn.model_selection import train_test_split  
 X = data.drop(['name','purchased'], axis=1)  
 y = data['purchased']  
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


 from sklearn.preprocessing import MinMaxScaler  

 Min_Max_scaler = MinMaxScaler()  
 
 Min_Max_X_train = Min_Max_scaler.fit_transform(X_train)  
 Min_Max_X_test = Min_Max_scaler.transform(X_test)  
 
 Min_Max_X_train

 
  Min_Max_X_test

(ii) Standardization (StandardScaler)

*Z = X - μ / σ

*μ = 0 and σ = 1

*Range -1 to 1

*If the standard deviation of feature is different, their range also would differ from each other.

 
 from sklearn.preprocessing import StandardScaler  
 
 StandardScaler_scaler = StandardScaler()  
 
 StandardScaler_X_train = StandardScaler_scaler.fit_transform(X_train)  
 StandardScaler_X_test = StandardScaler_scaler.transform(X_test)  
 
 StandardScaler_X_train

 
 StandardScaler_X_test

10.) Extracting Date

There are 3 ways we can preprocessing date

*Extracting the parts of the date into the different columns: YEAR, MONTH, DAY.

*Extracting the time period between the current date and column in terms of YEAR, MONTH, DAY.

*Extracting some specific features from the date such as the name of the weekday, weekend, etc.

*With dealing with date features like that our machine learning model can easily understand data and deal with the data.

 
 from datetime import date  
 data = pd.DataFrame({  
   'date': ['01/01/2017', '04/12/2000', '23/04/2011', '11/02/2008', '08/08/2018']  
 })  
 
 data

 
  data.info()


 # Transform String to Date  
 
 data['date'] = pd.to_datetime(data.date, format='%d/%m/%Y')  
 data.info()

 
 # Extracat year  
 
 data['year'] = data['date'].dt.year  
 data


 # Extract month  
 
 data['month'] = data['date'].dt.month  
 data


 # Extract day  
 
 data['day'] = data['date'].dt.day  
 data

 
 # Extracting passed year since the date   
 
 data['passed_years'] = date.today().year - data['date'].dt.year  
 data

 
 # Extracting passed month since the date   
 
 data['passed_months'] = (date.today().year - data['date'].dt.year)*12 + (date.today().month - data['date'].dt.month)  
 data

Here is the Github link of all the code which is mentioned here:

Github Link of the Notebook: Link

Use all the tips and tricks which are mentioned above and you will surely get amazing data to train your model on.

So we hope that you enjoyed this session. If you did then please share it with your friends and spread this knowledge.

Follow us at :

Instagram :

https://www.instagram.com/infinitycode_x/

Facebook :
https://www.facebook.com/InfinitycodeX/

Twitter :
https://twitter.com/InfinityCodeX1

InfinityCodeX

TOP 10 JUTSUS OF FEATURE ENGINEERING EVERY DATA SCIENTISTS NINJA SHOULD KNOW RIGHT NOW

What is Feature Engineering?

What are the Features in a dataset?

1.) Techniques of Imputation of numerical and categorical data

2.) Dealing with outliers

3.) Binning

4.)Techniques of dealing with Gaussian-Distribution / Skewness

5.)OneHotEncoding and OrdinalEncoder

6.)Feature splitting & extraction

7.) Group By

8.) Concat, Merge, Join

9.) Scaling

10.) Extracting Date

You May Also Like

1 comment:

Subscribe

Categories

Blog Archive

Recent Posts

Pages

Random Posts

Tags

Popular Posts