__TOP 10 JUTSUS OF FEATURE ENGINEERING EVERY DATA SCIENTISTS NINJA
SHOULD KNOW RIGHT NOW__

__TOP 10 JUTSUS OF FEATURE ENGINEERING EVERY DATA SCIENTISTS NINJA SHOULD KNOW RIGHT NOW__

If you want to be a successful data scientist and to make your model predict the most accurate result, then the following article is for you.

**What is Feature Engineering?**

*As per Wikipedia, Feature engineering* is the process of using domain
knowledge to extract *features* from raw data via data
mining techniques. These *features* can be used to improve
the performance of machine learning algorithms. *Feature engineering* can be considered as applied
machine learning itself.

Okay, so we
saw what Wikipedia said about it but still, we want to know more about it. So
let’s divide this term into 2 parts for better understanding i.e Feature and
Engineering.

**What are the
Features in a dataset?**

Basically,
all machine learning algorithms use some input data (independent data) to
generate output. These input data are features, which are usually in the form of
structured columns.

**So, why
do we engineer it?**

To get the most accurate or I can say to get the most precise output from our machine learning the algorithm we need to give it cleanest data possible which should be compatible with our machine learning algorithm.

So the process of preparing the proper input dataset for our machine learning model is known as Feature Engineering.

Link of the complete survey by Forbes.

Data scientists spend 60% of their time cleaning and organizing data. Collecting data sets come second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.

76% of data
scientists view data preparation as the least enjoyable part of their work

57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets.

**List of
Techniques you can find in this blog:**

*1.) Techniques
of Imputation of numerical and categorical data*

*2.) Dealing
with outliers*

*3.) Binning*

*4.) Techniques
of dealing with Gaussian-Distribution / Skewness*

*5.) OneHotEncoding
and OrdinalEncoder*

*6.) Feature
splitting & extraction*

*7.) Group By *

*8.) Concat,
Merge, Join*

*9.) Scaling*

*10.) Extracting
Date *

__1.) Techniques
of Imputation of numerical and categorical data__

__1.) Techniques of Imputation of numerical and categorical data__

In this we deal with missing values that are present in our dataset.

The most simple solution to deal with missing values is to drop that row or column, suppose that you have more than 75% of values missing that you can definitely drop that row or column.

```
threshold = 0.75
# Columns:
data = data[data.columns[data.isnull().mean() < threshold]]
#------------------------------------------------------------------------
# Rows:
data = data.loc[data.columns[data.isnull().mean(axis=1) < threshold]]
```

```
# fill NaN values with 0
data = data.fillna(0)
#------------------------------------------------
# fill NaN values with median
data = data.fillna(data.mean())
#------------------------------------------------
# fill NaN values with median: Better than Mean
data = data.fillna(data.median())
#------------------------------------------------
# fill NaN values with mode
data = data.fillna(data.mode())
```

**(i) Simple
Imputer**

-A simple imputer is univariate which means it will take only a single feature in the count.

```
# using simple imputer
import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median') # can be mean, median, mode
imputer = imputer.fit(x)
data = imputer.transform(x)
print('Imputed Data:',data)
```

-What if the
feature “x” which has NaN values is very well co-correlated with the features
such as “y” or “z”. Let’s say, people with higher “Age” give more “Rent” and
people with lower “Age” give less “Rent”. So that suggests a multivariate
approach. Which just means it takes multiple features into count.

-So the Iterative Imputer and KNN Imputer comes into the concentration.

**(ii) Iterative
Imputer**

-In
iterative imputer for all the rows in which “Age” is not missing sklearn trains
a regression model.

-Where other
features such as “Gender” and “Rent” are
considered as independent input features and “Age” is considered as the target
output feature.

-Now for all rows in which “Age” is missing it makes predictions for “Age” by using features “Gender” and “Rent” to trained model.

```
# IterativeImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer_it = IterativeImputer()
imputer_it.fit_transform(x)
```

**NOTE**: This model used by iterative imputer is totally independent of the model you are
using this dataset as training data.

**(iii) KNN
Imputer**

-Let’s
pretend we have a row in which “Age” is missing.

-Then sklearn finds the 2 most similar rows measured by how close the “Gender” and “Fare” values are to this row.

```
# KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer
imputer_knn = KNNImputer(n_neighbors=2)
imputer_knn.fit_transform(x)
```

Dealing with the categorical and nominal data with the help of mode

data['categorical_col'] = data['categorical_col'].fillna(data['categorical_col'].mode()[0])

__2.) Dealing with outliers__

__2.) Dealing with outliers__

__Detecting
outliers__:

**Using
Z-score:**

Data that
fall outside of the 3^{rd} standard deviation is considered as an outlier. We can use a Z-score if Z-score falls outside of μ+3 or μ-3 then we will consider it as an outlier.

Z-Score |

```
import numpy as np
box = [10,20,15,25,11,12,16,26,19,29,3000]
outliers = []
def detect_outliers(data):
Threshold = 3
mean = np.mean(data)
std = np.std(data)
for i in data:
z_score = (i-mean)/std
if np.abs(z_score) > Threshold:
outliers.append(i)
return outliers
out = detect_outliers(box)
out
```

**Using IQR:**

Datapoint
that fall outside of 1.5 times od Inter Quartile Range above 1^{st} the quartile and the 3^{rd}
quartile.

```
import numpy as np
box = [10,20,15,25,11,12,16,26,19,29,3000]
box = sorted(box)
Q1,Q3 = np.percentile(box, [25,75])
print(Q1,Q3)
iqr = Q3 - Q1
print(iqr)
# Lower bound and higher bound values
lower_bound_val = Q1 - (1.5*iqr)
higher_bound_val = Q3 + (1.5*iqr)
print(lower_bound_val,higher_bound_val)
```

**Using
Box-plot:**

The values
which are beyond the Max or Min are considered as the outliers.

```
for feature in data:
dataset = data.copy()
if 0 in dataset[feature].unique():
pass
else:
dataset[feature] = np.log(dataset[feature])
dataset.boxplot(column=feature)
plt.ylabel(feature)
plt.title(feature)
plt.show()
```

__Dropping
outliers with the Standard Deviation__:

```
fact = 3
upper_bound_val = data['column'].mean() + data['column'].std() * fact
lower_bound_val = data['column'].mean() - data['column'].std() * fact
data = data[(data['column'] < upper_bound_val) & (data['column'] > lower_bound_val)]
```

```
data = pd.DataFrame(np.random.randn(500,4))
```

from scipy import stats

data[(np.abs(stats.zscore(data)) < 3).all(axis=1)]

__Dropping the
outliers rows with percentile__:

```
upper_bound_val = data['column'].quantile(0.95)
lower_bound_val = data['column'].quantile(0.05)
data = data[(data['column'] < upper_bound_val) & (data['column'] > lower_bound_val)]
```

__3.) Binning__

__3.) Binning__

**Image Credit**: https://wisdomschema.com/data-binning/

-Binning can
be applied to both categorical and numerical data.

- The main reason behind binning is to make the model more robust and prevent overfitting,
however, it has a performance cost.

#Numerical
Binning Example

Value Bin

0-30 =>
Fail

31-70 =>
Average

71-100 => Excelent

#Categorical
Binning Example

Value Bin

Mumbai =>
Maharashtra

Pune =>
Maharashtra

Bikaner =>
Rajasthan

Jaipur =>
Rajasthan

__Numerical
Binning Example__:

```
data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Fail", "Average", "Excelent"])
```

value | bin

0 | 2 | Fail

1 | 45 | Average

2 | 7 | Fail

3 | 85 | Excellent

4 | 28 | Fail

__Categorical
Binning Example__:

```
conditions = [
data['State'].str.contains('Mumbai'),
data['State'].str.contains('Pune'),
data['State'].str.contains('Bikaner'),
data['State'].str.contains('Jaipur')]
choices = ['Maharashtra', 'Maharashtra', 'Rajasthan', 'Rajasthan']
data['Continent'] = np.select(conditions, choices, default='Other')
```

` value | bin `

0 | Mumbai | Maharashtra

1 | Pune | Maharashtra

2 | Bikaner | Rajasthan

3 | Delhi | Other

4 | Jaipur | Rajasthan

__4.)Techniques
of dealing with Gaussian-Distribution / Skewness__

__4.)Techniques of dealing with Gaussian-Distribution / Skewness__

-Helps to
handles skewed data and after transformation, the distribution becomes more
approximate to the normal.

-Decreases
the effect of the outliers to the normalization of magnitude difference and the model becomes more robust.

**(i)Log
Transformation:**

Log transformation is a
data transformation method in which it replaces by log(x) with base
10, base 2, or natural log.

```
data['log_column'] = np.log(data['column']+1)
```

**(ii)Reciprocal Transformation:**

In Reciprocal Transformation, x
will replace by the inverse of X(1/X). The reciprocal transformation will give
little effect on the shape of the distribution. This transformation can be only
used for non-zero values. The skewness for the transformed data is increased.

```
data['Reciprocal_column'] = 1/(data['column']+1)
```

**(iii)Square-Root Transformation:**

This *transformation* will give a moderate effect on *distribution*. The main advantage of *square root transformation* is, it can be applied to
zero values. Here the x will replace by the *square root*(x).
It is weaker than the Log *Transformation*.

```
data['sqr_column'] = data['column']**(1/2)
```

*5.)OneHotEncoding
and OrdinalEncoder*

*5.)OneHotEncoding and OrdinalEncoder*

Since our computer systems only understand numerical data. So categorical data makes no sense for them unless they are converted into the numerical data and for that we use techniques such as OneHotEncoding and OrdinalEncoders.

# For single
column

```
encoded_col = pd.get_dummies(data['column'], drop_first = True)
```

```
encoded_multi_col = pd.get_dummies(data,columns = ['column1', 'column2', 'columnN'], drop_first=True)
```

For unordered
(Nominal) data we will be using OneHotEncoding

Example: Male,
Female

```
# OneHotEncoding is for unordered (Nominal) data
# Example: Male, Female
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(data[['column']])
```

For ordered
(Ordinal) data we will be using OrdinalEncoder

Example: First,
Second, Third

```
# OrdinalEncoder is for ordered (Ordinal) data
# Example: First, Second, Third
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['First', 'Second', 'Third'],
['O','A','B','C']])
oe.fit_transform(data[['Rank', 'Grade']])
```

*6.)Feature
splitting & extraction*

*6.)Feature splitting & extraction*

-By extracting
the utilizable part of a column into a new feature:

*We enable the machine-learning algorithms to comprehend them.

*Make possible
to bin and group them.

*Improve model
performance by uncovering potential information.

Splitting a function is a good option, but however, there is one way of splitting features.

It depends on the characteristics of the column, how to split it.

__Feature
Splitting__:

```
import pandas as pd
import numpy as np
data = [('Arnold Schwarzenegger','M'),
('Natasha Romanova','F'),
('Sylvester Stallone','M'),
('Gal Gadot','F'),
('Dwayne Johnson','M')]
df = pd.DataFrame(data, columns=['name', 'gender'])
df
```

```
# Extracting First Name
df.name.str.split(" ").map(lambda X : X[0])
```

```
# Extracting Last Name
df.name.str.split(" ").map(lambda X : X[-1])
```

__Feature
Extraction__:

```
weather_data = [('1/1/2017', 32, 6, 'Rain'),
('1/2/2017', 30, 7, 'Sunny'),
('1/3/2017', 32, 2, 'Snow'),
('1/4/2017', 34, 6, 'Snow'),
('1/5/2017', 32, 4, 'Rain'),
('1/6/2017', 32, 2, 'Sunny')
]
df = pd.DataFrame(weather_data, columns=['day', 'temp', 'windspeed', 'weather'])
df
```

```
df['day'][df['weather']=='Rain']
```

```
df.day[df.temp == df.temp.max()]
```

*7.) Group By*

*7.) Group By*

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

```
weather_data = [('1/1/2017', 'Mumbai' , 32, 6, 'Rain'),
('1/2/2017', 'Pune' , 30, 7, 'Sunny'),
('1/3/2017', 'Mumbai' , 32, 2, 'Snow'),
('1/4/2017', 'Pune' , 34, 6, 'Snow'),
('1/5/2017', 'Mumbai' , 32, 4, 'Rain'),
('1/6/2017', 'Delhi' , 32, 2, 'Sunny')
]
data = pd.DataFrame(weather_data, columns=['day', 'city', 'temp', 'windspeed', 'weather'])
data
```

```
grp_city = data.groupby('city')
for city, city_data in grp_city:
print(city)
print(city_data)
```

```
# Get specific group
grp_city.get_group('Mumbai')
```

```
print(grp_city.max())
```

```
print(grp_city.mean())
```

```
print(grp_city.describe())
```

__8.) Concat,
Merge, Join__

__8.) Concat, Merge, Join__

Works like how
we had worked on SQL.

```
Maharashtra_weather_data = pd.DataFrame({
'city': ['Mumbai', 'Pune', 'Thane'],
'temperature': [32, 45, 30],
'humidity': [80, 60, 78]
})
Maharashtra_weather_data
```

```
Gujarat_weather_data = pd.DataFrame({
'city': ['Surat', 'Rajkot', 'Mehsana'],
'temperature': [21, 24, 35],
'humidity': [68, 65, 75]
})
Gujarat_weather_data
```

**(i) Concat:**

*-Column Wise
Concatenation*

```
df = pd.concat([Maharashtra_weather_data, Gujarat_weather_data])
df
```

*-Row Wise Concatenation*

```
df = pd.concat([Maharashtra_weather_data, Gujarat_weather_data], axis=1)
df
```

**(ii) Merge**

```
temp_data = pd.DataFrame({
'city':['Mumbai','Delhi','Banglore','Hydrabad'],
'temp':[32,45,30,40]
})
temp_data
```

```
humidity_data = pd.DataFrame({
'city':['Mumbai','Delhi','Banglore'],
'humidity':[62,65,70]
})
humidity_data
```

```
# Merge 2 dataframe without explicitly mentioning the index
df = pd.merge(temp_data, humidity_data, on='city')
df
```

**(iii) Join**

```
df = pd.merge(temp_data, humidity_data, on='city',how='outer')
df
```

__9.) Scaling__

__9.) Scaling__

Scaling are done basically in 2 ways first is
Normalization and the second is Standardization.

```
data = pd.DataFrame({
'name':['Ram', 'Lakhan', 'Shiva', 'Ria', 'Lucy', 'Suraj', 'Rohan', 'Anny', 'Priya', 'Niraj'],
'age':[25,40,33,26,30,35,28,43,36,50],
'Salary':[25000, 32000, 50000, 33000, 20000, 29000, 22000, 52000, 23000, 26000],
'purchased':[1,0,1,1,0,0,0,1,0,1]
})
data
```

**(i) Normalization
(MinMaxScaler)**

* X_norm = X – X_min
/ X_max – X_min

* Range 0 to 1

* Due to the decrease in standard deviation the effect of outliers increases.

*Before
Normalization, it is recommended to handle the outliers.

```
from sklearn.model_selection import train_test_split
X = data.drop(['name','purchased'], axis=1)
y = data['purchased']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```

```
from sklearn.preprocessing import MinMaxScaler
Min_Max_scaler = MinMaxScaler()
Min_Max_X_train = Min_Max_scaler.fit_transform(X_train)
Min_Max_X_test = Min_Max_scaler.transform(X_test)
Min_Max_X_train
```

```
Min_Max_X_test
```

**(ii) Standardization
(StandardScaler)**

*Z = X - μ / σ

*μ = 0 and σ = 1

*Range -1 to 1

*If the standard
deviation of feature is different, their range also would differ from each
other.

```
from sklearn.preprocessing import StandardScaler
StandardScaler_scaler = StandardScaler()
StandardScaler_X_train = StandardScaler_scaler.fit_transform(X_train)
StandardScaler_X_test = StandardScaler_scaler.transform(X_test)
StandardScaler_X_train
```

```
StandardScaler_X_test
```

*10.) Extracting
Date *

*10.) Extracting Date*

There are 3 ways we can preprocessing date

*Extracting the parts of the date
into the different columns: YEAR, MONTH, DAY.

*Extracting the time period between
the current date and column in terms of YEAR, MONTH, DAY.

*Extracting some specific features
from the date such as the name of the weekday, weekend, etc.

*With dealing with date features
like that our machine learning model can easily understand data and deal with
the data.

```
from datetime import date
data = pd.DataFrame({
'date': ['01/01/2017', '04/12/2000', '23/04/2011', '11/02/2008', '08/08/2018']
})
data
```

```
data.info()
```

```
# Transform String to Date
data['date'] = pd.to_datetime(data.date, format='%d/%m/%Y')
data.info()
```

```
# Extracat year
data['year'] = data['date'].dt.year
data
```

```
# Extract month
data['month'] = data['date'].dt.month
data
```

```
# Extract day
data['day'] = data['date'].dt.day
data
```

```
# Extracting passed year since the date
data['passed_years'] = date.today().year - data['date'].dt.year
data
```

```
# Extracting passed month since the date
data['passed_months'] = (date.today().year - data['date'].dt.year)*12 + (date.today().month - data['date'].dt.month)
data
```

Here is the Github link of all the code which is mentioned here:

Github Link of the Notebook: Link

Use all the tips and tricks which are mentioned above and you will surely get amazing data to train your model on.

So we hope that you enjoyed this session. If you did then please share it with your friends and spread this knowledge.

Follow us at :

Instagram :

https://www.instagram.com/infinitycode_x/

Facebook :

https://www.facebook.com/InfinitycodeX/

Twitter :

https://twitter.com/InfinityCodeX1

Excellent article... Thank you for providing such valuable information; the contents are quite intriguing. I'll be waiting for the next post on Big Data Engineering Services with great excitement.

ReplyDelete