Predicting Diabetes using Machine Learning | Real Life Project : 2

Predict Diabetes using Machine Learning

So as you have read our title. Today we will be discussing about predicting diabetes using Machine Learning. For this topic we will be using Logistic Regression.

Why Logistic Regression?

Logistic Regression is a Classification model which will help us to determine if the person has the diabetes or not. The working of the Logistic Regression is to predict the outcome which could have answer in the YES/NO, TRUE/FALSE, 0/1 format.

How we will predict diabetes using Logistic Regression ?

First of all we will take certain independent variable’s i.e factors which will help us to decide that a person have diabetes or not. Then we will train majority of the data & test them on the data which are not trained.

To know more about Logistic Regression do check out this blog first :

Logistic Regression

NOTE :

Meaning of all the Attributes which are present in our Dataset.

pregnant: Number of times the women got pregnant
glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
bp (BloodPressure): Diastolic blood pressure (mm Hg)
skin (SkinThickness): Triceps skin fold thickness (mm)
insulin_level: 2-Hour serum insulin (mu U/ml)
bmi (Body Mass Index): Body mass index (weight in kg/(height in m)2)
pedgiree (DiabetesPedigreeFunction): Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
age: Age (years)
diabetes_label (Outcome): Class variable (0 if non-diabetic, 1 if diabetic)

So our today Agenda would be :

1.) Import Libraries & Dataset

At this step we will import all the important libraries which will help us to modify our data according to the need. Which means all the other step's which are written below will only be executed when the libraries to do that task will be imported.

Such as ;

pandas for Data manipulation.

numpy for an efficient multi-dimensional container of generic data.

matplotlib & seaborn for data visualization.

2.) Analyzing the Data

The Data visualization libraries which we called such as matplotlib & seaborn will help us to analyze data in a graphical format.

3.) Data Wrangling

This step is very crucial because at this step we will fetch all the important data which will help us to build & train our model for testing & accuracy purpose. Pandas & numpy libraries will help us for data wrangling process. How we fetch our data will decide how accurately our model will predict answers.

4.) Test & Train Data

In this step we will be splitting our data & then we will train & test that data accordingly.

Splitting of the data will be done by train_test_split method from sklearn (Scikit-learn) for creating Training dataset & Testing dataset which will be at 80-20 ratio which means 80% will be training data & 20% will be our testing data.

Training & Test of our data will be done within few line of code by an (Scikit-learn) sklearn library which I mentioned above. After that we have to create our model for that we use Logistic Regression which will also available at sklearn. Don't worry it's hardly 4 to 5 line of code.

5.) Check Accuracy

At this step we will check... How accurate our model is? & How accurately it will perform if the data which is provided is large in numbers?

How we will check accuracy of our model?

* By inserting the values which we kept for testing purpose.
* Insert our own values & see what result our model gave.
* Then we will print a classification report. (will be explained in upcoming blog)
* Then we will compare actual value vs predicted value.
* Then we will be using confusion matrix. (will be explained in upcoming blog)
* At last we will recall our score.

Now Let's get Started...

Download Dataset from here : Diabetes_dataset

1.) Importing all the Libraries & Dataset

#Import all the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Call the data

data=pd.read_csv(r"D:/Dig/pima.csv")
data.head()

output :

#Lets Describe our Data

data.describe()

output :

2.) Analyzing the data

#How may people have Diabetes ( 0 = Not Diabetic & 1 = Yes )

plt.figure(figsize=(8,7))
sns.countplot(x="diabetes_label",data=data)

output :