InfinityCodeX


Check out our blogs where we cover topics such as Python, Data Science, Machine Learning, Deep Learning. A Best place to start your AI career for beginner, intermediate peoples.

Predict Diabetes using Machine Learning



So as you have read our title. Today we will be discussing about predicting diabetes using Machine Learning. For this topic we will be using Logistic Regression.


Why Logistic Regression?


Logistic Regression is a Classification model which will help us to determine if the person has the diabetes or not. The working of the Logistic Regression is to predict the outcome which could have answer in the YES/NO, TRUE/FALSE, 0/1 format.

How we will predict diabetes using Logistic Regression ?


First of all we will take certain independent variable’s i.e factors which will help us to decide that a person have diabetes or not. Then we will train majority of the data & test them on the data which are not trained.
To know more about Logistic Regression do check out this blog first :

NOTE : 


Meaning of all the Attributes which are present in our Dataset.
  • pregnant: Number of times the women got pregnant
  • glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
  • bp (BloodPressure): Diastolic blood pressure (mm Hg)
  • skin (SkinThickness): Triceps skin fold thickness (mm)
  • insulin_level: 2-Hour serum insulin (mu U/ml)
  • bmi (Body Mass Index): Body mass index (weight in kg/(height in m)2)
  • pedgiree (DiabetesPedigreeFunction): Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
  • age: Age (years)
  • diabetes_label (Outcome): Class variable (0 if non-diabetic, 1 if diabetic)


So our today Agenda would be :


1.) Import Libraries & Dataset

At this step we will import all the important libraries which will help us to modify our data according to the need. Which means all the other step's which are written below will only be executed when the libraries to do that task will be imported.

Such as ;

pandas for Data manipulation.
numpy for an efficient multi-dimensional container of generic data.
matplotlib & seaborn for data visualization.


2.) Analyzing the Data

The Data visualization libraries which we called such as matplotlib & seaborn will help us to analyze data in a graphical format.


3.) Data Wrangling

This step is very crucial because at this step we will fetch all the important data which will help us to build & train our model for testing & accuracy purpose. Pandas & numpy libraries will help us for data wrangling process. How we fetch our data will decide how accurately our model will predict answers.


4.) Test & Train Data

In this step we will be splitting our data & then we will train & test that data accordingly.
Splitting of the data will be done by train_test_split method from sklearn (Scikit-learn) for creating Training dataset & Testing dataset which will be at 80-20 ratio which means 80% will be training data & 20% will be our testing data.
Training & Test of our data will be done within few line of code by an (Scikit-learn) sklearn library which I mentioned above. After that we have to create our model for that we use Logistic Regression which will also available at sklearn. Don't worry it's hardly 4 to 5 line of code.

5.) Check Accuracy

At this step we will check... How accurate our model is? & How accurately it will perform if the data which is provided is large in numbers?

How we will check accuracy of our model?

* By inserting the values which we kept for testing purpose.
* Insert our own values & see what result our  model gave.
* Then we will print a classification report. (will be explained in upcoming blog)
* Then we will compare actual value vs predicted value.
* Then we will be using confusion matrix. (will be explained in upcoming blog)
* At last we will recall our score. 


Now Let's get Started...

Download Dataset from here : Diabetes_dataset

1.) Importing all the Libraries & Dataset


#Import all the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


#Call the data

data=pd.read_csv(r"D:/Dig/pima.csv")
data.head()

output :


Importing all the Libraries & Dataset




#Lets Describe our Data

data.describe()

output :



data describe



2.) Analyzing the data



#How may people have Diabetes ( 0 = Not Diabetic & 1 = Yes )

plt.figure(figsize=(8,7))
sns.countplot(x="diabetes_label",data=data)

output :


countplot




3.) Data Wrangling



#Separate Independent variable's& Dependent variable

X=data[['pregnant','glucose','bp','skin','insulin_level','bmi','pedgiree','age']]
y=data['diabetes_label']


#Check if there are any null value's

data.isnull().sum()

output :


isnull



Since there are no null value's so that's enough.


4.) Test & Train Data



#Split our data into 80-20 ration i.e (80% to Train & 20% to test)

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
print(X_test.shape," ",X_train.shape)

output :

(154, 8) (614, 8)


#Calling Logistic Regression

from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train,y_train)

output :


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0,
warm_start=False)


5.) Check Accuracy



#Check the accuracy of model

model.score(X_test,y_test)

output :


0.7792207792207793


#Let's check our model's prediction's

model.predict([[6,148,72,35,0,33.6,0.627,50]])

output :


array([1], dtype=int64) # yes that person has diabetes


#Let's print classification report

pred=model.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,pred))

output :

classification_report


#Now we will use metrics for in depth analysis

y_test.value_counts()

output :

value_counts


#Mean

y_test.mean()

output :


0.33116883116883117


#Actual v/s Predicted

print("Actual diabetes lables : ",y_test.values[0:15])
print("Predicted diabetes lables : ",pred[0:15])

output :


Actual diabetes lables : [0 1 1 0 1 0 0 0 0 1 0 1 0 1 0]
Predicted diabetes lables : [0 0 0 0 1 0 0 0 0 0 0 0 0 1 0]


#Confusion Matrix

print(metrics.confusion_matrix(y_test,pred))

output :


[[95 8]
[26 25]]


#Recalling Score

print(metrics.recall_score(y_test,pred))

output :


0.49019607843137253


So guy's this is how we create a model for detecting diabetes.

Thank you guys!...If you guy's like it share it with our friend & if you have any suggestions drop a comment in below.

4 comments:

  1. Which open source software using for diabetes prediction?

    ReplyDelete
  2. Which open source software is used here for diabetes prediction?

    ReplyDelete

No Spamming and No Offensive Language

Powered by Blogger.