Diabetes Prediction using Machine Learning Project
SOURCE CODE :
DATA SETS:
https://www.kaggle.com/uciml/pima-indians-diabetes-database
Introduction:
“In this Diabetes Prediction using Machine Learning Project , the main objective is to predict whether the person has Diabetes or not based on various features like Number of Pregnancies, Insulin Level, Age, BMI. The data set that has used in this project has taken from the kaggle . “This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database.” The dataset has 9 columns as shown below;
Pregnancies
– Number of times pregnantGlucose
– Plasma glucose concentration a 2 hours in an oral glucose tolerance testBlood Pressure
– Diastolic blood pressure (mm Hg)Skin Thickness
– Triceps skinfold thickness (mm)Insulin
– 2-Hour serum insulin (mu U/ml)BMI
– Body mass index (weight in kg/(height in m)^2)Diabetes Pedigree Function
– Diabetes pedigree functionAge
– Age (years)Outcome
– Class variable (0 or 1) 268 of 768 are 1, the others are 0
Technical Aspect : -
- Training a machine learning model using scikit-learn.
- Building and hosting a Flask web app.
- A user has to put details like Number of Pregnancies, Insulin Level, Age, BMI etc.
- Once it get all the fields information , the prediction is displayed on a new page .
In this runbook I am attempting a more organized approach based on statistical techniques to find out two things:
- A way to methodically select the most appropriate features, such that the particular subset results in maximum accuracy.
- To see if the accuracy of the models can be pushed beyond 76%.
Learning Objectives : -
The 7 Steps of Machine Learning, provides the following general framework of steps in supervised machine learning;
- Data Collection
- Data Preparation
- Choosing a model
- Training the model
- Evaluating the model
- Parameter tuning
- Making prediction
Problem Statement
This is a classification problem of supervised machine learning. The objective is to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
0
– Absence of Diabetes
1
– Presence of Diabetes
Libraries Used : -
- Python 3.7
- pandas
- numpy
- seaborn
- matplotlib
- scikit-learn
Package installation:
pip install numpy
pip install pandas
pip install seaborn
pip install scikit-learn
pip install matplotlib
STEPS IN THIS PROJECT
All the required import statements were put into this particular block along with the set of statements to initialize the dataset.
Import required libraries, Import diabetes dataset.
pandas — used to read the dataset , numpy — numerical python for mathematical calculation, matplot lib — to plot the graph ,seaborn in a graphical manner to plot the graph.
Then read the dataset df-new variable to read the dataset. df.head() here head function is used to display a output first time . Aim- To predict whether diabetes have or not.
DATA ANALYSIS :
Shape function is used to represent how many rows and columns. info is used to check the data type is int or float. null.sum function is used to to check the null values is present or not. sum- to calculate all columns and to check each and every attribute .
TRAIN AND TEST FITTING:
from sklearn package we have to test and train by using model class. df.ilec is a keyword for store x is data(pregnancy , BMI,insulin,….) and y is label or outcome . Then we know x train and y train then x test and y test . Perform percentage split of 80% to divide dataset as Training set and 30%.And to Test data set. If we check x.train.head() how the data can be stored and y.train.head() how label or outcome can be stored.
ALGORITHM:
From the scikit learn package we have to import random forest classifier for making predictions. RFC is used to train and test algorithms .Then train and test our algorithm model.fit is to train the algorithms of x train and y train . Predict the output variable is to initialize for test the algorithms.then import metrics is have to check the accuracy score .
SUMMARIZE THE RESULTS:
In this notebook we predicted diabetes from medical records with an accuracy of approximately 76%.