Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING
A PROJECT REPORT ON
“PREDICTION OF DIABETES USING LOGISTIC REGRESSION”
Submitted in the partial fulfillment of the requirements in the 8th semester of
BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING
By
MADUKARA 1NH15IS057
Under the guidance of
Mrs. K. K. KAVITHA Assistant Professor
Dept. of ISE, NHCE
DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING
CERTIFICATE
Certified that the project work entitled “Prediction of Diabetes using logistic
Regression”, carried out by Mr.Madukara-1NH15IS057 bonafide students of NEW HORIZON
COLLEGE OF ENGINEERING, Bengaluru,in partial fulfillment for the award of Bachelor of
Engineering in Information Science and Engineering of the Visveswaraya Technological
University, Belgaum during the year 2018-19.It is certified that all corrections/suggestions
indicated for Internal Assessment have been incorporated in the Report deposited in the
departmental library.
The project report has been approved as it satisfies the academic requirements in respect of
Project work prescribed for the said Degree.
-------------------------------- --------------------------------- -----------------------------------
Prof. K. K. Kavitha Dr. R J Anandhi Dr. Manjunatha
Internal Guide HOD, Dept. of ISE Principal, NHCE
External Viva
Name of the Examiners Signature with Date
1. ______________________
2.
DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING
DECLARATION
I hereby declare that I have followed the guidelines provided by the Institution in
preparing the project report and presented report of project titled “Diabetes
Prediction”, andis uniquely prepared by me after the completion of the project
work. I also confirm that the report is only prepared for my academic requirement
and the results embodied in this report have not been submitted to any other
University or Institution for the award of any degree.
Signature of the Student Name: Madukara
USN: 1NH15IS057
i
ABSTRACT
Diabetes is a disease caused due to the increase level of blood glucose. Diabetes is a chronic
disease with the potential to cause a worldwide health care crisis. According to International
Diabetes Federation 382 million people are living with diabetes across the whole world. By
2035, this will be doubled as 592 million. Various traditional methods, based on physical and
chemical tests, are available for diagnosing diabetes. However, early prediction of diabetes is
quite challenging task for medical practitioners due to complex interdependence on various
factors as diabetes affects human organs such as kidney, eye, heart, nerves, foot etc. Data
science methods have the potential to benefit other scientific fields by shedding new light on
common questions. One such task is to help make predictions on medical data. Machine
learning is an emerging scientific field in data science dealing with the ways in which machines
learn from experience. The aim of this project is to develop a system which can perform early
prediction of diabetes for a patient with a higher accuracy by combining the results of machine
learning technique. This project aims to predict diabetes via supervised machine learning
methods Logistic regression. This project also aims to propose an effective technique for earlier
detection of the diabetes disease.
ii
ACKNOWLEDGEMENT
Any achievement, be it scholastic or otherwise does not depend solely on the individual efforts
but on the guidance, encouragement and cooperation of intellectuals, elders and friends. A
number of personalities, in their own capacities have helped me in carrying out this project. I
would like to take an opportunity to thank them all.
First and foremost I thank the management, Dr. Mohan Manghnani, Chairman, New Horizon
Educational Institutions for providing us the opportunity to do internship.
I would like to thank Dr.Manjunatha, Principal, New Horizon College of Engineering,
Bengaluru, for his constant encouragement and support extended to us towards completing my
internship.
I extend my sincere gratitude to Dr. R J Anandhi, Head of the Department, Information Science
and Engineering, New Horizon College of Engineering, Bengaluru for her valuable suggestions
and expert advice.
I deeply express my sincere gratitude to my guide Mrs. K. K. Kavitha, Assistant Professor,
Department of ISE, NHCE, Bengaluru, for his able guidance, regular source of encouragement
and assistance throughout my internship period.
Last, but not the least, I would like to thank my peers and friends who provided me with valuable
suggestions to improve my project.
Madukara
(1NH15IS057)
iii
TABLE OF CONTENTS
Abstract i
Acknowledgement ii
Table of Contents iii
List of Figures v
1. Preamble 1
1.1 Introduction 1
1.2 Relevance of the Project 3
1.3 Purpose 3
1.4 Scope of the Project 4
1.5 Problem Statement and Definition 4
1.6 Objective of study 5
1.7 Existing System 6
1.8 Proposed system 6
2. Literature Survey 8
3. System Requirements and specifications 13
3.1 General Description of the system 13
3.1.1 Overview of Functional requirements 13
3.1.2 Overview of Data requirements 15
3.2 Technical Requirements of the System 17
3.2.1 Hardware Requirements 17
3.2.2 Software Requirements 17
iv
3.3 Language specification 17
3.3.1 Python 17
3.3.2 Machine Learning Features 18 3. 3. 3 Logistic Regression 19
4. System Design and Analysis 21
4.1 Preliminary Design 21
4.2 System Architecture 22
4.3 Data Flow Diagram 23
4.3.1 DFD for Data Extraction 24
4.3.2 DFD for Classification of Data 24
4.3 Use Case Diagram 25
5. Implementation 26
5.1 Different modules of the project 31
5.2 Flow Chart Of The Proposed System 32
6. Experimental Results 33
6.1 Outcomes of the Proposed System 33
7. Testing 34
7.1 Testing and Validations 34
7.2 Testing Levels 31
7.2.1 Functional Testing 35
7.2.2 Non-Functional Testing 35
7.3 White Box Testing 37
7.4 Different Stages of Testing 38
7.4.1 Unit Testing 38
7.4.2 Integration Testing 40
7.4.3 System Testing 41
7.4.4 Acceptance Testing 41
8. Conclusion and Future Enhancement 42
8.1 Conclusion 42
v
LIST OF FIGURES
Figure no Figure Name Page no
Figure 1.1 Predicting Diabetes Using Machine Learning
1
Figure 1.2 Representation of Diabetes Level Checking
2
Figure 3.1 Training and test set 14
Figure 4.1 Preliminary design of Diabetes Prediction system
21
Figure 4.2 Architecture of Diabetes Prediction System
22
Figure 4.3 Data Flow Diagram 23
Figure 4.4 DFD for Data Extraction 24
Figure 4.5 DFD for Classification of Data 24
Figure 4.6 Use Case Diagram 25
Figure 5.1 Confusion matrix 27
Figure 5.2 ndarray 29
Figure 5.3 Dataset 31
Figure 5.2 Flowchart of the proposed system
32
Figure 6.1 Data distribution 33
Figure 7.1 Non-Functional Testing 36
Diabetes Prediction
Dept of ISE, NHCE Page 1
Chapter 1
Preamble
Diabetes is one of the deadliest diseases in the world. It is not only a disease but also
a creator of kinds of diseases like heart attack , blindness, kidney diseases etc. The
normal identifying process is that patients needs to visit a diagnostic center, consult
their doctor, and sit for a day or more to get their reports. But with the rise of
Machine Learning approaches we have the ability to find a solution to this issue.
1.1 Introduction
Diabetes mellitus is an endless infection portrayed by hyperglycemia. It might cause
numerous inconveniences. As per the developing bleakness as of late, in 2040, the
world's diabetic patients will achieve 642 million, which implies that one of the ten
grown-ups later on is experiencing diabetes. There is no uncertainty this disturbing
figure needs extraordinary consideration. World Health Organization has assessed 12
million passings happen around the world, consistently because of Heart maladies. A
large portion of the passings in the United States and other created nations are
expected to cardio vascular maladies. The early visualization of cardiovascular
sicknesses can help in settling on choices on way of life changes in high hazard
patients and thus decrease the intricacies. This exploration means to pinpoint the
most significant/hazard elements of coronary illness just as anticipate the general
hazard utilizing calculated relapse. Machine Learning has been connected to
numerous parts of medicinal wellbeing. In this project, we utilized Logistic regression
to anticipate diabetes mellitus.
Figure 1.1 : Predicting Diabetes Using Machine Learning
Diabetes Prediction
Dept of ISE, NHCE Page 2
The dataset is originally from the National Institute of Diabetes and Digestive and
Kidney Diseases. The objective of the dataset is to diagnostically predict whether or
not a patient has diabetes, based on certain diagnostic measurements included in
the dataset.
Several constraints were placed on the selection of these instances from a larger
database. In particular, all patients here are females at least 21 years old of Pima
Indian heritage.
The datasets consist of several medical predictor (independent) variables and one
target (dependent) variable, Outcome. Independent variables include the number of
pregnancies the patient has had, their BMI, insulin level, age, and so on.
Proposed model is to anticipate diabetes that specialists can be valuable as a model
to help foresee diabetes. In this examination, analyzed the connection between
difficulties in diabetic patients and their properties, for example, blood glucose,
circulatory strain, tallness, weight, and hemoglobin and weight record of the
patients. The point of this examination is to foresee confusions dependent on their
manifestations.
Figure 1.2 : Representation of Diabetes Level Checking
Diabetes Prediction
Dept of ISE, NHCE Page 3
The above figure 1.2 shows how the diabetes level will be checked and gives the
alert upon increase in the level of diabetes.
1.2 Relevance of the Project
In recent years, using Machine Learning has been used with increasing frequency to
predict the possibility of disease. Many algorithms and toolkits have been created
and studied by researchers. These have highlighted the tremendous potential of
this research field.
Based on several studies, we found that a commonly used dataset was the Pima
Indians Diabetes Dataset from the University of California, Irvine (UCI) Machine
Learning Database. In this project logistic regression aimed at validating a chosen
class label of given data and aimed at building the final classifier model. All the
studies presented above used the same Pima Indians Diabetes Dataset as the
experimental material.
1.3 Purpose
Large amount of data has been continuously generated in field of engineering and
science. Recent advances in technology have resulted in big electronic data that
allow data to be captured, processed, analyzed and stored rather inexpensively.
This change leads to new trends in market as well as industry such as Internet
banking and e-commerce, insurance, financial transactions, supermarket, healthcare,
communications, location of data that generate huge amount of electronic data. The
need to understand huge, complex, information rich data sets is important to
virtually all fields in business, science, engineering and medical. The data used in
data warehouses and data marts has been extracted from knowledge hidden in that
data. This knowledge is becoming vital in today’s increasingly competitive world. The
greatest problem of today is how to teach people to ignore the irrelevant data.
With the rise of Machine Learning approaches we have the ability to find a solution
diabetes prediction, we have developed a system using machine learning which has
the ability to predict whether the patient has diabetes or not. Furthermore,
Diabetes Prediction
Dept of ISE, NHCE Page 4
predicting the disease early leads to treating the patients before it becomes critical.
Machine Learning has the ability to extract hidden knowledge from a huge amount
of diabetes-related data. Because of that, it has a significant role in diabetes
research, now more than ever. The aim of this research is to develop a system which
can predict the diabetic risk level of a patient with a higher accuracy. This research
has focused on developing a system based on logistic regression.
1.4 Scope of the Project
The early intervention of diabetes can reduce the prevalence of diabetes and hence
the economic burden due to it. Machine Learning techniques play an important role
in treatment plan workout, rehabilitation, chronic diseases management plan etc.
Long term follow up plan may be easily guided and keen supervision is possible. The
systems may definitely helpful in reduction of cost of patient management by
avoiding unnecessary investigations and patients follow up. These prediction
systems will add accuracy and time management. Computer-based patient support
systems benefit patients by providing informational support that increases their
participation in health care.
1.5 Problem Statement and Definition
To identify whether a given person in dataset will be diabetic, non diabetic or pre-
diabetic will be done on basis of attribute values. Values exceeding a specific value
may contribute to identify whether a person is diabetic, non diabetic or pre-diabetic.
The aim of prediction of diabetes is to make aware people about diabetes and what
it takes to treat it and gives the power to control.
The model can be used by the endocrinologists, dietitians, ophthalmologists and
podiatrists to predict if or if not the patient is likely to suffer from diabetes, if yes,
how intense it could be.
The dataset consists of features comprising the medical details of the patients that
are useful in determining the health condition of the patient.
Goals of the System:-
Diabetes Prediction
Dept of ISE, NHCE Page 5
Convert manual to computerize. Before this, majority of the process is done
manually. After converting it to computerize it will be easy to predict.
By computerizing, it is easier to understand the doctor's report which is hardly
understood with different and complicated handwriting.
Easy to maintain record.
Enable to predict various type of diabetes.
Ensure the system useful to doctor and patient.
1.6 Objective of the study
Easy to use: The primary objective of this assignment is to broaden a
platform so as to be simple and smooth to apply, as right here one have to
provide the patient’s scientific details and primarily based on the features
extracted the algorithm will then discover diabetes and its type. As right here
set of rules does the task hence a well trained version is much less certain to
make errors in predicting diabetes and its type consequently, in short
accuracy is advanced and thereby it additionally saves time and makes
simpler for doctors in addition to sufferers to expect whether or not they
may be vulnerable to any type of diabetes or not, that is otherwise we
difficult to do without health practitioner’s involvement.
No human intervention required: To predict diabetes one must provide
scientific details which includes age, BMI, and so on. And right here the set of
rules will offer the effects based on the capabilities extracted and
consequently here probabilities of mistakes been made are very minimal
given that there is no human intervention and it also saves lot of time for the
sufferers or doctors and they could similarly continue for treatments or
different tactics should quicker. This is in case whilst consequences are
provided quicker to them. This can in-turn make the precaution/prevention
for diabetes faster while it saves medical doctors and affected person the
essential time, to be able to cross on to in addition treatments and
precautions to be taken to minimize the effect of that diabetes.
Diabetes Prediction
Dept of ISE, NHCE Page 6
Not simplest hit upon the diabetes kind but additionally suggest precautions:
In this mission our goal isn't only to find and are expecting the kind of
diabetes but pin point towards the precautions to be taken to minimize the
impact of the diabetes. Getting hints on precautions to be taken will help the
doctors and sufferers to progress without problems to similarly steps of their
treatment.
1.7 Existing system
The healthcare enterprise collects big quantities of healthcare facts which, unluckily,
are not “mined” to find out hidden information. Clinical decisions are frequently
made based totally on docs‟ intuition and enjoy instead of at the knowledge wealthy
statistics hidden in the database. This exercise ends in unwanted biases, errors and
excessive medical. The existing process is very slow to give the result. It is very
difficult to find diabetes or not.
1.8 Proposed system
Diabetes prediction is an internet-primarily based device gaining knowledge of
utility, skilled through a Pima Indian dataset. The person inputs its particular clinical
information to get the prediction of diabetes. The set of rules will calculate the
opportunity of presence of diabetes. Thus, minimizing the price and time required to
are expecting the disorder. Format of statistics plays essential element on this
software. At the time of uploading the user information utility will take a look at its
right record format and if it no longer as consistent with want then ERROR dialog box
may be induced. Our device might be implementing the algorithm: Logistic
Regression. The algorithms may be educated the use of the statistics set obtained
from University of California, Irvine. 75% of the entries in the statistics set can be
used for education and the last 25% for testing the accuracy of the set of rules.
Furthermore, a few steps can be taken for optimizing the algorithms thereby
enhancing the accuracy.
Advantages of the system:-
Diabetes Prediction
Dept of ISE, NHCE Page 7
Our proposed System has the following benefits:
Powerful, flexible, and easy to use.
Increased efficiency of doctor.
Improved patient satisfaction.
Reduce the use of papers.
Simple and Quick.
More accurate result.
Diabetes Prediction
Dept of ISE, NHCE Page 8
Chapter 2
Literature Survey
The Pima Indians are genetically predisposed to diabetes, and it was noted that
their diabetic rate was 19 times that of a typical town in Minnesota. The National
Institute of Diabetes and Digestive and kidney Disease of the National Institute
of Health (NIH) originally owned the Pima Indian diabetes Database (PIDD) [18].
The number of patients (n) in the database n=768 each with 9 attribute variables.
Out of the nine conditional attributes, six are due to physical examination rest of the
attributes are chemical examination. Of these 9 attributes, there are eight inputs
and last one being the output. The goal is to use the first 8 variables to predict
attribute values of the 9th variables.
The aim is to be development of a hybrid model for classifying Pima Indian
Diabetic Database. The datasets were identified and the incorrectly classified
instances were eliminated using K-means clustering. Then using decision Tree
classifier the classification is done on these correctly clustered instances. The
resultant dataset is used to train and test the diabetic data set using two method
dividing training data and test data using 60- q40 ratio and 10 fold cross
Validation method. Experimental results show the improvement in accuracy of
diabetic data set using proposed cascaded method: k-means with decision tree by
an order of 19.50% of classification compared to decision tree alone with
unprocessed data.
Weka is popular machine learning software developed in Java at University of
Waikato, New Zealand. It is open source software available at GNU (General Public
License). It consists of visualization tools and algorithms which are used in data
analysis and predictive modeling with graphical user interface for easy
functionality access. Weka supports several data mining tasks such as data pre-
processing, clustering, regression, visualization and feature selection. The
attributes available in Weka are of one of these types Nominal: one of predefined
Diabetes Prediction
Dept of ISE, NHCE Page 9
list of values, Numeric: A real or integer number, String, Date, Relational. Key
features of this tool are open source and platform independent. This consists of
various algorithm for data mining and machine learning.
Diabetes a non-communicable disease is leading to long-term complications and
serious health problems. A report from the World Health Organisation addresses
diabetes and its complications that impact on individual physically, financially,
economically over the families. The survey says about 1.2 million deaths due to the
uncontrolled stage of health lead to death. About 2.2 million deaths occurred due to
the risk factors of diabetes like a cardiovascular and other diseases.
Diabetes is an ailment caused due to the extended level of sugar obsession in the
blood. In this paper, discussed various classifiers, decision support system is
proposed that uses the AdaBoost algorithm with Decision Stump as a base classifier
for classification. Moreover, Support Vector Machine, Naive Bayes and Decision Tree
have additionally executed as a base classifiers for AdaBoost calculation for
exactness confirmation. The exactness got for AdaBoost calculation with choices
stump as a base classifier is 80.72%, which is more note worthy contrasted with that
of Support Vector Machine, Naive Bayes and Decision Tree.
Artificial intelligence is having more effect is machine realizing, which creates
calculations ready to take in examples and choice standards from information.
Machine learning calculations have been implanted into information mining
pipelines, which can consolidate them with established measurable techniques, to
remove learning from information. Inside the EU-financed MOSAIC undertaking, an
information mining pipeline has been utilized to determine an arrangement of
prescient models of sort 2 diabetes mellitus (T2DM) entanglements in light of
electronic wellbeing record information of almost one thousand patients. Such
pipeline includes clinical focus profiling, prescient model focusing on, prescient
model development and model approval. In the wake of having managed to miss
information by methods for irregular woods (RF) and having connected appropriate
methodologies to deal with class unevenness, we have utilized Logistic Regression
Diabetes Prediction
Dept of ISE, NHCE Page 10
with the stepwise component choice to foresee the beginning of retinopathy,
neuropathy, or nephropathy, at various time situations, at 3, 5, and 7 years from the
main visit at the Hospital Center for Diabetes (not from the conclusion). Considered
factors are sexual orientation, age, time of determination, weight file (BMI), glycated
haemoglobin (HbA1c), hypertension, and smoking propensity. Lust models, custom
fitted as per the complexities, gave an exact up to 0.838. Diverse factors were
chosen for every complexity and time situation, prompting particular models simple
to mean the clinical practice.
In this paper, analysis of a Pima Indian dataset is done using various classification
techniques like Naïve Bayes, Zero R, J48, random forest, MLP, logistic regression.
Comparison and prediction whether positive and negative diabetes. Diagnosing
diabetes through data mining tool using the WEKA tool, in terms of accuracy and
performance MLP is better.
Patients with diabetes should ceaselessly screen their blood glucose levels and
modify insulin measurements, endeavouring to keep blood glucose levels as near
typical as would be prudent. Blood glucose levels that veer off from the typical range
can prompt genuine here and now and long-haul intricacies. A programmed
expectation shows that cautioned individuals of fast approaching changes in their
blood glucose levels would empower them to make a preventive move. In this paper,
we depict an answer that uses a bland physiological model of blood glucose
progression to produce enlightening highlights for a support vector regression
display that is prepared with tolerant particular information. The new model beats
diabetes specialists at foreseeing blood glucose levels and could be utilized to
envision right around a fourth of hypoglycaemic occasions 30 min ahead of time. In
spite of the fact that the comparing exactness is right now only 42%, most false
cautions are in close hypoglycaemic locales and hence patients reacting to these
hypoglycaemia alarms would not be hurt by intercession.
Diabetes mellitus is a standout amongst the most genuine wellbeing challenges in
both creating and created nations. As per the International Diabetes Federation,
Diabetes Prediction
Dept of ISE, NHCE Page 11
there are 285 million diabetic individuals around the world. This aggregate is relied
upon to ascend to 380 million in 20 years. Because of its significance, an outline of a
classifier for the recognition of Diabetes ailment with ideal cost and better execution
is the need of the age. The Pima Indian diabetic database at the UCI machine
learning research facility has turned into a standard for testing information mining
calculations to see their expectation exactness in diabetes information arrangement.
The proposed strategy utilizes SVM, a machine learning technique as the classifier
for analysis of diabetes. The machine learning strategy centre around arranging
diabetes illness from a high dimensional therapeutic dataset. The trial comes about
got demonstrate that help vector machine can be effectively utilized for diagnosing
diabetes illness.
The point of this examination is to the finding of diabetes illness, which is a standout
amongst the most vital infections in the restorative field utilizing Generalized
Discriminant Analysis (GDA) and Least Square Support Vector Machine (LS-SVM).
Likewise, we proposed another course learning framework in light of Generalized
Discriminant Analysis and Least Square Support Vector Machine. The proposed
framework comprises of two phases. The primary stage, we have utilized
Generalized Discriminant Analysis to discriminant highlight factors amongst sound
and patient (diabetes) information as a pre-preparing process. The second stage, we
have utilized LS-SVM so as to order of diabetes dataset. While LS-SVM acquired
78.21% grouping precision utilizing 10-overlap. cross approval, the proposed
framework called GDA–LS-SVM got 82.05% order exactness utilizing 10-crease across
approval. The heartiness of the proposed framework is inspected utilizing
arrangement precision, k-crease cross-approval technique and disarray lattice. The
acquired order exactness is 82.05% and it is exceptionally encouraging contrasted
with the beforehand detailed grouping strategies.
A proper and thorough literature survey concludes that there are various methods
that can be used to predict diabetes. Some of these approaches are:
Artificial Neural Network
Diabetes Prediction
Dept of ISE, NHCE Page 12
Bayesian Network
Neural Network
Hidden Markov Method
Genetic Algorithm
Diabetes Prediction
Dept of ISE, NHCE Page 13
Chapter 3
System Requirements and specifications
The System Requirements Specification document describes all data, functional and
behavioral requirements of the software under production or development. A
functional requirement document defines the functionality of a system or one of its
subsystems. It also depends upon the type of software, expected users and the type
of system where the software is used. Non-functional requirement is
a requirement that specifies criteria that can be used to judge the operation of a
system, rather than specific behaviors.
3.1 General Description of the System
3.1.1 Overview of Functional requirements
Preprocess Data
Data Preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is
collected in raw format which is not feasible for the analysis.
Therefore, certain steps are executed to convert the data into a small clean data set.
This technique is performed before the execution of Iterative Analysis. The set of
steps is known as Data Preprocessing. It includes
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Preprocessing is necessary because of the presence of unformatted real-world
data. Mostly real-world data is composed of –
Diabetes Prediction
Dept of ISE, NHCE Page 14
Inaccurate data (missing data) - There are many reasons for missing data such as
data is not continuously collected, a mistake in data entry, technical problems with
biometrics and much more.
The presence of noisy data (erroneous data and outliers) - The reasons for the
existence of noisy data could be a technological problem of gadget that gathers data,
a human mistake during data entry and much more.
Inconsistent data - The presence of inconsistencies are due to the reasons such that
existence of duplication within data, human data entry, containing mistakes in codes
or names, i.e., violation of data constraints and much more.
Train and Test Data Creation
The data we use is usually split into training data and test data. The training set
contains a known output and the model learns on this data in order to be
generalized to other data later on. We have the test dataset (or subset) in order to
test our model’s prediction on this subset.
Figure 3.1: Training and test set
The above figure shows splitting of data into training set and test set. The size of
training set will be larger than that of test set. The training set will be trained tested
against the test set. This dataset is originally from the National Institute of Diabetes
and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically
Diabetes Prediction
Dept of ISE, NHCE Page 15
predict whether or not a patient has diabetes, based on certain diagnostic
measurements included in the dataset.
Model Creation
The process of training an ML model involves providing an ML algorithm (that
is, the learning algorithm) with training data to learn from. The term ML
model refers to the model artifact that is created by the training process.
The training data must contain the correct answer, which is known as a target
or target attribute. The learning algorithm finds patterns in the training data
that map the input data attributes to the target (the answer that you want to
predict), and it outputs an ML model that captures these patterns.
You can use the ML model to get predictions on new data for which you do
not know the target. For example, let's say that you want to train an ML
model to predict if an email is spam or not spam. You would provide training
data that contains emails for which you know the target. Machine would
train an ML model by using this data, resulting in a model that attempts to
predict diabetes or not.
In our project we are using Logistic Regression to build our Model on Pima
Indian Dataset.
Result Analysis:
In this final phase, we will test our model on our prepared dataset and also measure
the Diabetes prediction performance on our dataset. To evaluate the performance of
our created classification and make it comparable to current approaches, we use
Accuracy to measure the effectiveness of classifiers.
3.1.2 Overview of Data requirements
Reliability
The structure must be reliable and strong in giving the functionalities. The
movements must be made unmistakable by the structure when a customer
Diabetes Prediction
Dept of ISE, NHCE Page 16
has revealed a couple of enhancements. The progressions made by the
Programmer must be Project pioneer and in addition the Test designer.
Maintainability
The system watching and upkeep should be fundamental and focus in its
approach. There should not be an excess of occupations running on diverse
machines such that it gets hard to screen whether the employments are
running without lapses.
Performance
The framework will be utilized by numerous representatives all the while.
Since the system will be encouraged on a single web server with a lone
database server outside of anyone's ability to see, execution transforms into
a significant concern. The structure should not capitulate when various
customers would use everything the while. It should allow brisk accessibility
to each and every piece of its customers. For instance, if two test specialists
are all the while attempting to report the vicinity of a bug, then there ought
not to be any irregularity at the same time.
Portability
The framework should to be effectively versatile to another framework. This
is obliged when the web server, which s facilitating the framework gets
adhered because of a few issues, which requires the framework to be taken
to another framework.
Scalability
The framework should be sufficiently adaptable to include new
functionalities at a later stage. There should be a run of the mill channel,
which can oblige the new functionalities.
Flexibility
Flexibility is the capacity of a framework to adjust to changing situations and
circumstances, and to adapt to changes to business approaches and rules. An
adaptable framework is one that is anything but difficult to reconfigure or
adjust because of diverse client and framework prerequisites. The deliberate
division of concerns between the trough and motor parts helps adaptability
Diabetes Prediction
Dept of ISE, NHCE Page 17
as just a little bit of the framework is influenced when strategies or principles
change.
3.2 Technical Requirements of the System
3.2.1 Hardware Requirements
System Processor : Core i3.
Hard Disk : 500 GB.
Ram : 4 GB.
Any desktop / Laptop system with above configuration or higher level.
3.2.2 Software Requirements
Operating system : Windows 8 / 10
Programming Language : Python
DL Libraries : Numpy, Pandas
3.3 Language specification
3.3.1 Python
Python is an easy to learn, powerful programming language. It has efficient high-
level data structures and a simple but effective approach to object-oriented
programming. Python’s elegant syntax and dynamic typing, together with its
interpreted nature, make it an ideal language for scripting and rapid application
development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in
source or binary form for all major platforms from the Python Web
site, https://www.python.org/, and may be freely distributed. The same site also
Diabetes Prediction
Dept of ISE, NHCE Page 18
contains distributions of and pointers to many free third party Python modules,
programs and tools, and additional documentation.
The Python interpreter is easily extended with new functions and data types
implemented in C or C++ (or other languages callable from C). Python is also suitable
as an extension language for customizable applications.
Python is a high-level, interpreted, interactive and object-oriented scripting
language. Python is designed to be highly readable. It uses English keywords
frequently where as other languages use punctuation, and it has fewer syntactical
constructions than other languages.
Python is Interpreted − Python is processed at runtime by the interpreter.
You do not need to compile your program before executing it. This is similar
to PERL and PHP.
Python is Interactive − you can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.
Python is Object-Oriented − Python supports Object-Oriented style or
technique of programming that encapsulates code within objects.
Python is a Beginner's Language − Python is a great language for the
beginner-level programmers and supports the development of a wide range
of applications from simple text processing to WWW browsers to games.
3.3.2 Machine Learning Features
Machine Learning is a method of statistical learning where each instance in a
dataset is described by a set of features or attributes. In contrast, the term “Deep
Learning” is a method of statistical learning that extracts features or attributes from
raw data. Deep Learning does this by utilizing neural networks with many hidden
layers, big data, and powerful computational resources. The terms seem somewhat
interchangeable, however, with Deep Learning method, The algorithm constructs
representations of the data automatically. In contrast, data representations are
Diabetes Prediction
Dept of ISE, NHCE Page 19
hard-coded as a set of features in machine learning algorithms, requiring further
processes such as feature selection and extraction, (such as PCA).
Both of these terms are in dramatic contrast with another class of classical artificial
intelligence algorithms known as Rule-Based Systems where each decision is
manually programmed in such a way that it resembles a statistical model.
In Machine Learning and Deep Learning, there are many different models that fall
into two different categories, supervised and unsupervised. In unsupervised
learning, algorithms such as k-Means, hierarchical clustering, and Gaussian mixture
models attempt to learn meaningful structures in the data. Supervised learning
involves an output label associated with each instance in the dataset. This output
can be discrete/categorical or real-valued. Regression models estimate real-valued
outputs, whereas classification models estimate discrete-valued outputs. Simple
binary classification models have just two output labels, 1 (positive) and 0 (negative).
Some popular supervised learning algorithms that are considered Machine Learning:
are linear regression, logistic regression, decision trees, support vector machines,
and neural networks, as well as non-parametric models such as k-Nearest Neighbors.
3. 3. 3 Logistic Regression
Logistic regression is a statistical method for analysing a dataset in which there are
one or more independent variables that determine an outcome. The outcome is
measured with a dichotomous variable (in which there are only two possible
outcomes). It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a
set of independent variables. To represent binary / categorical outcome, we use
dummy variables. We can also think of logistic regression as a special case of linear
regression when the outcome variable is categorical, where we are using log of odds
as dependent variable. In simple words, it predicts the probability of occurrence of an
event by fitting data to a logit function.
Logistic regression was developed by statistician David Cox in 1958. This binary
logistic model is used to estimate the probability of a binary response based on one
or more predictor (or independent) variables (features). It allows one to say that the
Diabetes Prediction
Dept of ISE, NHCE Page 20
presence of a risk factor increases the probability of a given outcome by a specific
percentage.
Like all regression analyses, the logistic regression is a predictive analysis. Logistic
regression is used to describe data and to explain the relationship between one
dependent binary variable and one or more nominal, ordinal, interval or ratio-level
independent variables.
It’s being used in Healthcare , Social Sciences & various ML for advanced research &
analytics.
Trauma & Injury Severity Score, which is widely used to predict mortality in injured
patients, was originally developed by Boyd et al. using logistic regression. Many other
medical scales used to assess severity of a patient have been developed using logistic
regression.
Binary logistic regression is estimated using Maximum Likelihood Estimation (MLE),
unlike linear regression which uses the Ordinary Least Squares (OLS) approach. MLE is
an iterative procedure, meaning that it starts with a guess as to the best weight for
each predictor variable (that is, each coefficient in the model) and then adjusts these
coefficients repeatedly until there is no additional improvement in the ability to
predict the value of the outcome variable (either 0 or 1) for each case.
Life insurance actuaries use logistic regression to predict, based on given data on a
policy holder (e.g. age, gender, results from a physical examination) the chances that
the policy holder will die before the term of the policy expires.
Political campaigns try to predict the chances that a voter will vote for their candidate
(or do something else desirable, such as donate to the campaign).
Diabetes Prediction
Dept of ISE, NHCE Page 21
Chapter 4
System Design and Analysis
System Design is a process of planning a new business system or replacing an
existing system by defining its components or modules to satisfy the specific
requirements. Before planning, you need to understand the old system thoroughly
and determine how computers can best be used in order to operate efficiently.
System analysis is conducted for the purpose of studying a system or its parts in
order to identify its objectives. It is a problem solving technique that improves the
system and ensures that all the components of the system work efficiently to
accomplish their purpose.
4.1 Preliminary Design
Diabetes Prediction
Dept of ISE, NHCE Page 22
Figure 4.1: Preliminary design of Diabetes Prediction system
The above figure 4.1 depicts the preliminary design of Diabetes prediction system.
The system initially considers the plasma Glucose level of an individual with the age
of the person. Based on the value of Glucose level it categorizes the person whether
the person is diabetic or non diabetic. Here the design shows that the person whose
glucose level is >66.500 is considered to be diabetic and person whose glucose value
is ≤66.500 will be considered as non diabetic. If DBP is greater than 107 the patient
will be non diabetic. If the DBP is less than or equal to 107 and plasma glucose level
is greater than 28 and pregnant is considered to be diabetic otherwise non diabetic.
4.2 System Architecture
The below figure 4.1 shows the architecture of Diabetes Prediction System. The
training data set is fed to the system as input which will be initially pre processed.
Data pre processing is the phase where the raw data will be transformed into
meaningful and understandable format.
Diabetes Prediction
Dept of ISE, NHCE Page 23
Figure 4.2: Architecture of Diabetes Prediction System
The pre processed data will be later classified using the best classification
mechanism. Then the classified will be compared with the test data in order to
classify it accurately using some distance measures. The final classified data will be
converted to data patterns using intelligent methods. The obtained patterns will be
evaluated for accuracy and correctness. The identified patterns will be represented
as knowledge in the required form as output.
4.3 Data Flow Diagram
The data flow diagram of proposed Diabetes Prediction system is as shown in figure
4.3.
Diabetes Prediction
Dept of ISE, NHCE Page 24
Figure 4.3: Data Flow Diagram
The above figure 4.3 shows the Data Flow Diagram of Diabetes Prediction. The user
will enter preliminary symptoms those symptoms will be compared with symptoms
database. Symptoms will be mapped with disease. Display predicted disease with
associated symptoms after that check suitable medicine and suggest suitable
medicines from the medicine database.
4.3.1 DFD for Data Extraction
Diabetes Prediction
Dept of ISE, NHCE Page 25
Figure 4.4: DFD for Data Extraction
The above figure 4.4 shows the DFD for Data Extraction for the proposed system.
Raw Data will be pre processed. Processed file will be divided into 70% of training
data and 30% of test data. Model is created for the training data and save the model.
Result will be obtained for the test data using the saved model.
4.3.2 DFD for Classification of Data
Figure 4.5: DFD for Classification of Data
Diabetes Prediction
Dept of ISE, NHCE Page 26
The above figure 4.5 shows the DFD for Classification of Data of the proposed
system. For the problem definition data will be collected and pre processed. Model
will be created for the gathered data and data access, data sampling, data
transformation will be done. After creating the model and evaluate and interpret it.
Created model is applied for external applications.
4.3 Use Case Diagram
Figure 4.6: Use Case Diagram
The figure 4.6 above shows the use case diagram of the proposed system. Values for
each attribute assigned based on importance for the input dataset. Process and
generate value. If the value is equal to one then the patient is diabetic otherwise the
patient is non diabetic.
Diabetes Prediction
Dept of ISE, NHCE Page 27
Chapter 5
Implementation
In this work, a business intelligent model has been developed, to classify different
diabetes type, based on attribute values using a suitable machine learning
technique. The model was evaluated by a scientific approach to measure accuracy.
We are using Logistic Regression to build our model.
Analysis:
In this final phase, we will test our classification model on our prepared image
dataset and also measure the performance on our dataset. To evaluate the
performance of our created classification and make it comparable to current
approaches, we use accuracy to measure the effectiveness of classifiers.
After model building, knowing the power of model prediction on a new instance, is
very important issue. Once a predictive model is developed using the historical data,
one would be curious as to how the model will perform on the data that it has not
seen during the model building process. One might even try multiple model types for
the same prediction problem, and then, would like to know which model is the one
to use for the real-world decision making situation, simply by comparing them on
their prediction performance (e.g., accuracy). To measure the performance of a
predictor, there are commonly used performance metrics, such as accuracy, recall
etc. First, the most commonly used performance metrics will be described, and then
some famous estimation methodologies are explained and compared to each other.
"Performance Metrics for Predictive Modeling In classification problems, the primary
source of performance measurements is a coincidence matrix (classification matrix
or a contingency table)”. Above figure shows a coincidence matrix for a two-class
classification problem. The equations of the most commonly used metrics that can
be calculated from the coincidence matrix.
Diabetes Prediction
Dept of ISE, NHCE Page 28
Confusion matrix:
Figure 5.1: Confusion matrix
The above figure 5.1 shows the Confusion matrix for the proposed system. The
numbers along the diagonal from upper-left to lower-right represent the correct
decisions made, and the numbers outside this diagonal represent the errors. "The
true positive rate (also called hit rate or recall) of a classifier is estimated by dividing
the correctly classified positives (the true positive count) by the total positive count.
The false positive rate (also called a false alarm rate) of the classifier is estimated by
dividing the incorrectly classified negatives (the false negative count) by the total
negatives. The overall accuracy of a classifier is estimated by dividing the total
correctly classified positives and negatives by the total number of samples.
Numpy
Numpy, which stands for Numerical Python, is a library consisting of
multidimensional array objects and a collection of routines for processing those
arrays. Using NumPy, mathematical and logical operations on arrays can be
performed. This tutorial explains the basics of NumPy such as its architecture and
environment. It also discusses the various array functions, types of indexing, etc. An
introduction to Matplotlib is also provided. All this is explained with the help of
examples for better understanding.
Diabetes Prediction
Dept of ISE, NHCE Page 29
Numpy is a Python package. It stands for 'Numerical Python'. It is a library consisting
of multidimensional array objects and a collection of routines for processing of
array.
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package
Numarray was also developed, having some additional functionality. In 2005, Travis
Oliphant created NumPy package by incorporating the features of Numarray into
Numeric package. There are many contributors to this open source project.
Operations using NumPy
Using NumPy, a developer can perform the following operations −
Mathematical and logical operations on arrays.
Fourier transforms and routines for shape manipulation.
Operations related to linear algebra. NumPy has in-built functions for linear
algebra and random number generation.
NumPy – A Replacement for Mat Lab
NumPy is often used along with packages like SciPy (Scientific Python)
and Mat−plotlib (plotting library). This combination is widely used as a replacement
for MatLab, a popular platform for technical computing. However, Python
alternative to MatLab is now seen as a more modern and complete programming
language.
It is open source, which is an added advantage of NumPy.
The most important object defined in NumPy is an N-dimensional array type
called ndarray. It describes the collection of items of the same type. Items in the
collection can be accessed using a zero-based index.
Every item in an ndarray takes the same size of block in the memory. Each element
in ndarray is an object of data-type object (called dtype).
Any item extracted from ndarray object (by slicing) is represented by a Python
object of one of array scalar types. The following diagram shows a relationship
between ndarray, data type object (dtype) and array scalar type −
Diabetes Prediction
Dept of ISE, NHCE Page 30
Figure 5.2: ndarray
The above figure 5.2 shows the ndarray. An instance of ndarray class can be
constructed by different array creation routines described later in the tutorial. The
basic ndarray is created using an array function in NumPy as follows −
numpy.array
It creates an ndarray from any object exposing array interface, or from any method
that returns an array.
The ndarray objects can be saved to and loaded from the disk files. The IO functions
available are −
load() and save() functions handle /numPy binary files (with npyextension)
loadtxt() and savetxt() functions handle normal text files
NumPy introduces a simple file format for ndarray objects. This .npy file stores
data, shape, dtype and other information required to reconstruct the ndarray in a
disk file such that the array is correctly retrieved even if the file is on another
machine with different architecture.
Steps involved in predicting the heart disease:
# import standard libraries
import pandas as pd
importmatplotlib.pyplot as plt
importnumpy as np
# import libraries for prepearing model
Diabetes Prediction
Dept of ISE, NHCE Page 31
fromsklearn.model_selection import train_test_split
fromsklearn import metrics
# import library for Logistic regression
fromsklearn.linear_model import LogisticRegression
# load dataset
pimaDataset = pd.read_csv("diabetes.csv")
# Show the information on fields
print(pimaDataset.describe())
# Show distribution of various columns
pimaDataset.hist()
plt.show()
#split dataset into features and target variable
y = pimaDataset['Outcome']
X = pimaDataset.drop(['Outcome'], axis = 1)
# split X and y into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
# instantiate the model (using the default parameters)
logreg = LogisticRegression()
Diabetes Prediction
Dept of ISE, NHCE Page 32
# fit the model with data
logreg.fit(X_train,y_train)
# predict values
y_pred=logreg.predict(X_test)
# print the confusion matrix to determine model accuracy
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
5.1 Different modules of the project
Training Data
Figure 5.3: Dataset
Diabetes Prediction
Dept of ISE, NHCE Page 33
Figure 5.3 shows the dataset used to predict the diabetes. This dataset is originally
from the National Institute of Diabetes and Digestive and Kidney Diseases. The
objective of the dataset is to diagnostically predict whether or not a patient has
diabetes, based on certain diagnostic measurements included in the dataset. Several
constraints were placed on the selection of these instances from a larger database.
In particular, all patients here are females at least 21 years old of Pima Indian
heritage. The datasets consist of several medical predictor (independent) variables
and one target (dependent) variable, Outcome. Independent variables include the
number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
5.2 Flowchart of the proposed system
Figure 5.2: Flowchart of the proposed system
The above figure 5.2 shows Flowchart of the proposed system. The data from the
patient database made as training data, create model for it and train the model. New
patient details is provided it will be classified based on the details given.
Recommended score and details will be obtained.
Diabetes Prediction
Dept of ISE, NHCE Page 34
Chapter 6
Experimental Results
6.1 Outcomes of the Proposed System
Diabetes prediction is an internet-primarily based device gaining knowledge of
utility, skilled through a Pima Indian dataset. The person inputs its particular clinical
information to get the prediction of diabetes. The set of rules will calculate the
opportunity of presence diabetes.
Data distribution:
Figure 6.1: Data distribution
The datasets consist of several medical predictor (independent) variables and one
target (dependent) variable, Outcome. Independent variables include the number of
pregnancies the patient has had, their BMI, insulin level, age, and so on.
Diabetes Prediction
Dept of ISE, NHCE Page 35
Chapter 7
Testing
Software testing is defined as an activity to check whether the actual results match
the expected results and to ensure that the software system is Defect free. It
involves execution of a software component or system component to evaluate one
or more properties of interest.
7.1 Testing and Validations
Validation is a complex process with many possible variations and options, so
specifics vary from database to database, but the general outline is:
Requirement Gathering
o The Sponsor decides what the database is required to do based on
regulations, company needs, and any other important factors.
o The requirements are documented and approved.
System Testing
o Procedures to test the requirements are created and documented.
o The version of the database that will be used for validation is set up.
o The Sponsor approves the test procedures.
o The tests are performed and documented.
o Any needed changes are made. This may require another, shorter
round of testing and documentation.
System Release
o The validation documentation is finalized.
o The database is put into production.
Diabetes Prediction
Dept of ISE, NHCE Page 36
7.2 Testing Levels
7.2.1 Functional Testing:
This type of testing is done against the functional requirements of the project.
Types:
Unit testing: Each unit /module of the project is individually tested to check for
bugs. If any bugs found by the testing team, it is reported to the developer for fixing.
Integration testing: All the units are now integrated as one single unit and checked
for bugs. This also checks if all the modules are working properly with each other.
System testing: This testing checks for operating system compatibility. It includes
both functional and non functional requirements.
Sanity testing: It ensures change in the code doesn’t affect the working of the
project.
Smoke testing: this type of testing is a set of small tests designed for each build.
Interface testing: Testing of the interface and its proper functioning.
Regression testing: Testing the software repetitively when a new requirement is
added, when bug fixed etc.
Beta/Acceptance testing: User level testing to obtain user feedback on the product.
7.2.2 Non-Functional Testing:
This type of testing is mainly concerned with the non-functional requirements such
as performance of the system under various scenarios.
Performance testing: Checks for speed, stability and reliability of the software,
hardware or even the network of the system under test.
Compatibility testing: This type of testing checks for compatibility of the system with
different operating systems, different networks etc.
Diabetes Prediction
Dept of ISE, NHCE Page 37
Localization testing: This checks for the localized version of the product mainly
concerned with UI.
Security testing: Checks if the software has vulnerabilities and if any, fix them.
Reliability testing: Checks for the reliability of the software
Stress testing: This testing checks the performance of the system when it is exposed
to different stress levels.
Usability testing: Type of testing checks the easily the software is being used by the
customers.
Compliance testing: Type of testing to determine the compliance of a system with
internal or external standards.
Figure 7.1 Non-Functional Testing
The above figure 7.1 shows about Non-Functional Testing.
Reliability
The structure must be reliable and strong in giving the functionalities. The
movements must be made unmistakable by the structure when a customer
has revealed a couple of enhancements. The progressions made by the
Programmer must be Project pioneer and in addition the Test designer.
Maintainability
Diabetes Prediction
Dept of ISE, NHCE Page 38
The system watching and upkeep should be fundamental and focus in its
approach. There should not be an excess of occupations running on diverse
machines such that it gets hard to screen whether the employments are
running without lapses.
Performance
The framework will be utilized by numerous representatives all the while.
Since the system will be encouraged on a single web server with a lone
database server outside of anyone's ability to see, execution transforms into
a significant concern. The structure should not capitulate when various
customers would use everything the while. It should allow brisk accessibility
to each and every piece of its customers. For instance, if two test specialists
are all the while attempting to report the vicinity of a bug, then there ought
not to be any irregularity at the same time.
Portability
The framework should to be effectively versatile to another framework. This
is obliged when the web server, which s facilitating the framework gets
adhered because of a few issues, which requires the framework to be taken
to another framework.
Scalability
The framework should be sufficiently adaptable to include new
functionalities at a later stage. There should be a run of the mill channel,
which can oblige the new functionalities.
Flexibility
Flexibility is the capacity of a framework to adjust to changing situations and
circumstances, and to adapt to changes to business approaches and rules. An
adaptable framework is one that is anything but difficult to reconfigure.
7.3 White Box Testing
White Box Testing is defined as the testing of a software solution's internal structure,
design, and coding. In this type of testing, the code is visible to the tester. It focuses
primarily on verifying the flow of inputs and outputs through the application,
Diabetes Prediction
Dept of ISE, NHCE Page 39
improving design and usability, strengthening security. White box testing is also
known as Clear Box testing, Open Box testing, Structural testing, Transparent Box
testing, Code-Based testing, and Glass Box testing. It is usually performed by
developers.
It is one of two parts of the "Box Testing" approach to software testing. Its
counterpart, Blackbox testing, involves testing from an external or end-user type
perspective. On the other hand, Whitebox testing is based on the inner workings of
an application and revolves around internal testing.
The term "WhiteBox" was used because of the see-through box concept. The clear
box or WhiteBox name symbolizes the ability to see through the software's outer
shell (or "box") into its inner workings. Likewise, the "black box" in "Black Box
Testing" symbolizes not being able to see the inner workings of the software so that
only the end-user experience can be tested.
Internal security holes
Broken or poorly structured paths in the coding processes
The flow of specific inputs through the code
Expected output
The functionality of conditional loops
Testing of each statement, object, and function on an individual basis
The testing can be done at system, integration and unit levels of software
development. One of the basic goals of whitebox testing is to verify a working flow
for an application. It involves testing a series of predefined inputs against expected
or desired outputs so that when a specific input does not result in the expected
output, you have encountered a bug.
7.4 Different Stages of Testing
7.4.1 Unit Testing
Diabetes Prediction
Dept of ISE, NHCE Page 40
Unit Testing is a level of software testing where individual units/ components of a
software are tested. The purpose is to validate that each unit of the software
performs as designed. A unit is the smallest testable part of any software. It usually
has one or a few inputs and usually a single output. In procedural programming, a
unit may be an individual program, function, procedure, etc. In object-oriented
programming, the smallest unit is a method, which may belong to a base/ super
class, abstract class or derived/ child class. (Some treat a module of an application as
a unit. This is to be discouraged as there will probably be many individual units
within that module.) Unit testing frameworks, drivers, stubs, and mock/ fake objects
are used to assist in unit testing.
Unit Test Plan:
Unit Test Plan
o Prepare
o Review
o Rework
o Baseline
Unit Test Cases/Scripts
o Prepare
o Review
o Rework
o Baseline
Unit Test
o Perform
Benefits
Unit testing increases confidence in changing/ maintaining code. If good unit
tests are written and if they are run every time any code is changed, we will
be able to promptly catch any defects introduced due to the change. Also, if
codes are already made less interdependent to make unit testing possible,
the unintended impact of changes to any code is less.
Diabetes Prediction
Dept of ISE, NHCE Page 41
Codes are more reusable. In order to make unit testing possible, codes need
to be modular. This means that codes are easier to reuse.
Development is faster. How? If you do not have unit testing in place, you
write your code and perform that fuzzy ‘developer test’ (You set some
breakpoints, fire up the GUI, provide a few inputs that hopefully hit your
code and hope that you are all set.) But, if you have unit testing in place, you
write the test, write the code and run the test. Writing tests takes time but
the time is compensated by the less amount of time it takes to run the tests;
You need not fire up the GUI and provide all those inputs. And, of course, unit
tests are more reliable than ‘developer tests’. Development is faster in the
long run too. How? The effort required to find and fix defects found during
unit testing is very less in comparison to the effort required to fix
defects found during system testing or acceptance testing.
The cost of fixing a defect detected during unit testing is lesser in comparison
to that of defects detected at higher levels. Compare the cost (time, effort,
destruction, humiliation) of a defect detected during acceptance testing or
when the software is live.
Debugging is easy. When a test fails, only the latest changes need to be
debugged. With testing at higher levels, changes made over the span of
several days/weeks/months need to be scanned.
7.4.2 Integration Testing
Integration Testing is a level of software testing where individual units are combined
and tested as a group. The purpose of this level of testing is to expose faults in the
interaction between integrated units. Test drivers and test stubs are used to assist in
Integration Testing.
integration testing: Testing performed to expose defects in the interfaces
and in the interactions between integrated components or systems. See also
component integration testing, system integration testing.
component integration testing: Testing performed to expose defects in the
interfaces and interaction between integrated components.
Diabetes Prediction
Dept of ISE, NHCE Page 42
system integration testing: Testing the integration of systems and packages;
testing interfaces to external organizations (e.g. Electronic Data Interchange,
Internet).
Tasks
Integration Test Plan
o Prepare
o Review
o Rework
o Baseline
Integration Test Cases/Scripts
o Prepare
o Review
o Rework
o Baseline
Integration Test
7.4.3 System Testing
System Testing is a level of software testing where a complete and integrated
software is tested. The purpose of this test is to evaluate the system’s compliance
with the specified requirements. The process of testing an integrated system to
verify that it meets specified requirements.
7.4.4 Acceptance Testing
Acceptance Testing is a level of software testing where a system is tested for
acceptability. The purpose of this test is to evaluate the system’s compliance with
the business requirements and assess whether it is acceptable for delivery. Formal
testing with respect to user needs, requirements, and business processes conducted
to determine whether or not a system satisfies the acceptance criteria and to enable
the user, customers or other authorized entity to determine whether or not
to accept the system.
Diabetes Prediction
Dept of ISE, NHCE Page 43
Chapter 8
Conclusion and Future Enhancement
Project introduction is the first step in building a system. Basically it will tell what is
the application or a system that we are intended to build, what it will look like, brief
describe on the proposed project, setting up the project scope, defining project
objective, problem statements of the project and also the expected outcome. This
stage will be used as a reference to ensure system meet the project scope and
project objective.
8.1 Conclusion
Diabetes is vital health hassle in human society. This paper has summarised kingdom
of art techniques and to be had techniques for predication of this sickness. Deep
studying an rising region of Machine Learning showed a few promising bring about
different area of clinical diagnose with excessive accuracy. It continues to be an open
area waiting to get applied in Diabetes predication. Some strategies of deep studying
has been discussed which may be implemented for Diabetes predication, alongside
pioneer machine getting to know algorithms. An analytical assessment has been
completed for locating out best available algorithm for clinical dataset. In future our
purpose is to carry ahead the work of temporal scientific dataset, wherein dataset
varies with time and retraining of dataset is needed.
8.2 Future Enhancement
The proposed system is Diabetes prediction. We can enhance this system should
be efficient to predict the diseases and suggestion of medications using machine
learning. Further the system can be extended to N number diseases existing with
proper medications.
LIMITATIONS:
The size of the database increases day-by-day, increasing the load on the
database back update data maintenance activity.
Diabetes Prediction
Dept of ISE, NHCE Page 44
Training for simple computer operations is necessary for the users working on the
system.
Diabetes Prediction
Dept of ISE, NHCE Page 45
References
• https://www.academia.edu
• https://en.wikipedia.org/wiki/Clinical_data_management
• http://shodh.inflibnet.ac.in:8080/jspui/bitstream/123456789/4170/3/03_lite
rature%20review.pdf
• World Health Organization. Available online: http://www.who.int (accessed
on 14 September 2018).
• Baldwin, D. Wayfinding technology: A road map to the future. J. Vis. Impair.
Blind. 2003, 97, 612–620.
• Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., &
Chouvarda, I. (2017). Machine learning and data mining methods in diabetes
research. Computational and structural biotechnology journal.
• Zheng, T., Xie, W., Xu, L., He, X., Zhang, Y., You, M., ... & Chen, Y. (2017). A
machine learning-based framework to identify type 2 diabetes through
electronic health records. International journal of medical informatics, 97,
120-127.
• Xu, W., Zhang, J., Zhang, Q., & Wei, X. (2017, February). Risk prediction of
type II diabetes based on random forest model. In Advances in Electrical,
Electronics, Information, Communication and BioInformatics (AEEICB), 2017
Third International Conference on (pp. 382-386). IEEE.
• Song, Y., Liang, J., Lu, J., & Zhao, X. (2017). An efficient instance selection
algorithm for k nearest neighbour regression. Neurocomputing, 251, 26-34.
• Komi, M., Li, J., Zhai, Y., & Zhang, X. (2017, June). Application of data mining
methods in diabetes prediction. In Image, Vision and Computing (ICIVC), 2017
2nd International Conference on (pp. 1006-1010). IEEE.
• Meza-Palacios, R., Aguilar-Lasserre, A. A., Ureña-Bogarín, E. L., Vázquez-
Rodríguez, C. F., Posada-Gómez, R., & Trujillo-Mata, A. (2017). Development
of a fuzzy expert system for the nephropathy control assessment in patients
with type 2 diabetes mellitus. Expert Systems with Applications, 72, 335-343
Diabetes Prediction
Dept of ISE, NHCE Page 46
• Rani, A. S., & Jyothi, S. (2016, March). Performance analysis of classification
algorithms under different datasets. In Computing for Sustainable Global
Development (INDIACom), 2016 3rd International Conference on (pp. 1584-
1589). IEEE.
• Pradeep, K. R., & Naveen, N. C. (2016, December). Predictive analysis of
diabetes using J48 algorithm of classification techniques. In Contemporary
Computing and Informatics (IC3I), 2016 2nd International Conference on (pp.
347-352). IEEE.
• Perveen, S., Shahbaz, M., Guergachi, A., & Keshavjee, K. (2016). Performance
analysis of data mining classification techniques to predict diabetes. Procedia
Computer Science, 82, 115-121.
• Santhanam, T., & Padmavathi, M. S. (2015). Application of K-means and
genetic algorithms for dimension reduction by integrating SVM for diabetes
diagnosis. Procedia Computer Science, 47, 76-83.
• Kandhasamy, J. P., & Balamurali, S. (2015). Performance analysis of classifier
models to predict diabetes mellitus. Procedia Computer Science, 47, 45-51.
• Bashir, S., Qamar, U., Khan, F. H., & Javed, M. Y. (2014, December). An
Efficient RuleBased Classification of Diabetes Using ID3, C4. 5, & CART
Ensembles. In Frontiers of Information Technology (FIT), 2014 12th
International Conference on (pp. 226-231). IEEE.
• Meng, X. H., Huang, Y. X., Rao, D. P., Zhang, Q., & Liu, Q. (2013). Comparison
of three data mining models for predicting diabetes or prediabetes by risk
factors. The Kaohsiung journal of medical sciences, 29(2), 93-99.
• Krati Saxena, D., Khan, Z., & Singh, S.(2014) Diagnosis of Diabetes Mellitus
using K Nearest Neighbor Algorithm.
• Saravananathan, K., & Velmurugan, T. (2016). Analyzing Diabetic Data using
Classification Algorithms in Data Mining. Indian Journal of Science and
Technology, 9(43).
• Guo, Y., Bai, G., & Hu, Y. (2012, December). Using bayes network for
prediction of type-2 diabetes. In Internet Technology And Secured
Transactions, 2012 International Conference for (pp. 471-472). IEEE.