Disease Prediction From Various Symptoms Using Machine Learning

7 Pages Posted: 8 Oct 2020

Rinkal Keniya

K. J. Somaiya College of Engineering

Aman Khakharia

K. J. Somaiya college of engineering

Vruddhi Shah

University of Mumbai - K. J. Somaiya College of Engineering (K.J.S.C.E.)

Vrushabh Gada

Ruchi manjalkar, tirth thaker, mahesh warang, ninad mehendale.

University of Mumbai - K. J. Somaiya College of Engineering (K.J.S.C.E.); Ninad's research Lab

Date Written: July 27, 2020

Accurate and on-time analysis of any health-related problem is important for the prevention and treatment of the illness. The traditional way of diagnosis may not be sufficient in the case of a serious ailment. Developing a medical diagnosis system based on machine learning (ML) algorithms for prediction of any disease can help in a more accurate diagnosis than the conventional method. We have designed a disease prediction system using multiple ML algorithms. The data set used had more than 230 diseases for processing. Based on the symptoms, age, and gender of an individual, the diagnosis system gives the output as the disease that the individual might be suffering from. The weighted KNN algorithm gave the best results as compared to the other algorithms. The accuracy of the weighted KNN algorithm for the prediction was 93.5 %. Our diagnosis model can act as a doctor for the early diagnosis of a disease to ensure the treatment can take place on time and lives can be saved.

Keywords: Disease Prediction, Machine Learning, Symptoms

JEL Classification: I

Suggested Citation: Suggested Citation

K. J. Somaiya College of Engineering ( email )

K. j. somaiya college of engineering ( email ), university of mumbai - k. j. somaiya college of engineering (k.j.s.c.e.) ( email ).

Mumbai, MA Maharashtra 400007 India

Ninad Mehendale (Contact Author)

Ninad's research lab ( email ).

M.G. Road, Naupada Thane Thane, 400602 India

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, computation theory ejournal.

Subscribe to this fee journal for more curated articles on this topic

Public Health eJournal

  • Research article
  • Open access
  • Published: 21 December 2019

Comparing different supervised machine learning algorithms for disease prediction

  • Shahadat Uddin   ORCID: orcid.org/0000-0003-0091-6919 1 ,
  • Arif Khan 1 , 2 ,
  • Md Ekramul Hossain 1 &
  • Mohammad Ali Moni 3  

BMC Medical Informatics and Decision Making volume  19 , Article number:  281 ( 2019 ) Cite this article

127k Accesses

773 Citations

13 Altmetric

Metrics details

Supervised machine learning algorithms have been a dominant method in the data mining field. Disease prediction using health data has recently shown a potential application area for these methods. This study aims to identify the key trends among different types of supervised machine learning algorithms, and their performance and usage for disease risk prediction.

In this study, extensive research efforts were made to identify those studies that applied more than one supervised machine learning algorithm on single disease prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search items. Thus, we selected 48 articles in total for the comparison among variants supervised machine learning algorithms for disease prediction.

We found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naïve Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was considered.

This study provides a wide overview of the relative performance of different variants of supervised machine learning algorithms for disease prediction. This important information of relative performance can be used to aid researchers in the selection of an appropriate supervised machine learning algorithm for their studies.

Peer Review reports

Machine learning algorithms employ a variety of statistical, probabilistic and optimisation methods to learn from past experience and detect useful patterns from large, unstructured and complex datasets [ 1 ]. These algorithms have a wide range of applications, including automated text categorisation [ 2 ], network intrusion detection [ 3 ], junk e-mail filtering [ 4 ], detection of credit card fraud [ 5 ], customer purchase behaviour detection [ 6 ], optimising manufacturing process [ 7 ] and disease modelling [ 8 ]. Most of these applications have been implemented using supervised variants [ 4 , 5 , 8 ] of the machine learning algorithms rather than unsupervised ones. In the supervised variant, a prediction model is developed by learning a dataset where the label is known and accordingly the outcome of unlabelled examples can be predicted [ 9 ].

The scope of this research is primarily on the performance analysis of disease prediction approaches using different variants of supervised machine learning algorithms. Disease prediction and in a broader context, medical informatics, have recently gained significant attention from the data science research community in recent years. This is primarily due to the wide adaptation of computer-based technology into the health sector in different forms (e.g., electronic health records and administrative data) and subsequent availability of large health databases for researchers. These electronic data are being utilised in a wide range of healthcare research areas such as the analysis of healthcare utilisation [ 10 ], measuring performance of a hospital care network [ 11 ], exploring patterns and cost of care [ 12 ], developing disease risk prediction model [ 13 , 14 ], chronic disease surveillance [ 15 ], and comparing disease prevalence and drug outcomes [ 16 ]. Our research focuses on the disease risk prediction models involving machine learning algorithms (e.g., support vector machine, logistic regression and artificial neural network), specifically - supervised learning algorithms. Models based on these algorithms use labelled training data of patients for training [ 8 , 17 , 18 ]. For the test set, patients are classified into several groups such as low risk and high risk.

Given the growing applicability and effectiveness of supervised machine learning algorithms on predictive disease modelling, the breadth of research still seems progressing. Specifically, we found little research that makes a comprehensive review of published articles employing different supervised learning algorithms for disease prediction. Therefore, this research aims to identify key trends among different types of supervised machine learning algorithms, their performance accuracies and the types of diseases being studied. In addition, the advantages and limitations of different supervised machine learning algorithms are summarised. The results of this study will help the scholars to better understand current trends and hotspots of disease prediction models using supervised machine learning algorithms and formulate their research goals accordingly.

In making comparisons among different supervised machine learning algorithms, this study reviewed, by following the PRISMA guidelines [ 19 ], existing studies from the literature that used such algorithms for disease prediction. More specifically, this article considered only those studies that used more than one supervised machine learning algorithm for a single disease prediction in the same research setting. This made the principal contribution of this study (i.e., comparison among different supervised machine learning algorithms) more accurate and comprehensive since the comparison of the performance of a single algorithm across different study settings can be biased and generate erroneous results [ 20 ].

Traditionally, standard statistical methods and doctor’s intuition, knowledge and experience had been used for prognosis and disease risk prediction. This practice often leads to unwanted biases, errors and high expenses, and negatively affects the quality of service provided to patients [ 21 ]. With the increasing availability of electronic health data, more robust and advanced computational approaches such as machine learning have become more practical to apply and explore in disease prediction area. In the literature, most of the related studies utilised one or more machine learning algorithms for a particular disease prediction. For this reason, the performance comparison of different supervised machine learning algorithms for disease prediction is the primary focus of this study.

In the following sections, we discuss different variants of supervised machine learning algorithm, followed by presenting the methods of this study. In the subsequent sections, we present the results and discussion of the study.

  • Supervised machine learning algorithm

At its most basic sense, machine learning uses programmed algorithms that learn and optimise their operations by analysing input data to make predictions within an acceptable range. With the feeding of new data, these algorithms tend to make more accurate predictions. Although there are some variations of how to group machine learning algorithms they can be divided into three broad categories according to their purposes and the way the underlying machine is being taught. These three categories are: supervised, unsupervised and semi-supervised.

In supervised machine learning algorithms, a labelled training dataset is used first to train the underlying algorithm. This trained algorithm is then fed on the unlabelled test dataset to categorise them into similar groups. Using an abstract dataset for three diabetic patients, Fig.  1 shows an illustration about how supervised machine learning algorithms work to categorise diabetic and non-diabetic patients. Supervised learning algorithms suit well with two types of problems: classification problems; and regression problems. In classification problems, the underlying output variable is discrete. This variable is categorised into different groups or categories, such as ‘red’ or ‘black’, or it could be ‘diabetic’ and ‘non-diabetic’. The corresponding output variable is a real value in regression problems, such as the risk of developing cardiovascular disease for an individual. In the following subsections, we briefly describe the commonly used supervised machine learning algorithms for disease prediction.

figure 1

An illustration of how supervised machine learning algorithms work to categorise diabetic and non-diabetic patients based on abstract data

Logistic regression

Logistic regression (LR) is a powerful and well-established method for supervised classification [ 22 ]. It can be considered as an extension of ordinary regression and can model only a dichotomous variable which usually represents the occurrence or non-occurrence of an event. LR helps in finding the probability that a new instance belongs to a certain class. Since it is a probability, the outcome lies between 0 and 1. Therefore, to use the LR as a binary classifier, a threshold needs to be assigned to differentiate two classes. For example, a probability value higher than 0.50 for an input instance will classify it as ‘class A’; otherwise, ‘class B’. The LR model can be generalised to model a categorical variable with more than two values. This generalised version of LR is known as the multinomial logistic regression.

Support vector machine

Support vector machine (SVM) algorithm can classify both linear and non-linear data. It first maps each data item into an n-dimensional feature space where n is the number of features. It then identifies the hyperplane that separates the data items into two classes while maximising the marginal distance for both classes and minimising the classification errors [ 23 ]. The marginal distance for a class is the distance between the decision hyperplane and its nearest instance which is a member of that class. More formally, each data point is plotted first as a point in an n-dimension space (where n is the number of features) with the value of each feature being the value of a specific coordinate. To perform the classification, we then need to find the hyperplane that differentiates the two classes by the maximum margin. Figure  2 provides a simplified illustration of an SVM classifier.

figure 2

A simplified illustration of how the support vector machine works. The SVM has identified a hyperplane (actually a line) which maximises the separation between the ‘star’ and ‘circle’ classes

Decision tree

Decision tree (DT) is one of the earliest and prominent machine learning algorithms. A decision tree models the decision logics i.e., tests and corresponds outcomes for classifying data items into a tree-like structure. The nodes of a DT tree normally have multiple levels where the first or top-most node is called the root node. All internal nodes (i.e., nodes having at least one child) represent tests on input variables or attributes. Depending on the test outcome, the classification algorithm branches towards the appropriate child node where the process of test and branching repeats until it reaches the leaf node [ 24 ]. The leaf or terminal nodes correspond to the decision outcomes. DTs have been found easy to interpret and quick to learn, and are a common component to many medical diagnostic protocols [ 25 ]. When traversing the tree for the classification of a sample, the outcomes of all tests at each node along the path will provide sufficient information to conjecture about its class. An illustration of an DT with its elements and rules is depicted in Fig.  3 .

figure 3

An illustration of a Decision tree. Each variable (C1, C2, and C3) is represented by a circle and the decision outcomes (Class A and Class B) are shown by rectangles. In order to successfully classify a sample to a class, each branch is labelled with either ‘True’ or ‘False’ based on the outcome value from the test of its ancestor node

Random forest

A random forest (RF) is an ensemble classifier and consisting of many DTs similar to the way a forest is a collection of many trees [ 26 ]. DTs that are grown very deep often cause overfitting of the training data, resulting a high variation in classification outcome for a small change in the input data. They are very sensitive to their training data, which makes them error-prone to the test dataset. The different DTs of an RF are trained using the different parts of the training dataset. To classify a new sample, the input vector of that sample is required to pass down with each DT of the forest. Each DT then considers a different part of that input vector and gives a classification outcome. The forest then chooses the classification of having the most ‘votes’ (for discrete classification outcome) or the average of all trees in the forest (for numeric classification outcome). Since the RF algorithm considers the outcomes from many different DTs, it can reduce the variance resulted from the consideration of a single DT for the same dataset. Figure  4 shows an illustration of the RF algorithm.

figure 4

An illustration of a Random forest which consists of three different decision trees. Each of those three decision trees was trained using a random subset of the training data

Naïve Bayes

Naïve Bayes (NB) is a classification technique based on the Bayes’ theorem [ 27 ]. This theorem can describe the probability of an event based on the prior knowledge of conditions related to that event. This classifier assumes that a particular feature in a class is not directly related to any other feature although features for that class could have interdependence among themselves [ 28 ]. By considering the task of classifying a new object (white circle) to either ‘green’ class or ‘red’ class, Fig.  5 provides an illustration about how the NB technique works. According to this figure, it is reasonable to believe that any new object is twice as likely to have ‘green’ membership rather than ‘red’ since there are twice as many ‘green’ objects (40) as ‘red’. In the Bayesian analysis, this belief is known as the prior probability. Therefore, the prior probabilities of ‘green’ and ‘red’ are 0.67 (40 ÷ 60) and 0.33 (20 ÷ 60), respectively. Now to classify the ‘white’ object, we need to draw a circle around this object which encompasses several points (to be chosen prior) irrespective of their class labels. Four points (three ‘red’ and one ‘green) were considered in this figure. Thus, the likelihood of ‘white’ given ‘green’ is 0.025 (1 ÷ 40) and the likelihood of ‘white’ given ‘red’ is 0.15 (3 ÷ 20). Although the prior probability indicates that the new ‘white’ object is more likely to have ‘green’ membership, the likelihood shows that it is more likely to be in the ‘red’ class. In the Bayesian analysis, the final classifier is produced by combining both sources of information (i.e., prior probability and likelihood value). The ‘multiplication’ function is used to combine these two types of information and the product is called the ‘posterior’ probability. Finally, the posterior probability of ‘white’ being ‘green’ is 0.017 (0.67 × 0.025) and the posterior probability of ‘white’ being ‘red’ is 0.049 (0.33 × 0.15). Thus, the new ‘white’ object should be classified as a member of the ‘red’ class according to the NB technique.

figure 5

An illustration of the Naïve Bayes algorithm. The ‘white’ circle is the new sample instance which needs to be classified either to ‘red’ class or ‘green’ class

K-nearest neighbour

The K-nearest neighbour (KNN) algorithm is one of the simplest and earliest classification algorithms [ 29 ]. It can be thought a simpler version of an NB classifier. Unlike the NB technique, the KNN algorithm does not require to consider probability values. The ‘ K ’ is the KNN algorithm is the number of nearest neighbours considered to take ‘vote’ from. The selection of different values for ‘ K ’ can generate different classification results for the same sample object. Figure  6 shows an illustration of how the KNN works to classify a new object. For K = 3 , the new object (star) is classified as ‘black’; however, it has been classified as ‘red’ when K = 5 .

figure 6

A simplified illustration of the K-nearest neighbour algorithm. When K = 3, the sample object (‘star’) is classified as ‘black’ since it gets more ‘vote’ from the ‘black’ class. However, for K = 5 the same sample object is classified as ‘red’ since it now gets more ‘vote’ from the ‘red’ class

Artificial neural network

Artificial neural networks (ANNs) are a set of machine learning algorithms which are inspired by the functioning of the neural networks of human brain. They were first proposed by McCulloch and Pitts [ 30 ] and later popularised by the works of Rumelhart et al. in the 1980s [ 31 ].. In the biological brain, neurons are connected to each other through multiple axon junctions forming a graph like architecture. These interconnections can be rewired (e.g., through neuroplasticity) that helps to adapt, process and store information. Likewise, ANN algorithms can be represented as an interconnected group of nodes. The output of one node goes as input to another node for subsequent processing according to the interconnection. Nodes are normally grouped into a matrix called layer depending on the transformation they perform. Apart from the input and output layer, there can be one or more hidden layers in an ANN framework. Nodes and edges have weights that enable to adjust signal strengths of communication which can be amplified or weakened through repeated training. Based on the training and subsequent adaption of the matrices, node and edge weights, ANNs can make a prediction for the test data. Figure  7 shows an illustration of an ANN (with two hidden layers) with its interconnected group of nodes.

figure 7

An illustration of the artificial neural network structure with two hidden layers. The arrows connect the output of nodes from one layer to the input of nodes of another layer

Data source and data extraction

Extensive research efforts were made to identify articles employing more than one supervised machine learning algorithm for disease prediction. Two databases were searched (October 2018): Scopus and PubMed. Scopus is an online bibliometric database developed by Elsevier. It has been chosen because of its high level of accuracy and consistency [ 32 ]. PubMed is a free publication search engine and incorporates citation information mostly for biomedical and life science literature. It comprises more than 28 million citations from MEDLINE, life science journals and online books [ 33 ]. MEDLINE is a bibliographic database that includes bibliographic information for articles from academic journals covering medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care [ 33 ].

A comprehensive search strategy was followed to find out all related articles. The search terms that were used in this search strategy were:

“disease prediction” AND “machine learning”;

“disease prediction” AND “data mining”;

“disease risk prediction” AND “machine learning”; and

“disease risk prediction” AND “data mining”.

In scientific literature, the generic name of “machine learning” is often used for both “supervised” and “unsupervised” machine learning algorithms. On the other side, there is a close relationship between the terms “machine learning” and “data mining”, with the latter is commonly used for the former one [ 34 ]. For these reasons, we used both “machine learning” and “data mining” in the search terms although the focus of this study is on the supervised machine learning algorithm. The four search items were then considered to launch searches on the titles, abstracts and keywords of an article for both Scopus and PubMed. This resulted in 305 and 83 articles from Scopus and PubMed, respectively. After combining these two lists of articles and removing the articles written in languages other than English, we found 336 unique articles.

Since the aim of this study was to compare the performance of different supervised machine learning algorithms, the next step was to select the articles from these 336 which used more than one supervised machine learning algorithm for disease prediction. For this reason, we wrote a computer program using Python programming language [ 35 ] which checked the presence of the name of more than one supervised machine learning algorithm in the title, abstract and keyword list of each of 336 articles. It found 55 articles that used more than one supervised machine learning algorithm for the prediction of different diseases. Out of the remaining 281 articles, only 155 used one of the seven supervised machine learning algorithms considered in this study. The rest 126 used either other machine learning algorithms (e.g., unsupervised or semi-supervised) or data mining methods other than machine learning ones. ANN was found most frequently (30.32%) in the 155 articles, followed by the Naïve Bayes (19.35%).

The next step is the manual inspection of all recovered articles. We noticed that four groups of authors reported their study results in two publication outlets (i.e., book chapter, conference and journal) using the same or different titles. For these four publications, we considered the most recent one. We further excluded another three articles since the reported prediction accuracies for all supervised machine learning algorithms used in those articles are the same. For each of the remaining 48 articles, the performance outcomes of the supervised machine learning algorithms that were used for disease prediction were gathered. Two diseases were predicted in one article [ 17 ] and two algorithms were found showing the best accuracy outcomes for a disease in one article [ 36 ]. In that article, five different algorithms were used for prediction analysis. The number of publications per year has been depicted in Fig.  8 . The overall data collection procedure along with the number of articles selected for different diseases has been shown in Fig.  9 .

figure 8

Number of articles published in different years

figure 9

The overall data collection procedure. It also shows the number of articles considered for each disease

Figure  10 shows a comparison of the composition of initially selected 329 articles regarding the seven supervised machine learning algorithms considered in this study. ANN shows the highest percentage difference (i.e., 16%) between the 48 selected articles of this study and initially selected 155 articles that used only one supervised machine learning algorithm for disease prediction, which is followed by LR. The remaining five supervised machine learning algorithms show a percentage difference between 1 and 5.

figure 10

Composition of initially selected 329 articles with respect to the seven supervised learning algorithms

Classifier performance index

The diagnostic ability of classifiers has usually been determined by the confusion matrix and the receiver operating characteristic (ROC) curve [ 37 ]. In the machine learning research domain, the confusion matrix is also known as error or contingency matrix. The basic framework of the confusion matrix has been provided in Fig.  11 a. In this framework, true positives (TP) are the positive cases where the classifier correctly identified them. Similarly, true negatives (TN) are the negative cases where the classifier correctly identified them. False positives (FP) are the negative cases where the classifier incorrectly identified them as positive and the false negatives (FN) are the positive cases where the classifier incorrectly identified them as negative. The following measures, which are based on the confusion matrix, are commonly used to analyse the performance of classifiers, including those that are based on supervised machine learning algorithms.

figure 11

a The basic framework of the confusion matrix; and ( b ) A presentation of the ROC curve

An ROC is one of the fundamental tools for diagnostic test evaluation and is created by plotting the true positive rate against the false positive rate at various threshold settings [ 37 ]. The area under the ROC curve (AUC) is also commonly used to determine the predictability of a classifier. A higher AUC value represents the superiority of a classifier and vice versa. Figure  11 b illustrates a presentation of three ROC curves based on an abstract dataset. The area under the blue ROC curve is half of the shaded rectangle. Thus, the AUC value for this blue ROC curve is 0.5. Due to the coverage of a larger area, the AUC value for the red ROC curve is higher than that of the black ROC curve. Hence, the classifier that produced the red ROC curve shows higher predictive accuracy compared with the other two classifiers that generated the blue and red ROC curves.

There are few other measures that are also used to assess the performance of different classifiers. One such measure is the running mean square error (RMSE). For different pairs of actual and predicted values, RMSE represents the mean value of all square errors. An error is the difference between an actual and its corresponding predicted value. Another such measure is the mean absolute error (MAE). For an actual and its predicted value, MAE indicates the absolute value of their difference.

The final dataset contained 48 articles, each of which implemented more than one variant of supervised machine learning algorithms for a single disease prediction. All implemented variants were already discussed in the methods section as well as the more frequently used performance measures. Based on these, we reviewed the finally selected 48 articles in terms of the methods used, performance measures as well as the disease they targeted.

In Table  1 , names and references of the diseases and the corresponding supervised machine learning algorithms used to predict them are discussed. For each of the disease models, the better performing algorithm is also described in this table. This study considered 48 articles, which in total made the prediction for 49 diseases or conditions (one article predicted two diseases [ 17 ]). For these 49 diseases, 50 algorithms were found to show the superior accuracy. One disease has two algorithms (out of 5) that showed the same higher-level accuracies [ 36 ]. To sum up, 49 diseases were predicted in 48 articles considered in this study and 50 supervised machine learning algorithms were found to show the superior accuracy. The advantages and limitations of different supervised machine learning algorithms are shown in Table  2 .

The comparison of the usage frequency and accuracy of different supervised learning algorithms are shown in Table  3 . It is observed that SVM has been used most frequently (29 out of 49 diseases that were predicted). This is followed by NB, which has been used in 23 articles. Although RF has been considered the second least number of times, it showed the highest percentage (i.e., 53%) in revealing the superior accuracy followed by SVM (i.e., 41%).

In Table  4 , the performance comparison of different supervised machine learning algorithms for most frequently modelled diseases is shown. It is observed that SVM showed the superior accuracy at most times for three diseases (e.g., heart disease, diabetes and Parkinson’s disease). For breast cancer, ANN showed the superior accuracy at most times.

A close investigation of Table 1 reveals an interesting result regarding the performance of different supervised learning algorithms. This result has also been reported in Table 4 . Consideration of only those articles that used clinical and demographic data (15 articles) reveals DT as to show the superior result at most times (6). Interestingly, SVM has been found the least time (1) to show the superior result although it showed the superior accuracy at most times for heart disease, diabetes and Parkinson’s disease (Table 4 ). In other 33 articles that used research data other than ‘clinical and demographic’ type, SVM and RF have been found to show the superior accuracy at most times (12) and second most times (7), respectively. In articles where 10-fold and 5-fold validation methods were used, SVM has been found to show the superior accuracy at most times (5 and 3 times, respectively). On the other side, articles where no method was used for validation, ANN has been found at most times to show the superior accuracy. Figure  12 further illustrates the superior performance of SVM. Performance statistics from Table 4 have been used in a normalised way to draw these two graphs. Fig.  12 a illustrates the ROC graph for the four diseases (i.e., Heart disease, Diabetes, Breast cancer and Parkinson’s disease) under the ‘ disease names that were modelled ’ criterion. The ROC graph based on the ‘ validation method followed ’ criterion has been presented in Fig.  12 b.

figure 12

Illustration of the superior performance of the Support vector machine using ROC graphs (based on the data from Table 4 ) – ( a ) for disease names that were modelled; and ( b ) for validation methods that were followed

To avoid the risk of selection bias, from the literature we extracted those articles that used more than one supervised machine learning algorithm. The same supervised learning algorithm can generate different results across various study settings. There is a chance that a performance comparison between two supervised learning algorithms can generate imprecise results if they were employed in different studies separately. On the other side, the results of this study could suffer a variable selection bias from individual articles considered in this study. These articles used different variables or measures for disease prediction. We noticed that the authors of these articles did not consider all available variables from the corresponding research datasets. The inclusion of a new variable could improve the accuracy of an underperformed algorithm considered in the underlying study, and vice versa. This is one of the limitations of this study. Another limitation of this study is that we considered a broader level classification of supervised machine learning algorithms to make a comparison among them for disease prediction. We did not consider any sub-classifications or variants of any of the algorithms considered in this study. For example, we did not make any performance comparison between least-square and sparse SVMs; instead of considering them under the SVM algorithm. A third limitation of this study is that we did not consider the hyperparameters that were chosen in different articles of this study in comparing multiple supervised machine learning algorithms. It has been argued that the same machine learning algorithm can generate different accuracy results for the same data set with the selection of different values for the underlying hyperparameters [ 81 , 82 ]. The selection of different kernels for support vector machines can result a variation in accuracy outcomes for the same data set. Similarly, a random forest could generate different results, while splitting a node, with the changes in the number of decision trees within the underlying forest.

This research attempted to study comparative performances of different supervised machine learning algorithms in disease prediction. Since clinical data and research scope varies widely between disease prediction studies, a comparison was only possible when a common benchmark on the dataset and scope is established. Therefore, we only chose studies that implemented multiple machine learning methods on the same data and disease prediction for comparison. Regardless of the variations on frequency and performances, the results show the potential of these families of algorithms in the disease prediction.

Availability of data and materials

The data used in this study can be extracted from online databases. The detail of this extraction has been described within the manuscript.

Abbreviations

Area under the ROC curve

Decision Tree

False negative

False positive

Mean absolute error

Running mean square error

Receiver operating characteristic

True negative

True positive

T. M. Mitchell, “Machine learning WCB”: McGraw-Hill Boston, MA:, 1997.

Google Scholar  

Sebastiani F. Machine learning in automated text categorization. ACM Comput Surveys (CSUR). 2002;34(1):1–47.

Sinclair C, Pierce L, Matzner S. An application of machine learning to network intrusion detection. In: Computer Security Applications Conference, 1999. (ACSAC’99) Proceedings. 15th Annual; 1999. p. 371–7. IEEE.

Sahami M, Dumais S, Heckerman D, Horvitz E. A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 workshop, vol. 62; 1998. p. 98–105. Madison, Wisconsin.

Aleskerov E, Freisleben B, Rao B. Cardwatch: A neural network based database mining system for credit card fraud detection. In: Computational Intelligence for Financial Engineering (CIFEr), 1997., Proceedings of the IEEE/IAFE 1997; 1997. p. 220–6. IEEE.

Kim E, Kim W, Lee Y. Combination of multiple classifiers for the customer's purchase behavior prediction. Decis Support Syst. 2003;34(2):167–75.

Mahadevan S, Theocharous G. “Optimizing Production Manufacturing Using Reinforcement Learning,” in FLAIRS Conference; 1998. p. 372–7.

Yao D, Yang J, Zhan X. A novel method for disease prediction: hybrid of random forest and multivariate adaptive regression splines. J Comput. 2013;8(1):170–7.

R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Machine learning: an artificial intelligence approach. Springer Science & Business Media, 2013.

Culler SD, Parchman ML, Przybylski M. Factors related to potentially preventable hospitalizations among the elderly. Med Care. 1998;1:804–17.

Uddin MS, Hossain L. Social networks enabled coordination model for cost Management of Patient Hospital Admissions. J Healthc Qual. 2011;33(5):37–48.

PubMed   Google Scholar  

Lee PP, et al. Cost of patients with primary open-angle glaucoma: a retrospective study of commercial insurance claims data. Ophthalmology. 2007;114(7):1241–7.

Davis DA, Chawla NV, Christakis NA, Barabási A-L. Time to CARE: a collaborative engine for practical disease prediction. Data Min Knowl Disc. 2010;20(3):388–415.

McCormick T, Rudin C, Madigan D. A hierarchical model for association rule mining of sequential events: an approach to automated medical symptom prediction; 2011.

Yiannakoulias N, Schopflocher D, Svenson L. Using administrative data to understand the geography of case ascertainment. Chron Dis Can. 2009;30(1):20–8.

CAS   Google Scholar  

Fisher ES, Malenka DJ, Wennberg JE, Roos NP. Technology assessment using insurance claims: example of prostatectomy. Int J Technol Assess Health Care. 1990;6(02):194–202.

CAS   PubMed   Google Scholar  

Farran B, Channanath AM, Behbehani K, Thanaraj TA. Predictive models to assess risk of type 2 diabetes, hypertension and comorbidity: machine-learning algorithms and validation using national health data from Kuwait-a cohort study. BMJ Open. 2013;3(5):e002457.

PubMed   PubMed Central   Google Scholar  

Ahmad LG, Eshlaghy A, Poorebrahimi A, Ebrahimi M, Razavi A. Using three machine learning techniques for predicting breast cancer recurrence. J Health Med Inform. 2013;4(124):3.

Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann Intern Med. 2009;151(4):264–9.

Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.

Palaniappan S, Awang R. Intelligent heart disease prediction system using data mining techniques. In: Computer Systems and Applications, 2008. AICCSA 2008. IEEE/ACS International Conference on; 2008. p. 108–15. IEEE.

Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Wiley; 2013.

Joachims T. Making large-scale SVM learning practical. SFB 475: Komplexitätsreduktion Multivariaten Datenstrukturen, Univ. Dortmund, Dortmund, Tech. Rep. 1998. p. 28.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106.

Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and prognosis. Cancer Informat. 2006;2:59–77.

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Lindley DV. Fiducial distributions and Bayes’ theorem. J Royal Stat Soc. Series B (Methodological). 1958;1:102–7.

I. Rish, “An empirical study of the naive Bayes classifier,” in IJCAI 2001 workshop on empirical methods in artificial intelligence, 2001, vol. 3, 22, pp. 41–46: IBM New York.

Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–7.

McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943;5(4):115–33.

Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533.

Falagas ME, Pitsouni EI, Malietzis GA, Pappas G. Comparison of PubMed, Scopus, web of science, and Google scholar: strengths and weaknesses. FASEB J. 2008;22(2):338–42.

PubMed. (2018). https://www.ncbi.nlm.nih.gov/pubmed/ .

Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J. 2017;15:104–16.

Pedregosa F, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

Borah MS, Bhuyan BP, Pathak MS, Bhattacharya P. Machine learning in predicting hemoglobin variants. Int J Mach Learn Comput. 2018;8(2):140–3.

Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.

Aneja S, Lal S. Effective asthma disease prediction using naive Bayes—Neural network fusion technique. In: International Conference on Parallel, Distributed and Grid Computing (PDGC); 2014. p. 137–40. IEEE.

Ayer T, Chhatwal J, Alagoz O, Kahn CE Jr, Woods RW, Burnside ES. Comparison of logistic regression and artificial neural network models in breast cancer risk estimation. Radiographics. 2010;30(1):13–22.

Lundin M, Lundin J, Burke H, Toikkanen S, Pylkkänen L, Joensuu H. Artificial neural networks applied to survival prediction in breast cancer. Oncology. 1999;57(4):281–6.

Delen D, Walker G, Kadam A. Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med. 2005;34(2):113–27.

Chen M, Hao Y, Hwang K, Wang L, Wang L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access. 2017;5:8869–79.

Cai L, Wu H, Li D, Zhou K, Zou F. Type 2 diabetes biomarkers of human gut microbiota selected via iterative sure independent screening method. PLoS One. 2015;10(10):e0140827.

Malik S, Khadgawat R, Anand S, Gupta S. Non-invasive detection of fasting blood glucose level via electrochemical measurement of saliva. SpringerPlus. 2016;5(1):701.

Mani S, Chen Y, Elasy T, Clayton W, Denny J. Type 2 diabetes risk forecasting from EMR data using machine learning. In: AMIA annual symposium proceedings, vol. 2012; 2012. p. 606. American Medical Informatics Association.

Tapak L, Mahjub H, Hamidi O, Poorolajal J. Real-data comparison of data mining methods in prediction of diabetes in Iran. Healthc Inform Res. 2013;19(3):177–85.

Sisodia D, Sisodia DS. Prediction of diabetes using classification algorithms. Procedia Comput Sci. 2018;132:1578–85.

Yang J, Yao D, Zhan X, Zhan X. Predicting disease risks using feature selection based on random forest and support vector machine. In: International Symposium on Bioinformatics Research and Applications; 2014. p. 1–11. Springer.

Juhola M, Joutsijoki H, Penttinen K, Aalto-Setälä K. Detection of genetic cardiac diseases by Ca 2+ transient profiles using machine learning methods. Sci Rep. 2018;8(1):9355.

Long NC, Meesad P, Unger H. A highly accurate firefly based algorithm for heart disease prediction. Expert Syst Appl. 2015;42(21):8221–31.

Jin B, Che C, Liu Z, Zhang S, Yin X, Wei X. Predicting the risk of heart failure with ehr sequential data modeling. IEEE Access. 2018;6:9256–61.

Puyalnithi T, Viswanatham VM. Preliminary cardiac disease risk prediction based on medical and behavioural data set using supervised machine learning techniques. Indian J Sci Technol. 2016;9(31):1–5.

Forssen H, et al. Evaluation of Machine Learning Methods to Predict Coronary Artery Disease Using Metabolomic Data. Stud Health Technol Inform. 2017;235: IOS Press:111–5.

Tang Z-H, Liu J, Zeng F, Li Z, Yu X, Zhou L. Comparison of prediction model for cardiovascular autonomic dysfunction using artificial neural network and logistic regression analysis. PLoS One. 2013;8(8):e70571.

CAS   PubMed   PubMed Central   Google Scholar  

Toshniwal D, Goel B, Sharma H. Multistage Classification for Cardiovascular Disease Risk Prediction. In: International Conference on Big Data Analytics; 2015. p. 258–66. Springer.

Alonso DH, Wernick MN, Yang Y, Germano G, Berman DS, Slomka P. Prediction of cardiac death after adenosine myocardial perfusion SPECT based on machine learning. J Nucl Cardiol. 2018;1:1–9.

Mustaqeem A, Anwar SM, Majid M, Khan AR. Wrapper method for feature selection to classify cardiac arrhythmia. In: Engineering in Medicine and Biology Society (EMBC), 39th Annual International Conference of the IEEE; 2017. p. 3656–9. IEEE.

Mansoor H, Elgendy IY, Segal R, Bavry AA, Bian J. Risk prediction model for in-hospital mortality in women with ST-elevation myocardial infarction: a machine learning approach. Heart Lung. 2017;46(6):405–11.

Kim J, Lee J, Lee Y. Data-mining-based coronary heart disease risk prediction model using fuzzy logic and decision tree. Healthc Inform Res. 2015;21(3):167–74.

Taslimitehrani V, Dong G, Pereira NL, Panahiazar M, Pathak J. Developing EHR-driven heart failure risk prediction models using CPXR (log) with the probabilistic loss function. J Biomed Inform. 2016;60:260–9.

Anbarasi M, Anupriya E, Iyengar N. Enhanced prediction of heart disease with feature subset selection using genetic algorithm. Int J Eng Sci Technol. 2010;2(10):5370–6.

Bhatla N, Jyoti K. An analysis of heart disease prediction using different data mining techniques. Int J Eng. 2012;1(8):1–4.

Thenmozhi K, Deepika P. Heart disease prediction using classification with different decision tree techniques. Int J Eng Res Gen Sci. 2014;2(6):6–11.

Tamilarasi R, Porkodi DR. A study and analysis of disease prediction techniques in data mining for healthcare. Int J Emerg Res Manag Technoly ISSN. 2015;1:2278–9359.

Marikani T, Shyamala K. Prediction of heart disease using supervised learning algorithms. Int J Comput Appl. 2017;165(5):41–4.

Lu P, et al. Research on improved depth belief network-based prediction of cardiovascular diseases. J Healthc Eng. 2018;2018:1–9.

Khateeb N, Usman M. Efficient Heart Disease Prediction System using K-Nearest Neighbor Classification Technique. In: Proceedings of the International Conference on Big Data and Internet of Thing; 2017. p. 21–6. ACM.

Patel SB, Yadav PK, Shukla DD. Predict the diagnosis of heart disease patients using classification mining techniques. IOSR J Agri Vet Sci (IOSR-JAVS). 2013;4(2):61–4.

Venkatalakshmi B, Shivsankar M. Heart disease diagnosis using predictive data mining. Int J Innovative Res Sci Eng Technol. 2014;3(3):1873–7.

Ani R, Sasi G, Sankar UR, Deepa O. Decision support system for diagnosis and prediction of chronic renal failure using random subspace classification. In: Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on; 2016. p. 1287–92. IEEE.

Islam MM, Wu CC, Poly TN, Yang HC, Li YC. Applications of Machine Learning in Fatty Live Disease Prediction. In: 40th Medical Informatics in Europe Conference, MIE 2018; 2018. p. 166–70. IOS Press.

Lynch CM, et al. Prediction of lung cancer patient survival via supervised machine learning classification techniques. Int J Med Inform. 2017;108:1–8.

Chen C-Y, Su C-H, Chung I-F, Pal NR. Prediction of mammalian microRNA binding sites using random forests. In: System Science and Engineering (ICSSE), 2012 International Conference on; 2012. p. 91–5. IEEE.

Eskidere Ö, Ertaş F, Hanilçi C. A comparison of regression methods for remote tracking of Parkinson’s disease progression. Expert Syst Appl. 2012;39(5):5523–8.

Chen H-L, et al. An efficient diagnosis system for detection of Parkinson’s disease using fuzzy k-nearest neighbor approach. Expert Syst Appl. 2013;40(1):263–71.

Behroozi M, Sami A. A multiple-classifier framework for Parkinson’s disease detection based on various vocal tests. Int J Telemed Appl. 2016;2016:1–9.

Hussain L, et al. Prostate cancer detection using machine learning techniques by employing combination of features extracting strategies. Cancer Biomarkers. 2018;21(2):393–413.

Zupan B, DemšAr J, Kattan MW, Beck JR, Bratko I. Machine learning for survival analysis: a case study on recurrence of prostate cancer. Artif Intell Med. 2000;20(1):59–75.

Hung C-Y, Chen W-C, Lai P-T, Lin C-H, Lee C-C. Comparing deep neural network and other machine learning algorithms for stroke prediction in a large-scale population-based electronic medical claims database. In: Engineering in Medicine and Biology Society (EMBC), 2017 39th Annual International Conference of the IEEE, vol. 1; 2017. p. 3110–3. IEEE.

Atlas L, et al. A performance comparison of trained multilayer perceptrons and trained classification trees. Proc IEEE. 1990;78(10):1614–9.

Lucic M, Kurach K, Michalski M, Bousquet O, Gelly S. Are GANs created equal? a large-scale study. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems; 2018. p. 698–707. Curran Associates Inc.

Levy O, Goldberg Y, Dagan I. Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguistics. 2015;3:211–25.

Download references

Acknowledgements

Not applicable.

This study did not receive any funding.

Author information

Authors and affiliations.

Complex Systems Research Group, Faculty of Engineering, The University of Sydney, Room 524, SIT Building (J12), Darlington, NSW, 2008, Australia

Shahadat Uddin, Arif Khan & Md Ekramul Hossain

Health Market Quality Research Stream, Capital Markets CRC, Level 3, 55 Harrington Street, Sydney, NSW, Australia

Faculty of Medicine and Health, School of Medical Sciences, The University of Sydney, Camperdown, NSW, 2006, Australia

Mohammad Ali Moni

You can also search for this author in PubMed   Google Scholar

Contributions

SU: Originator of the idea, data analysis and writing. AK: Data analysis and writing. MEH: Data analysis and writing. MAM: Data analysis and critical review of the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Shahadat Uddin .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they do not have any competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Uddin, S., Khan, A., Hossain, M. et al. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak 19 , 281 (2019). https://doi.org/10.1186/s12911-019-1004-8

Download citation

Received : 28 January 2019

Accepted : 11 December 2019

Published : 21 December 2019

DOI : https://doi.org/10.1186/s12911-019-1004-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Medical data
  • Disease prediction

BMC Medical Informatics and Decision Making

ISSN: 1472-6947

research paper on disease prediction

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 27 October 2021

Disease variant prediction with deep generative models of evolutionary data

  • Jonathan Frazer 1   na1 ,
  • Pascal Notin 2   na1 ,
  • Mafalda Dias 1   na1 ,
  • Aidan Gomez 2 ,
  • Joseph K. Min   ORCID: orcid.org/0000-0002-2781-7390 1 ,
  • Kelly Brock 1 ,
  • Yarin Gal   ORCID: orcid.org/0000-0002-2733-2078 2 &
  • Debora S. Marks   ORCID: orcid.org/0000-0001-9388-2281 1 , 3  

Nature volume  599 ,  pages 91–95 ( 2021 ) Cite this article

72k Accesses

264 Citations

464 Altmetric

Metrics details

  • Computational models
  • Disease genetics
  • Genetic variation
  • Genetics research
  • Machine learning

A Publisher Correction to this article was published on 17 December 2021

This article has been updated

Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences 1 , 2 , 3 . In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods 4 , 5 , 6 , 7 , 8 , 9 , 10 have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable 11 . Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with, if not better than, predictions from high-throughput experiments, which are increasingly used as evidence for variant classification 12 , 13 , 14 , 15 , 16 . We predict the pathogenicity of more than 36 million variants across 3,219 disease genes and provide evidence for the classification of more than 256,000 variants of unknown significance. Our work suggests that models of evolutionary information can provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

research paper on disease prediction

Similar content being viewed by others

research paper on disease prediction

Genome-wide prediction of disease variant effects with a deep protein language model

research paper on disease prediction

Inferring the molecular and phenotypic impact of amino acid variants with MutPred2

research paper on disease prediction

A unified analysis of evolutionary and population constraint in protein domains highlights structural features and pathogenic sites

Data availability.

The data analysed and generated in this study, including multiple sequence alignments used in training, ClinVar annotations used for validation, population frequencies and predictions from our model, are available in  Supplementary Information and at evemodel.org. Predictions from other computational models are available through http://database.liulab.science/dbNSFP .  Source data are provided with this paper.

Code availability

The model code is available at https://github.com/OATML-Markslab/EVE , https://doi.org/10.5281/zenodo.5389490 .

Change history

17 december 2021.

A Correction to this paper has been published: https://doi.org/10.1038/s41586-021-04207-6

Van Hout, C. V. et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature 586 , 749–756 (2020).

Article   ADS   PubMed   PubMed Central   CAS   Google Scholar  

Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581 , 434–443 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Landrum, M. J. & Kattman, B. L. ClinVar at five years: delivering on the promise. Hum. Mutat. 39 , 1623–1630 (2018).

Article   PubMed   Google Scholar  

Raimondi, D. et al. DEOGEN2: prediction and interactive visualization of single amino acid variant deleteriousness in human proteins. Nucleic Acids Res. 45 , W201-W206 (2017).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Feng, B. J. PERCH: a unified framework for disease gene prioritization. Hum. Mutat. 38 , 243–251 (2017).

Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99 , 877-885 (2016).

Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48 , 214–220 (2016).

Jagadeesh, K. A. et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat. Genet. 48 , 1581-1586 (2016).

Article   CAS   PubMed   Google Scholar  

Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47 , D886–D894 (2019).

Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7 , 248–249 (2010).

Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17 , 405–424 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Findlay, G. M. et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature 562 , 217–222 (2018).

Glazer, A. M. et al. High-throughput reclassification of SCN5A variants. Am. J. Hum. Genet. 107 , 111–123 (2020).

Giacomelli, A. O. et al. Mutational processes shape the landscape of TP53 mutations in human cancer. Nat. Genet. 50 , 1381–1387 (2018).

Mighell, T. L., Evans-Dutson, S. & O’Roak, B. J. A saturation mutagenesis approach to understanding PTEN lipid phosphatase activity and genotype–phenotype relationships. Am. J. Hum. Genet. 102 , 943–955 (2018).

Jia, X. et al. Massively parallel functional testing of MSH2 missense variants conferring Lynch syndrome risk. Am. J. Hum. Genet. 108 , 163–175 (2021).

Cao, Y. et al. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res. 30 , 717–731 (2020).

Gudbjartsson, D. F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47 , 435–444 (2015).

Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20 , 223 (2019).

Trenkmann, M. Putting genetic variants to a fitness test. Nat. Rev. Genet. 19 , 667 (2018).

Rehm, H. L. et al. ClinGen—the Clinical Genome Resource. N. Engl. J. Med. 372 , 2235–2242 (2015).

Grimm, D. G. et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 36 , 513–523 (2015).

Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35 , 128–135 (2017).

Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6 , e28766 (2011).

Hopf, T. A. et al. Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife 3 , e03430 (2014).

Article   PubMed Central   Google Scholar  

Lapedes, A., Giraud, B. & Jarzynski, C. Using sequence alignments to predict protein structure and stability with high accuracy. Preprint at https://arxiv.org/abs/1207.2484v1 (2012).

Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. & Ng, P. C. SIFT missense predictions for genomes. Nat. Protoc. 11 , 1–9 (2016).

Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39 , e118 (2011).

Rezende, D. J., Mohamed, S. & Wierstra, D. in Proceedings of the 31st International Conference on Machine Learning vol. 32 (eds Xing, E. P. & Jebara, T.) 1278–1286 (PMLR, 2014).

Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).

Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15 , 816–822 (2018).

Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B. & Wu, C. H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31 , 926–932 (2015).

Kalia, S. S. et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genet. Med. 19 , 249–255 (2017).

Frigo, G. et al. Homozygous SCN5A mutation in Brugada syndrome with monomorphic ventricular tachycardia and structural heart abnormalities. Europace 9 , 391–397 (2007).

Itoh, H. et al. Asymmetry of parental origin in long QT syndrome: preferential maternal transmission of KCNQ1 variants linked to channel dysfunction. Eur. J. Hum. Genet. 24 , 1160–1166 (2016).

Glazer, A. M. et al. Deep mutational scan of an SCN5A voltage sensor. Circ. Genom. Precis. Med. 13 , e002786 (2020).

Bouvet, D. et al. Methylation tolerance-based functional assay to assess variants of unknown significance in the MLH1 and MSH2 genes and identify patients with Lynch syndrome. Gastroenterology 157 , 421–431 (2019).

Pan, X. et al. Structure of the human voltage-gated sodium channel Na v 1.4 in complex with β1. Science 362 , eaau2486 (2018).

Article   PubMed   CAS   Google Scholar  

Fishel, R. et al. The human mutator gene homolog MSH2 and its association with hereditary nonpolyposis colon cancer. Cell 75 , 1027–1038 (1993).

Peltomaki, P. Role of DNA mismatch repair defects in the pathogenesis of human cancer. J. Clin. Oncol. 21 , 1174-1179 (2003).

Warren, J. J. et al. Structure of the human MutSα DNA lesion recognition complex. Mol. Cell 26 , 579–592 (2007).

Brnich, S. E. et al. Recommendations for application of the functional evidence PS3/BS3 criterion using the ACMG/AMP sequence variant interpretation framework. Genome Med. 12 , 3 (2019).

Lewontin, R. C. The Genetic Basis of Evolutionary Change (Columbia Univ. Press, 1974).

Kreitman, M. Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster . Nature 304 , 412-417 (1983).

Article   ADS   CAS   PubMed   Google Scholar  

Sunyaev, S. et al. Prediction of deleterious human alleles. Hum. Mol. Genet. 10 , 591–597 (2001).

IUCN. The IUCN red list of threatened species. IUCN https://www.iucnredlist.org (2020).

Download references

Acknowledgements

We thank members of the Marks laboratory, OATML and C. Sander for many valuable discussions. J.F., M.D. and K.B. are supported by the Chan Zuckerberg Initiative CZI2018-191853. K.B. is also supported by the US National Institutes of Health (R01 R01GM120574). P.N. is supported by GSK and the UK Engineering and Physical Sciences Research Council (EPSRC ICASE award no. 18000077). A.G. is a Clarendon Scholar and Open Philanthropy AI Fellow. Y.G. holds a Turing AI Fellowship (Phase 1) at the Alan Turing Institute, which is supported by EPSRC grant reference V030302/1. D.S.M. holds a Ben Barres Early Career Award by the Chan Zuckerberg Initiative as part of the Neurodegeneration Challenge Network, CZI2018-191853.

Author information

These authors contributed equally: Jonathan Frazer, Pascal Notin, Mafalda Dias

Authors and Affiliations

Marks Group, Department of Systems Biology, Harvard Medical School, Boston, MA, USA

Jonathan Frazer, Mafalda Dias, Joseph K. Min, Kelly Brock & Debora S. Marks

OATML Group, Department of Computer Science, University of Oxford, Oxford, UK

Pascal Notin, Aidan Gomez & Yarin Gal

Broad Institute of Harvard and MIT, Cambridge, MA, USA

Debora S. Marks

You can also search for this author in PubMed   Google Scholar

Contributions

D.S.M. and Y.G. led the research. J.F., P.N. and M.D. conceived and implemented the end-to-end approach. A.G. contributed technical advice. K.B. supported with data preparation. J.K.M. developed the website. J.F., P.N., M.D., Y.G. and D.S.M. wrote the manuscript.

Corresponding authors

Correspondence to Yarin Gal or Debora S. Marks .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information Nature thanks Martin Kircher and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 bayesian vae architecture details..

The Bayesian VAE architecture in EVE is comprised of a symmetric 3-layer encoder & decoder architecture (with 2,000-1,000-300 and 300-1,000-2,000 units respectively) and a latent space of dimension 50. After performing a one-hot encoding of the input sequence across amino acids (zeros in white, ones in green), we flatten the input before performing the forward pass through the network. We use a single set of parameters for the encoder ( ϕ p) and learn a fully-factorized gaussian distribution over the weights of the decoder ( θ p): weight samples for the decoder are obtained by sampling a random normal variable (rnv), multiplying that sample by the standard deviation parameters, and subsequently adding the mean parameters. A one-dimensional convolution is applied on the un-flattened output of the decoder to capture potential correlations between amino-acid usage. Finally, a softmax activation turns the final output into probabilities over amino acids at each position of the sequence (low values in white, high values in dark green). The overall network is trained by maximizing the Evidence Lower Bound (ELBO), which forms a tractable lower bound to the log-marginal likelihood ( Supplementary Methods and Fig. 1 ).

Extended Data Fig. 2 Comparison of performance of Bayesian VAE and DeepSequence against 38 deep mutation scans.

Comparison between the performance of the Bayesian VAE architecture in EVE and DeepSequence 46 which achieves state-of-the-art performance on the protein function prediction task. “Evolutionary indices” were computed by sampling 2k times from the approximate posterior distribution and by ensembling the obtained indices over 5 independently trained VAEs ( Supplementary Methods ).

Extended Data Fig. 3 Evolutionary index separates pathogenic and benign variants.

a , Average evolutionary index per protein, and corresponding standard deviations, for variants with known Benign and Pathogenic ClinVar labels across 3,219 proteins (sorted by alphabetical order). On the right, marginal distributions of the means over the 3,219 proteins. Evolutionary index separates pathogenic and benign labels consistently across proteins. b , Two-component Gaussian Mixture Models (GMM) over the distributions of the evolutionary indices (histograms) for all the single amino acid variants of 3,219 proteins combined (top, left) and for P53, PTEN and SCN5A separately (top right, bottom left and right, respectively). The dashed black line is the marginal likelihood for the GMM model, i.e. the likelihood of a variant sequence after marginalizing the latent variable that corresponds to the mixture assignment; the dashed blue and red lines represent the relative share of the marginal likelihood from the benign and pathogenic clusters respectively ( i.e . the product of the marginal likelihood by each cluster).

Extended Data Fig. 4 EVE prediction for actionable genes and EVE comparison to other computational methods, including meta-predictors.

a , EVE AUCs versus ClinVar labels for set of ACMG “actionable genes” 33 that have 15 or more labels (shown in parentheses). AUCs are computed both for EVE scores of all variants (pale blue), and of the 75% variants with most confident scores (dark blue) ( Supplementary Methods ). b , Performance comparison of EVE to state-of-the-art computational variant effect predictors: 7 unsupervised, 8 supervised, and 8 supervised meta-prediction methods. Size of marker indicates how many genes for which the method would be relevant (on a per-protein basis validation) ( Supplementary Methods , Supplementary Notes 2 , Fig. 2 , Supplementary Tables 3 , 4 ).

Extended Data Fig. 5 Computational model EVE as good as high-throughput experiments for clinical labels.

(Companion to Fig. 3 ) a , Comparison of computational model predictions (upper panels EVE score) and experimental assay predictions (lower panels, experimental assay metric) to ClinVar labels (dots) and VUS (crosses) and where pale red and pale blue crosses indicate EVE assignments of VUS. Dashed red and blue lines correspond to EVE predictions after removing the 25% most uncertain variants (computed on all variants across all proteins; see  Supplementary Methods ). x-axes are position in protein. Experimental measurements data from deep mutational scans of P53 14 , from left (WT_Nutlin-3, A549_p53NULL_Nutlin-3, A549_p53NULL_Etoposide), SCN5A 13 , and BRCA1 12 . b , Scatter plots of experiment scores (y-axis) against EVE scores (x-axis). Experimental measurements data from deep mutational scans same as a ( Supplementary Methods , Supplementary Table 6 ).

Extended Data Fig. 6 Comparison of label policies, and comparison of EVE and experimental predictions of clinical labels.

a , The y-axis is the subset of the ACMG actionable protein list with at least 5 benign and 5 pathogenic labels with at least a one-star review status in ClinVar, mean for the 3,219 proteins and mean for this subset. x-axis is AUCs computed using these labels (deep blue), labels with at least a two-star review status (light grey) and a more lenient labelling policy (sky blue), as defined in  Supplementary Methods . b , AUC of EVE predictions (blue circle) and experimental predictions (blue cross) computed on ClinVar labels. Whilst most of the papers that provide these experimental results refer to the goal of predicting association to human disease, the assays vary in their relevance to disease phenotype. Results use high-quality labels whenever they are sufficient for robust validation (MSH2, P53, BRCA1) and lenient labels for all other cases, and 2017-release ClinVar data whenever experimental results were used in defining labels reported in 2021 (P53 and BRCA1). Reported averages of all displayed AUC values, and of AUCs computed exclusively on 2017 and 2021-reslease high-quality labels ( Supplementary Methods , Supplementary Table 5 , 6 ).

Extended Data Fig. 7 EVE has many more genes that can be validated on, compared to supervised methods.

Mean number of genes, for EVE (dark blue) and a supervised method (light blue), that have sufficient labels for validation (5 (left), 10 (middle) and 20 labels (right)). We assume a 90% train 10% test random split of all labels in ClinVar for the supervised methods.

Extended Data Fig. 8 Data provided on our server evemodel.org.

Screenshot of evemodel.org for the example of KCNQ2. Our server provides information about each protein: aggregate AUC/Accuracy, performance curves (ROC & Precision-recall), variant-level EVE scores, classification and uncertainties, as well as the multiple sequence alignments used for training. All data is available to download both in bulk and for individual genes.

Extended Data Fig. 9 Fraction of genes per person with more than one variant.

Density function of the fraction of total genes per person with at least two variants, though not necessarily in the same chromosome. Data extracted from 50k genomes of the UK Biobank with self-reported ethnicity backgrounds ( Supplementary Methods ).

Extended Data Fig. 10 Performance as a function of alignment depth.

Average AUC of EVE scores as a function of N/L cov for the subset of genes with at least 10 known clinical labels (5 benign and 5 pathogenic). For this subset of genes, the performance of the model can be carefully validated using AUCs. There is no strong correlation between alignment depth and performance: while models with very deep alignments tend to have good performance, models with very low N/L cov can also have AUC close to 1.

Supplementary information

Supplementary information.

This file contains Supplementary Methods, Supplementary Note 1 on the limitations of supervised modeling methods and Supplementary Note 2 with comments on meta-predictors.

Reporting Summary

Supplementary table 1.

Statistics of multiple sequence alignments used in training.

Supplementary Table 2

Summary of class assignments and combination with other sources of evidence.

Supplementary Table 3

Comparison of performance of EVE and other computational models at predicting ClinVar labels.

Supplementary Table 4

Comparison of performance of EVE and other computational models at predicting results from high-throughput functional assays.

Supplementary Table 5

Sensitivity of predictions to label policy.

Supplementary Table 6

Comparison of performance of high-throughput experiments and EVE at predicting clinical labels.

Supplementary Table 7

Genes with at least 10 labels sorted by EVE performance.

Supplementary Table 8

Variants for which there is disagreement between EVE predictions and ClinVar labels.

Supplementary Table 9

All pairs of variants occurring in the same gene over the UK biobank population for the actionable genes defined by ACMG.

Source data

Source data fig. 2, source data fig. 3, source data fig. 4, rights and permissions.

Reprints and permissions

About this article

Cite this article.

Frazer, J., Notin, P., Dias, M. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599 , 91–95 (2021). https://doi.org/10.1038/s41586-021-04043-8

Download citation

Received : 18 December 2020

Accepted : 20 September 2021

Published : 27 October 2021

Issue Date : 04 November 2021

DOI : https://doi.org/10.1038/s41586-021-04043-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research paper on disease prediction

  • Open access
  • Published: 10 October 2023

Early prediction of heart disease with data analysis using supervised learning with stochastic gradient boosting

  • Anil Pandurang Jawalkar 1 ,
  • Pandla Swetcha 1 ,
  • Nuka Manasvi 1 ,
  • Pakki Sreekala 1 ,
  • Samudrala Aishwarya 1 ,
  • Potru Kanaka Durga Bhavani 1 &
  • Pendem Anjani 1  

Journal of Engineering and Applied Science volume  70 , Article number:  122 ( 2023 ) Cite this article

3636 Accesses

5 Citations

Metrics details

Heart diseases are consistently ranked among the top causes of mortality on a global scale. Early detection and accurate heart disease prediction can help effectively manage and prevent the disease. However, the traditional methods have failed to improve heart disease classification performance. So, this article proposes a machine learning approach for heart disease prediction (HDP) using a decision tree-based random forest (DTRF) classifier with loss optimization. Initially, preprocessing of the dataset with patient records with known labels is performed for the presence or absence of heart disease records. Then, train a DTRF classifier on the dataset using stochastic gradient boosting (SGB) loss optimization technique and evaluate the classifier’s performance using a separate test dataset. The results demonstrate that the proposed HDP-DTRF approach resulted in 86% of precision, 86% of recall, 85% of F1-score, and 96% of accuracy on publicly available real-world datasets, which are higher than traditional methods.

Introduction

One person dies due to cardiovascular disease every 36 s in every country. Coronary heart disease is the leading cause of mortality in the USA, accounting for one out of every four fatalities that occur each year. This disease claims the lives of about 0.66 million people annually [ 1 ]. The expenditures associated with cardiovascular disease are significant for the healthcare system in the USA. In the years 2021 and 2022, it resulted in annual costs of around $219 billion owing to the increased demand for medical treatment and medication and the loss of productivity caused by deaths. Table 1 provides the statistics of the heart disease dataset with total heart disease cases, deaths, case fatality rate, and total vaccinations. A prompt diagnosis also aids in preventing heart failure, which is another potential cause of mortality in certain cases [ 2 ]. Since many traits put a person at risk for acquiring the ailment, it is difficult to diagnose heart disease in its earlier stages while it is still in its infancy. Diabetes, hypertension, elevated cholesterol levels, an irregular pulse rhythm, and a wide variety of other diseases are some risk factors that might contribute to this [ 3 ]. These ailments are grouped and discussed under “heart disease,” an umbrella word. The symptoms of cardiac disease can differ considerably from one individual to the next and from one condition to another within the same patient [ 4 ]. The process of identifying and classifying cardiovascular diseases is a continuous one that has a chance of being fruitful when carried out by a qualified professional with appropriate knowledge and skill in the relevant sector. There are a lot of different aspects, such as age, diabetes, smoking, being overweight, and eating a diet high in junk food. There have been several variables and criteria discovered that have been shown to either cause heart disease or raise the risk of developing heart disease [ 5 ].

Most hospitals use management software to monitor the clinical and patient data they collect. It is well-known these days, and these kinds of devices generate a vast quantity of information on patients. These data are used for decision-making help in clinical settings rather seldom. These data are precious, yet a significant portion of their knowledge is left unused [ 6 ]. Because of the sheer volume of data involved in the process, the translation of clinical data that has been acquired into information that intelligent systems can use to assist healthcare practitioners in making decisions is a process fraught with difficulties [ 7 ]. Intelligent systems put this knowledge to use to enhance the quality of treatment provided to patients. As a result of this issue, research on the processing of medical photographs was carried out. Because there were not enough specialists and too many instances were misdiagnosed, an automated detection method that was both quick and effective was necessary [ 8 ].

The primary objective of the research is centered around the effective utilization of a classifier model, which aims to categorize and identify vital components within complex medical data. This categorization process is a critical step towards enabling early diagnosis of cardiovascular diseases, potentially contributing to improved patient outcomes and healthcare management [ 9 ]. However, the pursuit of disease prediction at an early stage is not without its challenges. One significant factor pertains to the inherent complexity of the predictive methods employed in the classification process [ 10 ]. The intricate nature of these methods can lead to difficulties in interpreting the underlying decision-making processes, which might impede the integration of these models into clinical practice. Furthermore, the efficiency of disease prediction models is impacted by the time they take to execute. Swift diagnosis and intervention are crucial in medical conditions, and time-intensive models might not align with the urgency required for timely medical decisions. Researchers [ 11 ] have investigated various alternative strategies to forecast cardiovascular diseases. Perfect treatment and diagnosis have the potential to save the lives of an infinite number of individuals. The novel contribution of this work is as follows:

Preprocessing of HDP dataset with normalization, exploratory data analysis (EDA), data visualization, and extraction of top correlated features.

Implementation of DTRF classifier for training preprocessed dataset, which can accurately predict the presence or absence of heart disease.

The SGB loss optimization is used to reduce the losses generated during the training process, which tunes the hyperparameters of DTRF.

The rest of the article is organized as follows: Sect. 2 gives a detailed literature survey analysis. Section 3 gives a detailed analysis of the proposed HDP-DTRF with multiple modules. Section 4 gives a detailed simulation analysis of the proposed HDP-DTRF. Section 5 concludes the article.

Literature survey

Rani et al. [ 12 ] designed a novel hybrid decision support system to diagnose cardiac ailments early. They effectively addressed the missing data challenge by employing multivariate imputations through chained equations. Additionally, their unique approach to feature selection involved a fusion of genetic algorithms (GA) and recursive feature reduction. Notably, the integration of random forest classifiers played a pivotal role in significantly enhancing the accuracy of their system. However, despite these advancements, their hybrid approach’s complexity might have posed challenges in terms of interpretability and practical implementation. Kavitha et al. [ 13 ] embraced machine learning techniques to forecast cardiac diseases. They introduced a hybrid model by incorporating random forest as the base classifier. This hybridization aimed to enhance prediction accuracy; however, their decision to capture and store user input parameters for future use was intriguing but yielded suboptimal classification performance. This unique approach could be viewed as an innovative attempt to integrate patient-specific information, yet the exact impact on overall performance warrants further investigation.

Mohan et al. [ 14 ] further advanced the field by employing a hybrid model that combined random forest with a linear model to predict cardiovascular diseases. Through this amalgamation of different classification approaches and feature combinations, they achieved commendable performance with an accuracy of 88.7%. However, it is worth noting that while hybrid models show promise, the trade-offs between complexity and interpretability could influence their practical utility in real-world clinical settings. To predict heart diseases, Shah et al. [ 15 ] adopted supervised learning techniques, including Naive Bayes, decision trees, K-nearest neighbor (KNN), and random forest algorithms. Their choice of utilizing the Cleveland database from the UCI repository as their data source added a sense of universality to their findings. However, the lack of customization in data sources might limit the applicability of their model to diverse patient populations with varying characteristics. Guoet et al. [ 16 ] contributed to the field by harnessing an improved learning machine (ILM) model in conjunction with machine learning techniques. Integrating novel feature combinations and categorization methods showcased their dedication to enhancing performance and accuracy. Nonetheless, while their approach exhibits promising results, the precise impact of specific feature combinations on prediction accuracy could have been further explored. Hager Ahmed et al. [ 17 ] presented an innovative real-time prediction system for cardiac diseases using Apache Spark and Apache Kafka. This system, characterized by its three-tier architecture—offline model building, online prediction, and stream processing pipeline—highlighted its commitment to harnessing cutting-edge technologies for practical medical applications. However, the scalability and resource requirements of such real-time systems, especially in healthcare settings with limited computational resources, could be an area of concern.

Kataria et al. [ 18 ] comprehensively analyzed and compared various machine learning algorithms for predicting heart disease. Their focus on analyzing the algorithms’ ability to predict heart disease effectively sheds light on their dedication to identifying the most suitable model. However, their study’s outcome might have been further enriched by addressing the unique challenges posed by individual attributes, such as high blood pressure and diabetes, in a more customized manner. Kannan et al. [ 19 ] meticulously evaluated machine learning algorithms to predict and diagnose cardiac sickness. By selecting 14 criteria from the UCI Cardiac Datasets, they showcased their dedication to designing a comprehensive study. Nevertheless, a deeper analysis of how these algorithms perform with specific criteria and their contributions to accurate predictions could provide more actionable insights.

Ali et al. [ 20 ] conducted a detailed analysis of supervised machine-learning algorithms for predicting cardiac disease. Their thorough evaluation of decision trees, k-nearest neighbors, and logistic regression classifiers (LRC) provided a well-rounded perspective on the strengths and limitations of each method. However, a more fine-grained analysis of how these algorithms perform under various parameter configurations and feature combinations might offer additional insights into their potential use cases. Mienye et al. [ 21 ] introduced an enhanced technique for ensemble learning, utilizing decision trees, random forests, and support vector machine classifiers. The voting system they employed to aggregate results showcased their innovative approach to combining various methods. However, the potential trade-offs between ensemble complexity and the robustness of predictions could be considered for future refinement. Dutta et al. [ 22 ] revolutionized the field by introducing convolutional neural networks (CNNs) for predicting coronary heart disease. Their approach, leveraging the power of CNNs on a large dataset of ECG signals, showcased the potential for deep learning techniques in healthcare. However, the requirement for extensive computational resources and potential challenges in model interpretability could be areas warranting further attention. Latha et al. [ 23 ] demonstrated ensemble classification approaches. Combined with a bagging technique, their utilization of decision trees, naive Bayes, and random forest exemplified their determination to achieve robust results. Nevertheless, the potential interplay between different ensemble techniques and their effectiveness under various scenarios could be explored further.

Ishaq et al. [ 24 ] introduced the concept of using the synthetic minority oversampling technique (SMOTE) in conjunction with efficient data mining methods to improve survival prediction for heart failure patients. Their emphasis on addressing class imbalance through SMOTE showcased their awareness of real-world challenges in healthcare datasets. However, the potential impact of the SMOTE method on individual patient subgroups and its implications for model fairness could be areas of future exploration. Asadi et al. [ 25 ] proposed a unique cardiac disease detection technique based on random forest swarm optimization. Their use of a large dataset for evaluation underscored their dedication to robust testing. However, the potential influence of dataset characteristics and the algorithm’s sensitivity to various parameters on prediction performance could be investigated further.

Proposed methodology

Heart disease is a significant health problem worldwide and is responsible for many deaths every year. Traditional methods for diagnosing heart disease are often time-consuming, expensive, and inaccurate. Therefore, there is a need for more accurate and efficient methods for predicting and diagnosing heart disease. The article aims to provide a detailed analysis of the proposed HDP-DTRF approach and its performance in accurately predicting the presence or absence of heart disease. The results demonstrate the effectiveness of the proposed approach, which can lead to improved diagnosis and treatment of heart disease, ultimately leading to better health outcomes for patients.

Figure  1 shows the proposed HDP-DTRF block diagram. The initial step in the proposed approach is the preprocessing of a dataset consisting of patient records with known labels indicating the presence or absence of heart disease. The dataset is then used to train a DTRF classifier with the SGB loss optimization technique. The performance of the trained classifier is evaluated using a separate publicly available real-world test dataset, and the results show that the proposed HDP-DTRF approach can accurately predict the presence or absence of heart disease. Using decision trees in the random forest classifier enables the algorithm to handle nonlinear data and make accurate predictions even with missing or noisy data. Applying the SGB loss optimization technique further enhances the algorithm’s performance by improving the convergence rate and avoiding overfitting. The proposed approach can be useful in clinical decision-making processes, enabling medical professionals to predict the likelihood of heart disease in patients and take appropriate preventive measures.

figure 1

Block diagram for the proposed HDP-DTRF system

The detailed operation of the proposed HDP-DTRF system is illustrated as follows:

Step 1: Data preprocessing: Gather a dataset containing patient records, where each record includes features such as age, blood pressure, and cholesterol levels, along with labels indicating whether the patient has heart disease. Remove duplicate records, handle missing values (e.g., imputing missing data or removing instances with missing values), and eliminate irrelevant or redundant features. Encode categorical variables (like gender) into numerical values using techniques like one-hot encoding. Scale numerical features to bring them to a common scale, which can prevent features with larger ranges from dominating the model.

Step 2: Training the DTRF classifier: Initialize an empty random forest ensemble. For each tree in the ensemble, randomly sample the training data with replacement. It creates a bootstrapped dataset for training each tree, ensuring diversity in the data subsets. Construct a decision tree using the bootstrapped dataset. At each node of the tree, split the data based on the feature that provides the best separation, determined using metrics like Gini impurity or information gain. Add the constructed decision tree to the random forest ensemble. Repeat the process to create the ensemble’s desired number of decision trees.

Step 3: SGB optimization: Initialize the model by setting the initial prediction to the mean of the target labels. Calculate the negative gradient of the loss function (such as mean squared error or log loss) concerning the current model’s predictions. This gradient represents the direction in which the model’s predictions need to be adjusted to minimize the loss. Train a new decision tree using the negative gradient as the target. This new tree will help correct the errors made by the previous model iterations. Update the model’s predictions by adding the predictions of the new tree, scaled by a learning rate. This step moves the model closer to the correct predictions. Repeat the process for a predefined number of iterations. Each iteration focuses on improving the model’s predictions based on the errors made in the previous iterations.

Step 4: Performance evaluation: Use a separate real-world test dataset that was not used during training to evaluate the performance of the trained HDP-DTRF classifier.

DTRF classifier

The DTRF classifier, an ensemble learning model, centers around the decision tree as its core component. As illustrated in Fig.  2 , the DTRF block diagram depicts a framework comprising multiple trained decision trees employing the bagging technique. During the classification process, when a sample requiring classification is input, the ultimate classification outcome is determined through a majority vote from the output of an individual decision tree [ 26 ]. In classifying high-dimensional data, the DTRF model outperforms standalone decision trees by effectively addressing overfitting, displaying robust resistance to noise and outliers, and demonstrating exceptional scalability and parallel processing capabilities. Notably, the strength of DTRF stems from its inherent parameter-free nature, embodying a data-driven approach. The model requires no prior knowledge of classification from the user and is adept at training classification rules based on observed instances. This data-centric attribute enhances the model’s adaptability to various data scenarios. The DTRF model’s essence lies in utilizing K decision trees. Each of these trees contributes a single “vote” towards the category it deems most fitting, thereby participating in determining the class to which the independent variable X, under consideration, should be allocated. This approach effectively harnesses the collective wisdom of multiple trees, facilitating accurate and robust classification outcomes that capitalize on the diverse insights provided by each decision tree. The mathematical analysis of DTRF is as follows:

figure 2

Block diagram of DTRF

Here, \(K\) represents the number of decision trees present in the DTRF. In this context, \({\theta }_{k}\) is a collection of independent random vectors uniformly distributed amongst themselves. Here, \(K\) individual decision trees are generated. Each tree provides its prediction for the category that best fits the independent variable \(X\) . The predictions made by the \(K\) decision trees are combined through a voting mechanism to determine the final category assignment for the independent variable \(X\) . It is important to note that the given Eq. ( 1 ) indicates the ensemble nature of the DTRF model, where multiple decision trees work collectively to enhance predictive accuracy and robustness. The collection of \({\theta }_{k}\) represents the varied parameter sets for each decision tree within the ensemble.

The following procedures must be followed to produce a DTRF:

Step 1: The \(K\) classification regression trees are generated by randomly selecting \(K\) samples from the original training set as a self-service sample set, using the random repeated sampling method. Extracting all \(K\) samples requires repeating this procedure.

Step 2: Each node in the trees will include m randomly selected characteristics from the first training set (m n). Only one of the m traits is employed in the node splitting procedure, and it is the one with the greatest classification potential. DTRF calculates how much data is included in each feature to do this.

Step 3: A tree never has to be trimmed since it grows perfectly without help.

Step 4: The generated trees are built using DTRFs, and the freshly received data is categorized using DTRFs. The number of votes from the tree classifiers determines the classification outcomes.

There are a lot of important markers of generalization performance that are inherent to DTRFs. Similarity and correlation between different decision trees, mistakes in generalization, and the system’s ability to generalize are all features t . A system’s decision-making efficacy is determined by how well it can generalize its results to fresh information that follows the same distribution as the training set [ 27 ]. The system’s performance and generalizability benefit from reducing the severity of generalization mistakes. Here is a case of the overgeneralization fallacy in action:

Here, \(P{E}^{*}\) denotes the generalization error, the subscripts \(X\) and \(Y\) point to the space where the probability is defined, and \(Mr (X, Y)\) is the margin function. The following is a definition of the margin function:

If it stands for the input sample, \(Y\) indicates the correct classification, and \(J\) indicates the incorrect one. Specifically, \(h(g)\) is a representation of a sequence model for classification, \(I(g)\) indicates an indicator function, and \({avg}_{k}(g)\) means averaging. The margin function determines how many more votes the correct classification for sample X receives than all possible incorrect classifications. As the value of the margin function grows, so does the classifier’s confidence in its accuracy. The term “convergence formulation of generalization error” as follows [ 28 ]:

As the number of decision trees grows, the generalization error will tend toward a maximum, as predicted by the preceding calculation, and the model will not over-fit. The classification power of each tree and the correlation between trees is used to estimate the maximum allowed generalization error. The DTRF model aims to produce a DTRF with a small correlation coefficient and strong classification power. Classification intensity ( \(S\) ) is the sample-space-wide mathematical expectation of the variable \(mr(X, Y)\) .

Here, \(\theta\) and \(\theta {\prime}\) are independent and identically distributed vectors of estimated data \({E}_{X, Y}\) , correlation coefficients of \(mr(\theta , X, Y)\) and \(mr(\theta ,{\prime} X, Y)\) :

Among them, \(sd(\theta )\) can be expressed as follows:

Equation ( 7 ) is a metric that is used to quantify the degree to which the trees \(h(X,\theta )\) and \(h(X,\theta {\prime})\) on the dataset consisting of X , Y are correlated with one another. The correlation coefficient increases in magnitude in direct proportion \(\overline{\rho }\) to the size of the chi-square. The upper limit of generalization error is obtained using the following formula, which is based on the Chebyshev inequality:

The generalization error limit of a DTRF is inversely proportional to the strength of the correlation P between individual decision trees and positively correlated with the classification intensity S of a single tree. That is to say, the stricter the category \(S\) , the lower the degree of linkage \(P\) . If the DTRF is to improve its classification accuracy, the threshold for generalization error must be lowered.

SGB loss optimization

The SGB optimization approach has recently received increased use in various deep-learning applications. These applications call for a higher degree of expertise in learning than what can be provided by more conventional means. During the whole training process, the learning rate that SGB uses does not, at any time, experience any fluctuations. The SGB uses one learning rate, which is alpha. The SGB algorithm maintains a per-parameter learning rate to increase performance in scenarios with sparse gradients (for example, computer vision challenges). It maintains per-parameter learning rates that are updated based on the average of recent magnitudes of the gradients for the weight, and it does so based on averaging recent gradient magnitudes (for example, how rapidly it is changing). In addition, it does this based on averaging recent gradient magnitudes for the weight. It illustrates that the strategy is effective for online and non-stationary applications (for example, noisy). The chain rule applied calculus to compute the partial derivatives. To calculate the loss gradient about the weights and biases, it will allow us to determine how the loss varies as a function of the weights and biases. Let us assume that we have a training dataset with N samples, denoted as { \({x}_{i}, {y}_{i}\) } for i = 1, 2, …, N , where \({x}_{i}\) is the input, and \({y}_{i}\) is the true label or target value. It uses a decision tree with parameters θ to predict the output \({\widehat{\mathrm{y}}}_{i}\) for input \({x}_{i}\) . The output can be any function of the parameters and the input, represented as \({\widehat{\mathrm{y}}}_{i} = f({x}_{i}, \theta ).\) The goal is to minimize the difference between the predicted output \({\widehat{\mathrm{y}}}_{i}\) and the true label \({y}_{i}\) . It is typically done by defining a loss function \(L({\widehat{\mathrm{y}}}_{i}, {y}_{i})\) that quantifies the difference between the predicted and true values. The total loss over the entire dataset is then defined as the sum of the individual losses over all samples:

The optimization algorithm focused on estimating the values of the parameters \(\theta\) that minimize this total loss. It is typically done using gradient descent, which updates the parameters \(\theta\) in the opposite direction of the gradient of the total loss concerning the parameters:

Here, \(\alpha\) is the learning rate, which controls the size of the parameter update, and \({\nabla }_{\theta }{L}_{total}\) is the gradient of the total loss concerning the parameters θ. The SGB can sometimes oscillate and take a long time to converge due to the noisy gradients. Momentum is a technique that helps SGB converge faster by adding a fraction of the previous update to the current update:

Here, \({v}_{t}\) is the momentum term at iteration \(t, \beta\) is the momentum coefficient, typically set to 0.9 or 0.99, and the other terms are as previously defined.

Results and discussion

This section gives a detailed performance analysis of the proposed HDP-DTRF. The performance of the proposed method is measured using multiple performance metrics. All these metrics are measured for proposed methods as well as existing methods. Then, all the methods use the same publicly available real-world dataset for performance estimations.

The Cleveland Heart Disease dataset contains data on 303 patients who were evaluated for heart disease. The dataset is downloaded from open-access websites like the UCI-ML repository. Each patient is represented by 14 attributes, which include demographic and clinical information such as age, sex, chest pain type, resting blood pressure, serum cholesterol level, and exercise test results. The dataset has 303 records, each corresponding to a unique patient. The data in each record includes values for all 14 attributes, and the diagnosis of heart disease (present or absent) is also included in the dataset. Table 2 provides a detailed description of the dataset. Researchers and data scientists can use this dataset to develop predictive models for heart disease diagnosis or explore relationships between the different variables in the dataset. With 303 records, this dataset is relatively small compared to other medical datasets. However, it is still widely used in heart disease research due to its rich attributes and long history of use in research studies.

EDA is essential in understanding and analyzing any dataset, including the Cleveland Heart Disease dataset. EDA involves examining the dataset’s basic properties, identifying missing values, checking data distributions, and exploring relationships between variables. Figure  3 shows the EDA of the dataset. Figure  3 (a) shows the count for each target class. Here, the no heart disease class contains 138 records, and the heart disease presented class contains 165 records. Figure  3 (b) shows the male and female-based record percentages in the dataset. Here, the dataset contains 68.32% male and 31.68% female records. Figure  3 (c) shows the percentage of records for chest pain experienced by the patient in the dataset. Here, the dataset contains 47.19% of records in typical angina, 16.50% in atypical angina, 28.71% in non-anginal pain, and 7.59% in the asymptomatic class. Figure  3 (d) shows the percentage of records for fasting blood sugar in the dataset. Here, the dataset contains 85.15% of records in the fasting blood sugar (> 120 mg/dl) class and 14.85% of records in the fasting blood sugar (< 120 mg/dl) class. Figure  4 shows the heart disease frequency by age for both no disease and disease classes. The output contains histogram levels that show the frequency of heart disease by age. Here, the counts of patients with and without heart disease are shown in red and green colors. The overlap between the bars shows how the frequency of heart disease varies with age, with a peak in the frequency of heart disease occurring around the age of 29–77 years.

figure 3

EDA of the dataset. a Count for each target class. b Male–female distribution. c Chest pain experienced by patient distribution. d Fasting blood sugar distribution

figure 4

Heart disease frequency by age

Figure  5 shows the frequencies for different columns of the dataset, which contains the frequencies of chest pain type, fasting blood sugar, rest ECG, exercise-induced angina, st_slope, and number of major vessel columns. Exploring the frequencies of different variables in a dataset is crucial in understanding the data and gaining insights about the underlying phenomena. By analyzing the frequency of values in each variable, we can better understand the data distribution and identify potential patterns, relationships, and outliers that are important for further analysis. For example, understanding the frequency of different chest pain types in a heart disease dataset reveals whether certain types of chest pain are more strongly associated with the disease than others. Similarly, analyzing the frequency of different fasting blood sugar levels helps to identify potential risk factors for heart disease. Overall, exploring the frequencies of variables is an important step in the EDA process, as it provides a starting point for identifying potential relationships and patterns in the data.

figure 5

Frequencies for different columns of the dataset

Performance evaluation

Table 3 shows the class-specific performance evaluation of HDP-DTRF. Here, the performance was measured for class-0 (no heart disease) and class-1 (heart disease presented) classes. Further, macro average and weighted average performances were also measured. Macro average treats all classes equally, regardless of their size. It calculates the average performance metrics across all classes, giving each class an equal weight. It means that the performance of smaller classes will have the same impact on the metric as larger classes.Then, the weighted average considers the size of each class. It calculates the average performance metric across all classes but gives each class a weight proportional to its size. It means that the performance of larger classes will have a greater impact on the metric than smaller classes.

Table 4 shows the class-0 performance comparison of various methods. Here, the proposed HDP-DTRF improved precision by 5.75%, recall by 1.37%, F1-score by 6%, and accuracy by 2.45% compared to KNN [ 15 ]. Then, the proposed HDP-DTRF improved precision by 3.45%, recall by 0.63%, F1-score by 3.61%, and accuracy by 1.45% compared to ILM [ 16 ]. Then, the proposed HDP-DTRF improved precision by 2.30%, recall by 1.27%, F1-score by 3.61%, and accuracy by 1.03% compared to LRC [ 20 ]. Table 5 shows the class-1 performance comparison of various methods. Here, KNN [ 15 ] shows a 2.35% lower precision, a 4.40% lower recall, a 3.53% lower F1-score, and a 1.03% lower accuracy than the proposed HDP-DTRF method. Then, ILM shows a 2.35% lower precision, a 5.49% lower recall, a 1.14% lower F1-score, and a 1.03% lower accuracy than the proposed HDP-DTRF method. Then, LRC [ 20 ] shows a 4.71% lower precision, an 11.11% lower recall, a 2.27% lower F1-score, and a 1.03% lower accuracy than the proposed HDP-DTRF method.

Table 6 shows the macro average performance comparison of various methods. For KNN [ 15 ], the percentage improvements are 7.5% for precision, 13.3% for recall, 10.4% for F1-score, and 6.7% for accuracy. For ILM [ 16 ], the percentage improvements are achieved as 2.4% for precision, 6.1% for recall, 6.0% for F1-score, and 3.2% for accuracy. For LRC [ 20 ], the percentage improvements are achieved as 3.4% for precision, 10.0% for recall, 6.0% for F1-score, and 4.3% for accuracy archived by the proposed method. Table 7 shows the weighted average performance comparison of various methods. For KNN [ 15 ], the percentage improvements are 6.5% for precision, 3.3% for recall, 1.4% for F1-score, and 6.7% for accuracy. For ILM [ 16 ], the percentage improvements are achieved as 2.4% for precision, 5.1% for recall, 6.0% for F1-score, and 3.2% for accuracy. For LRC [ 20 ], the percentage improvements are achieved as 1.4% for precision, 1.0% for recall, 6.0% for F1-score, and 4.3% for accuracy archived by the proposed method.

The ROC curve of the proposed HDP-DTRF is seen in Fig.  6 . The true positive rate (TPR) is shown against the false-positive rate (FPR) on the ROC curve, which considers various threshold values. In the context of the HDP-DTRF technique, the ROC curve illustrates the degree to which the model can differentiate between positive and negative heart disease instances. The model’s performance is greater when it has a higher TPR and a lower FPR. The ROC curve that represents the HDP-DTRF approach that has been suggested is used to find the best classification threshold, which strikes a balance between sensitivity and specificity in the diagnostic process. If there is a point on the ROC curve that is closer to the top left corner, this implies that the model is doing better.

figure 6

ROC curve of proposed HDP-DTRF

Conclusions

This article proposes a machine-learning approach for heart disease prediction. The approach uses a DTRF classifier with loss optimization and involves preprocessing a dataset of patient records to determine the presence or absence of heart disease. The DTRF classifier is then trained on the SGB loss optimization dataset and evaluated using a separate test dataset. The proposed HDP-DTRF improved class-specific performances and a macro with weighted average performance measures. Overall, the proposed HDP-DTRF improved precision by 2.30%, recall by 1.27%, F1-score by 3.61%, and accuracy by 1.03% compared to traditional methodologies. Further, this work can be extended with deep learning-based classification with machine learning feature analysis .

Availability of data and materials

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Abbreviations

Heart disease prediction

Decision tree-based random forest

  • Stochastic gradient boosting

False positive

False negative

True negative

True positive

Bhatt CM et al (2023) Effective heart disease prediction using machine learning techniques. Algorithms 16(2):88

Article   Google Scholar  

Dileep P et al (2023) An automatic heart disease prediction using cluster-based bi-directional LSTM (C-BiLSTM) algorithm. Neural Comput Appl 35(10):7253–7266

Jain A et al (2023) Optimized levy flight model for heart disease prediction using CNN framework in big data application. Exp Syst Appl 223:119859

Nandy S et al (2023) An intelligent heart disease prediction system based on swarm-artificial neural network. Neural Comput Appl 35(20):14723–14737

Hassan D et al (2023) Heart disease prediction based on pre-trained deep neural networks combined with principal component analysis. Biomed Signal Proc Contr 79:104019

Ozcan M et al (2023) A classification and regression tree algorithm for heart disease modeling and prediction. Healthc Anal 3:100130

Saranya G et al (2023) A novel feature selection approach with integrated feature sensitivity and feature correlation for improved heart disease prediction. J Ambient Intell Humaniz Comput 14(9):12005–12019

Sudha VK et al (2023) Hybrid CNN and LSTM network for heart disease prediction. SN Comp Sc 4(2):172

Chaurasia V, et al (2023) Novel method of characterization of heart disease prediction using sequential feature selection-based ensemble technique. Biomed Mat Dev 2023;1–10. https://doi.org/10.1007/s44174-022-00060-x

Ogundepo EA et al (2023) Performance analysis of supervised classification models on heart disease prediction. Innov Syst Software Eng 19(1):129–144

de Vries S et al (2023) Development and validation of risk prediction models for coronary heart disease and heart failure after treatment for Hodgkin lymphoma. J Clin Oncol 41(1):86–95

Vijaya Kishore V, Kalpana V (2020) Effect of Noise on Segmentation Evaluation Parameters. In: Pant, M., Kumar Sharma, T., Arya, R., Sahana, B., Zolfagharinia, H. (eds) Soft Computing: Theories and Applications. Advances in Intelligent Systems and Computing, vol 1154. Springer, Singapore. https://doi.org/10.1007/978-981-15-4032-5_41 .

Kalpana V, Vijaya Kishore V, Praveena K (2020) A Common Framework for the Extraction of ILD Patterns from CT Image. In: Hitendra Sarma, T., Sankar, V., Shaik, R. (eds) Emerging Trends in Electrical, Communications, and Information Technologies. Lecture Notes in Electrical Engineering, vol 569. Springer, Singapore. https://doi.org/10.1007/978-981-13-8942-9_42

Annamalai M, Muthiah P (2022) An Early Prediction of Tumor in Heart by Cardiac Masses Classification in Echocardiogram Images Using Robust Back Propagation Neural Network Classifier. Brazilian Archives of Biology and Technology. 65. https://doi.org/10.1590/1678-4324-2022210316

Shah D et al (2020) Heart disease prediction using machine learning techniques. SN Comput Sci 1:345

Guo C et al (2020) Recursion enhanced random forest with an improved linear model (RERF-ILM) for heart disease detection on the internet of medical things platform. IEEE Access 8:59247–59256

Ahmed H et al (2020) Heart disease identification from patients’ social posts, machine learning solution on Spark. Future Gen Comp Syst 111:714–722

Katarya R et al (2021) Machine learning techniques for heart disease prediction: a comparative study and analysis. Health Technol 11:87–97

Kannan R et al (2019) Machine learning algorithms with ROC curve for predicting and diagnosing the heart disease. Springer, Soft Computing and Medical Bioinformatics

Book   Google Scholar  

Ali MM et al (2021) Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison. Comput Biol Med 136:104672

Mienye ID et al (2020) An improved ensemble learning approach for the prediction of heart disease risk. Inform Med Unlocked 20:100402

Dutta A et al (2020) An efficient convolutional neural network for coronary heart disease prediction. Expert Syst Appl 159:113408

Latha CBC et al (2019) Improving the accuracy of heart disease risk prediction based on ensemble classification techniques. Inform Med Unlocked 16:100203

Ishaq A et al (2021) Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques. IEEE Access 9:39707–39716

Asadi S et al (2021) Random forest swarm optimization-based for heart diseases diagnosis. J Biomed Inform 115:103690

Asif D et al (2023) Enhancing heart disease prediction through ensemble learning techniques with hyperparameter optimization. Algorithms 16(6):308

David VAR S, Govinda E, Ganapriya K, Dhanapal R, Manikandan A (2023) "An Automatic Brain Tumors Detection and Classification Using Deep Convolutional Neural Network with VGG-19," 2023 2nd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India, 2023, pp. 1-5. https://doi.org/10.1109/ICAECA56562.2023.10200949

Radwan M et al (2023) MLHeartDisPrediction: heart disease prediction using machine learning. J Comp Commun 2(1):50-65

Download references

Acknowledgements

Not applicable.

No funding was received by any government or private concern.

Author information

Authors and affiliations.

Department of Information Technology, Malla Reddy Engineering College for Women (UGC-Autonomous), Maisammaguda, Hyderabad, India

Anil Pandurang Jawalkar, Pandla Swetcha, Nuka Manasvi, Pakki Sreekala, Samudrala Aishwarya, Potru Kanaka Durga Bhavani & Pendem Anjani

You can also search for this author in PubMed   Google Scholar

Contributions

A.P.J, P.S., and N.M. contributed to the technical content of the paper, and P.S. and S.A. contributed to the conceptual content and architectural design. P.K., D.B., and P.A. contributed to the guidance and counseling on the writing of the paper.

Corresponding author

Correspondence to Anil Pandurang Jawalkar .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jawalkar, A.P., Swetcha, P., Manasvi, N. et al. Early prediction of heart disease with data analysis using supervised learning with stochastic gradient boosting. J. Eng. Appl. Sci. 70 , 122 (2023). https://doi.org/10.1186/s44147-023-00280-y

Download citation

Received : 31 May 2023

Accepted : 05 September 2023

Published : 10 October 2023

DOI : https://doi.org/10.1186/s44147-023-00280-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Heart disease
  • Machine learning
  • Decision tree
  • Random forest
  • Loss optimization

research paper on disease prediction

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Journal Proposal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

algorithms-logo

Article Menu

research paper on disease prediction

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Effective heart disease prediction using machine learning techniques.

research paper on disease prediction

1. Introduction

2. literature survey, 3. methodology, 3.1. data source, 3.2. removing outliers, 3.3. feature selection and reduction, 3.4. clustering, 3.5. correlation table, 3.6. modeling, 3.6.1. decision tree classifier, 3.6.2. random forest, 3.6.3. multilayer perceptron, 3.6.4. xgboost, 5. conclusions, author contributions, conflicts of interest.

  • Estes, C.; Anstee, Q.M.; Arias-Loste, M.T.; Bantel, H.; Bellentani, S.; Caballeria, J.; Colombo, M.; Craxi, A.; Crespo, J.; Day, C.P.; et al. Modeling NAFLD disease burden in China, France, Germany, Italy, Japan, Spain, United Kingdom, and United States for the period 2016–2030. J. Hepatol. 2018 , 69 , 896–904. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Drożdż, K.; Nabrdalik, K.; Kwiendacz, H.; Hendel, M.; Olejarz, A.; Tomasik, A.; Bartman, W.; Nalepa, J.; Gumprecht, J.; Lip, G.Y.H. Risk factors for cardiovascular disease in patients with metabolic-associated fatty liver disease: A machine learning approach. Cardiovasc. Diabetol. 2022 , 21 , 240. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Murthy, H.S.N.; Meenakshi, M. Dimensionality reduction using neuro-genetic approach for early prediction of coronary heart disease. In Proceedings of the International Conference on Circuits, Communication, Control and Computing, Bangalore, India, 21–22 November 2014; pp. 329–332. [ Google Scholar ] [ CrossRef ]
  • Benjamin, E.J.; Muntner, P.; Alonso, A.; Bittencourt, M.S.; Callaway, C.W.; Carson, A.P.; Chamberlain, A.M.; Chang, A.R.; Cheng, S.; Das, S.R.; et al. Heart disease and stroke statistics—2019 update: A report from the American heart association. Circulation 2019 , 139 , e56–e528. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Shorewala, V. Early detection of coronary heart disease using ensemble techniques. Inform. Med. Unlocked 2021 , 26 , 100655. [ Google Scholar ] [ CrossRef ]
  • Mozaffarian, D.; Benjamin, E.J.; Go, A.S.; Arnett, D.K.; Blaha, M.J.; Cushman, M.; de Ferranti, S.; Després, J.-P.; Fullerton, H.J.; Howard, V.J.; et al. Heart disease and stroke statistics—2015 update: A report from the American Heart Association. Circulation 2015 , 131 , e29–e322. [ Google Scholar ] [ CrossRef ]
  • Maiga, J.; Hungilo, G.G.; Pranowo. Comparison of Machine Learning Models in Prediction of Cardiovascular Disease Using Health Record Data. In Proceedings of the 2019 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Jakarta, Indonesia, 24–25 October 2019; pp. 45–48. [ Google Scholar ] [ CrossRef ]
  • Li, J.; Loerbroks, A.; Bosma, H.; Angerer, P. Work stress and cardiovascular disease: A life course perspective. J. Occup. Health 2016 , 58 , 216–219. [ Google Scholar ] [ CrossRef ]
  • Purushottam; Saxena, K.; Sharma, R. Efficient Heart Disease Prediction System. Procedia Comput. Sci. 2016 , 85 , 962–969. [ Google Scholar ] [ CrossRef ]
  • Soni, J.; Ansari, U.; Sharma, D.; Soni, S. Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction. Int. J. Comput. Appl. 2011 , 17 , 43–48. [ Google Scholar ] [ CrossRef ]
  • Mohan, S.; Thirumalai, C.; Srivastava, G. Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques. IEEE Access 2019 , 7 , 81542–81554. [ Google Scholar ] [ CrossRef ]
  • Waigi, R.; Choudhary, S.; Fulzele, P.; Mishra, G. Predicting the risk of heart disease using advanced machine learning approach. Eur. J. Mol. Clin. Med. 2020 , 7 , 1638–1645. [ Google Scholar ]
  • Breiman, L. Random forests. Mach. Learn. 2001 , 45 , 5–32. [ Google Scholar ] [ CrossRef ]
  • Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD ’16: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [ Google Scholar ] [ CrossRef ]
  • Gietzelt, M.; Wolf, K.-H.; Marschollek, M.; Haux, R. Performance comparison of accelerometer calibration algorithms based on 3D-ellipsoid fitting methods. Comput. Methods Programs Biomed. 2013 , 111 , 62–71. [ Google Scholar ] [ CrossRef ]
  • K, V.; Singaraju, J. Decision Support System for Congenital Heart Disease Diagnosis based on Signs and Symptoms using Neural Networks. Int. J. Comput. Appl. 2011 , 19 , 6–12. [ Google Scholar ] [ CrossRef ]
  • Narin, A.; Isler, Y.; Ozer, M. Early prediction of Paroxysmal Atrial Fibrillation using frequency domain measures of heart rate variability. In Proceedings of the 2016 Medical Technologies National Congress (TIPTEKNO), Antalya, Turkey, 27–29 October 2016. [ Google Scholar ] [ CrossRef ]
  • Shah, D.; Patel, S.; Bharti, S.K. Heart Disease Prediction using Machine Learning Techniques. SN Comput. Sci. 2020 , 1 , 345. [ Google Scholar ] [ CrossRef ]
  • Alotaibi, F.S. Implementation of Machine Learning Model to Predict Heart Failure Disease. Int. J. Adv. Comput. Sci. Appl. 2019 , 10 , 261–268. [ Google Scholar ] [ CrossRef ]
  • Hasan, N.; Bao, Y. Comparing different feature selection algorithms for cardiovascular disease prediction. Health Technol. 2020 , 11 , 49–62. [ Google Scholar ] [ CrossRef ]
  • Ouf, S.; ElSeddawy, A.I.B. A proposed paradigm for intelligent heart disease prediction system using data mining techniques. J. Southwest Jiaotong Univ. 2021 , 56 , 220–240. [ Google Scholar ] [ CrossRef ]
  • Khan, I.H.; Mondal, M.R.H. Data-Driven Diagnosis of Heart Disease. Int. J. Comput. Appl. 2020 , 176 , 46–54. [ Google Scholar ] [ CrossRef ]
  • Kaggle Cardiovascular Disease Dataset. Available online: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset (accessed on 1 November 2022).
  • Han, J.A.; Kamber, M. Data Mining: Concepts and Techniques , 3rd ed.; Morgan Kaufmann Publishers: San Francisco, CA, USA, 2011. [ Google Scholar ]
  • Rivero, R.; Garcia, P. A Comparative Study of Discretization Techniques for Naive Bayes Classifiers. IEEE Trans. Knowl. Data Eng. 2009 , 21 , 674–688. [ Google Scholar ]
  • Khan, S.S.; Ning, H.; Wilkins, J.T.; Allen, N.; Carnethon, M.; Berry, J.D.; Sweis, R.N.; Lloyd-Jones, D.M. Association of body mass index with lifetime risk of cardiovascular disease and compression of morbidity. JAMA Cardiol. 2018 , 3 , 280–287. [ Google Scholar ] [ CrossRef ]
  • Kengne, A.-P.; Czernichow, S.; Huxley, R.; Grobbee, D.; Woodward, M.; Neal, B.; Zoungas, S.; Cooper, M.; Glasziou, P.; Hamet, P.; et al. Blood Pressure Variables and Cardiovascular Risk. Hypertension 2009 , 54 , 399–404. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Yu, D.; Zhao, Z.; Simmons, D. Interaction between Mean Arterial Pressure and HbA1c in Prediction of Cardiovascular Disease Hospitalisation: A Population-Based Case-Control Study. J. Diabetes Res. 2016 , 2016 , 8714745. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Huang, Z. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. DMKD 1997 , 3 , 34–39. [ Google Scholar ]
  • Maas, A.H.; Appelman, Y.E. Gender differences in coronary heart disease. Neth. Heart J. 2010 , 18 , 598–602. [ Google Scholar ] [ CrossRef ]
  • Bhunia, P.K.; Debnath, A.; Mondal, P.; D E, M.; Ganguly, K.; Rakshit, P. Heart Disease Prediction using Machine Learning. Int. J. Eng. Res. Technol. 2021 , 9 . [ Google Scholar ]
  • Mohanty, M.D.; Mohanty, M.N. Verbal sentiment analysis and detection using recurrent neural network. In Advanced Data Mining Tools and Methods for Social Computing ; Academic Press: Cambridge, MA, USA, 2022; pp. 85–106. [ Google Scholar ] [ CrossRef ]
  • Menzies, T.; Kocagüneli, E.; Minku, L.; Peters, F.; Turhan, B. Using Goals in Model-Based Reasoning. In Sharing Data and Models in Software Engineering ; Morgan Kaufmann: San Francisco, CA, USA, 2015; pp. 321–353. [ Google Scholar ] [ CrossRef ]
  • Fayez, M.; Kurnaz, S. Novel method for diagnosis diseases using advanced high-performance machine learning system. Appl. Nanosci. 2021 . [ Google Scholar ] [ CrossRef ]
  • Hassan, C.A.U.; Iqbal, J.; Irfan, R.; Hussain, S.; Algarni, A.D.; Bukhari, S.S.H.; Alturki, N.; Ullah, S.S. Effectively Predicting the Presence of Coronary Heart Disease Using Machine Learning Classifiers. Sensors 2022 , 22 , 7227. [ Google Scholar ] [ CrossRef ]
  • Subahi, A.F.; Khalaf, O.I.; Alotaibi, Y.; Natarajan, R.; Mahadev, N.; Ramesh, T. Modified Self-Adaptive Bayesian Algorithm for Smart Heart Disease Prediction in IoT System. Sustainability 2022 , 14 , 14208. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

AuthorsNovel ApproachBest AccuracyDataset
Shorewall, 2021 [ ]Stacking of KNN, random forest, and SVM outputs with logistic regression as the metaclassifier75.1% (stacked model)Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
Maiga et al., 2019 [ ]-Random forest
-Naive Bayes
-Logistic regression
-KNN
70%Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
Waigi at el., 2020 [ ]Decision tree72.77% (decision tree)Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
Our and ElSeddawy, 2021 [ ]Repeated random with random forest89.01%(random forest classifier)UCI cardiovascular dataset (303 patients, 14 attributes)
Khan and Mondal, 2020 [ ]Holdout cross-validation with the neural network for Kaggle dataset71.82% (neural networks)Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
Cross-validation method with logistic regression (solver: lbfgs) where k = 3072.72%Kaggle cardiovascular disease dataset 1 (462 patients, 12 attributes)
Cross-validation method with linear SVM where k = 1072.22%Kaggle cardiovascular disease dataset (70,000 patients, 12 attributes)
FeatureVariableMin and Max Values
AgeAgeMin: 10,798 and max: 23,713
HeightHeightMin: 55 and max: 250
WeightWeightMin: 10 and max: 200
GenderGender1: female, 2: male
Systolic blood pressureap_hiMin: −150 and max: 16,020
Diastolic blood pressureap_loMin: −70 and max: 11,000
CholesterolCholCategorical value = 1(min) to 3(max)
GlucoseGlucCategorical value = 1(min) to 3(max)
SmokingSmoke1: yes, 0: no
Alcohol intakeAlco1: yes, 0: no
Physical activityActive1: yes, 0: no
Presence or absence of cardiovascular diseaseCardio1: yes, 0: no
MAP ValuesCategory
≥70 and <801
≥80 and <902
≥100 and <1103
≥100 and <1104
≥110 and <1205
FeatureVariableMin and Max Values
Gendergender1: male, 2: female
AgeAgeCategorical values = 0(min) to 6(max)
BMIBMI_ClassCategorical values = 0(min) to 5(max)
Mean arterial pressureMAP_ClassCategorical values = 0(min) to 5(max)
CholesterolCholesterolCategorical values = 1(min) to 3(max)
GlucoseGlucCategorical values = 1(min) to 3(max)
SmokingSmoke1: yes, 0: no
Alcohol intakeAlco1: yes, 0: no
Physical activityActive1: yes, 0: no
Presence or absence of cardiovascular diseaseCardio1: yes, 0: no
ModelAccuracyPrecisionRecallF1-ScoreAUC
Without CVCVWithout CVCVWithout CVCVWithout CVCV
MLP86.9487.2889.0388.7082.9584.8585.8886.710.95
RF86.9287.0588.5289.4283.4683.4385.9186.320.95
DT86.5386.3790.1089.5881.1781.6185.4085.420.94
XGB87.0286.8789.6288.9382.1183.5786.3086.160.95
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Bhatt, C.M.; Patel, P.; Ghetia, T.; Mazzeo, P.L. Effective Heart Disease Prediction Using Machine Learning Techniques. Algorithms 2023 , 16 , 88. https://doi.org/10.3390/a16020088

Bhatt CM, Patel P, Ghetia T, Mazzeo PL. Effective Heart Disease Prediction Using Machine Learning Techniques. Algorithms . 2023; 16(2):88. https://doi.org/10.3390/a16020088

Bhatt, Chintan M., Parth Patel, Tarang Ghetia, and Pier Luigi Mazzeo. 2023. "Effective Heart Disease Prediction Using Machine Learning Techniques" Algorithms 16, no. 2: 88. https://doi.org/10.3390/a16020088

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Med J Armed Forces India
  • v.77(3); 2021 Jul

Machine learning–based heart disease prediction system for Indian population: An exploratory study done in South India

a Research Scholar (Computer Science & Engineering), Dayananda Sagar University, Bengaluru, India

Bondu Venkateswarlu

b Associate Professor (Computer Science & Engineering), Dayananda Sagar University, Bengaluru, India

Baljeet Maini

c Professor Pediatrics, Teerthanker Mahaveer Medical College & Research Centre, Moradabad, India

Dheeraj Marwaha

d Senior Software Engineer, Microsoft India, Hyderabad, India

In India, huge mortality occurs due to cardiovascular diseases (CVDs) as these diseases are not diagnosed in early stages. Machine learning (ML) algorithms can be used to build efficient and economical prediction system for early diagnosis of CVDs in India.

A total of 1670 anonymized medical records were collected from a tertiary hospital in South India. Seventy percent of the collected data were used to train the prediction system. Five state-of-the-art ML algorithms (k-Nearest Neighbours, Naïve Bayes, Logistic Regression, AdaBoost and Random Forest [RF]) were applied using Python programming language to develop the prediction system. The performance was evaluated over remaining 30% of data. The prediction system was later deployed in the cloud for easy accessibility via Internet.

ML effectively predicted the risk of heart disease. The best performing (RF) prediction system correctly classified 470 out of 501 medical records thus attaining a diagnostic accuracy of 93.8%. Sensitivity and specificity were observed to be 92.8% and 94.6%, respectively. The prediction system attained positive predictive value of 94% and negative predictive value of 93.6%. The prediction model developed in this study can be accessed at http://das.southeastasia.cloudapp.azure.com/predict/

Conclusions

ML-based prediction system developed in this study performs well in early diagnosis of CVDs and can be accessed via Internet. This study offers promising results suggesting potential use of ML-based heart disease prediction system as a screening tool to diagnose heart diseases in primary healthcare centres in India, which would otherwise get undetected.

Introduction

Cardiovascular diseases (CVDs) are the foremost reason of disease burden and mortality all over the world. Approximately 30% of total deaths (17.9 million) occurred due to CVDs globally in 2016. 1 The situation is critically serious in low- and middle-income countries like India. During the past three decades, the number of deaths due to CVDs has increased significantly from 15.2% to 28.1% in India. 2 Prevalence of CVDs was observed to be as high as 54.5 million cases in 2016. 3 CVDs are often detected in advanced stages amongst the underprivileged patients. Due to various reasons, Indian public healthcare system is still not capable in effectively preventing non-communicable diseases like CVDs. Efficient healthcare in terms of affordability, accessibility and quality is still far from being within reach of many. 4 Shortage of facilities in rural areas hampers medical diagnostic and therapeutic help in the initial stage of disease. Despite the government initiatives of health insurance (which are mainly for therapeutic care only) for poor people, the major section of Indian population does not have preventive health check-up benefits. 5 All these reasons lead to delayed treatment and increase in morbidity and mortality. 6

Amalgamation of ML-based prediction system in primary healthcare centres can potentially aid hugely in the prevention of CVDs in India. Recent advancements in the field of computer science have proved that machine learning (ML) algorithms can generate huge meaningful information from the immense data generated by the healthcare sector. 7 This information can be used for the diagnosis of diseases at initial stage, which can thus aid in the prevention of diseases. ML-based tools for efficient healthcare are fetching a huge attention globally. As an example, ML-based tools have recently been successfully implemented in the fields of ophthalmology and oncology in the United States. 8 , 9 Available literature reveals the development of highly accurate prediction systems for CVDs. 10 , 11 , 12 , 13 All the research studies done for early detection of CVDs as reported in the literature so far are based on the freely available online data set provided by ML repository of University of California, Irvine. 14 This data set provides information about 76 medical attributes of 303 medical records gathered from hospitals of Western countries. Information obtained from diagnostic tests like electrocardiogram, treadmill test, fluoroscopy test etc. is available in the above-mentioned data set. However, these medical tests are neither accessible nor affordable to a major section of Indian population. 15 Thus, the prediction systems developed so far are not suitable for Indian population. Moreover, some of the primary risk factors responsible for heart diseases in India, like, obesity, lack of physical activity, physiological stress, smoking and alcohol consumption etc. have not been considered in the ML-based studies so far.

As the importance of early detection of CVD is increasingly being realized, there is a definite need of developing ML-based prediction system for CVDs specifically suitable for Indian scenario.

This study was carried out with the following objectives: a) Development of a high-performance and cost-effective ML-based heart disease prediction system using routine clinical data specifically suited for Indian population and b) Deployment of the prediction system in public cloud to ensure easy accessibility via Internet particularly beneficial for rural areas in India.

Materials and methods

Study setting.

This study is an interdisciplinary research work carried out by collaboration of data scientists and specialist doctors. The study was approved by institutional ethical committee. Members of ethical committee deemed that data privacy is ensured by using anonymized medical records of existing/retrospective cohort.

Data collection

By a random selection after applying the exclusion criteria, anonymized medical records of heart patients as well as of healthy persons were collected from a tertiary hospital in South India. Anonymization ensured data privacy as the personal details of the patients were not collected for the study.

Exclusion criteria: 1) Medical data sets corresponding to pregnant females, 2) patients reporting chronic kidney disease, severe mental illness, atrial fibrillation, 3) patients who reported the prolonged use of anti-depressants, antibiotics and medicines for asthma, tuberculosis and cancer, 4) patients who are prescribed oral corticosteroids, antipsychotic drugs and immunosuppressants and 5) patients younger than 20 years or older than 100 years.

After applying these exclusion criteria, the final data set comprised of 1670 medical records belonging to people between the age 30–79 years. Study population included 881 males and 789 females. Ethnicity of all records in this study was observed to be Asian. Eight-hundred and seventy-four records did not have hypertension, while rest 796 reported hypertension. Of 1670 records, 928 reported to consume alcohol. Eight-hundred and twenty-eight records belonged to smokers. Nine-hundred and twenty records complained of stress and anxiety in life. Of 1670 records, 893 records (53.47%) were diagnosed with CVDs and remaining 777 records (46.53%) were of healthy persons with no CVDs. The persons who visited the hospital for routine check-ups and were not diagnosed with any heart disease are referred to as healthy persons (CVD risk: low) in this study.

Risk factor attributes

People living in rural parts of the country are usually unaware of the potential risk factors of heart diseases. They usually neglect the early signs of heart disease. Since the study has been carried out especially for rural areas, the clinical attributes already known to be the potential risk factors of CVDs along with lifestyle attributes associated with heart disease were chosen for this study. These attributes include age, gender, weight, height, total cholesterol levels, smoking habits, alcohol, diabetes, hypertension, family history of CVDs, intake of healthy diet, physical activity/exercise habits and stress/anxiety in life. Body mass index (BMI) was calculated internally by the software. Table 1 represents the details of risk factors considered in this study.

Description of attributes used in the study.

AttributesDescriptionCategorical/numeric
AgeYearsNumeric
WeightKilogramsNumeric
HeightCentimetresNumeric
Total cholesterol levelsmg/dLNumeric
GenderMale/femaleCategorical
HypertensionYes/noCategorical
DiabetesYes/noCategorical
AlcoholYes/noCategorical
SmokingYes/noCategorical
ExerciseYes/noCategorical
StressYes/noCategorical
Family history of cardiovascular disease (CVD)Yes/noCategorical
Healthy dietYes/noCategorical
Risk of CVDHigh/lowCategorical

Diagnostic procedures like treadmill test and fluoroscopy (used extensively in similar studies done so far) were not considered relevant for this study to ensure cost-effectiveness. Tests for triglycerides, serum creatinine, C-reactive protein, serum fibrinogen, gamma glutamyl transferase, lipoprotein, apolipoprotein B, homocysteine, insulin test etc. although associated with the risk of heart diseases, were also not considered in this study as these medical tests are not feasible/affordable for rural population for which the research is aimed for.

Study population characteristics

Out of 1670 records, 893 were positive cases of CVD while remaining 777 records were negative cases of CVD ensuring that the data set is balanced and is not skewed in favour of any class. In the data set, mean age of patients with heart disease is 66.2 years while mean age of healthy people was 57.3 years. Mean total cholesterol for healthy people was 188 mg/dL while the mean total cholesterol for heart patients was high at 237.7 mg/dL. Mean weight of heart patients was observed to be 85.4 kg while the mean weight of healthy people was 69.4 kg. It was observed that only 27.3% of heart patients were females. Nearly 95% of healthy people reported that they used to exercise regularly. Chi-square test of independence and t -test were carried out in the study subjects on ‘prior basis’ to determine the statistical significance of categorical and numeric input attributes, respectively, in determining heart disease. 16 These tests were used to ensure the validity of data of study variables, since the performance of AI algorithms is affected by the data of variables used to train the algorithms. Descriptive characteristics of these study population variables have been represented in Table 2 .

Study Population Descriptive Characteristics.

Total records (n = 1670)
Risk factor
attribute
UnitCardiovascular disease (CVD) (n = 893)No CVD (n = 777)P-value
AgeYears (SD)66.2 (11.2)57.3 (12.4)<0.001
WeightKilograms (SD)85.4 (9.2)69.4 (10.1)<0.001
HeightCentimeters (SD)165.7 (9.1)162.3 (13.4)0.23
Total cholesterol levelsmg/dL (SD)267.7 (14.1)218.4 (13.9)<0.001
Gender (female)N (%)244 (27.3)545 (70.1)<0.001
Hypertension (yes)N (%)614 (68.7)182 (23.4)<0.001
Diabetes (yes)N (%)630 (70.5)318 (40.9)<0.001
Alcohol (yes)N (%)623 (69.7)305 (39.2)<0.001
Smoking (yes)N (%)570 (63.8)258 (33.2)<0.001
Exercise (yes)N (%)412 (46)737 (94.8)<0.001
Stress (yes)N (%)568 (63.6)352 (45.3)<0.001
Family history of CVD (yes)N (%)592 (66.2)299 (38.4)<0.001
Healthy diet (yes)N (%)496 (55.5)398 (51.2)0.077

∗p-value < 0.05 is statistically significant.

Methodology

Python 3.7 programming language was used for building ML-based heart disease prediction system. Powerful software libraries supported by Python namely NumPy, Pandas, Seaborn, Statsmodels.api, SciPy and Sklearn etc. were used for exploratory analysis of data 17 and implementing five ML algorithms namely k-Nearest Neighbours (k-NN), Naïve Bayes (NB), Logistic Regression (LR), AdaBoost (AB) and Random Forest (RF). This study also has typical binary classification where 13 input attributes are observed to determine if there is a high risk of heart disease in a patient (risk of CVD = high) or not (risk of CVD = low). Fig. 1 shows the workflow diagram of complete project.

Fig. 1

Workflow diagram of the study. This figure depicts the complete workflow of the study. The medical data set of 1670 records were gathered (in random fashion). Seventy percent data samples used to train the models. Test subset comprised the rest 30% of medical records. Five machine learning algorithms are applied to train the training subset. The prediction system was hosted on the public cloud for easy accessibility.

Data pre-processing

It was observed that there were no missing values or outliers in the data.

Since the ML algorithms can process only numerical data, the categorical attributes were label encoded. Gender female was encoded as 1 while male as 0. For all the other categorical variables like diabetes, stress, exercise etc., the presence (yes) was encoded as 1 while absence (no) was encoded as 0. High risk of CVD was encoded as 1 while low risk of CVD was encoded as 0.

Building the model

Using the train_test_split function supported by scikit learn library, the complete medical data set was randomly split into two portions in the ratio 70:30 referred as training and test/validation subset, respectively. Out of total 1670 records, training subset had 1169 records while test subset had 501 records. Detailed information about the training and test subsets is provided in Table 3 . The total number of records in the training data set were 1169, of which 656 records correspond to CVDs while 513 records belonged to healthy people not diagnosed with CVDs.

Details of training and test subsets.

ClassTraining subset (70%)Test subset (30%)Total records
High-risk cardiovascular disease (CVD)656237893
Low-risk CVD513264777
Total records11695011670

ML algorithms with well demonstrated performance for classification namely NB, LR and k-NN were applied to build the prediction model.

Applying ensembling algorithms for better performance

Research has proved that the performance of a ML-based prediction system can be improvised using ensembling techniques. 18 Ensembling is a union of individual classifying algorithms. Bagging ensemble algorithms namely RF and boosting ensemble algorithms namely adaptive boosting AB were also implemented for enhanced performance.

Testing the performance of the model

The performance of prediction models developed using k-NN, NB and LR algorithms was analysed using the validation subset of 501 records as shown in Table 4 . Of these records, 237 were confirmed cases of CVDs while remaining 264 records correspond to healthy people not diagnosed with CVDs. Prevalence of disease in validation subset was 237/501 = 47.3%

Performance of machine learning algorithms on validation set of 501 records.

AlgorithmTrue positiveTrue negativeFalse negativeFalse positiveSensitivitySpecificityPositive predictive valueNegative predictive valueAccuracy
k-Nearest Neighbours211230263489%87.1%86.1%89.8%88%
Naïve Bayes210232273288.6%87.8%86.7%89.5%88.2%
Logistic Regression215240222490.7%90.9%89.9%91.6%90.8%
AdaBoost218246191891.9%93.1%92.3%92.8%92.6%
Random Forest220250171492.8%94.6%94%93.6%93.8%

Analysis of confusion matrix is a standard way to check the performance of ML-based prediction system. Confusion matrix has four components namely true positives (TPs), true negatives (TNs), false positives (FPs) and false negatives (FNs).

TPs: Heart patients who are predicted correctly to have heart diseases.

TNs: Healthy persons who are predicted correctly to be healthy.

FPs: Healthy persons predicted incorrectly to have heart diseases (Type 1 error).

FNs: Heart patient predicted incorrectly to be healthy (Type 2 error).

These values are used to calculate accuracy, specificity, sensitivity, positive predictive value (PPV) and negative predictive value (NPV). PPV and NPV depend on the prevalence of disease.

A brief description of these parameters is given below.

  • i. Classification accuracy: This parameter represents that part of total predictions that were correct. Accuracy = (TN + TP)/(TN + FN + FP + TP)
  • ii. Sensitivity: This parameter reflects the ratio of cases that were accurately predicted with heart disease to the total number of actual cases of heart disease. Mathematically, sensitivity = TP/TP + FN
  • iii. Specificity: This parameter calculates the ratio of cases that are correctly predicted with no heart disease to the entire count of actual cases with no heart disease. Mathematically, Specificity = TN/FP + TN
  • iv. PPV: This parameter reflects the ratio of cases that are correctly predicted with heart diseases to the total count of cases predicted to have heart disease. Mathematically, PPV = TP/TP + FP
  • v. NPV: This parameter reflects the ratio of cases correctly predicted to be healthy to the total count of cases predicted to be healthy. Mathematically, NPV = TN/TN + FN

Fine tuning of hyperparameters

Grid Search for cross-validation was used to identify the best hyperparameters for the learning algorithms. Grid Search CV class from sklearn library was used for this purpose.

Deployment on the public cloud

The best performance prediction system built using RF model was deployed in Microsoft Azure cloud for better accessibility. 19 ‘ Pickle ’ and ‘ Flask ’ software libraries of Python programming language were used for this purpose. 20 Hosting the prediction system on cloud enables it to be easily accessed from anywhere in the world via Internet. This is highly useful feature for healthcare sector of India, which faces the major issue of shortage of medical facilities especially in rural areas. Accessing this prediction system is as easy as accessing an e-mail via Internet.

CVD prediction system was developed by applying five well-established ML algorithms on the training data set. The performance was tested on the validation test set of 501 records. Prevalence of disease in validation subset was 237/501 = 47.3% Performance metrics namely accuracy, sensitivity, specificity, PPV and NPV were calculated for each algorithm. The performance results of all classifiers are given in Table 4 .

The best hyperparameters for k-NN (n_neighbors = 12) resulted in a performance of sensitivity 89%, specificity 87.1%, PPV 86.1%, NPV 89.8%. The performance of NB was found to better than k-NN. Sensitivity 88.6%, specificity 87.8%, PPV 86.7%, NPV 89.5% were achieved by NB.

LR with hyperparameters (C = 1, penalty = l2) performed well in classifying people with low risk or high risk of CVDs. LR correctly classified 455 out of 501 records, thus attaining a classification accuracy of 90.8%. Sensitivity 90.7% and specificity were 90.7% and 90.9%, respectively. PPV was observed to be 89.9% while NPV was 91.6%.

Models built using ensemble techniques (RF and AB) performed better than LR. AB model was trained with Stage-wise Adaptive Modelling using a Multi-class Exponential loss function (n_estimators = 30) while RF based on ‘gini index’ with n_estimators = 150 resulted in the best performance. Sensitivity and specificity of AB model was 91.9% and 93.1%, respectively, while RF reported 92.8% sensitivity and 94.6% specificity. PPV 94% and NPV 93.6% were achieved by RF–based prediction model.

Interpretation of ML-based models is not easy, and these are usually considered as ‘black boxes. However, logistic regression–based models are quite interpretable. Logistic regression was implemented using the Logit function (Binomial family) based on maximum likelihood estimation method to predict CVD risk using statsmodels.api library of Python. Fig. 2 shows the summary of results obtained.

Fig. 2

Study population characteristics mean (standard deviation) of numerical attributes along with p-values of t -test to indicate the statistical significance for two groups: high risk/low risk of cardiovascular disease (CVDs). Count (%) of categorical attributes in two groups: high risk/low risk of CVDs.

Male gender, diabetes, hypertension, high cholesterol level, smoking and alcohol were significantly associated with CVD. Lack of exercise and stress were observed to be more prevalent in CVD group (p value < 0.05).

Estimate column in the summary reflects the natural logarithm of odds ratio of getting diagnosed with high risk of heart disease keeping all other features constant. Due to negative values of log (odds ratio) it is inferred that females had a low risk of CVDs compared with males. Regular exercise and intake of healthy diet were observed to be associated with low risk of CVDs; on the other hand, diabetes, hypertension, stress, smoking and family history tend to result in high risk of CVDs.

The odds ratio column in the summary suggests how the odds ratio of being detected with high risk of CVD change if all other attributes are kept constant. Hypertension tends to increase the odds ratio of high risk of CVDs by 1.573 while the odds ratio drops significantly to 0.328 with regular physical exercise. Odds ratio of high risk of CVD for females is 0.788 compared with males.

Ensemble algorithms (RF and AB) are based on decision trees and attribute importance is graded according to selection occurrence frequency of an attribute as a decision node decided based on information gain and entropy. Variable importance for boosting algorithm was decided based on the impurity-based scores using feature_importances_ from sklearn library of Python. Attributes exercise, weight, total cholesterol, hypertension and age were the top five important attributes for AB algorithm. In case of RF prediction system, variable importance scores for attributes weight, exercise, total cholesterol, hypertension, and gender were found to be maximum for predicting CVDs. Variable importance for AB algorithms and RF is represented graphically in Fig. 3 (a) and (b), respectively.

Fig. 3

Variable importance. (a) Variable importance for AdaBoost-based prediction model. (b) Variable importance for Random Forest–based prediction model.

RF-based CVD prediction model (trained on 1169 records and tested on 501 records) is hosted on cloud and can be easily accessed at das.southeastasia.cloudapp.azure.com/predict/

The input attributes of the patient are entered into the system. The system predicts if the patient has low risk of CVDs or high risk. Sample screenshots of the result obtained using the prediction system are shown in Fig. 4 .

Fig. 4

Using cardiovascular disease (CVD) prediction model to test the risk of CVDs. The medical practitioner enters the patient's clinical parameters as well as attributes related to his lifestyle to predict the risk of CVD.

In the recent years, substantial research studies have been carried out to build methods for diagnosing heart diseases in early stages. Various feature selection techniques were applied in the research carried out by Takci. 21 (2018), and the resulting prediction system attained an accuracy of 84.81%. Similar study was carried out by Kausar et al. and an accuracy of 88.41% was obtained. 22 Prediction system developed by Khalid Raza using ensembling technique (2019) attained an accuracy of 88.88%. 23 A similar accuracy level of 89% was achieved by the prediction system developed by Haq et al. in 2019. 24 Using artificial neural network to design a prediction system Alic et al. achieved an accuracy of 91% in their research study. 25 But importantly, the prediction system developed in all of these studies do not work effectively well for Indian population as these models are based on data collected from Western countries and do not take into consideration lifestyle-related risk factors responsible for CVDs (lack of physical activity, family history, alcohol etc.). Moreover, these systems rely on the results of medical tests like ECG, treadmill test, fluoroscopy tests etc., which are not feasible in Indian primary health centres in the existing scenario.

The accuracy attained in the present study is 93.8%. The prediction system developed in this research uses 13 clinical parameters and identifies the risk of a person to have heart disease. Compared with the studies done so far, this study has been carried out on Indian population, and the potential risk factors like high body weight, lack of exercise, psychological stress, family history, smoking and alcohol consumption habits have been considered in this study (unlike the studies quoted previously). It is worth noting that the system developed in this study is highly cost-effective compared with earlier studies as expensive tests like fluoroscopy and treadmill tests have not been taken into consideration. Easy accessibility of the prediction system via Internet is also an added remarkable feature of this study, which was not reported by earlier studies. It is worth mentioning here that prediction model developed in this pilot study predicts output depending on the study population attribute trends it was trained on. Once the ML models are trained and tested on voluminous data sets, it can be used as a screening tool in rural India and can help in the prevention of CVDs.

Cost-effectiveness, excellent performance and easy accessibility of the prediction system via Internet defend the use of ML-based prediction system as a screening tool for CVD detection in India.

To the best of our knowledge, this study was first of its kind in Indian context. Developed countries like the United Kingdom and the United States are investing their resources to carry to research for developing ML-based prediction models for diagnosing heart diseases in primary healthcare centres. 26 , 27 It is recommended that similar studies should be promoted in India. The current national health policy (2017) of our government, laying stress on preventive health will be more meaningful and fruitful if advancement in this field is made as early as possible. 28 We propose larger studies of multicentric nature for development of AI prediction systems for CVD screening in our country, which is facing ever increasing load of morbidity and mortality due to CVD being detected in late advanced stages. Premier institutes of medicine and technology can collaborate in this regard to diagnose other lifestyle diseases and non-communicable diseases like malignancies. Cardiological Society of India (CSI) can help in this regard. Other modern techniques like artificial neural networks can be applied to further improve the performance of the system.

Limitations of the study

This study used a data set of 1670 patients reporting to a tertiary care private setup in a south Indian metropolitan city where largely the higher income group seeks the medical care. This potentially may seem biased in reader's mind, but this study was aimed only to detect the robustness of a prediction model based on ML. The results obtained from the prediction system developed in this study are based on the attribute trends of the study population on which the model is trained on. In future the model needs to be trained on huge data sets collected from diverse regions before using it as a screening tool.

The study portrays the capability of ML algorithms to predict CVDs in Indian population. Issues of affordability and accessibility in healthcare sector of India can be addressed using ML-based models, which can be easily accessed via Internet even in the rural parts of the country. It is proposed to build and test the performance of similar systems using voluminous cardiac data sets belonging to all economic sections of the society collected from various regions of India. We recommend similar studies of multicentric nature across entire country. To achieve the sustainable development goals laid down by World Health Organization, it is high time, we as a country do take timely advantage of ML-based prediction systems in improving preventive care aspect of public healthcare system. 29

What is already known?

ML-based tools have shown remarkable performance in diagnosing various serious diseases in initial stages in healthcare centres of developed countries.

What does this study add?

An indigenous high-performance ML-based CVD prediction system easily accessible via Internet is proposed for existing Indian healthcare system. Healthcare in India can be made more affordable and accessible using ML-based prediction systems.

Disclosure of competing interest

The authors have none to declare.

Acknowledgements

The authors express their heartfelt gratitude to Sagar Hospitals, Jayanagar, Bengaluru, for providing anonymized information of patients' health parameters for carrying out this study. No funding was received for this project.

Login to your account

If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password

If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password

Property Value
Status
Version
Ad File
Disable Ads Flag
Environment
Moat Init
Moat Ready
Contextual Ready
Contextual URL
Contextual Initial Segments
Contextual Used Segments
AdUnit
SubAdUnit
Custom Targeting
Ad Events
Invalid Ad Sizes

Access provided by

Early prediction of grape disease attack using a hybrid classifier in association with IoT sensors

Cover Image - Heliyon, Volume 0, Issue 0

  • Download PDF Download PDF
  • Agriculture
  • Machine Learning
  • Disease classification
  • weather database

1 Introduction

2 role and impact of temperature, humidity and leaf wetness on grape plant, 3 literature survey.

ReferencePlantData analysisTechnique usedDiseases CoveredAccuracy
6GrapeHMMZigbeeDowneyDowney 90.9%
PowderyPowdery 90.9%
9TomatoSVM
KNN
Random forest
IOTDisorder99.6% using Random forest
Detection
42TeaMultiple Linear RegressionIOTBlister Disease91%
Detection
15GrapeN/AIOTDowneyDowney 94.4%
PowderyPowdery 96%
8MultipleKNN
RF
LR
IOTPlant91% using
DiseaseKNN
29GrapevineANNIPDowney
Powdery
Black Rot
Downey 90.47%
Powdery 92.85%
32GrapeSVM
RF AdaBoost
 Leaf Disease93% using
SVM
33CropFuzzy LogicWi-FiClimate Prediction 
  • Open table in a new tab

4 Objective

5 methodology.

Fig. 1

Performance Parameters

Fig. 2

6 System Architecture

Fig. 3

Sr. NoComponent DescriptionSpecification
1Power Supply: Battery40A, 14.8 V - 16.8 V
2Temperature SensorDHT11
3Leaf wetness Sensor-
4GSM ModuleSIM800L GSM/GPRS, 4 V
5Node MCUESP8266, 16 Digital Pins, Analog −1 Pin
6LCD Display16X2

Temperature sensor

Calibration, 6.1 experimental setup.

Fig. 4

7 Results and Discussions

7.1 dataset generated by system.

Fig. 6

DateTimeTemperatureHumidityLeaf Wetness
7/14/20239:11:0128.1670
7/14/20239:11:0528.1670
7/14/20239:11:1028.1670
7/14/20239:11:1728.1660
7/14/20239:11:2228.1660
7/14/20239:11:2628.1660
7/14/20239:11:3128.2660
7/14/20239:11:3528.2660
7/14/20239:11:3928.2660
7/14/20239:11:4428.2660
7/14/20239:11:4828.2660
7/14/20239:11:5228.2660
7/14/20239:11:5728.3650
7/14/20239:12:0228.3650
7/14/20239:12:0228.3650
7/14/20239:12:0628.3650
7/14/20239:12:1128.3650
7/14/20239:12:1628.3650

7.2 Results

MeasurePercentage %
Accuracy98.25
Recall98.3
Precision98.3
MeasurePercentage %
Accuracy98.85
Recall98.9
Precision97.7
MeasurePercentage %
Accuracy93.95
Recall94.0
Precision94.4

Fig. 7

7.3 Comparison with existing systems

Parameters/AuthorsPatil & Thorat K. Sanghavi Proposed
Powdery MildewYesYesYes
Downey MildewYesYesYes
Bacterial Leaf SpotYesNoYes
Technique UsedIoTIoTIoT
Cloud BasedNoYesYes
AccuracyDowney 90.9%Downey 94.4%Downey 98.85%
Powdery 90.9%Powdery 96%Powdery 98.25%
  Bacterial Leaf Spot 93.95

8 Various Operation Scenarios

Comparative analysis, 9 limitations, 10 conclusion, 11 future work, data availability, additional information, uncited reference, declaration of competing interest, article metrics, related articles.

  • Download Hi-res image
  • Download .PPT
  • Cancer Cell
  • Cell Chemical Biology
  • Cell Genomics
  • Cell Host & Microbe
  • Cell Metabolism
  • Cell Reports
  • Cell Reports Medicine
  • Cell Stem Cell
  • Cell Systems
  • Current Biology
  • Developmental Cell
  • Molecular Cell
  • American Journal of Human Genetics ( partner )
  • Biophysical Journal ( partner )
  • Biophysical Reports ( partner )
  • Human Genetics and Genomics Advances ( partner )
  • Molecular Plant ( partner )
  • Molecular Therapy ( partner )
  • Molecular Therapy Methods & Clinical Development ( partner )
  • Molecular Therapy Nucleic Acids ( partner )
  • Molecular Therapy Oncology ( partner )
  • Plant Communications ( partner )
  • Stem Cell Reports ( partner )
  • Trends in Biochemical Sciences
  • Trends in Cancer
  • Trends in Cell Biology
  • Trends in Ecology & Evolution
  • Trends in Endocrinology & Metabolism
  • Trends in Genetics
  • Trends in Immunology
  • Trends in Microbiology
  • Trends in Molecular Medicine
  • Trends in Neurosciences
  • Trends in Parasitology
  • Trends in Pharmacological Sciences
  • Trends in Plant Science
  • Cell Reports Physical Science
  • Chem Catalysis
  • Trends in Chemistry
  • Cell Biomaterials
  • Cell Reports Methods
  • Cell Reports Sustainability
  • STAR Protocols
  • Nexus ( partner )
  • The Innovation ( partner )
  • Trends in Biotechnology
  • Trends in Cognitive Sciences
  • Submit article
  • Multi-Journal Submission
  • STAR Methods
  • Sneak Peek – Preprints
  • Information for reviewers
  • Cell Symposia
  • Consortia Hub
  • Cell Press Podcast
  • Cell Press Videos
  • Coloring and Comics
  • Cell Picture Show
  • Research Arc
  • About Cell Press
  • Open access
  • Sustainability hub
  • Inclusion and diversity
  • Help & Support
  • Cell Press Careers
  • Scientific job board
  • Read-It-Now
  • Recommend to Librarian
  • Publication Alerts
  • Best of Cell Press
  • Cell Press Reviews
  • Cell Press Selections
  • Nucleus Collections
  • SnapShot Archive
  • For Advertisers
  • For Recruiters
  • For Librarians
  • Privacy Policy
  • Terms and Conditions
  • Accessibility

The content on this site is intended for healthcare professionals and researchers across all fields of science.

We use cookies to help provide and enhance our service and tailor content. To update your cookie settings, please visit the Cookie settings for this site. All content on this site: Copyright © 2024 Elsevier Inc., its licensors, and contributors. All rights are reserved, including those for text and data mining, AI training, and similar technologies. For all open access content, the Creative Commons licensing terms apply.

  • Privacy Policy   
  • Terms & Conditions   
  • Accessibility   
  • Help & Support   

RELX

Session Timeout (2:00)

Your session will expire shortly. If you are still working, click the ‘Keep Me Logged In’ button below. If you do not respond within the next minute, you will be automatically logged out.

Image Processing and Machine Learning for Plant Disease Detection

  • Conference paper
  • First Online: 19 September 2024
  • Cite this conference paper

research paper on disease prediction

  • Dattatray G. Takale 39 ,
  • Parishit N. Mahalle 40 ,
  • Vivek Deshpande 41 ,
  • Chitrakant B. Banchhor 39 ,
  • Piyush P. Gawali 39 ,
  • Gopal Deshmukh 42 ,
  • Vajid Khan 39 &
  • Vikas B. Maral 39  

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1191))

Included in the following conference series:

  • International Conference on Artificial-Business Analytics, Quantum and Machine Learning

Agriculture ensures that everyone has enough to eat even if the global population suddenly doubles. Prediction of plant diseases at an early stage is suggested in agriculture since it is crucial to provide food for the general population. Unfortunately, early disease forecasting is not possible for crops. The purpose of this study is to educate agriculturalists on recent advances in the fight against plant leaf diseases. To identify leaf illnesses in tomato plants, an accurate methodology was developed utilizing machine learning and image processing approaches. The authors of this paper propose a method for detecting illness in plants and crops that employs digital image processing and machine learning. The system can distinguish between healthy and unhealthy plant photos by using a Machine Learning model known as Support Vector Machine (SVM). “The Informational properties of leaf samples are extracted using different descriptors, including Discrete Wavelet Transform, Principal Component Analysis, and Grey Level Co-occurrence Matrix. With an F1 score of 99%, 99% accuracy, 98% precision, and 99% recall, the suggested technique integrates Discrete Wavelet Transform (DWT), Principal Component Analysis (PCA), Grey-Level Co-occurrence Matrix (GLCM), and Convolutional Neural Networks (CNN)” for the greatest performance”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

F. Abbas, S. Khan, Z. Zhang, A novel feature selection approach for plant disease detection. Comput. Electron. Agric. 141 , 234–246 (2017). https://doi.org/10.1016/j.compag.2017.08.016

Article   Google Scholar  

A.E. Abdalla, M.M. El Hoseny, E.A. Farahat, A novel hybrid intelligent model for tomato leaf diseases diagnosis using decision tree and naïve Bayes algorithms. Comput. Electron. Agric. 174 , 105491 (2020). https://doi.org/10.1016/j.compag.2020.105491

S.S. Bere, G.P. Shukla, V.N. Khan, A.M. Shah, D.G. Takale, Analysis of students performance prediction in online courses using machine learning algorithms. NeuroQuantology 20 (12), 13–19 (2022)

Google Scholar  

S. Ghosal, S. Saha, S. Chakraborty, A deep learning approach for detection and classification of plant leaf diseases. Comput. Electron. Agric. 163 , 104853 (2019)

R. Jain, I. Gupta, D. Varshney, Plant leaf disease classification using convolutional neural networks. J. Eng. Res. Rep. 18 (3), 1–8 (2021)

H. Jiang, W. Qian, M. Gao, Y. Li, An automatic detection system of lung nodule based on multigroup patch-based deep learning network. IEEE J. Biomed. Health Inform. 22 (4), 1227–1237 (2018). https://doi.org/10.1109/JBHI.2017.2725903

S.U. Kadam, V.M. Dhede, V.N. Khan, A. Raj, D.G. Takale, Machine learning methode for automatic potato disease detection. NeuroQuantology 20 (16), 2102–2106 (2022)

S.U. Kadam, A. Katri, V.N. Khan, A. Singh, D.G. Takale, D.S. Galhe, Improve the performance of non-intrusive speech quality assessment using machine learning algorithms. NeuroQuantology 20 (19), 3243–3250 (2022)

A.A. Khan, R.M. Mulajkar, V.N. Khan, S.K. Sonkar, D.G. Takale, A research on efficient spam detection technique for IOT devices using machine learning. NeuroQuantology 20 (18), 625–631 (2022)

Y. Li, X. Chen, J. Wei, An autoencoder-based approach for soybean disease recognition using multispectral images. IEEE Access 7 , 53210–53222 (2019). https://doi.org/10.1109/ACCESS.2019.2914136

P. Mandal, P. Mitra, B. Chanda, A comparative study on feature extraction techniques for plant leaf disease classification. Procedia Comput. Sci. 132 , 1154–1163 (2018)

A. Mishra, S.K. Tripathi, Plant leaf disease detection and classification using DWT and PCA. Int. J. Comput. Appl. 175 (12), 6–10 (2020)

S.P. Mohanty, D.P. Hughes, M. Salathé, Using deep learning for image-based plant disease detection. Front. Plant Sci. 7 , 1419 (2016). https://doi.org/10.3389/fpls.2016.01419

E. Mwebaze, J. Wanyama, R. Ogwang, Smartphone-based plant disease diagnosis using a convolutional neural network. J. Intell. Syst. 27 (2), 229–238 (2018). https://doi.org/10.1515/jisys-2017-0293

D. Patel, N. Patel, A review on various techniques used for detection of plant leaf diseases. Int. J. Eng. Res. Technol. 7 (3), 684–689 (2018)

R. Raut, Y. Borole, S. Patil, V.N. Khan, D.G. Takale, Skin disease classification using machine learning algorithms. NeuroQuantology 20 (10), 9624–9629 (2022)

M. Sajjad, S. Ali, M. Hussain, M. Shahzad, An intelligent system for plant leaf disease diagnosis using KNN and SVM classifiers. Comput. Electron. Agric. 164 , 104891 (2019)

D. Singh, D. Gupta, S. Gupta, A review on various techniques used for detection of plant leaf diseases. Int. J. Emerg. Technol. Innovative Res. 6 (10), 220–224 (2019a)

R. Singh, G. Singh, P. Singh, Detection of plant leaf diseases using GLCM and machine learning techniques. Int. J. Comput. Sci. Eng. 7 (9), 1–5 (2019b)

D. Singh, D. Singh, M. Kaur, Ensemble of bagging and boosting for accurate identification of grape leaf diseases. J. Ambient. Intell. Humaniz. Comput. 12 (4), 4089–4100 (2021). https://doi.org/10.1007/s12652-020-02739-5

D.G. Takale, A review on implementing energy efficient clustering protocol for wireless sensor network. J. Emerg. Technol. Innovative Res. (JETIR) 6 (1), 310–315 (2019a)

D.G. Takale, A review on QoS aware routing protocols for wireless sensor networks. Int. J. Emerg. Technol. Innovative Res. 6 (1), 316–320 (2019b)

D.G. Takale, A review on wireless sensor network: its applications and challenges. J. Emerg. Technol. Innovative Res. (JETIR) 6 (1), 222–226 (2019c)

D.G. Takale, A review on data centric routing for wireless sensor network. J. Emerg. Technol. Innovative Res. (JETIR) 6 (1), 304–309 (2019e)

D.G. Takale et al., A study of fault management algorithm and recover the faulty node using the FNR algorithms for wireless sensor network. Int. J. Eng. Res. Gen. Sci. 2 (6), 590–595 (2014)

D.G. Takale, S.D. Gunjal, V.N. Khan, A. Raj, S.N. Gujar, Road accident prediction model using data mining techniques. NeuroQuantology 20 (16), 2904–2101 (2022)

D.G. Takale et al., Load balancing energy efficient protocol for wireless sensor network. Int. J. Res. Anal. Rev. (IJRAR) 153–158 (2019d)

H. Wang, G. Liu, H. Zhao, Segmentation of apple leaf disease spots based on the K-means clustering algorithm. J. Phys. Conf. Ser. 837 , 012007 (2017). https://doi.org/10.1088/1742-6596/837/1/012007

J. Zhang, X. Wang, Y. Zhang, Plant disease recognition using a convolutional neural network ensemble method. Comput. Electron. Agric. 149 , 142–149 (2018). https://doi.org/10.1016/j.compag.2018.04.019

X. Zhang, Y. Zhou, X. Lin, G. Wu, Z. Wang, Y. Yao, Multi-class plant disease recognition using a CNN with a combination of convolutional and recurrent layers. Front. Plant Sci. 11 , 588071 (2020a). https://doi.org/10.3389/fpls.2020.588071

X. Zhang, Y. Zhou, G. Wu, X. Lin, Y. Yao, Tomato disease recognition based on feature fusion and multi-scale convolutional neural network. J. Phys. Conf. Ser. 1529 , 022064 (2020b). https://doi.org/10.1088/1742-6596/1529/2/022064

Download references

Author information

Authors and affiliations.

Department of Computer Engineering, Vishwakarma Institute of Information Technology, SPPU Pune, Pune, India

Dattatray G. Takale, Chitrakant B. Banchhor, Piyush P. Gawali, Vajid Khan & Vikas B. Maral

Department of AI & DS, Vishwakarma Institute of Information Technology, SPPU Pune, Pune, India

Parishit N. Mahalle

Vishwakarma Institute of Information Technology, SPPU Pune, Pune, India

Vivek Deshpande

Department of Computer Engineering, KJ College of Engineering and Management Research, SPPU Pune, Pune, India

Gopal Deshmukh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dattatray G. Takale .

Editor information

Editors and affiliations.

Department of Computer Science, University of South Dakota, Vermillion, SD, USA

K. C. Santosh

Department of Computer Applications, National Institute of Technology Kuruks, Kurukshetra, Haryana, India

Sandeep Kumar Sood

School of Science and Technology, Bournemouth University, Poole, UK

Hari Mohan Pandey

Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India

Charu Virmani

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Takale, D.G. et al. (2024). Image Processing and Machine Learning for Plant Disease Detection. In: Santosh, K.C., Sood, S.K., Pandey, H.M., Virmani, C. (eds) Advances in Artificial-Business Analytics and Quantum Machine Learning. COMITCON 2023. Lecture Notes in Electrical Engineering, vol 1191. Springer, Singapore. https://doi.org/10.1007/978-981-97-2508-3_45

Download citation

DOI : https://doi.org/10.1007/978-981-97-2508-3_45

Published : 19 September 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-97-2507-6

Online ISBN : 978-981-97-2508-3

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. (PDF) Heart Disease Prediction Using Machine Learning

    research paper on disease prediction

  2. (PDF) Statistical Machine Learning Approaches to Liver Disease Prediction

    research paper on disease prediction

  3. (PDF) A Review on Heart Disease Prediction using Machine Learning and

    research paper on disease prediction

  4. (PDF) Symptoms Based Multiple Disease Prediction Model using Machine

    research paper on disease prediction

  5. (PDF) Plant disease prediction using classification algorithms

    research paper on disease prediction

  6. (PDF) Heart Disease Prediction using Machine Learning

    research paper on disease prediction

VIDEO

  1. Multiple Disease Prediction Machine Learning Model

  2. Data generation method for disease prediction on imbalanced data

  3. Disease prediction ML Project

  4. PLANT PATHOLOGY || Part-2 || in Hindi || Botany || B. Sc. & M. Sc

  5. ML Project on Disease Prediction

  6. Revolutionizing Disease Prediction: Unleashing the Power of AI

COMMENTS

  1. Disease Prediction From Various Symptoms Using Machine Learning

    Based on the symptoms, age, and gender of an individual, the diagnosis system gives the output as the disease that the individual might be suffering from. The weighted KNN algorithm gave the best results as compared to the other algorithms. The accuracy of the weighted KNN algorithm for the prediction was 93.5 %.

  2. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review

    the disease is omitted mistakenly from the consideration. Machine learning (ML) is used practically everywhere, from cutting-edge technology (such as mobile phones, computers, and robotics) to health care (i.e., disease diagnosis, safety). ML is gaining popularity in various fields, including disease diagnosis in health care.

  3. Disease Prediction Using Machine Learning

    Disease Prediction Using Machine Learning. * Research Gate Link: Marouane Fethi Ferjani. Computing Department. Bournemouth University. Bournemouth, England. [email protected]. Abstract ...

  4. Development of machine learning model for diagnostic disease prediction

    The numbers of disease prediction papers using XGBoost with medical data have increased recently 33,34,35,36. XGBoost is an algorithm that overcomes the shortcomings of GBM (gradient boosting ...

  5. Machine learning prediction in cardiovascular diseases: a meta-analysis

    Study characteristics. Table 2 shows the basic characteristics of the included studies. In total, our meta-analysis of ML and cardiovascular diseases included 103 cohorts (55 studies) with a total ...

  6. Ensemble Learning for Disease Prediction: A Review

    Machine learning models are used to create and enhance various disease prediction frameworks. Ensemble learning is a machine learning technique that combines multiple classifiers to improve performance by making more accurate predictions than a single classifier. Although numerous studies have employed ensemble approaches for disease prediction, there is a lack of thorough assessment of ...

  7. Comparing different supervised machine learning algorithms for disease

    Background Supervised machine learning algorithms have been a dominant method in the data mining field. Disease prediction using health data has recently shown a potential application area for these methods. This study aims to identify the key trends among different types of supervised machine learning algorithms, and their performance and usage for disease risk prediction. Methods In this ...

  8. Popular deep learning algorithms for disease prediction: a review

    Section 2 of this paper will introduce the theories, development and disease application cases of two kinds of structured data algorithms, ANN and FM-Deep Learning. Section 3 will introduce the theories, development and disease application cases of CNN and RNN. Section 4 will respectively introduce the current defects in the field of disease prediction algorithms and the coping strategies.

  9. Multiple Disease Prediction System Using Machine Learning

    Abstract: Machine learning (ML) refers to the science and engineering of artificially intelligent systems, providing them with the capability to learn without being explicitly programmed. In recent years, ML in the healthcare domain has made great advancements in the early predictions of many critical illnesses. While there have been significant contributions to single disease prediction ...

  10. Disease variant prediction with deep generative models of ...

    Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms ...

  11. Machine Learning for the Multiple Disease Prediction System

    In this paper we are proposes a complete Multiple Disease Prediction System that makes accurate predictions of diabetes, cancer, and heart disease using machine learning algorithms. The system's ...

  12. Heart Disease Prediction Using Machine Learning

    Cardiovascular disease refers to any critical condition that impacts the heart. Because heart diseases can be life-threatening, researchers are focusing on designing smart systems to accurately diagnose them based on electronic health data, with the aid of machine learning algorithms. This work presents several machine learning approaches for predicting heart diseases, using data of major ...

  13. Machine Learning and Prediction of Infectious Diseases: A ...

    The aim of the study is to show whether it is possible to predict infectious disease outbreaks early, by using machine learning. This study was carried out following the guidelines of the Cochrane Collaboration and the meta-analysis of observational studies in epidemiology and the preferred reporting items for systematic reviews and meta-analyses. The suitable bibliography on PubMed/Medline ...

  14. Disease Prediction: Smart Disease Prediction System using Random Forest

    For the analysis, a sample of 4920 patient records with 41 disorders was chosen. A total of 41 diseases made up the dependent variable. We enhanced 95 of the 132 independent variables (symptoms) that are closely related to illnesses. This paper illustrates a disease prediction system constructed using the Random Forest Machine Learning algorithm.

  15. Using Machine Learning for Heart Disease Prediction

    Our paper is part of the research on the detection and prediction of heart disease. It is based on the application of Machine Learning algorithms, of which w e have. chosen the 3 most used ...

  16. Early prediction of heart disease with data analysis using supervised

    Heart diseases are consistently ranked among the top causes of mortality on a global scale. Early detection and accurate heart disease prediction can help effectively manage and prevent the disease. However, the traditional methods have failed to improve heart disease classification performance. So, this article proposes a machine learning approach for heart disease prediction (HDP) using a ...

  17. Identification and Prediction of Chronic Diseases Using Machine

    The identification and prediction of such diseases at their earlier stages are much important, so as to prevent the extremity of it. It is difficult for doctors to manually identify the diseases accurately most of the time. The goal of this paper is to identify and predict the patients with more common chronic illnesses.

  18. PDF Multiple Disease Prediction System Using Machine Learning

    1. "Multiple Disease Prediction Using Machine Learning Algorithms" by Chauhan et al. (2021): This paper investigates using various ML algorithms, including SVM and Decision Trees, for multiple disease prediction, focusing on symptoms as input. It examines the performance of these algorithms on four diseases, including heart disease and diabetes.

  19. Effective Heart Disease Prediction Using Machine Learning Techniques

    Globally, cardiovascular disease (CVDs) is the primary cause of morbidity and mortality, accounting for more than 70% of all fatalities. According to the 2017 Global Burden of Disease research, cardiovascular disease is responsible for about 43% of all fatalities [1,2].Common risk factors for heart disease in high-income nations include lousy diet, cigarette use, excessive sugar consumption ...

  20. Prediction of Cancer Disease using Machine learning Approach

    ChaoTan et al [1] explored the feasibility of using decision stumps as a poor classification method and track element analysis to predict timely lung cancer in a combination of Adaboost (machine learning ensemble). For the illustration, a cancer dataset was used which identified 9 trace elements in 122 urine samples.

  21. Disease Prediction using Machine Learning Algorithms

    This research work carried out demonstrates the disease prediction system developed using Machine learning algorithms such as Decision Tree classifier, Random forest classifier, and Naïve Bayes classifier. The paper presents the comparative study of the results of the above algorithms used. Published in: 2020 ...

  22. A Novel Heart Disease Disorder Prediction Using Faster ...

    Today, heart disease is the leading cause of death. The annual death rate from coronary heart disease decreased by 31.8% between 2006 and 2016. Age-adjusted death rate from coronary heart disease per 100,000 people . Thus, a high-precision system that can be used as an analytical tool to find hidden patterns of heart problems in medical data ...

  23. Popular deep learning algorithms for disease prediction: a review

    6 Conclusion. This paper reviews the deep learning algorithms in the field of disease prediction. According to the type of data processed, the algorithms are divided into structured data algorithms and unstructured data algorithms. Structured data algorithms include ANN and FM-Deep Learning algorithms.

  24. An Analysis of Machine Learning Approaches for Diabetic Prediction

    In their research paper, Anuja Kumari et al. came to the conclusion that utilizing super vector machine and the Pima Indian Diabetes Dataset, with Matlab R2010a, the classifier can predict diabetes disease with the best possible the cost and effectiveness. ... Diabetes disease prediction using machine learning on big data of healthcare. In ...

  25. Machine learning-based heart disease prediction system for Indian

    Introduction. Cardiovascular diseases (CVDs) are the foremost reason of disease burden and mortality all over the world. Approximately 30% of total deaths (17.9 million) occurred due to CVDs globally in 2016. 1 The situation is critically serious in low- and middle-income countries like India. During the past three decades, the number of deaths due to CVDs has increased significantly from 15.2 ...

  26. Early prediction of grape disease attack using a hybrid classifier in

    Machine learning with IoT practices in the agriculture sector has the potential to address numerous challenges encountered by farmers, including disease prediction and estimation of soil profile. This paper extensively explores the classification of diseases in grape plants and provides detailed information about the conducted experiments. It is important to keep track of each crop's current ...

  27. Image Processing and Machine Learning for Plant Disease Detection

    The dataset in question is referred to as Plant Village, and it is a publicly accessible dataset that was curated for the purpose of identifying plant leaf diseases by Sharada P. Mohanty et al. (Abbas et al. 2017).There are a total of 38 separate plant disease categories included in this collection, which is comprised of 87,000 RGB photographs of healthy and sick plant leaves.