|
|
REVIEW ARTICLE |
|
Year : 2019 | Volume
: 5
| Issue : 1 | Page : 12-13 |
|
Machine learning and medical research data analysis
Rajiv Narang1, Jaya Deva2, Sada Nand Dwivedi3
1 Department of Cardiology, All India Institute of Medical Sciences, New Delhi, India 2 Department of Electrical Engineering, Indian Institute of Technology, New Delhi, India 3 Department of Biostatistics, All India Institute of Medical Sciences, New Delhi, India
Date of Web Publication | 2-May-2019 |
Correspondence Address: Dr. Rajiv Narang Department of Cardiology, All India Institute of Medical Sciences, Ansari Nagar, New Delhi - 110 029 India
 Source of Support: None, Conflict of Interest: None  | Check |
DOI: 10.4103/jpcs.jpcs_20_19
How to cite this article: Narang R, Deva J, Dwivedi SN. Machine learning and medical research data analysis. J Pract Cardiovasc Sci 2019;5:12-3 |
Developments over last few years may change (statistically significantly!) the way we analyze our data. These include wide availability of powerful computers (especially with graphics processing units or GPUs that allow large scale, parallelized computations), open source programming languages, for example, R (https://cran.r-project.org) and Python (https://www.python.org) as well as machine learning (ML) software (e.g., Scikit-Learn, Theano, TensorFlow, Caffe, Weka, and Apache Spark). As a result, ML is increasingly being used for data analysis in medicine.[1],[2],[3]
ML algorithms can be supervised or unsupervised depending on whether a class or outcome variable is available. In addition to commonly used linear and logistic regression, many generalized linear models are available such as Ridge regression, Lasso, Elastic net, Least Angle Regression, Bayesian regression, Perceptron, Random sample consensus, Theil–Sen estimator, and Huber regression. Last 3 have the advantage of being robust to outliers. Currently, only linear and logistic regression analyses are being used widely in medical studies.
Supervised learning also includes many nonlinear techniques such as Linear and Quadratic Discriminant Analysis, Kernel ridge regression, Support vector machines, Stochastic gradient descent, Nearest Neighbor Gaussian processes, Cross decomposition, Naive Bayes (e.g., Bernoulli, Gaussian, and Multinomial), Decision trees (Decision tree, Extra tree), Ensemble methods (including Bagging, Random Forest, Ada Boost, Gradient Tree Boost, and Voting classifier), and supervised neural network models (e.g., multilayer perceptron). Cross decomposition techniques including the partial least squares and the canonical correlation analysis can find relationships between 2 matrices and hence can be used when the group or outcome variable is also multivariate like predictor variables. Most of techniques mentioned above can be used for regression (as an alternative to linear regression) and for classification (as alternative to logistic regression).
Unsupervised ML methods are helpful in dimensionality reduction and in analyzing data with large number of features but less number of cases, a scenario where linear and logistic regression techniques are not reliable. They include decompositions (e.g., principal component analysis [PCA] and its variants such as kernel PCA, iterative PCA, robust PCA, sparse PCA, weighted PCA, entropy component analysis, truncated singular value decomposition, nonnegative matrix factorization, independent component analysis, and factor analysis), Gaussian mixture models, Manifold learning (e.g., isomap, local linear embedding, nonlinear spectral embedding and multidimensional scaling), clustering (e.g., K-means, spectral clustering, hierarchical clustering, DBSCAN, Birch), covariance estimation density estimation, and unsupervised neural network models (restricted Boltzmann machines).
Many of these algorithms are lengthy and were traditionally time-consuming but can now be easily performed on modern fast computers. All these techniques have different assumptions, advantages, disadvantages, and situations where they are most useful. These algorithms may produce models and results different from linear/logistic regression, but they may actually be closer to truth. The average values of coefficients obtained from multiple algorithms are also likely to be indicative of true relationships. It will only be prudent to make use of such variety of available techniques for medical research data analysis.
Several feature selection methods are also available to select out features responsible for high variance while rejecting features with low variance. These include sequential feature selection, minimum redundancy maximum relevance, correlation feature selection, regularized trees, relief, information gain-based feature selection, among others. These are broadly categorized into filter, wrapper, and embedded methods (that incorporate both feature selection and learning). Individual feature importance can also be determined by many methods, such as Gradient Boost, Ada Boost, Extra Trees, Decision Tree, and Random Forest. In addition, many techniques for model selection and evaluation are also available. These include cross-validation, model persistence, and model curves. Comparison of different algorithms using cross-validation is especially popular.
Unconventional approaches used by ML techniques may result in unexpected benefits. For example, Simjanoska et al. were able to accurately determine blood pressure from raw electrocardiographic data![4] Moreover, multiple techniques can now easily be applied to medical data.[5] For example, [Figure 1] shows results of regression analysis of factors associated with low birth weight from publically available “birthwt” dataset using 20 different regression algorithms. Similarly, on running 13 classification algorithms for feature selection on publically available South African Heart Disease (”sahd”) dataset, it was found that age, tobacco, low-density lipoprotein, family history, and Type A personality were selected by 92%, 85%, 77%, 69%, and 62% algorithms, respectively. Obesity, adiposity, systolic blood pressure, and alcohol intake were selected by only 23%, 15%, 8%, and 8% algorithms, respectively. A meta-analysis of this kind involving results from different algorithms applied to same data is likely to produce answers as correct as meta-analysis of data from different studies using the same single algorithm. | Figure 1: Comparison of coefficients for different variables obtained from 20 different algorithms. It is clear that uterine irritability (”ui”), history of hypertension (”ht”), and smoking status (”smoke”) have most consistent association.
Click here to view |
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References | |  |
1. | Baum A, Scarpa J, Bruzelius E, Tamler R, Basu S, Faghmous J. Targeting weight loss interventions to reduce cardiovascular complications of type 2 diabetes: A machine learning-based post hoc analysis of heterogeneous treatment effects in the look AHEAD trial. Lancet Diabetes Endocrinol 2017;5:808-15. |
2. | Motwani M, Dey D, Berman DS, Germano G, Achenbach S, Al-Mallah MH, et al. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: A 5-year multicentre prospective registry analysis. Eur Heart J 2017;38:500-7. |
3. | Ahmad T, Lund LH, Rao P, Ghosh R, Warier P, Vaccaro B, et al. Machine learning methods improve prognostication, identify clinically distinct phenotypes, and detect heterogeneity in response to therapy in a large cohort of heart failure patients. J Am Heart Assoc 2018;7. pii: e008081. |
4. | Simjanoska M, Gjoreski M, Gams M, Madevska Bogdanova A. Non-invasive blood pressure estimation from ECG using machine learning techniques. Sensors (Basel) 2018;18. pii: E1160. |
5. | Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, Keteyian S, et al. Using machine learning on cardiorespiratory fitness data for predicting hypertension: The Henry Ford ExercIse Testing (FIT) project. PLoS One 2018;13:e0195344. |
[Figure 1]
|