Subscribe to the PwC Newsletter

Join the community, trending research, mini-omni: language models can hear, talk while thinking in streaming.

research articles on machine learning

We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output.

FLUX that Plays Music

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic.

research articles on machine learning

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.

research articles on machine learning

rerankers: A Lightweight Python Library to Unify Ranking Methods

This paper presents rerankers, a Python library which provides an easy-to-use interface to the most commonly used re-ranking approaches.

DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification

Second, we propose two mechanisms to enforce the diversity among the global vectors to be more descriptive of the entire bag: (i) positive instance alignment and (ii) a novel, efficient, and theoretically guaranteed diversification learning paradigm.

Sapiens: Foundation for Human Vision Models

We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction.

research articles on machine learning

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents.

NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and Results

This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results.

research articles on machine learning

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies.

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models.

research articles on machine learning

Frequently Asked Questions

Journal of Machine Learning Research

The Journal of Machine Learning Research (JMLR), established in 2000 , provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online.

  • 2024.02.18 : Volume 24 completed; Volume 25 began.
  • 2023.01.20 : Volume 23 completed; Volume 24 began.
  • 2022.07.20 : New special issue on climate change .
  • 2022.02.18 : New blog post: Retrospectives from 20 Years of JMLR .
  • 2022.01.25 : Volume 22 completed; Volume 23 began.
  • 2021.12.02 : Message from outgoing co-EiC Bernhard Schölkopf .
  • 2021.02.10 : Volume 21 completed; Volume 22 began.
  • More news ...

Latest papers

Rethinking Discount Regularization: New Interpretations, Unintended Consequences, and Solutions for Regularization in Reinforcement Learning Sarah Rathnam, Sonali Parbhoo, Siddharth Swaroop, Weiwei Pan, Susan A. Murphy, Finale Doshi-Velez , 2024. [ abs ][ pdf ][ bib ]      [ code ]

PromptBench: A Unified Library for Evaluation of Large Language Models Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Gaussian Interpolation Flows Yuan Gao, Jian Huang, and Yuling Jiao , 2024. [ abs ][ pdf ][ bib ]

Gaussian Mixture Models with Rare Events Xuetong Li, Jing Zhou, Hansheng Wang , 2024. [ abs ][ pdf ][ bib ]

On the Concentration of the Minimizers of Empirical Risks Paul Escande , 2024. [ abs ][ pdf ][ bib ]

Variance estimation in graphs with the fused lasso Oscar Hernan Madrid Padilla , 2024. [ abs ][ pdf ][ bib ]

Random measure priors in Bayesian recovery from sketches Mario Beraha, Stefano Favaro, Matteo Sesia , 2024. [ abs ][ pdf ][ bib ]      [ code ]

From continuous-time formulations to discretization schemes: tensor trains and robust regression for BSDEs and parabolic PDEs Lorenz Richter, Leon Sallandt, Nikolas Nüsken , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Label Alignment Regularization for Distribution Shift Ehsan Imani, Guojun Zhang, Runjia Li, Jun Luo, Pascal Poupart, Philip H.S. Torr, Yangchen Pan , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fairness in Survival Analysis with Distributionally Robust Optimization Shu Hu, George H. Chen , 2024. [ abs ][ pdf ][ bib ]      [ code ]

FineMorphs: Affine-Diffeomorphic Sequences for Regression Michele Lohr, Laurent Younes , 2024. [ abs ][ pdf ][ bib ]

Tensor-train methods for sequential state and parameter learning in state-space models Yiran Zhao, Tiangang Cui , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Memory of recurrent networks: Do we compute it right? Giovanni Ballarin, Lyudmila Grigoryeva, Juan-Pablo Ortega , 2024. [ abs ][ pdf ][ bib ]      [ code ]

The Loss Landscape of Deep Linear Neural Networks: a Second-order Analysis El Mehdi Achour, François Malgouyres, Sébastien Gerchinovitz , 2024. [ abs ][ pdf ][ bib ]

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise Liam Madden, Emiliano Dall'Anese, Stephen Becker , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Euler Characteristic Tools for Topological Data Analysis Olympio Hacquard, Vadim Lebovici , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization Cameron Jakub, Mihai Nica , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fortuna: A Library for Uncertainty Quantification in Deep Learning Gianluca Detommaso, Alberto Gasparin, Michele Donini, Matthias Seeger, Andrew Gordon Wilson, Cedric Archambeau , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Characterization of translation invariant MMD on Rd and connections with Wasserstein distances Thibault Modeste, Clément Dombry , 2024. [ abs ][ pdf ][ bib ]

On the Hyperparameters in Stochastic Gradient Descent with Momentum Bin Shi , 2024. [ abs ][ pdf ][ bib ]

Improved Random Features for Dot Product Kernels Jonas Wacker, Motonobu Kanagawa, Maurizio Filippone , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Regret Analysis of Bilateral Trade with a Smoothed Adversary Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Federico Fusco, Stefano Leonardi , 2024. [ abs ][ pdf ][ bib ]

Invariant Physics-Informed Neural Networks for Ordinary Differential Equations Shivam Arora, Alex Bihlo, Francis Valiquette , 2024. [ abs ][ pdf ][ bib ]

Distribution Learning via Neural Differential Equations: A Nonparametric Statistical Perspective Youssef Marzouk, Zhi (Robert) Ren, Sven Wang, Jakob Zech , 2024. [ abs ][ pdf ][ bib ]

Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression Joseph Shenouda, Rahul Parhi, Kangwook Lee, Robert D. Nowak , 2024. [ abs ][ pdf ][ bib ]

Individual-centered Partial Information in Social Networks Xiao Han, Y. X. Rachel Wang, Qing Yang, Xin Tong , 2024. [ abs ][ pdf ][ bib ]

Data-driven Automated Negative Control Estimation (DANCE): Search for, Validation of, and Causal Inference with Negative Controls Erich Kummerfeld, Jaewon Lim, Xu Shi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Continuous Prediction with Experts' Advice Nicholas J. A. Harvey, Christopher Liaw, Victor S. Portella , 2024. [ abs ][ pdf ][ bib ]

Memory-Efficient Sequential Pattern Mining with Hybrid Tries Amin Hosseininasab, Willem-Jan van Hoeve, Andre A. Cire , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds Zhenghao Xu, Xiang Ji, Minshuo Chen, Mengdi Wang, Tuo Zhao , 2024. [ abs ][ pdf ][ bib ]

Split Conformal Prediction and Non-Exchangeable Data Roberto I. Oliveira, Paulo Orenstein, Thiago Ramos, João Vitor Romano , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Structured Dynamic Pricing: Optimal Regret in a Global Shrinkage Model Rashmi Ranjan Bhuyan, Adel Javanmard, Sungchul Kim, Gourab Mukherjee, Ryan A. Rossi, Tong Yu, Handong Zhao , 2024. [ abs ][ pdf ][ bib ]

Sparse Graphical Linear Dynamical Systems Emilie Chouzenoux, Victor Elvira , 2024. [ abs ][ pdf ][ bib ]

Statistical analysis for a penalized EM algorithm in high-dimensional mixture linear regression model Ning Wang, Xin Zhang, Qing Mai , 2024. [ abs ][ pdf ][ bib ]

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds Hao Liang, Zhi-Quan Luo , 2024. [ abs ][ pdf ][ bib ]

Low-Rank Matrix Estimation in the Presence of Change-Points Lei Shi, Guanghui Wang, Changliang Zou , 2024. [ abs ][ pdf ][ bib ]

A Framework for Improving the Reliability of Black-box Variational Inference Manushi Welandawe, Michael Riis Andersen, Aki Vehtari, Jonathan H. Huggins , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Understanding Entropic Regularization in GANs Daria Reshetova, Yikun Bai, Xiugang Wu, Ayfer Özgür , 2024. [ abs ][ pdf ][ bib ]

BenchMARL: Benchmarking Multi-Agent Reinforcement Learning Matteo Bettini, Amanda Prorok, Vincent Moens , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Learning from many trajectories Stephen Tu, Roy Frostig, Mahdi Soltanolkotabi , 2024. [ abs ][ pdf ][ bib ]

Interpretable algorithmic fairness in structured and unstructured data Hari Bandi, Dimitris Bertsimas, Thodoris Koukouvinos, Sofie Kupiec , 2024. [ abs ][ pdf ][ bib ]

FedCBO: Reaching Group Consensus in Clustered Federated Learning through Consensus-based Optimization José A. Carrillo, Nicolás García Trillos, Sixu Li, Yuhua Zhu , 2024. [ abs ][ pdf ][ bib ]

On the Connection between Lp- and Risk Consistency and its Implications on Regularized Kernel Methods Hannes Köhler , 2024. [ abs ][ pdf ][ bib ]

Pre-trained Gaussian Processes for Bayesian Optimization Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, Zoubin Ghahramani , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Heterogeneity-aware Clustered Distributed Learning for Multi-source Data Analysis Yuanxing Chen, Qingzhao Zhang, Shuangge Ma, Kuangnan Fang , 2024. [ abs ][ pdf ][ bib ]

From Small Scales to Large Scales: Distance-to-Measure Density based Geometric Analysis of Complex Data Katharina Proksch, Christoph Alexander Weikamp, Thomas Staudt, Benoit Lelandais, Christophe Zimmer , 2024. [ abs ][ pdf ][ bib ]      [ code ]

PAMI: An Open-Source Python Library for Pattern Mining Uday Kiran Rage, Veena Pamalla, Masashi Toyoda, Masaru Kitsuregawa , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Law of Large Numbers and Central Limit Theorem for Wide Two-layer Neural Networks: The Mini-Batch and Noisy Case Arnaud Descours, Arnaud Guillin, Manon Michel, Boris Nectoux , 2024. [ abs ][ pdf ][ bib ]

Risk Measures and Upper Probabilities: Coherence and Stratification Christian Fröhlich, Robert C. Williamson , 2024. [ abs ][ pdf ][ bib ]

Parallel-in-Time Probabilistic Numerical ODE Solvers Nathanael Bosch, Adrien Corenflos, Fatemeh Yaghoobi, Filip Tronarp, Philipp Hennig, Simo Särkkä , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data Shuo-Chieh Huang, Ruey S. Tsay , 2024. [ abs ][ pdf ][ bib ]

Dropout Regularization Versus l2-Penalization in the Linear Model Gabriel Clara, Sophie Langer, Johannes Schmidt-Hieber , 2024. [ abs ][ pdf ][ bib ]

Efficient Convex Algorithms for Universal Kernel Learning Aleksandr Talitckii, Brendon Colbert, Matthew M. Peet , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Manifold Learning by Mixture Models of VAEs for Inverse Problems Giovanni S. Alberti, Johannes Hertrich, Matteo Santacesaria, Silvia Sciutto , 2024. [ abs ][ pdf ][ bib ]      [ code ]

An Algorithmic Framework for the Optimization of Deep Neural Networks Architectures and Hyperparameters Julie Keisler, El-Ghazali Talbi, Sandra Claudel, Gilles Cabriel , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity Laixi Shi, Yuejie Chi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Grokking phase transitions in learning local rules with gradient descent Bojan Žunkovič, Enej Ilievski , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Unsupervised Tree Boosting for Learning Probability Distributions Naoki Awaya, Li Ma , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Linear Regression With Unmatched Data: A Deconvolution Perspective Mona Azadkia, Fadoua Balabdaoui , 2024. [ abs ][ pdf ][ bib ]

Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit Karl Hajjar, Lénaïc Chizat, Christophe Giraud , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Sharp analysis of power iteration for tensor PCA Yuchen Wu, Kangjie Zhou , 2024. [ abs ][ pdf ][ bib ]

On the Intrinsic Structures of Spiking Neural Networks Shao-Qun Zhang, Jia-Yi Chen, Jin-Hui Wu, Gao Zhang, Huan Xiong, Bin Gu, Zhi-Hua Zhou , 2024. [ abs ][ pdf ][ bib ]

Three-Way Trade-Off in Multi-Objective Learning: Optimization, Generalization and Conflict-Avoidance Lisha Chen, Heshan Fernando, Yiming Ying, Tianyi Chen , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Neural Collapse for Unconstrained Feature Model under Cross-entropy Loss with Imbalanced Data Wanli Hong, Shuyang Ling , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Generalized Independent Noise Condition for Estimating Causal Structure with Latent Variables Feng Xie, Biwei Huang, Zhengming Chen, Ruichu Cai, Clark Glymour, Zhi Geng, Kun Zhang , 2024. [ abs ][ pdf ][ bib ]

Classification of Data Generated by Gaussian Mixture Models Using Deep ReLU Networks Tian-Yi Zhou, Xiaoming Huo , 2024. [ abs ][ pdf ][ bib ]

Differentially Private Topological Data Analysis Taegyu Kang, Sehwan Kim, Jinwon Sohn, Jordan Awan , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Optimality of Misspecified Spectral Algorithms Haobo Zhang, Yicheng Li, Qian Lin , 2024. [ abs ][ pdf ][ bib ]

An Entropy-Based Model for Hierarchical Learning Amir R. Asadi , 2024. [ abs ][ pdf ][ bib ]

Optimal Clustering with Bandit Feedback Junwen Yang, Zixin Zhong, Vincent Y. F. Tan , 2024. [ abs ][ pdf ][ bib ]

A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression Youngseok Kim, Wei Wang, Peter Carbonetto, Matthew Stephens , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Spectral Analysis of the Neural Tangent Kernel for Deep Residual Networks Yuval Belfer, Amnon Geifman, Meirav Galun, Ronen Basri , 2024. [ abs ][ pdf ][ bib ]

Permuted and Unlinked Monotone Regression in R^d: an approach based on mixture modeling and optimal transport Martin Slawski, Bodhisattva Sen , 2024. [ abs ][ pdf ][ bib ]

Volterra Neural Networks (VNNs) Siddharth Roheda, Hamid Krim, Bo Jiang , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Towards Optimal Sobolev Norm Rates for the Vector-Valued Regularized Least-Squares Algorithm Zhu Li, Dimitri Meunier, Mattes Mollenhauer, Arthur Gretton , 2024. [ abs ][ pdf ][ bib ]

Bayesian Regression Markets Thomas Falconer, Jalal Kazempour, Pierre Pinson , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Sharpness-Aware Minimization and the Edge of Stability Philip M. Long, Peter L. Bartlett , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization Sijia Chen, Yu-Jie Zhang, Wei-Wei Tu, Peng Zhao, Lijun Zhang , 2024. [ abs ][ pdf ][ bib ]

Multi-Objective Neural Architecture Search by Learning Search Space Partitions Yiyang Zhao, Linnan Wang, Tian Guo , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fermat Distances: Metric Approximation, Spectral Convergence, and Clustering Algorithms Nicolás García Trillos, Anna Little, Daniel McKenzie, James M. Murphy , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Spherical Rotation Dimension Reduction with Geometric Loss Functions Hengrui Luo, Jeremy E. Purvis, Didong Li , 2024. [ abs ][ pdf ][ bib ]

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks Yuxin Sun, Dong Lao, Anthony Yezzi, Ganesh Sundaramoorthi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Two is Better Than One: Regularized Shrinkage of Large Minimum Variance Portfolios Taras Bodnar, Nestor Parolya, Erik Thorsen , 2024. [ abs ][ pdf ][ bib ]

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning Jinchi Chen, Jie Feng, Weiguo Gao, Ke Wei , 2024. [ abs ][ pdf ][ bib ]

Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning Ilnura Usmanova, Yarden As, Maryam Kamgarpour, Andreas Krause , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Cluster-Adaptive Network A/B Testing: From Randomization to Estimation Yang Liu, Yifan Zhou, Ping Li, Feifang Hu , 2024. [ abs ][ pdf ][ bib ]

On the Computational and Statistical Complexity of Over-parameterized Matrix Sensing Jiacheng Zhuo, Jeongyeol Kwon, Nhat Ho, Constantine Caramanis , 2024. [ abs ][ pdf ][ bib ]

Optimization-based Causal Estimation from Heterogeneous Environments Mingzhang Yin, Yixin Wang, David M. Blei , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Optimal Locally Private Nonparametric Classification with Public Data Yuheng Ma, Hanfang Yang , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Learning to Warm-Start Fixed-Point Optimization Algorithms Rajiv Sambharya, Georgina Hall, Brandon Amos, Bartolomeo Stellato , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Nonparametric Regression Using Over-parameterized Shallow ReLU Neural Networks Yunfei Yang, Ding-Xuan Zhou , 2024. [ abs ][ pdf ][ bib ]

Nonparametric Copula Models for Multivariate, Mixed, and Missing Data Joseph Feldman, Daniel R. Kowal , 2024. [ abs ][ pdf ][ bib ]      [ code ]

An Analysis of Quantile Temporal-Difference Learning Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney , 2024. [ abs ][ pdf ][ bib ]

Conformal Inference for Online Prediction with Arbitrary Distribution Shifts Isaac Gibbs, Emmanuel J. Candès , 2024. [ abs ][ pdf ][ bib ]      [ code ]

More Efficient Estimation of Multivariate Additive Models Based on Tensor Decomposition and Penalization Xu Liu, Heng Lian, Jian Huang , 2024. [ abs ][ pdf ][ bib ]

A Kernel Test for Causal Association via Noise Contrastive Backdoor Adjustment Robert Hu, Dino Sejdinovic, Robin J. Evans , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Assessing the Overall and Partial Causal Well-Specification of Nonlinear Additive Noise Models Christoph Schultheiss, Peter Bühlmann , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Simple Cycle Reservoirs are Universal Boyu Li, Robert Simon Fong, Peter Tino , 2024. [ abs ][ pdf ][ bib ]

On the Computational Complexity of Metropolis-Adjusted Langevin Algorithms for Bayesian Posterior Sampling Rong Tang, Yun Yang , 2024. [ abs ][ pdf ][ bib ]

Generalization and Stability of Interpolating Neural Networks with Minimal Width Hossein Taheri, Christos Thrampoulidis , 2024. [ abs ][ pdf ][ bib ]

Statistical Optimality of Divide and Conquer Kernel-based Functional Linear Regression Jiading Liu, Lei Shi , 2024. [ abs ][ pdf ][ bib ]

Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations Yuanyuan Wang, Wei Huang, Mingming Gong, Xi Geng, Tongliang Liu, Kun Zhang, Dacheng Tao , 2024. [ abs ][ pdf ][ bib ]

Robust Black-Box Optimization for Stochastic Search and Episodic Reinforcement Learning Maximilian Hüttenrauch, Gerhard Neumann , 2024. [ abs ][ pdf ][ bib ]

Kernel Thinning Raaz Dwivedi, Lester Mackey , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Optimal Algorithms for Stochastic Bilevel Optimization under Relaxed Smoothness Conditions Xuxing Chen, Tesi Xiao, Krishnakumar Balasubramanian , 2024. [ abs ][ pdf ][ bib ]

Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks Yunpeng Zhao, Ning Hao, Ji Zhu , 2024. [ abs ][ pdf ][ bib ]

Statistical Inference for Fairness Auditing John J. Cherian, Emmanuel J. Candès , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Adjusted Wasserstein Distributionally Robust Estimator in Statistical Learning Yiling Xie, Xiaoming Huo , 2024. [ abs ][ pdf ][ bib ]

DoWhy-GCM: An Extension of DoWhy for Causal Inference in Graphical Causal Models Patrick Blöbaum, Peter Götz, Kailash Budhathoki, Atalanti A. Mastakouri, Dominik Janzing , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Flexible Bayesian Product Mixture Models for Vector Autoregressions Suprateek Kundu, Joshua Lukemire , 2024. [ abs ][ pdf ][ bib ]

A Variational Approach to Bayesian Phylogenetic Inference Cheng Zhang, Frederick A. Matsen IV , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fat-Shattering Dimension of k-fold Aggregations Idan Attias, Aryeh Kontorovich , 2024. [ abs ][ pdf ][ bib ]

Unified Binary and Multiclass Margin-Based Classification Yutong Wang, Clayton Scott , 2024. [ abs ][ pdf ][ bib ]

Neural Feature Learning in Function Space Xiangxiang Xu, Lizhong Zheng , 2024. [ abs ][ pdf ][ bib ]      [ code ]

PyGOD: A Python Library for Graph Outlier Detection Kay Liu, Yingtong Dou, Xueying Ding, Xiyang Hu, Ruitong Zhang, Hao Peng, Lichao Sun, Philip S. Yu , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Blessings and Curses of Covariate Shifts: Adversarial Learning Dynamics, Directional Convergence, and Equilibria Tengyuan Liang , 2024. [ abs ][ pdf ][ bib ]

Fixed points of nonnegative neural networks Tomasz J. Piotrowski, Renato L. G. Cavalcante, Mateusz Gabor , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks Fanghui Liu, Leello Dadi, Volkan Cevher , 2024. [ abs ][ pdf ][ bib ]

A Survey on Multi-player Bandits Etienne Boursier, Vianney Perchet , 2024. [ abs ][ pdf ][ bib ]

Transport-based Counterfactual Models Lucas De Lara, Alberto González-Sanz, Nicholas Asher, Laurent Risser, Jean-Michel Loubes , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Adaptive Latent Feature Sharing for Piecewise Linear Dimensionality Reduction Adam Farooq, Yordan P. Raykov, Petar Raykov, Max A. Little , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Topological Node2vec: Enhanced Graph Embedding via Persistent Homology Yasuaki Hiraoka, Yusuke Imoto, Théo Lacombe, Killian Meehan, Toshiaki Yachimura , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Granger Causal Inference in Multivariate Hawkes Processes by Minimum Message Length Katerina Hlaváčková-Schindler, Anna Melnykova, Irene Tubikanec , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Representation Learning via Manifold Flattening and Reconstruction Michael Psenka, Druv Pai, Vishal Raman, Shankar Sastry, Yi Ma , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Bagging Provides Assumption-free Stability Jake A. Soloff, Rina Foygel Barber, Rebecca Willett , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fairness guarantees in multi-class classification with demographic parity Christophe Denis, Romuald Elie, Mohamed Hebiri, François Hu , 2024. [ abs ][ pdf ][ bib ]

Regimes of No Gain in Multi-class Active Learning Gan Yuan, Yunfan Zhao, Samory Kpotufe , 2024. [ abs ][ pdf ][ bib ]

Learning Optimal Dynamic Treatment Regimens Subject to Stagewise Risk Controls Mochuan Liu, Yuanjia Wang, Haoda Fu, Donglin Zeng , 2024. [ abs ][ pdf ][ bib ]

Margin-Based Active Learning of Classifiers Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice , 2024. [ abs ][ pdf ][ bib ]

Random Subgraph Detection Using Queries Wasim Huleihel, Arya Mazumdar, Soumyabrata Pal , 2024. [ abs ][ pdf ][ bib ]

Classification with Deep Neural Networks and Logistic Loss Zihan Zhang, Lei Shi, Ding-Xuan Zhou , 2024. [ abs ][ pdf ][ bib ]

Spectral learning of multivariate extremes Marco Avella Medina, Richard A Davis, Gennady Samorodnitsky , 2024. [ abs ][ pdf ][ bib ]

Sum-of-norms clustering does not separate nearby balls Alexander Dunlap, Jean-Christophe Mourrat , 2024. [ abs ][ pdf ][ bib ]      [ code ]

An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization Guy Kornowski, Ohad Shamir , 2024. [ abs ][ pdf ][ bib ]

Linear Distance Metric Learning with Noisy Labels Meysam Alishahi, Anna Little, Jeff M. Phillips , 2024. [ abs ][ pdf ][ bib ]      [ code ]

OpenBox: A Python Toolkit for Generalized Black-box Optimization Huaijun Jiang, Yu Shen, Yang Li, Beicheng Xu, Sixian Du, Wentao Zhang, Ce Zhang, Bin Cui , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Generative Adversarial Ranking Nets Yinghua Yao, Yuangang Pan, Jing Li, Ivor W. Tsang, Xin Yao , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Predictive Inference with Weak Supervision Maxime Cauchois, Suyash Gupta, Alnur Ali, John C. Duchi , 2024. [ abs ][ pdf ][ bib ]

Functions with average smoothness: structure, algorithms, and learning Yair Ashlagi, Lee-Ad Gottlieb, Aryeh Kontorovich , 2024. [ abs ][ pdf ][ bib ]

Differentially Private Data Release for Mixed-type Data via Latent Factor Models Yanqing Zhang, Qi Xu, Niansheng Tang, Annie Qu , 2024. [ abs ][ pdf ][ bib ]

The Non-Overlapping Statistical Approximation to Overlapping Group Lasso Mingyu Qi, Tianxi Li , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Faster Rates of Differentially Private Stochastic Convex Optimization Jinyan Su, Lijie Hu, Di Wang , 2024. [ abs ][ pdf ][ bib ]

Nonasymptotic analysis of Stochastic Gradient Hamiltonian Monte Carlo under local conditions for nonconvex optimization O. Deniz Akyildiz, Sotirios Sabanis , 2024. [ abs ][ pdf ][ bib ]

Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits Junpei Komiyama, Edouard Fouché, Junya Honda , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Stable Implementation of Probabilistic ODE Solvers Nicholas Krämer, Philipp Hennig , 2024. [ abs ][ pdf ][ bib ]

More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity Borja Rodríguez-Gálvez, Ragnar Thobaben, Mikael Skoglund , 2024. [ abs ][ pdf ][ bib ]

Neural Hilbert Ladders: Multi-Layer Neural Networks in Function Space Zhengdao Chen , 2024. [ abs ][ pdf ][ bib ]

QDax: A Library for Quality-Diversity and Population-based Algorithms with Hardware Acceleration Felix Chalumeau, Bryan Lim, Raphaël Boige, Maxime Allard, Luca Grillotti, Manon Flageat, Valentin Macé, Guillaume Richard, Arthur Flajolet, Thomas Pierrot, Antoine Cully , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Random Forest Weighted Local Fréchet Regression with Random Objects Rui Qiu, Zhou Yu, Ruoqing Zhu , 2024. [ abs ][ pdf ][ bib ]      [ code ]

PhAST: Physics-Aware, Scalable, and Task-Specific GNNs for Accelerated Catalyst Design Alexandre Duval, Victor Schmidt, Santiago Miret, Yoshua Bengio, Alex Hernández-García, David Rolnick , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Unsupervised Anomaly Detection Algorithms on Real-world Data: How Many Do We Need? Roel Bouman, Zaharah Bukhsh, Tom Heskes , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Multi-class Probabilistic Bounds for Majority Vote Classifiers with Partially Labeled Data Vasilii Feofanov, Emilie Devijver, Massih-Reza Amini , 2024. [ abs ][ pdf ][ bib ]

Information Processing Equalities and the Information–Risk Bridge Robert C. Williamson, Zac Cranko , 2024. [ abs ][ pdf ][ bib ]

Nonparametric Regression for 3D Point Cloud Learning Xinyi Li, Shan Yu, Yueying Wang, Guannan Wang, Li Wang, Ming-Jun Lai , 2024. [ abs ][ pdf ][ bib ]      [ code ]

AMLB: an AutoML Benchmark Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, Joaquin Vanschoren , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Materials Discovery using Max K-Armed Bandit Nobuaki Kikkawa, Hiroshi Ohno , 2024. [ abs ][ pdf ][ bib ]

Semi-supervised Inference for Block-wise Missing Data without Imputation Shanshan Song, Yuanyuan Lin, Yong Zhou , 2024. [ abs ][ pdf ][ bib ]

Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization Peng Zhao, Yu-Jie Zhang, Lijun Zhang, Zhi-Hua Zhou , 2024. [ abs ][ pdf ][ bib ]

Scaling Speech Technology to 1,000+ Languages Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli , 2024. [ abs ][ pdf ][ bib ]      [ code ]

MAP- and MLE-Based Teaching Hans Ulrich Simon, Jan Arne Telle , 2024. [ abs ][ pdf ][ bib ]

A General Framework for the Analysis of Kernel-based Tests Tamara Fernández, Nicolás Rivera , 2024. [ abs ][ pdf ][ bib ]

Overparametrized Multi-layer Neural Networks: Uniform Concentration of Neural Tangent Kernel and Convergence of Stochastic Gradient Descent Jiaming Xu, Hanjing Zhu , 2024. [ abs ][ pdf ][ bib ]

Sparse Representer Theorems for Learning in Reproducing Kernel Banach Spaces Rui Wang, Yuesheng Xu, Mingsong Yan , 2024. [ abs ][ pdf ][ bib ]

Exploration of the Search Space of Gaussian Graphical Models for Paired Data Alberto Roverato, Dung Ngoc Nguyen , 2024. [ abs ][ pdf ][ bib ]

The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy , 2024. [ abs ][ pdf ][ bib ]

Minimax Rates for High-Dimensional Random Tessellation Forests Eliza O'Reilly, Ngoc Mai Tran , 2024. [ abs ][ pdf ][ bib ]

Nonparametric Estimation of Non-Crossing Quantile Regression Process with Deep ReQU Neural Networks Guohao Shen, Yuling Jiao, Yuanyuan Lin, Joel L. Horowitz, Jian Huang , 2024. [ abs ][ pdf ][ bib ]

Spatial meshing for general Bayesian multivariate models Michele Peruzzi, David B. Dunson , 2024. [ abs ][ pdf ][ bib ]      [ code ]

A Semi-parametric Estimation of Personalized Dose-response Function Using Instrumental Variables Wei Luo, Yeying Zhu, Xuekui Zhang, Lin Lin , 2024. [ abs ][ pdf ][ bib ]

Learning Non-Gaussian Graphical Models via Hessian Scores and Triangular Transport Ricardo Baptista, Rebecca Morrison, Olivier Zahm, Youssef Marzouk , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Learnability of Out-of-distribution Detection Zhen Fang, Yixuan Li, Feng Liu, Bo Han, Jie Lu , 2024. [ abs ][ pdf ][ bib ]

Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training Pan Zhou, Xingyu Xie, Zhouchen Lin, Kim-Chuan Toh, Shuicheng Yan , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains Yicheng Li, Zixiong Yu, Guhan Chen, Qian Lin , 2024. [ abs ][ pdf ][ bib ]

Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions Maksim Velikanov, Dmitry Yarotsky , 2024. [ abs ][ pdf ][ bib ]

ptwt - The PyTorch Wavelet Toolbox Moritz Wolter, Felix Blanke, Jochen Garcke, Charles Tapley Hoyt , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Choosing the Number of Topics in LDA Models – A Monte Carlo Comparison of Selection Criteria Victor Bystrov, Viktoriia Naboka-Krell, Anna Staszewska-Bystrova, Peter Winker , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Functional Directed Acyclic Graphs Kuang-Yao Lee, Lexin Li, Bing Li , 2024. [ abs ][ pdf ][ bib ]

Unlabeled Principal Component Analysis and Matrix Completion Yunzhen Yao, Liangzu Peng, Manolis C. Tsakiris , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Distributed Estimation on Semi-Supervised Generalized Linear Model Jiyuan Tu, Weidong Liu, Xiaojun Mao , 2024. [ abs ][ pdf ][ bib ]

Towards Explainable Evaluation Metrics for Machine Translation Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger , 2024. [ abs ][ pdf ][ bib ]

Differentially private methods for managing model uncertainty in linear regression Víctor Peña, Andrés F. Barrientos , 2024. [ abs ][ pdf ][ bib ]

Data Summarization via Bilevel Optimization Zalán Borsos, Mojmír Mutný, Marco Tagliasacchi, Andreas Krause , 2024. [ abs ][ pdf ][ bib ]

Pareto Smoothed Importance Sampling Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, Jonah Gabry , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Policy Gradient Methods in the Presence of Symmetries and State Abstractions Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Scaling Instruction-Finetuned Language Models Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei , 2024. [ abs ][ pdf ][ bib ]

Tangential Wasserstein Projections Florian Gunsilius, Meng Hsuan Hsieh, Myung Jin Lee , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Learnability of Linear Port-Hamiltonian Systems Juan-Pablo Ortega, Daiying Yin , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Unbiased Estimation for Partially Observed Diffusions Jeremy Heng, Jeremie Houssineau, Ajay Jasra , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Improving Lipschitz-Constrained Neural Networks by Learning Activation Functions Stanislas Ducotterd, Alexis Goujon, Pakshal Bohra, Dimitris Perdios, Sebastian Neumayer, Michael Unser , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Mathematical Framework for Online Social Media Auditing Wasim Huleihel, Yehonathan Refael , 2024. [ abs ][ pdf ][ bib ]

An Embedding Framework for the Design and Analysis of Consistent Polyhedral Surrogates Jessie Finocchiaro, Rafael M. Frongillo, Bo Waggoner , 2024. [ abs ][ pdf ][ bib ]

Low-rank Variational Bayes correction to the Laplace method Janet van Niekerk, Haavard Rue , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Scaling the Convex Barrier with Sparse Dual Algorithms Alessandro De Palma, Harkirat Singh Behl, Rudy Bunel, Philip H.S. Torr, M. Pawan Kumar , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Causal-learn: Causal Discovery in Python Yujia Zheng, Biwei Huang, Wei Chen, Joseph Ramsey, Mingming Gong, Ruichu Cai, Shohei Shimizu, Peter Spirtes, Kun Zhang , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics Noga Mudrik, Yenho Chen, Eva Yezerets, Christopher J. Rozell, Adam S. Charles , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Existence and Minimax Theorems for Adversarial Surrogate Risks in Binary Classification Natalie S. Frank, Jonathan Niles-Weed , 2024. [ abs ][ pdf ][ bib ]

Data Thinning for Convolution-Closed Distributions Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten , 2024. [ abs ][ pdf ][ bib ]      [ code ]

A projected semismooth Newton method for a class of nonconvex composite programs with strong prox-regularity Jiang Hu, Kangkang Deng, Jiayuan Wu, Quanzheng Li , 2024. [ abs ][ pdf ][ bib ]

Revisiting RIP Guarantees for Sketching Operators on Mixture Models Ayoub Belhadji, Rémi Gribonval , 2024. [ abs ][ pdf ][ bib ]

Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization Daniel LeJeune, Jiayu Liu, Reinhard Heckel , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks Dong-Young Lim, Sotirios Sabanis , 2024. [ abs ][ pdf ][ bib ]

Axiomatic effect propagation in structural causal models Raghav Singal, George Michailidis , 2024. [ abs ][ pdf ][ bib ]

Optimal First-Order Algorithms as a Function of Inequalities Chanwoo Park, Ernest K. Ryu , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Resource-Efficient Neural Networks for Embedded Systems Wolfgang Roth, Günther Schindler, Bernhard Klein, Robert Peharz, Sebastian Tschiatschek, Holger Fröning, Franz Pernkopf, Zoubin Ghahramani , 2024. [ abs ][ pdf ][ bib ]

Trained Transformers Learn Linear Models In-Context Ruiqi Zhang, Spencer Frei, Peter L. Bartlett , 2024. [ abs ][ pdf ][ bib ]

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees Nachuan Xiao, Xiaoyin Hu, Xin Liu, Kim-Chuan Toh , 2024. [ abs ][ pdf ][ bib ]

Efficient Modality Selection in Multimodal Learning Yifei He, Runxiang Cheng, Gargi Balasubramaniam, Yao-Hung Hubert Tsai, Han Zhao , 2024. [ abs ][ pdf ][ bib ]

A Multilabel Classification Framework for Approximate Nearest Neighbor Search Ville Hyvönen, Elias Jääsaari, Teemu Roos , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization Lorenzo Pacchiardi, Rilwan A. Adewoyin, Peter Dueben, Ritabrata Dutta , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Multiple Descent in the Multiple Random Feature Model Xuran Meng, Jianfeng Yao, Yuan Cao , 2024. [ abs ][ pdf ][ bib ]

Mean-Square Analysis of Discretized Itô Diffusions for Heavy-tailed Sampling Ye He, Tyler Farghly, Krishnakumar Balasubramanian, Murat A. Erdogdu , 2024. [ abs ][ pdf ][ bib ]

Invariant and Equivariant Reynolds Networks Akiyoshi Sannai, Makoto Kawano, Wataru Kumagai , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Personalized PCA: Decoupling Shared and Unique Features Naichen Shi, Raed Al Kontar , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee George H. Chen , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec Koppel , 2024. [ abs ][ pdf ][ bib ]

Convergence for nonconvex ADMM, with applications to CT imaging Rina Foygel Barber, Emil Y. Sidky , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms T. Tony Cai, Hongji Wei , 2024. [ abs ][ pdf ][ bib ]

Sparse NMF with Archetypal Regularization: Computational and Robustness Properties Kayhan Behdin, Rahul Mazumder , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions Shijun Zhang, Jianfeng Lu, Hongkai Zhao , 2024. [ abs ][ pdf ][ bib ]

Effect-Invariant Mechanisms for Policy Generalization Sorawit Saengkyongam, Niklas Pfister, Predrag Klasnja, Susan Murphy, Jonas Peters , 2024. [ abs ][ pdf ][ bib ]

Pygmtools: A Python Graph Matching Toolkit Runzhong Wang, Ziao Guo, Wenzheng Pan, Jiale Ma, Yikai Zhang, Nan Yang, Qi Liu, Longxuan Wei, Hanxue Zhang, Chang Liu, Zetian Jiang, Xiaokang Yang, Junchi Yan , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Heterogeneous-Agent Reinforcement Learning Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, Yaodong Yang , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Sample-efficient Adversarial Imitation Learning Dahuin Jung, Hyungyu Lee, Sungroh Yoon , 2024. [ abs ][ pdf ][ bib ]

Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent Benjamin Gess, Sebastian Kassing, Vitalii Konarovskyi , 2024. [ abs ][ pdf ][ bib ]

Rates of convergence for density estimation with generative adversarial networks Nikita Puchkin, Sergey Samsonov, Denis Belomestny, Eric Moulines, Alexey Naumov , 2024. [ abs ][ pdf ][ bib ]

Additive smoothing error in backward variational inference for general state-space models Mathis Chagneux, Elisabeth Gassiat, Pierre Gloaguen, Sylvain Le Corff , 2024. [ abs ][ pdf ][ bib ]

Optimal Bump Functions for Shallow ReLU networks: Weight Decay, Depth Separation, Curse of Dimensionality Stephan Wojtowytsch , 2024. [ abs ][ pdf ][ bib ]

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Tail Decay Rate Estimation of Loss Function Distributions Etrit Haxholli, Marco Lorenzi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces Hao Liu, Haizhao Yang, Minshuo Chen, Tuo Zhao, Wenjing Liao , 2024. [ abs ][ pdf ][ bib ]

Post-Regularization Confidence Bands for Ordinary Differential Equations Xiaowu Dai, Lexin Li , 2024. [ abs ][ pdf ][ bib ]

On the Generalization of Stochastic Gradient Descent with Momentum Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang , 2024. [ abs ][ pdf ][ bib ]

Pursuit of the Cluster Structure of Network Lasso: Recovery Condition and Non-convex Extension Shotaro Yagishita, Jun-ya Gotoh , 2024. [ abs ][ pdf ][ bib ]

Iterate Averaging in the Quest for Best Test Error Diego Granziol, Nicholas P. Baskerville, Xingchen Wan, Samuel Albanie, Stephen Roberts , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Nonparametric Inference under B-bits Quantization Kexuan Li, Ruiqi Liu, Ganggang Xu, Zuofeng Shang , 2024. [ abs ][ pdf ][ bib ]

Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box Ryan Giordano, Martin Ingram, Tamara Broderick , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Sufficient Graphical Models Bing Li, Kyongwon Kim , 2024. [ abs ][ pdf ][ bib ]

Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond Nathan Kallus, Xiaojie Mao, Masatoshi Uehara , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks Sebastian Neumayer, Lénaïc Chizat, Michael Unser , 2024. [ abs ][ pdf ][ bib ]

Improving physics-informed neural networks with meta-learned optimization Alex Bihlo , 2024. [ abs ][ pdf ][ bib ]

A Comparison of Continuous-Time Approximations to Stochastic Gradient Descent Stefan Ankirchner, Stefan Perko , 2024. [ abs ][ pdf ][ bib ]

Critically Assessing the State of the Art in Neural Network Verification Matthias König, Annelot W. Bosman, Holger H. Hoos, Jan N. van Rijn , 2024. [ abs ][ pdf ][ bib ]

Estimating the Minimizer and the Minimum Value of a Regression Function under Passive Design Arya Akhavan, Davit Gogolashvili, Alexandre B. Tsybakov , 2024. [ abs ][ pdf ][ bib ]

Modeling Random Networks with Heterogeneous Reciprocity Daniel Cirkovic, Tiandong Wang , 2024. [ abs ][ pdf ][ bib ]

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment Zixian Yang, Xin Liu, Lei Ying , 2024. [ abs ][ pdf ][ bib ]

On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models Yangjing Zhang, Ying Cui, Bodhisattva Sen, Kim-Chuan Toh , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Decorrelated Variable Importance Isabella Verdinelli, Larry Wasserman , 2024. [ abs ][ pdf ][ bib ]

Model-Free Representation Learning and Exploration in Low-Rank MDPs Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal , 2024. [ abs ][ pdf ][ bib ]

Seeded Graph Matching for the Correlated Gaussian Wigner Model via the Projected Power Method Ernesto Araya, Guillaume Braun, Hemant Tyagi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization Shicong Cen, Yuting Wei, Yuejie Chi , 2024. [ abs ][ pdf ][ bib ]

Power of knockoff: The impact of ranking algorithm, augmented design, and symmetric statistic Zheng Tracy Ke, Jun S. Liu, Yucong Ma , 2024. [ abs ][ pdf ][ bib ]

Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction Yuze Han, Guangzeng Xie, Zhihua Zhang , 2024. [ abs ][ pdf ][ bib ]

On Truthing Issues in Supervised Classification Jonathan K. Su , 2024. [ abs ][ pdf ][ bib ]

2024.

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

Machine learning.

  • Sustainability
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Download RSS feed: News Articles / In the Media / Audio

Illustration with (at left) 3 colorful human hands pointing to a scribble on an X-ray of a human hip. At right, a green robotic hand appears under the same region of the X-ray.

A fast and flexible approach to help doctors annotate medical scans

“ScribblePrompt” is an interactive AI framework that can efficiently highlight anatomical structures across different medical scans, assisting medical workers to delineate regions of interest and abnormalities.

September 9, 2024

Read full story →

Five square slices show glimpse of LLMs, and the final one is green with a thumbs up.

Study: Transparency is often lacking in datasets used to train large language models

Researchers developed an easy-to-use tool that enables an AI practitioner to find data that suits the purpose of their model, which could improve accuracy and reduce bias.

August 30, 2024

More than a dozen people sit around shared tables with laptops running App Inventor

First AI + Education Summit is an international push for “AI fluency”

The three-day, hands-on conference hosted by the MIT RAISE Initiative welcomed youths and adults from nearly 30 countries.

August 27, 2024

Isometric drawing shows rows of robots on phones, and in the middle is a human looking up.

3 Questions: How to prove humanity online

AI agents could soon become indistinguishable from humans online. Could “personhood credentials” protect people against digital imposters?

August 16, 2024

Screenshot of NeuroTrALE software shows hundreds of neuron filaments in red and one neuron highlighted in yellow.

New open-source tool helps to detangle the brain

The software tool NeuroTrALE is designed to quickly and efficiently process large amounts of brain imaging data semi-automatically.

August 14, 2024

A cartoon robot inspects a pile of wingdings with a magnifying glass, helping it think about how to piece together a jigsaw puzzle of a robot moving to different locations.

LLMs develop their own understanding of reality as their language abilities improve

In controlled experiments, MIT CSAIL researchers discover simulations of reality developing deep within LLMs, indicating an understanding of language beyond simple mimicry.

Photo of wind turbines in rural landscape, with neural-network graphic in the sky.

MIT researchers use large language models to flag problems in complex systems

The approach can detect anomalies in data recorded over time, without the need for any training.

Four panels illustrate a quadrupedal robot sweeping with a broom and moving some torus-shaped objects

Helping robots practice skills independently to adapt to unfamiliar environments

A new algorithm helps robots practice skills like sweeping and placing objects, potentially helping them improve at important tasks in houses, hospitals, and factories.

August 8, 2024

Marcel Torne Villasevil and Pulkit Agrawal stand in front of a robotic arm, which is picking up a cup

Precision home robots learn with real-to-sim-to-real

CSAIL researchers introduce a novel approach allowing robots to be trained in simulations of scanned home environments, paving the way for customized household automation accessible to anyone.

July 31, 2024

Thermometers are connected by lines to create a stylized neural-network

Method prevents an AI model from being overconfident about wrong answers

More efficient than other approaches, the “Thermometer” technique could help someone know when they should trust a large language model.

Dice fall to the ground and glowing lines connect them. The dice become nodes in a stylized machine-learning model.

Study: When allocating scarce resources with AI, randomization can improve fairness

Introducing structured randomization into decisions based on machine-learning model predictions can address inherent uncertainties while maintaining efficiency.

July 24, 2024

Graphic of a human brain made from computer nodes, with abstract patterns resembling computer parts in the background

MIT researchers advance automated interpretability in AI models

MAIA is a multimodal agent that can iteratively design experiments to better understand various components of AI systems.

July 23, 2024

Four triangular sold acids spinning, with icons showing the direction of spin.

Proton-conducting materials could enable new green energy technologies

Analysis and materials identified by MIT engineers could lead to more energy-efficient fuel cells, electrolyzers, batteries, or computing devices.

A hand touches an array of lines and nodes, and a fizzle appears.

Large language models don’t behave like people, even though we may expect them to

A new study shows someone’s beliefs about an LLM play a significant role in the model’s performance and are important for how it is deployed.

A doctor looks at breast Xray with patient and scanner in background.

AI model identifies certain breast tumor stages likely to progress to invasive cancer

The model could help clinicians assess breast cancer stage and ultimately help in reducing overtreatment.

July 22, 2024

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Machine Learning: Algorithms, Real-World Applications and Research Directions

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349 Chattogram, Bangladesh

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. ​ Fig.1, 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of 0 ( m i n i m u m ) to 100 ( m a x i m u m ) has been shown in y - axis . According to Fig. ​ Fig.1, 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig1_HTML.jpg

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

  • To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.
  • To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.
  • To discuss the applicability of machine learning-based solutions in various real-world application domains.
  • To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

  • Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.
  • Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.
  • Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.
  • Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. ​ Fig.2. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig2_HTML.jpg

Various types of machine learning techniques

  • Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.
  • Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.
  • Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.
  • Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table ​ Table1, 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Various types of machine learning techniques with examples

Learning typeModel buildingExamples
SupervisedAlgorithms or models learn from labeled data (task-driven approach)Classification, regression
UnsupervisedAlgorithms or models learn from unlabeled data (Data-Driven Approach)Clustering, associations, dimensionality reduction
Semi-supervisedModels are built using combined data (labeled + unlabeled)Classification, clustering
ReinforcementModels are based on reward or penalty (environment-driven approach)Classification, control

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. ​ Fig.3, 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

  • Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.
  • Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.
  • Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

  • Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].
  • Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.
  • Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification. g ( z ) = 1 1 + exp ( - z ) . 1
  • K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.
  • Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig4_HTML.jpg

An example of a decision tree structure

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig5_HTML.jpg

An example of a random forest structure considering multiple decision trees

  • Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.
  • Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.
  • Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, α is the learning rate, and J i is the training example cost of i th , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the j th iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations. w j : = w j - α ∂ J i ∂ w j . 4
  • Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure ​ Figure6 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

  • Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations: y = a + b x + e 5 y = a + b 1 x 1 + b 2 x 2 + ⋯ + b n x n + e , 6 where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .
  • Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of n th in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below: y = b 0 + b 1 x + b 2 x 2 + b 3 x 3 + ⋯ + b n x n + e . 7 Here, y is the predicted/target output, b 0 , b 1 , . . . b n are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is n th degree of polynomial then we use polynomial regression to get desired output.
  • LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig6_HTML.jpg

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

  • Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.
  • Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig7_HTML.jpg

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

  • Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.
  • Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.
  • Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

  • K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.
  • Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.
  • DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.
  • GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.
  • Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

  • Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.
  • Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

  • Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.
  • Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is [ - 1 , 1 ] , where - 1 means perfect negative correlation, + 1 means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ] r ( X , Y ) = ∑ i = 1 n ( X i - X ¯ ) ( Y i - Y ¯ ) ∑ i = 1 n ( X i - X ¯ ) 2 ∑ i = 1 n ( Y i - Y ¯ ) 2 . 8
  • ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.
  • Chi square: The chi-square χ 2 [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on χ 2 . The chi-square χ 2 is commonly used for testing relationships between categorical variables. If O i represents observed value and E i represents expected value, then χ 2 = ∑ i = 1 n ( O i - E i ) 2 E i . 9
  • Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.
  • Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig8_HTML.jpg

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

  • AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.
  • Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.
  • ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.
  • FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].
  • ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

  • Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.
  • Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.
  • Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure ​ Figure9 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig9_HTML.jpg

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig10_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig11_HTML.jpg

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

  • LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

  • Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.
  • Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.
  • Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.
  • Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO 2 pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.
  • Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.
  • E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.
  • NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.
  • Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.
  • Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.
  • User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Declaration

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Smart. Open. Grounded. Inventive. Read our Ideas Made to Matter.

Which program is right for you?

MIT Sloan Campus life

Through intellectual rigor and experiential learning, this full-time, two-year MBA program develops leaders who make a difference in the world.

Earn your MBA and SM in engineering with this transformative two-year program.

A rigorous, hands-on program that prepares adaptive problem solvers for premier finance careers.

A 12-month program focused on applying the tools of modern data science, optimization and machine learning to solve real-world business problems.

Combine an international MBA with a deep dive into management science. A special opportunity for partner and affiliate schools only.

A doctoral program that produces outstanding scholars who are leading in their fields of research.

Bring a business perspective to your technical and quantitative expertise with a bachelor’s degree in management, business analytics, or finance.

Apply now and work for two to five years. We'll save you a seat in our MBA class when you're ready to come back to campus for your degree.

Executive Programs

The 20-month program teaches the science of management to mid-career leaders who want to move from success to significance.

A full-time MBA program for mid-career leaders eager to dedicate one year of discovery for a lifetime of impact.

A joint program for mid-career professionals that integrates engineering and systems thinking. Earn your master’s degree in engineering and management.

Non-degree programs for senior executives and high-potential managers.

A non-degree, customizable program for mid-career professionals.

New MIT Sloan courses prioritize AI, analytics, and climate

GDPR reduced firms’ data and computation use

23 MIT startups to watch in 2024

Credit: Andriy Onufriyenko / Getty Images

Ideas Made to Matter

Artificial Intelligence

Machine learning, explained

Apr 21, 2021

Machine learning is behind chatbots and predictive text, language translation apps, the shows Netflix suggests to you, and how your social media feeds are presented. It powers autonomous vehicles and machines that can diagnose medical conditions based on images. 

When companies today deploy artificial intelligence programs, they are most likely using machine learning — so much so that the terms are often used interchangeably, and sometimes ambiguously. Machine learning is a subfield of artificial intelligence that gives computers the ability to learn without explicitly being programmed.

“In just the last five or 10 years, machine learning has become a critical way, arguably the most important way, most parts of AI are done,” said MIT Sloan professor Thomas W. Malone, the founding director of the MIT Center for Collective Intelligence . “So that's why some people use the terms AI and machine learning almost as synonymous … most of the current advances in AI have involved machine learning.”

With the growing ubiquity of machine learning, everyone in business is likely to encounter it and will need some working knowledge about this field. A 2020 Deloitte survey found that 67% of companies are using machine learning, and 97% are using or planning to use it in the next year.

From manufacturing to retail and banking to bakeries, even legacy companies are using machine learning to unlock new value or boost efficiency. “Machine learning is changing, or will change, every industry, and leaders need to understand the basic principles, the potential, and the limitations,” said MIT computer science professor Aleksander Madry , director of the MIT Center for Deployable Machine Learning .

While not everyone needs to know the technical details, they should understand what the technology does and what it can and cannot do, Madry added. “I don’t think anyone can afford not to be aware of what’s happening.”

That includes being aware of the social, societal, and ethical implications of machine learning. “It's important to engage and begin to understand these tools, and then think about how you're going to use them well. We have to use these [tools] for the good of everybody,” said Dr. Joan LaRovere , MBA ’16, a pediatric cardiac intensive care physician and co-founder of the nonprofit The Virtue Foundation. “AI has so much potential to do good, and we need to really keep that in our lenses as we're thinking about this. How do we use this to do good and better the world?”

What is machine learning?

Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior. Artificial intelligence systems are used to perform complex tasks in a way that is similar to how humans solve problems.

The goal of AI is to create computer models that exhibit “intelligent behaviors” like humans, according to Boris Katz , a principal research scientist and head of the InfoLab Group at CSAIL. This means machines that can recognize a visual scene, understand a text written in natural language, or perform an action in the physical world.

Machine learning is one way to use AI. It was defined in the 1950s by AI pioneer Arthur Samuel as “the field of study that gives computers the ability to learn without explicitly being programmed.”

The definition holds true, according to Mikey Shulman, a lecturer at MIT Sloan and head of machine learning at  Kensho , which specializes in artificial intelligence for the finance and U.S. intelligence communities. He compared the traditional way of programming computers, or “software 1.0,” to baking, where a recipe calls for precise amounts of ingredients and tells the baker to mix for an exact amount of time. Traditional programming similarly requires creating detailed instructions for the computer to follow.

But in some cases, writing a program for the machine to follow is time-consuming or impossible, such as training a computer to recognize pictures of different people. While humans can do this task easily, it’s difficult to tell a computer how to do it. Machine learning takes the approach of letting computers learn to program themselves through experience. 

Machine learning starts with data — numbers, photos, or text, like bank transactions, pictures of people or even bakery items , repair records, time series data from sensors, or sales reports. The data is gathered and prepared to be used as training data, or the information the machine learning model will be trained on. The more data, the better the program.

From there, programmers choose a machine learning model to use, supply the data, and let the computer model train itself to find patterns or make predictions. Over time the human programmer can also tweak the model, including changing its parameters, to help push it toward more accurate results. (Research scientist Janelle Shane’s website AI Weirdness is an entertaining look at how machine learning algorithms learn and how they can get things wrong — as happened when an algorithm tried to generate recipes and created Chocolate Chicken Chicken Cake.)

Some data is held out from the training data to be used as evaluation data, which tests how accurate the machine learning model is when it is shown new data. The result is a model that can be used in the future with different sets of data.

Successful machine learning algorithms can do different things, Malone wrote in a recent research brief about AI and the future of work that was co-authored by MIT professor and CSAIL director Daniela Rus and Robert Laubacher, the associate director of the MIT Center for Collective Intelligence.

“The function of a machine learning system can be descriptive , meaning that the system uses the data to explain what happened; predictive , meaning the system uses the data to predict what will happen; or prescriptive , meaning the system will use the data to make suggestions about what action to take,” the researchers wrote.  

There are three subcategories of machine learning:

Supervised machine learning models are trained with labeled data sets, which allow the models to learn and grow more accurate over time. For example, an algorithm would be trained with pictures of dogs and other things, all labeled by humans, and the machine would learn ways to identify pictures of dogs on its own. Supervised machine learning is the most common type used today.

In unsupervised machine learning, a program looks for patterns in unlabeled data. Unsupervised machine learning can find patterns or trends that people aren’t explicitly looking for. For example, an unsupervised machine learning program could look through online sales data and identify different types of clients making purchases.

Reinforcement machine learning trains machines through trial and error to take the best action by establishing a reward system. Reinforcement learning can train models to play games or train autonomous vehicles to drive by telling the machine when it made the right decisions, which helps it learn over time what actions it should take.

Infographic entitled "What do you want your machine learning system to do?"

Source: Thomas Malone | MIT Sloan. See: https://bit.ly/3gvRho2, Figure 2.

In the Work of the Future brief, Malone noted that machine learning is best suited for situations with lots of data — thousands or millions of examples, like recordings from previous conversations with customers, sensor logs from machines, or ATM transactions. For example, Google Translate was possible because it “trained” on the vast amount of information on the web, in different languages.

In some cases, machine learning can gain insight or automate decision-making in cases where humans would not be able to, Madry said. “It may not only be more efficient and less costly to have an algorithm do this, but sometimes humans just literally are not able to do it,” he said.

Google search is an example of something that humans can do, but never at the scale and speed at which the Google models are able to show potential answers every time a person types in a query, Malone said. “That’s not an example of computers putting people out of work. It's an example of computers doing things that would not have been remotely economically feasible if they had to be done by humans.”

Machine learning is also associated with several other artificial intelligence subfields:

Natural language processing

Natural language processing is a field of machine learning in which machines learn to understand natural language as spoken and written by humans, instead of the data and numbers normally used to program computers. This allows machines to recognize language, understand it, and respond to it, as well as create new text and translate between languages. Natural language processing enables familiar technology like chatbots and digital assistants like Siri or Alexa.

Neural networks

Neural networks are a commonly used, specific class of machine learning algorithms. Artificial neural networks are modeled on the human brain, in which thousands or millions of processing nodes are interconnected and organized into layers.

In an artificial neural network, cells, or nodes, are connected, with each cell processing inputs and producing an output that is sent to other neurons. Labeled data moves through the nodes, or cells, with each cell performing a different function. In a neural network trained to identify whether a picture contains a cat or not, the different nodes would assess the information and arrive at an output that indicates whether a picture features a cat.

Deep learning

Deep learning networks are neural networks with many layers. The layered network can process extensive amounts of data and determine the “weight” of each link in the network — for example, in an image recognition system, some layers of the neural network might detect individual features of a face, like eyes, nose, or mouth, while another layer would be able to tell whether those features appear in a way that indicates a face.  

Like neural networks, deep learning is modeled on the way the human brain works and powers many machine learning uses, like autonomous vehicles, chatbots, and medical diagnostics.

“The more layers you have, the more potential you have for doing complex things well,” Malone said.

Deep learning requires a great deal of computing power, which raises concerns about its economic and environmental sustainability.

How businesses are using machine learning

Machine learning is the core of some companies’ business models, like in the case of Netflix’s suggestions algorithm or Google’s search engine . Other companies are engaging deeply with machine learning, though it’s not their main business proposition.

Others are still trying to determine how to use machine learning in a beneficial way. “In my opinion, one of the hardest problems in machine learning is figuring out what problems I can solve with machine learning,” Shulman said. “There’s still a gap in the understanding.” 

A person in business attire holding a maestro baton orchestrating data imagery in the background

Leading the AI-Driven Organization

In person at MIT Sloan

Register Now

MIT Sloan Executive Education logo

In a 2018 paper , researchers from the MIT Initiative on the Digital Economy outlined a 21-question rubric to determine whether a task is suitable for machine learning. The researchers found that no occupation will be untouched by machine learning, but no occupation is likely to be completely taken over by it. The way to unleash machine learning success, the researchers found, was to reorganize jobs into discrete tasks, some which can be done by machine learning, and others that require a human.

Companies are already using machine learning in several ways, including:

Recommendation algorithms. The recommendation engines behind Netflix and YouTube suggestions, what information appears on your Facebook feed, and product recommendations are fueled by machine learning. “[The algorithms] are trying to learn our preferences,” Madry said. “They want to learn, like on Twitter, what tweets we want them to show us, on Facebook, what ads to display, what posts or liked content to share with us.”

Image analysis and object detection. Machine learning can analyze images for different information, like learning to identify people and tell them apart — though facial recognition algorithms are controversial. Business uses for this vary. Shulman noted that hedge funds famously use machine learning to analyze the number of cars  in parking lots, which helps them learn how companies are performing and make good bets.

Fraud detection . Machines can analyze patterns, like how someone normally spends or where they normally shop, to identify potentially fraudulent credit card transactions , log-in attempts, or spam emails.

Automatic helplines or chatbots. Many companies are deploying online chatbots, in which customers or clients don’t speak to humans, but instead interact with a machine. These algorithms use machine learning and natural language processing, with the bots learning from records of past conversations to come up with appropriate responses.

Self-driving cars. Much of the technology behind self-driving cars is based on machine learning, deep learning in particular .

Medical imaging and diagnostics. Machine learning programs can be trained to examine medical images or other information and look for certain markers of illness, like a tool that can predict cancer risk based on a mammogram.

Read report: Artificial Intelligence and the Future of Work

How machine learning works: promises and challenges

While machine learning is fueling technology that can help workers or open new possibilities for businesses, there are several things business leaders should know about machine learning and its limits.

Explainability

One area of concern is what some experts call explainability, or the ability to be clear about what the machine learning models are doing and how they make decisions. “Understanding why a model does what it does is actually a very difficult question, and you always have to ask yourself that,” Madry said. “You should never treat this as a black box, that just comes as an oracle … yes, you should use it, but then try to get a feeling of what are the rules of thumb that it came up with? And then validate them.”

Related Articles

This is especially important because systems can be fooled and undermined, or just fail on certain tasks, even those humans can perform easily. For example, adjusting the metadata in images can confuse computers — with a few adjustments, a machine identifies a picture of a dog as an ostrich .

Madry pointed out another example in which a machine learning algorithm examining X-rays seemed to outperform physicians. But it turned out the algorithm was correlating results with the machines that took the image, not necessarily the image itself. Tuberculosis is more common in developing countries, which tend to have older machines. The machine learning program learned that if the X-ray was taken on an older machine, the patient was more likely to have tuberculosis. It completed the task, but not in the way the programmers intended or would find useful.

The importance of explaining how a model is working — and its accuracy — can vary depending on how it’s being used, Shulman said. While most well-posed problems can be solved through machine learning, he said, people should assume right now that the models only perform to about 95% of human accuracy. It might be okay with the programmer and the viewer if an algorithm recommending movies is 95% accurate, but that level of accuracy wouldn’t be enough for a self-driving vehicle or a program designed to find serious flaws in machinery.   

Bias and unintended outcomes

Machines are trained by humans, and human biases can be incorporated into algorithms — if biased information, or data that reflects existing inequities, is fed to a machine learning program, the program will learn to replicate it and perpetuate forms of discrimination. Chatbots trained on how people converse on Twitter can pick up on offensive and racist language , for example.

In some cases, machine learning models create or exacerbate social problems. For example, Facebook has used machine learning as a tool to show users ads and content that will interest and engage them — which has led to models showing people extreme content that leads to polarization and the spread of conspiracy theories when people are shown incendiary, partisan, or inaccurate content.

Ways to fight against bias in machine learning including carefully vetting training data and putting organizational support behind ethical artificial intelligence efforts, like making sure your organization embraces human-centered AI , the practice of seeking input from people of different backgrounds, experiences, and lifestyles when designing AI systems. Initiatives working on this issue include the Algorithmic Justice League and  The Moral Machine  project.

Putting machine learning to work

Shulman said executives tend to struggle with understanding where machine learning can actually add value to their company. What’s gimmicky for one company is core to another, and businesses should avoid trends and find business use cases that work for them.

The way machine learning works for Amazon is probably not going to translate at a car company, Shulman said — while Amazon has found success with voice assistants and voice-operated speakers, that doesn’t mean car companies should prioritize adding speakers to cars. More likely, he said, the car company might find a way to use machine learning on the factory line that saves or makes a great deal of money.

“The field is moving so quickly, and that's awesome, but it makes it hard for executives to make decisions about it and to decide how much resourcing to pour into it,” Shulman said.

It’s also best to avoid looking at machine learning as a solution in search of a problem, Shulman said. Some companies might end up trying to backport machine learning into a business use. Instead of starting with a focus on technology, businesses should start with a focus on a business problem or customer need that could be met with machine learning. 

A basic understanding of machine learning is important, LaRovere said, but finding the right machine learning use ultimately rests on people with different expertise working together. “I'm not a data scientist. I'm not doing the actual data engineering work — all the data acquisition, processing, and wrangling to enable machine learning applications — but I understand it well enough to be able to work with those teams to get the answers we need and have the impact we need,” she said. “You really have to work in a team.”

Learn more:  

Sign-up for a  Machine Learning in Business Course .

Watch an  Introduction to Machine Learning through MIT OpenCourseWare .

Read about how  an AI pioneer thinks companies can use machine learning to transform .

Watch a discussion with two AI experts about  machine learning strides and limitations .

Take a look at  the seven steps of machine learning .

Read next: 7 lessons for successful machine learning projects 

A person holds up a lightbulb covered in AI data imagery

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

AIM

  • Conferences
  • Last updated November 18, 2021
  • In AI Origins & Evolution

Top Machine Learning Research Papers Released In 2021

research articles on machine learning

  • Published on November 18, 2021
  • by Dr. Nivash Jeevanandam

Join AIM in Whatsapp

Advances in machine learning and deep learning research are reshaping our technology. Machine learning and deep learning have accomplished various astounding feats this year in 2021, and key research articles have resulted in technical advances used by billions of people. The research in this sector is advancing at a breakneck pace and assisting you to keep up. Here is a collection of the most important recent scientific study papers.

Rebooting ACGAN: Auxiliary Classifier GANs with Stable Training

The authors of this work examined why ACGAN training becomes unstable as the number of classes in the dataset grows. The researchers revealed that the unstable training occurs due to a gradient explosion problem caused by the unboundedness of the input feature vectors and the classifier’s poor classification capabilities during the early training stage. The researchers presented the Data-to-Data Cross-Entropy loss (D2D-CE) and the Rebooted Auxiliary Classifier Generative Adversarial Network to alleviate the instability and reinforce ACGAN (ReACGAN). Additionally, extensive tests of ReACGAN demonstrate that it is resistant to hyperparameter selection and is compatible with a variety of architectures and differentiable augmentations.

This article is ranked #1 on CIFAR-10 for Conditional Image Generation.

For the research paper, read here .

For code, see here .

Dense Unsupervised Learning for Video Segmentation

The authors presented a straightforward and computationally fast unsupervised strategy for learning dense spacetime representations from unlabeled films in this study. The approach demonstrates rapid convergence of training and a high degree of data efficiency. Furthermore, the researchers obtain VOS accuracy superior to previous results despite employing a fraction of the previously necessary training data. The researchers acknowledge that the research findings may be utilised maliciously, such as for unlawful surveillance, and that they are excited to investigate how this skill might be used to better learn a broader spectrum of invariances by exploiting larger temporal windows in movies with complex (ego-)motion, which is more prone to disocclusions.

This study is ranked #1 on DAVIS 2017 for Unsupervised Video Object Segmentation (val).

Temporally-Consistent Surface Reconstruction using Metrically-Consistent Atlases

The authors offer an atlas-based technique for producing unsupervised temporally consistent surface reconstructions by requiring a point on the canonical shape representation to translate to metrically consistent 3D locations on the reconstructed surfaces. Finally, the researchers envisage a plethora of potential applications for the method. For example, by substituting an image-based loss for the Chamfer distance, one may apply the method to RGB video sequences, which the researchers feel will spur development in video-based 3D reconstruction.

This article is ranked #1 on ANIM in the category of Surface Reconstruction. 

EdgeFlow: Achieving Practical Interactive Segmentation with Edge-Guided Flow

The researchers propose a revolutionary interactive architecture called EdgeFlow that uses user interaction data without resorting to post-processing or iterative optimisation. The suggested technique achieves state-of-the-art performance on common benchmarks due to its coarse-to-fine network design. Additionally, the researchers create an effective interactive segmentation tool that enables the user to improve the segmentation result through flexible options incrementally.

This paper is ranked #1 on Interactive Segmentation on PASCAL VOC

Learning Transferable Visual Models From Natural Language Supervision

The authors of this work examined whether it is possible to transfer the success of task-agnostic web-scale pre-training in natural language processing to another domain. The findings indicate that adopting this formula resulted in the emergence of similar behaviours in the field of computer vision, and the authors examine the social ramifications of this line of research. CLIP models learn to accomplish a range of tasks during pre-training to optimise their training objective. Using natural language prompting, CLIP can then use this task learning to enable zero-shot transfer to many existing datasets. When applied at a large scale, this technique can compete with task-specific supervised models, while there is still much space for improvement.

This research is ranked #1 on Zero-Shot Transfer Image Classification on SUN

CoAtNet: Marrying Convolution and Attention for All Data Sizes

The researchers in this article conduct a thorough examination of the features of convolutions and transformers, resulting in a principled approach for combining them into a new family of models dubbed CoAtNet. Extensive experiments demonstrate that CoAtNet combines the advantages of ConvNets and Transformers, achieving state-of-the-art performance across a range of data sizes and compute budgets. Take note that this article is currently concentrating on ImageNet classification for model construction. However, the researchers believe their approach is relevant to a broader range of applications, such as object detection and semantic segmentation.

This paper is ranked #1 on Image Classification on ImageNet (using extra training data).

SwinIR: Image Restoration Using Swin Transformer

The authors of this article suggest the SwinIR image restoration model, which is based on the Swin Transformer . The model comprises three modules: shallow feature extraction, deep feature extraction, and human-recognition reconstruction. For deep feature extraction, the researchers employ a stack of residual Swin Transformer blocks (RSTB), each formed of Swin Transformer layers, a convolution layer, and a residual connection.

This research article is ranked #1 on Image Super-Resolution on Manga109 – 4x upscaling.

📣 Want to advertise in AIM? Book here

Picture of Dr. Nivash Jeevanandam

Dr. Nivash Jeevanandam

Illustrated by Raghu P

Most of the demand in India will be driven by the desire among enterprises to deliver their services in vernacular languages.

research articles on machine learning

Top Editorial Picks

Sergey Brin Says Developers Don’t Use AI Tools for Coding as Much as They Ought to Mohit Pandey

Oracle Unveils World’s First Zettascale AI Supercomputer with 131K NVIDIA Blackwell GPUs Aditi Suresh

Hume AI’s Latest Voice-to-Voice Model EVI 2 for Human-Like Conversations Makes OpenAI’s GPT-4o Sweat Siddharth Jindal

Oracle Unveils 50+ AI Agents to Automate Tasks Across Finance, HR, and More Siddharth Jindal

Yotta Launches Shambho Accelerator Programme for AI Startups Tanisha Bhattacharjee

Subscribe to The Belamy: Our Weekly Newsletter

Biggest ai stories, delivered to your inbox every week., "> "> flagship events.

discord icon

Discover how Cypher 2024 expands to the USA, bridging AI innovation gaps and tackling the challenges of enterprise AI adoption

© Analytics India Magazine Pvt Ltd & AIM Media House LLC 2024

  • Terms of use
  • Privacy Policy

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.

Subscribe to Our Youtube channel

  • Submit a Manuscript
  • Advanced search

American Journal of Neuroradiology

American Journal of Neuroradiology

Advanced Search

Assessing the Emergence and Evolution of Artificial Intelligence and Machine Learning Research in Neuroradiology

  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Alexandre Boutet
  • ORCID record for Samuel S. Haile
  • ORCID record for Hyo Jin Son
  • ORCID record for Mikail Malik
  • ORCID record for Mehran Nasralla
  • ORCID record for Jurgen Germann
  • ORCID record for Farzad Khalvati
  • ORCID record for Birgit B. Ertl-Wagner
  • Figures & Data
  • Supplemental
  • Info & Metrics

This article requires a subscription to view the full text. If you have a subscription you may use the login form below to view the article. Access to this article can also be purchased.

BACKGROUND AND PURPOSE: Interest in artificial intelligence (AI) and machine learning (ML) has been growing in neuroradiology, but there is limited knowledge on how this interest has manifested into research and specifically, its qualities and characteristics. This study aims to characterize the emergence and evolution of AI/ML articles within neuroradiology and provide a comprehensive overview of the trends, challenges, and future directions of the field.

MATERIALS AND METHODS: We performed a bibliometric analysis of the American Journal of Neuroradiology ; the journal was queried for original research articles published since inception (January 1, 1980) to December 3, 2022 that contained any of the following key terms: “machine learning,” “artificial intelligence,” “radiomics,” “deep learning,” “neural network,” “generative adversarial network,” “object detection,” or “natural language processing.” Articles were screened by 2 independent reviewers, and categorized into statistical modeling (type 1), AI/ML development (type 2), both representing developmental research work but without a direct clinical integration, or end-user application (type 3), which is the closest surrogate of potential AI/ML integration into day-to-day practice. To better understand the limiting factors to type 3 articles being published, we analyzed type 2 articles as they should represent the precursor work leading to type 3.

RESULTS: A total of 182 articles were identified with 79% being nonintegration focused (type 1 n = 53, type 2 n = 90) and 21% ( n = 39) being type 3. The total number of articles published grew roughly 5-fold in the last 5 years, with the nonintegration focused articles mainly driving this growth. Additionally, a minority of type 2 articles addressed bias (22%) and explainability (16%). These articles were primarily led by radiologists (63%), with most (60%) having additional postgraduate degrees.

CONCLUSIONS: AI/ML publications have been rapidly increasing in neuroradiology with only a minority of this growth being attributable to end-user application. Areas identified for improvement include enhancing the quality of type 2 articles, namely external validation, and addressing both bias and explainability. These results ultimately provide authors, editors, clinicians, and policymakers important insights to promote a shift toward integrating practical AI/ML solutions in neuroradiology.

  • ABBREVIATIONS:
  • © 2024 by American Journal of Neuroradiology

Log in using your username and password

In this issue.

American Journal of Neuroradiology: 45 (9)

  • Table of Contents
  • Index by author
  • Complete Issue (PDF)

Thank you for your interest in spreading the word on American Journal of Neuroradiology.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Citation Manager Formats

  • EndNote (tagged)
  • EndNote 8 (xml)
  • RefWorks Tagged
  • Ref Manager

del.icio.us logo

  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • MATERIALS AND METHODS
  • CONCLUSIONS

Related Articles

  • Google Scholar

Cited By...

  • No citing articles found.
  • Crossref (2)

This article has been cited by the following articles in journals that are participating in Crossref Cited-by Linking.

  • Augmenting Radiological Diagnostics with AI for Tuberculosis and COVID-19 Disease Detection: Deep Learning Detection of Chest Radiographs Manjur Kolhar, Ahmed M. Al Rajeh, Raisa Nazir Ahmed Kazi Diagnostics 2024 14 13
  • Radiomics and 256-slice-dual-energy CT in the automated diagnosis of mild acute pancreatitis: the innovation of formal methods and high-resolution CT Aldo Rocca, Maria Chiara Brunese, Antonella Santone, Giulia Varriano, Luca Viganò, Corrado Caiazzo, Gianfranco Vallone, Luca Brunese, Luigia Romano, Marco Di Serafino, Fabio Bellifemine, Francesca De Chiara, Dalila De Lucia, Giulia Pacella, Pasquale Avella La radiologia medica 2024

More in this TOC Section

  • MR Cranial Bone Imaging: Evaluation of Both Motion-Corrected and Automated Deep Learning Pseudo-CT Estimated MR Images
  • Assessing the Performance of Artificial Intelligence Models: Insights from the American Society of Functional Neuroradiology Artificial Intelligence Competition
  • Impact of SUSAN Denoising and ComBat Harmonization on Machine Learning Model Performance for Malignant Brain Neoplasms

Similar Articles

A Machine Learning Approach to Well-Being in Late Childhood and Early Adolescence: The Children’s Worlds Data Case

  • Original Research
  • Open access
  • Published: 11 September 2024

Cite this article

You have full access to this open access article

research articles on machine learning

  • Mònica González-Carrasco   ORCID: orcid.org/0000-0003-3677-8175 1 ,
  • Silvana Aciar 2 ,
  • Ferran Casas 3 ,
  • Xavier Oriol 1 ,
  • Ramon Fabregat 4 &
  • Sara Malo 1  

Explaining what leads to higher or lower levels of subjective well-being (SWB) in childhood and adolescence is one of the cornerstones within this field of studies, since it can lead to the development of more focused preventive and promotion actions. Although many indicators of SWB have been identified, selecting one over the other to obtain a reasonably short list poses a challenge, given that models are particularly sensitive to the indicators considered.Two Machine Learning (ML) algorithms, one based on Extreme Gradient Boosting and Random Forest and the other on Lineal Regression, were applied to 77 indicators included in the 3rd wave of the Children’s Worlds project and then compared. ExtremeGradient Boosting outperforms the other two, while Lineal Regression outperforms Random Forest. Moreover, the Extreme Gradient Boosting algorithm was used to compare models for each of the 35 participating countries with that of the pooled sample on the basis of responses from 93,349 children and adolescents collected through a representative sampling and belonging to the 10 and 12-year-olds age groups. Large differences were detected by country with regard to the importance of these 77 indicators in explaining the scores for the five-item-version of the CWSWBS5 (Children’s Worlds Subjective Well-Being Scale). The process followed highlights the greater capacity of some ML techniques in providing models with higher explanatory power and less error, and in more clearly differentiating between the contributions of the different indicators to explain children’s and adolescents’ SWB. This finding is useful when it comes to designing shorter but more reliable questionnaires (a selection of 29 indicators were used in this case).

Explore related subjects

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Subjective well-being (SWB) has been conceptualized as the way in which people evaluate their lives, regardless of age, both in general and in relation to specific life domains (family, friends, leisure time, etc.) (Campbell et al., 1976 ). It comprises a cognitive component (life satisfaction), but also an affective component with two dimensions (positive and negative affect), reflecting the so-called tripartite structure theory of SWB (Arthaud-Day et al., 2005 ; Diener, 1984 ; Metler & Busseri, 2017 ), which has been taken as a reference for many years. As for SWB measurement instruments, according to Holte et al. ( 2014 ), they can be classified into the following: 1) Single-item scales based on the response to a global or generic question; 2) One-dimensional multiple-item scales, based on the assumption that all items load onto a single component. The latter can be of two types: context-free , which include generic items on overall life satisfaction, and scales based on satisfaction domains ; and 3) Multidimensional multiple-item scales that refer to different components of the SWB construct. The contribution made by each of these scales in explaining the overall SWB construct has yet to be fully elucidated, however. More research is therefore needed in this direction.

Being able to explain what leads to a higher or lower SWB in childhood and adolescence has become one of the cornerstones within this field of studies, insofar as it can lead to the development of better preventive and promotion actions. In recent years, numerous attempts have been made to explain children’s and adolescents’ SWB globally. Different indicators have been used to this end, including the aforementioned satisfaction with specific life domains, and affect (both positive and negative) indicators, but also perception of control, self-esteem and values, among many other psychosocial constructs (Casas et al., 2007 , 2015 ).

Over time, a notable number of factors determining child and adolescent SWB have been identified – such as safety and social participation, for example – although most studies have focused on only a few (Marjanen et al., 2017 ; Moore, 2020 ), and commonly indicators from the cognitive dimension, such as satisfaction with life domains. We agree with the aforementioned authors that a broader range of factors needs to be taken into account through the use of large-scale surveys. In parallel, several debates have emerged in relation to which determining factors should be considered, these including the role of objective versus subjective indicators (see Voukelatou et al., 2021 ), the convenience of using more generic versus more specific indicators, and variations in these determining factors according to age, gender and sociocultural context.

Although these debates remain open, the scientific community has reached some relevant conclusions, examples of these being: that in parallel to generic indicators, it is important to use indicators that refer to specific life contexts (home, school, and neighbourhood) (Campbell et al., 1976 ), that the contribution of different indicators to explain SWB is unequal (Hsieh, 2022 ), and that the importance of each indicator may vary with age, gender and the sociocultural context (Casas & González-Carrasco, 2019 ; González-Carrasco, 2020 ). In this respect, each instrument used as an SWB indicator may even display a different degree of sensitivity to each diverse context–leading to different authors recommending that more than one SWB indicator be used with each population. To this we should also add the fact that subjective indicators are now considered to play a much more important role than objective indicators than was the case years ago (Casas, 2011 ; Margolis et al., 2021 ).

Lately, the debate has intensified over the extent to which indicators of psychological well-being (PWB) may also contribute to explaining SWB, even though SWB and PWB have traditionally been considered very distinct constructs, since they derive from different philosophical traditions. Specifically, SWB derives from the hedonic tradition revolving around the concept of pleasure (what is pleasurable generates well-being), whereas PWB derives from the eudaimonic tradition, according to which what is important is to feel fulfilled as a human being, regardless of the degree of pleasure this may be associated with. At present, there is an increasing trend to incorporate both hedonic and eudaimonic indicators to measure children’s and adolescents’ well-being, since a growing number of researchers argue that these are complementary approaches to the same broader construct of well-being (Herd, 2022 ; Ryan & Deci, 2001 ; Strelhow et al., 2020 ; Symonds et al., 2022 ).

Among the most outstanding findings to date are that of the many indicators identified over time, some are considered as the core of the SWB construct due to their higher contribution—especially those measuring its cognitive and affective dimensions—and others are viewed as more peripheral for just the opposite reason (see the interesting debates on this issue raised by Casas, 2011 ). Differentiating one from the other is no easy task, however, since models are particularly sensitive to the indicators considered and vary significantly with the introduction of some and exclusion of others, beyond theoretical or conceptual reasons. Another important finding is that the explanatory capacity associated with the models is quite low (Wilckens & Hall, 2015 ), a limitation that persists irrespective of how many indicators are added.

Although all of the above has allowed for great progress to be made when it comes to knowledge regarding child and adolescent SWB, it has also led to a dead end. In this regard, effect sizes between determinants of SWB are difficult to compare among studies, and some indicators may interact with one another, for example. Testing these interactions exhaustively requires a large enough dataset (Margolis et al., 2021 ). Besides, while including the exploration of non-linear relationships and interaction effects within linear models clearly improves their explanatory capacity (González et al., 2007 ,  2008 , 2010 , González-Carrasco, 2020 ), it is still insufficient. Therefore, more robust and powerful techniques are needed to better differentiate which indicators make a greater contribution from those that have less “rank ordering”, to borrow Margolis et al.’s ( 2021 ) term. This would allow prevention and promotion efforts to be concentrated on more specific actions and the design of instruments for data collection with fewer indicators.

The advantages of meeting this challenge are notable, starting with helping to increase the quality of the collected data. It is well-known that respondents, especially children and adolescents, get tired as they go through a questionnaire, and therefore the more things they are asked for, the less attention they pay to them. It would clearly be very helpful to keep the list of indicators short, then; but this is not possible unless there is some certainty that nothing essential will be left out. Shorter instruments would also help broaden monitoring of children’s and adolescents’ SWB by reducing the huge economic and time-costs associated with multiple data collections. Cross-cultural studies would also strongly benefit from having shorter instruments at their disposal, since it is difficult to reach a consensus on which indicators to use in a given questionnaire when many countries are involved, the numerous participating researchers having their own criteria on what is more important to ask.

An example of this is the Children’s Worlds project ( https:/isciweb.org ), a worldwide research survey of 8, 10 and 12 year-olds’ SWB that has been collecting solid and representative data on children’s lives and daily activities, time use and, in particular, perceptions and evaluations of their own well-being. Its purpose is to improve children’s well-being by raising awareness among children, their parents and their communities, but also among opinion leaders, decision makers, professionals and the general public. In the context of this project, three waves of representative data collections have already been carried out, the third one involving more than 128,000 children from 35 countries on four continents. That makes it one of the biggest data collections on children’s well-being worldwide, and certainly the one that considers most multi-item psychometric scales using broader numeric scales to capture more variance of a phenomenon that does not display a normal statistical distribution.

In addition to all of the countries in the study committing to recruiting a minimum sample of the three group ages considered and to collect data through a representative sampling procedure, whether at the regional or national level, a complex and time-consuming agreement process also took place to decide which indicators to include in the questionnaire and how they should be formulated. The questionnaire containing compulsory questions and some optional ones, is translated into several other languages from English and a common database is created so to compare results afterwards. This provides the scientific community with a great amount of data that allows to deepen into how children belonging to different cultures perceive their main life contexts (family, school, and neighbourhood), but at the same time, it raises the question of whether so many indicators are strictly necessary. Even traditional statistics techniques of analysis, such as linear regression seems to indicate that probably a smaller number of indicators will be enough because some of them are finally left out of the equation to the detriment of others but, at the same time, there is also the paradox that the variance explained by these models is low, so reducing many indicators means compromising the explanatory power of the model.

The conclusion seems clear: new avenues need to be explored that will allow us to advance in the objective of better understanding which factors contribute most to the SWB of children and adolescents from their own perspective. It is here that much more computationally powerful techniques considered to be part of artificial intelligence, such as machine learning (ML), make particular sense. This same opinion was also expressed by Oparina et al. ( 2022 ) when referring to human well-being in general, and not specifically that of children or adolescents. To this they also added how important SWB data have recently become for international organizations such as the OECD and national governments as a key tool in policy analysis.

ML “involves applying a performance algorithm to a large data set to produce a prediction model and using this model to predict an outcome. Repeating this process iteratively allows for a ‘perfected’ model and accurate predictions of psychological constructs” (Marinucci et al., 2018 , p. 2). For this reason, ML techniques are experiencing an exponential growth in many scientific fields, helping researchers increase their ability to analyse huge amounts of data and giving new perspectives to the results. However, it has seldom been used in the field of children’s and adolescents’ SWB (Wang et al., 2022 ). Among its most important advantages is the fact that, in contrast to traditional statistics, it relies on minimal a priori assumptions, such as error distribution and additivity of parameters. ML can also be used “to analyse complex multivariate relationships related to high-dimensional data with known interdependencies” (Dehghan et al., 2022 , p. 3). Linear regression is much less capable in this regard, since it assumes a linear relationship between independent and dependent variables. Such an assumption may be unrealistic in large and complex data sets, such as those generated by the Children’s Worlds project, as already shown in González-Carrasco et al. ( 2007 , 2008 , 2010 ). ML can be used in either a supervised or unsupervised way. With the former, the dependent variable is defined and used with both the training and the test data, while for the latter it is not. Unsupervised learning is used to interpret complex data structures, whereas supervised learning is generally used for predictions (Wilckens & Hall, 2015 ), as is the case with the present article.

Taking all of the above into consideration, the general objective of this article is to apply ML methodology to data from the 3rd wave of the Children’s Worlds project to determine which indicators of SWB and PWB are most relevant to children’s and adolescents’ well-being. This will be done with the aim of being able to select a limited but statistically sound set of indicators, without ignoring the current important debate around the extent to which ML outperforms more traditional data analysis techniques (see Froud et al., 2021 ; Margolis et al., 2021 ). To the best of our knowledge, this is the first attempt to evaluate the use of supervised ML algorithms to study children’s and adolescents’ SWB on an international scale and based on a very large dataset, as opposed to using a conventional technique such as linear regression. The fact that the questionnaires used in the Children's Worlds project have been developed on the basis of strong empirical evidence on SWB and successfully tested with children and adolescents in different countries on numerous occasions makes the indicators analysed here very robust and particularly suitable for the general objective of this article. To achieve this general objective, three specific objectives have also been formulated. They are described below.

The article also takes as a starting point the work conducted by Zhang et al. ( 2019 ), who aimed to “predict” (using their own words) undergraduate Chinese students’ SWB by applying ML to 298 indicators. Their analysis showed that 90% of the 1,518 participants could be correctly classified and that the sensitivity and specificity of the model were around 92% and 90%, respectively. The present article also compares a more sophisticated version of the analytical technique used by Zhang et al. ( 2019 ) ( Gradient Boosting Classifier ) with another one commonly used within machine learning ( Random Forest ), following the work by Wang et al. ( 2022 ) ( Specific objective 1 ). This allows us to test which of the two analytical techniques best explains the available data. As in Froud et al. ( 2021 ) and Oparina et al. ( 2022 ), our results are also compared to those computed through linear regression in order to determine whether using more complex analysis techniques delivers a substantial advantage without overfitting the data ( Specific objective 2 ). Finally, once the most appropriate algorithm for the available data has been identified, separate models are compared for each country in terms of suitability and explanatory power ( Specific objective 3 ), since important country differences are expected to be found.

2.1 Participants

Leaving aside the 2.3% of cases for which gender was not reported, 49.3% of the participants were boys and 50.7% girls. Boys and girls were almost identically distributed within each age group: 1) a late childhood age group – mostly 10-year-olds—and 2) early adolescence age-group – mostly 12-year-olds (Table  1 ). The 8-year-olds were not considered in this article since the number of indicators included in their questionnaire was very limited with the aim of avoiding fatigue. The mean age for the 10-year-old age group was 10.07 (SD = 0.733, 58.7% of participants being 10-year-olds), while for the 12-year-olds it was 12.02 (SD = 0.766, 54.7% of participants being 12-year-olds). Table 2 displays the number of participants per country and age group. With only some exceptions (England, France, Greece, Malaysia and Switzerland), all 35 participating countries collected data from both 10 and 12-year-olds.

2.2 Instruments

The Children’s Worlds questionnaire is divided into different sections reflecting different domains of children’s and adolescents’ lives (Rees et al., 2020 ). All sections were considered in the present analysis, with the exception of the one related to country and children’s rights, since the included indicators referred to very specific dynamics taking place in each country. The questionnaire includes several indicators as independent variables, some taken from the following psychometric scales: two measuring the cognitive dimension of SWB—1) The CW-DBSWBS ( Children’s Worlds Domain Based Subjective Well-Being Scale ), a multiple-item scale based on the Brief Multidimensional Student Life Satisfaction Scale by Seligson et at. ( 2003 ) and 2) The single-item scale on OLS ( Overall Life Satisfaction Scale ) by Campbell et al. ( 1976 ); one scale measuring the affective dimension of SWB—the CW-PNAS ( Children’s Worlds Positive and Negative Affects Scale ), based on Feldman Barrett and Russell ( 1998 ); and finally, one scale measuring PWB, the CW-PSWBS ( Children’s Worlds Psychological Subjective Well-Being Scale ), based on Ryff’s ( 1989 ) theoretical background and only used in the 12-year-old questionnaire. Table s3 (Supplementary Materials) shows all indicators used, whether belonging to specific scales or not, their correspondence to the different life domains assessed and their respective measurement scales.

The mean score for the CW-SWBS5 ( Children’s Worlds Subjective Well-Being Scale ), which is the arithmetic sum of the scores obtained for its five items divided by five (M pooled sample  = 8.19, SD pooled sample  = 1.712; M 10-year-olds  = 8.50, SD 10-year-olds  = 1.843; M 12-year-olds  = 8.15, SD 12-year-olds  = 1.982), was used as the dependent variable, because this version displays a better cross-cultural comparability than the original six-item one (Casas & González-Carrasco, 2021 ). This scale is in fact an improved version based on advice from children in different countries, who were asked to suggest new wordings where items did not work properly in an earlier version. It is therefore one of the Children's Worlds most recommended scales for use in international comparisons, as its metric invariance has been clearly supported across the 35 countries included in the 10-year-old sample and the 30 countries included in the 12-year-old sample. This scale is used to appraise the cognitive dimension of SWB and is a context-free scale; that is, it does not focus on specific life domains. The items, measured using an 11-point scale ranging from 0 = Do not agree at all to 10 = Totally agree, are as follows: I enjoy my life, My life is going well, I have a good life, The things that happen in my life are excellent and I am happy with my life .

2.3 Procedure

Participants responded to an anonymous questionnaire, which was self-administered in their regular classroom during school hours with the support of the researchers involved. It being a study involving human beings, the ethical norms of the 1964 Declaration of Helsinki and its subsequent modifications were followed, which also implied the voluntary collaboration of the schools and the children themselves. The schools were selected in such a way as to form a representative sample at the country or regional level according to those parameters considered most relevant by each national research team, such as territorial distribution or school characteristics. In each selected school, a second sampling unit was used, which comprised the corresponding class to the target age group, meaning the procedure entailed two-stage probability sampling.

2.4 Data analysis

Following the steps outlined by Froud et al. ( 2021 ) and Oparina et al. ( 2022 ), in this article we have attempted to elucidate whether ML algorithms perform better than conventional linear regression when explaining SWB, measured via the CW-SWBS5. A further aim was to determine whether the variables identified by each ML algorithm as important for explaining SWB were the same as those yielded by the linear regression model when all models displayed equivalent explanatory capacity and error level. In this study, two ML algorithms were used to estimate scores for the CW-SWBS5: Extreme Gradient Boosting (XGBoost) and Random Forest , XGBoost being an enhanced and optimized version of Gradient Boosting that improves model generalization capabilities (Bentéjac et al., 2021 ). According to Shwartz-Ziv and Armon (2022, cited in Oparina et al., 2022 ), Random Forest and Gradient Boosting are tree-based algorithms that perform well with tabular data, meaning data that is displayed in columns or tables.

Given the nature and volume of questionnaires conducted in the different countries, null or missing data analysis was required, since missing values cause predictions to be less reliable. These values must be identified and replaced by an estimated value through a data imputation process. In this work, the K-nearest (KNN) technique was used to obtain a numerical value in the missing data. This technique has proven to be effective in several ML applications (Keerin & Boongoen, 2021 ; Malarvizhi & Thanamani, 2012 ).

KNN imputation is an algorithm that assigns a value to each missing piece of data based on the k most similar observed data. These closest units are often called neighbours. In this work, the similarity between neighbours was established from the Euclidean distance. A smaller distance value means a higher similarity measure. Since K = 5 was used to establish the neighbourhood of the missing data, the imputed value was the arithmetic mean from among the five nearest neighbours.

Once the data had been imputed, the ML algorithms were applied. The training process was carried out with both algorithms, 70% of the data was used for training and obtaining the models and the remaining 30% for their evaluation. The R 2 and the Standard Error of Estimate (SEE) were used for the linear regression model, while Root Mean Square Error (RMSE) and the coefficient of determination R 2 were run to evaluate the effectiveness of the ML models. Both SEE and RMSE measure the error between the actual and the predicted values. Specifically, RMSE applies the square root to the difference of the values and the SEE is the absolute value, so they are directly comparable. The linear regression model and ML algorithms were calculated using the SPSS software and Python, respectively.

3.1 Linear Regression Model

The adjusted R 2 for the linear regression model was 0.644 with a SEE of 1.143, meaning a low explanatory capacity of the dependent variable and a high error, but a good fit (F 77,93343  = 2194.423, p < 0.001). As Table  3 shows, 11 of the 77 indicators considered here were not statistically significant in explaining the dependent variable (CW-SWBS5). Standardized beta coefficients for those that were significant ranged from -0.002 ( teacherslisten ) to 0.145 (Satisfaction with life as a whole), with only two indicators having a standardized beta coefficient higher than 1 ( satisfiedlifeaswhole and feelinghappy ).

3.2 Random Forest Model

The model calculated by means of the Random Forest algorithm yielded an R 2 of 0.634 and an RMSE of 0.988. The high error together with a low R 2 are clear signs that this algorithm did not perform well with the available data. The contributions of each of the 77 indicators considered here are shown in Fig.  1 and Table  4 , the fact of “having equipment/things you need for school” being the least contributing indicator and “satisfaction with life as a whole” the most in explaining scores for the CW-SWBS5.

figure 1

Visual representation of contributions made by included indicators in explaining scores for the CW-SWBS5 for the Random Forest algorithm (highest contributions at bottom of graph) Note: Complete wording of indicators reproduced in Table s3

3.3 Extreme Gradient Boosting model

The model calculated through the Extreme Gradient Boosting ( XGBoost ) algorithm yielded an R 2 of 0.765 and an RMSE of 0.899, this meaning a good explanatory power and a reasonable error. More specifically, this model outperformed the linear regression model since its R 2 was higher (0.765 versus 0.644), and the Random Forest model as well (0.765 versus 0.634), not being at the same time suspicious of overfitting the data. It also displayed a lower error compared to the lineal model (0.899 versus 1.143) and the Random Forest model (0.899 versus 0.988). The contributions of each of the 77 indicators considered here are shown in Fig.  2 and Table  5 , “gender” being the least contributing indicator and “satisfaction with life as a whole” the most in explaining scores for the CW-SWBS5.

figure 2

Visual representation of contributions made by included indicators in explaining scores for the CW-SWBS5 for the Extreme Gradient Boosting algorithm (highest contributions at bottom of graph) Note: Complete wording of indicators reproduced in Table s3

3.4 Selection of the Short List of Indicators

Having determined that the Extreme Gradient Boosting ( XGBoost ) algorithm was the most suitable one for our data, the following step was to identify those indicators that best explained the scores for the CW-SWBS5. Since no specific cut-point has been defined within ML applications, we established that a sign of a given indicator making a substantial contribution could be considered a coefficient above 0.01, which resulted in the selection of 29 indicators from the initial list of 77 (see Table  5 ). These indicators were from the following instruments: the six items of the CW-PSWBS ( Children’s Worlds Psychological Subjective Well-Being Scale ), the five items of the CW-DBSWBS ( Children’s Worlds Domain Based Subjective Well-Being Scale ), two items ( feelinghappy and feelingsad ) from the CW-PNAS ( Children’s Worlds Positive and Negative Affects Scale ), eight items on satisfaction with different life domains ( satisfiedthingslearned , satisfiedhouse , satisfiedlaterinlife , satisfiedthingshave , satisfiedtimeuse , satisfiedfreedom , satisfiedlistenedto , satisfiedsafety ), two items on perceptions related to school ( teacherscare , schoolsafe ) and five items measuring family-related perceptions ( parentslisten , familygoodtimetogether , homesafe , familyhelpproblem and parentsjointdecisions ). And, finally, the indicator that contributed most was the OLS ( Overall Life Satisfaction Scale ), with a coefficient above 0.2.

3.5 Extreme Gradient Boosting Models by Country

In the next step, we proceeded to calculate one model for each country, again using the Extreme Gradient Boosting ( XGBoost ) algorithm, with the aim of exploring whether there were differences between countries in the role played by the 77 indicators in explaining CW-SWBS5 scores. Table s6 (Supplementary Materials) shows the coefficients obtained for each indicator and country, while Table s7 (Supplementary Materials) displays the ordering of these indicators according to their coefficients. This facilitates comparisons between countries regarding how far they are from the ordering in the pooled sample.

Table s6 shows that the R 2 of the different models ranged from a low-medium-range value in England (0.53) to a high range value in Greece (0.93), with twelve countries displaying an R 2 score above that of the pooled sample (0.765) and twenty-three below this value. Regarding the errors contained in the models, twenty-three countries displayed an RSME value below that of the pooled sample (0.899), which represents an acceptable level of error. These oscillated between 0.41 (Croatia) and 0.81 (Bangladesh). However, in thirteen countries this error was above 0.899 (from 0.97 in Switzerland to 1.67 in England), which is undesirably high.

An important variability was observed when the coefficients of the 77 indicators were analysed by country, with seven countries showing coefficients of 0.0000 (Bangladesh, Belgium, England, France, Greece, Malaysia and Switzerland). These corresponded to thirteen indicators, only three of which belonged to the 29 on the list mentioned previously: parentsjointdecisions , schooldecisions and peoplefriendy . In contrast, three countries displayed coefficients of above 0.3, these corresponding to frequencyschoolfights (Poland) and satisfiedlifeaswhole (South Korea and Germany), with only the latter belonging to the list of 29 selected indicators.

In coherence with the variability in the coefficients commented above, the results displayed in Table s7 also highlight the great variability existing among countries regarding the importance awarded to the 77 indicators considered here. Wales, followed by Brazil, were the countries that displayed the most similar ordering to the pooled sample, while Israel, followed by Bangladesh, displayed the least. In terms of specific indicators, the importance awarded to gender was the indicator with the greatest consensus among countries in terms of ordering (from the 68th position in Bangladesh to the 77th position in 16 countries: Algeria, Chile, Croatia, Hong Kong, Hungary, Indonesia, Israel, Italy, Malta, Poland, South Korea, Spain, Sri Lanka, Taiwan, Vietnam and Wales), while familyhelpproblem had the lowest (from the 1st position in India to the 75th position in Indonesia and Israel). Turning our attention to the list of 29 indicators, familyhelpproblem continued to be the indicator receiving the lowest consensus, as it was included in this shorter list, while satisfiedlifeaswhole was the one receiving the greatest consensus (from the 1st position in England, Finland, France, Germany, Greece, Hong Kong, Hungary, Norway, Russia, South Korea, Spain, Taiwan and Wales, to the 29th in Switzerland, with it being among the top seven positions in most of the countries).

4 Discussion

The premise for this article is the importance of measuring and monitoring children’s and adolescents’ SWB and the consequent need to have accurate and manageable measures of how children and adolescents perceive their life conditions, considered a prerequisite for having good data and implementing efficient and accountability-based public policies targeted at these age groups (see Ben-Arieh, 2008 ). Having a good selection of indicators at their disposal is useful for researchers for the following reasons: both time and economic resources are limited; survey questionnaires cannot be too long because respondents get tired and reliability declines; the longer a questionnaire, the higher the complexities to translate it to different languages; and the more difficult is to infer the consequences of the results into practice. Determining which indicators are most salient to children’s and adolescents’ well-being is a key issue that has only been partially resolved in our view, hence our attempt to fill this gap through the general objective of this article.

Although conducted with older participants (university students) than the ones considered here, the 2019 study by Zhang et al. has been used as a departing point for this article. The reason for this is that it shares the same objective of explaining SWB through the use of ML methodology and identifying the most efficient indicators in this explanation. In contrast to the present article, however, Zhang et al. ( 2019 ) considered not only a combination of subjective and socioeconomic indicators, but also biological ones, including blood type and weight, to explain SWB measured through the SWLS (Satisfaction with Life Scale) and the PANAS (Positive and Negative Affect Schedule).

Using the Gradient Boost Classifier method, they identified the 20 indicators that contribute most to “predicting” participants’ SWB. According to the authors, the method can help detect people at risk with a very limited number of indicators, including items taken from the CESD-D (Centre for Epidemiologic Studies Depression Scale), the BFI (Big Five Inventory), sleep quantity, the ASLEC (Adolescent Self-Rating Life Events), the DFS (Dispositional Flow Scale) and the MMCS (Multidimensional-Multifactorial Causality Scale), the latter as a measure of achievement and affiliation locus of control. One important limitation of the study is that all participants belonged to the same college and country. It was therefore necessary to extend this type of analysis to more culturally diversified samples, as we have done here.

The analysis performed in this article, using a more sophisticated version of the Gradient Boost Classifier, specifically the Extreme Gradient Boosting ( XGBoost ) algorithm ( specific objective 1 ), has allowed us to identify 29 of 77 indicators that make a substantial contribution to explaining a general measure of SWB. In terms of their content, these 29 represent a set of indicators that covers both the cognitive (measured through the CW-DBSWBS, the OLS, and different satisfaction with life domains) and the affective dimensions of SWB (although only two of the items from the CW-PNAS are included in this list of 29 indicators). They also contain all of the items that measure PWB through the CW-PSWBS. The other indicators are related to children’s and adolescents’ opinions, perceptions and evaluations regarding different contexts of their lives: neighbourhood, school and family, with a stronger presence of family-related indicators. Interestingly, some of these indicators have been highlighted in recent years as making more of a contribution to children’s and adolescents’ well-being than was previously thought, namely: satisfiedsafety and satisfiedlistenedto (González-Carrasco et al., 2023 ), and satisfiedtimeuse (Casas et al., 2015 ). It is also worth mentioning that, generally speaking, and unlike classical linear regression, ML methods are not stepwise in nature, meaning more advanced methods are used to select the most relevant explanatory variables for the model. The models used here— Random Forest and XGBoost —are tree-based models, in which variables are automatically evaluated to determine their ability to divide the data into the branches of the tree, those that variables predominate in these models being the most important (Yilmazer & Kocaman, 2020 ).

As theoretically anticipated, the OLS is the indicator that contributes by far the most to explaining CW-SWBS5 scores. Although its exclusion from the analysis might be argued as a way of reducing multicollinearity (both are general cognitive measures of SWB), the reason for not doing so is threefold. Firstly, the authors needed to verify whether results obtained through ML algorithms were coherent with the scientific knowledge in the field, given the scarce literature available using both SWB and PWB indicators in childhood and adolescence. Secondly, decision tree-based algorithms like Random Forest and Gradient Boosting are more flexible than linear models and can better manage non-linear relationships without being as affected by multicollinearity. Since, unlike linear models, ML algorithms do not assume a strict linear relationship between the independent and dependent variables, they can also automatically select the most relevant characteristics and eliminate the redundant ones, thereby decreasing multicollinearity (Chan et al., 2022 ; Garg & Tai, 2013 ). This approach therefore yielded sufficiently robust models, as was the case for the Gradient Boosting ( XGBoost ) algorithm. And, thirdly, despite linear regression being much more sensitive to multicollinearity, when it was performed without including the OLS, the explanatory power of the other indicators was only slightly higher, the number of statistically significant indicators did not display significant changes and the degree of error was very similar.

The process followed here paves the way to consider the use of fewer indicators in this type of study, and the advantages that this would entail, as described in the Introduction section. It also invites further discussion on relevant conceptual issues, such as the differences between the constructs of SWB and PWB in the case of children and adolescents, given that, as our results show, the boundaries between the two are not easy to establish (see the work by Symonds et al., 2022 on this topic), or the importance of the affective dimension of SWB, previously highlighted by Blore et al ( 2011 ).

Although the aims of this article did not include investigating the structure of children’s and adolescents’ well-being, given the important role played by the CW-PSWBS indicators as a measure of PWB in all of the constructed models, the results seem to suggest that a tetrapartite model of well-being would be feasible, as previously suggested by Moreta-Herrera et al. ( 2023 ). This would mean taking Savahl et al.’s ( 2021 ) quadripartite model as a basis, which includes positive affect, negative affect, cognitive life satisfaction domain-based and cognitive life satisfaction context-free components, and then adding a measure for PWB, as in the model proposed by Strelhow et al. ( 2020 ).

Wang et al. ( 2022 ) also applied the Random Forest technique to data collected from 12,058 Chinese 15-year-olds using the PISA 2018 Chinese dataset. This allowed them to identify nine from a total of 35 indicators with the greatest capacity to separately explain positive affect, negative affect, OLS and PWB; the accuracy in these results ranged from 76 to 78%. Among the background predictors, socioeconomic status was the only key factor, specifically explaining negative affect. The top non-cognitive/metacognitive factors included resilience, self-concept of competence, work mastery, mastery goal orientation, competitiveness, fear of failure and enjoyment, while schooling factors most influencing the students’ well-being included a sense of school belonging, parental emotional support, perceived cooperation at school, teacher stimulation, the experience of being bullied, teacher feedback and teacher interests. In the case of our study, the Random Forest algorithm did not provide a good model, since it displayed a high error and low explanatory power ( Specific objective 1 ). The reason for Random Forest performing worse can be attributed to the fact that the trees are built in parallel with random samples from the training set, while in Gradient Boosting the trees are built sequentially, this decreasing the error made in the previous tree.

Framed around Axford’s ( 2009 ) argument that constant critical reflection is required to ensure selection of the best indicators, given that this has direct implications for children’s policy, we have compared the results obtained in this article using two widely used ML techniques with those obtained by means of linear regression ( Specific objective 2 ), following the steps outlined by Froud et al. ( 2021 ). With the aim of explaining the academic performance and quality of life of Norwegian children aged 11–12, these authors concluded that linear regression was less prone to overfitting and that its outperformance was better compared to four machine learning techniques ( K-nearest Neighbours, Neural Networks, Random Forest and Support Vector Machine ) for continuous health outcome variables. They therefore recommend ML techniques only in those cases where there are non-linear and heteroscedastic relationships between variables and few missing cases.

Wilckens and Hall ( 2015 ) also found that the different ML algorithms they used in their study ( Kernel Smoothing Algorithms, Neural Network Algorithms and Feature Selection Algorithms ) aimed at explaining the scores on the Human Flourishing Index using a combination of demographic and personality indicators among adults from four data collections over four weeks did not provide higher prediction accuracies than the general linear model when appropriately tested with sufficient cross-validation. The authors considered three possible explanations for these results: 1) the algorithms used were not able to sufficiently fit the existing structure within the data; 2) the dataset was too small, so cross-validation does not allow existing structures to be found; and 3) the links between personality, demographics and well-being were linear, and therefore well-described by the generalized linear model.

In contrast with the above works by Froud et al. ( 2021 ) and Wilckens and Hall ( 2015 ), in the present article the conventional linear model only outperforms one, rather than two, of the ML techniques used. Specifically, the Extreme Gradient Boosting ( XGBoost ) algorithm performed better than the Random Forest algorithm and the linear regression model ( Specific objective 1 and Specific objective 2 ). This might be explained by the fact that Random Forest first combines the prediction of multiple parallel trees before then combining the results of these trees into one mean or mode value. If one of the trees has a high error value because it selected the least significant variables, then that tree contributes to a high error value. With the Extreme Gradient Boosting ( XGBoost ) algorithm , by sequentially improving the prediction of each tree, the error of one tree is decreased in the next, and so on, until a value with the lowest prediction error is obtained.

Compared to the study by Oparina et al. ( 2022 ), the R 2 obtained through the Extreme Gradient Boosting ( XGBoost ) algorithm was around 0.7, far above the around 0.3 achieved with data from three other surveys (the German Socio-Economic Panel – or SOEP, the UK Longitudinal Household Survey – or UKHLS, and the US Gallup Daily Poll). One explanation for this result is that the Gradient Boosting Classifier does not have a native implementation of regularization, which can make the model prone to overfitting in certain cases. In contrast, Extreme Gradient Boosting ( XGBoost ) incorporates L1 (Lasso) and L2 (Ridge) regularization into its algorithm, which helps prevent overfitting, while improving the generalizability of the model (Bentéjac et al., 2021 ).

None of the works mentioned above included such an array of countries as the one considered here, this being an unprecedented contribution of the current study. Specifically, this has made it possible to test the fit and explanatory capacity of the models for each of the 35 countries involved after establishing the best algorithm to do so according to Specific objective 3 . For most of the countries, the R 2 and RMSEA were found to be adequate, although for twelve of them (Brazil, England, France, Hong Kong, Namibia, Nepal, South Africa, Sri Lanka, Switzerland, Taiwan, Vietnam and Wales) the combination of a low R 2 and a high RSME suggests their corresponding models should be treated with special caution. Limitations of this kind when trying to compare different countries in terms of children’s and adolescents’ well-being—which are related to different ways of understanding the indicators and constructs under study—have already been reported using different means of analysis but the same dataset (see Casas & González-Carrasco, 2021 ).

As expected, another important finding from the process followed was the huge variability observed among countries, both when the coefficients corresponding to each indicator were compared in terms of their higher or lower contribution to explaining scores on the CW-SWBS5, and when they are ordered according to weight. This variability did not seem to be reduced when the focus was placed on selection of the 29 highest contributing indicators, which suggests that caution should be exercised in interpreting the results from the pooled sample, since they do not represent all participating countries equally.

4.1 Limitations and future research

As with the study by Oparina et al. ( 2022 ), we focused our analysis on identifying the variables key to explaining children’s and adolescents’ SWB, rather than on clusters of individuals selected according to specific variables, such as the country participants belong to, this being a second step to be explored in the future. Neither were gender differences analysed here, since this would have exceeded the scope of this article, although this could be an interesting future avenue to explore.

Furthermore, the approach adopted here was supervised ML. It is therefore necessary to verify the feasibility of an unsupervised ML approach using the data available from the Children’s Worlds project. One limitation of ML methods is that they are not thought to analyse the theoretical mechanisms connecting the different explanatory factors in any depth. In the future, more classical statistical analyses such as mediational models and structural equation modelling would be of great help in this regard (Wang et al., 2022 ).

5. References

Arthaud-Day, M. L., Rode, J. C., Mooney, C. H., & Near, J. P. (2005). The subjective well-being construct: A test of its convergent, discriminant, and factorial validity. Social Indicators Research, 74 (3), 445–476. https://doi.org/10.1007/s11205-004-8209-6

Article   Google Scholar  

Axford, N. (2009). Child well- being through different lenses: Why concept matters. Child and Family Social Work, 14 , 372–383. https://doi.org/10.1111/j.1365-2206.2009.00611.x

Barrett, L., & Russell, J. A. (1998). Independence and bipolarity in the structure of current affect. Journal of Personality and Social Psychology, 74 (4), 967–984. https://doi.org/10.1037/0022-3514.74.4.967

Ben‐Arieh, A. (2008). Indicators and Indices of children’s well‐being: Towards a more policy‐oriented perspective. European Journal of Education, 43 (1), 37–50. https://doi.org/10.1111/j.1465-3435.2007.00332.x

Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54 , 1937–1967. https://doi.org/10.1007/s10462-020-09896-5

Blore, J. D., Stokes, M. A., Mellor, D., Frith, L., & Cummins, R. A. (2011). Comparing multiple discrepancies theory to affective models of subjective wellbeing. Social Indicators Research, 100 , 1–16. https://doi.org/10.1007/s11205-010-9599-2

Campbell, A., Converse, P.E., & Rodgers, W.L. (1976). The quality of American life: Perceptions, evaluations, and satisfactions. Russell Sage.

Casas, F. (2011). Subjective social indicators and child and adolescent well-being. Child Indicators Research, 4 , 555–575. https://doi.org/10.1007/s12187-010-9093-z

Casas, F., Figuer, C., González, M., & Malo, S. (2007). The values adolescents aspire to, their well-being and the values parents aspire to for their children. Social Indicators Research, 84 , 271–290. https://doi.org/10.1007/s11205-007-9141-3

Casas, F., & González-Carrasco, M. (2019). Subjective well-being decreasing with age: New research on children over 8. Child Development, 90 (2), 375–394. https://doi.org/10.1111/cdev.13133

Casas, F., & González-Carrasco, M. (2021). Analysing comparability of four multi-item well-being psychometric scales among 35 countries using Children’s Worlds 3 rd  Wave 10 and 12-year-olds samples. Child Indicators Research, 14 (5), 1829–1861. https://doi.org/10.1007/s12187-021-09825-0

Casas, F., Sarriera, J. C., Alfaro, J., González, M., Bedin, L., Abs, D., Figuer, C., & Valdenegro, B. (2015). Reconsidering life domains that contribute to subjective well-being among adolescents with data from three countries. Journal of Happiness Studies, 16 , 491–513.

Chan, J. Y. L., Leow, S. M. H., Bea, K. T., Cheng, W. K., Phoong, S. W., Hong, Z. W., & Chen, Y. L. (2022). Mitigating the multicollinearity problem and its machine learning approach: A review. Mathematics, 10 (8), 1283. https://doi.org/10.3390/math10081283

Rees, G., Savahl, S., Lee, B.J., & Casas, F. (Eds.). (2020). Children’s views on their lives and well-being in 35 countries: A report on the Children’s Worlds project, 2016–19 . Jerusalem, Israel: Children’s Worlds Project (ISCWeb). https://isciweb.org/wp-content/uploads/2020/07/Childrens-Worlds-Comparative-Report-2020.pd

Dehghan, P., Alashwal, H., & Moustafa, A. A. (2022). Applications of machine learning to behavioral sciences: Focus on categorical data. Discover Psychology . https://doi.org/10.1007/s44202-022-00027-5

Diener, E. (1984). Subjective well-being. Psychological Bulletin, 95 , 542–575.

Froud, R., Hansen, S. H., Ruud, H. K., Foss, J., Ferguson, L., & Fredriksen, P. M. (2021). Relative performance of machine learning and linear regression in predicting quality of life and academic performance of school children in Norway: Data analysis of a quasi-experimental study. Journal of Medical Internet Research, 23 (7), e22021. https://doi.org/10.2196/22021

Garg, A., & Tai, K. (2013). Comparison of statistical and machine learning methods in modelling of data with multicollinearity. International Journal of Modelling, Identification and Control, 18 (4), 295–312. https://doi.org/10.1504/IJMIC.2013.053535

González, M., Casas, F., & Coenders, G. (2007). A complexity approach to psychological well-being in adolescence: Major strengths and methodological issues. Social Indicators Research, 80 , 267–295. https://doi.org/10.1007/s11205-005-5073-y

González, M., Coenders, G., & Casas, F. (2008). Using non-linear models for a complexity approach to psychological well-being. Quality & Quantity, 42 , 1–21. https://doi.org/10.1007/s11135-006-9032-8

González, M., Coenders, G., Saez, M., & Casas, F. (2010). Non-linearity, complexity and limited measurement in the relationship between satisfaction with specific life domains and satisfaction with life as a whole. Journal of Happiness Studies, 11 , 335–352. https://doi.org/10.1007/s10902-009-9143-8

González-Carrasco, M., Bedin, L., Casas, F., Alfaro, J., & CastelláSarriera, J. (2023). Safety, perceptions of good treatment and subjective well-being in 10- and 12-year-old children in three countries. Applied Research Quality Life, 18 , 1521–1544. https://doi.org/10.1007/s11482-023-10151-6

González-Carrasco, M., Sáez, M., & Casas, F. (2020). Subjective well-being in early adolescence: Observations from a five-year longitudinal study. International Journal of Environmental Research and Public Health, 17 , 8249. https://doi.org/10.3390/ijerph17218249

Herd, S. M. (2022). Synthesis in hedonic and eudaimonic approaches: A culturally responsive four-factor model of aggregate subjective well-being for Hong Kong children. Child Indicators Research, 15 , 1103–1129. https://doi.org/10.1007/s12187-021-09901-5

Holte, A., Berry, M. M., Bekkhus, M., Borge, A. I. H., Bowes, L., Casas, F., et al. (2014). Psychology of child well-being. In A. Ben-Arieh, F. Casas, I. Frønes, & J. E. Korbin (Eds.), Handbook of Child Well-Being (pp. 555–631). Springer.

Chapter   Google Scholar  

Hsieh, Cm. (2022). Are all life domains created equal? Domain importance weighting in subjective well-being research. Applied Research Quality Life, 17 , 1909–1925. https://doi.org/10.1007/s11482-021-10016-w

Keerin, P., & Boongoen, T. (2021). Improved KNN imputation for missing values in gene expression data. Computers, Materials and Continua, 70 (2), 4009–4025.  https://doi.org/10.32604/cmc.2022.020261

Malarvizhi, R., & Thanamani, A. S. (2012). K-nearest neighbor in missing data imputation. International Journal of Engineering Research & Development, 5 (1), 5–7.

Google Scholar  

Margolis, S., Elder, J., Hughes, B., & Lyubomirsky, S. (2021). What are the most important predictors of subjective well-being? Insights from a machine learning and linear regression approaches on the MIDUS Research. PsyArchiv Preprints. https://doi.org/10.31234/osf.io/ugfjs

Marinucci, A., Kraska, J., & Costello, S. (2018). Recreating the relationship between subjective wellbeing and personality using machine learning: An investigation into Facebook online behaviours. Big Data and Cognitive Computing, 2 (3), 29. https://doi.org/10.3390/bdcc2030029

Marjanen, P., Ornellas, A., & Mäntynen, L. (2017). Determining holistic child well-being: Critical reflections on theory and dominant models. Child Indicators Research, 10 (3), 633–647. https://doi.org/10.1007/s12187-016-9399-6

Metler, S. J., & Busseri, M. A. (2015). Further evaluation of the tripartite structure of subjective well‐being: Evidence from longitudinal and experimental studies. Journal of Personality, 85 (2), 192–206. https://doi.org/10.1111/jopy.12233

Moore, K. A. (2020). Developing an indicator system to measure child well-being: Lessons learned over time. Child Indicators Research, 13 , 729–739. https://doi.org/10.1007/s12187-019-09644-4

Moreta-Herrera, R., Oriol-Granado, X., & González-Carrasco, M. (2023). Examining the relationship between subjective well-being and psychological well-being among 12-year-old-children from 30 countries. Child Indicators Research . https://doi.org/10.1007/s12187-023-10042-0

Oparina, E., Kaiser, C., Gentile, N., Tkatchenko, Clark, A.E., De Neve, J-E., & D’Ambrosio, C. (2022). Human wellbeing and machine learning. Discussion Paper No. 1863. Centre for Economic Performance. ISSN 2042–2695.

Ryan, R. M., & Deci, E. L. (2001). On happiness and human potentials: A review of research on hedonic and eudaimonic well-being. Annual Review of Psychology, 52 , 141–166. https://doi.org/10.1146/annurev.psych.52.1.141

Ryff, C. D. (1989). Happiness is everything, or is it? Explorations on the meaning of psychological well-being. Journal of Personality and Social Psychology, 57 (6), 1069.

Savahl, S., Casas, F., & Adams, S. (2021). The structure of children’s subjective well-being. Frontiers in Psychology . https://doi.org/10.3389/fpsyg.2021.650691

Seligson, J. L., Huebner, E. S., & Valois, R. F. (2003). Preliminary validation of the Brief Multidimensional Students’ Life Satisfaction Scale (BMSLSS). Social Indicators Research, 61 , 121–145.  https://doi.org/10.1023/A:1021326822957

Strelhow, M. R. W., Sarriera, J. C., & Casas, F. (2020). Evaluation of well-being in adolescence: Proposal of an integrative model with hedonic and eudemonic aspects. Child Indicators Research, 13 , 1439–1452. https://doi.org/10.1007/s12187-019-09708-5

Symonds, J. E., Sloan, S., Kearns, M., Devine, D., Sugrue, C., Suryanaryan, S., Capistrano, D., & Samonova, E. (2021). Developing a social evolutionary measure of child and adolescent hedonic and eudaimonic wellbeing in Rural Sierra Leone. Journal of Happiness Studies, 23 (4), 1433–1467. https://doi.org/10.1007/s10902-021-00456-4

Voukelatou, V., Gabrielli, L., Miliou, I., Cresci, S., Sharma, R., Tesconi, M., & Pappalardo, L. (2021). Measuring objective and subjective well-being: Dimensions and data sources. International Journal of Data Science and Analytics, 11 , 279–309. https://doi.org/10.1007/s41060-020-00224-2

Wang, Y., King, R., & Leung, S. O. (2022). Understanding Chinese students’ well-being: A machine learning study. Child Indicators Research . https://doi.org/10.1007/s12187-022-09997-2

Wilckens, M., & Hall, M. (2015). Can well-being be predicted? A machine-learning approach. SSRN. https://doi.org/10.2139/ssrn.2562051

Yilmazer, S., & Kocaman, S. (2020). A mass appraisal assessment study using machine learning based on multiple regression and random forest. Land Use Policy, 99 , 104889. https://doi.org/10.1016/j.landusepol.2020.104889

Zhang, N., Liu, C., Chen, Z. et al. (2019). Prediction of adolescent subjective well-being: A machine learning approach. General Psychiatry, 32 . https://gpsych.bmj.com/content/32/5/e100096

Download references

Acknowledgements

Thanks are due to all of the children who kindly agreed to answer the questionnaire, all principal investigators and all research team members who participated in data collection in the 35 countries included in the sample used here. Also to the coordinating team of the Children’s Worlds project for kindly allowing us to use the database, the Jacobs Foundation for supporting the project, and to Barnaby Griffiths for the English editing of this paper.

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Author information

Authors and affiliations.

Research Institute on Quality of Life, University of Girona, Girona, Spain

Mònica González-Carrasco, Xavier Oriol & Sara Malo

Institute of Astronomical Sciences of Earth and Space (ICATE-CONICET), National University of San Juan, San Juan, Argentina

Silvana Aciar

Doctoral Program on Education and Society, Faculty of Education and Social Sciences, University Andrés Bello, Santiago de Chile, Chile

Ferran Casas

Institute of Informatics and Applications, University of Girona, Girona, Spain

Ramon Fabregat

You can also search for this author in PubMed   Google Scholar

Ethics declarations

Conflict of interest.

On behalf of all authors, the corresponding author states that there are no conflicts of interest.

Ethical Statement

The researchers declare that they have complied with all the ethical requirements for this type of study, having received authorization from each school’s education authority.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 133 KB)

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

González-Carrasco, M., Aciar, S., Casas, F. et al. A Machine Learning Approach to Well-Being in Late Childhood and Early Adolescence: The Children’s Worlds Data Case. Soc Indic Res (2024). https://doi.org/10.1007/s11205-024-03429-1

Download citation

Accepted : 26 August 2024

Published : 11 September 2024

DOI : https://doi.org/10.1007/s11205-024-03429-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Subjective well-being
  • Psychological well-being
  • Adolescence
  • Children’s Worlds
  • Machine Learning (ML)
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 10 September 2024

Predictive etiological classification of acute ischemic stroke through interpretable machine learning algorithms: a multicenter, prospective cohort study

  • Siding Chen 1 , 2 , 3 ,
  • Xiaomeng Yang 1 ,
  • Hongqiu Gu 1 , 2 ,
  • Yanzhao Wang 4 ,
  • Zhe Xu 1 , 2 ,
  • Yong Jiang 1 , 2 , 3 , 5 &
  • Yongjun Wang 1 , 2 , 3 , 6 , 7 , 8  

BMC Medical Research Methodology volume  24 , Article number:  199 ( 2024 ) Cite this article

1 Altmetric

Metrics details

The prognosis, recurrence rates, and secondary prevention strategies varied significantly among different subtypes of acute ischemic stroke (AIS). Machine learning (ML) techniques can uncover intricate, non-linear relationships within medical data, enabling the identification of factors associated with etiological classification. However, there is currently a lack of research utilizing ML algorithms for predicting AIS etiology.

We aimed to use interpretable ML algorithms to develop AIS etiology prediction models, identify critical factors in etiology classification, and enhance existing clinical categorization.

This study involved patients with the Third China National Stroke Registry (CNSR-III). Nine models, which included Natural Gradient Boosting (NGBoost), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Light Gradient Boosting Machine (LGBM), Gradient Boosting Decision Tree (GBDT), Adaptive Boosting (AdaBoost), Support Vector Machine (SVM), and logistic regression (LR), were employed to predict large artery atherosclerosis (LAA), small vessel occlusion (SVO), and cardioembolism (CE) using an 80:20 randomly split training and test set. We designed an SFS-XGB with 10-fold cross-validation for feature selection. The primary evaluation metrics for the models included the area under the receiver operating characteristic curve (AUC) for discrimination and the Brier score (or calibration plots) for calibration.

A total of 5,213 patients were included, comprising 2,471 (47.4%) with LAA, 2,153 (41.3%) with SVO, and 589 (11.3%) with CE. In both LAA and SVO models, the AUC values of the ML models were significantly higher than that of the LR model ( P  < 0.001). The optimal model for predicting SVO (AUC [RF model] = 0.932) outperformed the optimal LAA model (AUC [NGB model] = 0.917) and the optimal CE model (AUC [LGBM model] = 0.846). Each model displayed relatively satisfactory calibration. Further analysis showed that the optimal CE model could identify potential CE patients in the undetermined etiology (SUE) group, accounting for 1,900 out of 4,156 (45.7%).

Conclusions

The ML algorithm effectively classified patients with LAA, SVO, and CE, demonstrating superior classification performance compared to the LR model. The optimal ML model can identify potential CE patients among SUE patients. These newly identified predictive factors may complement the existing etiological classification system, enabling clinicians to promptly categorize stroke patients’ etiology and initiate optimal strategies for secondary prevention.

Peer Review reports

Introduction

Stroke is the second leading cause of global mortality and the primary contributor to both morbidity and disability in China. Acute ischemic stroke (AIS) represents a prevalent form of stroke [ 1 , 2 , 3 ]. Different subtypes of AIS have varying prognostic trajectories, recurrence patterns, and strategies for secondary prevention. Accurate identification of AIS subtypes is pivotal for developing effective secondary prevention strategies and alleviating the burden associated with AIS.

The most widely accepted AIS subtyping system is the Trial of ORG 10,172 in Acute Stroke Treatment (TOAST) classification scheme [ 4 ]. However, the initial assessment of AIS is often time-consuming and uncertain, requiring expert reviewers to thoroughly interpret clinical indicators, conduct laboratory tests, and analyze electrocardiography and imaging results [ 5 , 6 ]. This process is highly dependent on the expertise and experience of the doctors [ 7 , 8 ]. Despite rigorous training, physicians frequently encounter challenges in identifying AIS subtypes. Reports also indicate that primary psychiatrists exhibit low accuracy in assessing AIS etiology due to the requirement for extensive experience accumulation [ 5 , 6 , 9 ]. Therefore, it is crucial to develop rapid etiological classification prediction models that can accurately identify AIS subtypes during the acute stage after admission.

Recent advancements in machine learning (ML) applications across various healthcare domains have sparked innovations in developing novel ML-based etiological classification technologies [ 10 , 11 ]. The non-parametric nature of ML and its ability to capture non-linear relationships make it well-suited for identifying AIS subtypes, given the complex and non-normal nature of most medical data. Studies have demonstrated the potential of ML in this field. For instance, one study used ML algorithms to automatically identify and quantify carotid artery plaques in MRI scans, achieving 91.41% accuracy in LAA classification using Random Forest (RF) [ 12 ]. Wang et al. developed a predictive model for patients with large vessel occlusion using an RF model, achieving an area under the receiver operating characteristic curve (AUC) of 0.831 [ 13 ]. This study concluded that ML outperformed logistic regression (LR) in identifying patients with large vessel occlusion. Sun et al. employed various ML algorithms to develop an etiological prediction model for large artery atherosclerosis (LAA) using 62 features [ 14 ]. However, these studies have limitations such as small sample sizes, single-center retrospective designs, and poor interpretability.

To address these limitations, it is imperative to use large prospective cohort data and advanced ML algorithms to develop more accurate etiological prediction models with fewer predictive variables. This study aims to develop predictive models for LAA, small vessel occlusion (SVO), and cardiogenic embolism (CE) using interpretable ML algorithms based on high-quality, prospective cohort studies to provide explanations for predictive factors and complement existing etiological classifications.

Study design and participants

We extracted data from the Third China National Stroke Registry (CNSR-III), a large-scale nationwide prospective registry of acute ischemic cerebrovascular events in China. The study design and patient identification details for CNSR-III have been reported previously [ 2 ]. Imaging data were collected in the Digital Imaging and Communications in Medicine (DICOM) format on discs and interpreted by trained professional physicians. Stroke subtypes were classified into five major categories according to TOAST classification: LAA, SVO, CE, other determined etiology (SOE), and undetermined etiology (SUE) [ 4 ]. Additionally, our data included the Causative Classification System (CCS), which integrates etiological and phenotypic classifications [ 15 ].

A total of 44 biomarkers identified in this study were extracted from these samples. We excluded 4,190 patients without baseline plasma, serum, or imaging data and also excluded patients with SOE from the analysis. Ultimately, we enrolled 5,213 patients (including LAA, SVO, and CE) for the main study and included 4,156 SUE patients for subsequent analysis (Fig.  1 ).

figure 1

Study flowchart

Data information

Based on published literature and pathophysiological considerations, the candidate variables included in our study comprised demographic characteristics, medical history, family history, and imaging and laboratory data. Detailed information can be found in the supplementary materials (Table S1 ).

ML algorithms

In this study, we developed and compared eight ML predictive models to assess their performance against the LR model. The models included Natural Gradient Boosting (NGBoost [NGB]) [ 16 ], Light Gradient Boosting Machine (LGBM) [ 17 ], Categorical Boosting (CatBoost [CAT]) [ 18 ], Extreme Gradient Boosting (XGBoost [XGB]) [ 19 ], Gradient Boosting Decision Tree (GBDT) [ 20 , 21 ], Random Forest (RF) [ 22 , 23 ], Adaptive Boosting (AdaBoost [Ada]) [ 24 , 25 ], and Support Vector Machine (SVM) [ 26 ]. The dataset was randomly divided into a training set (80%) and a testing set (20%). We used 10-fold cross-validation for parameter optimization in the training set. Details of the parameters for different algorithms can be found in the supplementary materials (Table S6 ).

NGB: NGB is a novel algorithm for regression prediction tasks. It extends conventional gradient boosting algorithms by incorporating natural gradients to optimize model parameters, enhancing its ability to adapt to the probability distribution characteristics of the data. The primary objective of NGB is to directly model the predictive distribution, moving beyond mere predictions of expected values [ 16 ].

LGBM: LGBM is a robust gradient-boosting framework known for its computational efficiency. Compared to traditional gradient-boosting decision tree algorithms, LGBM offers faster training speeds and lower memory consumption [ 17 ].

CAT: CAT is an innovative ordered gradient boosting algorithm that utilizes ordered target-based statistics to handle categorical features and employs permutation strategies to prevent prediction shifts [ 18 ].

XGB: XGB is a robust ML algorithm used for classification and regression problems. It enhances gradient-boosting trees by combining multiple decision trees to improve predictive capabilities [ 19 ].

GBDT: GBDT employs the gradient descent method to reduce error. This model can automatically capture interactions between features without the need for manually specifying interaction terms and is relatively robust to outliers and noisy data [ 20 ].

RF: RF is an ensemble supervised learning method consisting of multiple decision trees, each trained on different subsets of the data. The results from each tree are averaged, which reduces variance and improves predictive performance [ 22 ].

Ada: Ada is an iterative ensemble learning method. Its core idea is to combine multiple weak learners, typically weak classifiers like decision trees or Naive Bayes, to create a strong learner [ 24 ].

SVM: SVM is a robust algorithm used for classification tasks. It finds the optimal hyperplane that maximizes the margin between classes, ensuring effective separation of data points. SVM is particularly effective in high-dimensional spaces and can handle non-linear classification using various kernel functions [ 26 ].

Features selections

We employed our custom-designed Sequential Forward Selection with XGB (SFS-XGB), utilizing 10-fold cross-validation to maximize performance. Within the training set, we implemented 10-fold cross-validation with SFS, varying the parameter k from 3 to 10. The optimal feature set was evaluated based on AUC values. From the SFS-XGB results, we identified the top 10 variables as candidates. Our objective was to pinpoint the optimal feature set with the highest AUC values while minimizing the number of variables. This approach was applied separately to identify the best predictive feature subsets for LAA, SVO, and CE. Notably, to ensure specificity in CE models—especially concerning conditions like atrial fibrillation (AF)—we excluded medical histories of AF and heart valve disease (HVD) during feature selection for CE models.

Data preprocessing

For data preprocessing, we employed multiple imputations to complete missing values in continuous variables from laboratory data, while categorical variables were imputed using the mode. The distribution of laboratory data was evaluated both before and after imputation, with detailed statistics provided in Table S2 .

To standardize laboratory data, we utilized MinMaxScaler, which linearly transformed the features, scaling them to fit within the [0, 1] range.

Addressing the imbalance in our dataset, particularly for the classification of CE against other categories (LAA and SVO), where the sample ratio was 589:4,624 = 1:7.893, indicating significant class imbalance, we applied random undersampling to the training set using the imblearn library. The sampling strategy was set to achieve a balanced ratio, specifically 0.5.

Subsequent analysis

In our subsequent analysis, we divided it into two parts: 1) We utilized our best models for LAA, SVO, and CE to identify potential patients within the SUE group. Using an 80% probability cutoff, patients below this threshold were categorized as UND, while those above were selected. The group with the highest probability was considered the final prediction group.2) We extended our analysis by applying the LAA, SVO, and CE models to further study the CCS subtypes. Based on their AUC performance, we assigned weights ranging from 0 to 1 to the top four models in each etiology prediction category. ML scores were calculated by multiplying these weights by a total score of 12, as detailed in Table S10 . In cases where a model was duplicated, scores were summed. By combining predictors from LAA, SVO, and CE, we created a comprehensive set of predictors.

Definitions of metrics

We evaluated our models’ performance using discrimination and calibration as primary measures. Discrimination, measured by the AUC, indicates the model’s ability to distinguish, with higher values indicating better performance. Calibration was assessed using the Brier score, which ranges from 0 to 1, with a lower score indicating better calibration [ 27 ]. Calibration plots were also utilized for visual assessment.

Additional metrics used in this study included accuracy, sensitivity, specificity, Youden’s index, and F1-score. These metrics utilize True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) to describe correct and incorrect predictions of unknown etiology types. The calculations for these measures were as follows:

Statistical analysis

Baseline characteristics were presented using means and standard deviations or medians and interquartile ranges for continuous variables, and frequencies and percentages for categorical variables. The chi-square test or Fisher’s exact test was used to compare baseline characteristics among categorical variables, while analysis of variance (ANOVA) or the Kruskal-Wallis test was employed for continuous variables. Differences in AUC values among various models were assessed using the DeLong test [ 28 ], and model interpretations were facilitated using SHapley Additive exPlanations (SHAP) [ 29 ]. The suitability of ML research was evaluated based on TRIPOD and PROBAST guidelines [ 30 , 31 ]. Data analysis was performed using SAS software (version 9.4) and Python (version 3.9.7). All comparisons were two-sided, with statistical significance defined as P  < 0.05.

Baseline characteristics

From an initial cohort of 15,166 patients with AIS or transient ischaemic attack (TIA), 4,190 patients without serum, plasma, or imaging data were excluded, leaving 9,485 patients for analysis (Fig.  1 ). Among these, 5,213 patients diagnosed with LAA, SVO, and CE were included. The distribution among these groups was as follows: 47.4% ( n  = 2,471) were LAA, 41.3% ( n  = 2,153) were SVO, and 11.3% ( n  = 589) were CE. The average age was 62.9 ± 11.1 years, with 30.0% ( n  = 1,563) females. The median (IQR) admission NIHSS score was 3.0 (2.0–6.0). The most prevalent medical history was hypertension (64.8%, n  = 3,380), followed by diabetes (24.4%, n  = 1,273) and prior stroke (23.4%, n  = 1,218). More than half of the patients presented with a single infarction (51.7%, n  = 2,697) or anterior circulation infarction (56.6%, n  = 2,951). Demographic details for the LAA, SVO, and CE groups can be found in Table S1 .

Feature selection of LAA, SVO, and CE models

The dataset was randomly divided into training and testing sets at an 80:20 ratio. The training set initially comprised 70 variables from Table S1 . As shown in Table S2 , there was no statistically significant difference between the data before and after multiple imputation of missing values ( P  > 0.1).

Feature selection was exclusively conducted in the training set, and the process is detailed in Fig.  2 and Tables S3 - S5 . For the LAA models, seven features were selected: number of acute infarctions, history of AF, blood glucose level, age, longitude, admission NIHSS score, and total cholesterol (CHOL). The SVO models utilized ten variables: age, number of acute infarctions, history of AF, infarction circulation, admission NIHSS score, C-reactive protein (hs-CRP), absolute lymphocyte count (LYM), low-density lipoprotein (LDL-C), smoking history, and history of diabetes. In CE models, AF and the history of HVD demonstrated strong discriminatory power (Figure S1 ). Optimal performance for CE models was achieved with six features: history of heart disease (HD), age, history of coronary heart disease (CHD), direct bilirubin (DBIL), adiponectin, and international normalized ratio (INR).

figure 2

Feature selection plots of LAA, SVO and CE models (in the training set). A, the Feature selection diagram of the LAA etiology classification prediction models; B, the Feature selection diagram of the SVO etiology classification prediction models; C, the Feature selection diagram of the CE etiology classification prediction models

Model construction and evaluation

In the training set, we optimized parameters for constructing nine prediction models, detailed in Table S6 . Table  1 was presented the performance metrics for each model evaluated on the test set. Among the nine models, the ML models outperformed the LR model. Specifically, the NGB model excelled in predicting LAA, while the RF model performed best in predicting SVO. Additionally, the LGBM model demonstrated superior efficacy in predicting CE. The AUC values of ML models for LAA and SVO predictions were significantly better than those of the LR model according to Delong’s test results with a p-value less than 0.001. For CE predictions, although not reaching statistical significance based on Delong’s test results with a p-value greater than 0.05, the AUC performance of LGBM, XGB, GBDT, CAT, and NGB models surpassed the LR model. The ROC curves presented in Fig.  3 illustrate the performance of prediction models for LAA, SVO, and CE, respectively. Among them, SVO models exhibited superior performance followed by LAA models and then CE models. All these predictive models demonstrated excellent calibration, as evidenced by the calibration curves shown in Fig.  4 and Figures S8 - S10 .

figure 3

ROC curves plots of LAA, SVO and CE models (in the test set). A, the ROC curves of the LAA etiology classification prediction models; B, the ROC curves of the SVO etiology classification prediction models; C, the ROC curves of the CE etiology classification prediction models

figure 4

Calibration curves for the top four models in each etiology classification model (in the test set). A, the calibration curves of the LAA etiology classification prediction models; B, the calibration curves of the SVO etiology classification prediction models

Visualization of feature importance

SHAP was employed to illustrate our LAA, SVO, and CE models (Fig.  5 ). This plot visualized the relationship between feature values and SHAP values in the test set, highlighting higher SHAP values as indicators of greater influence on classification under each etiology. Dependence plots (Figure S2 -S4) further elucidated the impact of the single feature on the output of the etiological classification models.

As shown in Fig.  5 , the contributions of each predictor variable to LAA, SVO, and CE models were highlighted. The number of infarctions and history of AF were identified as the most significant variables in LAA and SVO models. Admission NIHSS score and age were found to be common predictors for both models. Optimal variable combinations for classifying LAA, SVO, and CE patients were illustrated in partial dependence plots (Figure S2 - S4 ).

figure 5

SHAP summary plot of the LAA, SVO and CE models (in the test set). A, the SHAP summary plot of the LAA etiology classification prediction models; B, the SHAP summary plot of the CE etiology classification prediction models; C, the SHAP summary plot of the SVO etiology classification prediction models

Additional analysis for SUE: A total of 4156 SUE patients were included (Fig.  1 ), with a mean age of 61.9 ± 11.4 years and 1369 females (32.9%). The distribution of different genders in SUE was presented in Table S8 . The established optimal LAA (NGB), SVO (RF), and CE (LGBM) models were utilized to identify potential LAA, SVO, or CE patients in SUE. Results indicated that 1900 (45.7%) potential CE patients could be identified in SUE, with 2256 SUE patients (UND) having a predicted probability below 80%. Detailed results can be found in Table S7 and Figure S5 . We compared 1900 potential CE and UND within SUE and observed statistically significant differences ( P  < 0.05) in several heart-related variables between these two groups (Table S9 ).

Extended analysis for CCS: Out of 4,642 patients in the CCS classification, there were 2,163 LAA patients, 2,020 SVO patients, and 459 CE patients. In the test set, there were 433 LAA patients, 404 SVO patients, and 92 CE patients. For CCS analysis, predictors such as age, smoking, admission NIHSS score, longitude, diabetes, AF, HD, CHD, infarction circulation, number of acute infarctions, blood glucose, INR, CHOL, LDL-C, hs-CRP, DBIL, Adiponectin, and LYM were used. Models were scored as follows: RF (18), LGBM (16.8), NGB (15.6), GBDT (10.2), CAT (4.92), and XGB (1.2) (Table S10 ). Subsequent analysis combined the top 2 models (RF, LGBM) with these 18 variables. In the test set, the RF model accurately predicted 392 LAA (90.5%), 72 CE (78.3%), and 404 SVO (100.0%). LGBM correctly predicted 393 LAA (90.8%), 72 CE (78.3%), and 404 SVO (100.0%). The highest accuracy in predicting SVO was demonstrated by ML models, followed by LAA and CE (Figure S6 - S7 ).

To our knowledge, this study was the first application of ML algorithms for classifying AIS etiology within a prospective high-quality Chinese AIS cohort. Notably, this study also marks the initial utilization of the NGB algorithm in this specific field. We comprehensively integrated clinical, imaging, and laboratory data to accurately classify AIS subtypes. The ML algorithms successfully constructed predictive models for LAA, SVO, and CE, demonstrating robustness consistent with findings in other fields [ 10 , 11 , 13 , 32 , 33 ]. Among the developed models, the SVO model showed superior performance, followed by the LAA and CE models. It is worth noting that the top-performing models for etiological classification were LAA-NGB, SVO-RF, and CE-LGBM. Our CE-LGBM model successfully identified 1,900 (45.7%) potential CE patients in SUE. Our study revealed the following clinical findings:

Individuals aged 57–70 with multiple infarcts, high NIHSS scores (> 8), no history of AF, elevated blood glucose (> 6 mmol/L), and high CHOL levels (≥ 5 mmol/L) in regions like Northeast China, North China, and East China were more likely to develop LAA than CE or SVO.

Those without a history of AF, under 61 years old, with low NIHSS scores (< 7), a single infarct in one circulation (anterior or posterior), and smokers (maintaining low lymphocyte, hs-CRP, and LDL-C levels) were more likely to develop SVO instead of LAA or CE.

Besides strong CE indicators like AF and HVD, older age (> 69 years), history of HD, impaired coagulation (INR > 1.15), no CHD history, and elevated DBIL (> 5µmol/L) and adiponectin (> 2.5 mg/ml) levels indicated a higher likelihood of developing CE rather than LAA or SVO.

LAA significantly contributes to disability and mortality in China [ 34 ], commonly associated with risk factors like high cholesterol, hypertension, smoking, diabetes, and older [ 35 ]. SVO results from small blood vessel occlusion or narrowing, leading to limited blood supply to the brain. It typically presents mild symptoms like dizziness and limb numbness. Compared to LAA and CE, SVO has a better prognosis due to its limited impact on smaller brain areas [ 36 ]. CE is characterized by heart-related clots that travel to the brain, causing cerebral vascular embolism and subsequent ischemic injury. It is associated with conditions such as AF, rheumatic heart disease, and heart valve issues. CE has a less favorable prognosis and a higher disability rate compared to other stroke types. Accurate identification of CE is crucial for personalized treatment, often involving anticoagulant medications. This study attempted to identify additional factors to distinguish CE from other subtypes, utilizing interpretable ML models to aid clinical decision-making. Therefore, this is one of the reasons why this study separately constructed three etiological prediction models of LAA, CE, and SVO.

Age, NIHSS score, AF, and CHD history were common factors influencing stroke classification [ 4 ]. It’s essential to note that a fasting blood glucose level below 5.6 mmol/L was considered normal, but our study suggests that LAA risk increases with blood glucose levels exceeding 6 mmol/L. Elevated blood glucose levels may contribute to LAA stroke risk through several mechanisms: (1) The promotion of atherosclerosis occurs through the damage to arterial endothelial cells, inflammation induction, and encouragement of cholesterol and lipid accumulation in arterial walls. This leads to plaque formation and narrowing of arteries; (2) hyperglycemia increases the risk of platelet aggregation and coagulation, promoting thrombus formation and embolism in narrowed arteries, contributing to LAA; (3) high blood glucose levels affect arterial wall elasticity, inflammation, and oxidative stress, potentially damaging arterial endothelial cells, accelerating atherosclerosis, and increasing LAA risk [ 37 , 38 , 39 ]. The NIHSS score provides valuable insights into the clinical symptoms and neurological conditions of LAA and SVO, but it does not specifically indicate the etiological subtype. Therefore, combining the NIHSS score with other clinical and imaging data is crucial for a comprehensive evaluation of LAA. Individuals born in Northeast China, North China, and East China showed higher susceptibility to LAA compared to other subtypes. Various factors, such as dietary habits, natural environmental factors, and genetic influences, may have contributed to this heightened susceptibility. For instance, the prevalence of high-fat and high-salt diets in Northeast China could have elevated the risk of hypertension and hyperlipidemia, potentially leading to arterial wall damage and lipid deposition, thereby increasing the likelihood of atherosclerosis [ 40 , 41 , 42 , 43 ]. However, further research and validation are needed to fully understand the specific impact of different regions on stroke etiology classification.

To our knowledge, our study was the first to utilize ML in discovering adiponectin and DBIL as a potential novel biomarker associated with CE. Adiponectin was a peptide substance secreted from adipose tissue with anti-inflammatory and anti-atherosclerotic effects [ 44 ]. Previous studies have linked it to various diseases, including energy metabolism [ 45 ], immune response, chronic inflammatory conditions [ 46 ], and atherosclerosis [ 47 , 48 ]. Low adiponectin levels may make individuals more susceptible to LAA rather than CE. Additionally, elevated adiponectin levels could serve as a biomarker for CE, indicating underlying biological mechanisms that warrant further investigation. Elevated DBIL levels are indicative of liver and biliary system disorders. High DBIL levels have been associated with increased stroke severity and poorer prognosis [ 49 ]. However, the role of DBIL as a stroke risk factor or prognostic indicator remained uncertain due to potential confounding factors [ 50 ]. Further research is needed to establish the causal relationship between DBIL and CE. Elevated INR indicated prolonged coagulation time in patients, likely due to the frequent use of anticoagulant medications in individuals with CE, thus displaying this distinctive characteristic.

Our feature selection utilized SFS-XGB with 10-fold cross-validation, optimizing predictor selection based on AUC performance. This method effectively removed irrelevant features, reduced dimensionality, and enhanced model accuracy. Addressing the imbalance between CE and other stroke subtypes (LAA and SVO) through random undersampling ensured reliable model training, mitigating bias towards the majority class and improving prediction reliability. The existing reports and guidelines provide support for AF and HVD as high-risk cardiogenic sources of CE [ 4 , 51 , 52 , 53 ], which aligns with our results (Table S1 and Fig.  1 ). To uncover other variables linked to CE etiology classification, a strategy was adopted that excluded atrial fibrillation and heart valve disease.

Among our LAA prediction models, NGB demonstrated the highest performance (AUC = 0.917), closely followed by RF (AUC = 0.916). Importantly, all ML models significantly outperformed the LR model ( P  < 0.001). Notably, the NGB model exhibited superior calibration compared to RF (Brier score: NGB = 0.096, RF = 0.112), further validating its predictive accuracy. As far as we know, this study also represents the first attempt to investigate the use of the NGB model for predicting stroke etiological classification. NGB was developed by the ML team at Stanford in 2019; NGB is a boosting algorithm designed to provide probabilistic forecasts through a full probability distribution rather than point predictions [ 16 ]. Here’s a detailed breakdown of the NGB algorithm, including the mathematical formalisms: For many ML models, we seek to optimize the parameters \(\:\theta\:\) to minimise loss \(\:L\left(y,f\left(x;\theta\:\right)\right)\) where \(\:y\) was the target and \(\:f\left(x;\theta\:\right)\) was the prediction. The gradient boosting algorithm improved the model iteratively by fitting new base models to the negative gradient of the loss with respect to the current prediction. Instead of point predictions, consider a distribution \(\:{P}_{\theta\:}\left(y|x\right)\) parameterized by \(\:\theta\:\) to represent the prediction. The objective was to minimise the expected value of some scoring rule \(\:S\left(y,{P}_{\theta\:}\left(y|x\right)\right)\) . A common choice for \(\:S\) was the negative log-likelihood: \(\:S\left(y,{P}_{\theta\:}\left(y|x\right)\right)=-\text{l}\text{o}\text{g}{P}_{\theta\:}\left(y|x\right)\) . NGB generalises gradient boosting to parameterized probability distributions [ 16 ]. The updated to \(\:\theta\:\) were done using the natural gradient instead of the gradient. Given the scoring rule \(\:S\) , the steepest descent direction (natural gradient) was given by:

where \(\:I\left(\theta\:\right)\) is the Fisher Information matrix.

NGB overcame the challenge of probabilistic predictions with gradient boosting, showing high accuracy in predicting structured or tabular data and excelling in LAA etiology prediction. Despite its advantages, NGB did not guarantee superior performance in all scenarios, with the RF model’s classification effect being only slightly lower (difference of 0.001). The RF model performed best in predicting SVO (AUC = 0.932), closely followed by LGBM (AUC = 0.930). All ML models significantly outperformed the LR model ( P  < 0.001). RF’s robustness in LAA and SVO models stemmed from constructing multiple decision trees based on Gini Impurity or Information Gain, random sampling to prevent overfitting, and displaying good noise resistance and fast training speed.

LGBM showed the best predictive performance for CE models (AUC = 0.846). Known for its speed and suitability for large-scale datasets, LGBM employs efficient strategies like leaf-wise tree growth and histogram-based training. Future improvements could involve integrating heart-related examination variables to enhance CE differentiation. Although our study did not fully utilize LGBM’s potential due to feature selection constraints, it holds promise for improved performance in larger datasets. Additionally, excluding correlated factors like AF and HVD in the initial CE model might have affected overall performance compared to SVO and LAA models. Diagnosing CE is complex, requiring thorough cardiogenic source identification and high-risk factor consideration. Subsequent research could explore LGBM further for enhanced results.

Our analysis identified 45.7% (1900) of SUE patients as potential CE patients (Table S7), showing significant differences in heart disease-related variables compared to the remaining UND group ( P  < 0.05, Table S9 ). Precise ML models effectively identified LAA, SVO, and CE patients among SUE, easing diagnostic challenges and improving treatment accuracy. We ranked RF, LGBM, NGB, GBDT, CAT, and XGB as top performers. RF and LGBM, particularly in SVO predictions, demonstrated high accuracy within the CCS system. This aligns with our initial models’ performance, indicating their robustness. Regarding the selection of these three algorithms, we would like to offer the following recommendations:

Each algorithm came with its own set of hyperparameters. Proper tuning was crucial for optimal performance. An improperly tuned NGB might underperform compared to a well-tuned LGBM or RF.

RF combined predictions from multiple decision trees to produce a more robust and accurate outcome.

NGB focused on probabilistic predictions and used natural gradients. If the problem did not require probabilistic forecasting, the added complexity might not be beneficial.

LGBM was efficient and scalable, especially for large datasets. When dealing with substantial data, using LGBM is advisable.

The choice of algorithm should be based on the nature of the data, the specific problem context, and thorough experimentation and validation.

Despite data normalization efforts, the SVM model underperformed in predicting LAA, SVO, and CE compared to other ML models. SVM’s preference for limited samples and numerous features hindered its effectiveness with our larger sample size. Additionally, SVM’s difficulty in finding suitable kernel functions and its focus on boundary data points resulted in lower AUC performance, highlighting its limitations relative to other models.

Our ML models demonstrated exceptional predictive efficiency while maintaining precision. Given the simplicity and accessibility of the variables used in this study’s prediction model, these three etiological prediction models can be easily integrated into web pages or clinical decision support systems (CDSS) for practical application. This will enable clinicians to efficiently classify patients’ etiological factors. By utilizing ML algorithms to identify new variables, we have filled gaps in existing clinical knowledge regarding variable selection. We firmly believe that there is no universally superior method; the key lies in selecting the appropriate algorithm and variables for specific clinical challenges. Although using three etiological prediction models may seem more complex than using a single model in clinical practice, it is important to note that core predictors differ among patients with different etiological subtypes. Therefore, segregating the prediction of these subtypes could lead to improved accuracy for each subtype. Additionally, our three prediction models can also be employed to forecast potential LAA, SVO, and CE patients within the SUE population. The ability of our model to identify potential CE patients among those with SUE has significant implications in clinical practice as it addresses the challenge of delayed anticoagulation treatment due to ambiguous etiological diagnoses. In our subsequent analysis section, we found that the results of our LAA, SVO, and CE etiological prediction models were in good agreement with actual classifications. Future studies seeking to establish a singular predictive model can reference our discovered predictor variables when building their models. Both the TOAST classification and our models provide valuable insights for clinicians, facilitating precise patient assessment. By integrating our models with guidelines and clinical expertise, clinicians can thoroughly evaluate patients and implement optimal preventive or intervention measures that ultimately improve patient prognosis.

However, this study had several limitations. Firstly, the predictors used to establish CE models had limited ability to identify CE patients (AUC ≤ 0.846). Future research should incorporate additional brain imaging, ECG, and echocardiographic data to uncover more relevant variables. Secondly, while ML algorithms, especially RF, LGBM, and NGB, showed high accuracy and AUC performance in SVO and LAA models, further external validation is essential. Suitable external validation data were not available in our existing databases due to the origin of predictor variables from clinical, imaging, and laboratory data. We plan to establish an appropriate cohort for external validation. Currently, we recommend applying the etiological prediction model to retrospective data while prospective prediction needs to be evaluated. Thirdly, this analysis used multiple imputations to handle missing values. The majority of missing variables accounted for less than 5%, but we couldn’t confidently assert that variables with missing values exceeding 5% were randomly missing, as there was no direct method to test this [ 54 ]. Nonetheless, we minimized selection bias, and the data distribution after multiple imputations did not significantly differ from the distribution before imputations.

In conclusion, our interpretable ML models, which combine clinical, imaging, and lab data, successfully classify patients with LAA, SVO, and CE, outperforming traditional LR models. Additionally, our model can identify potential CE patients within the SUE group, supplementing existing classifications. This potentially enables clinicians to promptly categorize stroke patients based on their etiologies and initiate optimal prevention and treatment strategies.

Data availability

No datasets were generated or analysed during the current study.

Roth GA, Abate D, Abate KH, Abay SM, Abbafati C, Abbasi N, et al. Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980–2017: a systematic analysis for the global burden of Disease Study 2017. Lancet. 2018;392:1736–88.

Article   Google Scholar  

Wang Y, Jing J, Meng X, Pan Y, Wang Y, Zhao X, et al. The third China National Stroke Registry (CNSR-III) for patients with acute ischaemic stroke or transient ischaemic attack: design, rationale and baseline patient characteristics. Stroke Vasc Neurol. 2019;4:158–64.

Article   PubMed   PubMed Central   Google Scholar  

Wang Y-J, Li Z-X, Gu H-Q, Zhai Y, Jiang Y, Zhao X-Q, the National Center for Healthcare Quality Management in Neurological Diseases, China National Clinical Research Center for Neurological Diseases, the Chinese Stroke Association. Stroke Vasc Neurol. 2020;5:211–39. National Center for Chronic and Non-communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention and Institute for Global Neuroscience and Stroke CollaborationsChina Stroke Statistics 2019: A Report From.

Adams HP, Bendixen BH, Kappelle LJ, Biller J, Love BB, Gordon DL, et al. Classification of subtype of acute ischemic stroke. Definitions for use in a multicenter clinical trial. TOAST. Trial of Org 10172 in Acute Stroke Treatment. Stroke. 1993;24:35–41.

Article   PubMed   Google Scholar  

Yang X-L, Zhu D-S, Lv H-H, Huang X-X, Han Y, Wu S, et al. Etiological classification of cerebral ischemic stroke by the TOAST, SSS-TOAST, and ASCOD systems: the impact of Observer’s experience on reliability. Neurologist. 2019;24:111–4.

Goldstein LB, Jones MR, Matchar DB, Edwards LJ, Hoff J, Chilukuri V, et al. Improving the reliability of stroke subgroup classification using the trial of ORG 10172 in Acute Stroke Treatment (TOAST) criteria. Stroke. 2001;32:1091–7.

Article   PubMed   CAS   Google Scholar  

Jauch EC, Barreto AD, Broderick JP, Char DM, Cucchiara BL, Devlin TG, et al. Biomarkers of Acute Stroke etiology (BASE) study methodology. Transl Stroke Res. 2017;8:424–8.

Hankey GJ. Secondary stroke prevention. Lancet Neurol. 2014;13:178–94.

Pandian JD, Kalkonde Y, Sebastian IA, Felix C, Urimubenshi G, Bosch J. Stroke systems of care in low-income and middle-income countries: challenges and opportunities. Lancet. 2020;396:1443–51.

Heo J, Yoon JG, Park H, Kim YD, Nam HS, Heo JH. Machine learning-based model for prediction of outcomes in Acute Stroke. Stroke. 2019;50:1263–5.

Kamel H, Navi BB, Parikh NS, Merkler AE, Okin PM, Devereux RB, et al. Machine learning prediction of stroke mechanism in Embolic strokes of undetermined source. Stroke. 2020;51:e203–10.

Latha S, Muthu P, Lai KW, Khalil A, Dhanalakshmi S. Performance Analysis of Machine Learning and Deep Learning architectures on early stroke detection using carotid artery ultrasound images. Front Aging Neurosci. 2022;13:828214.

Wang J, Zhang J, Gong X, Zhang W, Zhou Y, Lou M. Prediction of large vessel occlusion for ischaemic stroke by using the machine learning model random forests. Stroke Vasc Neurol. 2022;7:e001096.

Sun T-H, Wang C-C, Wu Y-L, Hsu K-C, Lee T-H. Machine learning approaches for biomarker discovery to predict large-artery atherosclerosis. Sci Rep. 2023;13:15139.

Article   PubMed   PubMed Central   CAS   Google Scholar  

Ay H, Benner T, Murat Arsava E, Furie KL, Singhal AB, Jensen MB, et al. A computerized algorithm for etiologic classification of ischemic stroke: the causative classification of Stroke System. Stroke. 2007;38:2979–84.

Duan T, Anand A, Ding DY, Thai KK, Basu S, Ng A et al. Ngboost: Natural gradient boosting for probabilistic prediction. In: International conference on machine learning. PMLR; 2020. pp. 2690–700.

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W et al. Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30.

Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. 2018.

Chen T, Guestrin C, Xgboost. A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016. pp. 785–94.

Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;:1189–232.

Peng T, Chen X, Wan M, Jin L, Wang X, Du X, et al. The prediction of Hepatitis E through Ensemble Learning. IJERPH. 2020;18:159.

Breiman L. Random forests. Mach Learn. 2001;45:5–32.

Wang C, Chen X, Du L, Zhan Q, Yang T, Fang Z. Comparison of machine learning algorithms for the identification of acute exacerbations in chronic obstructive pulmonary disease. Comput Methods Programs Biomed. 2020;188:105267.

Hastie T, Rosset S, Zhu J, Zou H. Multi-class adaboost. Stat its Interface. 2009;2:349–60.

Tran BX, Ha GH, Nguyen LH, Vu GT, Hoang MT, Le HT, et al. Studies of Novel Coronavirus Disease 19 (COVID-19) pandemic: A Global Analysis of Literature. IJERPH. 2020;17:4095.

Aruna S, Rajagopalan S. A novel SVM based CSSFFS feature selection algorithm for detecting breast cancer. Int J Comput Appl. 2011;31:14–20.

Google Scholar  

Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78:1–3.

DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–45.

Lundberg S, Lee S-I. A Unified Approach to Interpreting Model Predictions. 2017.

Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162:W1–73.

Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: A Tool to assess the risk of Bias and Applicability of Prediction Model studies. Ann Intern Med. 2019;170:51.

Miceli G, Basso MG, Rizzo G, Pintus C, Cocciola E, Pennacchio AR, et al. Artificial Intelligence in Acute ischemic stroke subtypes according to Toast classification: a Comprehensive Narrative Review. Biomedicines. 2023;11:1138.

Wang J, Gong X, Chen H, Zhong W, Chen Y, Zhou Y, et al. Causative classification of ischemic stroke by the machine learning Algorithm Random forests. Front Aging Neurosci. 2022;14:788637.

GBD 2016 Neurology Collaborators. Global, regional, and national burden of neurological disorders, 1990–2016: a systematic analysis for the global burden of Disease Study 2016. Lancet Neurol. 2019;18:459–80.

Ma Q, Li R, Wang L, Yin P, Wang Y, Yan C, et al. Temporal trend and attributable risk factors of stroke burden in China, 1990–2019: an analysis for the global burden of Disease Study 2019. Lancet Public Health. 2021;6:e897–906.

Pantoni L. Cerebral small vessel disease: from pathogenesis and clinical characteristics to therapeutic challenges. Lancet Neurol. 2010;9:689–701.

Xu S, Ilyas I, Little PJ, Li H, Kamato D, Zheng X, et al. Endothelial dysfunction in atherosclerotic Cardiovascular diseases and Beyond: from mechanism to Pharmacotherapies. Pharmacol Rev. 2021;73:924–67.

Cai H, Harrison DG. Endothelial dysfunction in cardiovascular diseases: the role of oxidant stress. Circ Res. 2000;87:840–4.

Papaharalambus CA, Griendling KK. Basic mechanisms of oxidative stress and reactive oxygen species in cardiovascular injury. Trends Cardiovasc Med. 2007;17:48–54.

Wu Z, Yao C, Zhao D, Wu G, Wang W, Liu J, et al. Sino-MONICA project: a collaborative study on trends and determinants in cardiovascular diseases in China, Part I: morbidity and mortality monitoring. Circulation. 2001;103:462–8.

Xu G, Ma M, Liu X, Hankey GJ. Is there a stroke belt in China and why? Stroke. 2013;44:1775–83.

Li Y, He Y, Lai J, Wang D, Zhang J, Fu P, et al. Dietary patterns are associated with stroke in Chinese adults. J Nutr. 2011;141:1834–9.

Liu LS, Tao SC, Lai SH. Relationship between salt excretion and blood pressure in various regions of China. Bull World Health Organ. 1984;62:255–60.

PubMed   PubMed Central   CAS   Google Scholar  

Kanhai DA, Kranendonk ME, Uiterwaal CSPM, Van Der Graaf Y, Kappelle LJ, Visseren FLJ. Adiponectin and incident coronary heart disease and stroke. A systematic review and meta-analysis of prospective studies: Adiponectin and risk for future CHD/stroke. Obes Rev. 2013;14:555–67.

Straub LG, Scherer PE. Metabolic messengers: Adiponectin. Nat Metab. 2019;1:334–9.

Jang AY, Scherer PE, Kim JY, Lim S, Koh KK. Adiponectin and cardiometabolic trait and mortality: where do we go? Cardiovascular Res. 2022;118:2074–84.

Article   CAS   Google Scholar  

Chandran M, Phillips SA, Ciaraldi T, Henry RR. Adiponectin: more than just another Fat cell hormone? Diabetes Care. 2003;26:2442–50.

Becic T, Studenik C, Hoffmann G. Exercise increases Adiponectin and reduces leptin levels in Prediabetic and Diabetic individuals: systematic review and Meta-analysis of Randomized controlled trials. Med Sci. 2018;6:97.

CAS   Google Scholar  

Arsalan null, Ismail M, Khattak MB, Khan F, Anwar MJ, Murtaza Z, et al. Prognostic significance of serum bilirubin in stroke. J Ayub Med Coll Abbottabad. 2011;23:104–7.

Lee SJ, Jee YH, Jung KJ, Hong S, Shin ES, Jee SH. Bilirubin and Stroke Risk using a mendelian randomization design. Stroke. 2017;48:1154–60.

Ibrahim F, Murr N. Embolic Stroke. StatPearls. Treasure Island (FL). StatPearls Publishing; 2023.

Hart RG. Cardiogenic stroke. Am Fam Physician. 1989;40(5 Suppl):S35–8.

Arsava EM, Ballabio E, Benner T, Cole JW, Delgado-Martinez MP, Dichgans M, et al. The causative classification of stroke system: an international reliability and optimization study. Neurology. 2010;75:1277–84.

Potthoff RF, Tudor GE, Pieper KS, Hasselblad V. Can one assess whether missing data are missing at random in medical studies? Stat Methods Med Res. 2006;15:213–34.

Download references

Acknowledgements

We express our gratitude to the Changping Laboratory for their invaluable support. We extend our sincere appreciation to all the participating hospitals, doctors and nurses, and the members of the Third China National Stroke Registry Steering Committee members, particularly Dr Yongjun Wang and Dr Yong Jiang, for their unwavering support and assistance. Additionally, we would like to acknowledge the diligent efforts of the editors and reviewers who provided meticulous feedback and constructive comments during the review process.

This study was supported by grants from National Natural Science Foundation of China (U20A20358), the Capital’s Funds for Health Improvement and Research (2020-1-2041) and Chinese Academy of Medical Sciences Innovation Fund for Medical Sciences (2019-I2M-5-029).The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and affiliations.

Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, No.119 South 4th Ring West Road, Fengtai District, Beijing, 100070, China

Siding Chen, Xiaomeng Yang, Hongqiu Gu, Zhe Xu, Yong Jiang & Yongjun Wang

China National Clinical Research Center for Neurological Diseases, No.119 South 4th Ring West Road, Fengtai District, Beijing, 100070, China

Siding Chen, Hongqiu Gu, Zhe Xu, Yong Jiang & Yongjun Wang

Changping Laboratory, Beijing, China

Siding Chen, Yong Jiang & Yongjun Wang

School of Statistics, Renmin University of China, No. 59 Zhongguancun Street, Haidian District, Beijing, 100872, China

Yanzhao Wang

Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, Beihang University & Capital Medical University, Beijing, 100091, China

Advanced Innovation Center for Human Brain Protection, Capital Medical University, Beijing, China

Yongjun Wang

Clinical Center for Precision Medicine in Stroke, Capital Medical University, Beijing, China

Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, China

You can also search for this author in PubMed   Google Scholar

Contributions

SDC wrote the main text of the manuscript, performed the data analysis, and prepared the supplementary materials. YJ and YJW were responsible for supervision and data provision. XMY reviewed the article and provided clinical consultation. YZW contributed to the discussion of machine learning algorithms and offered theoretical assistance on machine learning algorithms. XMY, ZX, and HQG participated in the clinical discussion. All authors read and approved the manuscript for submission.

Corresponding authors

Correspondence to Yong Jiang or Yongjun Wang .

Ethics declarations

Ethics approval and consent to participate.

This study was approved by the Ethics Committees of Beijing Tiantan Hospital (IRB number: KY2015-001-01). Written informed consent was obtained from all participants or their representatives.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Chen, S., Yang, X., Gu, H. et al. Predictive etiological classification of acute ischemic stroke through interpretable machine learning algorithms: a multicenter, prospective cohort study. BMC Med Res Methodol 24 , 199 (2024). https://doi.org/10.1186/s12874-024-02331-1

Download citation

Received : 24 November 2023

Accepted : 05 September 2024

Published : 10 September 2024

DOI : https://doi.org/10.1186/s12874-024-02331-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Acute ischemic stroke
  • Clinical prediction
  • Etiological classification
  • Prospective cohort study
  • Machine learning

BMC Medical Research Methodology

ISSN: 1471-2288

research articles on machine learning

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 21 August 2023

Seeking a quantum advantage for machine learning

Nature Machine Intelligence volume  5 ,  page 813 ( 2023 ) Cite this article

5900 Accesses

1 Citations

11 Altmetric

Metrics details

Machine learning and quantum computing approaches are converging, fuelling considerable excitement over quantum devices and their capabilities. However, given the current hardware limitations, it is important to push the technology forward while being realistic about what quantum computers can do, now and in the near future.

A recent phase of excitement in quantum computing and quantum machine learning has attracted substantial funding to develop the technology, with big tech companies such as NVIDIA, Amazon, Microsoft, Google and IBM conducting fundamental and applied research in this emerging field, in addition to start-ups and academic labs. The excitement is undoubtedly caused at least partly by several experimental demonstrations in recent years that have provided evidence of a quantum computational advantage in specific tasks — examples in which classical computers would be able to complete the same task only in a substantially, impractically longer time 1 , 2 .

Could a revolution in quantum computing indeed be underway as advances in quantum algorithms and device technology are continuing apace? Realizing suitable hardware remains a hurdle. There is no shortage of approaches to building qubits — the building blocks of a quantum computer — with some of the most common ones being based on trapped ions, superconducting circuits and quantum optical systems, although silicon-based options 3 and phonon-based options 4 are also currently under investigation. However, a general challenge is that qubit quantum states are easily disturbed during interaction with other qubits and the rest of the environment. Errors in qubit operation can be corrected with well-developed protocols, but these require large amounts of qubits 5 . The reality is that the fabrication of quantum chips with sufficient numbers of qubits and sufficiently low error rates for full-scale fault-tolerant quantum computing is currently out of reach.

Resigned to this state of affairs, the present phase of quantum computing is called the ‘noisy intermediate-scale quantum’ (NISQ) era, which refers to devices that operate with noisy qubits that are not error corrected and, therefore, have limited, imperfect computational capabilities. Currently, the most powerful quantum processor, made by IBM, has 433 qubits, and the company says it will release a quantum processor with more than 1,000 qubits later this year 6 .

NISQ devices usually consist of hybrid architectures, wherein parts of the computations are carried out by quantum systems while other tasks are performed by classical processors. Some of these devices can be accessed and manipulated via internet-based cloud services and are often used in proof-of-principle experiments. NISQ devices have a limited promise for the sort of full-scale quantum computing applications that were originally envisioned (for example, in factoring large numbers ). However, recent research shows that many specialized but still useful tasks can be identified that are realistically feasible with current NISQ devices.

An area of active research is speeding up machine learning with NISQ devices 7 . One of the first experimental implementations of quantum supervised machine learning used a chip of five qubits made from superconducting circuits and employed two types of algorithms: quantum kernel estimation and quantum variational classifier 8 . Another application is in quantum reinforcement learning, for which quantum speed-up has been experimentally demonstrated on a nano-photonic processor by means of a quantum communication channel between the learning agent and the environment 9 . Adversarial quantum machine learning is also attracting increasing interest, as discussed by West et al. in a recent Perspective article 10 . Taking a different angle on the near-term use of NISQ devices, an active research direction is to harness machine learning for the control of quantum states in simulations of complex quantum systems, as shown by Metz and Bukov in this issue of Nature Machine Intelligence .

Although the field of quantum machine learning is clearly progressing, researchers are venturing into largely uncharted territory, with many roadblocks that need to be overcome before the technology can become practically useful. One technical bottleneck consists of representing classical data as quantum states. Quantum algorithms therefore seem especially likely to offer quantum computational advantage in applications in which the data handled and processed are natively quantum, as is the case for simulations of quantum chemistry and quantum solid state systems in regimes that are difficult to represent on classical computers.

It is key to be honest about what quantum computers can do, now and in the near future, and whether they present an advantage over classical computing approaches — the latter are also advancing, constantly raising the bar. Claims about current and imminent advances in quantum devices need to be made with a clear understanding of the inherent limitations of the current quantum hardware. At the same time, the field should be encouraged to broaden the range of problems for which quantum computing and quantum machine learning can be used.

Arute, F. et al. Nature 574 , 505–510 (2019).

Article   Google Scholar  

Madsen, L. S. et al. Nature 606 , 75–81 (2022).

Zwerver, A. M. J. et al. Nat. Electron. 5 , 184–190 (2022).

Qiao, H. et al. Science 380 , 1030–1033 (2023).

Article   MathSciNet   Google Scholar  

Shor, P. W. Phys. Rev. A 52 , R2493 (1995).

Coi, C. Q. IEEE Spectrum https://spectrum.ieee.org/ibm-quantum-computer-osprey (9 November 2022).

Biamonte, J. et al. Nature 549 , 195–202 (2017).

Havlíček, V. et al. Nature 567 , 209–212 (2019).

Saggio, V. et al. Nature 591 , 229–233 (2021).

West, M. T. et al. Nat. Mach. Intell. 5 , 581–589 (2023).

Download references

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Seeking a quantum advantage for machine learning. Nat Mach Intell 5 , 813 (2023). https://doi.org/10.1038/s42256-023-00710-9

Download citation

Published : 21 August 2023

Issue Date : August 2023

DOI : https://doi.org/10.1038/s42256-023-00710-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research articles on machine learning

Physical Layer Authentication and Security Design in the Machine Learning Era

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, index terms.

Network properties

Network security

Mobile and wireless security

Security and privacy

Human and societal aspects of security and privacy

Usability in security and privacy

Security protocols

Security services

Authentication

Multi-factor authentication

Recommendations

Machine learning and the internet of things security: solutions and open challenges.

  • Emphasizing security challenges and requirements of IoT-based systems.

Internet of Things (IoT) is a pervasively-used technology for the last few years. IoT technologies are also responsible for intensifying various everyday smart applications improving the standard of living. However, the inter-crossing ...

The challenges facing physical layer security

There has recently been significant interest in applying the principles of information-theoretical security and signal processing to secure physical layer systems. Although the community has made progress in understanding how the physical layer can ...

Exploiting the physical layer for enhanced security

While conventional cryptographic security mechanisms are essential to the overall problem of securing wireless networks, they do not directly leverage the unique properties of the wireless domain to address security threats. The wireless medium is a ...

Information

Published in, publication history.

  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

IMAGES

  1. (PDF) An Overview of Machine Learning and its Applications

    research articles on machine learning

  2. Write research articles on artificial intelligence and machine learning

    research articles on machine learning

  3. (PDF) Research of Machine Learning Algorithms for the Development of

    research articles on machine learning

  4. (PDF) Research on education management system based on machine learning

    research articles on machine learning

  5. (PDF) A Research on Machine Learning Methods and Its Applications

    research articles on machine learning

  6. Latest 15 machine learning research topics

    research articles on machine learning

COMMENTS

  1. Machine learning

    Machine learning articles from across Nature Portfolio. Machine learning is the ability of a machine to improve its performance based on previous results. Machine learning methods enable computers ...

  2. Machine Learning: Algorithms, Real-World Applications and Research

    To discuss the applicability of machine learning-based solutions in various real-world application domains. To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services. The rest of the paper is organized as follows.

  3. The latest in Machine Learning

    DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification. chongqingnosubway/dgr-mil • • 4 Jul 2024 Second, we propose two mechanisms to enforce the diversity among the global vectors to be more descriptive of the entire bag: (i) positive instance alignment and (ii) a novel, efficient, and theoretically guaranteed diversification ...

  4. Journal of Machine Learning Research

    Journal of Machine Learning Research. The Journal of Machine Learning Research (JMLR), established in 2000, provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning.All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing.

  5. Machine learning articles within

    Weather and climate predicted accurately — without using a supercomputer. A cutting-edge global model of the atmosphere combines machine learning with a numerical model based on the laws of ...

  6. A comprehensive study of groundbreaking machine learning research

    Introduction. Machine learning (ML) has undergone a transformative evolution within the field of artificial intelligence, bringing about significant changes across numerous industries and scientific domains [1], [9], [50]; Ezugwu et al., 2020; [49], [30], [34].The rapid progress of ML techniques and algorithms has resulted in a proliferation of research publications in this field.

  7. Machine learning: Trends, perspectives, and prospects

    Machine learning is having a substantial effect on many areas of technology and science; examples of recent applied success stories include robotics and autonomous vehicle control (top left), speech processing and natural language processing (top right), neuroscience research (middle), and applications in computer vision (bottom).

  8. Eight ways machine learning is assisting medicine

    Machine learning will soon be applied to many other medical conditions, from cardiology to neurodegenerative diseases and beyond. 6. Improving prognostics. In addition to using it to diagnose ...

  9. The Journal of Machine Learning Research

    Benjamin Recht. Article No.: 20, Pages 724-750. This paper provides elementary analyses of the regret and generalization of minimum-norm interpolating classifiers (MNIC). The MNIC is the function of smallest Reproducing Kernel Hilbert Space norm that perfectly interpolates a label pattern on a finite ...

  10. Machine learning

    LLMs develop their own understanding of reality as their language abilities improve. In controlled experiments, MIT CSAIL researchers discover simulations of reality developing deep within LLMs, indicating an understanding of language beyond simple mimicry. August 14, 2024. Read full story.

  11. The Journal of Machine Learning Research

    The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning.JMLR seeks previously unpublished papers that contain:new algorithms with empirical, theoretical, psychological, or biological justification; experimental and/or theoretical studies yielding new insight into ...

  12. Machine Learning: Algorithms, Real-World Applications and Research

    Challenges and Research Directions. Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

  13. Machine learning-based approach: global trends, research directions

    Temporal evolution of machine learning-related publications. a) Temporal patterns of machine learning-related articles; b) relative percentage estimation in 2020. Table 3. Number of papers related to machine learning published in 2020. ... The research was conducted by a working group composed of jurists, computer scientists, and social ...

  14. Home

    Overview. Machine Learning is an international forum focusing on computational approaches to learning. Reports substantive results on a wide range of learning methods applied to various learning problems. Provides robust support through empirical studies, theoretical analysis, or comparison to psychological phenomena.

  15. Machine Learning

    Machine learning is a research area of artificial intelligence that enables computers to learn and improve from large datasets without being explicitly programmed. It involves creating algorithms that can analyze patterns in data and generate models for specific tasks, allowing for accurate predictions and intelligent behavior. ...

  16. 777306 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on MACHINE LEARNING. Find methods information, sources, references or conduct a literature review on ...

  17. Machine learning, explained

    Machine learning is a subfield of artificial intelligence that gives computers the ability to learn without explicitly being programmed. "In just the last five or 10 years, machine learning has become a critical way, arguably the most important way, most parts of AI are done," said MIT Sloan professor. Thomas W. Malone,

  18. Machine learning articles within Scientific Reports

    Read the latest Research articles in Machine learning from Scientific Reports. ... Machine learning articles within Scientific Reports. Featured. Article 09 September 2024 | Open Access.

  19. Articles

    Methodology and evaluation in sports analytics: challenges, approaches, and lessons learned. Jesse Davis. Lotte Bransen. Maaike Van Roy. OriginalPaper Open access 17 July 2024 Pages: 6977 - 7010. Part of 1 collection: Special Issue on Machine Learning for Soccer.

  20. Machine Learning

    Machine Learning. Abstract: In machine learning, a computer first learns to perform a task by studying a training set of examples. The computer then performs the same task with data it hasn't encountered before. This article presents a brief overview of machine-learning technologies, with a concrete case study from code analysis.

  21. Top Machine Learning Research Papers Released In 2021

    Machine learning and deep learning have accomplished various astounding feats this year in 2021, and key research articles have resulted in technical advances used by billions of people. The research in this sector is advancing at a breakneck pace and assisting you to keep up. Here is a collection of the most important recent scientific study ...

  22. Assessing the Emergence and Evolution of Artificial Intelligence and

    BACKGROUND AND PURPOSE: Interest in artificial intelligence (AI) and machine learning (ML) has been growing in neuroradiology, but there is limited knowledge on how this interest has manifested into research and specifically, its qualities and characteristics. This study aims to characterize the emergence and evolution of AI/ML articles within neuroradiology and provide a comprehensive ...

  23. Forecasting the future of artificial intelligence with machine learning

    The corpus of scientific literature grows at an ever-increasing speed. Specifically, in the field of artificial intelligence (AI) and machine learning (ML), the number of papers every month is ...

  24. A Machine Learning Approach to Well-Being in Late Childhood ...

    The premise for this article is the importance of measuring and monitoring children's and adolescents' SWB and the consequent need to have accurate and manageable measures of how children and adolescents perceive their life conditions, considered a prerequisite for having good data and implementing efficient and accountability-based public policies targeted at these age groups (see Ben ...

  25. Predictive etiological classification of acute ischemic stroke through

    The prognosis, recurrence rates, and secondary prevention strategies varied significantly among different subtypes of acute ischemic stroke (AIS). Machine learning (ML) techniques can uncover intricate, non-linear relationships within medical data, enabling the identification of factors associated with etiological classification. However, there is currently a lack of research utilizing ML ...

  26. Machine learning enables comprehensive prediction of the relative

    Machine learning enables comprehensive prediction of the relative protein abundance of multiple proteins on the protein corona. Research. 0: DOI: 10.34133/research.0487

  27. Supervised machine learning for understanding and predicting the status

    1.Introduction. Rivers provide a wide range of essential ecosystem services; however, rapid social development and urbanization have profoundly disrupted these services through various anthropogenic activities (Best 2018, Liang et al. 2018).The discharge of municipal sewage and application of agricultural fertilizers have led to increased nutrient concentrations and turbidity, resulting in ...

  28. Seeking a quantum advantage for machine learning

    An area of active research is speeding up machine learning with NISQ devices 7.One of the first experimental implementations of quantum supervised machine learning used a chip of five qubits made ...

  29. Physical Layer Authentication and Security Design in the Machine

    Security at the physical layer (PHY) is a salient research topic in wireless systems, and machine learning (ML) is emerging as a powerful tool for providing new data-driven security solutions. Therefore, the application of ML techniques to the PHY security is of crucial importance in the landscape of more and more data-driven wireless services.

  30. Machine Learning‐Guided Design and Synthesis of Eco‐Friendly Poly

    Research Article Machine Learning-Guided Design and Synthesis of Eco-Friendly Poly(ethylene oxide) Membranes for High-Efficacy CO 2 /N 2 Separation Guangtai Zheng ,