MBC@UP: Model based clustering research group

Research programme lead:

Team members:

Prof Frans Kanfer ([email protected])
Dr Sphiwe Skhosana ([email protected])
Prof Weixin Yao (University of California, Riverside)
Prof Antonio Punza (University of Catania)
Prof Mohammad Arashi (Ferdowski University of Mashad)
Prof Helmke von Bach (University of Namibia)
Prof Andriette Bekker (UP)
Prof Johan Ferreira (WITS)
Dr Salomi Du Plessis (Matrix Analytics Group)

Introduction

Model-based clustering is a statistical approach to grouping data into homogeneous clusters based on probabilistic models, typically finite mixture models. Unlike heuristic methods such as k-means or hierarchical clustering, model-based clustering provides a principled framework grounded in statistical theory, allowing for robust parameter estimation, objective determination of the number of clusters, and uncertainty quantification.

Key Features

Finite Mixture Models: The data is assumed to originate from a mixture of subpopulations, each following a specific probability distribution (e.g., multivariate normal, Poisson, Bernoulli). Each subpopulation corresponds to a cluster.

Model Selection: Statistical criteria such as the Bayesian Information Criterion (BIC) or Integrated Completed Likelihood (ICL) are used to determine the optimal number of clusters and the best-fitting model
Parameter Estimation: Parameters are estimated using methods like the Expectation-Maximization (EM) algorithm or Bayesian inference. These methods ensure that clusters are well-defined and interpretable.
Flexibility: Model-based clustering can handle various data types (e.g., continuous, categorical, or mixed) and complex data structures like networks.

Advantages

Reproducibility: Provides a systematic and less subjective approach than traditional clustering methods.
Uncertainty Quantification: Accounts for uncertainty in cluster membership, which is often overlooked in heuristic methods.
Scalability: Recent advancements have made it applicable to large and high-dimensional data.

Applications

Model-based clustering has been applied across diverse fields:

Network Analysis: Clustering nodes in large-scale networks using scalable algorithms.
Person-Oriented Research: Identifying subpopulations in psychological or sociological studies.
Healthcare: Diagnosing diseases by clustering patient data based on clinical measurements.
Marketing: Modelling customer behaviour and customer segmentation.
FinOps: Modelling client spend and the interaction thereof in operational aspects in business.

Tools and Software

Popular tools include the SAS, the R packages mclust for continuous data and flexmix for mixed data. These tools implement model-based clustering algorithms with user-friendly interfaces.

This research area includes ongoing developments in handling high-dimensional data, feature selection and incorporating covariates into models and improving computational efficiency.

Model-based clustering extends naturally to mixture regression models, identifying latent subgroups with distinct relationships between features and outcomes. These models integrate clustering and regression analysis by allowing regression coefficients to vary across unobserved classes, providing insights into heterogeneous effects within populations.

Key Aspects of Mixture Regression

1. Model Structure:

a. Data are assumed to arise from a mixture of K regression models, each with unique parameters

b. Covariates can influence both class membership and outcomes, enabling nuanced subgroup analysis.

2. Estimation:

a. Parameters are typically estimated via the Expectation-Maximization (EM) algorithm, with adaptations for high-dimensional data (e.g., penalised likelihood methods).

b. Bayesian approaches using Markov Chain Monte Carlo (MCMC) are common for complex models, such as those with random effects.

3. Challenges:

a. Model selection: Overestimation of clusters may occur if Gaussian components approximate non-Gaussian subgroups. Criteria like BIC or entropy-based metrics help balance fit and parsimony.

b. Identifiability: Performance degrades when the actual cluster structure or covariance matrix is unknown.

Applications

Social Sciences: Analysing voting patterns by clustering municipalities with similar electoral behaviour over time.
Marketing: Segmenting consumers based on heterogeneous responses to promotions.
Finance: Identifying anomalies and modelling financial patterns.
Healthcare: Identifying patient subgroups with distinct treatment-outcome relationships.
Environmental modelling: Modelling environmental impacts on operational aspects of businesses

List of recent projects:

1. Identifying cloud spending anomalies in the automotive industry: A FinOps approach.

2. Predicting cloud spending in the automotive industry: A FinOps approach.

3. Initial value determination in Gaussian mixture regression.

4. Model-based clustering for mixed data.

5. Investigation into variants of distributed Expectation-Maximisation algorithms for Gaussian mixture modelling.

6. Modal regression: parametric and non-parametric views.

7. Modelling transaction-based money laundering using a mixture of logistic regressions.

8. Mixture modelling for credit risk management.

9. Determining the number of clusters using penalised k-means clustering.

10. A semiparametric mixture of single-index models.

11. A penalised approach for feature selection in finite mixture of regressions.

12. Fairness first: A case study of fair credit modelling in South Africa.

13. A mixture modelling approach to extreme value analysis of heavy-tailed processes.

14. Modelling behavioural structures of recycler automated teller machines.

15. Robust parameter estimation of finite mixture models with self-paced learning.

Our research links to the following SDGs:

SDG 3: Good Health and Well-being
SDG 4: Quality education
SDG 8: Decent work and economic growth
SDG 9: Industry, innovation and infrastructure
SDG 13: Climate action

Share this page