Topic modelling for short text

The Latent Dirichlet Allocation (LDA) model is one of the most popular topic models and it makes the generative assumption that a document belongs to many topics. Conversely, the Multinomial Mixture (MM) model, another topic model, assumes a document can belong to at most one topic, which we believe is an intuitively sensible assumption for short text. Based on this key difference, we posit that the MM model should perform better than the LDA. Initial results are promising and current research include a systematic and thorough evaluation framework for topic models as well as computational optimisation of algorithms.


Honours projects can be summarised as follows:

1) Language ID & Learning

a) Language ID: classify unseen text as English or isiZulu. Student: Siyabonga Mjali

b) Build a German language model using GRU (gated recurrent unit) neural network.

2) Semi-supervised learning of text data

a) Anomaly detection

b) Sentiment analysis

3) Short text topic modelling

4) Using random forests on meta data for sentiment classification


PhD topic:

Discovering product weaknesses using topic models on product reviews

Whilst websites such as, and have created a platform on which consumers can easily comment on various products, they have also created vast stores potentially useful data which are typically impractical to analysis manually. If it is assumed that the weaknesses of a product are the features/aspects of the product that users most frequently post negative reviews of, then the weaknesses (and strengths) of the product could potentially be inferred through the identification of these aspects in reviews. Consequently, the focus of this research is to develop a novel method of automatically extracting product weaknesses (and strengths) based on user reviews. Most methods that attempt to address this issue are based on sentiment analysis techniques. In this research, we propose an alternative method that is based on topic modeling instead.

The Statistics Department works closely with researchers at the CSIR in the field of Human Language Analytics.


Dr Alta de Waal - Principal investigator

Department of Statistics, University of Pretoria, Pretoria

South Africa

E-mail: [email protected]

Telephone: +27 (0) 12 420 3441

Ms J. Mazarura - PhD student

Copyright © University of Pretoria 2024. All rights reserved.

FAQ's Email Us Virtual Campus Share Cookie Preferences