Online Multi-label Text Classification using Topic Models
Date issued
Authors
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
License
Abstract
Every day, an enormous amount of text data is produced. Sources of text data include news, social media, emails, text messages, medical reports, scientific publications and fiction. Due to these increasing amounts of data, the need for scalable and interpretable models, that help analyze this data, is growing. Such models may generally be divided into supervised and unsupervised models. In the case of supervised methods, the problem is to classify the data given an existing set of labels. In the case of unlabeled data, unsupervised models may be learned, that cluster the data to reveal hidden similarities and regularities.
For both of these tasks, existing methods often lack either scalability or interpretability. Scalability is especially important as dataset sizes grow and it is often necessary to be able to process streaming data online without being able to store it. Interpretability helps to understand a given result and is of increasing importance if actions have to be taken based on the modeling result. Such actions often need to be justified to customers or other stake-holders based on information extracted from the model. In this thesis, both scalability and interpretability are achieved by focusing on generative Bayesian topic models for text data. These models are applicable in the supervised as well as the unsupervised setting while maintaining interpretability in both cases.
Overall, four novel topic models are proposed in this thesis. These models allow to not only cluster and classify the data but also to assign a semantic interpretation to each cluster that helps to understand its content. This way, it is possible to understand why a text document was assigned a certain topic. At the same time, the proposed models are scalable to large datasets and able to handle streams of data.
The first model is trained online and used for multi-label classification of text, meaning that each document may be assigned several labels that possibly exhibit dependencies. The second model is a nonparametric multi-label topic model that utilizes a novel sampling method to make it more efficient. Its nonparametric nature allows it to model different label frequencies. The third model is also nonparametric and trained with a hybrid Variational-Gibbs sampling training algorithm that takes advantage of sparsity. The last model is trained online and tracks changes of topics over time to analyze the German media with respect to the refugee crisis. In conclusion, this thesis demonstrates the manifold possibilities and flexibility of the topic model framework for complex settings such as multi-label classification by exploring different learning and sampling strategies.