Please login first
Fast tuning of topic models: an application of Rényi entropy and renormalization theory
* , * ,
1  National Research University Higher School of Economics

Abstract:

In practice, the critical step in build machine learning models of big data (BD) often involves costly in terms of time and computing resources procedure of parameter tuning with grid search. Due to the size BD are comparable to mesoscopic physical systems. Hence, methods of statistical physics could be applied to BD. The paper shows that topic modeling (a clustering method for large document collections) demonstrates self-similar behavior under the condition of a varying number of clusters. Such behavior allows using a renormalization technique. A combination of renormalization procedure with Rényi entropy approach allows for fast searching of the optimal number of clusters. In this paper, the renormalization procedure is developed for the Latent Dirichlet Allocation (LDA) model with variational Expectation-Maximization algorithm. The experiments were conducted on two document collections with a known number of clusters in Russian and English languages, respectively. The paper presents results for three versions of the renormalization procedure: (1) a renormalization with the random merging of clusters, (2) a renormalization based on minimal values of Kullback-Leibler divergence and (3) a renormalization with merging clusters with minimal values of Rényi entropy where entropy is computed for each topic separately. The paper shows that the renormalization procedure allows finding the optimal number of topics ten times faster than grid search without significant loss of quality.

Keywords: Rényi entropy; renormalization; topic modeling; big data
Comments on this paper
Feiyan Liu
Self-similar behavior
Thank you for your interesting paper, it's very helpful to me since I am working on how to choose the number ot topics automatically while performing LDA.

I have a question: in abstract, you mentioned "the paper shows that topic modeling demonstrates self-similar behavior under the condition of a varying number of clusters", can you explain it exactly?

Sergei Koltcov
Self-similar behavior
In our calculations, we measured the distribution density of words with changing the number of topics.
The word density distribution function is the ratio of the number of words with a high probability to the total number of words in topics, that is, p = N / (W * K) b where W is the number of unique words in the collection, K is the number of topics. In fact, this function allows you to measure the sparseness of the topic model in terms of entropy.
If you plot the function ln (p) versus p, you can see several different straight lines. This means that on straight lines the density function behaves in the same way (fractal behavior), therefore, density is reproduced. Since the function of the density of distribution of words define Renyi entropy, therefore, we able see the same behavior of Renyi entropy
Details are given in the work.
Ignatenko, V., Koltcov, S., Staab, S., & Boukhers, Z. (2019). Fractal approach for determining the optimal number of topics in the field of topic modeling. Journal of Physics: Conference Series. Vol. 1163, No. 1, pp. 1- 6. doi: 10.1088/1742-6596/1163/1/012025
https://iopscience.iop.org/article/10.1088/1742-6596/1163/1/012025/meta



 
 
Top