Topic Discovery

Background:

Topic Modeling is a popular statistical tool for modeling the latent semantic structure of data such as text corpora. In a typical topic model, each document is assumed to be generated from a small set of topics - each being a distribution over a word vocabulary. An important goal in topic modeling is to estimate the latent topics.

Main Results:

We have developed a suite of novel algorithms that can consistently estimate the topic matrix with polynomial-time computational complexity. Our work is most closely related to the recent works for Nonnegative Matrix Factorization under the key structural assumption of Separability – the assumption of the existence of  the “novel words” that are unique to each topic.

Our approach exploits a key geometric structure embedded in the problem, namely, novel words correspond to extreme points in a suitable representation space.  We develop efficient and robust extreme point finding algorithms under various additional conditions on the topic prior. The proposed data dependent and random projection approaches can achieve the state-of-the-art statistical and computational efficiency.

We have shown that the simplicial condition proposed by Arora et al.(2012) is not only sufficient but also necessary for novel word detection in separable topic models. We further show that our random projections based algorithms consistently recover the novel words under this assumption.

We also consider topic-discovery in the distributed setting, e.g., when the documents are archived on a system of distributed servers. It turns out that our random projection based algorithm can be naturally parallelized in a distributed setting with a moderate communication cost while retaining the statistical efficiency of the centralized state-of-the-art .

Publications:

  • W. Ding, M. H. Rohban, P. Ishwar, and V. Saligrama, “Efficient Distributed Topic Modeling with Provable Guarantees,” (to appear) in Proc. International Conference on Artificial Intelligence and Statistics (AISTATS’14),  Reykjavik, Iceland, 22-25, Apr., 2014, JMLR W&CP  33 .

Source Code:

Thank you for your interest in our project. For questions, comments, and bugs related to the code, please send email to Weicong Ding <dingwc@bu.edu>. In your email, please indicate your full name and affiliation, whether you are a faculty member, researcher, or a student, and whether you are already using or planning to use the code and its intended purpose. Please note that while we will make every effort to respond, sometimes we may be slow in replying.

Open-source copyright notice:

Copyright (c) 2013, Weicong Ding, Mohammad H. Rohban, Prakash Ishwar, and Venkatesh Saligrama. Allrights reserved.
Redistribution and use in source and binary forms, with or withoutmodification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright   notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright   notice, this list of conditions and the following disclaimer in the   documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS”AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOTLIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FORA PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHTHOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOTLIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USEOF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Link to the source code (in Matlab) (A python version will be available soon. )