Project : Topic Modeling through Data Dependent and Random Projection

Background:

Topic Modeling is a popular statistic tool in modeling the latent semantic structure of, e.g., text corpora. In a typical topic model, each document is assumed to be generated from a small set of topics – each being a distribution over a vocabulary. An important goal in topic modeling is to estimate the latent topics.

Main Results:

We develop a suite of novel approaches that consistently estimates the topic matrix with polynomial time complexity. Our approaches are closely related to the recent subsequent works based on provable Nonnegative Matrix Factorization under a key structural assumption – Separability. This separability assumption assumes the existence of  the “novel words” that are unique to each topic.

Our approaches exploit a geometric structure of the problem, i.e.,  novel words correspond to extreme points in some representation space.  We develop efficient and robust extreme point finding algorithms under various additional conditions on the topic prior. The proposed data dependent and random projection approaches could achieve the state-of-the-art statistical and computational efficiency.

We identified that the simplicial condition proposed by Arora et al.(2012) is the necessary and sufficient condition for novel words detection in separable topic models. We further show that our random projections based approach is applicable under such assumption.

We also consider the topic discovery in the distributed setting, i.e., the documents are archived on a system of distributed servers. It turns out that the random projection based approach is also distributable and parallelizable. We show that while retains the statistical efficiency of the centralized state-of-the-arts, our computation is parallelized with a moderate communication cost.

Publication :

  • W. Ding, M. H. Rohban, P. Ishwar, and V. Saligrama, “Efficient Distributed Topic Modeling with Provable Guarantees,” (to appear) in Proc. International Conference on Artificial Intelligence and Statistics (AISTATS’14),  Reykjavik, Iceland, 22-25, Apr., 2014, JMLR W&CP  33 .

Source Code :

Thank you for your interest in our project. For questions, comments, and bugs related to the code, please send email to Weicong Ding <dingwc@bu.edu>. In your email, please indicate your full name and affiliation, whether you are a faculty member, researcher, or a student, and whether you are already using or planning to use the code and its intended purpose. Please note that while we will make every effort to respond, sometimes we may be slow in replying.

Open-source copyright notice:

Copyright (c) 2013, Weicong Ding, Mohammad H. Rohban, Prakash Ishwar, and Venkatesh Saligrama. Allrights reserved.
Redistribution and use in source and binary forms, with or withoutmodification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright   notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright   notice, this list of conditions and the following disclaimer in the   documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS”AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOTLIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FORA PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHTHOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOTLIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USEOF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Link to the source code (in Matlab) (A python version will be available soon. )

Post a Comment

Your email address is never shared. Required fields are marked *