Discrete Component Analysis
The Discrete Component Analysis (DCA) software is being developed as a stand-alone package, and as a plug-in to the Elefant system, a machine learning toolbox from NICTA. Currently the software is being run in stand-alone mode using the data streaming libraries from the older and now unsupported MPCA system, developed at Helsinki Institute for IT. The software itself is written in the C language and compiles on a Linux and a Mac OS X environment.
The models presented here are known under many names, such as latent Dirichlet allocation, multi-aspect models, multinomial PCA, and non-negative matrix factorisation.
The following reports are available for the first public release, version 0.202:
- "DCA 0.202 Software," software guide for installers and developers.
- "DCA 0.202 User Guide," for users.
- "An Approach to Discrete Component Analysis," theory paper.
- "Estimating Likelihoods for Topic Models," paper describing the evaluation of topic models, original is in ACML'09 and the original publication is available at www.springerlink.com.
- Downloading and converting the Wikipedia for use using the WEX format.
- Using the UCI data sets for topic modelling.
- Using Reuters RCV1 collection.
- Using the Caltech256 or MSRC2 image collections based on SIFT features, available from Tinne Tuytelaars webpage on Unsupervised Object Discovery.
- Using PubMed XML data.
The software itself, published under the MPL license, can be downloaded in "tar.gz" format. It is built using the GNU configure and autotools. A lot of the examples require installation of a number of Perl-based scripts for data massaging and HTML reporting. Otherwise, installation mainly requires the GNU Scientific Library (GSL), and the Judy library, both are available for easy install on most distributions.
