Software and Data

Software and Data




Contents:  

Generalised Stirling Numbers (libstb) software, Document/Text preprocessing for topic models (dca-bags), Text data for DCA-bags

Non-parametric Topic Modelling (hca) software

This software is now available.  Please contact wray.buntine at nicta.com.au with questions/support/suggestion. Software available in the source folder as tar.gz file from MLOSS Archive.  Simple user manual here. Features are as follows:

  • Coded in C with no dependencies so easy to compile on most systems.
  • Input can be LdaC format, docword format, various Matlab style formats.
  • Implements HDP-LDA, HPYP-LDA, symmetric-symmetric, symmetric-asymmetric, asymmetric-symmetric, and asymmetric-symmetric priors with Pitman-Yor or Dirichlet processes.
  • Full hyper-parameter fitting, or setting initially.
  • Special "burstiness" function using the trick from Doyle and Elkan for (DCMLDA, 2009) for even better performance.
  • No Chinese restaurant processes or stick breaking so sampling is fast (1.0-3.0 times slower than regular LDA with Gibbs).  Includes a part of the libstb library below to handle Pitman-Yor and Dirichlet processes.
  • Estimation of various vectors (document and topic vectors).
  • Diagnostics, control, restarts, test likelihood via document completion.
  • Coherence calculations on results using PMI and normalised PMI

Note to achieve the performance the coding is quite complex in parts.  If you wish to edit or modify the code contact the author for support.

Generalised Stirling Numbers (libstb) software

This source directory contains library routines that provide alternative ways of computing generalised second order Stirling numbers used when working with Pitman-Yor and Dirichlet processes (PYP and DP). Included is library routines for posterior sampling on the discount and concentration parameters of the PYP/DP, and some simple demo programmes to illustrate usage.  System used for Pitman-Yor processing with topic models, to allow easy scaling to gigabytes of text. Tested on a few versions on Ubuntu Linux and MacOSX.  Default system is independent of any libraries, using the rand48() samplers in the standard C library and copying some GPL'd code from Mathlib and GSL for a few key routines.  Instructions on how to remove the GSL code (removing use of slice sampler) are included.

Resources are as follows:

Document/Text preprocessing for topic models (DCA-bags)

This is my workhorse for preprocessing text, the DCA-Bags suite of Perl routines.  This automates the task of producing bags and lists as input to topic modelling software.  I've created output for 5 different programmes or data set types so far (Ldac, Matlab toolbox, docwords, ...), and input processing includes all sorts of tricks including some fancy tokenisation, stop words, optional stemming and collocations (n-grams).  Used it so far for Wikipedia (the lot), PubMed and Reuters RCV1 (the lot) and all the usual smaller stuff.  Look below for prepared data sets.

Data sets that can be used with the DCA-bags software

  • Examples directory with XSL scripts and documentation for below.  Contains example XSL scripts for Reuters RCV1 and Medline XML data.
  • 20 News Group train and test.
  • Reuters-21578 ModLewis split with train and test in one file.
  • BitterLemons corpus
  • NIPS 1987-2012 cleaned with references and head information removed, years 87-98, 99-08, 08-12
  • Abstracts and titles for JMLR (vols 1-12), IEEE Trans PAMI (2006-2011) and ICML (2007-2011).
  • Different subsets of Wikipedia from 11/12/11:    words only; words+URLs; words+URLs+categories; words with frequency>1000, words with frequency>10,000; random 100k articles, random 500k articles; PMI matrices.  These are pretty big so email wray.buntine at nicta.com.au if interested to arrange for access.