Software

Software Projects People Publications Videos


s

Gossamer

A Space-Efficient Genome Assembler
Gossamer is an application for the de novo assembly of genomes from fragments of DNA that specifically attacks the question of scalability. The advantage of Gossamer is that large data sets can be assembled on computers with small amounts of memory.

s

is-rSNP

in silico regulatory SNP detection
is-rSNP is a software tool which predicts whether a SNP is an rSNP. For a given SNP, and using a statistical framework, is-rSNP can successfully predict the set of TFs for which binding is affected. Recent enhancements to the algorithm provide the statistical power to scan large numbers of SNPs, making it suitable to use to screen all associated SNPs output by a typical GWAS.

s

Xenome

A tool for classifying reads from xenograft samples
Shotgun sequence read data derived from xenograft material contains a mixture of reads arising from the host and reads arising from the graft. Xenome is an application for classifying the read mixture to separate the two, allowing for more precise analysis to be performed.

s

GWIS

Multivariate GWAS analysis
GWIS cuts the computational time for analyzing all pairs of SNP interactions from months to minutes on commodity computers. The tool also has the ability to handle ternary interactions providing an unprecedented capability for investigation of complex diseases.

s

GAT

Genome Annotation Test with Validation on Transcription Start Site and ChIP-Seq for Pol-II Binding Data

s

Quarc

Quality Analysis and Read Control

s

SparSNP

Sparse SNP Analysis
SparSNP fits lasso (and naive elastic net) penalized linear models to SNP data. Its main features are:
  • it can fit squared hinge loss for classification (case/control) and linear regression (quantitative phenotypes)
  • takes PLINK BED/FAM files as input
  • the amount of memory is bounded - can work with large datasets using very little memory (typically <100MB)
  • fits a model over a grid of penalties, and writes the estimated coefficients to disk
  • it can also do cross-validation, using the estimated coefficients to predict outputs for other datasets
  • efficient - it uses warm-restarts plus an active-set approach, the model fitting part of 3-fold cross-validation for a dataset of 2000 samples by 300,000 SNP dataset takes ~5min, and about 25min for ~6800 samples / ~516,000 SNPs

s

RLZ

Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
Self-indexes are data structures that simultaneously provide fast search of and access to compressed text and are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our RLZ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totalling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just nH_k(T)+slogn+slog(N/s)+O(s) bits. At the cost of negligible extra space, access to consecutive l symbols requires O(l+logn) time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.