Bioinformatics Tools

ERANGE - A package of python scripts designed to analyze ultra-high-througphut sequencing data from the Illumina/Solexa platform for RNA-seq and ChIP-seq in metazoan genomes. The RNA-seq portion of ERANGE is described in our Nature Methods paper "Mapping and quantifying mammalian transcriptomes by RNA-Seq" (Mortazavi, 2008). ERANGE is built on top of Cistematic.

ChIPSeq Peak Finder - A subset of the scripts in ERANGE that can be used to analyze ChIP-seq data as originally described in our Science paper "Genome-wide mapping of in vivo protein-DNA interactions" (Johnson, 2007).

Cistematic - The core of Cistematic is a Python package with a rich set of API's that simplify the collection and analysis of candidate cis-elements from a number of different motif-finding and phylogenetic footprinting programs such as MEME, AlignACE, Co-Bind, and FootPrinter. Cistematic assesses the significance of each motif by comparing it to its prevalence genome-wide. The original version of Cistematic is described in our Genome Research paper "Comparative genomics modeling of the NRSF/REST repressor network: from single conserved sites to genome-wide repertoire" (Mortazavi, 2006).

BioHub - The BioHub is a relational database and Python API developed at Caltech that manages associations between numerous genomic-sequence-based and transcript-based datasets in order to provide centralized query services and uniform data access. The Biohub was designed to permit biologists to draw on and combine many disparate data sources for integrative analyses such as gene network modeling. The central feature in Biohub design is the Sequence Registry which relates diverse data and annotations to individual genomic sequence features - usually genes.

BHUtils - A Python package containing useful utilities which were developed for the BioHub project, but can be used independently. These include a disk based multi-recordFASTA reader (capable of pulling sub chunks of entire chromosomes), download utilities, file decompression utilities, reverse complementing, batch blasting (requiresBioPython currently), handling of blast results, batch blating, blast db creation, blat nib creation, etc.

CompClust - CompClust is a python package written using the pyMLX and IPlot APIs. It provides software tools to explore and quantify relationships between clustering results. Its development has been largely built around needs of microarray data analysis but could be easily used in other domains. Briefly pyMLX provides an provides for efficient and convenient execution of many clustering algorithms using a extendable library of algorithms. It also provides many-to-many linkages between data features and annotations (such as cluster labels, gene names, gene ontology information, etc.) This linkages are are persistant through data manipulations. IPlot provides an abstraction of the plotting process in which any arbitrary feature or derived feature of the data can be projected onto any feature of the plot, including the X,Y coordinates of points, marker symbol, marker size, maker/line color, etc. These plots are intrinsically linked to the dataset, the View and the Labeling classes found within pyMLX.

Mussa - Mussa is an N-way version of the FamilyRelations/secomp 2-way comparative sequence analysis programs. Given DNA sequence from N species, Mussa uses all possible pairwise comparions to derive an N-wise comparison. For example, given sequences 1,2,3, and 4, Mussa makes 6 2-way comparisons: 1vs2, 1vs3, 1vs4, 2vs3, 2vs4, and 3vs4. It then compares all the links between these comparisons, saving those that satisfy a transitivity requirement. The saved paths are then displayed in an interactive viewer

Pymerase - Pymerase is a tool intended to generate a python object model, relational database, and an object-relational model connecting the two. However it has been extended to also output web pages, gui widgets, tab delimited text parsers, etc. It can be easily extended to output whatever else you might like. We are currently using Pymerase for BioHub development and other projects.

Sigmoid - The Sigmoid project is intended to produce a database of cellular signaling pathways and models thereof, to marshall the major forms of data and knowledge required as input to cellular modeling software and also to organize the outputs. Such cellular signaling and regulatory pathways are commonly hand-drawn in biological literature as an aid to intuitive understanding. Pathway databases can provide the same assistance in the context of attempts to achieve a quantitative understanding of cellular processes by numerical simulation. They can also serve as an aid to capturing and querying both expert knowledge and heterogeneous data sets pertaining to pathways. Cell model databases are a subject of current research. SIGMOID works at the interface of these two areas.