Caltech BioHub

The Caltech BioHub: Unified access to diverse bioinformatics datasets

The number and diversity of bioinformatics data sources, as well as their ever-increasing sizes, pose numerous challenges to investigators wanting to perform integrative data analyses in their research. The need to combine myriad data sources, formats and qualities ranging from well-vetted annotations to mere hypotheses often impedes such efforts.

The BioHub is a relational database and Python API developed at Caltech that manages associations between numerous genomic-sequence-based and transcript-based datasets in order to provide centralized query services and uniform data access. The Biohub was designed to permit biologists to draw on and combine many disparate data sources for integrative analyses such as gene network modeling. The central feature in Biohub design is the Sequence Registry which relates diverse data and annotations to individual genomic sequence features - usually genes. Key BioHub design features include:

Maintains associations between transcript-based results (microarray, ChIP-array, and in situ expression) and genomic-sequence-based results (sequence annotations, motif instances, conserved elements)
Sequence Registry provides unique IDs for a variety of annotations to build stable associations to peripheral data stores, e.g. Rosetta Resolver, GeneX, sequence annotation databases, motif databases, conserved element collections, probe libraries, etc.
Query entry-points and results are based on common representations for genes, probes, expression data, motifs and other annotations, all designed to be integrated and analyzed together, e.g. in the PyMLX CompClust environment
Offers a rich collection of sequence operations, e.g. BLAST, BLAT and regular expression searches, contains/is-contained-in queries, and neighborhood searches, for any specific (or all) sequence types
Permits transient, hypothetical annotations to be registered to support an individual's integrative analyses but hidden from default query results until they are promoted (or easily removed)
Automated import of genomes (as assemblies or contigs) including construction of blast and blat databases
New genome build releases trigger updates so that transcript and sequence locations can be tracked across multiple builds
Supports broad queries, e.g. "For gene X, get me all Mm and Hs expression data, and all upstream motifs that are conserved in both species"

The poster presented at PSB 2004 described the status of the current BioHub prototype, demonstrated the ways it has been used to date (e.g. in assessing the quality of oligonucleotide probe libraries, for identifying common regulatory elements, etc.), and described design plans for future BioHub development.