Wold Lab - BioHub

Main Menu:

Development Menu:

Current BioHub Version:

pre-0.1 (Revision Control Only)

About BioHub

The goal of the BioHub database project is twofold. The first is to provide our biologists with a tool that allows them to ask questions of very different kinds of large-scale biological data which are tied together based on their spatial relationship to a gene or DNA sequence feature in one or more genomes. The computational goal is to make a rich API (Application Programming Interface) to allow computer scientists to easily write custom largescale analysis programs, which can then be turned into web application or other GUI to allow for easy to use large-scale analysis.

In it's current form, BioHub is a spatial annotation PostgreSQL database with a Python API for writing applications. It works by registering sequences (annotations of sequences) in the BioHub core database. Upon registering an annotation at a location on a genome, the user or program receives an SID (Sequence ID) that can later be used as a handle to the 'Registered Sequence' when using the BioHub API. An SID will always be the same for the exact same location on a genome. This means that if two different programs or people register a sequence with the exact same location, both will be given the exact same SID. This feature is important because it allows for connection of a wide variety of biological data to associated by simply having the same location on the genome. For example, if one were to register a sequence they found in a publication as a 'conserved regulatory motif' and then later a motif finding program finds the exact same motif, they will both have the exact same SID. But they will also have two descriptions and users attached to the SID, as well. One saying "found in paper x..." and the other "discovered by motif finding program y." By simply registering the two sequences, the published motif has now been connected to all sites in BioHub, and in current work the hub is expanding to allow the next obvious query to recover all expression data associated with this custom set of instances. The user has the capacity, through BioHub, to specify and collect, via SIDs, only those genes associated with motif instances that have a particular positional relationship to your gene models.

Actual Usecases

Asked BioHub length of intergenic regions for statistics on Human Genome.
Used to confirm the quality of miRNA precursor predictions by asking for overlap of all predicted/known miRNA precursors in five datasets. This allowed us to check if predictions where matching known precursors; when precursor predictions where overlaping many other predictions, it's more likely that only on of those predictions is real. When prediction overlaped a CDS/UTR of a gene, we flaged it as not likely to be real. Registeration allowed us to see if each precursor was unique in the genome or had multiple copies. With a few quick queries we knew if precursors were located in intergenic or intronic regions.
BioHub was used to design a custom gene chip that discriminates hundreds of related zinc-finger transcription factors in the human genome. These are not well represented with non-crossreacting probes in current commercial array collections.