BioHub

Author: Diane Trout

This presentation and associated material is available at http://woldlab.caltech.edu/biohub/scipy2006/

Motivation: Chr 1

>chr1
taaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
accctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac
cctaacccaaccctaaccctaaccctaaccctaaccctaaccctaacccc
taaccctaaccctaaccctaaccctaacctaaccctaaccctaaccctaa
ccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaa
ccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaac
cccaaccccaaccccaaccccaaccctaacccctaaccctaaccctaacc
ctaccctaaccctaaccctaaccctaaccctaaccctaacccctaacccc
taaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccct
aaccctaaccctaaccctcgcggtaccctcagccggcccgcccgcccggg

Motivation: Rest of human

That was the first 500 DNA bases of human chromosome 1, and there are about 3.2 billion more just like those.

And that's for one species.

Motivation: Genome sizes

Order of genome size in millions of bases (Mb).

Class Genome size
Wheat 15000
Mammals 1000
Worm, Fruit fly 100
Yeast 10
Bacteria 1
  1. Genitalium
0.580
  1. Genitalium has about 470 genes

Motivation: sequence growth

genbankgrowth.jpg

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Motivation: Genome Browser

ucsc.png

Introducing: BioHub

The purpose of BioHub is to

Core Concepts

Since there are two strands to a DNA molecule but one strand completely determines the other, when storing sequence we need only store one strand.

However when storing a location we need to indicate if we are stored on the "forward" or "reverse" strand.

BioHub Genome

Human 34.0

sid.png

Example

To illustrate how BioHub can be used for a large scale genomic analysis we'll construct a simple example.

Create Sample Database

Imagine we have the results from some microRNA scanning programs:

class QueryFasta:
  def __init__(self, query_fasta):

class miRNATarget:
  def __init__(self, mirna_name, binding,
               query_seq, target_seq):

microRNAs are one regulator of gene expression.

Create Sample Database

Start with a simple "database":

# sequence fed to target finder
gene2995 = QueryFasta(">GeneID:2995\nCAC...")
results2995 = []
results2995.append(
  miRNAResult("hsa-mir-140", #name
              -21.2,     # strength
              gene2995,  # seq searched
              "CAAGA..." # seq found
  ))
# sequence fed to microRNA target finder
gene2995 = QueryFasta(""">GeneID:2995
CACTGCATTTCCCTTTACCAACTAGCGCTGGGAGCACTGGACACTTAAA
TCCTCATCTGTCCTCCTTTCCTGTAAATAAAAGCCCTTCTATCCA""")
results2995=[]
results2995.append(
  # result, name, binding strength,
  # query sequence, and miRNA target
  miRNAResult("hsa-mir-140", # result name
              -21.2, # binding strength
              gene2995, # sequence we searched
              # sequence we found
              "CAAGATATTACCATGTACATGGTACCACCATC"
              ))
# store another sequence
results2995.append(
  miRNAResult("hsa-mir-206", -20.4, gene2995,
              "TCTTACCCATGAATGTGCACTACCTACATTTT"))
cPickle.dump(results2995,
             "gene2995_mirna.pickle")

Registration

Registration

registerSequenceGivenLocation(self,
  locationList,   # (start,stop)
  description,    # desc. of run
  user,           # username
  sequenceType,   # e.g. CDS, Intron
  contig=None,    # Contig object
  species=None,   # Name of species
  build=None,     # Genome version
  accession=None, # NCBI accession
  gi=None,        # NCBI gi
  id=None,        # chromosome names
  reverseComplement=False)

You don't need specify all of these at the same time.

Registration

BLAST

BLAST is a sequence (string) search tool that allows for mismatches, insertions and deletions.

Because of this inexact matching in larger genomes you need 20-40 bases to avoid purely random matches.

Link Annotations

registrar = RegisterSequence()
for mirna_result in results2995:
  target_seq = mirna_result.target_seq
  sid = registrar.registerSequenceByBlast(
    target_seq,
    "Human",
    34,
    "miRNA",
    "found a microRNA",
    "diane")
biohub_link[mirna_result] = sid
from BioHub.BioHubAPI.RegisterSequence \
     import RegisterSequence
registrar = RegisterSequence()
for mirna_result in results2995:
  target_seq = mirna_result.target_seq
  sid = registrar.registerSequenceByBlast(
    target_seq,
    "Human",
    34,
    "miRNA",
    "found a microRNA",
    "diane")
 biohub_link[mirna_result] = sid

Perform Query

a microRNA always targets part of a gene, so lets check to make sure we actually are contained in a gene.

sidList = SpatialQuery.getSidsContainingSid(
            registered_sid,
            seqTypes=['gene'],
            strand=SpatialQuery.STRAND_BOTH)
assert( len(sidList) != 0 )
# Find what gene we're contained in
from BioHub.BioHubAPI import SpatialQuery
for result, registered_sid in biohub_link.items():
  sidList = SpatialQuery.getSidsContainingSid(
              registered_sid,
              seqTypes=['gene'],
              strand=SpatialQuery.STRAND_BOTH)
assert( len(sidList) != 0 )

Perhaps we might want to find the next upstream gene:

nextGeneSid = SpatialQuery.getSidNextTOSid(
                result_sid,
                searchDirection = SpatialQuery.DIRECTION_UPSTREAM,
                seqTypes=['gene'],
                maxBP=10000,
                strand=SpatialQuery.STRAND_SAME)

Or we might want to find all annotations 10Kbps downstream:

downstreamSIDs = SpatialQuery.getSidsNextToSid(
                   result_sid,
                   downstreamBp=10000)

Perform Query

Search for Motifs (regulatory elements) near our microRNA.

regSIDs = getSidsNextToSid(
            result_sid,
            upstreamBp=10000, downstreamBp=10000,
            seqTypes=['motif', 'binding_site',
                      'conserved'],
            strand=STRAND_BOTH,
            inclusive=True, # include us
            # only include fully in
            overlap=OVERLAP_EXACTLY_IN)

Advanced BioHub Workflow

Future Work

Acknowledgments

Core Development PI
  • Brandon King
  • Barbara Wold
  • Joe Roden
 
  • Cory Tobin
 
  • Matthew Goldsbury
 
  • Diane Trout
 

Project Page: http://woldlab.caltech.edu/biohub/

Support

  • Department of Energy
  • NIH GMS
  • NASA
  • Moore Minority Scholars Program