Author: | Diane Trout |
---|
This presentation and associated material is available at http://woldlab.caltech.edu/biohub/scipy2006/
>chr1 taaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta accctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac cctaacccaaccctaaccctaaccctaaccctaaccctaaccctaacccc taaccctaaccctaaccctaaccctaacctaaccctaaccctaaccctaa ccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaa ccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaac cccaaccccaaccccaaccccaaccctaacccctaaccctaaccctaacc ctaccctaaccctaaccctaaccctaaccctaaccctaacccctaacccc taaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccct aaccctaaccctaaccctcgcggtaccctcagccggcccgcccgcccggg
That was the first 500 DNA bases of human chromosome 1, and there are about 3.2 billion more just like those.
And that's for one species.
Order of genome size in millions of bases (Mb).
Class | Genome size |
---|---|
Wheat | 15000 |
Mammals | 1000 |
Worm, Fruit fly | 100 |
Yeast | 10 |
Bacteria | 1 |
|
0.580 |
The purpose of BioHub is to
Since there are two strands to a DNA molecule but one strand completely determines the other, when storing sequence we need only store one strand.
However when storing a location we need to indicate if we are stored on the "forward" or "reverse" strand.
Human 34.0
To illustrate how BioHub can be used for a large scale genomic analysis we'll construct a simple example.
Imagine we have the results from some microRNA scanning programs:
class QueryFasta: def __init__(self, query_fasta): class miRNATarget: def __init__(self, mirna_name, binding, query_seq, target_seq):
microRNAs are one regulator of gene expression.
Start with a simple "database":
# sequence fed to target finder gene2995 = QueryFasta(">GeneID:2995\nCAC...") results2995 = [] results2995.append( miRNAResult("hsa-mir-140", #name -21.2, # strength gene2995, # seq searched "CAAGA..." # seq found ))
# sequence fed to microRNA target finder gene2995 = QueryFasta(""">GeneID:2995 CACTGCATTTCCCTTTACCAACTAGCGCTGGGAGCACTGGACACTTAAA TCCTCATCTGTCCTCCTTTCCTGTAAATAAAAGCCCTTCTATCCA""") results2995=[] results2995.append( # result, name, binding strength, # query sequence, and miRNA target miRNAResult("hsa-mir-140", # result name -21.2, # binding strength gene2995, # sequence we searched # sequence we found "CAAGATATTACCATGTACATGGTACCACCATC" )) # store another sequence results2995.append( miRNAResult("hsa-mir-206", -20.4, gene2995, "TCTTACCCATGAATGTGCACTACCTACATTTT")) cPickle.dump(results2995, "gene2995_mirna.pickle")
registerSequenceGivenLocation(self, locationList, # (start,stop) description, # desc. of run user, # username sequenceType, # e.g. CDS, Intron contig=None, # Contig object species=None, # Name of species build=None, # Genome version accession=None, # NCBI accession gi=None, # NCBI gi id=None, # chromosome names reverseComplement=False)
You don't need specify all of these at the same time.
Register by BLAST
registerSequenceByBlast(self, sequenceString, # search sequence species, # Name of species build, # Genome version seqType, # Desc of seq desc, # Desc of Registration user, # username header=None) # optional fasta header
Throws an exception when more than one location is found.
BLAST is a sequence (string) search tool that allows for mismatches, insertions and deletions.
Because of this inexact matching in larger genomes you need 20-40 bases to avoid purely random matches.
this code will attempt to find the exact genomic location for our miRNAs:
from BioHub.Util.BlastDB import BlastDB from BioHub.Util.BlastHandler import NMismatchesBlastHandler blast_filter = NMismatchesBlastHandler(0) # create a reference to a blast database human_db = BlastDB("Homo sapiens build 34") # go run blast blast_handle = human_db.blastBySequence(gene2995.sequence) # parse the miserable BioPython blast handler results matches = blast_filter.getMatchesGivenBlastFileHandle(blast_handle) if len(matches) != 1: raise SequenceNotFound("too many copies of query sequence found") # convert high scoring pair to location # FIXME: this example got to complicated and this hasn't been # FIXME: implemented yet. Check for updates on our website. contig, start, stop, strand = hspToLocation(matches[0][1]) seq2995 = gene2995.sequence for mirna_result in results2995: target_seq = mirna_result.target_seq reverseCompliment = false target_start = seq2995.find(target_seq) # figure out which strand we're on if target_start == -1: target_reverse_seq = reverse_compliment(target_seq) target_start = seq2995.find(target_reverse_seq) reverseCompliment = true if target_start == -1: raise SequenceNotFound("I give up") target_start = start+target_start target_stop = start+target_start+len(target_seq) sid = genome.RegisterSequenceByLocation() biohub_link[mirna_result] = sid
registrar = RegisterSequence() for mirna_result in results2995: target_seq = mirna_result.target_seq sid = registrar.registerSequenceByBlast( target_seq, "Human", 34, "miRNA", "found a microRNA", "diane") biohub_link[mirna_result] = sid
from BioHub.BioHubAPI.RegisterSequence \ import RegisterSequence registrar = RegisterSequence() for mirna_result in results2995: target_seq = mirna_result.target_seq sid = registrar.registerSequenceByBlast( target_seq, "Human", 34, "miRNA", "found a microRNA", "diane") biohub_link[mirna_result] = sid
a microRNA always targets part of a gene, so lets check to make sure we actually are contained in a gene.
sidList = SpatialQuery.getSidsContainingSid( registered_sid, seqTypes=['gene'], strand=SpatialQuery.STRAND_BOTH) assert( len(sidList) != 0 )
# Find what gene we're contained in from BioHub.BioHubAPI import SpatialQuery for result, registered_sid in biohub_link.items(): sidList = SpatialQuery.getSidsContainingSid( registered_sid, seqTypes=['gene'], strand=SpatialQuery.STRAND_BOTH) assert( len(sidList) != 0 )
Perhaps we might want to find the next upstream gene:
nextGeneSid = SpatialQuery.getSidNextTOSid( result_sid, searchDirection = SpatialQuery.DIRECTION_UPSTREAM, seqTypes=['gene'], maxBP=10000, strand=SpatialQuery.STRAND_SAME)
Or we might want to find all annotations 10Kbps downstream:
downstreamSIDs = SpatialQuery.getSidsNextToSid( result_sid, downstreamBp=10000)
Search for Motifs (regulatory elements) near our microRNA.
regSIDs = getSidsNextToSid( result_sid, upstreamBp=10000, downstreamBp=10000, seqTypes=['motif', 'binding_site', 'conserved'], strand=STRAND_BOTH, inclusive=True, # include us # only include fully in overlap=OVERLAP_EXACTLY_IN)
Core Development | PI |
---|---|
|
|
|
|
|
|
|
|
|
Project Page: http://woldlab.caltech.edu/biohub/
Support
- Department of Energy
- NIH GMS
- NASA
- Moore Minority Scholars Program