.. include:: .. footer: Caltech - SciPy 2006 ====== BioHub ====== :Author: Diane Trout .. class:: small This presentation and associated material is available at http://woldlab.caltech.edu/biohub/scipy2006/ .. things to fix .. color code, syntax highlighting .. DNA .. miRNA, many to one mapping .. What's blast? .. why does short sequence fail? it uses gapped alignment which .. matches loosely enough that you'll find matches by random chance .. for 20..40 bp .. advantage: there is a common reference frame to attach meaning to Motivation: Chr 1 ------------------------ :: >chr1 taaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta accctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac cctaacccaaccctaaccctaaccctaaccctaaccctaaccctaacccc taaccctaaccctaaccctaaccctaacctaaccctaaccctaaccctaa ccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaa ccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaac cccaaccccaaccccaaccccaaccctaacccctaaccctaaccctaacc ctaccctaaccctaaccctaaccctaaccctaaccctaacccctaacccc taaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccct aaccctaaccctaaccctcgcggtaccctcagccggcccgcccgcccggg Motivation: Rest of human ------------------------- That was the first 500 DNA bases of human chromosome 1, and there are about 3.2 billion more just like those. And that's for one species. .. class:: incremental * Current estimates put humans at about 25,000 genes Motivation: Genome sizes ------------------------ Order of genome size in millions of bases (Mb). .. class:: small =============== ============= Class Genome size =============== ============= Wheat 15000 Mammals 1000 Worm, Fruit fly 100 Yeast 10 Bacteria 1 M. Genitalium 0.580 =============== ============= .. class:: incremental M. Genitalium has about 470 genes Motivation: sequence growth --------------------------- .. fixme not sure if this is really needed? .. image:: genbankgrowth.jpg :height: 450px :width: 90% :target: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html .. class:: tiny http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Motivation: Genome Browser -------------------------- .. image:: ucsc.png :align: center :target: http://genome.ucsc.edu/cgi-bin/hgGateway Introducing: BioHub ------------------- The purpose of BioHub is to * provide a python interface for large-scale genomic analyses * provide a database to link diverse annotation sources to specific genome locations. * (AKA local or remote databases) Core Concepts ------------- * A BioHub Genome is a collection of `fasta `_ files storing each chromosome. * A genome is named by species and version number. * A location is a genome + chromosome + start + stop + strand * We always return the same sequence ID (SID) for the same location. .. class:: handout Since there are two strands to a DNA molecule but one strand completely determines the other, when storing sequence we need only store one strand. However when storing a location we need to indicate if we are stored on the "forward" or "reverse" strand. BioHub Genome ------------- Human 34.0 .. image:: sid.png :align: center Example ------- To illustrate how BioHub can be used for a large scale genomic analysis we'll construct a simple example. * Create an annotation database * Link the annotations via BioHub * Locate sequence and other annotations "near" our annotations. Create Sample Database ----------------------- Imagine we have the results from some microRNA scanning programs:: class QueryFasta: def __init__(self, query_fasta): class miRNATarget: def __init__(self, mirna_name, binding, query_seq, target_seq): .. class:: small microRNAs are one regulator of gene expression. Create Sample Database ---------------------- Start with a simple "database":: # sequence fed to target finder gene2995 = QueryFasta(">GeneID:2995\nCAC...") results2995 = [] results2995.append( miRNAResult("hsa-mir-140", #name -21.2, # strength gene2995, # seq searched "CAAGA..." # seq found )) .. class:: handout :: # sequence fed to microRNA target finder gene2995 = QueryFasta(""">GeneID:2995 CACTGCATTTCCCTTTACCAACTAGCGCTGGGAGCACTGGACACTTAAA TCCTCATCTGTCCTCCTTTCCTGTAAATAAAAGCCCTTCTATCCA""") results2995=[] results2995.append( # result, name, binding strength, # query sequence, and miRNA target miRNAResult("hsa-mir-140", # result name -21.2, # binding strength gene2995, # sequence we searched # sequence we found "CAAGATATTACCATGTACATGGTACCACCATC" )) # store another sequence results2995.append( miRNAResult("hsa-mir-206", -20.4, gene2995, "TCTTACCCATGAATGTGCACTACCTACATTTT")) cPickle.dump(results2995, "gene2995_mirna.pickle") Registration ------------ * Registration is how we tell BioHub to generate an ID for a particular chunk of genome. * registerSequenceGivenLocation * registerSequenceByBlast Registration ------------ .. container:: tiny :: registerSequenceGivenLocation(self, locationList, # (start,stop) description, # desc. of run user, # username sequenceType, # e.g. CDS, Intron contig=None, # Contig object species=None, # Name of species build=None, # Genome version accession=None, # NCBI accession gi=None, # NCBI gi id=None, # chromosome names reverseComplement=False) :small:`You don't need specify all of these at the same time.` Registration ------------ .. class:: small * Register by `BLAST `_ :: registerSequenceByBlast(self, sequenceString, # search sequence species, # Name of species build, # Genome version seqType, # Desc of seq desc, # Desc of Registration user, # username header=None) # optional fasta header Throws an exception when more than one location is found. BLAST ----- BLAST is a sequence (string) search tool that allows for mismatches, insertions and deletions. Because of this inexact matching in larger genomes you need 20-40 bases to avoid purely random matches. Link Annotations ---------------- * One problem with microRNAs is that they are typically the length where one would start to expect matches by random chance. * Thus we should search for the location of our longer "query" sequence first, and then find the offset for our miRNA target .. container:: handout this code will attempt to find the exact genomic location for our miRNAs:: from BioHub.Util.BlastDB import BlastDB from BioHub.Util.BlastHandler import NMismatchesBlastHandler blast_filter = NMismatchesBlastHandler(0) # create a reference to a blast database human_db = BlastDB("Homo sapiens build 34") # go run blast blast_handle = human_db.blastBySequence(gene2995.sequence) # parse the miserable BioPython blast handler results matches = blast_filter.getMatchesGivenBlastFileHandle(blast_handle) if len(matches) != 1: raise SequenceNotFound("too many copies of query sequence found") # convert high scoring pair to location # FIXME: this example got to complicated and this hasn't been # FIXME: implemented yet. Check for updates on our website. contig, start, stop, strand = hspToLocation(matches[0][1]) seq2995 = gene2995.sequence for mirna_result in results2995: target_seq = mirna_result.target_seq reverseCompliment = false target_start = seq2995.find(target_seq) # figure out which strand we're on if target_start == -1: target_reverse_seq = reverse_compliment(target_seq) target_start = seq2995.find(target_reverse_seq) reverseCompliment = true if target_start == -1: raise SequenceNotFound("I give up") target_start = start+target_start target_stop = start+target_start+len(target_seq) sid = genome.RegisterSequenceByLocation() biohub_link[mirna_result] = sid .. class:: incremental * but for simplicity we'll just pretend there is enough specificity Link Annotations ---------------- :: registrar = RegisterSequence() for mirna_result in results2995: target_seq = mirna_result.target_seq sid = registrar.registerSequenceByBlast( target_seq, "Human", 34, "miRNA", "found a microRNA", "diane") biohub_link[mirna_result] = sid .. class:: handout :: from BioHub.BioHubAPI.RegisterSequence \ import RegisterSequence registrar = RegisterSequence() for mirna_result in results2995: target_seq = mirna_result.target_seq sid = registrar.registerSequenceByBlast( target_seq, "Human", 34, "miRNA", "found a microRNA", "diane") biohub_link[mirna_result] = sid Perform Query ------------- a microRNA always targets part of a gene, so lets check to make sure we actually are contained in a gene. .. class:: small :: sidList = SpatialQuery.getSidsContainingSid( registered_sid, seqTypes=['gene'], strand=SpatialQuery.STRAND_BOTH) assert( len(sidList) != 0 ) .. container:: handout :: # Find what gene we're contained in from BioHub.BioHubAPI import SpatialQuery for result, registered_sid in biohub_link.items(): sidList = SpatialQuery.getSidsContainingSid( registered_sid, seqTypes=['gene'], strand=SpatialQuery.STRAND_BOTH) assert( len(sidList) != 0 ) Perhaps we might want to find the next upstream gene:: nextGeneSid = SpatialQuery.getSidNextTOSid( result_sid, searchDirection = SpatialQuery.DIRECTION_UPSTREAM, seqTypes=['gene'], maxBP=10000, strand=SpatialQuery.STRAND_SAME) Or we might want to find all annotations 10Kbps downstream:: downstreamSIDs = SpatialQuery.getSidsNextToSid( result_sid, downstreamBp=10000) Perform Query ------------- Search for Motifs (regulatory elements) near our microRNA. .. class:: small :: regSIDs = getSidsNextToSid( result_sid, upstreamBp=10000, downstreamBp=10000, seqTypes=['motif', 'binding_site', 'conserved'], strand=STRAND_BOTH, inclusive=True, # include us # only include fully in overlap=OVERLAP_EXACTLY_IN) Advanced BioHub Workflow ------------------------ .. class:: normal * Search genome for genes with specific regulatory binding site (like a microRNA) * Find a set of genes that interact * Extract likely microRNA target region * Feed likely targets to multiple microRNA prediction tools. * Anything look promising? * Put down the computer and perform wet bench experiments * Record results in BioHub * Next lab user begins a search with validated microRNA targets... Future Work ----------- * More unit tests * Streamline API * Consider using SQLAlchemy instead of our own ORM * Support migrating annotations between genome builds * Homology database * Gene ontology database * Link to expression database * BioHub Trac - http://woldlab.caltech.edu/cgi-bin/biohub/ticket/ Acknowledgments ---------------- .. class:: small .. class:: borderless ====================== ================= Core Development PI ====================== ================= * Brandon King * Barbara Wold * Joe Roden * Cory Tobin * Matthew Goldsbury * Diane Trout ====================== ================= Project Page: http://woldlab.caltech.edu/biohub/ **Support** * Department of Energy * NIH GMS * NASA * Moore Minority Scholars Program