doc/bioinfo_jc/bioinfo-presentation.rst

   1 .. include:: <s5defs.txt>
   2
   3 =====
   4 Mussa
   5 =====
   6
   7 :Authors: Diane Trout
   8
   9 .. The contents of this directory contain the source
  10    for a presentation for the Caltech Bioinformatics Journal club.
  11
  12 .. footer:: Caltech Bioinformatics Journal Club
  13
  14 What is Mussa
  15 -------------
  16
  17 .. class:: small
  18
  19   Mussa is tool to search for conserved regions between several
  20   sequences. Hopefully regions detected as conserved will
  21   highlight potentially important DNA sequence features such as
  22   cis-regulatory modules, microRNA genes, and exons.
  23
  24   Mussa extends previous 2-way sequence comparison to N sequences.
  25
  26 Family Tree
  27 -----------
  28
  29 .. class:: small
  30
  31   Family Relations and Mussa started using the same sequence
  32   comparison algorithm but developed in different directions.
  33
  34     .. image:: familytree.png
  35         :alt: Gratutious software family tree
  36
  37   `Family Relations`_ focused on providing a robust usable piece
  38   of software.
  39
  40   Mussa focused on the N-way algorithm.
  41
  42   .. _`Family Relations`: http://cartwheel.caltech.edu/
  43
  44 Motivation
  45 ----------
  46
  47 .. class:: small
  48
  49   The hope is that conservation while highlight elements that are important.
  50   However, it (by definition) only shows elements in common.
  51
  52   For instance though a two sequence comparision between a Human and Fugu
  53   muscle gene might show important elements of muscle, it would lose any
  54   mammal specific elements.
  55
  56   But a two sequence comparison between Mouse and Human might have too
  57   much in common to be useful.
  58
  59
  60 Motivation: Human vs. Fugu
  61 --------------------------
  62
  63 .. class:: small
  64
  65   .. image:: HuFu.png
  66
  67 Motivation: Human vs. Mouse
  68 ---------------------------
  69
  70 .. class:: small
  71
  72   .. image:: HuMo.png
  73
  74 Motivation
  75 ----------
  76
  77 .. class:: small
  78
  79   The hope is that by requiring conservation in multiple more closely related
  80   species one can achive the purification of the long distance comparison
  81   while still allowing elements that are important to those more closely
  82   related species to remain.
  83
  84 Motivation: Mammals
  85 -------------------
  86
  87 .. class:: small
  88
  89   .. image:: HuCoDoMoRa.png
  90
  91 Algorithm
  92 ---------
  93
  94 .. class:: small
  95
  96   To compute a result Mussa uses these algorithms to perform the N-way
  97   filtering.
  98
  99     * Seqcomp (determins the pairwise list of "matches")
 100     * Transitivity Test (filters the matches)
 101
 102 Seqcomp
 103 -------
 104
 105 .. class:: small
 106
 107   The original seqcomp comparion uses a refinement of a fairly simple
 108   algorithm to compare two sequences.
 109
 110   Given window of size W and sequences S[0] and S[1]::
 111
 112      for x in range(len(S[0])-W):
 113        for y in range(len(S[1])-W):
 114          match = 0
 115          for i in range(W):
 116            if S[0][x+i] == S[1][y+i]:
 117              match = match + 1
 118            if match >= threshold:
 119              save_indicies(x,y)
 120
 121   The algorithm actully being used only needs to compare the base that
 122   "slid in" into window, and account for the base that "slid out"
 123
 124 Seqcomp
 125 -------
 126
 127 .. class:: small
 128
 129   Assume that in this case we need 3 matches out of 4
 130
 131     .. image:: 4bp_window_no_match.png
 132
 133   In this case there is only one.
 134
 135 Seqcomp
 136 -------
 137
 138 .. class:: small
 139
 140    Assume that in this case we need 3 matches out of 4
 141
 142      .. image:: 4bp_window_match.png
 143
 144    However, now that we slid over one position there are now 3
 145    and so we would record 0, 5
 146
 147 Seqcomp
 148 -------
 149
 150 .. class:: small
 151
 152
 153   Once one pass is complete one of the sequences is reversed complimented
 154   and the process is repeated.
 155
 156   .. container:: incremental
 157
 158      When extending to more than two sequences, mussa needs to compare
 159
 160      (N * (N-1)) sequences
 161
 162 Transitivity Test
 163 -----------------
 164
 165 .. class:: small
 166
 167   There are several algorithms for comparing multiple sequences.
 168
 169   * Require transitivity, e.g. if A = B, and B = C, then A = C
 170   * "Radial" only tests matches between any number of query sequences
 171     and a single reference sequence. A = B, A = C, but B ?= C
 172   * "Entropy" (an experimental comparision that Tristan was working on)
 173
 174 Test Transitivity
 175 -----------------
 176
 177 .. class:: small
 178
 179   .. image:: 4way_trans.png
 180
 181
 182 Limits
 183 ------
 184
 185 .. class:: small
 186
 187   One of the weaknesses with the current implementation is that the
 188   transitivity filtering step involves a combinatorial explosion as it
 189   compares every possible path.
 190
 191   The parameters that influence the number of matches found are,
 192   repeat masking the sequence, how closely releated the two sequences
 193   are, the length of the sequence and the stringency of the seqcomp
 194   threshold.
 195
 196 Limits
 197 ------
 198
 199 .. class:: small
 200
 201   Additionally the types of elements found are influenced by the
 202   window size and base-pair threshold.
 203
 204   For instance a 6 base pair binding site wont be detected when using
 205   a 30 base pair window size.
 206
 207 Usage
 208 -----
 209
 210 .. class:: small
 211
 212   Currently I have two classes of target user for mussa.
 213
 214     * Computationally savvy user (AKA me)
 215     * The "typical" biologist (AKA my PI)
 216
 217 Tutorial
 218 --------
 219
 220   Brandon has been working on a tutorial for the GUI
 221   which includes a section on how we extract sequence out of UCSC.
 222
 223
 224 Command-Line Features
 225 ---------------------
 226
 227 .. class:: small
 228
 229   * Command line::
 230
 231       $ mussagl --help
 232       --run-analysis  arg   run an analysis
 233                             defined by the mussa
 234                             parameter file
 235       --view-analysis arg   load a previously run
 236                             analysis
 237       --no-gui              terminate without viewing
 238                             an analysis
 239
 240 Command-Line Features
 241 ---------------------
 242
 243 .. class:: small
 244
 245    * Parameter file::
 246
 247        ANA_NAME mck3test
 248        APPEND_WIN true
 249        APPEND_THRES true
 250
 251        SEQUENCE seq/mouse_mck_pro.fa
 252        ANNOTATION mm_mck3test.annot
 253
 254 Command-Line Features
 255 ---------------------
 256
 257 .. class:: small
 258
 259   * Annotation File::
 260
 261       [Seq name]
 262       start stop name type
 263       >name
 264       AGCGAAA
 265
 266   * [Seq name] is an optional name specifier.
 267   * The "alignment" algorithm used for sequence specified annotations
 268     is currently just using the motif search, so it only accepts
 269     IUPAC codes and doesn't handle in-dels.
 270
 271 GUI Features
 272 ------------
 273
 274 .. class:: small
 275
 276    * The Create Analysis menu option provides the same options
 277      as the parameter file.
 278
 279    .. image:: ../manual/images/define_analysis.png
 280
 281 GUI Features
 282 ------------
 283
 284 .. class:: small
 285
 286    Although there isn't a GUI for describing large annotations.
 287    (The motif editor can be used this way but there are issues).
 288
 289
 290 GUI Features
 291 ------------
 292
 293 .. class:: small
 294
 295    The Mussa GUI can:
 296
 297      * Display sequence with highlighted annotation regions
 298      * Search for motifs in these sequences
 299      * Show a base-pair alignment of a seqcomp "match"
 300      * Copy sequence regions
 301      * Create a new analysis using a subselection of one analysis
 302        and different parameters.
 303
 304 GUI
 305 ---
 306
 307 .. class:: small
 308
 309   <demo>
 310
 311 Finish
 312 ------
 313
 314 .. class:: small
 315
 316 Mussa has been developed by:
 317
 318   * Tristan DeBuysscher
 319   * Diane Trout
 320   * Brandon King
 321   * Nora Mullaney
 322
 323 And been influenced by:
 324
 325   * C. Titus Brown
 326   * Erich Schwarz
 327   * and Barbara Wold
 328
 329   :tiny:`and as I stepped in fairly late in Mussa's life, there could easily
 330   be others.`