1 .. include:: <s5defs.txt>
9 .. The contents of this directory contain the source
10 for a presentation for the Caltech Bioinformatics Journal club.
12 .. footer:: Caltech Bioinformatics Journal Club
19 Mussa is tool to search for conserved regions between several
20 sequences. Hopefully regions detected as conserved will
21 highlight potentially important DNA sequence features such as
22 cis-regulatory modules, microRNA genes, and exons.
24 Mussa extends previous 2-way sequence comparison to N sequences.
31 Family Relations and Mussa started using the same sequence
32 comparison algorithm but developed in different directions.
34 .. image:: familytree.png
35 :alt: Gratutious software family tree
37 `Family Relations`_ focused on providing a robust usable piece
40 Mussa focused on the N-way algorithm.
42 .. _`Family Relations`: http://cartwheel.caltech.edu/
49 The hope is that conservation while highlight elements that are important.
50 However, it (by definition) only shows elements in common.
52 For instance though a two sequence comparision between a Human and Fugu
53 muscle gene might show important elements of muscle, it would lose any
54 mammal specific elements.
56 But a two sequence comparison between Mouse and Human might have too
57 much in common to be useful.
60 Motivation: Human vs. Fugu
61 --------------------------
67 Motivation: Human vs. Mouse
68 ---------------------------
79 The hope is that by requiring conservation in multiple more closely related
80 species one can achive the purification of the long distance comparison
81 while still allowing elements that are important to those more closely
82 related species to remain.
89 .. image:: HuCoDoMoRa.png
96 To compute a result Mussa uses these algorithms to perform the N-way
99 * Seqcomp (determins the pairwise list of "matches")
100 * Transitivity Test (filters the matches)
107 The original seqcomp comparion uses a refinement of a fairly simple
108 algorithm to compare two sequences.
110 Given window of size W and sequences S[0] and S[1]::
112 for x in range(len(S[0])-W):
113 for y in range(len(S[1])-W):
116 if S[0][x+i] == S[1][y+i]:
118 if match >= threshold:
121 The algorithm actully being used only needs to compare the base that
122 "slid in" into window, and account for the base that "slid out"
129 Assume that in this case we need 3 matches out of 4
131 .. image:: 4bp_window_no_match.png
133 In this case there is only one.
140 Assume that in this case we need 3 matches out of 4
142 .. image:: 4bp_window_match.png
144 However, now that we slid over one position there are now 3
145 and so we would record 0, 5
153 Once one pass is complete one of the sequences is reversed complimented
154 and the process is repeated.
156 .. container:: incremental
158 When extending to more than two sequences, mussa needs to compare
160 (N * (N-1)) sequences
167 There are several algorithms for comparing multiple sequences.
169 * Require transitivity, e.g. if A = B, and B = C, then A = C
170 * "Radial" only tests matches between any number of query sequences
171 and a single reference sequence. A = B, A = C, but B ?= C
172 * "Entropy" (an experimental comparision that Tristan was working on)
179 .. image:: 4way_trans.png
187 One of the weaknesses with the current implementation is that the
188 transitivity filtering step involves a combinatorial explosion as it
189 compares every possible path.
191 The parameters that influence the number of matches found are,
192 repeat masking the sequence, how closely releated the two sequences
193 are, the length of the sequence and the stringency of the seqcomp
201 Additionally the types of elements found are influenced by the
202 window size and base-pair threshold.
204 For instance a 6 base pair binding site wont be detected when using
205 a 30 base pair window size.
212 Currently I have two classes of target user for mussa.
214 * Computationally savvy user (AKA me)
215 * The "typical" biologist (AKA my PI)
220 Brandon has been working on a tutorial for the GUI
221 which includes a section on how we extract sequence out of UCSC.
224 Command-Line Features
225 ---------------------
232 --run-analysis arg run an analysis
235 --view-analysis arg load a previously run
237 --no-gui terminate without viewing
240 Command-Line Features
241 ---------------------
251 SEQUENCE seq/mouse_mck_pro.fa
252 ANNOTATION mm_mck3test.annot
254 Command-Line Features
255 ---------------------
266 * [Seq name] is an optional name specifier.
267 * The "alignment" algorithm used for sequence specified annotations
268 is currently just using the motif search, so it only accepts
269 IUPAC codes and doesn't handle in-dels.
276 * The Create Analysis menu option provides the same options
277 as the parameter file.
279 .. image:: ../manual/images/define_analysis.png
286 Although there isn't a GUI for describing large annotations.
287 (The motif editor can be used this way but there are issues).
297 * Display sequence with highlighted annotation regions
298 * Search for motifs in these sequences
299 * Show a base-pair alignment of a seqcomp "match"
300 * Copy sequence regions
301 * Create a new analysis using a subselection of one analysis
302 and different parameters.
316 Mussa has been developed by:
318 * Tristan DeBuysscher
323 And been influenced by:
329 :tiny:`and as I stepped in fairly late in Mussa's life, there could easily