8 Last updated: May 23th, 2006
10 Updated to Mussagl build: 200 (Update to 230 in progress)
23 Short History of Mussa
24 ----------------------
27 Mussa Python/PMW Prototype
28 ~~~~~~~~~~~~~~~~~~~~~~~~~~
45 Mussagl has been released open source under the `GPL v2
53 You have the option of building from source or downloading prebuilt
54 binaries. Most people will want the prebuilt versions.
58 * Mac OS X (binary or source)
59 * Windows XP (binary or source)
65 Mussagl in binary form for OS X and Windows and/or source can be
66 downloaded from http://mussa.caltech.edu/.
73 Once you have downloaded the .dmg file, double click on it and follow
74 the install instructions.
76 FIXME: Mention how to launch the program.
81 Once you have downloaded the Mussagl installer, double click on the
82 installer and follow the install instructions.
84 To start Mussagl, launch the program from Start > Programs > Mussagl >
90 Currently we do not have a binary installer for Linux. You will have
91 to build from source. See the 'build from source' section below.
97 Instructions for building from source can be found `build page
98 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild>`_ on the
107 If you already have your data, you can skip ahead to the the `Using
110 Lets say you have a gene of interest called 'SMN1' and you want to
111 know how the sequence surrounding the gene in multiple species is
112 conserved. Guess what, that's what we are going to do, retrieve the
113 DNA sequence for SMN1 and prepare it for using in Mussa.
115 For more information about SMN1 visit `NCBI's OMIM
116 <http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=609682>`_.
118 UCSC Genome Browser Method
119 --------------------------
121 There are many methods of retrieving DNA sequence, but for this
122 example we will retrieve SMN1 through the UCSC genome broswer located
123 at http://genome.ucsc.edu/.
125 .. image:: images/ucsc_genome_browser_home.png
126 :alt: UCSC Genome Broswer
132 The first step in finding SMN1 is to use the **Gene Sorter** menu
133 option which I have highlighted in orange below:
135 .. image:: images/ucsc_menu_bar_gene_sorter.png
136 :alt: Gene Sorter Menu Option
141 .. image:: images/ucsc_gene_sorter.png
145 We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**.
147 .. image:: images/ucsc_gs_sort_name_sim.png
148 :alt: Gene Sorter - Name Similarity
151 After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box.
153 .. image:: images/ucsc_gs_smn1.png
157 Press **Go!** and you should see the following page:
159 .. image:: images/ucsc_gs_found.png
163 Click on **SMN1** and you will be taking the gene expression atlas
166 .. image:: images/ucsc_gs_genome_position.png
167 :alt: Gene expression atlas
170 Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome
173 Now we have found the location of SMN1 on human!
175 .. image:: images/ucsc_gb_smn1_human.png
176 :alt: Genome Browser - SMN1 (human)
179 FIXME: continue here.
187 Launch Mussagl... It should look similar to the screen shot below.
189 .. image:: images/opened.png
196 ----------------------
198 Currently there are three ways to load a Mussa experiment.
200 1. `Create a new analysis`_
201 2. `Load a mussa parameter file`_ (.mupa)
202 3. `Load an analysis`_
206 Create a new analysis
207 ~~~~~~~~~~~~~~~~~~~~~
209 To create a new analysis select 'Define analysis' from the 'File'
210 menu. You should see a dialog box similar to the one below. For this
211 demo we will use the example sequences that come with Mussagl.
213 .. image:: images/define_analysis.png
214 :alt: Define Analysis
219 1. **Give the experiment a name**, for this demo, we'll use
220 'demo_w30_t20'. Mussa will create a folder with this name to store
221 the analysis files in once it has been run.
223 2. Choose a `window size`_. For this demo **choose 30**.
225 3. Choose a threshold_... for this demo **choose 20**. See the
226 Threshold_ section for more detailed information.
228 4. Choose the number of sequences_ you would like. For this demo
231 .. image:: images/define_analysis_step1a.png
235 Now click on the 'Browse' button next to the sequence input box and
236 then select /examples/seq/human_mck_pro.fa file. Do the same in the
237 next two sequence input boxes selecting mouse_mck_pro.fa and
238 rabbit_mck_pro.fa as shown below. Note that you can create annotation
239 files using the mussa `Annotation File Format`_ to add annotations to
242 .. image:: images/define_analysis_step2.png
243 :alt: Choose sequences
246 Click the **create** button and in a few moments you should see
247 something similar to the following screen shot.
249 .. image:: images/demo.png
253 This analysis is now saved in a directory called **demo_w30_t20** in
254 the current working directory. If you close and reopen Mussagl, you
255 can reload the saved analysis. See `Load an analysis`_ section below
259 Load a mussa parameter file
260 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
262 If you prefer, you can define your Mussa analysis using the Mussa
263 parameter file. See the `Parameter File Format`_ section for details
264 on creating a .mupa file.
266 Once you have a .mupa file created, load Mussagl and select the **File >
267 Load Mussa Parameters** menu option. Select the .mupa file and click
270 .. image:: images/load_mupa_menu.png
271 :alt: Load Mussa Parameters
274 If you would like to see an example, you can load the
275 **mck3test.mupa** file in the examples directory that comes with
278 .. image:: images/load_mupa_dialog.png
279 :alt: Load Mussa Parameters Dialog
286 To load a previously run analysis open Mussagl and select the **File >
287 Load Analysis** menu option. Select an analysis **directory** and
290 .. image:: images/load_analysis_menu.png
291 :alt: Load Analysis Menu
300 .. Screen-shot with numbers showing features.
302 .. image:: images/window_overview.png
308 1. `DNA Sequence (Black bars)`_
314 4. `Conservation tracks`_
318 6. `Zoom Factor`_ (Base pairs per pixel)
320 7. `Dynamic Threshold`_
322 8. `Sequence Information Bar`_
324 9. `Sequence Scroll Bar`_
327 DNA Sequence (black bars)
328 ~~~~~~~~~~~~~~~~~~~~~~~~~
330 .. image:: images/sequence_bar.png
334 Each of the black bars represents one of the loaded sequences, in this
335 case the sequence around the gene 'MCK' in human, mouse, and rabbit.
337 FIXME: Should I mention the repeats here?
343 .. figure:: images/annotation.png
347 Annotation shown in green on sequence bar.
350 Annotations can be included on any of the sequences using the `Load a
351 mussa parameter file`_ method of loading your sequences. You can
352 define annotations by location or using an exact sub-sequence and you
353 may also choose any color for display of the annotation; see the
354 `Annotation File Format`_ section for details.
356 Note: Currently there is no way to add annotations using the GUI (only
357 via the .mupa file). We plan to add this feature in the future, but it
358 likely will not make it into the first release.
364 .. figure:: images/motif.png
368 Motif shown in light blue on sequence bar.
370 The only real difference between an annotation and motif in Mussagl is
371 that you can define motifs from within the GUI. See the `Motifs`_
372 section for more information.
378 .. figure:: images/conservation_tracks.png
379 :alt: Conservation Tracks
382 Conservations tracks shown as red and blue lines between sequence
385 The **red lines** between the sequence bars represent conservation
386 between the sequences and **blue lines** represent **reverse
387 complement** conservation. The amount of sequence conservation shown
388 will depend on the relatedness of your sequences and the `dynamic
389 threshold` you are using. Sequences with lots of repeats will cause
390 major slow downs in calculating the matches.
396 .. image:: images/motif_toggle.png
400 Toggles motifs on and off. This will not turn on and off annotations.
402 Note: As of the current build (#200), this feature hasn't been
409 .. image:: images/zoom_factor.png
413 The zoom factor represents the number of base pairs represented per
414 pixel. When you zoom in far enough the sequence will switch from
415 seeing a black bar, representing the sequence, to the actual sequence
416 (well, ASCII representation of sequence).
422 .. image:: images/dynamic_threshold.png
423 :alt: Dynamic Threshold
426 You can dynamically change the threshold for how strong of match you
427 consider the conservation to be with one of two options:
429 1. Number of base pair matches out of window size.
431 2. Percent base pair conservation.
433 See the Threshold_ section for more information.
436 Sequence Information Bar
437 ~~~~~~~~~~~~~~~~~~~~~~~~
439 .. image:: images/seq_info_bar.png
440 :alt: Sequence Information Bar
443 The sequence information bars can be found to the left and right sides
444 of Mussagl. Next to each sequence you will find the following
447 1. Species (If it has been defined)
448 2. Total Size of Sequence
449 3. Current base pair position
455 .. image:: images/scroll_bar.png
456 :alt: Sequence Scroll Bar
459 The scroll bar allows you to scroll through the sequence which is
460 useful when you have zoomed in using the `zoom factor`_.
469 Currently annotations can be added to a sequence using the mussa
470 `annotation file format`_ and can be loaded by selecting the
471 annotation file when defining a new analysis (see `Create a new
472 analysis`_ section) or by defining a .mupa file pointing to your
473 annotation file (see `Load a mussa parameter file`_ section).
478 Load Motifs from File
479 *********************
481 It is possible to load motifs from a file which was saved from a
482 previous run or by defining your own motif file. See the `Motif File
483 Format`_ section for details.
485 To load a motif file, select **Load Motif List** item from the
486 **File** menu and select a motif list file.
488 .. image:: images/load_motif.png
489 :alt: Load Motif List
496 Note: Currently not implemented
502 Mussa has the ability to find lab motifs using the `IUPAC Nucleotide
503 Code`_ for defining a motif. To define a motif, select **View > Edit
504 Motifs** menu item as shown below.
506 .. image:: images/view_edit_motifs.png
507 :alt: "View > Edit Motifs" Menu
510 You will see a dialog box appear with a "set motifs" button and 10
511 rows for defining motifs and the color that will be displayed on the
512 sequence. By default all 10 motifs start off as with white as the
513 color. In the image below, I changed the color from white to blue to
514 make it easier to see.
516 .. image:: images/motif_dialog_start.png
520 Now lets make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide
521 Code`_, type in **'ATSCT'** into the first box as shown below.
523 .. image:: images/motif_dialog_enter_motif.png
527 Now choose a color for your motif by clicking on the colored area to
528 the left of the motif. In the image above, you would click on the blue
529 square, but by default the squares will be white. Remember to choose a
530 color that will show up well with a black bar as the background.
532 .. image:: images/color_chooser.png
536 Once you have selected the color for your motif, click on the 'set
537 motifs' button. Notice that if Mussa finds matches to your motif will
538 now show up in the main Mussagl window.
542 .. image:: images/motif_dialog_bar_before.png
543 :alt: Sequence bar before motif
548 .. image:: images/motif_dialog_bar_after.png
549 :alt: Sequence bar after motif
553 View Mussa Alignements
554 ----------------------
556 Mussagl allows you to zoom in on Mussa alignments by selecting the set
557 of alignment(s) of interest. To do this, move the mouse near the
558 alignment you are interested in viewing and then **PRESS** and
559 **HOLD** the **LEFT mouse button** and **drag the mouse** to the other
560 side of the conservation track so that you see a bounding box
561 overlaping the alienment(s) of interest and then **let go** of the
564 In the example below, I started by left clicking on the area marked by
565 a red dot (upper left corner of bounding box) and draging the mouse to
566 the area marked by a blue dot (lower right corner of the bounding box)
567 and letting go of the left mouse button.
569 .. image:: images/select_sequence.png
570 :alt: Select Sequence
573 All of the lines which were not selected should be washed out as shown
576 .. image:: images/washed_out.png
577 :alt: Tracks washed out
580 With a selection made, goto the **View** menu and select **View mussa alignment**.
582 .. image:: images/view_mussa_alignment.png
583 :alt: View mussa alignment
586 You should see the alignment at the base-pair level as shown below.
588 .. image:: images/mussa_alignment.png
589 :alt: Mussa alignment
596 ---------------------------------
598 FIXME: Need to write this section
607 The threshold of an analysis is in minimum number of base pair matches
608 must be meet to in order to be kept as a match. Note that you can vary
609 the threshold from within Mussagl. For example, if you choose a
610 `window size`_ of **30** and a **threshold** of **20** the mussa nway
611 transitive algorithm will store all matches that are 20 out of 30 bp
612 matches or better and pass it on to Mussagl. Mussagl will then allow
613 you to dynamically choose a threshold from 20 to 30 base pairs. A
614 threshold of 30 bps would only show 30 out of 30 bp matches. A
615 threshold of 20 bps would show all matches of 20 out of 30 bps or
616 better. If you would like to see results for matches lower than 20 out
617 of 30, you will need to rerun the analysis with a lower threshold.
622 The typical sizes people tend to choose are between 20 and 30. You
623 will likely need to experiment with this setting depending on your
624 needs and input sequence.
630 Mussa reads in sequences which are formatted in the fasta_
631 format. Mussa may take a long time to run (>10 minutes) if the total
632 bp length near 280Kb. Once mussa has run once, you can reload
633 previously run analyzes.
635 FIXME: We have learned more about how much sequence and how many to
636 put in Mussagl, this information should be documented here.
644 Parameter File Format
645 ~~~~~~~~~~~~~~~~~~~~~
647 **File Format (.mupa):**
651 # name of analysis directory and stem for associated files
652 ANA_NAME <analysis_name>
654 # if APPEND vars true, a _wXX and/or _tYY added to analysis name
655 # where XX = WINDOW and YY = THRESHOLD
656 # Highly recommeded with use of command line override of WINDOW or THRESHOLD
657 APPEND_WIN <true/false>
658 APPEND_THRES <true/false>
660 # how many sequences are being analyzed
663 # first sequence info
664 SEQUENCE <fasta_file_path>
665 ANNOTATION <annotation_file_path>
666 SEQ_START <sequence_start>
668 # the second sequence info
669 SEQUENCE <fasta_file_path>
670 # ANNOTATION <annotation_file_path>
671 SEQ_START <sequence_start>
672 # SEQ_END <sequence_end>
674 # third sequence info
675 SEQUENCE <fasta_file_path>
676 # ANNOTATION <annotation_file_path>
678 # analyzes parameters: command line args -w -t will override these
682 .. csv-table:: Parameter File Options:
683 :header: "Option Name", "Value", "Default", "Required", "Description"
684 :widths: 30 30 30 30 60
686 "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also
687 name of directory where analysis will be saved."
688 "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME"
689 "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME"
690 "SEQUENCE_NUM", "integer", "N/A", "true", "The number of sequences
692 "SEQUENCE", "/fasta/filepath.fa", "N/A", "true", "Must define one
693 sequence per SEQUENCE_NUM."
694 "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional
695 annotation file. See `annotation file format`_ section for more
697 "SEQ_START", "integer", "1", "false", "Optional index into fasta file"
698 "SEQ_END", "integer", "1", "false", "Optional index into fasta file"
699 "WINDOW", "integer", "N/A", "true", "`Window Size`_"
700 "THRESHOLD", "integer", "N/A", "true", "`Threshold`_"
704 Annotation File Format
705 ~~~~~~~~~~~~~~~~~~~~~~
707 The first line in the file is the sequence name. Each line there after
708 is a **space** separated annotation.
712 * The annotation format now supports fasta sequences embedded in the
713 annotation file as shown in the format example below. Mussagl will
714 take this sequence and look for an exact match of this sequence in
715 your sequences. If a match is found, it will label it with the name
716 of from the fasta header.
722 <species/sequence_name>
723 <start> <stop> <annotation_name> <annotation_type>
724 <start> <stop> <annotation_name> <annotation_type>
725 <start> <stop> <annotation_name> <annotation_type>
726 <start> <stop> <annotation_name> <annotation_type>
728 ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG
729 ACGTACGTACGTACGTAGCTGTCATACGCTAGCA
730 TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT
731 ACGTACGGCAGTACGCGGTCAGA
732 <start> <stop> <annotation_name> <annotation_type>
740 251 500 Glorp Glorptype
741 751 1000 Glorp Glorptype
742 1251 1500 Glorp Glorptype
743 >My favorite DNA sequence
745 1751 2000 Glorp Glorptype
748 .. _motif_file_format:
755 <motif> <red> <green> <blue>
763 IUPAC Nucleotide Code
764 ~~~~~~~~~~~~~~~~~~~~~~
766 For your convenience, below is a table of the IUPAC Nucleotide Code.
768 The following table is table 1 from "Nomenclature for Incompletely
769 Specified Bases in Nucleic Acid Sequences" which can be found at
770 http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html.
772 ====== ================= ===================================
773 Symbol Meaning Origin of designation
774 ====== ================= ===================================
783 S G or C Strong interaction (3 H bonds)
784 W A or T Weak interaction (2 H bonds)
785 H A or C or T not-G, H follows G in the alphabet
786 B G or T or C not-A, B follows A
787 V G or C or A not-T (not-U), V follows U
788 D G or A or T not-C, D follows C
789 N G or A or T or C aNy
790 ====== ================= ===================================
793 .. Define links below
796 .. _GPL: http://www.opensource.org/licenses/gpl-license.php
797 .. _wiki: http://mussa.caltech.edu
798 .. _build: http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild
799 .. _fasta: http://en.wikipedia.org/wiki/FASTA_format
800 .. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif