doc/manual/mussagl_manual.rst

   1 ==============
   2 Mussagl Manual
   3 ==============
   4 ---------------
   5 Brandon W. King
   6 ---------------
   7
   8 Last updated: Oct 27th, 2006
   9
  10 Documentation for Mussagl v1.0
  11
  12
  13 .. Things to add
  14         * New features / change log
  15         * (DONE) Comment out anything isn't implemented yet.
  16         * (DONE) List of features that will be implemented in the future.
  17         * Look into the homology mapping of UCSC.
  18         * Add toggle to genomes.
  19         * Document why one fast record per region.
  20         * How to deal with the hazards of small utrs vis motif finder. (Add warning)
  21         * Add warning about saving FASTA file.
  22         * Add a general principles section near the top
  23                 * Using comparison algorithm which will pickup all repeats
  24                 * Add info about repeatmasking
  25                 * Checking upstream and downstream genes for make sure you are in the right regions.
  26         * Later on: look into Ensembl
  27         * Look into method of homology instead of blating.
  28         * Mention advantages of using mupa.
  29         * Mention the difference between using arrows and scroll bar
  30         * Document the color for motifs
  31         * Update for Mac user left-click
  32
  33         * Wormbase/Flybase/mirBASE tutorials
  34
  35
  36
  37 .. contents::
  38
  39 Status
  40 ======
  41
  42 ..
  43
  44   Major New Features
  45   .. ------------------
  46
  47   Change Log
  48   .. ----------
  49
  50   .. INSERT CHANGE LOG HERE
  51   .. END INSERT CHANGE LOG
  52
  53 Features to be Implemented
  54 --------------------------
  55
  56 For an up-to-date list of features to be implemented visit:
  57 http://woldlab.caltech.edu/cgi-bin/mussa/roadmap
  58
  59 Introduction
  60 ============
  61
  62
  63 What is Mussagl?
  64 ----------------
  65
  66 Mussa is an N-way version of the FamilyRelations (which is a part of
  67 the Cartwheel project) 2-way comparative sequence analysis
  68 software. Given DNA sequence from N species, Mussa uses all possible
  69 pairwise comparions to derive an N-wise comparison. For example, given
  70 sequences 1,2,3, and 4, Mussa makes 6 2-way comparisons: 1vs2, 1vs3,
  71 1vs4, 2vs3, 2vs4, and 3vs4. It then compares all the links between
  72 these comparisons, saving those that satisfy a transitivity
  73 requirement. The saved paths are then displayed in an interactive
  74 viewer.
  75
  76 Short History of Mussa
  77 ----------------------
  78
  79 Mussa Python/PMW Prototype
  80 ~~~~~~~~~~~~~~~~~~~~~~~~~~
  81
  82 First Python/PMW based prototype.
  83
  84 Mussa C++/FLTK
  85 ~~~~~~~~~~~~~~
  86
  87 A rewrite for speed purposes using C++ and FLTK GUI toolkit.
  88
  89 Mussagl C++/Qt/OpenGL
  90 ~~~~~~~~~~~~~~~~~~~~~
  91
  92 Refactored version using the more elegant Qt GUI framework and
  93 OpenGL for hardware acceleration for those who have better graphics
  94 cards.
  95
  96 Getting Mussagl
  97 ===============
  98
  99 License
 100 -------
 101
 102 Mussagl has been released open source under the `GPL v2
 103 license`__.
 104
 105 __ GPL_
 106
 107 Platforms
 108 ---------
 109
 110 You have the option of building from source or downloading prebuilt
 111 binaries. Most people will want the prebuilt versions.
 112
 113 Supported Platforms:
 114
 115  * Mac OS X (binary or source)
 116  * Windows XP (binary or source)
 117  * Linux (source)
 118
 119 Download
 120 --------
 121
 122 Mussagl in binary form for OS X and Windows and/or source can be
 123 downloaded from http://mussa.caltech.edu/.
 124
 125 Install
 126 -------
 127
 128 Mac OS X
 129 ~~~~~~~~
 130
 131  * Download .dmg file.
 132  * Double click on .dmg file.
 133  * Drag Mussa icon to your /Applications folder.
 134  * Double Click on Mussa icon to open program.
 135
 136 Windows XP
 137 ~~~~~~~~~~
 138 Once you have downloaded the Mussagl installer, double click on the
 139 installer and follow the install instructions.
 140
 141 To start Mussagl, launch the program from Start > Programs > Mussagl >
 142 Mussagl.
 143
 144
 145 Linux
 146 ~~~~~
 147 Currently we do not have a binary installer for Linux. You will have
 148 to build from source. See the 'build from source' section below.
 149
 150
 151 Build from Source
 152 ~~~~~~~~~~~~~~~~~
 153
 154 Instructions for building from source can be found `build page
 155 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild>`_ on the
 156 `Mussa wiki`__.
 157
 158 __ wiki_
 159
 160
 161 Obtaining Input Data
 162 ====================
 163
 164 If you would like help obtaining data for use with Mussagl, you can
 165 skip ahead to the `Obtaining Input Data - Continued`_ section.
 166
 167 If would like a tour of the software, continue with the `Using
 168 Mussagl`_ section.
 169
 170
 171 Using Mussagl
 172 =============
 173
 174
 175 Launch Mussagl
 176 --------------
 177 Launch Mussagl... It should look similar to the screen shot below.
 178
 179 .. image:: images/opened.png
 180    :alt: Launch Mussa
 181    :align: center
 182
 183
 184
 185 Create/Load Analysis
 186 ----------------------
 187
 188 Currently there are three ways to load a Mussa experiment.
 189
 190  1. `Create a new analysis`_
 191  2. `Load a mussa parameter file`_ (.mupa)
 192  3. `Load an analysis`_
 193
 194 .. _createnew:
 195
 196 Create a new analysis
 197 ~~~~~~~~~~~~~~~~~~~~~
 198
 199 To create a new analysis select 'Define analysis' from the 'File'
 200 menu. You should see a dialog box similar to the one below. For this
 201 demo we will use the example sequences that come with Mussagl.
 202
 203 .. image:: images/define_analysis.png
 204    :alt: Define Analysis
 205    :align: center
 206
 207 Instructions:
 208
 209  1. **Give the experiment a name**, for this demo, we'll use
 210     'demo_w30_t20'. Mussa will create a folder with this name to store
 211     the analysis files in once it has been run.
 212
 213  2. Choose a threshold_... for this demo **choose 20**. See the
 214     Threshold_ section for more detailed information.
 215
 216  3. Choose a `window size`_. For this demo **choose 30**.
 217
 218
 219  4. Choose the number of sequences_ you would like. For this demo
 220     **choose 3**.
 221
 222 .. image:: images/define_analysis_step1a.png
 223    :alt: Steps 1-4
 224    :align: center
 225
 226 First enter the species name of "Human" in the first "Species" text
 227 box. Now click on the 'Browse' button next to the sequence input box
 228 and then select /examples/seq/human_mck_pro.fa file. Do the same in
 229 the next two sequence input boxes selecting mouse_mck_pro.fa and
 230 rabbit_mck_pro.fa as shown below. Make sure to give them a species
 231 name as well. Note that you can create annotation files using the
 232 mussa `Annotation File Format`_ to add annotations to your sequence.
 233
 234 .. image:: images/define_analysis_step2.png
 235    :alt: Choose sequences
 236    :align: center
 237
 238 Click the **create** button and in a few moments you should see
 239 something similar to the following screen shot.
 240
 241 .. image:: images/demo.png
 242    :alt: Mussagl Demo
 243    :align: center
 244
 245 By default your analysis is NOT saved. If you try to close an analysis
 246 without saving, you will be prompted with a dialog box asking you if
 247 you would like to save your analysis. The `Saving`_ section for
 248 details on saving your analysis. When saving, choose directory and
 249 give the analysis the name **demo_w30_t20**. If you close and reopen
 250 Mussagl, you will then be able to load the saved analysis. See `Load
 251 an analysis`_ section below for details.
 252
 253
 254 Load a mussa parameter file
 255 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 256
 257 If you prefer, you can define your Mussa analysis using the Mussa
 258 parameter file. See the `Parameter File Format`_ section for details
 259 on creating a .mupa file.
 260
 261 Once you have a .mupa file created, load Mussagl and select the **File >
 262 Create Analysis from File** menu option. Select the .mupa file and click
 263 open.
 264
 265 .. image:: images/load_mupa_menu.png
 266    :alt: Load Mussa Parameters
 267    :align: center
 268
 269 If you would like to see an example, you can load the
 270 **mck3test.mupa** file in the examples directory that comes with
 271 Mussagl or read the `Step 5 - Create Analysis` section from the
 272 `Obtaining Input Data - Continued`_ section.
 273
 274 .. image:: images/load_mupa_dialog.png
 275    :alt: Load Mussa Parameters Dialog
 276    :align: center
 277
 278
 279
 280 Load an analysis
 281 ~~~~~~~~~~~~~~~~
 282
 283 To load a previously run analysis open Mussagl and select the **File >
 284 Open Existing Analysis** menu option. Select an analysis **directory** and
 285 click open.
 286
 287 .. image:: images/load_analysis_menu.png
 288    :alt: Load Analysis Menu
 289    :align: center
 290
 291
 292 Main Window
 293 -----------
 294
 295 Overview
 296 ~~~~~~~~
 297 .. Screen-shot with numbers showing features.
 298
 299 .. image:: images/window_overview.png
 300    :alt: Mussa Window
 301    :align: center
 302
 303 Legend:
 304
 305  1. `DNA Sequence (Black bars)`_
 306
 307  2. Annotation_
 308
 309  3. Motif_
 310
 311  4. `Red conservation tracks`_
 312
 313  5. `Blue conservation tracks`_
 314
 315  6. `Zoom Factor`_ (Base pairs per pixel)
 316
 317  7. `Dynamic Threshold`_
 318
 319  8. `Sequence Information Bar`_
 320
 321  9. `Sequence Scroll Bar`_
 322
 323
 324 DNA Sequence (black bars)
 325 ~~~~~~~~~~~~~~~~~~~~~~~~~
 326
 327 .. image:: images/sequence_bar.png
 328    :alt: Sequence Bar
 329    :align: center
 330
 331 Each of the black bars represents one of the loaded sequences, in this
 332 case the sequence around the gene 'MCK' in human, mouse, and rabbit.
 333
 334
 335 Annotation
 336 ~~~~~~~~~~
 337
 338 .. figure:: images/annotation.png
 339    :alt: Annotation
 340    :align: center
 341
 342    Annotation shown in green on sequence bar.
 343
 344
 345 Annotations can be included on any of the sequences using the `Load a
 346 mussa parameter file`_ or `Create a new analysis`_ method of loading
 347 your sequences. You can define annotations by location or using an
 348 exact sub-sequence or a FASTA sequence of the section of DNA you wish
 349 to annotate. See the `Annotation File Format`_ section for details.
 350
 351
 352 Motif
 353 ~~~~~
 354
 355 .. figure:: images/motif.png
 356    :alt: Motif
 357    :align: center
 358
 359    Motif shown in light blue on sequence bar.
 360
 361 The only real difference between an annotation and motif in Mussagl is
 362 that you can define motifs and choose a color from within the GUI. See
 363 the `Motifs`_ section for more information.
 364
 365
 366 Red conservation tracks
 367 ~~~~~~~~~~~~~~~~~~~~~~~
 368
 369 .. figure:: images/conservation_tracks.png
 370    :alt: Conservation Tracks
 371    :align: center
 372
 373    Conservations tracks shown as red and blue lines between sequence
 374    bars.
 375
 376 The **red lines** between the sequence bars represent conservation
 377 between the sequences (i.e. not reverse complement matches)
 378
 379 The amount of sequence conservation shown will depend on how much your
 380 sequences are related and the `dynamic threshold`_ you are using.
 381
 382 To **deselect**, click and drag over any white area and then release
 383 the mouse button.
 384
 385
 386 Blue conservation tracks
 387 ~~~~~~~~~~~~~~~~~~~~~~~~
 388
 389 .. figure:: images/conservation_tracks.png
 390    :alt: Conservation Tracks
 391    :align: center
 392
 393    Conservations tracks shown as red and blue lines between sequence
 394    bars.
 395
 396 **Blue lines** represent **reverse complement** conservation relative
 397 to the sequence attached to the top of the blue line.
 398
 399 The amount of sequence conservation shown will depend on how much your
 400 sequences are related and the `dynamic threshold`_ you are using.
 401
 402 To **deselect**, click and drag over any white area and then release
 403 the mouse button.
 404
 405 Zoom Factor
 406 ~~~~~~~~~~~
 407
 408 .. image:: images/zoom_factor.png
 409    :alt: Zoom Factor
 410    :align: center
 411
 412 The zoom factor represents the number of base pairs represented per
 413 pixel. When you zoom in far enough the sequence will switch from
 414 seeing a black bar, representing the sequence, to the actual sequence
 415 (well, ASCII representation of sequence).
 416
 417
 418 Dynamic Threshold
 419 ~~~~~~~~~~~~~~~~~
 420
 421 .. image:: images/dynamic_threshold.png
 422    :alt: Dynamic Threshold
 423    :align: center
 424
 425 You can dynamically change the threshold for how strong a match you
 426 consider the conservation to be by changing the value in the dynamic
 427 threshold box.
 428
 429 The value you enter is the minimum number of base pairs that have to
 430 be matched in order to be considered conserved. The second number that
 431 you can't change is the `window size`_ you used when creating the
 432 experiment. The last number is the percent match.
 433
 434 Below is an animation of the dynamic threshold being increased over
 435 time.
 436
 437 .. image:: images/threshold_change.gif
 438    :alt: Animated Dynamic Threshold
 439    :align: center
 440
 441 See the Threshold_ section for more information.
 442
 443
 444 Sequence Information Bar
 445 ~~~~~~~~~~~~~~~~~~~~~~~~
 446
 447 .. image:: images/seq_info_bar.png
 448    :alt: Sequence Information Bar
 449    :align: center
 450
 451 The sequence information bars can be found to the left and right sides
 452 of Mussagl. Next to each sequence you will find the following
 453 information:
 454
 455  1. Species (If it has been defined)
 456  2. Total Size of Sequence
 457  3. Current base pair position
 458
 459 Note that you can **update the species** text box. Make sure to **save your
 460 experiment** after making this change by selecting **File > Save
 461 Analysis** from the menu.
 462
 463 Sequence Scroll Bar
 464 ~~~~~~~~~~~~~~~~~~~
 465
 466 .. image:: images/scroll_bar.png
 467    :alt: Sequence Scroll Bar
 468    :align: center
 469
 470 The scroll bar allows you to scroll through the sequence which is
 471 useful when you have zoomed in using the `zoom factor`_.
 472
 473
 474 Saving
 475 ------
 476
 477 Save on Close
 478 ~~~~~~~~~~~~~
 479
 480 When ever you create a new analysis or make a change such as
 481 adding/editing a motif or changing a species name, an asterisk (*)
 482 will appear in the title of the window showing that there are changes
 483 that have not been saved. If you close a Mussa window without saving
 484 changes, Mussa will ask you if you would like to save the changes that
 485 have been made.
 486
 487 Save Analysis
 488 ~~~~~~~~~~~~~
 489
 490 After making changes, such as updating species names or adding/editing
 491 motifs, you can save these changes by selecting the **File > Save
 492 analysis** menu option or pressing **CTRL + S** (PC) or
 493 **Apple/Command Key + S** (on Mac).
 494
 495 .. image:: images/save_analysis.png
 496    :alt: Save analysis
 497    :align: center
 498
 499 Save Analysis As
 500 ~~~~~~~~~~~~~~~~
 501
 502 To save a copy of your analysis to a new location, select the **File >
 503 Save analysis as** menu option and choose a new location and name for
 504 your analysis.
 505
 506 .. image:: images/save_analysis_as.png
 507    :alt: Save analysis
 508    :align: center
 509
 510 Save Motif List
 511 ~~~~~~~~~~~~~~~
 512
 513 See `Save Motifs to File`_ in the `Motifs`_ section.
 514
 515
 516 Viewing Multiple Analyses
 517 -------------------------
 518
 519 Some times it is useful to view more than one analysis at a time. To
 520 do accomplish this, Mussa allows you to open a new Mussa window by
 521 selecting the **File > New Mussa Window** menu option.
 522
 523 .. image:: images/new_mussa_window_menu.png
 524    :alt: New Mussa Window Menu Option
 525    :align: center
 526
 527 A new Mussa window will pop up.
 528
 529 .. figure:: images/new_mussa_window.png
 530    :alt: New Mussa Window
 531    :align: center
 532
 533    A new Mussa window on the right, in which a second analysis has
 534    been loaded.
 535
 536 Now you can create or load an existing analysis, in this new window,
 537 as described in the `Create/Load Analysis`_ section.
 538
 539 You can view as many analyses as you can fit on your screen or until
 540 you run out of available RAM. If you notice a rapid decrease in
 541 performance and hear lots of noise coming from your hard drive, you
 542 probably ran out of RAM and are now using virtual memory (i.e. much
 543 much slower). If this happens, you may need to avoid opening as many
 544 analyses at one time.
 545
 546
 547 Annotations / Motifs
 548 --------------------
 549
 550 Annotations
 551 ~~~~~~~~~~~
 552
 553 Currently annotations can be added to a sequence using the mussa
 554 `annotation file format`_ and can be loaded by selecting the
 555 annotation file when defining a new analysis (see `Create a new
 556 analysis`_ section) or by defining a .mupa file pointing to your
 557 annotation file (see `Load a mussa parameter file`_ section).
 558
 559 Motifs
 560 ~~~~~~
 561
 562 Load Motifs from File
 563 *********************
 564
 565 It is possible to load motifs from a file which was saved from a
 566 previous run or by defining your own motif file. See the `Motif File
 567 Format`_ section for details.
 568
 569 NOTE: Valid motif list file extensions are:
 570
 571   * .mtl
 572   * .txt
 573
 574 To load a motif file, select **Load Motif List** item from the
 575 **File** menu and select a motif list file.
 576
 577 .. image:: images/load_motif.png
 578    :alt: Load Motif List
 579    :align: center
 580
 581
 582 Save Motifs to File
 583 *******************
 584
 585 Motifs from the `Motif Dialog`_ can be saved to file for use with
 586 other analyses. If you just want your motifs to be saved with your
 587 analysis, see the `save analysis`_ section for details.
 588
 589 To save a motif list, select **File > Save Motifs** menu option. By
 590 default, Mussa will append .mtl if you do not provide a file extension
 591 (valid file extensions: .mtl & .txt).
 592
 593 .. image:: images/save_motifs.png
 594    :alt: Save Motifs
 595    :align: center
 596
 597
 598 Motif Dialog
 599 ************
 600
 601 Mussa has the ability to find lab motifs using the `IUPAC Nucleotide
 602 Code`_ for defining a motif. To define a motif, select **Edit > Edit
 603 Motifs** menu item as shown below.
 604
 605 .. image:: images/view_edit_motifs.png
 606    :alt: "View > Edit Motifs" Menu
 607    :align: center
 608
 609 You will see a dialog box appear with a "apply" button in the bottom
 610 right and one rows for defining motifs and the color that will be
 611 displayed on the sequence. When you start adding your first motif, an
 612 additional row will be added. The check box in the first column
 613 defines whether the motif is displayed or not. The second column is
 614 the motif display color. The third column is for the name of your
 615 motif and finally, the fourth column is motif itself.
 616
 617 .. image:: images/motif_dialog_start.png
 618    :alt: Motif Dialog
 619    :align: center
 620
 621 Now let's make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide
 622 Code`_, type in **'ATSCT'** into the motif field and **'My Motif'** for
 623 the name in the name field as shown below.
 624
 625 Notice how a second row appeared when you started to add the first
 626 motif. Every time you add a new motif, a new row will appear allowing
 627 you to add as many motifs as you need.
 628
 629 .. image:: images/motif_dialog_enter_motif.png
 630    :alt: Enter Motif
 631    :align: center
 632
 633 Now choose a color for your motif by clicking on the colored area to
 634 the left of the name field. Remember to choose a color that will show
 635 up well with a black bar as the background. A good tool for picking a
 636 color is the `Colour Contrast Analyser
 637 <http://juicystudio.com/services/colourcontrast.php>`_ by
 638 `juicystudio.com <http://juicystudio.com/>`_.
 639
 640 .. image:: images/color_chooser.png
 641    :alt: Color Chooser
 642    :align: center
 643
 644 Once you have selected the color for your motif, click on the
 645 **'apply'** button. Notice that if Mussa finds matches to your motif
 646 will now show up in the main Mussa window.
 647
 648 Before Motif:
 649
 650 .. image:: images/motif_dialog_bar_before.png
 651    :alt: Sequence bar before motif
 652    :align: center
 653
 654 After Motif:
 655
 656 .. image:: images/motif_dialog_bar_after.png
 657    :alt: Sequence bar after motif
 658    :align: center
 659
 660 To save your motifs with your analysis, see the `save analysis`_
 661 section. To save your motifs to a file, see the `save motifs to file`_
 662 section.
 663
 664 Deleting a Motif
 665 ^^^^^^^^^^^^^^^^
 666
 667 To delete a motif, remove all text from the name and sequence columns
 668 and close the motif editor.
 669
 670 View Mussa Alignments
 671 ---------------------
 672
 673 Mussagl allows you to zoom in on Mussa alignments by selecting the set
 674 of alignment(s) of interest. To do this, move the mouse near the
 675 alignment you are interested in viewing and then **PRESS** and
 676 **HOLD** the **LEFT mouse button** and **drag the mouse** to the other
 677 side of the conservation track so that you see a bounding box
 678 overlaping the alienment(s) of interest and then **let go** of the
 679 *left mouse button*.
 680
 681 In the example below, I started by left-clicking on the area marked by
 682 a red dot (upper left corner of bounding box) and dragging the mouse to
 683 the area marked by a blue dot (lower right corner of the bounding box)
 684 and letting go of the left mouse button.
 685
 686 .. image:: images/select_sequence.png
 687    :alt: Select Sequence
 688    :align: center
 689
 690 All of the lines which were not selected should be washed out as shown
 691 below:
 692
 693 .. image:: images/washed_out.png
 694    :alt: Tracks washed out
 695    :align: center
 696
 697 With a selection made, goto the **View** menu and select **View mussa alignment**.
 698
 699 .. image:: images/view_mussa_alignment.png
 700    :alt: View mussa alignment
 701    :align: center
 702
 703 You should see the alignment at the base-pair level as shown below.
 704
 705 .. image:: images/mussa_alignment.png
 706    :alt: Mussa alignment
 707    :align: center
 708
 709
 710 Sub-analysis
 711 ------------
 712
 713 Sub-analysis was created to allow you to analyze a sub-region using
 714 different parameters. This may allow you to find matches which may not
 715 have shown up with your initial settings.
 716
 717 To run a sub-analysis **highlight** a section of sequence and *right
 718 click* on it and select **Add to subanalysis**. To the same for the
 719 sequences shown in orange in the screenshot below. Note that you **are
 720 NOT limited** to selecting only one subsequence from the same
 721 sequence.
 722
 723 .. image:: images/subanalysis_select_seqs.png
 724    :alt: Subanalysis sequence selection
 725    :align: center
 726
 727 Once you have added your sequences for subanalysis, choose a `window size`_ and `threshold`_ and click **Ok**.
 728
 729 .. image:: images/subanalysis_dialog.png
 730    :alt: Subanalysis Dialog
 731    :align: center
 732
 733 A new Mussa window will appear with the subanalysis of your sequences
 734 once it's done running. This may take a while if you selected large
 735 chunks of sequence with a loose threshold.
 736
 737 .. image:: images/subanalysis_done.png
 738    :alt: Subalaysis complete
 739    :align: center
 740
 741
 742 Copying sequence to clipboard
 743 -----------------------------
 744
 745 To copy a sequence to the clipboard, highlight a section of sequence,
 746 as shown in the screen shot below, and do one of the following:
 747
 748  * Select **Copy as FASTA** from the **Edit** menu.
 749  * **Right-Click (Left-click + Apple/Command Key on Mac)** on the highlighted sequence and select **Copy as FASTA**.
 750  * Press **Ctrl + C (on PC)** or **Apple/Command Key + C (on Mac)** on the keyboard.
 751
 752 .. image:: images/copy_sequence.png
 753    :alt: Copy sequence
 754    :align: center
 755
 756
 757 Saving to an Image
 758 ---------------------------------
 759
 760 To save your current mussa view to an image, select **File > Save to
 761 image...** as shown below.
 762
 763 .. image:: images/save_to_image_menu.png
 764    :alt: File > Save to image...
 765    :align: center
 766
 767 You can define the width and the height of the image to save. By
 768 default it will use the same size of your current view. Since the
 769 Mussa view is implemented using vectors, if you choose a larger size
 770 then your current view, Mussa will redraw at the higher resolution
 771 when saving. In other words, you get higher quality images when saving
 772 at a higher resolution.
 773
 774 If you check the "Lock aspect ratio" check box, which I have circled
 775 in red, then when you change one value, say width, the other, height,
 776 will update automatically to keep the same aspect ratio.
 777
 778 .. image:: images/save_to_image_dialog.png
 779    :alt: Save to image dialog
 780    :align: center
 781
 782 Click save and choose a location and filename for your file.
 783
 784 The valid image formats are:
 785
 786   * .png (default if no extension specified.)
 787   * .jpg
 788
 789
 790 Detailed Information
 791 --------------------
 792
 793 Threshold
 794 ~~~~~~~~~
 795
 796 The threshold of an analysis is in minimum number of base pair matches
 797 must be meet to in order to be kept as a match. Note that you can vary
 798 the threshold from within Mussagl. For example, if you choose a
 799 `window size`_ of **30** and a **threshold** of **20** the mussa nway
 800 transitive algorithm will store all matches that are 20 out of 30 bp
 801 matches or better and pass it on to Mussagl. Mussagl will then allow
 802 you to dynamically choose a threshold from 20 to 30 base pairs. A
 803 threshold of 30 bps would only show 30 out of 30 bp matches. A
 804 threshold of 20 bps would show all matches of 20 out of 30 bps or
 805 better. If you would like to see results for matches lower than 20 out
 806 of 30, you will need to rerun the analysis with a lower threshold.
 807
 808 Window Size
 809 ~~~~~~~~~~~
 810
 811 The typical sizes people tend to choose are between 20 and 30. You
 812 will likely need to experiment with this setting depending on your
 813 needs and input sequence.
 814
 815
 816 Sequences
 817 ~~~~~~~~~
 818
 819 Mussa reads in sequences which are formatted in the FASTA_
 820 format. Mussa may take a long time to run (>10 minutes) if the total
 821 bp length near 280Kb. Once mussa has run once, you can reload
 822 previously run analyzes.
 823
 824 FIXME: We have learned more about how much sequence and how many to
 825 put in Mussagl, this information should be documented here.
 826
 827
 828 Mussa File Formats
 829 ------------------
 830
 831 .. _param:
 832
 833 Parameter File Format
 834 ~~~~~~~~~~~~~~~~~~~~~
 835
 836 Note that for the comment character '#' to work, it must contain a
 837 space after it (i.e. '# ').
 838
 839 **File Format (.mupa):**
 840
 841 ::
 842
 843   # name of analysis directory and stem for associated files
 844   ANA_NAME <analysis_name>
 845
 846   # if APPEND vars true, a _wXX and/or _tYY added to analysis name
 847   # where XX = WINDOW and YY = THRESHOLD
 848   # Highly recommeded with use of command line override of WINDOW or THRESHOLD
 849   APPEND_WIN <true/false>
 850   APPEND_THRES <true/false>
 851
 852   # first sequence info
 853   SEQUENCE <FASTA_file_path>
 854   ANNOTATION <annotation_file_path>
 855   SEQ_START <sequence_start>
 856
 857   # the second sequence info
 858   SEQUENCE <FASTA_file_path>
 859   # ANNOTATION <annotation_file_path>
 860   SEQ_START <sequence_start>
 861   # SEQ_END <sequence_end>
 862
 863   # third sequence info
 864   SEQUENCE <FASTA_file_path>
 865   # ANNOTATION <annotation_file_path>
 866
 867   # analyzes parameters: command line args -w -t will override these
 868   WINDOW <num>
 869   THRESHOLD <num>
 870
 871 .. csv-table:: Parameter File Options:
 872    :header: "Option Name", "Value", "Default", "Required", "Description"
 873    :widths: 30 30 30 30 60
 874
 875    "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also
 876    name of directory where analysis will be saved."
 877    "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME"
 878    "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME"
 879    "SEQUENCE", "/FASTA/filepath.fa", "N/A", "true", "Absolute/Relative file
 880    path to sequence."
 881    "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional
 882    annotation file. See `annotation file format`_ section for more
 883    information."
 884    "SEQ_START", "integer", "1", "false", "Optional index into FASTA file"
 885    "SEQ_END", "integer", "1", "false", "Optional index into FASTA file"
 886    "WINDOW", "integer", "N/A", "true", "`Window Size`_"
 887    "THRESHOLD", "integer", "N/A", "true", "`Threshold`_"
 888
 889 .. _annot:
 890
 891 Annotation File Format
 892 ~~~~~~~~~~~~~~~~~~~~~~
 893
 894 The first line in the file is the sequence name. Each line there after
 895 is a **space** separated annotation.
 896
 897 Update:
 898
 899  * The annotation format now supports FASTA sequences embedded in the
 900    annotation file as shown in the format example below. Mussagl will
 901    take this sequence and look for an exact match of this sequence in
 902    your sequences. If a match is found, it will label it with the name
 903    of from the FASTA header.
 904
 905 Format:
 906
 907 ::
 908
 909   <species/sequence_name>
 910   <start> <stop> <annotation_name> <annotation_type>
 911   <start> <stop> <annotation_name> <annotation_type>
 912   <start> <stop> <annotation_name> <annotation_type>
 913   <start> <stop> <annotation_name> <annotation_type>
 914   >FASTA Header
 915   ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG
 916   ACGTACGTACGTACGTAGCTGTCATACGCTAGCA
 917   TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT
 918   ACGTACGGCAGTACGCGGTCAGA
 919   <start> <stop> <annotation_name> <annotation_type>
 920   ...
 921
 922 Example:
 923
 924 ::
 925
 926   Mouse
 927   251 500 Glorp Glorptype
 928   751 1000 Glorp Glorptype
 929   1251 1500 Glorp Glorptype
 930   >My favorite DNA sequence
 931   GATTACA
 932   1751 2000 Glorp Glorptype
 933
 934
 935 .. _motif_file_format:
 936
 937 Motif File Format
 938 ~~~~~~~~~~~~~~~~~
 939
 940 Format:
 941
 942   <motif> <red> <green> <blue>
 943
 944 Example:
 945
 946   GGCC 0.0 1 1
 947
 948
 949
 950 IUPAC Nucleotide Code
 951 ~~~~~~~~~~~~~~~~~~~~~~
 952
 953 For your convenience, below is a table of the IUPAC Nucleotide Code.
 954
 955 The following table is table 1 from "Nomenclature for Incompletely
 956 Specified Bases in Nucleic Acid Sequences" which can be found at
 957 http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html.
 958
 959 ======  =================  ===================================
 960 Symbol  Meaning            Origin of designation
 961 ======  =================  ===================================
 962 G       G                  Guanine
 963 A       A                  Adenine
 964 T       T                  Thymine
 965 C       C                  Cytosine
 966 R       G or A             puRine
 967 Y       T or C             pYrimidine
 968 M       A or C             aMino
 969 K       G or T             Keto
 970 S       G or C             Strong interaction (3 H bonds)
 971 W       A or T             Weak interaction (2 H bonds)
 972 H       A or C or T        not-G, H follows G in the alphabet
 973 B       G or T or C        not-A, B follows A
 974 V       G or C or A        not-T (not-U), V follows U
 975 D       G or A or T        not-C, D follows C
 976 N       G or A or T or C   aNy
 977 ======  =================  ===================================
 978
 979
 980 Obtaining Input Data - Continued
 981 --------------------------------
 982
 983 If you already have your data, may want to go to the `Using Mussagl`_
 984 section of the manual.
 985
 986 Let's say you have a gene of interest called 'SMN1' and you want to
 987 know how the sequence surrounding the gene in multiple species is
 988 conserved. Guess what, that's what we are going to do, retrieve the
 989 DNA sequence for SMN1 and prepare it for using in Mussa.
 990
 991 For more information about SMN1 visit `NCBI's OMIM
 992 <http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=609682>`_.
 993
 994 The SMN1 data retrieved in this section can be downloaded from the
 995 `Mussa Example Data
 996 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/ExampleData>`_ page if
 997 you prefer to skip this section of the manual.
 998
 999 UCSC Genome Browser Method
1000 --------------------------
1001
1002 There are many methods of retrieving DNA sequence, but for this
1003 example we will retrieve SMN1 through the UCSC genome browser located
1004 at http://genome.ucsc.edu/.
1005
1006
1007 .. image:: images/ucsc_genome_browser_home.png
1008    :alt: UCSC Genome Browser
1009    :align: center
1010
1011 Step 1 - Find SMN1
1012 ~~~~~~~~~~~~~~~~~~
1013
1014 The first step in finding SMN1 is to use the **Gene Sorter** menu
1015 option which I have highlighted in orange below:
1016
1017 .. image:: images/ucsc_menu_bar_gene_sorter.png
1018    :alt: Gene Sorter Menu Option
1019    :align: center
1020
1021 Gene Sorter page:
1022
1023 .. image:: images/ucsc_gene_sorter.png
1024    :alt: Gene Sorter
1025    :align: center
1026
1027 We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**.
1028
1029 .. image:: images/ucsc_gs_sort_name_sim.png
1030    :alt: Gene Sorter - Name Similarity
1031    :align: center
1032
1033 After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box.
1034
1035 .. image:: images/ucsc_gs_smn1.png
1036    :alt: Gene
1037    :align: center
1038
1039 Press **Go!** and you should see the following page:
1040
1041 .. image:: images/ucsc_gs_found.png
1042    :alt: Found SMN1
1043    :align: center
1044
1045 Click on **SMN1** and you will be taking the gene expression atlas
1046 page.
1047
1048 .. image:: images/ucsc_gs_genome_position.png
1049    :alt: Gene expression atlas
1050    :align: center
1051
1052 Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome
1053 position column**.
1054
1055 Now we have found the location of SMN1 on human!
1056
1057 .. image:: images/ucsc_gb_smn1_human.png
1058    :alt: Genome Browser - SMN1 (human)
1059    :align: center
1060
1061
1062 Step 2 - Download CDS/UTR sequence for annotations
1063 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1064
1065 Since we have found **SMN1**, this would be a convenient time to extract
1066 the DNA sequence for the CDS and UTRs of the gene to use it as an
1067 annotation_ in Mussa.
1068
1069 **Click on SMN1** shown **between** the **two orange arrows** shown
1070 below.
1071
1072 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
1073    :alt: Genome Browser - SMN1 (human) - Orange Arrows
1074    :align: center
1075
1076 You should find yourself at the SMN1 description page.
1077
1078 .. image:: images/ucsc_gb_smn1_description_page.png
1079    :alt: Genome Browser - SMN1 (human) - Description page
1080    :align: center
1081
1082 **Scroll down** until you get to the **Sequence section** and click on
1083 **Genomic (chr5:70,256,524-70,284,592)**.
1084
1085 .. image:: images/ucsc_gb_smn1_human_sequence.png
1086    :alt: Genome Browser - SMN1 (human) - Sequence
1087    :align: center
1088
1089 You should now be at the **Genomic sequence near gene** page:
1090
1091 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence.png
1092    :alt: Genome Browser - SMN1 (human) - Get genomic sequence
1093    :align: center
1094
1095 Make the following changes (highlighted in orange in the screenshot
1096 below):
1097
1098  1. UNcheck **introns**.
1099     (We only want to annotate CDS and UTRs.)
1100  2. Select **one FASTA record** per **region**.
1101     (Mussa needs each CDS and UTR represented by one FASTA record per CDS/UTR).
1102  3. Select **CDS in upper case, UTR in lower case.**
1103
1104 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_diff.png
1105    :alt: Genome Browser - SMN1 (human) - Get genomic sequence setup
1106    :align: center
1107
1108 Now click the **submit** button. You will then see a FASTA file with
1109 many FASTA records representing the CDS and UTRS.
1110
1111 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_submit.png
1112    :alt: Genome Browser - SMN1 (human) - CDS/UTR sequence
1113    :align: center
1114
1115 Now you need to save the FASTA records to a **text file**. If you are
1116 using **Firefox** or **Internet Explorer 6+** click on the **File >
1117 Save As** menu option.
1118
1119 **IMPORTANT:** Make sure you select **Text Files** and **NOT**, I
1120 repeat **NOT Webpage Complete** (see screenshot below.)
1121
1122 Type in **smn1_human_annot.txt** for the file name.
1123
1124 .. image:: images/smn1_human_annot.png
1125    :alt: Genome Browser - SMN1 (human) - sequence annotation file
1126    :align: center
1127
1128 **IMPORTANT:** You should open the file with a text editor and make
1129   sure **no HTML** was saved... If you find any HTML markup, delete
1130   the markup and save the file.
1131
1132 Now we are going to **modify the file** you just saved to **add the
1133 name of the species** to the **annotation file**. All you have to do
1134 is **add a new line** at the **top of the file** with the word **'Human'** as
1135 shown below:
1136
1137 .. image:: images/smn1_human_annot_plus_human.png
1138    :alt: Genome Browser - SMN1 (human) - sequence annotation file
1139    :align: center
1140
1141 You can add more annotations to this file if you wish. See the
1142 `annotation file format`_ section for details of the file format. By
1143 including FASTA records in the annotation_ file, Mussa searches your
1144 DNA sequence for an exact match of the sequence in the annotation_
1145 file. If found, it will be marked as an annotation_ within Mussa.
1146
1147
1148 Step 3 - Download gene and upstream/downstream sequence
1149 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1150
1151 Use the back button in your web browser to get back the **genome
1152 browser view** of **SMN1** as shown below.
1153
1154 .. image:: images/ucsc_gb_smn1_human.png
1155    :alt: Genome Browser - SMN1 (human)
1156    :align: center
1157
1158 There are two options for getting additional sequence around your
1159 gene. The more complex way is to zoom out so that you have the
1160 sequence you want being shown in the genome browser and then follow
1161 the directions for the following method.
1162
1163 The second option, which we will choose, is to leave the genome
1164 browser zoomed exactly at the location of SMN1 and click on the
1165 **DNA** option on the menu bar (shown with orange arrows in the
1166 screenshot below.)
1167
1168 .. image:: images/ucsc_gb_smn1_human_dna_option.png
1169    :alt: Genome Browser - SMN1 (human) - DNA Option
1170    :align: center
1171
1172 Now in the **get dna in window** page, let's add an arbitrary amount of
1173 extra sequence on to each end of the gene, let's say 5000 base pairs.
1174
1175 .. image:: images/ucsc_gb_smn1_human_get_dna.png
1176    :alt: Genome Browser - SMN1 (human) - Get DNA
1177    :align: center
1178
1179 Click the **get DNA** button.
1180
1181 .. image:: images/ucsc_gb_smn1_human_dna.png
1182    :alt: Genome Browser - SMN1 (human) - DNA
1183    :align: center
1184
1185 Save the DNA sequence to a text file called 'smn1_human_dna.fa' as we
1186 did in step 2 with the annotation file.
1187
1188 **IMPORTANT:** Make sure the file is saved as a text file and not an
1189 HTML file. Open the file with a text editor and remove any HTML markup
1190 you find.
1191
1192
1193 Step 4 - Same/similar/related gene other species.
1194 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1195
1196 What good is a multiple sequence alignment viewer without multiple
1197 sequences? Let'S find a similar gene in a few more species.
1198
1199 Use the back button on your web browser until you get the **genome
1200 browser view** of **SMN1** as shown below.
1201
1202 .. image:: images/ucsc_genome_browser_home.png
1203    :alt: UCSC Genome Browser
1204    :align: center
1205
1206 **Click on SMN1** shown **between** the **two orange arrows** shown
1207 below.
1208
1209 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
1210    :alt: Genome Browser - SMN1 (human) - Orange Arrows
1211    :align: center
1212
1213 You should find yourself at the SMN1 description page.
1214
1215 .. image:: images/ucsc_gb_smn1_description_page.png
1216    :alt: Genome Browser - SMN1 (human) - Description page
1217    :align: center
1218
1219 **Scroll down** until you get to the **Sequence section** and click on
1220 **Protein (262 aa)**.
1221
1222 .. image:: images/ucsc_gb_smn1_human_sequence.png
1223    :alt: Genome Browser - SMN1 (human) - Sequence
1224    :align: center
1225
1226 Copy the SMN1 protein seqeunce by highlighting it and selecting **Edit
1227 > Copy** option from the menu.
1228
1229 .. image:: images/smn1_human_protein.png
1230    :alt: Genome Browser - SMN1 (human) - Protein
1231    :align: center
1232
1233 Press the back button on the web browser once and then scroll to the
1234 top of the page and click on the **BLAT** option on the menu bar
1235 (shown below with orange arrows).
1236
1237 .. image:: images/ucsc_gb_smn1_human_blat.png
1238    :alt: Genome Browser - SMN1 (human) - Blat
1239    :align: center
1240
1241 **Paste** in the **protein sequence** and **change** the **genome** to
1242 **mouse** as shown below and then click **submit**.
1243
1244 .. image:: images/ucsc_gb_smn1_human_blat_paste.png
1245    :alt: Genome Browser - SMN1 (human) - Blat paste protein
1246    :align: center
1247
1248 Notice that we have two hits, one of which looks pretty good at 89.9%
1249 match.
1250
1251 .. image:: images/ucsc_gb_smn1_human_blat_hits.png
1252    :alt: Genome Browser - SMN1 (human) - Blat hits
1253    :align: center
1254
1255 **Click** on the **brower** link next to the 89.9% match. Notice in
1256 the genome browser (shown below) that there is an annotated gene
1257 called SMN1 for mouse which matches the line called **your sequence
1258 from blat search**. This means we are fairly confidant we found the
1259 right location in the mouse genome.
1260
1261 .. image:: images/ucsc_gb_smn1_human_blat_to_browser.png
1262    :alt: Genome Browser - SMN1 (human) - Blat to browser
1263    :align: center
1264
1265 Follow steps 1 through 3 for mouse and then repeat step 4 with the
1266 human protein sequence to find **SMN1** in the following species (if
1267 you find a match):
1268
1269  1. Rat
1270  2. Rabbit
1271  3. Dog
1272  4. Armadillo
1273  5. Elephant
1274  6. Opposum
1275  7. x_tropicalis
1276
1277 Make sure to save the extended DNA sequence and annotation file for
1278 each one.
1279
1280
1281 Step 5 - Create Analysis
1282 ~~~~~~~~~~~~~~~~~~~~~~~~
1283
1284 At this point you should have the annotations and fasta files for each
1285 species. If you skipped the first four steps or are having trouble,
1286 you can download the example data from the `Mussa Example Data
1287 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/ExampleData>`_ page.
1288
1289 There are two methods for creating an analysis. You can create MUssa
1290 PArameter file (.mupa), or you can use the create analysis dialog. To
1291 use the analysis dialog, see the `create a new analysis`_ section.
1292
1293 If you are planning on do lots of analyses using the same sets of DNA
1294 sequence but with different parameters, annotations, and/or species,
1295 it is often best to setup a `mupa`_ file, so you can:
1296
1297   * Change parameters and rerun analysis easily.
1298   * Use Mussa command line option to run a batch analyses.
1299   * Define an analysis for someone else to run.
1300
1301 Now, we will create a `mupa`_ file for smn1 for an analysis with
1302 Human, Mouse, and Cow. I'll start by showing you the `mupa`_ file and
1303 then walking you through it line by line.
1304
1305 Start by creating a new text file called *smn1_human_mouse_cow.mupa*,
1306 in your smn1 directory. I decided to put each of the fasta and
1307 annotation files for each species in it's own directory, so I will use
1308 that setup (see screen shot below).
1309
1310 .. image:: images/smn1_dir_structure.png
1311    :alt: SMN1 directory structure
1312    :align: center
1313
1314 smn1_human_mouse_cow.mupa:
1315 ::
1316
1317   # Analysis name
1318   ANA_NAME smn1_human_mouse_cow
1319
1320   # Appending to analysis name
1321   APPEND_WIN true
1322   APPEND_THRES true
1323
1324   # Human sequence
1325   SEQUENCE human/smn1_human_dna.fasta
1326   ANNOTATION human/smn1_human_annotations.txt
1327
1328   SEQUENCE mouse/smn1_mouse_dna.fasta
1329   ANNOTATION mouse/smn1_mouse_annotations.txt
1330
1331   SEQUENCE cow/smn1_cow_dna.fasta
1332   ANNOTATION cow/smn1_cow_annotations.txt
1333
1334   # Window size / Threshold
1335   WINDOW 30
1336   THRESHOLD 24
1337
1338 The first line is the analysis name. This will be the name of the
1339 directory the results will be saved in when using the Mussa `command
1340 line`_ option --no-gui to run an analysis. If you are using the Mussa
1341 GUI, then you will be prompted for a directory name as mentioned in
1342 the `saving`_ section.
1343
1344 ::
1345
1346   # Analysis name
1347   ANA_NAME smn1_human_mouse_cow
1348
1349 If your provide the APPEND_WIN and/or APPEND_THRES, and set them to
1350 true, the window size and threshold will be appended to the analysis
1351 name. In this example, using the --no-gui `command line`_ option, our
1352 directory name would be *smn1_human_mouse_cow_w30_t24*.
1353
1354 ::
1355
1356   # Appending to analysis name
1357   APPEND_WIN true
1358   APPEND_THRES true
1359
1360 The following six lines provide Mussa with the location of the
1361 sequence files and annotation files. The files can provided with
1362 relative paths from the .mupa file. In other words, this .mupa file
1363 will provide the proper path to the human sequence only if there
1364 exists a directory called *human* in the same directory as this .mupa
1365 file.
1366
1367 To provide the species name for each species, you have to put the
1368 species name in the annotation files. See the `annotation file
1369 format`_ section for more details.
1370
1371 ::
1372
1373   # Human sequence
1374   SEQUENCE human/smn1_human_dna.fasta
1375   ANNOTATION human/smn1_human_annotations.txt
1376
1377   SEQUENCE mouse/smn1_mouse_dna.fasta
1378   ANNOTATION mouse/smn1_mouse_annotations.txt
1379
1380   SEQUENCE cow/smn1_cow_dna.fasta
1381   ANNOTATION cow/smn1_cow_annotations.txt
1382
1383 And finally, the `window size`_ and `threshold`_ parameters.
1384
1385 ::
1386
1387   # Window size / Threshold
1388   WINDOW 30
1389   THRESHOLD 24
1390
1391 Next, open Mussagl and select the **File > Create Analysis from File**
1392 menu option. Mussagl should run your analysis if everything was setup
1393 properly.
1394
1395
1396
1397 Understanding Mussa
1398 ===================
1399
1400 Command Line
1401 ------------
1402
1403 Mussa has some very useful command line options that allow for
1404 loading an existing analysis or running a new analysis with or without
1405 launching the GUI.
1406
1407 Mussa options:
1408   --help                     help message
1409   -p, --run-analysis arg     run an analysis defined by the mussa parameter file
1410   --view-analysis arg        load a previously run analysis
1411   --motifs arg               annotate analysis with motifs from this file
1412   --no-gui                   terminate without running an analysis
1413   --python                   launch as a `python interpreter`_
1414
1415 Running an analysis using the --no-gui option is useful when you want
1416 to run many analyses on a compute server and save the results for
1417 viewing in the future.
1418
1419
1420 Performance
1421 -----------
1422
1423 Algorithm Behavior
1424 ~~~~~~~~~~~~~~~~~~
1425
1426 FIXME: Include seqcomp algorithm info.
1427
1428 FIXME: Include transitivity info.
1429
1430 Repeats
1431 ~~~~~~~
1432
1433 Repeat masking of all input sequences, or at least of the "reference"
1434 genome, can be important for reducing compute time and for simplifying
1435 subsequent visual interpretation. Larger loci generally contain more
1436 repeat elements, and as their number grows so will the number of Mussa
1437 connections among them. If not repeat filtered, connectivity between
1438 shared repeat elements can obscure important relationships between
1439 single copy features.
1440
1441 The formula for the number of connections, C, that will be made for R
1442 instances of a single repeat (meaning R copies of one repeat in each
1443 sequence) and S sequences is:
1444
1445 C = (R^2)[S(S-1)/2]
1446
1447 Table of example situations:
1448
1449 =====  =====  =====
1450   C      R      S
1451 =====  =====  =====
1452    16     4     2
1453    48     4     3
1454    96     4     4
1455   160     4     5
1456   240     4     6
1457   336     4     7
1458   448     4     8
1459    24     2     4
1460    54     3     4
1461    96     4     4
1462   150     5     4
1463   216     6     4
1464   294     7     4
1465   384     8     4
1466  2500    50     2
1467  7500    50     3
1468 15000    50     4
1469 10000   100     2
1470 30000   100     3
1471 60000   100     4
1472 =====  =====  =====
1473
1474 After the connections, C, are found, they are passed on to the
1475 transitivity filter, which is a C^2 algorithm (FIXME: confirm
1476 algorithm is C^2). This means with 50 repeats in 2 sequences giving
1477 you a C of 2500, ends up with a C^2 of 6,250,000.
1478
1479 **Conclusion: repeats cause the processing time of Mussa to skyrocket.**
1480
1481 To deal with a situation where you have many repeats in your sequences
1482 do any of the following:
1483
1484  * Use shorter sequence lengths.
1485  * Repeat mask one or more of your sequences.
1486  * Increase the threshold.
1487
1488
1489 Details
1490 -------
1491
1492 Case: Conservation track suddenly stops
1493 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1494
1495 Details about this potentially confusing case can be found `here
1496 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/OverlappingWindows>`_.
1497
1498 Python Interpreter
1499 ------------------
1500
1501 Mussagl has some functionality for running a python interpreter for
1502 interacting with the internals of Mussagl and/or executing Python
1503 code. This feature is mostly experimental at this point in time. If
1504 you have interest in this feature or would like to know more about it,
1505 contact us using the contact information found at
1506 http://mussa.caltech.edu/.
1507
1508 .. Define links below
1509    ------------------
1510
1511 .. _GPL: http://www.opensource.org/licenses/gpl-license.php
1512 .. _wiki: http://mussa.caltech.edu
1513 .. _build: http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild
1514 .. _FASTA: http://en.wikipedia.org/wiki/fasta_format
1515 .. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif
1516 .. _mupa: `Parameter File Format`_