doc/manual/mussagl_manual.rst

   1 ===================
   2 Mussagl Manual v1.0
   3 ===================
   4 ---------------
   5 Brandon W. King
   6 ---------------
   7
   8 Last updated: Oct 31st, 2006
   9
  10 .. Things to add
  11         * New features / change log
  12         * (DONE) Comment out anything isn't implemented yet.
  13         * (DONE) List of features that will be implemented in the future.
  14         * Look into the homology mapping of UCSC.
  15         * Add toggle to genomes.
  16         * Document why one fast record per region.
  17         * How to deal with the hazards of small utrs vis motif finder. (Add warning)
  18         * Add warning about saving FASTA file.
  19         * Add a general principles section near the top
  20                 * Using comparison algorithm which will pickup all repeats
  21                 * Add info about repeatmasking
  22                 * Checking upstream and downstream genes for make sure you are in the right regions.
  23         * Later on: look into Ensembl
  24         * Look into method of homology instead of blating.
  25         * Mention advantages of using mupa.
  26         * Mention the difference between using arrows and scroll bar
  27         * Document the color for motifs
  28         * Update for Mac user left-click
  29
  30         * Wormbase/Flybase/mirBASE tutorials
  31
  32
  33
  34 .. contents::
  35
  36 Status
  37 ======
  38
  39
  40 ..
  41
  42   .. Major New Features
  43   .. ------------------
  44
  45   .. Change Log
  46   .. ----------
  47
  48   .. INSERT CHANGE LOG HERE
  49   .. END INSERT CHANGE LOG
  50
  51 Features to be Implemented
  52 --------------------------
  53
  54 For an up-to-date list of features to be implemented visit:
  55 http://woldlab.caltech.edu/cgi-bin/mussa/roadmap
  56
  57 Introduction
  58 ============
  59
  60
  61 What is Mussagl?
  62 ----------------
  63
  64 Mussa is an N-way version of the FamilyRelations (which is a part of
  65 the Cartwheel project) 2-way comparative sequence analysis
  66 software. Given DNA sequence from N species, Mussa uses all possible
  67 pairwise comparions to derive an N-wise comparison. For example, given
  68 sequences 1,2,3, and 4, Mussa makes 6 2-way comparisons: 1vs2, 1vs3,
  69 1vs4, 2vs3, 2vs4, and 3vs4. It then compares all the links between
  70 these comparisons, saving those that satisfy a transitivity
  71 requirement. The saved paths are then displayed in an interactive
  72 viewer.
  73
  74 Short History of Mussa
  75 ----------------------
  76
  77 Mussa Python/PMW Prototype
  78 ~~~~~~~~~~~~~~~~~~~~~~~~~~
  79
  80 First Python/PMW based prototype.
  81
  82 Mussa C++/FLTK
  83 ~~~~~~~~~~~~~~
  84
  85 A rewrite for speed purposes using C++ and FLTK GUI toolkit.
  86
  87 Mussagl C++/Qt/OpenGL
  88 ~~~~~~~~~~~~~~~~~~~~~
  89
  90 Refactored version using the more elegant Qt GUI framework and
  91 OpenGL for hardware acceleration for those who have better graphics
  92 cards.
  93
  94 Getting Mussagl
  95 ===============
  96
  97 License
  98 -------
  99
 100 Mussagl has been released open source under the `GPL v2
 101 license`__.
 102
 103 __ GPL_
 104
 105 Platforms
 106 ---------
 107
 108 You have the option of building from source or downloading prebuilt
 109 binaries. Most people will want the prebuilt versions.
 110
 111 Supported Platforms:
 112
 113  * Mac OS X (binary or source)
 114  * Windows XP (binary or source)
 115  * Linux (source)
 116
 117 Download
 118 --------
 119
 120 Mussagl in binary form for OS X and Windows and/or source can be
 121 downloaded from http://mussa.caltech.edu/.
 122
 123 Install
 124 -------
 125
 126 Mac OS X
 127 ~~~~~~~~
 128
 129  * Download .dmg file.
 130  * Double click on .dmg file.
 131  * Drag Mussa icon to your /Applications folder.
 132  * Double Click on Mussa icon to open program.
 133
 134 Windows XP
 135 ~~~~~~~~~~
 136 Once you have downloaded the Mussagl installer, double click on the
 137 installer and follow the install instructions.
 138
 139 To start Mussagl, launch the program from Start > Programs > Mussagl >
 140 Mussagl.
 141
 142
 143 Linux
 144 ~~~~~
 145 Currently we do not have a binary installer for Linux. You will have
 146 to build from source. See the 'build from source' section below.
 147
 148
 149 Build from Source
 150 ~~~~~~~~~~~~~~~~~
 151
 152 Instructions for building from source can be found `build page
 153 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild>`_ on the
 154 `Mussa wiki`__.
 155
 156 __ wiki_
 157
 158
 159 Obtaining Input Data
 160 ====================
 161
 162 If you would like help obtaining data for use with Mussagl, you can
 163 skip ahead to the `Obtaining Input Data - Continued`_ section.
 164
 165 If would like a tour of the software, continue with the `Using
 166 Mussagl`_ section.
 167
 168
 169 Using Mussagl
 170 =============
 171
 172
 173 Launch Mussagl
 174 --------------
 175 Launch Mussagl... It should look similar to the screen shot below.
 176
 177 .. image:: images/opened.png
 178    :alt: Launch Mussa
 179    :align: center
 180
 181
 182
 183 Create/Load Analysis
 184 ----------------------
 185
 186 Currently there are three ways to load a Mussa experiment.
 187
 188  1. `Create a new analysis`_
 189  2. `Load a mussa parameter file`_ (.mupa)
 190  3. `Load an analysis`_
 191
 192 .. _createnew:
 193
 194 Create a new analysis
 195 ~~~~~~~~~~~~~~~~~~~~~
 196
 197 To create a new analysis select 'Define analysis' from the 'File'
 198 menu. You should see a dialog box similar to the one below. For this
 199 demo we will use the example sequences that come with Mussagl.
 200
 201 .. image:: images/define_analysis.png
 202    :alt: Define Analysis
 203    :align: center
 204
 205 Instructions:
 206
 207  1. **Give the experiment a name**, for this demo, we'll use
 208     'demo_w30_t20'. Mussa will create a folder with this name to store
 209     the analysis files in once it has been run.
 210
 211  2. Choose a threshold_... for this demo **choose 20**. See the
 212     Threshold_ section for more detailed information.
 213
 214  3. Choose a `window size`_. For this demo **choose 30**.
 215
 216
 217  4. Choose the number of sequences_ you would like. For this demo
 218     **choose 3**.
 219
 220 .. image:: images/define_analysis_step1a.png
 221    :alt: Steps 1-4
 222    :align: center
 223
 224 First enter the species name of "Human" in the first "Species" text
 225 box. Now click on the 'Browse' button next to the sequence input box
 226 and then select /examples/seq/human_mck_pro.fa file. Do the same in
 227 the next two sequence input boxes selecting mouse_mck_pro.fa and
 228 rabbit_mck_pro.fa as shown below. Make sure to give them a species
 229 name as well. Note that you can create annotation files using the
 230 mussa `Annotation File Format`_ to add annotations to your sequence.
 231
 232 .. image:: images/define_analysis_step2.png
 233    :alt: Choose sequences
 234    :align: center
 235
 236 Click the **create** button and in a few moments you should see
 237 something similar to the following screen shot.
 238
 239 .. image:: images/demo.png
 240    :alt: Mussagl Demo
 241    :align: center
 242
 243 By default your analysis is NOT saved. If you try to close an analysis
 244 without saving, you will be prompted with a dialog box asking you if
 245 you would like to save your analysis. The `Saving`_ section for
 246 details on saving your analysis. When saving, choose directory and
 247 give the analysis the name **demo_w30_t20**. If you close and reopen
 248 Mussagl, you will then be able to load the saved analysis. See `Load
 249 an analysis`_ section below for details.
 250
 251
 252 Load a mussa parameter file
 253 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 254
 255 If you prefer, you can define your Mussa analysis using the Mussa
 256 parameter file. See the `Parameter File Format`_ section for details
 257 on creating a .mupa file.
 258
 259 Once you have a .mupa file created, load Mussagl and select the **File >
 260 Create Analysis from File** menu option. Select the .mupa file and click
 261 open.
 262
 263 .. image:: images/load_mupa_menu.png
 264    :alt: Load Mussa Parameters
 265    :align: center
 266
 267 If you would like to see an example, you can load the
 268 **mck3test.mupa** file in the examples directory that comes with
 269 Mussagl or read the `Step 5 - Create Analysis` section from the
 270 `Obtaining Input Data - Continued`_ section.
 271
 272 .. image:: images/load_mupa_dialog.png
 273    :alt: Load Mussa Parameters Dialog
 274    :align: center
 275
 276
 277
 278 Load an analysis
 279 ~~~~~~~~~~~~~~~~
 280
 281 To load a previously run analysis open Mussagl and select the **File >
 282 Open Existing Analysis** menu option. Select an analysis **directory** and
 283 click open.
 284
 285 .. image:: images/load_analysis_menu.png
 286    :alt: Load Analysis Menu
 287    :align: center
 288
 289
 290 Main Window
 291 -----------
 292
 293 Overview
 294 ~~~~~~~~
 295 .. Screen-shot with numbers showing features.
 296
 297 .. image:: images/window_overview.png
 298    :alt: Mussa Window
 299    :align: center
 300
 301 Legend:
 302
 303  1. `DNA Sequence (Black bars)`_
 304
 305  2. Annotation_
 306
 307  3. Motif_
 308
 309  4. `Red conservation tracks`_
 310
 311  5. `Blue conservation tracks`_
 312
 313  6. `Zoom Factor`_ (Base pairs per pixel)
 314
 315  7. `Dynamic Threshold`_
 316
 317  8. `Sequence Information Bar`_
 318
 319  9. `Sequence Scroll Bar`_
 320
 321
 322 DNA Sequence (black bars)
 323 ~~~~~~~~~~~~~~~~~~~~~~~~~
 324
 325 .. image:: images/sequence_bar.png
 326    :alt: Sequence Bar
 327    :align: center
 328
 329 Each of the black bars represents one of the loaded sequences, in this
 330 case the sequence around the gene 'MCK' in human, mouse, and rabbit.
 331
 332
 333 Annotation
 334 ~~~~~~~~~~
 335
 336 .. figure:: images/annotation.png
 337    :alt: Annotation
 338    :align: center
 339
 340    Annotation shown in green on sequence bar.
 341
 342
 343 Annotations can be included on any of the sequences using the `Load a
 344 mussa parameter file`_ or `Create a new analysis`_ method of loading
 345 your sequences. You can define annotations by location or using an
 346 exact sub-sequence or a FASTA sequence of the section of DNA you wish
 347 to annotate. See the `Annotation File Format`_ section for details.
 348
 349
 350 Motif
 351 ~~~~~
 352
 353 .. figure:: images/motif.png
 354    :alt: Motif
 355    :align: center
 356
 357    Motif shown in light blue on sequence bar.
 358
 359 The only real difference between an annotation and motif in Mussagl is
 360 that you can define motifs and choose a color from within the GUI. See
 361 the `Motifs`_ section for more information.
 362
 363
 364 Red conservation tracks
 365 ~~~~~~~~~~~~~~~~~~~~~~~
 366
 367 .. figure:: images/conservation_tracks.png
 368    :alt: Conservation Tracks
 369    :align: center
 370
 371    Conservations tracks shown as red and blue lines between sequence
 372    bars.
 373
 374 The **red lines** between the sequence bars represent conservation
 375 between the sequences (i.e. not reverse complement matches)
 376
 377 The amount of sequence conservation shown will depend on how much your
 378 sequences are related and the `dynamic threshold`_ you are using.
 379
 380 To **deselect**, click and drag over any white area and then release
 381 the mouse button.
 382
 383
 384 Blue conservation tracks
 385 ~~~~~~~~~~~~~~~~~~~~~~~~
 386
 387 .. figure:: images/conservation_tracks.png
 388    :alt: Conservation Tracks
 389    :align: center
 390
 391    Conservations tracks shown as red and blue lines between sequence
 392    bars.
 393
 394 **Blue lines** represent **reverse complement** conservation relative
 395 to the sequence attached to the top of the blue line.
 396
 397 The amount of sequence conservation shown will depend on how much your
 398 sequences are related and the `dynamic threshold`_ you are using.
 399
 400 To **deselect**, click and drag over any white area and then release
 401 the mouse button.
 402
 403 Zoom Factor
 404 ~~~~~~~~~~~
 405
 406 .. image:: images/zoom_factor.png
 407    :alt: Zoom Factor
 408    :align: center
 409
 410 The zoom factor represents the number of base pairs represented per
 411 pixel. When you zoom in far enough the sequence will switch from
 412 seeing a black bar, representing the sequence, to the actual sequence
 413 (well, ASCII representation of sequence).
 414
 415
 416 Dynamic Threshold
 417 ~~~~~~~~~~~~~~~~~
 418
 419 .. image:: images/dynamic_threshold.png
 420    :alt: Dynamic Threshold
 421    :align: center
 422
 423 You can dynamically change the threshold for how strong a match you
 424 consider the conservation to be by changing the value in the dynamic
 425 threshold box.
 426
 427 The value you enter is the minimum number of base pairs that have to
 428 be matched in order to be considered conserved. The second number that
 429 you can't change is the `window size`_ you used when creating the
 430 experiment. The last number is the percent match.
 431
 432 Below is an animation of the dynamic threshold being increased over
 433 time.
 434
 435 .. image:: images/threshold_change.gif
 436    :alt: Animated Dynamic Threshold
 437    :align: center
 438
 439 See the Threshold_ section for more information.
 440
 441
 442 Sequence Information Bar
 443 ~~~~~~~~~~~~~~~~~~~~~~~~
 444
 445 .. image:: images/seq_info_bar.png
 446    :alt: Sequence Information Bar
 447    :align: center
 448
 449 The sequence information bars can be found to the left and right sides
 450 of Mussagl. Next to each sequence you will find the following
 451 information:
 452
 453  1. Species (If it has been defined)
 454  2. Total Size of Sequence
 455  3. Current base pair position
 456
 457 Note that you can **update the species** text box. Make sure to **save your
 458 experiment** after making this change by selecting **File > Save
 459 Analysis** from the menu.
 460
 461 Sequence Scroll Bar
 462 ~~~~~~~~~~~~~~~~~~~
 463
 464 .. image:: images/scroll_bar.png
 465    :alt: Sequence Scroll Bar
 466    :align: center
 467
 468 The scroll bar allows you to scroll through the sequence which is
 469 useful when you have zoomed in using the `zoom factor`_.
 470
 471
 472 Saving
 473 ------
 474
 475 Save on Close
 476 ~~~~~~~~~~~~~
 477
 478 When ever you create a new analysis or make a change such as
 479 adding/editing a motif or changing a species name, an asterisk (*)
 480 will appear in the title of the window showing that there are changes
 481 that have not been saved. If you close a Mussa window without saving
 482 changes, Mussa will ask you if you would like to save the changes that
 483 have been made.
 484
 485 Save Analysis
 486 ~~~~~~~~~~~~~
 487
 488 After making changes, such as updating species names or adding/editing
 489 motifs, you can save these changes by selecting the **File > Save
 490 analysis** menu option or pressing **CTRL + S** (PC) or
 491 **Apple/Command Key + S** (on Mac).
 492
 493 .. image:: images/save_analysis.png
 494    :alt: Save analysis
 495    :align: center
 496
 497 Save Analysis As
 498 ~~~~~~~~~~~~~~~~
 499
 500 To save a copy of your analysis to a new location, select the **File >
 501 Save analysis as** menu option and choose a new location and name for
 502 your analysis.
 503
 504 .. image:: images/save_analysis_as.png
 505    :alt: Save analysis
 506    :align: center
 507
 508 Save Motif List
 509 ~~~~~~~~~~~~~~~
 510
 511 See `Save Motifs to File`_ in the `Motifs`_ section.
 512
 513
 514 Viewing Multiple Analyses
 515 -------------------------
 516
 517 Some times it is useful to view more than one analysis at a time. To
 518 do accomplish this, Mussa allows you to open a new Mussa window by
 519 selecting the **File > New Mussa Window** menu option.
 520
 521 .. image:: images/new_mussa_window_menu.png
 522    :alt: New Mussa Window Menu Option
 523    :align: center
 524
 525 A new Mussa window will pop up.
 526
 527 .. figure:: images/new_mussa_window.png
 528    :alt: New Mussa Window
 529    :align: center
 530
 531    A new Mussa window on the right, in which a second analysis has
 532    been loaded.
 533
 534 Now you can create or load an existing analysis, in this new window,
 535 as described in the `Create/Load Analysis`_ section.
 536
 537 You can view as many analyses as you can fit on your screen or until
 538 you run out of available RAM. If you notice a rapid decrease in
 539 performance and hear lots of noise coming from your hard drive, you
 540 probably ran out of RAM and are now using virtual memory (i.e. much
 541 much slower). If this happens, you may need to avoid opening as many
 542 analyses at one time.
 543
 544
 545 Annotations / Motifs
 546 --------------------
 547
 548 Annotations
 549 ~~~~~~~~~~~
 550
 551 Currently annotations can be added to a sequence using the mussa
 552 `annotation file format`_ and can be loaded by selecting the
 553 annotation file when defining a new analysis (see `Create a new
 554 analysis`_ section) or by defining a .mupa file pointing to your
 555 annotation file (see `Load a mussa parameter file`_ section).
 556
 557 Motifs
 558 ~~~~~~
 559
 560 Load Motifs from File
 561 *********************
 562
 563 It is possible to load motifs from a file which was saved from a
 564 previous run or by defining your own motif file. See the `Motif File
 565 Format`_ section for details.
 566
 567 NOTE: Valid motif list file extensions are:
 568
 569   * .mtl
 570   * .txt
 571
 572 To load a motif file, select **Load Motif List** item from the
 573 **File** menu and select a motif list file.
 574
 575 .. image:: images/load_motif.png
 576    :alt: Load Motif List
 577    :align: center
 578
 579
 580 Save Motifs to File
 581 *******************
 582
 583 Motifs from the `Motif Dialog`_ can be saved to file for use with
 584 other analyses. If you just want your motifs to be saved with your
 585 analysis, see the `save analysis`_ section for details.
 586
 587 To save a motif list, select **File > Save Motifs** menu option. By
 588 default, Mussa will append .mtl if you do not provide a file extension
 589 (valid file extensions: .mtl & .txt).
 590
 591 .. image:: images/save_motifs.png
 592    :alt: Save Motifs
 593    :align: center
 594
 595
 596 Motif Dialog
 597 ************
 598
 599 Mussa has the ability to find lab motifs using the `IUPAC Nucleotide
 600 Code`_ for defining a motif. To define a motif, select **Edit > Edit
 601 Motifs** menu item as shown below.
 602
 603 .. image:: images/view_edit_motifs.png
 604    :alt: "View > Edit Motifs" Menu
 605    :align: center
 606
 607 You will see a dialog box appear with a "apply" button in the bottom
 608 right and one rows for defining motifs and the color that will be
 609 displayed on the sequence. When you start adding your first motif, an
 610 additional row will be added. The check box in the first column
 611 defines whether the motif is displayed or not. The second column is
 612 the motif display color. The third column is for the name of your
 613 motif and finally, the fourth column is motif itself.
 614
 615 .. image:: images/motif_dialog_start.png
 616    :alt: Motif Dialog
 617    :align: center
 618
 619 Now let's make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide
 620 Code`_, type in **'ATSCT'** into the motif field and **'My Motif'** for
 621 the name in the name field as shown below.
 622
 623 Notice how a second row appeared when you started to add the first
 624 motif. Every time you add a new motif, a new row will appear allowing
 625 you to add as many motifs as you need.
 626
 627 .. image:: images/motif_dialog_enter_motif.png
 628    :alt: Enter Motif
 629    :align: center
 630
 631 Now choose a color for your motif by clicking on the colored area to
 632 the left of the name field. Remember to choose a color that will show
 633 up well with a black bar as the background. A good tool for picking a
 634 color is the `Colour Contrast Analyser
 635 <http://juicystudio.com/services/colourcontrast.php>`_ by
 636 `juicystudio.com <http://juicystudio.com/>`_.
 637
 638 .. image:: images/color_chooser.png
 639    :alt: Color Chooser
 640    :align: center
 641
 642 Once you have selected the color for your motif, click on the
 643 **'apply'** button. Notice that if Mussa finds matches to your motif
 644 will now show up in the main Mussa window.
 645
 646 Before Motif:
 647
 648 .. image:: images/motif_dialog_bar_before.png
 649    :alt: Sequence bar before motif
 650    :align: center
 651
 652 After Motif:
 653
 654 .. image:: images/motif_dialog_bar_after.png
 655    :alt: Sequence bar after motif
 656    :align: center
 657
 658 To save your motifs with your analysis, see the `save analysis`_
 659 section. To save your motifs to a file, see the `save motifs to file`_
 660 section.
 661
 662 Deleting a Motif
 663 ^^^^^^^^^^^^^^^^
 664
 665 To delete a motif, remove all text from the name and sequence columns
 666 and close the motif editor.
 667
 668 View Mussa Alignments
 669 ---------------------
 670
 671 Mussagl allows you to zoom in on Mussa alignments by selecting the set
 672 of alignment(s) of interest. To do this, move the mouse near the
 673 alignment you are interested in viewing and then **PRESS** and
 674 **HOLD** the **LEFT mouse button** and **drag the mouse** to the other
 675 side of the conservation track so that you see a bounding box
 676 overlaping the alienment(s) of interest and then **let go** of the
 677 *left mouse button*.
 678
 679 In the example below, I started by left-clicking on the area marked by
 680 a red dot (upper left corner of bounding box) and dragging the mouse to
 681 the area marked by a blue dot (lower right corner of the bounding box)
 682 and letting go of the left mouse button.
 683
 684 .. image:: images/select_sequence.png
 685    :alt: Select Sequence
 686    :align: center
 687
 688 All of the lines which were not selected should be washed out as shown
 689 below:
 690
 691 .. image:: images/washed_out.png
 692    :alt: Tracks washed out
 693    :align: center
 694
 695 With a selection made, goto the **View** menu and select **View mussa alignment**.
 696
 697 .. image:: images/view_mussa_alignment.png
 698    :alt: View mussa alignment
 699    :align: center
 700
 701 You should see the alignment at the base-pair level as shown below.
 702
 703 .. image:: images/mussa_alignment.png
 704    :alt: Mussa alignment
 705    :align: center
 706
 707
 708 Sub-analysis
 709 ------------
 710
 711 Sub-analysis was created to allow you to analyze a sub-region using
 712 different parameters. This may allow you to find matches which may not
 713 have shown up with your initial settings.
 714
 715 To run a sub-analysis **highlight** a section of sequence and *right
 716 click* on it and select **Add to subanalysis**. To the same for the
 717 sequences shown in orange in the screenshot below. Note that you **are
 718 NOT limited** to selecting only one subsequence from the same
 719 sequence.
 720
 721 .. image:: images/subanalysis_select_seqs.png
 722    :alt: Subanalysis sequence selection
 723    :align: center
 724
 725 Once you have added your sequences for subanalysis, choose a `window size`_ and `threshold`_ and click **Ok**.
 726
 727 .. image:: images/subanalysis_dialog.png
 728    :alt: Subanalysis Dialog
 729    :align: center
 730
 731 A new Mussa window will appear with the subanalysis of your sequences
 732 once it's done running. This may take a while if you selected large
 733 chunks of sequence with a loose threshold.
 734
 735 .. image:: images/subanalysis_done.png
 736    :alt: Subalaysis complete
 737    :align: center
 738
 739
 740 Copying sequence to clipboard
 741 -----------------------------
 742
 743 To copy a sequence to the clipboard, highlight a section of sequence,
 744 as shown in the screen shot below, and do one of the following:
 745
 746  * Select **Copy as FASTA** from the **Edit** menu.
 747  * **Right-Click (Left-click + Apple/Command Key on Mac)** on the highlighted sequence and select **Copy as FASTA**.
 748  * Press **Ctrl + C (on PC)** or **Apple/Command Key + C (on Mac)** on the keyboard.
 749
 750 .. image:: images/copy_sequence.png
 751    :alt: Copy sequence
 752    :align: center
 753
 754
 755 Saving to an Image
 756 ---------------------------------
 757
 758 To save your current mussa view to an image, select **File > Save to
 759 image...** as shown below.
 760
 761 .. image:: images/save_to_image_menu.png
 762    :alt: File > Save to image...
 763    :align: center
 764
 765 You can define the width and the height of the image to save. By
 766 default it will use the same size of your current view. Since the
 767 Mussa view is implemented using vectors, if you choose a larger size
 768 then your current view, Mussa will redraw at the higher resolution
 769 when saving. In other words, you get higher quality images when saving
 770 at a higher resolution.
 771
 772 If you check the "Lock aspect ratio" check box, which I have circled
 773 in red, then when you change one value, say width, the other, height,
 774 will update automatically to keep the same aspect ratio.
 775
 776 .. image:: images/save_to_image_dialog.png
 777    :alt: Save to image dialog
 778    :align: center
 779
 780 Click save and choose a location and filename for your file.
 781
 782 The valid image formats are:
 783
 784   * .png (default if no extension specified.)
 785   * .jpg
 786
 787
 788 Detailed Information
 789 --------------------
 790
 791 Threshold
 792 ~~~~~~~~~
 793
 794 The threshold of an analysis is in minimum number of base pair matches
 795 must be meet to in order to be kept as a match. Note that you can vary
 796 the threshold from within Mussagl. For example, if you choose a
 797 `window size`_ of **30** and a **threshold** of **20** the mussa nway
 798 transitive algorithm will store all matches that are 20 out of 30 bp
 799 matches or better and pass it on to Mussagl. Mussagl will then allow
 800 you to dynamically choose a threshold from 20 to 30 base pairs. A
 801 threshold of 30 bps would only show 30 out of 30 bp matches. A
 802 threshold of 20 bps would show all matches of 20 out of 30 bps or
 803 better. If you would like to see results for matches lower than 20 out
 804 of 30, you will need to rerun the analysis with a lower threshold.
 805
 806 Window Size
 807 ~~~~~~~~~~~
 808
 809 The typical sizes people tend to choose are between 20 and 30. You
 810 will likely need to experiment with this setting depending on your
 811 needs and input sequence.
 812
 813
 814 Sequences
 815 ~~~~~~~~~
 816
 817 Mussa reads in sequences which are formatted in the FASTA_
 818 format. Mussa may take a long time to run (>10 minutes) if the total
 819 bp length near 280Kb. Once mussa has run once, you can reload
 820 previously run analyzes.
 821
 822 FIXME: We have learned more about how much sequence and how many to
 823 put in Mussagl, this information should be documented here.
 824
 825
 826 Mussa File Formats
 827 ------------------
 828
 829 .. _param:
 830
 831 Parameter File Format
 832 ~~~~~~~~~~~~~~~~~~~~~
 833
 834 Note that for the comment character '#' to work, it must contain a
 835 space after it (i.e. '# ').
 836
 837 **File Format (.mupa):**
 838
 839 ::
 840
 841   # name of analysis directory and stem for associated files
 842   ANA_NAME <analysis_name>
 843
 844   # if APPEND vars true, a _wXX and/or _tYY added to analysis name
 845   # where XX = WINDOW and YY = THRESHOLD
 846   # Highly recommeded with use of command line override of WINDOW or THRESHOLD
 847   APPEND_WIN <true/false>
 848   APPEND_THRES <true/false>
 849
 850   # first sequence info
 851   SEQUENCE <FASTA_file_path>
 852   ANNOTATION <annotation_file_path>
 853   SEQ_START <sequence_start>
 854
 855   # the second sequence info
 856   SEQUENCE <FASTA_file_path>
 857   # ANNOTATION <annotation_file_path>
 858   SEQ_START <sequence_start>
 859   # SEQ_END <sequence_end>
 860
 861   # third sequence info
 862   SEQUENCE <FASTA_file_path>
 863   # ANNOTATION <annotation_file_path>
 864
 865   # analyzes parameters: command line args -w -t will override these
 866   WINDOW <num>
 867   THRESHOLD <num>
 868
 869 .. csv-table:: Parameter File Options:
 870    :header: "Option Name", "Value", "Default", "Required", "Description"
 871    :widths: 30 30 30 30 60
 872
 873    "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also
 874    name of directory where analysis will be saved."
 875    "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME"
 876    "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME"
 877    "SEQUENCE", "/FASTA/filepath.fa", "N/A", "true", "Absolute/Relative file
 878    path to sequence."
 879    "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional
 880    annotation file. See `annotation file format`_ section for more
 881    information."
 882    "SEQ_START", "integer", "1", "false", "Optional index into FASTA file"
 883    "SEQ_END", "integer", "1", "false", "Optional index into FASTA file"
 884    "WINDOW", "integer", "N/A", "true", "`Window Size`_"
 885    "THRESHOLD", "integer", "N/A", "true", "`Threshold`_"
 886
 887 .. _annot:
 888
 889 Annotation File Format
 890 ~~~~~~~~~~~~~~~~~~~~~~
 891
 892 The first line in the file is the sequence name. Each line there after
 893 is a **space** separated annotation.
 894
 895 Update:
 896
 897  * The annotation format now supports FASTA sequences embedded in the
 898    annotation file as shown in the format example below. Mussagl will
 899    take this sequence and look for an exact match of this sequence in
 900    your sequences. If a match is found, it will label it with the name
 901    of from the FASTA header.
 902
 903 Format:
 904
 905 ::
 906
 907   <species/sequence_name>
 908   <start> <stop> <annotation_name> <annotation_type>
 909   <start> <stop> <annotation_name> <annotation_type>
 910   <start> <stop> <annotation_name> <annotation_type>
 911   <start> <stop> <annotation_name> <annotation_type>
 912   >FASTA Header
 913   ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG
 914   ACGTACGTACGTACGTAGCTGTCATACGCTAGCA
 915   TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT
 916   ACGTACGGCAGTACGCGGTCAGA
 917   <start> <stop> <annotation_name> <annotation_type>
 918   ...
 919
 920 Example:
 921
 922 ::
 923
 924   Mouse
 925   251 500 Glorp Glorptype
 926   751 1000 Glorp Glorptype
 927   1251 1500 Glorp Glorptype
 928   >My favorite DNA sequence
 929   GATTACA
 930   1751 2000 Glorp Glorptype
 931
 932
 933 .. _motif_file_format:
 934
 935 Motif File Format
 936 ~~~~~~~~~~~~~~~~~
 937
 938 Format:
 939
 940 ::
 941
 942   <motif> <optional name> <red> <green> <blue> <optional alpha>
 943
 944 Example:
 945
 946 ::
 947
 948   AGTGAG "My First Motif" 0.333333 0.588235 1 1
 949   ATGAT "2nd Motif" 1 0 0 1
 950
 951
 952 IUPAC Nucleotide Code
 953 ~~~~~~~~~~~~~~~~~~~~~~
 954
 955 For your convenience, below is a table of the IUPAC Nucleotide Code.
 956
 957 The following table is table 1 from "Nomenclature for Incompletely
 958 Specified Bases in Nucleic Acid Sequences" which can be found at
 959 http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html.
 960
 961 ======  =================  ===================================
 962 Symbol  Meaning            Origin of designation
 963 ======  =================  ===================================
 964 G       G                  Guanine
 965 A       A                  Adenine
 966 T       T                  Thymine
 967 C       C                  Cytosine
 968 R       G or A             puRine
 969 Y       T or C             pYrimidine
 970 M       A or C             aMino
 971 K       G or T             Keto
 972 S       G or C             Strong interaction (3 H bonds)
 973 W       A or T             Weak interaction (2 H bonds)
 974 H       A or C or T        not-G, H follows G in the alphabet
 975 B       G or T or C        not-A, B follows A
 976 V       G or C or A        not-T (not-U), V follows U
 977 D       G or A or T        not-C, D follows C
 978 N       G or A or T or C   aNy
 979 ======  =================  ===================================
 980
 981
 982 Obtaining Input Data - Continued
 983 --------------------------------
 984
 985 If you already have your data, may want to go to the `Using Mussagl`_
 986 section of the manual.
 987
 988 Let's say you have a gene of interest called 'SMN1' and you want to
 989 know how the sequence surrounding the gene in multiple species is
 990 conserved. Guess what, that's what we are going to do, retrieve the
 991 DNA sequence for SMN1 and prepare it for using in Mussa.
 992
 993 For more information about SMN1 visit `NCBI's OMIM
 994 <http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=609682>`_.
 995
 996 The SMN1 data retrieved in this section can be downloaded from the
 997 `Mussa Example Data
 998 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/ExampleData>`_ page if
 999 you prefer to skip this section of the manual.
1000
1001 UCSC Genome Browser Method
1002 ~~~~~~~~~~~~~~~~~~~~~~~~~~
1003
1004 There are many methods of retrieving DNA sequence, but for this
1005 example we will retrieve SMN1 through the UCSC genome browser located
1006 at http://genome.ucsc.edu/.
1007
1008
1009 .. image:: images/ucsc_genome_browser_home.png
1010    :alt: UCSC Genome Browser
1011    :align: center
1012
1013 Step 1 - Find SMN1
1014 ******************
1015
1016 The first step in finding SMN1 is to use the **Gene Sorter** menu
1017 option which I have highlighted in orange below:
1018
1019 .. image:: images/ucsc_menu_bar_gene_sorter.png
1020    :alt: Gene Sorter Menu Option
1021    :align: center
1022
1023 Gene Sorter page:
1024
1025 .. image:: images/ucsc_gene_sorter.png
1026    :alt: Gene Sorter
1027    :align: center
1028
1029 We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**.
1030
1031 .. image:: images/ucsc_gs_sort_name_sim.png
1032    :alt: Gene Sorter - Name Similarity
1033    :align: center
1034
1035 After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box.
1036
1037 .. image:: images/ucsc_gs_smn1.png
1038    :alt: Gene
1039    :align: center
1040
1041 Press **Go!** and you should see the following page:
1042
1043 .. image:: images/ucsc_gs_found.png
1044    :alt: Found SMN1
1045    :align: center
1046
1047 Click on **SMN1** and you will be taking the gene expression atlas
1048 page.
1049
1050 .. image:: images/ucsc_gs_genome_position.png
1051    :alt: Gene expression atlas
1052    :align: center
1053
1054 Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome
1055 position column**.
1056
1057 Now we have found the location of SMN1 on human!
1058
1059 .. image:: images/ucsc_gb_smn1_human.png
1060    :alt: Genome Browser - SMN1 (human)
1061    :align: center
1062
1063
1064 Step 2 - Download CDS/UTR sequence for annotations
1065 **************************************************
1066
1067 Since we have found **SMN1**, this would be a convenient time to extract
1068 the DNA sequence for the CDS and UTRs of the gene to use it as an
1069 annotation_ in Mussa.
1070
1071 **Click on SMN1** shown **between** the **two orange arrows** shown
1072 below.
1073
1074 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
1075    :alt: Genome Browser - SMN1 (human) - Orange Arrows
1076    :align: center
1077
1078 You should find yourself at the SMN1 description page.
1079
1080 .. image:: images/ucsc_gb_smn1_description_page.png
1081    :alt: Genome Browser - SMN1 (human) - Description page
1082    :align: center
1083
1084 **Scroll down** until you get to the **Sequence section** and click on
1085 **Genomic (chr5:70,256,524-70,284,592)**.
1086
1087 .. image:: images/ucsc_gb_smn1_human_sequence.png
1088    :alt: Genome Browser - SMN1 (human) - Sequence
1089    :align: center
1090
1091 You should now be at the **Genomic sequence near gene** page:
1092
1093 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence.png
1094    :alt: Genome Browser - SMN1 (human) - Get genomic sequence
1095    :align: center
1096
1097 Make the following changes (highlighted in orange in the screenshot
1098 below):
1099
1100  1. UNcheck **introns**.
1101     (We only want to annotate CDS and UTRs.)
1102  2. Select **one FASTA record** per **region**.
1103     (Mussa needs each CDS and UTR represented by one FASTA record per CDS/UTR).
1104  3. Select **CDS in upper case, UTR in lower case.**
1105
1106 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_diff.png
1107    :alt: Genome Browser - SMN1 (human) - Get genomic sequence setup
1108    :align: center
1109
1110 Now click the **submit** button. You will then see a FASTA file with
1111 many FASTA records representing the CDS and UTRS.
1112
1113 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_submit.png
1114    :alt: Genome Browser - SMN1 (human) - CDS/UTR sequence
1115    :align: center
1116
1117 Now you need to save the FASTA records to a **text file**. If you are
1118 using **Firefox** or **Internet Explorer 6+** click on the **File >
1119 Save As** menu option.
1120
1121 **IMPORTANT:** Make sure you select **Text Files** and **NOT**, I
1122 repeat **NOT Webpage Complete** (see screenshot below.)
1123
1124 Type in **smn1_human_annot.txt** for the file name.
1125
1126 .. image:: images/smn1_human_annot.png
1127    :alt: Genome Browser - SMN1 (human) - sequence annotation file
1128    :align: center
1129
1130 **IMPORTANT:** You should open the file with a text editor and make
1131   sure **no HTML** was saved... If you find any HTML markup, delete
1132   the markup and save the file.
1133
1134 Now we are going to **modify the file** you just saved to **add the
1135 name of the species** to the **annotation file**. All you have to do
1136 is **add a new line** at the **top of the file** with the word **'Human'** as
1137 shown below:
1138
1139 .. image:: images/smn1_human_annot_plus_human.png
1140    :alt: Genome Browser - SMN1 (human) - sequence annotation file
1141    :align: center
1142
1143 You can add more annotations to this file if you wish. See the
1144 `annotation file format`_ section for details of the file format. By
1145 including FASTA records in the annotation_ file, Mussa searches your
1146 DNA sequence for an exact match of the sequence in the annotation_
1147 file. If found, it will be marked as an annotation_ within Mussa.
1148
1149
1150 Step 3 - Download gene and upstream/downstream sequence
1151 *******************************************************
1152
1153 Use the back button in your web browser to get back the **genome
1154 browser view** of **SMN1** as shown below.
1155
1156 .. image:: images/ucsc_gb_smn1_human.png
1157    :alt: Genome Browser - SMN1 (human)
1158    :align: center
1159
1160 There are two options for getting additional sequence around your
1161 gene. The more complex way is to zoom out so that you have the
1162 sequence you want being shown in the genome browser and then follow
1163 the directions for the following method.
1164
1165 The second option, which we will choose, is to leave the genome
1166 browser zoomed exactly at the location of SMN1 and click on the
1167 **DNA** option on the menu bar (shown with orange arrows in the
1168 screenshot below.)
1169
1170 .. image:: images/ucsc_gb_smn1_human_dna_option.png
1171    :alt: Genome Browser - SMN1 (human) - DNA Option
1172    :align: center
1173
1174 Now in the **get dna in window** page, let's add an arbitrary amount of
1175 extra sequence on to each end of the gene, let's say 5000 base pairs.
1176
1177 .. image:: images/ucsc_gb_smn1_human_get_dna.png
1178    :alt: Genome Browser - SMN1 (human) - Get DNA
1179    :align: center
1180
1181 Click the **get DNA** button.
1182
1183 .. image:: images/ucsc_gb_smn1_human_dna.png
1184    :alt: Genome Browser - SMN1 (human) - DNA
1185    :align: center
1186
1187 Save the DNA sequence to a text file called 'smn1_human_dna.fa' as we
1188 did in step 2 with the annotation file.
1189
1190 **IMPORTANT:** Make sure the file is saved as a text file and not an
1191 HTML file. Open the file with a text editor and remove any HTML markup
1192 you find.
1193
1194
1195 Step 4 - Same/similar/related gene other species.
1196 *************************************************
1197
1198 What good is a multiple sequence alignment viewer without multiple
1199 sequences? Let'S find a similar gene in a few more species.
1200
1201 Use the back button on your web browser until you get the **genome
1202 browser view** of **SMN1** as shown below.
1203
1204 .. image:: images/ucsc_genome_browser_home.png
1205    :alt: UCSC Genome Browser
1206    :align: center
1207
1208 **Click on SMN1** shown **between** the **two orange arrows** shown
1209 below.
1210
1211 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
1212    :alt: Genome Browser - SMN1 (human) - Orange Arrows
1213    :align: center
1214
1215 You should find yourself at the SMN1 description page.
1216
1217 .. image:: images/ucsc_gb_smn1_description_page.png
1218    :alt: Genome Browser - SMN1 (human) - Description page
1219    :align: center
1220
1221 **Scroll down** until you get to the **Sequence section** and click on
1222 **Protein (262 aa)**.
1223
1224 .. image:: images/ucsc_gb_smn1_human_sequence.png
1225    :alt: Genome Browser - SMN1 (human) - Sequence
1226    :align: center
1227
1228 Copy the SMN1 protein seqeunce by highlighting it and selecting **Edit
1229 > Copy** option from the menu.
1230
1231 .. image:: images/smn1_human_protein.png
1232    :alt: Genome Browser - SMN1 (human) - Protein
1233    :align: center
1234
1235 Press the back button on the web browser once and then scroll to the
1236 top of the page and click on the **BLAT** option on the menu bar
1237 (shown below with orange arrows).
1238
1239 .. image:: images/ucsc_gb_smn1_human_blat.png
1240    :alt: Genome Browser - SMN1 (human) - Blat
1241    :align: center
1242
1243 **Paste** in the **protein sequence** and **change** the **genome** to
1244 **mouse** as shown below and then click **submit**.
1245
1246 .. image:: images/ucsc_gb_smn1_human_blat_paste.png
1247    :alt: Genome Browser - SMN1 (human) - Blat paste protein
1248    :align: center
1249
1250 Notice that we have two hits, one of which looks pretty good at 89.9%
1251 match.
1252
1253 .. image:: images/ucsc_gb_smn1_human_blat_hits.png
1254    :alt: Genome Browser - SMN1 (human) - Blat hits
1255    :align: center
1256
1257 **Click** on the **brower** link next to the 89.9% match. Notice in
1258 the genome browser (shown below) that there is an annotated gene
1259 called SMN1 for mouse which matches the line called **your sequence
1260 from blat search**. This means we are fairly confidant we found the
1261 right location in the mouse genome.
1262
1263 .. image:: images/ucsc_gb_smn1_human_blat_to_browser.png
1264    :alt: Genome Browser - SMN1 (human) - Blat to browser
1265    :align: center
1266
1267 Follow steps 1 through 3 for mouse and then repeat step 4 with the
1268 human protein sequence to find **SMN1** in the following species (if
1269 you find a match):
1270
1271  1. Rat
1272  2. Rabbit
1273  3. Dog
1274  4. Armadillo
1275  5. Elephant
1276  6. Opposum
1277  7. x_tropicalis
1278
1279 Make sure to save the extended DNA sequence and annotation file for
1280 each one.
1281
1282
1283 Step 5 - Create Analysis
1284 ************************
1285
1286 At this point you should have the annotations and fasta files for each
1287 species. If you skipped the first four steps or are having trouble,
1288 you can download the example data from the `Mussa Example Data
1289 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/ExampleData>`_ page.
1290
1291 There are two methods for creating an analysis. You can create MUssa
1292 PArameter file (.mupa), or you can use the create analysis dialog. To
1293 use the analysis dialog, see the `create a new analysis`_ section.
1294
1295 If you are planning on do lots of analyses using the same sets of DNA
1296 sequence but with different parameters, annotations, and/or species,
1297 it is often best to setup a `mupa`_ file, so you can:
1298
1299   * Change parameters and rerun analysis easily.
1300   * Use Mussa command line option to run a batch analyses.
1301   * Define an analysis for someone else to run.
1302
1303 Now, we will create a `mupa`_ file for smn1 for an analysis with
1304 Human, Mouse, and Cow. I'll start by showing you the `mupa`_ file and
1305 then walking you through it line by line.
1306
1307 Start by creating a new text file called *smn1_human_mouse_cow.mupa*,
1308 in your smn1 directory. I decided to put each of the fasta and
1309 annotation files for each species in it's own directory, so I will use
1310 that setup (see screen shot below).
1311
1312 .. image:: images/smn1_dir_structure.png
1313    :alt: SMN1 directory structure
1314    :align: center
1315
1316 smn1_human_mouse_cow.mupa:
1317 ::
1318
1319   # Analysis name
1320   ANA_NAME smn1_human_mouse_cow
1321
1322   # Appending to analysis name
1323   APPEND_WIN true
1324   APPEND_THRES true
1325
1326   # Human sequence
1327   SEQUENCE human/smn1_human_dna.fasta
1328   ANNOTATION human/smn1_human_annotations.txt
1329
1330   SEQUENCE mouse/smn1_mouse_dna.fasta
1331   ANNOTATION mouse/smn1_mouse_annotations.txt
1332
1333   SEQUENCE cow/smn1_cow_dna.fasta
1334   ANNOTATION cow/smn1_cow_annotations.txt
1335
1336   # Window size / Threshold
1337   WINDOW 30
1338   THRESHOLD 24
1339
1340 The first line is the analysis name. This will be the name of the
1341 directory the results will be saved in when using the Mussa `command
1342 line`_ option --no-gui to run an analysis. If you are using the Mussa
1343 GUI, then you will be prompted for a directory name as mentioned in
1344 the `saving`_ section.
1345
1346 ::
1347
1348   # Analysis name
1349   ANA_NAME smn1_human_mouse_cow
1350
1351 If your provide the APPEND_WIN and/or APPEND_THRES, and set them to
1352 true, the window size and threshold will be appended to the analysis
1353 name. In this example, using the --no-gui `command line`_ option, our
1354 directory name would be *smn1_human_mouse_cow_w30_t24*.
1355
1356 ::
1357
1358   # Appending to analysis name
1359   APPEND_WIN true
1360   APPEND_THRES true
1361
1362 The following six lines provide Mussa with the location of the
1363 sequence files and annotation files. The files can provided with
1364 relative paths from the .mupa file. In other words, this .mupa file
1365 will provide the proper path to the human sequence only if there
1366 exists a directory called *human* in the same directory as this .mupa
1367 file.
1368
1369 To provide the species name for each species, you have to put the
1370 species name in the annotation files. See the `annotation file
1371 format`_ section for more details.
1372
1373 ::
1374
1375   # Human sequence
1376   SEQUENCE human/smn1_human_dna.fasta
1377   ANNOTATION human/smn1_human_annotations.txt
1378
1379   SEQUENCE mouse/smn1_mouse_dna.fasta
1380   ANNOTATION mouse/smn1_mouse_annotations.txt
1381
1382   SEQUENCE cow/smn1_cow_dna.fasta
1383   ANNOTATION cow/smn1_cow_annotations.txt
1384
1385 And finally, the `window size`_ and `threshold`_ parameters.
1386
1387 ::
1388
1389   # Window size / Threshold
1390   WINDOW 30
1391   THRESHOLD 24
1392
1393 Next, open Mussagl and select the **File > Create Analysis from File**
1394 menu option. Mussagl should run your analysis if everything was setup
1395 properly.
1396
1397
1398
1399 Understanding Mussa
1400 ===================
1401
1402 Command Line
1403 ------------
1404
1405 Mussa has some very useful command line options that allow for
1406 loading an existing analysis or running a new analysis with or without
1407 launching the GUI.
1408
1409 Mussa options:
1410   --help                     help message
1411   -p, --run-analysis arg     run an analysis defined by the mussa parameter file
1412   --view-analysis arg        load a previously run analysis
1413   --motifs arg               annotate analysis with motifs from this file
1414   --no-gui                   terminate without running an analysis
1415   --python                   launch as a `python interpreter`_
1416
1417 Running an analysis using the --no-gui option is useful when you want
1418 to run many analyses on a compute server and save the results for
1419 viewing in the future.
1420
1421
1422 Performance
1423 -----------
1424
1425 Algorithm Behavior
1426 ~~~~~~~~~~~~~~~~~~
1427
1428 FIXME: Include seqcomp algorithm info.
1429
1430 FIXME: Include transitivity info.
1431
1432 Repeats
1433 ~~~~~~~
1434
1435 Repeat masking of all input sequences, or at least of the "reference"
1436 genome, can be important for reducing compute time and for simplifying
1437 subsequent visual interpretation. Larger loci generally contain more
1438 repeat elements, and as their number grows so will the number of Mussa
1439 connections among them. If not repeat filtered, connectivity between
1440 shared repeat elements can obscure important relationships between
1441 single copy features.
1442
1443 The formula for the number of connections, C, that will be made for R
1444 instances of a single repeat (meaning R copies of one repeat in each
1445 sequence) and S sequences is:
1446
1447 C = (R^2)[S(S-1)/2]
1448
1449 Table of example situations:
1450
1451 =====  =====  =====
1452   C      R      S
1453 =====  =====  =====
1454    16     4     2
1455    48     4     3
1456    96     4     4
1457   160     4     5
1458   240     4     6
1459   336     4     7
1460   448     4     8
1461    24     2     4
1462    54     3     4
1463    96     4     4
1464   150     5     4
1465   216     6     4
1466   294     7     4
1467   384     8     4
1468  2500    50     2
1469  7500    50     3
1470 15000    50     4
1471 10000   100     2
1472 30000   100     3
1473 60000   100     4
1474 =====  =====  =====
1475
1476 After the connections, C, are found, they are passed on to the
1477 transitivity filter, which is a C^2 algorithm (FIXME: confirm
1478 algorithm is C^2). This means with 50 repeats in 2 sequences giving
1479 you a C of 2500, ends up with a C^2 of 6,250,000.
1480
1481 **Conclusion: repeats cause the processing time of Mussa to skyrocket.**
1482
1483 To deal with a situation where you have many repeats in your sequences
1484 do any of the following:
1485
1486  * Use shorter sequence lengths.
1487  * Repeat mask one or more of your sequences.
1488  * Increase the threshold.
1489
1490
1491 Details
1492 -------
1493
1494 Case: Conservation track suddenly stops
1495 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1496
1497 Details about this potentially confusing case can be found `here
1498 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/OverlappingWindows>`_.
1499
1500 Python Interpreter
1501 ------------------
1502
1503 Mussagl has some functionality for running a python interpreter for
1504 interacting with the internals of Mussagl and/or executing Python
1505 code. This feature is mostly experimental at this point in time. If
1506 you have interest in this feature or would like to know more about it,
1507 contact us using the contact information found at
1508 http://mussa.caltech.edu/.
1509
1510 .. Define links below
1511    ------------------
1512
1513 .. _GPL: http://www.opensource.org/licenses/gpl-license.php
1514 .. _wiki: http://mussa.caltech.edu
1515 .. _build: http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild
1516 .. _FASTA: http://en.wikipedia.org/wiki/fasta_format
1517 .. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif
1518 .. _mupa: `Parameter File Format`_