doc/manual/mussagl_manual.rst

   1 ==============
   2 Mussagl Manual
   3 ==============
   4 ---------------
   5 Brandon W. King
   6 ---------------
   7
   8 Last updated: Oct 27th, 2006
   9
  10 Documentation for Mussagl v1.0
  11
  12
  13 .. Things to add
  14         * New features / change log
  15         * (DONE) Comment out anything isn't implemented yet.
  16         * (DONE) List of features that will be implemented in the future.
  17         * Look into the homology mapping of UCSC.
  18         * Add toggle to genomes.
  19         * Document why one fast record per region.
  20         * How to deal with the hazards of small utrs vis motif finder. (Add warning)
  21         * Add warning about saving FASTA file.
  22         * Add a general principles section near the top
  23                 * Using comparison algorithm which will pickup all repeats
  24                 * Add info about repeatmasking
  25                 * Checking upstream and downstream genes for make sure you are in the right regions.
  26         * Later on: look into Ensembl
  27         * Look into method of homology instead of blating.
  28         * Mention advantages of using mupa.
  29         * Mention the difference between using arrows and scroll bar
  30         * Document the color for motifs
  31         * Update for Mac user left-click
  32
  33         * Wormbase/Flybase/mirBASE tutorials
  34
  35
  36
  37 .. contents::
  38
  39 Status
  40 ======
  41
  42 Major New Features
  43 ------------------
  44
  45  * Build 381
  46    * Analysis "Save As" feature
  47
  48 Change Log
  49 ----------
  50
  51 .. INSERT CHANGE LOG HERE
  52 .. END INSERT CHANGE LOG
  53
  54 Features to be Implemented
  55 --------------------------
  56
  57 For an up-to-date list of features to be implemented visit:
  58 http://woldlab.caltech.edu/cgi-bin/mussa/roadmap
  59
  60 Introduction
  61 ============
  62
  63
  64 What is Mussagl?
  65 ----------------
  66
  67 Mussa is an N-way version of the FamilyRelations (which is a part of
  68 the Cartwheel project) 2-way comparative sequence analysis
  69 software. Given DNA sequence from N species, Mussa uses all possible
  70 pairwise comparions to derive an N-wise comparison. For example, given
  71 sequences 1,2,3, and 4, Mussa makes 6 2-way comparisons: 1vs2, 1vs3,
  72 1vs4, 2vs3, 2vs4, and 3vs4. It then compares all the links between
  73 these comparisons, saving those that satisfy a transitivity
  74 requirement. The saved paths are then displayed in an interactive
  75 viewer.
  76
  77 Short History of Mussa
  78 ----------------------
  79
  80 Mussa Python/PMW Prototype
  81 ~~~~~~~~~~~~~~~~~~~~~~~~~~
  82
  83 First Python/PMW based protoype.
  84
  85 Mussa C++/FLTK
  86 ~~~~~~~~~~~~~~
  87
  88 A rewrite for speed purposes using C++ and FLTK GUI toolkit.
  89
  90 Mussagl C++/Qt/OpenGL
  91 ~~~~~~~~~~~~~~~~~~~~~
  92
  93 Refactored version using the more elegant Qt GUI framework and
  94 OpenGL for hardware acceleration for those who have better graphics
  95 cards.
  96
  97 Getting Mussagl
  98 ===============
  99
 100 License
 101 -------
 102
 103 Mussagl has been released open source under the `GPL v2
 104 license`__.
 105
 106 __ GPL_
 107
 108 Platforms
 109 ---------
 110
 111 You have the option of building from source or downloading prebuilt
 112 binaries. Most people will want the prebuilt versions.
 113
 114 Supported Platforms:
 115
 116  * Mac OS X (binary or source)
 117  * Windows XP (binary or source)
 118  * Linux (source)
 119
 120 Download
 121 --------
 122
 123 Mussagl in binary form for OS X and Windows and/or source can be
 124 downloaded from http://mussa.caltech.edu/.
 125
 126 Install
 127 -------
 128
 129 Mac OS X
 130 ~~~~~~~~
 131 Once you have downloaded the .dmg file, double click on it and follow
 132 the install instructions.
 133
 134 FIXME: Mention how to launch the program.
 135
 136
 137 Windows XP
 138 ~~~~~~~~~~
 139 Once you have downloaded the Mussagl installer, double click on the
 140 installer and follow the install instructions.
 141
 142 To start Mussagl, launch the program from Start > Programs > Mussagl >
 143 Mussagl.
 144
 145
 146 Linux
 147 ~~~~~
 148 Currently we do not have a binary installer for Linux. You will have
 149 to build from source. See the 'build from source' section below.
 150
 151
 152 Build from Source
 153 ~~~~~~~~~~~~~~~~~
 154
 155 Instructions for building from source can be found `build page
 156 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild>`_ on the
 157 `Mussa wiki`__.
 158
 159 __ wiki_
 160
 161
 162 Obtaining Input Data
 163 ====================
 164
 165 If you would like help obtaining data for use with Mussagl, you can
 166 skip ahead to the `Obtaining Input Data - Continued`_ section.
 167
 168 If would like a tour of the software, continue with the `Using
 169 Mussagl`_ section.
 170
 171
 172 Using Mussagl
 173 =============
 174
 175
 176 Launch Mussagl
 177 --------------
 178 Launch Mussagl... It should look similar to the screen shot below.
 179
 180 .. image:: images/opened.png
 181    :alt: Launch Mussa
 182    :align: center
 183
 184
 185
 186 Create/Load Analysis
 187 ----------------------
 188
 189 Currently there are three ways to load a Mussa experiment.
 190
 191  1. `Create a new analysis`_
 192  2. `Load a mussa parameter file`_ (.mupa)
 193  3. `Load an analysis`_
 194
 195 .. _createnew:
 196
 197 Create a new analysis
 198 ~~~~~~~~~~~~~~~~~~~~~
 199
 200 To create a new analysis select 'Define analysis' from the 'File'
 201 menu. You should see a dialog box similar to the one below. For this
 202 demo we will use the example sequences that come with Mussagl.
 203
 204 .. image:: images/define_analysis.png
 205    :alt: Define Analysis
 206    :align: center
 207
 208 Instructions:
 209
 210  1. **Give the experiment a name**, for this demo, we'll use
 211     'demo_w30_t20'. Mussa will create a folder with this name to store
 212     the analysis files in once it has been run.
 213
 214  2. Choose a threshold_... for this demo **choose 20**. See the
 215     Threshold_ section for more detailed information.
 216
 217  3. Choose a `window size`_. For this demo **choose 30**.
 218
 219
 220  4. Choose the number of sequences_ you would like. For this demo
 221     **choose 3**.
 222
 223 .. image:: images/define_analysis_step1a.png
 224    :alt: Steps 1-4
 225    :align: center
 226
 227 First enter the species name of "Human" in the first "Species" text
 228 box. Now click on the 'Browse' button next to the sequence input box
 229 and then select /examples/seq/human_mck_pro.fa file. Do the same in
 230 the next two sequence input boxes selecting mouse_mck_pro.fa and
 231 rabbit_mck_pro.fa as shown below. Make sure to give them a species
 232 name as well. Note that you can create annotation files using the
 233 mussa `Annotation File Format`_ to add annotations to your sequence.
 234
 235 .. image:: images/define_analysis_step2.png
 236    :alt: Choose sequences
 237    :align: center
 238
 239 Click the **create** button and in a few moments you should see
 240 something similar to the following screen shot.
 241
 242 .. image:: images/demo.png
 243    :alt: Mussagl Demo
 244    :align: center
 245
 246 By default your analysis is NOT saved. If you try to close an analysis
 247 without saving, you will be prompted with a dialog box asking you if
 248 you would like to save your analysis. The `Saving`_ section for
 249 details on saving your analysis. When saving, choose directory and
 250 give the analysis the name **demo_w30_t20**. If you close and reopen
 251 Mussagl, you will then be able to load the saved analysis. See `Load
 252 an analysis`_ section below for details.
 253
 254
 255 Load a mussa parameter file
 256 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 257
 258 If you prefer, you can define your Mussa analysis using the Mussa
 259 parameter file. See the `Parameter File Format`_ section for details
 260 on creating a .mupa file.
 261
 262 Once you have a .mupa file created, load Mussagl and select the **File >
 263 Create Analysis from File** menu option. Select the .mupa file and click
 264 open.
 265
 266 .. image:: images/load_mupa_menu.png
 267    :alt: Load Mussa Parameters
 268    :align: center
 269
 270 If you would like to see an example, you can load the
 271 **mck3test.mupa** file in the examples directory that comes with
 272 Mussagl.
 273
 274 .. image:: images/load_mupa_dialog.png
 275    :alt: Load Mussa Parameters Dialog
 276    :align: center
 277
 278
 279 Load an analysis
 280 ~~~~~~~~~~~~~~~~
 281
 282 To load a previously run analysis open Mussagl and select the **File >
 283 Open Existing Analysis** menu option. Select an analysis **directory** and
 284 click open.
 285
 286 .. image:: images/load_analysis_menu.png
 287    :alt: Load Analysis Menu
 288    :align: center
 289
 290
 291 Main Window
 292 -----------
 293
 294 Overview
 295 ~~~~~~~~
 296 .. Screen-shot with numbers showing features.
 297
 298 .. image:: images/window_overview.png
 299    :alt: Mussa Window
 300    :align: center
 301
 302 Legend:
 303
 304  1. `DNA Sequence (Black bars)`_
 305
 306  2. Annotation_
 307
 308  3. Motif_
 309
 310  4. `Red conservation tracks`_
 311
 312  5. `Blue conservation tracks`_
 313
 314  6. `Zoom Factor`_ (Base pairs per pixel)
 315
 316  7. `Dynamic Threshold`_
 317
 318  8. `Sequence Information Bar`_
 319
 320  9. `Sequence Scroll Bar`_
 321
 322
 323 DNA Sequence (black bars)
 324 ~~~~~~~~~~~~~~~~~~~~~~~~~
 325
 326 .. image:: images/sequence_bar.png
 327    :alt: Sequence Bar
 328    :align: center
 329
 330 Each of the black bars represents one of the loaded sequences, in this
 331 case the sequence around the gene 'MCK' in human, mouse, and rabbit.
 332
 333
 334 Annotation
 335 ~~~~~~~~~~
 336
 337 .. figure:: images/annotation.png
 338    :alt: Annotation
 339    :align: center
 340
 341    Annotation shown in green on sequence bar.
 342
 343
 344 Annotations can be included on any of the sequences using the `Load a
 345 mussa parameter file`_ or `Create a new analysis`_ method of loading
 346 your sequences. You can define annotations by location or using an
 347 exact sub-sequence or a FASTA sequence of the section of DNA you wish
 348 to annotate. See the `Annotation File Format`_ section for details.
 349
 350
 351 Motif
 352 ~~~~~
 353
 354 .. figure:: images/motif.png
 355    :alt: Motif
 356    :align: center
 357
 358    Motif shown in light blue on sequence bar.
 359
 360 The only real difference between an annotation and motif in Mussagl is
 361 that you can define motifs and choose a color from within the GUI. See
 362 the `Motifs`_ section for more information.
 363
 364
 365 Red conservation tracks
 366 ~~~~~~~~~~~~~~~~~~~~~~~
 367
 368 .. figure:: images/conservation_tracks.png
 369    :alt: Conservation Tracks
 370    :align: center
 371
 372    Conservations tracks shown as red and blue lines between sequence
 373    bars.
 374
 375 The **red lines** between the sequence bars represent conservation
 376 between the sequences (i.e. not reverse complement matches)
 377
 378 The amount of sequence conservation shown will depend on how much your
 379 sequences are related and the `dynamic threshold`_ you are using.
 380
 381
 382 Blue conservation tracks
 383 ~~~~~~~~~~~~~~~~~~~~~~~~
 384
 385 .. figure:: images/conservation_tracks.png
 386    :alt: Conservation Tracks
 387    :align: center
 388
 389    Conservations tracks shown as red and blue lines between sequence
 390    bars.
 391
 392 **Blue lines** represent **reverse complement** conservation relative
 393 to the sequence attached to the top of the blue line.
 394
 395 The amount of sequence conservation shown will depend on how much your
 396 sequences are related and the `dynamic threshold`_ you are using.
 397
 398
 399 Zoom Factor
 400 ~~~~~~~~~~~
 401
 402 .. image:: images/zoom_factor.png
 403    :alt: Zoom Factor
 404    :align: center
 405
 406 The zoom factor represents the number of base pairs represented per
 407 pixel. When you zoom in far enough the sequence will switch from
 408 seeing a black bar, representing the sequence, to the actual sequence
 409 (well, ASCII representation of sequence).
 410
 411
 412 Dynamic Threshold
 413 ~~~~~~~~~~~~~~~~~
 414
 415 .. image:: images/dynamic_threshold.png
 416    :alt: Dynamic Threshold
 417    :align: center
 418
 419 You can dynamically change the threshold for how strong a match you
 420 consider the conservation to be by changing the value in the dynamic
 421 threshold box.
 422
 423 The value you enter is the minimum number of base pairs that have to
 424 be matched in order to be considered conserved. The second number that
 425 you can't change is the `window size`_ you used when creating the
 426 experiment. The last number is the percent match.
 427
 428 Below is an animation of the dynamic threshold being increased over
 429 time.
 430
 431 .. image:: images/threshold_change.gif
 432    :alt: Animated Dynamic Threshold
 433    :align: center
 434
 435 See the Threshold_ section for more information.
 436
 437
 438 Sequence Information Bar
 439 ~~~~~~~~~~~~~~~~~~~~~~~~
 440
 441 .. image:: images/seq_info_bar.png
 442    :alt: Sequence Information Bar
 443    :align: center
 444
 445 The sequence information bars can be found to the left and right sides
 446 of Mussagl. Next to each sequence you will find the following
 447 information:
 448
 449  1. Species (If it has been defined)
 450  2. Total Size of Sequence
 451  3. Current base pair position
 452
 453 Note that you can **update the species** text box. Make sure to **save your
 454 experiment** after making this change by selecting **File > Save
 455 Analysis** from the menu.
 456
 457 Sequence Scroll Bar
 458 ~~~~~~~~~~~~~~~~~~~
 459
 460 .. image:: images/scroll_bar.png
 461    :alt: Sequence Scroll Bar
 462    :align: center
 463
 464 The scroll bar allows you to scroll through the sequence which is
 465 useful when you have zoomed in using the `zoom factor`_.
 466
 467
 468 Saving
 469 ------
 470
 471 Save on Close
 472 ~~~~~~~~~~~~~
 473
 474 When ever you create a new analysis or make a change such as
 475 adding/editing a motif or changing a species name, an asterisk (*)
 476 will appear in the title of the window showing that there are changes
 477 that have not been saved. If you close a Mussa window without saving
 478 changes, Mussa will ask you if you would like to save the changes that
 479 have been made.
 480
 481 Save Analysis
 482 ~~~~~~~~~~~~~
 483
 484 After making changes, such as updating species names or adding/editing
 485 motifs, you can save these changes by selecting the **File > Save
 486 analysis** menu option or pressing **CTRL + S** (PC) or
 487 **Apple/Command Key + S** (on Mac).
 488
 489 .. image:: images/save_analysis.png
 490    :alt: Save analysis
 491    :align: center
 492
 493 Save Analysis As
 494 ~~~~~~~~~~~~~~~~
 495
 496 To save a copy of your analysis to a new location, select the **File >
 497 Save analysis as** menu option and choose a new location and name for
 498 your analysis.
 499
 500 .. image:: images/save_analysis_as.png
 501    :alt: Save analysis
 502    :align: center
 503
 504 Save Motif List
 505 ~~~~~~~~~~~~~~~
 506
 507 See `Save Motifs to File`_ in the `Motifs`_ section.
 508
 509
 510 Viewing Multiple Analyses
 511 -------------------------
 512
 513 Some times it is useful to view more than one analysis at a time. To
 514 do accomplish this, Mussa allows you to open a new Mussa window by
 515 selecting the **File > New Mussa Window** menu option.
 516
 517 .. image:: images/new_mussa_window_menu.png
 518    :alt: New Mussa Window Menu Option
 519    :align: center
 520
 521 A new Mussa window will pop up.
 522
 523 .. figure:: images/new_mussa_window.png
 524    :alt: New Mussa Window
 525    :align: center
 526
 527    A new Mussa window on the right, in which a second analysis has
 528    been loaded.
 529
 530 Now you can create or load an existing analysis, in this new window,
 531 as described in the `Create/Load Analysis`_ section.
 532
 533 You can view as many analyses as you can fit on your screen or until
 534 you run out of available RAM. If you notice a rapid decrease in
 535 performance and hear lots of noise coming from your hard drive, you
 536 probably ran out of RAM and are now using virtual memory (i.e. much
 537 much slower). If this happens, you may need to avoid opening as many
 538 analyses at one time.
 539
 540
 541 Annotations / Motifs
 542 --------------------
 543
 544 Annotations
 545 ~~~~~~~~~~~
 546
 547 Currently annotations can be added to a sequence using the mussa
 548 `annotation file format`_ and can be loaded by selecting the
 549 annotation file when defining a new analysis (see `Create a new
 550 analysis`_ section) or by defining a .mupa file pointing to your
 551 annotation file (see `Load a mussa parameter file`_ section).
 552
 553 Motifs
 554 ~~~~~~
 555
 556 Load Motifs from File
 557 *********************
 558
 559 It is possible to load motifs from a file which was saved from a
 560 previous run or by defining your own motif file. See the `Motif File
 561 Format`_ section for details.
 562
 563 NOTE: Valid motif list file extensions are:
 564
 565   * .mtl
 566   * .txt
 567
 568 To load a motif file, select **Load Motif List** item from the
 569 **File** menu and select a motif list file.
 570
 571 .. image:: images/load_motif.png
 572    :alt: Load Motif List
 573    :align: center
 574
 575
 576 Save Motifs to File
 577 *******************
 578
 579 Motifs from the `Motif Dialog`_ can be saved to file for use with
 580 other analyses. If you just want your motifs to be saved with your
 581 analysis, see the `save analysis`_ section for details.
 582
 583 To save a motif list, select **File > Save Motifs** menu option. By
 584 default, Mussa will append .mtl if you do not provide a file extension
 585 (valid file extensions: .mtl & .txt).
 586
 587 .. image:: images/save_motifs.png
 588    :alt: Save Motifs
 589    :align: center
 590
 591
 592 Motif Dialog
 593 ************
 594
 595 Mussa has the ability to find lab motifs using the `IUPAC Nucleotide
 596 Code`_ for defining a motif. To define a motif, select **Edit > Edit
 597 Motifs** menu item as shown below.
 598
 599 .. image:: images/view_edit_motifs.png
 600    :alt: "View > Edit Motifs" Menu
 601    :align: center
 602
 603 You will see a dialog box appear with a "apply" button in the bottom
 604 right and one rows for defining motifs and the color that will be
 605 displayed on the sequence. When you start adding your first motif, an
 606 additional row will be added. The check box in the first column
 607 defines whether the motif is displayed or not. The second column is
 608 the motif display color. The third column is for the name of your
 609 motif and finally, the fourth column is motif itself.
 610
 611 .. image:: images/motif_dialog_start.png
 612    :alt: Motif Dialog
 613    :align: center
 614
 615 Now let's make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide
 616 Code`_, type in **'ATSCT'** into the motif field and **'My Motif'** for
 617 the name in the name field as shown below.
 618
 619 Notice how a second row appeared when you started to add the first
 620 motif. Every time you add a new motif, a new row will appear allowing
 621 you to add as many motifs as you need.
 622
 623 .. image:: images/motif_dialog_enter_motif.png
 624    :alt: Enter Motif
 625    :align: center
 626
 627 Now choose a color for your motif by clicking on the colored area to
 628 the left of the name field. Remember to choose a color that will show
 629 up well with a black bar as the background. A good tool for picking a
 630 color is the `Colour Contrast Analyser
 631 <http://juicystudio.com/services/colourcontrast.php>`_ by
 632 `juicystudio.com <http://juicystudio.com/>`_.
 633
 634 .. image:: images/color_chooser.png
 635    :alt: Color Chooser
 636    :align: center
 637
 638 Once you have selected the color for your motif, click on the
 639 **'apply'** button. Notice that if Mussa finds matches to your motif
 640 will now show up in the main Mussa window.
 641
 642 Before Motif:
 643
 644 .. image:: images/motif_dialog_bar_before.png
 645    :alt: Sequence bar before motif
 646    :align: center
 647
 648 After Motif:
 649
 650 .. image:: images/motif_dialog_bar_after.png
 651    :alt: Sequence bar after motif
 652    :align: center
 653
 654 To save your motifs with your analysis, see the `save analysis`_
 655 section. To save your motifs to a file, see the `save motifs to file`_
 656 section.
 657
 658 Deleting a Motif
 659 ^^^^^^^^^^^^^^^^
 660
 661 To delete a motif, remove all text from the name and sequence columns
 662 and close the motif editor.
 663
 664 View Mussa Alignments
 665 ---------------------
 666
 667 Mussagl allows you to zoom in on Mussa alignments by selecting the set
 668 of alignment(s) of interest. To do this, move the mouse near the
 669 alignment you are interested in viewing and then **PRESS** and
 670 **HOLD** the **LEFT mouse button** and **drag the mouse** to the other
 671 side of the conservation track so that you see a bounding box
 672 overlaping the alienment(s) of interest and then **let go** of the
 673 *left mouse button*.
 674
 675 In the example below, I started by left-clicking on the area marked by
 676 a red dot (upper left corner of bounding box) and dragging the mouse to
 677 the area marked by a blue dot (lower right corner of the bounding box)
 678 and letting go of the left mouse button.
 679
 680 .. image:: images/select_sequence.png
 681    :alt: Select Sequence
 682    :align: center
 683
 684 All of the lines which were not selected should be washed out as shown
 685 below:
 686
 687 .. image:: images/washed_out.png
 688    :alt: Tracks washed out
 689    :align: center
 690
 691 With a selection made, goto the **View** menu and select **View mussa alignment**.
 692
 693 .. image:: images/view_mussa_alignment.png
 694    :alt: View mussa alignment
 695    :align: center
 696
 697 You should see the alignment at the base-pair level as shown below.
 698
 699 .. image:: images/mussa_alignment.png
 700    :alt: Mussa alignment
 701    :align: center
 702
 703
 704 Sub-analysis
 705 ------------
 706
 707 To run a sub-analysis **highlight** a section of sequence and *right
 708 click* on it and select **Add to subanalysis**. To the same for the
 709 sequences shown in orange in the screenshot below. Note that you **are
 710 NOT limited** to selecting only one subsequence from the same
 711 sequence.
 712
 713 .. image:: images/subanalysis_select_seqs.png
 714    :alt: Subanalysis sequence selection
 715    :align: center
 716
 717 Once you have added your sequences for subanalysis, choose a `window size`_ and `threshold`_ and click **Ok**.
 718
 719 .. image:: images/subanalysis_dialog.png
 720    :alt: Subanalysis Dialog
 721    :align: center
 722
 723 A new Mussa window will appear with the subanalysis of your sequences
 724 once it's done running. This may take a while if you selected large
 725 chunks of sequence with a loose threshold.
 726
 727 .. image:: images/subanalysis_done.png
 728    :alt: Subalaysis complete
 729    :align: center
 730
 731
 732 Copying sequence to clipboard
 733 -----------------------------
 734
 735 To copy a sequence to the clipboard, highlight a section of sequence,
 736 as shown in the screen shot below, and do one of the following:
 737
 738  * Select **Copy as FASTA** from the **Edit** menu.
 739  * **Right-Click (Left-click + Apple/Command Key on Mac)** on the highlighted sequence and select **Copy as FASTA**.
 740  * Press **Ctrl + C (on PC)** or **Apple/Command Key + C (on Mac)** on the keyboard.
 741
 742 .. image:: images/copy_sequence.png
 743    :alt: Copy sequence
 744    :align: center
 745
 746
 747 Saving to an Image
 748 ---------------------------------
 749
 750 To save your current mussa view to an image, select **File > Save to
 751 image...** as shown below.
 752
 753 .. image:: images/save_to_image_menu.png
 754    :alt: File > Save to image...
 755    :align: center
 756
 757 You can define the width and the height of the image to save. By
 758 default it will use the same size of your current view. Since the
 759 Mussa view is implemented using vectors, if you choose a larger size
 760 then your current view, Mussa will redraw at the higher resolution
 761 when saving. In other words, you get higher quality images when saving
 762 at a higher resolution.
 763
 764 If you check the "Lock aspect ratio" check box, which I have circled
 765 in red, then when you change one value, say width, the other, height,
 766 will update automatically to keep the same aspect ratio.
 767
 768 .. image:: images/save_to_image_dialog.png
 769    :alt: Save to image dialog
 770    :align: center
 771
 772 Click save and choose a location and filename for your file.
 773
 774 The valid image formats are:
 775
 776   * .png (default if no extension specified.)
 777   * .jpg
 778
 779
 780 Detailed Information
 781 --------------------
 782
 783 Threshold
 784 ~~~~~~~~~
 785
 786 The threshold of an analysis is in minimum number of base pair matches
 787 must be meet to in order to be kept as a match. Note that you can vary
 788 the threshold from within Mussagl. For example, if you choose a
 789 `window size`_ of **30** and a **threshold** of **20** the mussa nway
 790 transitive algorithm will store all matches that are 20 out of 30 bp
 791 matches or better and pass it on to Mussagl. Mussagl will then allow
 792 you to dynamically choose a threshold from 20 to 30 base pairs. A
 793 threshold of 30 bps would only show 30 out of 30 bp matches. A
 794 threshold of 20 bps would show all matches of 20 out of 30 bps or
 795 better. If you would like to see results for matches lower than 20 out
 796 of 30, you will need to rerun the analysis with a lower threshold.
 797
 798 Window Size
 799 ~~~~~~~~~~~
 800
 801 The typical sizes people tend to choose are between 20 and 30. You
 802 will likely need to experiment with this setting depending on your
 803 needs and input sequence.
 804
 805
 806 Sequences
 807 ~~~~~~~~~
 808
 809 Mussa reads in sequences which are formatted in the FASTA_
 810 format. Mussa may take a long time to run (>10 minutes) if the total
 811 bp length near 280Kb. Once mussa has run once, you can reload
 812 previously run analyzes.
 813
 814 FIXME: We have learned more about how much sequence and how many to
 815 put in Mussagl, this information should be documented here.
 816
 817
 818 Mussa File Formats
 819 ------------------
 820
 821 .. _param:
 822
 823 Parameter File Format
 824 ~~~~~~~~~~~~~~~~~~~~~
 825
 826 Note that for the comment character '#' to work, it must contain a
 827 space after it (i.e. '# ').
 828
 829 **File Format (.mupa):**
 830
 831 ::
 832
 833   # name of analysis directory and stem for associated files
 834   ANA_NAME <analysis_name>
 835
 836   # if APPEND vars true, a _wXX and/or _tYY added to analysis name
 837   # where XX = WINDOW and YY = THRESHOLD
 838   # Highly recommeded with use of command line override of WINDOW or THRESHOLD
 839   APPEND_WIN <true/false>
 840   APPEND_THRES <true/false>
 841
 842   # first sequence info
 843   SEQUENCE <FASTA_file_path>
 844   ANNOTATION <annotation_file_path>
 845   SEQ_START <sequence_start>
 846
 847   # the second sequence info
 848   SEQUENCE <FASTA_file_path>
 849   # ANNOTATION <annotation_file_path>
 850   SEQ_START <sequence_start>
 851   # SEQ_END <sequence_end>
 852
 853   # third sequence info
 854   SEQUENCE <FASTA_file_path>
 855   # ANNOTATION <annotation_file_path>
 856
 857   # analyzes parameters: command line args -w -t will override these
 858   WINDOW <num>
 859   THRESHOLD <num>
 860
 861 .. csv-table:: Parameter File Options:
 862    :header: "Option Name", "Value", "Default", "Required", "Description"
 863    :widths: 30 30 30 30 60
 864
 865    "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also
 866    name of directory where analysis will be saved."
 867    "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME"
 868    "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME"
 869    "SEQUENCE", "/FASTA/filepath.fa", "N/A", "true", "Absolute/Relative file
 870    path to sequence."
 871    "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional
 872    annotation file. See `annotation file format`_ section for more
 873    information."
 874    "SEQ_START", "integer", "1", "false", "Optional index into FASTA file"
 875    "SEQ_END", "integer", "1", "false", "Optional index into FASTA file"
 876    "WINDOW", "integer", "N/A", "true", "`Window Size`_"
 877    "THRESHOLD", "integer", "N/A", "true", "`Threshold`_"
 878
 879 .. _annot:
 880
 881 Annotation File Format
 882 ~~~~~~~~~~~~~~~~~~~~~~
 883
 884 The first line in the file is the sequence name. Each line there after
 885 is a **space** separated annotation.
 886
 887 Update:
 888
 889  * The annotation format now supports FASTA sequences embedded in the
 890    annotation file as shown in the format example below. Mussagl will
 891    take this sequence and look for an exact match of this sequence in
 892    your sequences. If a match is found, it will label it with the name
 893    of from the FASTA header.
 894
 895 Format:
 896
 897 ::
 898
 899   <species/sequence_name>
 900   <start> <stop> <annotation_name> <annotation_type>
 901   <start> <stop> <annotation_name> <annotation_type>
 902   <start> <stop> <annotation_name> <annotation_type>
 903   <start> <stop> <annotation_name> <annotation_type>
 904   >FASTA Header
 905   ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG
 906   ACGTACGTACGTACGTAGCTGTCATACGCTAGCA
 907   TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT
 908   ACGTACGGCAGTACGCGGTCAGA
 909   <start> <stop> <annotation_name> <annotation_type>
 910   ...
 911
 912 Example:
 913
 914 ::
 915
 916   Mouse
 917   251 500 Glorp Glorptype
 918   751 1000 Glorp Glorptype
 919   1251 1500 Glorp Glorptype
 920   >My favorite DNA sequence
 921   GATTACA
 922   1751 2000 Glorp Glorptype
 923
 924
 925 .. _motif_file_format:
 926
 927 Motif File Format
 928 ~~~~~~~~~~~~~~~~~
 929
 930 Format:
 931
 932   <motif> <red> <green> <blue>
 933
 934 Example:
 935
 936   GGCC 0.0 1 1
 937
 938
 939
 940 IUPAC Nucleotide Code
 941 ~~~~~~~~~~~~~~~~~~~~~~
 942
 943 For your convenience, below is a table of the IUPAC Nucleotide Code.
 944
 945 The following table is table 1 from "Nomenclature for Incompletely
 946 Specified Bases in Nucleic Acid Sequences" which can be found at
 947 http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html.
 948
 949 ======  =================  ===================================
 950 Symbol  Meaning            Origin of designation
 951 ======  =================  ===================================
 952 G       G                  Guanine
 953 A       A                  Adenine
 954 T       T                  Thymine
 955 C       C                  Cytosine
 956 R       G or A             puRine
 957 Y       T or C             pYrimidine
 958 M       A or C             aMino
 959 K       G or T             Keto
 960 S       G or C             Strong interaction (3 H bonds)
 961 W       A or T             Weak interaction (2 H bonds)
 962 H       A or C or T        not-G, H follows G in the alphabet
 963 B       G or T or C        not-A, B follows A
 964 V       G or C or A        not-T (not-U), V follows U
 965 D       G or A or T        not-C, D follows C
 966 N       G or A or T or C   aNy
 967 ======  =================  ===================================
 968
 969
 970 Obtaining Input Data - Continued
 971 --------------------------------
 972
 973 If you already have your data, may want to go to the `Using Mussagl`_
 974 section of the manual.
 975
 976 Let's say you have a gene of interest called 'SMN1' and you want to
 977 know how the sequence surrounding the gene in multiple species is
 978 conserved. Guess what, that's what we are going to do, retrieve the
 979 DNA sequence for SMN1 and prepare it for using in Mussa.
 980
 981 For more information about SMN1 visit `NCBI's OMIM
 982 <http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=609682>`_.
 983
 984 The SMN1 data retrieved in this section can be downloaded from the
 985 `Mussa Example Data
 986 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/ExampleData>`_ page if
 987 you prefer to skip this section of the manual.
 988
 989 UCSC Genome Browser Method
 990 --------------------------
 991
 992 There are many methods of retrieving DNA sequence, but for this
 993 example we will retrieve SMN1 through the UCSC genome browser located
 994 at http://genome.ucsc.edu/.
 995
 996
 997 .. image:: images/ucsc_genome_browser_home.png
 998    :alt: UCSC Genome Browser
 999    :align: center
1000
1001 Step 1 - Find SMN1
1002 ~~~~~~~~~~~~~~~~~~
1003
1004 The first step in finding SMN1 is to use the **Gene Sorter** menu
1005 option which I have highlighted in orange below:
1006
1007 .. image:: images/ucsc_menu_bar_gene_sorter.png
1008    :alt: Gene Sorter Menu Option
1009    :align: center
1010
1011 Gene Sorter page:
1012
1013 .. image:: images/ucsc_gene_sorter.png
1014    :alt: Gene Sorter
1015    :align: center
1016
1017 We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**.
1018
1019 .. image:: images/ucsc_gs_sort_name_sim.png
1020    :alt: Gene Sorter - Name Similarity
1021    :align: center
1022
1023 After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box.
1024
1025 .. image:: images/ucsc_gs_smn1.png
1026    :alt: Gene
1027    :align: center
1028
1029 Press **Go!** and you should see the following page:
1030
1031 .. image:: images/ucsc_gs_found.png
1032    :alt: Found SMN1
1033    :align: center
1034
1035 Click on **SMN1** and you will be taking the gene expression atlas
1036 page.
1037
1038 .. image:: images/ucsc_gs_genome_position.png
1039    :alt: Gene expression atlas
1040    :align: center
1041
1042 Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome
1043 position column**.
1044
1045 Now we have found the location of SMN1 on human!
1046
1047 .. image:: images/ucsc_gb_smn1_human.png
1048    :alt: Genome Browser - SMN1 (human)
1049    :align: center
1050
1051
1052 Step 2 - Download CDS/UTR sequence for annotations
1053 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1054
1055 Since we have found **SMN1**, this would be a convenient time to extract
1056 the DNA sequence for the CDS and UTRs of the gene to use it as an
1057 annotation_ in Mussa.
1058
1059 **Click on SMN1** shown **between** the **two orange arrows** shown
1060 below.
1061
1062 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
1063    :alt: Genome Browser - SMN1 (human) - Orange Arrows
1064    :align: center
1065
1066 You should find yourself at the SMN1 description page.
1067
1068 .. image:: images/ucsc_gb_smn1_description_page.png
1069    :alt: Genome Browser - SMN1 (human) - Description page
1070    :align: center
1071
1072 **Scroll down** until you get to the **Sequence section** and click on
1073 **Genomic (chr5:70,256,524-70,284,592)**.
1074
1075 .. image:: images/ucsc_gb_smn1_human_sequence.png
1076    :alt: Genome Browser - SMN1 (human) - Sequence
1077    :align: center
1078
1079 You should now be at the **Genomic sequence near gene** page:
1080
1081 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence.png
1082    :alt: Genome Browser - SMN1 (human) - Get genomic sequence
1083    :align: center
1084
1085 Make the following changes (highlighted in orange in the screenshot
1086 below):
1087
1088  1. UNcheck **introns**.
1089     (We only want to annotate CDS and UTRs.)
1090  2. Select **one FASTA record** per **region**.
1091     (Mussa needs each CDS and UTR represented by one FASTA record per CDS/UTR).
1092  3. Select **CDS in upper case, UTR in lower case.**
1093
1094 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_diff.png
1095    :alt: Genome Browser - SMN1 (human) - Get genomic sequence setup
1096    :align: center
1097
1098 Now click the **submit** button. You will then see a FASTA file with
1099 many FASTA records representing the CDS and UTRS.
1100
1101 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_submit.png
1102    :alt: Genome Browser - SMN1 (human) - CDS/UTR sequence
1103    :align: center
1104
1105 Now you need to save the FASTA records to a **text file**. If you are
1106 using **Firefox** or **Internet Explorer 6+** click on the **File >
1107 Save As** menu option.
1108
1109 **IMPORTANT:** Make sure you select **Text Files** and **NOT**, I
1110 repeat **NOT Webpage Complete** (see screenshot below.)
1111
1112 Type in **smn1_human_annot.txt** for the file name.
1113
1114 .. image:: images/smn1_human_annot.png
1115    :alt: Genome Browser - SMN1 (human) - sequence annotation file
1116    :align: center
1117
1118 **IMPORTANT:** You should open the file with a text editor and make
1119   sure **no HTML** was saved... If you find any HTML markup, delete
1120   the markup and save the file.
1121
1122 Now we are going to **modify the file** you just saved to **add the
1123 name of the species** to the **annotation file**. All you have to do
1124 is **add a new line** at the **top of the file** with the word **'Human'** as
1125 shown below:
1126
1127 .. image:: images/smn1_human_annot_plus_human.png
1128    :alt: Genome Browser - SMN1 (human) - sequence annotation file
1129    :align: center
1130
1131 You can add more annotations to this file if you wish. See the
1132 `annotation file format`_ section for details of the file format. By
1133 including FASTA records in the annotation_ file, Mussa searches your
1134 DNA sequence for an exact match of the sequence in the annotation_
1135 file. If found, it will be marked as an annotation_ within Mussa.
1136
1137
1138 Step 3 - Download gene and upstream/downstream sequence
1139 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1140
1141 Use the back button in your web browser to get back the **genome
1142 browser view** of **SMN1** as shown below.
1143
1144 .. image:: images/ucsc_gb_smn1_human.png
1145    :alt: Genome Browser - SMN1 (human)
1146    :align: center
1147
1148 There are two options for getting additional sequence around your
1149 gene. The more complex way is to zoom out so that you have the
1150 sequence you want being shown in the genome browser and then follow
1151 the directions for the following method.
1152
1153 The second option, which we will choose, is to leave the genome
1154 browser zoomed exactly at the location of SMN1 and click on the
1155 **DNA** option on the menu bar (shown with orange arrows in the
1156 screenshot below.)
1157
1158 .. image:: images/ucsc_gb_smn1_human_dna_option.png
1159    :alt: Genome Browser - SMN1 (human) - DNA Option
1160    :align: center
1161
1162 Now in the **get dna in window** page, let's add an arbitrary amount of
1163 extra sequence on to each end of the gene, let's say 5000 base pairs.
1164
1165 .. image:: images/ucsc_gb_smn1_human_get_dna.png
1166    :alt: Genome Browser - SMN1 (human) - Get DNA
1167    :align: center
1168
1169 Click the **get DNA** button.
1170
1171 .. image:: images/ucsc_gb_smn1_human_dna.png
1172    :alt: Genome Browser - SMN1 (human) - DNA
1173    :align: center
1174
1175 Save the DNA sequence to a text file called 'smn1_human_dna.fa' as we
1176 did in step 2 with the annotation file.
1177
1178 **IMPORTANT:** Make sure the file is saved as a text file and not an
1179 HTML file. Open the file with a text editor and remove any HTML markup
1180 you find.
1181
1182
1183 Step 4 - Same/similar/related gene other species.
1184 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1185
1186 What good is a multiple sequence alignment viewer without multiple
1187 sequences? Let'S find a similar gene in a few more species.
1188
1189 Use the back button on your web browser until you get the **genome
1190 browser view** of **SMN1** as shown below.
1191
1192 .. image:: images/ucsc_genome_browser_home.png
1193    :alt: UCSC Genome Browser
1194    :align: center
1195
1196 **Click on SMN1** shown **between** the **two orange arrows** shown
1197 below.
1198
1199 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
1200    :alt: Genome Browser - SMN1 (human) - Orange Arrows
1201    :align: center
1202
1203 You should find yourself at the SMN1 description page.
1204
1205 .. image:: images/ucsc_gb_smn1_description_page.png
1206    :alt: Genome Browser - SMN1 (human) - Description page
1207    :align: center
1208
1209 **Scroll down** until you get to the **Sequence section** and click on
1210 **Protein (262 aa)**.
1211
1212 .. image:: images/ucsc_gb_smn1_human_sequence.png
1213    :alt: Genome Browser - SMN1 (human) - Sequence
1214    :align: center
1215
1216 Copy the SMN1 protein seqeunce by highlighting it and selecting **Edit
1217 > Copy** option from the menu.
1218
1219 .. image:: images/smn1_human_protein.png
1220    :alt: Genome Browser - SMN1 (human) - Protein
1221    :align: center
1222
1223 Press the back button on the web browser once and then scroll to the
1224 top of the page and click on the **BLAT** option on the menu bar
1225 (shown below with orange arrows).
1226
1227 .. image:: images/ucsc_gb_smn1_human_blat.png
1228    :alt: Genome Browser - SMN1 (human) - Blat
1229    :align: center
1230
1231 **Paste** in the **protein sequence** and **change** the **genome** to
1232 **mouse** as shown below and then click **submit**.
1233
1234 .. image:: images/ucsc_gb_smn1_human_blat_paste.png
1235    :alt: Genome Browser - SMN1 (human) - Blat paste protein
1236    :align: center
1237
1238 Notice that we have two hits, one of which looks pretty good at 89.9%
1239 match.
1240
1241 .. image:: images/ucsc_gb_smn1_human_blat_hits.png
1242    :alt: Genome Browser - SMN1 (human) - Blat hits
1243    :align: center
1244
1245 **Click** on the **brower** link next to the 89.9% match. Notice in
1246 the genome browser (shown below) that there is an annotated gene
1247 called SMN1 for mouse which matches the line called **your sequence
1248 from blat search**. This means we are fairly confidant we found the
1249 right location in the mouse genome.
1250
1251 .. image:: images/ucsc_gb_smn1_human_blat_to_browser.png
1252    :alt: Genome Browser - SMN1 (human) - Blat to browser
1253    :align: center
1254
1255 Follow steps 1 through 3 for mouse and then repeat step 4 with the
1256 human protein sequence to find **SMN1** in the following species (if
1257 you find a match):
1258
1259  1. Rat
1260  2. Rabbit
1261  3. Dog
1262  4. Armadillo
1263  5. Elephant
1264  6. Opposum
1265  7. x_tropicalis
1266
1267 Make sure to save the extended DNA sequence and annotation file for
1268 each one.
1269
1270
1271 Step 5 - Create Analysis
1272 ~~~~~~~~~~~~~~~~~~~~~~~~
1273
1274 At this point you should have the annotations and fasta files for each
1275 species. If you skipped the first four steps or are having trouble,
1276 you can download the example data from the `Mussa Example Data
1277 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/ExampleData>`_ page.
1278
1279 There are two methods for creating an analysis. You can create MUssa
1280 PArameter file (.mupa), or you can use the create analysis dialog. To
1281 use the analysis dialog, see the `create a new analysis`_ section.
1282
1283 If you are planning on do lots of analyses using the same sets of DNA
1284 sequence but with different parameters, annotations, and/or species,
1285 it is often best to setup a `mupa`_ file, so you can:
1286
1287   * Change parameters and rerun analysis easily.
1288   * Use Mussa command line option to run a batch analyses.
1289   * Define an analysis for someone else to run.
1290
1291 Now, we will create a `mupa`_ file for smn1 for an analysis with
1292 Human, Mouse, and Cow. I'll start by showing you the `mupa`_ file and
1293 then walking you through it line by line.
1294
1295 Start by creating a new text file called *smn1_human_mouse_cow.mupa*,
1296 in your smn1 directory. I decided to put each of the fasta and
1297 annotation files for each species in it's own directory, so I will use
1298 that setup (see screen shot below).
1299
1300 .. image:: images/smn1_dir_structure.png
1301    :alt: SMN1 directory structure
1302    :align: center
1303
1304 smn1_human_mouse_cow.mupa:
1305 ::
1306
1307   # Analysis name
1308   ANA_NAME smn1_human_mouse_cow
1309
1310   # Appending to analysis name
1311   APPEND_WIN true
1312   APPEND_THRES true
1313
1314   # Human sequence
1315   SEQUENCE human/smn1_human_dna.fasta
1316   ANNOTATION human/smn1_human_annotations.txt
1317
1318   SEQUENCE mouse/smn1_mouse_dna.fasta
1319   ANNOTATION mouse/smn1_mouse_annotations.txt
1320
1321   SEQUENCE cow/smn1_cow_dna.fasta
1322   ANNOTATION cow/smn1_cow_annotations.txt
1323
1324   # Window size / Threshold
1325   WINDOW 30
1326   THRESHOLD 24
1327
1328 The first line is the analysis name. This will be the name of the
1329 directory the results will be saved in when using the Mussa `command
1330 line`_ option --no-gui to run an analysis. If you are using the Mussa
1331 GUI, then you will be prompted for a directory name as mentioned in
1332 the `saving`_ section.
1333
1334 ::
1335
1336   # Analysis name
1337   ANA_NAME smn1_human_mouse_cow
1338
1339 If your provide the APPEND_WIN and/or APPEND_THRES, and set them to
1340 true, the window size and threshold will be appended to the analysis
1341 name. In this example, using the --no-gui `command line`_ option, our
1342 directory name would be *smn1_human_mouse_cow_w30_t24*.
1343
1344 ::
1345
1346   # Appending to analysis name
1347   APPEND_WIN true
1348   APPEND_THRES true
1349
1350 The following six lines provide Mussa with the location of the
1351 sequence files and annotation files. The files can provided with
1352 relative paths from the .mupa file. In other words, this .mupa file
1353 will provide the proper path to the human sequence only if there
1354 exists a directory called *human* in the same directory as this .mupa
1355 file.
1356
1357 To provide the species name for each species, you have to put the
1358 species name in the annotation files. See the `annotation file
1359 format`_ section for more details.
1360
1361 ::
1362
1363   # Human sequence
1364   SEQUENCE human/smn1_human_dna.fasta
1365   ANNOTATION human/smn1_human_annotations.txt
1366
1367   SEQUENCE mouse/smn1_mouse_dna.fasta
1368   ANNOTATION mouse/smn1_mouse_annotations.txt
1369
1370   SEQUENCE cow/smn1_cow_dna.fasta
1371   ANNOTATION cow/smn1_cow_annotations.txt
1372
1373 And finally, the `window size`_ and `threshold`_ parameters.
1374
1375 ::
1376
1377   # Window size / Threshold
1378   WINDOW 30
1379   THRESHOLD 24
1380
1381 Next, open Mussagl and select the **File > Create Analysis from File**
1382 menu option. Mussagl should run your analysis if everything was setup
1383 properly.
1384
1385
1386
1387 Understanding Mussa
1388 ===================
1389
1390 Command Line
1391 ------------
1392
1393 Mussa has some very useful command line options that allow for
1394 loading an existing analysis or running a new analysis with or without
1395 launching the GUI.
1396
1397 Mussa options:
1398   --help                     help message
1399   -p, --run-analysis arg     run an analysis defined by the mussa parameter file
1400   --view-analysis arg        load a previously run analysis
1401   --motifs arg               annotate analysis with motifs from this file
1402   --no-gui                   terminate without running an analysis
1403   --python                   launch as a `python interpreter`_
1404
1405 Running an analysis using the --no-gui option is useful when you want
1406 to run many analyses on a compute server and save the results for
1407 viewing in the future.
1408
1409
1410 Performance
1411 -----------
1412
1413 Algorithm Behavior
1414 ~~~~~~~~~~~~~~~~~~
1415
1416 FIXME: Include seqcomp algorithm info.
1417
1418 FIXME: Include transitivity info.
1419
1420 Repeats
1421 ~~~~~~~
1422
1423 Repeat masking of all input sequences, or at least of the "reference"
1424 genome, can be important for reducing compute time and for simplifying
1425 subsequent visual interpretation. Larger loci generally contain more
1426 repeat elements, and as their number grows so will the number of Mussa
1427 connections among them. If not repeat filtered, connectivity between
1428 shared repeat elements can obscure important relationships between
1429 single copy features.
1430
1431 The formula for the number of connections, C, that will be made for R
1432 instances of a single repeat (meaning R copies of one repeat in each
1433 sequence) and S sequences is:
1434
1435 C = (R^2)[S(S-1)/2]
1436
1437 Table of example situations:
1438
1439 =====  =====  =====
1440   C      R      S
1441 =====  =====  =====
1442    16     4     2
1443    48     4     3
1444    96     4     4
1445   160     4     5
1446   240     4     6
1447   336     4     7
1448   448     4     8
1449    24     2     4
1450    54     3     4
1451    96     4     4
1452   150     5     4
1453   216     6     4
1454   294     7     4
1455   384     8     4
1456  2500    50     2
1457  7500    50     3
1458 15000    50     4
1459 10000   100     2
1460 30000   100     3
1461 60000   100     4
1462 =====  =====  =====
1463
1464 After the connections, C, are found, they are passed on to the
1465 transitivity filter, which is a C^2 algorithm (FIXME: confirm
1466 algorithm is C^2). This means with 50 repeats in 2 sequences giving
1467 you a C of 2500, ends up with a C^2 of 6,250,000.
1468
1469 **Conclusion: repeats cause the processing time of Mussa to skyrocket.**
1470
1471 To deal with a situation where you have many repeats in your sequences
1472 do any of the following:
1473
1474  * Use shorter sequence lengths.
1475  * Repeat mask one or more of your sequences.
1476  * Increase the threshold.
1477
1478
1479 Details
1480 -------
1481
1482 Case: Conservation track suddenly stops
1483 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1484
1485 Details about this potentially confusing case can be found `here
1486 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/OverlappingWindows>`_.
1487
1488 Python Interpreter
1489 ------------------
1490
1491 Mussagl has some functionality for running a python interpreter for
1492 interacting with the internals of Mussagl and/or executing Python
1493 code. This feature is mostly experimental at this point in time. If
1494 you have interest in this feature or would like to know more about it,
1495 contact us using the contact information found at
1496 http://mussa.caltech.edu/.
1497
1498 .. Define links below
1499    ------------------
1500
1501 .. _GPL: http://www.opensource.org/licenses/gpl-license.php
1502 .. _wiki: http://mussa.caltech.edu
1503 .. _build: http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild
1504 .. _FASTA: http://en.wikipedia.org/wiki/fasta_format
1505 .. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif
1506 .. _mupa: `Parameter File Format`_