doc/manual/mussagl_manual.rst

   1 ==============
   2 Mussagl Manual
   3 ==============
   4 ---------------
   5 Brandon W. King
   6 ---------------
   7
   8 Last updated: Oct 20th, 2006
   9
  10 Updated to Mussagl build: (In process to 424)
  11
  12
  13 .. Things to add
  14         * New features / change log
  15         * (DONE) Comment out anything isn't implemented yet.
  16         * (DONE) List of features that will be implemented in the future.
  17         * Look into the homology mapping of UCSC.
  18         * Add toggle to genomes.
  19         * Document why one fast record per region.
  20         * How to deal with the hazards of small utrs vis motif finder. (Add warning)
  21         * Add warning about saving FASTA file.
  22         * Add a general principles section near the top
  23                 * Using comparison algorithm which will pickup all repeats
  24                 * Add info about repeatmasking
  25                 * Checking upstream and downstream genes for make sure you are in the right regions.
  26         * Later on: look into Ensembl
  27         * Look into method of homology instead of blating.
  28         * Mention advantages of using mupa.
  29         * Mention the difference between using arrows and scroll bar
  30         * Document the color for motifs
  31         * Update for Mac user left-click
  32
  33         * Wormbase/Flybase/mirBASE tutorials
  34
  35
  36
  37 .. contents::
  38
  39 Status
  40 ======
  41
  42 Major New Features
  43 ------------------
  44
  45  * Build 381
  46    * Analysis "Save As" feature
  47
  48 Change Log
  49 ----------
  50
  51 .. INSERT CHANGE LOG HERE
  52 .. END INSERT CHANGE LOG
  53
  54 Features to be Implemented
  55 --------------------------
  56
  57 For an up-to-date list of features to be implemented visit:
  58 http://woldlab.caltech.edu/cgi-bin/mussa/roadmap
  59
  60 Introduction
  61 ============
  62
  63
  64 What is Mussagl?
  65 ----------------
  66
  67 Mussa is an N-way version of the FamilyRelations (which is a part of
  68 the Cartwheel project) 2-way comparative sequence analysis
  69 software. Given DNA sequence from N species, Mussa uses all possible
  70 pairwise comparions to derive an N-wise comparison. For example, given
  71 sequences 1,2,3, and 4, Mussa makes 6 2-way comparisons: 1vs2, 1vs3,
  72 1vs4, 2vs3, 2vs4, and 3vs4. It then compares all the links between
  73 these comparisons, saving those that satisfy a transitivity
  74 requirement. The saved paths are then displayed in an interactive
  75 viewer.
  76
  77 Short History of Mussa
  78 ----------------------
  79
  80 Mussa Python/PMW Prototype
  81 ~~~~~~~~~~~~~~~~~~~~~~~~~~
  82
  83 First Python/PMW based protoype.
  84
  85 Mussa C++/FLTK
  86 ~~~~~~~~~~~~~~
  87
  88 A rewrite for speed purposes using C++ and FLTK GUI toolkit.
  89
  90 Mussagl C++/Qt/OpenGL
  91 ~~~~~~~~~~~~~~~~~~~~~
  92
  93 Refactored version using the more elegant Qt GUI framework and
  94 OpenGL for hardware acceleration for those who have better graphics
  95 cards.
  96
  97 Getting Mussagl
  98 ===============
  99
 100 License
 101 -------
 102
 103 Mussagl has been released open source under the `GPL v2
 104 license`__.
 105
 106 __ GPL_
 107
 108 Platforms
 109 ---------
 110
 111 You have the option of building from source or downloading prebuilt
 112 binaries. Most people will want the prebuilt versions.
 113
 114 Supported Platforms:
 115
 116  * Mac OS X (binary or source)
 117  * Windows XP (binary or source)
 118  * Linux (source)
 119
 120 Download
 121 --------
 122
 123 Mussagl in binary form for OS X and Windows and/or source can be
 124 downloaded from http://mussa.caltech.edu/.
 125
 126 Install
 127 -------
 128
 129 Mac OS X
 130 ~~~~~~~~
 131 Once you have downloaded the .dmg file, double click on it and follow
 132 the install instructions.
 133
 134 FIXME: Mention how to launch the program.
 135
 136
 137 Windows XP
 138 ~~~~~~~~~~
 139 Once you have downloaded the Mussagl installer, double click on the
 140 installer and follow the install instructions.
 141
 142 To start Mussagl, launch the program from Start > Programs > Mussagl >
 143 Mussagl.
 144
 145
 146 Linux
 147 ~~~~~
 148 Currently we do not have a binary installer for Linux. You will have
 149 to build from source. See the 'build from source' section below.
 150
 151
 152 Build from Source
 153 ~~~~~~~~~~~~~~~~~
 154
 155 Instructions for building from source can be found `build page
 156 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild>`_ on the
 157 `Mussa wiki`__.
 158
 159 __ wiki_
 160
 161
 162 Obtaining Input Data
 163 ====================
 164
 165 If you already have your data, you can skip ahead to the the `Using
 166 Mussagl`_ section.
 167
 168 Let's say you have a gene of interest called 'SMN1' and you want to
 169 know how the sequence surrounding the gene in multiple species is
 170 conserved. Guess what, that's what we are going to do, retrieve the
 171 DNA sequence for SMN1 and prepare it for using in Mussa.
 172
 173 For more information about SMN1 visit `NCBI's OMIM
 174 <http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=609682>`_.
 175
 176 The SMN1 data retrieved in this section can be downloaded from the
 177 `Mussa Example Data
 178 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/ExampleData>`_ page if
 179 you prefer to skip this section of the manual.
 180
 181
 182 UCSC Genome Browser Method
 183 --------------------------
 184
 185 There are many methods of retrieving DNA sequence, but for this
 186 example we will retrieve SMN1 through the UCSC genome browser located
 187 at http://genome.ucsc.edu/.
 188
 189
 190 .. image:: images/ucsc_genome_browser_home.png
 191    :alt: UCSC Genome Browser
 192    :align: center
 193
 194 Step 1 - Find SMN1
 195 ~~~~~~~~~~~~~~~~~~
 196
 197 The first step in finding SMN1 is to use the **Gene Sorter** menu
 198 option which I have highlighted in orange below:
 199
 200 .. image:: images/ucsc_menu_bar_gene_sorter.png
 201    :alt: Gene Sorter Menu Option
 202    :align: center
 203
 204 Gene Sorter page:
 205
 206 .. image:: images/ucsc_gene_sorter.png
 207    :alt: Gene Sorter
 208    :align: center
 209
 210 We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**.
 211
 212 .. image:: images/ucsc_gs_sort_name_sim.png
 213    :alt: Gene Sorter - Name Similarity
 214    :align: center
 215
 216 After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box.
 217
 218 .. image:: images/ucsc_gs_smn1.png
 219    :alt: Gene
 220    :align: center
 221
 222 Press **Go!** and you should see the following page:
 223
 224 .. image:: images/ucsc_gs_found.png
 225    :alt: Found SMN1
 226    :align: center
 227
 228 Click on **SMN1** and you will be taking the gene expression atlas
 229 page.
 230
 231 .. image:: images/ucsc_gs_genome_position.png
 232    :alt: Gene expression atlas
 233    :align: center
 234
 235 Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome
 236 position column**.
 237
 238 Now we have found the location of SMN1 on human!
 239
 240 .. image:: images/ucsc_gb_smn1_human.png
 241    :alt: Genome Browser - SMN1 (human)
 242    :align: center
 243
 244
 245 Step 2 - Download CDS/UTR sequence for annotations
 246 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 247
 248 Since we have found **SMN1**, this would be a convenient time to extract
 249 the DNA sequence for the CDS and UTRs of the gene to use it as an
 250 annotation_ in Mussa.
 251
 252 **Click on SMN1** shown **between** the **two orange arrows** shown
 253 below.
 254
 255 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
 256    :alt: Genome Browser - SMN1 (human) - Orange Arrows
 257    :align: center
 258
 259 You should find yourself at the SMN1 description page.
 260
 261 .. image:: images/ucsc_gb_smn1_description_page.png
 262    :alt: Genome Browser - SMN1 (human) - Description page
 263    :align: center
 264
 265 **Scroll down** until you get to the **Sequence section** and click on
 266 **Genomic (chr5:70,256,524-70,284,592)**.
 267
 268 .. image:: images/ucsc_gb_smn1_human_sequence.png
 269    :alt: Genome Browser - SMN1 (human) - Sequence
 270    :align: center
 271
 272 You should now be at the **Genomic sequence near gene** page:
 273
 274 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence.png
 275    :alt: Genome Browser - SMN1 (human) - Get genomic sequence
 276    :align: center
 277
 278 Make the following changes (highlighted in orange in the screenshot
 279 below):
 280
 281  1. UNcheck **introns**.
 282     (We only want to annotate CDS and UTRs.)
 283  2. Select **one FASTA record** per **region**.
 284     (Mussa needs each CDS and UTR represented by one FASTA record per CDS/UTR).
 285  3. Select **CDS in upper case, UTR in lower case.**
 286
 287 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_diff.png
 288    :alt: Genome Browser - SMN1 (human) - Get genomic sequence setup
 289    :align: center
 290
 291 Now click the **submit** button. You will then see a FASTA file with
 292 many FASTA records representing the CDS and UTRS.
 293
 294 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_submit.png
 295    :alt: Genome Browser - SMN1 (human) - CDS/UTR sequence
 296    :align: center
 297
 298 Now you need to save the FASTA records to a **text file**. If you are
 299 using **Firefox** or **Internet Explorer 6+** click on the **File >
 300 Save As** menu option.
 301
 302 **IMPORTANT:** Make sure you select **Text Files** and **NOT**, I
 303 repeat **NOT Webpage Complete** (see screenshot below.)
 304
 305 Type in **smn1_human_annot.txt** for the file name.
 306
 307 .. image:: images/smn1_human_annot.png
 308    :alt: Genome Browser - SMN1 (human) - sequence annotation file
 309    :align: center
 310
 311 **IMPORTANT:** You should open the file with a text editor and make
 312   sure **no HTML** was saved... If you find any HTML markup, delete
 313   the markup and save the file.
 314
 315 Now we are going to **modify the file** you just saved to **add the
 316 name of the species** to the **annotation file**. All you have to do
 317 is **add a new line** at the **top of the file** with the word **'Human'** as
 318 shown below:
 319
 320 .. image:: images/smn1_human_annot_plus_human.png
 321    :alt: Genome Browser - SMN1 (human) - sequence annotation file
 322    :align: center
 323
 324 You can add more annotations to this file if you wish. See the
 325 `annotation file format`_ section for details of the file format. By
 326 including FASTA records in the annotation_ file, Mussa searches your
 327 DNA sequence for an exact match of the sequence in the annotation_
 328 file. If found, it will be marked as an annotation_ within Mussa.
 329
 330
 331 Step 3 - Download gene and upstream/downstream sequence
 332 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 333
 334 Use the back button in your web browser to get back the **genome
 335 browser view** of **SMN1** as shown below.
 336
 337 .. image:: images/ucsc_gb_smn1_human.png
 338    :alt: Genome Browser - SMN1 (human)
 339    :align: center
 340
 341 There are two options for getting additional sequence around your
 342 gene. The more complex way is to zoom out so that you have the
 343 sequence you want being shown in the genome browser and then follow
 344 the directions for the following method.
 345
 346 The second option, which we will choose, is to leave the genome
 347 browser zoomed exactly at the location of SMN1 and click on the
 348 **DNA** option on the menu bar (shown with orange arrows in the
 349 screenshot below.)
 350
 351 .. image:: images/ucsc_gb_smn1_human_dna_option.png
 352    :alt: Genome Browser - SMN1 (human) - DNA Option
 353    :align: center
 354
 355 Now in the **get dna in window** page, let's add an arbitrary amount of
 356 extra sequence on to each end of the gene, let's say 5000 base pairs.
 357
 358 .. image:: images/ucsc_gb_smn1_human_get_dna.png
 359    :alt: Genome Browser - SMN1 (human) - Get DNA
 360    :align: center
 361
 362 Click the **get DNA** button.
 363
 364 .. image:: images/ucsc_gb_smn1_human_dna.png
 365    :alt: Genome Browser - SMN1 (human) - DNA
 366    :align: center
 367
 368 Save the DNA sequence to a text file called 'smn1_human_dna.fa' as we
 369 did in step 2 with the annotation file.
 370
 371 **IMPORTANT:** Make sure the file is saved as a text file and not an
 372 HTML file. Open the file with a text editor and remove any HTML markup
 373 you find.
 374
 375
 376 Step 4 - Same/similar/related gene other species.
 377 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 378
 379 What good is a multiple sequence alignment viewer without multiple
 380 sequences? Let'S find a similar gene in a few more species.
 381
 382 Use the back button on your web browser until you get the **genome
 383 browser view** of **SMN1** as shown below.
 384
 385 .. image:: images/ucsc_genome_browser_home.png
 386    :alt: UCSC Genome Browser
 387    :align: center
 388
 389 **Click on SMN1** shown **between** the **two orange arrows** shown
 390 below.
 391
 392 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
 393    :alt: Genome Browser - SMN1 (human) - Orange Arrows
 394    :align: center
 395
 396 You should find yourself at the SMN1 description page.
 397
 398 .. image:: images/ucsc_gb_smn1_description_page.png
 399    :alt: Genome Browser - SMN1 (human) - Description page
 400    :align: center
 401
 402 **Scroll down** until you get to the **Sequence section** and click on
 403 **Protein (262 aa)**.
 404
 405 .. image:: images/ucsc_gb_smn1_human_sequence.png
 406    :alt: Genome Browser - SMN1 (human) - Sequence
 407    :align: center
 408
 409 Copy the SMN1 protein seqeunce by highlighting it and selecting **Edit
 410 > Copy** option from the menu.
 411
 412 .. image:: images/smn1_human_protein.png
 413    :alt: Genome Browser - SMN1 (human) - Protein
 414    :align: center
 415
 416 Press the back button on the web browser once and then scroll to the
 417 top of the page and click on the **BLAT** option on the menu bar
 418 (shown below with orange arrows).
 419
 420 .. image:: images/ucsc_gb_smn1_human_blat.png
 421    :alt: Genome Browser - SMN1 (human) - Blat
 422    :align: center
 423
 424 **Paste** in the **protein sequence** and **change** the **genome** to
 425 **mouse** as shown below and then click **submit**.
 426
 427 .. image:: images/ucsc_gb_smn1_human_blat_paste.png
 428    :alt: Genome Browser - SMN1 (human) - Blat paste protein
 429    :align: center
 430
 431 Notice that we have two hits, one of which looks pretty good at 89.9%
 432 match.
 433
 434 .. image:: images/ucsc_gb_smn1_human_blat_hits.png
 435    :alt: Genome Browser - SMN1 (human) - Blat hits
 436    :align: center
 437
 438 **Click** on the **brower** link next to the 89.9% match. Notice in
 439 the genome browser (shown below) that there is an annotated gene
 440 called SMN1 for mouse which matches the line called **your sequence
 441 from blat search**. This means we are fairly confidant we found the
 442 right location in the mouse genome.
 443
 444 .. image:: images/ucsc_gb_smn1_human_blat_to_browser.png
 445    :alt: Genome Browser - SMN1 (human) - Blat to browser
 446    :align: center
 447
 448 Follow steps 1 through 3 for mouse and then repeat step 4 with the
 449 human protein sequence to find **SMN1** in the following species (if
 450 you find a match):
 451
 452  1. Rat
 453  2. Rabbit
 454  3. Dog
 455  4. Armadillo
 456  5. Elephant
 457  6. Opposum
 458  7. x_tropicalis
 459
 460 Make sure to save the extended DNA sequence and annotation file for
 461 each one.
 462
 463 Using Mussagl
 464 =============
 465
 466
 467 Launch Mussagl
 468 --------------
 469 Launch Mussagl... It should look similar to the screen shot below.
 470
 471 .. image:: images/opened.png
 472    :alt: Launch Mussa
 473    :align: center
 474
 475
 476
 477 Create/Load Analysis
 478 ----------------------
 479
 480 Currently there are three ways to load a Mussa experiment.
 481
 482  1. `Create a new analysis`_
 483  2. `Load a mussa parameter file`_ (.mupa)
 484  3. `Load an analysis`_
 485
 486 .. _createnew:
 487
 488 Create a new analysis
 489 ~~~~~~~~~~~~~~~~~~~~~
 490
 491 To create a new analysis select 'Define analysis' from the 'File'
 492 menu. You should see a dialog box similar to the one below. For this
 493 demo we will use the example sequences that come with Mussagl.
 494
 495 .. image:: images/define_analysis.png
 496    :alt: Define Analysis
 497    :align: center
 498
 499 Instructions:
 500
 501  1. **Give the experiment a name**, for this demo, we'll use
 502     'demo_w30_t20'. Mussa will create a folder with this name to store
 503     the analysis files in once it has been run.
 504
 505  2. Choose a threshold_... for this demo **choose 20**. See the
 506     Threshold_ section for more detailed information.
 507
 508  3. Choose a `window size`_. For this demo **choose 30**.
 509
 510
 511  4. Choose the number of sequences_ you would like. For this demo
 512     **choose 3**.
 513
 514 .. image:: images/define_analysis_step1a.png
 515    :alt: Steps 1-4
 516    :align: center
 517
 518 First enter the species name of "Human" in the first "Species" text
 519 box. Now click on the 'Browse' button next to the sequence input box
 520 and then select /examples/seq/human_mck_pro.fa file. Do the same in
 521 the next two sequence input boxes selecting mouse_mck_pro.fa and
 522 rabbit_mck_pro.fa as shown below. Make sure to give them a species
 523 name as well. Note that you can create annotation files using the
 524 mussa `Annotation File Format`_ to add annotations to your sequence.
 525
 526 .. image:: images/define_analysis_step2.png
 527    :alt: Choose sequences
 528    :align: center
 529
 530 Click the **create** button and in a few moments you should see
 531 something similar to the following screen shot.
 532
 533 .. image:: images/demo.png
 534    :alt: Mussagl Demo
 535    :align: center
 536
 537 By default your analysis is NOT saved. If you try to close an analysis
 538 without saving, you will be prompted with a dialog box asking you if
 539 you would like to save your analysis. The `Saving`_ section for
 540 details on saving your analysis. When saving, choose directory and
 541 give the analysis the name **demo_w30_t20**. If you close and reopen
 542 Mussagl, you will then be able to load the saved analysis. See `Load
 543 an analysis`_ section below for details.
 544
 545
 546 Load a mussa parameter file
 547 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 548
 549 If you prefer, you can define your Mussa analysis using the Mussa
 550 parameter file. See the `Parameter File Format`_ section for details
 551 on creating a .mupa file.
 552
 553 Once you have a .mupa file created, load Mussagl and select the **File >
 554 Create Analysis from File** menu option. Select the .mupa file and click
 555 open.
 556
 557 .. image:: images/load_mupa_menu.png
 558    :alt: Load Mussa Parameters
 559    :align: center
 560
 561 If you would like to see an example, you can load the
 562 **mck3test.mupa** file in the examples directory that comes with
 563 Mussagl.
 564
 565 .. image:: images/load_mupa_dialog.png
 566    :alt: Load Mussa Parameters Dialog
 567    :align: center
 568
 569
 570 Load an analysis
 571 ~~~~~~~~~~~~~~~~
 572
 573 To load a previously run analysis open Mussagl and select the **File >
 574 Open Existing Analysis** menu option. Select an analysis **directory** and
 575 click open.
 576
 577 .. image:: images/load_analysis_menu.png
 578    :alt: Load Analysis Menu
 579    :align: center
 580
 581
 582 Main Window
 583 -----------
 584
 585 Overview
 586 ~~~~~~~~
 587 .. Screen-shot with numbers showing features.
 588
 589 .. image:: images/window_overview.png
 590    :alt: Mussa Window
 591    :align: center
 592
 593 Legend:
 594
 595  1. `DNA Sequence (Black bars)`_
 596
 597  2. Annotation_
 598
 599  3. Motif_
 600
 601  4. `Red conservation tracks`_
 602
 603  5. `Blue conservation tracks`_
 604
 605  6. `Zoom Factor`_ (Base pairs per pixel)
 606
 607  7. `Dynamic Threshold`_
 608
 609  8. `Sequence Information Bar`_
 610
 611  9. `Sequence Scroll Bar`_
 612
 613
 614 DNA Sequence (black bars)
 615 ~~~~~~~~~~~~~~~~~~~~~~~~~
 616
 617 .. image:: images/sequence_bar.png
 618    :alt: Sequence Bar
 619    :align: center
 620
 621 Each of the black bars represents one of the loaded sequences, in this
 622 case the sequence around the gene 'MCK' in human, mouse, and rabbit.
 623
 624
 625 Annotation
 626 ~~~~~~~~~~
 627
 628 .. figure:: images/annotation.png
 629    :alt: Annotation
 630    :align: center
 631
 632    Annotation shown in green on sequence bar.
 633
 634
 635 Annotations can be included on any of the sequences using the `Load a
 636 mussa parameter file`_ or `Create a new analysis`_ method of loading
 637 your sequences. You can define annotations by location or using an
 638 exact sub-sequence or a FASTA sequence of the section of DNA you wish
 639 to annotate. See the `Annotation File Format`_ section for details.
 640
 641
 642 Motif
 643 ~~~~~
 644
 645 .. figure:: images/motif.png
 646    :alt: Motif
 647    :align: center
 648
 649    Motif shown in light blue on sequence bar.
 650
 651 The only real difference between an annotation and motif in Mussagl is
 652 that you can define motifs and choose a color from within the GUI. See
 653 the `Motifs`_ section for more information.
 654
 655
 656 Red conservation tracks
 657 ~~~~~~~~~~~~~~~~~~~~~~~
 658
 659 .. figure:: images/conservation_tracks.png
 660    :alt: Conservation Tracks
 661    :align: center
 662
 663    Conservations tracks shown as red and blue lines between sequence
 664    bars.
 665
 666 The **red lines** between the sequence bars represent conservation
 667 between the sequences (i.e. not reverse complement matches)
 668
 669 The amount of sequence conservation shown will depend on how much your
 670 sequences are related and the `dynamic threshold`_ you are using.
 671
 672
 673 Blue conservation tracks
 674 ~~~~~~~~~~~~~~~~~~~~~~~~
 675
 676 .. figure:: images/conservation_tracks.png
 677    :alt: Conservation Tracks
 678    :align: center
 679
 680    Conservations tracks shown as red and blue lines between sequence
 681    bars.
 682
 683 **Blue lines** represent **reverse complement** conservation relative
 684 to the sequence attached to the top of the blue line.
 685
 686 The amount of sequence conservation shown will depend on how much your
 687 sequences are related and the `dynamic threshold`_ you are using.
 688
 689
 690 Zoom Factor
 691 ~~~~~~~~~~~
 692
 693 .. image:: images/zoom_factor.png
 694    :alt: Zoom Factor
 695    :align: center
 696
 697 The zoom factor represents the number of base pairs represented per
 698 pixel. When you zoom in far enough the sequence will switch from
 699 seeing a black bar, representing the sequence, to the actual sequence
 700 (well, ASCII representation of sequence).
 701
 702
 703 Dynamic Threshold
 704 ~~~~~~~~~~~~~~~~~
 705
 706 .. image:: images/dynamic_threshold.png
 707    :alt: Dynamic Threshold
 708    :align: center
 709
 710 You can dynamically change the threshold for how strong a match you
 711 consider the conservation to be by changing the value in the dynamic
 712 threshold box.
 713
 714 The value you enter is the minimum number of base pairs that have to
 715 be matched in order to be considered conserved. The second number that
 716 you can't change is the `window size`_ you used when creating the
 717 experiment. The last number is the percent match.
 718
 719 See the Threshold_ section for more information.
 720
 721
 722 Sequence Information Bar
 723 ~~~~~~~~~~~~~~~~~~~~~~~~
 724
 725 .. image:: images/seq_info_bar.png
 726    :alt: Sequence Information Bar
 727    :align: center
 728
 729 The sequence information bars can be found to the left and right sides
 730 of Mussagl. Next to each sequence you will find the following
 731 information:
 732
 733  1. Species (If it has been defined)
 734  2. Total Size of Sequence
 735  3. Current base pair position
 736
 737 Note that you can **update the species** text box. Make sure to **save your
 738 experiment** after making this change by selecting **File > Save
 739 Analysis** from the menu.
 740
 741 Sequence Scroll Bar
 742 ~~~~~~~~~~~~~~~~~~~
 743
 744 .. image:: images/scroll_bar.png
 745    :alt: Sequence Scroll Bar
 746    :align: center
 747
 748 The scroll bar allows you to scroll through the sequence which is
 749 useful when you have zoomed in using the `zoom factor`_.
 750
 751
 752 Saving
 753 ------
 754
 755 Save on Close
 756 ~~~~~~~~~~~~~
 757
 758 When ever you create a new analysis or make a change such as
 759 adding/editing a motif or changing a species name, an asterisk (*)
 760 will appear in the title of the window showing that there are changes
 761 that have not been saved. If you close a Mussa window without saving
 762 changes, Mussa will ask you if you would like to save the changes that
 763 have been made.
 764
 765 Save Analysis
 766 ~~~~~~~~~~~~~
 767
 768 After making changes, such as updating species names or adding/editing
 769 motifs, you can save these changes by selecting the **File > Save
 770 analysis** menu option or pressing **CTRL + S** (PC) or
 771 **Apple/Command Key + S** (on Mac).
 772
 773 .. image:: images/save_analysis.png
 774    :alt: Save analysis
 775    :align: center
 776
 777 Save Analysis As
 778 ~~~~~~~~~~~~~~~~
 779
 780 To save a copy of your analysis to a new location, select the **File >
 781 Save analysis as** menu option and choose a new location and name for
 782 your analysis.
 783
 784 .. image:: images/save_analysis_as.png
 785    :alt: Save analysis
 786    :align: center
 787
 788 Save Motif List
 789 ~~~~~~~~~~~~~~~
 790
 791 See `Save Motifs to File`_ in the `Motifs`_ section.
 792
 793
 794 Viewing Multiple Analyses
 795 -------------------------
 796
 797 Some times it is useful to view more than one analysis at a time. To
 798 do accomplish this, Mussa allows you to open a new Mussa window by
 799 selecting the **File > New Mussa Window** menu option.
 800
 801 .. image:: images/new_mussa_window_menu.png
 802    :alt: New Mussa Window Menu Option
 803    :align: center
 804
 805 A new Mussa window will pop up.
 806
 807 .. figure:: images/new_mussa_window.png
 808    :alt: New Mussa Window
 809    :align: center
 810
 811    A new Mussa window on the right, in which a second analysis has
 812    been loaded.
 813
 814 Now you can create or load an existing analysis, in this new window,
 815 as described in the `Create/Load Analysis`_ section.
 816
 817 You can view as many analyses as you can fit on your screen or until
 818 you run out of available RAM. If you notice a rapid decrease in
 819 performance and hear lots of noise coming from your hard drive, you
 820 probably ran out of RAM and are now using virtual memory (i.e. much
 821 much slower). If this happens, you may need to avoid opening as many
 822 analyses at one time.
 823
 824
 825 Annotations / Motifs
 826 --------------------
 827
 828 Annotations
 829 ~~~~~~~~~~~
 830
 831 Currently annotations can be added to a sequence using the mussa
 832 `annotation file format`_ and can be loaded by selecting the
 833 annotation file when defining a new analysis (see `Create a new
 834 analysis`_ section) or by defining a .mupa file pointing to your
 835 annotation file (see `Load a mussa parameter file`_ section).
 836
 837 Motifs
 838 ~~~~~~
 839
 840 Load Motifs from File
 841 *********************
 842
 843 It is possible to load motifs from a file which was saved from a
 844 previous run or by defining your own motif file. See the `Motif File
 845 Format`_ section for details.
 846
 847 NOTE: Valid motif list file extensions are:
 848
 849   * .mtl
 850   * .txt
 851
 852 To load a motif file, select **Load Motif List** item from the
 853 **File** menu and select a motif list file.
 854
 855 .. image:: images/load_motif.png
 856    :alt: Load Motif List
 857    :align: center
 858
 859
 860 Save Motifs to File
 861 *******************
 862
 863 Motifs from the `Motif Dialog`_ can be saved to file for use with
 864 other analyses. If you just want your motifs to be saved with your
 865 analysis, see the `save analysis`_ section for details.
 866
 867 To save a motif list, select **File > Save Motifs** menu option. By
 868 default, Mussa will append .mtl if you do not provide a file extension
 869 (valid file extensions: .mtl & .txt).
 870
 871 .. image:: images/save_motifs.png
 872    :alt: Save Motifs
 873    :align: center
 874
 875
 876 Motif Dialog
 877 ************
 878
 879 Mussa has the ability to find lab motifs using the `IUPAC Nucleotide
 880 Code`_ for defining a motif. To define a motif, select **Edit > Edit
 881 Motifs** menu item as shown below.
 882
 883 .. image:: images/view_edit_motifs.png
 884    :alt: "View > Edit Motifs" Menu
 885    :align: center
 886
 887 You will see a dialog box appear with a "apply" button in the bottom
 888 right and one rows for defining motifs and the color that will be
 889 displayed on the sequence. When you start adding your first motif, an
 890 additional row will be added. The check box in the first column
 891 defines whether the motif is displayed or not. The second column is
 892 the motif display color. The third column is for the name of your
 893 motif and finally, the fourth column is motif itself.
 894
 895 .. image:: images/motif_dialog_start.png
 896    :alt: Motif Dialog
 897    :align: center
 898
 899 Now let's make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide
 900 Code`_, type in **'ATSCT'** into the motif field and **'My Motif'** for
 901 the name in the name field as shown below.
 902
 903 Notice how a second row appeared when you started to add the first
 904 motif. Every time you add a new motif, a new row will appear allowing
 905 you to add as many motifs as you need.
 906
 907 .. image:: images/motif_dialog_enter_motif.png
 908    :alt: Enter Motif
 909    :align: center
 910
 911 Now choose a color for your motif by clicking on the colored area to
 912 the left of the name field. Remember to choose a color that will show
 913 up well with a black bar as the background. A good tool for picking a
 914 color is the `Colour Contrast Analyser
 915 <http://juicystudio.com/services/colourcontrast.php>`_ by
 916 `juicystudio.com <http://juicystudio.com/>`_.
 917
 918 .. image:: images/color_chooser.png
 919    :alt: Color Chooser
 920    :align: center
 921
 922 Once you have selected the color for your motif, click on the
 923 **'apply'** button. Notice that if Mussa finds matches to your motif
 924 will now show up in the main Mussa window.
 925
 926 Before Motif:
 927
 928 .. image:: images/motif_dialog_bar_before.png
 929    :alt: Sequence bar before motif
 930    :align: center
 931
 932 After Motif:
 933
 934 .. image:: images/motif_dialog_bar_after.png
 935    :alt: Sequence bar after motif
 936    :align: center
 937
 938 To save your motifs with your analysis, see the `save analysis`_
 939 section. To save your motifs to a file, see the `save motifs to file`_
 940 section.
 941
 942 Deleting a Motif
 943 ^^^^^^^^^^^^^^^^
 944
 945 To delete a motif, remove all text from the name and sequence columns
 946 and close the motif editor.
 947
 948 View Mussa Alignments
 949 ---------------------
 950
 951 Mussagl allows you to zoom in on Mussa alignments by selecting the set
 952 of alignment(s) of interest. To do this, move the mouse near the
 953 alignment you are interested in viewing and then **PRESS** and
 954 **HOLD** the **LEFT mouse button** and **drag the mouse** to the other
 955 side of the conservation track so that you see a bounding box
 956 overlaping the alienment(s) of interest and then **let go** of the
 957 *left mouse button*.
 958
 959 In the example below, I started by left-clicking on the area marked by
 960 a red dot (upper left corner of bounding box) and dragging the mouse to
 961 the area marked by a blue dot (lower right corner of the bounding box)
 962 and letting go of the left mouse button.
 963
 964 .. image:: images/select_sequence.png
 965    :alt: Select Sequence
 966    :align: center
 967
 968 All of the lines which were not selected should be washed out as shown
 969 below:
 970
 971 .. image:: images/washed_out.png
 972    :alt: Tracks washed out
 973    :align: center
 974
 975 With a selection made, goto the **View** menu and select **View mussa alignment**.
 976
 977 .. image:: images/view_mussa_alignment.png
 978    :alt: View mussa alignment
 979    :align: center
 980
 981 You should see the alignment at the base-pair level as shown below.
 982
 983 .. image:: images/mussa_alignment.png
 984    :alt: Mussa alignment
 985    :align: center
 986
 987
 988 Sub-analysis
 989 ------------
 990
 991 To run a sub-analysis **highlight** a section of sequence and *right
 992 click* on it and select **Add to subanalysis**. To the same for the
 993 sequences shown in orange in the screenshot below. Note that you **are
 994 NOT limited** to selecting only one subsequence from the same
 995 sequence.
 996
 997 .. image:: images/subanalysis_select_seqs.png
 998    :alt: Subanalysis sequence selection
 999    :align: center
1000
1001 Once you have added your sequences for subanalysis, choose a `window size`_ and `threshold`_ and click **Ok**.
1002
1003 .. image:: images/subanalysis_dialog.png
1004    :alt: Subanalysis Dialog
1005    :align: center
1006
1007 A new Mussa window will appear with the subanalysis of your sequences
1008 once it's done running. This may take a while if you selected large
1009 chunks of sequence with a loose threshold.
1010
1011 .. image:: images/subanalysis_done.png
1012    :alt: Subalaysis complete
1013    :align: center
1014
1015
1016 Copying sequence to clipboard
1017 -----------------------------
1018
1019 To copy a sequence to the clipboard, highlight a section of sequence,
1020 as shown in the screen shot below, and do one of the following:
1021
1022  * Select **Copy as FASTA** from the **Edit** menu.
1023  * **Right-Click (Left-click + Apple/Command Key on Mac)** on the highlighted sequence and select **Copy as FASTA**.
1024  * Press **Ctrl + C (on PC)** or **Apple/Command Key + C (on Mac)** on the keyboard.
1025
1026 .. image:: images/copy_sequence.png
1027    :alt: Copy sequence
1028    :align: center
1029
1030
1031 Saving to an Image
1032 ---------------------------------
1033
1034 To save your current mussa view to an image, select **File > Save to
1035 image...** as shown below.
1036
1037 .. image:: images/save_to_image_menu.png
1038    :alt: File > Save to image...
1039    :align: center
1040
1041 You can define the width and the height of the image to save. By
1042 default it will use the same size of your current view. Since the
1043 Mussa view is implemented using vectors, if you choose a larger size
1044 then your current view, Mussa will redraw at the higher resolution
1045 when saving. In other words, you get higher quality images when saving
1046 at a higher resolution.
1047
1048 If you check the "Lock aspect ratio" check box, which I have circled
1049 in red, then when you change one value, say width, the other, height,
1050 will update automatically to keep the same aspect ratio.
1051
1052 .. image:: images/save_to_image_dialog.png
1053    :alt: Save to image dialog
1054    :align: center
1055
1056 Click save and choose a location and filename for your file.
1057
1058 The valid image formats are:
1059
1060   * .png (default if no extension specified.)
1061   * .jpg
1062
1063
1064 Detailed Information
1065 --------------------
1066
1067 Threshold
1068 ~~~~~~~~~
1069
1070 The threshold of an analysis is in minimum number of base pair matches
1071 must be meet to in order to be kept as a match. Note that you can vary
1072 the threshold from within Mussagl. For example, if you choose a
1073 `window size`_ of **30** and a **threshold** of **20** the mussa nway
1074 transitive algorithm will store all matches that are 20 out of 30 bp
1075 matches or better and pass it on to Mussagl. Mussagl will then allow
1076 you to dynamically choose a threshold from 20 to 30 base pairs. A
1077 threshold of 30 bps would only show 30 out of 30 bp matches. A
1078 threshold of 20 bps would show all matches of 20 out of 30 bps or
1079 better. If you would like to see results for matches lower than 20 out
1080 of 30, you will need to rerun the analysis with a lower threshold.
1081
1082 Window Size
1083 ~~~~~~~~~~~
1084
1085 The typical sizes people tend to choose are between 20 and 30. You
1086 will likely need to experiment with this setting depending on your
1087 needs and input sequence.
1088
1089
1090 Sequences
1091 ~~~~~~~~~
1092
1093 Mussa reads in sequences which are formatted in the FASTA_
1094 format. Mussa may take a long time to run (>10 minutes) if the total
1095 bp length near 280Kb. Once mussa has run once, you can reload
1096 previously run analyzes.
1097
1098 FIXME: We have learned more about how much sequence and how many to
1099 put in Mussagl, this information should be documented here.
1100
1101
1102 Mussa File Formats
1103 ------------------
1104
1105 .. _param:
1106
1107 Parameter File Format
1108 ~~~~~~~~~~~~~~~~~~~~~
1109
1110 **File Format (.mupa):**
1111
1112 ::
1113
1114   # name of analysis directory and stem for associated files
1115   ANA_NAME <analysis_name>
1116
1117   # if APPEND vars true, a _wXX and/or _tYY added to analysis name
1118   # where XX = WINDOW and YY = THRESHOLD
1119   # Highly recommeded with use of command line override of WINDOW or THRESHOLD
1120   APPEND_WIN <true/false>
1121   APPEND_THRES <true/false>
1122
1123   # how many sequences are being analyzed
1124   SEQUENCE_NUM <num>
1125
1126   # first sequence info
1127   SEQUENCE <FASTA_file_path>
1128   ANNOTATION <annotation_file_path>
1129   SEQ_START <sequence_start>
1130
1131   # the second sequence info
1132   SEQUENCE <FASTA_file_path>
1133   # ANNOTATION <annotation_file_path>
1134   SEQ_START <sequence_start>
1135   # SEQ_END <sequence_end>
1136
1137   # third sequence info
1138   SEQUENCE <FASTA_file_path>
1139   # ANNOTATION <annotation_file_path>
1140
1141   # analyzes parameters: command line args -w -t will override these
1142   WINDOW <num>
1143   THRESHOLD <num>
1144
1145 .. csv-table:: Parameter File Options:
1146    :header: "Option Name", "Value", "Default", "Required", "Description"
1147    :widths: 30 30 30 30 60
1148
1149    "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also
1150    name of directory where analysis will be saved."
1151    "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME"
1152    "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME"
1153    "SEQUENCE_NUM", "integer", "N/A", "true", "The number of sequences
1154    to analyze"
1155    "SEQUENCE", "/FASTA/filepath.fa", "N/A", "true", "Must define one
1156    sequence per SEQUENCE_NUM."
1157    "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional
1158    annotation file. See `annotation file format`_ section for more
1159    information."
1160    "SEQ_START", "integer", "1", "false", "Optional index into FASTA file"
1161    "SEQ_END", "integer", "1", "false", "Optional index into FASTA file"
1162    "WINDOW", "integer", "N/A", "true", "`Window Size`_"
1163    "THRESHOLD", "integer", "N/A", "true", "`Threshold`_"
1164
1165 .. _annot:
1166
1167 Annotation File Format
1168 ~~~~~~~~~~~~~~~~~~~~~~
1169
1170 The first line in the file is the sequence name. Each line there after
1171 is a **space** separated annotation.
1172
1173 New as of build 198:
1174
1175  * The annotation format now supports FASTA sequences embedded in the
1176    annotation file as shown in the format example below. Mussagl will
1177    take this sequence and look for an exact match of this sequence in
1178    your sequences. If a match is found, it will label it with the name
1179    of from the FASTA header.
1180
1181 Format:
1182
1183 ::
1184
1185   <species/sequence_name>
1186   <start> <stop> <annotation_name> <annotation_type>
1187   <start> <stop> <annotation_name> <annotation_type>
1188   <start> <stop> <annotation_name> <annotation_type>
1189   <start> <stop> <annotation_name> <annotation_type>
1190   >FASTA Header
1191   ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG
1192   ACGTACGTACGTACGTAGCTGTCATACGCTAGCA
1193   TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT
1194   ACGTACGGCAGTACGCGGTCAGA
1195   <start> <stop> <annotation_name> <annotation_type>
1196   ...
1197
1198 Example:
1199
1200 ::
1201
1202   Mouse
1203   251 500 Glorp Glorptype
1204   751 1000 Glorp Glorptype
1205   1251 1500 Glorp Glorptype
1206   >My favorite DNA sequence
1207   GATTACA
1208   1751 2000 Glorp Glorptype
1209
1210
1211 .. _motif_file_format:
1212
1213 Motif File Format
1214 ~~~~~~~~~~~~~~~~~
1215
1216 Format:
1217
1218   <motif> <red> <green> <blue>
1219
1220 Example:
1221
1222   GGCC 0.0 1 1
1223
1224
1225
1226 IUPAC Nucleotide Code
1227 ~~~~~~~~~~~~~~~~~~~~~~
1228
1229 For your convenience, below is a table of the IUPAC Nucleotide Code.
1230
1231 The following table is table 1 from "Nomenclature for Incompletely
1232 Specified Bases in Nucleic Acid Sequences" which can be found at
1233 http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html.
1234
1235 ======  =================  ===================================
1236 Symbol  Meaning            Origin of designation
1237 ======  =================  ===================================
1238 G       G                  Guanine
1239 A       A                  Adenine
1240 T       T                  Thymine
1241 C       C                  Cytosine
1242 R       G or A             puRine
1243 Y       T or C             pYrimidine
1244 M       A or C             aMino
1245 K       G or T             Keto
1246 S       G or C             Strong interaction (3 H bonds)
1247 W       A or T             Weak interaction (2 H bonds)
1248 H       A or C or T        not-G, H follows G in the alphabet
1249 B       G or T or C        not-A, B follows A
1250 V       G or C or A        not-T (not-U), V follows U
1251 D       G or A or T        not-C, D follows C
1252 N       G or A or T or C   aNy
1253 ======  =================  ===================================
1254
1255
1256
1257 Understanding Mussa
1258 ===================
1259
1260
1261 Performance
1262 -----------
1263
1264 Algorithm Behavior
1265 ~~~~~~~~~~~~~~~~~~
1266
1267 FIXME: Include seqcomp algorithm info.
1268
1269 FIXME: Include transitivity info.
1270
1271 Repeats
1272 ~~~~~~~
1273
1274 The algorithm Mussa uses to find conserved sequences is sensitive to
1275 repeated DNA segments, which are frequently occurring in most
1276 genomes. The problem with repeats, is that one repeat from one
1277 sequence can show up many times in another sequence. Every connection
1278 Mussa makes takes up memory and CPU time to process.
1279
1280 The formula for the number of connections, C, that will be made for R
1281 instances of a single repeat (meaning R copies of one repeat in each
1282 sequence) and S sequences is:
1283
1284 C = (R^2)[S(S-1)/2]
1285
1286 Table of example situations:
1287
1288 =====  =====  =====
1289   C      R      S
1290 =====  =====  =====
1291    16     4     2
1292    48     4     3
1293    96     4     4
1294   160     4     5
1295   240     4     6
1296   336     4     7
1297   448     4     8
1298    24     2     4
1299    54     3     4
1300    96     4     4
1301   150     5     4
1302   216     6     4
1303   294     7     4
1304   384     8     4
1305  2500    50     2
1306  7500    50     3
1307 15000    50     4
1308 10000   100     2
1309 30000   100     3
1310 60000   100     4
1311 =====  =====  =====
1312
1313 After the connections, C, are found, they are passed on to the
1314 transitivity filter, which is a C^2 algorithm (FIXME: confirm
1315 algorithm is C^2). This means with 50 repeats in 2 sequences giving
1316 you a C of 2500, ends up with a C^2 of 6,250,000.
1317
1318 **Conclusion: repeats cause the processing time of Mussa to skyrocket.**
1319
1320 One way to deal with a situation where you have many repeats in your
1321 sequences is do any of the following: user shorter sequence lengths;
1322 repeat mask one or more of your sequences; or increase the threshold.
1323
1324 Details
1325 -------
1326
1327 Case: Conservation track suddenly stops
1328 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1329
1330
1331 .. Define links below
1332    ------------------
1333
1334 .. _GPL: http://www.opensource.org/licenses/gpl-license.php
1335 .. _wiki: http://mussa.caltech.edu
1336 .. _build: http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild
1337 .. _FASTA: http://en.wikipedia.org/wiki/fasta_format
1338 .. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif