doc/manual/mussagl_manual.rst

   1 ==============
   2 Mussagl Manual
   3 ==============
   4 ---------------
   5 Brandon W. King
   6 ---------------
   7
   8 Last updated: Sept 20th, 2006
   9
  10 Updated to Mussagl build: 287 (In process to 419)
  11
  12
  13 .. Things to add
  14         * New features / change log
  15         * Comment out anything isn't implemented yet.
  16         * (DONE) List of features that will be implemented in the future.
  17         * Look into the homology mapping of UCSC.
  18         * Add toggle to genomes.
  19         * Document why one fast record per region.
  20         * How to deal with the hazards of small utrs vis motif finder. (Add warning)
  21         * Add warning about saving FASTA file.
  22         * Add a general principles section near the top
  23                 * Using comparison algorithm which will pickup all repeats
  24                 * Add info about repeatmasking
  25                 * Checking upstream and downstream genes for make sure you are in the right regions.
  26         * Later on: look into Ensembl
  27         * Look into method of homology instead of blating.
  28         * Mention advantages of using mupa.
  29         * Mention the difference between using arrows and scroll bar
  30         * Document the color for motifs
  31         * Update for Mac user left-click
  32
  33         * Wormbase/Flybase/mirBASE tutorials
  34
  35
  36
  37 .. contents::
  38
  39 Status
  40 ======
  41
  42 Major New Features
  43 ------------------
  44
  45  * Build 381
  46    * Analysis "Save As" feature
  47
  48 Change Log
  49 ----------
  50
  51 .. INSERT CHANGE LOG HERE
  52 .. END INSERT CHANGE LOG
  53
  54 Features to be Implemented
  55 --------------------------
  56
  57  * Motif editor supporting more than 10 motifs
  58    (Status: http://woldlab.caltech.edu/cgi-bin/mussa/ticket/122)
  59  * Save motifs from Mussagl
  60    (Status: http://woldlab.caltech.edu/cgi-bin/mussa/ticket/133)
  61
  62 For an up-to-date list of features to be implemented visit:
  63 http://woldlab.caltech.edu/cgi-bin/mussa/roadmap
  64
  65 Introduction
  66 ============
  67
  68
  69 What is Mussagl?
  70 ----------------
  71
  72 Mussa is an N-way version of the FamilyRelations (which is a part of
  73 the Cartwheel project) 2-way comparative sequence analysis
  74 software. Given DNA sequence from N species, Mussa uses all possible
  75 pairwise comparions to derive an N-wise comparison. For example, given
  76 sequences 1,2,3, and 4, Mussa makes 6 2-way comparisons: 1vs2, 1vs3,
  77 1vs4, 2vs3, 2vs4, and 3vs4. It then compares all the links between
  78 these comparisons, saving those that satisfy a transitivity
  79 requirement. The saved paths are then displayed in an interactive
  80 viewer.
  81
  82 Short History of Mussa
  83 ----------------------
  84
  85 Mussa Python/PMW Prototype
  86 ~~~~~~~~~~~~~~~~~~~~~~~~~~
  87
  88 First Python/PMW based protoype.
  89
  90 Mussa C++/FLTK
  91 ~~~~~~~~~~~~~~
  92
  93 A rewrite for speed purposes using C++ and FLTK GUI toolkit.
  94
  95 Mussagl C++/Qt/OpenGL
  96 ~~~~~~~~~~~~~~~~~~~~~
  97
  98 Refactored version using the more elegant Qt GUI framework and
  99 OpenGL for hardware acceleration for those who have better graphics
 100 cards.
 101
 102 Getting Mussagl
 103 ===============
 104
 105 License
 106 -------
 107
 108 Mussagl has been released open source under the `GPL v2
 109 license`__.
 110
 111 __ GPL_
 112
 113 Platforms
 114 ---------
 115
 116 You have the option of building from source or downloading prebuilt
 117 binaries. Most people will want the prebuilt versions.
 118
 119 Supported Platforms:
 120
 121  * Mac OS X (binary or source)
 122  * Windows XP (binary or source)
 123  * Linux (source)
 124
 125 Download
 126 --------
 127
 128 Mussagl in binary form for OS X and Windows and/or source can be
 129 downloaded from http://mussa.caltech.edu/.
 130
 131 Install
 132 -------
 133
 134 Mac OS X
 135 ~~~~~~~~
 136 Once you have downloaded the .dmg file, double click on it and follow
 137 the install instructions.
 138
 139 FIXME: Mention how to launch the program.
 140
 141
 142 Windows XP
 143 ~~~~~~~~~~
 144 Once you have downloaded the Mussagl installer, double click on the
 145 installer and follow the install instructions.
 146
 147 To start Mussagl, launch the program from Start > Programs > Mussagl >
 148 Mussagl.
 149
 150
 151 Linux
 152 ~~~~~
 153 Currently we do not have a binary installer for Linux. You will have
 154 to build from source. See the 'build from source' section below.
 155
 156
 157 Build from Source
 158 ~~~~~~~~~~~~~~~~~
 159
 160 Instructions for building from source can be found `build page
 161 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild>`_ on the
 162 `Mussa wiki`__.
 163
 164 __ wiki_
 165
 166
 167 Obtaining Input Data
 168 ====================
 169
 170 If you already have your data, you can skip ahead to the the `Using
 171 Mussagl`_ section.
 172
 173 Let's say you have a gene of interest called 'SMN1' and you want to
 174 know how the sequence surrounding the gene in multiple species is
 175 conserved. Guess what, that's what we are going to do, retrieve the
 176 DNA sequence for SMN1 and prepare it for using in Mussa.
 177
 178 For more information about SMN1 visit `NCBI's OMIM
 179 <http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=609682>`_.
 180
 181 UCSC Genome Browser Method
 182 --------------------------
 183
 184 There are many methods of retrieving DNA sequence, but for this
 185 example we will retrieve SMN1 through the UCSC genome browser located
 186 at http://genome.ucsc.edu/.
 187
 188 .. image:: images/ucsc_genome_browser_home.png
 189    :alt: UCSC Genome Browser
 190    :align: center
 191
 192 Step 1 - Find SMN1
 193 ~~~~~~~~~~~~~~~~~~
 194
 195 The first step in finding SMN1 is to use the **Gene Sorter** menu
 196 option which I have highlighted in orange below:
 197
 198 .. image:: images/ucsc_menu_bar_gene_sorter.png
 199    :alt: Gene Sorter Menu Option
 200    :align: center
 201
 202 Gene Sorter page:
 203
 204 .. image:: images/ucsc_gene_sorter.png
 205    :alt: Gene Sorter
 206    :align: center
 207
 208 We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**.
 209
 210 .. image:: images/ucsc_gs_sort_name_sim.png
 211    :alt: Gene Sorter - Name Similarity
 212    :align: center
 213
 214 After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box.
 215
 216 .. image:: images/ucsc_gs_smn1.png
 217    :alt: Gene
 218    :align: center
 219
 220 Press **Go!** and you should see the following page:
 221
 222 .. image:: images/ucsc_gs_found.png
 223    :alt: Found SMN1
 224    :align: center
 225
 226 Click on **SMN1** and you will be taking the gene expression atlas
 227 page.
 228
 229 .. image:: images/ucsc_gs_genome_position.png
 230    :alt: Gene expression atlas
 231    :align: center
 232
 233 Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome
 234 position column**.
 235
 236 Now we have found the location of SMN1 on human!
 237
 238 .. image:: images/ucsc_gb_smn1_human.png
 239    :alt: Genome Browser - SMN1 (human)
 240    :align: center
 241
 242
 243 Step 2 - Download CDS/UTR sequence for annotations
 244 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 245
 246 Since we have found **SMN1**, this would be a convenient time to extract
 247 the DNA sequence for the CDS and UTRs of the gene to use it as an
 248 annotation_ in Mussa.
 249
 250 **Click on SMN1** shown **between** the **two orange arrows** shown
 251 below.
 252
 253 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
 254    :alt: Genome Browser - SMN1 (human) - Orange Arrows
 255    :align: center
 256
 257 You should find yourself at the SMN1 description page.
 258
 259 .. image:: images/ucsc_gb_smn1_description_page.png
 260    :alt: Genome Browser - SMN1 (human) - Description page
 261    :align: center
 262
 263 **Scroll down** until you get to the **Sequence section** and click on
 264 **Genomic (chr5:70,256,524-70,284,592)**.
 265
 266 .. image:: images/ucsc_gb_smn1_human_sequence.png
 267    :alt: Genome Browser - SMN1 (human) - Sequence
 268    :align: center
 269
 270 You should now be at the **Genomic sequence near gene** page:
 271
 272 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence.png
 273    :alt: Genome Browser - SMN1 (human) - Get genomic sequence
 274    :align: center
 275
 276 Make the following changes (highlighted in orange in the screenshot
 277 below):
 278
 279  1. UNcheck **introns**.
 280     (We only want to annotate CDS and UTRs.)
 281  2. Select **one FASTA record** per **region**.
 282     (Mussa needs each CDS and UTR represented by one FASTA record per CDS/UTR).
 283  3. Select **CDS in upper case, UTR in lower case.**
 284
 285 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_diff.png
 286    :alt: Genome Browser - SMN1 (human) - Get genomic sequence setup
 287    :align: center
 288
 289 Now click the **submit** button. You will then see a FASTA file with
 290 many FASTA records representing the CDS and UTRS.
 291
 292 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_submit.png
 293    :alt: Genome Browser - SMN1 (human) - CDS/UTR sequence
 294    :align: center
 295
 296 Now you need to save the FASTA records to a **text file**. If you are
 297 using **Firefox** or **Internet Explorer 6+** click on the **File >
 298 Save As** menu option.
 299
 300 **IMPORTANT:** Make sure you select **Text Files** and **NOT**, I
 301 repeat **NOT Webpage Complete** (see screenshot below.)
 302
 303 Type in **smn1_human_annot.txt** for the file name.
 304
 305 .. image:: images/smn1_human_annot.png
 306    :alt: Genome Browser - SMN1 (human) - sequence annotation file
 307    :align: center
 308
 309 **IMPORTANT:** You should open the file with a text editor and make
 310   sure **no HTML** was saved... If you find any HTML markup, delete
 311   the markup and save the file.
 312
 313 Now we are going to **modify the file** you just saved to **add the
 314 name of the species** to the **annotation file**. All you have to do
 315 is **add a new line** at the **top of the file** with the word **'Human'** as
 316 shown below:
 317
 318 .. image:: images/smn1_human_annot_plus_human.png
 319    :alt: Genome Browser - SMN1 (human) - sequence annotation file
 320    :align: center
 321
 322 You can add more annotations to this file if you wish. See the
 323 `annotation file format`_ section for details of the file format. By
 324 including FASTA records in the annotation_ file, Mussa searches your
 325 DNA sequence for an exact match of the sequence in the annotation_
 326 file. If found, it will be marked as an annotation_ within Mussa.
 327
 328
 329 Step 3 - Download gene and upstream/downstream sequence
 330 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 331
 332 Use the back button in your web browser to get back the **genome
 333 browser view** of **SMN1** as shown below.
 334
 335 .. image:: images/ucsc_gb_smn1_human.png
 336    :alt: Genome Browser - SMN1 (human)
 337    :align: center
 338
 339 There are two options for getting additional sequence around your
 340 gene. The more complex way is to zoom out so that you have the
 341 sequence you want being shown in the genome browser and then follow
 342 the directions for the following method.
 343
 344 The second option, which we will choose, is to leave the genome
 345 browser zoomed exactly at the location of SMN1 and click on the
 346 **DNA** option on the menu bar (shown with orange arrows in the
 347 screenshot below.)
 348
 349 .. image:: images/ucsc_gb_smn1_human_dna_option.png
 350    :alt: Genome Browser - SMN1 (human) - DNA Option
 351    :align: center
 352
 353 Now in the **get dna in window** page, let's add an arbitrary amount of
 354 extra sequence on to each end of the gene, let's say 5000 base pairs.
 355
 356 .. image:: images/ucsc_gb_smn1_human_get_dna.png
 357    :alt: Genome Browser - SMN1 (human) - Get DNA
 358    :align: center
 359
 360 Click the **get DNA** button.
 361
 362 .. image:: images/ucsc_gb_smn1_human_dna.png
 363    :alt: Genome Browser - SMN1 (human) - DNA
 364    :align: center
 365
 366 Save the DNA sequence to a text file called 'smn1_human_dna.fa' as we
 367 did in step 2 with the annotation file.
 368
 369 **IMPORTANT:** Make sure the file is saved as a text file and not an
 370 HTML file. Open the file with a text editor and remove any HTML markup
 371 you find.
 372
 373
 374 Step 4 - Same/similar/related gene other species.
 375 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 376
 377 What good is a multiple sequence alignment viewer without multiple
 378 sequences? Let'S find a similar gene in a few more species.
 379
 380 Use the back button on your web browser until you get the **genome
 381 browser view** of **SMN1** as shown below.
 382
 383 .. image:: images/ucsc_genome_browser_home.png
 384    :alt: UCSC Genome Browser
 385    :align: center
 386
 387 **Click on SMN1** shown **between** the **two orange arrows** shown
 388 below.
 389
 390 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
 391    :alt: Genome Browser - SMN1 (human) - Orange Arrows
 392    :align: center
 393
 394 You should find yourself at the SMN1 description page.
 395
 396 .. image:: images/ucsc_gb_smn1_description_page.png
 397    :alt: Genome Browser - SMN1 (human) - Description page
 398    :align: center
 399
 400 **Scroll down** until you get to the **Sequence section** and click on
 401 **Protein (262 aa)**.
 402
 403 .. image:: images/ucsc_gb_smn1_human_sequence.png
 404    :alt: Genome Browser - SMN1 (human) - Sequence
 405    :align: center
 406
 407 Copy the SMN1 protein seqeunce by highlighting it and selecting **Edit
 408 > Copy** option from the menu.
 409
 410 .. image:: images/smn1_human_protein.png
 411    :alt: Genome Browser - SMN1 (human) - Protein
 412    :align: center
 413
 414 Press the back button on the web browser once and then scroll to the
 415 top of the page and click on the **BLAT** option on the menu bar
 416 (shown below with orange arrows).
 417
 418 .. image:: images/ucsc_gb_smn1_human_blat.png
 419    :alt: Genome Browser - SMN1 (human) - Blat
 420    :align: center
 421
 422 **Paste** in the **protein sequence** and **change** the **genome** to
 423 **mouse** as shown below and then click **submit**.
 424
 425 .. image:: images/ucsc_gb_smn1_human_blat_paste.png
 426    :alt: Genome Browser - SMN1 (human) - Blat paste protein
 427    :align: center
 428
 429 Notice that we have two hits, one of which looks pretty good at 89.9%
 430 match.
 431
 432 .. image:: images/ucsc_gb_smn1_human_blat_hits.png
 433    :alt: Genome Browser - SMN1 (human) - Blat hits
 434    :align: center
 435
 436 **Click** on the **brower** link next to the 89.9% match. Notice in
 437 the genome browser (shown below) that there is an annotated gene
 438 called SMN1 for mouse which matches the line called **your sequence
 439 from blat search**. This means we are fairly confidant we found the
 440 right location in the mouse genome.
 441
 442 .. image:: images/ucsc_gb_smn1_human_blat_to_browser.png
 443    :alt: Genome Browser - SMN1 (human) - Blat to browser
 444    :align: center
 445
 446 Follow steps 1 through 3 for mouse and then repeat step 4 with the
 447 human protein sequence to find **SMN1** in the following species (if
 448 you find a match):
 449
 450  1. Rat
 451  2. Rabbit
 452  3. Dog
 453  4. Armadillo
 454  5. Elephant
 455  6. Opposum
 456  7. x_tropicalis
 457
 458 Make sure to save the extended DNA sequence and annotation file for
 459 each one.
 460
 461 Using Mussagl
 462 =============
 463
 464
 465 Launch Mussagl
 466 --------------
 467 Launch Mussagl... It should look similar to the screen shot below.
 468
 469 .. image:: images/opened.png
 470    :alt: Launch Mussa
 471    :align: center
 472
 473
 474
 475 Create/Load Analysis
 476 ----------------------
 477
 478 Currently there are three ways to load a Mussa experiment.
 479
 480  1. `Create a new analysis`_
 481  2. `Load a mussa parameter file`_ (.mupa)
 482  3. `Load an analysis`_
 483
 484 .. _createnew:
 485
 486 Create a new analysis
 487 ~~~~~~~~~~~~~~~~~~~~~
 488
 489 To create a new analysis select 'Define analysis' from the 'File'
 490 menu. You should see a dialog box similar to the one below. For this
 491 demo we will use the example sequences that come with Mussagl.
 492
 493 .. image:: images/define_analysis.png
 494    :alt: Define Analysis
 495    :align: center
 496
 497 Instructions:
 498
 499  1. **Give the experiment a name**, for this demo, we'll use
 500     'demo_w30_t20'. Mussa will create a folder with this name to store
 501     the analysis files in once it has been run.
 502
 503  2. Choose a `window size`_. For this demo **choose 30**.
 504
 505  3. Choose a threshold_... for this demo **choose 20**. See the
 506     Threshold_ section for more detailed information.
 507
 508  4. Choose the number of sequences_ you would like. For this demo
 509     **choose 3**.
 510
 511 .. image:: images/define_analysis_step1a.png
 512    :alt: Steps 1-4
 513    :align: center
 514
 515 Now click on the 'Browse' button next to the sequence input box and
 516 then select /examples/seq/human_mck_pro.fa file. Do the same in the
 517 next two sequence input boxes selecting mouse_mck_pro.fa and
 518 rabbit_mck_pro.fa as shown below. Note that you can create annotation
 519 files using the mussa `Annotation File Format`_ to add annotations to
 520 your sequence.
 521
 522 .. image:: images/define_analysis_step2.png
 523    :alt: Choose sequences
 524    :align: center
 525
 526 Click the **create** button and in a few moments you should see
 527 something similar to the following screen shot.
 528
 529 .. image:: images/demo.png
 530    :alt: Mussagl Demo
 531    :align: center
 532
 533 This analysis is now saved in a directory called **demo_w30_t20** in
 534 the current working directory. If you close and reopen Mussagl, you
 535 can reload the saved analysis. See `Load an analysis`_ section below
 536 for details.
 537
 538
 539 Load a mussa parameter file
 540 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 541
 542 If you prefer, you can define your Mussa analysis using the Mussa
 543 parameter file. See the `Parameter File Format`_ section for details
 544 on creating a .mupa file.
 545
 546 Once you have a .mupa file created, load Mussagl and select the **File >
 547 Load Mussa Parameters** menu option. Select the .mupa file and click
 548 open.
 549
 550 .. image:: images/load_mupa_menu.png
 551    :alt: Load Mussa Parameters
 552    :align: center
 553
 554 If you would like to see an example, you can load the
 555 **mck3test.mupa** file in the examples directory that comes with
 556 Mussagl.
 557
 558 .. image:: images/load_mupa_dialog.png
 559    :alt: Load Mussa Parameters Dialog
 560    :align: center
 561
 562
 563 Load an analysis
 564 ~~~~~~~~~~~~~~~~
 565
 566 To load a previously run analysis open Mussagl and select the **File >
 567 Load Analysis** menu option. Select an analysis **directory** and
 568 click open.
 569
 570 .. image:: images/load_analysis_menu.png
 571    :alt: Load Analysis Menu
 572    :align: center
 573
 574
 575 Main Window
 576 -----------
 577
 578 Overview
 579 ~~~~~~~~
 580 .. Screen-shot with numbers showing features.
 581
 582 .. image:: images/window_overview.png
 583    :alt: Mussa Window
 584    :align: center
 585
 586 Legend:
 587
 588  1. `DNA Sequence (Black bars)`_
 589
 590  2. Annotation_
 591
 592  3. Motif_
 593
 594  4. `Conservation tracks`_
 595
 596  5. `Motif Toggle`_
 597
 598  6. `Zoom Factor`_ (Base pairs per pixel)
 599
 600  7. `Dynamic Threshold`_
 601
 602  8. `Sequence Information Bar`_
 603
 604  9. `Sequence Scroll Bar`_
 605
 606
 607 DNA Sequence (black bars)
 608 ~~~~~~~~~~~~~~~~~~~~~~~~~
 609
 610 .. image:: images/sequence_bar.png
 611    :alt: Sequence Bar
 612    :align: center
 613
 614 Each of the black bars represents one of the loaded sequences, in this
 615 case the sequence around the gene 'MCK' in human, mouse, and rabbit.
 616
 617 FIXME: Should I mention the repeats here?
 618
 619
 620 Annotation
 621 ~~~~~~~~~~
 622
 623 .. figure:: images/annotation.png
 624    :alt: Annotation
 625    :align: center
 626
 627    Annotation shown in green on sequence bar.
 628
 629
 630 Annotations can be included on any of the sequences using the `Load a
 631 mussa parameter file`_ method of loading your sequences. You can
 632 define annotations by location or using an exact sub-sequence and you
 633 may also choose any color for display of the annotation; see the
 634 `Annotation File Format`_ section for details.
 635
 636 Note: Currently there is no way to add annotations using the GUI (only
 637 via the .mupa file). We plan to add this feature in the future, but it
 638 likely will not make it into the first release.
 639
 640
 641 Motif
 642 ~~~~~
 643
 644 .. figure:: images/motif.png
 645    :alt: Motif
 646    :align: center
 647
 648    Motif shown in light blue on sequence bar.
 649
 650 The only real difference between an annotation and motif in Mussagl is
 651 that you can define motifs from within the GUI. See the `Motifs`_
 652 section for more information.
 653
 654
 655 Conservation tracks
 656 ~~~~~~~~~~~~~~~~~~~
 657
 658 .. figure:: images/conservation_tracks.png
 659    :alt: Conservation Tracks
 660    :align: center
 661
 662    Conservations tracks shown as red and blue lines between sequence
 663    bars.
 664
 665 The **red lines** between the sequence bars represent conservation
 666 between the sequences and **blue lines** represent **reverse
 667 complement** conservation. The amount of sequence conservation shown
 668 will depend on the relatedness of your sequences and the `dynamic
 669 threshold` you are using. Sequences with lots of repeats will cause
 670 major slow downs in calculating the matches.
 671
 672
 673 Motif Toggle
 674 ~~~~~~~~~~~~
 675
 676 .. image:: images/motif_toggle.png
 677    :alt: Motif Toggle
 678    :align: center
 679
 680 Toggles motifs on and off. This will not turn on and off annotations.
 681
 682 Note: As of the current build (#200), this feature hasn't been
 683 implemented.
 684
 685
 686 Zoom Factor
 687 ~~~~~~~~~~~
 688
 689 .. image:: images/zoom_factor.png
 690    :alt: Zoom Factor
 691    :align: center
 692
 693 The zoom factor represents the number of base pairs represented per
 694 pixel. When you zoom in far enough the sequence will switch from
 695 seeing a black bar, representing the sequence, to the actual sequence
 696 (well, ASCII representation of sequence).
 697
 698
 699 Dynamic Threshold
 700 ~~~~~~~~~~~~~~~~~
 701
 702 .. image:: images/dynamic_threshold.png
 703    :alt: Dynamic Threshold
 704    :align: center
 705
 706 You can dynamically change the threshold for how strong a match you
 707 consider the conservation to be with one of two options:
 708
 709  1. Number of base pair matches out of window size.
 710
 711  2. Percent base pair conservation.
 712
 713 See the Threshold_ section for more information.
 714
 715
 716 Sequence Information Bar
 717 ~~~~~~~~~~~~~~~~~~~~~~~~
 718
 719 .. image:: images/seq_info_bar.png
 720    :alt: Sequence Information Bar
 721    :align: center
 722
 723 The sequence information bars can be found to the left and right sides
 724 of Mussagl. Next to each sequence you will find the following
 725 information:
 726
 727  1. Species (If it has been defined)
 728  2. Total Size of Sequence
 729  3. Current base pair position
 730
 731
 732 Sequence Scroll Bar
 733 ~~~~~~~~~~~~~~~~~~~
 734
 735 .. image:: images/scroll_bar.png
 736    :alt: Sequence Scroll Bar
 737    :align: center
 738
 739 The scroll bar allows you to scroll through the sequence which is
 740 useful when you have zoomed in using the `zoom factor`_.
 741
 742
 743 Annotations / Motifs
 744 --------------------
 745
 746 Annotations
 747 ~~~~~~~~~~~
 748
 749 Currently annotations can be added to a sequence using the mussa
 750 `annotation file format`_ and can be loaded by selecting the
 751 annotation file when defining a new analysis (see `Create a new
 752 analysis`_ section) or by defining a .mupa file pointing to your
 753 annotation file (see `Load a mussa parameter file`_ section).
 754
 755 Motifs
 756 ~~~~~~
 757
 758 Load Motifs from File
 759 *********************
 760
 761 It is possible to load motifs from a file which was saved from a
 762 previous run or by defining your own motif file. See the `Motif File
 763 Format`_ section for details.
 764
 765 NOTE: Valid motif list file extensions are:
 766
 767   * .mtl
 768   * .txt
 769
 770 To load a motif file, select **Load Motif List** item from the
 771 **File** menu and select a motif list file.
 772
 773 .. image:: images/load_motif.png
 774    :alt: Load Motif List
 775    :align: center
 776
 777
 778 Save Motifs to File
 779 *******************
 780
 781 Note: Currently not implemented
 782
 783
 784 Motif Dialog
 785 ************
 786
 787 **New Features:**
 788
 789 Build 276
 790  * Allow for toggling individual motifs on and off.
 791
 792 Build 269
 793  * Field added for naming motifs.
 794
 795 Mussa has the ability to find lab motifs using the `IUPAC Nucleotide
 796 Code`_ for defining a motif. To define a motif, select **Edit > Edit
 797 Motifs** menu item as shown below.
 798
 799 .. image:: images/view_edit_motifs.png
 800    :alt: "View > Edit Motifs" Menu
 801    :align: center
 802
 803 You will see a dialog box appear with a "set motifs" button and 10
 804 rows for defining motifs and the color that will be displayed on the
 805 sequence. By default all 10 motifs start off as with white as the
 806 color. In the image below, I changed the color from white to blue to
 807 make it easier to see. The first text box is for the motif and the
 808 second box is for the name of the motif. The check box defines whether
 809 the motif is displayed or not.
 810
 811 .. image:: images/motif_dialog_start.png
 812    :alt: Motif Dialog
 813    :align: center
 814
 815 Now let's make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide
 816 Code`_, type in **'ATSCT'** into the first box and 'My Motif' for the
 817 name in the second box as shown below.
 818
 819 .. image:: images/motif_dialog_enter_motif.png
 820    :alt: Enter Motif
 821    :align: center
 822
 823 Now choose a color for your motif by clicking on the colored area to
 824 the left of the motif. In the image above, you would click on the blue
 825 square, but by default the squares will be white. Remember to choose a
 826 color that will show up well with a black bar as the background.
 827
 828 .. image:: images/color_chooser.png
 829    :alt: Color Chooser
 830    :align: center
 831
 832 Once you have selected the color for your motif, click on the 'set
 833 motifs' button. Notice that if Mussa finds matches to your motif will
 834 now show up in the main Mussagl window.
 835
 836 Before Motif:
 837
 838 .. image:: images/motif_dialog_bar_before.png
 839    :alt: Sequence bar before motif
 840    :align: center
 841
 842 After Motif:
 843
 844 .. image:: images/motif_dialog_bar_after.png
 845    :alt: Sequence bar after motif
 846    :align: center
 847
 848
 849 View Mussa Alignments
 850 ---------------------
 851
 852 Mussagl allows you to zoom in on Mussa alignments by selecting the set
 853 of alignment(s) of interest. To do this, move the mouse near the
 854 alignment you are interested in viewing and then **PRESS** and
 855 **HOLD** the **LEFT mouse button** and **drag the mouse** to the other
 856 side of the conservation track so that you see a bounding box
 857 overlaping the alienment(s) of interest and then **let go** of the
 858 *left mouse button*.
 859
 860 In the example below, I started by left-clicking on the area marked by
 861 a red dot (upper left corner of bounding box) and dragging the mouse to
 862 the area marked by a blue dot (lower right corner of the bounding box)
 863 and letting go of the left mouse button.
 864
 865 .. image:: images/select_sequence.png
 866    :alt: Select Sequence
 867    :align: center
 868
 869 All of the lines which were not selected should be washed out as shown
 870 below:
 871
 872 .. image:: images/washed_out.png
 873    :alt: Tracks washed out
 874    :align: center
 875
 876 With a selection made, goto the **View** menu and select **View mussa alignment**.
 877
 878 .. image:: images/view_mussa_alignment.png
 879    :alt: View mussa alignment
 880    :align: center
 881
 882 You should see the alignment at the base-pair level as shown below.
 883
 884 .. image:: images/mussa_alignment.png
 885    :alt: Mussa alignment
 886    :align: center
 887
 888
 889 Sub-analysis
 890 ------------
 891
 892 To run a sub-analysis **highlight** a section of sequence and *right
 893 click* on it and select **Add to subanalysis**. To the same for the
 894 sequences shown in orange in the screenshot below. Note that you **are
 895 NOT limited** to selecting more than one subsequence from the same
 896 sequence.
 897
 898 .. image:: images/subanalysis_select_seqs.png
 899    :alt: Subanalysis sequence selection
 900    :align: center
 901
 902 Once you have added your sequences for subanalysis, choose a `window size`_ and `threshold`_ and click **Ok**.
 903
 904 .. image:: images/subanalysis_dialog.png
 905    :alt: Subanalysis Dialog
 906    :align: center
 907
 908 A new Mussa window will appear with the subanalysis of your sequences
 909 once it's done running. This may take a while if you selected large
 910 chunks of sequence with a loose threshold.
 911
 912 .. image:: images/subanalysis_done.png
 913    :alt: Subalaysis complete
 914    :align: center
 915
 916
 917 Copying sequence to clipboard
 918 -----------------------------
 919
 920 To copy a sequence to the clipboard, highlight a section of sequence,
 921 as shown in the screen shot below, and do one of the following:
 922
 923  * Select **Copy as FASTA** from the **Edit** menu.
 924  * **Right-Click (Left-click + Apple/Command Key on Mac)** on the highlighted sequence and select **Copy as FASTA**.
 925  * Press **Ctrl + C (on PC)** or **Apple/Command Key + C (on Mac)** on the keyboard.
 926
 927 .. image:: images/copy_sequence.png
 928    :alt: Copy sequence
 929    :align: center
 930
 931 Saving to an Image
 932 ---------------------------------
 933
 934 FIXME: Need to write this section
 935
 936
 937 Detailed Information
 938 --------------------
 939
 940 Threshold
 941 ~~~~~~~~~
 942
 943 The threshold of an analysis is in minimum number of base pair matches
 944 must be meet to in order to be kept as a match. Note that you can vary
 945 the threshold from within Mussagl. For example, if you choose a
 946 `window size`_ of **30** and a **threshold** of **20** the mussa nway
 947 transitive algorithm will store all matches that are 20 out of 30 bp
 948 matches or better and pass it on to Mussagl. Mussagl will then allow
 949 you to dynamically choose a threshold from 20 to 30 base pairs. A
 950 threshold of 30 bps would only show 30 out of 30 bp matches. A
 951 threshold of 20 bps would show all matches of 20 out of 30 bps or
 952 better. If you would like to see results for matches lower than 20 out
 953 of 30, you will need to rerun the analysis with a lower threshold.
 954
 955 Window Size
 956 ~~~~~~~~~~~
 957
 958 The typical sizes people tend to choose are between 20 and 30. You
 959 will likely need to experiment with this setting depending on your
 960 needs and input sequence.
 961
 962
 963 Sequences
 964 ~~~~~~~~~
 965
 966 Mussa reads in sequences which are formatted in the FASTA_
 967 format. Mussa may take a long time to run (>10 minutes) if the total
 968 bp length near 280Kb. Once mussa has run once, you can reload
 969 previously run analyzes.
 970
 971 FIXME: We have learned more about how much sequence and how many to
 972 put in Mussagl, this information should be documented here.
 973
 974
 975 Mussa File Formats
 976 ------------------
 977
 978 .. _param:
 979
 980 Parameter File Format
 981 ~~~~~~~~~~~~~~~~~~~~~
 982
 983 **File Format (.mupa):**
 984
 985 ::
 986
 987   # name of analysis directory and stem for associated files
 988   ANA_NAME <analysis_name>
 989
 990   # if APPEND vars true, a _wXX and/or _tYY added to analysis name
 991   # where XX = WINDOW and YY = THRESHOLD
 992   # Highly recommeded with use of command line override of WINDOW or THRESHOLD
 993   APPEND_WIN <true/false>
 994   APPEND_THRES <true/false>
 995
 996   # how many sequences are being analyzed
 997   SEQUENCE_NUM <num>
 998
 999   # first sequence info
1000   SEQUENCE <FASTA_file_path>
1001   ANNOTATION <annotation_file_path>
1002   SEQ_START <sequence_start>
1003
1004   # the second sequence info
1005   SEQUENCE <FASTA_file_path>
1006   # ANNOTATION <annotation_file_path>
1007   SEQ_START <sequence_start>
1008   # SEQ_END <sequence_end>
1009
1010   # third sequence info
1011   SEQUENCE <FASTA_file_path>
1012   # ANNOTATION <annotation_file_path>
1013
1014   # analyzes parameters: command line args -w -t will override these
1015   WINDOW <num>
1016   THRESHOLD <num>
1017
1018 .. csv-table:: Parameter File Options:
1019    :header: "Option Name", "Value", "Default", "Required", "Description"
1020    :widths: 30 30 30 30 60
1021
1022    "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also
1023    name of directory where analysis will be saved."
1024    "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME"
1025    "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME"
1026    "SEQUENCE_NUM", "integer", "N/A", "true", "The number of sequences
1027    to analyze"
1028    "SEQUENCE", "/FASTA/filepath.fa", "N/A", "true", "Must define one
1029    sequence per SEQUENCE_NUM."
1030    "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional
1031    annotation file. See `annotation file format`_ section for more
1032    information."
1033    "SEQ_START", "integer", "1", "false", "Optional index into FASTA file"
1034    "SEQ_END", "integer", "1", "false", "Optional index into FASTA file"
1035    "WINDOW", "integer", "N/A", "true", "`Window Size`_"
1036    "THRESHOLD", "integer", "N/A", "true", "`Threshold`_"
1037
1038 .. _annot:
1039
1040 Annotation File Format
1041 ~~~~~~~~~~~~~~~~~~~~~~
1042
1043 The first line in the file is the sequence name. Each line there after
1044 is a **space** separated annotation.
1045
1046 New as of build 198:
1047
1048  * The annotation format now supports FASTA sequences embedded in the
1049    annotation file as shown in the format example below. Mussagl will
1050    take this sequence and look for an exact match of this sequence in
1051    your sequences. If a match is found, it will label it with the name
1052    of from the FASTA header.
1053
1054 Format:
1055
1056 ::
1057
1058   <species/sequence_name>
1059   <start> <stop> <annotation_name> <annotation_type>
1060   <start> <stop> <annotation_name> <annotation_type>
1061   <start> <stop> <annotation_name> <annotation_type>
1062   <start> <stop> <annotation_name> <annotation_type>
1063   >FASTA Header
1064   ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG
1065   ACGTACGTACGTACGTAGCTGTCATACGCTAGCA
1066   TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT
1067   ACGTACGGCAGTACGCGGTCAGA
1068   <start> <stop> <annotation_name> <annotation_type>
1069   ...
1070
1071 Example:
1072
1073 ::
1074
1075   Mouse
1076   251 500 Glorp Glorptype
1077   751 1000 Glorp Glorptype
1078   1251 1500 Glorp Glorptype
1079   >My favorite DNA sequence
1080   GATTACA
1081   1751 2000 Glorp Glorptype
1082
1083
1084 .. _motif_file_format:
1085
1086 Motif File Format
1087 ~~~~~~~~~~~~~~~~~
1088
1089 Format:
1090
1091   <motif> <red> <green> <blue>
1092
1093 Example:
1094
1095   GGCC 0.0 1 1
1096
1097
1098
1099 IUPAC Nucleotide Code
1100 ~~~~~~~~~~~~~~~~~~~~~~
1101
1102 For your convenience, below is a table of the IUPAC Nucleotide Code.
1103
1104 The following table is table 1 from "Nomenclature for Incompletely
1105 Specified Bases in Nucleic Acid Sequences" which can be found at
1106 http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html.
1107
1108 ======  =================  ===================================
1109 Symbol  Meaning            Origin of designation
1110 ======  =================  ===================================
1111 G       G                  Guanine
1112 A       A                  Adenine
1113 T       T                  Thymine
1114 C       C                  Cytosine
1115 R       G or A             puRine
1116 Y       T or C             pYrimidine
1117 M       A or C             aMino
1118 K       G or T             Keto
1119 S       G or C             Strong interaction (3 H bonds)
1120 W       A or T             Weak interaction (2 H bonds)
1121 H       A or C or T        not-G, H follows G in the alphabet
1122 B       G or T or C        not-A, B follows A
1123 V       G or C or A        not-T (not-U), V follows U
1124 D       G or A or T        not-C, D follows C
1125 N       G or A or T or C   aNy
1126 ======  =================  ===================================
1127
1128
1129 .. Define links below
1130    ------------------
1131
1132 .. _GPL: http://www.opensource.org/licenses/gpl-license.php
1133 .. _wiki: http://mussa.caltech.edu
1134 .. _build: http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild
1135 .. _FASTA: http://en.wikipedia.org/wiki/fasta_format
1136 .. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif