doc/manual/mussagl_manual.rst

   1 ==============
   2 Mussagl Manual
   3 ==============
   4 ---------------
   5 Brandon W. King
   6 ---------------
   7
   8 Last updated: Oct 16th, 2006
   9
  10 Updated to Mussagl build: 287? (In process to 424)
  11
  12
  13 .. Things to add
  14         * New features / change log
  15         * Comment out anything isn't implemented yet.
  16         * (DONE) List of features that will be implemented in the future.
  17         * Look into the homology mapping of UCSC.
  18         * Add toggle to genomes.
  19         * Document why one fast record per region.
  20         * How to deal with the hazards of small utrs vis motif finder. (Add warning)
  21         * Add warning about saving FASTA file.
  22         * Add a general principles section near the top
  23                 * Using comparison algorithm which will pickup all repeats
  24                 * Add info about repeatmasking
  25                 * Checking upstream and downstream genes for make sure you are in the right regions.
  26         * Later on: look into Ensembl
  27         * Look into method of homology instead of blating.
  28         * Mention advantages of using mupa.
  29         * Mention the difference between using arrows and scroll bar
  30         * Document the color for motifs
  31         * Update for Mac user left-click
  32
  33         * Wormbase/Flybase/mirBASE tutorials
  34
  35
  36
  37 .. contents::
  38
  39 Status
  40 ======
  41
  42 Major New Features
  43 ------------------
  44
  45  * Build 381
  46    * Analysis "Save As" feature
  47
  48 Change Log
  49 ----------
  50
  51 .. INSERT CHANGE LOG HERE
  52 .. END INSERT CHANGE LOG
  53
  54 Features to be Implemented
  55 --------------------------
  56
  57  * Motif editor supporting more than 10 motifs
  58    (Status: http://woldlab.caltech.edu/cgi-bin/mussa/ticket/122)
  59  * Save motifs from Mussagl
  60    (Status: http://woldlab.caltech.edu/cgi-bin/mussa/ticket/133)
  61
  62 For an up-to-date list of features to be implemented visit:
  63 http://woldlab.caltech.edu/cgi-bin/mussa/roadmap
  64
  65 Introduction
  66 ============
  67
  68
  69 What is Mussagl?
  70 ----------------
  71
  72 Mussa is an N-way version of the FamilyRelations (which is a part of
  73 the Cartwheel project) 2-way comparative sequence analysis
  74 software. Given DNA sequence from N species, Mussa uses all possible
  75 pairwise comparions to derive an N-wise comparison. For example, given
  76 sequences 1,2,3, and 4, Mussa makes 6 2-way comparisons: 1vs2, 1vs3,
  77 1vs4, 2vs3, 2vs4, and 3vs4. It then compares all the links between
  78 these comparisons, saving those that satisfy a transitivity
  79 requirement. The saved paths are then displayed in an interactive
  80 viewer.
  81
  82 Short History of Mussa
  83 ----------------------
  84
  85 Mussa Python/PMW Prototype
  86 ~~~~~~~~~~~~~~~~~~~~~~~~~~
  87
  88 First Python/PMW based protoype.
  89
  90 Mussa C++/FLTK
  91 ~~~~~~~~~~~~~~
  92
  93 A rewrite for speed purposes using C++ and FLTK GUI toolkit.
  94
  95 Mussagl C++/Qt/OpenGL
  96 ~~~~~~~~~~~~~~~~~~~~~
  97
  98 Refactored version using the more elegant Qt GUI framework and
  99 OpenGL for hardware acceleration for those who have better graphics
 100 cards.
 101
 102 Getting Mussagl
 103 ===============
 104
 105 License
 106 -------
 107
 108 Mussagl has been released open source under the `GPL v2
 109 license`__.
 110
 111 __ GPL_
 112
 113 Platforms
 114 ---------
 115
 116 You have the option of building from source or downloading prebuilt
 117 binaries. Most people will want the prebuilt versions.
 118
 119 Supported Platforms:
 120
 121  * Mac OS X (binary or source)
 122  * Windows XP (binary or source)
 123  * Linux (source)
 124
 125 Download
 126 --------
 127
 128 Mussagl in binary form for OS X and Windows and/or source can be
 129 downloaded from http://mussa.caltech.edu/.
 130
 131 Install
 132 -------
 133
 134 Mac OS X
 135 ~~~~~~~~
 136 Once you have downloaded the .dmg file, double click on it and follow
 137 the install instructions.
 138
 139 FIXME: Mention how to launch the program.
 140
 141
 142 Windows XP
 143 ~~~~~~~~~~
 144 Once you have downloaded the Mussagl installer, double click on the
 145 installer and follow the install instructions.
 146
 147 To start Mussagl, launch the program from Start > Programs > Mussagl >
 148 Mussagl.
 149
 150
 151 Linux
 152 ~~~~~
 153 Currently we do not have a binary installer for Linux. You will have
 154 to build from source. See the 'build from source' section below.
 155
 156
 157 Build from Source
 158 ~~~~~~~~~~~~~~~~~
 159
 160 Instructions for building from source can be found `build page
 161 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild>`_ on the
 162 `Mussa wiki`__.
 163
 164 __ wiki_
 165
 166
 167 Obtaining Input Data
 168 ====================
 169
 170 If you already have your data, you can skip ahead to the the `Using
 171 Mussagl`_ section.
 172
 173 Let's say you have a gene of interest called 'SMN1' and you want to
 174 know how the sequence surrounding the gene in multiple species is
 175 conserved. Guess what, that's what we are going to do, retrieve the
 176 DNA sequence for SMN1 and prepare it for using in Mussa.
 177
 178 For more information about SMN1 visit `NCBI's OMIM
 179 <http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=609682>`_.
 180
 181 The SMN1 data retrieved in this section can be downloaded from the
 182 `Mussa Example Data
 183 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/ExampleData>`_ page if
 184 you prefer to skip this section of the manual.
 185
 186
 187 UCSC Genome Browser Method
 188 --------------------------
 189
 190 There are many methods of retrieving DNA sequence, but for this
 191 example we will retrieve SMN1 through the UCSC genome browser located
 192 at http://genome.ucsc.edu/.
 193
 194
 195 .. image:: images/ucsc_genome_browser_home.png
 196    :alt: UCSC Genome Browser
 197    :align: center
 198
 199 Step 1 - Find SMN1
 200 ~~~~~~~~~~~~~~~~~~
 201
 202 The first step in finding SMN1 is to use the **Gene Sorter** menu
 203 option which I have highlighted in orange below:
 204
 205 .. image:: images/ucsc_menu_bar_gene_sorter.png
 206    :alt: Gene Sorter Menu Option
 207    :align: center
 208
 209 Gene Sorter page:
 210
 211 .. image:: images/ucsc_gene_sorter.png
 212    :alt: Gene Sorter
 213    :align: center
 214
 215 We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**.
 216
 217 .. image:: images/ucsc_gs_sort_name_sim.png
 218    :alt: Gene Sorter - Name Similarity
 219    :align: center
 220
 221 After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box.
 222
 223 .. image:: images/ucsc_gs_smn1.png
 224    :alt: Gene
 225    :align: center
 226
 227 Press **Go!** and you should see the following page:
 228
 229 .. image:: images/ucsc_gs_found.png
 230    :alt: Found SMN1
 231    :align: center
 232
 233 Click on **SMN1** and you will be taking the gene expression atlas
 234 page.
 235
 236 .. image:: images/ucsc_gs_genome_position.png
 237    :alt: Gene expression atlas
 238    :align: center
 239
 240 Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome
 241 position column**.
 242
 243 Now we have found the location of SMN1 on human!
 244
 245 .. image:: images/ucsc_gb_smn1_human.png
 246    :alt: Genome Browser - SMN1 (human)
 247    :align: center
 248
 249
 250 Step 2 - Download CDS/UTR sequence for annotations
 251 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 252
 253 Since we have found **SMN1**, this would be a convenient time to extract
 254 the DNA sequence for the CDS and UTRs of the gene to use it as an
 255 annotation_ in Mussa.
 256
 257 **Click on SMN1** shown **between** the **two orange arrows** shown
 258 below.
 259
 260 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
 261    :alt: Genome Browser - SMN1 (human) - Orange Arrows
 262    :align: center
 263
 264 You should find yourself at the SMN1 description page.
 265
 266 .. image:: images/ucsc_gb_smn1_description_page.png
 267    :alt: Genome Browser - SMN1 (human) - Description page
 268    :align: center
 269
 270 **Scroll down** until you get to the **Sequence section** and click on
 271 **Genomic (chr5:70,256,524-70,284,592)**.
 272
 273 .. image:: images/ucsc_gb_smn1_human_sequence.png
 274    :alt: Genome Browser - SMN1 (human) - Sequence
 275    :align: center
 276
 277 You should now be at the **Genomic sequence near gene** page:
 278
 279 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence.png
 280    :alt: Genome Browser - SMN1 (human) - Get genomic sequence
 281    :align: center
 282
 283 Make the following changes (highlighted in orange in the screenshot
 284 below):
 285
 286  1. UNcheck **introns**.
 287     (We only want to annotate CDS and UTRs.)
 288  2. Select **one FASTA record** per **region**.
 289     (Mussa needs each CDS and UTR represented by one FASTA record per CDS/UTR).
 290  3. Select **CDS in upper case, UTR in lower case.**
 291
 292 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_diff.png
 293    :alt: Genome Browser - SMN1 (human) - Get genomic sequence setup
 294    :align: center
 295
 296 Now click the **submit** button. You will then see a FASTA file with
 297 many FASTA records representing the CDS and UTRS.
 298
 299 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_submit.png
 300    :alt: Genome Browser - SMN1 (human) - CDS/UTR sequence
 301    :align: center
 302
 303 Now you need to save the FASTA records to a **text file**. If you are
 304 using **Firefox** or **Internet Explorer 6+** click on the **File >
 305 Save As** menu option.
 306
 307 **IMPORTANT:** Make sure you select **Text Files** and **NOT**, I
 308 repeat **NOT Webpage Complete** (see screenshot below.)
 309
 310 Type in **smn1_human_annot.txt** for the file name.
 311
 312 .. image:: images/smn1_human_annot.png
 313    :alt: Genome Browser - SMN1 (human) - sequence annotation file
 314    :align: center
 315
 316 **IMPORTANT:** You should open the file with a text editor and make
 317   sure **no HTML** was saved... If you find any HTML markup, delete
 318   the markup and save the file.
 319
 320 Now we are going to **modify the file** you just saved to **add the
 321 name of the species** to the **annotation file**. All you have to do
 322 is **add a new line** at the **top of the file** with the word **'Human'** as
 323 shown below:
 324
 325 .. image:: images/smn1_human_annot_plus_human.png
 326    :alt: Genome Browser - SMN1 (human) - sequence annotation file
 327    :align: center
 328
 329 You can add more annotations to this file if you wish. See the
 330 `annotation file format`_ section for details of the file format. By
 331 including FASTA records in the annotation_ file, Mussa searches your
 332 DNA sequence for an exact match of the sequence in the annotation_
 333 file. If found, it will be marked as an annotation_ within Mussa.
 334
 335
 336 Step 3 - Download gene and upstream/downstream sequence
 337 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 338
 339 Use the back button in your web browser to get back the **genome
 340 browser view** of **SMN1** as shown below.
 341
 342 .. image:: images/ucsc_gb_smn1_human.png
 343    :alt: Genome Browser - SMN1 (human)
 344    :align: center
 345
 346 There are two options for getting additional sequence around your
 347 gene. The more complex way is to zoom out so that you have the
 348 sequence you want being shown in the genome browser and then follow
 349 the directions for the following method.
 350
 351 The second option, which we will choose, is to leave the genome
 352 browser zoomed exactly at the location of SMN1 and click on the
 353 **DNA** option on the menu bar (shown with orange arrows in the
 354 screenshot below.)
 355
 356 .. image:: images/ucsc_gb_smn1_human_dna_option.png
 357    :alt: Genome Browser - SMN1 (human) - DNA Option
 358    :align: center
 359
 360 Now in the **get dna in window** page, let's add an arbitrary amount of
 361 extra sequence on to each end of the gene, let's say 5000 base pairs.
 362
 363 .. image:: images/ucsc_gb_smn1_human_get_dna.png
 364    :alt: Genome Browser - SMN1 (human) - Get DNA
 365    :align: center
 366
 367 Click the **get DNA** button.
 368
 369 .. image:: images/ucsc_gb_smn1_human_dna.png
 370    :alt: Genome Browser - SMN1 (human) - DNA
 371    :align: center
 372
 373 Save the DNA sequence to a text file called 'smn1_human_dna.fa' as we
 374 did in step 2 with the annotation file.
 375
 376 **IMPORTANT:** Make sure the file is saved as a text file and not an
 377 HTML file. Open the file with a text editor and remove any HTML markup
 378 you find.
 379
 380
 381 Step 4 - Same/similar/related gene other species.
 382 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 383
 384 What good is a multiple sequence alignment viewer without multiple
 385 sequences? Let'S find a similar gene in a few more species.
 386
 387 Use the back button on your web browser until you get the **genome
 388 browser view** of **SMN1** as shown below.
 389
 390 .. image:: images/ucsc_genome_browser_home.png
 391    :alt: UCSC Genome Browser
 392    :align: center
 393
 394 **Click on SMN1** shown **between** the **two orange arrows** shown
 395 below.
 396
 397 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
 398    :alt: Genome Browser - SMN1 (human) - Orange Arrows
 399    :align: center
 400
 401 You should find yourself at the SMN1 description page.
 402
 403 .. image:: images/ucsc_gb_smn1_description_page.png
 404    :alt: Genome Browser - SMN1 (human) - Description page
 405    :align: center
 406
 407 **Scroll down** until you get to the **Sequence section** and click on
 408 **Protein (262 aa)**.
 409
 410 .. image:: images/ucsc_gb_smn1_human_sequence.png
 411    :alt: Genome Browser - SMN1 (human) - Sequence
 412    :align: center
 413
 414 Copy the SMN1 protein seqeunce by highlighting it and selecting **Edit
 415 > Copy** option from the menu.
 416
 417 .. image:: images/smn1_human_protein.png
 418    :alt: Genome Browser - SMN1 (human) - Protein
 419    :align: center
 420
 421 Press the back button on the web browser once and then scroll to the
 422 top of the page and click on the **BLAT** option on the menu bar
 423 (shown below with orange arrows).
 424
 425 .. image:: images/ucsc_gb_smn1_human_blat.png
 426    :alt: Genome Browser - SMN1 (human) - Blat
 427    :align: center
 428
 429 **Paste** in the **protein sequence** and **change** the **genome** to
 430 **mouse** as shown below and then click **submit**.
 431
 432 .. image:: images/ucsc_gb_smn1_human_blat_paste.png
 433    :alt: Genome Browser - SMN1 (human) - Blat paste protein
 434    :align: center
 435
 436 Notice that we have two hits, one of which looks pretty good at 89.9%
 437 match.
 438
 439 .. image:: images/ucsc_gb_smn1_human_blat_hits.png
 440    :alt: Genome Browser - SMN1 (human) - Blat hits
 441    :align: center
 442
 443 **Click** on the **brower** link next to the 89.9% match. Notice in
 444 the genome browser (shown below) that there is an annotated gene
 445 called SMN1 for mouse which matches the line called **your sequence
 446 from blat search**. This means we are fairly confidant we found the
 447 right location in the mouse genome.
 448
 449 .. image:: images/ucsc_gb_smn1_human_blat_to_browser.png
 450    :alt: Genome Browser - SMN1 (human) - Blat to browser
 451    :align: center
 452
 453 Follow steps 1 through 3 for mouse and then repeat step 4 with the
 454 human protein sequence to find **SMN1** in the following species (if
 455 you find a match):
 456
 457  1. Rat
 458  2. Rabbit
 459  3. Dog
 460  4. Armadillo
 461  5. Elephant
 462  6. Opposum
 463  7. x_tropicalis
 464
 465 Make sure to save the extended DNA sequence and annotation file for
 466 each one.
 467
 468 Using Mussagl
 469 =============
 470
 471
 472 Launch Mussagl
 473 --------------
 474 Launch Mussagl... It should look similar to the screen shot below.
 475
 476 .. image:: images/opened.png
 477    :alt: Launch Mussa
 478    :align: center
 479
 480
 481
 482 Create/Load Analysis
 483 ----------------------
 484
 485 Currently there are three ways to load a Mussa experiment.
 486
 487  1. `Create a new analysis`_
 488  2. `Load a mussa parameter file`_ (.mupa)
 489  3. `Load an analysis`_
 490
 491 .. _createnew:
 492
 493 Create a new analysis
 494 ~~~~~~~~~~~~~~~~~~~~~
 495
 496 To create a new analysis select 'Define analysis' from the 'File'
 497 menu. You should see a dialog box similar to the one below. For this
 498 demo we will use the example sequences that come with Mussagl.
 499
 500 .. image:: images/define_analysis.png
 501    :alt: Define Analysis
 502    :align: center
 503
 504 Instructions:
 505
 506  1. **Give the experiment a name**, for this demo, we'll use
 507     'demo_w30_t20'. Mussa will create a folder with this name to store
 508     the analysis files in once it has been run.
 509
 510  2. Choose a `window size`_. For this demo **choose 30**.
 511
 512  3. Choose a threshold_... for this demo **choose 20**. See the
 513     Threshold_ section for more detailed information.
 514
 515  4. Choose the number of sequences_ you would like. For this demo
 516     **choose 3**.
 517
 518 .. image:: images/define_analysis_step1a.png
 519    :alt: Steps 1-4
 520    :align: center
 521
 522 Now click on the 'Browse' button next to the sequence input box and
 523 then select /examples/seq/human_mck_pro.fa file. Do the same in the
 524 next two sequence input boxes selecting mouse_mck_pro.fa and
 525 rabbit_mck_pro.fa as shown below. Note that you can create annotation
 526 files using the mussa `Annotation File Format`_ to add annotations to
 527 your sequence.
 528
 529 .. image:: images/define_analysis_step2.png
 530    :alt: Choose sequences
 531    :align: center
 532
 533 Click the **create** button and in a few moments you should see
 534 something similar to the following screen shot.
 535
 536 .. image:: images/demo.png
 537    :alt: Mussagl Demo
 538    :align: center
 539
 540 This analysis is now saved in a directory called **demo_w30_t20** in
 541 the current working directory. If you close and reopen Mussagl, you
 542 can reload the saved analysis. See `Load an analysis`_ section below
 543 for details.
 544
 545
 546 Load a mussa parameter file
 547 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 548
 549 If you prefer, you can define your Mussa analysis using the Mussa
 550 parameter file. See the `Parameter File Format`_ section for details
 551 on creating a .mupa file.
 552
 553 Once you have a .mupa file created, load Mussagl and select the **File >
 554 Load Mussa Parameters** menu option. Select the .mupa file and click
 555 open.
 556
 557 .. image:: images/load_mupa_menu.png
 558    :alt: Load Mussa Parameters
 559    :align: center
 560
 561 If you would like to see an example, you can load the
 562 **mck3test.mupa** file in the examples directory that comes with
 563 Mussagl.
 564
 565 .. image:: images/load_mupa_dialog.png
 566    :alt: Load Mussa Parameters Dialog
 567    :align: center
 568
 569
 570 Load an analysis
 571 ~~~~~~~~~~~~~~~~
 572
 573 To load a previously run analysis open Mussagl and select the **File >
 574 Load Analysis** menu option. Select an analysis **directory** and
 575 click open.
 576
 577 .. image:: images/load_analysis_menu.png
 578    :alt: Load Analysis Menu
 579    :align: center
 580
 581
 582 Main Window
 583 -----------
 584
 585 Overview
 586 ~~~~~~~~
 587 .. Screen-shot with numbers showing features.
 588
 589 .. image:: images/window_overview.png
 590    :alt: Mussa Window
 591    :align: center
 592
 593 Legend:
 594
 595  1. `DNA Sequence (Black bars)`_
 596
 597  2. Annotation_
 598
 599  3. Motif_
 600
 601  4. `Red conservation tracks`_
 602
 603  5. `Blue conservation tracks`_
 604
 605  6. `Zoom Factor`_ (Base pairs per pixel)
 606
 607  7. `Dynamic Threshold`_
 608
 609  8. `Sequence Information Bar`_
 610
 611  9. `Sequence Scroll Bar`_
 612
 613
 614 DNA Sequence (black bars)
 615 ~~~~~~~~~~~~~~~~~~~~~~~~~
 616
 617 .. image:: images/sequence_bar.png
 618    :alt: Sequence Bar
 619    :align: center
 620
 621 Each of the black bars represents one of the loaded sequences, in this
 622 case the sequence around the gene 'MCK' in human, mouse, and rabbit.
 623
 624
 625 Annotation
 626 ~~~~~~~~~~
 627
 628 .. figure:: images/annotation.png
 629    :alt: Annotation
 630    :align: center
 631
 632    Annotation shown in green on sequence bar.
 633
 634
 635 Annotations can be included on any of the sequences using the `Load a
 636 mussa parameter file`_ or `Create a new analysis`_ method of loading
 637 your sequences. You can define annotations by location or using an
 638 exact sub-sequence or a FASTA sequence of the section of DNA you wish
 639 to annotate. See the `Annotation File Format`_ section for details.
 640
 641
 642 Motif
 643 ~~~~~
 644
 645 .. figure:: images/motif.png
 646    :alt: Motif
 647    :align: center
 648
 649    Motif shown in light blue on sequence bar.
 650
 651 The only real difference between an annotation and motif in Mussagl is
 652 that you can define motifs and choose a color from within the GUI. See
 653 the `Motifs`_ section for more information.
 654
 655
 656 Red conservation tracks
 657 ~~~~~~~~~~~~~~~~~~~~~~~
 658
 659 .. figure:: images/conservation_tracks.png
 660    :alt: Conservation Tracks
 661    :align: center
 662
 663    Conservations tracks shown as red and blue lines between sequence
 664    bars.
 665
 666 The **red lines** between the sequence bars represent conservation
 667 between the sequences (i.e. not reverse complement matches)
 668
 669 The amount of sequence conservation shown will depend on how much your
 670 sequences are related and the `dynamic threshold`_ you are using.
 671
 672
 673 Blue conservation tracks
 674 ~~~~~~~~~~~~~~~~~~~~~~~~
 675
 676 .. figure:: images/conservation_tracks.png
 677    :alt: Conservation Tracks
 678    :align: center
 679
 680    Conservations tracks shown as red and blue lines between sequence
 681    bars.
 682
 683 **Blue lines** represent **reverse complement** conservation relative
 684 to the sequence attached to the top of the blue line.
 685
 686 The amount of sequence conservation shown will depend on how much your
 687 sequences are related and the `dynamic threshold`_ you are using.
 688
 689
 690 Zoom Factor
 691 ~~~~~~~~~~~
 692
 693 .. image:: images/zoom_factor.png
 694    :alt: Zoom Factor
 695    :align: center
 696
 697 The zoom factor represents the number of base pairs represented per
 698 pixel. When you zoom in far enough the sequence will switch from
 699 seeing a black bar, representing the sequence, to the actual sequence
 700 (well, ASCII representation of sequence).
 701
 702
 703 Dynamic Threshold
 704 ~~~~~~~~~~~~~~~~~
 705
 706 .. image:: images/dynamic_threshold.png
 707    :alt: Dynamic Threshold
 708    :align: center
 709
 710 You can dynamically change the threshold for how strong a match you
 711 consider the conservation to be by changing the value in the dynamic
 712 threshold box.
 713
 714 The value you enter is the minimum number of base pairs that have to
 715 be matched in order to be considered conserved. The second number that
 716 you can't change is the `window size`_ you used when creating the
 717 experiment. The last number is the percent match.
 718
 719 See the Threshold_ section for more information.
 720
 721
 722 Sequence Information Bar
 723 ~~~~~~~~~~~~~~~~~~~~~~~~
 724
 725 .. image:: images/seq_info_bar.png
 726    :alt: Sequence Information Bar
 727    :align: center
 728
 729 The sequence information bars can be found to the left and right sides
 730 of Mussagl. Next to each sequence you will find the following
 731 information:
 732
 733  1. Species (If it has been defined)
 734  2. Total Size of Sequence
 735  3. Current base pair position
 736
 737 Note that you can **update the species** text box. Make sure to **save your
 738 experiment** after making this change by selecting **File > Save
 739 Analysis** from the menu.
 740
 741 Sequence Scroll Bar
 742 ~~~~~~~~~~~~~~~~~~~
 743
 744 .. image:: images/scroll_bar.png
 745    :alt: Sequence Scroll Bar
 746    :align: center
 747
 748 The scroll bar allows you to scroll through the sequence which is
 749 useful when you have zoomed in using the `zoom factor`_.
 750
 751
 752 Annotations / Motifs
 753 --------------------
 754
 755 Annotations
 756 ~~~~~~~~~~~
 757
 758 Currently annotations can be added to a sequence using the mussa
 759 `annotation file format`_ and can be loaded by selecting the
 760 annotation file when defining a new analysis (see `Create a new
 761 analysis`_ section) or by defining a .mupa file pointing to your
 762 annotation file (see `Load a mussa parameter file`_ section).
 763
 764 Motifs
 765 ~~~~~~
 766
 767 Load Motifs from File
 768 *********************
 769
 770 It is possible to load motifs from a file which was saved from a
 771 previous run or by defining your own motif file. See the `Motif File
 772 Format`_ section for details.
 773
 774 NOTE: Valid motif list file extensions are:
 775
 776   * .mtl
 777   * .txt
 778
 779 To load a motif file, select **Load Motif List** item from the
 780 **File** menu and select a motif list file.
 781
 782 .. image:: images/load_motif.png
 783    :alt: Load Motif List
 784    :align: center
 785
 786
 787 Save Motifs to File
 788 *******************
 789
 790 Note: Currently not implemented
 791
 792
 793 Motif Dialog
 794 ************
 795
 796 **New Features:**
 797
 798 Build 276
 799  * Allow for toggling individual motifs on and off.
 800
 801 Build 269
 802  * Field added for naming motifs.
 803
 804 Mussa has the ability to find lab motifs using the `IUPAC Nucleotide
 805 Code`_ for defining a motif. To define a motif, select **Edit > Edit
 806 Motifs** menu item as shown below.
 807
 808 .. image:: images/view_edit_motifs.png
 809    :alt: "View > Edit Motifs" Menu
 810    :align: center
 811
 812 You will see a dialog box appear with a "set motifs" button and 10
 813 rows for defining motifs and the color that will be displayed on the
 814 sequence. By default all 10 motifs start off as with white as the
 815 color. In the image below, I changed the color from white to blue to
 816 make it easier to see. The first text box is for the motif and the
 817 second box is for the name of the motif. The check box defines whether
 818 the motif is displayed or not.
 819
 820 .. image:: images/motif_dialog_start.png
 821    :alt: Motif Dialog
 822    :align: center
 823
 824 Now let's make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide
 825 Code`_, type in **'ATSCT'** into the first box and 'My Motif' for the
 826 name in the second box as shown below.
 827
 828 .. image:: images/motif_dialog_enter_motif.png
 829    :alt: Enter Motif
 830    :align: center
 831
 832 Now choose a color for your motif by clicking on the colored area to
 833 the left of the motif. In the image above, you would click on the blue
 834 square, but by default the squares will be white. Remember to choose a
 835 color that will show up well with a black bar as the background.
 836
 837 .. image:: images/color_chooser.png
 838    :alt: Color Chooser
 839    :align: center
 840
 841 Once you have selected the color for your motif, click on the 'set
 842 motifs' button. Notice that if Mussa finds matches to your motif will
 843 now show up in the main Mussagl window.
 844
 845 Before Motif:
 846
 847 .. image:: images/motif_dialog_bar_before.png
 848    :alt: Sequence bar before motif
 849    :align: center
 850
 851 After Motif:
 852
 853 .. image:: images/motif_dialog_bar_after.png
 854    :alt: Sequence bar after motif
 855    :align: center
 856
 857
 858 View Mussa Alignments
 859 ---------------------
 860
 861 Mussagl allows you to zoom in on Mussa alignments by selecting the set
 862 of alignment(s) of interest. To do this, move the mouse near the
 863 alignment you are interested in viewing and then **PRESS** and
 864 **HOLD** the **LEFT mouse button** and **drag the mouse** to the other
 865 side of the conservation track so that you see a bounding box
 866 overlaping the alienment(s) of interest and then **let go** of the
 867 *left mouse button*.
 868
 869 In the example below, I started by left-clicking on the area marked by
 870 a red dot (upper left corner of bounding box) and dragging the mouse to
 871 the area marked by a blue dot (lower right corner of the bounding box)
 872 and letting go of the left mouse button.
 873
 874 .. image:: images/select_sequence.png
 875    :alt: Select Sequence
 876    :align: center
 877
 878 All of the lines which were not selected should be washed out as shown
 879 below:
 880
 881 .. image:: images/washed_out.png
 882    :alt: Tracks washed out
 883    :align: center
 884
 885 With a selection made, goto the **View** menu and select **View mussa alignment**.
 886
 887 .. image:: images/view_mussa_alignment.png
 888    :alt: View mussa alignment
 889    :align: center
 890
 891 You should see the alignment at the base-pair level as shown below.
 892
 893 .. image:: images/mussa_alignment.png
 894    :alt: Mussa alignment
 895    :align: center
 896
 897
 898 Sub-analysis
 899 ------------
 900
 901 To run a sub-analysis **highlight** a section of sequence and *right
 902 click* on it and select **Add to subanalysis**. To the same for the
 903 sequences shown in orange in the screenshot below. Note that you **are
 904 NOT limited** to selecting more than one subsequence from the same
 905 sequence.
 906
 907 .. image:: images/subanalysis_select_seqs.png
 908    :alt: Subanalysis sequence selection
 909    :align: center
 910
 911 Once you have added your sequences for subanalysis, choose a `window size`_ and `threshold`_ and click **Ok**.
 912
 913 .. image:: images/subanalysis_dialog.png
 914    :alt: Subanalysis Dialog
 915    :align: center
 916
 917 A new Mussa window will appear with the subanalysis of your sequences
 918 once it's done running. This may take a while if you selected large
 919 chunks of sequence with a loose threshold.
 920
 921 .. image:: images/subanalysis_done.png
 922    :alt: Subalaysis complete
 923    :align: center
 924
 925
 926 Copying sequence to clipboard
 927 -----------------------------
 928
 929 To copy a sequence to the clipboard, highlight a section of sequence,
 930 as shown in the screen shot below, and do one of the following:
 931
 932  * Select **Copy as FASTA** from the **Edit** menu.
 933  * **Right-Click (Left-click + Apple/Command Key on Mac)** on the highlighted sequence and select **Copy as FASTA**.
 934  * Press **Ctrl + C (on PC)** or **Apple/Command Key + C (on Mac)** on the keyboard.
 935
 936 .. image:: images/copy_sequence.png
 937    :alt: Copy sequence
 938    :align: center
 939
 940 Saving to an Image
 941 ---------------------------------
 942
 943  * Updated to build 419.
 944
 945 To save your current mussa view to an image, select **File > Save to
 946 image...** as shown below.
 947
 948 .. image:: images/save_to_image_menu.png
 949    :alt: File > Save to image...
 950    :align: center
 951
 952 You can define the width and the height of the image to save. By
 953 default it will use the same size of your current view. Since the
 954 Mussa view is implemented using vectors, if you choose a larger size
 955 then your current view, Mussa will redraw at the higher resolution
 956 when saving. In other words, you get higher quality images when saving
 957 at a higher resolution.
 958
 959 If you check the "Lock aspect ratio" check box, which I have circled
 960 in red, then when you change one value, say width, the other, height,
 961 will update automatically to keep the same aspect ratio.
 962
 963 .. image:: images/save_to_image_dialog.png
 964    :alt: Save to image dialog
 965    :align: center
 966
 967 Click save and choose a location and filename for your file.
 968
 969 The valid image formats are:
 970
 971   * .png (default if no extension specified.)
 972   * .jpg
 973
 974
 975 Detailed Information
 976 --------------------
 977
 978 Threshold
 979 ~~~~~~~~~
 980
 981 The threshold of an analysis is in minimum number of base pair matches
 982 must be meet to in order to be kept as a match. Note that you can vary
 983 the threshold from within Mussagl. For example, if you choose a
 984 `window size`_ of **30** and a **threshold** of **20** the mussa nway
 985 transitive algorithm will store all matches that are 20 out of 30 bp
 986 matches or better and pass it on to Mussagl. Mussagl will then allow
 987 you to dynamically choose a threshold from 20 to 30 base pairs. A
 988 threshold of 30 bps would only show 30 out of 30 bp matches. A
 989 threshold of 20 bps would show all matches of 20 out of 30 bps or
 990 better. If you would like to see results for matches lower than 20 out
 991 of 30, you will need to rerun the analysis with a lower threshold.
 992
 993 Window Size
 994 ~~~~~~~~~~~
 995
 996 The typical sizes people tend to choose are between 20 and 30. You
 997 will likely need to experiment with this setting depending on your
 998 needs and input sequence.
 999
1000
1001 Sequences
1002 ~~~~~~~~~
1003
1004 Mussa reads in sequences which are formatted in the FASTA_
1005 format. Mussa may take a long time to run (>10 minutes) if the total
1006 bp length near 280Kb. Once mussa has run once, you can reload
1007 previously run analyzes.
1008
1009 FIXME: We have learned more about how much sequence and how many to
1010 put in Mussagl, this information should be documented here.
1011
1012
1013 Mussa File Formats
1014 ------------------
1015
1016 .. _param:
1017
1018 Parameter File Format
1019 ~~~~~~~~~~~~~~~~~~~~~
1020
1021 **File Format (.mupa):**
1022
1023 ::
1024
1025   # name of analysis directory and stem for associated files
1026   ANA_NAME <analysis_name>
1027
1028   # if APPEND vars true, a _wXX and/or _tYY added to analysis name
1029   # where XX = WINDOW and YY = THRESHOLD
1030   # Highly recommeded with use of command line override of WINDOW or THRESHOLD
1031   APPEND_WIN <true/false>
1032   APPEND_THRES <true/false>
1033
1034   # how many sequences are being analyzed
1035   SEQUENCE_NUM <num>
1036
1037   # first sequence info
1038   SEQUENCE <FASTA_file_path>
1039   ANNOTATION <annotation_file_path>
1040   SEQ_START <sequence_start>
1041
1042   # the second sequence info
1043   SEQUENCE <FASTA_file_path>
1044   # ANNOTATION <annotation_file_path>
1045   SEQ_START <sequence_start>
1046   # SEQ_END <sequence_end>
1047
1048   # third sequence info
1049   SEQUENCE <FASTA_file_path>
1050   # ANNOTATION <annotation_file_path>
1051
1052   # analyzes parameters: command line args -w -t will override these
1053   WINDOW <num>
1054   THRESHOLD <num>
1055
1056 .. csv-table:: Parameter File Options:
1057    :header: "Option Name", "Value", "Default", "Required", "Description"
1058    :widths: 30 30 30 30 60
1059
1060    "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also
1061    name of directory where analysis will be saved."
1062    "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME"
1063    "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME"
1064    "SEQUENCE_NUM", "integer", "N/A", "true", "The number of sequences
1065    to analyze"
1066    "SEQUENCE", "/FASTA/filepath.fa", "N/A", "true", "Must define one
1067    sequence per SEQUENCE_NUM."
1068    "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional
1069    annotation file. See `annotation file format`_ section for more
1070    information."
1071    "SEQ_START", "integer", "1", "false", "Optional index into FASTA file"
1072    "SEQ_END", "integer", "1", "false", "Optional index into FASTA file"
1073    "WINDOW", "integer", "N/A", "true", "`Window Size`_"
1074    "THRESHOLD", "integer", "N/A", "true", "`Threshold`_"
1075
1076 .. _annot:
1077
1078 Annotation File Format
1079 ~~~~~~~~~~~~~~~~~~~~~~
1080
1081 The first line in the file is the sequence name. Each line there after
1082 is a **space** separated annotation.
1083
1084 New as of build 198:
1085
1086  * The annotation format now supports FASTA sequences embedded in the
1087    annotation file as shown in the format example below. Mussagl will
1088    take this sequence and look for an exact match of this sequence in
1089    your sequences. If a match is found, it will label it with the name
1090    of from the FASTA header.
1091
1092 Format:
1093
1094 ::
1095
1096   <species/sequence_name>
1097   <start> <stop> <annotation_name> <annotation_type>
1098   <start> <stop> <annotation_name> <annotation_type>
1099   <start> <stop> <annotation_name> <annotation_type>
1100   <start> <stop> <annotation_name> <annotation_type>
1101   >FASTA Header
1102   ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG
1103   ACGTACGTACGTACGTAGCTGTCATACGCTAGCA
1104   TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT
1105   ACGTACGGCAGTACGCGGTCAGA
1106   <start> <stop> <annotation_name> <annotation_type>
1107   ...
1108
1109 Example:
1110
1111 ::
1112
1113   Mouse
1114   251 500 Glorp Glorptype
1115   751 1000 Glorp Glorptype
1116   1251 1500 Glorp Glorptype
1117   >My favorite DNA sequence
1118   GATTACA
1119   1751 2000 Glorp Glorptype
1120
1121
1122 .. _motif_file_format:
1123
1124 Motif File Format
1125 ~~~~~~~~~~~~~~~~~
1126
1127 Format:
1128
1129   <motif> <red> <green> <blue>
1130
1131 Example:
1132
1133   GGCC 0.0 1 1
1134
1135
1136
1137 IUPAC Nucleotide Code
1138 ~~~~~~~~~~~~~~~~~~~~~~
1139
1140 For your convenience, below is a table of the IUPAC Nucleotide Code.
1141
1142 The following table is table 1 from "Nomenclature for Incompletely
1143 Specified Bases in Nucleic Acid Sequences" which can be found at
1144 http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html.
1145
1146 ======  =================  ===================================
1147 Symbol  Meaning            Origin of designation
1148 ======  =================  ===================================
1149 G       G                  Guanine
1150 A       A                  Adenine
1151 T       T                  Thymine
1152 C       C                  Cytosine
1153 R       G or A             puRine
1154 Y       T or C             pYrimidine
1155 M       A or C             aMino
1156 K       G or T             Keto
1157 S       G or C             Strong interaction (3 H bonds)
1158 W       A or T             Weak interaction (2 H bonds)
1159 H       A or C or T        not-G, H follows G in the alphabet
1160 B       G or T or C        not-A, B follows A
1161 V       G or C or A        not-T (not-U), V follows U
1162 D       G or A or T        not-C, D follows C
1163 N       G or A or T or C   aNy
1164 ======  =================  ===================================
1165
1166
1167 .. Define links below
1168    ------------------
1169
1170 .. _GPL: http://www.opensource.org/licenses/gpl-license.php
1171 .. _wiki: http://mussa.caltech.edu
1172 .. _build: http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild
1173 .. _FASTA: http://en.wikipedia.org/wiki/fasta_format
1174 .. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif