doc/manual/mussagl_manual.rst

   1 ==============
   2 Mussagl Manual
   3 ==============
   4 ---------------
   5 Brandon W. King
   6 ---------------
   7
   8 Last updated: Sept 20th, 2006
   9
  10 Updated to Mussagl build: 287
  11
  12
  13 .. Things to add
  14         * New features / change log
  15         * Comment out anything isn't implemented yet.
  16         * List of features that will be implemented in the future.
  17         * Look into the homology mapping of UCSC.
  18         * Add toggle to genomes.
  19         * Document why one fast record per region.
  20         * How to deal with the hazards of small utrs vis motif finder. (Add warning)
  21         * Add warning about saving FASTA file.
  22         * Add a general principles section near the top
  23                 * Using comparison algorithm which will pickup all repeats
  24                 * Add info about repeatmasking
  25                 * Checking upstream and downstream genes for make sure you are in the right regions.
  26         * Later on: look into Ensembl
  27         * Look into method of homology instead of blating.
  28         * Mention advantages of using mupa.
  29         * Mention the difference between using arrows and scroll bar
  30         * Document the color for motifs
  31         * Update for Mac user left-click
  32
  33         * Wormbase/Flybase/mirBASE tutorials
  34
  35
  36
  37 .. contents::
  38
  39 Introduction
  40 ============
  41
  42
  43 What is Mussagl?
  44 ----------------
  45
  46 Mussa is an N-way version of the FamilyRelations (which is a part of
  47 the Cartwheel project) 2-way comparative sequence analysis
  48 software. Given DNA sequence from N species, Mussa uses all possible
  49 pairwise comparions to derive an N-wise comparison. For example, given
  50 sequences 1,2,3, and 4, Mussa makes 6 2-way comparisons: 1vs2, 1vs3,
  51 1vs4, 2vs3, 2vs4, and 3vs4. It then compares all the links between
  52 these comparisons, saving those that satisfy a transitivity
  53 requirement. The saved paths are then displayed in an interactive
  54 viewer.
  55
  56 Short History of Mussa
  57 ----------------------
  58
  59
  60 Mussa Python/PMW Prototype
  61 ~~~~~~~~~~~~~~~~~~~~~~~~~~
  62
  63 First Python/PMW based protoype.
  64
  65 Mussa C++/FLTK
  66 ~~~~~~~~~~~~~~
  67
  68 A rewrite for speed purposes using C++ and FLTK GUI toolkit.
  69
  70 Mussagl C++/Qt/OpenGL
  71 ~~~~~~~~~~~~~~~~~~~~~
  72
  73 Refactored version using the more elegant Qt GUI framework and
  74 OpenGL for hardware acceleration for those who have better graphics
  75 cards.
  76
  77 Getting Mussagl
  78 ===============
  79
  80 License
  81 -------
  82
  83 Mussagl has been released open source under the `GPL v2
  84 license`__.
  85
  86 __ GPL_
  87
  88 Platforms
  89 ---------
  90
  91 You have the option of building from source or downloading prebuilt
  92 binaries. Most people will want the prebuilt versions.
  93
  94 Supported Platforms:
  95
  96  * Mac OS X (binary or source)
  97  * Windows XP (binary or source)
  98  * Linux (source)
  99
 100 Download
 101 --------
 102
 103 Mussagl in binary form for OS X and Windows and/or source can be
 104 downloaded from http://mussa.caltech.edu/.
 105
 106 Install
 107 -------
 108
 109 Mac OS X
 110 ~~~~~~~~
 111 Once you have downloaded the .dmg file, double click on it and follow
 112 the install instructions.
 113
 114 FIXME: Mention how to launch the program.
 115
 116
 117 Windows XP
 118 ~~~~~~~~~~
 119 Once you have downloaded the Mussagl installer, double click on the
 120 installer and follow the install instructions.
 121
 122 To start Mussagl, launch the program from Start > Programs > Mussagl >
 123 Mussagl.
 124
 125
 126 Linux
 127 ~~~~~
 128 Currently we do not have a binary installer for Linux. You will have
 129 to build from source. See the 'build from source' section below.
 130
 131
 132 Build from Source
 133 ~~~~~~~~~~~~~~~~~
 134
 135 Instructions for building from source can be found `build page
 136 <http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild>`_ on the
 137 `Mussa wiki`__.
 138
 139 __ wiki_
 140
 141
 142 Obtaining Input Data
 143 ====================
 144
 145 If you already have your data, you can skip ahead to the the `Using
 146 Mussagl`_ section.
 147
 148 Let's say you have a gene of interest called 'SMN1' and you want to
 149 know how the sequence surrounding the gene in multiple species is
 150 conserved. Guess what, that's what we are going to do, retrieve the
 151 DNA sequence for SMN1 and prepare it for using in Mussa.
 152
 153 For more information about SMN1 visit `NCBI's OMIM
 154 <http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=609682>`_.
 155
 156 UCSC Genome Browser Method
 157 --------------------------
 158
 159 There are many methods of retrieving DNA sequence, but for this
 160 example we will retrieve SMN1 through the UCSC genome browser located
 161 at http://genome.ucsc.edu/.
 162
 163 .. image:: images/ucsc_genome_browser_home.png
 164    :alt: UCSC Genome Browser
 165    :align: center
 166
 167 Step 1 - Find SMN1
 168 ~~~~~~~~~~~~~~~~~~
 169
 170 The first step in finding SMN1 is to use the **Gene Sorter** menu
 171 option which I have highlighted in orange below:
 172
 173 .. image:: images/ucsc_menu_bar_gene_sorter.png
 174    :alt: Gene Sorter Menu Option
 175    :align: center
 176
 177 Gene Sorter page:
 178
 179 .. image:: images/ucsc_gene_sorter.png
 180    :alt: Gene Sorter
 181    :align: center
 182
 183 We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**.
 184
 185 .. image:: images/ucsc_gs_sort_name_sim.png
 186    :alt: Gene Sorter - Name Similarity
 187    :align: center
 188
 189 After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box.
 190
 191 .. image:: images/ucsc_gs_smn1.png
 192    :alt: Gene
 193    :align: center
 194
 195 Press **Go!** and you should see the following page:
 196
 197 .. image:: images/ucsc_gs_found.png
 198    :alt: Found SMN1
 199    :align: center
 200
 201 Click on **SMN1** and you will be taking the gene expression atlas
 202 page.
 203
 204 .. image:: images/ucsc_gs_genome_position.png
 205    :alt: Gene expression atlas
 206    :align: center
 207
 208 Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome
 209 position column**.
 210
 211 Now we have found the location of SMN1 on human!
 212
 213 .. image:: images/ucsc_gb_smn1_human.png
 214    :alt: Genome Browser - SMN1 (human)
 215    :align: center
 216
 217
 218 Step 2 - Download CDS/UTR sequence for annotations
 219 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 220
 221 Since we have found **SMN1**, this would be a convenient time to extract
 222 the DNA sequence for the CDS and UTRs of the gene to use it as an
 223 annotation_ in Mussa.
 224
 225 **Click on SMN1** shown **between** the **two orange arrows** shown
 226 below.
 227
 228 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
 229    :alt: Genome Browser - SMN1 (human) - Orange Arrows
 230    :align: center
 231
 232 You should find yourself at the SMN1 description page.
 233
 234 .. image:: images/ucsc_gb_smn1_description_page.png
 235    :alt: Genome Browser - SMN1 (human) - Description page
 236    :align: center
 237
 238 **Scroll down** until you get to the **Sequence section** and click on
 239 **Genomic (chr5:70,256,524-70,284,592)**.
 240
 241 .. image:: images/ucsc_gb_smn1_human_sequence.png
 242    :alt: Genome Browser - SMN1 (human) - Sequence
 243    :align: center
 244
 245 You should now be at the **Genomic sequence near gene** page:
 246
 247 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence.png
 248    :alt: Genome Browser - SMN1 (human) - Get genomic sequence
 249    :align: center
 250
 251 Make the following changes (highlighted in orange in the screenshot
 252 below):
 253
 254  1. UNcheck **introns**.
 255     (We only want to annotate CDS and UTRs.)
 256  2. Select **one FASTA record** per **region**.
 257     (Mussa needs each CDS and UTR represented by one FASTA record per CDS/UTR).
 258  3. Select **CDS in upper case, UTR in lower case.**
 259
 260 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_diff.png
 261    :alt: Genome Browser - SMN1 (human) - Get genomic sequence setup
 262    :align: center
 263
 264 Now click the **submit** button. You will then see a FASTA file with
 265 many FASTA records representing the CDS and UTRS.
 266
 267 .. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_submit.png
 268    :alt: Genome Browser - SMN1 (human) - CDS/UTR sequence
 269    :align: center
 270
 271 Now you need to save the FASTA records to a **text file**. If you are
 272 using **Firefox** or **Internet Explorer 6+** click on the **File >
 273 Save As** menu option.
 274
 275 **IMPORTANT:** Make sure you select **Text Files** and **NOT**, I
 276 repeat **NOT Webpage Complete** (see screenshot below.)
 277
 278 Type in **smn1_human_annot.txt** for the file name.
 279
 280 .. image:: images/smn1_human_annot.png
 281    :alt: Genome Browser - SMN1 (human) - sequence annotation file
 282    :align: center
 283
 284 **IMPORTANT:** You should open the file with a text editor and make
 285   sure **no HTML** was saved... If you find any HTML markup, delete
 286   the markup and save the file.
 287
 288 Now we are going to **modify the file** you just saved to **add the
 289 name of the species** to the **annotation file**. All you have to do
 290 is **add a new line** at the **top of the file** with the word **'Human'** as
 291 shown below:
 292
 293 .. image:: images/smn1_human_annot_plus_human.png
 294    :alt: Genome Browser - SMN1 (human) - sequence annotation file
 295    :align: center
 296
 297 You can add more annotations to this file if you wish. See the
 298 `annotation file format`_ section for details of the file format. By
 299 including FASTA records in the annotation_ file, Mussa searches your
 300 DNA sequence for an exact match of the sequence in the annotation_
 301 file. If found, it will be marked as an annotation_ within Mussa.
 302
 303
 304 Step 3 - Download gene and upstream/downstream sequence
 305 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 306
 307 Use the back button in your web browser to get back the **genome
 308 browser view** of **SMN1** as shown below.
 309
 310 .. image:: images/ucsc_gb_smn1_human.png
 311    :alt: Genome Browser - SMN1 (human)
 312    :align: center
 313
 314 There are two options for getting additional sequence around your
 315 gene. The more complex way is to zoom out so that you have the
 316 sequence you want being shown in the genome browser and then follow
 317 the directions for the following method.
 318
 319 The second option, which we will choose, is to leave the genome
 320 browser zoomed exactly at the location of SMN1 and click on the
 321 **DNA** option on the menu bar (shown with orange arrows in the
 322 screenshot below.)
 323
 324 .. image:: images/ucsc_gb_smn1_human_dna_option.png
 325    :alt: Genome Browser - SMN1 (human) - DNA Option
 326    :align: center
 327
 328 Now in the **get dna in window** page, let's add an arbitrary amount of
 329 extra sequence on to each end of the gene, let's say 5000 base pairs.
 330
 331 .. image:: images/ucsc_gb_smn1_human_get_dna.png
 332    :alt: Genome Browser - SMN1 (human) - Get DNA
 333    :align: center
 334
 335 Click the **get DNA** button.
 336
 337 .. image:: images/ucsc_gb_smn1_human_dna.png
 338    :alt: Genome Browser - SMN1 (human) - DNA
 339    :align: center
 340
 341 Save the DNA sequence to a text file called 'smn1_human_dna.fa' as we
 342 did in step 2 with the annotation file.
 343
 344 **IMPORTANT:** Make sure the file is saved as a text file and not an
 345 HTML file. Open the file with a text editor and remove any HTML markup
 346 you find.
 347
 348
 349 Step 4 - Same/similar/related gene other species.
 350 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 351
 352 What good is a multiple sequence alignment viewer without multiple
 353 sequences? Let'S find a similar gene in a few more species.
 354
 355 Use the back button on your web browser until you get the **genome
 356 browser view** of **SMN1** as shown below.
 357
 358 .. image:: images/ucsc_genome_browser_home.png
 359    :alt: UCSC Genome Browser
 360    :align: center
 361
 362 **Click on SMN1** shown **between** the **two orange arrows** shown
 363 below.
 364
 365 .. image:: images/ucsc_gb_smn1_human_click_smn1.png
 366    :alt: Genome Browser - SMN1 (human) - Orange Arrows
 367    :align: center
 368
 369 You should find yourself at the SMN1 description page.
 370
 371 .. image:: images/ucsc_gb_smn1_description_page.png
 372    :alt: Genome Browser - SMN1 (human) - Description page
 373    :align: center
 374
 375 **Scroll down** until you get to the **Sequence section** and click on
 376 **Protein (262 aa)**.
 377
 378 .. image:: images/ucsc_gb_smn1_human_sequence.png
 379    :alt: Genome Browser - SMN1 (human) - Sequence
 380    :align: center
 381
 382 Copy the SMN1 protein seqeunce by highlighting it and selecting **Edit
 383 > Copy** option from the menu.
 384
 385 .. image:: images/smn1_human_protein.png
 386    :alt: Genome Browser - SMN1 (human) - Protein
 387    :align: center
 388
 389 Press the back button on the web browser once and then scroll to the
 390 top of the page and click on the **BLAT** option on the menu bar
 391 (shown below with orange arrows).
 392
 393 .. image:: images/ucsc_gb_smn1_human_blat.png
 394    :alt: Genome Browser - SMN1 (human) - Blat
 395    :align: center
 396
 397 **Paste** in the **protein sequence** and **change** the **genome** to
 398 **mouse** as shown below and then click **submit**.
 399
 400 .. image:: images/ucsc_gb_smn1_human_blat_paste.png
 401    :alt: Genome Browser - SMN1 (human) - Blat paste protein
 402    :align: center
 403
 404 Notice that we have two hits, one of which looks pretty good at 89.9%
 405 match.
 406
 407 .. image:: images/ucsc_gb_smn1_human_blat_hits.png
 408    :alt: Genome Browser - SMN1 (human) - Blat hits
 409    :align: center
 410
 411 **Click** on the **brower** link next to the 89.9% match. Notice in
 412 the genome browser (shown below) that there is an annotated gene
 413 called SMN1 for mouse which matches the line called **your sequence
 414 from blat search**. This means we are fairly confidant we found the
 415 right location in the mouse genome.
 416
 417 .. image:: images/ucsc_gb_smn1_human_blat_to_browser.png
 418    :alt: Genome Browser - SMN1 (human) - Blat to browser
 419    :align: center
 420
 421 Follow steps 1 through 3 for mouse and then repeat step 4 with the
 422 human protein sequence to find **SMN1** in the following species (if
 423 you find a match):
 424
 425  1. Rat
 426  2. Rabbit
 427  3. Dog
 428  4. Armadillo
 429  5. Elephant
 430  6. Opposum
 431  7. x_tropicalis
 432
 433 Make sure to save the extended DNA sequence and annotation file for
 434 each one.
 435
 436 Using Mussagl
 437 =============
 438
 439
 440 Launch Mussagl
 441 --------------
 442 Launch Mussagl... It should look similar to the screen shot below.
 443
 444 .. image:: images/opened.png
 445    :alt: Launch Mussa
 446    :align: center
 447
 448
 449
 450 Create/Load Analysis
 451 ----------------------
 452
 453 Currently there are three ways to load a Mussa experiment.
 454
 455  1. `Create a new analysis`_
 456  2. `Load a mussa parameter file`_ (.mupa)
 457  3. `Load an analysis`_
 458
 459 .. _createnew:
 460
 461 Create a new analysis
 462 ~~~~~~~~~~~~~~~~~~~~~
 463
 464 To create a new analysis select 'Define analysis' from the 'File'
 465 menu. You should see a dialog box similar to the one below. For this
 466 demo we will use the example sequences that come with Mussagl.
 467
 468 .. image:: images/define_analysis.png
 469    :alt: Define Analysis
 470    :align: center
 471
 472 Instructions:
 473
 474  1. **Give the experiment a name**, for this demo, we'll use
 475     'demo_w30_t20'. Mussa will create a folder with this name to store
 476     the analysis files in once it has been run.
 477
 478  2. Choose a `window size`_. For this demo **choose 30**.
 479
 480  3. Choose a threshold_... for this demo **choose 20**. See the
 481     Threshold_ section for more detailed information.
 482
 483  4. Choose the number of sequences_ you would like. For this demo
 484     **choose 3**.
 485
 486 .. image:: images/define_analysis_step1a.png
 487    :alt: Steps 1-4
 488    :align: center
 489
 490 Now click on the 'Browse' button next to the sequence input box and
 491 then select /examples/seq/human_mck_pro.fa file. Do the same in the
 492 next two sequence input boxes selecting mouse_mck_pro.fa and
 493 rabbit_mck_pro.fa as shown below. Note that you can create annotation
 494 files using the mussa `Annotation File Format`_ to add annotations to
 495 your sequence.
 496
 497 .. image:: images/define_analysis_step2.png
 498    :alt: Choose sequences
 499    :align: center
 500
 501 Click the **create** button and in a few moments you should see
 502 something similar to the following screen shot.
 503
 504 .. image:: images/demo.png
 505    :alt: Mussagl Demo
 506    :align: center
 507
 508 This analysis is now saved in a directory called **demo_w30_t20** in
 509 the current working directory. If you close and reopen Mussagl, you
 510 can reload the saved analysis. See `Load an analysis`_ section below
 511 for details.
 512
 513
 514 Load a mussa parameter file
 515 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 516
 517 If you prefer, you can define your Mussa analysis using the Mussa
 518 parameter file. See the `Parameter File Format`_ section for details
 519 on creating a .mupa file.
 520
 521 Once you have a .mupa file created, load Mussagl and select the **File >
 522 Load Mussa Parameters** menu option. Select the .mupa file and click
 523 open.
 524
 525 .. image:: images/load_mupa_menu.png
 526    :alt: Load Mussa Parameters
 527    :align: center
 528
 529 If you would like to see an example, you can load the
 530 **mck3test.mupa** file in the examples directory that comes with
 531 Mussagl.
 532
 533 .. image:: images/load_mupa_dialog.png
 534    :alt: Load Mussa Parameters Dialog
 535    :align: center
 536
 537
 538 Load an analysis
 539 ~~~~~~~~~~~~~~~~
 540
 541 To load a previously run analysis open Mussagl and select the **File >
 542 Load Analysis** menu option. Select an analysis **directory** and
 543 click open.
 544
 545 .. image:: images/load_analysis_menu.png
 546    :alt: Load Analysis Menu
 547    :align: center
 548
 549
 550 Main Window
 551 -----------
 552
 553 Overview
 554 ~~~~~~~~
 555 .. Screen-shot with numbers showing features.
 556
 557 .. image:: images/window_overview.png
 558    :alt: Mussa Window
 559    :align: center
 560
 561 Legend:
 562
 563  1. `DNA Sequence (Black bars)`_
 564
 565  2. Annotation_
 566
 567  3. Motif_
 568
 569  4. `Conservation tracks`_
 570
 571  5. `Motif Toggle`_
 572
 573  6. `Zoom Factor`_ (Base pairs per pixel)
 574
 575  7. `Dynamic Threshold`_
 576
 577  8. `Sequence Information Bar`_
 578
 579  9. `Sequence Scroll Bar`_
 580
 581
 582 DNA Sequence (black bars)
 583 ~~~~~~~~~~~~~~~~~~~~~~~~~
 584
 585 .. image:: images/sequence_bar.png
 586    :alt: Sequence Bar
 587    :align: center
 588
 589 Each of the black bars represents one of the loaded sequences, in this
 590 case the sequence around the gene 'MCK' in human, mouse, and rabbit.
 591
 592 FIXME: Should I mention the repeats here?
 593
 594
 595 Annotation
 596 ~~~~~~~~~~
 597
 598 .. figure:: images/annotation.png
 599    :alt: Annotation
 600    :align: center
 601
 602    Annotation shown in green on sequence bar.
 603
 604
 605 Annotations can be included on any of the sequences using the `Load a
 606 mussa parameter file`_ method of loading your sequences. You can
 607 define annotations by location or using an exact sub-sequence and you
 608 may also choose any color for display of the annotation; see the
 609 `Annotation File Format`_ section for details.
 610
 611 Note: Currently there is no way to add annotations using the GUI (only
 612 via the .mupa file). We plan to add this feature in the future, but it
 613 likely will not make it into the first release.
 614
 615
 616 Motif
 617 ~~~~~
 618
 619 .. figure:: images/motif.png
 620    :alt: Motif
 621    :align: center
 622
 623    Motif shown in light blue on sequence bar.
 624
 625 The only real difference between an annotation and motif in Mussagl is
 626 that you can define motifs from within the GUI. See the `Motifs`_
 627 section for more information.
 628
 629
 630 Conservation tracks
 631 ~~~~~~~~~~~~~~~~~~~
 632
 633 .. figure:: images/conservation_tracks.png
 634    :alt: Conservation Tracks
 635    :align: center
 636
 637    Conservations tracks shown as red and blue lines between sequence
 638    bars.
 639
 640 The **red lines** between the sequence bars represent conservation
 641 between the sequences and **blue lines** represent **reverse
 642 complement** conservation. The amount of sequence conservation shown
 643 will depend on the relatedness of your sequences and the `dynamic
 644 threshold` you are using. Sequences with lots of repeats will cause
 645 major slow downs in calculating the matches.
 646
 647
 648 Motif Toggle
 649 ~~~~~~~~~~~~
 650
 651 .. image:: images/motif_toggle.png
 652    :alt: Motif Toggle
 653    :align: center
 654
 655 Toggles motifs on and off. This will not turn on and off annotations.
 656
 657 Note: As of the current build (#200), this feature hasn't been
 658 implemented.
 659
 660
 661 Zoom Factor
 662 ~~~~~~~~~~~
 663
 664 .. image:: images/zoom_factor.png
 665    :alt: Zoom Factor
 666    :align: center
 667
 668 The zoom factor represents the number of base pairs represented per
 669 pixel. When you zoom in far enough the sequence will switch from
 670 seeing a black bar, representing the sequence, to the actual sequence
 671 (well, ASCII representation of sequence).
 672
 673
 674 Dynamic Threshold
 675 ~~~~~~~~~~~~~~~~~
 676
 677 .. image:: images/dynamic_threshold.png
 678    :alt: Dynamic Threshold
 679    :align: center
 680
 681 You can dynamically change the threshold for how strong a match you
 682 consider the conservation to be with one of two options:
 683
 684  1. Number of base pair matches out of window size.
 685
 686  2. Percent base pair conservation.
 687
 688 See the Threshold_ section for more information.
 689
 690
 691 Sequence Information Bar
 692 ~~~~~~~~~~~~~~~~~~~~~~~~
 693
 694 .. image:: images/seq_info_bar.png
 695    :alt: Sequence Information Bar
 696    :align: center
 697
 698 The sequence information bars can be found to the left and right sides
 699 of Mussagl. Next to each sequence you will find the following
 700 information:
 701
 702  1. Species (If it has been defined)
 703  2. Total Size of Sequence
 704  3. Current base pair position
 705
 706
 707 Sequence Scroll Bar
 708 ~~~~~~~~~~~~~~~~~~~
 709
 710 .. image:: images/scroll_bar.png
 711    :alt: Sequence Scroll Bar
 712    :align: center
 713
 714 The scroll bar allows you to scroll through the sequence which is
 715 useful when you have zoomed in using the `zoom factor`_.
 716
 717
 718 Annotations / Motifs
 719 --------------------
 720
 721 Annotations
 722 ~~~~~~~~~~~
 723
 724 Currently annotations can be added to a sequence using the mussa
 725 `annotation file format`_ and can be loaded by selecting the
 726 annotation file when defining a new analysis (see `Create a new
 727 analysis`_ section) or by defining a .mupa file pointing to your
 728 annotation file (see `Load a mussa parameter file`_ section).
 729
 730 Motifs
 731 ~~~~~~
 732
 733 Load Motifs from File
 734 *********************
 735
 736 It is possible to load motifs from a file which was saved from a
 737 previous run or by defining your own motif file. See the `Motif File
 738 Format`_ section for details.
 739
 740 To load a motif file, select **Load Motif List** item from the
 741 **File** menu and select a motif list file.
 742
 743 .. image:: images/load_motif.png
 744    :alt: Load Motif List
 745    :align: center
 746
 747
 748 Save Motifs to File
 749 *******************
 750
 751 Note: Currently not implemented
 752
 753
 754 Motif Dialog
 755 ************
 756
 757 **New Features:**
 758
 759 Build 276
 760  * Allow for toggling individual motifs on and off.
 761
 762 Build 269
 763  * Field added for naming motifs.
 764
 765 Mussa has the ability to find lab motifs using the `IUPAC Nucleotide
 766 Code`_ for defining a motif. To define a motif, select **Edit > Edit
 767 Motifs** menu item as shown below.
 768
 769 .. image:: images/view_edit_motifs.png
 770    :alt: "View > Edit Motifs" Menu
 771    :align: center
 772
 773 You will see a dialog box appear with a "set motifs" button and 10
 774 rows for defining motifs and the color that will be displayed on the
 775 sequence. By default all 10 motifs start off as with white as the
 776 color. In the image below, I changed the color from white to blue to
 777 make it easier to see. The first text box is for the motif and the
 778 second box is for the name of the motif. The check box defines whether
 779 the motif is displayed or not.
 780
 781 .. image:: images/motif_dialog_start.png
 782    :alt: Motif Dialog
 783    :align: center
 784
 785 Now let's make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide
 786 Code`_, type in **'ATSCT'** into the first box and 'My Motif' for the
 787 name in the second box as shown below.
 788
 789 .. image:: images/motif_dialog_enter_motif.png
 790    :alt: Enter Motif
 791    :align: center
 792
 793 Now choose a color for your motif by clicking on the colored area to
 794 the left of the motif. In the image above, you would click on the blue
 795 square, but by default the squares will be white. Remember to choose a
 796 color that will show up well with a black bar as the background.
 797
 798 .. image:: images/color_chooser.png
 799    :alt: Color Chooser
 800    :align: center
 801
 802 Once you have selected the color for your motif, click on the 'set
 803 motifs' button. Notice that if Mussa finds matches to your motif will
 804 now show up in the main Mussagl window.
 805
 806 Before Motif:
 807
 808 .. image:: images/motif_dialog_bar_before.png
 809    :alt: Sequence bar before motif
 810    :align: center
 811
 812 After Motif:
 813
 814 .. image:: images/motif_dialog_bar_after.png
 815    :alt: Sequence bar after motif
 816    :align: center
 817
 818
 819 View Mussa Alignments
 820 ---------------------
 821
 822 Mussagl allows you to zoom in on Mussa alignments by selecting the set
 823 of alignment(s) of interest. To do this, move the mouse near the
 824 alignment you are interested in viewing and then **PRESS** and
 825 **HOLD** the **LEFT mouse button** and **drag the mouse** to the other
 826 side of the conservation track so that you see a bounding box
 827 overlaping the alienment(s) of interest and then **let go** of the
 828 *left mouse button*.
 829
 830 In the example below, I started by left-clicking on the area marked by
 831 a red dot (upper left corner of bounding box) and dragging the mouse to
 832 the area marked by a blue dot (lower right corner of the bounding box)
 833 and letting go of the left mouse button.
 834
 835 .. image:: images/select_sequence.png
 836    :alt: Select Sequence
 837    :align: center
 838
 839 All of the lines which were not selected should be washed out as shown
 840 below:
 841
 842 .. image:: images/washed_out.png
 843    :alt: Tracks washed out
 844    :align: center
 845
 846 With a selection made, goto the **View** menu and select **View mussa alignment**.
 847
 848 .. image:: images/view_mussa_alignment.png
 849    :alt: View mussa alignment
 850    :align: center
 851
 852 You should see the alignment at the base-pair level as shown below.
 853
 854 .. image:: images/mussa_alignment.png
 855    :alt: Mussa alignment
 856    :align: center
 857
 858
 859 Sub-analysis
 860 ------------
 861
 862 To run a sub-analysis **highlight** a section of sequence and *right
 863 click* on it and select **Add to subanalysis**. To the same for the
 864 sequences shown in orange in the screenshot below. Note that you **are
 865 NOT limited** to selecting more than one subsequence from the same
 866 sequence.
 867
 868 .. image:: images/subanalysis_select_seqs.png
 869    :alt: Subanalysis sequence selection
 870    :align: center
 871
 872 Once you have added your sequences for subanalysis, choose a `window size`_ and `threshold`_ and click **Ok**.
 873
 874 .. image:: images/subanalysis_dialog.png
 875    :alt: Subanalysis Dialog
 876    :align: center
 877
 878 A new Mussa window will appear with the subanalysis of your sequences
 879 once it's done running. This may take a while if you selected large
 880 chunks of sequence with a loose threshold.
 881
 882 .. image:: images/subanalysis_done.png
 883    :alt: Subalaysis complete
 884    :align: center
 885
 886
 887 Copying sequence to clipboard
 888 -----------------------------
 889
 890 To copy a sequence to the clipboard, highlight a section of sequence,
 891 as shown in the screen shot below, and do one of the following:
 892
 893  * Select **Copy as FASTA** from the **Edit** menu.
 894  * **Right-Click (Left-click + Apple/Command Key on Mac)** on the highlighted sequence and select **Copy as FASTA**.
 895  * Press **Ctrl + C (on PC)** or **Apple/Command Key + C (on Mac)** on the keyboard.
 896
 897 .. image:: images/copy_sequence.png
 898    :alt: Copy sequence
 899    :align: center
 900
 901 Saving to an Image
 902 ---------------------------------
 903
 904 FIXME: Need to write this section
 905
 906
 907 Detailed Information
 908 --------------------
 909
 910 Threshold
 911 ~~~~~~~~~
 912
 913 The threshold of an analysis is in minimum number of base pair matches
 914 must be meet to in order to be kept as a match. Note that you can vary
 915 the threshold from within Mussagl. For example, if you choose a
 916 `window size`_ of **30** and a **threshold** of **20** the mussa nway
 917 transitive algorithm will store all matches that are 20 out of 30 bp
 918 matches or better and pass it on to Mussagl. Mussagl will then allow
 919 you to dynamically choose a threshold from 20 to 30 base pairs. A
 920 threshold of 30 bps would only show 30 out of 30 bp matches. A
 921 threshold of 20 bps would show all matches of 20 out of 30 bps or
 922 better. If you would like to see results for matches lower than 20 out
 923 of 30, you will need to rerun the analysis with a lower threshold.
 924
 925 Window Size
 926 ~~~~~~~~~~~
 927
 928 The typical sizes people tend to choose are between 20 and 30. You
 929 will likely need to experiment with this setting depending on your
 930 needs and input sequence.
 931
 932
 933 Sequences
 934 ~~~~~~~~~
 935
 936 Mussa reads in sequences which are formatted in the FASTA_
 937 format. Mussa may take a long time to run (>10 minutes) if the total
 938 bp length near 280Kb. Once mussa has run once, you can reload
 939 previously run analyzes.
 940
 941 FIXME: We have learned more about how much sequence and how many to
 942 put in Mussagl, this information should be documented here.
 943
 944
 945 Mussa File Formats
 946 ------------------
 947
 948 .. _param:
 949
 950 Parameter File Format
 951 ~~~~~~~~~~~~~~~~~~~~~
 952
 953 **File Format (.mupa):**
 954
 955 ::
 956
 957   # name of analysis directory and stem for associated files
 958   ANA_NAME <analysis_name>
 959
 960   # if APPEND vars true, a _wXX and/or _tYY added to analysis name
 961   # where XX = WINDOW and YY = THRESHOLD
 962   # Highly recommeded with use of command line override of WINDOW or THRESHOLD
 963   APPEND_WIN <true/false>
 964   APPEND_THRES <true/false>
 965
 966   # how many sequences are being analyzed
 967   SEQUENCE_NUM <num>
 968
 969   # first sequence info
 970   SEQUENCE <FASTA_file_path>
 971   ANNOTATION <annotation_file_path>
 972   SEQ_START <sequence_start>
 973
 974   # the second sequence info
 975   SEQUENCE <FASTA_file_path>
 976   # ANNOTATION <annotation_file_path>
 977   SEQ_START <sequence_start>
 978   # SEQ_END <sequence_end>
 979
 980   # third sequence info
 981   SEQUENCE <FASTA_file_path>
 982   # ANNOTATION <annotation_file_path>
 983
 984   # analyzes parameters: command line args -w -t will override these
 985   WINDOW <num>
 986   THRESHOLD <num>
 987
 988 .. csv-table:: Parameter File Options:
 989    :header: "Option Name", "Value", "Default", "Required", "Description"
 990    :widths: 30 30 30 30 60
 991
 992    "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also
 993    name of directory where analysis will be saved."
 994    "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME"
 995    "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME"
 996    "SEQUENCE_NUM", "integer", "N/A", "true", "The number of sequences
 997    to analyze"
 998    "SEQUENCE", "/FASTA/filepath.fa", "N/A", "true", "Must define one
 999    sequence per SEQUENCE_NUM."
1000    "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional
1001    annotation file. See `annotation file format`_ section for more
1002    information."
1003    "SEQ_START", "integer", "1", "false", "Optional index into FASTA file"
1004    "SEQ_END", "integer", "1", "false", "Optional index into FASTA file"
1005    "WINDOW", "integer", "N/A", "true", "`Window Size`_"
1006    "THRESHOLD", "integer", "N/A", "true", "`Threshold`_"
1007
1008 .. _annot:
1009
1010 Annotation File Format
1011 ~~~~~~~~~~~~~~~~~~~~~~
1012
1013 The first line in the file is the sequence name. Each line there after
1014 is a **space** separated annotation.
1015
1016 New as of build 198:
1017
1018  * The annotation format now supports FASTA sequences embedded in the
1019    annotation file as shown in the format example below. Mussagl will
1020    take this sequence and look for an exact match of this sequence in
1021    your sequences. If a match is found, it will label it with the name
1022    of from the FASTA header.
1023
1024 Format:
1025
1026 ::
1027
1028   <species/sequence_name>
1029   <start> <stop> <annotation_name> <annotation_type>
1030   <start> <stop> <annotation_name> <annotation_type>
1031   <start> <stop> <annotation_name> <annotation_type>
1032   <start> <stop> <annotation_name> <annotation_type>
1033   >FASTA Header
1034   ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG
1035   ACGTACGTACGTACGTAGCTGTCATACGCTAGCA
1036   TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT
1037   ACGTACGGCAGTACGCGGTCAGA
1038   <start> <stop> <annotation_name> <annotation_type>
1039   ...
1040
1041 Example:
1042
1043 ::
1044
1045   Mouse
1046   251 500 Glorp Glorptype
1047   751 1000 Glorp Glorptype
1048   1251 1500 Glorp Glorptype
1049   >My favorite DNA sequence
1050   GATTACA
1051   1751 2000 Glorp Glorptype
1052
1053
1054 .. _motif_file_format:
1055
1056 Motif File Format
1057 ~~~~~~~~~~~~~~~~~
1058
1059 Format:
1060
1061   <motif> <red> <green> <blue>
1062
1063 Example:
1064
1065   GGCC 0.0 1 1
1066
1067
1068
1069 IUPAC Nucleotide Code
1070 ~~~~~~~~~~~~~~~~~~~~~~
1071
1072 For your convenience, below is a table of the IUPAC Nucleotide Code.
1073
1074 The following table is table 1 from "Nomenclature for Incompletely
1075 Specified Bases in Nucleic Acid Sequences" which can be found at
1076 http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html.
1077
1078 ======  =================  ===================================
1079 Symbol  Meaning            Origin of designation
1080 ======  =================  ===================================
1081 G       G                  Guanine
1082 A       A                  Adenine
1083 T       T                  Thymine
1084 C       C                  Cytosine
1085 R       G or A             puRine
1086 Y       T or C             pYrimidine
1087 M       A or C             aMino
1088 K       G or T             Keto
1089 S       G or C             Strong interaction (3 H bonds)
1090 W       A or T             Weak interaction (2 H bonds)
1091 H       A or C or T        not-G, H follows G in the alphabet
1092 B       G or T or C        not-A, B follows A
1093 V       G or C or A        not-T (not-U), V follows U
1094 D       G or A or T        not-C, D follows C
1095 N       G or A or T or C   aNy
1096 ======  =================  ===================================
1097
1098
1099 .. Define links below
1100    ------------------
1101
1102 .. _GPL: http://www.opensource.org/licenses/gpl-license.php
1103 .. _wiki: http://mussa.caltech.edu
1104 .. _build: http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild
1105 .. _FASTA: http://en.wikipedia.org/wiki/fasta_format
1106 .. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif