From: Brandon King Date: Sat, 28 Oct 2006 00:15:33 +0000 (+0000) Subject: Mussagl Manual: Updates for Docs 1.0 X-Git-Url: http://woldlab.caltech.edu/gitweb/?p=mussa.git;a=commitdiff_plain;h=36baa4926b1b0955a7330ad11cc98e449d0f1ce4 Mussagl Manual: Updates for Docs 1.0 * Moved UCSC tutorial toward end, but linked to from Obtaining Data section. * Added animated gif under dynamic threshold section. * Documented command line options. (tickets: #167, #225) * Mentioned Python existing in Mussagl. (tickets: #167, #227) * Updated repeats documentation. (tickets: #167) * Updated .mupa file format description. * Added note about .mupa comment character. * Added a create .mupa section for UCSC tutorial. (ticket:164) * Linked to Diane's OverlappingWindow wiki page (ticket:137) --- diff --git a/doc/manual/images/smn1_dir_structure.png b/doc/manual/images/smn1_dir_structure.png new file mode 100644 index 0000000..18f6fdc Binary files /dev/null and b/doc/manual/images/smn1_dir_structure.png differ diff --git a/doc/manual/images/threshold_change.gif b/doc/manual/images/threshold_change.gif new file mode 100644 index 0000000..2dba416 Binary files /dev/null and b/doc/manual/images/threshold_change.gif differ diff --git a/doc/manual/mussagl_manual.rst b/doc/manual/mussagl_manual.rst index 5e551e7..ed4642d 100644 --- a/doc/manual/mussagl_manual.rst +++ b/doc/manual/mussagl_manual.rst @@ -5,9 +5,9 @@ Mussagl Manual Brandon W. King --------------- -Last updated: Oct 20th, 2006 +Last updated: Oct 27th, 2006 -Updated to Mussagl build: (In process to 424) +Documentation for Mussagl v1.0 .. Things to add @@ -162,1101 +162,1250 @@ __ wiki_ Obtaining Input Data ==================== -If you already have your data, you can skip ahead to the the `Using +If you would like help obtaining data for use with Mussagl, you can +skip ahead to the `Obtaining Input Data - Continued`_ section. + +If would like a tour of the software, continue with the `Using Mussagl`_ section. -Let's say you have a gene of interest called 'SMN1' and you want to -know how the sequence surrounding the gene in multiple species is -conserved. Guess what, that's what we are going to do, retrieve the -DNA sequence for SMN1 and prepare it for using in Mussa. -For more information about SMN1 visit `NCBI's OMIM -`_. +Using Mussagl +============= -The SMN1 data retrieved in this section can be downloaded from the -`Mussa Example Data -`_ page if -you prefer to skip this section of the manual. +Launch Mussagl +-------------- +Launch Mussagl... It should look similar to the screen shot below. -UCSC Genome Browser Method --------------------------- +.. image:: images/opened.png + :alt: Launch Mussa + :align: center -There are many methods of retrieving DNA sequence, but for this -example we will retrieve SMN1 through the UCSC genome browser located -at http://genome.ucsc.edu/. -.. image:: images/ucsc_genome_browser_home.png - :alt: UCSC Genome Browser - :align: center +Create/Load Analysis +---------------------- -Step 1 - Find SMN1 -~~~~~~~~~~~~~~~~~~ +Currently there are three ways to load a Mussa experiment. -The first step in finding SMN1 is to use the **Gene Sorter** menu -option which I have highlighted in orange below: + 1. `Create a new analysis`_ + 2. `Load a mussa parameter file`_ (.mupa) + 3. `Load an analysis`_ -.. image:: images/ucsc_menu_bar_gene_sorter.png - :alt: Gene Sorter Menu Option - :align: center +.. _createnew: -Gene Sorter page: +Create a new analysis +~~~~~~~~~~~~~~~~~~~~~ -.. image:: images/ucsc_gene_sorter.png - :alt: Gene Sorter +To create a new analysis select 'Define analysis' from the 'File' +menu. You should see a dialog box similar to the one below. For this +demo we will use the example sequences that come with Mussagl. + +.. image:: images/define_analysis.png + :alt: Define Analysis :align: center -We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**. +Instructions: -.. image:: images/ucsc_gs_sort_name_sim.png - :alt: Gene Sorter - Name Similarity - :align: center + 1. **Give the experiment a name**, for this demo, we'll use + 'demo_w30_t20'. Mussa will create a folder with this name to store + the analysis files in once it has been run. -After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box. + 2. Choose a threshold_... for this demo **choose 20**. See the + Threshold_ section for more detailed information. -.. image:: images/ucsc_gs_smn1.png - :alt: Gene - :align: center + 3. Choose a `window size`_. For this demo **choose 30**. -Press **Go!** and you should see the following page: -.. image:: images/ucsc_gs_found.png - :alt: Found SMN1 + 4. Choose the number of sequences_ you would like. For this demo + **choose 3**. + +.. image:: images/define_analysis_step1a.png + :alt: Steps 1-4 :align: center -Click on **SMN1** and you will be taking the gene expression atlas -page. +First enter the species name of "Human" in the first "Species" text +box. Now click on the 'Browse' button next to the sequence input box +and then select /examples/seq/human_mck_pro.fa file. Do the same in +the next two sequence input boxes selecting mouse_mck_pro.fa and +rabbit_mck_pro.fa as shown below. Make sure to give them a species +name as well. Note that you can create annotation files using the +mussa `Annotation File Format`_ to add annotations to your sequence. -.. image:: images/ucsc_gs_genome_position.png - :alt: Gene expression atlas +.. image:: images/define_analysis_step2.png + :alt: Choose sequences :align: center -Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome -position column**. - -Now we have found the location of SMN1 on human! +Click the **create** button and in a few moments you should see +something similar to the following screen shot. -.. image:: images/ucsc_gb_smn1_human.png - :alt: Genome Browser - SMN1 (human) +.. image:: images/demo.png + :alt: Mussagl Demo :align: center +By default your analysis is NOT saved. If you try to close an analysis +without saving, you will be prompted with a dialog box asking you if +you would like to save your analysis. The `Saving`_ section for +details on saving your analysis. When saving, choose directory and +give the analysis the name **demo_w30_t20**. If you close and reopen +Mussagl, you will then be able to load the saved analysis. See `Load +an analysis`_ section below for details. -Step 2 - Download CDS/UTR sequence for annotations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Since we have found **SMN1**, this would be a convenient time to extract -the DNA sequence for the CDS and UTRs of the gene to use it as an -annotation_ in Mussa. +Load a mussa parameter file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -**Click on SMN1** shown **between** the **two orange arrows** shown -below. +If you prefer, you can define your Mussa analysis using the Mussa +parameter file. See the `Parameter File Format`_ section for details +on creating a .mupa file. -.. image:: images/ucsc_gb_smn1_human_click_smn1.png - :alt: Genome Browser - SMN1 (human) - Orange Arrows +Once you have a .mupa file created, load Mussagl and select the **File > +Create Analysis from File** menu option. Select the .mupa file and click +open. + +.. image:: images/load_mupa_menu.png + :alt: Load Mussa Parameters :align: center -You should find yourself at the SMN1 description page. +If you would like to see an example, you can load the +**mck3test.mupa** file in the examples directory that comes with +Mussagl. -.. image:: images/ucsc_gb_smn1_description_page.png - :alt: Genome Browser - SMN1 (human) - Description page +.. image:: images/load_mupa_dialog.png + :alt: Load Mussa Parameters Dialog :align: center -**Scroll down** until you get to the **Sequence section** and click on -**Genomic (chr5:70,256,524-70,284,592)**. -.. image:: images/ucsc_gb_smn1_human_sequence.png - :alt: Genome Browser - SMN1 (human) - Sequence - :align: center +Load an analysis +~~~~~~~~~~~~~~~~ -You should now be at the **Genomic sequence near gene** page: +To load a previously run analysis open Mussagl and select the **File > +Open Existing Analysis** menu option. Select an analysis **directory** and +click open. -.. image:: images/ucsc_gb_smn1_human_get_genomic_sequence.png - :alt: Genome Browser - SMN1 (human) - Get genomic sequence +.. image:: images/load_analysis_menu.png + :alt: Load Analysis Menu :align: center -Make the following changes (highlighted in orange in the screenshot -below): - - 1. UNcheck **introns**. - (We only want to annotate CDS and UTRs.) - 2. Select **one FASTA record** per **region**. - (Mussa needs each CDS and UTR represented by one FASTA record per CDS/UTR). - 3. Select **CDS in upper case, UTR in lower case.** -.. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_diff.png - :alt: Genome Browser - SMN1 (human) - Get genomic sequence setup - :align: center +Main Window +----------- -Now click the **submit** button. You will then see a FASTA file with -many FASTA records representing the CDS and UTRS. +Overview +~~~~~~~~ +.. Screen-shot with numbers showing features. -.. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_submit.png - :alt: Genome Browser - SMN1 (human) - CDS/UTR sequence +.. image:: images/window_overview.png + :alt: Mussa Window :align: center -Now you need to save the FASTA records to a **text file**. If you are -using **Firefox** or **Internet Explorer 6+** click on the **File > -Save As** menu option. +Legend: -**IMPORTANT:** Make sure you select **Text Files** and **NOT**, I -repeat **NOT Webpage Complete** (see screenshot below.) + 1. `DNA Sequence (Black bars)`_ + + 2. Annotation_ -Type in **smn1_human_annot.txt** for the file name. + 3. Motif_ -.. image:: images/smn1_human_annot.png - :alt: Genome Browser - SMN1 (human) - sequence annotation file - :align: center + 4. `Red conservation tracks`_ -**IMPORTANT:** You should open the file with a text editor and make - sure **no HTML** was saved... If you find any HTML markup, delete - the markup and save the file. + 5. `Blue conservation tracks`_ -Now we are going to **modify the file** you just saved to **add the -name of the species** to the **annotation file**. All you have to do -is **add a new line** at the **top of the file** with the word **'Human'** as -shown below: + 6. `Zoom Factor`_ (Base pairs per pixel) -.. image:: images/smn1_human_annot_plus_human.png - :alt: Genome Browser - SMN1 (human) - sequence annotation file - :align: center + 7. `Dynamic Threshold`_ -You can add more annotations to this file if you wish. See the -`annotation file format`_ section for details of the file format. By -including FASTA records in the annotation_ file, Mussa searches your -DNA sequence for an exact match of the sequence in the annotation_ -file. If found, it will be marked as an annotation_ within Mussa. + 8. `Sequence Information Bar`_ + 9. `Sequence Scroll Bar`_ -Step 3 - Download gene and upstream/downstream sequence -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Use the back button in your web browser to get back the **genome -browser view** of **SMN1** as shown below. +DNA Sequence (black bars) +~~~~~~~~~~~~~~~~~~~~~~~~~ -.. image:: images/ucsc_gb_smn1_human.png - :alt: Genome Browser - SMN1 (human) +.. image:: images/sequence_bar.png + :alt: Sequence Bar :align: center -There are two options for getting additional sequence around your -gene. The more complex way is to zoom out so that you have the -sequence you want being shown in the genome browser and then follow -the directions for the following method. +Each of the black bars represents one of the loaded sequences, in this +case the sequence around the gene 'MCK' in human, mouse, and rabbit. -The second option, which we will choose, is to leave the genome -browser zoomed exactly at the location of SMN1 and click on the -**DNA** option on the menu bar (shown with orange arrows in the -screenshot below.) -.. image:: images/ucsc_gb_smn1_human_dna_option.png - :alt: Genome Browser - SMN1 (human) - DNA Option +Annotation +~~~~~~~~~~ + +.. figure:: images/annotation.png + :alt: Annotation :align: center + + Annotation shown in green on sequence bar. -Now in the **get dna in window** page, let's add an arbitrary amount of -extra sequence on to each end of the gene, let's say 5000 base pairs. -.. image:: images/ucsc_gb_smn1_human_get_dna.png - :alt: Genome Browser - SMN1 (human) - Get DNA - :align: center +Annotations can be included on any of the sequences using the `Load a +mussa parameter file`_ or `Create a new analysis`_ method of loading +your sequences. You can define annotations by location or using an +exact sub-sequence or a FASTA sequence of the section of DNA you wish +to annotate. See the `Annotation File Format`_ section for details. -Click the **get DNA** button. -.. image:: images/ucsc_gb_smn1_human_dna.png - :alt: Genome Browser - SMN1 (human) - DNA +Motif +~~~~~ + +.. figure:: images/motif.png + :alt: Motif :align: center -Save the DNA sequence to a text file called 'smn1_human_dna.fa' as we -did in step 2 with the annotation file. + Motif shown in light blue on sequence bar. -**IMPORTANT:** Make sure the file is saved as a text file and not an -HTML file. Open the file with a text editor and remove any HTML markup -you find. +The only real difference between an annotation and motif in Mussagl is +that you can define motifs and choose a color from within the GUI. See +the `Motifs`_ section for more information. -Step 4 - Same/similar/related gene other species. -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Red conservation tracks +~~~~~~~~~~~~~~~~~~~~~~~ -What good is a multiple sequence alignment viewer without multiple -sequences? Let'S find a similar gene in a few more species. +.. figure:: images/conservation_tracks.png + :alt: Conservation Tracks + :align: center + + Conservations tracks shown as red and blue lines between sequence + bars. -Use the back button on your web browser until you get the **genome -browser view** of **SMN1** as shown below. +The **red lines** between the sequence bars represent conservation +between the sequences (i.e. not reverse complement matches) -.. image:: images/ucsc_genome_browser_home.png - :alt: UCSC Genome Browser - :align: center +The amount of sequence conservation shown will depend on how much your +sequences are related and the `dynamic threshold`_ you are using. -**Click on SMN1** shown **between** the **two orange arrows** shown -below. -.. image:: images/ucsc_gb_smn1_human_click_smn1.png - :alt: Genome Browser - SMN1 (human) - Orange Arrows +Blue conservation tracks +~~~~~~~~~~~~~~~~~~~~~~~~ + +.. figure:: images/conservation_tracks.png + :alt: Conservation Tracks :align: center + + Conservations tracks shown as red and blue lines between sequence + bars. -You should find yourself at the SMN1 description page. +**Blue lines** represent **reverse complement** conservation relative +to the sequence attached to the top of the blue line. -.. image:: images/ucsc_gb_smn1_description_page.png - :alt: Genome Browser - SMN1 (human) - Description page - :align: center +The amount of sequence conservation shown will depend on how much your +sequences are related and the `dynamic threshold`_ you are using. -**Scroll down** until you get to the **Sequence section** and click on -**Protein (262 aa)**. -.. image:: images/ucsc_gb_smn1_human_sequence.png - :alt: Genome Browser - SMN1 (human) - Sequence +Zoom Factor +~~~~~~~~~~~ + +.. image:: images/zoom_factor.png + :alt: Zoom Factor :align: center -Copy the SMN1 protein seqeunce by highlighting it and selecting **Edit -> Copy** option from the menu. +The zoom factor represents the number of base pairs represented per +pixel. When you zoom in far enough the sequence will switch from +seeing a black bar, representing the sequence, to the actual sequence +(well, ASCII representation of sequence). -.. image:: images/smn1_human_protein.png - :alt: Genome Browser - SMN1 (human) - Protein - :align: center -Press the back button on the web browser once and then scroll to the -top of the page and click on the **BLAT** option on the menu bar -(shown below with orange arrows). +Dynamic Threshold +~~~~~~~~~~~~~~~~~ -.. image:: images/ucsc_gb_smn1_human_blat.png - :alt: Genome Browser - SMN1 (human) - Blat +.. image:: images/dynamic_threshold.png + :alt: Dynamic Threshold :align: center -**Paste** in the **protein sequence** and **change** the **genome** to -**mouse** as shown below and then click **submit**. +You can dynamically change the threshold for how strong a match you +consider the conservation to be by changing the value in the dynamic +threshold box. -.. image:: images/ucsc_gb_smn1_human_blat_paste.png - :alt: Genome Browser - SMN1 (human) - Blat paste protein - :align: center +The value you enter is the minimum number of base pairs that have to +be matched in order to be considered conserved. The second number that +you can't change is the `window size`_ you used when creating the +experiment. The last number is the percent match. -Notice that we have two hits, one of which looks pretty good at 89.9% -match. +Below is an animation of the dynamic threshold being increased over +time. -.. image:: images/ucsc_gb_smn1_human_blat_hits.png - :alt: Genome Browser - SMN1 (human) - Blat hits +.. image:: images/threshold_change.gif + :alt: Animated Dynamic Threshold :align: center -**Click** on the **brower** link next to the 89.9% match. Notice in -the genome browser (shown below) that there is an annotated gene -called SMN1 for mouse which matches the line called **your sequence -from blat search**. This means we are fairly confidant we found the -right location in the mouse genome. +See the Threshold_ section for more information. -.. image:: images/ucsc_gb_smn1_human_blat_to_browser.png - :alt: Genome Browser - SMN1 (human) - Blat to browser - :align: center -Follow steps 1 through 3 for mouse and then repeat step 4 with the -human protein sequence to find **SMN1** in the following species (if -you find a match): +Sequence Information Bar +~~~~~~~~~~~~~~~~~~~~~~~~ - 1. Rat - 2. Rabbit - 3. Dog - 4. Armadillo - 5. Elephant - 6. Opposum - 7. x_tropicalis +.. image:: images/seq_info_bar.png + :alt: Sequence Information Bar + :align: center -Make sure to save the extended DNA sequence and annotation file for -each one. +The sequence information bars can be found to the left and right sides +of Mussagl. Next to each sequence you will find the following +information: -Using Mussagl -============= + 1. Species (If it has been defined) + 2. Total Size of Sequence + 3. Current base pair position +Note that you can **update the species** text box. Make sure to **save your +experiment** after making this change by selecting **File > Save +Analysis** from the menu. -Launch Mussagl --------------- -Launch Mussagl... It should look similar to the screen shot below. +Sequence Scroll Bar +~~~~~~~~~~~~~~~~~~~ -.. image:: images/opened.png - :alt: Launch Mussa +.. image:: images/scroll_bar.png + :alt: Sequence Scroll Bar :align: center +The scroll bar allows you to scroll through the sequence which is +useful when you have zoomed in using the `zoom factor`_. -Create/Load Analysis ----------------------- - -Currently there are three ways to load a Mussa experiment. +Saving +------ - 1. `Create a new analysis`_ - 2. `Load a mussa parameter file`_ (.mupa) - 3. `Load an analysis`_ +Save on Close +~~~~~~~~~~~~~ -.. _createnew: +When ever you create a new analysis or make a change such as +adding/editing a motif or changing a species name, an asterisk (*) +will appear in the title of the window showing that there are changes +that have not been saved. If you close a Mussa window without saving +changes, Mussa will ask you if you would like to save the changes that +have been made. -Create a new analysis -~~~~~~~~~~~~~~~~~~~~~ +Save Analysis +~~~~~~~~~~~~~ -To create a new analysis select 'Define analysis' from the 'File' -menu. You should see a dialog box similar to the one below. For this -demo we will use the example sequences that come with Mussagl. +After making changes, such as updating species names or adding/editing +motifs, you can save these changes by selecting the **File > Save +analysis** menu option or pressing **CTRL + S** (PC) or +**Apple/Command Key + S** (on Mac). -.. image:: images/define_analysis.png - :alt: Define Analysis +.. image:: images/save_analysis.png + :alt: Save analysis :align: center -Instructions: +Save Analysis As +~~~~~~~~~~~~~~~~ - 1. **Give the experiment a name**, for this demo, we'll use - 'demo_w30_t20'. Mussa will create a folder with this name to store - the analysis files in once it has been run. +To save a copy of your analysis to a new location, select the **File > +Save analysis as** menu option and choose a new location and name for +your analysis. - 2. Choose a threshold_... for this demo **choose 20**. See the - Threshold_ section for more detailed information. +.. image:: images/save_analysis_as.png + :alt: Save analysis + :align: center - 3. Choose a `window size`_. For this demo **choose 30**. +Save Motif List +~~~~~~~~~~~~~~~ +See `Save Motifs to File`_ in the `Motifs`_ section. - 4. Choose the number of sequences_ you would like. For this demo - **choose 3**. -.. image:: images/define_analysis_step1a.png - :alt: Steps 1-4 - :align: center +Viewing Multiple Analyses +------------------------- -First enter the species name of "Human" in the first "Species" text -box. Now click on the 'Browse' button next to the sequence input box -and then select /examples/seq/human_mck_pro.fa file. Do the same in -the next two sequence input boxes selecting mouse_mck_pro.fa and -rabbit_mck_pro.fa as shown below. Make sure to give them a species -name as well. Note that you can create annotation files using the -mussa `Annotation File Format`_ to add annotations to your sequence. +Some times it is useful to view more than one analysis at a time. To +do accomplish this, Mussa allows you to open a new Mussa window by +selecting the **File > New Mussa Window** menu option. -.. image:: images/define_analysis_step2.png - :alt: Choose sequences +.. image:: images/new_mussa_window_menu.png + :alt: New Mussa Window Menu Option :align: center -Click the **create** button and in a few moments you should see -something similar to the following screen shot. +A new Mussa window will pop up. -.. image:: images/demo.png - :alt: Mussagl Demo +.. figure:: images/new_mussa_window.png + :alt: New Mussa Window :align: center -By default your analysis is NOT saved. If you try to close an analysis -without saving, you will be prompted with a dialog box asking you if -you would like to save your analysis. The `Saving`_ section for -details on saving your analysis. When saving, choose directory and -give the analysis the name **demo_w30_t20**. If you close and reopen -Mussagl, you will then be able to load the saved analysis. See `Load -an analysis`_ section below for details. + A new Mussa window on the right, in which a second analysis has + been loaded. +Now you can create or load an existing analysis, in this new window, +as described in the `Create/Load Analysis`_ section. -Load a mussa parameter file -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +You can view as many analyses as you can fit on your screen or until +you run out of available RAM. If you notice a rapid decrease in +performance and hear lots of noise coming from your hard drive, you +probably ran out of RAM and are now using virtual memory (i.e. much +much slower). If this happens, you may need to avoid opening as many +analyses at one time. -If you prefer, you can define your Mussa analysis using the Mussa -parameter file. See the `Parameter File Format`_ section for details -on creating a .mupa file. -Once you have a .mupa file created, load Mussagl and select the **File > -Create Analysis from File** menu option. Select the .mupa file and click -open. +Annotations / Motifs +-------------------- -.. image:: images/load_mupa_menu.png - :alt: Load Mussa Parameters - :align: center +Annotations +~~~~~~~~~~~ -If you would like to see an example, you can load the -**mck3test.mupa** file in the examples directory that comes with -Mussagl. +Currently annotations can be added to a sequence using the mussa +`annotation file format`_ and can be loaded by selecting the +annotation file when defining a new analysis (see `Create a new +analysis`_ section) or by defining a .mupa file pointing to your +annotation file (see `Load a mussa parameter file`_ section). -.. image:: images/load_mupa_dialog.png - :alt: Load Mussa Parameters Dialog +Motifs +~~~~~~ + +Load Motifs from File +********************* + +It is possible to load motifs from a file which was saved from a +previous run or by defining your own motif file. See the `Motif File +Format`_ section for details. + +NOTE: Valid motif list file extensions are: + + * .mtl + * .txt + +To load a motif file, select **Load Motif List** item from the +**File** menu and select a motif list file. + +.. image:: images/load_motif.png + :alt: Load Motif List :align: center -Load an analysis -~~~~~~~~~~~~~~~~ +Save Motifs to File +******************* -To load a previously run analysis open Mussagl and select the **File > -Open Existing Analysis** menu option. Select an analysis **directory** and -click open. +Motifs from the `Motif Dialog`_ can be saved to file for use with +other analyses. If you just want your motifs to be saved with your +analysis, see the `save analysis`_ section for details. -.. image:: images/load_analysis_menu.png - :alt: Load Analysis Menu +To save a motif list, select **File > Save Motifs** menu option. By +default, Mussa will append .mtl if you do not provide a file extension +(valid file extensions: .mtl & .txt). + +.. image:: images/save_motifs.png + :alt: Save Motifs :align: center -Main Window ------------ +Motif Dialog +************ -Overview -~~~~~~~~ -.. Screen-shot with numbers showing features. +Mussa has the ability to find lab motifs using the `IUPAC Nucleotide +Code`_ for defining a motif. To define a motif, select **Edit > Edit +Motifs** menu item as shown below. -.. image:: images/window_overview.png - :alt: Mussa Window +.. image:: images/view_edit_motifs.png + :alt: "View > Edit Motifs" Menu :align: center -Legend: +You will see a dialog box appear with a "apply" button in the bottom +right and one rows for defining motifs and the color that will be +displayed on the sequence. When you start adding your first motif, an +additional row will be added. The check box in the first column +defines whether the motif is displayed or not. The second column is +the motif display color. The third column is for the name of your +motif and finally, the fourth column is motif itself. - 1. `DNA Sequence (Black bars)`_ - - 2. Annotation_ +.. image:: images/motif_dialog_start.png + :alt: Motif Dialog + :align: center - 3. Motif_ +Now let's make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide +Code`_, type in **'ATSCT'** into the motif field and **'My Motif'** for +the name in the name field as shown below. - 4. `Red conservation tracks`_ +Notice how a second row appeared when you started to add the first +motif. Every time you add a new motif, a new row will appear allowing +you to add as many motifs as you need. - 5. `Blue conservation tracks`_ +.. image:: images/motif_dialog_enter_motif.png + :alt: Enter Motif + :align: center - 6. `Zoom Factor`_ (Base pairs per pixel) +Now choose a color for your motif by clicking on the colored area to +the left of the name field. Remember to choose a color that will show +up well with a black bar as the background. A good tool for picking a +color is the `Colour Contrast Analyser +`_ by +`juicystudio.com `_. - 7. `Dynamic Threshold`_ +.. image:: images/color_chooser.png + :alt: Color Chooser + :align: center - 8. `Sequence Information Bar`_ +Once you have selected the color for your motif, click on the +**'apply'** button. Notice that if Mussa finds matches to your motif +will now show up in the main Mussa window. - 9. `Sequence Scroll Bar`_ +Before Motif: +.. image:: images/motif_dialog_bar_before.png + :alt: Sequence bar before motif + :align: center -DNA Sequence (black bars) -~~~~~~~~~~~~~~~~~~~~~~~~~ +After Motif: -.. image:: images/sequence_bar.png - :alt: Sequence Bar +.. image:: images/motif_dialog_bar_after.png + :alt: Sequence bar after motif :align: center -Each of the black bars represents one of the loaded sequences, in this -case the sequence around the gene 'MCK' in human, mouse, and rabbit. +To save your motifs with your analysis, see the `save analysis`_ +section. To save your motifs to a file, see the `save motifs to file`_ +section. +Deleting a Motif +^^^^^^^^^^^^^^^^ -Annotation -~~~~~~~~~~ +To delete a motif, remove all text from the name and sequence columns +and close the motif editor. -.. figure:: images/annotation.png - :alt: Annotation +View Mussa Alignments +--------------------- + +Mussagl allows you to zoom in on Mussa alignments by selecting the set +of alignment(s) of interest. To do this, move the mouse near the +alignment you are interested in viewing and then **PRESS** and +**HOLD** the **LEFT mouse button** and **drag the mouse** to the other +side of the conservation track so that you see a bounding box +overlaping the alienment(s) of interest and then **let go** of the +*left mouse button*. + +In the example below, I started by left-clicking on the area marked by +a red dot (upper left corner of bounding box) and dragging the mouse to +the area marked by a blue dot (lower right corner of the bounding box) +and letting go of the left mouse button. + +.. image:: images/select_sequence.png + :alt: Select Sequence :align: center - - Annotation shown in green on sequence bar. +All of the lines which were not selected should be washed out as shown +below: -Annotations can be included on any of the sequences using the `Load a -mussa parameter file`_ or `Create a new analysis`_ method of loading -your sequences. You can define annotations by location or using an -exact sub-sequence or a FASTA sequence of the section of DNA you wish -to annotate. See the `Annotation File Format`_ section for details. +.. image:: images/washed_out.png + :alt: Tracks washed out + :align: center +With a selection made, goto the **View** menu and select **View mussa alignment**. -Motif -~~~~~ +.. image:: images/view_mussa_alignment.png + :alt: View mussa alignment + :align: center -.. figure:: images/motif.png - :alt: Motif +You should see the alignment at the base-pair level as shown below. + +.. image:: images/mussa_alignment.png + :alt: Mussa alignment :align: center - Motif shown in light blue on sequence bar. -The only real difference between an annotation and motif in Mussagl is -that you can define motifs and choose a color from within the GUI. See -the `Motifs`_ section for more information. +Sub-analysis +------------ +To run a sub-analysis **highlight** a section of sequence and *right +click* on it and select **Add to subanalysis**. To the same for the +sequences shown in orange in the screenshot below. Note that you **are +NOT limited** to selecting only one subsequence from the same +sequence. -Red conservation tracks -~~~~~~~~~~~~~~~~~~~~~~~ +.. image:: images/subanalysis_select_seqs.png + :alt: Subanalysis sequence selection + :align: center -.. figure:: images/conservation_tracks.png - :alt: Conservation Tracks +Once you have added your sequences for subanalysis, choose a `window size`_ and `threshold`_ and click **Ok**. + +.. image:: images/subanalysis_dialog.png + :alt: Subanalysis Dialog :align: center - - Conservations tracks shown as red and blue lines between sequence - bars. -The **red lines** between the sequence bars represent conservation -between the sequences (i.e. not reverse complement matches) +A new Mussa window will appear with the subanalysis of your sequences +once it's done running. This may take a while if you selected large +chunks of sequence with a loose threshold. -The amount of sequence conservation shown will depend on how much your -sequences are related and the `dynamic threshold`_ you are using. +.. image:: images/subanalysis_done.png + :alt: Subalaysis complete + :align: center -Blue conservation tracks -~~~~~~~~~~~~~~~~~~~~~~~~ +Copying sequence to clipboard +----------------------------- -.. figure:: images/conservation_tracks.png - :alt: Conservation Tracks +To copy a sequence to the clipboard, highlight a section of sequence, +as shown in the screen shot below, and do one of the following: + + * Select **Copy as FASTA** from the **Edit** menu. + * **Right-Click (Left-click + Apple/Command Key on Mac)** on the highlighted sequence and select **Copy as FASTA**. + * Press **Ctrl + C (on PC)** or **Apple/Command Key + C (on Mac)** on the keyboard. + +.. image:: images/copy_sequence.png + :alt: Copy sequence :align: center - - Conservations tracks shown as red and blue lines between sequence - bars. -**Blue lines** represent **reverse complement** conservation relative -to the sequence attached to the top of the blue line. -The amount of sequence conservation shown will depend on how much your -sequences are related and the `dynamic threshold`_ you are using. +Saving to an Image +--------------------------------- +To save your current mussa view to an image, select **File > Save to +image...** as shown below. -Zoom Factor -~~~~~~~~~~~ +.. image:: images/save_to_image_menu.png + :alt: File > Save to image... + :align: center -.. image:: images/zoom_factor.png - :alt: Zoom Factor +You can define the width and the height of the image to save. By +default it will use the same size of your current view. Since the +Mussa view is implemented using vectors, if you choose a larger size +then your current view, Mussa will redraw at the higher resolution +when saving. In other words, you get higher quality images when saving +at a higher resolution. + +If you check the "Lock aspect ratio" check box, which I have circled +in red, then when you change one value, say width, the other, height, +will update automatically to keep the same aspect ratio. + +.. image:: images/save_to_image_dialog.png + :alt: Save to image dialog :align: center -The zoom factor represents the number of base pairs represented per -pixel. When you zoom in far enough the sequence will switch from -seeing a black bar, representing the sequence, to the actual sequence -(well, ASCII representation of sequence). +Click save and choose a location and filename for your file. +The valid image formats are: -Dynamic Threshold -~~~~~~~~~~~~~~~~~ + * .png (default if no extension specified.) + * .jpg -.. image:: images/dynamic_threshold.png - :alt: Dynamic Threshold - :align: center -You can dynamically change the threshold for how strong a match you -consider the conservation to be by changing the value in the dynamic -threshold box. +Detailed Information +-------------------- -The value you enter is the minimum number of base pairs that have to -be matched in order to be considered conserved. The second number that -you can't change is the `window size`_ you used when creating the -experiment. The last number is the percent match. +Threshold +~~~~~~~~~ -See the Threshold_ section for more information. +The threshold of an analysis is in minimum number of base pair matches +must be meet to in order to be kept as a match. Note that you can vary +the threshold from within Mussagl. For example, if you choose a +`window size`_ of **30** and a **threshold** of **20** the mussa nway +transitive algorithm will store all matches that are 20 out of 30 bp +matches or better and pass it on to Mussagl. Mussagl will then allow +you to dynamically choose a threshold from 20 to 30 base pairs. A +threshold of 30 bps would only show 30 out of 30 bp matches. A +threshold of 20 bps would show all matches of 20 out of 30 bps or +better. If you would like to see results for matches lower than 20 out +of 30, you will need to rerun the analysis with a lower threshold. +Window Size +~~~~~~~~~~~ -Sequence Information Bar -~~~~~~~~~~~~~~~~~~~~~~~~ +The typical sizes people tend to choose are between 20 and 30. You +will likely need to experiment with this setting depending on your +needs and input sequence. -.. image:: images/seq_info_bar.png - :alt: Sequence Information Bar - :align: center -The sequence information bars can be found to the left and right sides -of Mussagl. Next to each sequence you will find the following -information: +Sequences +~~~~~~~~~ - 1. Species (If it has been defined) - 2. Total Size of Sequence - 3. Current base pair position +Mussa reads in sequences which are formatted in the FASTA_ +format. Mussa may take a long time to run (>10 minutes) if the total +bp length near 280Kb. Once mussa has run once, you can reload +previously run analyzes. -Note that you can **update the species** text box. Make sure to **save your -experiment** after making this change by selecting **File > Save -Analysis** from the menu. +FIXME: We have learned more about how much sequence and how many to +put in Mussagl, this information should be documented here. -Sequence Scroll Bar -~~~~~~~~~~~~~~~~~~~ -.. image:: images/scroll_bar.png - :alt: Sequence Scroll Bar - :align: center +Mussa File Formats +------------------ -The scroll bar allows you to scroll through the sequence which is -useful when you have zoomed in using the `zoom factor`_. +.. _param: +Parameter File Format +~~~~~~~~~~~~~~~~~~~~~ -Saving ------- +Note that for the comment character '#' to work, it must contain a +space after it (i.e. '# '). -Save on Close -~~~~~~~~~~~~~ +**File Format (.mupa):** -When ever you create a new analysis or make a change such as -adding/editing a motif or changing a species name, an asterisk (*) -will appear in the title of the window showing that there are changes -that have not been saved. If you close a Mussa window without saving -changes, Mussa will ask you if you would like to save the changes that -have been made. +:: -Save Analysis -~~~~~~~~~~~~~ + # name of analysis directory and stem for associated files + ANA_NAME + + # if APPEND vars true, a _wXX and/or _tYY added to analysis name + # where XX = WINDOW and YY = THRESHOLD + # Highly recommeded with use of command line override of WINDOW or THRESHOLD + APPEND_WIN + APPEND_THRES + + # first sequence info + SEQUENCE + ANNOTATION + SEQ_START + + # the second sequence info + SEQUENCE + # ANNOTATION + SEQ_START + # SEQ_END -After making changes, such as updating species names or adding/editing -motifs, you can save these changes by selecting the **File > Save -analysis** menu option or pressing **CTRL + S** (PC) or -**Apple/Command Key + S** (on Mac). + # third sequence info + SEQUENCE + # ANNOTATION + + # analyzes parameters: command line args -w -t will override these + WINDOW + THRESHOLD -.. image:: images/save_analysis.png - :alt: Save analysis - :align: center +.. csv-table:: Parameter File Options: + :header: "Option Name", "Value", "Default", "Required", "Description" + :widths: 30 30 30 30 60 -Save Analysis As -~~~~~~~~~~~~~~~~ + "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also + name of directory where analysis will be saved." + "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME" + "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME" + "SEQUENCE", "/FASTA/filepath.fa", "N/A", "true", "Absolute/Relative file + path to sequence." + "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional + annotation file. See `annotation file format`_ section for more + information." + "SEQ_START", "integer", "1", "false", "Optional index into FASTA file" + "SEQ_END", "integer", "1", "false", "Optional index into FASTA file" + "WINDOW", "integer", "N/A", "true", "`Window Size`_" + "THRESHOLD", "integer", "N/A", "true", "`Threshold`_" -To save a copy of your analysis to a new location, select the **File > -Save analysis as** menu option and choose a new location and name for -your analysis. +.. _annot: -.. image:: images/save_analysis_as.png - :alt: Save analysis - :align: center +Annotation File Format +~~~~~~~~~~~~~~~~~~~~~~ -Save Motif List -~~~~~~~~~~~~~~~ +The first line in the file is the sequence name. Each line there after +is a **space** separated annotation. -See `Save Motifs to File`_ in the `Motifs`_ section. +Update: + + * The annotation format now supports FASTA sequences embedded in the + annotation file as shown in the format example below. Mussagl will + take this sequence and look for an exact match of this sequence in + your sequences. If a match is found, it will label it with the name + of from the FASTA header. +Format: -Viewing Multiple Analyses -------------------------- +:: + + + + + + + >FASTA Header + ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG + ACGTACGTACGTACGTAGCTGTCATACGCTAGCA + TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT + ACGTACGGCAGTACGCGGTCAGA + + ... -Some times it is useful to view more than one analysis at a time. To -do accomplish this, Mussa allows you to open a new Mussa window by -selecting the **File > New Mussa Window** menu option. +Example: -.. image:: images/new_mussa_window_menu.png - :alt: New Mussa Window Menu Option - :align: center +:: -A new Mussa window will pop up. + Mouse + 251 500 Glorp Glorptype + 751 1000 Glorp Glorptype + 1251 1500 Glorp Glorptype + >My favorite DNA sequence + GATTACA + 1751 2000 Glorp Glorptype -.. figure:: images/new_mussa_window.png - :alt: New Mussa Window - :align: center - A new Mussa window on the right, in which a second analysis has - been loaded. +.. _motif_file_format: + +Motif File Format +~~~~~~~~~~~~~~~~~ + +Format: + + + +Example: + + GGCC 0.0 1 1 + + + +IUPAC Nucleotide Code +~~~~~~~~~~~~~~~~~~~~~~ + +For your convenience, below is a table of the IUPAC Nucleotide Code. + +The following table is table 1 from "Nomenclature for Incompletely +Specified Bases in Nucleic Acid Sequences" which can be found at +http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html. + +====== ================= =================================== +Symbol Meaning Origin of designation +====== ================= =================================== +G G Guanine +A A Adenine +T T Thymine +C C Cytosine +R G or A puRine +Y T or C pYrimidine +M A or C aMino +K G or T Keto +S G or C Strong interaction (3 H bonds) +W A or T Weak interaction (2 H bonds) +H A or C or T not-G, H follows G in the alphabet +B G or T or C not-A, B follows A +V G or C or A not-T (not-U), V follows U +D G or A or T not-C, D follows C +N G or A or T or C aNy +====== ================= =================================== + + +Obtaining Input Data - Continued +-------------------------------- + +If you already have your data, may want to go to the `Using Mussagl`_ +section of the manual. + +Let's say you have a gene of interest called 'SMN1' and you want to +know how the sequence surrounding the gene in multiple species is +conserved. Guess what, that's what we are going to do, retrieve the +DNA sequence for SMN1 and prepare it for using in Mussa. + +For more information about SMN1 visit `NCBI's OMIM +`_. + +The SMN1 data retrieved in this section can be downloaded from the +`Mussa Example Data +`_ page if +you prefer to skip this section of the manual. + +UCSC Genome Browser Method +-------------------------- + +There are many methods of retrieving DNA sequence, but for this +example we will retrieve SMN1 through the UCSC genome browser located +at http://genome.ucsc.edu/. + -Now you can create or load an existing analysis, in this new window, -as described in the `Create/Load Analysis`_ section. +.. image:: images/ucsc_genome_browser_home.png + :alt: UCSC Genome Browser + :align: center -You can view as many analyses as you can fit on your screen or until -you run out of available RAM. If you notice a rapid decrease in -performance and hear lots of noise coming from your hard drive, you -probably ran out of RAM and are now using virtual memory (i.e. much -much slower). If this happens, you may need to avoid opening as many -analyses at one time. +Step 1 - Find SMN1 +~~~~~~~~~~~~~~~~~~ +The first step in finding SMN1 is to use the **Gene Sorter** menu +option which I have highlighted in orange below: -Annotations / Motifs --------------------- +.. image:: images/ucsc_menu_bar_gene_sorter.png + :alt: Gene Sorter Menu Option + :align: center -Annotations -~~~~~~~~~~~ +Gene Sorter page: -Currently annotations can be added to a sequence using the mussa -`annotation file format`_ and can be loaded by selecting the -annotation file when defining a new analysis (see `Create a new -analysis`_ section) or by defining a .mupa file pointing to your -annotation file (see `Load a mussa parameter file`_ section). +.. image:: images/ucsc_gene_sorter.png + :alt: Gene Sorter + :align: center -Motifs -~~~~~~ +We will start by looking for SMN1 in the **Human Genome** and **sorting by name similarity**. -Load Motifs from File -********************* +.. image:: images/ucsc_gs_sort_name_sim.png + :alt: Gene Sorter - Name Similarity + :align: center -It is possible to load motifs from a file which was saved from a -previous run or by defining your own motif file. See the `Motif File -Format`_ section for details. +After you have selected **Human Genome** and **sorting by name similarity**, type *SMN1* into the search box. -NOTE: Valid motif list file extensions are: - - * .mtl - * .txt +.. image:: images/ucsc_gs_smn1.png + :alt: Gene + :align: center -To load a motif file, select **Load Motif List** item from the -**File** menu and select a motif list file. +Press **Go!** and you should see the following page: -.. image:: images/load_motif.png - :alt: Load Motif List +.. image:: images/ucsc_gs_found.png + :alt: Found SMN1 :align: center +Click on **SMN1** and you will be taking the gene expression atlas +page. -Save Motifs to File -******************* +.. image:: images/ucsc_gs_genome_position.png + :alt: Gene expression atlas + :align: center -Motifs from the `Motif Dialog`_ can be saved to file for use with -other analyses. If you just want your motifs to be saved with your -analysis, see the `save analysis`_ section for details. +Click on **chr5 70,270,558** found in the **SMN1 row**, **Genome +position column**. -To save a motif list, select **File > Save Motifs** menu option. By -default, Mussa will append .mtl if you do not provide a file extension -(valid file extensions: .mtl & .txt). +Now we have found the location of SMN1 on human! -.. image:: images/save_motifs.png - :alt: Save Motifs +.. image:: images/ucsc_gb_smn1_human.png + :alt: Genome Browser - SMN1 (human) :align: center -Motif Dialog -************ +Step 2 - Download CDS/UTR sequence for annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Mussa has the ability to find lab motifs using the `IUPAC Nucleotide -Code`_ for defining a motif. To define a motif, select **Edit > Edit -Motifs** menu item as shown below. +Since we have found **SMN1**, this would be a convenient time to extract +the DNA sequence for the CDS and UTRs of the gene to use it as an +annotation_ in Mussa. -.. image:: images/view_edit_motifs.png - :alt: "View > Edit Motifs" Menu +**Click on SMN1** shown **between** the **two orange arrows** shown +below. + +.. image:: images/ucsc_gb_smn1_human_click_smn1.png + :alt: Genome Browser - SMN1 (human) - Orange Arrows :align: center -You will see a dialog box appear with a "apply" button in the bottom -right and one rows for defining motifs and the color that will be -displayed on the sequence. When you start adding your first motif, an -additional row will be added. The check box in the first column -defines whether the motif is displayed or not. The second column is -the motif display color. The third column is for the name of your -motif and finally, the fourth column is motif itself. +You should find yourself at the SMN1 description page. -.. image:: images/motif_dialog_start.png - :alt: Motif Dialog +.. image:: images/ucsc_gb_smn1_description_page.png + :alt: Genome Browser - SMN1 (human) - Description page :align: center -Now let's make a motif **'AT[C or G]CT'**. Using the `IUPAC Nucleotide -Code`_, type in **'ATSCT'** into the motif field and **'My Motif'** for -the name in the name field as shown below. - -Notice how a second row appeared when you started to add the first -motif. Every time you add a new motif, a new row will appear allowing -you to add as many motifs as you need. +**Scroll down** until you get to the **Sequence section** and click on +**Genomic (chr5:70,256,524-70,284,592)**. -.. image:: images/motif_dialog_enter_motif.png - :alt: Enter Motif +.. image:: images/ucsc_gb_smn1_human_sequence.png + :alt: Genome Browser - SMN1 (human) - Sequence :align: center -Now choose a color for your motif by clicking on the colored area to -the left of the name field. Remember to choose a color that will show -up well with a black bar as the background. A good tool for picking a -color is the `Colour Contrast Analyser -`_ by -`juicystudio.com `_. +You should now be at the **Genomic sequence near gene** page: -.. image:: images/color_chooser.png - :alt: Color Chooser +.. image:: images/ucsc_gb_smn1_human_get_genomic_sequence.png + :alt: Genome Browser - SMN1 (human) - Get genomic sequence :align: center -Once you have selected the color for your motif, click on the -**'apply'** button. Notice that if Mussa finds matches to your motif -will now show up in the main Mussa window. +Make the following changes (highlighted in orange in the screenshot +below): -Before Motif: + 1. UNcheck **introns**. + (We only want to annotate CDS and UTRs.) + 2. Select **one FASTA record** per **region**. + (Mussa needs each CDS and UTR represented by one FASTA record per CDS/UTR). + 3. Select **CDS in upper case, UTR in lower case.** -.. image:: images/motif_dialog_bar_before.png - :alt: Sequence bar before motif +.. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_diff.png + :alt: Genome Browser - SMN1 (human) - Get genomic sequence setup :align: center -After Motif: +Now click the **submit** button. You will then see a FASTA file with +many FASTA records representing the CDS and UTRS. -.. image:: images/motif_dialog_bar_after.png - :alt: Sequence bar after motif +.. image:: images/ucsc_gb_smn1_human_get_genomic_sequence_submit.png + :alt: Genome Browser - SMN1 (human) - CDS/UTR sequence :align: center -To save your motifs with your analysis, see the `save analysis`_ -section. To save your motifs to a file, see the `save motifs to file`_ -section. +Now you need to save the FASTA records to a **text file**. If you are +using **Firefox** or **Internet Explorer 6+** click on the **File > +Save As** menu option. -Deleting a Motif -^^^^^^^^^^^^^^^^ +**IMPORTANT:** Make sure you select **Text Files** and **NOT**, I +repeat **NOT Webpage Complete** (see screenshot below.) -To delete a motif, remove all text from the name and sequence columns -and close the motif editor. +Type in **smn1_human_annot.txt** for the file name. -View Mussa Alignments ---------------------- +.. image:: images/smn1_human_annot.png + :alt: Genome Browser - SMN1 (human) - sequence annotation file + :align: center -Mussagl allows you to zoom in on Mussa alignments by selecting the set -of alignment(s) of interest. To do this, move the mouse near the -alignment you are interested in viewing and then **PRESS** and -**HOLD** the **LEFT mouse button** and **drag the mouse** to the other -side of the conservation track so that you see a bounding box -overlaping the alienment(s) of interest and then **let go** of the -*left mouse button*. +**IMPORTANT:** You should open the file with a text editor and make + sure **no HTML** was saved... If you find any HTML markup, delete + the markup and save the file. -In the example below, I started by left-clicking on the area marked by -a red dot (upper left corner of bounding box) and dragging the mouse to -the area marked by a blue dot (lower right corner of the bounding box) -and letting go of the left mouse button. +Now we are going to **modify the file** you just saved to **add the +name of the species** to the **annotation file**. All you have to do +is **add a new line** at the **top of the file** with the word **'Human'** as +shown below: -.. image:: images/select_sequence.png - :alt: Select Sequence +.. image:: images/smn1_human_annot_plus_human.png + :alt: Genome Browser - SMN1 (human) - sequence annotation file :align: center -All of the lines which were not selected should be washed out as shown -below: +You can add more annotations to this file if you wish. See the +`annotation file format`_ section for details of the file format. By +including FASTA records in the annotation_ file, Mussa searches your +DNA sequence for an exact match of the sequence in the annotation_ +file. If found, it will be marked as an annotation_ within Mussa. -.. image:: images/washed_out.png - :alt: Tracks washed out - :align: center -With a selection made, goto the **View** menu and select **View mussa alignment**. +Step 3 - Download gene and upstream/downstream sequence +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. image:: images/view_mussa_alignment.png - :alt: View mussa alignment +Use the back button in your web browser to get back the **genome +browser view** of **SMN1** as shown below. + +.. image:: images/ucsc_gb_smn1_human.png + :alt: Genome Browser - SMN1 (human) :align: center -You should see the alignment at the base-pair level as shown below. +There are two options for getting additional sequence around your +gene. The more complex way is to zoom out so that you have the +sequence you want being shown in the genome browser and then follow +the directions for the following method. -.. image:: images/mussa_alignment.png - :alt: Mussa alignment +The second option, which we will choose, is to leave the genome +browser zoomed exactly at the location of SMN1 and click on the +**DNA** option on the menu bar (shown with orange arrows in the +screenshot below.) + +.. image:: images/ucsc_gb_smn1_human_dna_option.png + :alt: Genome Browser - SMN1 (human) - DNA Option :align: center +Now in the **get dna in window** page, let's add an arbitrary amount of +extra sequence on to each end of the gene, let's say 5000 base pairs. -Sub-analysis ------------- +.. image:: images/ucsc_gb_smn1_human_get_dna.png + :alt: Genome Browser - SMN1 (human) - Get DNA + :align: center -To run a sub-analysis **highlight** a section of sequence and *right -click* on it and select **Add to subanalysis**. To the same for the -sequences shown in orange in the screenshot below. Note that you **are -NOT limited** to selecting only one subsequence from the same -sequence. +Click the **get DNA** button. -.. image:: images/subanalysis_select_seqs.png - :alt: Subanalysis sequence selection +.. image:: images/ucsc_gb_smn1_human_dna.png + :alt: Genome Browser - SMN1 (human) - DNA :align: center -Once you have added your sequences for subanalysis, choose a `window size`_ and `threshold`_ and click **Ok**. +Save the DNA sequence to a text file called 'smn1_human_dna.fa' as we +did in step 2 with the annotation file. -.. image:: images/subanalysis_dialog.png - :alt: Subanalysis Dialog - :align: center +**IMPORTANT:** Make sure the file is saved as a text file and not an +HTML file. Open the file with a text editor and remove any HTML markup +you find. -A new Mussa window will appear with the subanalysis of your sequences -once it's done running. This may take a while if you selected large -chunks of sequence with a loose threshold. -.. image:: images/subanalysis_done.png - :alt: Subalaysis complete - :align: center +Step 4 - Same/similar/related gene other species. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +What good is a multiple sequence alignment viewer without multiple +sequences? Let'S find a similar gene in a few more species. -Copying sequence to clipboard ------------------------------ +Use the back button on your web browser until you get the **genome +browser view** of **SMN1** as shown below. -To copy a sequence to the clipboard, highlight a section of sequence, -as shown in the screen shot below, and do one of the following: +.. image:: images/ucsc_genome_browser_home.png + :alt: UCSC Genome Browser + :align: center - * Select **Copy as FASTA** from the **Edit** menu. - * **Right-Click (Left-click + Apple/Command Key on Mac)** on the highlighted sequence and select **Copy as FASTA**. - * Press **Ctrl + C (on PC)** or **Apple/Command Key + C (on Mac)** on the keyboard. +**Click on SMN1** shown **between** the **two orange arrows** shown +below. -.. image:: images/copy_sequence.png - :alt: Copy sequence +.. image:: images/ucsc_gb_smn1_human_click_smn1.png + :alt: Genome Browser - SMN1 (human) - Orange Arrows :align: center +You should find yourself at the SMN1 description page. -Saving to an Image ---------------------------------- +.. image:: images/ucsc_gb_smn1_description_page.png + :alt: Genome Browser - SMN1 (human) - Description page + :align: center -To save your current mussa view to an image, select **File > Save to -image...** as shown below. +**Scroll down** until you get to the **Sequence section** and click on +**Protein (262 aa)**. -.. image:: images/save_to_image_menu.png - :alt: File > Save to image... +.. image:: images/ucsc_gb_smn1_human_sequence.png + :alt: Genome Browser - SMN1 (human) - Sequence :align: center -You can define the width and the height of the image to save. By -default it will use the same size of your current view. Since the -Mussa view is implemented using vectors, if you choose a larger size -then your current view, Mussa will redraw at the higher resolution -when saving. In other words, you get higher quality images when saving -at a higher resolution. +Copy the SMN1 protein seqeunce by highlighting it and selecting **Edit +> Copy** option from the menu. -If you check the "Lock aspect ratio" check box, which I have circled -in red, then when you change one value, say width, the other, height, -will update automatically to keep the same aspect ratio. +.. image:: images/smn1_human_protein.png + :alt: Genome Browser - SMN1 (human) - Protein + :align: center -.. image:: images/save_to_image_dialog.png - :alt: Save to image dialog +Press the back button on the web browser once and then scroll to the +top of the page and click on the **BLAT** option on the menu bar +(shown below with orange arrows). + +.. image:: images/ucsc_gb_smn1_human_blat.png + :alt: Genome Browser - SMN1 (human) - Blat :align: center -Click save and choose a location and filename for your file. +**Paste** in the **protein sequence** and **change** the **genome** to +**mouse** as shown below and then click **submit**. -The valid image formats are: +.. image:: images/ucsc_gb_smn1_human_blat_paste.png + :alt: Genome Browser - SMN1 (human) - Blat paste protein + :align: center - * .png (default if no extension specified.) - * .jpg +Notice that we have two hits, one of which looks pretty good at 89.9% +match. +.. image:: images/ucsc_gb_smn1_human_blat_hits.png + :alt: Genome Browser - SMN1 (human) - Blat hits + :align: center -Detailed Information --------------------- +**Click** on the **brower** link next to the 89.9% match. Notice in +the genome browser (shown below) that there is an annotated gene +called SMN1 for mouse which matches the line called **your sequence +from blat search**. This means we are fairly confidant we found the +right location in the mouse genome. -Threshold -~~~~~~~~~ +.. image:: images/ucsc_gb_smn1_human_blat_to_browser.png + :alt: Genome Browser - SMN1 (human) - Blat to browser + :align: center -The threshold of an analysis is in minimum number of base pair matches -must be meet to in order to be kept as a match. Note that you can vary -the threshold from within Mussagl. For example, if you choose a -`window size`_ of **30** and a **threshold** of **20** the mussa nway -transitive algorithm will store all matches that are 20 out of 30 bp -matches or better and pass it on to Mussagl. Mussagl will then allow -you to dynamically choose a threshold from 20 to 30 base pairs. A -threshold of 30 bps would only show 30 out of 30 bp matches. A -threshold of 20 bps would show all matches of 20 out of 30 bps or -better. If you would like to see results for matches lower than 20 out -of 30, you will need to rerun the analysis with a lower threshold. +Follow steps 1 through 3 for mouse and then repeat step 4 with the +human protein sequence to find **SMN1** in the following species (if +you find a match): -Window Size -~~~~~~~~~~~ + 1. Rat + 2. Rabbit + 3. Dog + 4. Armadillo + 5. Elephant + 6. Opposum + 7. x_tropicalis -The typical sizes people tend to choose are between 20 and 30. You -will likely need to experiment with this setting depending on your -needs and input sequence. +Make sure to save the extended DNA sequence and annotation file for +each one. -Sequences -~~~~~~~~~ +Step 5 - Create Analysis +~~~~~~~~~~~~~~~~~~~~~~~~ -Mussa reads in sequences which are formatted in the FASTA_ -format. Mussa may take a long time to run (>10 minutes) if the total -bp length near 280Kb. Once mussa has run once, you can reload -previously run analyzes. +At this point you should have the annotations and fasta files for each +species. If you skipped the first four steps or are having trouble, +you can download the example data from the `Mussa Example Data +`_ page. -FIXME: We have learned more about how much sequence and how many to -put in Mussagl, this information should be documented here. +There are two methods for creating an analysis. You can create MUssa +PArameter file (.mupa), or you can use the create analysis dialog. To +use the analysis dialog, see the `create a new analysis`_ section. +If you are planning on do lots of analyses using the same sets of DNA +sequence but with different parameters, annotations, and/or species, +it is often best to setup a `mupa`_ file, so you can: -Mussa File Formats ------------------- + * Change parameters and rerun analysis easily. + * Use Mussa command line option to run a batch analyses. + * Define an analysis for someone else to run. -.. _param: +Now, we will create a `mupa`_ file for smn1 for an analysis with +Human, Mouse, and Cow. I'll start by showing you the `mupa`_ file and +then walking you through it line by line. -Parameter File Format -~~~~~~~~~~~~~~~~~~~~~ +Start by creating a new text file called *smn1_human_mouse_cow.mupa*, +in your smn1 directory. I decided to put each of the fasta and +annotation files for each species in it's own directory, so I will use +that setup (see screen shot below). -**File Format (.mupa):** +.. image:: images/smn1_dir_structure.png + :alt: SMN1 directory structure + :align: center +smn1_human_mouse_cow.mupa: :: - # name of analysis directory and stem for associated files - ANA_NAME - - # if APPEND vars true, a _wXX and/or _tYY added to analysis name - # where XX = WINDOW and YY = THRESHOLD - # Highly recommeded with use of command line override of WINDOW or THRESHOLD - APPEND_WIN - APPEND_THRES - - # how many sequences are being analyzed - SEQUENCE_NUM - - # first sequence info - SEQUENCE - ANNOTATION - SEQ_START + # Analysis name + ANA_NAME smn1_human_mouse_cow - # the second sequence info - SEQUENCE - # ANNOTATION - SEQ_START - # SEQ_END - - # third sequence info - SEQUENCE - # ANNOTATION + # Appending to analysis name + APPEND_WIN true + APPEND_THRES true - # analyzes parameters: command line args -w -t will override these - WINDOW - THRESHOLD + # Human sequence + SEQUENCE human/smn1_human_dna.fasta + ANNOTATION human/smn1_human_annotations.txt -.. csv-table:: Parameter File Options: - :header: "Option Name", "Value", "Default", "Required", "Description" - :widths: 30 30 30 30 60 - - "ANA_NAME", "string", "N/A", "true", "Name of analysis (Also - name of directory where analysis will be saved." - "APPEND_WIN", "true/false", "?", "?", "Appends _w## to ANA_NAME" - "APPEND_THRES", "true or false", "?", "?", "Appends _t## to ANA_NAME" - "SEQUENCE_NUM", "integer", "N/A", "true", "The number of sequences - to analyze" - "SEQUENCE", "/FASTA/filepath.fa", "N/A", "true", "Must define one - sequence per SEQUENCE_NUM." - "ANNOTATION", "/annotation/filepath.txt", "N/A", "false", "Optional - annotation file. See `annotation file format`_ section for more - information." - "SEQ_START", "integer", "1", "false", "Optional index into FASTA file" - "SEQ_END", "integer", "1", "false", "Optional index into FASTA file" - "WINDOW", "integer", "N/A", "true", "`Window Size`_" - "THRESHOLD", "integer", "N/A", "true", "`Threshold`_" - -.. _annot: - -Annotation File Format -~~~~~~~~~~~~~~~~~~~~~~ + SEQUENCE mouse/smn1_mouse_dna.fasta + ANNOTATION mouse/smn1_mouse_annotations.txt -The first line in the file is the sequence name. Each line there after -is a **space** separated annotation. + SEQUENCE cow/smn1_cow_dna.fasta + ANNOTATION cow/smn1_cow_annotations.txt -New as of build 198: - - * The annotation format now supports FASTA sequences embedded in the - annotation file as shown in the format example below. Mussagl will - take this sequence and look for an exact match of this sequence in - your sequences. If a match is found, it will label it with the name - of from the FASTA header. + # Window size / Threshold + WINDOW 30 + THRESHOLD 24 -Format: +The first line is the analysis name. This will be the name of the +directory the results will be saved in when using the Mussa `command +line`_ option --no-gui to run an analysis. If you are using the Mussa +GUI, then you will be prompted for a directory name as mentioned in +the `saving`_ section. :: - - - - - - >FASTA Header - ACTGACTGACGTACGTAGCTAGCTAGCTAGCACG - ACGTACGTACGTACGTAGCTGTCATACGCTAGCA - TGCGTAGAGGATCTCGGATGCTAGCGCTATCGAT - ACGTACGGCAGTACGCGGTCAGA - - ... + # Analysis name + ANA_NAME smn1_human_mouse_cow -Example: +If your provide the APPEND_WIN and/or APPEND_THRES, and set them to +true, the window size and threshold will be appended to the analysis +name. In this example, using the --no-gui `command line`_ option, our +directory name would be *smn1_human_mouse_cow_w30_t24*. :: - Mouse - 251 500 Glorp Glorptype - 751 1000 Glorp Glorptype - 1251 1500 Glorp Glorptype - >My favorite DNA sequence - GATTACA - 1751 2000 Glorp Glorptype - - -.. _motif_file_format: + # Appending to analysis name + APPEND_WIN true + APPEND_THRES true -Motif File Format -~~~~~~~~~~~~~~~~~ +The following six lines provide Mussa with the location of the +sequence files and annotation files. The files can provided with +relative paths from the .mupa file. In other words, this .mupa file +will provide the proper path to the human sequence only if there +exists a directory called *human* in the same directory as this .mupa +file. -Format: +To provide the species name for each species, you have to put the +species name in the annotation files. See the `annotation file +format`_ section for more details. - - -Example: +:: - GGCC 0.0 1 1 + # Human sequence + SEQUENCE human/smn1_human_dna.fasta + ANNOTATION human/smn1_human_annotations.txt + SEQUENCE mouse/smn1_mouse_dna.fasta + ANNOTATION mouse/smn1_mouse_annotations.txt + SEQUENCE cow/smn1_cow_dna.fasta + ANNOTATION cow/smn1_cow_annotations.txt -IUPAC Nucleotide Code -~~~~~~~~~~~~~~~~~~~~~~ +And finally, the `window size`_ and `threshold`_ parameters. -For your convenience, below is a table of the IUPAC Nucleotide Code. +:: -The following table is table 1 from "Nomenclature for Incompletely -Specified Bases in Nucleic Acid Sequences" which can be found at -http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html. + # Window size / Threshold + WINDOW 30 + THRESHOLD 24 -====== ================= =================================== -Symbol Meaning Origin of designation -====== ================= =================================== -G G Guanine -A A Adenine -T T Thymine -C C Cytosine -R G or A puRine -Y T or C pYrimidine -M A or C aMino -K G or T Keto -S G or C Strong interaction (3 H bonds) -W A or T Weak interaction (2 H bonds) -H A or C or T not-G, H follows G in the alphabet -B G or T or C not-A, B follows A -V G or C or A not-T (not-U), V follows U -D G or A or T not-C, D follows C -N G or A or T or C aNy -====== ================= =================================== +Next, open Mussagl and select the **File > Create Analysis from File** +menu option. Mussagl should run your analysis if everything was setup +properly. Understanding Mussa =================== +Command Line +------------ + +Mussa has some very useful command line options that allow for +loading an existing analysis or running a new analysis with or without +launching the GUI. + +Mussa options: + --help help message + -p, --run-analysis arg run an analysis defined by the mussa parameter file + --view-analysis arg load a previously run analysis + --motifs arg annotate analysis with motifs from this file + --no-gui terminate without running an analysis + --python launch as a `python interpreter`_ + +Running an analysis using the --no-gui option is useful when you want +to run many analyses on a compute server and save the results for +viewing in the future. + Performance ----------- @@ -1271,11 +1420,13 @@ FIXME: Include transitivity info. Repeats ~~~~~~~ -The algorithm Mussa uses to find conserved sequences is sensitive to -repeated DNA segments, which are frequently occurring in most -genomes. The problem with repeats, is that one repeat from one -sequence can show up many times in another sequence. Every connection -Mussa makes takes up memory and CPU time to process. +Repeat masking of all input sequences, or at least of the "reference" +genome, can be important for reducing compute time and for simplifying +subsequent visual interpretation. Larger loci generally contain more +repeat elements, and as their number grows so will the number of Mussa +connections among them. If not repeat filtered, connectivity between +shared repeat elements can obscure important relationships between +single copy features. The formula for the number of connections, C, that will be made for R instances of a single repeat (meaning R copies of one repeat in each @@ -1317,9 +1468,13 @@ you a C of 2500, ends up with a C^2 of 6,250,000. **Conclusion: repeats cause the processing time of Mussa to skyrocket.** -One way to deal with a situation where you have many repeats in your -sequences is do any of the following: user shorter sequence lengths; -repeat mask one or more of your sequences; or increase the threshold. +To deal with a situation where you have many repeats in your sequences +do any of the following: + + * Use shorter sequence lengths. + * Repeat mask one or more of your sequences. + * Increase the threshold. + Details ------- @@ -1327,6 +1482,18 @@ Details Case: Conservation track suddenly stops ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Details about this potentially confusing case can be found `here +`_. + +Python Interpreter +------------------ + +Mussagl has some functionality for running a python interpreter for +interacting with the internals of Mussagl and/or executing Python +code. This feature is mostly experimental at this point in time. If +you have interest in this feature or would like to know more about it, +contact us using the contact information found at +http://mussa.caltech.edu/. .. Define links below ------------------ @@ -1335,4 +1502,5 @@ Case: Conservation track suddenly stops .. _wiki: http://mussa.caltech.edu .. _build: http://woldlab.caltech.edu/cgi-bin/mussa/wiki/MussaglBuild .. _FASTA: http://en.wikipedia.org/wiki/fasta_format -.. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif \ No newline at end of file +.. _wpDnaMotif: http://en.wikipedia.org/wiki/DNA_motif +.. _mupa: `Parameter File Format`_ \ No newline at end of file