docs/README.rna-seq

   1 The latest version of this software is available at
   2
   3 http://woldlab.caltech.edu/rnaseq
   4
   5 please check the website for updates.
   6
   7 This is the core of the RNA-seq analysis code described in Mortazavi
   8 et al (2008). Please make sure that you have read Figure 3 and the
   9 methods / supplemental methods of that paper before attempting to
  10 use this package for RNA-Seq data analysis.
  11
  12 ERANGE should run on any Unix-like system supporting python 2.5 or
  13 better. The code is developed on Linux and MacOS X on python 2.5.
  14
  15 Historically, the code for ERANGE grew out of the ChIPSeqMini
  16 package from Johnson et al (2007), and some of the key scripts
  17 (findallnocontrol.py and getallgenes.py) are shared between the two.
  18 This is why ERANGE is "dual-use" and is also why the code for both
  19 analyses were kept in common as much as possible. This should be
  20 helpful when someone tries to combine ChIP-seq and RNA-seq
  21 analyses !
  22
  23 This code is made available as open-source, as described in the
  24 copyright file ERANGE.COPYRIGHT.
  25
  26 1. SETTING EXPECTATIONS
  27 2. REQUIREMENTS
  28 3. COMMAND LINE OPTIONS
  29 4. DISPLAYING DATA
  30 5. ANALYSIS
  31 6. PIPELINE
  32 7. CUSTOM CISTEMATIC GENOME ANNOTATIONS
  33 8. PAIRED-END RNA-SEQ ANALYSIS
  34 9. EXPRESSED SNP ANALYSIS
  35
  36 1. SETTING EXPECTATIONS
  37
  38 ERANGE is not a point-and-click, turn-key package.
  39
  40 It is a set of python scripts that, when run in order as a pipeline
  41 on the "right" input, will take read data in RDS format and
  42 calculate gene expression levels in RPKM (Reads Per kb per Million
  43 reads). This pipeline for unpaired reads is embodied in a shell
  44 script called runStandardAnalysis.sh, which only takes a few inputs,
  45 described in the ANALYSIS and PIPELINE section below.
  46
  47 You should be able to download the data from our website and run the
  48 analysis through the pipeline. You will need to map the reads and
  49 import them into an RDS dataset as described in README.build-rds.
  50
  51 Because you will likely want to run this package on other genomes
  52 (or builds) than the one described in our original paper, you will
  53 need to do several additional steps, such as:
  54
  55 - build expanded genomes with splices and spikes
  56 - check overlap of RNAFAR predictions with repeats
  57
  58 This will require some comfort with running and, if necessary,
  59 editing scripts. While the code is sparsely documented, we are
  60 making it available so that you can *read it*. We'll be happy to
  61 help modifying and updating the code within a reasonable extent
  62 and will try to provide more in depth documentation and tutorials
  63 on our web site.
  64
  65 While the scripts produce several forms of RPKM, we suggest that
  66 the "final" RPKM are the values that most people will be interested
  67 in.
  68
  69 *WARNING* A couple of these scripts are pretty memory hungry. If
  70 you are going to analyze datasets with > 20M reads or reads with
  71 high error rates, you will easily need > 8 GB RAM. We'll rewrite
  72 these scripts before releasing 3.0 final to lower the memory
  73 footprint.
  74
  75 2. REQUIREMENTS
  76
  77 1) Python 2.5+ is required because some of the scripts and
  78 Cistematic (see below) need pysqlite, which is now bundled in
  79 Python.
  80
  81 2) You will also need to use Cistematic 3.0 for some of the scripts
  82 marked below that use genes and genomic sequence; in particular, you
  83 will also likely need the Cistematic version of the genomes, unless
  84 providing your own custom genome and annotations.
  85
  86 Cistematic is available at http://cistematic.caltech.edu
  87
  88 3) You will need genomic sequences to build the expanded genome, as
  89 well as gene models from UCSC.
  90
  91 (Optional) Python is very slow on large datasets. Use of the psyco
  92 module (psyco.sf.net) on 32-bit Linux or all Mac Intel machines to
  93 significantly speed up runtime is highly recommended.
  94
  95 (Optional) Several of the ploting scripts also rely on Matplotlib,
  96 which is available at matplotlib.sf.net.
  97
  98
  99 3. COMMAND LINE OPTIONS
 100
 101 You can find out more about the settings for each python script by
 102 typing:
 103
 104 python $ERANGEPATH/<scriptname>
 105
 106 to see the command line options, where ERANGEPATH is the
 107 environmental variable set to the path to the directory
 108 holding the ERANGE scripts.
 109
 110
 111 For example, if you wanted to know the command line options of the
 112 script used to generate supplementary datasets 2-4, combineRPKMs.py ,
 113 you would type:
 114
 115 python $ERANGEPATH/combineRPKMs.py
 116
 117 and get back a version number and all possible command line options:
 118
 119 version 1.0
 120 usage: python $ERANGEPATH/combineRPKMs.py firstRPKM expandedRPKM finalRPKM combinedOutfile [--withmultifraction]
 121
 122 where fields in brackets are optional.
 123
 124
 125 4. DISPLAYING DATA
 126
 127 You can output bed-files of the raw reads in the RDS file
 128 using makebedfromrds.py and  WIG file using makewiggle.py as
 129 described in README.build-rds .
 130
 131
 132 5. ANALYSIS
 133
 134 The main steps of a typical, unpaired analysis using ERANGE
 135 is shown in RNA-seq.analysisSteps.txt, where each script
 136 would be run in order, with the caveat that there are two
 137 ways to do the candidate exon analysis (RNAFAR), creatively
 138 called "alternative 1" and "alternative 2".
 139
 140 In alternative 1, we use reads that did not match an existing gene
 141 model to identify candidate regions:
 142
 143 # Alternative 1: find new regions outside of gene models with reads piled up
 144 python $ERANGEPATH/findall.py RNAFAR LHCN10213.rds LHCN10213.newregions.txt --RNA --minimum 1 --nomulti --flag NM --log rna.log --cache 1
 145
 146 # Alternative 1: filter out new regions that overlap repeats more than a certain fraction
 147 #                use "none" if you don't have a repeatmask database
 148 python $ERANGEPATH/checkrmask.py ../hg19repeats/rmask.db LHCN10213.newregions.txt LHCN10213.newregions.repstatus LHCN10213.newregions.good --log rna.log --startField 1 --cache 1
 149
 150 In alternative 2, we pool multiple RNA-seq datasets into a single
 151 RDS database, run it through the two scripts of alternative 1 above,
 152 and then use these precomputed candidates to count reads falling in
 153 these regions:
 154
 155 # Alternative 2: use a precomputed list of "new" regions (outside of gene models)
 156 python2.5 $ERANGEPATH/regionCounts.py ../RNAFAR/all.newregions.good LHCN10213.rds LHCN10213.newregions.good
 157
 158 Alternative 1 is the one used by the pipeline script described below.
 159
 160 The scripts will generate a set of intermediate files, the most
 161 interesting of which are the final RPKM values. These will be in the
 162 following files for the test example:
 163
 164 test.firstpass.rpkm (the unique reads only)
 165 test.expanded.rpkm (the unique reads + spliced reads  + RNAFAR)
 166 test.final.rpkm (uniques + spliced + RNAFAR + multireads)
 167
 168
 169 6. PIPELINE
 170
 171 IF YOU ARE STORING THE RDS FILE ON A NETWORK-MOUNTED DIRECTORY,
 172 PLEASE ALSO READ SECTION 7.
 173
 174 Most of the analysis steps described in the section above are
 175 automated in a pipeline shell script called runStandardAnalysis.sh .
 176 Note that the pipeline assumes that it will call its own RNAFAR
 177 regions, which is called "alternative 1" in the ANALYSIS section,
 178 which is a good starting point. You can modify the pipeline script
 179 to use alternative 2, if appropriate.
 180
 181 The pipeline assumes that one RDS database containing the appropriate
 182 uniq, multi, and spliced reads exists as desribed in README.build-rds.
 183
 184 We assume that Cistematic 2.3 is installed, including a version of
 185 the appropriate Cistematic genome. You will need to build your own
 186 Cistematic genome for any unsupported genome.
 187
 188 We will also need a radius (e.g. 20000 bp) within which a candidate
 189 exon will be consolidated with an existing gene.
 190
 191 For example, for the test.rds dataset from the ANALYSIS section, we
 192 would run the pipeline as:
 193
 194 . $ERANGEPATH/runStandardAnalysis.sh mouse test ../mm9repeats/rmask.db 20001
 195
 196 where ERANGEPATH is the environmental variable set to the path to
 197 the directory holding the ERANGE scripts. Remember that you can
 198 replace '../mm9repeats/rmask.db' with 'none' if you don't have a
 199 repeatmask database.
 200
 201 This could run from an hour to a whole day depending on how many
 202 reads are involved (1M vs 80M) and how big a consolidation radius
 203 is used.
 204
 205
 206 7. CUSTOM CISTEMATIC GENOME ANNOTATIONS
 207
 208 Cistematic 3.0 added support for generic genomes and loadable
 209 (or alternative) annotations. While this support is still
 210 experimental, the general idea is to take a GTF/GFF3 file,
 211 convert it into the format that cistematic expects using
 212
 213 $ERANGEPATH/gfftocis.py infile.gff outfile.cis
 214
 215 NOTE THAT THIS FILE IS PROVIDED AS AN EXAMPLE ONLY. YOU WILL MOST
 216 LIKELY HAVE TO EDIT THIS FILE TO ACCOMODATE YOUR SPECIFIC GFF
 217 FORMAT TO THE CISTEMATIC FORMAT, WHICH IS
 218
 219 geneID<tab>uniqRef<tab>chrom<tab>start<tab>stop<tab>sense<tab>type<return>
 220
 221 where type is one of 'CDS','5UTR','3UTR'.
 222
 223 You can then run the standard analysis script with the additional
 224 flag " -models outfile.cis ", e.g.
 225
 226 . runStandardAnalysis.sh generic asteph none 1000 -models agambiae.base.cis
 227
 228 Custom annotation support will be extended to other PIPELINE
 229 scripts as part of 3.2 final.
 230
 231
 232 8. PAIRED-END RNA-SEQ ANALYSIS
 233
 234 We are now experimentally supporting paired-end RNA-seq, as
 235 implemented in the pipeline script runRNAPairedAnalysis.sh and
 236 is only provided as a "work-in-progress" snapshot.
 237
 238 This is done primarily by marking all of the reads that map in a
 239 known exon or a novel RNAFAR region in the RDS database, which
 240 is a slow and time-consuming step (and is off by default for
 241 single-ended RNA-seq). This mapping step is done without
 242 accounting for paired-end information.
 243
 244 The paired-end information is then used to connect RNAFAR
 245 regions to known genes or to other RNAFAR regions using
 246 reads with one end in a given region and the other end
 247 in different (known or novel) region, as implemented in
 248 rnafarPairs.py ; note that there is currently a default
 249 limit of 500000 bp maximum distance between the two pairs.
 250
 251
 252 9. EXPRESSED SNP ANALYSIS
 253
 254 ERANGE3 now supports SNP analysis in RNA-seq data as described
 255 in README.rna-esnp .
 256
 257 RELEASE HISTORY
 258
 259 version 3.3    November 2010 - updated command line options
 260 version 3.2    December 2009 - support for custom genome annotations with Cistematic 3.0
 261 version 3.1    April    2009 - modified normalizeFinalExonic.py to remove genome
 262 version 3.0    January  2009 - added logging to shell pipelines
 263 version 3.0rc1 December 2008 - added blat support
 264 version 3.0b2  December 2008 - bug fixes & ERANGEPATH variable
 265 version 3.0b   November 2008 - Support for paired end analysis
 266 version 3.0a    October 2008 - Preview release of ERANGE3.0
 267 version 2.0         May 2008 - First public release of ERANGE
 268