1 The latest version of this software is available at
3 http://woldlab.caltech.edu/rnaseq
5 please check the website for updates.
7 This is the core of the RNA-seq analysis code described in Mortazavi
8 et al (2008). Please make sure that you have read Figure 3 and the
9 methods / supplemental methods of that paper before attempting to
10 use this package for RNA-Seq data analysis.
12 ERANGE should run on any Unix-like system supporting python 2.5 or
13 better. The code is developed on Linux and MacOS X on python 2.5.
15 Historically, the code for ERANGE grew out of the ChIPSeqMini
16 package from Johnson et al (2007), and some of the key scripts
17 (findallnocontrol.py and getallgenes.py) are shared between the two.
18 This is why ERANGE is "dual-use" and is also why the code for both
19 analyses were kept in common as much as possible. This should be
20 helpful when someone tries to combine ChIP-seq and RNA-seq
23 This code is made available as open-source, as described in the
24 copyright file ERANGE.COPYRIGHT.
26 1. SETTING EXPECTATIONS
28 3. COMMAND LINE OPTIONS
32 7. CUSTOM CISTEMATIC GENOME ANNOTATIONS
33 8. PAIRED-END RNA-SEQ ANALYSIS
34 9. EXPRESSED SNP ANALYSIS
36 1. SETTING EXPECTATIONS
38 ERANGE is not a point-and-click, turn-key package.
40 It is a set of python scripts that, when run in order as a pipeline
41 on the "right" input, will take read data in RDS format and
42 calculate gene expression levels in RPKM (Reads Per kb per Million
43 reads). This pipeline for unpaired reads is embodied in a shell
44 script called runStandardAnalysis.sh, which only takes a few inputs,
45 described in the ANALYSIS and PIPELINE section below.
47 You should be able to download the data from our website and run the
48 analysis through the pipeline. You will need to map the reads and
49 import them into an RDS dataset as described in README.build-rds.
51 Because you will likely want to run this package on other genomes
52 (or builds) than the one described in our original paper, you will
53 need to do several additional steps, such as:
55 - build expanded genomes with splices and spikes
56 - check overlap of RNAFAR predictions with repeats
58 This will require some comfort with running and, if necessary,
59 editing scripts. While the code is sparsely documented, we are
60 making it available so that you can *read it*. We'll be happy to
61 help modifying and updating the code within a reasonable extent
62 and will try to provide more in depth documentation and tutorials
65 While the scripts produce several forms of RPKM, we suggest that
66 the "final" RPKM are the values that most people will be interested
69 *WARNING* A couple of these scripts are pretty memory hungry. If
70 you are going to analyze datasets with > 20M reads or reads with
71 high error rates, you will easily need > 8 GB RAM. We'll rewrite
72 these scripts before releasing 3.0 final to lower the memory
77 1) Python 2.5+ is required because some of the scripts and
78 Cistematic (see below) need pysqlite, which is now bundled in
81 2) You will also need to use Cistematic 3.0 for some of the scripts
82 marked below that use genes and genomic sequence; in particular, you
83 will also likely need the Cistematic version of the genomes, unless
84 providing your own custom genome and annotations.
86 Cistematic is available at http://cistematic.caltech.edu
88 3) You will need genomic sequences to build the expanded genome, as
89 well as gene models from UCSC.
91 (Optional) Python is very slow on large datasets. Use of the psyco
92 module (psyco.sf.net) on 32-bit Linux or all Mac Intel machines to
93 significantly speed up runtime is highly recommended.
95 (Optional) Several of the ploting scripts also rely on Matplotlib,
96 which is available at matplotlib.sf.net.
99 3. COMMAND LINE OPTIONS
101 You can find out more about the settings for each python script by
104 python $ERANGEPATH/<scriptname>
106 to see the command line options, where ERANGEPATH is the
107 environmental variable set to the path to the directory
108 holding the ERANGE scripts.
111 For example, if you wanted to know the command line options of the
112 script used to generate supplementary datasets 2-4, combineRPKMs.py ,
115 python $ERANGEPATH/combineRPKMs.py
117 and get back a version number and all possible command line options:
120 usage: python $ERANGEPATH/combineRPKMs.py firstRPKM expandedRPKM finalRPKM combinedOutfile [--withmultifraction]
122 where fields in brackets are optional.
127 You can output bed-files of the raw reads in the RDS file
128 using makebedfromrds.py and WIG file using makewiggle.py as
129 described in README.build-rds .
134 The main steps of a typical, unpaired analysis using ERANGE
135 is shown in RNA-seq.analysisSteps.txt, where each script
136 would be run in order, with the caveat that there are two
137 ways to do the candidate exon analysis (RNAFAR), creatively
138 called "alternative 1" and "alternative 2".
140 In alternative 1, we use reads that did not match an existing gene
141 model to identify candidate regions:
143 # Alternative 1: find new regions outside of gene models with reads piled up
144 python $ERANGEPATH/findall.py RNAFAR LHCN10213.rds LHCN10213.newregions.txt --RNA --minimum 1 --nomulti --flag NM --log rna.log --cache 1
146 # Alternative 1: filter out new regions that overlap repeats more than a certain fraction
147 # use "none" if you don't have a repeatmask database
148 python $ERANGEPATH/checkrmask.py ../hg19repeats/rmask.db LHCN10213.newregions.txt LHCN10213.newregions.repstatus LHCN10213.newregions.good --log rna.log --startField 1 --cache 1
150 In alternative 2, we pool multiple RNA-seq datasets into a single
151 RDS database, run it through the two scripts of alternative 1 above,
152 and then use these precomputed candidates to count reads falling in
155 # Alternative 2: use a precomputed list of "new" regions (outside of gene models)
156 python2.5 $ERANGEPATH/regionCounts.py ../RNAFAR/all.newregions.good LHCN10213.rds LHCN10213.newregions.good
158 Alternative 1 is the one used by the pipeline script described below.
160 The scripts will generate a set of intermediate files, the most
161 interesting of which are the final RPKM values. These will be in the
162 following files for the test example:
164 test.firstpass.rpkm (the unique reads only)
165 test.expanded.rpkm (the unique reads + spliced reads + RNAFAR)
166 test.final.rpkm (uniques + spliced + RNAFAR + multireads)
171 IF YOU ARE STORING THE RDS FILE ON A NETWORK-MOUNTED DIRECTORY,
172 PLEASE ALSO READ SECTION 7.
174 Most of the analysis steps described in the section above are
175 automated in a pipeline shell script called runStandardAnalysis.sh .
176 Note that the pipeline assumes that it will call its own RNAFAR
177 regions, which is called "alternative 1" in the ANALYSIS section,
178 which is a good starting point. You can modify the pipeline script
179 to use alternative 2, if appropriate.
181 The pipeline assumes that one RDS database containing the appropriate
182 uniq, multi, and spliced reads exists as desribed in README.build-rds.
184 We assume that Cistematic 2.3 is installed, including a version of
185 the appropriate Cistematic genome. You will need to build your own
186 Cistematic genome for any unsupported genome.
188 We will also need a radius (e.g. 20000 bp) within which a candidate
189 exon will be consolidated with an existing gene.
191 For example, for the test.rds dataset from the ANALYSIS section, we
192 would run the pipeline as:
194 . $ERANGEPATH/runStandardAnalysis.sh mouse test ../mm9repeats/rmask.db 20001
196 where ERANGEPATH is the environmental variable set to the path to
197 the directory holding the ERANGE scripts. Remember that you can
198 replace '../mm9repeats/rmask.db' with 'none' if you don't have a
201 This could run from an hour to a whole day depending on how many
202 reads are involved (1M vs 80M) and how big a consolidation radius
206 7. CUSTOM CISTEMATIC GENOME ANNOTATIONS
208 Cistematic 3.0 added support for generic genomes and loadable
209 (or alternative) annotations. While this support is still
210 experimental, the general idea is to take a GTF/GFF3 file,
211 convert it into the format that cistematic expects using
213 $ERANGEPATH/gfftocis.py infile.gff outfile.cis
215 NOTE THAT THIS FILE IS PROVIDED AS AN EXAMPLE ONLY. YOU WILL MOST
216 LIKELY HAVE TO EDIT THIS FILE TO ACCOMODATE YOUR SPECIFIC GFF
217 FORMAT TO THE CISTEMATIC FORMAT, WHICH IS
219 geneID<tab>uniqRef<tab>chrom<tab>start<tab>stop<tab>sense<tab>type<return>
221 where type is one of 'CDS','5UTR','3UTR'.
223 You can then run the standard analysis script with the additional
224 flag " -models outfile.cis ", e.g.
226 . runStandardAnalysis.sh generic asteph none 1000 -models agambiae.base.cis
228 Custom annotation support will be extended to other PIPELINE
229 scripts as part of 3.2 final.
232 8. PAIRED-END RNA-SEQ ANALYSIS
234 We are now experimentally supporting paired-end RNA-seq, as
235 implemented in the pipeline script runRNAPairedAnalysis.sh and
236 is only provided as a "work-in-progress" snapshot.
238 This is done primarily by marking all of the reads that map in a
239 known exon or a novel RNAFAR region in the RDS database, which
240 is a slow and time-consuming step (and is off by default for
241 single-ended RNA-seq). This mapping step is done without
242 accounting for paired-end information.
244 The paired-end information is then used to connect RNAFAR
245 regions to known genes or to other RNAFAR regions using
246 reads with one end in a given region and the other end
247 in different (known or novel) region, as implemented in
248 rnafarPairs.py ; note that there is currently a default
249 limit of 500000 bp maximum distance between the two pairs.
252 9. EXPRESSED SNP ANALYSIS
254 ERANGE3 now supports SNP analysis in RNA-seq data as described
259 version 3.3 November 2010 - updated command line options
260 version 3.2 December 2009 - support for custom genome annotations with Cistematic 3.0
261 version 3.1 April 2009 - modified normalizeFinalExonic.py to remove genome
262 version 3.0 January 2009 - added logging to shell pipelines
263 version 3.0rc1 December 2008 - added blat support
264 version 3.0b2 December 2008 - bug fixes & ERANGEPATH variable
265 version 3.0b November 2008 - Support for paired end analysis
266 version 3.0a October 2008 - Preview release of ERANGE3.0
267 version 2.0 May 2008 - First public release of ERANGE