1 This is a description of the sqlite-based read storage
2 files and of the scripts designed to import read
3 mappings from supported short read mappers. The code
4 should run on any Unix-like system supporting python 2.5
5 or better. The code is developed on Linux and MacOS X on
8 This code is made available as open-source, as described
9 in the copyright file ERANGE.COPYRIGHT.
12 2. COMMAND LINE OPTIONS
13 3. CREATING THE NECESSARY INPUT (RDS) FILES
14 4. BUILDING EXPANDED GENOMES
15 5. MAPPING READS WITH ELAND
16 6. MAPPING READS WITH BOWTIE
17 7. MAPPING READS WITH BLAT
18 8. IMPORTING BED FILES
19 9. COMBINING RDS FILES
20 10. MANIPULATING RDS METADATA AND CACHING
21 11. VISUALIZING THE DATA IN RDS FILES
26 See README.chip-seq or README.rna-seq to see the requirements
27 for installing and running ERANGE specific to each
31 2. COMMAND LINE OPTIONS
33 You can find out more about the settings for each script
36 python $ERANGEPATH/<scriptname>
38 to see the command line options, where ERANGEPATH is the
39 environmental variable set to the path to the directory
40 holding the ERANGE scripts. Note that the command line
41 options are case sensitive and that they could well
45 3. CREATING THE NECESSARY INPUT (RDS) FILES
47 Before you can use the rest of the ERANGE scripts to do
48 CHiP-seq or RNA-seq analyses, you will need to first
49 convert your read mappings to the native ERANGE read
50 storage format, which is sqlite-based, and which is
51 called RDS (Read DataSet). RDS files consist of four
53 - metadata (tracks required and optional metadata)
54 - uniqs (stores uniquely mappable-reads)
55 - multi (stores reads that map equally well to multiple
56 locations in the genome)
57 - splices (stores split reads)
59 a readDataset python object (in commoncode.py) provides
60 the encapsulation of the read database which is accessed
61 through specific methods. Since an RDS file is a sqlite3
62 database, you can additionally use any of the sqlite-based
63 tools to look at the reads in the tables, if you wish to
66 You will need to first map your reads with one of the
67 supported read mappers (see next paragraph) against a copy
68 of the appropriate genome. For ChIP-seq, it will be your
69 genome of interest, whereas for RNA-seq reads should be
70 mapped against an expanded genome, which consists of
71 chromosomes + splice junctions which depend on the read
72 length used. Note that several parts of the code assume
73 that your genomic sequences are labelled with the "chr"
74 chromosomes prefix. For more information on creating
75 expanded genomes, see BUILDING EXPANDED GENOMES.
77 The currently supported read mappers are:
78 - Eland (part of the Illumina GA pipeline)
79 - Bowtie (bowtie-bio.sourceforge.net)
82 These are described in the sections on MAPPING READS WITH
83 ELAND, MAPPING READS WITH BOWTIE, MAPPING READS WITH BLAT.
85 For ChIP-seq, you can also import bed files of unique reads
86 only using makerdsfrombed.py .
88 Also see MANIPULATING RDS METADATA AND CACHING to learn about
89 some important aspects of working with RDS files.
92 4. BUILDING EXPANDED GENOMES
94 For RNA-seq using ELAND or BOWTIE mappings, you will need to build
95 an expanded genome consisting of genomic sequences, spike sequences,
96 and splice-spanning sequences in order to run ERANGE on your own
97 datasets. This expanded genome is specific to the read size used,
98 i.e. there will be a different expanded genome for mouse when using
99 25bp reads or 32bp reads. For reads longer than 32 bp, we recommend
100 using BOWTIE. If your reads are longer than 50bp, consider using
103 Download the chromosomes from UCSC, as well as the knownGene.txt (or
104 equivalent table) and a directory of repeatmask annotations for each
105 chromosome (also from UCSC) for your genome of interest.
107 You will need to build a splice fasta file using the script
108 getsplicefa.py, which needs Cistematic, the knownGene table, and a
109 paremeter for splice radius, which is 4 bp shorter than the length
112 Once you have the splice fasta file, drop it into the same directory
113 as well as a fasta file for your spikes. Then use squashGenome
114 (part of Eland) or bowtie-build (part of Bowtie), to build the
115 expanded genome. Please refer to the documentation for each
116 package to run the genome squasher/builder.
118 You will also build a repeat database using buildrmaskdb.py for use
119 in the candidate exon analysis from UCSC repeatmasker annotations.
122 5. MAPPING READS WITH ELAND
124 Please refer to the Illumina documentation for the details on
125 running squashGenome and Eland. If you do not have access to the
126 Illumina pipeline, use bowtie as described in the next section.
128 For ChIP-seq, you could take the output of the Illumina pipeline,
129 e.g. eland_multi.txt or eland_extended.txt and use them as inputs
130 for makerdsfromeland2.py .
132 Once you have run Eland with the --multi option (which we
133 colloquially call "eland2") for each RNA-seq lane against the
134 expanded genome, combine all of the outputs for one sample into a
135 single file e.g. test.comb.eland2
137 The makerdsfromeland2.py script is used to import the reads
140 python makerdsfromeland2.py label infilename outrdsfile [--append] [--RNA ucscGeneModels]
141 [propertyName::propertyValue] [--index] [--paired 1 or 2] [--extended] [--verbose]
142 [--olddelimiter] [--maxlines num] [--cache numPages]
144 The first 3 arguments are required:
145 - label is any label that you wish (a combination flowcell+lane#
147 - infilename is the output of eland in eland_multi format
148 (default) or eland_extended format (with the -extended flag)
149 - outdbname is the name of the rds file, e.g. test.rds
151 If the reads are from paired-end runs, enter each eland_multi
152 (or extended) file separately with the "--paired 1" or "--paired 2"
153 flag, as appropriate.
155 If entering more than one lane, use --append for all subsequent
156 lanes. Upon entering the last lane, use --index to build a read
157 index. Refer to MANIPULATING RDS METADATA AND CACHING for
158 information on the optional property::value pairs and caching.
160 For RNA-seq, you must in addition specify the path to knownGene.txt
161 using the --RNA flag, e.g.
163 python $ERANGEPATH/makerdsfromeland2.py myRNAlabel myRNA.eland_multi.txt rnatest.rds --RNA ../mm9/knownGene.txt [more options]
166 6. MAPPING READS WITH BOWTIE
168 Bowtie (bowtie-bio.sourceforge.net) is a new read-mapper that
169 is very fast and friendly. ERANGE supports version 0.10.X
170 and higher that allow you to control how many multireads
171 are reported. We recommend the following settings:
173 $BOWTIEDIR/bowtie zzz -v 2 -k 11 -m 10 -t --strata --best -f s1.query32.txt --un s1.unm.fa --max s1.max.fa s1.zzz.bowtie.txt
175 where zzz is the genome prefix that you gave when building the
176 genome. In particular, we ask bowtie to map all multireads up
177 to 11 ("-k") with up to 2 mismatches ("-v" and "--best"), however
178 we will only import all multireads up to 10x multiplicity ("-m").
179 Note that bowtie is multithreaded and can use multiple cpu based
180 on the -p flag (e.g. use "-p 4" to use 4 CPUs). Unmapped reads
181 are saved in unmapped.fa for later analysis.
183 Once reads are mapped, they can be imported using:
185 python $ERANGEPATH/makerdsfrombowtie.py testLabel s1.mm9.bowtie.txt bowtietest.rds
187 The options for the script are:
189 python makerdsfrombowtie.py label infilename outrdsfile
190 [--RNA ucscGeneModels] [--append] [--index] [propertyName::propertyValue]
191 [--rawreadID] [--verbose] [--cache numPages]
193 Refer to "MAPPING READS WITH ELAND" for a description of label,
194 infilename, outdbname, '--append', '--index', and '--cache'.
196 ****REMEMBER TO USE --index WHEN LOADING THE LAST LANE OF YOUR
199 The script assumes that the read ID are from Illumina, i.e. that
200 they have multiple fields separated by ':' and that paired-end
201 reads have an additional '/1' or '/2' depending on the end.
202 It will by default strip the first part of the readID (up to the
203 first ':') and replace it with the label. If you want raw readIDs
204 because you mapped raw reads that do not have an associated ID or
205 an ID that doesn't follow Illumina's conventions, use -rawreadID.
207 If not using Illumina readIDs, use any identifier of the format
209 throw_away:uniqueid if unpaired
210 throw_away:uniqueid/1 and throw_away:uniqueid/2 for paired-ends.
212 For RNA-seq, you must in addition specify the path to knownGene.txt
213 using the --RNA flag, e.g.
215 python $ERANGEPATH/makerdsfrombowtie.py myRNAlabel myRNA.bowtie.txt rnatest.rds --RNA ../mm9/knownGene.txt [more options]
218 7. MAPPING READS WITH BLAT
220 BLAT SUPPORT IN ERANGE IS STILL UNDER DEVELOPMENT AND THE
221 SCRIPTS AND SETTINGS BELOW MAY BE OPTIMIZED FURTHER IN
222 FUTURE RELEASES OF ERANGE.
224 Reads longer than 40-50bp can be fruitfully mapped with BLAT
225 against the reference genome without needing to provide the
226 exon junctions. While BLAT is much slower than BOWTIE, it
227 has the great advantage of seeing novel splices (i.e.
228 splices not present in knownGene models).
230 We use the following settings to map 75bp reads with BLAT and
231 filter them with pslReps:
233 $BLATPATH/blat /tmp/hg18.fa s3_1.query75.txt -out=pslx s3_1.hg18.blat
234 $BLATPATH/pslReps -minNearTopSize=70 s3_1.hg18.blat s3_1.hg18.blatbetter s3_1.blatpsr
236 where the binaries are in $BLATPATH anywhere on your system.
238 Once the reads have been filtered, the makerdsfromblat.py
239 script is used to import the mapped reads (in the example
240 above s3_1.hg18.blatbetter) into RDS:
242 python makerdsfromblat.py label infilename outrdsfile [--append] [--index] [propertyName::propertyValue]
243 [--rawreadID] [--forceRNA] [--flag] [--strict minSpliceLen] [--spliceonly] [--verbose] [--cache numPages]
245 If you are using BLAT for RNA-seq, please be sure to use
246 --forceRNA in order to import spliced reads and consider
247 using --strict to require a minimum length of bases on
248 each side of the splice.
250 You can combine BOWTIE and BLAT by mapping reads with BOWTIE
251 first, and then using BLAT to map the unmapped reads. In
252 that case, you may want to only load the spliced reads
253 using the --spliceonly flag. To track those reads in the RDS
254 file, use --flag ; you can then retrieve those reads using
255 the options "--flag blat --flagLike" with the makebedfromrds.py
259 8. IMPORTING BED FILES
261 If you do not have the raw read data, you can import unique
262 reads only using the script makerdsfrombed.py . Note that
263 this is not particularly useful for RNA-seq since you will
264 have neither the multireads nor the spliced reads.
266 The command line options are similar to those for other
267 scripts described in part 5-7:
269 python makerdsfrombed.py label bedfile outrdsfile [--append] [--index] [propertyName::propertyValue] [--cache numPages]
272 9. COMBINING RDS FILES
274 Previously created RDS files can be combined into a new RDS
275 dataset using the combinerds.py command with the granularity
276 of importing all tables or specific ones (e.g. uniqs, splices).
278 The combinerds.py command options are:
280 python combinerds.py destinationRDS inputrds1 [inputrds2 ....] [--table table_name] [--init] [--initrna] [--index] [--cache pages]
283 10. MANIPULATING RDS METADATA AND CACHING
285 One of the advantages of RDS over bed, is the possibility of
286 attaching arbitrary sets of annotations with the data, which
287 are then carried along. Both the makerds* scripts and
288 rdsmetadata.py allows you to both enter key::value
289 combinations. Entering a key multiple times will cause the
290 same instance to be recorded multiple times, which is
291 appropriate in some settings (e.g. to enter flowcell info).
292 In addition rdsmetadata.py allows you to inspect various
293 attributes of your RDS files such as # of reads and size
294 of the default cache size.
296 Sqlite files have a certain amount of RAM set aside as cache
297 for lookups, indexes, etc.... where the amount is measured in
298 1.5kb pages. Each RDS instance come with a default of 100000
299 pages (150MB) of cache, which is needlessly small in most
300 situations. Whenever appropriate, try using more cache (e.g.
301 750000 pages on a 2GB RAM machine, much more if more RAM is
302 available) for a significant speed increase in indexing and
303 lookups. You can change the default value for each RDS file
304 by using the -defaultcache option of rdsmetadata.py.
306 Note that sqlite can be very slow over NFS. Wherever
307 possible, copy your RDS file locally before running an I/O
311 11. VISUALIZING THE DATA IN RDS FILES
313 You can output bed-files of the raw reads using
314 makebedfromrds.py. A more practical way to look at the data
315 might be to ouput it as a bedGraph file using makewiggle.py .
317 Note that UCSC has a hard limit on the size of their files
318 and you will likely need to break the wiggles on a per-chromosome
319 basis for mammalian genomes.
323 version 3.3 November 2010 - updated command line options
324 version 3.2 October 2009 - added combinerds.py
325 version 3.01 February 2009 - bug fixes
326 version 3.0 January 2009 - added logging to buildrdsfrom*
327 version 3.0rc1 December 2008 - added blat support