docs/README.build-rds

   1 This is a description of the sqlite-based read storage
   2 files and of the scripts designed to import read
   3 mappings from supported short read mappers.  The code
   4 should run on any Unix-like system supporting python 2.5
   5 or better. The code is developed on Linux and MacOS X on
   6 python 2.5.
   7
   8 This code is made available as open-source, as described
   9 in the copyright file ERANGE.COPYRIGHT.
  10
  11 1. REQUIREMENTS
  12 2. COMMAND LINE OPTIONS
  13 3. CREATING THE NECESSARY INPUT (RDS) FILES
  14 4. BUILDING EXPANDED GENOMES
  15 5. MAPPING READS WITH ELAND
  16 6. MAPPING READS WITH BOWTIE
  17 7. MAPPING READS WITH BLAT
  18 8. IMPORTING BED FILES
  19 9. COMBINING RDS FILES
  20 10. MANIPULATING RDS METADATA AND CACHING
  21 11. VISUALIZING THE DATA IN RDS FILES
  22
  23
  24 1. REQUIREMENTS
  25
  26 See README.chip-seq or README.rna-seq to see the requirements
  27 for installing and running ERANGE specific to each
  28 application.
  29
  30
  31 2. COMMAND LINE OPTIONS
  32
  33 You can find out more about the settings for each script
  34 by typing:
  35
  36 python $ERANGEPATH/<scriptname>
  37
  38 to see the command line options, where ERANGEPATH is the
  39 environmental variable set to the path to the directory
  40 holding the ERANGE scripts. Note that the command line
  41 options are case sensitive and that they could well
  42 fail silently.
  43
  44
  45 3. CREATING THE NECESSARY INPUT (RDS) FILES
  46
  47 Before you can use the rest of the ERANGE scripts to do
  48 CHiP-seq or RNA-seq analyses, you will need to first
  49 convert your read mappings to the native ERANGE read
  50 storage format, which is sqlite-based, and which is
  51 called RDS (Read DataSet). RDS files consist of four
  52 tables:
  53 - metadata (tracks required and optional metadata)
  54 - uniqs (stores uniquely mappable-reads)
  55 - multi (stores reads that map equally well to multiple
  56 locations in the genome)
  57 - splices (stores split reads)
  58
  59 a readDataset python object (in commoncode.py) provides
  60 the encapsulation of the read database which is accessed
  61 through specific methods. Since an RDS file is a sqlite3
  62 database, you can additionally use any of the sqlite-based
  63 tools to look at the reads in the tables, if you wish to
  64 do so.
  65
  66 You will need to first map your reads with one of the
  67 supported read mappers (see next paragraph) against a copy
  68 of the appropriate genome. For ChIP-seq, it will be your
  69 genome of interest, whereas for RNA-seq reads should be
  70 mapped against an expanded genome, which consists of
  71 chromosomes + splice junctions which depend on the read
  72 length used. Note that several parts of the code assume
  73 that your genomic sequences are labelled with the "chr"
  74 chromosomes prefix. For more information on creating
  75 expanded genomes, see BUILDING EXPANDED GENOMES.
  76
  77 The currently supported read mappers are:
  78 - Eland (part of the Illumina GA pipeline)
  79 - Bowtie (bowtie-bio.sourceforge.net)
  80 - Blat (from UCSC)
  81
  82 These are described in the sections on MAPPING READS WITH
  83 ELAND, MAPPING READS WITH BOWTIE, MAPPING READS WITH BLAT.
  84
  85 For ChIP-seq, you can also import bed files of unique reads
  86 only using makerdsfrombed.py .
  87
  88 Also see MANIPULATING RDS METADATA AND CACHING to learn about
  89 some important aspects of working with RDS files.
  90
  91
  92 4. BUILDING EXPANDED GENOMES
  93
  94 For RNA-seq using ELAND or BOWTIE mappings, you will need to build
  95 an expanded genome consisting of genomic sequences, spike sequences,
  96 and splice-spanning sequences in order to run ERANGE on your own
  97 datasets. This expanded genome is specific to the read size used,
  98 i.e. there will be a different expanded genome for mouse when using
  99 25bp reads or 32bp reads. For reads longer than 32 bp, we recommend
 100 using BOWTIE. If your reads are longer than 50bp, consider using
 101 BLAT instead.
 102
 103 Download the chromosomes from UCSC, as well as the knownGene.txt (or
 104 equivalent table) and a directory of repeatmask annotations for each
 105 chromosome (also from UCSC) for your genome of interest.
 106
 107 You will need to build a splice fasta file using the script
 108 getsplicefa.py, which needs Cistematic, the knownGene table, and a
 109 paremeter for splice radius, which is 4 bp shorter than the length
 110 of the reads.
 111
 112 Once you have the splice fasta file, drop it into the same directory
 113 as well as a fasta file for your spikes. Then use squashGenome
 114 (part of Eland) or bowtie-build (part of Bowtie), to build the
 115 expanded genome. Please refer to the documentation for each
 116 package to run the genome squasher/builder.
 117
 118 You will also build a repeat database using buildrmaskdb.py for use
 119 in the candidate exon analysis from UCSC repeatmasker annotations.
 120
 121
 122 5. MAPPING READS WITH ELAND
 123
 124 Please refer to the Illumina documentation for the details on
 125 running squashGenome and Eland. If you do not have access to the
 126 Illumina pipeline, use bowtie as described in the next section.
 127
 128 For ChIP-seq, you could take the output of the Illumina pipeline,
 129 e.g. eland_multi.txt or eland_extended.txt and use them as inputs
 130 for makerdsfromeland2.py .
 131
 132 Once you have run Eland with the --multi option (which we
 133 colloquially call "eland2") for each RNA-seq lane against the
 134 expanded genome, combine all of the outputs for one sample into a
 135 single file e.g. test.comb.eland2
 136
 137 The makerdsfromeland2.py script is used to import the reads
 138 into RDS:
 139
 140 python makerdsfromeland2.py label infilename outrdsfile [--append] [--RNA ucscGeneModels]
 141 [propertyName::propertyValue] [--index] [--paired 1 or 2] [--extended] [--verbose]
 142 [--olddelimiter] [--maxlines num] [--cache numPages]
 143
 144 The first 3 arguments are required:
 145 - label is any label that you wish (a combination flowcell+lane#
 146 is a good choice)
 147 - infilename is the output of eland in eland_multi format
 148 (default) or eland_extended format (with the -extended flag)
 149 - outdbname is the name of the rds file, e.g. test.rds
 150
 151 If the reads are from paired-end runs, enter each eland_multi
 152 (or extended) file separately with the "--paired 1" or "--paired 2"
 153 flag, as appropriate.
 154
 155 If entering more than one lane, use --append for all subsequent
 156 lanes. Upon entering the last lane, use --index to build a read
 157 index. Refer to MANIPULATING RDS METADATA AND CACHING for
 158 information on the optional property::value pairs and caching.
 159
 160 For RNA-seq, you must in addition specify the path to knownGene.txt
 161 using the --RNA flag, e.g.
 162
 163 python $ERANGEPATH/makerdsfromeland2.py myRNAlabel myRNA.eland_multi.txt rnatest.rds --RNA ../mm9/knownGene.txt [more options]
 164
 165
 166 6. MAPPING READS WITH BOWTIE
 167
 168 Bowtie (bowtie-bio.sourceforge.net) is a new read-mapper that
 169 is very fast and friendly. ERANGE supports version 0.10.X
 170 and higher that allow you to control how many multireads
 171 are reported. We recommend the following settings:
 172
 173 $BOWTIEDIR/bowtie zzz -v 2 -k 11 -m 10 -t --strata --best -f s1.query32.txt --un s1.unm.fa --max s1.max.fa s1.zzz.bowtie.txt
 174
 175 where zzz is the genome prefix that you gave when building the
 176 genome. In particular, we ask bowtie to map all multireads up
 177 to 11 ("-k") with up to 2 mismatches ("-v" and "--best"), however
 178 we will only import all multireads up to 10x multiplicity ("-m").
 179 Note that bowtie is multithreaded and can use multiple cpu based
 180 on the -p flag (e.g. use "-p 4" to use 4 CPUs). Unmapped reads
 181 are saved in unmapped.fa for later analysis.
 182
 183 Once reads are mapped, they can be imported using:
 184
 185 python $ERANGEPATH/makerdsfrombowtie.py testLabel s1.mm9.bowtie.txt bowtietest.rds
 186
 187 The options for the script are:
 188
 189 python makerdsfrombowtie.py label infilename outrdsfile
 190 [--RNA ucscGeneModels]  [--append]  [--index] [propertyName::propertyValue]
 191 [--rawreadID] [--verbose] [--cache numPages]
 192
 193 Refer to "MAPPING READS WITH ELAND" for a description of label,
 194 infilename, outdbname, '--append', '--index', and '--cache'.
 195
 196 ****REMEMBER TO USE --index WHEN LOADING THE LAST LANE OF YOUR
 197 DATASET.****
 198
 199 The script assumes that the read ID are from Illumina, i.e. that
 200 they have multiple fields separated by ':' and that paired-end
 201 reads have an additional '/1' or '/2' depending on the end.
 202 It will by default strip the first part of the readID (up to the
 203 first ':') and replace it with the label. If you want raw readIDs
 204 because you mapped raw reads that do not have an associated ID or
 205 an ID that doesn't follow Illumina's conventions, use -rawreadID.
 206
 207 If not using Illumina readIDs, use any identifier of the format
 208
 209 throw_away:uniqueid if unpaired
 210 throw_away:uniqueid/1 and throw_away:uniqueid/2 for paired-ends.
 211
 212 For RNA-seq, you must in addition specify the path to knownGene.txt
 213 using the --RNA flag, e.g.
 214
 215 python $ERANGEPATH/makerdsfrombowtie.py myRNAlabel myRNA.bowtie.txt rnatest.rds --RNA ../mm9/knownGene.txt [more options]
 216
 217
 218 7. MAPPING READS WITH BLAT
 219
 220 BLAT SUPPORT IN ERANGE IS STILL UNDER DEVELOPMENT AND THE
 221 SCRIPTS AND SETTINGS BELOW MAY BE OPTIMIZED FURTHER IN
 222 FUTURE RELEASES OF ERANGE.
 223
 224 Reads longer than 40-50bp can be fruitfully mapped with BLAT
 225 against the reference genome without needing to provide the
 226 exon junctions. While BLAT is much slower than BOWTIE, it
 227 has the great advantage of seeing novel splices (i.e.
 228 splices not present in knownGene models).
 229
 230 We use the following settings to map 75bp reads with BLAT and
 231 filter them with pslReps:
 232
 233 $BLATPATH/blat /tmp/hg18.fa s3_1.query75.txt -out=pslx s3_1.hg18.blat
 234 $BLATPATH/pslReps -minNearTopSize=70 s3_1.hg18.blat s3_1.hg18.blatbetter s3_1.blatpsr
 235
 236 where the binaries are in $BLATPATH anywhere on your system.
 237
 238 Once the reads have been filtered, the makerdsfromblat.py
 239 script is used to import the mapped reads (in the example
 240 above s3_1.hg18.blatbetter) into RDS:
 241
 242 python makerdsfromblat.py label infilename outrdsfile [--append] [--index] [propertyName::propertyValue]
 243 [--rawreadID] [--forceRNA]  [--flag] [--strict minSpliceLen] [--spliceonly] [--verbose] [--cache numPages]
 244
 245 If you are using BLAT for RNA-seq, please be sure to use
 246 --forceRNA in order to import spliced reads and consider
 247 using --strict to require a minimum length of bases on
 248 each side of the splice.
 249
 250 You can combine BOWTIE and BLAT by mapping reads with BOWTIE
 251 first, and then using BLAT to map the unmapped reads. In
 252 that case, you may want to only load the spliced reads
 253 using the --spliceonly flag. To track those reads in the RDS
 254 file, use --flag ; you can then retrieve those reads using
 255 the options "--flag blat --flagLike" with the makebedfromrds.py
 256 script.
 257
 258
 259 8. IMPORTING BED FILES
 260
 261 If you do not have the raw read data, you can import unique
 262 reads only using the script makerdsfrombed.py . Note that
 263 this is not particularly useful for RNA-seq since you will
 264 have neither the multireads nor the spliced reads.
 265
 266 The command line options are similar to those for other
 267 scripts described in part 5-7:
 268
 269 python makerdsfrombed.py label bedfile outrdsfile [--append] [--index] [propertyName::propertyValue] [--cache numPages]
 270
 271
 272 9.  COMBINING RDS FILES
 273
 274 Previously created RDS files can be combined into a new RDS
 275 dataset using the combinerds.py command with the granularity
 276 of importing all tables or specific ones (e.g. uniqs, splices).
 277
 278 The combinerds.py command options are:
 279
 280 python combinerds.py destinationRDS inputrds1 [inputrds2 ....] [--table table_name] [--init] [--initrna] [--index] [--cache pages]
 281
 282
 283 10. MANIPULATING RDS METADATA AND CACHING
 284
 285 One of the advantages of RDS over bed, is the possibility of
 286 attaching arbitrary sets of annotations with the data, which
 287 are then carried along. Both the makerds* scripts and
 288 rdsmetadata.py allows you to both enter key::value
 289 combinations. Entering a key multiple times will cause the
 290 same instance to be recorded multiple times, which is
 291 appropriate in some settings (e.g. to enter flowcell info).
 292 In addition rdsmetadata.py allows you to inspect various
 293 attributes of your RDS files such as # of reads and size
 294 of the default cache size.
 295
 296 Sqlite files have a certain amount of RAM set aside as cache
 297 for lookups, indexes, etc.... where the amount is measured in
 298 1.5kb pages. Each RDS instance come with a default of 100000
 299 pages (150MB) of cache, which is needlessly small in most
 300 situations. Whenever appropriate, try using more cache (e.g.
 301 750000 pages on a 2GB RAM machine, much more if more RAM is
 302 available) for a significant speed increase in indexing and
 303 lookups. You can change the default value for each RDS file
 304 by using the -defaultcache option of rdsmetadata.py.
 305
 306 Note that sqlite can be very slow over NFS. Wherever
 307 possible, copy your RDS file locally before running an I/O
 308 intensive script.
 309
 310
 311 11. VISUALIZING THE DATA IN RDS FILES
 312
 313 You can output bed-files of the raw reads using
 314 makebedfromrds.py. A more practical way to look at the data
 315 might be to ouput it as a bedGraph file using makewiggle.py .
 316
 317 Note that UCSC has a hard limit on the size of their files
 318 and you will likely need to break the wiggles on a per-chromosome
 319 basis for mammalian genomes.
 320
 321 RELEASE HISTORY
 322
 323 version 3.3    November 2010 - updated command line options
 324 version 3.2    October  2009 - added combinerds.py
 325 version 3.01   February 2009 - bug fixes
 326 version 3.0    January  2009 - added logging to buildrdsfrom*
 327 version 3.0rc1 December 2008 - added blat support
 328
 329