The latest version of this software is available at http://woldlab.caltech.edu please check the website for updates. This is the core of the RNA-seq analysis code described in Mortazavi et al (2008). Please make sure that you have read Figure 3 and the methods / supplemental methods of that paper before attempting to use this package for RNA-Seq data analysis. ERANGE should run on any Unix-like system supporting python 2.5 or better. The code is developed on Linux and MacOS X on python 2.5. Historically, the code for ERANGE grew out of the ChIPSeqMini package from Johnson et al (2007), and some of the key scripts (findallnocontrol.py and getallgenes.py) are shared between the two. This is why ERANGE is "dual-use" and is also why the code for both analyses were kept in common as much as possible. This should be helpful when someone tries to combine ChIP-seq and RNA-seq analyses ! This code is made available as open-source, as described in the copyright file ERANGE.COPYRIGHT. SETTING EXPECTATIONS ERANGE is not a point-and-click, turn-key package. It is a set of python scripts that, when run in order as a pipeline on the "right" input, will take read data in bed format and calculate gene expression levels in RPKM (Reads Per kb per Million reads). This pipeline is embodied in a shell script called runStandardAnalysis.sh, which only takes a few inputs, described in the ANALYSIS and PIPELINE section below. You should be able to download the data from our website and run the analysis through the pipeline.Whether you decide to start from the bed files or the eland files is up to you, but the later will require running the eland to bed conversion scripts manually, described in the MAPPING ELAND OUTPUT INTO BED FILES section below. Because you will likely want to run this package on other genomes (or builds) than the one described in our original paper, you will need to do several additional steps, such as: - build expanded genomes with splices and spikes - check overlap of RNAFAR predictions with repeats This will require some comfort with running and, if necessary, editing scripts. While the code is sparsely documented, we are making it available so that you can *read it*. We'll be happy to help modifying and updating the code within a reasonable extent and will try to provide more in depth documentation and tutorials on our web site. While the scripts produce several forms of RPKM, we suggest that the "final" RPKM are the values that most people will be interested in. REQUIREMENTS 1) reads in bed format. We assume and for now only provide scripts for parsing ELAND (0.3 or higher) output run with the --multi option into bed files. If you decide to use any other read mapper or another platform, you will need to generate 3 bed files for unique, spliced, and multireads. Note that multireads are flagged with an identifier of the form: ReadIDxMultiplicity where every entry in the bed file for that particular read at each possible location has the same identifier. 1) Python 2.5 is required because some of the scripts and Cistematic (see below) need pysqlite, which is now bundled in Python. 2) You will also need to use Cistematic 2.0 for some of the scripts marked below that use genes and genomic sequence; in particular, you will also need the Cistematic version of the genomes. Cistematic is available at http://cistematic.caltech.edu 3) You will need genomic sequences to build the expanded genome, as well as gene models from UCSC. (Optional) Python is very slow on large datasets. Use of the psyco module (psyco.sf.net) on 32-bit Linux or all Mac Intel machines to significantly speed up runtime is highly recommended. (Optional) Several of the ploting scripts also rely on Matplotlib, which is available at matplotlib.sf.net. COMMAND LINE OPTIONS You can find out more about the settings for each python script by typing: python commoncode/ to see the command line options. For example, if you wanted to know the command line options of the script used to generate supplementary datasets 2-4, combineRPKMs.py , you would type: python commoncode/combineRPKMs.py and get back a version number and all possible command line options: version 1.0 usage: python commoncode/combineRPKMs.py firstRPKM expandedRPKM finalRPKM combinedOutfile [-withmultifraction] where fields in brackets are optional. BUILDING EXPANDED GENOMES AND RUNNING ELAND You will need to build an expanded genome consisting of genomic sequences, spike sequences, and splice-spanning sequences in order to run ERANGE on your own datasets. This expanded genome is specific to the read size used, i.e. there will be a different expanded genome for mouse when using 25bp reads or 32bp reads. Download the chromosomes from UCSC, as well as the knownGene.txt (or equivalent table) and a directory of repeatmask annotations for each chromosome (also from UCSC) for your genome of interest. You will need to build a splice fasta file using the script getsplicefa.py, which needs Cistematic, the knownGene table, and a paremeter for splice radius, which is 4 bp shorter than the length of the reads. Once you have the splice fasta file, drop it into the same directory as well as a fasta file for your spikes. Then use squashGenome (part of Eland), to build the expanded genome. You will also build a repeat database using buildrmaskdb.py for use in the candidate exon analysis. Please refer to the Illumina documentation for the details on running squashGenome and ELAND. MAPPING ELAND OUTPUT INTO BED FILES Once you have run ELAND with the --multi option (which we colloquially call "eland2") for each RNA-seq lane against the expanded genome, combine all of the outputs for one sample into a single file e.g. test.comb.eland2 The following scripts are then used to map the reads onto bed files: maketrackfromeland2.py maketrackmulti.py and remapSplicesEland2.py for example, assuming that the UCSC gene model description are in ../mm9splices/knownGene.txt, we would run: python2.5 ../commoncode/maketrackfromeland2.py test.comb.eland2 test test.uniqs.bed python2.5 ../commoncode/maketrackmulti.py test.comb.eland2 testmulti test.multi.bed python2.5 ../commoncode/remapSplicesEland2.py ../mm9splices/knownGene.txt test.comb.eland2 testsplices test.splices.bed It is important to stick to the naming convention of the bed files, as the pipeline assumes that all of the bed files have the same prefix (in this case "test"), and will use the appropriate file extensions (e.g. ".uniqs.bed") with the right scripts. You can also create a bed-formatted WIG file, for display on the UCSC browser: makewiggle.py ANALYSIS The main steps of a typical analysis using ERANGE is shown in analysisSteps.txt, where each script would be run in order, with the caveat that there are two ways to do the candidate exon analysis (RNAFAR), creatively called "alternative 1" and "alternative 2". In alternative 1, we use the .nomatch.bed file computed in the previous step to identify candidate regions: # Alternative 1: find new regions outside of gene models with reads piled up python2.5 ../commoncode/findallnocontrol.py test test.nomatch.bed test.newregions.txt 25 40 # Alternative 1: filter out new regions that overlap repeats more than a certain fraction python2.5 ../commoncode/checkrmask.py ../mm9repeats/rmask.db test.newregions.txt test.newregions.repstatus test.newregions.good 1 In alternative 2, we pool multiple .nomatch.bed into a single nomatch.bed, run it through the two scripts of alternative 1 above, and then use these precomputed candidates to count reads falling in these regions: # Alternative 2: use a precomputed list of "new" regions (outside of gene models) python2.5 ../commoncode/regionCounts.py ../RNAFAR/all.newregions.good test.nomatch.bed test.newregions.good test.stillnomatch.bed Alternative 2 is the one used by the pipeline script described below. The scripts will generate a set of intermediate files, the most interesting of which are the gene RPKM values. These will be in the following files for the test example: test.firstpass.rpkm (the unique reads only) test.expanded.rpkm (the unique reads + spliced reads + RNAFAR) test.final.rpkm (uniques + spliced + RNAFAR + multireads) PIPELINE Most of the analysis steps described in the section above are automated in a pipeline shell script called runStandardAnalysis.sh . Note that the pipeline assumes that candidate exons have already been identified (e.g. from pooled data), which is called "alternative 2" in the ANALYSIS section. The pipeline assumes that the bed files for the uniq, multi, and spliced reads exist andare named as described in the MAPPING ELAND OUTPUT INTO BED FILES section. We assume that Cistematic 2.0 is installed, including a version of the appropriate Cistematic genome. We will also need a radius (e.g. 20000 bp) within which a candidate exon will be consolidated with an existing gene. For example, for the test dataset from the ANALYSIS section, we could run the pipeline as: . ./runStandardAnalysis.sh mouse test ../RNAFAR/all.newregions.good 20001 This could run from an hour to a whole day depending on how many reads are involved (1M vs 80M) and how big a consolidation radius is used. RELEASE HISTORY version 2.0 May 2008 - First public release of ERANGE