1 This is a description of the pipeline designed to analyze single
2 nucleotide changes found in the mapped reads. The code should run
3 on any Unix-like system supporting python 2.5 or better. The code
4 is developed on MacOS X on python 2.5.
6 1. COMMAND LINE OPTIONS
7 2. BUILDING THE SNP DATABASE
8 3. RUNNING THE SNP PIPELINE
11 1. COMMAND LINE OPTIONS
13 To find out more about the settings for each script, type:
15 python $ERANGEPATH/<scriptname>
17 to see the command line options. Note that all ERANGE command-line
18 options are case-sensitive & that the scripts typically ignore
19 command-line arguments that they do not recognize!
22 2. BUILDING THE SNP DATABASE
24 In order to check the candidate SNPs versus known SNPs, you will need
25 to first download the corresponding dbSNP database file from UCSC and
26 then build a sqlite version of it using:
28 python $ERANGEPATH/buildsnpdb.py ucscSNPfile outdb
32 python buildsnpdb.py snp128.txt dbSNP128
35 3. RUNNING THE SNP PIPELINE
37 The runSNPAnalysis.sh shell script is designed to retrieve SNPs, filter
38 them against repeat annotations, cross-check them against known SNPs and
39 annotate the novel SNPs. It will automatically run a set of python scripts
40 that are required for the SNPs analysis using the RDS (Read DataSet) file.
41 This script assumes the existence of a known SNP database as described in
42 the previous section as well as of a repeatmask database
44 Usage: $ERANGEPATH/runSNPAnalysis.sh genome rdsfile label rmaskdbfile dbsnpfile uniqStartMin totalRatio rpkmfile cachepages
46 where ERANGEPATH is the environmental variable set to the path to the directory holding the ERANGE scripts.
49 - genome: the name of the organism in the analysis.
50 - rdsfile: read DataSet file. See README.build-rds for
52 - label: the file name of your choice for the analysis.
53 - rmaskdbfile: repeat mask database, a sqlite database file. See
54 README.rna-seq for more information on creating the database.
55 - dbsnpfile: dbsnp database, a sqlite database file, built from the
56 dbSNP database text file from UCSC. Please see command line option
57 for building dbsnp sqlite database using buildsnpdb.py .
58 - uniqStartMin: the ratio of the number of unique reads supporting a
59 SNP at base s and the maximum number of unique read coverage at base s .
60 5 is a good number to start with.
61 - totalRatio: the ratio of the number of reads supporting an
62 expressed SNP at s and the total read coverage at s . 0.75 should allow
63 you to get the homozygous SNPs.
64 - rpkmfile: rpkm file can be generated using the RNA-seq pipeline as
65 described in README.rna-seq. If you do not have that file, you can
67 - cachepages: cache pages. Make sure to use as much caching as your
68 system will accomodate. See README.build-rds for more information.
70 Example: $ERANGEPATH/runSNPAnalysis.sh mouse 24T4spike.rds 24Tspike rmask.db dbSNP128.db 5 0.75 c2c12rna.24R.final.rpkm 5000000
72 version 3.0 January 2009 - logging
73 version 3.0rc1 December 2008 - major rewrite and speed-up of getSNPs.py and chksnp.py
74 version 3.0b2 December 2008 - bug fixes & ERANGEPATH variable