docs/README.rna-esnp

   1 This is a description of the pipeline designed to analyze single
   2 nucleotide changes found in the mapped reads. The code should run
   3 on any Unix-like system supporting python 2.5 or better. The code
   4 is developed on MacOS X on python 2.5.
   5
   6 1. COMMAND LINE OPTIONS
   7 2. BUILDING THE SNP DATABASE
   8 3. RUNNING THE SNP PIPELINE
   9
  10
  11 1. COMMAND LINE OPTIONS
  12
  13 To find out more about the settings for each script, type:
  14
  15 python $ERANGEPATH/<scriptname>
  16
  17 to see the command line options. Note that all ERANGE command-line
  18 options are case-sensitive & that the scripts typically ignore
  19 command-line arguments that they do not recognize!
  20
  21
  22 2. BUILDING THE SNP DATABASE
  23
  24 In order to check the candidate SNPs versus known SNPs, you will need
  25 to first download the corresponding dbSNP database file from UCSC and
  26 then build a sqlite version of it using:
  27
  28 python $ERANGEPATH/buildsnpdb.py ucscSNPfile outdb
  29
  30 e.g.
  31
  32 python buildsnpdb.py snp128.txt dbSNP128
  33
  34
  35 3. RUNNING THE SNP PIPELINE
  36
  37 The runSNPAnalysis.sh shell script is designed to retrieve SNPs, filter
  38 them against repeat annotations, cross-check them against known SNPs and
  39 annotate the novel SNPs. It will automatically run a set of python scripts
  40 that are required for the SNPs analysis using the RDS (Read DataSet) file.
  41 This script assumes the existence of a known SNP database as described in
  42 the previous section as well as of a repeatmask database
  43
  44 Usage: $ERANGEPATH/runSNPAnalysis.sh genome rdsfile label rmaskdbfile dbsnpfile uniqStartMin totalRatio rpkmfile cachepages
  45
  46 where ERANGEPATH is the environmental variable set to the path to the directory holding the ERANGE scripts.
  47
  48 Parameters:
  49 - genome: the name of the organism in the analysis.
  50 - rdsfile: read DataSet file. See README.build-rds for
  51 more information.
  52 - label: the file name of your choice for the analysis.
  53 - rmaskdbfile: repeat mask database, a sqlite database file. See
  54 README.rna-seq for more information on creating the database.
  55 - dbsnpfile: dbsnp database, a sqlite database file, built from the
  56 dbSNP database text file from UCSC. Please see command line option
  57 for building dbsnp sqlite database using buildsnpdb.py .
  58 - uniqStartMin: the ratio of the number of unique reads supporting a
  59 SNP at base s and the maximum number of unique read coverage at base s .
  60 5 is a good number to start with.
  61 - totalRatio: the ratio of the number of reads supporting an
  62 expressed SNP at s and the total read coverage at s . 0.75 should allow
  63 you to get the homozygous SNPs.
  64 - rpkmfile: rpkm file can be generated using the RNA-seq pipeline as
  65 described in README.rna-seq.  If you do not have that file, you can
  66 set it to NONE.
  67 - cachepages: cache pages. Make sure to use as much caching as your
  68 system will accomodate. See README.build-rds for more information.
  69
  70 Example: $ERANGEPATH/runSNPAnalysis.sh mouse 24T4spike.rds 24Tspike rmask.db dbSNP128.db 5 0.75 c2c12rna.24R.final.rpkm 5000000
  71
  72 version 3.0    January  2009 - logging
  73 version 3.0rc1 December 2008 - major rewrite and speed-up of getSNPs.py and chksnp.py
  74 version 3.0b2  December 2008 - bug fixes & ERANGEPATH variable
  75