convert standard analysis pipelines to use bam format natively

[erange.git] / docs / README.chip-seq
diff --git a/docs/README.chip-seq b/docs/README.chip-seq

index 6529a6fa1b9404c8d1d0b42484f83bc48c0f048c..ea7b0a347661d59a40e838cd666156f9c9b8bfe4 100644 (file)
--- a/docs/README.chip-seq
+++ b/docs/README.chip-seq
@@ -57,18 +57,24 @@ options are case sensitive and that they could well
  fail silently.
  
  
-3. MAKING THE NECESSARY INPUT (RDS) FILES
-
-You will want to first convert your read mappings to the 
-native ERANGE read store. Please see the file 
-README.build-rds for instructions on how to do this.
-
-Build an RDS file for both the ChIP, and if available and 
-appropriate, the control. Note that we *HIGHLY* recommend 
-the use of a matched control sample to account for some 
-of the general background artifacts that can be present 
-in ChIP-seq samples (e.g. DNAse hypersensitivity, 
-assembly collapse of some sattelite repeats, etc....). 
+3. MAKING THE NECESSARY INPUT FILES
+
+Erange uses BAM format files, but there are a couple of
+modifications that need to be made to the header and
+individual entries.  The python script bamPreprocessing.py
+will do the following:
+1. Count the reads by type and write these counts to the
+header as comments.
+2. Verify that every read has a value in the NH tag or add
+it if needed.
+3. Optionally annotate the reads with the geneID using the
+ZG flag
+
+Note that we *HIGHLY* recommend the use of a matched
+control sample to account for some of the general
+background artifacts that can be present in ChIP-seq
+samples (e.g. DNAse hypersensitivity, assembly collapse
+of some sattelite repeats, etc....). 
  
  
  4. WEIGHING MULTIREADS
@@ -86,7 +92,7 @@ given radius
  
  (a) is the default in the current release of ERANGE. 
  Simply proceed to RUNNING THE PEAK FINDER for (a) and 
-(a). You can ignore multireads (b) by using the -nomulti 
+(a). You can ignore multireads (b) by using the --nomulti 
  flag with findall.py. For (c), use weighMultireads.py 
  to weigh multireads based on a unique reads in the 
  respective radius of each potential location. Once run, 
@@ -98,7 +104,7 @@ proceed to the section below.
  To run the peak finder without read shifting, use the 
  following command:
  
-python $ERANGEPATH/findall.py label chip.rds chip.regions.txt -control control.rds -listPeak -revbackground
+python $ERANGEPATH/findall.py label chip.rds chip.regions.txt --control control.rds --listPeak --revbackground
  
  which will run the peak finder on chip.rds / control.rds , 
  store the enriched region coordinates in chip.regions.txt, 
@@ -119,40 +125,40 @@ fragment sizes, on the order of 40-60 bp.
  You will *NEED* to change some of the default parameters 
  if working in smaller genomes (e.g. use smaller -spacing), 
  if working with certain types of IPs such as histones and 
-polymerases (test with and without -notrim and 
--nodirectionality), if working with rather weak IPs
-(e.g. -minimum and -ratio), or if working with larger 
+polymerases (test with and without --notrim and 
+--nodirectionality), if working with rather weak IPs
+(e.g. --minimum and --ratio), or if working with larger 
  fragment sizes (see the paragraph below discussing read 
  shifting). 
  
  findall.py returns a per-peak p-value. By default, this 
  is calculated using a Poisson distribution of peak RPMs 
-(or counts, if using -raw) for each chromosome in the IP. 
+(or counts, if using --raw) for each chromosome in the IP. 
  P-value calculations can be turned off using 
-'-pvalue none '. Alternatively, the p-value can be 
+'--pvalue none '. Alternatively, the p-value can be 
  calculated from the background using the option 
-'-pvalue back ', which must be combined with the option 
--revbackground.
+'--pvalue back ', which must be combined with the option 
+--revbackground.
  
  By default, findall.py does not try to adjust the location 
  of the reads based on half the size of the expected fragment 
  length (the "shift"). If you believe that you need to shift 
  your peaks, findall.py can try to pick the best shift based 
  on the best shift for strong sites using the parameter 
-'-shift learn '. You can also either manually specify a 
-shift value using '-shift #bp ' or ou can calculate a 
-"best shift" for each region using '-autoshift'. If you 
+'--shift learn '. You can also either manually specify a 
+shift value using '--shift #bp ' or ou can calculate a 
+"best shift" for each region using '--autoshift'. If you 
  need to using the shift options, the recommended usage is:
-(i) first run findall.py with '-shift learn ', which will 
+(i) first run findall.py with '--shift learn ', which will 
  peak a shift if there are at least 30 regions that meet 
  its training criteria.
  (ii) if (i) couldn't pick a shift, run findall.py with 
--autoshift and -reportshift
+--autoshift and --reportshift
  (iii) look at the mode (most common #) for the shift
-(iv) rerun findall.py with -shift #bp where #bp is the mode
+(iv) rerun findall.py with --shift #bp where #bp is the mode
    
  If you are storing the RDS files on an network-mounted 
-directory, make sure to use '-cache XXXXX' to enable 
+directory, make sure to use '--cache XXXXX' to enable 
  local caching, where is as large as appropriate as 
  described in section 9 of README.build-rds . 
  
@@ -223,6 +229,7 @@ for example.
  
  RELEASE HISTORY
  
+version 3.2    November 2010 - updated command line options
  version 3.1    February 2009 - support for read shifting
  version 3.0    February 2009 - support for UCSC narrowPeak format in regiontobed.py
  version 3.0rc1 December 2008 - added parameter to control peak-trimming