docs/README.chip-seq

   1 This is an updated version of the core of the ChIP-seq
   2 analysis code described in Johnson et al (2007).  It
   3 should run on any Unix-like system supporting python 2.5
   4 or better. The code is developed on Linux and MacOS X on
   5 python 2.5.
   6
   7 These scripts in the ChIPSeqMini package are now part of
   8 the ERANGE package, but are still available as a
   9 standalone package for now.
  10
  11 This code is made available as open-source, as described
  12 in the copyright file ERANGE.COPYRIGHT.
  13
  14
  15 1. REQUIREMENTS
  16 2. COMMAND LINE OPTIONS
  17 3. MAKING THE NECESSARY INPUT (RDS) FILES
  18 4. WEIGHING MULTIREADS
  19 5. RUNNING THE PEAK FINDER
  20 6. DISPLAYING DATA ONTO THE  UCSC GENOME BROWSER
  21 7. DOWNSTREAM ANALYSES
  22
  23
  24 1. REQUIREMENTS
  25
  26 1) Python 2.5 is required because some of the scripts and
  27 Cistematic (see below) need pysqlite, which is now bundled in
  28 Python.
  29
  30 2) You will also need to use Cistematic 2.3 (available at
  31 cistematic.caltech.edu) for all of the scripts that are
  32 part of the downstream analyses.
  33
  34 (optional) Use of the psyco module (psyco.sf.net) on 32-bit
  35 Linux or Mac Intel machines is highly recommended.
  36
  37 (optional) Three visualization scripts also depend on the
  38 additional package pylab (matplotlib). These scripts are:
  39 - getgosig.py
  40 - plotbardist.py
  41 - scatterfields.py
  42 You do not need to install pylab if you will be
  43 visualizing some of your analysis results differently.
  44
  45
  46 2. COMMAND LINE OPTIONS
  47
  48 You can find out more about the settings for each script
  49 by typing:
  50
  51 python $ERANGEPATH/<scriptname>
  52
  53 to see the command line options, where ERANGEPATH is the
  54 environmental variable set to the path to the directory
  55 holding the ERANGE scripts. Note that the command line
  56 options are case sensitive and that they could well
  57 fail silently.
  58
  59
  60 3. MAKING THE NECESSARY INPUT FILES
  61
  62 Erange uses BAM format files, but there are a couple of
  63 modifications that need to be made to the header and
  64 individual entries.  The python script bamPreprocessing.py
  65 will do the following:
  66 1. Count the reads by type and write these counts to the
  67 header as comments.
  68 2. Verify that every read has a value in the NH tag or add
  69 it if needed.
  70 3. Optionally annotate the reads with the geneID using the
  71 ZG flag
  72
  73 Note that we *HIGHLY* recommend the use of a matched
  74 control sample to account for some of the general
  75 background artifacts that can be present in ChIP-seq
  76 samples (e.g. DNAse hypersensitivity, assembly collapse
  77 of some sattelite repeats, etc....).
  78
  79
  80 4. WEIGHING MULTIREADS
  81
  82 Version 3.0 of the peak finder can use multireads, i.e.
  83 reads that map equally well to more than one location
  84 in the genome, to find binding sites that are in low
  85 copy-number on-unique regions (typically less than 10).
  86
  87 ERANGE offers 3 ways to analyze these regions:
  88 (a) default weighing of 1/multiplicity
  89 (b) ignoring multireads
  90 (c) weighing of multireads based on unique reads in a
  91 given radius
  92
  93 (a) is the default in the current release of ERANGE.
  94 Simply proceed to RUNNING THE PEAK FINDER for (a) and
  95 (a). You can ignore multireads (b) by using the --nomulti
  96 flag with findall.py. For (c), use weighMultireads.py
  97 to weigh multireads based on a unique reads in the
  98 respective radius of each potential location. Once run,
  99 proceed to the section below.
 100
 101
 102 5. RUNNING THE PEAK FINDER
 103
 104 To run the peak finder without read shifting, use the
 105 following command:
 106
 107 python $ERANGEPATH/findall.py label chip.rds chip.regions.txt --control control.rds --listPeak --revbackground
 108
 109 which will run the peak finder on chip.rds / control.rds ,
 110 store the enriched region coordinates in chip.regions.txt,
 111 also store the actual local maximum in each region in the
 112 same file, and also calculate an FDR by running the
 113 finder on control.rds / chip.rds .
 114
 115 A log file (findall.log by default, change with -log)
 116 tracks the settings used to run the program as well as
 117 some of the summary statistics, which are also stored
 118 at the bottom of the regions.txt output file.
 119
 120 findall.py is tuned to conservative settings for 10-12M
 121 mappable read IPs of static, sequence-specific
 122 transcription factors in mammals with very short
 123 fragment sizes, on the order of 40-60 bp.
 124
 125 You will *NEED* to change some of the default parameters
 126 if working in smaller genomes (e.g. use smaller -spacing),
 127 if working with certain types of IPs such as histones and
 128 polymerases (test with and without --notrim and
 129 --nodirectionality), if working with rather weak IPs
 130 (e.g. --minimum and --ratio), or if working with larger
 131 fragment sizes (see the paragraph below discussing read
 132 shifting).
 133
 134 findall.py returns a per-peak p-value. By default, this
 135 is calculated using a Poisson distribution of peak RPMs
 136 (or counts, if using --raw) for each chromosome in the IP.
 137 P-value calculations can be turned off using
 138 '--pvalue none '. Alternatively, the p-value can be
 139 calculated from the background using the option
 140 '--pvalue back ', which must be combined with the option
 141 --revbackground.
 142
 143 By default, findall.py does not try to adjust the location
 144 of the reads based on half the size of the expected fragment
 145 length (the "shift"). If you believe that you need to shift
 146 your peaks, findall.py can try to pick the best shift based
 147 on the best shift for strong sites using the parameter
 148 '--shift learn '. You can also either manually specify a
 149 shift value using '--shift #bp ' or ou can calculate a
 150 "best shift" for each region using '--autoshift'. If you
 151 need to using the shift options, the recommended usage is:
 152 (i) first run findall.py with '--shift learn ', which will
 153 peak a shift if there are at least 30 regions that meet
 154 its training criteria.
 155 (ii) if (i) couldn't pick a shift, run findall.py with
 156 --autoshift and --reportshift
 157 (iii) look at the mode (most common #) for the shift
 158 (iv) rerun findall.py with --shift #bp where #bp is the mode
 159
 160 If you are storing the RDS files on an network-mounted
 161 directory, make sure to use '--cache XXXXX' to enable
 162 local caching, where is as large as appropriate as
 163 described in section 9 of README.build-rds .
 164
 165 Note that ERANGE will cache by default to /tmp, but this
 166 can be redirected to any directory pointed to by the
 167 environmental variable CISTEMATIC_TEMP.
 168
 169 To find out the current default settings and options,
 170 simply type:
 171
 172 python $ERANGEPATH/findall.py
 173
 174 for more information.
 175
 176
 177 6. DISPLAYING DATA ONTO THE  UCSC GENOME BROWSER
 178
 179 You can output bed-files of the raw reads using
 180 makebedfromrds.py and  BEDGRAPH file using
 181 makewiggle.py as described in README.build-rds .
 182
 183 You can create bed files of regions and sites (see
 184 below) using regiontobed.py and makesitetrack.py .
 185
 186
 187 7. DOWNSTREAM ANALYSES
 188
 189 Recall that Cistematic 2.3 is a required to do motif
 190 and gene-level analyses of the output of findall.py.
 191
 192 Use getallgenes.py to find the nearest gene within a
 193 radius of each binding site.
 194
 195 Use analyzego.py to do a Gene Ontology enrichment
 196 analysis of a gene list (such as from getallgenes.py).
 197 You can look at a heatmap of your GO enrichments using
 198 getgosig.py. You can also use getGOgenes.py to look at
 199 the genes with particular GO annotations.
 200
 201 To do motif-finding, use getfasta.py to get the sequences
 202 centered on the peaks of your regions of interest. For
 203 the sake of a pleasant experience, try limiting yourself
 204 to less than 100kb of combined sequence (the easiest being
 205 by picking your regions with the strongest signals).
 206
 207 Once you have a fasta file of the regions of interest, you
 208 can use findMotifs.py to find motifs using either
 209 cisGreedy (bundled with Cistematic 2.2) which is good for
 210 shorter motifs or Meme (must be installed separately -
 211 refer to the instructions on cistematic.caltech.edu for
 212 more information), which is better for longer motifs.
 213 findmotifs.py will return a set motifs in Cistematic format
 214 with a .mot extension. These motifs can then be used with
 215 getallsites.py to get the coordinates and instances of each
 216 motif in all of the regions found by the peak finder.
 217
 218 The sites can be checked against repeat-masker annotations
 219 (preloaded from UCSC with buildrmaskdb.py) using
 220 checkrmask.py. The sites for each motif can also be fed
 221 back into getallgenes.py to get genes, redo the GO analyses,
 222 etc....
 223
 224 You can use the intersect scripts (intersects.py,
 225 gointersects.py, and siteintersects.py) to compare different
 226 sets of genes/GO/site results across multiple experiments,
 227 for example.
 228
 229
 230 RELEASE HISTORY
 231
 232 version 3.2    November 2010 - updated command line options
 233 version 3.1    February 2009 - support for read shifting
 234 version 3.0    February 2009 - support for UCSC narrowPeak format in regiontobed.py
 235 version 3.0rc1 December 2008 - added parameter to control peak-trimming
 236 version 3.0b2  December 2008 - added per-peak p-value
 237 version 3.0b   November 2008 - initial release of RDS-based code
 238 with support for eland and bowtie.
 239