htswanalysis/MACS/00README

   1 README for MACS
   2 Time-stamp: <2008-04-28 16:55:52 Tao Liu>
   3
   4 * Introduction
   5
   6 With the improvement of sequencing techniques, chromatin
   7 immunoprecipitation followed by high throughput sequencing (ChIP-Seq)
   8 is getting popular to study genome-wide protein-DNA interactions. To
   9 address the lack of powerful ChIP-Seq analysis method, we present a
  10 novel algorithm, named Model-based Analysis of ChIP-Seq (MACS), for
  11 identifying transcript factor binding sites. MACS captures the
  12 influence of genome complexity to evaluate the significance of
  13 enriched ChIP regions, and MACS improves the spatial resolution of
  14 binding sites through combining the information of both sequencing tag
  15 position and orientation. MACS can be easily used for ChIP-Seq data
  16 alone, or with control sample with the increase of specificity.
  17
  18 * Install
  19
  20 Please check the file 'INSTALL' in the distribution.
  21
  22 * Usage
  23
  24 Usage: macs <-t tfile> [options]
  25
  26 MACS -- Model-based Analysis for ChIP-Sequencing
  27
  28 Options:
  29   --version             show program's version number and exit
  30   -h, --help            show this help message and exit.
  31   -t TFILE, --treatment=TFILE
  32                         ChIP-seq treatment files. REQUIRED
  33   -c CFILE, --control=CFILE
  34                         control files.
  35   --name=NAME           experiment name, which will be used to generate output
  36                         file names. DEFAULT: "NA"
  37   --format=FORMAT       Format of tag file, "ELAND" or "BED". DEFAULT: "BED"
  38   --gsize=GSIZE         genome size, default:2700000000
  39   --tsize=TSIZE         tag size. DEFAULT: 25
  40   --bw=BW               band width. DEFAULT: 300
  41   --pvalue=PVALUE       pvalue cutoff for peak detection. DEFAULT: 1e-5
  42   --mfold=MFOLD         select the regions with MFOLD high-confidence
  43                         enrichment ratio against background to build model.
  44                         DEFAULT:32
  45   --verbose=VERBOSE     set verbose level. 0: only show critical message, 1:
  46                         show additional warning message, 2: show process
  47                         information, 3: show debug messages. DEFAULT:2
  48
  49 ** Parameters:
  50
  51 *** -t/--treatment FILENAME
  52
  53 This is the only REQUIRED parameter for MACS.
  54
  55 ChIP-seq treatment data file can be in either BED format (refer to:
  56 http://genome.ucsc.edu/FAQ/FAQformat#format1) or ELAND output format.
  57
  58 In an ELAND output file, each line represents one sequence, with fields of:
  59
  60  1. Sequence name (derived from file name and line number if format is not Fasta)
  61  2. Sequence
  62  3. Type of match:
  63  NM - no match found.
  64  QC - no matching done: QC failure (too many Ns basically).
  65  RM - no matching done: repeat masked (may be seen if repeatFile.txt was specified).
  66  U0 - Best match found was a unique exact match.
  67  U1 - Best match found was a unique 1-error match.
  68  U2 - Best match found was a unique 2-error match.
  69  R0 - Multiple exact matches found.
  70  R1 - Multiple 1-error matches found, no exact matches.
  71  R2 - Multiple 2-error matches found, no exact or 1-error matches.
  72  4. Number of exact matches found.
  73  5. Number of 1-error matches found.
  74  6. Number of 2-error matches found.
  75  Rest of fields are only seen if a unique best match was found (i.e. the match code in field 3 begins with "U").
  76  7. Genome file in which match was found.
  77  8. Position of match (bases in file are numbered starting at 1).
  78  9. Direction of match (F=forward strand, R=reverse).
  79  10. How N characters in read were interpreted: ("."=not applicable, "D"=deletion, "I"=insertion).
  80  Rest of fields are only seen in the case of a unique inexact match (i.e. the match code was U1 or U2).
  81  11. Position and type of first substitution error (e.g. 12A: base 12 was A, not whatever is was in read).
  82  12. Position and type of first substitution error, as above.
  83
  84 Notes:
  85
  86 1) For BED format, the 6th column of strand information is required by
  87 MACS. And please pay attention that the coordinates in BED format is
  88 zero-based and half-open
  89 (http://genome.ucsc.edu/FAQ/FAQtracks#tracks1).
  90
  91 2) For ELAND format, only matches with match type U0, U1 or U2 is
  92 accepted by MACS, i.e. only the unique match for a sequence with less
  93 than 3 errors is involed in calculation.
  94
  95 3) For the experiment with several replicates, it is recommended to
  96 concatenate several ChIP-seq treatment files into a single file. To do
  97 this, under Unix/Mac or Cygwin (for windows OS), type:
  98
  99 $ cat replicate1.bed replicate2.bed replicate3.bed > all_replicates.bed
 100
 101 *** -c/--control
 102
 103 The control or mock data file in either BED format or ELAND output
 104 format. Please follow the same direction as for -t/--treatment.
 105
 106 *** --name
 107
 108 The name string of the experiment. MACS will use this string NAME to
 109 create output files like 'NAME_peaks.xls', 'NAME_negative_peaks.xls',
 110 'NAME_peaks.bed' and 'NAME_model.r'. So please avoid any confliction
 111 between these filenames and your existing files.
 112
 113 *** --gsize
 114
 115 PLEASE assign this parameter to fit your needs!
 116
 117 It's the mappable genome size or effective genome size which is
 118 defined as the genome size which can be sequenced. Because of the
 119 repetitive features on the chromsomes, the actual mappable genome size
 120 will be smaller than the original size. The default 2.7Gb is
 121 recommended for UCSC human hg18 assembly.
 122
 123 *** --tsize
 124
 125 The size of sequencing tags.
 126
 127 *** --bw
 128
 129 The band width which is used to scan the genome for model
 130 building. You can set this parameter as the ChIP DNA fragment size
 131 expected from wet experiment.
 132
 133 *** --pvalue
 134
 135 The pvalue cutoff.
 136
 137 *** --mfold
 138
 139 This parameter is used to select the regions with MFOLD fold tag
 140 enrichment against background to build model. Default is 32. Higher
 141 the MFOLD, less the number of candidate regions. If you see an ERROR or
 142 CRITICAL message from MACS, it is recommended to lower this parameter.
 143
 144 *** --verbose
 145
 146 If you don't want to see any message during the running of MACS, set
 147 it to 0. But the CRITICAL messages will never be hidden. If you want
 148 to see rich information like how many peaks are called for every
 149 chromosome, you can set it to 3 or larger than 3.
 150
 151 * Output files
 152
 153  1. NAME_peaks.xls is a tabular file which contains information about
 154 called peaks. You can open it in excel and sort/filter using excel
 155 functions. Information include: chromosome name, start position of
 156 peak, end position of peak, length of peak region, peak summit
 157 position related to the start position of peak region, number of tags
 158 in peak region, -10*log10(pvalue) for the peak region (e.g. pvalue
 159 =1e-10, then this value should be 100), fold change for this region
 160 against random Poisson distribution with local lambda, FDR in
 161 percentage. Coordinates in XLS is 1-based which is different with BED
 162 format.
 163
 164  2. NAME_peaks.bed is BED format file which contains the peak
 165 locations. You can load it to UCSC genome browser or Affymetrix IGB
 166 software.
 167
 168  3. NAME_negative_peaks.xls is a tabular file which contains
 169 information about negative peaks. Negative peaks are called by
 170 swapping the ChIP-seq and control channel.
 171
 172  4. NAME_model.r is an R script which you can use to produce a PDF
 173 image about the model based on your data. Load it to R by:
 174
 175 $ R --vanilla < NAME_model.r
 176
 177 Then a pdf file NAME_model.pdf will be generated in your current
 178 directory. Note, R is required to draw this figure.
 179
 180 * FAQs
 181
 182 NA now.
 183