2 Time-stamp: <2008-04-28 16:55:52 Tao Liu>
6 With the improvement of sequencing techniques, chromatin
7 immunoprecipitation followed by high throughput sequencing (ChIP-Seq)
8 is getting popular to study genome-wide protein-DNA interactions. To
9 address the lack of powerful ChIP-Seq analysis method, we present a
10 novel algorithm, named Model-based Analysis of ChIP-Seq (MACS), for
11 identifying transcript factor binding sites. MACS captures the
12 influence of genome complexity to evaluate the significance of
13 enriched ChIP regions, and MACS improves the spatial resolution of
14 binding sites through combining the information of both sequencing tag
15 position and orientation. MACS can be easily used for ChIP-Seq data
16 alone, or with control sample with the increase of specificity.
20 Please check the file 'INSTALL' in the distribution.
24 Usage: macs <-t tfile> [options]
26 MACS -- Model-based Analysis for ChIP-Sequencing
29 --version show program's version number and exit
30 -h, --help show this help message and exit.
31 -t TFILE, --treatment=TFILE
32 ChIP-seq treatment files. REQUIRED
33 -c CFILE, --control=CFILE
35 --name=NAME experiment name, which will be used to generate output
36 file names. DEFAULT: "NA"
37 --format=FORMAT Format of tag file, "ELAND" or "BED". DEFAULT: "BED"
38 --gsize=GSIZE genome size, default:2700000000
39 --tsize=TSIZE tag size. DEFAULT: 25
40 --bw=BW band width. DEFAULT: 300
41 --pvalue=PVALUE pvalue cutoff for peak detection. DEFAULT: 1e-5
42 --mfold=MFOLD select the regions with MFOLD high-confidence
43 enrichment ratio against background to build model.
45 --verbose=VERBOSE set verbose level. 0: only show critical message, 1:
46 show additional warning message, 2: show process
47 information, 3: show debug messages. DEFAULT:2
51 *** -t/--treatment FILENAME
53 This is the only REQUIRED parameter for MACS.
55 ChIP-seq treatment data file can be in either BED format (refer to:
56 http://genome.ucsc.edu/FAQ/FAQformat#format1) or ELAND output format.
58 In an ELAND output file, each line represents one sequence, with fields of:
60 1. Sequence name (derived from file name and line number if format is not Fasta)
64 QC - no matching done: QC failure (too many Ns basically).
65 RM - no matching done: repeat masked (may be seen if repeatFile.txt was specified).
66 U0 - Best match found was a unique exact match.
67 U1 - Best match found was a unique 1-error match.
68 U2 - Best match found was a unique 2-error match.
69 R0 - Multiple exact matches found.
70 R1 - Multiple 1-error matches found, no exact matches.
71 R2 - Multiple 2-error matches found, no exact or 1-error matches.
72 4. Number of exact matches found.
73 5. Number of 1-error matches found.
74 6. Number of 2-error matches found.
75 Rest of fields are only seen if a unique best match was found (i.e. the match code in field 3 begins with "U").
76 7. Genome file in which match was found.
77 8. Position of match (bases in file are numbered starting at 1).
78 9. Direction of match (F=forward strand, R=reverse).
79 10. How N characters in read were interpreted: ("."=not applicable, "D"=deletion, "I"=insertion).
80 Rest of fields are only seen in the case of a unique inexact match (i.e. the match code was U1 or U2).
81 11. Position and type of first substitution error (e.g. 12A: base 12 was A, not whatever is was in read).
82 12. Position and type of first substitution error, as above.
86 1) For BED format, the 6th column of strand information is required by
87 MACS. And please pay attention that the coordinates in BED format is
88 zero-based and half-open
89 (http://genome.ucsc.edu/FAQ/FAQtracks#tracks1).
91 2) For ELAND format, only matches with match type U0, U1 or U2 is
92 accepted by MACS, i.e. only the unique match for a sequence with less
93 than 3 errors is involed in calculation.
95 3) For the experiment with several replicates, it is recommended to
96 concatenate several ChIP-seq treatment files into a single file. To do
97 this, under Unix/Mac or Cygwin (for windows OS), type:
99 $ cat replicate1.bed replicate2.bed replicate3.bed > all_replicates.bed
103 The control or mock data file in either BED format or ELAND output
104 format. Please follow the same direction as for -t/--treatment.
108 The name string of the experiment. MACS will use this string NAME to
109 create output files like 'NAME_peaks.xls', 'NAME_negative_peaks.xls',
110 'NAME_peaks.bed' and 'NAME_model.r'. So please avoid any confliction
111 between these filenames and your existing files.
115 PLEASE assign this parameter to fit your needs!
117 It's the mappable genome size or effective genome size which is
118 defined as the genome size which can be sequenced. Because of the
119 repetitive features on the chromsomes, the actual mappable genome size
120 will be smaller than the original size. The default 2.7Gb is
121 recommended for UCSC human hg18 assembly.
125 The size of sequencing tags.
129 The band width which is used to scan the genome for model
130 building. You can set this parameter as the ChIP DNA fragment size
131 expected from wet experiment.
139 This parameter is used to select the regions with MFOLD fold tag
140 enrichment against background to build model. Default is 32. Higher
141 the MFOLD, less the number of candidate regions. If you see an ERROR or
142 CRITICAL message from MACS, it is recommended to lower this parameter.
146 If you don't want to see any message during the running of MACS, set
147 it to 0. But the CRITICAL messages will never be hidden. If you want
148 to see rich information like how many peaks are called for every
149 chromosome, you can set it to 3 or larger than 3.
153 1. NAME_peaks.xls is a tabular file which contains information about
154 called peaks. You can open it in excel and sort/filter using excel
155 functions. Information include: chromosome name, start position of
156 peak, end position of peak, length of peak region, peak summit
157 position related to the start position of peak region, number of tags
158 in peak region, -10*log10(pvalue) for the peak region (e.g. pvalue
159 =1e-10, then this value should be 100), fold change for this region
160 against random Poisson distribution with local lambda, FDR in
161 percentage. Coordinates in XLS is 1-based which is different with BED
164 2. NAME_peaks.bed is BED format file which contains the peak
165 locations. You can load it to UCSC genome browser or Affymetrix IGB
168 3. NAME_negative_peaks.xls is a tabular file which contains
169 information about negative peaks. Negative peaks are called by
170 swapping the ChIP-seq and control channel.
172 4. NAME_model.r is an R script which you can use to produce a PDF
173 image about the model based on your data. Load it to R by:
175 $ R --vanilla < NAME_model.r
177 Then a pdf file NAME_model.pdf will be generated in your current
178 directory. Note, R is required to draw this figure.