1 samtools(1) Bioinformatics tools samtools(1)
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
11 samtools sort aln.bam aln.sorted
13 samtools index aln.sorted.bam
15 samtools idxstats aln.sorted.bam
17 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
19 samtools merge out.bam in1.bam in2.bam in3.bam
21 samtools faidx ref.fasta
23 samtools pileup -vcf ref.fasta aln.sorted.bam
25 samtools mpileup -C50 -agf ref.fasta -r chr3:1,000-2,000 in1.bam
28 samtools tview aln.sorted.bam ref.fasta
32 Samtools is a set of utilities that manipulate alignments in the BAM
33 format. It imports from and exports to the SAM (Sequence Alignment/Map)
34 format, does sorting, merging and indexing, and allows to retrieve
35 reads in any regions swiftly.
37 Samtools is designed to work on a stream. It regards an input file `-'
38 as the standard input (stdin) and an output file `-' as the standard
39 output (stdout). Several commands can thus be combined with Unix pipes.
40 Samtools always output warning and error messages to the standard error
43 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
44 HTTP server if the BAM file name starts with `ftp://' or `http://'.
45 Samtools checks the current working directory for the index file and
46 will download the index upon absence. Samtools does not retrieve the
47 entire alignment file unless it is asked to do so.
51 view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
52 reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
53 Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
55 Extract/print all or sub alignments in SAM or BAM format. If
56 no region is specified, all the alignments will be printed;
57 otherwise only alignments overlapping the specified regions
58 will be output. An alignment may be given multiple times if
59 it is overlapping several regions. A region can be presented,
60 for example, in the following format: `chr2' (the whole
61 chr2), `chr2:1000000' (region starting from 1,000,000bp) or
62 `chr2:1,000,000-2,000,000' (region between 1,000,000 and
63 2,000,000bp including the end points). The coordinate is
68 -b Output in the BAM format.
70 -f INT Only output alignments with all bits in INT present
71 in the FLAG field. INT can be in hex in the format of
74 -F INT Skip alignments with bits present in INT [0]
76 -h Include the header in the output.
78 -H Output the header only.
80 -l STR Only output reads in library STR [null]
82 -o FILE Output file [stdout]
84 -q INT Skip alignments with MAPQ smaller than INT [0]
86 -r STR Only output reads in read group STR [null]
88 -R FILE Output reads in read groups listed in FILE [null]
90 -S Input is in SAM. If @SQ header lines are absent, the
91 `-t' option is required.
93 -t FILE This file is TAB-delimited. Each line must contain
94 the reference name and the length of the reference,
95 one line for each distinct reference; additional
96 fields are ignored. This file also defines the order
97 of the reference sequences in sorting. If you run
98 `samtools faidx <ref.fa>', the resultant index file
99 <ref.fa>.fai can be used as this <in.ref_list> file.
101 -u Output uncompressed BAM. This option saves time spent
102 on compression/decomprssion and is thus preferred
103 when the output is piped to another samtools command.
106 tview samtools tview <in.sorted.bam> [ref.fasta]
108 Text alignment viewer (based on the ncurses library). In the
109 viewer, press `?' for help and press `g' to check the align-
110 ment start from a region in the format like
111 `chr10:10,000,000' or `=10,000,000' when viewing the same
115 pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
116 [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N
117 nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G
118 indelPrior] <in.bam>|<in.sam>
120 Print the alignment in the pileup format. In the pileup for-
121 mat, each line represents a genomic position, consisting of
122 chromosome name, coordinate, reference base, read bases, read
123 qualities and alignment mapping qualities. Information on
124 match, mismatch, indel, strand, mapping quality and start and
125 end of a read are all encoded at the read base column. At
126 this column, a dot stands for a match to the reference base
127 on the forward strand, a comma for a match on the reverse
128 strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
129 match on the forward strand and `acgtn' for a mismatch on the
130 reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates
131 there is an insertion between this reference position and the
132 next reference position. The length of the insertion is given
133 by the integer in the pattern, followed by the inserted
134 sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre-
135 sents a deletion from the reference. The deleted bases will
136 be presented as `*' in the following lines. Also at the read
137 base column, a symbol `^' marks the start of a read. The
138 ASCII of the character following `^' minus 33 gives the map-
139 ping quality. A symbol `$' marks the end of a read segment.
141 If option -c is applied, the consensus base, Phred-scaled
142 consensus quality, SNP quality (i.e. the Phred-scaled proba-
143 bility of the consensus being identical to the reference) and
144 root mean square (RMS) mapping quality of the reads covering
145 the site will be inserted between the `reference base' and
146 the `read bases' columns. An indel occupies an additional
147 line. Each indel line consists of chromosome name, coordi-
148 nate, a star, the genotype, consensus quality, SNP quality,
149 RMS mapping quality, # covering reads, the first alllele, the
150 second allele, # reads supporting the first allele, # reads
151 supporting the second allele and # reads containing indels
152 different from the top two alleles.
154 The position of indels is offset by -1.
158 -B Disable the BAQ computation. See the mpileup com-
161 -c Call the consensus sequence using SOAPsnp consensus
162 model. Options -T, -N, -I and -r are only effective
163 when -c or -g is in use.
165 -C INT Coefficient for downgrading the mapping quality of
166 poorly mapped reads. See the mpileup command for
169 -d INT Use the first NUM reads in the pileup for indel
170 calling for speed up. Zero for unlimited. [1024]
172 -f FILE The reference sequence in the FASTA format. Index
173 file FILE.fai will be created if absent.
175 -g Generate genotype likelihood in the binary GLFv3
176 format. This option suppresses -c, -i and -s. This
177 option is deprecated by the mpileup command.
179 -i Only output pileup lines containing indels.
181 -I INT Phred probability of an indel in sequencing/prep.
184 -l FILE List of sites at which pileup is output. This file
185 is space delimited. The first two columns are
186 required to be chromosome and 1-based coordinate.
187 Additional columns are ignored. It is recommended
190 -m INT Filter reads with flag containing bits in INT
193 -M INT Cap mapping quality at INT [60]
195 -N INT Number of haplotypes in the sample (>=2) [2]
197 -r FLOAT Expected fraction of differences between a pair of
200 -s Print the mapping quality as the last column. This
201 option makes the output easier to parse, although
202 this format is not space efficient.
204 -S The input file is in SAM.
206 -t FILE List of reference names ane sequence lengths, in
207 the format described for the import command. If
208 this option is present, samtools assumes the input
209 <in.alignment> is in SAM format; otherwise it
210 assumes in BAM format. -s together with -l as in
211 the default format we may not know the mapping
214 -T FLOAT The theta parameter (error dependency coefficient)
215 in the maq consensus calling model [0.85]
218 mpileup samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
219 list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
222 Generate BCF or pileup for one or multiple BAM files. Align-
223 ment records are grouped by sample identifiers in @RG header
224 lines. If sample identifiers are absent, each input file is
225 regarded as one sample.
229 -B Disable probabilistic realignment for the computation
230 of base alignment quality (BAQ). BAQ is the Phred-
231 scaled probability of a read base being misaligned.
232 Applying this option greatly helps to reduce false
233 SNPs caused by misalignments.
235 -C INT Coefficient for downgrading mapping quality for reads
236 containing excessive mismatches. Given a read with a
237 phred-scaled probability q of being generated from
238 the mapped position, the new mapping quality is about
239 sqrt((INT-q)/INT)*INT. A zero value disables this
240 functionality; if enabled, the recommended value is
243 -f FILE The reference file [null]
245 -g Compute genotype likelihoods and output them in the
246 binary call format (BCF).
248 -u Similar to -g except that the output is uncompressed
249 BCF, which is preferred for pipeing.
251 -l FILE File containing a list of sites where pileup or BCF
254 -q INT Minimum mapping quality for an alignment to be used
257 -Q INT Minimum base quality for a base to be considered [13]
259 -r STR Only generate pileup in region STR [all sites]
262 reheader samtools reheader <in.header.sam> <in.bam>
264 Replace the header in in.bam with the header in
265 in.header.sam. This command is much faster than replacing
266 the header with a BAM->SAM->BAM conversion.
269 sort samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
271 Sort alignments by leftmost coordinates. File <out.pre-
272 fix>.bam will be created. This command may also create tempo-
273 rary files <out.prefix>.%d.bam when the whole alignment can-
274 not be fitted into memory (controlled by option -m).
278 -o Output the final alignment to the standard output.
280 -n Sort by read names rather than by chromosomal coordi-
283 -m INT Approximately the maximum required memory.
287 merge samtools merge [-nur] [-h inh.sam] [-R reg] <out.bam>
288 <in1.bam> <in2.bam> [...]
290 Merge multiple sorted alignments. The header reference lists
291 of all the input BAM files, and the @SQ headers of inh.sam,
292 if any, must all refer to the same set of reference
293 sequences. The header reference list and (unless overridden
294 by -h) `@' headers of in1.bam will be copied to out.bam, and
295 the headers of other files will be ignored.
299 -h FILE Use the lines of FILE as `@' headers to be copied to
300 out.bam, replacing any header lines that would other-
301 wise be copied from in1.bam. (FILE is actually in
302 SAM format, though any alignment records it may con-
305 -R STR Merge files in the specified region indicated by STR
307 -r Attach an RG tag to each alignment. The tag value is
308 inferred from file names.
310 -n The input alignments are sorted by read names rather
311 than by chromosomal coordinates
313 -u Uncompressed BAM output
316 index samtools index <aln.bam>
318 Index sorted alignment for fast random access. Index file
319 <aln.bam>.bai will be created.
322 idxstats samtools idxstats <aln.bam>
324 Retrieve and print stats in the index file. The output is TAB
325 delimited with each line consisting of reference sequence
326 name, sequence length, # mapped reads and # unmapped reads.
329 faidx samtools faidx <ref.fasta> [region1 [...]]
331 Index reference sequence in the FASTA format or extract sub-
332 sequence from indexed reference sequence. If no region is
333 specified, faidx will index the file and create
334 <ref.fasta>.fai on the disk. If regions are speficified, the
335 subsequences will be retrieved and printed to stdout in the
336 FASTA format. The input file can be compressed in the RAZF
340 fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
342 Fill in mate coordinates, ISIZE and mate related flags from a
343 name-sorted alignment.
346 rmdup samtools rmdup [-sS] <input.srt.bam> <out.bam>
348 Remove potential PCR duplicates: if multiple read pairs have
349 identical external coordinates, only retain the pair with
350 highest mapping quality. In the paired-end mode, this com-
351 mand ONLY works with FR orientation and requires ISIZE is
352 correctly set. It does not work for unpaired reads (e.g. two
353 ends mapped to different chromosomes or orphan reads).
357 -s Remove duplicate for single-end reads. By default,
358 the command works for paired-end reads only.
360 -S Treat paired-end reads and single-end reads.
363 calmd samtools calmd [-eubSr] [-C capQcoef] <aln.bam> <ref.fasta>
365 Generate the MD tag. If the MD tag is already present, this
366 command will give a warning if the MD tag generated is dif-
367 ferent from the existing tag. Output SAM by default.
371 -e Convert a the read base to = if it is identical to
372 the aligned reference base. Indel caller does not
373 support the = bases at the moment.
375 -u Output uncompressed BAM
377 -b Output compressed BAM
379 -S The input is SAM with header lines
381 -C INT Coefficient to cap mapping quality of poorly mapped
382 reads. See the pileup command for details. [0]
384 -r Perform probabilistic realignment to compute BAQ,
385 which will be used to cap base quality.
389 SAM is TAB-delimited. Apart from the header lines, which are started
390 with the `@' symbol, each alignment line consists of:
393 +----+-------+----------------------------------------------------------+
394 |Col | Field | Description |
395 +----+-------+----------------------------------------------------------+
396 | 1 | QNAME | Query (pair) NAME |
397 | 2 | FLAG | bitwise FLAG |
398 | 3 | RNAME | Reference sequence NAME |
399 | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
400 | 5 | MAPQ | MAPping Quality (Phred-scaled) |
401 | 6 | CIAGR | extended CIGAR string |
402 | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) |
403 | 8 | MPOS | 1-based Mate POSistion |
404 | 9 | ISIZE | Inferred insert SIZE |
405 |10 | SEQ | query SEQuence on the same strand as the reference |
406 |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
407 |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
408 +----+-------+----------------------------------------------------------+
410 Each bit in the FLAG field is defined as:
413 +-------+-----+--------------------------------------------------+
414 | Flag | Chr | Description |
415 +-------+-----+--------------------------------------------------+
416 |0x0001 | p | the read is paired in sequencing |
417 |0x0002 | P | the read is mapped in a proper pair |
418 |0x0004 | u | the query sequence itself is unmapped |
419 |0x0008 | U | the mate is unmapped |
420 |0x0010 | r | strand of the query (1 for reverse) |
421 |0x0020 | R | strand of the mate |
422 |0x0040 | 1 | the read is the first read in a pair |
423 |0x0080 | 2 | the read is the second read in a pair |
424 |0x0100 | s | the alignment is not primary |
425 |0x0200 | f | the read fails platform/vendor quality checks |
426 |0x0400 | d | the read is either a PCR or an optical duplicate |
427 +-------+-----+--------------------------------------------------+
430 o Import SAM to BAM when @SQ lines are present in the header:
432 samtools view -bS aln.sam > aln.bam
434 If @SQ lines are absent:
436 samtools faidx ref.fa
437 samtools view -bt ref.fa.fai aln.sam > aln.bam
439 where ref.fa.fai is generated automatically by the faidx command.
442 o Attach the RG tag while merging sorted alignments:
444 perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
445 mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
446 samtools merge -rh rg.txt merged.bam ga.bam 454.bam
448 The value in a RG tag is determined by the file name the read is com-
449 ing from. In this example, in the merged.bam, reads from ga.bam will
450 be attached RG:Z:ga, while reads from 454.bam will be attached
454 o Call SNPs and short indels for one diploid individual:
456 samtools pileup -vcf ref.fa aln.bam > var.raw.plp
457 samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
458 awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var.flt.plp >
461 The -D option of varFilter controls the maximum read depth, which
462 should be adjusted to about twice the average read depth. One may
463 consider to add -C50 to pileup if mapping quality is overestimated
464 for reads containing excessive mismatches. Applying this option usu-
465 ally helps BWA-short but may not other mappers. It also potentially
466 increases reference biases.
469 o Call SNPs (not short indels) for multiple diploid individuals:
471 samtools mpileup -augf ref.fa *.bam | bcftools view -bcv - >
473 bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 | bgzip
476 Individuals are identified from the SM tags in the @RG header lines.
477 Individuals can be pooled in one alignment file; one individual can
478 also be separated into multiple files. Similarly, one may consider to
479 apply -C50 to mpileup. SNP calling in this way also works for single
480 sample and has the advantage of enabling more powerful filtering. The
481 drawback is the lack of short indel calling, which may be implemented
485 o Derive the allele frequency spectrum (AFS) on a list of sites from
486 multiple individuals:
488 samtools mpileup -gf ref.fa *.bam > all.bcf
489 bcftools view -bl sites.list all.bcf > sites.bcf
490 bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
491 bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
492 bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
495 where sites.list contains the list of sites with each line consisting
496 of the reference sequence name and position. The following bcftools
497 commands estimate AFS by EM.
500 o Dump BAQ applied alignment for other SNP callers:
502 samtools calmd -br aln.bam > aln.baq.bam
504 It adds and corrects the NM and MD tags at the same time. The calmd
505 command also comes with the -C option, the same as the on in pileup
506 and mpileup. Apply if it helps.
510 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
513 o In merging, the input files are required to have the same number of
514 reference sequences. The requirement can be relaxed. In addition,
515 merging does not reconstruct the header dictionaries automatically.
516 Endusers have to provide the correct header. Picard is better at
519 o Samtools paired-end rmdup does not work for unpaired reads (e.g.
520 orphan reads or ends mapped to different chromosomes). If this is a
521 concern, please use Picard's MarkDuplicate which correctly handles
522 these cases, although a little slower.
526 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
527 Handsaker from the Broad Institute implemented the BGZF library and Jue
528 Ruan from Beijing Genomics Institute wrote the RAZF library. Various
529 people in the 1000 Genomes Project contributed to the SAM format speci-
534 Samtools website: <http://samtools.sourceforge.net>
538 samtools-0.1.9 27 October 2010 samtools(1)