1 samtools(1) Bioinformatics tools samtools(1)
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
11 samtools sort aln.bam aln.sorted
13 samtools index aln.sorted.bam
15 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
17 samtools merge out.bam in1.bam in2.bam in3.bam
19 samtools faidx ref.fasta
21 samtools pileup -f ref.fasta aln.sorted.bam
23 samtools tview aln.sorted.bam ref.fasta
27 Samtools is a set of utilities that manipulate alignments in the BAM
28 format. It imports from and exports to the SAM (Sequence Alignment/Map)
29 format, does sorting, merging and indexing, and allows to retrieve
30 reads in any regions swiftly.
32 Samtools is designed to work on a stream. It regards an input file `-'
33 as the standard input (stdin) and an output file `-' as the standard
34 output (stdout). Several commands can thus be combined with Unix pipes.
35 Samtools always output warning and error messages to the standard error
38 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
39 HTTP server if the BAM file name starts with `ftp://' or `http://'.
40 Samtools checks the current working directory for the index file and
41 will download the index upon absence. Samtools does not retrieve the
42 entire alignment file unless it is asked to do so.
46 import samtools import <in.ref_list> <in.sam> <out.bam>
48 Since 0.1.4, this command is an alias of:
50 samtools view -bt <in.ref_list> -o <out.bam> <in.sam>
53 sort samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
55 Sort alignments by leftmost coordinates. File <out.pre-
56 fix>.bam will be created. This command may also create tempo-
57 rary files <out.prefix>.%d.bam when the whole alignment can-
58 not be fitted into memory (controlled by option -m).
62 -n Sort by read names rather than by chromosomal coordi-
65 -m INT Approximately the maximum required memory.
69 merge samtools merge [-h inh.sam] [-n] <out.bam> <in1.bam>
72 Merge multiple sorted alignments. The header reference lists
73 of all the input BAM files, and the @SQ headers of inh.sam,
74 if any, must all refer to the same set of reference
75 sequences. The header reference list and (unless overridden
76 by -h) `@' headers of in1.bam will be copied to out.bam, and
77 the headers of other files will be ignored.
81 -h FILE Use the lines of FILE as `@' headers to be copied to
82 out.bam, replacing any header lines that would other-
83 wise be copied from in1.bam. (FILE is actually in
84 SAM format, though any alignment records it may con-
87 -n The input alignments are sorted by read names rather
88 than by chromosomal coordinates
91 index samtools index <aln.bam>
93 Index sorted alignment for fast random access. Index file
94 <aln.bam>.bai will be created.
97 view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
98 reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
99 Group] <in.bam>|<in.sam> [region1 [...]]
101 Extract/print all or sub alignments in SAM or BAM format. If
102 no region is specified, all the alignments will be printed;
103 otherwise only alignments overlapping the specified regions
104 will be output. An alignment may be given multiple times if
105 it is overlapping several regions. A region can be presented,
106 for example, in the following format: `chr2' (the whole
107 chr2), `chr2:1000000' (region starting from 1,000,000bp) or
108 `chr2:1,000,000-2,000,000' (region between 1,000,000 and
109 2,000,000bp including the end points). The coordinate is
114 -b Output in the BAM format.
116 -u Output uncompressed BAM. This option saves time spent
117 on compression/decomprssion and is thus preferred
118 when the output is piped to another samtools command.
120 -h Include the header in the output.
122 -H Output the header only.
124 -S Input is in SAM. If @SQ header lines are absent, the
125 `-t' option is required.
127 -t FILE This file is TAB-delimited. Each line must contain
128 the reference name and the length of the reference,
129 one line for each distinct reference; additional
130 fields are ignored. This file also defines the order
131 of the reference sequences in sorting. If you run
132 `samtools faidx <ref.fa>', the resultant index file
133 <ref.fa>.fai can be used as this <in.ref_list> file.
135 -o FILE Output file [stdout]
137 -f INT Only output alignments with all bits in INT present
138 in the FLAG field. INT can be in hex in the format of
141 -F INT Skip alignments with bits present in INT [0]
143 -q INT Skip alignments with MAPQ smaller than INT [0]
145 -l STR Only output reads in library STR [null]
147 -r STR Only output reads in read group STR [null]
150 faidx samtools faidx <ref.fasta> [region1 [...]]
152 Index reference sequence in the FASTA format or extract sub-
153 sequence from indexed reference sequence. If no region is
154 specified, faidx will index the file and create
155 <ref.fasta>.fai on the disk. If regions are speficified, the
156 subsequences will be retrieved and printed to stdout in the
157 FASTA format. The input file can be compressed in the RAZF
161 pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l
162 in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r
163 pairDiffRate] <in.bam>|<in.sam>
165 Print the alignment in the pileup format. In the pileup for-
166 mat, each line represents a genomic position, consisting of
167 chromosome name, coordinate, reference base, read bases, read
168 qualities and alignment mapping qualities. Information on
169 match, mismatch, indel, strand, mapping quality and start and
170 end of a read are all encoded at the read base column. At
171 this column, a dot stands for a match to the reference base
172 on the forward strand, a comma for a match on the reverse
173 strand, `ACGTN' for a mismatch on the forward strand and
174 `acgtn' for a mismatch on the reverse strand. A pattern
175 `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
176 between this reference position and the next reference posi-
177 tion. The length of the insertion is given by the integer in
178 the pattern, followed by the inserted sequence. Similarly, a
179 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
180 reference. The deleted bases will be presented as `*' in the
181 following lines. Also at the read base column, a symbol `^'
182 marks the start of a read segment which is a contiguous sub-
183 sequence on the read separated by `N/S/H' CIGAR operations.
184 The ASCII of the character following `^' minus 33 gives the
185 mapping quality. A symbol `$' marks the end of a read seg-
188 If option -c is applied, the consensus base, Phred-scaled
189 consensus quality, SNP quality (i.e. the Phred-scaled proba-
190 bility of the consensus being identical to the reference) and
191 root mean square (RMS) mapping quality of the reads covering
192 the site will be inserted between the `reference base' and
193 the `read bases' columns. An indel occupies an additional
194 line. Each indel line consists of chromosome name, coordi-
195 nate, a star, the genotype, consensus quality, SNP quality,
196 RMS mapping quality, # covering reads, the first alllele, the
197 second allele, # reads supporting the first allele, # reads
198 supporting the second allele and # reads containing indels
199 different from the top two alleles.
204 -s Print the mapping quality as the last column. This
205 option makes the output easier to parse, although
206 this format is not space efficient.
209 -S The input file is in SAM.
212 -i Only output pileup lines containing indels.
215 -f FILE The reference sequence in the FASTA format. Index
216 file FILE.fai will be created if absent.
219 -M INT Cap mapping quality at INT [60]
222 -t FILE List of reference names ane sequence lengths, in
223 the format described for the import command. If
224 this option is present, samtools assumes the input
225 <in.alignment> is in SAM format; otherwise it
226 assumes in BAM format.
229 -l FILE List of sites at which pileup is output. This file
230 is space delimited. The first two columns are
231 required to be chromosome and 1-based coordinate.
232 Additional columns are ignored. It is recommended
233 to use option -s together with -l as in the default
234 format we may not know the mapping quality.
237 -c Call the consensus sequence using MAQ consensus
238 model. Options -T, -N, -I and -r are only effective
239 when -c or -g is in use.
242 -g Generate genotype likelihood in the binary GLFv3
243 format. This option suppresses -c, -i and -s.
246 -T FLOAT The theta parameter (error dependency coefficient)
247 in the maq consensus calling model [0.85]
250 -N INT Number of haplotypes in the sample (>=2) [2]
253 -r FLOAT Expected fraction of differences between a pair of
257 -I INT Phred probability of an indel in sequencing/prep.
262 tview samtools tview <in.sorted.bam> [ref.fasta]
264 Text alignment viewer (based on the ncurses library). In the
265 viewer, press `?' for help and press `g' to check the align-
266 ment start from a region in the format like
270 fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
272 Fill in mate coordinates, ISIZE and mate related flags from a
273 name-sorted alignment.
276 rmdup samtools rmdup <input.srt.bam> <out.bam>
278 Remove potential PCR duplicates: if multiple read pairs have
279 identical external coordinates, only retain the pair with
280 highest mapping quality. This command ONLY works with FR
281 orientation and requires ISIZE is correctly set.
284 rmdupse samtools rmdupse <input.srt.bam> <out.bam>
286 Remove potential duplicates for single-ended reads. This com-
287 mand will treat all reads as single-ended even if they are
291 fillmd samtools fillmd [-e] <aln.bam> <ref.fasta>
293 Generate the MD tag. If the MD tag is already present, this
294 command will give a warning if the MD tag generated is dif-
295 ferent from the existing tag.
299 -e Convert a the read base to = if it is identical to
300 the aligned reference base. Indel caller does not
301 support the = bases at the moment.
306 SAM is TAB-delimited. Apart from the header lines, which are started
307 with the `@' symbol, each alignment line consists of:
310 +----+-------+----------------------------------------------------------+
311 |Col | Field | Description |
312 +----+-------+----------------------------------------------------------+
313 | 1 | QNAME | Query (pair) NAME |
314 | 2 | FLAG | bitwise FLAG |
315 | 3 | RNAME | Reference sequence NAME |
316 | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
317 | 5 | MAPQ | MAPping Quality (Phred-scaled) |
318 | 6 | CIAGR | extended CIGAR string |
319 | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) |
320 | 8 | MPOS | 1-based Mate POSistion |
321 | 9 | ISIZE | Inferred insert SIZE |
322 |10 | SEQ | query SEQuence on the same strand as the reference |
323 |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
324 |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
325 +----+-------+----------------------------------------------------------+
327 Each bit in the FLAG field is defined as:
330 +-------+--------------------------------------------------+
331 | Flag | Description |
332 +-------+--------------------------------------------------+
333 |0x0001 | the read is paired in sequencing |
334 |0x0002 | the read is mapped in a proper pair |
335 |0x0004 | the query sequence itself is unmapped |
336 |0x0008 | the mate is unmapped |
337 |0x0010 | strand of the query (1 for reverse) |
338 |0x0020 | strand of the mate |
339 |0x0040 | the read is the first read in a pair |
340 |0x0080 | the read is the second read in a pair |
341 |0x0100 | the alignment is not primary |
342 |0x0200 | the read fails platform/vendor quality checks |
343 |0x0400 | the read is either a PCR or an optical duplicate |
344 +-------+--------------------------------------------------+
347 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
350 o CIGAR operation P is not properly handled at the moment.
352 o In merging, the input files are required to have the same number of
353 reference sequences. The requirement can be relaxed. In addition,
354 merging does not reconstruct the header dictionaries automatically.
355 Endusers have to provide the correct header. Picard is better at
358 o Samtools' rmdup does not work for single-end data and does not remove
359 duplicates across chromosomes. Picard is better.
363 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
364 Handsaker from the Broad Institute implemented the BGZF library and Jue
365 Ruan from Beijing Genomics Institute wrote the RAZF library. Various
366 people in the 1000Genomes Project contributed to the SAM format speci-
371 Samtools website: <http://samtools.sourceforge.net>
375 samtools-0.1.7 10 November 2009 samtools(1)