1 samtools(1) Bioinformatics tools samtools(1)
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
11 samtools sort aln.bam aln.sorted
13 samtools index aln.sorted.bam
15 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
17 samtools merge out.bam in1.bam in2.bam in3.bam
19 samtools faidx ref.fasta
21 samtools pileup -f ref.fasta aln.sorted.bam
23 samtools tview aln.sorted.bam ref.fasta
27 Samtools is a set of utilities that manipulate alignments in the BAM
28 format. It imports from and exports to the SAM (Sequence Alignment/Map)
29 format, does sorting, merging and indexing, and allows to retrieve
30 reads in any regions swiftly.
32 Samtools is designed to work on a stream. It regards an input file `-'
33 as the standard input (stdin) and an output file `-' as the standard
34 output (stdout). Several commands can thus be combined with Unix pipes.
35 Samtools always output warning and error messages to the standard error
38 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
39 HTTP server if the BAM file name starts with `ftp://' or `http://'.
40 Samtools checks the current working directory for the index file and
41 will download the index upon absence. Samtools does not retrieve the
42 entire alignment file unless it is asked to do so.
46 import samtools import <in.ref_list> <in.sam> <out.bam>
48 Since 0.1.4, this command is an alias of:
50 samtools view -bt <in.ref_list> -o <out.bam> <in.sam>
53 sort samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
55 Sort alignments by leftmost coordinates. File <out.pre-
56 fix>.bam will be created. This command may also create tempo-
57 rary files <out.prefix>.%d.bam when the whole alignment can-
58 not be fitted into memory (controlled by option -m).
62 -n Sort by read names rather than by chromosomal coordi-
65 -m INT Approximately the maximum required memory.
69 merge samtools merge [-h inh.sam] [-n] <out.bam> <in1.bam>
72 Merge multiple sorted alignments. The header reference lists
73 of all the input BAM files, and the @SQ headers of inh.sam,
74 if any, must all refer to the same set of reference
75 sequences. The header reference list and (unless overridden
76 by -h) `@' headers of in1.bam will be copied to out.bam, and
77 the headers of other files will be ignored.
81 -h FILE Use the lines of FILE as `@' headers to be copied to
82 out.bam, replacing any header lines that would other-
83 wise be copied from in1.bam. (FILE is actually in
84 SAM format, though any alignment records it may con-
87 -n The input alignments are sorted by read names rather
88 than by chromosomal coordinates
91 index samtools index <aln.bam>
93 Index sorted alignment for fast random access. Index file
94 <aln.bam>.bai will be created.
97 view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
98 reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
99 Group] <in.bam>|<in.sam> [region1 [...]]
101 Extract/print all or sub alignments in SAM or BAM format. If
102 no region is specified, all the alignments will be printed;
103 otherwise only alignments overlapping the specified regions
104 will be output. An alignment may be given multiple times if
105 it is overlapping several regions. A region can be presented,
106 for example, in the following format: `chr2', `chr2:1000000'
107 or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.
111 -b Output in the BAM format.
113 -u Output uncompressed BAM. This option saves time spent
114 on compression/decomprssion and is thus preferred
115 when the output is piped to another samtools command.
117 -h Include the header in the output.
119 -H Output the header only.
121 -S Input is in SAM. If @SQ header lines are absent, the
122 `-t' option is required.
124 -t FILE This file is TAB-delimited. Each line must contain
125 the reference name and the length of the reference,
126 one line for each distinct reference; additional
127 fields are ignored. This file also defines the order
128 of the reference sequences in sorting. If you run
129 `samtools faidx <ref.fa>', the resultant index file
130 <ref.fa>.fai can be used as this <in.ref_list> file.
132 -o FILE Output file [stdout]
134 -f INT Only output alignments with all bits in INT present
135 in the FLAG field. INT can be in hex in the format of
138 -F INT Skip alignments with bits present in INT [0]
140 -q INT Skip alignments with MAPQ smaller than INT [0]
142 -l STR Only output reads in library STR [null]
144 -r STR Only output reads in read group STR [null]
147 faidx samtools faidx <ref.fasta> [region1 [...]]
149 Index reference sequence in the FASTA format or extract sub-
150 sequence from indexed reference sequence. If no region is
151 specified, faidx will index the file and create
152 <ref.fasta>.fai on the disk. If regions are speficified, the
153 subsequences will be retrieved and printed to stdout in the
154 FASTA format. The input file can be compressed in the RAZF
158 pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l
159 in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r
160 pairDiffRate] <in.bam>|<in.sam>
162 Print the alignment in the pileup format. In the pileup for-
163 mat, each line represents a genomic position, consisting of
164 chromosome name, coordinate, reference base, read bases, read
165 qualities and alignment mapping qualities. Information on
166 match, mismatch, indel, strand, mapping quality and start and
167 end of a read are all encoded at the read base column. At
168 this column, a dot stands for a match to the reference base
169 on the forward strand, a comma for a match on the reverse
170 strand, `ACGTN' for a mismatch on the forward strand and
171 `acgtn' for a mismatch on the reverse strand. A pattern
172 `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
173 between this reference position and the next reference posi-
174 tion. The length of the insertion is given by the integer in
175 the pattern, followed by the inserted sequence. Similarly, a
176 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
177 reference. The deleted bases will be presented as `*' in the
178 following lines. Also at the read base column, a symbol `^'
179 marks the start of a read segment which is a contiguous sub-
180 sequence on the read separated by `N/S/H' CIGAR operations.
181 The ASCII of the character following `^' minus 33 gives the
182 mapping quality. A symbol `$' marks the end of a read seg-
185 If option -c is applied, the consensus base, consensus qual-
186 ity, SNP quality and RMS mapping quality of the reads cover-
187 ing the site will be inserted between the `reference base'
188 and the `read bases' columns. An indel occupies an additional
189 line. Each indel line consists of chromosome name, coordi-
190 nate, a star, the genotype, consensus quality, SNP quality,
191 RMS mapping quality, # covering reads, the first alllele, the
192 second allele, # reads supporting the first allele, # reads
193 supporting the second allele and # reads containing indels
194 different from the top two alleles.
199 -s Print the mapping quality as the last column. This
200 option makes the output easier to parse, although
201 this format is not space efficient.
204 -S The input file is in SAM.
207 -i Only output pileup lines containing indels.
210 -f FILE The reference sequence in the FASTA format. Index
211 file FILE.fai will be created if absent.
214 -M INT Cap mapping quality at INT [60]
217 -t FILE List of reference names ane sequence lengths, in
218 the format described for the import command. If
219 this option is present, samtools assumes the input
220 <in.alignment> is in SAM format; otherwise it
221 assumes in BAM format.
224 -l FILE List of sites at which pileup is output. This file
225 is space delimited. The first two columns are
226 required to be chromosome and 1-based coordinate.
227 Additional columns are ignored. It is recommended
228 to use option -s together with -l as in the default
229 format we may not know the mapping quality.
232 -c Call the consensus sequence using MAQ consensus
233 model. Options -T, -N, -I and -r are only effective
234 when -c or -g is in use.
237 -g Generate genotype likelihood in the binary GLFv3
238 format. This option suppresses -c, -i and -s.
241 -T FLOAT The theta parameter (error dependency coefficient)
242 in the maq consensus calling model [0.85]
245 -N INT Number of haplotypes in the sample (>=2) [2]
248 -r FLOAT Expected fraction of differences between a pair of
252 -I INT Phred probability of an indel in sequencing/prep.
257 tview samtools tview <in.sorted.bam> [ref.fasta]
259 Text alignment viewer (based on the ncurses library). In the
260 viewer, press `?' for help and press `g' to check the align-
261 ment start from a region in the format like
266 fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
268 Fill in mate coordinates, ISIZE and mate related flags from a
269 name-sorted alignment.
272 rmdup samtools rmdup <input.srt.bam> <out.bam>
274 Remove potential PCR duplicates: if multiple read pairs have
275 identical external coordinates, only retain the pair with
276 highest mapping quality. This command ONLY works with FR
277 orientation and requires ISIZE is correctly set.
281 rmdupse samtools rmdupse <input.srt.bam> <out.bam>
283 Remove potential duplicates for single-ended reads. This com-
284 mand will treat all reads as single-ended even if they are
289 fillmd samtools fillmd [-e] <aln.bam> <ref.fasta>
291 Generate the MD tag. If the MD tag is already present, this
292 command will give a warning if the MD tag generated is dif-
293 ferent from the existing tag.
297 -e Convert a the read base to = if it is identical to
298 the aligned reference base. Indel caller does not
299 support the = bases at the moment.
304 SAM is TAB-delimited. Apart from the header lines, which are started
305 with the `@' symbol, each alignment line consists of:
308 +----+-------+----------------------------------------------------------+
309 |Col | Field | Description |
310 +----+-------+----------------------------------------------------------+
311 | 1 | QNAME | Query (pair) NAME |
312 | 2 | FLAG | bitwise FLAG |
313 | 3 | RNAME | Reference sequence NAME |
314 | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
315 | 5 | MAPQ | MAPping Quality (Phred-scaled) |
316 | 6 | CIAGR | extended CIGAR string |
317 | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) |
318 | 8 | MPOS | 1-based Mate POSistion |
319 | 9 | ISIZE | Inferred insert SIZE |
320 |10 | SEQ | query SEQuence on the same strand as the reference |
321 |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
322 |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
323 +----+-------+----------------------------------------------------------+
325 Each bit in the FLAG field is defined as:
328 +-------+--------------------------------------------------+
329 | Flag | Description |
330 +-------+--------------------------------------------------+
331 |0x0001 | the read is paired in sequencing |
332 |0x0002 | the read is mapped in a proper pair |
333 |0x0004 | the query sequence itself is unmapped |
334 |0x0008 | the mate is unmapped |
335 |0x0010 | strand of the query (1 for reverse) |
336 |0x0020 | strand of the mate |
337 |0x0040 | the read is the first read in a pair |
338 |0x0080 | the read is the second read in a pair |
339 |0x0100 | the alignment is not primary |
340 |0x0200 | the read fails platform/vendor quality checks |
341 |0x0400 | the read is either a PCR or an optical duplicate |
342 +-------+--------------------------------------------------+
345 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
348 o CIGAR operation P is not properly handled at the moment.
350 o In merging, the input files are required to have the same number of
351 reference sequences. The requirement can be relaxed. In addition,
352 merging does not reconstruct the header dictionaries automatically.
353 Endusers have to provide the correct header. Picard is better at
356 o Samtools' rmdup does not work for single-end data and does not remove
357 duplicates across chromosomes. Picard is better.
361 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
362 Handsaker from the Broad Institute implemented the BGZF library and Jue
363 Ruan from Beijing Genomics Institute wrote the RAZF library. Various
364 people in the 1000Genomes Project contributed to the SAM format speci-
369 Samtools website: <http://samtools.sourceforge.net>
373 samtools-0.1.6 2 September 2009 samtools(1)