samtools index aln.sorted.bam
+ samtools idxstats aln.sorted.bam
+
samtools view aln.sorted.bam chr2:20,100,000-20,200,000
samtools merge out.bam in1.bam in2.bam in3.bam
samtools pileup -f ref.fasta aln.sorted.bam
+ samtools mpileup -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
+
samtools tview aln.sorted.bam ref.fasta
COMMANDS AND OPTIONS
- import samtools import <in.ref_list> <in.sam> <out.bam>
-
- Since 0.1.4, this command is an alias of:
-
- samtools view -bt <in.ref_list> -o <out.bam> <in.sam>
-
-
- sort samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
-
- Sort alignments by leftmost coordinates. File <out.pre-
- fix>.bam will be created. This command may also create tempo-
- rary files <out.prefix>.%d.bam when the whole alignment can-
- not be fitted into memory (controlled by option -m).
-
- OPTIONS:
-
- -n Sort by read names rather than by chromosomal coordi-
- nates
-
- -m INT Approximately the maximum required memory.
- [500000000]
-
-
- merge samtools merge [-h inh.sam] [-n] <out.bam> <in1.bam>
- <in2.bam> [...]
-
- Merge multiple sorted alignments. The header reference lists
- of all the input BAM files, and the @SQ headers of inh.sam,
- if any, must all refer to the same set of reference
- sequences. The header reference list and (unless overridden
- by -h) `@' headers of in1.bam will be copied to out.bam, and
- the headers of other files will be ignored.
-
- OPTIONS:
-
- -h FILE Use the lines of FILE as `@' headers to be copied to
- out.bam, replacing any header lines that would other-
- wise be copied from in1.bam. (FILE is actually in
- SAM format, though any alignment records it may con-
- tain are ignored.)
-
- -n The input alignments are sorted by read names rather
- than by chromosomal coordinates
-
-
- index samtools index <aln.bam>
-
- Index sorted alignment for fast random access. Index file
- <aln.bam>.bai will be created.
-
-
view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
- reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
- Group] <in.bam>|<in.sam> [region1 [...]]
+ reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
+ Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
- Extract/print all or sub alignments in SAM or BAM format. If
- no region is specified, all the alignments will be printed;
- otherwise only alignments overlapping the specified regions
- will be output. An alignment may be given multiple times if
+ Extract/print all or sub alignments in SAM or BAM format. If
+ no region is specified, all the alignments will be printed;
+ otherwise only alignments overlapping the specified regions
+ will be output. An alignment may be given multiple times if
it is overlapping several regions. A region can be presented,
- for example, in the following format: `chr2', `chr2:1000000'
- or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.
+ for example, in the following format: `chr2' (the whole
+ chr2), `chr2:1000000' (region starting from 1,000,000bp) or
+ `chr2:1,000,000-2,000,000' (region between 1,000,000 and
+ 2,000,000bp including the end points). The coordinate is
+ 1-based.
OPTIONS:
-r STR Only output reads in read group STR [null]
+ -R FILE Output reads in read groups listed in FILE [null]
- faidx samtools faidx <ref.fasta> [region1 [...]]
- Index reference sequence in the FASTA format or extract sub-
- sequence from indexed reference sequence. If no region is
- specified, faidx will index the file and create
- <ref.fasta>.fai on the disk. If regions are speficified, the
- subsequences will be retrieved and printed to stdout in the
- FASTA format. The input file can be compressed in the RAZF
- format.
+ tview samtools tview <in.sorted.bam> [ref.fasta]
+
+ Text alignment viewer (based on the ncurses library). In the
+ viewer, press `?' for help and press `g' to check the align-
+ ment start from a region in the format like
+ `chr10:10,000,000' or `=10,000,000' when viewing the same
+ reference sequence.
pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l
mapping quality. A symbol `$' marks the end of a read seg-
ment.
- If option -c is applied, the consensus base, consensus qual-
- ity, SNP quality and RMS mapping quality of the reads cover-
- ing the site will be inserted between the `reference base'
- and the `read bases' columns. An indel occupies an additional
+ If option -c is applied, the consensus base, Phred-scaled
+ consensus quality, SNP quality (i.e. the Phred-scaled proba-
+ bility of the consensus being identical to the reference) and
+ root mean square (RMS) mapping quality of the reads covering
+ the site will be inserted between the `reference base' and
+ the `read bases' columns. An indel occupies an additional
line. Each indel line consists of chromosome name, coordi-
nate, a star, the genotype, consensus quality, SNP quality,
RMS mapping quality, # covering reads, the first alllele, the
supporting the second allele and # reads containing indels
different from the top two alleles.
- OPTIONS:
+ The position of indels is offset by -1.
+ OPTIONS:
-s Print the mapping quality as the last column. This
option makes the output easier to parse, although
this format is not space efficient.
-
-S The input file is in SAM.
-
-i Only output pileup lines containing indels.
-
-f FILE The reference sequence in the FASTA format. Index
file FILE.fai will be created if absent.
-
-M INT Cap mapping quality at INT [60]
+ -m INT Filter reads with flag containing bits in INT
+ [1796]
+
+ -d INT Use the first NUM reads in the pileup for indel
+ calling for speed up. Zero for unlimited. [0]
-t FILE List of reference names ane sequence lengths, in
the format described for the import command. If
<in.alignment> is in SAM format; otherwise it
assumes in BAM format.
-
-l FILE List of sites at which pileup is output. This file
is space delimited. The first two columns are
required to be chromosome and 1-based coordinate.
to use option -s together with -l as in the default
format we may not know the mapping quality.
-
- -c Call the consensus sequence using MAQ consensus
+ -c Call the consensus sequence using SOAPsnp consensus
model. Options -T, -N, -I and -r are only effective
when -c or -g is in use.
-
-g Generate genotype likelihood in the binary GLFv3
format. This option suppresses -c, -i and -s.
-
-T FLOAT The theta parameter (error dependency coefficient)
in the maq consensus calling model [0.85]
-
-N INT Number of haplotypes in the sample (>=2) [2]
-
-r FLOAT Expected fraction of differences between a pair of
haplotypes [0.001]
-
-I INT Phred probability of an indel in sequencing/prep.
[40]
+ mpileup samtools mpileup [-r reg] [-f in.fa] in.bam [in2.bam [...]]
- tview samtools tview <in.sorted.bam> [ref.fasta]
+ Generate pileup for multiple BAM files. Consensus calling is
+ not implemented.
- Text alignment viewer (based on the ncurses library). In the
- viewer, press `?' for help and press `g' to check the align-
- ment start from a region in the format like
- `chr10:10,000,000'.
+ OPTIONS:
+
+ -r STR Only generate pileup in region STR [all sites]
+
+ -f FILE The reference file [null]
+ reheader samtools reheader <in.header.sam> <in.bam>
+
+ Replace the header in in.bam with the header in
+ in.header.sam. This command is much faster than replacing
+ the header with a BAM->SAM->BAM conversion.
+
+
+ sort samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
+
+ Sort alignments by leftmost coordinates. File <out.pre-
+ fix>.bam will be created. This command may also create tempo-
+ rary files <out.prefix>.%d.bam when the whole alignment can-
+ not be fitted into memory (controlled by option -m).
+
+ OPTIONS:
+
+ -o Output the final alignment to the standard output.
+
+ -n Sort by read names rather than by chromosomal coordi-
+ nates
+
+ -m INT Approximately the maximum required memory.
+ [500000000]
+
+
+ merge samtools merge [-h inh.sam] [-nr] <out.bam> <in1.bam>
+ <in2.bam> [...]
+
+ Merge multiple sorted alignments. The header reference lists
+ of all the input BAM files, and the @SQ headers of inh.sam,
+ if any, must all refer to the same set of reference
+ sequences. The header reference list and (unless overridden
+ by -h) `@' headers of in1.bam will be copied to out.bam, and
+ the headers of other files will be ignored.
+
+ OPTIONS:
+
+ -h FILE Use the lines of FILE as `@' headers to be copied to
+ out.bam, replacing any header lines that would other-
+ wise be copied from in1.bam. (FILE is actually in
+ SAM format, though any alignment records it may con-
+ tain are ignored.)
+
+ -r Attach an RG tag to each alignment. The tag value is
+ inferred from file names.
+
+ -n The input alignments are sorted by read names rather
+ than by chromosomal coordinates
+
+
+ index samtools index <aln.bam>
+
+ Index sorted alignment for fast random access. Index file
+ <aln.bam>.bai will be created.
+
+
+ idxstats samtools idxstats <aln.bam>
+
+ Retrieve and print stats in the index file. The output is TAB
+ delimited with each line consisting of reference sequence
+ name, sequence length, # mapped reads and # unmapped reads.
+
+
+ faidx samtools faidx <ref.fasta> [region1 [...]]
+
+ Index reference sequence in the FASTA format or extract sub-
+ sequence from indexed reference sequence. If no region is
+ specified, faidx will index the file and create
+ <ref.fasta>.fai on the disk. If regions are speficified, the
+ subsequences will be retrieved and printed to stdout in the
+ FASTA format. The input file can be compressed in the RAZF
+ format.
+
fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
name-sorted alignment.
- rmdup samtools rmdup <input.srt.bam> <out.bam>
+ rmdup samtools rmdup [-sS] <input.srt.bam> <out.bam>
Remove potential PCR duplicates: if multiple read pairs have
identical external coordinates, only retain the pair with
- highest mapping quality. This command ONLY works with FR
- orientation and requires ISIZE is correctly set.
-
+ highest mapping quality. In the paired-end mode, this com-
+ mand ONLY works with FR orientation and requires ISIZE is
+ correctly set. It does not work for unpaired reads (e.g. two
+ ends mapped to different chromosomes or orphan reads).
+ OPTIONS:
- rmdupse samtools rmdupse <input.srt.bam> <out.bam>
-
- Remove potential duplicates for single-ended reads. This com-
- mand will treat all reads as single-ended even if they are
- paired in fact.
+ -s Remove duplicate for single-end reads. By default,
+ the command works for paired-end reads only.
+ -S Treat paired-end reads and single-end reads.
- fillmd samtools fillmd [-e] <aln.bam> <ref.fasta>
+ calmd samtools calmd [-eubS] <aln.bam> <ref.fasta>
- Generate the MD tag. If the MD tag is already present, this
- command will give a warning if the MD tag generated is dif-
- ferent from the existing tag.
+ Generate the MD tag. If the MD tag is already present, this
+ command will give a warning if the MD tag generated is dif-
+ ferent from the existing tag. Output SAM by default.
OPTIONS:
- -e Convert a the read base to = if it is identical to
- the aligned reference base. Indel caller does not
+ -e Convert a the read base to = if it is identical to
+ the aligned reference base. Indel caller does not
support the = bases at the moment.
+ -u Output uncompressed BAM
+
+ -b Output compressed BAM
+
+ -S The input is SAM with header lines
SAM FORMAT
- SAM is TAB-delimited. Apart from the header lines, which are started
+ SAM is TAB-delimited. Apart from the header lines, which are started
with the `@' symbol, each alignment line consists of:
Each bit in the FLAG field is defined as:
- +-------+--------------------------------------------------+
- | Flag | Description |
- +-------+--------------------------------------------------+
- |0x0001 | the read is paired in sequencing |
- |0x0002 | the read is mapped in a proper pair |
- |0x0004 | the query sequence itself is unmapped |
- |0x0008 | the mate is unmapped |
- |0x0010 | strand of the query (1 for reverse) |
- |0x0020 | strand of the mate |
- |0x0040 | the read is the first read in a pair |
- |0x0080 | the read is the second read in a pair |
- |0x0100 | the alignment is not primary |
- |0x0200 | the read fails platform/vendor quality checks |
- |0x0400 | the read is either a PCR or an optical duplicate |
- +-------+--------------------------------------------------+
+ +-------+-----+--------------------------------------------------+
+ | Flag | Chr | Description |
+ +-------+-----+--------------------------------------------------+
+ |0x0001 | p | the read is paired in sequencing |
+ |0x0002 | P | the read is mapped in a proper pair |
+ |0x0004 | u | the query sequence itself is unmapped |
+ |0x0008 | U | the mate is unmapped |
+ |0x0010 | r | strand of the query (1 for reverse) |
+ |0x0020 | R | strand of the mate |
+ |0x0040 | 1 | the read is the first read in a pair |
+ |0x0080 | 2 | the read is the second read in a pair |
+ |0x0100 | s | the alignment is not primary |
+ |0x0200 | f | the read fails platform/vendor quality checks |
+ |0x0400 | d | the read is either a PCR or an optical duplicate |
+ +-------+-----+--------------------------------------------------+
LIMITATIONS
- o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
+ o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
bam_aux.c.
- o CIGAR operation P is not properly handled at the moment.
-
- o In merging, the input files are required to have the same number of
- reference sequences. The requirement can be relaxed. In addition,
- merging does not reconstruct the header dictionaries automatically.
- Endusers have to provide the correct header. Picard is better at
+ o In merging, the input files are required to have the same number of
+ reference sequences. The requirement can be relaxed. In addition,
+ merging does not reconstruct the header dictionaries automatically.
+ Endusers have to provide the correct header. Picard is better at
merging.
- o Samtools' rmdup does not work for single-end data and does not remove
- duplicates across chromosomes. Picard is better.
+ o Samtools paired-end rmdup does not work for unpaired reads (e.g.
+ orphan reads or ends mapped to different chromosomes). If this is a
+ concern, please use Picard's MarkDuplicate which correctly handles
+ these cases, although a little slower.
AUTHOR
- Heng Li from the Sanger Institute wrote the C version of samtools. Bob
+ Heng Li from the Sanger Institute wrote the C version of samtools. Bob
Handsaker from the Broad Institute implemented the BGZF library and Jue
- Ruan from Beijing Genomics Institute wrote the RAZF library. Various
- people in the 1000Genomes Project contributed to the SAM format speci-
+ Ruan from Beijing Genomics Institute wrote the RAZF library. Various
+ people in the 1000 Genomes Project contributed to the SAM format speci-
fication.
-samtools-0.1.6 2 September 2009 samtools(1)
+samtools-0.1.8 11 July 2010 samtools(1)