otherwise only alignments overlapping the specified regions
will be output. An alignment may be given multiple times if
it is overlapping several regions. A region can be presented,
- for example, in the following format: `chr2', `chr2:1000000'
- or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.
+ for example, in the following format: `chr2' (the whole
+ chr2), `chr2:1000000' (region starting from 1,000,000bp) or
+ `chr2:1,000,000-2,000,000' (region between 1,000,000 and
+ 2,000,000bp including the end points). The coordinate is
+ 1-based.
OPTIONS:
-b Output in the BAM format.
-u Output uncompressed BAM. This option saves time spent
- on compression/decomprssion and is thus preferred
+ on compression/decomprssion and is thus preferred
when the output is piped to another samtools command.
-h Include the header in the output.
-H Output the header only.
- -S Input is in SAM. If @SQ header lines are absent, the
+ -S Input is in SAM. If @SQ header lines are absent, the
`-t' option is required.
- -t FILE This file is TAB-delimited. Each line must contain
- the reference name and the length of the reference,
- one line for each distinct reference; additional
- fields are ignored. This file also defines the order
- of the reference sequences in sorting. If you run
- `samtools faidx <ref.fa>', the resultant index file
- <ref.fa>.fai can be used as this <in.ref_list> file.
+ -t FILE This file is TAB-delimited. Each line must contain
+ the reference name and the length of the reference,
+ one line for each distinct reference; additional
+ fields are ignored. This file also defines the order
+ of the reference sequences in sorting. If you run
+ `samtools faidx <ref.fa>', the resultant index file
+ <ref.fa>.fai can be used as this <in.ref_list> file.
-o FILE Output file [stdout]
- -f INT Only output alignments with all bits in INT present
+ -f INT Only output alignments with all bits in INT present
in the FLAG field. INT can be in hex in the format of
/^0x[0-9A-F]+/ [0]
faidx samtools faidx <ref.fasta> [region1 [...]]
- Index reference sequence in the FASTA format or extract sub-
- sequence from indexed reference sequence. If no region is
+ Index reference sequence in the FASTA format or extract sub-
+ sequence from indexed reference sequence. If no region is
specified, faidx will index the file and create
- <ref.fasta>.fai on the disk. If regions are speficified, the
- subsequences will be retrieved and printed to stdout in the
- FASTA format. The input file can be compressed in the RAZF
+ <ref.fasta>.fai on the disk. If regions are speficified, the
+ subsequences will be retrieved and printed to stdout in the
+ FASTA format. The input file can be compressed in the RAZF
format.
- pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l
- in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r
+ pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l
+ in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r
pairDiffRate] <in.bam>|<in.sam>
- Print the alignment in the pileup format. In the pileup for-
- mat, each line represents a genomic position, consisting of
+ Print the alignment in the pileup format. In the pileup for-
+ mat, each line represents a genomic position, consisting of
chromosome name, coordinate, reference base, read bases, read
- qualities and alignment mapping qualities. Information on
+ qualities and alignment mapping qualities. Information on
match, mismatch, indel, strand, mapping quality and start and
- end of a read are all encoded at the read base column. At
- this column, a dot stands for a match to the reference base
- on the forward strand, a comma for a match on the reverse
- strand, `ACGTN' for a mismatch on the forward strand and
- `acgtn' for a mismatch on the reverse strand. A pattern
- `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
- between this reference position and the next reference posi-
- tion. The length of the insertion is given by the integer in
- the pattern, followed by the inserted sequence. Similarly, a
+ end of a read are all encoded at the read base column. At
+ this column, a dot stands for a match to the reference base
+ on the forward strand, a comma for a match on the reverse
+ strand, `ACGTN' for a mismatch on the forward strand and
+ `acgtn' for a mismatch on the reverse strand. A pattern
+ `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
+ between this reference position and the next reference posi-
+ tion. The length of the insertion is given by the integer in
+ the pattern, followed by the inserted sequence. Similarly, a
pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
- reference. The deleted bases will be presented as `*' in the
- following lines. Also at the read base column, a symbol `^'
- marks the start of a read segment which is a contiguous sub-
- sequence on the read separated by `N/S/H' CIGAR operations.
- The ASCII of the character following `^' minus 33 gives the
- mapping quality. A symbol `$' marks the end of a read seg-
+ reference. The deleted bases will be presented as `*' in the
+ following lines. Also at the read base column, a symbol `^'
+ marks the start of a read segment which is a contiguous sub-
+ sequence on the read separated by `N/S/H' CIGAR operations.
+ The ASCII of the character following `^' minus 33 gives the
+ mapping quality. A symbol `$' marks the end of a read seg-
ment.
- If option -c is applied, the consensus base, consensus qual-
- ity, SNP quality and RMS mapping quality of the reads cover-
- ing the site will be inserted between the `reference base'
- and the `read bases' columns. An indel occupies an additional
- line. Each indel line consists of chromosome name, coordi-
- nate, a star, the genotype, consensus quality, SNP quality,
+ If option -c is applied, the consensus base, Phred-scaled
+ consensus quality, SNP quality (i.e. the Phred-scaled proba-
+ bility of the consensus being identical to the reference) and
+ root mean square (RMS) mapping quality of the reads covering
+ the site will be inserted between the `reference base' and
+ the `read bases' columns. An indel occupies an additional
+ line. Each indel line consists of chromosome name, coordi-
+ nate, a star, the genotype, consensus quality, SNP quality,
RMS mapping quality, # covering reads, the first alllele, the
- second allele, # reads supporting the first allele, # reads
- supporting the second allele and # reads containing indels
+ second allele, # reads supporting the first allele, # reads
+ supporting the second allele and # reads containing indels
different from the top two alleles.
OPTIONS:
- -s Print the mapping quality as the last column. This
- option makes the output easier to parse, although
+ -s Print the mapping quality as the last column. This
+ option makes the output easier to parse, although
this format is not space efficient.
-i Only output pileup lines containing indels.
- -f FILE The reference sequence in the FASTA format. Index
+ -f FILE The reference sequence in the FASTA format. Index
file FILE.fai will be created if absent.
-M INT Cap mapping quality at INT [60]
- -t FILE List of reference names ane sequence lengths, in
- the format described for the import command. If
- this option is present, samtools assumes the input
+ -t FILE List of reference names ane sequence lengths, in
+ the format described for the import command. If
+ this option is present, samtools assumes the input
<in.alignment> is in SAM format; otherwise it
assumes in BAM format.
- -l FILE List of sites at which pileup is output. This file
- is space delimited. The first two columns are
- required to be chromosome and 1-based coordinate.
- Additional columns are ignored. It is recommended
+ -l FILE List of sites at which pileup is output. This file
+ is space delimited. The first two columns are
+ required to be chromosome and 1-based coordinate.
+ Additional columns are ignored. It is recommended
to use option -s together with -l as in the default
format we may not know the mapping quality.
- -c Call the consensus sequence using MAQ consensus
+ -c Call the consensus sequence using MAQ consensus
model. Options -T, -N, -I and -r are only effective
when -c or -g is in use.
- -g Generate genotype likelihood in the binary GLFv3
+ -g Generate genotype likelihood in the binary GLFv3
format. This option suppresses -c, -i and -s.
- -T FLOAT The theta parameter (error dependency coefficient)
+ -T FLOAT The theta parameter (error dependency coefficient)
in the maq consensus calling model [0.85]
-N INT Number of haplotypes in the sample (>=2) [2]
- -r FLOAT Expected fraction of differences between a pair of
+ -r FLOAT Expected fraction of differences between a pair of
haplotypes [0.001]
- -I INT Phred probability of an indel in sequencing/prep.
+ -I INT Phred probability of an indel in sequencing/prep.
[40]
tview samtools tview <in.sorted.bam> [ref.fasta]
- Text alignment viewer (based on the ncurses library). In the
- viewer, press `?' for help and press `g' to check the align-
- ment start from a region in the format like
+ Text alignment viewer (based on the ncurses library). In the
+ viewer, press `?' for help and press `g' to check the align-
+ ment start from a region in the format like
`chr10:10,000,000'.
-
fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
Fill in mate coordinates, ISIZE and mate related flags from a
rmdup samtools rmdup <input.srt.bam> <out.bam>
- Remove potential PCR duplicates: if multiple read pairs have
- identical external coordinates, only retain the pair with
- highest mapping quality. This command ONLY works with FR
+ Remove potential PCR duplicates: if multiple read pairs have
+ identical external coordinates, only retain the pair with
+ highest mapping quality. This command ONLY works with FR
orientation and requires ISIZE is correctly set.
-
rmdupse samtools rmdupse <input.srt.bam> <out.bam>
Remove potential duplicates for single-ended reads. This com-
- mand will treat all reads as single-ended even if they are
+ mand will treat all reads as single-ended even if they are
paired in fact.
-
fillmd samtools fillmd [-e] <aln.bam> <ref.fasta>
- Generate the MD tag. If the MD tag is already present, this
- command will give a warning if the MD tag generated is dif-
+ Generate the MD tag. If the MD tag is already present, this
+ command will give a warning if the MD tag generated is dif-
ferent from the existing tag.
OPTIONS:
- -e Convert a the read base to = if it is identical to
- the aligned reference base. Indel caller does not
+ -e Convert a the read base to = if it is identical to
+ the aligned reference base. Indel caller does not
support the = bases at the moment.
SAM FORMAT
- SAM is TAB-delimited. Apart from the header lines, which are started
+ SAM is TAB-delimited. Apart from the header lines, which are started
with the `@' symbol, each alignment line consists of:
+-------+--------------------------------------------------+
LIMITATIONS
- o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
+ o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
bam_aux.c.
o CIGAR operation P is not properly handled at the moment.
- o In merging, the input files are required to have the same number of
- reference sequences. The requirement can be relaxed. In addition,
- merging does not reconstruct the header dictionaries automatically.
- Endusers have to provide the correct header. Picard is better at
+ o In merging, the input files are required to have the same number of
+ reference sequences. The requirement can be relaxed. In addition,
+ merging does not reconstruct the header dictionaries automatically.
+ Endusers have to provide the correct header. Picard is better at
merging.
o Samtools' rmdup does not work for single-end data and does not remove
AUTHOR
- Heng Li from the Sanger Institute wrote the C version of samtools. Bob
+ Heng Li from the Sanger Institute wrote the C version of samtools. Bob
Handsaker from the Broad Institute implemented the BGZF library and Jue
- Ruan from Beijing Genomics Institute wrote the RAZF library. Various
- people in the 1000Genomes Project contributed to the SAM format speci-
+ Ruan from Beijing Genomics Institute wrote the RAZF library. Various
+ people in the 1000Genomes Project contributed to the SAM format speci-
fication.
-samtools-0.1.6 2 September 2009 samtools(1)
+samtools-0.1.7 10 November 2009 samtools(1)