samtools pileup -vcf ref.fasta aln.sorted.bam
- samtools mpileup -C50 -agf ref.fasta -r chr3:1,000-2,000 in1.bam
- in2.bam
+ samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
samtools tview aln.sorted.bam ref.fasta
COMMANDS AND OPTIONS
- view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
+ view samtools view [-bchuHS] [-t in.refList] [-o output] [-f
reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
-S Input is in SAM. If @SQ header lines are absent, the
`-t' option is required.
+ -c Instead of printing the alignments, only count them
+ and print the total number. All filter options, such
+ as `-f', `-F' and `-q' , are taken into account.
+
-t FILE This file is TAB-delimited. Each line must contain
the reference name and the length of the reference,
one line for each distinct reference; additional
reference sequence.
- pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
- [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N
- nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G
- indelPrior] <in.bam>|<in.sam>
-
- Print the alignment in the pileup format. In the pileup for-
- mat, each line represents a genomic position, consisting of
- chromosome name, coordinate, reference base, read bases, read
- qualities and alignment mapping qualities. Information on
- match, mismatch, indel, strand, mapping quality and start and
- end of a read are all encoded at the read base column. At
- this column, a dot stands for a match to the reference base
- on the forward strand, a comma for a match on the reverse
- strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
- match on the forward strand and `acgtn' for a mismatch on the
- reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates
- there is an insertion between this reference position and the
- next reference position. The length of the insertion is given
- by the integer in the pattern, followed by the inserted
- sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre-
- sents a deletion from the reference. The deleted bases will
- be presented as `*' in the following lines. Also at the read
- base column, a symbol `^' marks the start of a read. The
- ASCII of the character following `^' minus 33 gives the map-
- ping quality. A symbol `$' marks the end of a read segment.
-
- If option -c is applied, the consensus base, Phred-scaled
- consensus quality, SNP quality (i.e. the Phred-scaled proba-
- bility of the consensus being identical to the reference) and
- root mean square (RMS) mapping quality of the reads covering
- the site will be inserted between the `reference base' and
- the `read bases' columns. An indel occupies an additional
- line. Each indel line consists of chromosome name, coordi-
- nate, a star, the genotype, consensus quality, SNP quality,
- RMS mapping quality, # covering reads, the first alllele, the
- second allele, # reads supporting the first allele, # reads
- supporting the second allele and # reads containing indels
- different from the top two alleles.
-
- The position of indels is offset by -1.
-
- OPTIONS:
-
- -B Disable the BAQ computation. See the mpileup com-
- mand for details.
-
- -c Call the consensus sequence using SOAPsnp consensus
- model. Options -T, -N, -I and -r are only effective
- when -c or -g is in use.
-
- -C INT Coefficient for downgrading the mapping quality of
- poorly mapped reads. See the mpileup command for
- details. [0]
-
- -d INT Use the first NUM reads in the pileup for indel
- calling for speed up. Zero for unlimited. [1024]
-
- -f FILE The reference sequence in the FASTA format. Index
- file FILE.fai will be created if absent.
-
- -g Generate genotype likelihood in the binary GLFv3
- format. This option suppresses -c, -i and -s. This
- option is deprecated by the mpileup command.
-
- -i Only output pileup lines containing indels.
-
- -I INT Phred probability of an indel in sequencing/prep.
- [40]
-
- -l FILE List of sites at which pileup is output. This file
- is space delimited. The first two columns are
- required to be chromosome and 1-based coordinate.
- Additional columns are ignored. It is recommended
- to use option
-
- -m INT Filter reads with flag containing bits in INT
- [1796]
-
- -M INT Cap mapping quality at INT [60]
-
- -N INT Number of haplotypes in the sample (>=2) [2]
-
- -r FLOAT Expected fraction of differences between a pair of
- haplotypes [0.001]
-
- -s Print the mapping quality as the last column. This
- option makes the output easier to parse, although
- this format is not space efficient.
-
- -S The input file is in SAM.
-
- -t FILE List of reference names ane sequence lengths, in
- the format described for the import command. If
- this option is present, samtools assumes the input
- <in.alignment> is in SAM format; otherwise it
- assumes in BAM format. -s together with -l as in
- the default format we may not know the mapping
- quality.
-
- -T FLOAT The theta parameter (error dependency coefficient)
- in the maq consensus calling model [0.85]
-
-
mpileup samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
[...]]
phred-scaled probability q of being generated from
the mapped position, the new mapping quality is about
sqrt((INT-q)/INT)*INT. A zero value disables this
- functionality; if enabled, the recommended value is
- 50. [0]
+ functionality; if enabled, the recommended value for
+ BWA is 50. [0]
+
+ -e INT Phred-scaled gap extension sequencing error probabil-
+ ity. Reducing INT leads to longer indels. [20]
-f FILE The reference file [null]
- -g Compute genotype likelihoods and output them in the
+ -g Compute genotype likelihoods and output them in the
binary call format (BCF).
- -u Similar to -g except that the output is uncompressed
- BCF, which is preferred for pipeing.
+ -h INT Coefficient for modeling homopolymer errors. Given an
+ l-long homopolymer run, the sequencing error of an
+ indel of size s is modeled as INT*s/l. [100]
-l FILE File containing a list of sites where pileup or BCF
is outputted [null]
- -q INT Minimum mapping quality for an alignment to be used
+ -o INT Phred-scaled gap open sequencing error probability.
+ Reducing INT leads to more indel calls. [40]
+
+ -P STR Comma dilimited list of platforms (determined by @RG-
+ PL) from which indel candidates are obtained. It is
+ recommended to collect indel candidates from sequenc-
+ ing technologies that have low indel error rate such
+ as ILLUMINA. [all]
+
+ -q INT Minimum mapping quality for an alignment to be used
[0]
-Q INT Minimum base quality for a base to be considered [13]
-r STR Only generate pileup in region STR [all sites]
+ -u Similar to -g except that the output is uncompressed
+ BCF, which is preferred for piping.
+
reheader samtools reheader <in.header.sam> <in.bam>
OPTIONS:
- -e Convert a the read base to = if it is identical to
- the aligned reference base. Indel caller does not
+ -A When used jointly with -r this option overwrites the
+ original base quality.
+
+ -e Convert a the read base to = if it is identical to
+ the aligned reference base. Indel caller does not
support the = bases at the moment.
-u Output uncompressed BAM
-S The input is SAM with header lines
- -C INT Coefficient to cap mapping quality of poorly mapped
+ -C INT Coefficient to cap mapping quality of poorly mapped
reads. See the pileup command for details. [0]
- -r Perform probabilistic realignment to compute BAQ,
- which will be used to cap base quality.
+ -r Compute the BQ tag without changing the base quality.
+
+
+ pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
+ [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N
+ nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G
+ indelPrior] <in.bam>|<in.sam>
+
+ Print the alignment in the pileup format. In the pileup for-
+ mat, each line represents a genomic position, consisting of
+ chromosome name, coordinate, reference base, read bases, read
+ qualities and alignment mapping qualities. Information on
+ match, mismatch, indel, strand, mapping quality and start and
+ end of a read are all encoded at the read base column. At
+ this column, a dot stands for a match to the reference base
+ on the forward strand, a comma for a match on the reverse
+ strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
+ match on the forward strand and `acgtn' for a mismatch on the
+ reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates
+ there is an insertion between this reference position and the
+ next reference position. The length of the insertion is given
+ by the integer in the pattern, followed by the inserted
+ sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre-
+ sents a deletion from the reference. The deleted bases will
+ be presented as `*' in the following lines. Also at the read
+ base column, a symbol `^' marks the start of a read. The
+ ASCII of the character following `^' minus 33 gives the map-
+ ping quality. A symbol `$' marks the end of a read segment.
+
+ If option -c is applied, the consensus base, Phred-scaled
+ consensus quality, SNP quality (i.e. the Phred-scaled proba-
+ bility of the consensus being identical to the reference) and
+ root mean square (RMS) mapping quality of the reads covering
+ the site will be inserted between the `reference base' and
+ the `read bases' columns. An indel occupies an additional
+ line. Each indel line consists of chromosome name, coordi-
+ nate, a star, the genotype, consensus quality, SNP quality,
+ RMS mapping quality, # covering reads, the first alllele, the
+ second allele, # reads supporting the first allele, # reads
+ supporting the second allele and # reads containing indels
+ different from the top two alleles.
+
+ NOTE: Since 0.1.10, the `pileup' command is deprecated by
+ `mpileup'.
+
+ OPTIONS:
+
+ -B Disable the BAQ computation. See the mpileup com-
+ mand for details.
+
+ -c Call the consensus sequence. Options -T, -N, -I and
+ -r are only effective when -c or -g is in use.
+
+ -C INT Coefficient for downgrading the mapping quality of
+ poorly mapped reads. See the mpileup command for
+ details. [0]
+
+ -d INT Use the first NUM reads in the pileup for indel
+ calling for speed up. Zero for unlimited. [1024]
+
+ -f FILE The reference sequence in the FASTA format. Index
+ file FILE.fai will be created if absent.
+
+ -g Generate genotype likelihood in the binary GLFv3
+ format. This option suppresses -c, -i and -s. This
+ option is deprecated by the mpileup command.
+
+ -i Only output pileup lines containing indels.
+
+ -I INT Phred probability of an indel in sequencing/prep.
+ [40]
+
+ -l FILE List of sites at which pileup is output. This file
+ is space delimited. The first two columns are
+ required to be chromosome and 1-based coordinate.
+ Additional columns are ignored. It is recommended
+ to use option
+
+ -m INT Filter reads with flag containing bits in INT
+ [1796]
+
+ -M INT Cap mapping quality at INT [60]
+
+ -N INT Number of haplotypes in the sample (>=2) [2]
+
+ -r FLOAT Expected fraction of differences between a pair of
+ haplotypes [0.001]
+
+ -s Print the mapping quality as the last column. This
+ option makes the output easier to parse, although
+ this format is not space efficient.
+
+ -S The input file is in SAM.
+
+ -t FILE List of reference names ane sequence lengths, in
+ the format described for the import command. If
+ this option is present, samtools assumes the input
+ <in.alignment> is in SAM format; otherwise it
+ assumes in BAM format. -s together with -l as in
+ the default format we may not know the mapping
+ quality.
+
+ -T FLOAT The theta parameter (error dependency coefficient)
+ in the maq consensus calling model [0.85]
SAM FORMAT
- SAM is TAB-delimited. Apart from the header lines, which are started
+ SAM is TAB-delimited. Apart from the header lines, which are started
with the `@' symbol, each alignment line consists of:
o Attach the RG tag while merging sorted alignments:
- perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
+ perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
samtools merge -rh rg.txt merged.bam ga.bam 454.bam
The value in a RG tag is determined by the file name the read is com-
- ing from. In this example, in the merged.bam, reads from ga.bam will
- be attached RG:Z:ga, while reads from 454.bam will be attached
+ ing from. In this example, in the merged.bam, reads from ga.bam will
+ be attached RG:Z:ga, while reads from 454.bam will be attached
RG:Z:454.
o Call SNPs and short indels for one diploid individual:
- samtools pileup -vcf ref.fa aln.bam > var.raw.plp
- samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
- awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var.flt.plp >
- var.final.plp
+ samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - >
+ var.raw.bcf
+ bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 >
+ var.flt.vcf
The -D option of varFilter controls the maximum read depth, which
should be adjusted to about twice the average read depth. One may
- consider to add -C50 to pileup if mapping quality is overestimated
+ consider to add -C50 to mpileup if mapping quality is overestimated
for reads containing excessive mismatches. Applying this option usu-
- ally helps BWA-short but may not other mappers. It also potentially
- increases reference biases.
+ ally helps BWA-short but may not other mappers.
- o Call SNPs (not short indels) for multiple diploid individuals:
+ o Call SNPs and short indels for multiple diploid individuals:
- samtools mpileup -augf ref.fa *.bam | bcftools view -bcv - >
- snp.raw.bcf
- bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 | bgzip
- > snp.flt.vcf.gz
+ samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view
+ -bcvg - > var.raw.bcf
+ bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 >
+ var.flt.vcf
- Individuals are identified from the SM tags in the @RG header lines.
- Individuals can be pooled in one alignment file; one individual can
- also be separated into multiple files. Similarly, one may consider to
- apply -C50 to mpileup. SNP calling in this way also works for single
- sample and has the advantage of enabling more powerful filtering. The
- drawback is the lack of short indel calling, which may be implemented
- in future.
+ Individuals are identified from the SM tags in the @RG header lines.
+ Individuals can be pooled in one alignment file; one individual can
+ also be separated into multiple files. The -P option specifies that
+ indel candidates should be collected only from read groups with the
+ @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads
+ sequenced by an indel-prone technology may affect the performance of
+ indel calling.
- o Derive the allele frequency spectrum (AFS) on a list of sites from
+ o Derive the allele frequency spectrum (AFS) on a list of sites from
multiple individuals:
- samtools mpileup -gf ref.fa *.bam > all.bcf
+ samtools mpileup -Igf ref.fa *.bam > all.bcf
bcftools view -bl sites.list all.bcf > sites.bcf
bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
......
where sites.list contains the list of sites with each line consisting
- of the reference sequence name and position. The following bcftools
+ of the reference sequence name and position. The following bcftools
commands estimate AFS by EM.
o Dump BAQ applied alignment for other SNP callers:
- samtools calmd -br aln.bam > aln.baq.bam
+ samtools calmd -bAr aln.bam > aln.baq.bam
- It adds and corrects the NM and MD tags at the same time. The calmd
- command also comes with the -C option, the same as the on in pileup
+ It adds and corrects the NM and MD tags at the same time. The calmd
+ command also comes with the -C option, the same as the one in pileup
and mpileup. Apply if it helps.
LIMITATIONS
- o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
+ o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
bam_aux.c.
- o In merging, the input files are required to have the same number of
- reference sequences. The requirement can be relaxed. In addition,
- merging does not reconstruct the header dictionaries automatically.
- Endusers have to provide the correct header. Picard is better at
+ o In merging, the input files are required to have the same number of
+ reference sequences. The requirement can be relaxed. In addition,
+ merging does not reconstruct the header dictionaries automatically.
+ Endusers have to provide the correct header. Picard is better at
merging.
- o Samtools paired-end rmdup does not work for unpaired reads (e.g.
- orphan reads or ends mapped to different chromosomes). If this is a
- concern, please use Picard's MarkDuplicate which correctly handles
+ o Samtools paired-end rmdup does not work for unpaired reads (e.g.
+ orphan reads or ends mapped to different chromosomes). If this is a
+ concern, please use Picard's MarkDuplicate which correctly handles
these cases, although a little slower.
AUTHOR
- Heng Li from the Sanger Institute wrote the C version of samtools. Bob
+ Heng Li from the Sanger Institute wrote the C version of samtools. Bob
Handsaker from the Broad Institute implemented the BGZF library and Jue
- Ruan from Beijing Genomics Institute wrote the RAZF library. Various
- people in the 1000 Genomes Project contributed to the SAM format speci-
+ Ruan from Beijing Genomics Institute wrote the RAZF library. John Mar-
+ shall and Petr Danecek contribute to the source code and various people
+ from the 1000 Genomes Project have contributed to the SAM format speci-
fication.
-samtools-0.1.9 27 October 2010 samtools(1)
+samtools-0.1.10 15 November 2010 samtools(1)