X-Git-Url: http://woldlab.caltech.edu/gitweb/?p=samtools.git;a=blobdiff_plain;f=samtools.txt;h=41cabe5053986a44f415ae35981c0f61738a7996;hp=09c1ccdedf9be5dd1964022519948a3932d384c8;hb=e3b3a0177339fb8c099346986e965e3bd5b85999;hpb=8d2494d1fb7cd0fa7c63be5ffba8dd1a11457522 diff --git a/samtools.txt b/samtools.txt index 09c1ccd..41cabe5 100644 --- a/samtools.txt +++ b/samtools.txt @@ -22,8 +22,7 @@ SYNOPSIS samtools pileup -vcf ref.fasta aln.sorted.bam - samtools mpileup -C50 -agf ref.fasta -r chr3:1,000-2,000 in1.bam - in2.bam + samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam samtools tview aln.sorted.bam ref.fasta @@ -48,7 +47,7 @@ DESCRIPTION COMMANDS AND OPTIONS - view samtools view [-bhuHS] [-t in.refList] [-o output] [-f + view samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read- Group] [-R rgFile] | [region1 [...]] @@ -90,6 +89,10 @@ COMMANDS AND OPTIONS -S Input is in SAM. If @SQ header lines are absent, the `-t' option is required. + -c Instead of printing the alignments, only count them + and print the total number. All filter options, such + as `-f', `-F' and `-q' , are taken into account. + -t FILE This file is TAB-delimited. Each line must contain the reference name and the length of the reference, one line for each distinct reference; additional @@ -112,109 +115,6 @@ COMMANDS AND OPTIONS reference sequence. - pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] - [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N - nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G - indelPrior] | - - Print the alignment in the pileup format. In the pileup for- - mat, each line represents a genomic position, consisting of - chromosome name, coordinate, reference base, read bases, read - qualities and alignment mapping qualities. Information on - match, mismatch, indel, strand, mapping quality and start and - end of a read are all encoded at the read base column. At - this column, a dot stands for a match to the reference base - on the forward strand, a comma for a match on the reverse - strand, a '>' or '<' for a reference skip, `ACGTN' for a mis- - match on the forward strand and `acgtn' for a mismatch on the - reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates - there is an insertion between this reference position and the - next reference position. The length of the insertion is given - by the integer in the pattern, followed by the inserted - sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre- - sents a deletion from the reference. The deleted bases will - be presented as `*' in the following lines. Also at the read - base column, a symbol `^' marks the start of a read. The - ASCII of the character following `^' minus 33 gives the map- - ping quality. A symbol `$' marks the end of a read segment. - - If option -c is applied, the consensus base, Phred-scaled - consensus quality, SNP quality (i.e. the Phred-scaled proba- - bility of the consensus being identical to the reference) and - root mean square (RMS) mapping quality of the reads covering - the site will be inserted between the `reference base' and - the `read bases' columns. An indel occupies an additional - line. Each indel line consists of chromosome name, coordi- - nate, a star, the genotype, consensus quality, SNP quality, - RMS mapping quality, # covering reads, the first alllele, the - second allele, # reads supporting the first allele, # reads - supporting the second allele and # reads containing indels - different from the top two alleles. - - The position of indels is offset by -1. - - OPTIONS: - - -B Disable the BAQ computation. See the mpileup com- - mand for details. - - -c Call the consensus sequence using SOAPsnp consensus - model. Options -T, -N, -I and -r are only effective - when -c or -g is in use. - - -C INT Coefficient for downgrading the mapping quality of - poorly mapped reads. See the mpileup command for - details. [0] - - -d INT Use the first NUM reads in the pileup for indel - calling for speed up. Zero for unlimited. [1024] - - -f FILE The reference sequence in the FASTA format. Index - file FILE.fai will be created if absent. - - -g Generate genotype likelihood in the binary GLFv3 - format. This option suppresses -c, -i and -s. This - option is deprecated by the mpileup command. - - -i Only output pileup lines containing indels. - - -I INT Phred probability of an indel in sequencing/prep. - [40] - - -l FILE List of sites at which pileup is output. This file - is space delimited. The first two columns are - required to be chromosome and 1-based coordinate. - Additional columns are ignored. It is recommended - to use option - - -m INT Filter reads with flag containing bits in INT - [1796] - - -M INT Cap mapping quality at INT [60] - - -N INT Number of haplotypes in the sample (>=2) [2] - - -r FLOAT Expected fraction of differences between a pair of - haplotypes [0.001] - - -s Print the mapping quality as the last column. This - option makes the output easier to parse, although - this format is not space efficient. - - -S The input file is in SAM. - - -t FILE List of reference names ane sequence lengths, in - the format described for the import command. If - this option is present, samtools assumes the input - is in SAM format; otherwise it - assumes in BAM format. -s together with -l as in - the default format we may not know the mapping - quality. - - -T FLOAT The theta parameter (error dependency coefficient) - in the maq consensus calling model [0.85] - - mpileup samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]] @@ -237,27 +137,43 @@ COMMANDS AND OPTIONS phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this - functionality; if enabled, the recommended value is - 50. [0] + functionality; if enabled, the recommended value for + BWA is 50. [0] + + -e INT Phred-scaled gap extension sequencing error probabil- + ity. Reducing INT leads to longer indels. [20] -f FILE The reference file [null] - -g Compute genotype likelihoods and output them in the + -g Compute genotype likelihoods and output them in the binary call format (BCF). - -u Similar to -g except that the output is uncompressed - BCF, which is preferred for pipeing. + -h INT Coefficient for modeling homopolymer errors. Given an + l-long homopolymer run, the sequencing error of an + indel of size s is modeled as INT*s/l. [100] -l FILE File containing a list of sites where pileup or BCF is outputted [null] - -q INT Minimum mapping quality for an alignment to be used + -o INT Phred-scaled gap open sequencing error probability. + Reducing INT leads to more indel calls. [40] + + -P STR Comma dilimited list of platforms (determined by @RG- + PL) from which indel candidates are obtained. It is + recommended to collect indel candidates from sequenc- + ing technologies that have low indel error rate such + as ILLUMINA. [all] + + -q INT Minimum mapping quality for an alignment to be used [0] -Q INT Minimum base quality for a base to be considered [13] -r STR Only generate pileup in region STR [all sites] + -u Similar to -g except that the output is uncompressed + BCF, which is preferred for piping. + reheader samtools reheader @@ -368,8 +284,11 @@ COMMANDS AND OPTIONS OPTIONS: - -e Convert a the read base to = if it is identical to - the aligned reference base. Indel caller does not + -A When used jointly with -r this option overwrites the + original base quality. + + -e Convert a the read base to = if it is identical to + the aligned reference base. Indel caller does not support the = bases at the moment. -u Output uncompressed BAM @@ -378,15 +297,117 @@ COMMANDS AND OPTIONS -S The input is SAM with header lines - -C INT Coefficient to cap mapping quality of poorly mapped + -C INT Coefficient to cap mapping quality of poorly mapped reads. See the pileup command for details. [0] - -r Perform probabilistic realignment to compute BAQ, - which will be used to cap base quality. + -r Compute the BQ tag without changing the base quality. + + + pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] + [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N + nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G + indelPrior] | + + Print the alignment in the pileup format. In the pileup for- + mat, each line represents a genomic position, consisting of + chromosome name, coordinate, reference base, read bases, read + qualities and alignment mapping qualities. Information on + match, mismatch, indel, strand, mapping quality and start and + end of a read are all encoded at the read base column. At + this column, a dot stands for a match to the reference base + on the forward strand, a comma for a match on the reverse + strand, a '>' or '<' for a reference skip, `ACGTN' for a mis- + match on the forward strand and `acgtn' for a mismatch on the + reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates + there is an insertion between this reference position and the + next reference position. The length of the insertion is given + by the integer in the pattern, followed by the inserted + sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre- + sents a deletion from the reference. The deleted bases will + be presented as `*' in the following lines. Also at the read + base column, a symbol `^' marks the start of a read. The + ASCII of the character following `^' minus 33 gives the map- + ping quality. A symbol `$' marks the end of a read segment. + + If option -c is applied, the consensus base, Phred-scaled + consensus quality, SNP quality (i.e. the Phred-scaled proba- + bility of the consensus being identical to the reference) and + root mean square (RMS) mapping quality of the reads covering + the site will be inserted between the `reference base' and + the `read bases' columns. An indel occupies an additional + line. Each indel line consists of chromosome name, coordi- + nate, a star, the genotype, consensus quality, SNP quality, + RMS mapping quality, # covering reads, the first alllele, the + second allele, # reads supporting the first allele, # reads + supporting the second allele and # reads containing indels + different from the top two alleles. + + NOTE: Since 0.1.10, the `pileup' command is deprecated by + `mpileup'. + + OPTIONS: + + -B Disable the BAQ computation. See the mpileup com- + mand for details. + + -c Call the consensus sequence. Options -T, -N, -I and + -r are only effective when -c or -g is in use. + + -C INT Coefficient for downgrading the mapping quality of + poorly mapped reads. See the mpileup command for + details. [0] + + -d INT Use the first NUM reads in the pileup for indel + calling for speed up. Zero for unlimited. [1024] + + -f FILE The reference sequence in the FASTA format. Index + file FILE.fai will be created if absent. + + -g Generate genotype likelihood in the binary GLFv3 + format. This option suppresses -c, -i and -s. This + option is deprecated by the mpileup command. + + -i Only output pileup lines containing indels. + + -I INT Phred probability of an indel in sequencing/prep. + [40] + + -l FILE List of sites at which pileup is output. This file + is space delimited. The first two columns are + required to be chromosome and 1-based coordinate. + Additional columns are ignored. It is recommended + to use option + + -m INT Filter reads with flag containing bits in INT + [1796] + + -M INT Cap mapping quality at INT [60] + + -N INT Number of haplotypes in the sample (>=2) [2] + + -r FLOAT Expected fraction of differences between a pair of + haplotypes [0.001] + + -s Print the mapping quality as the last column. This + option makes the output easier to parse, although + this format is not space efficient. + + -S The input file is in SAM. + + -t FILE List of reference names ane sequence lengths, in + the format described for the import command. If + this option is present, samtools assumes the input + is in SAM format; otherwise it + assumes in BAM format. -s together with -l as in + the default format we may not know the mapping + quality. + + -T FLOAT The theta parameter (error dependency coefficient) + in the maq consensus calling model [0.85] SAM FORMAT - SAM is TAB-delimited. Apart from the header lines, which are started + SAM is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of: @@ -441,51 +462,50 @@ EXAMPLES o Attach the RG tag while merging sorted alignments: - perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu- + perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu- mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt samtools merge -rh rg.txt merged.bam ga.bam 454.bam The value in a RG tag is determined by the file name the read is com- - ing from. In this example, in the merged.bam, reads from ga.bam will - be attached RG:Z:ga, while reads from 454.bam will be attached + ing from. In this example, in the merged.bam, reads from ga.bam will + be attached RG:Z:ga, while reads from 454.bam will be attached RG:Z:454. o Call SNPs and short indels for one diploid individual: - samtools pileup -vcf ref.fa aln.bam > var.raw.plp - samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp - awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var.flt.plp > - var.final.plp + samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > + var.raw.bcf + bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > + var.flt.vcf The -D option of varFilter controls the maximum read depth, which should be adjusted to about twice the average read depth. One may - consider to add -C50 to pileup if mapping quality is overestimated + consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usu- - ally helps BWA-short but may not other mappers. It also potentially - increases reference biases. + ally helps BWA-short but may not other mappers. - o Call SNPs (not short indels) for multiple diploid individuals: + o Call SNPs and short indels for multiple diploid individuals: - samtools mpileup -augf ref.fa *.bam | bcftools view -bcv - > - snp.raw.bcf - bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 | bgzip - > snp.flt.vcf.gz + samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view + -bcvg - > var.raw.bcf + bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > + var.flt.vcf - Individuals are identified from the SM tags in the @RG header lines. - Individuals can be pooled in one alignment file; one individual can - also be separated into multiple files. Similarly, one may consider to - apply -C50 to mpileup. SNP calling in this way also works for single - sample and has the advantage of enabling more powerful filtering. The - drawback is the lack of short indel calling, which may be implemented - in future. + Individuals are identified from the SM tags in the @RG header lines. + Individuals can be pooled in one alignment file; one individual can + also be separated into multiple files. The -P option specifies that + indel candidates should be collected only from read groups with the + @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads + sequenced by an indel-prone technology may affect the performance of + indel calling. - o Derive the allele frequency spectrum (AFS) on a list of sites from + o Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals: - samtools mpileup -gf ref.fa *.bam > all.bcf + samtools mpileup -Igf ref.fa *.bam > all.bcf bcftools view -bl sites.list all.bcf > sites.bcf bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs @@ -493,40 +513,41 @@ EXAMPLES ...... where sites.list contains the list of sites with each line consisting - of the reference sequence name and position. The following bcftools + of the reference sequence name and position. The following bcftools commands estimate AFS by EM. o Dump BAQ applied alignment for other SNP callers: - samtools calmd -br aln.bam > aln.baq.bam + samtools calmd -bAr aln.bam > aln.baq.bam - It adds and corrects the NM and MD tags at the same time. The calmd - command also comes with the -C option, the same as the on in pileup + It adds and corrects the NM and MD tags at the same time. The calmd + command also comes with the -C option, the same as the one in pileup and mpileup. Apply if it helps. LIMITATIONS - o Unaligned words used in bam_import.c, bam_endian.h, bam.c and + o Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. - o In merging, the input files are required to have the same number of - reference sequences. The requirement can be relaxed. In addition, - merging does not reconstruct the header dictionaries automatically. - Endusers have to provide the correct header. Picard is better at + o In merging, the input files are required to have the same number of + reference sequences. The requirement can be relaxed. In addition, + merging does not reconstruct the header dictionaries automatically. + Endusers have to provide the correct header. Picard is better at merging. - o Samtools paired-end rmdup does not work for unpaired reads (e.g. - orphan reads or ends mapped to different chromosomes). If this is a - concern, please use Picard's MarkDuplicate which correctly handles + o Samtools paired-end rmdup does not work for unpaired reads (e.g. + orphan reads or ends mapped to different chromosomes). If this is a + concern, please use Picard's MarkDuplicate which correctly handles these cases, although a little slower. AUTHOR - Heng Li from the Sanger Institute wrote the C version of samtools. Bob + Heng Li from the Sanger Institute wrote the C version of samtools. Bob Handsaker from the Broad Institute implemented the BGZF library and Jue - Ruan from Beijing Genomics Institute wrote the RAZF library. Various - people in the 1000 Genomes Project contributed to the SAM format speci- + Ruan from Beijing Genomics Institute wrote the RAZF library. John Mar- + shall and Petr Danecek contribute to the source code and various people + from the 1000 Genomes Project have contributed to the SAM format speci- fication. @@ -535,4 +556,4 @@ SEE ALSO -samtools-0.1.9 27 October 2010 samtools(1) +samtools-0.1.10 15 November 2010 samtools(1)