Merge commit 'upstream/0.1.10'

[samtools.git] / samtools.txt
diff --git a/samtools.txt b/samtools.txt

index 09c1ccdedf9be5dd1964022519948a3932d384c8..41cabe5053986a44f415ae35981c0f61738a7996 100644 (file)
--- a/samtools.txt
+++ b/samtools.txt
@@ -22,8 +22,7 @@ SYNOPSIS
  
         samtools pileup -vcf ref.fasta aln.sorted.bam
  
-       samtools  mpileup  -C50  -agf  ref.fasta  -r  chr3:1,000-2,000  in1.bam
-       in2.bam
+       samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
  
         samtools tview aln.sorted.bam ref.fasta
  
@@ -48,7 +47,7 @@ DESCRIPTION
  
  
  COMMANDS AND OPTIONS
-       view      samtools  view  [-bhuHS]  [-t  in.refList]  [-o  output]  [-f
+       view      samtools  view  [-bchuHS]  [-t  in.refList]  [-o  output] [-f
                   reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
                   Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  
@@ -90,6 +89,10 @@ COMMANDS AND OPTIONS
                   -S      Input is in SAM. If @SQ header lines are absent,  the
                           `-t' option is required.
  
+                 -c      Instead  of  printing the alignments, only count them
+                         and print the total number. All filter options,  such
+                         as `-f', `-F' and `-q' , are taken into account.
+
                   -t FILE This  file  is  TAB-delimited. Each line must contain
                           the reference name and the length of  the  reference,
                           one  line  for  each  distinct  reference; additional
@@ -112,109 +115,6 @@ COMMANDS AND OPTIONS
                   reference sequence.
  
  
-       pileup    samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
-                 [-l in.site_list] [-C capMapQ] [-M maxMapQ]  [-T  theta]  [-N
-                 nHap]  [-r  pairDiffRate]  [-m  mask]  [-d maxIndelDepth] [-G
-                 indelPrior] <in.bam>|<in.sam>
-
-                 Print the alignment in the pileup format. In the pileup  for-
-                 mat,  each  line represents a genomic position, consisting of
-                 chromosome name, coordinate, reference base, read bases, read
-                 qualities  and  alignment  mapping  qualities. Information on
-                 match, mismatch, indel, strand, mapping quality and start and
-                 end  of  a  read  are all encoded at the read base column. At
-                 this column, a dot stands for a match to the  reference  base
-                 on  the  forward  strand,  a comma for a match on the reverse
-                 strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
-                 match on the forward strand and `acgtn' for a mismatch on the
-                 reverse strand. A pattern  `\+[0-9]+[ACGTNacgtn]+'  indicates
-                 there is an insertion between this reference position and the
-                 next reference position. The length of the insertion is given
-                 by  the  integer  in  the  pattern,  followed by the inserted
-                 sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'  repre-
-                 sents  a  deletion from the reference. The deleted bases will
-                 be presented as `*' in the following lines. Also at the  read
-                 base  column,  a  symbol  `^'  marks the start of a read. The
-                 ASCII of the character following `^' minus 33 gives the  map-
-                 ping quality. A symbol `$' marks the end of a read segment.
-
-                 If  option  -c  is  applied, the consensus base, Phred-scaled
-                 consensus quality, SNP quality (i.e. the Phred-scaled  proba-
-                 bility of the consensus being identical to the reference) and
-                 root mean square (RMS) mapping quality of the reads  covering
-                 the  site  will  be inserted between the `reference base' and
-                 the `read bases' columns. An  indel  occupies  an  additional
-                 line.  Each  indel  line consists of chromosome name, coordi-
-                 nate, a star, the genotype, consensus quality,  SNP  quality,
-                 RMS mapping quality, # covering reads, the first alllele, the
-                 second allele, # reads supporting the first allele,  #  reads
-                 supporting  the  second  allele and # reads containing indels
-                 different from the top two alleles.
-
-                 The position of indels is offset by -1.
-
-                 OPTIONS:
-
-                 -B        Disable the BAQ computation. See the  mpileup  com-
-                           mand for details.
-
-                 -c        Call the consensus sequence using SOAPsnp consensus
-                           model. Options -T, -N, -I and -r are only effective
-                           when -c or -g is in use.
-
-                 -C INT    Coefficient  for downgrading the mapping quality of
-                           poorly mapped reads. See the  mpileup  command  for
-                           details. [0]
-
-                 -d INT    Use  the  first  NUM  reads in the pileup for indel
-                           calling for speed up. Zero for unlimited. [1024]
-
-                 -f FILE   The reference sequence in the FASTA  format.  Index
-                           file FILE.fai will be created if absent.
-
-                 -g        Generate  genotype  likelihood  in the binary GLFv3
-                           format. This option suppresses -c, -i and -s.  This
-                           option is deprecated by the mpileup command.
-
-                 -i        Only output pileup lines containing indels.
-
-                 -I INT    Phred  probability  of an indel in sequencing/prep.
-                           [40]
-
-                 -l FILE   List of sites at which pileup is output. This  file
-                           is  space  delimited.  The  first  two  columns are
-                           required to be chromosome and  1-based  coordinate.
-                           Additional  columns  are ignored. It is recommended
-                           to use option
-
-                 -m INT    Filter reads  with  flag  containing  bits  in  INT
-                           [1796]
-
-                 -M INT    Cap mapping quality at INT [60]
-
-                 -N INT    Number of haplotypes in the sample (>=2) [2]
-
-                 -r FLOAT  Expected  fraction of differences between a pair of
-                           haplotypes [0.001]
-
-                 -s        Print the mapping quality as the last column.  This
-                           option  makes  the output easier to parse, although
-                           this format is not space efficient.
-
-                 -S        The input file is in SAM.
-
-                 -t FILE   List of reference names ane  sequence  lengths,  in
-                           the  format  described  for  the import command. If
-                           this option is present, samtools assumes the  input
-                           <in.alignment>  is  in  SAM  format;  otherwise  it
-                           assumes in BAM format.  -s together with -l  as  in
-                           the  default  format  we  may  not know the mapping
-                           quality.
-
-                 -T FLOAT  The theta parameter (error dependency  coefficient)
-                           in the maq consensus calling model [0.85]
-
-
         mpileup   samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
                   list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
                   [...]]
@@ -237,27 +137,43 @@ COMMANDS AND OPTIONS
                           phred-scaled probability q of  being  generated  from
                           the mapped position, the new mapping quality is about
                           sqrt((INT-q)/INT)*INT. A  zero  value  disables  this
-                         functionality;  if  enabled, the recommended value is
-                         50. [0]
+                         functionality;  if enabled, the recommended value for
+                         BWA is 50. [0]
+
+                 -e INT  Phred-scaled gap extension sequencing error probabil-
+                         ity. Reducing INT leads to longer indels. [20]
  
                   -f FILE The reference file [null]
  
-                 -g      Compute genotype likelihoods and output them  in  the
+                 -g      Compute  genotype  likelihoods and output them in the
                           binary call format (BCF).
  
-                 -u      Similar  to -g except that the output is uncompressed
-                         BCF, which is preferred for pipeing.
+                 -h INT  Coefficient for modeling homopolymer errors. Given an
+                         l-long  homopolymer  run,  the sequencing error of an
+                         indel of size s is modeled as INT*s/l.  [100]
  
                   -l FILE File containing a list of sites where pileup  or  BCF
                           is outputted [null]
  
-                 -q INT  Minimum  mapping  quality for an alignment to be used
+                 -o INT  Phred-scaled  gap  open sequencing error probability.
+                         Reducing INT leads to more indel calls. [40]
+
+                 -P STR  Comma dilimited list of platforms (determined by @RG-
+                         PL)  from  which indel candidates are obtained. It is
+                         recommended to collect indel candidates from sequenc-
+                         ing  technologies that have low indel error rate such
+                         as ILLUMINA. [all]
+
+                 -q INT  Minimum mapping quality for an alignment to  be  used
                           [0]
  
                   -Q INT  Minimum base quality for a base to be considered [13]
  
                   -r STR  Only generate pileup in region STR [all sites]
  
+                 -u      Similar to -g except that the output is  uncompressed
+                         BCF, which is preferred for piping.
+
  
         reheader  samtools reheader <in.header.sam> <in.bam>
  
@@ -368,8 +284,11 @@ COMMANDS AND OPTIONS
  
                   OPTIONS:
  
-                 -e      Convert a the read base to = if it  is  identical  to
-                         the  aligned  reference  base.  Indel caller does not
+                 -A      When used jointly with -r this option overwrites  the
+                         original base quality.
+
+                 -e      Convert  a  the  read base to = if it is identical to
+                         the aligned reference base.  Indel  caller  does  not
                           support the = bases at the moment.
  
                   -u      Output uncompressed BAM
@@ -378,15 +297,117 @@ COMMANDS AND OPTIONS
  
                   -S      The input is SAM with header lines
  
-                 -C INT  Coefficient to cap mapping quality of  poorly  mapped
+                 -C INT  Coefficient  to  cap mapping quality of poorly mapped
                           reads. See the pileup command for details. [0]
  
-                 -r      Perform  probabilistic  realignment  to  compute BAQ,
-                         which will be used to cap base quality.
+                 -r      Compute the BQ tag without changing the base quality.
+
+
+       pileup    samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
+                 [-l in.site_list] [-C capMapQ] [-M maxMapQ]  [-T  theta]  [-N
+                 nHap]  [-r  pairDiffRate]  [-m  mask]  [-d maxIndelDepth] [-G
+                 indelPrior] <in.bam>|<in.sam>
+
+                 Print the alignment in the pileup format. In the pileup  for-
+                 mat,  each  line represents a genomic position, consisting of
+                 chromosome name, coordinate, reference base, read bases, read
+                 qualities  and  alignment  mapping  qualities. Information on
+                 match, mismatch, indel, strand, mapping quality and start and
+                 end  of  a  read  are all encoded at the read base column. At
+                 this column, a dot stands for a match to the  reference  base
+                 on  the  forward  strand,  a comma for a match on the reverse
+                 strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
+                 match on the forward strand and `acgtn' for a mismatch on the
+                 reverse strand. A pattern  `\+[0-9]+[ACGTNacgtn]+'  indicates
+                 there is an insertion between this reference position and the
+                 next reference position. The length of the insertion is given
+                 by  the  integer  in  the  pattern,  followed by the inserted
+                 sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'  repre-
+                 sents  a  deletion from the reference. The deleted bases will
+                 be presented as `*' in the following lines. Also at the  read
+                 base  column,  a  symbol  `^'  marks the start of a read. The
+                 ASCII of the character following `^' minus 33 gives the  map-
+                 ping quality. A symbol `$' marks the end of a read segment.
+
+                 If  option  -c  is  applied, the consensus base, Phred-scaled
+                 consensus quality, SNP quality (i.e. the Phred-scaled  proba-
+                 bility of the consensus being identical to the reference) and
+                 root mean square (RMS) mapping quality of the reads  covering
+                 the  site  will  be inserted between the `reference base' and
+                 the `read bases' columns. An  indel  occupies  an  additional
+                 line.  Each  indel  line consists of chromosome name, coordi-
+                 nate, a star, the genotype, consensus quality,  SNP  quality,
+                 RMS mapping quality, # covering reads, the first alllele, the
+                 second allele, # reads supporting the first allele,  #  reads
+                 supporting  the  second  allele and # reads containing indels
+                 different from the top two alleles.
+
+                 NOTE: Since 0.1.10, the `pileup'  command  is  deprecated  by
+                 `mpileup'.
+
+                 OPTIONS:
+
+                 -B        Disable  the  BAQ computation. See the mpileup com-
+                           mand for details.
+
+                 -c        Call the consensus sequence. Options -T, -N, -I and
+                           -r are only effective when -c or -g is in use.
+
+                 -C INT    Coefficient  for downgrading the mapping quality of
+                           poorly mapped reads. See the  mpileup  command  for
+                           details. [0]
+
+                 -d INT    Use  the  first  NUM  reads in the pileup for indel
+                           calling for speed up. Zero for unlimited. [1024]
+
+                 -f FILE   The reference sequence in the FASTA  format.  Index
+                           file FILE.fai will be created if absent.
+
+                 -g        Generate  genotype  likelihood  in the binary GLFv3
+                           format. This option suppresses -c, -i and -s.  This
+                           option is deprecated by the mpileup command.
+
+                 -i        Only output pileup lines containing indels.
+
+                 -I INT    Phred  probability  of an indel in sequencing/prep.
+                           [40]
+
+                 -l FILE   List of sites at which pileup is output. This  file
+                           is  space  delimited.  The  first  two  columns are
+                           required to be chromosome and  1-based  coordinate.
+                           Additional  columns  are ignored. It is recommended
+                           to use option
+
+                 -m INT    Filter reads  with  flag  containing  bits  in  INT
+                           [1796]
+
+                 -M INT    Cap mapping quality at INT [60]
+
+                 -N INT    Number of haplotypes in the sample (>=2) [2]
+
+                 -r FLOAT  Expected  fraction of differences between a pair of
+                           haplotypes [0.001]
+
+                 -s        Print the mapping quality as the last column.  This
+                           option  makes  the output easier to parse, although
+                           this format is not space efficient.
+
+                 -S        The input file is in SAM.
+
+                 -t FILE   List of reference names ane  sequence  lengths,  in
+                           the  format  described  for  the import command. If
+                           this option is present, samtools assumes the  input
+                           <in.alignment>  is  in  SAM  format;  otherwise  it
+                           assumes in BAM format.  -s together with -l  as  in
+                           the  default  format  we  may  not know the mapping
+                           quality.
+
+                 -T FLOAT  The theta parameter (error dependency  coefficient)
+                           in the maq consensus calling model [0.85]
  
  
  SAM FORMAT
-       SAM is TAB-delimited. Apart from the header lines,  which  are  started
+       SAM  is  TAB-delimited.  Apart from the header lines, which are started
         with the `@' symbol, each alignment line consists of:
  
  
@@ -441,51 +462,50 @@ EXAMPLES
  
         o Attach the RG tag while merging sorted alignments:
  
-           perl       -e       'print      "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
+           perl      -e       'print       "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
           mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
             samtools merge -rh rg.txt merged.bam ga.bam 454.bam
  
           The value in a RG tag is determined by the file name the read is com-
-         ing  from. In this example, in the merged.bam, reads from ga.bam will
-         be attached RG:Z:ga,  while  reads  from  454.bam  will  be  attached
+         ing from. In this example, in the merged.bam, reads from ga.bam  will
+         be  attached  RG:Z:ga,  while  reads  from  454.bam  will be attached
           RG:Z:454.
  
  
         o Call SNPs and short indels for one diploid individual:
  
-           samtools pileup -vcf ref.fa aln.bam > var.raw.plp
-           samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
-           awk     '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)'     var.flt.plp    >
-         var.final.plp
+           samtools mpileup -ugf ref.fa aln.bam |  bcftools  view  -bvcg  -  >
+         var.raw.bcf
+           bcftools   view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  100  >
+         var.flt.vcf
  
           The -D option of varFilter controls the  maximum  read  depth,  which
           should  be  adjusted  to about twice the average read depth.  One may
-         consider to add -C50 to pileup if mapping  quality  is  overestimated
+         consider to add -C50 to mpileup if mapping quality  is  overestimated
           for  reads containing excessive mismatches. Applying this option usu-
-         ally helps BWA-short but may not other mappers. It  also  potentially
-         increases reference biases.
+         ally helps BWA-short but may not other mappers.
  
  
-       o Call SNPs (not short indels) for multiple diploid individuals:
+       o Call SNPs and short indels for multiple diploid individuals:
  
-           samtools  mpileup  -augf  ref.fa  *.bam  |  bcftools  view -bcv - >
-         snp.raw.bcf
-           bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 |  bgzip
-         > snp.flt.vcf.gz
+           samtools mpileup -P ILLUMINA -ugf  ref.fa  *.bam  |  bcftools  view
+         -bcvg - > var.raw.bcf
+           bcftools  view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  2000  >
+         var.flt.vcf
  
-         Individuals  are identified from the SM tags in the @RG header lines.
-         Individuals can be pooled in one alignment file; one  individual  can
-         also be separated into multiple files. Similarly, one may consider to
-         apply -C50 to mpileup.  SNP calling in this way also works for single
-         sample and has the advantage of enabling more powerful filtering. The
-         drawback is the lack of short indel calling, which may be implemented
-         in future.
+         Individuals are identified from the SM tags in the @RG header  lines.
+         Individuals  can  be pooled in one alignment file; one individual can
+         also be separated into multiple files. The -P option  specifies  that
+         indel  candidates  should be collected only from read groups with the
+         @RG-PL tag set to ILLUMINA.  Collecting indel candidates  from  reads
+         sequenced  by an indel-prone technology may affect the performance of
+         indel calling.
  
  
-       o Derive  the  allele  frequency spectrum (AFS) on a list of sites from
+       o Derive the allele frequency spectrum (AFS) on a list  of  sites  from
           multiple individuals:
  
-           samtools mpileup -gf ref.fa *.bam > all.bcf
+           samtools mpileup -Igf ref.fa *.bam > all.bcf
             bcftools view -bl sites.list all.bcf > sites.bcf
             bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
             bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
@@ -493,40 +513,41 @@ EXAMPLES
             ......
  
           where sites.list contains the list of sites with each line consisting
-         of  the  reference sequence name and position. The following bcftools
+         of the reference sequence name and position. The  following  bcftools
           commands estimate AFS by EM.
  
  
         o Dump BAQ applied alignment for other SNP callers:
  
-           samtools calmd -br aln.bam > aln.baq.bam
+           samtools calmd -bAr aln.bam > aln.baq.bam
  
-         It adds and corrects the NM and MD tags at the same time.  The  calmd
-         command  also  comes with the -C option, the same as the on in pileup
+         It  adds  and corrects the NM and MD tags at the same time. The calmd
+         command also comes with the -C option, the same as the one in  pileup
           and mpileup.  Apply if it helps.
  
  
  LIMITATIONS
-       o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
+       o Unaligned   words  used  in  bam_import.c,  bam_endian.h,  bam.c  and
           bam_aux.c.
  
-       o In  merging,  the input files are required to have the same number of
-         reference sequences. The requirement can  be  relaxed.  In  addition,
-         merging  does  not reconstruct the header dictionaries automatically.
-         Endusers have to provide the correct  header.  Picard  is  better  at
+       o In merging, the input files are required to have the same  number  of
+         reference  sequences.  The  requirement  can be relaxed. In addition,
+         merging does not reconstruct the header  dictionaries  automatically.
+         Endusers  have  to  provide  the  correct header. Picard is better at
           merging.
  
-       o Samtools  paired-end  rmdup  does  not  work for unpaired reads (e.g.
-         orphan reads or ends mapped to different chromosomes). If this  is  a
-         concern,  please  use  Picard's MarkDuplicate which correctly handles
+       o Samtools paired-end rmdup does not  work  for  unpaired  reads  (e.g.
+         orphan  reads  or ends mapped to different chromosomes). If this is a
+         concern, please use Picard's MarkDuplicate  which  correctly  handles
           these cases, although a little slower.
  
  
  AUTHOR
-       Heng Li from the Sanger Institute wrote the C version of samtools.  Bob
+       Heng  Li from the Sanger Institute wrote the C version of samtools. Bob
         Handsaker from the Broad Institute implemented the BGZF library and Jue
-       Ruan from Beijing Genomics Institute wrote the  RAZF  library.  Various
-       people in the 1000 Genomes Project contributed to the SAM format speci-
+       Ruan  from Beijing Genomics Institute wrote the RAZF library. John Mar-
+       shall and Petr Danecek contribute to the source code and various people
+       from the 1000 Genomes Project have contributed to the SAM format speci-
         fication.
  
  
@@ -535,4 +556,4 @@ SEE ALSO
  
  
  
-samtools-0.1.9                  27 October 2010                    samtools(1)
+samtools-0.1.10                15 November 2010                    samtools(1)