Added notes on how the package is tested.

[samtools.git] / samtools.txt
diff --git a/samtools.txt b/samtools.txt

index 63e6a25ec72823931cda29bd2ffd9ddad745b2db..20e6c15cd707d11a7d1ed41f15b9d54732cd67ec 100644 (file)
--- a/samtools.txt
+++ b/samtools.txt
@@ -12,6 +12,8 @@ SYNOPSIS
  
         samtools index aln.sorted.bam
  
+       samtools idxstats aln.sorted.bam
+
         samtools view aln.sorted.bam chr2:20,100,000-20,200,000
  
         samtools merge out.bam in1.bam in2.bam in3.bam
@@ -20,6 +22,8 @@ SYNOPSIS
  
         samtools pileup -f ref.fasta aln.sorted.bam
  
+       samtools mpileup -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
+
         samtools tview aln.sorted.bam ref.fasta
  
  
@@ -43,68 +47,20 @@ DESCRIPTION
  
  
  COMMANDS AND OPTIONS
-       import    samtools import <in.ref_list> <in.sam> <out.bam>
-
-                 Since 0.1.4, this command is an alias of:
-
-                 samtools view -bt <in.ref_list> -o <out.bam> <in.sam>
-
-
-       sort      samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
-
-                 Sort  alignments  by  leftmost  coordinates.  File  <out.pre-
-                 fix>.bam will be created. This command may also create tempo-
-                 rary files <out.prefix>.%d.bam when the whole alignment  can-
-                 not be fitted into memory (controlled by option -m).
-
-                 OPTIONS:
-
-                 -n      Sort by read names rather than by chromosomal coordi-
-                         nates
-
-                 -m INT  Approximately   the    maximum    required    memory.
-                         [500000000]
-
-
-       merge     samtools   merge   [-h   inh.sam]  [-n]  <out.bam>  <in1.bam>
-                 <in2.bam> [...]
-
-                 Merge multiple sorted alignments.  The header reference lists
-                 of  all  the input BAM files, and the @SQ headers of inh.sam,
-                 if  any,  must  all  refer  to  the  same  set  of  reference
-                 sequences.   The header reference list and (unless overridden
-                 by -h) `@' headers of in1.bam will be copied to out.bam,  and
-                 the headers of other files will be ignored.
-
-                 OPTIONS:
-
-                 -h FILE Use  the lines of FILE as `@' headers to be copied to
-                         out.bam, replacing any header lines that would other-
-                         wise  be  copied  from in1.bam.  (FILE is actually in
-                         SAM format, though any alignment records it may  con-
-                         tain are ignored.)
-
-                 -n      The  input alignments are sorted by read names rather
-                         than by chromosomal coordinates
-
-
-       index     samtools index <aln.bam>
-
-                 Index sorted alignment for fast  random  access.  Index  file
-                 <aln.bam>.bai will be created.
-
-
         view      samtools  view  [-bhuHS]  [-t  in.refList]  [-o  output]  [-f
-                 reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
-                 Group] <in.bam>|<in.sam> [region1 [...]]
+                 reqFlag]  [-F  skipFlag]  [-q minMapQ] [-l library] [-r read-
+                 Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  
-                 Extract/print  all or sub alignments in SAM or BAM format. If
-                 no region is specified, all the alignments will  be  printed;
-                 otherwise  only  alignments overlapping the specified regions
-                 will be output. An alignment may be given multiple  times  if
+                 Extract/print all or sub alignments in SAM or BAM format.  If
+                 no  region  is specified, all the alignments will be printed;
+                 otherwise only alignments overlapping the  specified  regions
+                 will  be  output. An alignment may be given multiple times if
                   it is overlapping several regions. A region can be presented,
-                 for example, in the following format: `chr2',  `chr2:1000000'
-                 or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.
+                 for  example,  in  the  following  format:  `chr2' (the whole
+                 chr2), `chr2:1000000' (region starting from  1,000,000bp)  or
+                 `chr2:1,000,000-2,000,000'   (region  between  1,000,000  and
+                 2,000,000bp including the  end  points).  The  coordinate  is
+                 1-based.
  
                   OPTIONS:
  
@@ -143,16 +99,16 @@ COMMANDS AND OPTIONS
  
                   -r STR  Only output reads in read group STR [null]
  
+                 -R FILE Output reads in read groups listed in FILE [null]
  
-       faidx     samtools faidx <ref.fasta> [region1 [...]]
  
-                 Index reference sequence in the FASTA format or extract  sub-
-                 sequence  from  indexed  reference  sequence. If no region is
-                 specified,   faidx   will   index   the   file   and   create
-                 <ref.fasta>.fai  on the disk. If regions are speficified, the
-                 subsequences will be retrieved and printed to stdout  in  the
-                 FASTA  format.  The  input file can be compressed in the RAZF
-                 format.
+       tview     samtools tview <in.sorted.bam> [ref.fasta]
+
+                 Text alignment viewer (based on the ncurses library). In  the
+                 viewer,  press `?' for help and press `g' to check the align-
+                 ment   start   from   a   region   in   the    format    like
+                 `chr10:10,000,000'  or  `=10,000,000'  when  viewing the same
+                 reference sequence.
  
  
         pileup    samtools  pileup  [-f  in.ref.fasta]  [-t  in.ref_list]   [-l
@@ -182,10 +138,12 @@ COMMANDS AND OPTIONS
                   mapping quality. A symbol `$' marks the end of  a  read  seg-
                   ment.
  
-                 If  option -c is applied, the consensus base, consensus qual-
-                 ity, SNP quality and RMS mapping quality of the reads  cover-
-                 ing  the  site  will be inserted between the `reference base'
-                 and the `read bases' columns. An indel occupies an additional
+                 If  option  -c  is  applied, the consensus base, Phred-scaled
+                 consensus quality, SNP quality (i.e. the Phred-scaled  proba-
+                 bility of the consensus being identical to the reference) and
+                 root mean square (RMS) mapping quality of the reads  covering
+                 the  site  will  be inserted between the `reference base' and
+                 the `read bases' columns. An  indel  occupies  an  additional
                   line.  Each  indel  line consists of chromosome name, coordi-
                   nate, a star, the genotype, consensus quality,  SNP  quality,
                   RMS mapping quality, # covering reads, the first alllele, the
@@ -193,26 +151,28 @@ COMMANDS AND OPTIONS
                   supporting  the  second  allele and # reads containing indels
                   different from the top two alleles.
  
-                 OPTIONS:
+                 The position of indels is offset by -1.
  
+                 OPTIONS:
  
                   -s        Print the mapping quality as the last column.  This
                             option  makes  the output easier to parse, although
                             this format is not space efficient.
  
-
                   -S        The input file is in SAM.
  
-
                   -i        Only output pileup lines containing indels.
  
-
                   -f FILE   The reference sequence in the FASTA  format.  Index
                             file FILE.fai will be created if absent.
  
-
                   -M INT    Cap mapping quality at INT [60]
  
+                 -m INT    Filter  reads  with  flag  containing  bits  in INT
+                           [1796]
+
+                 -d INT    Use the first NUM reads in  the  pileup  for  indel
+                           calling for speed up. Zero for unlimited. [0]
  
                   -t FILE   List  of  reference  names ane sequence lengths, in
                             the format described for  the  import  command.  If
@@ -220,7 +180,6 @@ COMMANDS AND OPTIONS
                             <in.alignment>  is  in  SAM  format;  otherwise  it
                             assumes in BAM format.
  
-
                   -l FILE   List  of sites at which pileup is output. This file
                             is space  delimited.  The  first  two  columns  are
                             required  to  be chromosome and 1-based coordinate.
@@ -228,40 +187,110 @@ COMMANDS AND OPTIONS
                             to use option -s together with -l as in the default
                             format we may not know the mapping quality.
  
-
-                 -c        Call the consensus  sequence  using  MAQ  consensus
+                 -c        Call the consensus sequence using SOAPsnp consensus
                             model. Options -T, -N, -I and -r are only effective
                             when -c or -g is in use.
  
-
                   -g        Generate genotype likelihood in  the  binary  GLFv3
                             format. This option suppresses -c, -i and -s.
  
-
                   -T FLOAT  The  theta parameter (error dependency coefficient)
                             in the maq consensus calling model [0.85]
  
-
                   -N INT    Number of haplotypes in the sample (>=2) [2]
  
-
                   -r FLOAT  Expected fraction of differences between a pair  of
                             haplotypes [0.001]
  
-
                   -I INT    Phred  probability  of an indel in sequencing/prep.
                             [40]
  
  
+       mpileup   samtools mpileup [-r reg] [-f in.fa] in.bam [in2.bam [...]]
  
-       tview     samtools tview <in.sorted.bam> [ref.fasta]
+                 Generate pileup for multiple BAM files. Consensus calling  is
+                 not implemented.
  
-                 Text alignment viewer (based on the ncurses library). In  the
-                 viewer,  press `?' for help and press `g' to check the align-
-                 ment   start   from   a   region   in   the    format    like
-                 `chr10:10,000,000'.
+                 OPTIONS:
+
+                 -r STR  Only generate pileup in region STR [all sites]
+
+                 -f FILE The reference file [null]
  
  
+       reheader  samtools reheader <in.header.sam> <in.bam>
+
+                 Replace   the   header   in   in.bam   with   the  header  in
+                 in.header.sam.  This command is much  faster  than  replacing
+                 the header with a BAM->SAM->BAM conversion.
+
+
+       sort      samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
+
+                 Sort  alignments  by  leftmost  coordinates.  File  <out.pre-
+                 fix>.bam will be created. This command may also create tempo-
+                 rary  files <out.prefix>.%d.bam when the whole alignment can-
+                 not be fitted into memory (controlled by option -m).
+
+                 OPTIONS:
+
+                 -o      Output the final alignment to the standard output.
+
+                 -n      Sort by read names rather than by chromosomal coordi-
+                         nates
+
+                 -m INT  Approximately    the    maximum    required   memory.
+                         [500000000]
+
+
+       merge     samtools  merge  [-h  inh.sam]  [-nr]   <out.bam>   <in1.bam>
+                 <in2.bam> [...]
+
+                 Merge multiple sorted alignments.  The header reference lists
+                 of all the input BAM files, and the @SQ headers  of  inh.sam,
+                 if  any,  must  all  refer  to  the  same  set  of  reference
+                 sequences.  The header reference list and (unless  overridden
+                 by  -h) `@' headers of in1.bam will be copied to out.bam, and
+                 the headers of other files will be ignored.
+
+                 OPTIONS:
+
+                 -h FILE Use the lines of FILE as `@' headers to be copied  to
+                         out.bam, replacing any header lines that would other-
+                         wise be copied from in1.bam.  (FILE  is  actually  in
+                         SAM  format, though any alignment records it may con-
+                         tain are ignored.)
+
+                 -r      Attach an RG tag to each alignment. The tag value  is
+                         inferred from file names.
+
+                 -n      The  input alignments are sorted by read names rather
+                         than by chromosomal coordinates
+
+
+       index     samtools index <aln.bam>
+
+                 Index sorted alignment for fast  random  access.  Index  file
+                 <aln.bam>.bai will be created.
+
+
+       idxstats  samtools idxstats <aln.bam>
+
+                 Retrieve and print stats in the index file. The output is TAB
+                 delimited with each line  consisting  of  reference  sequence
+                 name, sequence length, # mapped reads and # unmapped reads.
+
+
+       faidx     samtools faidx <ref.fasta> [region1 [...]]
+
+                 Index  reference sequence in the FASTA format or extract sub-
+                 sequence from indexed reference sequence.  If  no  region  is
+                 specified,   faidx   will   index   the   file   and   create
+                 <ref.fasta>.fai on the disk. If regions are speficified,  the
+                 subsequences  will  be retrieved and printed to stdout in the
+                 FASTA format. The input file can be compressed  in  the  RAZF
+                 format.
+
  
         fixmate   samtools fixmate <in.nameSrt.bam> <out.bam>
  
@@ -269,39 +298,44 @@ COMMANDS AND OPTIONS
                   name-sorted alignment.
  
  
-       rmdup     samtools rmdup <input.srt.bam> <out.bam>
+       rmdup     samtools rmdup [-sS] <input.srt.bam> <out.bam>
  
                   Remove potential PCR duplicates: if multiple read pairs  have
                   identical  external  coordinates,  only  retain the pair with
-                 highest mapping quality.  This command  ONLY  works  with  FR
-                 orientation and requires ISIZE is correctly set.
-
+                 highest mapping quality.  In the paired-end mode,  this  com-
+                 mand  ONLY  works  with  FR orientation and requires ISIZE is
+                 correctly set. It does not work for unpaired reads (e.g.  two
+                 ends mapped to different chromosomes or orphan reads).
  
+                 OPTIONS:
  
-       rmdupse   samtools rmdupse <input.srt.bam> <out.bam>
-
-                 Remove potential duplicates for single-ended reads. This com-
-                 mand will treat all reads as single-ended even  if  they  are
-                 paired in fact.
+                 -s      Remove  duplicate  for  single-end reads. By default,
+                         the command works for paired-end reads only.
  
+                 -S      Treat paired-end reads and single-end reads.
  
  
-       fillmd    samtools fillmd [-e] <aln.bam> <ref.fasta>
+       calmd     samtools calmd [-eubS] <aln.bam> <ref.fasta>
  
-                 Generate  the  MD tag. If the MD tag is already present, this
-                 command will give a warning if the MD tag generated  is  dif-
-                 ferent from the existing tag.
+                 Generate the MD tag. If the MD tag is already  present,  this
+                 command  will  give a warning if the MD tag generated is dif-
+                 ferent from the existing tag. Output SAM by default.
  
                   OPTIONS:
  
-                 -e      Convert  a  the  read base to = if it is identical to
-                         the aligned reference base.  Indel  caller  does  not
+                 -e      Convert a the read base to = if it  is  identical  to
+                         the  aligned  reference  base.  Indel caller does not
                           support the = bases at the moment.
  
+                 -u      Output uncompressed BAM
+
+                 -b      Output compressed BAM
+
+                 -S      The input is SAM with header lines
  
  
  SAM FORMAT
-       SAM  is  TAB-delimited.  Apart from the header lines, which are started
+       SAM is TAB-delimited. Apart from the header lines,  which  are  started
         with the `@' symbol, each alignment line consists of:
  
  
@@ -325,43 +359,43 @@ SAM FORMAT
         Each bit in the FLAG field is defined as:
  
  
-             +-------+--------------------------------------------------+
-             | Flag  |                   Description                    |
-             +-------+--------------------------------------------------+
-             |0x0001 | the read is paired in sequencing                 |
-             |0x0002 | the read is mapped in a proper pair              |
-             |0x0004 | the query sequence itself is unmapped            |
-             |0x0008 | the mate is unmapped                             |
-             |0x0010 | strand of the query (1 for reverse)              |
-             |0x0020 | strand of the mate                               |
-             |0x0040 | the read is the first read in a pair             |
-             |0x0080 | the read is the second read in a pair            |
-             |0x0100 | the alignment is not primary                     |
-             |0x0200 | the read fails platform/vendor quality checks    |
-             |0x0400 | the read is either a PCR or an optical duplicate |
-             +-------+--------------------------------------------------+
+          +-------+-----+--------------------------------------------------+
+          | Flag  | Chr |                   Description                    |
+          +-------+-----+--------------------------------------------------+
+          |0x0001 |  p  | the read is paired in sequencing                 |
+          |0x0002 |  P  | the read is mapped in a proper pair              |
+          |0x0004 |  u  | the query sequence itself is unmapped            |
+          |0x0008 |  U  | the mate is unmapped                             |
+          |0x0010 |  r  | strand of the query (1 for reverse)              |
+          |0x0020 |  R  | strand of the mate                               |
+          |0x0040 |  1  | the read is the first read in a pair             |
+          |0x0080 |  2  | the read is the second read in a pair            |
+          |0x0100 |  s  | the alignment is not primary                     |
+          |0x0200 |  f  | the read fails platform/vendor quality checks    |
+          |0x0400 |  d  | the read is either a PCR or an optical duplicate |
+          +-------+-----+--------------------------------------------------+
  
  LIMITATIONS
-       o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
+       o Unaligned   words  used  in  bam_import.c,  bam_endian.h,  bam.c  and
           bam_aux.c.
  
-       o CIGAR operation P is not properly handled at the moment.
-
-       o In  merging,  the input files are required to have the same number of
-         reference sequences. The requirement can  be  relaxed.  In  addition,
-         merging  does  not reconstruct the header dictionaries automatically.
-         Endusers have to provide the correct  header.  Picard  is  better  at
+       o In merging, the input files are required to have the same  number  of
+         reference  sequences.  The  requirement  can be relaxed. In addition,
+         merging does not reconstruct the header  dictionaries  automatically.
+         Endusers  have  to  provide  the  correct header. Picard is better at
           merging.
  
-       o Samtools' rmdup does not work for single-end data and does not remove
-         duplicates across chromosomes. Picard is better.
+       o Samtools paired-end rmdup does not  work  for  unpaired  reads  (e.g.
+         orphan  reads  or ends mapped to different chromosomes). If this is a
+         concern, please use Picard's MarkDuplicate  which  correctly  handles
+         these cases, although a little slower.
  
  
  AUTHOR
-       Heng Li from the Sanger Institute wrote the C version of samtools.  Bob
+       Heng  Li from the Sanger Institute wrote the C version of samtools. Bob
         Handsaker from the Broad Institute implemented the BGZF library and Jue
-       Ruan from Beijing Genomics Institute wrote the  RAZF  library.  Various
-       people  in the 1000Genomes Project contributed to the SAM format speci-
+       Ruan  from  Beijing  Genomics Institute wrote the RAZF library. Various
+       people in the 1000 Genomes Project contributed to the SAM format speci-
         fication.
  
  
@@ -370,4 +404,4 @@ SEE ALSO
  
  
  
-samtools-0.1.6                 2 September 2009                    samtools(1)
+samtools-0.1.8                   11 July 2010                      samtools(1)