samtools.txt

   1 samtools(1)                  Bioinformatics tools                  samtools(1)
   2
   3
   4
   5 NAME
   6        samtools - Utilities for the Sequence Alignment/Map (SAM) format
   7
   8 SYNOPSIS
   9        samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
  10
  11        samtools sort aln.bam aln.sorted
  12
  13        samtools index aln.sorted.bam
  14
  15        samtools view aln.sorted.bam chr2:20,100,000-20,200,000
  16
  17        samtools merge out.bam in1.bam in2.bam in3.bam
  18
  19        samtools faidx ref.fasta
  20
  21        samtools pileup -f ref.fasta aln.sorted.bam
  22
  23        samtools tview aln.sorted.bam ref.fasta
  24
  25
  26 DESCRIPTION
  27        Samtools  is  a  set of utilities that manipulate alignments in the BAM
  28        format. It imports from and exports to the SAM (Sequence Alignment/Map)
  29        format,  does  sorting,  merging  and  indexing, and allows to retrieve
  30        reads in any regions swiftly.
  31
  32        Samtools is designed to work on a stream. It regards an input file  `-'
  33        as  the  standard  input (stdin) and an output file `-' as the standard
  34        output (stdout). Several commands can thus be combined with Unix pipes.
  35        Samtools always output warning and error messages to the standard error
  36        output (stderr).
  37
  38        Samtools is also able to open a BAM (not SAM) file on a remote  FTP  or
  39        HTTP  server  if  the  BAM file name starts with `ftp://' or `http://'.
  40        Samtools checks the current working directory for the  index  file  and
  41        will  download  the  index upon absence. Samtools does not retrieve the
  42        entire alignment file unless it is asked to do so.
  43
  44
  45 COMMANDS AND OPTIONS
  46        import    samtools import <in.ref_list> <in.sam> <out.bam>
  47
  48                  Since 0.1.4, this command is an alias of:
  49
  50                  samtools view -bt <in.ref_list> -o <out.bam> <in.sam>
  51
  52
  53        sort      samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
  54
  55                  Sort  alignments  by  leftmost  coordinates.  File  <out.pre-
  56                  fix>.bam will be created. This command may also create tempo-
  57                  rary files <out.prefix>.%d.bam when the whole alignment  can-
  58                  not be fitted into memory (controlled by option -m).
  59
  60                  OPTIONS:
  61
  62                  -n      Sort by read names rather than by chromosomal coordi-
  63                          nates
  64
  65                  -m INT  Approximately   the    maximum    required    memory.
  66                          [500000000]
  67
  68
  69        merge     samtools   merge   [-h   inh.sam]  [-n]  <out.bam>  <in1.bam>
  70                  <in2.bam> [...]
  71
  72                  Merge multiple sorted alignments.  The header reference lists
  73                  of  all  the input BAM files, and the @SQ headers of inh.sam,
  74                  if  any,  must  all  refer  to  the  same  set  of  reference
  75                  sequences.   The header reference list and (unless overridden
  76                  by -h) `@' headers of in1.bam will be copied to out.bam,  and
  77                  the headers of other files will be ignored.
  78
  79                  OPTIONS:
  80
  81                  -h FILE Use  the lines of FILE as `@' headers to be copied to
  82                          out.bam, replacing any header lines that would other-
  83                          wise  be  copied  from in1.bam.  (FILE is actually in
  84                          SAM format, though any alignment records it may  con-
  85                          tain are ignored.)
  86
  87                  -n      The  input alignments are sorted by read names rather
  88                          than by chromosomal coordinates
  89
  90
  91        index     samtools index <aln.bam>
  92
  93                  Index sorted alignment for fast  random  access.  Index  file
  94                  <aln.bam>.bai will be created.
  95
  96
  97        view      samtools  view  [-bhuHS]  [-t  in.refList]  [-o  output]  [-f
  98                  reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
  99                  Group] <in.bam>|<in.sam> [region1 [...]]
 100
 101                  Extract/print  all or sub alignments in SAM or BAM format. If
 102                  no region is specified, all the alignments will  be  printed;
 103                  otherwise  only  alignments overlapping the specified regions
 104                  will be output. An alignment may be given multiple  times  if
 105                  it is overlapping several regions. A region can be presented,
 106                  for example, in the following format: `chr2',  `chr2:1000000'
 107                  or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.
 108
 109                  OPTIONS:
 110
 111                  -b      Output in the BAM format.
 112
 113                  -u      Output uncompressed BAM. This option saves time spent
 114                          on compression/decomprssion  and  is  thus  preferred
 115                          when the output is piped to another samtools command.
 116
 117                  -h      Include the header in the output.
 118
 119                  -H      Output the header only.
 120
 121                  -S      Input is in SAM. If @SQ header lines are absent,  the
 122                          `-t' option is required.
 123
 124                  -t FILE This  file  is  TAB-delimited. Each line must contain
 125                          the reference name and the length of  the  reference,
 126                          one  line  for  each  distinct  reference; additional
 127                          fields are ignored. This file also defines the  order
 128                          of  the  reference  sequences  in sorting. If you run
 129                          `samtools faidx <ref.fa>', the resultant  index  file
 130                          <ref.fa>.fai  can be used as this <in.ref_list> file.
 131
 132                  -o FILE Output file [stdout]
 133
 134                  -f INT  Only output alignments with all bits in  INT  present
 135                          in the FLAG field. INT can be in hex in the format of
 136                          /^0x[0-9A-F]+/ [0]
 137
 138                  -F INT  Skip alignments with bits present in INT [0]
 139
 140                  -q INT  Skip alignments with MAPQ smaller than INT [0]
 141
 142                  -l STR  Only output reads in library STR [null]
 143
 144                  -r STR  Only output reads in read group STR [null]
 145
 146
 147        faidx     samtools faidx <ref.fasta> [region1 [...]]
 148
 149                  Index reference sequence in the FASTA format or extract  sub-
 150                  sequence  from  indexed  reference  sequence. If no region is
 151                  specified,   faidx   will   index   the   file   and   create
 152                  <ref.fasta>.fai  on the disk. If regions are speficified, the
 153                  subsequences will be retrieved and printed to stdout  in  the
 154                  FASTA  format.  The  input file can be compressed in the RAZF
 155                  format.
 156
 157
 158        pileup    samtools  pileup  [-f  in.ref.fasta]  [-t  in.ref_list]   [-l
 159                  in.site_list]    [-iscgS2]   [-T   theta]   [-N   nHap]   [-r
 160                  pairDiffRate] <in.bam>|<in.sam>
 161
 162                  Print the alignment in the pileup format. In the pileup  for-
 163                  mat,  each  line represents a genomic position, consisting of
 164                  chromosome name, coordinate, reference base, read bases, read
 165                  qualities  and  alignment  mapping  qualities. Information on
 166                  match, mismatch, indel, strand, mapping quality and start and
 167                  end  of  a  read  are all encoded at the read base column. At
 168                  this column, a dot stands for a match to the  reference  base
 169                  on  the  forward  strand,  a comma for a match on the reverse
 170                  strand, `ACGTN' for a mismatch  on  the  forward  strand  and
 171                  `acgtn'  for  a  mismatch  on  the  reverse strand. A pattern
 172                  `\+[0-9]+[ACGTNacgtn]+'  indicates  there  is  an   insertion
 173                  between  this reference position and the next reference posi-
 174                  tion. The length of the insertion is given by the integer  in
 175                  the  pattern, followed by the inserted sequence. Similarly, a
 176                  pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
 177                  reference.  The deleted bases will be presented as `*' in the
 178                  following lines. Also at the read base column, a  symbol  `^'
 179                  marks  the start of a read segment which is a contiguous sub-
 180                  sequence on the read separated by `N/S/H'  CIGAR  operations.
 181                  The  ASCII  of the character following `^' minus 33 gives the
 182                  mapping quality. A symbol `$' marks the end of  a  read  seg-
 183                  ment.
 184
 185                  If  option -c is applied, the consensus base, consensus qual-
 186                  ity, SNP quality and RMS mapping quality of the reads  cover-
 187                  ing  the  site  will be inserted between the `reference base'
 188                  and the `read bases' columns. An indel occupies an additional
 189                  line.  Each  indel  line consists of chromosome name, coordi-
 190                  nate, a star, the genotype, consensus quality,  SNP  quality,
 191                  RMS mapping quality, # covering reads, the first alllele, the
 192                  second allele, # reads supporting the first allele,  #  reads
 193                  supporting  the  second  allele and # reads containing indels
 194                  different from the top two alleles.
 195
 196                  OPTIONS:
 197
 198
 199                  -s        Print the mapping quality as the last column.  This
 200                            option  makes  the output easier to parse, although
 201                            this format is not space efficient.
 202
 203
 204                  -S        The input file is in SAM.
 205
 206
 207                  -i        Only output pileup lines containing indels.
 208
 209
 210                  -f FILE   The reference sequence in the FASTA  format.  Index
 211                            file FILE.fai will be created if absent.
 212
 213
 214                  -M INT    Cap mapping quality at INT [60]
 215
 216
 217                  -t FILE   List  of  reference  names ane sequence lengths, in
 218                            the format described for  the  import  command.  If
 219                            this  option is present, samtools assumes the input
 220                            <in.alignment>  is  in  SAM  format;  otherwise  it
 221                            assumes in BAM format.
 222
 223
 224                  -l FILE   List  of sites at which pileup is output. This file
 225                            is space  delimited.  The  first  two  columns  are
 226                            required  to  be chromosome and 1-based coordinate.
 227                            Additional columns are ignored. It  is  recommended
 228                            to use option -s together with -l as in the default
 229                            format we may not know the mapping quality.
 230
 231
 232                  -c        Call the consensus  sequence  using  MAQ  consensus
 233                            model. Options -T, -N, -I and -r are only effective
 234                            when -c or -g is in use.
 235
 236
 237                  -g        Generate genotype likelihood in  the  binary  GLFv3
 238                            format. This option suppresses -c, -i and -s.
 239
 240
 241                  -T FLOAT  The  theta parameter (error dependency coefficient)
 242                            in the maq consensus calling model [0.85]
 243
 244
 245                  -N INT    Number of haplotypes in the sample (>=2) [2]
 246
 247
 248                  -r FLOAT  Expected fraction of differences between a pair  of
 249                            haplotypes [0.001]
 250
 251
 252                  -I INT    Phred  probability  of an indel in sequencing/prep.
 253                            [40]
 254
 255
 256
 257        tview     samtools tview <in.sorted.bam> [ref.fasta]
 258
 259                  Text alignment viewer (based on the ncurses library). In  the
 260                  viewer,  press `?' for help and press `g' to check the align-
 261                  ment   start   from   a   region   in   the    format    like
 262                  `chr10:10,000,000'.
 263
 264
 265
 266        fixmate   samtools fixmate <in.nameSrt.bam> <out.bam>
 267
 268                  Fill in mate coordinates, ISIZE and mate related flags from a
 269                  name-sorted alignment.
 270
 271
 272        rmdup     samtools rmdup <input.srt.bam> <out.bam>
 273
 274                  Remove potential PCR duplicates: if multiple read pairs  have
 275                  identical  external  coordinates,  only  retain the pair with
 276                  highest mapping quality.  This command  ONLY  works  with  FR
 277                  orientation and requires ISIZE is correctly set.
 278
 279
 280
 281        rmdupse   samtools rmdupse <input.srt.bam> <out.bam>
 282
 283                  Remove potential duplicates for single-ended reads. This com-
 284                  mand will treat all reads as single-ended even  if  they  are
 285                  paired in fact.
 286
 287
 288
 289        fillmd    samtools fillmd [-e] <aln.bam> <ref.fasta>
 290
 291                  Generate  the  MD tag. If the MD tag is already present, this
 292                  command will give a warning if the MD tag generated  is  dif-
 293                  ferent from the existing tag.
 294
 295                  OPTIONS:
 296
 297                  -e      Convert  a  the  read base to = if it is identical to
 298                          the aligned reference base.  Indel  caller  does  not
 299                          support the = bases at the moment.
 300
 301
 302
 303 SAM FORMAT
 304        SAM  is  TAB-delimited.  Apart from the header lines, which are started
 305        with the `@' symbol, each alignment line consists of:
 306
 307
 308        +----+-------+----------------------------------------------------------+
 309        |Col | Field |                       Description                        |
 310        +----+-------+----------------------------------------------------------+
 311        | 1  | QNAME | Query (pair) NAME                                        |
 312        | 2  | FLAG  | bitwise FLAG                                             |
 313        | 3  | RNAME | Reference sequence NAME                                  |
 314        | 4  | POS   | 1-based leftmost POSition/coordinate of clipped sequence |
 315        | 5  | MAPQ  | MAPping Quality (Phred-scaled)                           |
 316        | 6  | CIAGR | extended CIGAR string                                    |
 317        | 7  | MRNM  | Mate Reference sequence NaMe (`=' if same as RNAME)      |
 318        | 8  | MPOS  | 1-based Mate POSistion                                   |
 319        | 9  | ISIZE | Inferred insert SIZE                                     |
 320        |10  | SEQ   | query SEQuence on the same strand as the reference       |
 321        |11  | QUAL  | query QUALity (ASCII-33 gives the Phred base quality)    |
 322        |12  | OPT   | variable OPTional fields in the format TAG:VTYPE:VALUE   |
 323        +----+-------+----------------------------------------------------------+
 324
 325        Each bit in the FLAG field is defined as:
 326
 327
 328              +-------+--------------------------------------------------+
 329              | Flag  |                   Description                    |
 330              +-------+--------------------------------------------------+
 331              |0x0001 | the read is paired in sequencing                 |
 332              |0x0002 | the read is mapped in a proper pair              |
 333              |0x0004 | the query sequence itself is unmapped            |
 334              |0x0008 | the mate is unmapped                             |
 335              |0x0010 | strand of the query (1 for reverse)              |
 336              |0x0020 | strand of the mate                               |
 337              |0x0040 | the read is the first read in a pair             |
 338              |0x0080 | the read is the second read in a pair            |
 339              |0x0100 | the alignment is not primary                     |
 340              |0x0200 | the read fails platform/vendor quality checks    |
 341              |0x0400 | the read is either a PCR or an optical duplicate |
 342              +-------+--------------------------------------------------+
 343
 344 LIMITATIONS
 345        o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
 346          bam_aux.c.
 347
 348        o CIGAR operation P is not properly handled at the moment.
 349
 350        o In  merging,  the input files are required to have the same number of
 351          reference sequences. The requirement can  be  relaxed.  In  addition,
 352          merging  does  not reconstruct the header dictionaries automatically.
 353          Endusers have to provide the correct  header.  Picard  is  better  at
 354          merging.
 355
 356        o Samtools' rmdup does not work for single-end data and does not remove
 357          duplicates across chromosomes. Picard is better.
 358
 359
 360 AUTHOR
 361        Heng Li from the Sanger Institute wrote the C version of samtools.  Bob
 362        Handsaker from the Broad Institute implemented the BGZF library and Jue
 363        Ruan from Beijing Genomics Institute wrote the  RAZF  library.  Various
 364        people  in the 1000Genomes Project contributed to the SAM format speci-
 365        fication.
 366
 367
 368 SEE ALSO
 369        Samtools website: <http://samtools.sourceforge.net>
 370
 371
 372
 373 samtools-0.1.6                 2 September 2009                    samtools(1)