samtools.txt

   1 samtools(1)                  Bioinformatics tools                  samtools(1)
   2
   3
   4
   5 NAME
   6        samtools - Utilities for the Sequence Alignment/Map (SAM) format
   7
   8 SYNOPSIS
   9        samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
  10
  11        samtools sort aln.bam aln.sorted
  12
  13        samtools index aln.sorted.bam
  14
  15        samtools idxstats aln.sorted.bam
  16
  17        samtools view aln.sorted.bam chr2:20,100,000-20,200,000
  18
  19        samtools merge out.bam in1.bam in2.bam in3.bam
  20
  21        samtools faidx ref.fasta
  22
  23        samtools pileup -vcf ref.fasta aln.sorted.bam
  24
  25        samtools  mpileup  -C50  -agf  ref.fasta  -r  chr3:1,000-2,000  in1.bam
  26        in2.bam
  27
  28        samtools tview aln.sorted.bam ref.fasta
  29
  30
  31 DESCRIPTION
  32        Samtools is a set of utilities that manipulate alignments  in  the  BAM
  33        format. It imports from and exports to the SAM (Sequence Alignment/Map)
  34        format, does sorting, merging and  indexing,  and  allows  to  retrieve
  35        reads in any regions swiftly.
  36
  37        Samtools  is designed to work on a stream. It regards an input file `-'
  38        as the standard input (stdin) and an output file `-'  as  the  standard
  39        output (stdout). Several commands can thus be combined with Unix pipes.
  40        Samtools always output warning and error messages to the standard error
  41        output (stderr).
  42
  43        Samtools  is  also able to open a BAM (not SAM) file on a remote FTP or
  44        HTTP server if the BAM file name starts  with  `ftp://'  or  `http://'.
  45        Samtools  checks  the  current working directory for the index file and
  46        will download the index upon absence. Samtools does  not  retrieve  the
  47        entire alignment file unless it is asked to do so.
  48
  49
  50 COMMANDS AND OPTIONS
  51        view      samtools  view  [-bhuHS]  [-t  in.refList]  [-o  output]  [-f
  52                  reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
  53                  Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  54
  55                  Extract/print  all or sub alignments in SAM or BAM format. If
  56                  no region is specified, all the alignments will  be  printed;
  57                  otherwise  only  alignments overlapping the specified regions
  58                  will be output. An alignment may be given multiple  times  if
  59                  it is overlapping several regions. A region can be presented,
  60                  for example, in  the  following  format:  `chr2'  (the  whole
  61                  chr2),  `chr2:1000000'  (region starting from 1,000,000bp) or
  62                  `chr2:1,000,000-2,000,000'  (region  between  1,000,000   and
  63                  2,000,000bp  including  the  end  points).  The coordinate is
  64                  1-based.
  65
  66                  OPTIONS:
  67
  68                  -b      Output in the BAM format.
  69
  70                  -f INT  Only output alignments with all bits in  INT  present
  71                          in the FLAG field. INT can be in hex in the format of
  72                          /^0x[0-9A-F]+/ [0]
  73
  74                  -F INT  Skip alignments with bits present in INT [0]
  75
  76                  -h      Include the header in the output.
  77
  78                  -H      Output the header only.
  79
  80                  -l STR  Only output reads in library STR [null]
  81
  82                  -o FILE Output file [stdout]
  83
  84                  -q INT  Skip alignments with MAPQ smaller than INT [0]
  85
  86                  -r STR  Only output reads in read group STR [null]
  87
  88                  -R FILE Output reads in read groups listed in FILE [null]
  89
  90                  -S      Input is in SAM. If @SQ header lines are absent,  the
  91                          `-t' option is required.
  92
  93                  -t FILE This  file  is  TAB-delimited. Each line must contain
  94                          the reference name and the length of  the  reference,
  95                          one  line  for  each  distinct  reference; additional
  96                          fields are ignored. This file also defines the  order
  97                          of  the  reference  sequences  in sorting. If you run
  98                          `samtools faidx <ref.fa>', the resultant  index  file
  99                          <ref.fa>.fai  can be used as this <in.ref_list> file.
 100
 101                  -u      Output uncompressed BAM. This option saves time spent
 102                          on  compression/decomprssion  and  is  thus preferred
 103                          when the output is piped to another samtools command.
 104
 105
 106        tview     samtools tview <in.sorted.bam> [ref.fasta]
 107
 108                  Text  alignment viewer (based on the ncurses library). In the
 109                  viewer, press `?' for help and press `g' to check the  align-
 110                  ment    start    from   a   region   in   the   format   like
 111                  `chr10:10,000,000' or `=10,000,000'  when  viewing  the  same
 112                  reference sequence.
 113
 114
 115        pileup    samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
 116                  [-l in.site_list] [-C capMapQ] [-M maxMapQ]  [-T  theta]  [-N
 117                  nHap]  [-r  pairDiffRate]  [-m  mask]  [-d maxIndelDepth] [-G
 118                  indelPrior] <in.bam>|<in.sam>
 119
 120                  Print the alignment in the pileup format. In the pileup  for-
 121                  mat,  each  line represents a genomic position, consisting of
 122                  chromosome name, coordinate, reference base, read bases, read
 123                  qualities  and  alignment  mapping  qualities. Information on
 124                  match, mismatch, indel, strand, mapping quality and start and
 125                  end  of  a  read  are all encoded at the read base column. At
 126                  this column, a dot stands for a match to the  reference  base
 127                  on  the  forward  strand,  a comma for a match on the reverse
 128                  strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
 129                  match on the forward strand and `acgtn' for a mismatch on the
 130                  reverse strand. A pattern  `\+[0-9]+[ACGTNacgtn]+'  indicates
 131                  there is an insertion between this reference position and the
 132                  next reference position. The length of the insertion is given
 133                  by  the  integer  in  the  pattern,  followed by the inserted
 134                  sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'  repre-
 135                  sents  a  deletion from the reference. The deleted bases will
 136                  be presented as `*' in the following lines. Also at the  read
 137                  base  column,  a  symbol  `^'  marks the start of a read. The
 138                  ASCII of the character following `^' minus 33 gives the  map-
 139                  ping quality. A symbol `$' marks the end of a read segment.
 140
 141                  If  option  -c  is  applied, the consensus base, Phred-scaled
 142                  consensus quality, SNP quality (i.e. the Phred-scaled  proba-
 143                  bility of the consensus being identical to the reference) and
 144                  root mean square (RMS) mapping quality of the reads  covering
 145                  the  site  will  be inserted between the `reference base' and
 146                  the `read bases' columns. An  indel  occupies  an  additional
 147                  line.  Each  indel  line consists of chromosome name, coordi-
 148                  nate, a star, the genotype, consensus quality,  SNP  quality,
 149                  RMS mapping quality, # covering reads, the first alllele, the
 150                  second allele, # reads supporting the first allele,  #  reads
 151                  supporting  the  second  allele and # reads containing indels
 152                  different from the top two alleles.
 153
 154                  The position of indels is offset by -1.
 155
 156                  OPTIONS:
 157
 158                  -B        Disable the BAQ computation. See the  mpileup  com-
 159                            mand for details.
 160
 161                  -c        Call the consensus sequence using SOAPsnp consensus
 162                            model. Options -T, -N, -I and -r are only effective
 163                            when -c or -g is in use.
 164
 165                  -C INT    Coefficient  for downgrading the mapping quality of
 166                            poorly mapped reads. See the  mpileup  command  for
 167                            details. [0]
 168
 169                  -d INT    Use  the  first  NUM  reads in the pileup for indel
 170                            calling for speed up. Zero for unlimited. [1024]
 171
 172                  -f FILE   The reference sequence in the FASTA  format.  Index
 173                            file FILE.fai will be created if absent.
 174
 175                  -g        Generate  genotype  likelihood  in the binary GLFv3
 176                            format. This option suppresses -c, -i and -s.  This
 177                            option is deprecated by the mpileup command.
 178
 179                  -i        Only output pileup lines containing indels.
 180
 181                  -I INT    Phred  probability  of an indel in sequencing/prep.
 182                            [40]
 183
 184                  -l FILE   List of sites at which pileup is output. This  file
 185                            is  space  delimited.  The  first  two  columns are
 186                            required to be chromosome and  1-based  coordinate.
 187                            Additional  columns  are ignored. It is recommended
 188                            to use option
 189
 190                  -m INT    Filter reads  with  flag  containing  bits  in  INT
 191                            [1796]
 192
 193                  -M INT    Cap mapping quality at INT [60]
 194
 195                  -N INT    Number of haplotypes in the sample (>=2) [2]
 196
 197                  -r FLOAT  Expected  fraction of differences between a pair of
 198                            haplotypes [0.001]
 199
 200                  -s        Print the mapping quality as the last column.  This
 201                            option  makes  the output easier to parse, although
 202                            this format is not space efficient.
 203
 204                  -S        The input file is in SAM.
 205
 206                  -t FILE   List of reference names ane  sequence  lengths,  in
 207                            the  format  described  for  the import command. If
 208                            this option is present, samtools assumes the  input
 209                            <in.alignment>  is  in  SAM  format;  otherwise  it
 210                            assumes in BAM format.  -s together with -l  as  in
 211                            the  default  format  we  may  not know the mapping
 212                            quality.
 213
 214                  -T FLOAT  The theta parameter (error dependency  coefficient)
 215                            in the maq consensus calling model [0.85]
 216
 217
 218        mpileup   samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
 219                  list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
 220                  [...]]
 221
 222                  Generate  BCF or pileup for one or multiple BAM files. Align-
 223                  ment records are grouped by sample identifiers in @RG  header
 224                  lines.  If  sample identifiers are absent, each input file is
 225                  regarded as one sample.
 226
 227                  OPTIONS:
 228
 229                  -B      Disable probabilistic realignment for the computation
 230                          of  base  alignment  quality (BAQ). BAQ is the Phred-
 231                          scaled probability of a read base  being  misaligned.
 232                          Applying  this  option  greatly helps to reduce false
 233                          SNPs caused by misalignments.
 234
 235                  -C INT  Coefficient for downgrading mapping quality for reads
 236                          containing  excessive mismatches. Given a read with a
 237                          phred-scaled probability q of  being  generated  from
 238                          the mapped position, the new mapping quality is about
 239                          sqrt((INT-q)/INT)*INT. A  zero  value  disables  this
 240                          functionality;  if  enabled, the recommended value is
 241                          50. [0]
 242
 243                  -f FILE The reference file [null]
 244
 245                  -g      Compute genotype likelihoods and output them  in  the
 246                          binary call format (BCF).
 247
 248                  -u      Similar  to -g except that the output is uncompressed
 249                          BCF, which is preferred for pipeing.
 250
 251                  -l FILE File containing a list of sites where pileup  or  BCF
 252                          is outputted [null]
 253
 254                  -q INT  Minimum  mapping  quality for an alignment to be used
 255                          [0]
 256
 257                  -Q INT  Minimum base quality for a base to be considered [13]
 258
 259                  -r STR  Only generate pileup in region STR [all sites]
 260
 261
 262        reheader  samtools reheader <in.header.sam> <in.bam>
 263
 264                  Replace   the   header   in   in.bam   with   the  header  in
 265                  in.header.sam.  This command is much  faster  than  replacing
 266                  the header with a BAM->SAM->BAM conversion.
 267
 268
 269        sort      samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
 270
 271                  Sort  alignments  by  leftmost  coordinates.  File  <out.pre-
 272                  fix>.bam will be created. This command may also create tempo-
 273                  rary  files <out.prefix>.%d.bam when the whole alignment can-
 274                  not be fitted into memory (controlled by option -m).
 275
 276                  OPTIONS:
 277
 278                  -o      Output the final alignment to the standard output.
 279
 280                  -n      Sort by read names rather than by chromosomal coordi-
 281                          nates
 282
 283                  -m INT  Approximately    the    maximum    required   memory.
 284                          [500000000]
 285
 286
 287        merge     samtools  merge  [-nur]  [-h  inh.sam]  [-R  reg]   <out.bam>
 288                  <in1.bam> <in2.bam> [...]
 289
 290                  Merge multiple sorted alignments.  The header reference lists
 291                  of all the input BAM files, and the @SQ headers  of  inh.sam,
 292                  if  any,  must  all  refer  to  the  same  set  of  reference
 293                  sequences.  The header reference list and (unless  overridden
 294                  by  -h) `@' headers of in1.bam will be copied to out.bam, and
 295                  the headers of other files will be ignored.
 296
 297                  OPTIONS:
 298
 299                  -h FILE Use the lines of FILE as `@' headers to be copied  to
 300                          out.bam, replacing any header lines that would other-
 301                          wise be copied from in1.bam.  (FILE  is  actually  in
 302                          SAM  format, though any alignment records it may con-
 303                          tain are ignored.)
 304
 305                  -R STR  Merge files in the specified region indicated by STR
 306
 307                  -r      Attach an RG tag to each alignment. The tag value  is
 308                          inferred from file names.
 309
 310                  -n      The  input alignments are sorted by read names rather
 311                          than by chromosomal coordinates
 312
 313                  -u      Uncompressed BAM output
 314
 315
 316        index     samtools index <aln.bam>
 317
 318                  Index sorted alignment for fast  random  access.  Index  file
 319                  <aln.bam>.bai will be created.
 320
 321
 322        idxstats  samtools idxstats <aln.bam>
 323
 324                  Retrieve and print stats in the index file. The output is TAB
 325                  delimited with each line  consisting  of  reference  sequence
 326                  name, sequence length, # mapped reads and # unmapped reads.
 327
 328
 329        faidx     samtools faidx <ref.fasta> [region1 [...]]
 330
 331                  Index  reference sequence in the FASTA format or extract sub-
 332                  sequence from indexed reference sequence.  If  no  region  is
 333                  specified,   faidx   will   index   the   file   and   create
 334                  <ref.fasta>.fai on the disk. If regions are speficified,  the
 335                  subsequences  will  be retrieved and printed to stdout in the
 336                  FASTA format. The input file can be compressed  in  the  RAZF
 337                  format.
 338
 339
 340        fixmate   samtools fixmate <in.nameSrt.bam> <out.bam>
 341
 342                  Fill in mate coordinates, ISIZE and mate related flags from a
 343                  name-sorted alignment.
 344
 345
 346        rmdup     samtools rmdup [-sS] <input.srt.bam> <out.bam>
 347
 348                  Remove potential PCR duplicates: if multiple read pairs  have
 349                  identical  external  coordinates,  only  retain the pair with
 350                  highest mapping quality.  In the paired-end mode,  this  com-
 351                  mand  ONLY  works  with  FR orientation and requires ISIZE is
 352                  correctly set. It does not work for unpaired reads (e.g.  two
 353                  ends mapped to different chromosomes or orphan reads).
 354
 355                  OPTIONS:
 356
 357                  -s      Remove  duplicate  for  single-end reads. By default,
 358                          the command works for paired-end reads only.
 359
 360                  -S      Treat paired-end reads and single-end reads.
 361
 362
 363        calmd     samtools calmd [-eubSr] [-C capQcoef] <aln.bam> <ref.fasta>
 364
 365                  Generate the MD tag. If the MD tag is already  present,  this
 366                  command  will  give a warning if the MD tag generated is dif-
 367                  ferent from the existing tag. Output SAM by default.
 368
 369                  OPTIONS:
 370
 371                  -e      Convert a the read base to = if it  is  identical  to
 372                          the  aligned  reference  base.  Indel caller does not
 373                          support the = bases at the moment.
 374
 375                  -u      Output uncompressed BAM
 376
 377                  -b      Output compressed BAM
 378
 379                  -S      The input is SAM with header lines
 380
 381                  -C INT  Coefficient to cap mapping quality of  poorly  mapped
 382                          reads. See the pileup command for details. [0]
 383
 384                  -r      Perform  probabilistic  realignment  to  compute BAQ,
 385                          which will be used to cap base quality.
 386
 387
 388 SAM FORMAT
 389        SAM is TAB-delimited. Apart from the header lines,  which  are  started
 390        with the `@' symbol, each alignment line consists of:
 391
 392
 393        +----+-------+----------------------------------------------------------+
 394        |Col | Field |                       Description                        |
 395        +----+-------+----------------------------------------------------------+
 396        | 1  | QNAME | Query (pair) NAME                                        |
 397        | 2  | FLAG  | bitwise FLAG                                             |
 398        | 3  | RNAME | Reference sequence NAME                                  |
 399        | 4  | POS   | 1-based leftmost POSition/coordinate of clipped sequence |
 400        | 5  | MAPQ  | MAPping Quality (Phred-scaled)                           |
 401        | 6  | CIAGR | extended CIGAR string                                    |
 402        | 7  | MRNM  | Mate Reference sequence NaMe (`=' if same as RNAME)      |
 403        | 8  | MPOS  | 1-based Mate POSistion                                   |
 404        | 9  | ISIZE | Inferred insert SIZE                                     |
 405        |10  | SEQ   | query SEQuence on the same strand as the reference       |
 406        |11  | QUAL  | query QUALity (ASCII-33 gives the Phred base quality)    |
 407        |12  | OPT   | variable OPTional fields in the format TAG:VTYPE:VALUE   |
 408        +----+-------+----------------------------------------------------------+
 409
 410        Each bit in the FLAG field is defined as:
 411
 412
 413           +-------+-----+--------------------------------------------------+
 414           | Flag  | Chr |                   Description                    |
 415           +-------+-----+--------------------------------------------------+
 416           |0x0001 |  p  | the read is paired in sequencing                 |
 417           |0x0002 |  P  | the read is mapped in a proper pair              |
 418           |0x0004 |  u  | the query sequence itself is unmapped            |
 419           |0x0008 |  U  | the mate is unmapped                             |
 420           |0x0010 |  r  | strand of the query (1 for reverse)              |
 421           |0x0020 |  R  | strand of the mate                               |
 422           |0x0040 |  1  | the read is the first read in a pair             |
 423           |0x0080 |  2  | the read is the second read in a pair            |
 424           |0x0100 |  s  | the alignment is not primary                     |
 425           |0x0200 |  f  | the read fails platform/vendor quality checks    |
 426           |0x0400 |  d  | the read is either a PCR or an optical duplicate |
 427           +-------+-----+--------------------------------------------------+
 428
 429 EXAMPLES
 430        o Import SAM to BAM when @SQ lines are present in the header:
 431
 432            samtools view -bS aln.sam > aln.bam
 433
 434          If @SQ lines are absent:
 435
 436            samtools faidx ref.fa
 437            samtools view -bt ref.fa.fai aln.sam > aln.bam
 438
 439          where ref.fa.fai is generated automatically by the faidx command.
 440
 441
 442        o Attach the RG tag while merging sorted alignments:
 443
 444            perl       -e       'print      "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
 445          mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
 446            samtools merge -rh rg.txt merged.bam ga.bam 454.bam
 447
 448          The value in a RG tag is determined by the file name the read is com-
 449          ing  from. In this example, in the merged.bam, reads from ga.bam will
 450          be attached RG:Z:ga,  while  reads  from  454.bam  will  be  attached
 451          RG:Z:454.
 452
 453
 454        o Call SNPs and short indels for one diploid individual:
 455
 456            samtools pileup -vcf ref.fa aln.bam > var.raw.plp
 457            samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
 458            awk     '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)'     var.flt.plp    >
 459          var.final.plp
 460
 461          The -D option of varFilter controls the  maximum  read  depth,  which
 462          should  be  adjusted  to about twice the average read depth.  One may
 463          consider to add -C50 to pileup if mapping  quality  is  overestimated
 464          for  reads containing excessive mismatches. Applying this option usu-
 465          ally helps BWA-short but may not other mappers. It  also  potentially
 466          increases reference biases.
 467
 468
 469        o Call SNPs (not short indels) for multiple diploid individuals:
 470
 471            samtools  mpileup  -augf  ref.fa  *.bam  |  bcftools  view -bcv - >
 472          snp.raw.bcf
 473            bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 |  bgzip
 474          > snp.flt.vcf.gz
 475
 476          Individuals  are identified from the SM tags in the @RG header lines.
 477          Individuals can be pooled in one alignment file; one  individual  can
 478          also be separated into multiple files. Similarly, one may consider to
 479          apply -C50 to mpileup.  SNP calling in this way also works for single
 480          sample and has the advantage of enabling more powerful filtering. The
 481          drawback is the lack of short indel calling, which may be implemented
 482          in future.
 483
 484
 485        o Derive  the  allele  frequency spectrum (AFS) on a list of sites from
 486          multiple individuals:
 487
 488            samtools mpileup -gf ref.fa *.bam > all.bcf
 489            bcftools view -bl sites.list all.bcf > sites.bcf
 490            bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
 491            bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
 492            bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
 493            ......
 494
 495          where sites.list contains the list of sites with each line consisting
 496          of  the  reference sequence name and position. The following bcftools
 497          commands estimate AFS by EM.
 498
 499
 500        o Dump BAQ applied alignment for other SNP callers:
 501
 502            samtools calmd -br aln.bam > aln.baq.bam
 503
 504          It adds and corrects the NM and MD tags at the same time.  The  calmd
 505          command  also  comes with the -C option, the same as the on in pileup
 506          and mpileup.  Apply if it helps.
 507
 508
 509 LIMITATIONS
 510        o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
 511          bam_aux.c.
 512
 513        o In  merging,  the input files are required to have the same number of
 514          reference sequences. The requirement can  be  relaxed.  In  addition,
 515          merging  does  not reconstruct the header dictionaries automatically.
 516          Endusers have to provide the correct  header.  Picard  is  better  at
 517          merging.
 518
 519        o Samtools  paired-end  rmdup  does  not  work for unpaired reads (e.g.
 520          orphan reads or ends mapped to different chromosomes). If this  is  a
 521          concern,  please  use  Picard's MarkDuplicate which correctly handles
 522          these cases, although a little slower.
 523
 524
 525 AUTHOR
 526        Heng Li from the Sanger Institute wrote the C version of samtools.  Bob
 527        Handsaker from the Broad Institute implemented the BGZF library and Jue
 528        Ruan from Beijing Genomics Institute wrote the  RAZF  library.  Various
 529        people in the 1000 Genomes Project contributed to the SAM format speci-
 530        fication.
 531
 532
 533 SEE ALSO
 534        Samtools website: <http://samtools.sourceforge.net>
 535
 536
 537
 538 samtools-0.1.9                  27 October 2010                    samtools(1)