--- /dev/null
+samtools(1) Bioinformatics tools samtools(1)
+
+
+
+NAME
+ samtools - Utilities for the Sequence Alignment/Map (SAM) format
+
+SYNOPSIS
+ samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
+
+ samtools sort aln.bam aln.sorted
+
+ samtools index aln.sorted.bam
+
+ samtools view aln.sorted.bam chr2:20,100,000-20,200,000
+
+ samtools merge out.bam in1.bam in2.bam in3.bam
+
+ samtools faidx ref.fasta
+
+ samtools pileup -f ref.fasta aln.sorted.bam
+
+ samtools tview aln.sorted.bam ref.fasta
+
+
+DESCRIPTION
+ Samtools is a set of utilities that manipulate alignments in the BAM
+ format. It imports from and exports to the SAM (Sequence Alignment/Map)
+ format, does sorting, merging and indexing, and allows to retrieve
+ reads in any regions swiftly.
+
+ Samtools is designed to work on a stream. It regards an input file `-'
+ as the standard input (stdin) and an output file `-' as the standard
+ output (stdout). Several commands can thus be combined with Unix pipes.
+ Samtools always output warning and error messages to the standard error
+ output (stderr).
+
+ Samtools is also able to open a BAM (not SAM) file on a remote FTP or
+ HTTP server if the BAM file name starts with `ftp://' or `http://'.
+ Samtools checks the current working directory for the index file and
+ will download the index upon absence. Samtools does not retrieve the
+ entire alignment file unless it is asked to do so.
+
+
+COMMANDS AND OPTIONS
+ import samtools import <in.ref_list> <in.sam> <out.bam>
+
+ Since 0.1.4, this command is an alias of:
+
+ samtools view -bt <in.ref_list> -o <out.bam> <in.sam>
+
+
+ sort samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
+
+ Sort alignments by leftmost coordinates. File <out.pre-
+ fix>.bam will be created. This command may also create tempo-
+ rary files <out.prefix>.%d.bam when the whole alignment can-
+ not be fitted into memory (controlled by option -m).
+
+ OPTIONS:
+
+ -n Sort by read names rather than by chromosomal coordi-
+ nates
+
+ -m INT Approximately the maximum required memory.
+ [500000000]
+
+
+ merge samtools merge [-h inh.sam] [-n] <out.bam> <in1.bam>
+ <in2.bam> [...]
+
+ Merge multiple sorted alignments. The header reference lists
+ of all the input BAM files, and the @SQ headers of inh.sam,
+ if any, must all refer to the same set of reference
+ sequences. The header reference list and (unless overridden
+ by -h) `@' headers of in1.bam will be copied to out.bam, and
+ the headers of other files will be ignored.
+
+ OPTIONS:
+
+ -h FILE Use the lines of FILE as `@' headers to be copied to
+ out.bam, replacing any header lines that would other-
+ wise be copied from in1.bam. (FILE is actually in
+ SAM format, though any alignment records it may con-
+ tain are ignored.)
+
+ -n The input alignments are sorted by read names rather
+ than by chromosomal coordinates
+
+
+ index samtools index <aln.bam>
+
+ Index sorted alignment for fast random access. Index file
+ <aln.bam>.bai will be created.
+
+
+ view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
+ reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
+ Group] <in.bam>|<in.sam> [region1 [...]]
+
+ Extract/print all or sub alignments in SAM or BAM format. If
+ no region is specified, all the alignments will be printed;
+ otherwise only alignments overlapping the specified regions
+ will be output. An alignment may be given multiple times if
+ it is overlapping several regions. A region can be presented,
+ for example, in the following format: `chr2', `chr2:1000000'
+ or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.
+
+ OPTIONS:
+
+ -b Output in the BAM format.
+
+ -u Output uncompressed BAM. This option saves time spent
+ on compression/decomprssion and is thus preferred
+ when the output is piped to another samtools command.
+
+ -h Include the header in the output.
+
+ -H Output the header only.
+
+ -S Input is in SAM. If @SQ header lines are absent, the
+ `-t' option is required.
+
+ -t FILE This file is TAB-delimited. Each line must contain
+ the reference name and the length of the reference,
+ one line for each distinct reference; additional
+ fields are ignored. This file also defines the order
+ of the reference sequences in sorting. If you run
+ `samtools faidx <ref.fa>', the resultant index file
+ <ref.fa>.fai can be used as this <in.ref_list> file.
+
+ -o FILE Output file [stdout]
+
+ -f INT Only output alignments with all bits in INT present
+ in the FLAG field. INT can be in hex in the format of
+ /^0x[0-9A-F]+/ [0]
+
+ -F INT Skip alignments with bits present in INT [0]
+
+ -q INT Skip alignments with MAPQ smaller than INT [0]
+
+ -l STR Only output reads in library STR [null]
+
+ -r STR Only output reads in read group STR [null]
+
+
+ faidx samtools faidx <ref.fasta> [region1 [...]]
+
+ Index reference sequence in the FASTA format or extract sub-
+ sequence from indexed reference sequence. If no region is
+ specified, faidx will index the file and create
+ <ref.fasta>.fai on the disk. If regions are speficified, the
+ subsequences will be retrieved and printed to stdout in the
+ FASTA format. The input file can be compressed in the RAZF
+ format.
+
+
+ pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l
+ in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r
+ pairDiffRate] <in.bam>|<in.sam>
+
+ Print the alignment in the pileup format. In the pileup for-
+ mat, each line represents a genomic position, consisting of
+ chromosome name, coordinate, reference base, read bases, read
+ qualities and alignment mapping qualities. Information on
+ match, mismatch, indel, strand, mapping quality and start and
+ end of a read are all encoded at the read base column. At
+ this column, a dot stands for a match to the reference base
+ on the forward strand, a comma for a match on the reverse
+ strand, `ACGTN' for a mismatch on the forward strand and
+ `acgtn' for a mismatch on the reverse strand. A pattern
+ `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
+ between this reference position and the next reference posi-
+ tion. The length of the insertion is given by the integer in
+ the pattern, followed by the inserted sequence. Similarly, a
+ pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
+ reference. The deleted bases will be presented as `*' in the
+ following lines. Also at the read base column, a symbol `^'
+ marks the start of a read segment which is a contiguous sub-
+ sequence on the read separated by `N/S/H' CIGAR operations.
+ The ASCII of the character following `^' minus 33 gives the
+ mapping quality. A symbol `$' marks the end of a read seg-
+ ment.
+
+ If option -c is applied, the consensus base, consensus qual-
+ ity, SNP quality and RMS mapping quality of the reads cover-
+ ing the site will be inserted between the `reference base'
+ and the `read bases' columns. An indel occupies an additional
+ line. Each indel line consists of chromosome name, coordi-
+ nate, a star, the genotype, consensus quality, SNP quality,
+ RMS mapping quality, # covering reads, the first alllele, the
+ second allele, # reads supporting the first allele, # reads
+ supporting the second allele and # reads containing indels
+ different from the top two alleles.
+
+ OPTIONS:
+
+
+ -s Print the mapping quality as the last column. This
+ option makes the output easier to parse, although
+ this format is not space efficient.
+
+
+ -S The input file is in SAM.
+
+
+ -i Only output pileup lines containing indels.
+
+
+ -f FILE The reference sequence in the FASTA format. Index
+ file FILE.fai will be created if absent.
+
+
+ -M INT Cap mapping quality at INT [60]
+
+
+ -t FILE List of reference names ane sequence lengths, in
+ the format described for the import command. If
+ this option is present, samtools assumes the input
+ <in.alignment> is in SAM format; otherwise it
+ assumes in BAM format.
+
+
+ -l FILE List of sites at which pileup is output. This file
+ is space delimited. The first two columns are
+ required to be chromosome and 1-based coordinate.
+ Additional columns are ignored. It is recommended
+ to use option -s together with -l as in the default
+ format we may not know the mapping quality.
+
+
+ -c Call the consensus sequence using MAQ consensus
+ model. Options -T, -N, -I and -r are only effective
+ when -c or -g is in use.
+
+
+ -g Generate genotype likelihood in the binary GLFv3
+ format. This option suppresses -c, -i and -s.
+
+
+ -T FLOAT The theta parameter (error dependency coefficient)
+ in the maq consensus calling model [0.85]
+
+
+ -N INT Number of haplotypes in the sample (>=2) [2]
+
+
+ -r FLOAT Expected fraction of differences between a pair of
+ haplotypes [0.001]
+
+
+ -I INT Phred probability of an indel in sequencing/prep.
+ [40]
+
+
+
+ tview samtools tview <in.sorted.bam> [ref.fasta]
+
+ Text alignment viewer (based on the ncurses library). In the
+ viewer, press `?' for help and press `g' to check the align-
+ ment start from a region in the format like
+ `chr10:10,000,000'.
+
+
+
+ fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
+
+ Fill in mate coordinates, ISIZE and mate related flags from a
+ name-sorted alignment.
+
+
+ rmdup samtools rmdup <input.srt.bam> <out.bam>
+
+ Remove potential PCR duplicates: if multiple read pairs have
+ identical external coordinates, only retain the pair with
+ highest mapping quality. This command ONLY works with FR
+ orientation and requires ISIZE is correctly set.
+
+
+
+ rmdupse samtools rmdupse <input.srt.bam> <out.bam>
+
+ Remove potential duplicates for single-ended reads. This com-
+ mand will treat all reads as single-ended even if they are
+ paired in fact.
+
+
+
+ fillmd samtools fillmd [-e] <aln.bam> <ref.fasta>
+
+ Generate the MD tag. If the MD tag is already present, this
+ command will give a warning if the MD tag generated is dif-
+ ferent from the existing tag.
+
+ OPTIONS:
+
+ -e Convert a the read base to = if it is identical to
+ the aligned reference base. Indel caller does not
+ support the = bases at the moment.
+
+
+
+SAM FORMAT
+ SAM is TAB-delimited. Apart from the header lines, which are started
+ with the `@' symbol, each alignment line consists of:
+
+
+ +----+-------+----------------------------------------------------------+
+ |Col | Field | Description |
+ +----+-------+----------------------------------------------------------+
+ | 1 | QNAME | Query (pair) NAME |
+ | 2 | FLAG | bitwise FLAG |
+ | 3 | RNAME | Reference sequence NAME |
+ | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
+ | 5 | MAPQ | MAPping Quality (Phred-scaled) |
+ | 6 | CIAGR | extended CIGAR string |
+ | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) |
+ | 8 | MPOS | 1-based Mate POSistion |
+ | 9 | ISIZE | Inferred insert SIZE |
+ |10 | SEQ | query SEQuence on the same strand as the reference |
+ |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
+ |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
+ +----+-------+----------------------------------------------------------+
+
+ Each bit in the FLAG field is defined as:
+
+
+ +-------+--------------------------------------------------+
+ | Flag | Description |
+ +-------+--------------------------------------------------+
+ |0x0001 | the read is paired in sequencing |
+ |0x0002 | the read is mapped in a proper pair |
+ |0x0004 | the query sequence itself is unmapped |
+ |0x0008 | the mate is unmapped |
+ |0x0010 | strand of the query (1 for reverse) |
+ |0x0020 | strand of the mate |
+ |0x0040 | the read is the first read in a pair |
+ |0x0080 | the read is the second read in a pair |
+ |0x0100 | the alignment is not primary |
+ |0x0200 | the read fails platform/vendor quality checks |
+ |0x0400 | the read is either a PCR or an optical duplicate |
+ +-------+--------------------------------------------------+
+
+LIMITATIONS
+ o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
+ bam_aux.c.
+
+ o CIGAR operation P is not properly handled at the moment.
+
+ o In merging, the input files are required to have the same number of
+ reference sequences. The requirement can be relaxed. In addition,
+ merging does not reconstruct the header dictionaries automatically.
+ Endusers have to provide the correct header. Picard is better at
+ merging.
+
+ o Samtools' rmdup does not work for single-end data and does not remove
+ duplicates across chromosomes. Picard is better.
+
+
+AUTHOR
+ Heng Li from the Sanger Institute wrote the C version of samtools. Bob
+ Handsaker from the Broad Institute implemented the BGZF library and Jue
+ Ruan from Beijing Genomics Institute wrote the RAZF library. Various
+ people in the 1000Genomes Project contributed to the SAM format speci-
+ fication.
+
+
+SEE ALSO
+ Samtools website: <http://samtools.sourceforge.net>
+
+
+
+samtools-0.1.6 2 September 2009 samtools(1)