X-Git-Url: http://woldlab.caltech.edu/gitweb/?p=samtools.git;a=blobdiff_plain;f=samtools.txt;fp=samtools.txt;h=63e6a25ec72823931cda29bd2ffd9ddad745b2db;hp=0000000000000000000000000000000000000000;hb=4a17fa7e1f91b2fe04ad334a63fc2b0d5e859d8a;hpb=b27e00385f41769d03a8cca4dbd71275fc9fa906 diff --git a/samtools.txt b/samtools.txt new file mode 100644 index 0000000..63e6a25 --- /dev/null +++ b/samtools.txt @@ -0,0 +1,373 @@ +samtools(1) Bioinformatics tools samtools(1) + + + +NAME + samtools - Utilities for the Sequence Alignment/Map (SAM) format + +SYNOPSIS + samtools view -bt ref_list.txt -o aln.bam aln.sam.gz + + samtools sort aln.bam aln.sorted + + samtools index aln.sorted.bam + + samtools view aln.sorted.bam chr2:20,100,000-20,200,000 + + samtools merge out.bam in1.bam in2.bam in3.bam + + samtools faidx ref.fasta + + samtools pileup -f ref.fasta aln.sorted.bam + + samtools tview aln.sorted.bam ref.fasta + + +DESCRIPTION + Samtools is a set of utilities that manipulate alignments in the BAM + format. It imports from and exports to the SAM (Sequence Alignment/Map) + format, does sorting, merging and indexing, and allows to retrieve + reads in any regions swiftly. + + Samtools is designed to work on a stream. It regards an input file `-' + as the standard input (stdin) and an output file `-' as the standard + output (stdout). Several commands can thus be combined with Unix pipes. + Samtools always output warning and error messages to the standard error + output (stderr). + + Samtools is also able to open a BAM (not SAM) file on a remote FTP or + HTTP server if the BAM file name starts with `ftp://' or `http://'. + Samtools checks the current working directory for the index file and + will download the index upon absence. Samtools does not retrieve the + entire alignment file unless it is asked to do so. + + +COMMANDS AND OPTIONS + import samtools import + + Since 0.1.4, this command is an alias of: + + samtools view -bt -o + + + sort samtools sort [-n] [-m maxMem] + + Sort alignments by leftmost coordinates. File .bam will be created. This command may also create tempo- + rary files .%d.bam when the whole alignment can- + not be fitted into memory (controlled by option -m). + + OPTIONS: + + -n Sort by read names rather than by chromosomal coordi- + nates + + -m INT Approximately the maximum required memory. + [500000000] + + + merge samtools merge [-h inh.sam] [-n] + [...] + + Merge multiple sorted alignments. The header reference lists + of all the input BAM files, and the @SQ headers of inh.sam, + if any, must all refer to the same set of reference + sequences. The header reference list and (unless overridden + by -h) `@' headers of in1.bam will be copied to out.bam, and + the headers of other files will be ignored. + + OPTIONS: + + -h FILE Use the lines of FILE as `@' headers to be copied to + out.bam, replacing any header lines that would other- + wise be copied from in1.bam. (FILE is actually in + SAM format, though any alignment records it may con- + tain are ignored.) + + -n The input alignments are sorted by read names rather + than by chromosomal coordinates + + + index samtools index + + Index sorted alignment for fast random access. Index file + .bai will be created. + + + view samtools view [-bhuHS] [-t in.refList] [-o output] [-f + reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read- + Group] | [region1 [...]] + + Extract/print all or sub alignments in SAM or BAM format. If + no region is specified, all the alignments will be printed; + otherwise only alignments overlapping the specified regions + will be output. An alignment may be given multiple times if + it is overlapping several regions. A region can be presented, + for example, in the following format: `chr2', `chr2:1000000' + or `chr2:1,000,000-2,000,000'. The coordinate is 1-based. + + OPTIONS: + + -b Output in the BAM format. + + -u Output uncompressed BAM. This option saves time spent + on compression/decomprssion and is thus preferred + when the output is piped to another samtools command. + + -h Include the header in the output. + + -H Output the header only. + + -S Input is in SAM. If @SQ header lines are absent, the + `-t' option is required. + + -t FILE This file is TAB-delimited. Each line must contain + the reference name and the length of the reference, + one line for each distinct reference; additional + fields are ignored. This file also defines the order + of the reference sequences in sorting. If you run + `samtools faidx ', the resultant index file + .fai can be used as this file. + + -o FILE Output file [stdout] + + -f INT Only output alignments with all bits in INT present + in the FLAG field. INT can be in hex in the format of + /^0x[0-9A-F]+/ [0] + + -F INT Skip alignments with bits present in INT [0] + + -q INT Skip alignments with MAPQ smaller than INT [0] + + -l STR Only output reads in library STR [null] + + -r STR Only output reads in read group STR [null] + + + faidx samtools faidx [region1 [...]] + + Index reference sequence in the FASTA format or extract sub- + sequence from indexed reference sequence. If no region is + specified, faidx will index the file and create + .fai on the disk. If regions are speficified, the + subsequences will be retrieved and printed to stdout in the + FASTA format. The input file can be compressed in the RAZF + format. + + + pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l + in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r + pairDiffRate] | + + Print the alignment in the pileup format. In the pileup for- + mat, each line represents a genomic position, consisting of + chromosome name, coordinate, reference base, read bases, read + qualities and alignment mapping qualities. Information on + match, mismatch, indel, strand, mapping quality and start and + end of a read are all encoded at the read base column. At + this column, a dot stands for a match to the reference base + on the forward strand, a comma for a match on the reverse + strand, `ACGTN' for a mismatch on the forward strand and + `acgtn' for a mismatch on the reverse strand. A pattern + `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion + between this reference position and the next reference posi- + tion. The length of the insertion is given by the integer in + the pattern, followed by the inserted sequence. Similarly, a + pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the + reference. The deleted bases will be presented as `*' in the + following lines. Also at the read base column, a symbol `^' + marks the start of a read segment which is a contiguous sub- + sequence on the read separated by `N/S/H' CIGAR operations. + The ASCII of the character following `^' minus 33 gives the + mapping quality. A symbol `$' marks the end of a read seg- + ment. + + If option -c is applied, the consensus base, consensus qual- + ity, SNP quality and RMS mapping quality of the reads cover- + ing the site will be inserted between the `reference base' + and the `read bases' columns. An indel occupies an additional + line. Each indel line consists of chromosome name, coordi- + nate, a star, the genotype, consensus quality, SNP quality, + RMS mapping quality, # covering reads, the first alllele, the + second allele, # reads supporting the first allele, # reads + supporting the second allele and # reads containing indels + different from the top two alleles. + + OPTIONS: + + + -s Print the mapping quality as the last column. This + option makes the output easier to parse, although + this format is not space efficient. + + + -S The input file is in SAM. + + + -i Only output pileup lines containing indels. + + + -f FILE The reference sequence in the FASTA format. Index + file FILE.fai will be created if absent. + + + -M INT Cap mapping quality at INT [60] + + + -t FILE List of reference names ane sequence lengths, in + the format described for the import command. If + this option is present, samtools assumes the input + is in SAM format; otherwise it + assumes in BAM format. + + + -l FILE List of sites at which pileup is output. This file + is space delimited. The first two columns are + required to be chromosome and 1-based coordinate. + Additional columns are ignored. It is recommended + to use option -s together with -l as in the default + format we may not know the mapping quality. + + + -c Call the consensus sequence using MAQ consensus + model. Options -T, -N, -I and -r are only effective + when -c or -g is in use. + + + -g Generate genotype likelihood in the binary GLFv3 + format. This option suppresses -c, -i and -s. + + + -T FLOAT The theta parameter (error dependency coefficient) + in the maq consensus calling model [0.85] + + + -N INT Number of haplotypes in the sample (>=2) [2] + + + -r FLOAT Expected fraction of differences between a pair of + haplotypes [0.001] + + + -I INT Phred probability of an indel in sequencing/prep. + [40] + + + + tview samtools tview [ref.fasta] + + Text alignment viewer (based on the ncurses library). In the + viewer, press `?' for help and press `g' to check the align- + ment start from a region in the format like + `chr10:10,000,000'. + + + + fixmate samtools fixmate + + Fill in mate coordinates, ISIZE and mate related flags from a + name-sorted alignment. + + + rmdup samtools rmdup + + Remove potential PCR duplicates: if multiple read pairs have + identical external coordinates, only retain the pair with + highest mapping quality. This command ONLY works with FR + orientation and requires ISIZE is correctly set. + + + + rmdupse samtools rmdupse + + Remove potential duplicates for single-ended reads. This com- + mand will treat all reads as single-ended even if they are + paired in fact. + + + + fillmd samtools fillmd [-e] + + Generate the MD tag. If the MD tag is already present, this + command will give a warning if the MD tag generated is dif- + ferent from the existing tag. + + OPTIONS: + + -e Convert a the read base to = if it is identical to + the aligned reference base. Indel caller does not + support the = bases at the moment. + + + +SAM FORMAT + SAM is TAB-delimited. Apart from the header lines, which are started + with the `@' symbol, each alignment line consists of: + + + +----+-------+----------------------------------------------------------+ + |Col | Field | Description | + +----+-------+----------------------------------------------------------+ + | 1 | QNAME | Query (pair) NAME | + | 2 | FLAG | bitwise FLAG | + | 3 | RNAME | Reference sequence NAME | + | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence | + | 5 | MAPQ | MAPping Quality (Phred-scaled) | + | 6 | CIAGR | extended CIGAR string | + | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) | + | 8 | MPOS | 1-based Mate POSistion | + | 9 | ISIZE | Inferred insert SIZE | + |10 | SEQ | query SEQuence on the same strand as the reference | + |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) | + |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE | + +----+-------+----------------------------------------------------------+ + + Each bit in the FLAG field is defined as: + + + +-------+--------------------------------------------------+ + | Flag | Description | + +-------+--------------------------------------------------+ + |0x0001 | the read is paired in sequencing | + |0x0002 | the read is mapped in a proper pair | + |0x0004 | the query sequence itself is unmapped | + |0x0008 | the mate is unmapped | + |0x0010 | strand of the query (1 for reverse) | + |0x0020 | strand of the mate | + |0x0040 | the read is the first read in a pair | + |0x0080 | the read is the second read in a pair | + |0x0100 | the alignment is not primary | + |0x0200 | the read fails platform/vendor quality checks | + |0x0400 | the read is either a PCR or an optical duplicate | + +-------+--------------------------------------------------+ + +LIMITATIONS + o Unaligned words used in bam_import.c, bam_endian.h, bam.c and + bam_aux.c. + + o CIGAR operation P is not properly handled at the moment. + + o In merging, the input files are required to have the same number of + reference sequences. The requirement can be relaxed. In addition, + merging does not reconstruct the header dictionaries automatically. + Endusers have to provide the correct header. Picard is better at + merging. + + o Samtools' rmdup does not work for single-end data and does not remove + duplicates across chromosomes. Picard is better. + + +AUTHOR + Heng Li from the Sanger Institute wrote the C version of samtools. Bob + Handsaker from the Broad Institute implemented the BGZF library and Jue + Ruan from Beijing Genomics Institute wrote the RAZF library. Various + people in the 1000Genomes Project contributed to the SAM format speci- + fication. + + +SEE ALSO + Samtools website: + + + +samtools-0.1.6 2 September 2009 samtools(1)