X-Git-Url: http://woldlab.caltech.edu/gitweb/?p=samtools.git;a=blobdiff_plain;f=samtools.txt;h=20e6c15cd707d11a7d1ed41f15b9d54732cd67ec;hp=63e6a25ec72823931cda29bd2ffd9ddad745b2db;hb=cefa18095b2479339b08111936313066ec548657;hpb=d363084f0412f3bcdeb0304aeb0974c9a10c7649 diff --git a/samtools.txt b/samtools.txt index 63e6a25..20e6c15 100644 --- a/samtools.txt +++ b/samtools.txt @@ -12,6 +12,8 @@ SYNOPSIS samtools index aln.sorted.bam + samtools idxstats aln.sorted.bam + samtools view aln.sorted.bam chr2:20,100,000-20,200,000 samtools merge out.bam in1.bam in2.bam in3.bam @@ -20,6 +22,8 @@ SYNOPSIS samtools pileup -f ref.fasta aln.sorted.bam + samtools mpileup -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam + samtools tview aln.sorted.bam ref.fasta @@ -43,68 +47,20 @@ DESCRIPTION COMMANDS AND OPTIONS - import samtools import - - Since 0.1.4, this command is an alias of: - - samtools view -bt -o - - - sort samtools sort [-n] [-m maxMem] - - Sort alignments by leftmost coordinates. File .bam will be created. This command may also create tempo- - rary files .%d.bam when the whole alignment can- - not be fitted into memory (controlled by option -m). - - OPTIONS: - - -n Sort by read names rather than by chromosomal coordi- - nates - - -m INT Approximately the maximum required memory. - [500000000] - - - merge samtools merge [-h inh.sam] [-n] - [...] - - Merge multiple sorted alignments. The header reference lists - of all the input BAM files, and the @SQ headers of inh.sam, - if any, must all refer to the same set of reference - sequences. The header reference list and (unless overridden - by -h) `@' headers of in1.bam will be copied to out.bam, and - the headers of other files will be ignored. - - OPTIONS: - - -h FILE Use the lines of FILE as `@' headers to be copied to - out.bam, replacing any header lines that would other- - wise be copied from in1.bam. (FILE is actually in - SAM format, though any alignment records it may con- - tain are ignored.) - - -n The input alignments are sorted by read names rather - than by chromosomal coordinates - - - index samtools index - - Index sorted alignment for fast random access. Index file - .bai will be created. - - view samtools view [-bhuHS] [-t in.refList] [-o output] [-f - reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read- - Group] | [region1 [...]] + reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read- + Group] [-R rgFile] | [region1 [...]] - Extract/print all or sub alignments in SAM or BAM format. If - no region is specified, all the alignments will be printed; - otherwise only alignments overlapping the specified regions - will be output. An alignment may be given multiple times if + Extract/print all or sub alignments in SAM or BAM format. If + no region is specified, all the alignments will be printed; + otherwise only alignments overlapping the specified regions + will be output. An alignment may be given multiple times if it is overlapping several regions. A region can be presented, - for example, in the following format: `chr2', `chr2:1000000' - or `chr2:1,000,000-2,000,000'. The coordinate is 1-based. + for example, in the following format: `chr2' (the whole + chr2), `chr2:1000000' (region starting from 1,000,000bp) or + `chr2:1,000,000-2,000,000' (region between 1,000,000 and + 2,000,000bp including the end points). The coordinate is + 1-based. OPTIONS: @@ -143,16 +99,16 @@ COMMANDS AND OPTIONS -r STR Only output reads in read group STR [null] + -R FILE Output reads in read groups listed in FILE [null] - faidx samtools faidx [region1 [...]] - Index reference sequence in the FASTA format or extract sub- - sequence from indexed reference sequence. If no region is - specified, faidx will index the file and create - .fai on the disk. If regions are speficified, the - subsequences will be retrieved and printed to stdout in the - FASTA format. The input file can be compressed in the RAZF - format. + tview samtools tview [ref.fasta] + + Text alignment viewer (based on the ncurses library). In the + viewer, press `?' for help and press `g' to check the align- + ment start from a region in the format like + `chr10:10,000,000' or `=10,000,000' when viewing the same + reference sequence. pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l @@ -182,10 +138,12 @@ COMMANDS AND OPTIONS mapping quality. A symbol `$' marks the end of a read seg- ment. - If option -c is applied, the consensus base, consensus qual- - ity, SNP quality and RMS mapping quality of the reads cover- - ing the site will be inserted between the `reference base' - and the `read bases' columns. An indel occupies an additional + If option -c is applied, the consensus base, Phred-scaled + consensus quality, SNP quality (i.e. the Phred-scaled proba- + bility of the consensus being identical to the reference) and + root mean square (RMS) mapping quality of the reads covering + the site will be inserted between the `reference base' and + the `read bases' columns. An indel occupies an additional line. Each indel line consists of chromosome name, coordi- nate, a star, the genotype, consensus quality, SNP quality, RMS mapping quality, # covering reads, the first alllele, the @@ -193,26 +151,28 @@ COMMANDS AND OPTIONS supporting the second allele and # reads containing indels different from the top two alleles. - OPTIONS: + The position of indels is offset by -1. + OPTIONS: -s Print the mapping quality as the last column. This option makes the output easier to parse, although this format is not space efficient. - -S The input file is in SAM. - -i Only output pileup lines containing indels. - -f FILE The reference sequence in the FASTA format. Index file FILE.fai will be created if absent. - -M INT Cap mapping quality at INT [60] + -m INT Filter reads with flag containing bits in INT + [1796] + + -d INT Use the first NUM reads in the pileup for indel + calling for speed up. Zero for unlimited. [0] -t FILE List of reference names ane sequence lengths, in the format described for the import command. If @@ -220,7 +180,6 @@ COMMANDS AND OPTIONS is in SAM format; otherwise it assumes in BAM format. - -l FILE List of sites at which pileup is output. This file is space delimited. The first two columns are required to be chromosome and 1-based coordinate. @@ -228,40 +187,110 @@ COMMANDS AND OPTIONS to use option -s together with -l as in the default format we may not know the mapping quality. - - -c Call the consensus sequence using MAQ consensus + -c Call the consensus sequence using SOAPsnp consensus model. Options -T, -N, -I and -r are only effective when -c or -g is in use. - -g Generate genotype likelihood in the binary GLFv3 format. This option suppresses -c, -i and -s. - -T FLOAT The theta parameter (error dependency coefficient) in the maq consensus calling model [0.85] - -N INT Number of haplotypes in the sample (>=2) [2] - -r FLOAT Expected fraction of differences between a pair of haplotypes [0.001] - -I INT Phred probability of an indel in sequencing/prep. [40] + mpileup samtools mpileup [-r reg] [-f in.fa] in.bam [in2.bam [...]] - tview samtools tview [ref.fasta] + Generate pileup for multiple BAM files. Consensus calling is + not implemented. - Text alignment viewer (based on the ncurses library). In the - viewer, press `?' for help and press `g' to check the align- - ment start from a region in the format like - `chr10:10,000,000'. + OPTIONS: + + -r STR Only generate pileup in region STR [all sites] + + -f FILE The reference file [null] + reheader samtools reheader + + Replace the header in in.bam with the header in + in.header.sam. This command is much faster than replacing + the header with a BAM->SAM->BAM conversion. + + + sort samtools sort [-no] [-m maxMem] + + Sort alignments by leftmost coordinates. File .bam will be created. This command may also create tempo- + rary files .%d.bam when the whole alignment can- + not be fitted into memory (controlled by option -m). + + OPTIONS: + + -o Output the final alignment to the standard output. + + -n Sort by read names rather than by chromosomal coordi- + nates + + -m INT Approximately the maximum required memory. + [500000000] + + + merge samtools merge [-h inh.sam] [-nr] + [...] + + Merge multiple sorted alignments. The header reference lists + of all the input BAM files, and the @SQ headers of inh.sam, + if any, must all refer to the same set of reference + sequences. The header reference list and (unless overridden + by -h) `@' headers of in1.bam will be copied to out.bam, and + the headers of other files will be ignored. + + OPTIONS: + + -h FILE Use the lines of FILE as `@' headers to be copied to + out.bam, replacing any header lines that would other- + wise be copied from in1.bam. (FILE is actually in + SAM format, though any alignment records it may con- + tain are ignored.) + + -r Attach an RG tag to each alignment. The tag value is + inferred from file names. + + -n The input alignments are sorted by read names rather + than by chromosomal coordinates + + + index samtools index + + Index sorted alignment for fast random access. Index file + .bai will be created. + + + idxstats samtools idxstats + + Retrieve and print stats in the index file. The output is TAB + delimited with each line consisting of reference sequence + name, sequence length, # mapped reads and # unmapped reads. + + + faidx samtools faidx [region1 [...]] + + Index reference sequence in the FASTA format or extract sub- + sequence from indexed reference sequence. If no region is + specified, faidx will index the file and create + .fai on the disk. If regions are speficified, the + subsequences will be retrieved and printed to stdout in the + FASTA format. The input file can be compressed in the RAZF + format. + fixmate samtools fixmate @@ -269,39 +298,44 @@ COMMANDS AND OPTIONS name-sorted alignment. - rmdup samtools rmdup + rmdup samtools rmdup [-sS] Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with - highest mapping quality. This command ONLY works with FR - orientation and requires ISIZE is correctly set. - + highest mapping quality. In the paired-end mode, this com- + mand ONLY works with FR orientation and requires ISIZE is + correctly set. It does not work for unpaired reads (e.g. two + ends mapped to different chromosomes or orphan reads). + OPTIONS: - rmdupse samtools rmdupse - - Remove potential duplicates for single-ended reads. This com- - mand will treat all reads as single-ended even if they are - paired in fact. + -s Remove duplicate for single-end reads. By default, + the command works for paired-end reads only. + -S Treat paired-end reads and single-end reads. - fillmd samtools fillmd [-e] + calmd samtools calmd [-eubS] - Generate the MD tag. If the MD tag is already present, this - command will give a warning if the MD tag generated is dif- - ferent from the existing tag. + Generate the MD tag. If the MD tag is already present, this + command will give a warning if the MD tag generated is dif- + ferent from the existing tag. Output SAM by default. OPTIONS: - -e Convert a the read base to = if it is identical to - the aligned reference base. Indel caller does not + -e Convert a the read base to = if it is identical to + the aligned reference base. Indel caller does not support the = bases at the moment. + -u Output uncompressed BAM + + -b Output compressed BAM + + -S The input is SAM with header lines SAM FORMAT - SAM is TAB-delimited. Apart from the header lines, which are started + SAM is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of: @@ -325,43 +359,43 @@ SAM FORMAT Each bit in the FLAG field is defined as: - +-------+--------------------------------------------------+ - | Flag | Description | - +-------+--------------------------------------------------+ - |0x0001 | the read is paired in sequencing | - |0x0002 | the read is mapped in a proper pair | - |0x0004 | the query sequence itself is unmapped | - |0x0008 | the mate is unmapped | - |0x0010 | strand of the query (1 for reverse) | - |0x0020 | strand of the mate | - |0x0040 | the read is the first read in a pair | - |0x0080 | the read is the second read in a pair | - |0x0100 | the alignment is not primary | - |0x0200 | the read fails platform/vendor quality checks | - |0x0400 | the read is either a PCR or an optical duplicate | - +-------+--------------------------------------------------+ + +-------+-----+--------------------------------------------------+ + | Flag | Chr | Description | + +-------+-----+--------------------------------------------------+ + |0x0001 | p | the read is paired in sequencing | + |0x0002 | P | the read is mapped in a proper pair | + |0x0004 | u | the query sequence itself is unmapped | + |0x0008 | U | the mate is unmapped | + |0x0010 | r | strand of the query (1 for reverse) | + |0x0020 | R | strand of the mate | + |0x0040 | 1 | the read is the first read in a pair | + |0x0080 | 2 | the read is the second read in a pair | + |0x0100 | s | the alignment is not primary | + |0x0200 | f | the read fails platform/vendor quality checks | + |0x0400 | d | the read is either a PCR or an optical duplicate | + +-------+-----+--------------------------------------------------+ LIMITATIONS - o Unaligned words used in bam_import.c, bam_endian.h, bam.c and + o Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. - o CIGAR operation P is not properly handled at the moment. - - o In merging, the input files are required to have the same number of - reference sequences. The requirement can be relaxed. In addition, - merging does not reconstruct the header dictionaries automatically. - Endusers have to provide the correct header. Picard is better at + o In merging, the input files are required to have the same number of + reference sequences. The requirement can be relaxed. In addition, + merging does not reconstruct the header dictionaries automatically. + Endusers have to provide the correct header. Picard is better at merging. - o Samtools' rmdup does not work for single-end data and does not remove - duplicates across chromosomes. Picard is better. + o Samtools paired-end rmdup does not work for unpaired reads (e.g. + orphan reads or ends mapped to different chromosomes). If this is a + concern, please use Picard's MarkDuplicate which correctly handles + these cases, although a little slower. AUTHOR - Heng Li from the Sanger Institute wrote the C version of samtools. Bob + Heng Li from the Sanger Institute wrote the C version of samtools. Bob Handsaker from the Broad Institute implemented the BGZF library and Jue - Ruan from Beijing Genomics Institute wrote the RAZF library. Various - people in the 1000Genomes Project contributed to the SAM format speci- + Ruan from Beijing Genomics Institute wrote the RAZF library. Various + people in the 1000 Genomes Project contributed to the SAM format speci- fication. @@ -370,4 +404,4 @@ SEE ALSO -samtools-0.1.6 2 September 2009 samtools(1) +samtools-0.1.8 11 July 2010 samtools(1)