X-Git-Url: http://woldlab.caltech.edu/gitweb/?p=samtools.git;a=blobdiff_plain;f=samtools.txt;h=feec2386e7c9694c12164fda385ce33d06d1f9f7;hp=63e6a25ec72823931cda29bd2ffd9ddad745b2db;hb=317f5e8dd22cc9e1e5e05fbcaeb3b9aca7447351;hpb=4a17fa7e1f91b2fe04ad334a63fc2b0d5e859d8a diff --git a/samtools.txt b/samtools.txt index 63e6a25..feec238 100644 --- a/samtools.txt +++ b/samtools.txt @@ -103,35 +103,38 @@ COMMANDS AND OPTIONS otherwise only alignments overlapping the specified regions will be output. An alignment may be given multiple times if it is overlapping several regions. A region can be presented, - for example, in the following format: `chr2', `chr2:1000000' - or `chr2:1,000,000-2,000,000'. The coordinate is 1-based. + for example, in the following format: `chr2' (the whole + chr2), `chr2:1000000' (region starting from 1,000,000bp) or + `chr2:1,000,000-2,000,000' (region between 1,000,000 and + 2,000,000bp including the end points). The coordinate is + 1-based. OPTIONS: -b Output in the BAM format. -u Output uncompressed BAM. This option saves time spent - on compression/decomprssion and is thus preferred + on compression/decomprssion and is thus preferred when the output is piped to another samtools command. -h Include the header in the output. -H Output the header only. - -S Input is in SAM. If @SQ header lines are absent, the + -S Input is in SAM. If @SQ header lines are absent, the `-t' option is required. - -t FILE This file is TAB-delimited. Each line must contain - the reference name and the length of the reference, - one line for each distinct reference; additional - fields are ignored. This file also defines the order - of the reference sequences in sorting. If you run - `samtools faidx ', the resultant index file - .fai can be used as this file. + -t FILE This file is TAB-delimited. Each line must contain + the reference name and the length of the reference, + one line for each distinct reference; additional + fields are ignored. This file also defines the order + of the reference sequences in sorting. If you run + `samtools faidx ', the resultant index file + .fai can be used as this file. -o FILE Output file [stdout] - -f INT Only output alignments with all bits in INT present + -f INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0] @@ -146,58 +149,60 @@ COMMANDS AND OPTIONS faidx samtools faidx [region1 [...]] - Index reference sequence in the FASTA format or extract sub- - sequence from indexed reference sequence. If no region is + Index reference sequence in the FASTA format or extract sub- + sequence from indexed reference sequence. If no region is specified, faidx will index the file and create - .fai on the disk. If regions are speficified, the - subsequences will be retrieved and printed to stdout in the - FASTA format. The input file can be compressed in the RAZF + .fai on the disk. If regions are speficified, the + subsequences will be retrieved and printed to stdout in the + FASTA format. The input file can be compressed in the RAZF format. - pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l - in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r + pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l + in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r pairDiffRate] | - Print the alignment in the pileup format. In the pileup for- - mat, each line represents a genomic position, consisting of + Print the alignment in the pileup format. In the pileup for- + mat, each line represents a genomic position, consisting of chromosome name, coordinate, reference base, read bases, read - qualities and alignment mapping qualities. Information on + qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and - end of a read are all encoded at the read base column. At - this column, a dot stands for a match to the reference base - on the forward strand, a comma for a match on the reverse - strand, `ACGTN' for a mismatch on the forward strand and - `acgtn' for a mismatch on the reverse strand. A pattern - `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion - between this reference position and the next reference posi- - tion. The length of the insertion is given by the integer in - the pattern, followed by the inserted sequence. Similarly, a + end of a read are all encoded at the read base column. At + this column, a dot stands for a match to the reference base + on the forward strand, a comma for a match on the reverse + strand, `ACGTN' for a mismatch on the forward strand and + `acgtn' for a mismatch on the reverse strand. A pattern + `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion + between this reference position and the next reference posi- + tion. The length of the insertion is given by the integer in + the pattern, followed by the inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the - reference. The deleted bases will be presented as `*' in the - following lines. Also at the read base column, a symbol `^' - marks the start of a read segment which is a contiguous sub- - sequence on the read separated by `N/S/H' CIGAR operations. - The ASCII of the character following `^' minus 33 gives the - mapping quality. A symbol `$' marks the end of a read seg- + reference. The deleted bases will be presented as `*' in the + following lines. Also at the read base column, a symbol `^' + marks the start of a read segment which is a contiguous sub- + sequence on the read separated by `N/S/H' CIGAR operations. + The ASCII of the character following `^' minus 33 gives the + mapping quality. A symbol `$' marks the end of a read seg- ment. - If option -c is applied, the consensus base, consensus qual- - ity, SNP quality and RMS mapping quality of the reads cover- - ing the site will be inserted between the `reference base' - and the `read bases' columns. An indel occupies an additional - line. Each indel line consists of chromosome name, coordi- - nate, a star, the genotype, consensus quality, SNP quality, + If option -c is applied, the consensus base, Phred-scaled + consensus quality, SNP quality (i.e. the Phred-scaled proba- + bility of the consensus being identical to the reference) and + root mean square (RMS) mapping quality of the reads covering + the site will be inserted between the `reference base' and + the `read bases' columns. An indel occupies an additional + line. Each indel line consists of chromosome name, coordi- + nate, a star, the genotype, consensus quality, SNP quality, RMS mapping quality, # covering reads, the first alllele, the - second allele, # reads supporting the first allele, # reads - supporting the second allele and # reads containing indels + second allele, # reads supporting the first allele, # reads + supporting the second allele and # reads containing indels different from the top two alleles. OPTIONS: - -s Print the mapping quality as the last column. This - option makes the output easier to parse, although + -s Print the mapping quality as the last column. This + option makes the output easier to parse, although this format is not space efficient. @@ -207,62 +212,61 @@ COMMANDS AND OPTIONS -i Only output pileup lines containing indels. - -f FILE The reference sequence in the FASTA format. Index + -f FILE The reference sequence in the FASTA format. Index file FILE.fai will be created if absent. -M INT Cap mapping quality at INT [60] - -t FILE List of reference names ane sequence lengths, in - the format described for the import command. If - this option is present, samtools assumes the input + -t FILE List of reference names ane sequence lengths, in + the format described for the import command. If + this option is present, samtools assumes the input is in SAM format; otherwise it assumes in BAM format. - -l FILE List of sites at which pileup is output. This file - is space delimited. The first two columns are - required to be chromosome and 1-based coordinate. - Additional columns are ignored. It is recommended + -l FILE List of sites at which pileup is output. This file + is space delimited. The first two columns are + required to be chromosome and 1-based coordinate. + Additional columns are ignored. It is recommended to use option -s together with -l as in the default format we may not know the mapping quality. - -c Call the consensus sequence using MAQ consensus + -c Call the consensus sequence using MAQ consensus model. Options -T, -N, -I and -r are only effective when -c or -g is in use. - -g Generate genotype likelihood in the binary GLFv3 + -g Generate genotype likelihood in the binary GLFv3 format. This option suppresses -c, -i and -s. - -T FLOAT The theta parameter (error dependency coefficient) + -T FLOAT The theta parameter (error dependency coefficient) in the maq consensus calling model [0.85] -N INT Number of haplotypes in the sample (>=2) [2] - -r FLOAT Expected fraction of differences between a pair of + -r FLOAT Expected fraction of differences between a pair of haplotypes [0.001] - -I INT Phred probability of an indel in sequencing/prep. + -I INT Phred probability of an indel in sequencing/prep. [40] tview samtools tview [ref.fasta] - Text alignment viewer (based on the ncurses library). In the - viewer, press `?' for help and press `g' to check the align- - ment start from a region in the format like + Text alignment viewer (based on the ncurses library). In the + viewer, press `?' for help and press `g' to check the align- + ment start from a region in the format like `chr10:10,000,000'. - fixmate samtools fixmate Fill in mate coordinates, ISIZE and mate related flags from a @@ -271,37 +275,35 @@ COMMANDS AND OPTIONS rmdup samtools rmdup - Remove potential PCR duplicates: if multiple read pairs have - identical external coordinates, only retain the pair with - highest mapping quality. This command ONLY works with FR + Remove potential PCR duplicates: if multiple read pairs have + identical external coordinates, only retain the pair with + highest mapping quality. This command ONLY works with FR orientation and requires ISIZE is correctly set. - rmdupse samtools rmdupse Remove potential duplicates for single-ended reads. This com- - mand will treat all reads as single-ended even if they are + mand will treat all reads as single-ended even if they are paired in fact. - fillmd samtools fillmd [-e] - Generate the MD tag. If the MD tag is already present, this - command will give a warning if the MD tag generated is dif- + Generate the MD tag. If the MD tag is already present, this + command will give a warning if the MD tag generated is dif- ferent from the existing tag. OPTIONS: - -e Convert a the read base to = if it is identical to - the aligned reference base. Indel caller does not + -e Convert a the read base to = if it is identical to + the aligned reference base. Indel caller does not support the = bases at the moment. SAM FORMAT - SAM is TAB-delimited. Apart from the header lines, which are started + SAM is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of: @@ -342,15 +344,15 @@ SAM FORMAT +-------+--------------------------------------------------+ LIMITATIONS - o Unaligned words used in bam_import.c, bam_endian.h, bam.c and + o Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. o CIGAR operation P is not properly handled at the moment. - o In merging, the input files are required to have the same number of - reference sequences. The requirement can be relaxed. In addition, - merging does not reconstruct the header dictionaries automatically. - Endusers have to provide the correct header. Picard is better at + o In merging, the input files are required to have the same number of + reference sequences. The requirement can be relaxed. In addition, + merging does not reconstruct the header dictionaries automatically. + Endusers have to provide the correct header. Picard is better at merging. o Samtools' rmdup does not work for single-end data and does not remove @@ -358,10 +360,10 @@ LIMITATIONS AUTHOR - Heng Li from the Sanger Institute wrote the C version of samtools. Bob + Heng Li from the Sanger Institute wrote the C version of samtools. Bob Handsaker from the Broad Institute implemented the BGZF library and Jue - Ruan from Beijing Genomics Institute wrote the RAZF library. Various - people in the 1000Genomes Project contributed to the SAM format speci- + Ruan from Beijing Genomics Institute wrote the RAZF library. Various + people in the 1000Genomes Project contributed to the SAM format speci- fication. @@ -370,4 +372,4 @@ SEE ALSO -samtools-0.1.6 2 September 2009 samtools(1) +samtools-0.1.7 10 November 2009 samtools(1)