+------------------------------------------------------------------------
+r925 | lh3lh3 | 2011-02-28 15:45:17 -0500 (Mon, 28 Feb 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/phase.c
+
+minor changes to a heuristic rule
+
+------------------------------------------------------------------------
+r924 | lh3lh3 | 2011-02-28 15:24:04 -0500 (Mon, 28 Feb 2011) | 4 lines
+Changed paths:
+ M /trunk/samtools/bam.h
+ M /trunk/samtools/bcftools/vcfutils.pl
+ M /trunk/samtools/phase.c
+
+ * 0.1.12-r924:126
+ * fixed a bug in phase (due to recent changes)
+ * fixed a bug in vcf2fq
+
+------------------------------------------------------------------------
+r923 | lh3lh3 | 2011-02-28 12:57:39 -0500 (Mon, 28 Feb 2011) | 5 lines
+Changed paths:
+ M /trunk/samtools/Makefile
+ M /trunk/samtools/bam.h
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/phase.c
+
+ * put version number in bam.h
+ * write version to BCF
+ * in phase, change the default -q to 37
+ * output a little more information during phasing
+
+------------------------------------------------------------------------
+r922 | lh3lh3 | 2011-02-25 16:40:09 -0500 (Fri, 25 Feb 2011) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/bcf.c
+ M /trunk/samtools/bcftools/bcf.tex
+ M /trunk/samtools/bcftools/bcf2qcall.c
+ M /trunk/samtools/bcftools/bcfutils.c
+ M /trunk/samtools/bcftools/ld.c
+ M /trunk/samtools/bcftools/prob1.c
+ M /trunk/samtools/bcftools/vcf.c
+ M /trunk/samtools/cut_target.c
+
+ * change the order of PL/GL according to the latest VCF spec
+ * change the type of SP to int32_t
+
+------------------------------------------------------------------------
+r921 | lh3lh3 | 2011-02-25 14:40:56 -0500 (Fri, 25 Feb 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/bcf.tex
+
+update the BCF spec
+
+------------------------------------------------------------------------
+r920 | lh3lh3 | 2011-02-25 00:59:27 -0500 (Fri, 25 Feb 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/Makefile
+ M /trunk/samtools/bam_md.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bam_sort.c
+ M /trunk/samtools/bamtk.c
+ A /trunk/samtools/cut_target.c
+ M /trunk/samtools/errmod.h
+ M /trunk/samtools/faidx.c
+ M /trunk/samtools/khash.h
+ M /trunk/samtools/kstring.c
+ M /trunk/samtools/kstring.h
+ A /trunk/samtools/phase.c
+ M /trunk/samtools/samtools.1
+
+added the phase command
+
+------------------------------------------------------------------------
+r918 | lh3lh3 | 2011-02-24 10:05:54 -0500 (Thu, 24 Feb 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/prob1.c
+ M /trunk/samtools/bcftools/prob1.h
+
+added "const" to bcf_p1_cal()
+
+------------------------------------------------------------------------
+r917 | lh3lh3 | 2011-02-24 09:36:30 -0500 (Thu, 24 Feb 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam.c
+
+more meaningful BAM truncation message
+
+------------------------------------------------------------------------
+r916 | lh3lh3 | 2011-02-24 09:35:06 -0500 (Thu, 24 Feb 2011) | 3 lines
+Changed paths:
+ M /trunk/samtools/bcftools/bcf.c
+ M /trunk/samtools/bcftools/vcf.c
+
+ * automatically fix errors in GL
+ * output unrecognized FORMAT as "."
+
+------------------------------------------------------------------------
+r913 | lh3lh3 | 2011-02-10 22:59:47 -0500 (Thu, 10 Feb 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/bcf.h
+ M /trunk/samtools/bcftools/call1.c
+ M /trunk/samtools/bcftools/vcf.c
+
+finished VCF->BCF conversion
+
+------------------------------------------------------------------------
+r910 | petulda | 2011-02-03 03:13:48 -0500 (Thu, 03 Feb 2011) | 1 line
+Changed paths:
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+Prevent division by zero
+------------------------------------------------------------------------
+r909 | lh3lh3 | 2011-02-02 11:29:20 -0500 (Wed, 02 Feb 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/call1.c
+
+fixed a typo in the VCF header
+
+------------------------------------------------------------------------
+r908 | lh3lh3 | 2011-02-02 11:28:24 -0500 (Wed, 02 Feb 2011) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam_index.c
+
+ * fixed an out-of-boundary bug
+ * improved sorting order checking in index
+
+------------------------------------------------------------------------
+r907 | lh3lh3 | 2011-01-29 22:59:20 -0500 (Sat, 29 Jan 2011) | 4 lines
+Changed paths:
+ M /trunk/samtools/INSTALL
+ M /trunk/samtools/bam_tview.c
+ M /trunk/samtools/knetfile.c
+
+ * avoid a segfault when network connect fails
+ * update INSTALL
+ * fixed a bug in tview on big-endian by Nathan Weeks
+
+------------------------------------------------------------------------
+r903 | lh3lh3 | 2011-01-27 14:50:02 -0500 (Thu, 27 Jan 2011) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_md.c
+
+ * fixed a rare memory issue in bam_md.c
+ * fixed a bug in indel calling related to unmapped and refskip reads
+
+------------------------------------------------------------------------
+r902 | lh3lh3 | 2011-01-23 21:46:18 -0500 (Sun, 23 Jan 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/fet.c
+
+fixed two minor bugs in Fisher's exact test
+
+------------------------------------------------------------------------
+r899 | petulda | 2011-01-19 09:28:02 -0500 (Wed, 19 Jan 2011) | 1 line
+Changed paths:
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+Skip sites with unknown ref
+------------------------------------------------------------------------
+r898 | lh3lh3 | 2011-01-15 12:56:05 -0500 (Sat, 15 Jan 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/ChangeLog
+ M /trunk/samtools/bam_maqcns.c
+ M /trunk/samtools/bam_md.c
+
+move bam_nt16_nt4_table[] from bam_maqcns.c to bam_md.c
+
+------------------------------------------------------------------------
+r896 | lh3lh3 | 2011-01-06 10:52:15 -0500 (Thu, 06 Jan 2011) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/bcf.h
+ M /trunk/samtools/bcftools/bcfutils.c
+ M /trunk/samtools/bcftools/call1.c
+
+ * samtools-0.1.12-10 (r896)
+ * allow to exclude read groups in mpileup
+
+------------------------------------------------------------------------
+r895 | lh3lh3 | 2011-01-04 11:31:29 -0500 (Tue, 04 Jan 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/bcf.tex
+
+sorry. It is SP not ST
+
+------------------------------------------------------------------------
+r894 | lh3lh3 | 2011-01-04 11:29:06 -0500 (Tue, 04 Jan 2011) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/bcf.tex
+
+added ST
+
+------------------------------------------------------------------------
+r893 | petulda | 2011-01-04 06:55:56 -0500 (Tue, 04 Jan 2011) | 1 line
+Changed paths:
+ M /trunk/samtools/bcftools/call1.c
+
+Fixed a typo in read_samples
+------------------------------------------------------------------------
+r892 | jmarshall | 2010-12-28 08:06:49 -0500 (Tue, 28 Dec 2010) | 9 lines
+Changed paths:
+ M /trunk/samtools/Makefile
+ M /trunk/samtools/bcftools/Makefile
+ M /trunk/samtools/examples/Makefile
+
+System libraries go *after* user libraries in link commands, because
+the user libraries may themselves have dependencies that are satisfied
+by the system libraries. It's not rocket science!
+
+This makes a difference with some linkers; or with -static or --as-needed.
+
+The examples/Makefile fix is from Charles Plessy.
+See also http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=606004
+
+------------------------------------------------------------------------
+r891 | lh3lh3 | 2010-12-21 12:16:33 -0500 (Tue, 21 Dec 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/bcf.h
+ M /trunk/samtools/bcftools/bcfutils.c
+ M /trunk/samtools/bcftools/call1.c
+
+ * samtools-0.1.12-9 (r891)
+ * allow to call SNPs from a subset of samples
+
+------------------------------------------------------------------------
+r889 | lh3lh3 | 2010-12-15 11:28:16 -0500 (Wed, 15 Dec 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.12-12 (r889)
+ * set mapQ as 20 if it equals 255
+
+------------------------------------------------------------------------
+r888 | lh3lh3 | 2010-12-14 22:41:09 -0500 (Tue, 14 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+
+When -B is applied to mpileup, still use paired reads only unless -A is flagged.
+
+------------------------------------------------------------------------
+r887 | lh3lh3 | 2010-12-14 22:37:05 -0500 (Tue, 14 Dec 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam_md.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.12-6 (r887)
+ * added a hidden option -E to mpileup/calmd. -E triggers an alternative way to apply BAQ.
+
+------------------------------------------------------------------------
+r886 | lh3lh3 | 2010-12-14 12:51:03 -0500 (Tue, 14 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bamtk.c
+
+(Arguably) improved the indel caller a tiny bit for lowCov data.
+
+------------------------------------------------------------------------
+r885 | petulda | 2010-12-14 04:55:46 -0500 (Tue, 14 Dec 2010) | 1 line
+Changed paths:
+ M /trunk/samtools/bcftools/call1.c
+
+Fixed the VCF header to pass validation
+------------------------------------------------------------------------
+r884 | lh3lh3 | 2010-12-12 23:02:19 -0500 (Sun, 12 Dec 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+ * samtools-0.1.12-4 (r884)
+ * fixed a long-existing flaw in the INDEL calling model
+
+------------------------------------------------------------------------
+r883 | lh3lh3 | 2010-12-11 20:05:42 -0500 (Sat, 11 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/bcfutils.c
+ M /trunk/samtools/bcftools/call1.c
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+compute max SP and max GQ from sample genotypes
+
+------------------------------------------------------------------------
+r880 | lh3lh3 | 2010-12-10 10:50:54 -0500 (Fri, 10 Dec 2010) | 2 lines
+Changed paths:
+ D /trunk/samtools/bcftools/bcf-fix.pl
+
+drop bcf-fix.pl as it is redundant by the latest changes
+
+------------------------------------------------------------------------
+r879 | lh3lh3 | 2010-12-10 10:50:29 -0500 (Fri, 10 Dec 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bcftools/call1.c
+ M /trunk/samtools/bcftools/vcf.c
+
+ * fixed a minor issue in printing VCFs
+ * write bcftools specific INFO and FORMAT in the header
+
+------------------------------------------------------------------------
+r878 | lh3lh3 | 2010-12-10 10:09:14 -0500 (Fri, 10 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/bcfutils.c
+ M /trunk/samtools/bcftools/call1.c
+
+Make sure that the GT genotype field is the first
+
+------------------------------------------------------------------------
+r877 | lh3lh3 | 2010-12-08 17:27:05 -0500 (Wed, 08 Dec 2010) | 7 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.12-2 (r877)
+
+ * allow to fine control the selection of indel candidates. The current
+ setting is okay for lowCov and highCov with ~100 samples, but it
+ skips too many indels for highCov with >250 samples.
+
+
+------------------------------------------------------------------------
+r874 | lh3lh3 | 2010-12-07 22:40:35 -0500 (Tue, 07 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam_plcmd.c
+
+a spelling error..
+
+------------------------------------------------------------------------
+r873 | lh3lh3 | 2010-12-07 22:39:57 -0500 (Tue, 07 Dec 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.12-1 (r873)
+ * added a switch to allow anomalous read pairs in calling
+
+------------------------------------------------------------------------
+r872 | lh3lh3 | 2010-12-07 14:43:54 -0500 (Tue, 07 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+fixed a bug in vcf2fq
+
+------------------------------------------------------------------------
+r869 | lh3lh3 | 2010-12-05 01:18:06 -0500 (Sun, 05 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bamtk.c
+
+added a warning for the Windows version
+
+------------------------------------------------------------------------
+r868 | lh3lh3 | 2010-12-05 01:05:51 -0500 (Sun, 05 Dec 2010) | 4 lines
+Changed paths:
+ M /trunk/samtools/bcftools/call1.c
+
+In ksprintf(), change "%lf" and "%lg" to "%f" and "%g", respectively.
+According to the manual page, this change is valid. However, MinGW seems
+to interpret "%lf" as "%Lf".
+
+------------------------------------------------------------------------
+r867 | lh3lh3 | 2010-12-05 00:35:43 -0500 (Sun, 05 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/Makefile.mingw
+ M /trunk/samtools/bam_aux.c
+
+bring back the windows support
+
+------------------------------------------------------------------------
+r866 | lh3lh3 | 2010-12-04 23:33:51 -0500 (Sat, 04 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam_reheader.c
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+Fixed a compiling error when knetfile is not used.
+
+------------------------------------------------------------------------
+r865 | lh3lh3 | 2010-12-04 00:13:22 -0500 (Sat, 04 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+vcf->fastq
+
+------------------------------------------------------------------------
+r864 | lh3lh3 | 2010-12-03 17:12:30 -0500 (Fri, 03 Dec 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bcftools/call1.c
+ M /trunk/samtools/bcftools/prob1.c
+ M /trunk/samtools/bcftools/prob1.h
+
+ * remove "-f". Instead always compute consensus quality
+ * increase the upper limit of quality
+
+------------------------------------------------------------------------
+r863 | lh3lh3 | 2010-12-03 15:28:15 -0500 (Fri, 03 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/bcf.c
+
+more informative error message
+
+------------------------------------------------------------------------
+r862 | lh3lh3 | 2010-12-02 16:16:08 -0500 (Thu, 02 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/NEWS
+ M /trunk/samtools/bamtk.c
+
+Release samtools-0.1.12a
+
+------------------------------------------------------------------------
+r861 | lh3lh3 | 2010-12-02 15:55:06 -0500 (Thu, 02 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/call1.c
+
+a possible fix to DP4=0,0,0,0; have not tested, but should have no side-effect
+
+------------------------------------------------------------------------
+r859 | lh3lh3 | 2010-12-02 11:39:57 -0500 (Thu, 02 Dec 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/NEWS
+ M /trunk/samtools/bam_index.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/samtools.1
+
+Release samtools-0.1.12
+
+------------------------------------------------------------------------
+r858 | lh3lh3 | 2010-12-02 11:24:41 -0500 (Thu, 02 Dec 2010) | 4 lines
+Changed paths:
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/bcf.c
+
+ * samtools-0.1.11-1 (r858)
+ * fixed a bug in mpileup which causes segfaults
+ * bcftools: do not segfault when BCF contains errors
+
+------------------------------------------------------------------------
+r857 | lh3lh3 | 2010-11-30 23:52:50 -0500 (Tue, 30 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam_index.c
+
+fixed a memory leak in bam_fetch()
+
+------------------------------------------------------------------------
+r856 | lh3lh3 | 2010-11-26 00:07:31 -0500 (Fri, 26 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+ * fixed a memory violation
+ * added splitchr to vcfutils.pl
+
+------------------------------------------------------------------------
+r854 | lh3lh3 | 2010-11-23 09:05:08 -0500 (Tue, 23 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/ld.c
+
+fixed a typo/bug in r^2 computation
+
+------------------------------------------------------------------------
+r852 | lh3lh3 | 2010-11-21 22:20:20 -0500 (Sun, 21 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bamtk.c
+
+forget to change the version information
+
+------------------------------------------------------------------------
+r851 | lh3lh3 | 2010-11-21 22:16:52 -0500 (Sun, 21 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/ChangeLog
+ M /trunk/samtools/NEWS
+ M /trunk/samtools/bcftools/bcftools.1
+ M /trunk/samtools/samtools.1
+
+Release samtools-0.1.11
+
------------------------------------------------------------------------
r844 | lh3lh3 | 2010-11-19 23:16:08 -0500 (Fri, 19 Nov 2010) | 3 lines
Changed paths:
System Requirements
===================
-SAMtools depends on the zlib library <http://www.zlib.net>. The latest
-version 1.2.3 is preferred and with the latest version you can compile
-razip and use it to compress a FASTA file. SAMtools' faidx is able to
-index a razip-compressed FASTA file to save diskspace. Older zlib also
-works with SAMtools, but razip cannot be compiled.
+SAMtools depends on the zlib library <http://www.zlib.net>. Version 1.2.3+ is
+preferred and with 1.2.3+ you can compile razip and use it to compress a FASTA
+file. SAMtools' faidx is able to index a razip-compressed FASTA file to save
+diskspace. Older zlib also works with SAMtools, but razip cannot be compiled.
The text-based viewer (tview) requires the GNU ncurses library
-<http://www.gnu.org/software/ncurses/>, which comes with Mac OS X and
-most of the modern Linux/Unix distributions. If you do not have this
-library installed, you can still compile the rest of SAMtools by
-manually modifying one line in Makefile.
+<http://www.gnu.org/software/ncurses/>, which comes with Mac OS X and most of
+the modern Linux/Unix distributions. If you do not have this library installed,
+you can still compile the rest of SAMtools by manually changing:
+`-D_CURSES_LIB=1' to `-D_CURSES_LIB=0' at the line starting with `DFLAGS=', and
+comment out the line starting with `LIBCURSES='.
Compilation
===========
-Type `make' to compile samtools. If you have zlib >= 1.2.2.1, you can
-compile razip with `make razip'.
+Type `make' to compile samtools. If you have zlib >= 1.2.2.1, you can compile
+razip with `make razip'.
Installation
============
-Simply copy `samtools' and other executables/scripts in `misc' to a
-location you want (e.g. a directory in your $PATH). No further
-configurations are required.
+Copy `samtools', `bcftools/bcftools' and other executables/scripts in `misc' to
+a location you want (e.g. a directory in your $PATH). You may also copy
+`samtools.1' and `bcftools/bcftools.1' to a directory in your $MANPATH such
+that the `man' command may find the manual.
CC= gcc
CFLAGS= -g -Wall -O2 #-m64 #-arch ppc
-DFLAGS= -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE -D_CURSES_LIB=1
+DFLAGS= -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -D_CURSES_LIB=1
KNETFILE_O= knetfile.o
LOBJS= bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o \
bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o \
$(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o kprobaln.o
AOBJS= bam_tview.o bam_maqcns.o bam_plcmd.o sam_view.o \
bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o \
- bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o
+ bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o \
+ cut_target.o phase.o
PROG= samtools
INCLUDES= -I.
SUBDIRS= . bcftools misc
$(AR) -cru $@ $(LOBJS)
samtools:lib-recur $(AOBJS)
- $(CC) $(CFLAGS) -o $@ $(AOBJS) libbam.a -lm $(LIBPATH) $(LIBCURSES) -lz -Lbcftools -lbcf
+ $(CC) $(CFLAGS) -o $@ $(AOBJS) -Lbcftools $(LIBPATH) libbam.a -lbcf $(LIBCURSES) -lm -lz
razip:razip.o razf.o $(KNETFILE_O)
$(CC) $(CFLAGS) -o $@ razf.o razip.o $(KNETFILE_O) -lz
bam2bcf.o:bam2bcf.h errmod.h bcftools/bcf.h
bam2bcf_indel.o:bam2bcf.h
errmod.o:errmod.h
+phase.o:bam.h khash.h ksort.h
+bamtk.o:bam.h
faidx.o:faidx.h razf.h khash.h
faidx_main.o:faidx.h razf.h
CC= gcc.exe
AR= ar.exe
CFLAGS= -g -Wall -O2
-DFLAGS= -D_CURSES_LIB=2 -D_USE_KNETFILE
+DFLAGS= -D_USE_KNETFILE -D_CURSES_LIB=2
KNETFILE_O= knetfile.o
LOBJS= bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o \
- bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o bam_sort.o \
- $(KNETFILE_O)
+ bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o \
+ $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o kprobaln.o
AOBJS= bam_tview.o bam_maqcns.o bam_plcmd.o sam_view.o \
bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o \
- bamtk.o kaln.o sam_header.o
-PROG= samtools
-INCLUDES= -Iwin32
+ bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o
+BCFOBJS= bcftools/bcf.o bcftools/fet.o bcftools/bcf2qcall.o bcftools/bcfutils.o \
+ bcftools/call1.o bcftools/index.o bcftools/kfunc.o bcftools/ld.o \
+ bcftools/prob1.o bcftools/vcf.o
+PROG= samtools.exe bcftools.exe
+INCLUDES= -I. -Iwin32
SUBDIRS= .
-LIBPATH=
+LIBPATH=
.SUFFIXES:.c .o
libbam.a:$(LOBJS)
$(AR) -cru $@ $(LOBJS)
-samtools:$(AOBJS) libbam.a
- $(CC) $(CFLAGS) -o $@ $(AOBJS) $(LIBPATH) -lm -L. -lbam -Lwin32 -lz -lcurses -lws2_32
+samtools.exe:$(AOBJS) libbam.a $(BCFOBJS)
+ $(CC) $(CFLAGS) -o $@ $(AOBJS) $(BCFOBJS) $(LIBPATH) -lm -L. -lbam -Lwin32 -lz -lcurses -lws2_32
-razip:razip.o razf.o $(KNETFILE_O)
- $(CC) $(CFLAGS) -o $@ razf.o razip.o $(KNETFILE_O) -lz
-
-bgzip:bgzip.o bgzf.o $(KNETFILE_O)
- $(CC) $(CFLAGS) -o $@ bgzf.o bgzip.o $(KNETFILE_O) -lz
+bcftools.exe:$(BCFOBJS) bcftools/main.o kstring.o bgzf.o knetfile.o
+ $(CC) $(CFLAGS) -o $@ $(BCFOBJS) bcftools/main.o kstring.o bgzf.o knetfile.o -lm -Lwin32 -lz -lws2_32
razip.o:razf.h
-bam.o:bam.h razf.h bam_endian.h kstring.h
+bam.o:bam.h razf.h bam_endian.h kstring.h sam_header.h
sam.o:sam.h bam.h
bam_import.o:bam.h kseq.h khash.h razf.h
bam_pileup.o:bam.h razf.h ksort.h
-bam_plcmd.o:bam.h faidx.h bam_maqcns.h glf.h
+bam_plcmd.o:bam.h faidx.h bam_maqcns.h glf.h bcftools/bcf.h bam2bcf.h
bam_index.o:bam.h khash.h ksort.h razf.h bam_endian.h
bam_lpileup.o:bam.h ksort.h
bam_tview.o:bam.h faidx.h bam_maqcns.h
-bam_maqcns.o:bam.h ksort.h bam_maqcns.h
+bam_maqcns.o:bam.h ksort.h bam_maqcns.h kaln.h
bam_sort.o:bam.h ksort.h razf.h
bam_md.o:bam.h faidx.h
glf.o:glf.h
+sam_header.o:sam_header.h khash.h
+bcf.o:bcftools/bcf.h
+bam2bcf.o:bam2bcf.h errmod.h bcftools/bcf.h
+bam2bcf_indel.o:bam2bcf.h
+errmod.o:errmod.h
faidx.o:faidx.h razf.h khash.h
faidx_main.o:faidx.h razf.h
clean:
- rm -fr gmon.out *.o *.exe *.dSYM razip bgzip $(PROG) *~ *.a
+ rm -fr gmon.out *.o a.out *.exe *.dSYM razip bgzip $(PROG) *~ *.a *.so.* *.so *.dylib
+Beta release 0.1.13 (1 March, 2011)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The most important though largely invisible modification is the change of the
+order of genotypes in the PL VCF/BCF tag. This is to conform the upcoming VCF
+spec v4.1. The change means that 0.1.13 is not backward compatible with VCF/BCF
+generated by samtools older than r921 inclusive. VCF/BCF generated by the new
+samtools will contain a line `##fileformat=VCFv4.1' as well as the samtools
+version number.
+
+Single Individual Haplotyping (SIH) is added as an experimental feature. It
+originally aims to produce haploid consensus from fosmid pool sequencing, but
+also works with short-read data. For short reads, phased blocks are usually too
+short to be useful in many applications, but they can help to rule out part of
+SNPs close to INDELs or between copies of CNVs.
+
+
+Other notable changes in samtools:
+
+ * Construct per-sample consensus to reduce the effect of nearby SNPs in INDEL
+ calling. This reduces the power but improves specificity.
+
+ * Improved sorting order checking in indexing. Now indexing is the preferred way
+ to check if a BAM is sorted.
+
+ * Added a switch `-E' to mpileup and calmd. This option uses an alternative way
+ to apply BAQ, which increases sensistivity, especially to MNPs, at the cost of
+ a little loss in specificity.
+
+ * Added `mpileup -A' to allow to use reads in anomalous pairs in SNP calling.
+
+ * Added `mpileup -m' to allow fine control of the collection of INDEL candidates.
+
+ * Added `mpileup -S' to compute per-sample strand bias P-value.
+
+ * Added `mpileup -G' to exclude read groups in variant calling.
+
+ * Fixed segfault in indel calling related to unmapped and refskip reads.
+
+ * Fixed an integer overflow in INDEL calling. This bug produces wrong INDEL
+ genotypes for longer short INDELs, typically over 10bp.
+
+ * Fixed a bug in tview on big-endian machines.
+
+ * Fixed a very rare memory issue in bam_md.c
+
+ * Fixed an out-of-boundary bug in mpileup when the read base is `N'.
+
+ * Fixed a compiling error when the knetfile library is not used. Fixed a
+ library compiling error due to the lack of bam_nt16_nt4_table[] table.
+ Suppress a compiling warning related to the latest zlib.
+
+
+Other notable changes in bcftools:
+
+ * Updated the BCF spec.
+
+ * Added the `FQ' VCF INFO field, which gives the phred-scaled probability
+ of all samples being the samei (identical to the reference or all homozygous
+ variants). Option `view -f' has been dropped.
+
+ * Implementated of "vcfutils.pl vcf2fq" to generate a consensus sequence
+ similar to "samtools.pl pileup2fq".
+
+ * Make sure the GT FORMAT field is always the first FORMAT to conform the VCF
+ spec. Drop bcf-fix.pl.
+
+ * Output bcftools specific INFO and FORMAT in the VCF header.
+
+ * Added `view -s' to call variants from a subset of samples.
+
+ * Properly convert VCF to BCF with a user provided sequence dictionary. Nonetheless,
+ custom fields are still unparsed and will be stored as a missing value.
+
+ * Fixed a minor bug in Fisher's exact test; the results are rarely changed.
+
+
+(0.1.13: 1 March 2011, r926:134)
+
+
+
Beta release 0.1.12a (2 December, 2010)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The is the first public release of samtools. For more information,
please check the manual page `samtools.1' and the samtools website
-http://samtools.sourceforge.net
\ No newline at end of file
+http://samtools.sourceforge.net
// with ESPIPE. Suppress the error message in this case.
if (errno != ESPIPE) perror("[bam_header_read] bgzf_check_EOF");
}
- else if (i == 0) fprintf(stderr, "[bam_header_read] EOF marker is absent.\n");
+ else if (i == 0) fprintf(stderr, "[bam_header_read] EOF marker is absent. The input is probably truncated.\n");
// read "BAM1"
magic_len = bam_read(fp, buf, 4);
if (magic_len != 4 || strncmp(buf, "BAM\001", 4) != 0) {
@copyright Genome Research Ltd.
*/
+#define BAM_VERSION "0.1.13 (r926:134)"
+
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#define CALL_ETA 0.03f
#define CALL_MAX 256
#define CALL_DEFTHETA 0.83f
+#define DEF_MAPQ 20
#define CAP_DIST 25
bca->openQ = 40; bca->extQ = 20; bca->tandemQ = 100;
bca->min_baseQ = min_baseQ;
bca->e = errmod_init(1. - theta);
+ bca->min_frac = 0.002;
+ bca->min_support = 1;
return bca;
}
seqQ = is_indel? (p->aux>>8&0xff) : 99;
if (q < bca->min_baseQ) continue;
if (q > seqQ) q = seqQ;
- mapQ = p->b->core.qual < bca->capQ? p->b->core.qual : bca->capQ;
+ mapQ = p->b->core.qual < 255? p->b->core.qual : DEF_MAPQ; // special case for mapQ==255
+ mapQ = mapQ < bca->capQ? mapQ : bca->capQ;
if (q > mapQ) q = mapQ;
if (q > 63) q = 63;
if (q < 4) q = 4;
}
bca->bases[n++] = q<<5 | (int)bam1_strand(p->b)<<4 | b;
// collect annotations
- r->qsum[b] += q;
+ if (b < 4) r->qsum[b] += q;
++r->anno[0<<2|is_diff<<1|bam1_strand(p->b)];
min_dist = p->b->core.l_qseq - 1 - p->qpos;
if (min_dist > p->qpos) min_dist = p->qpos;
x = call->n_alleles * (call->n_alleles + 1) / 2;
// get the possible genotypes
for (i = z = 0; i < call->n_alleles; ++i)
- for (j = i; j < call->n_alleles; ++j)
- g[z++] = call->a[i] * 5 + call->a[j];
+ for (j = 0; j <= i; ++j)
+ g[z++] = call->a[j] * 5 + call->a[i];
for (i = 0; i < n; ++i) {
uint8_t *PL = call->PL + x * i;
const bcf_callret1_t *r = calls + i;
typedef struct __bcf_callaux_t {
int capQ, min_baseQ;
- int openQ, extQ, tandemQ;
+ int openQ, extQ, tandemQ; // for indels
+ int min_support; // for collecting indel candidates
+ double min_frac; // for collecting indel candidates
// for internal uses
int max_bases;
int indel_types[4];
#define MINUS_CONST 0x10000000
#define INDEL_WINDOW_SIZE 50
-#define MIN_SUPPORT_COEF 500
void *bcf_call_add_rg(void *_hash, const char *hdtext, const char *list)
{
extern void ks_introsort_uint32_t(int, uint32_t*);
int i, s, j, k, t, n_types, *types, max_rd_len, left, right, max_ins, *score1, *score2, max_ref2;
int N, K, l_run, ref_type, n_alt;
- char *inscns = 0, *ref2, *query;
+ char *inscns = 0, *ref2, *query, **ref_sample;
khash_t(rg) *hash = (khash_t(rg)*)rghash;
if (ref == 0 || bca == 0) return -1;
// mark filtered reads
// squeeze out identical types
for (i = 1, n_types = 1; i < m; ++i)
if (aux[i] != aux[i-1]) ++n_types;
- if (n_types == 1 || n_alt * MIN_SUPPORT_COEF < n_tot) { // no indels or too few supporting reads
+ if (n_types == 1 || (double)n_alt / n_tot < bca->min_frac || n_alt < bca->min_support) { // then skip
free(aux); return -1;
}
types = (int*)calloc(n_types, sizeof(int));
if (ref[i] == 0) break;
right = i;
}
+ /* The following block fixes a long-existing flaw in the INDEL
+ * calling model: the interference of nearby SNPs. However, it also
+ * reduces the power because sometimes, substitutions caused by
+ * indels are not distinguishable from true mutations. Multiple
+ * sequence realignment helps to increase the power.
+ */
+ { // construct per-sample consensus
+ int L = right - left + 1, max_i, max2_i;
+ uint32_t *cns, max, max2;
+ char *ref0, *r;
+ ref_sample = calloc(n, sizeof(void*));
+ cns = calloc(L, 4);
+ ref0 = calloc(L, 1);
+ for (i = 0; i < right - left; ++i)
+ ref0[i] = bam_nt16_table[(int)ref[i+left]];
+ for (s = 0; s < n; ++s) {
+ r = ref_sample[s] = calloc(L, 1);
+ memset(cns, 0, sizeof(int) * L);
+ // collect ref and non-ref counts
+ for (i = 0; i < n_plp[s]; ++i) {
+ bam_pileup1_t *p = plp[s] + i;
+ bam1_t *b = p->b;
+ uint32_t *cigar = bam1_cigar(b);
+ uint8_t *seq = bam1_seq(b);
+ int x = b->core.pos, y = 0;
+ for (k = 0; k < b->core.n_cigar; ++k) {
+ int op = cigar[k]&0xf;
+ int j, l = cigar[k]>>4;
+ if (op == BAM_CMATCH) {
+ for (j = 0; j < l; ++j)
+ if (x + j >= left && x + j < right)
+ cns[x+j-left] += (bam1_seqi(seq, y+j) == ref0[x+j-left])? 1 : 0x10000;
+ x += l; y += l;
+ } else if (op == BAM_CDEL || op == BAM_CREF_SKIP) x += l;
+ else if (op == BAM_CINS || op == BAM_CSOFT_CLIP) y += l;
+ }
+ }
+ // determine the consensus
+ for (i = 0; i < right - left; ++i) r[i] = ref0[i];
+ max = max2 = 0; max_i = max2_i = -1;
+ for (i = 0; i < right - left; ++i) {
+ if (cns[i]>>16 >= max>>16) max2 = max, max2_i = max_i, max = cns[i], max_i = i;
+ else if (cns[i]>>16 >= max2>>16) max2 = cns[i], max2_i = i;
+ }
+ if ((double)(max&0xffff) / ((max&0xffff) + (max>>16)) >= 0.7) max_i = -1;
+ if ((double)(max2&0xffff) / ((max2&0xffff) + (max2>>16)) >= 0.7) max2_i = -1;
+ if (max_i >= 0) r[max_i] = 15;
+ if (max2_i >= 0) r[max2_i] = 15;
+// for (i = 0; i < right - left; ++i) fputc("=ACMGRSVTWYHKDBN"[(int)r[i]], stderr); fputc('\n', stderr);
+ }
+ free(ref0); free(cns);
+ }
{ // the length of the homopolymer run around the current position
int c = bam_nt16_table[(int)ref[pos + 1]];
if (c == 15) l_run = 1;
else ir = est_indelreg(pos, ref, -types[t], 0);
if (ir > bca->indelreg) bca->indelreg = ir;
// fprintf(stderr, "%d, %d, %d\n", pos, types[t], ir);
- // write ref2
- for (k = 0, j = left; j <= pos; ++j)
- ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]];
- if (types[t] <= 0) j += -types[t];
- else for (l = 0; l < types[t]; ++l)
- ref2[k++] = inscns[t*max_ins + l];
- if (types[0] < 0) { // mask deleted sequences to avoid a particular error in the model.
- int jj, tmp = types[t] >= 0? -types[0] : -types[0] + types[t];
- for (jj = 0; jj < tmp && j < right && ref[j]; ++jj, ++j)
- ref2[k++] = 4;
- }
- for (; j < right && ref[j]; ++j)
- ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]];
- for (; k < max_ref2; ++k) ref2[k] = 4;
- if (j < right) right = j;
- // align each read to ref2
+ // realignment
for (s = K = 0; s < n; ++s) {
+ // write ref2
+ for (k = 0, j = left; j <= pos; ++j)
+ ref2[k++] = bam_nt16_nt4_table[(int)ref_sample[s][j-left]];
+ if (types[t] <= 0) j += -types[t];
+ else for (l = 0; l < types[t]; ++l)
+ ref2[k++] = inscns[t*max_ins + l];
+ for (; j < right && ref[j]; ++j)
+ ref2[k++] = bam_nt16_nt4_table[(int)ref_sample[s][j-left]];
+ for (; k < max_ref2; ++k) ref2[k] = 4;
+ if (j < right) right = j;
+ // align each read to ref2
for (i = 0; i < n_plp[s]; ++i, ++K) {
bam_pileup1_t *p = plp[s] + i;
- int qbeg, qend, tbeg, tend, sc;
+ int qbeg, qend, tbeg, tend, sc, kk;
uint8_t *seq = bam1_seq(p->b);
+ uint32_t *cigar = bam1_cigar(p->b);
+ if (p->b->core.flag&4) continue; // unmapped reads
+ // FIXME: the following loop should be better moved outside; nonetheless, realignment should be much slower anyway.
+ for (kk = 0; kk < p->b->core.n_cigar; ++kk)
+ if ((cigar[kk]&BAM_CIGAR_MASK) == BAM_CREF_SKIP) break;
+ if (kk < p->b->core.n_cigar) continue;
// FIXME: the following skips soft clips, but using them may be more sensitive.
// determine the start and end of sequences for alignment
qbeg = tpos2qpos(&p->b->core, bam1_cigar(p->b), left, 0, &tbeg);
indelQ2 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ2 + .499);
// pick the smaller between indelQ1 and indelQ2
indelQ = indelQ1 < indelQ2? indelQ1 : indelQ2;
- p->aux = (sc[0]&0x3f)<<16 | seqQ<<8 | indelQ;
+ if (indelQ > 255) indelQ = 255;
+ if (seqQ > 255) seqQ = 255;
+ p->aux = (sc[0]&0x3f)<<16 | seqQ<<8 | indelQ; // use 22 bits in total
sumq[sc[0]&0x3f] += indelQ < seqQ? indelQ : seqQ;
-// fprintf(stderr, "pos=%d read=%d:%d name=%s call=%d q=%d\n", pos, s, i, bam1_qname(p->b), types[sc[0]&0x3f], indelQ);
+// fprintf(stderr, "pos=%d read=%d:%d name=%s call=%d indelQ=%d seqQ=%d\n", pos, s, i, bam1_qname(p->b), types[sc[0]&0x3f], indelQ, seqQ);
}
}
// determine bca->indel_types[] and bca->inscns
}
free(score1); free(score2);
// free
+ for (i = 0; i < n; ++i) free(ref_sample[i]);
+ free(ref_sample);
free(types); free(inscns);
return n_alt > 0? 0 : -1;
}
if (type == 'Z' || type == 'H') return (char*)s;
else return 0;
}
+
+#ifdef _WIN32
+double drand48()
+{
+ return (double)rand() / RAND_MAX;
+}
+#endif
}
merge_chunks(idx);
fill_missing(idx);
- if (ret >= 0)
- while ((ret = bam_read1(fp, b)) >= 0) ++n_no_coor;
+ if (ret >= 0) {
+ while ((ret = bam_read1(fp, b)) >= 0) {
+ ++n_no_coor;
+ if (c->tid >= 0 && n_no_coor) {
+ fprintf(stderr, "[bam_index_core] the alignment is not sorted: reads without coordinates prior to reads with coordinates.\n");
+ exit(1);
+ }
+ }
+ }
if (ret < -1) fprintf(stderr, "[bam_index_core] truncated file? Continue anyway. (%d)\n", ret);
free(b->data); free(b);
idx->n_no_coor = n_no_coor;
uint32_t c[4];
} glf_call_aux_t;
-char bam_nt16_nt4_table[] = { 4, 0, 1, 4, 2, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4 };
-
/*
P(<b1,b2>) = \theta \sum_{i=1}^{N-1} 1/i
P(D|<b1,b2>) = \sum_{k=1}^{N-1} p_k 1/2 [(k/N)^n_2(1-k/N)^n_1 + (k/N)^n1(1-k/N)^n_2]
#include "kaln.h"
#include "kprobaln.h"
+char bam_nt16_nt4_table[] = { 4, 0, 1, 4, 2, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4 };
+
void bam_fillmd1_core(bam1_t *b, char *ref, int is_equal, int max_nm)
{
uint8_t *seq = bam1_seq(b);
return (int)(t + .499);
}
-int bam_prob_realn_core(bam1_t *b, const char *ref, int apply_baq)
+int bam_prob_realn_core(bam1_t *b, const char *ref, int flag)
{
- int k, i, bw, x, y, yb, ye, xb, xe;
+ int k, i, bw, x, y, yb, ye, xb, xe, apply_baq = flag&1, extend_baq = flag>>1&1;
uint32_t *cigar = bam1_cigar(b);
bam1_core_t *c = &b->core;
kpa_par_t conf = kpa_par_def;
uint8_t *bq = 0, *zq = 0, *qual = bam1_qual(b);
- if (c->flag & BAM_FUNMAP) return -1; // do nothing
+ if ((c->flag & BAM_FUNMAP) || b->core.l_qseq == 0) return -1; // do nothing
// test if BQ or ZQ is present
if ((bq = bam_aux_get(b, "BQ")) != 0) ++bq;
if ((zq = bam_aux_get(b, "ZQ")) != 0 && *zq == 'Z') ++zq;
s = calloc(c->l_qseq, 1);
for (i = 0; i < c->l_qseq; ++i) s[i] = bam_nt16_nt4_table[bam1_seqi(seq, i)];
r = calloc(xe - xb, 1);
- for (i = xb; i < xe; ++i)
+ for (i = xb; i < xe; ++i) {
+ if (ref[i] == 0) { xe = i; break; }
r[i-xb] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[i]]];
+ }
state = calloc(c->l_qseq, sizeof(int));
q = calloc(c->l_qseq, 1);
kpa_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q);
- for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
- int op = cigar[k]&0xf, l = cigar[k]>>4;
- if (op == BAM_CMATCH) {
- for (i = y; i < y + l; ++i) {
- if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) bq[i] = 0;
- else bq[i] = bq[i] < q[i]? bq[i] : q[i];
- }
- x += l; y += l;
- } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
- else if (op == BAM_CDEL) x += l;
+ if (!extend_baq) { // in this block, bq[] is capped by base quality qual[]
+ for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
+ int op = cigar[k]&0xf, l = cigar[k]>>4;
+ if (op == BAM_CMATCH) {
+ for (i = y; i < y + l; ++i) {
+ if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) bq[i] = 0;
+ else bq[i] = bq[i] < q[i]? bq[i] : q[i];
+ }
+ x += l; y += l;
+ } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
+ else if (op == BAM_CDEL) x += l;
+ }
+ for (i = 0; i < c->l_qseq; ++i) bq[i] = qual[i] - bq[i] + 64; // finalize BQ
+ } else { // in this block, bq[] is BAQ that can be larger than qual[] (different from the above!)
+ uint8_t *left, *rght;
+ left = calloc(c->l_qseq, 1); rght = calloc(c->l_qseq, 1);
+ for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
+ int op = cigar[k]&0xf, l = cigar[k]>>4;
+ if (op == BAM_CMATCH) {
+ for (i = y; i < y + l; ++i)
+ bq[i] = ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y))? 0 : q[i];
+ for (left[y] = bq[y], i = y + 1; i < y + l; ++i)
+ left[i] = bq[i] > left[i-1]? bq[i] : left[i-1];
+ for (rght[y+l-1] = bq[y+l-1], i = y + l - 2; i >= y; --i)
+ rght[i] = bq[i] > rght[i+1]? bq[i] : rght[i+1];
+ for (i = y; i < y + l; ++i)
+ bq[i] = left[i] < rght[i]? left[i] : rght[i];
+ x += l; y += l;
+ } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
+ else if (op == BAM_CDEL) x += l;
+ }
+ for (i = 0; i < c->l_qseq; ++i) bq[i] = 64 + (qual[i] <= bq[i]? 0 : qual[i] - bq[i]); // finalize BQ
+ free(left); free(rght);
}
- for (i = 0; i < c->l_qseq; ++i) bq[i] = qual[i] - bq[i] + 64; // finalize BQ
if (apply_baq) {
for (i = 0; i < c->l_qseq; ++i) qual[i] -= bq[i] - 64; // modify qual
bam_aux_append(b, "ZQ", 'Z', c->l_qseq + 1, bq);
int bam_fillmd(int argc, char *argv[])
{
- int c, is_equal, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm, is_realn, capQ, apply_baq;
+ int c, is_equal, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm, is_realn, capQ, baq_flag;
samfile_t *fp, *fpout = 0;
faidx_t *fai;
char *ref = 0, mode_w[8], mode_r[8];
bam1_t *b;
- is_equal = is_bam_out = is_sam_in = is_uncompressed = is_realn = max_nm = apply_baq = capQ = 0;
+ is_equal = is_bam_out = is_sam_in = is_uncompressed = is_realn = max_nm = capQ = baq_flag = 0;
mode_w[0] = mode_r[0] = 0;
strcpy(mode_r, "r"); strcpy(mode_w, "w");
- while ((c = getopt(argc, argv, "reubSC:n:A")) >= 0) {
+ while ((c = getopt(argc, argv, "EreubSC:n:A")) >= 0) {
switch (c) {
case 'r': is_realn = 1; break;
case 'e': is_equal = 1; break;
case 'S': is_sam_in = 1; break;
case 'n': max_nm = atoi(optarg); break;
case 'C': capQ = atoi(optarg); break;
- case 'A': apply_baq = 1; break;
+ case 'A': baq_flag |= 1; break;
+ case 'E': baq_flag |= 2; break;
default: fprintf(stderr, "[bam_fillmd] unrecognized option '-%c'\n", c); return 1;
}
}
fprintf(stderr, " -b compressed BAM output\n");
fprintf(stderr, " -S the input is SAM with header\n");
fprintf(stderr, " -A modify the quality string\n");
- fprintf(stderr, " -r read-independent local realignment\n\n");
+ fprintf(stderr, " -r compute the BQ tag (without -A) or cap baseQ by BAQ (with -A)\n");
+ fprintf(stderr, " -E extended BAQ for better sensitivity but lower specificity\n\n");
return 1;
}
fp = samopen(argv[optind], mode_r, 0);
fprintf(stderr, "[bam_fillmd] fail to find sequence '%s' in the reference.\n",
fp->header->target_name[tid]);
}
- if (is_realn) bam_prob_realn_core(b, ref, apply_baq);
+ if (is_realn) bam_prob_realn_core(b, ref, baq_flag);
if (capQ > 10) {
int q = bam_cap_mapQ(b, ref, capQ);
if (b->core.qual > q) b->core.qual = q;
#define MPLP_FMT_DP 0x100
#define MPLP_FMT_SP 0x200
#define MPLP_NO_INDEL 0x400
+#define MPLP_EXT_BAQ 0x800
typedef struct {
int max_mq, min_mq, flag, min_baseQ, capQ_thres, max_depth;
- int openQ, extQ, tandemQ;
+ int openQ, extQ, tandemQ, min_support; // for indels
+ double min_frac; // for indels
char *reg, *fn_pos, *pl_list;
faidx_t *fai;
kh_64_t *hash;
+ void *rghash;
} mplp_conf_t;
typedef struct {
bamFile fp;
bam_iter_t iter;
- int min_mq, flag, ref_id, capQ_thres;
+ int ref_id;
char *ref;
+ const mplp_conf_t *conf;
} mplp_aux_t;
typedef struct {
int has_ref;
ret = ma->iter? bam_iter_read(ma->fp, ma->iter, b) : bam_read1(ma->fp, b);
if (ret < 0) break;
+ if (ma->conf->rghash) { // exclude read groups
+ uint8_t *rg = bam_aux_get(b, "RG");
+ skip = (rg && bcf_str2id(ma->conf->rghash, (const char*)(rg+1)) >= 0);
+ if (skip) continue;
+ }
has_ref = (ma->ref && ma->ref_id == b->core.tid)? 1 : 0;
skip = 0;
- if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn_core(b, ma->ref, 1);
- if (has_ref && ma->capQ_thres > 10) {
- int q = bam_cap_mapQ(b, ma->ref, ma->capQ_thres);
+ if (has_ref && (ma->conf->flag&MPLP_REALN)) bam_prob_realn_core(b, ma->ref, (ma->conf->flag & MPLP_EXT_BAQ)? 3 : 1);
+ if (has_ref && ma->conf->capQ_thres > 10) {
+ int q = bam_cap_mapQ(b, ma->ref, ma->conf->capQ_thres);
if (q < 0) skip = 1;
else if (b->core.qual > q) b->core.qual = q;
} else if (b->core.flag&BAM_FUNMAP) skip = 1;
- else if (b->core.qual < ma->min_mq) skip = 1;
- else if ((ma->flag&MPLP_NO_ORPHAN) && (b->core.flag&1) && !(b->core.flag&2)) skip = 1;
+ else if (b->core.qual < ma->conf->min_mq) skip = 1;
+ else if ((ma->conf->flag&MPLP_NO_ORPHAN) && (b->core.flag&1) && !(b->core.flag&2)) skip = 1;
} while (skip);
return ret;
}
for (i = 0; i < n; ++i) {
bam_header_t *h_tmp;
data[i] = calloc(1, sizeof(mplp_aux_t));
- data[i]->min_mq = conf->min_mq;
- data[i]->flag = conf->flag;
- data[i]->capQ_thres = conf->capQ_thres;
data[i]->fp = strcmp(fn[i], "-") == 0? bam_dopen(fileno(stdin), "r") : bam_open(fn[i], "r");
+ data[i]->conf = conf;
h_tmp = bam_header_read(data[i]->fp);
bam_smpl_add(sm, fn[i], h_tmp->text);
rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
bh->l_smpl = s.l;
bh->sname = malloc(s.l);
memcpy(bh->sname, s.s, s.l);
- bh->l_txt = 0;
+ bh->txt = malloc(strlen(BAM_VERSION) + 64);
+ bh->l_txt = 1 + sprintf(bh->txt, "##samtoolsVersion=%s\n", BAM_VERSION);
free(s.s);
bcf_hdr_sync(bh);
bcf_hdr_write(bp, bh);
bcr = calloc(sm->n, sizeof(bcf_callret1_t));
bca->rghash = rghash;
bca->openQ = conf->openQ, bca->extQ = conf->extQ, bca->tandemQ = conf->tandemQ;
+ bca->min_frac = conf->min_frac;
+ bca->min_support = conf->min_support;
}
ref_tid = -1; ref = 0;
iter = bam_mplp_init(n, mplp_func, (void**)data);
}
#define MAX_PATH_LEN 1024
-int read_file_list(const char *file_list,int *n,char **argv[])
+static int read_file_list(const char *file_list,int *n,char **argv[])
{
char buf[MAX_PATH_LEN];
int len, nfiles;
int c;
const char *file_list = NULL;
char **fn = NULL;
- int nfiles = 0;
+ int nfiles = 0, use_orphan = 0;
mplp_conf_t mplp;
memset(&mplp, 0, sizeof(mplp_conf_t));
mplp.max_mq = 60;
mplp.capQ_thres = 0;
mplp.max_depth = 250;
mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100;
+ mplp.min_frac = 0.002; mplp.min_support = 1;
mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN;
- while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:BDSd:b:P:o:e:h:I")) >= 0) {
+ while ((c = getopt(argc, argv, "Agf:r:l:M:q:Q:uaRC:BDSd:b:P:o:e:h:Im:F:EG:")) >= 0) {
switch (c) {
case 'f':
mplp.fai = fai_load(optarg);
case 'g': mplp.flag |= MPLP_GLF; break;
case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_GLF; break;
case 'a': mplp.flag |= MPLP_NO_ORPHAN | MPLP_REALN; break;
- case 'B': mplp.flag &= ~MPLP_REALN & ~MPLP_NO_ORPHAN; break;
- case 'O': mplp.flag |= MPLP_NO_ORPHAN; break;
+ case 'B': mplp.flag &= ~MPLP_REALN; break;
case 'R': mplp.flag |= MPLP_REALN; break;
case 'D': mplp.flag |= MPLP_FMT_DP; break;
case 'S': mplp.flag |= MPLP_FMT_SP; break;
case 'I': mplp.flag |= MPLP_NO_INDEL; break;
+ case 'E': mplp.flag |= MPLP_EXT_BAQ; break;
case 'C': mplp.capQ_thres = atoi(optarg); break;
case 'M': mplp.max_mq = atoi(optarg); break;
case 'q': mplp.min_mq = atoi(optarg); break;
case 'o': mplp.openQ = atoi(optarg); break;
case 'e': mplp.extQ = atoi(optarg); break;
case 'h': mplp.tandemQ = atoi(optarg); break;
+ case 'A': use_orphan = 1; break;
+ case 'F': mplp.min_frac = atof(optarg); break;
+ case 'm': mplp.min_support = atoi(optarg); break;
+ case 'G': {
+ FILE *fp_rg;
+ char buf[1024];
+ mplp.rghash = bcf_str2id_init();
+ if ((fp_rg = fopen(optarg, "r")) == 0)
+ fprintf(stderr, "(%s) Fail to open file %s. Continue anyway.\n", __func__, optarg);
+ while (!feof(fp_rg) && fscanf(fp_rg, "%s", buf) > 0) // this is not a good style, but forgive me...
+ bcf_str2id_add(mplp.rghash, strdup(buf));
+ fclose(fp_rg);
+ }
+ break;
}
}
+ if (use_orphan) mplp.flag &= ~MPLP_NO_ORPHAN;
if (argc == 1) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: samtools mpileup [options] in1.bam [in2.bam [...]]\n\n");
fprintf(stderr, " -o INT Phred-scaled gap open sequencing error probability [%d]\n", mplp.openQ);
fprintf(stderr, " -e INT Phred-scaled gap extension seq error probability [%d]\n", mplp.extQ);
fprintf(stderr, " -h INT coefficient for homopolyer errors [%d]\n", mplp.tandemQ);
+ fprintf(stderr, " -m INT minimum gapped reads for indel candidates [%d]\n", mplp.min_support);
+ fprintf(stderr, " -F FLOAT minimum fraction of gapped reads for candidates [%g]\n", mplp.min_frac);
+ fprintf(stderr, " -G FILE exclude read groups listed in FILE [null]\n");
+ fprintf(stderr, " -A use anomalous read pairs in SNP/INDEL calling\n");
fprintf(stderr, " -g generate BCF output\n");
fprintf(stderr, " -u do not compress BCF output\n");
+ fprintf(stderr, " -E extended BAQ for higher sensitivity but lower specificity\n");
fprintf(stderr, " -B disable BAQ computation\n");
fprintf(stderr, " -D output per-sample DP\n");
fprintf(stderr, " -S output per-sample SP (strand bias P-value, slow)\n");
fprintf(stderr, "Notes: Assuming diploid individuals.\n\n");
return 1;
}
- if ( file_list )
- {
+ if (file_list) {
if ( read_file_list(file_list,&nfiles,&fn) ) return 1;
mpileup(&mplp,nfiles,fn);
for (c=0; c<nfiles; c++) free(fn[c]);
free(fn);
- }
- else
- mpileup(&mplp, argc - optind, argv + optind);
+ } else mpileup(&mplp, argc - optind, argv + optind);
+ if (mplp.rghash) bcf_str2id_thorough_destroy(mplp.rghash);
free(mplp.reg); free(mplp.pl_list);
if (mplp.fai) fai_destroy(mplp.fai);
return 0;
}
#ifdef _USE_KNETFILE
while ((len = knet_read(in->x.fpr, buf, BUF_SIZE)) > 0)
+ fwrite(buf, 1, len, fp->x.fpw);
#else
while (!feof(in->file) && (len = fread(buf, 1, BUF_SIZE, in->file)) > 0)
+ fwrite(buf, 1, len, fp->file);
#endif
- fwrite(buf, 1, len, fp->x.fpw);
free(buf);
fp->block_offset = in->block_offset = 0;
bgzf_close(fp);
ks_heapmake(heap, n, heap);
while (heap->pos != HEAP_EMPTY) {
bam1_t *b = heap->b;
- if ((flag & MERGE_RG) && bam_aux_get(b, "RG") == 0)
+ if (flag & MERGE_RG) {
+ uint8_t *rg = bam_aux_get(b, "RG");
+ if (rg) bam_aux_del(b, rg);
bam_aux_append(b, "RG", 'Z', RG_len[heap->i] + 1, (uint8_t*)RG[heap->i]);
+ }
bam_write1_core(fpout, &b->core, b->data_len, b->data);
if ((j = bam_iter_read(fp[heap->i], iter[heap->i], b)) >= 0) {
heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)b->core.pos<<1 | bam1_strand(b);
{
tview_t *tv = (tview_t*)calloc(1, sizeof(tview_t));
tv->is_dot = 1;
- tv->idx = bam_index_load(fn);
- if (tv->idx == 0) exit(1);
tv->fp = bam_open(fn, "r");
bgzf_set_cache_size(tv->fp, 8 * 1024 *1024);
assert(tv->fp);
tv->header = bam_header_read(tv->fp);
+ tv->idx = bam_index_load(fn);
+ if (tv->idx == 0) exit(1);
tv->lplbuf = bam_lplbuf_init(tv_pl_func, tv);
if (fn_fa) tv->fai = fai_load(fn_fa);
tv->bmc = bam_maqcns_init();
#include "knetfile.h"
#endif
-#ifndef PACKAGE_VERSION
-#define PACKAGE_VERSION "0.1.12a (r862)"
-#endif
-
int bam_taf2baf(int argc, char *argv[]);
int bam_pileup(int argc, char *argv[]);
int bam_mpileup(int argc, char *argv[]);
int main_samview(int argc, char *argv[]);
int main_import(int argc, char *argv[]);
int main_reheader(int argc, char *argv[]);
+int main_cut_target(int argc, char *argv[]);
+int main_phase(int argc, char *argv[]);
int faidx_main(int argc, char *argv[]);
int glf3_view_main(int argc, char *argv[]);
{
fprintf(stderr, "\n");
fprintf(stderr, "Program: samtools (Tools for alignments in the SAM format)\n");
- fprintf(stderr, "Version: %s\n\n", PACKAGE_VERSION);
+ fprintf(stderr, "Version: %s\n\n", BAM_VERSION);
fprintf(stderr, "Usage: samtools <command> [options]\n\n");
fprintf(stderr, "Command: view SAM<->BAM conversion\n");
fprintf(stderr, " sort sort alignment file\n");
fprintf(stderr, " merge merge sorted alignments\n");
fprintf(stderr, " rmdup remove PCR duplicates\n");
fprintf(stderr, " reheader replace BAM header\n");
+ fprintf(stderr, " targetcut cut fosmid regions (for fosmid pool only)\n");
+ fprintf(stderr, " phase phase heterozygotes\n");
fprintf(stderr, "\n");
+#ifdef _WIN32
+ fprintf(stderr, "\
+Note: The Windows version of SAMtools is mainly designed for read-only\n\
+ operations, such as viewing the alignments and generating the pileup.\n\
+ Binary files generated by the Windows version may be buggy.\n\n");
+#endif
return 1;
}
else if (strcmp(argv[1], "calmd") == 0) return bam_fillmd(argc-1, argv+1);
else if (strcmp(argv[1], "fillmd") == 0) return bam_fillmd(argc-1, argv+1);
else if (strcmp(argv[1], "reheader") == 0) return main_reheader(argc-1, argv+1);
+ else if (strcmp(argv[1], "targetcut") == 0) return main_cut_target(argc-1, argv+1);
+ else if (strcmp(argv[1], "phase") == 0) return main_phase(argc-1, argv+1);
#if _CURSES_LIB != 0
else if (strcmp(argv[1], "tview") == 0) return bam_tview_main(argc-1, argv+1);
#endif
$(AR) -cru $@ $(LOBJS)
bcftools:lib $(AOBJS)
- $(CC) $(CFLAGS) -o $@ $(AOBJS) -lm $(LIBPATH) -lz -L. -lbcf
+ $(CC) $(CFLAGS) -o $@ $(AOBJS) -L. $(LIBPATH) -lbcf -lm -lz
bcf.o:bcf.h
vcf.o:bcf.h
+++ /dev/null
-#!/usr/bin/perl -w
-
-use strict;
-use warnings;
-use Carp;
-
-my $opts = parse_params();
-bcf_fix();
-
-exit;
-
-#--------------------------------
-
-sub error
-{
- my (@msg) = @_;
- if ( scalar @msg ) { confess @msg; }
- die
- "Usage: bcftools view test.bcf | bcf-fix.pl > test.vcf\n",
- "Options:\n",
- " -h, -?, --help This help message.\n",
- "\n";
-}
-
-
-sub parse_params
-{
- my $opts = {};
- while (my $arg=shift(@ARGV))
- {
- if ( $arg eq '-?' || $arg eq '-h' || $arg eq '--help' ) { error(); }
- error("Unknown parameter \"$arg\". Run -h for help.\n");
- }
- return $opts;
-}
-
-sub bcf_fix
-{
- while (my $line=<STDIN>)
- {
- if ( $line=~/^#CHROM/ )
- {
- print
-qq[##INFO=<ID=DP4,Number=4,Type=Integer,Description="Read depth for 1) forward reference bases; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref">
-##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for 1) strand bias (exact test); 2) baseQ bias (t-test); 3) mapQ bias (t); 4) tail distance bias (t)">
-##INFO=<ID=AF1,Number=1,Type=Float,Description="EM estimate of site allele frequency without prior">
-##INFO=<ID=AFE,Number=1,Type=Float,Description="Posterior expectation of site allele frequency (with prior)">
-##INFO=<ID=HWE,Number=1,Type=Float,Description="P-value for Hardy-Weinberg equillibrium (chi-square test)">
-##INFO=<ID=MQ,Number=1,Type=Integer,Descriptin="RMS mapping quality">
-##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
-##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
-##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
-];
- print $line;
- }
- elsif ( $line=~/^#/ )
- {
- print $line;
- }
- else
- {
- my @items = split(/\t/,$line);
- my @tags = split(/:/,$items[8]); # FORMAT tags
-
- my $nidx=2;
- my @idxs; # Mapping which defines new ordering: $idxs[$inew]=$iold; GT comes first, PL second
- for (my $i=0; $i<@tags; $i++)
- {
- if ( $tags[$i] eq 'GT' ) { $idxs[0]=$i; }
- elsif ( $tags[$i] eq 'PL' ) { $idxs[1]=$i; }
- else { $idxs[$nidx++]=$i; }
- }
- if ( !exists($tags[0]) or !exists($tags[1]) ) { error("FIXME: expected GT and PL in the format field.\n"); }
-
- # First fix the FORMAT column
- $items[8] = 'GT:GL';
- for (my $i=2; $i<@tags; $i++)
- {
- $items[8] .= ':'.$tags[$idxs[$i]];
- }
-
- # Now all the genotype columns
- for (my $iitem=9; $iitem<@items; $iitem++)
- {
- @tags = split(/:/,$items[$iitem]);
- $items[$iitem] = $tags[$idxs[0]] .':';
-
- # GL=-PL/10
- my ($a,$b,$c) = split(/,/,$tags[$idxs[1]]);
- $items[$iitem] .= sprintf "%.2f,%.2f,%.2f",-$a/10.,-$b/10.,-$c/10.;
-
- for (my $itag=2; $itag<@tags; $itag++)
- {
- $items[$iitem] .= ':'.$tags[$idxs[$itag]];
- }
- }
- print join("\t",@items);
- }
- }
-}
-
for (p = b->str, n = 0; p < b->str + b->l_str; ++p)
if (*p == 0 && p+1 != b->str + b->l_str) tmp[n++] = p + 1;
if (n != 5) {
- fprintf(stderr, "[%s] incorrect number of fields (%d != 5). Corrupted file?\n", __func__, n);
+ fprintf(stderr, "[%s] incorrect number of fields (%d != 5) at %d:%d\n", __func__, n, b->tid, b->pos);
return -1;
}
b->ref = tmp[0]; b->alt = tmp[1]; b->flt = tmp[2]; b->info = tmp[3]; b->fmt = tmp[4];
b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2;
} else if (b->gi[i].fmt == bcf_str2int("DP", 2) || b->gi[i].fmt == bcf_str2int("HQ", 2)) {
b->gi[i].len = 2;
- } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)
- || b->gi[i].fmt == bcf_str2int("SP", 2))
- {
+ } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)) {
b->gi[i].len = 1;
+ } else if (b->gi[i].fmt == bcf_str2int("SP", 2)) {
+ b->gi[i].len = 4;
} else if (b->gi[i].fmt == bcf_str2int("GL", 2)) {
b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2 * 4;
}
}
} else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
kputw(((uint16_t*)b->gi[i].data)[j], s);
- } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
+ } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
kputw(((uint8_t*)b->gi[i].data)[j], s);
+ } else if (b->gi[i].fmt == bcf_str2int("SP", 2)) {
+ kputw(((int32_t*)b->gi[i].data)[j], s);
} else if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
int y = ((uint8_t*)b->gi[i].data)[j];
if (y>>7&1) {
if (k > 0) kputc(',', s);
ksprintf(s, "%.2f", d[k]);
}
- }
+ } else kputc('.', s); // custom fields
}
}
}
int vcf_close(bcf_t *bp);
// read the VCF/BCF header
bcf_hdr_t *vcf_hdr_read(bcf_t *bp);
+ // read the sequence dictionary from a separate file; required for VCF->BCF conversion
+ int vcf_dictread(bcf_t *bp, bcf_hdr_t *h, const char *fn);
// read a VCF/BCF record; return -1 on end-of-file and <-1 for errors
int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b);
// write the VCF header
int bcf_gl2pl(bcf1_t *b);
// if the site is an indel
int bcf_is_indel(const bcf1_t *b);
+ bcf_hdr_t *bcf_hdr_subsam(const bcf_hdr_t *h0, int n, char *const* samples, int *list);
+ int bcf_subsam(int n_smpl, int *list, bcf1_t *b);
// string hash table
void *bcf_build_refhash(bcf_hdr_t *h);
void bcf_str2id_destroy(void *_hash);
+ void bcf_str2id_thorough_destroy(void *_hash);
int bcf_str2id_add(void *_hash, const char *str);
int bcf_str2id(void *_hash, const char *str);
void *bcf_str2id_init();
\begin{tabular}{|l|l|l|l|l|}
\hline
\multicolumn{2}{|c|}{\bf Field} & \multicolumn{1}{c|}{\bf Descrption} & \multicolumn{1}{c|}{\bf Type} & \multicolumn{1}{c|}{\bf Value} \\\hline\hline
-\multicolumn{2}{|l|}{\tt magic} & Magic string & {\tt char[4]} & {\tt BCF\char92 4} \\\hline
-\multicolumn{2}{|l|}{\tt l\_nm} & Length of concatenated sequence names & {\tt int32\_t} & \\\hline
-\multicolumn{2}{|l|}{\tt name} & Concatenated names, {\tt NULL} padded & {\tt char[l\_nm]} & \\\hline
-\multicolumn{2}{|l|}{\tt l\_smpl} & Length of concatenated sample names & {\tt int32\_t} & \\\hline
-\multicolumn{2}{|l|}{\tt sname} & Concatenated sample names & {\tt char[l\_smpl]} & \\\hline
-\multicolumn{2}{|l|}{\tt l\_txt} & Length of the meta text (double-hash lines)& {\tt int32\_t} & \\\hline
-\multicolumn{2}{|l|}{\tt text} & Meta text, {\tt NULL} terminated & {\tt char[l\_txt]} & \\\hline
+\multicolumn{2}{|l|}{\sf magic} & Magic string & {\tt char[4]} & {\tt BCF\char92 4} \\\hline
+\multicolumn{2}{|l|}{\sf l\_seqnm} & Length of concatenated sequence names & {\tt int32\_t} & \\\hline
+\multicolumn{2}{|l|}{\sf seqnm} & Concatenated names, {\tt NULL} padded & {\tt char[{\sf l\_seqnm}]} & \\\hline
+\multicolumn{2}{|l|}{\sf l\_smpl} & Length of concatenated sample names & {\tt int32\_t} & \\\hline
+\multicolumn{2}{|l|}{\sf smpl} & Concatenated sample names & {\tt char[{\sf l\_smpl}]} & \\\hline
+\multicolumn{2}{|l|}{\sf l\_meta} & Length of the meta text (double-hash lines)& {\tt int32\_t} & \\\hline
+\multicolumn{2}{|l|}{\sf meta} & Meta text, {\tt NULL} terminated & {\tt char[{\sf l\_meta}]} & \\\hline
\multicolumn{5}{|c|}{\it \color{gray}{List of records until the end of the file}}\\\cline{2-5}
-& {\tt seq\_id} & Reference sequence ID & {\tt int32\_t} & \\\cline{2-5}
-& {\tt pos} & Position & {\tt int32\_t} & \\\cline{2-5}
-& {\tt qual} & Variant quality & {\tt float} & \\\cline{2-5}
-& {\tt l\_str} & Length of str & {\tt int32\_t} & \\\cline{2-5}
-& {\tt str} & {\tt ID+REF+ALT+FILTER+INFO+FORMAT}, {\tt NULL} padded & {\tt char[slen]} &\\\cline{2-5}
+& {\sf seq\_id} & Reference sequence ID & {\tt int32\_t} & \\\cline{2-5}
+& {\sf pos} & Position & {\tt int32\_t} & \\\cline{2-5}
+& {\sf qual} & Variant quality & {\tt float} & \\\cline{2-5}
+& {\sf l\_str} & Length of {\sf str} & {\tt int32\_t} & \\\cline{2-5}
+& {\sf str} & {\tt ID+REF+ALT+FILTER+INFO+FORMAT}, {\tt NULL} padded & {\tt char[{\sf l\_str}]} &\\\cline{2-5}
& \multicolumn{4}{c|}{Blocks of data; \#blocks and formats defined by {\tt FORMAT} (table below)}\\
\hline
\end{tabular}
\hline
\multicolumn{1}{l}{\bf Field} & \multicolumn{1}{l}{\bf Type} & \multicolumn{1}{l}{\bf Description} \\\hline
{\tt DP} & {\tt uint16\_t[n]} & Read depth \\
-{\tt GL} & {\tt float[n*x]} & Log10 likelihood of data; $x=\frac{m(m+1)}{2}$, $m=\#\{alleles\}$\\
-{\tt GT} & {\tt uint8\_t[n]} & {\tt phase\char60\char60 6 | allele1\char60\char60 3 | allele2} \\
+{\tt GL} & {\tt float[n*G]} & Log10 likelihood of data; $G=\frac{A(A+1)}{2}$, $A=\#\{alleles\}$\\
+{\tt GT} & {\tt uint8\_t[n]} & {\tt missing\char60\char60 7 | phased\char60\char60 6 | allele1\char60\char60 3 | allele2} \\
+{\tt \_GT} & {\tt uint8\_t+uint8\_t[n*P]} & {Generic GT; the first int equals the max ploidy $P$} \\
{\tt GQ} & {\tt uint8\_t[n]} & {Genotype quality}\\
{\tt HQ} & {\tt uint8\_t[n*2]} & {Haplotype quality}\\
-{\tt PL} & {\tt uint8\_t[n*x]} & {Phred-scaled likelihood of data}\\
-\emph{misc} & {\tt int32\_t+char*} & {\tt NULL} padded concatenated strings (integer equal to the length) \\
+{\tt \_HQ} & {\tt uint8\_t+uint8\_t[n*P]} & {Generic HQ}\\
+{\tt IBD} & {\tt uint32\_t[n*2]} & {IBD}\\
+{\tt \_IBD} & {\tt uint8\_t+uint32\_t[n*P]} & {Generic IBD}\\
+{\tt PL} & {\tt uint8\_t[n*G]} & {Phred-scaled likelihood of data}\\
+{\tt PS} & {\tt uint32\_t[n]} & {Phase set}\\
+%{\tt SP} & {\tt uint8\_t[n]} & {Strand bias P-value (bcftools only)}\\
+\emph{Integer} & {\tt int32\_t[n*X]} & {Fix-sized custom Integer; $X$ defined in the header}\\
+\emph{Numeric} & {\tt double[n*X]} & {Fix-sized custom Numeric}\\
+\emph{String} & {\tt uint32\_t+char*} & {\tt NULL} padded concat. strings (int equals to the length) \\
\hline
\end{tabular}
\end{center}
\begin{itemize}
-\item The file is {\tt BGZF} compressed.
-\item All integers are little-endian.
+\item A BCF file is in the {\tt BGZF} format.
+\item All multi-byte numbers are little-endian.
\item In a string, a missing value `.' is an empty C string ``{\tt
\char92 0}'' (not ``{\tt .\char92 0}'')
\item For {\tt GL} and {\tt PL}, likelihoods of genotypes appear in the
order of alleles in {\tt REF} and then {\tt ALT}. For example, if {\tt
REF=C}, {\tt ALT=T,A}, likelihoods appear in the order of {\tt
- CC,CT,CA,TT,TA,AA}.
-\item {\tt GL} is an extension to and is backward compatible with the
- {\tt GL} genotype field in {\tt VCFv4.0}.
+ CC,CT,TT,CA,TA,AA} (NB: the ordering is different from the one in the original
+ BCF proposal).
+\item Predefined {\tt FORMAT} fields can be missing from VCF headers, but custom {\tt FORMAT} fields
+ are required to be explicitly defined in the headers.
+\item A {\tt FORMAT} field with its name starting with `{\tt \_}' gives an alternative
+ binary representation of the corresponding VCF field. The alternative representation
+ is used when the default representation is unable to keep the genotype information,
+ for example, when the ploidy is over 2 or there are more than 8 alleles.
\end{itemize}
-\end{document}
\ No newline at end of file
+\end{document}
for (k = j = 0; k < 4; ++k) {
for (l = k; l < 4; ++l) {
int t, x = map[k], y = map[l];
- if (x > y) t = x, x = y, y = t;
- g[j++] = p[x * b->n_alleles - x * (x-1) / 2 + (y - x)];
+ if (x > y) t = x, x = y, y = t; // swap
+ g[j++] = p[y * (y+1) / 2 + x];
}
}
printf("%s\t%d\t%c", h->ns[b->tid], b->pos+1, *b->ref);
+#include <string.h>
+#include <math.h>
#include "bcf.h"
#include "kstring.h"
#include "khash.h"
if (hash) kh_destroy(str2id, hash); // Note that strings are not freed.
}
+void bcf_str2id_thorough_destroy(void *_hash)
+{
+ khash_t(str2id) *hash = (khash_t(str2id)*)_hash;
+ khint_t k;
+ if (hash == 0) return;
+ for (k = 0; k < kh_end(hash); ++k)
+ if (kh_exist(hash, k)) free((char*)kh_key(hash, k));
+ kh_destroy(str2id, hash);
+}
+
int bcf_str2id(void *_hash, const char *str)
{
khash_t(str2id) *hash = (khash_t(str2id)*)_hash;
int bcf_shrink_alt(bcf1_t *b, int n)
{
char *p;
- int i, j, k, *z, n_smpl = b->n_smpl;
+ int i, j, k, n_smpl = b->n_smpl;
if (b->n_alleles <= n) return -1;
+ // update ALT
if (n > 1) {
for (p = b->alt, k = 1; *p; ++p)
if (*p == ',' && ++k == n) break;
++p;
memmove(p, b->flt, b->str + b->l_str - b->flt);
b->l_str -= b->flt - p;
- z = alloca(sizeof(int) / 2 * n * (n+1));
- for (i = k = 0; i < n; ++i)
- for (j = 0; j < n - i; ++j)
- z[k++] = i * b->n_alleles + j;
+ // update PL
for (i = 0; i < b->n_gi; ++i) {
bcf_ginfo_t *g = b->gi + i;
if (g->fmt == bcf_str2int("PL", 2)) {
g->len = n * (n + 1) / 2;
for (l = k = 0; l < n_smpl; ++l) {
uint8_t *dl = d + l * x;
- for (j = 0; j < g->len; ++j) d[k++] = dl[z[j]];
+ for (j = 0; j < g->len; ++j) d[k++] = dl[j];
}
} // FIXME: to add GL
}
}
return 0;
}
+/* FIXME: this function will fail given AB:GTX:GT. BCFtools never
+ * produces such FMT, but others may do. */
+int bcf_fix_gt(bcf1_t *b)
+{
+ char *s;
+ int i;
+ uint32_t tmp;
+ bcf_ginfo_t gt;
+ // check the presence of the GT FMT
+ if ((s = strstr(b->fmt, ":GT")) == 0) return 0; // no GT or GT is already the first
+ if (s[3] != '\0' && s[3] != ':') return 0; // :GTX in fact
+ tmp = bcf_str2int("GT", 2);
+ for (i = 0; i < b->n_gi; ++i)
+ if (b->gi[i].fmt == tmp) break;
+ if (i == b->n_gi) return 0; // no GT in b->gi; probably a bug...
+ gt = b->gi[i];
+ // move GT to the first
+ for (; i > 0; --i) b->gi[i] = b->gi[i-1];
+ b->gi[0] = gt;
+ memmove(b->fmt + 3, b->fmt, s + 1 - b->fmt);
+ b->fmt[0] = 'G'; b->fmt[1] = 'T'; b->fmt[2] = ':';
+ return 0;
+}
+
+static void *locate_field(const bcf1_t *b, const char *fmt, int l)
+{
+ int i;
+ uint32_t tmp;
+ tmp = bcf_str2int(fmt, l);
+ for (i = 0; i < b->n_gi; ++i)
+ if (b->gi[i].fmt == tmp) break;
+ return i == b->n_gi? 0 : b->gi[i].data;
+}
+
+int bcf_anno_max(bcf1_t *b)
+{
+ int k, max_gq, max_sp, n_het;
+ kstring_t str;
+ uint8_t *gt, *gq;
+ int32_t *sp;
+ max_gq = max_sp = n_het = 0;
+ gt = locate_field(b, "GT", 2);
+ if (gt == 0) return -1;
+ gq = locate_field(b, "GQ", 2);
+ sp = locate_field(b, "SP", 2);
+ if (sp)
+ for (k = 0; k < b->n_smpl; ++k)
+ if (gt[k]&0x3f)
+ max_sp = max_sp > (int)sp[k]? max_sp : sp[k];
+ if (gq)
+ for (k = 0; k < b->n_smpl; ++k)
+ if (gt[k]&0x3f)
+ max_gq = max_gq > (int)gq[k]? max_gq : gq[k];
+ for (k = 0; k < b->n_smpl; ++k) {
+ int a1, a2;
+ a1 = gt[k]&7; a2 = gt[k]>>3&7;
+ if ((!a1 && a2) || (!a2 && a1)) { // a het
+ if (gq == 0) ++n_het;
+ else if (gq[k] >= 20) ++n_het;
+ }
+ }
+ if (n_het) max_sp -= (int)(4.343 * log(n_het) + .499);
+ if (max_sp < 0) max_sp = 0;
+ memset(&str, 0, sizeof(kstring_t));
+ if (*b->info) kputc(';', &str);
+ ksprintf(&str, "MXSP=%d;MXGQ=%d", max_sp, max_gq);
+ bcf_append_info(b, str.s, str.l);
+ free(str.s);
+ return 0;
+}
+
+bcf_hdr_t *bcf_hdr_subsam(const bcf_hdr_t *h0, int n, char *const* samples, int *list)
+{
+ int i, ret, j;
+ khint_t k;
+ bcf_hdr_t *h;
+ khash_t(str2id) *hash;
+ kstring_t s;
+ s.l = s.m = 0; s.s = 0;
+ hash = kh_init(str2id);
+ for (i = 0; i < n; ++i)
+ k = kh_put(str2id, hash, samples[i], &ret);
+ for (i = j = 0; i < h0->n_smpl; ++i) {
+ if (kh_get(str2id, hash, h0->sns[i]) != kh_end(hash)) {
+ list[j++] = i;
+ kputs(h0->sns[i], &s); kputc('\0', &s);
+ }
+ }
+ if (j < n) fprintf(stderr, "<%s> %d samples in the list but not in BCF.", __func__, n - j);
+ kh_destroy(str2id, hash);
+ h = calloc(1, sizeof(bcf_hdr_t));
+ *h = *h0;
+ h->ns = 0; h->sns = 0;
+ h->name = malloc(h->l_nm); memcpy(h->name, h0->name, h->l_nm);
+ h->txt = calloc(1, h->l_txt + 1); memcpy(h->txt, h0->txt, h->l_txt);
+ h->l_smpl = s.l; h->sname = s.s;
+ bcf_hdr_sync(h);
+ return h;
+}
+
+int bcf_subsam(int n_smpl, int *list, bcf1_t *b) // list MUST BE sorted
+{
+ int i, j;
+ for (j = 0; j < b->n_gi; ++j) {
+ bcf_ginfo_t *gi = b->gi + j;
+ for (i = 0; i < n_smpl; ++i)
+ memcpy((uint8_t*)gi->data + i * gi->len, (uint8_t*)gi->data + list[i] * gi->len, gi->len);
+ }
+ b->n_smpl = n_smpl;
+ return 0;
+}
#define VC_CALL_GT 2048
#define VC_ADJLD 4096
#define VC_NO_INDEL 8192
-#define VC_FOLDED 16384
+#define VC_ANNO_MAX 16384
typedef struct {
- int flag, prior_type, n1;
- char *fn_list, *prior_file;
+ int flag, prior_type, n1, n_sub, *sublist;
+ char *fn_list, *prior_file, **subsam, *fn_dict;
double theta, pref, indel_frac;
} viewconf_t;
{
kstring_t s;
int is_var = (pr->p_ref < pref);
- double p_hwe, r = is_var? pr->p_ref : 1. - pr->p_ref;
+ double p_hwe, r = is_var? pr->p_ref : pr->p_var, fq;
anno16_t a;
p_hwe = pr->g[0] >= 0.? test_hwe(pr->g) : 1.0; // only do HWE g[] is calculated
kputs(b->info, &s);
if (b->info[0]) kputc(';', &s);
// ksprintf(&s, "AF1=%.4lg;AFE=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, 1.-pr->f_exp, pr->cil, pr->cih);
- ksprintf(&s, "AF1=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, pr->cil, pr->cih);
+ ksprintf(&s, "AF1=%.4g;CI95=%.4g,%.4g", 1.-pr->f_em, pr->cil, pr->cih);
ksprintf(&s, ";DP4=%d,%d,%d,%d;MQ=%d", a.d[0], a.d[1], a.d[2], a.d[3], a.mq);
+ fq = pr->p_ref_folded < 0.5? -4.343 * log(pr->p_ref_folded) : 4.343 * log(pr->p_var_folded);
+ if (fq < -999) fq = -999;
+ if (fq > 999) fq = 999;
+ ksprintf(&s, ";FQ=%.3g", fq);
if (a.is_tested) {
- if (pr->pc[0] >= 0.) ksprintf(&s, ";PC4=%lg,%lg,%lg,%lg", pr->pc[0], pr->pc[1], pr->pc[2], pr->pc[3]);
- ksprintf(&s, ";PV4=%.2lg,%.2lg,%.2lg,%.2lg", a.p[0], a.p[1], a.p[2], a.p[3]);
+ if (pr->pc[0] >= 0.) ksprintf(&s, ";PC4=%g,%g,%g,%g", pr->pc[0], pr->pc[1], pr->pc[2], pr->pc[3]);
+ ksprintf(&s, ";PV4=%.2g,%.2g,%.2g,%.2g", a.p[0], a.p[1], a.p[2], a.p[3]);
}
if (pr->g[0] >= 0. && p_hwe <= .2)
- ksprintf(&s, ";GC=%.2lf,%.2lf,%.2lf;HWE=%.3lf", pr->g[2], pr->g[1], pr->g[0], p_hwe);
+ ksprintf(&s, ";GC=%.2f,%.2f,%.2f;HWE=%.3f", pr->g[2], pr->g[1], pr->g[0], p_hwe);
kputc('\0', &s);
kputs(b->fmt, &s); kputc('\0', &s);
free(b->str);
b->m_str = s.m; b->l_str = s.l; b->str = s.s;
- b->qual = r < 1e-100? 99 : -4.343 * log(r);
- if (b->qual > 99) b->qual = 99;
+ b->qual = r < 1e-100? 999 : -4.343 * log(r);
+ if (b->qual > 999) b->qual = 999;
bcf_sync(b);
if (!is_var) bcf_shrink_alt(b, 1);
else if (!(flag&VC_KEEPALT))
return is_var;
}
+static char **read_samples(const char *fn, int *_n)
+{
+ gzFile fp;
+ kstream_t *ks;
+ kstring_t s;
+ int dret, n = 0, max = 0;
+ char **sam = 0;
+ *_n = 0;
+ s.l = s.m = 0; s.s = 0;
+ fp = gzopen(fn, "r");
+ if (fp == 0) return 0; // fail to open file
+ ks = ks_init(fp);
+ while (ks_getuntil(ks, 0, &s, &dret) >= 0) {
+ if (max == n) {
+ max = max? max<<1 : 4;
+ sam = realloc(sam, sizeof(void*)*max);
+ }
+ sam[n++] = strdup(s.s);
+ }
+ ks_destroy(ks);
+ gzclose(fp);
+ free(s.s);
+ *_n = n;
+ return sam;
+}
+
+static void write_header(bcf_hdr_t *h)
+{
+ kstring_t str;
+ str.l = h->l_txt? h->l_txt - 1 : 0;
+ str.m = str.l + 1; str.s = h->txt;
+ if (!strstr(str.s, "##INFO=<ID=DP,"))
+ kputs("##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Raw read depth\">\n", &str);
+ if (!strstr(str.s, "##INFO=<ID=DP4,"))
+ kputs("##INFO=<ID=DP4,Number=4,Type=Integer,Description=\"# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases\">\n", &str);
+ if (!strstr(str.s, "##INFO=<ID=MQ,"))
+ kputs("##INFO=<ID=MQ,Number=1,Type=Integer,Description=\"Root-mean-square mapping quality of covering reads\">\n", &str);
+ if (!strstr(str.s, "##INFO=<ID=FQ,"))
+ kputs("##INFO=<ID=FQ,Number=1,Type=Float,Description=\"Phred probability that sample chromosomes are not all the same\">\n", &str);
+ if (!strstr(str.s, "##INFO=<ID=AF1,"))
+ kputs("##INFO=<ID=AF1,Number=1,Type=Float,Description=\"Max-likelihood estimate of the site allele frequency of the first ALT allele\">\n", &str);
+ if (!strstr(str.s, "##INFO=<ID=CI95,"))
+ kputs("##INFO=<ID=CI95,Number=2,Type=Float,Description=\"Equal-tail Bayesian credible interval of the site allele frequency at the 95% level\">\n", &str);
+ if (!strstr(str.s, "##INFO=<ID=PV4,"))
+ kputs("##INFO=<ID=PV4,Number=4,Type=Float,Description=\"P-values for strand bias, baseQ bias, mapQ bias and tail distance bias\">\n", &str);
+ if (!strstr(str.s, "##INFO=<ID=INDEL,"))
+ kputs("##INFO=<ID=INDEL,Number=0,Type=Flag,Description=\"Indicates that the variant is an INDEL.\">\n", &str);
+ if (!strstr(str.s, "##INFO=<ID=GT,"))
+ kputs("##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n", &str);
+ if (!strstr(str.s, "##FORMAT=<ID=GQ,"))
+ kputs("##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">\n", &str);
+ if (!strstr(str.s, "##FORMAT=<ID=GL,"))
+ kputs("##FORMAT=<ID=GL,Number=3,Type=Float,Description=\"Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)\">\n", &str);
+ if (!strstr(str.s, "##FORMAT=<ID=DP,"))
+ kputs("##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"# high-quality bases\">\n", &str);
+ if (!strstr(str.s, "##FORMAT=<ID=SP,"))
+ kputs("##FORMAT=<ID=SP,Number=1,Type=Integer,Description=\"Phred-scaled strand bias P-value\">\n", &str);
+ if (!strstr(str.s, "##FORMAT=<ID=PL,"))
+ kputs("##FORMAT=<ID=PL,Number=-1,Type=Integer,Description=\"List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2\">\n", &str);
+ h->l_txt = str.l + 1; h->txt = str.s;
+}
+
double bcf_ld_freq(const bcf1_t *b0, const bcf1_t *b1, double f[4]);
int bcfview(int argc, char *argv[])
{
extern int bcf_2qcall(bcf_hdr_t *h, bcf1_t *b);
extern void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x);
+ extern int bcf_fix_gt(bcf1_t *b);
+ extern int bcf_anno_max(bcf1_t *b);
bcf_t *bp, *bout = 0;
bcf1_t *b, *blast;
int c;
uint64_t n_processed = 0;
viewconf_t vc;
bcf_p1aux_t *p1 = 0;
- bcf_hdr_t *h;
+ bcf_hdr_t *hin, *hout;
int tid, begin, end;
char moder[4], modew[4];
khash_t(set64) *hash = 0;
tid = begin = end = -1;
memset(&vc, 0, sizeof(viewconf_t));
vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5; vc.indel_frac = -1.;
- while ((c = getopt(argc, argv, "fN1:l:cHAGvbSuP:t:p:QgLi:I")) >= 0) {
+ vc.n_sub = 0; vc.subsam = 0; vc.sublist = 0;
+ while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgLi:IMs:D:")) >= 0) {
switch (c) {
- case 'f': vc.flag |= VC_FOLDED; break;
case '1': vc.n1 = atoi(optarg); break;
case 'l': vc.fn_list = strdup(optarg); break;
+ case 'D': vc.fn_dict = strdup(optarg); break;
case 'N': vc.flag |= VC_ACGT_ONLY; break;
case 'G': vc.flag |= VC_NO_GENO; break;
case 'A': vc.flag |= VC_KEEPALT; break;
case 'H': vc.flag |= VC_HWE; break;
case 'g': vc.flag |= VC_CALL_GT | VC_CALL; break;
case 'I': vc.flag |= VC_NO_INDEL; break;
+ case 'M': vc.flag |= VC_ANNO_MAX; break;
case 't': vc.theta = atof(optarg); break;
case 'p': vc.pref = atof(optarg); break;
case 'i': vc.indel_frac = atof(optarg); break;
case 'Q': vc.flag |= VC_QCALL; break;
case 'L': vc.flag |= VC_ADJLD; break;
+ case 's': vc.subsam = read_samples(optarg, &vc.n_sub); break;
case 'P':
if (strcmp(optarg, "full") == 0) vc.prior_type = MC_PTYPE_FULL;
else if (strcmp(optarg, "cond2") == 0) vc.prior_type = MC_PTYPE_COND2;
fprintf(stderr, " -Q output the QCALL likelihood format\n");
fprintf(stderr, " -L calculate LD for adjacent sites\n");
fprintf(stderr, " -I skip indels\n");
- fprintf(stderr, " -f reference-free variant calling\n");
+ fprintf(stderr, " -D FILE sequence dictionary for VCF->BCF conversion [null]\n");
fprintf(stderr, " -1 INT number of group-1 samples [0]\n");
fprintf(stderr, " -l FILE list of sites to output [all sites]\n");
- fprintf(stderr, " -t FLOAT scaled substitution mutation rate [%.4lg]\n", vc.theta);
- fprintf(stderr, " -i FLOAT indel-to-substitution ratio [%.4lg]\n", vc.indel_frac);
- fprintf(stderr, " -p FLOAT variant if P(ref|D)<FLOAT [%.3lg]\n", vc.pref);
+ fprintf(stderr, " -t FLOAT scaled substitution mutation rate [%.4g]\n", vc.theta);
+ fprintf(stderr, " -i FLOAT indel-to-substitution ratio [%.4g]\n", vc.indel_frac);
+ fprintf(stderr, " -p FLOAT variant if P(ref|D)<FLOAT [%.3g]\n", vc.pref);
fprintf(stderr, " -P STR type of prior: full, cond2, flat [full]\n");
+ fprintf(stderr, " -s FILE list of samples to use [all samples]\n");
fprintf(stderr, "\n");
return 1;
}
+ if ((vc.flag & VC_VCFIN) && (vc.flag & VC_BCFOUT) && vc.fn_dict == 0) {
+ fprintf(stderr, "[%s] For VCF->BCF conversion please specify the sequence dictionary with -D\n", __func__);
+ return 1;
+ }
b = calloc(1, sizeof(bcf1_t));
blast = calloc(1, sizeof(bcf1_t));
strcpy(moder, "r");
if (vc.flag & VC_BCFOUT) strcat(modew, "b");
if (vc.flag & VC_UNCOMP) strcat(modew, "u");
bp = vcf_open(argv[optind], moder);
- h = vcf_hdr_read(bp);
+ hin = hout = vcf_hdr_read(bp);
+ if (vc.fn_dict && (vc.flag & VC_VCFIN))
+ vcf_dictread(bp, hin, vc.fn_dict);
bout = vcf_open("-", modew);
- if (!(vc.flag & VC_QCALL)) vcf_hdr_write(bout, h);
+ if (!(vc.flag & VC_QCALL)) {
+ if (vc.n_sub) {
+ vc.sublist = calloc(vc.n_sub, sizeof(int));
+ hout = bcf_hdr_subsam(hin, vc.n_sub, vc.subsam, vc.sublist);
+ }
+ if (vc.flag & VC_CALL) write_header(hout);
+ vcf_hdr_write(bout, hout);
+ }
if (vc.flag & VC_CALL) {
- p1 = bcf_p1_init(h->n_smpl);
+ p1 = bcf_p1_init(hout->n_smpl);
if (vc.prior_file) {
if (bcf_p1_read_prior(p1, vc.prior_file) < 0) {
fprintf(stderr, "[%s] fail to read the prior AFS.\n", __func__);
bcf_p1_init_subprior(p1, vc.prior_type, vc.theta);
}
if (vc.indel_frac > 0.) bcf_p1_indel_prior(p1, vc.indel_frac); // otherwise use the default indel_frac
- if (vc.flag & VC_FOLDED) bcf_p1_set_folded(p1);
}
- if (vc.fn_list) hash = bcf_load_pos(vc.fn_list, h);
+ if (vc.fn_list) hash = bcf_load_pos(vc.fn_list, hin);
if (optind + 1 < argc && !(vc.flag&VC_VCFIN)) {
- void *str2id = bcf_build_refhash(h);
+ void *str2id = bcf_build_refhash(hout);
if (bcf_parse_region(str2id, argv[optind+1], &tid, &begin, &end) >= 0) {
bcf_idx_t *idx;
idx = bcf_idx_load(argv[optind]);
}
}
}
- while (vcf_read(bp, h, b) > 0) {
- int is_indel = bcf_is_indel(b);
+ while (vcf_read(bp, hin, b) > 0) {
+ int is_indel;
+ if (vc.n_sub) bcf_subsam(vc.n_sub, vc.sublist, b);
+ is_indel = bcf_is_indel(b);
if ((vc.flag & VC_NO_INDEL) && is_indel) continue;
if ((vc.flag & VC_ACGT_ONLY) && !is_indel) {
int x;
}
++n_processed;
if (vc.flag & VC_QCALL) { // output QCALL format; STOP here
- bcf_2qcall(h, b);
+ bcf_2qcall(hout, b);
continue;
}
if (vc.flag & (VC_CALL|VC_ADJLD)) bcf_gl2pl(b);
bcf_p1_dump_afs(p1);
}
if (pr.p_ref >= vc.pref && (vc.flag & VC_VARONLY)) continue;
- update_bcf1(h->n_smpl, b, p1, &pr, vc.pref, vc.flag);
+ update_bcf1(hout->n_smpl, b, p1, &pr, vc.pref, vc.flag);
}
if (vc.flag & VC_ADJLD) { // compute LD
double f[4], r2;
kstring_t s;
s.m = s.l = 0; s.s = 0;
if (*b->info) kputc(';', &s);
- ksprintf(&s, "NEIR=%.3lf;NEIF=%.3lf,%.3lf", r2, f[0]+f[2], f[0]+f[1]);
+ ksprintf(&s, "NEIR=%.3f;NEIF=%.3f,%.3f", r2, f[0]+f[2], f[0]+f[1]);
bcf_append_info(b, s.s, s.l);
free(s.s);
}
bcf_cpy(blast, b);
}
+ if (vc.flag & VC_ANNO_MAX) bcf_anno_max(b);
if (vc.flag & VC_NO_GENO) { // do not output GENO fields
b->n_gi = 0;
b->fmt[0] = '\0';
- }
- vcf_write(bout, h, b);
+ b->l_str = b->fmt - b->str + 1;
+ } else bcf_fix_gt(b);
+ vcf_write(bout, hout, b);
}
if (vc.prior_file) free(vc.prior_file);
if (vc.flag & VC_CALL) bcf_p1_dump_afs(p1);
- bcf_hdr_destroy(h);
+ if (hin != hout) bcf_hdr_destroy(hout);
+ bcf_hdr_destroy(hin);
bcf_destroy(b); bcf_destroy(blast);
vcf_close(bp); vcf_close(bout);
if (hash) kh_destroy(set64, hash);
if (vc.fn_list) free(vc.fn_list);
+ if (vc.fn_dict) free(vc.fn_dict);
+ if (vc.n_sub) {
+ int i;
+ for (i = 0; i < vc.n_sub; ++i) free(vc.subsam[i]);
+ free(vc.subsam); free(vc.sublist);
+ }
if (p1) bcf_p1_destroy(p1);
return 0;
}
n1_ = n11 + n12; n_1 = n11 + n21; n = n11 + n12 + n21 + n22; // calculate n1_, n_1 and n
max = (n_1 < n1_) ? n_1 : n1_; // max n11, for right tail
- min = (n1_ + n_1 - n < 0) ? 0 : (n1_ + n_1 - n < 0); // min n11, for left tail
+ min = n1_ + n_1 - n;
+ if (min < 0) min = 0; // min n11, for left tail
*two = *_left = *_right = 1.;
if (min == max) return 1.; // no need to do test
q = hypergeo_acc(n11, n1_, n_1, n, &aux); // the probability of the current table
p = hypergeo_acc(max, 0, 0, 0, &aux);
for (right = 0., j = max - 1; p < 0.99999999 * q; --j) // loop until underflow
right += p, p = hypergeo_acc(j, 0, 0, 0, &aux);
+ ++j;
if (p < 1.00000001 * q) right += p;
else ++j;
// two-tail
for (i = 0; i < n_smpl; ++i) {
const uint8_t *pi = PL[j] + i * PL_len[j];
double *p = pdg[j] + i * 3;
- p[0] = g_q2p[pi[b[j]->n_alleles]]; p[1] = g_q2p[pi[1]]; p[2] = g_q2p[pi[0]];
+ p[0] = g_q2p[pi[2]]; p[1] = g_q2p[pi[1]]; p[2] = g_q2p[pi[0]];
}
}
// iteration
};
struct __bcf_p1aux_t {
- int n, M, n1, is_indel, is_folded;
+ int n, M, n1, is_indel;
double *q2p, *pdg; // pdg -> P(D|g)
double *phi, *phi_indel;
double *z, *zswap; // aux for afs
int PL_len;
};
-static void fold_array(int M, double *x)
-{
- int k;
- for (k = 0; k < M/2; ++k)
- x[k] = x[M-k] = (x[k] + x[M-k]) / 2.;
-}
-
void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x)
{
int i;
return 0;
}
-void bcf_p1_set_folded(bcf_p1aux_t *p1a)
-{
- if (p1a->n1 < 0) {
- p1a->is_folded = 1;
- fold_array(p1a->M, p1a->phi);
- fold_array(p1a->M, p1a->phi_indel);
- }
-}
-
void bcf_p1_destroy(bcf_p1aux_t *ma)
{
if (ma) {
static int cal_pdg(const bcf1_t *b, bcf_p1aux_t *ma)
{
- int i, j, k;
+ int i, j;
long *p, tmp;
p = alloca(b->n_alleles * sizeof(long));
memset(p, 0, sizeof(long) * b->n_alleles);
for (j = 0; j < ma->n; ++j) {
const uint8_t *pi = ma->PL + j * ma->PL_len;
double *pdg = ma->pdg + j * 3;
- pdg[0] = ma->q2p[pi[b->n_alleles]]; pdg[1] = ma->q2p[pi[1]]; pdg[2] = ma->q2p[pi[0]];
- for (i = k = 0; i < b->n_alleles; ++i) {
- p[i] += (int)pi[k];
- k += b->n_alleles - i;
- }
+ pdg[0] = ma->q2p[pi[2]]; pdg[1] = ma->q2p[pi[1]]; pdg[2] = ma->q2p[pi[0]];
+ for (i = 0; i < b->n_alleles; ++i)
+ p[i] += (int)pi[(i+1)*(i+2)/2-1];
}
for (i = 0; i < b->n_alleles; ++i) p[i] = p[i]<<4 | i;
for (i = 1; i < b->n_alleles; ++i) // insertion sort
pc[1] = pc[1] == 1.? 99 : (int)(-4.343 * log(1. - pc[1]) + .499);
}
-static double mc_cal_afs(bcf_p1aux_t *ma)
+static double mc_cal_afs(bcf_p1aux_t *ma, double *p_ref_folded, double *p_var_folded)
{
int k;
- long double sum = 0.;
+ long double sum = 0., sum2;
double *phi = ma->is_indel? ma->phi_indel : ma->phi;
memset(ma->afs1, 0, sizeof(double) * (ma->M + 1));
mc_cal_y(ma);
+ // compute AFS
for (k = 0, sum = 0.; k <= ma->M; ++k)
sum += (long double)phi[k] * ma->z[k];
for (k = 0; k <= ma->M; ++k) {
ma->afs1[k] = phi[k] * ma->z[k] / sum;
if (isnan(ma->afs1[k]) || isinf(ma->afs1[k])) return -1.;
}
+ // compute folded variant probability
+ for (k = 0, sum = 0.; k <= ma->M; ++k)
+ sum += (long double)(phi[k] + phi[ma->M - k]) / 2. * ma->z[k];
+ for (k = 1, sum2 = 0.; k < ma->M; ++k)
+ sum2 += (long double)(phi[k] + phi[ma->M - k]) / 2. * ma->z[k];
+ *p_var_folded = sum2 / sum;
+ *p_ref_folded = (phi[k] + phi[ma->M - k]) / 2. * (ma->z[ma->M] + ma->z[0]) / sum;
+ // the expected frequency
for (k = 0, sum = 0.; k <= ma->M; ++k) {
ma->afs[k] += ma->afs1[k];
sum += k * ma->afs1[k];
return pd;
}
-int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst)
+int bcf_p1_cal(const bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst)
{
int i, k;
long double sum = 0.;
if (b->n_alleles < 2) return -1; // FIXME: find a better solution
//
rst->rank0 = cal_pdg(b, ma);
- rst->f_exp = mc_cal_afs(ma);
- rst->p_ref = ma->is_folded? ma->afs1[ma->M] + ma->afs1[0] : ma->afs1[ma->M];
+ rst->f_exp = mc_cal_afs(ma, &rst->p_ref_folded, &rst->p_var_folded);
+ rst->p_ref = ma->afs1[ma->M];
+ for (k = 0, sum = 0.; k < ma->M; ++k)
+ sum += ma->afs1[k];
+ rst->p_var = (double)sum;
// calculate f_flat and f_em
for (k = 0, sum = 0.; k <= ma->M; ++k)
sum += (long double)ma->z[k];
void bcf_p1_dump_afs(bcf_p1aux_t *ma)
{
int k;
- if (ma->is_folded) fold_array(ma->M, ma->afs);
fprintf(stderr, "[afs]");
for (k = 0; k <= ma->M; ++k)
fprintf(stderr, " %d:%.3lf", k, ma->afs[ma->M - k]);
typedef struct {
int rank0;
- double f_em, f_exp, f_flat, p_ref;
+ double f_em, f_exp, f_flat, p_ref_folded, p_ref, p_var_folded, p_var;
double cil, cih;
double pc[4];
double g[3];
void bcf_p1_init_prior(bcf_p1aux_t *ma, int type, double theta);
void bcf_p1_init_subprior(bcf_p1aux_t *ma, int type, double theta);
void bcf_p1_destroy(bcf_p1aux_t *ma);
- int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst);
+ int bcf_p1_cal(const bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst);
int bcf_p1_call_gt(const bcf_p1aux_t *ma, double f0, int k);
void bcf_p1_dump_afs(bcf_p1aux_t *ma);
int bcf_p1_read_prior(bcf_p1aux_t *ma, const char *fn);
return bp;
}
+int vcf_dictread(bcf_t *bp, bcf_hdr_t *h, const char *fn)
+{
+ vcf_t *v;
+ gzFile fp;
+ kstream_t *ks;
+ kstring_t s, rn;
+ int dret;
+ if (bp == 0) return -1;
+ if (!bp->is_vcf) return 0;
+ s.l = s.m = 0; s.s = 0;
+ rn.m = rn.l = h->l_nm; rn.s = h->name;
+ v = (vcf_t*)bp->v;
+ fp = gzopen(fn, "r");
+ ks = ks_init(fp);
+ while (ks_getuntil(ks, 0, &s, &dret) >= 0) {
+ bcf_str2id_add(v->refhash, strdup(s.s));
+ kputs(s.s, &rn); kputc('\0', &rn);
+ if (dret != '\n') ks_getuntil(ks, '\n', &s, &dret);
+ }
+ ks_destroy(ks);
+ gzclose(fp);
+ h->l_nm = rn.l; h->name = rn.s;
+ bcf_hdr_sync(h);
+ free(s.s);
+ return 0;
+}
+
int vcf_close(bcf_t *bp)
{
vcf_t *v;
}
if (v->fpout) fclose(v->fpout);
free(v->line.s);
- bcf_str2id_destroy(v->refhash);
+ bcf_str2id_thorough_destroy(v->refhash);
free(v);
free(bp);
return 0;
int vcf_hdr_write(bcf_t *bp, const bcf_hdr_t *h)
{
vcf_t *v = (vcf_t*)bp->v;
- int i, has_ref = 0, has_ver = 0;
+ int i, has_ver = 0;
if (!bp->is_vcf) return bcf_hdr_write(bp, h);
if (h->l_txt > 0) {
if (strstr(h->txt, "##fileformat=")) has_ver = 1;
- if (has_ver == 0) fprintf(v->fpout, "##fileformat=VCFv4.0\n");
+ if (has_ver == 0) fprintf(v->fpout, "##fileformat=VCFv4.1\n");
fwrite(h->txt, 1, h->l_txt - 1, v->fpout);
- if (strstr(h->txt, "##SQ=")) has_ref = 1;
}
- if (has_ver == 0) fprintf(v->fpout, "##fileformat=VCFv4.0\n");
+ if (h->l_txt == 0) fprintf(v->fpout, "##fileformat=VCFv4.1\n");
fprintf(v->fpout, "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT");
for (i = 0; i < h->n_smpl; ++i)
fprintf(v->fpout, "\t%s", h->sns[i]);
if (k == 0) { // ref
int tid = bcf_str2id(v->refhash, p);
if (tid < 0) {
- tid = bcf_str2id_add(v->refhash, p);
+ tid = bcf_str2id_add(v->refhash, strdup(p));
kputs(p, &rn); kputc('\0', &rn);
sync = 1;
}
for (i = 0; i < b->n_gi; ++i) {
if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
((uint8_t*)b->gi[i].data)[k-9] = 1<<7;
- } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
+ } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
((uint8_t*)b->gi[i].data)[k-9] = 0;
+ } else if (b->gi[i].fmt == bcf_str2int("SP", 2)) {
+ ((int32_t*)b->gi[i].data)[k-9] = 0;
} else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
((uint16_t*)b->gi[i].data)[k-9] = 0;
} else if (b->gi[i].fmt == bcf_str2int("PL", 2)) {
for (q = kstrtok(p, ":", &a2), i = 0; q && i < b->n_gi; q = kstrtok(0, 0, &a2), ++i) {
if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
((uint8_t*)b->gi[i].data)[k-9] = (q[0] - '0')<<3 | (q[2] - '0') | (q[1] == '/'? 0 : 1) << 6;
- } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
+ } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
double _x = strtod(q, &q);
int x = (int)(_x + .499);
if (x > 255) x = 255;
((uint8_t*)b->gi[i].data)[k-9] = x;
+ } else if (b->gi[i].fmt == bcf_str2int("SP", 2)) {
+ int x = strtol(q, &q, 10);
+ if (x > 0xffff) x = 0xffff;
+ ((uint32_t*)b->gi[i].data)[k-9] = x;
} else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
int x = strtol(q, &q, 10);
if (x > 0xffff) x = 0xffff;
y = b->n_alleles * (b->n_alleles + 1) / 2;
for (j = 0; j < y; ++j) {
x = strtod(q, &q);
- data[(k-9) * y + j] = x;
+ data[(k-9) * y + j] = x > 0? -x/10. : x;
++q;
}
}
my $command = shift(@ARGV);
my %func = (subsam=>\&subsam, listsam=>\&listsam, fillac=>\&fillac, qstats=>\&qstats, varFilter=>\&varFilter,
hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&varFilter, ldstats=>\&ldstats,
- gapstats=>\&gapstats, splitchr=>\&splitchr);
+ gapstats=>\&gapstats, splitchr=>\&splitchr, vcf2fq=>\&vcf2fq);
die("Unknown command \"$command\".\n") if (!defined($func{$command}));
&{$func{$command}};
}
}
sub varFilter {
- my %opts = (d=>2, D=>10000, a=>2, W=>10, Q=>10, w=>10, p=>undef, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4);
- getopts('pd:D:W:Q:w:a:1:2:3:4:', \%opts);
+ my %opts = (d=>2, D=>10000000, a=>2, W=>10, Q=>10, w=>10, p=>undef, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4, G=>0, S=>1000);
+ getopts('pd:D:W:Q:w:a:1:2:3:4:G:S:', \%opts);
die(qq/
Usage: vcfutils.pl varFilter [options] <in.vcf>
print; next;
}
next if ($t[4] eq '.'); # skip non-var sites
+ next if ($t[3] eq 'N'); # skip sites with unknown ref ('N')
# check if the site is a SNP
my $type = 1; # SNP
if (length($t[3]) > 1) {
$flt = 1 if ($flt == 0 && $mq >= 0 && $mq < $opts{Q});
$flt = 7 if ($flt == 0 && /PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/
&& ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4}));
+ $flt = 8 if ($flt == 0 && ((/MXGQ=(\d+)/ && $1 < $opts{G}) || (/MXSP=(\d+)/ && $1 >= $opts{S})));
my $score = $t[5] * 100 + $dp_alt;
my $rlen = length($t[3]) - 1; # $indel_score<0 for SNPs
} else { # SNP or MNP
for my $x (@staging) {
next if (($x->[0]&3) != 3 || $x->[4] + $x->[2] + $ow < $t[1]);
- $flt = 5;
+ if ($x->[4] + length($x->[7]) - 1 == $t[1] && substr($x->[7], -1, 1) eq substr($t[4], 0, 1)
+ && length($x->[7]) - length($x->[6]) == 1) {
+ $x->[1] = 5;
+ } else { $flt = 5; }
last;
}
# check MNP
if ($first->[1] == 0) {
print join("\t", @$first[3 .. @$first-1]), "\n";
} elsif ($is_print) {
- print STDERR join("\t", substr("UQdDaGgPM", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
+ print STDERR join("\t", substr("UQdDaGgPMS", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
}
}
}
}
+sub vcf2fq {
+ my %opts = (d=>3, D=>100000, Q=>10, l=>5);
+ getopts('d:D:Q:l:', \%opts);
+ die(qq/
+Usage: vcfutils.pl vcf2fq [options] <all-site.vcf>
+
+Options: -d INT minimum depth [$opts{d}]
+ -D INT maximum depth [$opts{D}]
+ -Q INT min RMS mapQ [$opts{Q}]
+ -l INT INDEL filtering window [$opts{l}]
+\n/) if (@ARGV == 0 && -t STDIN);
+
+ my ($last_chr, $seq, $qual, $last_pos, @gaps);
+ my $_Q = $opts{Q};
+ my $_d = $opts{d};
+ my $_D = $opts{D};
+
+ my %het = (AC=>'M', AG=>'R', AT=>'W', CA=>'M', CG=>'S', CT=>'Y',
+ GA=>'R', GC=>'S', GT=>'K', TA=>'W', TC=>'Y', TG=>'K');
+
+ $last_chr = '';
+ while (<>) {
+ next if (/^#/);
+ my @t = split;
+ if ($last_chr ne $t[0]) {
+ &v2q_post_process($last_chr, \$seq, \$qual, \@gaps, $opts{l}) if ($last_chr);
+ ($last_chr, $last_pos) = ($t[0], 0);
+ $seq = $qual = '';
+ @gaps = ();
+ }
+ die("[vcf2fq] unsorted input\n") if ($t[1] - $last_pos < 0);
+ if ($t[1] - $last_pos > 1) {
+ $seq .= 'n' x ($t[1] - $last_pos - 1);
+ $qual .= '!' x ($t[1] - $last_pos - 1);
+ }
+ if (length($t[3]) == 1 && $t[7] !~ /INDEL/ && $t[4] =~ /^([A-Za-z.])(,[A-Za-z])*$/) { # a SNP or reference
+ my ($ref, $alt) = ($t[3], $1);
+ my ($b, $q);
+ $q = $1 if ($t[7] =~ /FQ=(-?[\d\.]+)/);
+ if ($q < 0) {
+ $_ = $1 if ($t[7] =~ /AF1=([\d\.]+)/);
+ $b = ($_ < .5 || $alt eq '.')? $ref : $alt;
+ $q = -$q;
+ } else {
+ $b = $het{"$ref$alt"};
+ $b ||= 'N';
+ }
+ $b = lc($b);
+ $b = uc($b) if (($t[7] =~ /MQ=(\d+)/ && $1 >= $_Q) && ($t[7] =~ /DP=(\d+)/ && $1 >= $_d && $1 <= $_D));
+ $q = int($q + 33 + .499);
+ $q = chr($q <= 126? $q : 126);
+ $seq .= $b;
+ $qual .= $q;
+ } elsif ($t[4] ne '.') { # an INDEL
+ push(@gaps, [$t[1], length($t[3])]);
+ }
+ $last_pos = $t[1];
+ }
+ &v2q_post_process($last_chr, \$seq, \$qual, \@gaps, $opts{l});
+}
+
+sub v2q_post_process {
+ my ($chr, $seq, $qual, $gaps, $l) = @_;
+ for my $g (@$gaps) {
+ my $beg = $g->[0] > $l? $g->[0] - $l : 0;
+ my $end = $g->[0] + $g->[1] + $l;
+ $end = length($$seq) if ($end > length($$seq));
+ substr($$seq, $beg, $end - $beg) = lc(substr($$seq, $beg, $end - $beg));
+ }
+ print "\@$chr\n"; &v2q_print_str($seq);
+ print "+\n"; &v2q_print_str($qual);
+}
+
+sub v2q_print_str {
+ my ($s) = @_;
+ my $l = length($$s);
+ for (my $i = 0; $i < $l; $i += 60) {
+ print substr($$s, $i, 60), "\n";
+ }
+}
+
sub usage {
die(qq/
Usage: vcfutils.pl <command> [<arguments>]\n
listsam list the samples
fillac fill the allele count field
qstats SNP stats stratified by QUAL
- varFilter filtering short variants
+
hapmap2vcf convert the hapmap format to VCF
ucscsnp2vcf convert UCSC SNP SQL dump to VCF
+
+ varFilter filtering short variants (*)
+ vcf2fq VCF->fastq (**)
+
+Notes: Commands with description endting with (*) may need bcftools
+ specific annotations.
\n/);
}
--- /dev/null
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include "bam.h"
+#include "errmod.h"
+#include "faidx.h"
+
+#define ERR_DEP 0.83f
+
+typedef struct {
+ int e[2][3], p[2][2];
+} score_param_t;
+
+/* Note that although the two matrics have 10 parameters in total, only 4
+ * (probably 3) are free. Changing the scoring matrices in a sort of symmetric
+ * way will not change the result. */
+static score_param_t g_param = { {{0,0,0},{-4,1,6}}, {{0,-14000}, {0,0}} };
+
+typedef struct {
+ int min_baseQ, tid, max_bases;
+ uint16_t *bases;
+ bamFile fp;
+ bam_header_t *h;
+ char *ref;
+ faidx_t *fai;
+ errmod_t *em;
+} ct_t;
+
+static uint16_t gencns(ct_t *g, int n, const bam_pileup1_t *plp)
+{
+ int i, j, ret, tmp, k, sum[4], qual;
+ float q[16];
+ if (n > g->max_bases) { // enlarge g->bases
+ g->max_bases = n;
+ kroundup32(g->max_bases);
+ g->bases = realloc(g->bases, g->max_bases * 2);
+ }
+ for (i = k = 0; i < n; ++i) {
+ const bam_pileup1_t *p = plp + i;
+ uint8_t *seq;
+ int q, baseQ, b;
+ if (p->is_refskip || p->is_del) continue;
+ baseQ = bam1_qual(p->b)[p->qpos];
+ if (baseQ < g->min_baseQ) continue;
+ seq = bam1_seq(p->b);
+ b = bam_nt16_nt4_table[bam1_seqi(seq, p->qpos)];
+ if (b > 3) continue;
+ q = baseQ < p->b->core.qual? baseQ : p->b->core.qual;
+ if (q < 4) q = 4;
+ if (q > 63) q = 63;
+ g->bases[k++] = q<<5 | bam1_strand(p->b)<<4 | b;
+ }
+ if (k == 0) return 0;
+ errmod_cal(g->em, k, 4, g->bases, q);
+ for (i = 0; i < 4; ++i) sum[i] = (int)(q[i<<2|i] + .499) << 2 | i;
+ for (i = 1; i < 4; ++i) // insertion sort
+ for (j = i; j > 0 && sum[j] < sum[j-1]; --j)
+ tmp = sum[j], sum[j] = sum[j-1], sum[j-1] = tmp;
+ qual = (sum[1]>>2) - (sum[0]>>2);
+ k = k < 256? k : 255;
+ ret = (qual < 63? qual : 63) << 2 | (sum[0]&3);
+ return ret<<8|k;
+}
+
+static void process_cns(bam_header_t *h, int tid, int l, uint16_t *cns)
+{
+ int i, f[2][2], *prev, *curr, *swap_tmp, s;
+ uint8_t *b; // backtrack array
+ b = calloc(l, 1);
+ f[0][0] = f[0][1] = 0;
+ prev = f[0]; curr = f[1];
+ // fill the backtrack matrix
+ for (i = 0; i < l; ++i) {
+ int c = (cns[i] == 0)? 0 : (cns[i]>>8 == 0)? 1 : 2;
+ int tmp0, tmp1;
+ // compute f[0]
+ tmp0 = prev[0] + g_param.e[0][c] + g_param.p[0][0]; // (s[i+1],s[i])=(0,0)
+ tmp1 = prev[1] + g_param.e[0][c] + g_param.p[1][0]; // (0,1)
+ if (tmp0 > tmp1) curr[0] = tmp0, b[i] = 0;
+ else curr[0] = tmp1, b[i] = 1;
+ // compute f[1]
+ tmp0 = prev[0] + g_param.e[1][c] + g_param.p[0][1]; // (s[i+1],s[i])=(1,0)
+ tmp1 = prev[1] + g_param.e[1][c] + g_param.p[1][1]; // (1,1)
+ if (tmp0 > tmp1) curr[1] = tmp0, b[i] |= 0<<1;
+ else curr[1] = tmp1, b[i] |= 1<<1;
+ // swap
+ swap_tmp = prev; prev = curr; curr = swap_tmp;
+ }
+ // backtrack
+ s = prev[0] > prev[1]? 0 : 1;
+ for (i = l - 1; i > 0; --i) {
+ b[i] |= s<<2;
+ s = b[i]>>s&1;
+ }
+ // print
+ for (i = 0, s = -1; i <= l; ++i) {
+ if (i == l || ((b[i]>>2&3) == 0 && s >= 0)) {
+ if (s >= 0) {
+ int j;
+ printf("%s:%d-%d\t0\t%s\t%d\t60\t%dM\t*\t0\t0\t", h->target_name[tid], s+1, i, h->target_name[tid], s+1, i-s);
+ for (j = s; j < i; ++j) {
+ int c = cns[j]>>8;
+ if (c == 0) putchar('N');
+ else putchar("ACGT"[c&3]);
+ }
+ putchar('\t');
+ for (j = s; j < i; ++j)
+ putchar(33 + (cns[j]>>8>>2));
+ putchar('\n');
+ }
+ //if (s >= 0) printf("%s\t%d\t%d\t%d\n", h->target_name[tid], s, i, i - s);
+ s = -1;
+ } else if ((b[i]>>2&3) && s < 0) s = i;
+ }
+ free(b);
+}
+
+static int read_aln(void *data, bam1_t *b)
+{
+ extern int bam_prob_realn_core(bam1_t *b, const char *ref, int flag);
+ ct_t *g = (ct_t*)data;
+ int ret, len;
+ ret = bam_read1(g->fp, b);
+ if (ret >= 0 && g->fai && b->core.tid >= 0 && (b->core.flag&4) == 0) {
+ if (b->core.tid != g->tid) { // then load the sequence
+ free(g->ref);
+ g->ref = fai_fetch(g->fai, g->h->target_name[b->core.tid], &len);
+ g->tid = b->core.tid;
+ }
+ bam_prob_realn_core(b, g->ref, 1<<1|1);
+ }
+ return ret;
+}
+
+int main_cut_target(int argc, char *argv[])
+{
+ int c, tid, pos, n, lasttid = -1, lastpos = -1, l, max_l;
+ const bam_pileup1_t *p;
+ bam_plp_t plp;
+ uint16_t *cns;
+ ct_t g;
+
+ memset(&g, 0, sizeof(ct_t));
+ g.min_baseQ = 13; g.tid = -1;
+ while ((c = getopt(argc, argv, "f:Q:i:o:0:1:2:")) >= 0) {
+ switch (c) {
+ case 'Q': g.min_baseQ = atoi(optarg); break; // quality cutoff
+ case 'i': g_param.p[0][1] = -atoi(optarg); break; // 0->1 transition (in) PENALTY
+ case '0': g_param.e[1][0] = atoi(optarg); break; // emission SCORE
+ case '1': g_param.e[1][1] = atoi(optarg); break;
+ case '2': g_param.e[1][2] = atoi(optarg); break;
+ case 'f': g.fai = fai_load(optarg);
+ if (g.fai == 0) fprintf(stderr, "[%s] fail to load the fasta index.\n", __func__);
+ break;
+ }
+ }
+ if (argc == optind) {
+ fprintf(stderr, "Usage: samtools targetcut [-Q minQ] [-i inPen] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam>\n");
+ return 1;
+ }
+ l = max_l = 0; cns = 0;
+ g.fp = strcmp(argv[optind], "-")? bam_open(argv[optind], "r") : bam_dopen(fileno(stdin), "r");
+ g.h = bam_header_read(g.fp);
+ g.em = errmod_init(1 - ERR_DEP);
+ plp = bam_plp_init(read_aln, &g);
+ while ((p = bam_plp_auto(plp, &tid, &pos, &n)) != 0) {
+ if (tid < 0) break;
+ if (tid != lasttid) { // change of chromosome
+ if (cns) process_cns(g.h, lasttid, l, cns);
+ if (max_l < g.h->target_len[tid]) {
+ max_l = g.h->target_len[tid];
+ kroundup32(max_l);
+ cns = realloc(cns, max_l * 2);
+ }
+ l = g.h->target_len[tid];
+ memset(cns, 0, max_l * 2);
+ lasttid = tid;
+ }
+ cns[pos] = gencns(&g, n, p);
+ lastpos = pos;
+ }
+ process_cns(g.h, lasttid, l, cns);
+ free(cns);
+ bam_header_destroy(g.h);
+ bam_plp_destroy(plp);
+ bam_close(g.fp);
+ if (g.fai) {
+ fai_destroy(g.fai); free(g.ref);
+ }
+ errmod_destroy(g.em);
+ free(g.bases);
+ return 0;
+}
errmod_t *errmod_init(float depcorr);
void errmod_destroy(errmod_t *em);
+
+/*
+ n: number of bases
+ m: maximum base
+ bases[i]: qual:6, strand:1, base:4
+ q[i*m+j]: phred-scaled likelihood of (i,j)
+ */
int errmod_cal(const errmod_t *em, int n, int m, uint16_t *bases, float *q);
#endif
(cd ..; make libbam.a)
calDepth:../libbam.a calDepth.c
- gcc -g -Wall -O2 -I.. calDepth.c -o $@ -lm -lz -L.. -lbam
+ gcc -g -Wall -O2 -I.. calDepth.c -o $@ -L.. -lbam -lm -lz
clean:
rm -fr *.bam *.bai *.glf* *.fai *.pileup* *~ calDepth *.dSYM ex1*.rg ex1.bcf
# ../samtools pileup ex1.bam|perl -ape '$_=$F[4];s/(\d+)(??{".{$1}"})|\^.//g;@_=(tr/A-Z//,tr/a-z//);$_=join("\t",@F[0,1],@_)."\n"'
-# ../samtools pileup -cf ex1.fa ex1.bam|perl -ape '$_=$F[8];s/\^.//g;s/(\d+)(??{".{$1}"})|\^.//g;@_=(tr/A-Za-z//,tr/,.//);$_=join("\t",@F[0,1],@_)."\n"'
\ No newline at end of file
+# ../samtools pileup -cf ex1.fa ex1.bam|perl -ape '$_=$F[8];s/\^.//g;s/(\d+)(??{".{$1}"})|\^.//g;@_=(tr/A-Za-z//,tr/,.//);$_=join("\t",@F[0,1],@_)."\n"'
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
+#include <stdint.h>
#include "faidx.h"
#include "khash.h"
/* The MIT License
- Copyright (c) 2008 Genome Research Ltd (GRL).
+ Copyright (c) 2008, 2009, 2011 by Attractive Chaos <attractor@live.co.uk>
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
SOFTWARE.
*/
-/* Contact: Heng Li <lh3@sanger.ac.uk> */
-
/*
An example:
*/
/*
+ 2011-02-14 (0.2.5):
+
+ * Allow to declare global functions.
+
+ 2009-09-26 (0.2.4):
+
+ * Improve portability
+
2008-09-19 (0.2.3):
* Corrected the example
@copyright Heng Li
*/
-#define AC_VERSION_KHASH_H "0.2.2"
+#define AC_VERSION_KHASH_H "0.2.5"
-#include <stdint.h>
#include <stdlib.h>
#include <string.h>
+#include <limits.h>
+
+/* compipler specific configuration */
+
+#if UINT_MAX == 0xffffffffu
+typedef unsigned int khint32_t;
+#elif ULONG_MAX == 0xffffffffu
+typedef unsigned long khint32_t;
+#endif
+
+#if ULONG_MAX == ULLONG_MAX
+typedef unsigned long khint64_t;
+#else
+typedef unsigned long long khint64_t;
+#endif
+
+#ifdef _MSC_VER
+#define inline __inline
+#endif
-typedef uint32_t khint_t;
+typedef khint32_t khint_t;
typedef khint_t khiter_t;
#define __ac_HASH_PRIME_SIZE 32
-static const uint32_t __ac_prime_list[__ac_HASH_PRIME_SIZE] =
+static const khint32_t __ac_prime_list[__ac_HASH_PRIME_SIZE] =
{
0ul, 3ul, 11ul, 23ul, 53ul,
97ul, 193ul, 389ul, 769ul, 1543ul,
static const double __ac_HASH_UPPER = 0.77;
-#define KHASH_INIT(name, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal) \
+#define KHASH_DECLARE(name, khkey_t, khval_t) \
+ typedef struct { \
+ khint_t n_buckets, size, n_occupied, upper_bound; \
+ khint32_t *flags; \
+ khkey_t *keys; \
+ khval_t *vals; \
+ } kh_##name##_t; \
+ extern kh_##name##_t *kh_init_##name(); \
+ extern void kh_destroy_##name(kh_##name##_t *h); \
+ extern void kh_clear_##name(kh_##name##_t *h); \
+ extern khint_t kh_get_##name(const kh_##name##_t *h, khkey_t key); \
+ extern void kh_resize_##name(kh_##name##_t *h, khint_t new_n_buckets); \
+ extern khint_t kh_put_##name(kh_##name##_t *h, khkey_t key, int *ret); \
+ extern void kh_del_##name(kh_##name##_t *h, khint_t x);
+
+#define KHASH_INIT2(name, SCOPE, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal) \
typedef struct { \
khint_t n_buckets, size, n_occupied, upper_bound; \
- uint32_t *flags; \
+ khint32_t *flags; \
khkey_t *keys; \
khval_t *vals; \
} kh_##name##_t; \
- static inline kh_##name##_t *kh_init_##name() { \
+ SCOPE kh_##name##_t *kh_init_##name() { \
return (kh_##name##_t*)calloc(1, sizeof(kh_##name##_t)); \
} \
- static inline void kh_destroy_##name(kh_##name##_t *h) \
+ SCOPE void kh_destroy_##name(kh_##name##_t *h) \
{ \
if (h) { \
free(h->keys); free(h->flags); \
free(h); \
} \
} \
- static inline void kh_clear_##name(kh_##name##_t *h) \
+ SCOPE void kh_clear_##name(kh_##name##_t *h) \
{ \
if (h && h->flags) { \
- memset(h->flags, 0xaa, ((h->n_buckets>>4) + 1) * sizeof(uint32_t)); \
+ memset(h->flags, 0xaa, ((h->n_buckets>>4) + 1) * sizeof(khint32_t)); \
h->size = h->n_occupied = 0; \
} \
} \
- static inline khint_t kh_get_##name(const kh_##name##_t *h, khkey_t key) \
+ SCOPE khint_t kh_get_##name(const kh_##name##_t *h, khkey_t key) \
{ \
if (h->n_buckets) { \
khint_t inc, k, i, last; \
return __ac_iseither(h->flags, i)? h->n_buckets : i; \
} else return 0; \
} \
- static inline void kh_resize_##name(kh_##name##_t *h, khint_t new_n_buckets) \
+ SCOPE void kh_resize_##name(kh_##name##_t *h, khint_t new_n_buckets) \
{ \
- uint32_t *new_flags = 0; \
+ khint32_t *new_flags = 0; \
khint_t j = 1; \
{ \
khint_t t = __ac_HASH_PRIME_SIZE - 1; \
new_n_buckets = __ac_prime_list[t+1]; \
if (h->size >= (khint_t)(new_n_buckets * __ac_HASH_UPPER + 0.5)) j = 0; \
else { \
- new_flags = (uint32_t*)malloc(((new_n_buckets>>4) + 1) * sizeof(uint32_t)); \
- memset(new_flags, 0xaa, ((new_n_buckets>>4) + 1) * sizeof(uint32_t)); \
+ new_flags = (khint32_t*)malloc(((new_n_buckets>>4) + 1) * sizeof(khint32_t)); \
+ memset(new_flags, 0xaa, ((new_n_buckets>>4) + 1) * sizeof(khint32_t)); \
if (h->n_buckets < new_n_buckets) { \
h->keys = (khkey_t*)realloc(h->keys, new_n_buckets * sizeof(khkey_t)); \
if (kh_is_map) \
h->upper_bound = (khint_t)(h->n_buckets * __ac_HASH_UPPER + 0.5); \
} \
} \
- static inline khint_t kh_put_##name(kh_##name##_t *h, khkey_t key, int *ret) \
+ SCOPE khint_t kh_put_##name(kh_##name##_t *h, khkey_t key, int *ret) \
{ \
khint_t x; \
if (h->n_occupied >= h->upper_bound) { \
} else *ret = 0; \
return x; \
} \
- static inline void kh_del_##name(kh_##name##_t *h, khint_t x) \
+ SCOPE void kh_del_##name(kh_##name##_t *h, khint_t x) \
{ \
if (x != h->n_buckets && !__ac_iseither(h->flags, x)) { \
__ac_set_isdel_true(h->flags, x); \
} \
}
+#define KHASH_INIT(name, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal) \
+ KHASH_INIT2(name, static inline, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal)
+
/* --- BEGIN OF HASH FUNCTIONS --- */
/*! @function
@abstract Integer hash function
- @param key The integer [uint32_t]
+ @param key The integer [khint32_t]
@return The hash value [khint_t]
*/
-#define kh_int_hash_func(key) (uint32_t)(key)
+#define kh_int_hash_func(key) (khint32_t)(key)
/*! @function
@abstract Integer comparison function
*/
#define kh_int_hash_equal(a, b) ((a) == (b))
/*! @function
@abstract 64-bit integer hash function
- @param key The integer [uint64_t]
+ @param key The integer [khint64_t]
@return The hash value [khint_t]
*/
-#define kh_int64_hash_func(key) (uint32_t)((key)>>33^(key)^(key)<<11)
+#define kh_int64_hash_func(key) (khint32_t)((key)>>33^(key)^(key)<<11)
/*! @function
@abstract 64-bit integer comparison function
*/
@param name Name of the hash table [symbol]
*/
#define KHASH_SET_INIT_INT(name) \
- KHASH_INIT(name, uint32_t, char, 0, kh_int_hash_func, kh_int_hash_equal)
+ KHASH_INIT(name, khint32_t, char, 0, kh_int_hash_func, kh_int_hash_equal)
/*! @function
@abstract Instantiate a hash map containing integer keys
@param khval_t Type of values [type]
*/
#define KHASH_MAP_INIT_INT(name, khval_t) \
- KHASH_INIT(name, uint32_t, khval_t, 1, kh_int_hash_func, kh_int_hash_equal)
+ KHASH_INIT(name, khint32_t, khval_t, 1, kh_int_hash_func, kh_int_hash_equal)
/*! @function
@abstract Instantiate a hash map containing 64-bit integer keys
@param name Name of the hash table [symbol]
*/
#define KHASH_SET_INIT_INT64(name) \
- KHASH_INIT(name, uint64_t, char, 0, kh_int64_hash_func, kh_int64_hash_equal)
+ KHASH_INIT(name, khint64_t, char, 0, kh_int64_hash_func, kh_int64_hash_equal)
/*! @function
@abstract Instantiate a hash map containing 64-bit integer keys
@param khval_t Type of values [type]
*/
#define KHASH_MAP_INIT_INT64(name, khval_t) \
- KHASH_INIT(name, uint64_t, khval_t, 1, kh_int64_hash_func, kh_int64_hash_equal)
+ KHASH_INIT(name, khint64_t, khval_t, 1, kh_int64_hash_func, kh_int64_hash_equal)
typedef const char *kh_cstr_t;
/*! @function
/* The MIT License
- Copyright (c) 2008 Genome Research Ltd (GRL).
+ Copyright (c) 2008 by Genome Research Ltd (GRL).
+ 2010 by Attractive Chaos <attractor@live.co.uk>
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
SOFTWARE.
*/
-/* Contact: Heng Li <lh3@sanger.ac.uk> */
-
/* Probably I will not do socket programming in the next few years and
therefore I decide to heavily annotate this file, for Linux and
- Windows as well. -lh3 */
+ Windows as well. -ac */
#include <time.h>
#include <stdio.h>
int on = 1, fd;
struct linger lng = { 0, 0 };
- struct addrinfo hints, *res;
+ struct addrinfo hints, *res = 0;
memset(&hints, 0, sizeof(struct addrinfo));
hints.ai_family = AF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;
const char *p, *start;
if (sep) { // set up the table
if (str == 0 && (aux->tab[0]&1)) return 0; // no need to set up if we have finished
- aux->tab[0] = aux->tab[1] = aux->tab[2] = aux->tab[3] = 0;
- for (p = sep; *p; ++p)
- aux->tab[*p/64] |= 1ull<<(*p%64);
+ aux->finished = 0;
+ if (sep[1]) {
+ aux->sep = -1;
+ aux->tab[0] = aux->tab[1] = aux->tab[2] = aux->tab[3] = 0;
+ for (p = sep; *p; ++p) aux->tab[*p>>6] |= 1ull<<(*p&0x3f);
+ } else aux->sep = sep[0];
+ }
+ if (aux->finished) return 0;
+ else if (str) aux->p = str - 1, aux->finished = 0;
+ if (aux->sep < 0) {
+ for (p = start = aux->p + 1; *p; ++p)
+ if (aux->tab[*p>>6]>>(*p&0x3f)&1) break;
+ } else {
+ for (p = start = aux->p + 1; *p; ++p)
+ if (*p == aux->sep) break;
}
- if (str) aux->p = str - 1, aux->tab[0] &= ~1ull;
- else if (aux->tab[0]&1) return 0;
- for (p = start = aux->p + 1; *p; ++p)
- if (aux->tab[*p/64]>>(*p%64)&1) break;
aux->p = p; // end of token
- if (*p == 0) aux->tab[0] |= 1; // no more tokens
+ if (*p == 0) aux->finished = 1; // no more tokens
return (char*)start;
}
typedef struct {
uint64_t tab[4];
+ int sep, finished;
const char *p; // end of the current token
} ks_tokaux_t;
--- /dev/null
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <math.h>
+#include <zlib.h>
+#include "bam.h"
+#include "errmod.h"
+
+#include "kseq.h"
+KSTREAM_INIT(gzFile, gzread, 16384)
+
+#define MAX_VARS 256
+#define FLIP_PENALTY 2
+#define FLIP_THRES 4
+#define MASK_THRES 3
+
+#define FLAG_FIX_CHIMERA 0x1
+#define FLAG_LIST_EXCL 0x4
+#define FLAG_DROP_AMBI 0x8
+
+typedef struct {
+ // configurations, initialized in the main function
+ int flag, k, min_baseQ, min_varLOD, max_depth;
+ // other global variables
+ int vpos_shift;
+ bamFile fp;
+ char *pre;
+ bamFile out[3];
+ // alignment queue
+ int n, m;
+ bam1_t **b;
+} phaseg_t;
+
+typedef struct {
+ int8_t seq[MAX_VARS]; // TODO: change to dynamic memory allocation!
+ int vpos, beg, end;
+ uint32_t vlen:16, single:1, flip:1, phase:1, phased:1, ambig:1;
+ uint32_t in:16, out:16; // in-phase and out-phase
+} frag_t, *frag_p;
+
+#define rseq_lt(a,b) ((a)->vpos < (b)->vpos)
+
+#include "khash.h"
+KHASH_SET_INIT_INT64(set64)
+KHASH_MAP_INIT_INT64(64, frag_t)
+
+typedef khash_t(64) nseq_t;
+
+#include "ksort.h"
+KSORT_INIT(rseq, frag_p, rseq_lt)
+
+static char nt16_nt4_table[] = { 4, 0, 1, 4, 2, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4 };
+
+static inline uint64_t X31_hash_string(const char *s)
+{
+ uint64_t h = *s;
+ if (h) for (++s ; *s; ++s) h = (h << 5) - h + *s;
+ return h;
+}
+
+static void count1(int l, const uint8_t *seq, int *cnt)
+{
+ int i, j, n_ambi;
+ uint32_t z, x;
+ if (seq[l-1] == 0) return; // do nothing is the last base is ambiguous
+ for (i = n_ambi = 0; i < l; ++i) // collect ambiguous bases
+ if (seq[i] == 0) ++n_ambi;
+ if (l - n_ambi <= 1) return; // only one SNP
+ for (x = 0; x < 1u<<n_ambi; ++x) { // count
+ for (i = j = 0, z = 0; i < l; ++i) {
+ int c;
+ if (seq[i]) c = seq[i] - 1;
+ else {
+ c = x>>j&1;
+ ++j;
+ }
+ z = z<<1 | c;
+ }
+ ++cnt[z];
+ }
+}
+
+static int **count_all(int l, int vpos, nseq_t *hash)
+{
+ khint_t k;
+ int i, j, **cnt;
+ uint8_t *seq;
+ seq = calloc(l, 1);
+ cnt = calloc(vpos, sizeof(void*));
+ for (i = 0; i < vpos; ++i) cnt[i] = calloc(1<<l, sizeof(int));
+ for (k = 0; k < kh_end(hash); ++k) {
+ if (kh_exist(hash, k)) {
+ frag_t *f = &kh_val(hash, k);
+ if (f->vpos >= vpos || f->single) continue; // out of region; or singleton
+ if (f->vlen == 1) { // such reads should be flagged as deleted previously if everything is right
+ f->single = 1;
+ continue;
+ }
+ for (j = 1; j < f->vlen; ++j) {
+ for (i = 0; i < l; ++i)
+ seq[i] = j < l - 1 - i? 0 : f->seq[j - (l - 1 - i)];
+ count1(l, seq, cnt[f->vpos + j]);
+ }
+ }
+ }
+ free(seq);
+ return cnt;
+}
+
+// phasing
+static int8_t *dynaprog(int l, int vpos, int **w)
+{
+ int *f[2], *curr, *prev, max, i;
+ int8_t **b, *h = 0;
+ uint32_t x, z = 1u<<(l-1), mask = (1u<<l) - 1;
+ f[0] = calloc(z, sizeof(int));
+ f[1] = calloc(z, sizeof(int));
+ b = calloc(vpos, sizeof(void*));
+ prev = f[0]; curr = f[1];
+ // fill the backtrack matrix
+ for (i = 0; i < vpos; ++i) {
+ int *wi = w[i], *tmp;
+ int8_t *bi;
+ bi = b[i] = calloc(z, 1);
+ /* In the following, x is the current state, which is the
+ * lexicographically smaller local haplotype. xc is the complement of
+ * x, or the larger local haplotype; y0 and y1 are the two predecessors
+ * of x. */
+ for (x = 0; x < z; ++x) { // x0 is the smaller
+ uint32_t y0, y1, xc;
+ int c0, c1;
+ xc = ~x&mask; y0 = x>>1; y1 = xc>>1;
+ c0 = prev[y0] + wi[x] + wi[xc];
+ c1 = prev[y1] + wi[x] + wi[xc];
+ if (c0 > c1) bi[x] = 0, curr[x] = c0;
+ else bi[x] = 1, curr[x] = c1;
+ }
+ tmp = prev; prev = curr; curr = tmp; // swap
+ }
+ { // backtrack
+ uint32_t max_x = 0;
+ int which = 0;
+ h = calloc(vpos, 1);
+ for (x = 0, max = 0, max_x = 0; x < z; ++x)
+ if (prev[x] > max) max = prev[x], max_x = x;
+ for (i = vpos - 1, x = max_x; i >= 0; --i) {
+ h[i] = which? (~x&1) : (x&1);
+ which = b[i][x]? !which : which;
+ x = b[i][x]? (~x&mask)>>1 : x>>1;
+ }
+ }
+ // free
+ for (i = 0; i < vpos; ++i) free(b[i]);
+ free(f[0]); free(f[1]); free(b);
+ return h;
+}
+
+// phase each fragment
+static uint64_t *fragphase(int vpos, const int8_t *path, nseq_t *hash, int flip)
+{
+ khint_t k;
+ uint64_t *pcnt;
+ uint32_t *left, *rght, max;
+ left = rght = 0; max = 0;
+ pcnt = calloc(vpos, 8);
+ for (k = 0; k < kh_end(hash); ++k) {
+ if (kh_exist(hash, k)) {
+ int i, c[2];
+ frag_t *f = &kh_val(hash, k);
+ if (f->vpos >= vpos) continue;
+ // get the phase
+ c[0] = c[1] = 0;
+ for (i = 0; i < f->vlen; ++i) {
+ if (f->seq[i] == 0) continue;
+ ++c[f->seq[i] == path[f->vpos + i] + 1? 0 : 1];
+ }
+ f->phase = c[0] > c[1]? 0 : 1;
+ f->in = c[f->phase]; f->out = c[1 - f->phase];
+ f->phased = f->in == f->out? 0 : 1;
+ f->ambig = (f->in && f->out && f->out < 3 && f->in <= f->out + 1)? 1 : 0;
+ // fix chimera
+ f->flip = 0;
+ if (flip && c[0] >= 3 && c[1] >= 3) {
+ int sum[2], m, mi, md;
+ if (f->vlen > max) { // enlarge the array
+ max = f->vlen;
+ kroundup32(max);
+ left = realloc(left, max * 4);
+ rght = realloc(rght, max * 4);
+ }
+ for (i = 0, sum[0] = sum[1] = 0; i < f->vlen; ++i) { // get left counts
+ if (f->seq[i]) {
+ int c = f->phase? 2 - f->seq[i] : f->seq[i] - 1;
+ ++sum[c == path[f->vpos + i]? 0 : 1];
+ }
+ left[i] = sum[1]<<16 | sum[0];
+ }
+ for (i = f->vlen - 1, sum[0] = sum[1] = 0; i >= 0; --i) { // get right counts
+ if (f->seq[i]) {
+ int c = f->phase? 2 - f->seq[i] : f->seq[i] - 1;
+ ++sum[c == path[f->vpos + i]? 0 : 1];
+ }
+ rght[i] = sum[1]<<16 | sum[0];
+ }
+ // find the best flip point
+ for (i = m = 0, mi = -1, md = -1; i < f->vlen - 1; ++i) {
+ int a[2];
+ a[0] = (left[i]&0xffff) + (rght[i+1]>>16&0xffff) - (rght[i+1]&0xffff) * FLIP_PENALTY;
+ a[1] = (left[i]>>16&0xffff) + (rght[i+1]&0xffff) - (rght[i+1]>>16&0xffff) * FLIP_PENALTY;
+ if (a[0] > a[1]) {
+ if (a[0] > m) m = a[0], md = 0, mi = i;
+ } else {
+ if (a[1] > m) m = a[1], md = 1, mi = i;
+ }
+ }
+ if (m - c[0] >= FLIP_THRES && m - c[1] >= FLIP_THRES) { // then flip
+ f->flip = 1;
+ if (md == 0) { // flip the tail
+ for (i = mi + 1; i < f->vlen; ++i)
+ if (f->seq[i] == 1) f->seq[i] = 2;
+ else if (f->seq[i] == 2) f->seq[i] = 1;
+ } else { // flip the head
+ for (i = 0; i <= mi; ++i)
+ if (f->seq[i] == 1) f->seq[i] = 2;
+ else if (f->seq[i] == 2) f->seq[i] = 1;
+ }
+ }
+ }
+ // update pcnt[]
+ if (!f->single) {
+ for (i = 0; i < f->vlen; ++i) {
+ int c;
+ if (f->seq[i] == 0) continue;
+ c = f->phase? 2 - f->seq[i] : f->seq[i] - 1;
+ if (c == path[f->vpos + i]) {
+ if (f->phase == 0) ++pcnt[f->vpos + i];
+ else pcnt[f->vpos + i] += 1ull<<32;
+ } else {
+ if (f->phase == 0) pcnt[f->vpos + i] += 1<<16;
+ else pcnt[f->vpos + i] += 1ull<<48;
+ }
+ }
+ }
+ }
+ }
+ free(left); free(rght);
+ return pcnt;
+}
+
+static uint64_t *genmask(int vpos, const uint64_t *pcnt, int *_n)
+{
+ int i, max = 0, max_i = -1, m = 0, n = 0, beg = 0, score = 0;
+ uint64_t *list = 0;
+ for (i = 0; i < vpos; ++i) {
+ uint64_t x = pcnt[i];
+ int c[4], pre = score, s;
+ c[0] = x&0xffff; c[1] = x>>16&0xffff; c[2] = x>>32&0xffff; c[3] = x>>48&0xffff;
+ s = (c[1] + c[3] == 0)? -(c[0] + c[2]) : (c[1] + c[3] - 1);
+ if (c[3] > c[2]) s += c[3] - c[2];
+ if (c[1] > c[0]) s += c[1] - c[0];
+ score += s;
+ if (score < 0) score = 0;
+ if (pre == 0 && score > 0) beg = i; // change from zero to non-zero
+ if ((i == vpos - 1 || score == 0) && max >= MASK_THRES) {
+ if (n == m) {
+ m = m? m<<1 : 4;
+ list = realloc(list, m * 8);
+ }
+ list[n++] = (uint64_t)beg<<32 | max_i;
+ i = max_i; // reset i to max_i
+ score = 0;
+ } else if (score > max) max = score, max_i = i;
+ if (score == 0) max = 0;
+ }
+ *_n = n;
+ return list;
+}
+
+// trim heading and tailing ambiguous bases; mark deleted and remove sequence
+static int clean_seqs(int vpos, nseq_t *hash)
+{
+ khint_t k;
+ int ret = 0;
+ for (k = 0; k < kh_end(hash); ++k) {
+ if (kh_exist(hash, k)) {
+ frag_t *f = &kh_val(hash, k);
+ int beg, end, i;
+ if (f->vpos >= vpos) {
+ ret = 1;
+ continue;
+ }
+ for (i = 0; i < f->vlen; ++i)
+ if (f->seq[i] != 0) break;
+ beg = i;
+ for (i = f->vlen - 1; i >= 0; --i)
+ if (f->seq[i] != 0) break;
+ end = i + 1;
+ if (end - beg <= 0) kh_del(64, hash, k);
+ else {
+ if (beg != 0) memmove(f->seq, f->seq + beg, end - beg);
+ f->vpos += beg; f->vlen = end - beg;
+ f->single = f->vlen == 1? 1 : 0;
+ }
+ }
+ }
+ return ret;
+}
+
+static void dump_aln(phaseg_t *g, int min_pos, const nseq_t *hash)
+{
+ int i, is_flip, drop_ambi;
+ drop_ambi = g->flag & FLAG_DROP_AMBI;
+ is_flip = (drand48() < 0.5);
+ for (i = 0; i < g->n; ++i) {
+ int end, which;
+ uint64_t key;
+ khint_t k;
+ bam1_t *b = g->b[i];
+ key = X31_hash_string(bam1_qname(b));
+ end = bam_calend(&b->core, bam1_cigar(b));
+ if (end > min_pos) break;
+ k = kh_get(64, hash, key);
+ if (k == kh_end(hash)) which = 3;
+ else {
+ frag_t *f = &kh_val(hash, k);
+ if (f->ambig) which = drop_ambi? 2 : 3;
+ else if (f->phased && f->flip) which = 2;
+ else if (f->phased == 0) which = 3;
+ else { // phased and not flipped
+ char c = 'Y';
+ which = f->phase;
+ bam_aux_append(b, "ZP", 'A', 1, (uint8_t*)&c);
+ }
+ if (which < 2 && is_flip) which = 1 - which; // increase the randomness
+ }
+ if (which == 3) which = (drand48() < 0.5);
+ bam_write1(g->out[which], b);
+ bam_destroy1(b);
+ g->b[i] = 0;
+ }
+ memmove(g->b, g->b + i, (g->n - i) * sizeof(void*));
+ g->n -= i;
+}
+
+static int phase(phaseg_t *g, const char *chr, int vpos, uint64_t *cns, nseq_t *hash)
+{
+ int i, j, n_seqs = kh_size(hash), n_masked = 0, min_pos;
+ khint_t k;
+ frag_t **seqs;
+ int8_t *path, *sitemask;
+ uint64_t *pcnt, *regmask;
+
+ if (vpos == 0) return 0;
+ i = clean_seqs(vpos, hash); // i is true if hash has an element with its vpos >= vpos
+ min_pos = i? cns[vpos]>>32 : 0x7fffffff;
+ if (vpos == 1) {
+ printf("PS\t%s\t%d\t%d\n", chr, (int)(cns[0]>>32) + 1, (int)(cns[0]>>32) + 1);
+ printf("M0\t%s\t%d\t%d\t%c\t%c\t%d\t0\t0\t0\t0\n//\n", chr, (int)(cns[0]>>32) + 1, (int)(cns[0]>>32) + 1,
+ "ACGTX"[cns[0]&3], "ACGTX"[cns[0]>>16&3], g->vpos_shift + 1);
+ for (k = 0; k < kh_end(hash); ++k) {
+ if (kh_exist(hash, k)) {
+ frag_t *f = &kh_val(hash, k);
+ if (f->vpos) continue;
+ f->flip = 0;
+ if (f->seq[0] == 0) f->phased = 0;
+ else f->phased = 1, f->phase = f->seq[0] - 1;
+ }
+ }
+ dump_aln(g, min_pos, hash);
+ ++g->vpos_shift;
+ return 1;
+ }
+ { // phase
+ int **cnt;
+ uint64_t *mask;
+ printf("PS\t%s\t%d\t%d\n", chr, (int)(cns[0]>>32) + 1, (int)(cns[vpos-1]>>32) + 1);
+ sitemask = calloc(vpos, 1);
+ cnt = count_all(g->k, vpos, hash);
+ path = dynaprog(g->k, vpos, cnt);
+ for (i = 0; i < vpos; ++i) free(cnt[i]);
+ free(cnt);
+ pcnt = fragphase(vpos, path, hash, 0); // do not fix chimeras when masking
+ mask = genmask(vpos, pcnt, &n_masked);
+ regmask = calloc(n_masked, 8);
+ for (i = 0; i < n_masked; ++i) {
+ regmask[i] = cns[mask[i]>>32]>>32<<32 | cns[(uint32_t)mask[i]]>>32;
+ for (j = mask[i]>>32; j <= (int32_t)mask[i]; ++j)
+ sitemask[j] = 1;
+ }
+ free(mask);
+ if (g->flag & FLAG_FIX_CHIMERA) {
+ free(pcnt);
+ pcnt = fragphase(vpos, path, hash, 1);
+ }
+ }
+ for (i = 0; i < n_masked; ++i)
+ printf("FL\t%s\t%d\t%d\n", chr, (int)(regmask[i]>>32) + 1, (int)regmask[i] + 1);
+ for (i = 0; i < vpos; ++i) {
+ uint64_t x = pcnt[i];
+ int8_t c[2];
+ c[0] = (cns[i]&0xffff)>>2 == 0? 4 : (cns[i]&3);
+ c[1] = (cns[i]>>16&0xffff)>>2 == 0? 4 : (cns[i]>>16&3);
+ printf("M%d\t%s\t%d\t%d\t%c\t%c\t%d\t%d\t%d\t%d\t%d\n", sitemask[i]+1, chr, (int)(cns[0]>>32) + 1, (int)(cns[i]>>32) + 1, "ACGTX"[c[path[i]]], "ACGTX"[c[1-path[i]]],
+ i + g->vpos_shift + 1, (int)(x&0xffff), (int)(x>>16&0xffff), (int)(x>>32&0xffff), (int)(x>>48&0xffff));
+ }
+ free(path); free(pcnt); free(regmask); free(sitemask);
+ seqs = calloc(n_seqs, sizeof(void*));
+ for (k = 0, i = 0; k < kh_end(hash); ++k)
+ if (kh_exist(hash, k) && kh_val(hash, k).vpos < vpos && !kh_val(hash, k).single)
+ seqs[i++] = &kh_val(hash, k);
+ n_seqs = i;
+ ks_introsort_rseq(n_seqs, seqs);
+ for (i = 0; i < n_seqs; ++i) {
+ frag_t *f = seqs[i];
+ printf("EV\t0\t%s\t%d\t40\t%dM\t*\t0\t0\t", chr, f->vpos + 1 + g->vpos_shift, f->vlen);
+ for (j = 0; j < f->vlen; ++j) {
+ uint32_t c = cns[f->vpos + j];
+ if (f->seq[j] == 0) putchar('N');
+ else putchar("ACGT"[f->seq[j] == 1? (c&3) : (c>>16&3)]);
+ }
+ printf("\t*\tYP:i:%d\tYF:i:%d\tYI:i:%d\tYO:i:%d\tYS:i:%d\n", f->phase, f->flip, f->in, f->out, f->beg+1);
+ }
+ free(seqs);
+ printf("//\n");
+ fflush(stdout);
+ g->vpos_shift += vpos;
+ dump_aln(g, min_pos, hash);
+ return vpos;
+}
+
+static void update_vpos(int vpos, nseq_t *hash)
+{
+ khint_t k;
+ for (k = 0; k < kh_end(hash); ++k) {
+ if (kh_exist(hash, k)) {
+ frag_t *f = &kh_val(hash, k);
+ if (f->vpos < vpos) kh_del(64, hash, k); // TODO: if frag_t::seq is allocated dynamically, free it
+ else f->vpos -= vpos;
+ }
+ }
+}
+
+static nseq_t *shrink_hash(nseq_t *hash) // TODO: to implement
+{
+ return hash;
+}
+
+static int readaln(void *data, bam1_t *b)
+{
+ phaseg_t *g = (phaseg_t*)data;
+ int ret;
+ ret = bam_read1(g->fp, b);
+ if (ret < 0) return ret;
+ if (!(b->core.flag & (BAM_FUNMAP|BAM_FSECONDARY|BAM_FQCFAIL|BAM_FDUP)) && g->pre) {
+ if (g->n == g->m) {
+ g->m = g->m? g->m<<1 : 16;
+ g->b = realloc(g->b, g->m * sizeof(void*));
+ }
+ g->b[g->n++] = bam_dup1(b);
+ }
+ return ret;
+}
+
+static khash_t(set64) *loadpos(const char *fn, bam_header_t *h)
+{
+ gzFile fp;
+ kstream_t *ks;
+ int ret, dret;
+ kstring_t *str;
+ khash_t(set64) *hash;
+
+ hash = kh_init(set64);
+ str = calloc(1, sizeof(kstring_t));
+ fp = strcmp(fn, "-")? gzopen(fn, "r") : gzdopen(fileno(stdin), "r");
+ ks = ks_init(fp);
+ while (ks_getuntil(ks, 0, str, &dret) >= 0) {
+ int tid = bam_get_tid(h, str->s);
+ if (tid >= 0 && dret != '\n') {
+ if (ks_getuntil(ks, 0, str, &dret) >= 0) {
+ uint64_t x = (uint64_t)tid<<32 | (atoi(str->s) - 1);
+ kh_put(set64, hash, x, &ret);
+ } else break;
+ }
+ if (dret != '\n') while ((dret = ks_getc(ks)) > 0 && dret != '\n');
+ if (dret < 0) break;
+ }
+ ks_destroy(ks);
+ gzclose(fp);
+ free(str->s); free(str);
+ return hash;
+}
+
+static int gl2cns(float q[16])
+{
+ int i, j, min_ij;
+ float min, min2;
+ min = min2 = 1e30; min_ij = -1;
+ for (i = 0; i < 4; ++i) {
+ for (j = i; j < 4; ++j) {
+ if (q[i<<2|j] < min) min_ij = i<<2|j, min2 = min, min = q[i<<2|j];
+ else if (q[i<<2|j] < min2) min2 = q[i<<2|j];
+ }
+ }
+ return (min_ij>>2&3) == (min_ij&3)? 0 : 1<<18 | (min_ij>>2&3)<<16 | (min_ij&3) | (int)(min2 - min + .499) << 2;
+}
+
+int main_phase(int argc, char *argv[])
+{
+ extern void bam_init_header_hash(bam_header_t *header);
+ int c, tid, pos, vpos = 0, n, lasttid = -1, max_vpos = 0;
+ const bam_pileup1_t *plp;
+ bam_plp_t iter;
+ bam_header_t *h;
+ nseq_t *seqs;
+ uint64_t *cns = 0;
+ phaseg_t g;
+ char *fn_list = 0;
+ khash_t(set64) *set = 0;
+ errmod_t *em;
+ uint16_t *bases;
+
+ memset(&g, 0, sizeof(phaseg_t));
+ g.flag = FLAG_FIX_CHIMERA;
+ g.min_varLOD = 37; g.k = 13; g.min_baseQ = 13; g.max_depth = 256;
+ while ((c = getopt(argc, argv, "Q:eFq:k:b:l:D:A:")) >= 0) {
+ switch (c) {
+ case 'D': g.max_depth = atoi(optarg); break;
+ case 'q': g.min_varLOD = atoi(optarg); break;
+ case 'Q': g.min_baseQ = atoi(optarg); break;
+ case 'k': g.k = atoi(optarg); break;
+ case 'F': g.flag &= ~FLAG_FIX_CHIMERA; break;
+ case 'e': g.flag |= FLAG_LIST_EXCL; break;
+ case 'A': g.flag |= FLAG_DROP_AMBI; break;
+ case 'b': g.pre = strdup(optarg); break;
+ case 'l': fn_list = strdup(optarg); break;
+ }
+ }
+ if (argc == optind) {
+ fprintf(stderr, "\n");
+ fprintf(stderr, "Usage: samtools phase [options] <in.bam>\n\n");
+ fprintf(stderr, "Options: -k INT block length [%d]\n", g.k);
+ fprintf(stderr, " -b STR prefix of BAMs to output [null]\n");
+ fprintf(stderr, " -q INT min het phred-LOD [%d]\n", g.min_varLOD);
+ fprintf(stderr, " -Q INT min base quality in het calling [%d]\n", g.min_baseQ);
+ fprintf(stderr, " -D INT max read depth [%d]\n", g.max_depth);
+// fprintf(stderr, " -l FILE list of sites to phase [null]\n");
+ fprintf(stderr, " -F do not attempt to fix chimeras\n");
+ fprintf(stderr, " -A drop reads with ambiguous phase\n");
+// fprintf(stderr, " -e do not discover SNPs (effective with -l)\n");
+ fprintf(stderr, "\n");
+ return 1;
+ }
+ g.fp = strcmp(argv[optind], "-")? bam_open(argv[optind], "r") : bam_dopen(fileno(stdin), "r");
+ h = bam_header_read(g.fp);
+ if (fn_list) { // read the list of sites to phase
+ bam_init_header_hash(h);
+ set = loadpos(fn_list, h);
+ free(fn_list);
+ } else g.flag &= ~FLAG_LIST_EXCL;
+ if (g.pre) { // open BAMs to write
+ char *s = malloc(strlen(g.pre) + 20);
+ strcpy(s, g.pre); strcat(s, ".0.bam"); g.out[0] = bam_open(s, "w");
+ strcpy(s, g.pre); strcat(s, ".1.bam"); g.out[1] = bam_open(s, "w");
+ strcpy(s, g.pre); strcat(s, ".chimera.bam"); g.out[2] = bam_open(s, "w");
+ for (c = 0; c <= 2; ++c) bam_header_write(g.out[c], h);
+ free(s);
+ }
+
+ iter = bam_plp_init(readaln, &g);
+ g.vpos_shift = 0;
+ seqs = kh_init(64);
+ em = errmod_init(1. - 0.83);
+ bases = calloc(g.max_depth, 2);
+ printf("CC\n");
+ printf("CC\tDescriptions:\nCC\n");
+ printf("CC\t CC comments\n");
+ printf("CC\t PS start of a phase set\n");
+ printf("CC\t FL filtered region\n");
+ printf("CC\t M[012] markers; 0 for singletons, 1 for phased and 2 for filtered\n");
+ printf("CC\t EV supporting reads; SAM format\n");
+ printf("CC\t // end of a phase set\nCC\n");
+ printf("CC\tFormats of PS, FL and M[012] lines (1-based coordinates):\nCC\n");
+ printf("CC\t PS chr phaseSetStart phaseSetEnd\n");
+ printf("CC\t FL chr filterStart filterEnd\n");
+ printf("CC\t M? chr PS pos allele0 allele1 hetIndex #supports0 #errors0 #supp1 #err1\n");
+ printf("CC\nCC\n");
+ fflush(stdout);
+ while ((plp = bam_plp_auto(iter, &tid, &pos, &n)) != 0) {
+ int i, k, c, tmp, dophase = 1, in_set = 0;
+ float q[16];
+ if (tid < 0) break;
+ if (tid != lasttid) { // change of chromosome
+ g.vpos_shift = 0;
+ if (lasttid >= 0) {
+ seqs = shrink_hash(seqs);
+ phase(&g, h->target_name[lasttid], vpos, cns, seqs);
+ update_vpos(0x7fffffff, seqs);
+ }
+ lasttid = tid;
+ vpos = 0;
+ }
+ if (set && kh_get(set64, set, (uint64_t)tid<<32 | pos) != kh_end(set)) in_set = 1;
+ if (n > g.max_depth) continue; // do not proceed if the depth is too high
+ // fill the bases array and check if there is a variant
+ for (i = k = 0; i < n; ++i) {
+ const bam_pileup1_t *p = plp + i;
+ uint8_t *seq;
+ int q, baseQ, b;
+ if (p->is_del || p->is_refskip) continue;
+ baseQ = bam1_qual(p->b)[p->qpos];
+ if (baseQ < g.min_baseQ) continue;
+ seq = bam1_seq(p->b);
+ b = bam_nt16_nt4_table[bam1_seqi(seq, p->qpos)];
+ if (b > 3) continue;
+ q = baseQ < p->b->core.qual? baseQ : p->b->core.qual;
+ if (q < 4) q = 4;
+ if (q > 63) q = 63;
+ bases[k++] = q<<5 | (int)bam1_strand(p->b)<<4 | b;
+ }
+ if (k == 0) continue;
+ errmod_cal(em, k, 4, bases, q); // compute genotype likelihood
+ c = gl2cns(q); // get the consensus
+ // tell if to proceed
+ if (set && (g.flag&FLAG_LIST_EXCL) && !in_set) continue; // not in the list
+ if (!in_set && (c&0xffff)>>2 < g.min_varLOD) continue; // not a variant
+ // add the variant
+ if (vpos == max_vpos) {
+ max_vpos = max_vpos? max_vpos<<1 : 128;
+ cns = realloc(cns, max_vpos * 8);
+ }
+ cns[vpos] = (uint64_t)pos<<32 | c;
+ for (i = 0; i < n; ++i) {
+ const bam_pileup1_t *p = plp + i;
+ uint64_t key;
+ khint_t k;
+ uint8_t *seq = bam1_seq(p->b);
+ frag_t *f;
+ if (p->is_del || p->is_refskip) continue;
+ if (p->b->core.qual == 0) continue;
+ // get the base code
+ c = nt16_nt4_table[(int)bam1_seqi(seq, p->qpos)];
+ if (c == (cns[vpos]&3)) c = 1;
+ else if (c == (cns[vpos]>>16&3)) c = 2;
+ else c = 0;
+ // write to seqs
+ key = X31_hash_string(bam1_qname(p->b));
+ k = kh_put(64, seqs, key, &tmp);
+ f = &kh_val(seqs, k);
+ if (tmp == 0) { // present in the hash table
+ if (vpos - f->vpos + 1 < MAX_VARS) {
+ f->vlen = vpos - f->vpos + 1;
+ f->seq[f->vlen-1] = c;
+ f->end = bam_calend(&p->b->core, bam1_cigar(p->b));
+ }
+ dophase = 0;
+ } else { // absent
+ memset(f->seq, 0, MAX_VARS);
+ f->beg = p->b->core.pos;
+ f->end = bam_calend(&p->b->core, bam1_cigar(p->b));
+ f->vpos = vpos, f->vlen = 1, f->seq[0] = c, f->single = f->phased = f->flip = f->ambig = 0;
+ }
+ }
+ if (dophase) {
+ seqs = shrink_hash(seqs);
+ phase(&g, h->target_name[tid], vpos, cns, seqs);
+ update_vpos(vpos, seqs);
+ cns[0] = cns[vpos];
+ vpos = 0;
+ }
+ ++vpos;
+ }
+ if (tid >= 0) phase(&g, h->target_name[tid], vpos, cns, seqs);
+ bam_header_destroy(h);
+ bam_plp_destroy(iter);
+ bam_close(g.fp);
+ kh_destroy(64, seqs);
+ kh_destroy(set64, set);
+ free(cns);
+ errmod_destroy(em);
+ free(bases);
+ if (g.pre) {
+ for (c = 0; c <= 2; ++c) bam_close(g.out[c]);
+ free(g.pre); free(g.b);
+ }
+ return 0;
+}
-.TH samtools 1 "2 December 2010" "samtools-0.1.12" "Bioinformatics tools"
+.TH samtools 1 "1 March 2011" "samtools-0.1.13" "Bioinformatics tools"
.SH NAME
.PP
samtools - Utilities for the Sequence Alignment/Map (SAM) format
.TP
.B mpileup
-samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
+samtools mpileup [-EBug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
Generate BCF or pileup for one or multiple BAM files. Alignment records
are grouped by sample identifiers in @RG header lines. If sample
.I INT
leads to longer indels. [20]
.TP
+.B -E
+Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt
+specificity a little bit.
+.TP
.BI -f \ FILE
The reference file [null]
.TP
.TP
.B calmd
-samtools calmd [-eubSr] [-C capQcoef] <aln.bam> <ref.fasta>
+samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>
Generate the MD tag. If the MD tag is already present, this command will
give a warning if the MD tag generated is different from the existing
command for details. [0]
.TP
.B -r
-Compute the BQ tag without changing the base quality.
+Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).
+.TP
+.B -E
+Extended BAQ calculation. This option trades specificity for sensitivity, though the
+effect is minor.
+.RE
+
+.TP
+.B targetcut
+samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam>
+
+This command identifies target regions by examining the continuity of read depth, computes
+haploid consensus sequences of targets and outputs a SAM with each sequence corresponding
+to a target. When option
+.B -f
+is in use, BAQ will be applied. This command is
+.B only
+designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)].
+.RE
+
+.TP
+.B phase
+samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam>
+
+Call and phase heterozygous SNPs.
+.B OPTIONS:
+.RS
+.TP 8
+.B -A
+Drop reads with ambiguous phase.
+.TP 8
+.BI -b \ STR
+Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file
+.BR STR .0.bam
+and phase-1 reads in
+.BR STR .1.bam.
+Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads
+with switch errors will be saved in
+.BR STR .chimeric.bam.
+[null]
+.TP
+.B -F
+Do not attempt to fix chimeric reads.
+.TP
+.BI -k \ INT
+Maximum length for local phasing. [13]
+.TP
+.BI -q \ INT
+Minimum Phred-scaled LOD to call a heterozygote. [40]
+.TP
+.BI -Q \ INT
+Minimum base quality to be used in het calling. [13]
.RE
.TP
.B BWA-short
but may not other mappers.
+.IP o 2
+Generate the consensus sequence for one diploid individual:
+
+ samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
+
+.IP o 2
+Phase one individual:
+
+ samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
+
+The
+.B calmd
+command is used to reduce false heterozygotes around INDELs.
+
.IP o 2
Call SNPs and short indels for multiple diploid individuals:
+++ /dev/null
-samtools(1) Bioinformatics tools samtools(1)
-
-
-
-NAME
- samtools - Utilities for the Sequence Alignment/Map (SAM) format
-
-SYNOPSIS
- samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
-
- samtools sort aln.bam aln.sorted
-
- samtools index aln.sorted.bam
-
- samtools idxstats aln.sorted.bam
-
- samtools view aln.sorted.bam chr2:20,100,000-20,200,000
-
- samtools merge out.bam in1.bam in2.bam in3.bam
-
- samtools faidx ref.fasta
-
- samtools pileup -vcf ref.fasta aln.sorted.bam
-
- samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
-
- samtools tview aln.sorted.bam ref.fasta
-
-
-DESCRIPTION
- Samtools is a set of utilities that manipulate alignments in the BAM
- format. It imports from and exports to the SAM (Sequence Alignment/Map)
- format, does sorting, merging and indexing, and allows to retrieve
- reads in any regions swiftly.
-
- Samtools is designed to work on a stream. It regards an input file `-'
- as the standard input (stdin) and an output file `-' as the standard
- output (stdout). Several commands can thus be combined with Unix pipes.
- Samtools always output warning and error messages to the standard error
- output (stderr).
-
- Samtools is also able to open a BAM (not SAM) file on a remote FTP or
- HTTP server if the BAM file name starts with `ftp://' or `http://'.
- Samtools checks the current working directory for the index file and
- will download the index upon absence. Samtools does not retrieve the
- entire alignment file unless it is asked to do so.
-
-
-COMMANDS AND OPTIONS
- view samtools view [-bchuHS] [-t in.refList] [-o output] [-f
- reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
- Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
-
- Extract/print all or sub alignments in SAM or BAM format. If
- no region is specified, all the alignments will be printed;
- otherwise only alignments overlapping the specified regions
- will be output. An alignment may be given multiple times if
- it is overlapping several regions. A region can be presented,
- for example, in the following format: `chr2' (the whole
- chr2), `chr2:1000000' (region starting from 1,000,000bp) or
- `chr2:1,000,000-2,000,000' (region between 1,000,000 and
- 2,000,000bp including the end points). The coordinate is
- 1-based.
-
- OPTIONS:
-
- -b Output in the BAM format.
-
- -f INT Only output alignments with all bits in INT present
- in the FLAG field. INT can be in hex in the format of
- /^0x[0-9A-F]+/ [0]
-
- -F INT Skip alignments with bits present in INT [0]
-
- -h Include the header in the output.
-
- -H Output the header only.
-
- -l STR Only output reads in library STR [null]
-
- -o FILE Output file [stdout]
-
- -q INT Skip alignments with MAPQ smaller than INT [0]
-
- -r STR Only output reads in read group STR [null]
-
- -R FILE Output reads in read groups listed in FILE [null]
-
- -S Input is in SAM. If @SQ header lines are absent, the
- `-t' option is required.
-
- -c Instead of printing the alignments, only count them
- and print the total number. All filter options, such
- as `-f', `-F' and `-q' , are taken into account.
-
- -t FILE This file is TAB-delimited. Each line must contain
- the reference name and the length of the reference,
- one line for each distinct reference; additional
- fields are ignored. This file also defines the order
- of the reference sequences in sorting. If you run
- `samtools faidx <ref.fa>', the resultant index file
- <ref.fa>.fai can be used as this <in.ref_list> file.
-
- -u Output uncompressed BAM. This option saves time spent
- on compression/decomprssion and is thus preferred
- when the output is piped to another samtools command.
-
-
- tview samtools tview <in.sorted.bam> [ref.fasta]
-
- Text alignment viewer (based on the ncurses library). In the
- viewer, press `?' for help and press `g' to check the align-
- ment start from a region in the format like
- `chr10:10,000,000' or `=10,000,000' when viewing the same
- reference sequence.
-
-
- mpileup samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
- list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
- [...]]
-
- Generate BCF or pileup for one or multiple BAM files. Align-
- ment records are grouped by sample identifiers in @RG header
- lines. If sample identifiers are absent, each input file is
- regarded as one sample.
-
- OPTIONS:
-
- -B Disable probabilistic realignment for the computation
- of base alignment quality (BAQ). BAQ is the Phred-
- scaled probability of a read base being misaligned.
- Applying this option greatly helps to reduce false
- SNPs caused by misalignments.
-
- -C INT Coefficient for downgrading mapping quality for reads
- containing excessive mismatches. Given a read with a
- phred-scaled probability q of being generated from
- the mapped position, the new mapping quality is about
- sqrt((INT-q)/INT)*INT. A zero value disables this
- functionality; if enabled, the recommended value for
- BWA is 50. [0]
-
- -e INT Phred-scaled gap extension sequencing error probabil-
- ity. Reducing INT leads to longer indels. [20]
-
- -f FILE The reference file [null]
-
- -g Compute genotype likelihoods and output them in the
- binary call format (BCF).
-
- -h INT Coefficient for modeling homopolymer errors. Given an
- l-long homopolymer run, the sequencing error of an
- indel of size s is modeled as INT*s/l. [100]
-
- -l FILE File containing a list of sites where pileup or BCF
- is outputted [null]
-
- -o INT Phred-scaled gap open sequencing error probability.
- Reducing INT leads to more indel calls. [40]
-
- -P STR Comma dilimited list of platforms (determined by @RG-
- PL) from which indel candidates are obtained. It is
- recommended to collect indel candidates from sequenc-
- ing technologies that have low indel error rate such
- as ILLUMINA. [all]
-
- -q INT Minimum mapping quality for an alignment to be used
- [0]
-
- -Q INT Minimum base quality for a base to be considered [13]
-
- -r STR Only generate pileup in region STR [all sites]
-
- -u Similar to -g except that the output is uncompressed
- BCF, which is preferred for piping.
-
-
- reheader samtools reheader <in.header.sam> <in.bam>
-
- Replace the header in in.bam with the header in
- in.header.sam. This command is much faster than replacing
- the header with a BAM->SAM->BAM conversion.
-
-
- sort samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
-
- Sort alignments by leftmost coordinates. File <out.pre-
- fix>.bam will be created. This command may also create tempo-
- rary files <out.prefix>.%d.bam when the whole alignment can-
- not be fitted into memory (controlled by option -m).
-
- OPTIONS:
-
- -o Output the final alignment to the standard output.
-
- -n Sort by read names rather than by chromosomal coordi-
- nates
-
- -m INT Approximately the maximum required memory.
- [500000000]
-
-
- merge samtools merge [-nur] [-h inh.sam] [-R reg] <out.bam>
- <in1.bam> <in2.bam> [...]
-
- Merge multiple sorted alignments. The header reference lists
- of all the input BAM files, and the @SQ headers of inh.sam,
- if any, must all refer to the same set of reference
- sequences. The header reference list and (unless overridden
- by -h) `@' headers of in1.bam will be copied to out.bam, and
- the headers of other files will be ignored.
-
- OPTIONS:
-
- -h FILE Use the lines of FILE as `@' headers to be copied to
- out.bam, replacing any header lines that would other-
- wise be copied from in1.bam. (FILE is actually in
- SAM format, though any alignment records it may con-
- tain are ignored.)
-
- -R STR Merge files in the specified region indicated by STR
-
- -r Attach an RG tag to each alignment. The tag value is
- inferred from file names.
-
- -n The input alignments are sorted by read names rather
- than by chromosomal coordinates
-
- -u Uncompressed BAM output
-
-
- index samtools index <aln.bam>
-
- Index sorted alignment for fast random access. Index file
- <aln.bam>.bai will be created.
-
-
- idxstats samtools idxstats <aln.bam>
-
- Retrieve and print stats in the index file. The output is TAB
- delimited with each line consisting of reference sequence
- name, sequence length, # mapped reads and # unmapped reads.
-
-
- faidx samtools faidx <ref.fasta> [region1 [...]]
-
- Index reference sequence in the FASTA format or extract sub-
- sequence from indexed reference sequence. If no region is
- specified, faidx will index the file and create
- <ref.fasta>.fai on the disk. If regions are speficified, the
- subsequences will be retrieved and printed to stdout in the
- FASTA format. The input file can be compressed in the RAZF
- format.
-
-
- fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
-
- Fill in mate coordinates, ISIZE and mate related flags from a
- name-sorted alignment.
-
-
- rmdup samtools rmdup [-sS] <input.srt.bam> <out.bam>
-
- Remove potential PCR duplicates: if multiple read pairs have
- identical external coordinates, only retain the pair with
- highest mapping quality. In the paired-end mode, this com-
- mand ONLY works with FR orientation and requires ISIZE is
- correctly set. It does not work for unpaired reads (e.g. two
- ends mapped to different chromosomes or orphan reads).
-
- OPTIONS:
-
- -s Remove duplicate for single-end reads. By default,
- the command works for paired-end reads only.
-
- -S Treat paired-end reads and single-end reads.
-
-
- calmd samtools calmd [-eubSr] [-C capQcoef] <aln.bam> <ref.fasta>
-
- Generate the MD tag. If the MD tag is already present, this
- command will give a warning if the MD tag generated is dif-
- ferent from the existing tag. Output SAM by default.
-
- OPTIONS:
-
- -A When used jointly with -r this option overwrites the
- original base quality.
-
- -e Convert a the read base to = if it is identical to
- the aligned reference base. Indel caller does not
- support the = bases at the moment.
-
- -u Output uncompressed BAM
-
- -b Output compressed BAM
-
- -S The input is SAM with header lines
-
- -C INT Coefficient to cap mapping quality of poorly mapped
- reads. See the pileup command for details. [0]
-
- -r Compute the BQ tag without changing the base quality.
-
-
- pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
- [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N
- nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G
- indelPrior] <in.bam>|<in.sam>
-
- Print the alignment in the pileup format. In the pileup for-
- mat, each line represents a genomic position, consisting of
- chromosome name, coordinate, reference base, read bases, read
- qualities and alignment mapping qualities. Information on
- match, mismatch, indel, strand, mapping quality and start and
- end of a read are all encoded at the read base column. At
- this column, a dot stands for a match to the reference base
- on the forward strand, a comma for a match on the reverse
- strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
- match on the forward strand and `acgtn' for a mismatch on the
- reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates
- there is an insertion between this reference position and the
- next reference position. The length of the insertion is given
- by the integer in the pattern, followed by the inserted
- sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre-
- sents a deletion from the reference. The deleted bases will
- be presented as `*' in the following lines. Also at the read
- base column, a symbol `^' marks the start of a read. The
- ASCII of the character following `^' minus 33 gives the map-
- ping quality. A symbol `$' marks the end of a read segment.
-
- If option -c is applied, the consensus base, Phred-scaled
- consensus quality, SNP quality (i.e. the Phred-scaled proba-
- bility of the consensus being identical to the reference) and
- root mean square (RMS) mapping quality of the reads covering
- the site will be inserted between the `reference base' and
- the `read bases' columns. An indel occupies an additional
- line. Each indel line consists of chromosome name, coordi-
- nate, a star, the genotype, consensus quality, SNP quality,
- RMS mapping quality, # covering reads, the first alllele, the
- second allele, # reads supporting the first allele, # reads
- supporting the second allele and # reads containing indels
- different from the top two alleles.
-
- NOTE: Since 0.1.10, the `pileup' command is deprecated by
- `mpileup'.
-
- OPTIONS:
-
- -B Disable the BAQ computation. See the mpileup com-
- mand for details.
-
- -c Call the consensus sequence. Options -T, -N, -I and
- -r are only effective when -c or -g is in use.
-
- -C INT Coefficient for downgrading the mapping quality of
- poorly mapped reads. See the mpileup command for
- details. [0]
-
- -d INT Use the first NUM reads in the pileup for indel
- calling for speed up. Zero for unlimited. [1024]
-
- -f FILE The reference sequence in the FASTA format. Index
- file FILE.fai will be created if absent.
-
- -g Generate genotype likelihood in the binary GLFv3
- format. This option suppresses -c, -i and -s. This
- option is deprecated by the mpileup command.
-
- -i Only output pileup lines containing indels.
-
- -I INT Phred probability of an indel in sequencing/prep.
- [40]
-
- -l FILE List of sites at which pileup is output. This file
- is space delimited. The first two columns are
- required to be chromosome and 1-based coordinate.
- Additional columns are ignored. It is recommended
- to use option
-
- -m INT Filter reads with flag containing bits in INT
- [1796]
-
- -M INT Cap mapping quality at INT [60]
-
- -N INT Number of haplotypes in the sample (>=2) [2]
-
- -r FLOAT Expected fraction of differences between a pair of
- haplotypes [0.001]
-
- -s Print the mapping quality as the last column. This
- option makes the output easier to parse, although
- this format is not space efficient.
-
- -S The input file is in SAM.
-
- -t FILE List of reference names ane sequence lengths, in
- the format described for the import command. If
- this option is present, samtools assumes the input
- <in.alignment> is in SAM format; otherwise it
- assumes in BAM format. -s together with -l as in
- the default format we may not know the mapping
- quality.
-
- -T FLOAT The theta parameter (error dependency coefficient)
- in the maq consensus calling model [0.85]
-
-
-SAM FORMAT
- SAM is TAB-delimited. Apart from the header lines, which are started
- with the `@' symbol, each alignment line consists of:
-
-
- +----+-------+----------------------------------------------------------+
- |Col | Field | Description |
- +----+-------+----------------------------------------------------------+
- | 1 | QNAME | Query (pair) NAME |
- | 2 | FLAG | bitwise FLAG |
- | 3 | RNAME | Reference sequence NAME |
- | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
- | 5 | MAPQ | MAPping Quality (Phred-scaled) |
- | 6 | CIAGR | extended CIGAR string |
- | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) |
- | 8 | MPOS | 1-based Mate POSistion |
- | 9 | ISIZE | Inferred insert SIZE |
- |10 | SEQ | query SEQuence on the same strand as the reference |
- |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
- |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
- +----+-------+----------------------------------------------------------+
-
- Each bit in the FLAG field is defined as:
-
-
- +-------+-----+--------------------------------------------------+
- | Flag | Chr | Description |
- +-------+-----+--------------------------------------------------+
- |0x0001 | p | the read is paired in sequencing |
- |0x0002 | P | the read is mapped in a proper pair |
- |0x0004 | u | the query sequence itself is unmapped |
- |0x0008 | U | the mate is unmapped |
- |0x0010 | r | strand of the query (1 for reverse) |
- |0x0020 | R | strand of the mate |
- |0x0040 | 1 | the read is the first read in a pair |
- |0x0080 | 2 | the read is the second read in a pair |
- |0x0100 | s | the alignment is not primary |
- |0x0200 | f | the read fails platform/vendor quality checks |
- |0x0400 | d | the read is either a PCR or an optical duplicate |
- +-------+-----+--------------------------------------------------+
-
-EXAMPLES
- o Import SAM to BAM when @SQ lines are present in the header:
-
- samtools view -bS aln.sam > aln.bam
-
- If @SQ lines are absent:
-
- samtools faidx ref.fa
- samtools view -bt ref.fa.fai aln.sam > aln.bam
-
- where ref.fa.fai is generated automatically by the faidx command.
-
-
- o Attach the RG tag while merging sorted alignments:
-
- perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
- mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
- samtools merge -rh rg.txt merged.bam ga.bam 454.bam
-
- The value in a RG tag is determined by the file name the read is com-
- ing from. In this example, in the merged.bam, reads from ga.bam will
- be attached RG:Z:ga, while reads from 454.bam will be attached
- RG:Z:454.
-
-
- o Call SNPs and short indels for one diploid individual:
-
- samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - >
- var.raw.bcf
- bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 >
- var.flt.vcf
-
- The -D option of varFilter controls the maximum read depth, which
- should be adjusted to about twice the average read depth. One may
- consider to add -C50 to mpileup if mapping quality is overestimated
- for reads containing excessive mismatches. Applying this option usu-
- ally helps BWA-short but may not other mappers.
-
-
- o Call SNPs and short indels for multiple diploid individuals:
-
- samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view
- -bcvg - > var.raw.bcf
- bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 >
- var.flt.vcf
-
- Individuals are identified from the SM tags in the @RG header lines.
- Individuals can be pooled in one alignment file; one individual can
- also be separated into multiple files. The -P option specifies that
- indel candidates should be collected only from read groups with the
- @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads
- sequenced by an indel-prone technology may affect the performance of
- indel calling.
-
-
- o Derive the allele frequency spectrum (AFS) on a list of sites from
- multiple individuals:
-
- samtools mpileup -Igf ref.fa *.bam > all.bcf
- bcftools view -bl sites.list all.bcf > sites.bcf
- bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
- bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
- bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
- ......
-
- where sites.list contains the list of sites with each line consisting
- of the reference sequence name and position. The following bcftools
- commands estimate AFS by EM.
-
-
- o Dump BAQ applied alignment for other SNP callers:
-
- samtools calmd -bAr aln.bam > aln.baq.bam
-
- It adds and corrects the NM and MD tags at the same time. The calmd
- command also comes with the -C option, the same as the one in pileup
- and mpileup. Apply if it helps.
-
-
-LIMITATIONS
- o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
- bam_aux.c.
-
- o In merging, the input files are required to have the same number of
- reference sequences. The requirement can be relaxed. In addition,
- merging does not reconstruct the header dictionaries automatically.
- Endusers have to provide the correct header. Picard is better at
- merging.
-
- o Samtools paired-end rmdup does not work for unpaired reads (e.g.
- orphan reads or ends mapped to different chromosomes). If this is a
- concern, please use Picard's MarkDuplicate which correctly handles
- these cases, although a little slower.
-
-
-AUTHOR
- Heng Li from the Sanger Institute wrote the C version of samtools. Bob
- Handsaker from the Broad Institute implemented the BGZF library and Jue
- Ruan from Beijing Genomics Institute wrote the RAZF library. John Mar-
- shall and Petr Danecek contribute to the source code and various people
- from the 1000 Genomes Project have contributed to the SAM format speci-
- fication.
-
-
-SEE ALSO
- Samtools website: <http://samtools.sourceforge.net>
-
-
-
-samtools-0.1.12 2 December 2010 samtools(1)