From: Charles Plessy Date: Wed, 2 Mar 2011 04:30:55 +0000 (+0900) Subject: Imported Upstream version 0.1.13 X-Git-Url: http://woldlab.caltech.edu/gitweb/?p=samtools.git;a=commitdiff_plain;h=c34624801b980425af68c3c431423c72b18c14fe Imported Upstream version 0.1.13 --- diff --git a/ChangeLog b/ChangeLog index dd62b49..a471838 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,490 @@ +------------------------------------------------------------------------ +r925 | lh3lh3 | 2011-02-28 15:45:17 -0500 (Mon, 28 Feb 2011) | 2 lines +Changed paths: + M /trunk/samtools/phase.c + +minor changes to a heuristic rule + +------------------------------------------------------------------------ +r924 | lh3lh3 | 2011-02-28 15:24:04 -0500 (Mon, 28 Feb 2011) | 4 lines +Changed paths: + M /trunk/samtools/bam.h + M /trunk/samtools/bcftools/vcfutils.pl + M /trunk/samtools/phase.c + + * 0.1.12-r924:126 + * fixed a bug in phase (due to recent changes) + * fixed a bug in vcf2fq + +------------------------------------------------------------------------ +r923 | lh3lh3 | 2011-02-28 12:57:39 -0500 (Mon, 28 Feb 2011) | 5 lines +Changed paths: + M /trunk/samtools/Makefile + M /trunk/samtools/bam.h + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/phase.c + + * put version number in bam.h + * write version to BCF + * in phase, change the default -q to 37 + * output a little more information during phasing + +------------------------------------------------------------------------ +r922 | lh3lh3 | 2011-02-25 16:40:09 -0500 (Fri, 25 Feb 2011) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/bcf.c + M /trunk/samtools/bcftools/bcf.tex + M /trunk/samtools/bcftools/bcf2qcall.c + M /trunk/samtools/bcftools/bcfutils.c + M /trunk/samtools/bcftools/ld.c + M /trunk/samtools/bcftools/prob1.c + M /trunk/samtools/bcftools/vcf.c + M /trunk/samtools/cut_target.c + + * change the order of PL/GL according to the latest VCF spec + * change the type of SP to int32_t + +------------------------------------------------------------------------ +r921 | lh3lh3 | 2011-02-25 14:40:56 -0500 (Fri, 25 Feb 2011) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/bcf.tex + +update the BCF spec + +------------------------------------------------------------------------ +r920 | lh3lh3 | 2011-02-25 00:59:27 -0500 (Fri, 25 Feb 2011) | 2 lines +Changed paths: + M /trunk/samtools/Makefile + M /trunk/samtools/bam_md.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bam_sort.c + M /trunk/samtools/bamtk.c + A /trunk/samtools/cut_target.c + M /trunk/samtools/errmod.h + M /trunk/samtools/faidx.c + M /trunk/samtools/khash.h + M /trunk/samtools/kstring.c + M /trunk/samtools/kstring.h + A /trunk/samtools/phase.c + M /trunk/samtools/samtools.1 + +added the phase command + +------------------------------------------------------------------------ +r918 | lh3lh3 | 2011-02-24 10:05:54 -0500 (Thu, 24 Feb 2011) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/prob1.c + M /trunk/samtools/bcftools/prob1.h + +added "const" to bcf_p1_cal() + +------------------------------------------------------------------------ +r917 | lh3lh3 | 2011-02-24 09:36:30 -0500 (Thu, 24 Feb 2011) | 2 lines +Changed paths: + M /trunk/samtools/bam.c + +more meaningful BAM truncation message + +------------------------------------------------------------------------ +r916 | lh3lh3 | 2011-02-24 09:35:06 -0500 (Thu, 24 Feb 2011) | 3 lines +Changed paths: + M /trunk/samtools/bcftools/bcf.c + M /trunk/samtools/bcftools/vcf.c + + * automatically fix errors in GL + * output unrecognized FORMAT as "." + +------------------------------------------------------------------------ +r913 | lh3lh3 | 2011-02-10 22:59:47 -0500 (Thu, 10 Feb 2011) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/bcf.h + M /trunk/samtools/bcftools/call1.c + M /trunk/samtools/bcftools/vcf.c + +finished VCF->BCF conversion + +------------------------------------------------------------------------ +r910 | petulda | 2011-02-03 03:13:48 -0500 (Thu, 03 Feb 2011) | 1 line +Changed paths: + M /trunk/samtools/bcftools/vcfutils.pl + +Prevent division by zero +------------------------------------------------------------------------ +r909 | lh3lh3 | 2011-02-02 11:29:20 -0500 (Wed, 02 Feb 2011) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/call1.c + +fixed a typo in the VCF header + +------------------------------------------------------------------------ +r908 | lh3lh3 | 2011-02-02 11:28:24 -0500 (Wed, 02 Feb 2011) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam_index.c + + * fixed an out-of-boundary bug + * improved sorting order checking in index + +------------------------------------------------------------------------ +r907 | lh3lh3 | 2011-01-29 22:59:20 -0500 (Sat, 29 Jan 2011) | 4 lines +Changed paths: + M /trunk/samtools/INSTALL + M /trunk/samtools/bam_tview.c + M /trunk/samtools/knetfile.c + + * avoid a segfault when network connect fails + * update INSTALL + * fixed a bug in tview on big-endian by Nathan Weeks + +------------------------------------------------------------------------ +r903 | lh3lh3 | 2011-01-27 14:50:02 -0500 (Thu, 27 Jan 2011) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_md.c + + * fixed a rare memory issue in bam_md.c + * fixed a bug in indel calling related to unmapped and refskip reads + +------------------------------------------------------------------------ +r902 | lh3lh3 | 2011-01-23 21:46:18 -0500 (Sun, 23 Jan 2011) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/fet.c + +fixed two minor bugs in Fisher's exact test + +------------------------------------------------------------------------ +r899 | petulda | 2011-01-19 09:28:02 -0500 (Wed, 19 Jan 2011) | 1 line +Changed paths: + M /trunk/samtools/bcftools/vcfutils.pl + +Skip sites with unknown ref +------------------------------------------------------------------------ +r898 | lh3lh3 | 2011-01-15 12:56:05 -0500 (Sat, 15 Jan 2011) | 2 lines +Changed paths: + M /trunk/samtools/ChangeLog + M /trunk/samtools/bam_maqcns.c + M /trunk/samtools/bam_md.c + +move bam_nt16_nt4_table[] from bam_maqcns.c to bam_md.c + +------------------------------------------------------------------------ +r896 | lh3lh3 | 2011-01-06 10:52:15 -0500 (Thu, 06 Jan 2011) | 3 lines +Changed paths: + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/bcf.h + M /trunk/samtools/bcftools/bcfutils.c + M /trunk/samtools/bcftools/call1.c + + * samtools-0.1.12-10 (r896) + * allow to exclude read groups in mpileup + +------------------------------------------------------------------------ +r895 | lh3lh3 | 2011-01-04 11:31:29 -0500 (Tue, 04 Jan 2011) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/bcf.tex + +sorry. It is SP not ST + +------------------------------------------------------------------------ +r894 | lh3lh3 | 2011-01-04 11:29:06 -0500 (Tue, 04 Jan 2011) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/bcf.tex + +added ST + +------------------------------------------------------------------------ +r893 | petulda | 2011-01-04 06:55:56 -0500 (Tue, 04 Jan 2011) | 1 line +Changed paths: + M /trunk/samtools/bcftools/call1.c + +Fixed a typo in read_samples +------------------------------------------------------------------------ +r892 | jmarshall | 2010-12-28 08:06:49 -0500 (Tue, 28 Dec 2010) | 9 lines +Changed paths: + M /trunk/samtools/Makefile + M /trunk/samtools/bcftools/Makefile + M /trunk/samtools/examples/Makefile + +System libraries go *after* user libraries in link commands, because +the user libraries may themselves have dependencies that are satisfied +by the system libraries. It's not rocket science! + +This makes a difference with some linkers; or with -static or --as-needed. + +The examples/Makefile fix is from Charles Plessy. +See also http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=606004 + +------------------------------------------------------------------------ +r891 | lh3lh3 | 2010-12-21 12:16:33 -0500 (Tue, 21 Dec 2010) | 3 lines +Changed paths: + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/bcf.h + M /trunk/samtools/bcftools/bcfutils.c + M /trunk/samtools/bcftools/call1.c + + * samtools-0.1.12-9 (r891) + * allow to call SNPs from a subset of samples + +------------------------------------------------------------------------ +r889 | lh3lh3 | 2010-12-15 11:28:16 -0500 (Wed, 15 Dec 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.12-12 (r889) + * set mapQ as 20 if it equals 255 + +------------------------------------------------------------------------ +r888 | lh3lh3 | 2010-12-14 22:41:09 -0500 (Tue, 14 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + +When -B is applied to mpileup, still use paired reads only unless -A is flagged. + +------------------------------------------------------------------------ +r887 | lh3lh3 | 2010-12-14 22:37:05 -0500 (Tue, 14 Dec 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam_md.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.12-6 (r887) + * added a hidden option -E to mpileup/calmd. -E triggers an alternative way to apply BAQ. + +------------------------------------------------------------------------ +r886 | lh3lh3 | 2010-12-14 12:51:03 -0500 (Tue, 14 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bamtk.c + +(Arguably) improved the indel caller a tiny bit for lowCov data. + +------------------------------------------------------------------------ +r885 | petulda | 2010-12-14 04:55:46 -0500 (Tue, 14 Dec 2010) | 1 line +Changed paths: + M /trunk/samtools/bcftools/call1.c + +Fixed the VCF header to pass validation +------------------------------------------------------------------------ +r884 | lh3lh3 | 2010-12-12 23:02:19 -0500 (Sun, 12 Dec 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/vcfutils.pl + + * samtools-0.1.12-4 (r884) + * fixed a long-existing flaw in the INDEL calling model + +------------------------------------------------------------------------ +r883 | lh3lh3 | 2010-12-11 20:05:42 -0500 (Sat, 11 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/bcfutils.c + M /trunk/samtools/bcftools/call1.c + M /trunk/samtools/bcftools/vcfutils.pl + +compute max SP and max GQ from sample genotypes + +------------------------------------------------------------------------ +r880 | lh3lh3 | 2010-12-10 10:50:54 -0500 (Fri, 10 Dec 2010) | 2 lines +Changed paths: + D /trunk/samtools/bcftools/bcf-fix.pl + +drop bcf-fix.pl as it is redundant by the latest changes + +------------------------------------------------------------------------ +r879 | lh3lh3 | 2010-12-10 10:50:29 -0500 (Fri, 10 Dec 2010) | 3 lines +Changed paths: + M /trunk/samtools/bcftools/call1.c + M /trunk/samtools/bcftools/vcf.c + + * fixed a minor issue in printing VCFs + * write bcftools specific INFO and FORMAT in the header + +------------------------------------------------------------------------ +r878 | lh3lh3 | 2010-12-10 10:09:14 -0500 (Fri, 10 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/bcfutils.c + M /trunk/samtools/bcftools/call1.c + +Make sure that the GT genotype field is the first + +------------------------------------------------------------------------ +r877 | lh3lh3 | 2010-12-08 17:27:05 -0500 (Wed, 08 Dec 2010) | 7 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.12-2 (r877) + + * allow to fine control the selection of indel candidates. The current + setting is okay for lowCov and highCov with ~100 samples, but it + skips too many indels for highCov with >250 samples. + + +------------------------------------------------------------------------ +r874 | lh3lh3 | 2010-12-07 22:40:35 -0500 (Tue, 07 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam_plcmd.c + +a spelling error.. + +------------------------------------------------------------------------ +r873 | lh3lh3 | 2010-12-07 22:39:57 -0500 (Tue, 07 Dec 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.12-1 (r873) + * added a switch to allow anomalous read pairs in calling + +------------------------------------------------------------------------ +r872 | lh3lh3 | 2010-12-07 14:43:54 -0500 (Tue, 07 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/vcfutils.pl + +fixed a bug in vcf2fq + +------------------------------------------------------------------------ +r869 | lh3lh3 | 2010-12-05 01:18:06 -0500 (Sun, 05 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bamtk.c + +added a warning for the Windows version + +------------------------------------------------------------------------ +r868 | lh3lh3 | 2010-12-05 01:05:51 -0500 (Sun, 05 Dec 2010) | 4 lines +Changed paths: + M /trunk/samtools/bcftools/call1.c + +In ksprintf(), change "%lf" and "%lg" to "%f" and "%g", respectively. +According to the manual page, this change is valid. However, MinGW seems +to interpret "%lf" as "%Lf". + +------------------------------------------------------------------------ +r867 | lh3lh3 | 2010-12-05 00:35:43 -0500 (Sun, 05 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/Makefile.mingw + M /trunk/samtools/bam_aux.c + +bring back the windows support + +------------------------------------------------------------------------ +r866 | lh3lh3 | 2010-12-04 23:33:51 -0500 (Sat, 04 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam_reheader.c + M /trunk/samtools/bcftools/vcfutils.pl + +Fixed a compiling error when knetfile is not used. + +------------------------------------------------------------------------ +r865 | lh3lh3 | 2010-12-04 00:13:22 -0500 (Sat, 04 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/vcfutils.pl + +vcf->fastq + +------------------------------------------------------------------------ +r864 | lh3lh3 | 2010-12-03 17:12:30 -0500 (Fri, 03 Dec 2010) | 3 lines +Changed paths: + M /trunk/samtools/bcftools/call1.c + M /trunk/samtools/bcftools/prob1.c + M /trunk/samtools/bcftools/prob1.h + + * remove "-f". Instead always compute consensus quality + * increase the upper limit of quality + +------------------------------------------------------------------------ +r863 | lh3lh3 | 2010-12-03 15:28:15 -0500 (Fri, 03 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/bcf.c + +more informative error message + +------------------------------------------------------------------------ +r862 | lh3lh3 | 2010-12-02 16:16:08 -0500 (Thu, 02 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/NEWS + M /trunk/samtools/bamtk.c + +Release samtools-0.1.12a + +------------------------------------------------------------------------ +r861 | lh3lh3 | 2010-12-02 15:55:06 -0500 (Thu, 02 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/call1.c + +a possible fix to DP4=0,0,0,0; have not tested, but should have no side-effect + +------------------------------------------------------------------------ +r859 | lh3lh3 | 2010-12-02 11:39:57 -0500 (Thu, 02 Dec 2010) | 2 lines +Changed paths: + M /trunk/samtools/NEWS + M /trunk/samtools/bam_index.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/samtools.1 + +Release samtools-0.1.12 + +------------------------------------------------------------------------ +r858 | lh3lh3 | 2010-12-02 11:24:41 -0500 (Thu, 02 Dec 2010) | 4 lines +Changed paths: + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/bcf.c + + * samtools-0.1.11-1 (r858) + * fixed a bug in mpileup which causes segfaults + * bcftools: do not segfault when BCF contains errors + +------------------------------------------------------------------------ +r857 | lh3lh3 | 2010-11-30 23:52:50 -0500 (Tue, 30 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam_index.c + +fixed a memory leak in bam_fetch() + +------------------------------------------------------------------------ +r856 | lh3lh3 | 2010-11-26 00:07:31 -0500 (Fri, 26 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bcftools/vcfutils.pl + + * fixed a memory violation + * added splitchr to vcfutils.pl + +------------------------------------------------------------------------ +r854 | lh3lh3 | 2010-11-23 09:05:08 -0500 (Tue, 23 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/ld.c + +fixed a typo/bug in r^2 computation + +------------------------------------------------------------------------ +r852 | lh3lh3 | 2010-11-21 22:20:20 -0500 (Sun, 21 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bamtk.c + +forget to change the version information + +------------------------------------------------------------------------ +r851 | lh3lh3 | 2010-11-21 22:16:52 -0500 (Sun, 21 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/ChangeLog + M /trunk/samtools/NEWS + M /trunk/samtools/bcftools/bcftools.1 + M /trunk/samtools/samtools.1 + +Release samtools-0.1.11 + ------------------------------------------------------------------------ r844 | lh3lh3 | 2010-11-19 23:16:08 -0500 (Fri, 19 Nov 2010) | 3 lines Changed paths: diff --git a/INSTALL b/INSTALL index f1cf7aa..37d84a9 100644 --- a/INSTALL +++ b/INSTALL @@ -1,29 +1,30 @@ System Requirements =================== -SAMtools depends on the zlib library . The latest -version 1.2.3 is preferred and with the latest version you can compile -razip and use it to compress a FASTA file. SAMtools' faidx is able to -index a razip-compressed FASTA file to save diskspace. Older zlib also -works with SAMtools, but razip cannot be compiled. +SAMtools depends on the zlib library . Version 1.2.3+ is +preferred and with 1.2.3+ you can compile razip and use it to compress a FASTA +file. SAMtools' faidx is able to index a razip-compressed FASTA file to save +diskspace. Older zlib also works with SAMtools, but razip cannot be compiled. The text-based viewer (tview) requires the GNU ncurses library -, which comes with Mac OS X and -most of the modern Linux/Unix distributions. If you do not have this -library installed, you can still compile the rest of SAMtools by -manually modifying one line in Makefile. +, which comes with Mac OS X and most of +the modern Linux/Unix distributions. If you do not have this library installed, +you can still compile the rest of SAMtools by manually changing: +`-D_CURSES_LIB=1' to `-D_CURSES_LIB=0' at the line starting with `DFLAGS=', and +comment out the line starting with `LIBCURSES='. Compilation =========== -Type `make' to compile samtools. If you have zlib >= 1.2.2.1, you can -compile razip with `make razip'. +Type `make' to compile samtools. If you have zlib >= 1.2.2.1, you can compile +razip with `make razip'. Installation ============ -Simply copy `samtools' and other executables/scripts in `misc' to a -location you want (e.g. a directory in your $PATH). No further -configurations are required. +Copy `samtools', `bcftools/bcftools' and other executables/scripts in `misc' to +a location you want (e.g. a directory in your $PATH). You may also copy +`samtools.1' and `bcftools/bcftools.1' to a directory in your $MANPATH such +that the `man' command may find the manual. diff --git a/Makefile b/Makefile index 13d4a76..af93d9c 100644 --- a/Makefile +++ b/Makefile @@ -1,13 +1,14 @@ CC= gcc CFLAGS= -g -Wall -O2 #-m64 #-arch ppc -DFLAGS= -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE -D_CURSES_LIB=1 +DFLAGS= -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -D_CURSES_LIB=1 KNETFILE_O= knetfile.o LOBJS= bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o \ bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o \ $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o kprobaln.o AOBJS= bam_tview.o bam_maqcns.o bam_plcmd.o sam_view.o \ bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o \ - bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o + bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o \ + cut_target.o phase.o PROG= samtools INCLUDES= -I. SUBDIRS= . bcftools misc @@ -40,7 +41,7 @@ libbam.a:$(LOBJS) $(AR) -cru $@ $(LOBJS) samtools:lib-recur $(AOBJS) - $(CC) $(CFLAGS) -o $@ $(AOBJS) libbam.a -lm $(LIBPATH) $(LIBCURSES) -lz -Lbcftools -lbcf + $(CC) $(CFLAGS) -o $@ $(AOBJS) -Lbcftools $(LIBPATH) libbam.a -lbcf $(LIBCURSES) -lm -lz razip:razip.o razf.o $(KNETFILE_O) $(CC) $(CFLAGS) -o $@ razf.o razip.o $(KNETFILE_O) -lz @@ -66,6 +67,8 @@ bcf.o:bcftools/bcf.h bam2bcf.o:bam2bcf.h errmod.h bcftools/bcf.h bam2bcf_indel.o:bam2bcf.h errmod.o:errmod.h +phase.o:bam.h khash.h ksort.h +bamtk.o:bam.h faidx.o:faidx.h razf.h khash.h faidx_main.o:faidx.h razf.h diff --git a/Makefile.mingw b/Makefile.mingw index 9df4b9a..836360f 100644 --- a/Makefile.mingw +++ b/Makefile.mingw @@ -1,18 +1,21 @@ CC= gcc.exe AR= ar.exe CFLAGS= -g -Wall -O2 -DFLAGS= -D_CURSES_LIB=2 -D_USE_KNETFILE +DFLAGS= -D_USE_KNETFILE -D_CURSES_LIB=2 KNETFILE_O= knetfile.o LOBJS= bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o \ - bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o bam_sort.o \ - $(KNETFILE_O) + bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o \ + $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o kprobaln.o AOBJS= bam_tview.o bam_maqcns.o bam_plcmd.o sam_view.o \ bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o \ - bamtk.o kaln.o sam_header.o -PROG= samtools -INCLUDES= -Iwin32 + bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o +BCFOBJS= bcftools/bcf.o bcftools/fet.o bcftools/bcf2qcall.o bcftools/bcfutils.o \ + bcftools/call1.o bcftools/index.o bcftools/kfunc.o bcftools/ld.o \ + bcftools/prob1.o bcftools/vcf.o +PROG= samtools.exe bcftools.exe +INCLUDES= -I. -Iwin32 SUBDIRS= . -LIBPATH= +LIBPATH= .SUFFIXES:.c .o @@ -29,31 +32,33 @@ lib:libbam.a libbam.a:$(LOBJS) $(AR) -cru $@ $(LOBJS) -samtools:$(AOBJS) libbam.a - $(CC) $(CFLAGS) -o $@ $(AOBJS) $(LIBPATH) -lm -L. -lbam -Lwin32 -lz -lcurses -lws2_32 +samtools.exe:$(AOBJS) libbam.a $(BCFOBJS) + $(CC) $(CFLAGS) -o $@ $(AOBJS) $(BCFOBJS) $(LIBPATH) -lm -L. -lbam -Lwin32 -lz -lcurses -lws2_32 -razip:razip.o razf.o $(KNETFILE_O) - $(CC) $(CFLAGS) -o $@ razf.o razip.o $(KNETFILE_O) -lz - -bgzip:bgzip.o bgzf.o $(KNETFILE_O) - $(CC) $(CFLAGS) -o $@ bgzf.o bgzip.o $(KNETFILE_O) -lz +bcftools.exe:$(BCFOBJS) bcftools/main.o kstring.o bgzf.o knetfile.o + $(CC) $(CFLAGS) -o $@ $(BCFOBJS) bcftools/main.o kstring.o bgzf.o knetfile.o -lm -Lwin32 -lz -lws2_32 razip.o:razf.h -bam.o:bam.h razf.h bam_endian.h kstring.h +bam.o:bam.h razf.h bam_endian.h kstring.h sam_header.h sam.o:sam.h bam.h bam_import.o:bam.h kseq.h khash.h razf.h bam_pileup.o:bam.h razf.h ksort.h -bam_plcmd.o:bam.h faidx.h bam_maqcns.h glf.h +bam_plcmd.o:bam.h faidx.h bam_maqcns.h glf.h bcftools/bcf.h bam2bcf.h bam_index.o:bam.h khash.h ksort.h razf.h bam_endian.h bam_lpileup.o:bam.h ksort.h bam_tview.o:bam.h faidx.h bam_maqcns.h -bam_maqcns.o:bam.h ksort.h bam_maqcns.h +bam_maqcns.o:bam.h ksort.h bam_maqcns.h kaln.h bam_sort.o:bam.h ksort.h razf.h bam_md.o:bam.h faidx.h glf.o:glf.h +sam_header.o:sam_header.h khash.h +bcf.o:bcftools/bcf.h +bam2bcf.o:bam2bcf.h errmod.h bcftools/bcf.h +bam2bcf_indel.o:bam2bcf.h +errmod.o:errmod.h faidx.o:faidx.h razf.h khash.h faidx_main.o:faidx.h razf.h clean: - rm -fr gmon.out *.o *.exe *.dSYM razip bgzip $(PROG) *~ *.a + rm -fr gmon.out *.o a.out *.exe *.dSYM razip bgzip $(PROG) *~ *.a *.so.* *.so *.dylib diff --git a/NEWS b/NEWS index 6b4d8aa..8455b48 100644 --- a/NEWS +++ b/NEWS @@ -1,3 +1,84 @@ +Beta release 0.1.13 (1 March, 2011) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The most important though largely invisible modification is the change of the +order of genotypes in the PL VCF/BCF tag. This is to conform the upcoming VCF +spec v4.1. The change means that 0.1.13 is not backward compatible with VCF/BCF +generated by samtools older than r921 inclusive. VCF/BCF generated by the new +samtools will contain a line `##fileformat=VCFv4.1' as well as the samtools +version number. + +Single Individual Haplotyping (SIH) is added as an experimental feature. It +originally aims to produce haploid consensus from fosmid pool sequencing, but +also works with short-read data. For short reads, phased blocks are usually too +short to be useful in many applications, but they can help to rule out part of +SNPs close to INDELs or between copies of CNVs. + + +Other notable changes in samtools: + + * Construct per-sample consensus to reduce the effect of nearby SNPs in INDEL + calling. This reduces the power but improves specificity. + + * Improved sorting order checking in indexing. Now indexing is the preferred way + to check if a BAM is sorted. + + * Added a switch `-E' to mpileup and calmd. This option uses an alternative way + to apply BAQ, which increases sensistivity, especially to MNPs, at the cost of + a little loss in specificity. + + * Added `mpileup -A' to allow to use reads in anomalous pairs in SNP calling. + + * Added `mpileup -m' to allow fine control of the collection of INDEL candidates. + + * Added `mpileup -S' to compute per-sample strand bias P-value. + + * Added `mpileup -G' to exclude read groups in variant calling. + + * Fixed segfault in indel calling related to unmapped and refskip reads. + + * Fixed an integer overflow in INDEL calling. This bug produces wrong INDEL + genotypes for longer short INDELs, typically over 10bp. + + * Fixed a bug in tview on big-endian machines. + + * Fixed a very rare memory issue in bam_md.c + + * Fixed an out-of-boundary bug in mpileup when the read base is `N'. + + * Fixed a compiling error when the knetfile library is not used. Fixed a + library compiling error due to the lack of bam_nt16_nt4_table[] table. + Suppress a compiling warning related to the latest zlib. + + +Other notable changes in bcftools: + + * Updated the BCF spec. + + * Added the `FQ' VCF INFO field, which gives the phred-scaled probability + of all samples being the samei (identical to the reference or all homozygous + variants). Option `view -f' has been dropped. + + * Implementated of "vcfutils.pl vcf2fq" to generate a consensus sequence + similar to "samtools.pl pileup2fq". + + * Make sure the GT FORMAT field is always the first FORMAT to conform the VCF + spec. Drop bcf-fix.pl. + + * Output bcftools specific INFO and FORMAT in the VCF header. + + * Added `view -s' to call variants from a subset of samples. + + * Properly convert VCF to BCF with a user provided sequence dictionary. Nonetheless, + custom fields are still unparsed and will be stored as a missing value. + + * Fixed a minor bug in Fisher's exact test; the results are rarely changed. + + +(0.1.13: 1 March 2011, r926:134) + + + Beta release 0.1.12a (2 December, 2010) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -532,4 +613,4 @@ Beta Release 0.1.1 (22 December, 2008) The is the first public release of samtools. For more information, please check the manual page `samtools.1' and the samtools website -http://samtools.sourceforge.net \ No newline at end of file +http://samtools.sourceforge.net diff --git a/bam.c b/bam.c index 521c1dd..96aace2 100644 --- a/bam.c +++ b/bam.c @@ -79,7 +79,7 @@ bam_header_t *bam_header_read(bamFile fp) // with ESPIPE. Suppress the error message in this case. if (errno != ESPIPE) perror("[bam_header_read] bgzf_check_EOF"); } - else if (i == 0) fprintf(stderr, "[bam_header_read] EOF marker is absent.\n"); + else if (i == 0) fprintf(stderr, "[bam_header_read] EOF marker is absent. The input is probably truncated.\n"); // read "BAM1" magic_len = bam_read(fp, buf, 4); if (magic_len != 4 || strncmp(buf, "BAM\001", 4) != 0) { diff --git a/bam.h b/bam.h index eef2ea9..4ad2644 100644 --- a/bam.h +++ b/bam.h @@ -40,6 +40,8 @@ @copyright Genome Research Ltd. */ +#define BAM_VERSION "0.1.13 (r926:134)" + #include #include #include diff --git a/bam2bcf.c b/bam2bcf.c index 088635c..bd5503d 100644 --- a/bam2bcf.c +++ b/bam2bcf.c @@ -11,6 +11,7 @@ extern void ks_introsort_uint32_t(size_t n, uint32_t a[]); #define CALL_ETA 0.03f #define CALL_MAX 256 #define CALL_DEFTHETA 0.83f +#define DEF_MAPQ 20 #define CAP_DIST 25 @@ -23,6 +24,8 @@ bcf_callaux_t *bcf_call_init(double theta, int min_baseQ) bca->openQ = 40; bca->extQ = 20; bca->tandemQ = 100; bca->min_baseQ = min_baseQ; bca->e = errmod_init(1. - theta); + bca->min_frac = 0.002; + bca->min_support = 1; return bca; } @@ -61,7 +64,8 @@ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t seqQ = is_indel? (p->aux>>8&0xff) : 99; if (q < bca->min_baseQ) continue; if (q > seqQ) q = seqQ; - mapQ = p->b->core.qual < bca->capQ? p->b->core.qual : bca->capQ; + mapQ = p->b->core.qual < 255? p->b->core.qual : DEF_MAPQ; // special case for mapQ==255 + mapQ = mapQ < bca->capQ? mapQ : bca->capQ; if (q > mapQ) q = mapQ; if (q > 63) q = 63; if (q < 4) q = 4; @@ -75,7 +79,7 @@ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t } bca->bases[n++] = q<<5 | (int)bam1_strand(p->b)<<4 | b; // collect annotations - r->qsum[b] += q; + if (b < 4) r->qsum[b] += q; ++r->anno[0<<2|is_diff<<1|bam1_strand(p->b)]; min_dist = p->b->core.l_qseq - 1 - p->qpos; if (min_dist > p->qpos) min_dist = p->qpos; @@ -140,8 +144,8 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/, x = call->n_alleles * (call->n_alleles + 1) / 2; // get the possible genotypes for (i = z = 0; i < call->n_alleles; ++i) - for (j = i; j < call->n_alleles; ++j) - g[z++] = call->a[i] * 5 + call->a[j]; + for (j = 0; j <= i; ++j) + g[z++] = call->a[j] * 5 + call->a[i]; for (i = 0; i < n; ++i) { uint8_t *PL = call->PL + x * i; const bcf_callret1_t *r = calls + i; diff --git a/bam2bcf.h b/bam2bcf.h index 26b022c..9585672 100644 --- a/bam2bcf.h +++ b/bam2bcf.h @@ -9,7 +9,9 @@ typedef struct __bcf_callaux_t { int capQ, min_baseQ; - int openQ, extQ, tandemQ; + int openQ, extQ, tandemQ; // for indels + int min_support; // for collecting indel candidates + double min_frac; // for collecting indel candidates // for internal uses int max_bases; int indel_types[4]; diff --git a/bam2bcf_indel.c b/bam2bcf_indel.c index 16241d0..899428f 100644 --- a/bam2bcf_indel.c +++ b/bam2bcf_indel.c @@ -11,7 +11,6 @@ KHASH_SET_INIT_STR(rg) #define MINUS_CONST 0x10000000 #define INDEL_WINDOW_SIZE 50 -#define MIN_SUPPORT_COEF 500 void *bcf_call_add_rg(void *_hash, const char *hdtext, const char *list) { @@ -114,7 +113,7 @@ int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_calla extern void ks_introsort_uint32_t(int, uint32_t*); int i, s, j, k, t, n_types, *types, max_rd_len, left, right, max_ins, *score1, *score2, max_ref2; int N, K, l_run, ref_type, n_alt; - char *inscns = 0, *ref2, *query; + char *inscns = 0, *ref2, *query, **ref_sample; khash_t(rg) *hash = (khash_t(rg)*)rghash; if (ref == 0 || bca == 0) return -1; // mark filtered reads @@ -165,7 +164,7 @@ int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_calla // squeeze out identical types for (i = 1, n_types = 1; i < m; ++i) if (aux[i] != aux[i-1]) ++n_types; - if (n_types == 1 || n_alt * MIN_SUPPORT_COEF < n_tot) { // no indels or too few supporting reads + if (n_types == 1 || (double)n_alt / n_tot < bca->min_frac || n_alt < bca->min_support) { // then skip free(aux); return -1; } types = (int*)calloc(n_types, sizeof(int)); @@ -189,6 +188,58 @@ int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_calla if (ref[i] == 0) break; right = i; } + /* The following block fixes a long-existing flaw in the INDEL + * calling model: the interference of nearby SNPs. However, it also + * reduces the power because sometimes, substitutions caused by + * indels are not distinguishable from true mutations. Multiple + * sequence realignment helps to increase the power. + */ + { // construct per-sample consensus + int L = right - left + 1, max_i, max2_i; + uint32_t *cns, max, max2; + char *ref0, *r; + ref_sample = calloc(n, sizeof(void*)); + cns = calloc(L, 4); + ref0 = calloc(L, 1); + for (i = 0; i < right - left; ++i) + ref0[i] = bam_nt16_table[(int)ref[i+left]]; + for (s = 0; s < n; ++s) { + r = ref_sample[s] = calloc(L, 1); + memset(cns, 0, sizeof(int) * L); + // collect ref and non-ref counts + for (i = 0; i < n_plp[s]; ++i) { + bam_pileup1_t *p = plp[s] + i; + bam1_t *b = p->b; + uint32_t *cigar = bam1_cigar(b); + uint8_t *seq = bam1_seq(b); + int x = b->core.pos, y = 0; + for (k = 0; k < b->core.n_cigar; ++k) { + int op = cigar[k]&0xf; + int j, l = cigar[k]>>4; + if (op == BAM_CMATCH) { + for (j = 0; j < l; ++j) + if (x + j >= left && x + j < right) + cns[x+j-left] += (bam1_seqi(seq, y+j) == ref0[x+j-left])? 1 : 0x10000; + x += l; y += l; + } else if (op == BAM_CDEL || op == BAM_CREF_SKIP) x += l; + else if (op == BAM_CINS || op == BAM_CSOFT_CLIP) y += l; + } + } + // determine the consensus + for (i = 0; i < right - left; ++i) r[i] = ref0[i]; + max = max2 = 0; max_i = max2_i = -1; + for (i = 0; i < right - left; ++i) { + if (cns[i]>>16 >= max>>16) max2 = max, max2_i = max_i, max = cns[i], max_i = i; + else if (cns[i]>>16 >= max2>>16) max2 = cns[i], max2_i = i; + } + if ((double)(max&0xffff) / ((max&0xffff) + (max>>16)) >= 0.7) max_i = -1; + if ((double)(max2&0xffff) / ((max2&0xffff) + (max2>>16)) >= 0.7) max2_i = -1; + if (max_i >= 0) r[max_i] = 15; + if (max2_i >= 0) r[max2_i] = 15; +// for (i = 0; i < right - left; ++i) fputc("=ACMGRSVTWYHKDBN"[(int)r[i]], stderr); fputc('\n', stderr); + } + free(ref0); free(cns); + } { // the length of the homopolymer run around the current position int c = bam_nt16_table[(int)ref[pos + 1]]; if (c == 15) l_run = 1; @@ -252,27 +303,29 @@ int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_calla else ir = est_indelreg(pos, ref, -types[t], 0); if (ir > bca->indelreg) bca->indelreg = ir; // fprintf(stderr, "%d, %d, %d\n", pos, types[t], ir); - // write ref2 - for (k = 0, j = left; j <= pos; ++j) - ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]]; - if (types[t] <= 0) j += -types[t]; - else for (l = 0; l < types[t]; ++l) - ref2[k++] = inscns[t*max_ins + l]; - if (types[0] < 0) { // mask deleted sequences to avoid a particular error in the model. - int jj, tmp = types[t] >= 0? -types[0] : -types[0] + types[t]; - for (jj = 0; jj < tmp && j < right && ref[j]; ++jj, ++j) - ref2[k++] = 4; - } - for (; j < right && ref[j]; ++j) - ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]]; - for (; k < max_ref2; ++k) ref2[k] = 4; - if (j < right) right = j; - // align each read to ref2 + // realignment for (s = K = 0; s < n; ++s) { + // write ref2 + for (k = 0, j = left; j <= pos; ++j) + ref2[k++] = bam_nt16_nt4_table[(int)ref_sample[s][j-left]]; + if (types[t] <= 0) j += -types[t]; + else for (l = 0; l < types[t]; ++l) + ref2[k++] = inscns[t*max_ins + l]; + for (; j < right && ref[j]; ++j) + ref2[k++] = bam_nt16_nt4_table[(int)ref_sample[s][j-left]]; + for (; k < max_ref2; ++k) ref2[k] = 4; + if (j < right) right = j; + // align each read to ref2 for (i = 0; i < n_plp[s]; ++i, ++K) { bam_pileup1_t *p = plp[s] + i; - int qbeg, qend, tbeg, tend, sc; + int qbeg, qend, tbeg, tend, sc, kk; uint8_t *seq = bam1_seq(p->b); + uint32_t *cigar = bam1_cigar(p->b); + if (p->b->core.flag&4) continue; // unmapped reads + // FIXME: the following loop should be better moved outside; nonetheless, realignment should be much slower anyway. + for (kk = 0; kk < p->b->core.n_cigar; ++kk) + if ((cigar[kk]&BAM_CIGAR_MASK) == BAM_CREF_SKIP) break; + if (kk < p->b->core.n_cigar) continue; // FIXME: the following skips soft clips, but using them may be more sensitive. // determine the start and end of sequences for alignment qbeg = tpos2qpos(&p->b->core, bam1_cigar(p->b), left, 0, &tbeg); @@ -367,9 +420,11 @@ int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_calla indelQ2 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ2 + .499); // pick the smaller between indelQ1 and indelQ2 indelQ = indelQ1 < indelQ2? indelQ1 : indelQ2; - p->aux = (sc[0]&0x3f)<<16 | seqQ<<8 | indelQ; + if (indelQ > 255) indelQ = 255; + if (seqQ > 255) seqQ = 255; + p->aux = (sc[0]&0x3f)<<16 | seqQ<<8 | indelQ; // use 22 bits in total sumq[sc[0]&0x3f] += indelQ < seqQ? indelQ : seqQ; -// fprintf(stderr, "pos=%d read=%d:%d name=%s call=%d q=%d\n", pos, s, i, bam1_qname(p->b), types[sc[0]&0x3f], indelQ); +// fprintf(stderr, "pos=%d read=%d:%d name=%s call=%d indelQ=%d seqQ=%d\n", pos, s, i, bam1_qname(p->b), types[sc[0]&0x3f], indelQ, seqQ); } } // determine bca->indel_types[] and bca->inscns @@ -407,6 +462,8 @@ int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_calla } free(score1); free(score2); // free + for (i = 0; i < n; ++i) free(ref_sample[i]); + free(ref_sample); free(types); free(inscns); return n_alt > 0? 0 : -1; } diff --git a/bam_aux.c b/bam_aux.c index fbcd982..4fd3190 100644 --- a/bam_aux.c +++ b/bam_aux.c @@ -180,3 +180,10 @@ char *bam_aux2Z(const uint8_t *s) if (type == 'Z' || type == 'H') return (char*)s; else return 0; } + +#ifdef _WIN32 +double drand48() +{ + return (double)rand() / RAND_MAX; +} +#endif diff --git a/bam_index.c b/bam_index.c index 328f011..51c2701 100644 --- a/bam_index.c +++ b/bam_index.c @@ -217,8 +217,15 @@ bam_index_t *bam_index_core(bamFile fp) } merge_chunks(idx); fill_missing(idx); - if (ret >= 0) - while ((ret = bam_read1(fp, b)) >= 0) ++n_no_coor; + if (ret >= 0) { + while ((ret = bam_read1(fp, b)) >= 0) { + ++n_no_coor; + if (c->tid >= 0 && n_no_coor) { + fprintf(stderr, "[bam_index_core] the alignment is not sorted: reads without coordinates prior to reads with coordinates.\n"); + exit(1); + } + } + } if (ret < -1) fprintf(stderr, "[bam_index_core] truncated file? Continue anyway. (%d)\n", ret); free(b->data); free(b); idx->n_no_coor = n_no_coor; diff --git a/bam_maqcns.c b/bam_maqcns.c index 4fbc6c6..9931036 100644 --- a/bam_maqcns.c +++ b/bam_maqcns.c @@ -22,8 +22,6 @@ typedef struct { uint32_t c[4]; } glf_call_aux_t; -char bam_nt16_nt4_table[] = { 4, 0, 1, 4, 2, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4 }; - /* P() = \theta \sum_{i=1}^{N-1} 1/i P(D|) = \sum_{k=1}^{N-1} p_k 1/2 [(k/N)^n_2(1-k/N)^n_1 + (k/N)^n1(1-k/N)^n_2] diff --git a/bam_md.c b/bam_md.c index 44d46a4..20b1913 100644 --- a/bam_md.c +++ b/bam_md.c @@ -9,6 +9,8 @@ #include "kaln.h" #include "kprobaln.h" +char bam_nt16_nt4_table[] = { 4, 0, 1, 4, 2, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4 }; + void bam_fillmd1_core(bam1_t *b, char *ref, int is_equal, int max_nm) { uint8_t *seq = bam1_seq(b); @@ -162,14 +164,14 @@ int bam_cap_mapQ(bam1_t *b, char *ref, int thres) return (int)(t + .499); } -int bam_prob_realn_core(bam1_t *b, const char *ref, int apply_baq) +int bam_prob_realn_core(bam1_t *b, const char *ref, int flag) { - int k, i, bw, x, y, yb, ye, xb, xe; + int k, i, bw, x, y, yb, ye, xb, xe, apply_baq = flag&1, extend_baq = flag>>1&1; uint32_t *cigar = bam1_cigar(b); bam1_core_t *c = &b->core; kpa_par_t conf = kpa_par_def; uint8_t *bq = 0, *zq = 0, *qual = bam1_qual(b); - if (c->flag & BAM_FUNMAP) return -1; // do nothing + if ((c->flag & BAM_FUNMAP) || b->core.l_qseq == 0) return -1; // do nothing // test if BQ or ZQ is present if ((bq = bam_aux_get(b, "BQ")) != 0) ++bq; if ((zq = bam_aux_get(b, "ZQ")) != 0 && *zq == 'Z') ++zq; @@ -221,23 +223,47 @@ int bam_prob_realn_core(bam1_t *b, const char *ref, int apply_baq) s = calloc(c->l_qseq, 1); for (i = 0; i < c->l_qseq; ++i) s[i] = bam_nt16_nt4_table[bam1_seqi(seq, i)]; r = calloc(xe - xb, 1); - for (i = xb; i < xe; ++i) + for (i = xb; i < xe; ++i) { + if (ref[i] == 0) { xe = i; break; } r[i-xb] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[i]]]; + } state = calloc(c->l_qseq, sizeof(int)); q = calloc(c->l_qseq, 1); kpa_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q); - for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) { - int op = cigar[k]&0xf, l = cigar[k]>>4; - if (op == BAM_CMATCH) { - for (i = y; i < y + l; ++i) { - if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) bq[i] = 0; - else bq[i] = bq[i] < q[i]? bq[i] : q[i]; - } - x += l; y += l; - } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l; - else if (op == BAM_CDEL) x += l; + if (!extend_baq) { // in this block, bq[] is capped by base quality qual[] + for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) { + int op = cigar[k]&0xf, l = cigar[k]>>4; + if (op == BAM_CMATCH) { + for (i = y; i < y + l; ++i) { + if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) bq[i] = 0; + else bq[i] = bq[i] < q[i]? bq[i] : q[i]; + } + x += l; y += l; + } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l; + else if (op == BAM_CDEL) x += l; + } + for (i = 0; i < c->l_qseq; ++i) bq[i] = qual[i] - bq[i] + 64; // finalize BQ + } else { // in this block, bq[] is BAQ that can be larger than qual[] (different from the above!) + uint8_t *left, *rght; + left = calloc(c->l_qseq, 1); rght = calloc(c->l_qseq, 1); + for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) { + int op = cigar[k]&0xf, l = cigar[k]>>4; + if (op == BAM_CMATCH) { + for (i = y; i < y + l; ++i) + bq[i] = ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y))? 0 : q[i]; + for (left[y] = bq[y], i = y + 1; i < y + l; ++i) + left[i] = bq[i] > left[i-1]? bq[i] : left[i-1]; + for (rght[y+l-1] = bq[y+l-1], i = y + l - 2; i >= y; --i) + rght[i] = bq[i] > rght[i+1]? bq[i] : rght[i+1]; + for (i = y; i < y + l; ++i) + bq[i] = left[i] < rght[i]? left[i] : rght[i]; + x += l; y += l; + } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l; + else if (op == BAM_CDEL) x += l; + } + for (i = 0; i < c->l_qseq; ++i) bq[i] = 64 + (qual[i] <= bq[i]? 0 : qual[i] - bq[i]); // finalize BQ + free(left); free(rght); } - for (i = 0; i < c->l_qseq; ++i) bq[i] = qual[i] - bq[i] + 64; // finalize BQ if (apply_baq) { for (i = 0; i < c->l_qseq; ++i) qual[i] -= bq[i] - 64; // modify qual bam_aux_append(b, "ZQ", 'Z', c->l_qseq + 1, bq); @@ -254,16 +280,16 @@ int bam_prob_realn(bam1_t *b, const char *ref) int bam_fillmd(int argc, char *argv[]) { - int c, is_equal, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm, is_realn, capQ, apply_baq; + int c, is_equal, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm, is_realn, capQ, baq_flag; samfile_t *fp, *fpout = 0; faidx_t *fai; char *ref = 0, mode_w[8], mode_r[8]; bam1_t *b; - is_equal = is_bam_out = is_sam_in = is_uncompressed = is_realn = max_nm = apply_baq = capQ = 0; + is_equal = is_bam_out = is_sam_in = is_uncompressed = is_realn = max_nm = capQ = baq_flag = 0; mode_w[0] = mode_r[0] = 0; strcpy(mode_r, "r"); strcpy(mode_w, "w"); - while ((c = getopt(argc, argv, "reubSC:n:A")) >= 0) { + while ((c = getopt(argc, argv, "EreubSC:n:A")) >= 0) { switch (c) { case 'r': is_realn = 1; break; case 'e': is_equal = 1; break; @@ -272,7 +298,8 @@ int bam_fillmd(int argc, char *argv[]) case 'S': is_sam_in = 1; break; case 'n': max_nm = atoi(optarg); break; case 'C': capQ = atoi(optarg); break; - case 'A': apply_baq = 1; break; + case 'A': baq_flag |= 1; break; + case 'E': baq_flag |= 2; break; default: fprintf(stderr, "[bam_fillmd] unrecognized option '-%c'\n", c); return 1; } } @@ -288,7 +315,8 @@ int bam_fillmd(int argc, char *argv[]) fprintf(stderr, " -b compressed BAM output\n"); fprintf(stderr, " -S the input is SAM with header\n"); fprintf(stderr, " -A modify the quality string\n"); - fprintf(stderr, " -r read-independent local realignment\n\n"); + fprintf(stderr, " -r compute the BQ tag (without -A) or cap baseQ by BAQ (with -A)\n"); + fprintf(stderr, " -E extended BAQ for better sensitivity but lower specificity\n\n"); return 1; } fp = samopen(argv[optind], mode_r, 0); @@ -311,7 +339,7 @@ int bam_fillmd(int argc, char *argv[]) fprintf(stderr, "[bam_fillmd] fail to find sequence '%s' in the reference.\n", fp->header->target_name[tid]); } - if (is_realn) bam_prob_realn_core(b, ref, apply_baq); + if (is_realn) bam_prob_realn_core(b, ref, baq_flag); if (capQ > 10) { int q = bam_cap_mapQ(b, ref, capQ); if (b->core.qual > q) b->core.qual = q; diff --git a/bam_plcmd.c b/bam_plcmd.c index 002297a..bba62cb 100644 --- a/bam_plcmd.c +++ b/bam_plcmd.c @@ -533,20 +533,24 @@ int bam_pileup(int argc, char *argv[]) #define MPLP_FMT_DP 0x100 #define MPLP_FMT_SP 0x200 #define MPLP_NO_INDEL 0x400 +#define MPLP_EXT_BAQ 0x800 typedef struct { int max_mq, min_mq, flag, min_baseQ, capQ_thres, max_depth; - int openQ, extQ, tandemQ; + int openQ, extQ, tandemQ, min_support; // for indels + double min_frac; // for indels char *reg, *fn_pos, *pl_list; faidx_t *fai; kh_64_t *hash; + void *rghash; } mplp_conf_t; typedef struct { bamFile fp; bam_iter_t iter; - int min_mq, flag, ref_id, capQ_thres; + int ref_id; char *ref; + const mplp_conf_t *conf; } mplp_aux_t; typedef struct { @@ -566,16 +570,21 @@ static int mplp_func(void *data, bam1_t *b) int has_ref; ret = ma->iter? bam_iter_read(ma->fp, ma->iter, b) : bam_read1(ma->fp, b); if (ret < 0) break; + if (ma->conf->rghash) { // exclude read groups + uint8_t *rg = bam_aux_get(b, "RG"); + skip = (rg && bcf_str2id(ma->conf->rghash, (const char*)(rg+1)) >= 0); + if (skip) continue; + } has_ref = (ma->ref && ma->ref_id == b->core.tid)? 1 : 0; skip = 0; - if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn_core(b, ma->ref, 1); - if (has_ref && ma->capQ_thres > 10) { - int q = bam_cap_mapQ(b, ma->ref, ma->capQ_thres); + if (has_ref && (ma->conf->flag&MPLP_REALN)) bam_prob_realn_core(b, ma->ref, (ma->conf->flag & MPLP_EXT_BAQ)? 3 : 1); + if (has_ref && ma->conf->capQ_thres > 10) { + int q = bam_cap_mapQ(b, ma->ref, ma->conf->capQ_thres); if (q < 0) skip = 1; else if (b->core.qual > q) b->core.qual = q; } else if (b->core.flag&BAM_FUNMAP) skip = 1; - else if (b->core.qual < ma->min_mq) skip = 1; - else if ((ma->flag&MPLP_NO_ORPHAN) && (b->core.flag&1) && !(b->core.flag&2)) skip = 1; + else if (b->core.qual < ma->conf->min_mq) skip = 1; + else if ((ma->conf->flag&MPLP_NO_ORPHAN) && (b->core.flag&1) && !(b->core.flag&2)) skip = 1; } while (skip); return ret; } @@ -638,10 +647,8 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn) for (i = 0; i < n; ++i) { bam_header_t *h_tmp; data[i] = calloc(1, sizeof(mplp_aux_t)); - data[i]->min_mq = conf->min_mq; - data[i]->flag = conf->flag; - data[i]->capQ_thres = conf->capQ_thres; data[i]->fp = strcmp(fn[i], "-") == 0? bam_dopen(fileno(stdin), "r") : bam_open(fn[i], "r"); + data[i]->conf = conf; h_tmp = bam_header_read(data[i]->fp); bam_smpl_add(sm, fn[i], h_tmp->text); rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list); @@ -694,7 +701,8 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn) bh->l_smpl = s.l; bh->sname = malloc(s.l); memcpy(bh->sname, s.s, s.l); - bh->l_txt = 0; + bh->txt = malloc(strlen(BAM_VERSION) + 64); + bh->l_txt = 1 + sprintf(bh->txt, "##samtoolsVersion=%s\n", BAM_VERSION); free(s.s); bcf_hdr_sync(bh); bcf_hdr_write(bp, bh); @@ -702,6 +710,8 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn) bcr = calloc(sm->n, sizeof(bcf_callret1_t)); bca->rghash = rghash; bca->openQ = conf->openQ, bca->extQ = conf->extQ, bca->tandemQ = conf->tandemQ; + bca->min_frac = conf->min_frac; + bca->min_support = conf->min_support; } ref_tid = -1; ref = 0; iter = bam_mplp_init(n, mplp_func, (void**)data); @@ -797,7 +807,7 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn) } #define MAX_PATH_LEN 1024 -int read_file_list(const char *file_list,int *n,char **argv[]) +static int read_file_list(const char *file_list,int *n,char **argv[]) { char buf[MAX_PATH_LEN]; int len, nfiles; @@ -850,7 +860,7 @@ int bam_mpileup(int argc, char *argv[]) int c; const char *file_list = NULL; char **fn = NULL; - int nfiles = 0; + int nfiles = 0, use_orphan = 0; mplp_conf_t mplp; memset(&mplp, 0, sizeof(mplp_conf_t)); mplp.max_mq = 60; @@ -858,8 +868,9 @@ int bam_mpileup(int argc, char *argv[]) mplp.capQ_thres = 0; mplp.max_depth = 250; mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100; + mplp.min_frac = 0.002; mplp.min_support = 1; mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN; - while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:BDSd:b:P:o:e:h:I")) >= 0) { + while ((c = getopt(argc, argv, "Agf:r:l:M:q:Q:uaRC:BDSd:b:P:o:e:h:Im:F:EG:")) >= 0) { switch (c) { case 'f': mplp.fai = fai_load(optarg); @@ -872,12 +883,12 @@ int bam_mpileup(int argc, char *argv[]) case 'g': mplp.flag |= MPLP_GLF; break; case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_GLF; break; case 'a': mplp.flag |= MPLP_NO_ORPHAN | MPLP_REALN; break; - case 'B': mplp.flag &= ~MPLP_REALN & ~MPLP_NO_ORPHAN; break; - case 'O': mplp.flag |= MPLP_NO_ORPHAN; break; + case 'B': mplp.flag &= ~MPLP_REALN; break; case 'R': mplp.flag |= MPLP_REALN; break; case 'D': mplp.flag |= MPLP_FMT_DP; break; case 'S': mplp.flag |= MPLP_FMT_SP; break; case 'I': mplp.flag |= MPLP_NO_INDEL; break; + case 'E': mplp.flag |= MPLP_EXT_BAQ; break; case 'C': mplp.capQ_thres = atoi(optarg); break; case 'M': mplp.max_mq = atoi(optarg); break; case 'q': mplp.min_mq = atoi(optarg); break; @@ -886,8 +897,23 @@ int bam_mpileup(int argc, char *argv[]) case 'o': mplp.openQ = atoi(optarg); break; case 'e': mplp.extQ = atoi(optarg); break; case 'h': mplp.tandemQ = atoi(optarg); break; + case 'A': use_orphan = 1; break; + case 'F': mplp.min_frac = atof(optarg); break; + case 'm': mplp.min_support = atoi(optarg); break; + case 'G': { + FILE *fp_rg; + char buf[1024]; + mplp.rghash = bcf_str2id_init(); + if ((fp_rg = fopen(optarg, "r")) == 0) + fprintf(stderr, "(%s) Fail to open file %s. Continue anyway.\n", __func__, optarg); + while (!feof(fp_rg) && fscanf(fp_rg, "%s", buf) > 0) // this is not a good style, but forgive me... + bcf_str2id_add(mplp.rghash, strdup(buf)); + fclose(fp_rg); + } + break; } } + if (use_orphan) mplp.flag &= ~MPLP_NO_ORPHAN; if (argc == 1) { fprintf(stderr, "\n"); fprintf(stderr, "Usage: samtools mpileup [options] in1.bam [in2.bam [...]]\n\n"); @@ -903,8 +929,13 @@ int bam_mpileup(int argc, char *argv[]) fprintf(stderr, " -o INT Phred-scaled gap open sequencing error probability [%d]\n", mplp.openQ); fprintf(stderr, " -e INT Phred-scaled gap extension seq error probability [%d]\n", mplp.extQ); fprintf(stderr, " -h INT coefficient for homopolyer errors [%d]\n", mplp.tandemQ); + fprintf(stderr, " -m INT minimum gapped reads for indel candidates [%d]\n", mplp.min_support); + fprintf(stderr, " -F FLOAT minimum fraction of gapped reads for candidates [%g]\n", mplp.min_frac); + fprintf(stderr, " -G FILE exclude read groups listed in FILE [null]\n"); + fprintf(stderr, " -A use anomalous read pairs in SNP/INDEL calling\n"); fprintf(stderr, " -g generate BCF output\n"); fprintf(stderr, " -u do not compress BCF output\n"); + fprintf(stderr, " -E extended BAQ for higher sensitivity but lower specificity\n"); fprintf(stderr, " -B disable BAQ computation\n"); fprintf(stderr, " -D output per-sample DP\n"); fprintf(stderr, " -S output per-sample SP (strand bias P-value, slow)\n"); @@ -913,15 +944,13 @@ int bam_mpileup(int argc, char *argv[]) fprintf(stderr, "Notes: Assuming diploid individuals.\n\n"); return 1; } - if ( file_list ) - { + if (file_list) { if ( read_file_list(file_list,&nfiles,&fn) ) return 1; mpileup(&mplp,nfiles,fn); for (c=0; cx.fpr, buf, BUF_SIZE)) > 0) + fwrite(buf, 1, len, fp->x.fpw); #else while (!feof(in->file) && (len = fread(buf, 1, BUF_SIZE, in->file)) > 0) + fwrite(buf, 1, len, fp->file); #endif - fwrite(buf, 1, len, fp->x.fpw); free(buf); fp->block_offset = in->block_offset = 0; bgzf_close(fp); diff --git a/bam_sort.c b/bam_sort.c index 01f7016..38f15d6 100644 --- a/bam_sort.c +++ b/bam_sort.c @@ -222,8 +222,11 @@ int bam_merge_core(int by_qname, const char *out, const char *headers, int n, ch ks_heapmake(heap, n, heap); while (heap->pos != HEAP_EMPTY) { bam1_t *b = heap->b; - if ((flag & MERGE_RG) && bam_aux_get(b, "RG") == 0) + if (flag & MERGE_RG) { + uint8_t *rg = bam_aux_get(b, "RG"); + if (rg) bam_aux_del(b, rg); bam_aux_append(b, "RG", 'Z', RG_len[heap->i] + 1, (uint8_t*)RG[heap->i]); + } bam_write1_core(fpout, &b->core, b->data_len, b->data); if ((j = bam_iter_read(fp[heap->i], iter[heap->i], b)) >= 0) { heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)b->core.pos<<1 | bam1_strand(b); diff --git a/bam_tview.c b/bam_tview.c index e48afa7..bf01e15 100644 --- a/bam_tview.c +++ b/bam_tview.c @@ -183,12 +183,12 @@ tview_t *tv_init(const char *fn, const char *fn_fa) { tview_t *tv = (tview_t*)calloc(1, sizeof(tview_t)); tv->is_dot = 1; - tv->idx = bam_index_load(fn); - if (tv->idx == 0) exit(1); tv->fp = bam_open(fn, "r"); bgzf_set_cache_size(tv->fp, 8 * 1024 *1024); assert(tv->fp); tv->header = bam_header_read(tv->fp); + tv->idx = bam_index_load(fn); + if (tv->idx == 0) exit(1); tv->lplbuf = bam_lplbuf_init(tv_pl_func, tv); if (fn_fa) tv->fai = fai_load(fn_fa); tv->bmc = bam_maqcns_init(); diff --git a/bamtk.c b/bamtk.c index 79635d6..1d11519 100644 --- a/bamtk.c +++ b/bamtk.c @@ -8,10 +8,6 @@ #include "knetfile.h" #endif -#ifndef PACKAGE_VERSION -#define PACKAGE_VERSION "0.1.12a (r862)" -#endif - int bam_taf2baf(int argc, char *argv[]); int bam_pileup(int argc, char *argv[]); int bam_mpileup(int argc, char *argv[]); @@ -27,6 +23,8 @@ int bam_idxstats(int argc, char *argv[]); int main_samview(int argc, char *argv[]); int main_import(int argc, char *argv[]); int main_reheader(int argc, char *argv[]); +int main_cut_target(int argc, char *argv[]); +int main_phase(int argc, char *argv[]); int faidx_main(int argc, char *argv[]); int glf3_view_main(int argc, char *argv[]); @@ -75,7 +73,7 @@ static int usage() { fprintf(stderr, "\n"); fprintf(stderr, "Program: samtools (Tools for alignments in the SAM format)\n"); - fprintf(stderr, "Version: %s\n\n", PACKAGE_VERSION); + fprintf(stderr, "Version: %s\n\n", BAM_VERSION); fprintf(stderr, "Usage: samtools [options]\n\n"); fprintf(stderr, "Command: view SAM<->BAM conversion\n"); fprintf(stderr, " sort sort alignment file\n"); @@ -94,7 +92,15 @@ static int usage() fprintf(stderr, " merge merge sorted alignments\n"); fprintf(stderr, " rmdup remove PCR duplicates\n"); fprintf(stderr, " reheader replace BAM header\n"); + fprintf(stderr, " targetcut cut fosmid regions (for fosmid pool only)\n"); + fprintf(stderr, " phase phase heterozygotes\n"); fprintf(stderr, "\n"); +#ifdef _WIN32 + fprintf(stderr, "\ +Note: The Windows version of SAMtools is mainly designed for read-only\n\ + operations, such as viewing the alignments and generating the pileup.\n\ + Binary files generated by the Windows version may be buggy.\n\n"); +#endif return 1; } @@ -125,6 +131,8 @@ int main(int argc, char *argv[]) else if (strcmp(argv[1], "calmd") == 0) return bam_fillmd(argc-1, argv+1); else if (strcmp(argv[1], "fillmd") == 0) return bam_fillmd(argc-1, argv+1); else if (strcmp(argv[1], "reheader") == 0) return main_reheader(argc-1, argv+1); + else if (strcmp(argv[1], "targetcut") == 0) return main_cut_target(argc-1, argv+1); + else if (strcmp(argv[1], "phase") == 0) return main_phase(argc-1, argv+1); #if _CURSES_LIB != 0 else if (strcmp(argv[1], "tview") == 0) return bam_tview_main(argc-1, argv+1); #endif diff --git a/bcftools/Makefile b/bcftools/Makefile index 8b890ba..e579a9e 100644 --- a/bcftools/Makefile +++ b/bcftools/Makefile @@ -31,7 +31,7 @@ libbcf.a:$(LOBJS) $(AR) -cru $@ $(LOBJS) bcftools:lib $(AOBJS) - $(CC) $(CFLAGS) -o $@ $(AOBJS) -lm $(LIBPATH) -lz -L. -lbcf + $(CC) $(CFLAGS) -o $@ $(AOBJS) -L. $(LIBPATH) -lbcf -lm -lz bcf.o:bcf.h vcf.o:bcf.h diff --git a/bcftools/bcf-fix.pl b/bcftools/bcf-fix.pl deleted file mode 100755 index 61c6136..0000000 --- a/bcftools/bcf-fix.pl +++ /dev/null @@ -1,101 +0,0 @@ -#!/usr/bin/perl -w - -use strict; -use warnings; -use Carp; - -my $opts = parse_params(); -bcf_fix(); - -exit; - -#-------------------------------- - -sub error -{ - my (@msg) = @_; - if ( scalar @msg ) { confess @msg; } - die - "Usage: bcftools view test.bcf | bcf-fix.pl > test.vcf\n", - "Options:\n", - " -h, -?, --help This help message.\n", - "\n"; -} - - -sub parse_params -{ - my $opts = {}; - while (my $arg=shift(@ARGV)) - { - if ( $arg eq '-?' || $arg eq '-h' || $arg eq '--help' ) { error(); } - error("Unknown parameter \"$arg\". Run -h for help.\n"); - } - return $opts; -} - -sub bcf_fix -{ - while (my $line=) - { - if ( $line=~/^#CHROM/ ) - { - print -qq[##INFO= -##INFO= -##INFO= -##INFO= -##INFO= -##INFO= -##FORMAT= -##FORMAT= -##FORMAT= -]; - print $line; - } - elsif ( $line=~/^#/ ) - { - print $line; - } - else - { - my @items = split(/\t/,$line); - my @tags = split(/:/,$items[8]); # FORMAT tags - - my $nidx=2; - my @idxs; # Mapping which defines new ordering: $idxs[$inew]=$iold; GT comes first, PL second - for (my $i=0; $i<@tags; $i++) - { - if ( $tags[$i] eq 'GT' ) { $idxs[0]=$i; } - elsif ( $tags[$i] eq 'PL' ) { $idxs[1]=$i; } - else { $idxs[$nidx++]=$i; } - } - if ( !exists($tags[0]) or !exists($tags[1]) ) { error("FIXME: expected GT and PL in the format field.\n"); } - - # First fix the FORMAT column - $items[8] = 'GT:GL'; - for (my $i=2; $i<@tags; $i++) - { - $items[8] .= ':'.$tags[$idxs[$i]]; - } - - # Now all the genotype columns - for (my $iitem=9; $iitem<@items; $iitem++) - { - @tags = split(/:/,$items[$iitem]); - $items[$iitem] = $tags[$idxs[0]] .':'; - - # GL=-PL/10 - my ($a,$b,$c) = split(/,/,$tags[$idxs[1]]); - $items[$iitem] .= sprintf "%.2f,%.2f,%.2f",-$a/10.,-$b/10.,-$c/10.; - - for (my $itag=2; $itag<@tags; $itag++) - { - $items[$iitem] .= ':'.$tags[$idxs[$itag]]; - } - } - print join("\t",@items); - } - } -} - diff --git a/bcftools/bcf.c b/bcftools/bcf.c index 6e45695..080b6fe 100644 --- a/bcftools/bcf.c +++ b/bcftools/bcf.c @@ -106,7 +106,7 @@ int bcf_sync(bcf1_t *b) for (p = b->str, n = 0; p < b->str + b->l_str; ++p) if (*p == 0 && p+1 != b->str + b->l_str) tmp[n++] = p + 1; if (n != 5) { - fprintf(stderr, "[%s] incorrect number of fields (%d != 5). Corrupted file?\n", __func__, n); + fprintf(stderr, "[%s] incorrect number of fields (%d != 5) at %d:%d\n", __func__, n, b->tid, b->pos); return -1; } b->ref = tmp[0]; b->alt = tmp[1]; b->flt = tmp[2]; b->info = tmp[3]; b->fmt = tmp[4]; @@ -136,10 +136,10 @@ int bcf_sync(bcf1_t *b) b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2; } else if (b->gi[i].fmt == bcf_str2int("DP", 2) || b->gi[i].fmt == bcf_str2int("HQ", 2)) { b->gi[i].len = 2; - } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2) - || b->gi[i].fmt == bcf_str2int("SP", 2)) - { + } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)) { b->gi[i].len = 1; + } else if (b->gi[i].fmt == bcf_str2int("SP", 2)) { + b->gi[i].len = 4; } else if (b->gi[i].fmt == bcf_str2int("GL", 2)) { b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2 * 4; } @@ -240,8 +240,10 @@ void bcf_fmt_core(const bcf_hdr_t *h, bcf1_t *b, kstring_t *s) } } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) { kputw(((uint16_t*)b->gi[i].data)[j], s); - } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) { + } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) { kputw(((uint8_t*)b->gi[i].data)[j], s); + } else if (b->gi[i].fmt == bcf_str2int("SP", 2)) { + kputw(((int32_t*)b->gi[i].data)[j], s); } else if (b->gi[i].fmt == bcf_str2int("GT", 2)) { int y = ((uint8_t*)b->gi[i].data)[j]; if (y>>7&1) { @@ -259,7 +261,7 @@ void bcf_fmt_core(const bcf_hdr_t *h, bcf1_t *b, kstring_t *s) if (k > 0) kputc(',', s); ksprintf(s, "%.2f", d[k]); } - } + } else kputc('.', s); // custom fields } } } diff --git a/bcftools/bcf.h b/bcftools/bcf.h index f87ac1e..f545a91 100644 --- a/bcftools/bcf.h +++ b/bcftools/bcf.h @@ -129,6 +129,8 @@ extern "C" { int vcf_close(bcf_t *bp); // read the VCF/BCF header bcf_hdr_t *vcf_hdr_read(bcf_t *bp); + // read the sequence dictionary from a separate file; required for VCF->BCF conversion + int vcf_dictread(bcf_t *bp, bcf_hdr_t *h, const char *fn); // read a VCF/BCF record; return -1 on end-of-file and <-1 for errors int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b); // write the VCF header @@ -142,10 +144,13 @@ extern "C" { int bcf_gl2pl(bcf1_t *b); // if the site is an indel int bcf_is_indel(const bcf1_t *b); + bcf_hdr_t *bcf_hdr_subsam(const bcf_hdr_t *h0, int n, char *const* samples, int *list); + int bcf_subsam(int n_smpl, int *list, bcf1_t *b); // string hash table void *bcf_build_refhash(bcf_hdr_t *h); void bcf_str2id_destroy(void *_hash); + void bcf_str2id_thorough_destroy(void *_hash); int bcf_str2id_add(void *_hash, const char *str); int bcf_str2id(void *_hash, const char *str); void *bcf_str2id_init(); diff --git a/bcftools/bcf.tex b/bcftools/bcf.tex index 5ca1e28..6f2171f 100644 --- a/bcftools/bcf.tex +++ b/bcftools/bcf.tex @@ -14,19 +14,19 @@ \begin{tabular}{|l|l|l|l|l|} \hline \multicolumn{2}{|c|}{\bf Field} & \multicolumn{1}{c|}{\bf Descrption} & \multicolumn{1}{c|}{\bf Type} & \multicolumn{1}{c|}{\bf Value} \\\hline\hline -\multicolumn{2}{|l|}{\tt magic} & Magic string & {\tt char[4]} & {\tt BCF\char92 4} \\\hline -\multicolumn{2}{|l|}{\tt l\_nm} & Length of concatenated sequence names & {\tt int32\_t} & \\\hline -\multicolumn{2}{|l|}{\tt name} & Concatenated names, {\tt NULL} padded & {\tt char[l\_nm]} & \\\hline -\multicolumn{2}{|l|}{\tt l\_smpl} & Length of concatenated sample names & {\tt int32\_t} & \\\hline -\multicolumn{2}{|l|}{\tt sname} & Concatenated sample names & {\tt char[l\_smpl]} & \\\hline -\multicolumn{2}{|l|}{\tt l\_txt} & Length of the meta text (double-hash lines)& {\tt int32\_t} & \\\hline -\multicolumn{2}{|l|}{\tt text} & Meta text, {\tt NULL} terminated & {\tt char[l\_txt]} & \\\hline +\multicolumn{2}{|l|}{\sf magic} & Magic string & {\tt char[4]} & {\tt BCF\char92 4} \\\hline +\multicolumn{2}{|l|}{\sf l\_seqnm} & Length of concatenated sequence names & {\tt int32\_t} & \\\hline +\multicolumn{2}{|l|}{\sf seqnm} & Concatenated names, {\tt NULL} padded & {\tt char[{\sf l\_seqnm}]} & \\\hline +\multicolumn{2}{|l|}{\sf l\_smpl} & Length of concatenated sample names & {\tt int32\_t} & \\\hline +\multicolumn{2}{|l|}{\sf smpl} & Concatenated sample names & {\tt char[{\sf l\_smpl}]} & \\\hline +\multicolumn{2}{|l|}{\sf l\_meta} & Length of the meta text (double-hash lines)& {\tt int32\_t} & \\\hline +\multicolumn{2}{|l|}{\sf meta} & Meta text, {\tt NULL} terminated & {\tt char[{\sf l\_meta}]} & \\\hline \multicolumn{5}{|c|}{\it \color{gray}{List of records until the end of the file}}\\\cline{2-5} -& {\tt seq\_id} & Reference sequence ID & {\tt int32\_t} & \\\cline{2-5} -& {\tt pos} & Position & {\tt int32\_t} & \\\cline{2-5} -& {\tt qual} & Variant quality & {\tt float} & \\\cline{2-5} -& {\tt l\_str} & Length of str & {\tt int32\_t} & \\\cline{2-5} -& {\tt str} & {\tt ID+REF+ALT+FILTER+INFO+FORMAT}, {\tt NULL} padded & {\tt char[slen]} &\\\cline{2-5} +& {\sf seq\_id} & Reference sequence ID & {\tt int32\_t} & \\\cline{2-5} +& {\sf pos} & Position & {\tt int32\_t} & \\\cline{2-5} +& {\sf qual} & Variant quality & {\tt float} & \\\cline{2-5} +& {\sf l\_str} & Length of {\sf str} & {\tt int32\_t} & \\\cline{2-5} +& {\sf str} & {\tt ID+REF+ALT+FILTER+INFO+FORMAT}, {\tt NULL} padded & {\tt char[{\sf l\_str}]} &\\\cline{2-5} & \multicolumn{4}{c|}{Blocks of data; \#blocks and formats defined by {\tt FORMAT} (table below)}\\ \hline \end{tabular} @@ -37,27 +37,40 @@ \hline \multicolumn{1}{l}{\bf Field} & \multicolumn{1}{l}{\bf Type} & \multicolumn{1}{l}{\bf Description} \\\hline {\tt DP} & {\tt uint16\_t[n]} & Read depth \\ -{\tt GL} & {\tt float[n*x]} & Log10 likelihood of data; $x=\frac{m(m+1)}{2}$, $m=\#\{alleles\}$\\ -{\tt GT} & {\tt uint8\_t[n]} & {\tt phase\char60\char60 6 | allele1\char60\char60 3 | allele2} \\ +{\tt GL} & {\tt float[n*G]} & Log10 likelihood of data; $G=\frac{A(A+1)}{2}$, $A=\#\{alleles\}$\\ +{\tt GT} & {\tt uint8\_t[n]} & {\tt missing\char60\char60 7 | phased\char60\char60 6 | allele1\char60\char60 3 | allele2} \\ +{\tt \_GT} & {\tt uint8\_t+uint8\_t[n*P]} & {Generic GT; the first int equals the max ploidy $P$} \\ {\tt GQ} & {\tt uint8\_t[n]} & {Genotype quality}\\ {\tt HQ} & {\tt uint8\_t[n*2]} & {Haplotype quality}\\ -{\tt PL} & {\tt uint8\_t[n*x]} & {Phred-scaled likelihood of data}\\ -\emph{misc} & {\tt int32\_t+char*} & {\tt NULL} padded concatenated strings (integer equal to the length) \\ +{\tt \_HQ} & {\tt uint8\_t+uint8\_t[n*P]} & {Generic HQ}\\ +{\tt IBD} & {\tt uint32\_t[n*2]} & {IBD}\\ +{\tt \_IBD} & {\tt uint8\_t+uint32\_t[n*P]} & {Generic IBD}\\ +{\tt PL} & {\tt uint8\_t[n*G]} & {Phred-scaled likelihood of data}\\ +{\tt PS} & {\tt uint32\_t[n]} & {Phase set}\\ +%{\tt SP} & {\tt uint8\_t[n]} & {Strand bias P-value (bcftools only)}\\ +\emph{Integer} & {\tt int32\_t[n*X]} & {Fix-sized custom Integer; $X$ defined in the header}\\ +\emph{Numeric} & {\tt double[n*X]} & {Fix-sized custom Numeric}\\ +\emph{String} & {\tt uint32\_t+char*} & {\tt NULL} padded concat. strings (int equals to the length) \\ \hline \end{tabular} \end{center} \begin{itemize} -\item The file is {\tt BGZF} compressed. -\item All integers are little-endian. +\item A BCF file is in the {\tt BGZF} format. +\item All multi-byte numbers are little-endian. \item In a string, a missing value `.' is an empty C string ``{\tt \char92 0}'' (not ``{\tt .\char92 0}'') \item For {\tt GL} and {\tt PL}, likelihoods of genotypes appear in the order of alleles in {\tt REF} and then {\tt ALT}. For example, if {\tt REF=C}, {\tt ALT=T,A}, likelihoods appear in the order of {\tt - CC,CT,CA,TT,TA,AA}. -\item {\tt GL} is an extension to and is backward compatible with the - {\tt GL} genotype field in {\tt VCFv4.0}. + CC,CT,TT,CA,TA,AA} (NB: the ordering is different from the one in the original + BCF proposal). +\item Predefined {\tt FORMAT} fields can be missing from VCF headers, but custom {\tt FORMAT} fields + are required to be explicitly defined in the headers. +\item A {\tt FORMAT} field with its name starting with `{\tt \_}' gives an alternative + binary representation of the corresponding VCF field. The alternative representation + is used when the default representation is unable to keep the genotype information, + for example, when the ploidy is over 2 or there are more than 8 alleles. \end{itemize} -\end{document} \ No newline at end of file +\end{document} diff --git a/bcftools/bcf2qcall.c b/bcftools/bcf2qcall.c index 8634c9e..a86bac2 100644 --- a/bcftools/bcf2qcall.c +++ b/bcftools/bcf2qcall.c @@ -77,8 +77,8 @@ int bcf_2qcall(bcf_hdr_t *h, bcf1_t *b) for (k = j = 0; k < 4; ++k) { for (l = k; l < 4; ++l) { int t, x = map[k], y = map[l]; - if (x > y) t = x, x = y, y = t; - g[j++] = p[x * b->n_alleles - x * (x-1) / 2 + (y - x)]; + if (x > y) t = x, x = y, y = t; // swap + g[j++] = p[y * (y+1) / 2 + x]; } } printf("%s\t%d\t%c", h->ns[b->tid], b->pos+1, *b->ref); diff --git a/bcftools/bcfutils.c b/bcftools/bcfutils.c index 4d6835d..ae7ec0f 100644 --- a/bcftools/bcfutils.c +++ b/bcftools/bcfutils.c @@ -1,3 +1,5 @@ +#include +#include #include "bcf.h" #include "kstring.h" #include "khash.h" @@ -27,6 +29,16 @@ void bcf_str2id_destroy(void *_hash) if (hash) kh_destroy(str2id, hash); // Note that strings are not freed. } +void bcf_str2id_thorough_destroy(void *_hash) +{ + khash_t(str2id) *hash = (khash_t(str2id)*)_hash; + khint_t k; + if (hash == 0) return; + for (k = 0; k < kh_end(hash); ++k) + if (kh_exist(hash, k)) free((char*)kh_key(hash, k)); + kh_destroy(str2id, hash); +} + int bcf_str2id(void *_hash, const char *str) { khash_t(str2id) *hash = (khash_t(str2id)*)_hash; @@ -51,8 +63,9 @@ int bcf_str2id_add(void *_hash, const char *str) int bcf_shrink_alt(bcf1_t *b, int n) { char *p; - int i, j, k, *z, n_smpl = b->n_smpl; + int i, j, k, n_smpl = b->n_smpl; if (b->n_alleles <= n) return -1; + // update ALT if (n > 1) { for (p = b->alt, k = 1; *p; ++p) if (*p == ',' && ++k == n) break; @@ -61,10 +74,7 @@ int bcf_shrink_alt(bcf1_t *b, int n) ++p; memmove(p, b->flt, b->str + b->l_str - b->flt); b->l_str -= b->flt - p; - z = alloca(sizeof(int) / 2 * n * (n+1)); - for (i = k = 0; i < n; ++i) - for (j = 0; j < n - i; ++j) - z[k++] = i * b->n_alleles + j; + // update PL for (i = 0; i < b->n_gi; ++i) { bcf_ginfo_t *g = b->gi + i; if (g->fmt == bcf_str2int("PL", 2)) { @@ -73,7 +83,7 @@ int bcf_shrink_alt(bcf1_t *b, int n) g->len = n * (n + 1) / 2; for (l = k = 0; l < n_smpl; ++l) { uint8_t *dl = d + l * x; - for (j = 0; j < g->len; ++j) d[k++] = dl[z[j]]; + for (j = 0; j < g->len; ++j) d[k++] = dl[j]; } } // FIXME: to add GL } @@ -107,3 +117,114 @@ int bcf_gl2pl(bcf1_t *b) } return 0; } +/* FIXME: this function will fail given AB:GTX:GT. BCFtools never + * produces such FMT, but others may do. */ +int bcf_fix_gt(bcf1_t *b) +{ + char *s; + int i; + uint32_t tmp; + bcf_ginfo_t gt; + // check the presence of the GT FMT + if ((s = strstr(b->fmt, ":GT")) == 0) return 0; // no GT or GT is already the first + if (s[3] != '\0' && s[3] != ':') return 0; // :GTX in fact + tmp = bcf_str2int("GT", 2); + for (i = 0; i < b->n_gi; ++i) + if (b->gi[i].fmt == tmp) break; + if (i == b->n_gi) return 0; // no GT in b->gi; probably a bug... + gt = b->gi[i]; + // move GT to the first + for (; i > 0; --i) b->gi[i] = b->gi[i-1]; + b->gi[0] = gt; + memmove(b->fmt + 3, b->fmt, s + 1 - b->fmt); + b->fmt[0] = 'G'; b->fmt[1] = 'T'; b->fmt[2] = ':'; + return 0; +} + +static void *locate_field(const bcf1_t *b, const char *fmt, int l) +{ + int i; + uint32_t tmp; + tmp = bcf_str2int(fmt, l); + for (i = 0; i < b->n_gi; ++i) + if (b->gi[i].fmt == tmp) break; + return i == b->n_gi? 0 : b->gi[i].data; +} + +int bcf_anno_max(bcf1_t *b) +{ + int k, max_gq, max_sp, n_het; + kstring_t str; + uint8_t *gt, *gq; + int32_t *sp; + max_gq = max_sp = n_het = 0; + gt = locate_field(b, "GT", 2); + if (gt == 0) return -1; + gq = locate_field(b, "GQ", 2); + sp = locate_field(b, "SP", 2); + if (sp) + for (k = 0; k < b->n_smpl; ++k) + if (gt[k]&0x3f) + max_sp = max_sp > (int)sp[k]? max_sp : sp[k]; + if (gq) + for (k = 0; k < b->n_smpl; ++k) + if (gt[k]&0x3f) + max_gq = max_gq > (int)gq[k]? max_gq : gq[k]; + for (k = 0; k < b->n_smpl; ++k) { + int a1, a2; + a1 = gt[k]&7; a2 = gt[k]>>3&7; + if ((!a1 && a2) || (!a2 && a1)) { // a het + if (gq == 0) ++n_het; + else if (gq[k] >= 20) ++n_het; + } + } + if (n_het) max_sp -= (int)(4.343 * log(n_het) + .499); + if (max_sp < 0) max_sp = 0; + memset(&str, 0, sizeof(kstring_t)); + if (*b->info) kputc(';', &str); + ksprintf(&str, "MXSP=%d;MXGQ=%d", max_sp, max_gq); + bcf_append_info(b, str.s, str.l); + free(str.s); + return 0; +} + +bcf_hdr_t *bcf_hdr_subsam(const bcf_hdr_t *h0, int n, char *const* samples, int *list) +{ + int i, ret, j; + khint_t k; + bcf_hdr_t *h; + khash_t(str2id) *hash; + kstring_t s; + s.l = s.m = 0; s.s = 0; + hash = kh_init(str2id); + for (i = 0; i < n; ++i) + k = kh_put(str2id, hash, samples[i], &ret); + for (i = j = 0; i < h0->n_smpl; ++i) { + if (kh_get(str2id, hash, h0->sns[i]) != kh_end(hash)) { + list[j++] = i; + kputs(h0->sns[i], &s); kputc('\0', &s); + } + } + if (j < n) fprintf(stderr, "<%s> %d samples in the list but not in BCF.", __func__, n - j); + kh_destroy(str2id, hash); + h = calloc(1, sizeof(bcf_hdr_t)); + *h = *h0; + h->ns = 0; h->sns = 0; + h->name = malloc(h->l_nm); memcpy(h->name, h0->name, h->l_nm); + h->txt = calloc(1, h->l_txt + 1); memcpy(h->txt, h0->txt, h->l_txt); + h->l_smpl = s.l; h->sname = s.s; + bcf_hdr_sync(h); + return h; +} + +int bcf_subsam(int n_smpl, int *list, bcf1_t *b) // list MUST BE sorted +{ + int i, j; + for (j = 0; j < b->n_gi; ++j) { + bcf_ginfo_t *gi = b->gi + j; + for (i = 0; i < n_smpl; ++i) + memcpy((uint8_t*)gi->data + i * gi->len, (uint8_t*)gi->data + list[i] * gi->len, gi->len); + } + b->n_smpl = n_smpl; + return 0; +} diff --git a/bcftools/call1.c b/bcftools/call1.c index f293a6c..286c014 100644 --- a/bcftools/call1.c +++ b/bcftools/call1.c @@ -26,11 +26,11 @@ KSTREAM_INIT(gzFile, gzread, 16384) #define VC_CALL_GT 2048 #define VC_ADJLD 4096 #define VC_NO_INDEL 8192 -#define VC_FOLDED 16384 +#define VC_ANNO_MAX 16384 typedef struct { - int flag, prior_type, n1; - char *fn_list, *prior_file; + int flag, prior_type, n1, n_sub, *sublist; + char *fn_list, *prior_file, **subsam, *fn_dict; double theta, pref, indel_frac; } viewconf_t; @@ -151,7 +151,7 @@ static int update_bcf1(int n_smpl, bcf1_t *b, const bcf_p1aux_t *pa, const bcf_p { kstring_t s; int is_var = (pr->p_ref < pref); - double p_hwe, r = is_var? pr->p_ref : 1. - pr->p_ref; + double p_hwe, r = is_var? pr->p_ref : pr->p_var, fq; anno16_t a; p_hwe = pr->g[0] >= 0.? test_hwe(pr->g) : 1.0; // only do HWE g[] is calculated @@ -164,20 +164,24 @@ static int update_bcf1(int n_smpl, bcf1_t *b, const bcf_p1aux_t *pa, const bcf_p kputs(b->info, &s); if (b->info[0]) kputc(';', &s); // ksprintf(&s, "AF1=%.4lg;AFE=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, 1.-pr->f_exp, pr->cil, pr->cih); - ksprintf(&s, "AF1=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, pr->cil, pr->cih); + ksprintf(&s, "AF1=%.4g;CI95=%.4g,%.4g", 1.-pr->f_em, pr->cil, pr->cih); ksprintf(&s, ";DP4=%d,%d,%d,%d;MQ=%d", a.d[0], a.d[1], a.d[2], a.d[3], a.mq); + fq = pr->p_ref_folded < 0.5? -4.343 * log(pr->p_ref_folded) : 4.343 * log(pr->p_var_folded); + if (fq < -999) fq = -999; + if (fq > 999) fq = 999; + ksprintf(&s, ";FQ=%.3g", fq); if (a.is_tested) { - if (pr->pc[0] >= 0.) ksprintf(&s, ";PC4=%lg,%lg,%lg,%lg", pr->pc[0], pr->pc[1], pr->pc[2], pr->pc[3]); - ksprintf(&s, ";PV4=%.2lg,%.2lg,%.2lg,%.2lg", a.p[0], a.p[1], a.p[2], a.p[3]); + if (pr->pc[0] >= 0.) ksprintf(&s, ";PC4=%g,%g,%g,%g", pr->pc[0], pr->pc[1], pr->pc[2], pr->pc[3]); + ksprintf(&s, ";PV4=%.2g,%.2g,%.2g,%.2g", a.p[0], a.p[1], a.p[2], a.p[3]); } if (pr->g[0] >= 0. && p_hwe <= .2) - ksprintf(&s, ";GC=%.2lf,%.2lf,%.2lf;HWE=%.3lf", pr->g[2], pr->g[1], pr->g[0], p_hwe); + ksprintf(&s, ";GC=%.2f,%.2f,%.2f;HWE=%.3f", pr->g[2], pr->g[1], pr->g[0], p_hwe); kputc('\0', &s); kputs(b->fmt, &s); kputc('\0', &s); free(b->str); b->m_str = s.m; b->l_str = s.l; b->str = s.s; - b->qual = r < 1e-100? 99 : -4.343 * log(r); - if (b->qual > 99) b->qual = 99; + b->qual = r < 1e-100? 999 : -4.343 * log(r); + if (b->qual > 999) b->qual = 999; bcf_sync(b); if (!is_var) bcf_shrink_alt(b, 1); else if (!(flag&VC_KEEPALT)) @@ -197,19 +201,83 @@ static int update_bcf1(int n_smpl, bcf1_t *b, const bcf_p1aux_t *pa, const bcf_p return is_var; } +static char **read_samples(const char *fn, int *_n) +{ + gzFile fp; + kstream_t *ks; + kstring_t s; + int dret, n = 0, max = 0; + char **sam = 0; + *_n = 0; + s.l = s.m = 0; s.s = 0; + fp = gzopen(fn, "r"); + if (fp == 0) return 0; // fail to open file + ks = ks_init(fp); + while (ks_getuntil(ks, 0, &s, &dret) >= 0) { + if (max == n) { + max = max? max<<1 : 4; + sam = realloc(sam, sizeof(void*)*max); + } + sam[n++] = strdup(s.s); + } + ks_destroy(ks); + gzclose(fp); + free(s.s); + *_n = n; + return sam; +} + +static void write_header(bcf_hdr_t *h) +{ + kstring_t str; + str.l = h->l_txt? h->l_txt - 1 : 0; + str.m = str.l + 1; str.s = h->txt; + if (!strstr(str.s, "##INFO=\n", &str); + if (!strstr(str.s, "##INFO=\n", &str); + if (!strstr(str.s, "##INFO=\n", &str); + if (!strstr(str.s, "##INFO=\n", &str); + if (!strstr(str.s, "##INFO=\n", &str); + if (!strstr(str.s, "##INFO=\n", &str); + if (!strstr(str.s, "##INFO=\n", &str); + if (!strstr(str.s, "##INFO=\n", &str); + if (!strstr(str.s, "##INFO=\n", &str); + if (!strstr(str.s, "##FORMAT=\n", &str); + if (!strstr(str.s, "##FORMAT=\n", &str); + if (!strstr(str.s, "##FORMAT=\n", &str); + if (!strstr(str.s, "##FORMAT=\n", &str); + if (!strstr(str.s, "##FORMAT=\n", &str); + h->l_txt = str.l + 1; h->txt = str.s; +} + double bcf_ld_freq(const bcf1_t *b0, const bcf1_t *b1, double f[4]); int bcfview(int argc, char *argv[]) { extern int bcf_2qcall(bcf_hdr_t *h, bcf1_t *b); extern void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x); + extern int bcf_fix_gt(bcf1_t *b); + extern int bcf_anno_max(bcf1_t *b); bcf_t *bp, *bout = 0; bcf1_t *b, *blast; int c; uint64_t n_processed = 0; viewconf_t vc; bcf_p1aux_t *p1 = 0; - bcf_hdr_t *h; + bcf_hdr_t *hin, *hout; int tid, begin, end; char moder[4], modew[4]; khash_t(set64) *hash = 0; @@ -217,11 +285,12 @@ int bcfview(int argc, char *argv[]) tid = begin = end = -1; memset(&vc, 0, sizeof(viewconf_t)); vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5; vc.indel_frac = -1.; - while ((c = getopt(argc, argv, "fN1:l:cHAGvbSuP:t:p:QgLi:I")) >= 0) { + vc.n_sub = 0; vc.subsam = 0; vc.sublist = 0; + while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgLi:IMs:D:")) >= 0) { switch (c) { - case 'f': vc.flag |= VC_FOLDED; break; case '1': vc.n1 = atoi(optarg); break; case 'l': vc.fn_list = strdup(optarg); break; + case 'D': vc.fn_dict = strdup(optarg); break; case 'N': vc.flag |= VC_ACGT_ONLY; break; case 'G': vc.flag |= VC_NO_GENO; break; case 'A': vc.flag |= VC_KEEPALT; break; @@ -233,11 +302,13 @@ int bcfview(int argc, char *argv[]) case 'H': vc.flag |= VC_HWE; break; case 'g': vc.flag |= VC_CALL_GT | VC_CALL; break; case 'I': vc.flag |= VC_NO_INDEL; break; + case 'M': vc.flag |= VC_ANNO_MAX; break; case 't': vc.theta = atof(optarg); break; case 'p': vc.pref = atof(optarg); break; case 'i': vc.indel_frac = atof(optarg); break; case 'Q': vc.flag |= VC_QCALL; break; case 'L': vc.flag |= VC_ADJLD; break; + case 's': vc.subsam = read_samples(optarg, &vc.n_sub); break; case 'P': if (strcmp(optarg, "full") == 0) vc.prior_type = MC_PTYPE_FULL; else if (strcmp(optarg, "cond2") == 0) vc.prior_type = MC_PTYPE_COND2; @@ -262,17 +333,22 @@ int bcfview(int argc, char *argv[]) fprintf(stderr, " -Q output the QCALL likelihood format\n"); fprintf(stderr, " -L calculate LD for adjacent sites\n"); fprintf(stderr, " -I skip indels\n"); - fprintf(stderr, " -f reference-free variant calling\n"); + fprintf(stderr, " -D FILE sequence dictionary for VCF->BCF conversion [null]\n"); fprintf(stderr, " -1 INT number of group-1 samples [0]\n"); fprintf(stderr, " -l FILE list of sites to output [all sites]\n"); - fprintf(stderr, " -t FLOAT scaled substitution mutation rate [%.4lg]\n", vc.theta); - fprintf(stderr, " -i FLOAT indel-to-substitution ratio [%.4lg]\n", vc.indel_frac); - fprintf(stderr, " -p FLOAT variant if P(ref|D)BCF conversion please specify the sequence dictionary with -D\n", __func__); + return 1; + } b = calloc(1, sizeof(bcf1_t)); blast = calloc(1, sizeof(bcf1_t)); strcpy(moder, "r"); @@ -281,11 +357,20 @@ int bcfview(int argc, char *argv[]) if (vc.flag & VC_BCFOUT) strcat(modew, "b"); if (vc.flag & VC_UNCOMP) strcat(modew, "u"); bp = vcf_open(argv[optind], moder); - h = vcf_hdr_read(bp); + hin = hout = vcf_hdr_read(bp); + if (vc.fn_dict && (vc.flag & VC_VCFIN)) + vcf_dictread(bp, hin, vc.fn_dict); bout = vcf_open("-", modew); - if (!(vc.flag & VC_QCALL)) vcf_hdr_write(bout, h); + if (!(vc.flag & VC_QCALL)) { + if (vc.n_sub) { + vc.sublist = calloc(vc.n_sub, sizeof(int)); + hout = bcf_hdr_subsam(hin, vc.n_sub, vc.subsam, vc.sublist); + } + if (vc.flag & VC_CALL) write_header(hout); + vcf_hdr_write(bout, hout); + } if (vc.flag & VC_CALL) { - p1 = bcf_p1_init(h->n_smpl); + p1 = bcf_p1_init(hout->n_smpl); if (vc.prior_file) { if (bcf_p1_read_prior(p1, vc.prior_file) < 0) { fprintf(stderr, "[%s] fail to read the prior AFS.\n", __func__); @@ -297,11 +382,10 @@ int bcfview(int argc, char *argv[]) bcf_p1_init_subprior(p1, vc.prior_type, vc.theta); } if (vc.indel_frac > 0.) bcf_p1_indel_prior(p1, vc.indel_frac); // otherwise use the default indel_frac - if (vc.flag & VC_FOLDED) bcf_p1_set_folded(p1); } - if (vc.fn_list) hash = bcf_load_pos(vc.fn_list, h); + if (vc.fn_list) hash = bcf_load_pos(vc.fn_list, hin); if (optind + 1 < argc && !(vc.flag&VC_VCFIN)) { - void *str2id = bcf_build_refhash(h); + void *str2id = bcf_build_refhash(hout); if (bcf_parse_region(str2id, argv[optind+1], &tid, &begin, &end) >= 0) { bcf_idx_t *idx; idx = bcf_idx_load(argv[optind]); @@ -317,8 +401,10 @@ int bcfview(int argc, char *argv[]) } } } - while (vcf_read(bp, h, b) > 0) { - int is_indel = bcf_is_indel(b); + while (vcf_read(bp, hin, b) > 0) { + int is_indel; + if (vc.n_sub) bcf_subsam(vc.n_sub, vc.sublist, b); + is_indel = bcf_is_indel(b); if ((vc.flag & VC_NO_INDEL) && is_indel) continue; if ((vc.flag & VC_ACGT_ONLY) && !is_indel) { int x; @@ -341,7 +427,7 @@ int bcfview(int argc, char *argv[]) } ++n_processed; if (vc.flag & VC_QCALL) { // output QCALL format; STOP here - bcf_2qcall(h, b); + bcf_2qcall(hout, b); continue; } if (vc.flag & (VC_CALL|VC_ADJLD)) bcf_gl2pl(b); @@ -354,7 +440,7 @@ int bcfview(int argc, char *argv[]) bcf_p1_dump_afs(p1); } if (pr.p_ref >= vc.pref && (vc.flag & VC_VARONLY)) continue; - update_bcf1(h->n_smpl, b, p1, &pr, vc.pref, vc.flag); + update_bcf1(hout->n_smpl, b, p1, &pr, vc.pref, vc.flag); } if (vc.flag & VC_ADJLD) { // compute LD double f[4], r2; @@ -362,25 +448,34 @@ int bcfview(int argc, char *argv[]) kstring_t s; s.m = s.l = 0; s.s = 0; if (*b->info) kputc(';', &s); - ksprintf(&s, "NEIR=%.3lf;NEIF=%.3lf,%.3lf", r2, f[0]+f[2], f[0]+f[1]); + ksprintf(&s, "NEIR=%.3f;NEIF=%.3f,%.3f", r2, f[0]+f[2], f[0]+f[1]); bcf_append_info(b, s.s, s.l); free(s.s); } bcf_cpy(blast, b); } + if (vc.flag & VC_ANNO_MAX) bcf_anno_max(b); if (vc.flag & VC_NO_GENO) { // do not output GENO fields b->n_gi = 0; b->fmt[0] = '\0'; - } - vcf_write(bout, h, b); + b->l_str = b->fmt - b->str + 1; + } else bcf_fix_gt(b); + vcf_write(bout, hout, b); } if (vc.prior_file) free(vc.prior_file); if (vc.flag & VC_CALL) bcf_p1_dump_afs(p1); - bcf_hdr_destroy(h); + if (hin != hout) bcf_hdr_destroy(hout); + bcf_hdr_destroy(hin); bcf_destroy(b); bcf_destroy(blast); vcf_close(bp); vcf_close(bout); if (hash) kh_destroy(set64, hash); if (vc.fn_list) free(vc.fn_list); + if (vc.fn_dict) free(vc.fn_dict); + if (vc.n_sub) { + int i; + for (i = 0; i < vc.n_sub; ++i) free(vc.subsam[i]); + free(vc.subsam); free(vc.sublist); + } if (p1) bcf_p1_destroy(p1); return 0; } diff --git a/bcftools/fet.c b/bcftools/fet.c index 845f8c2..5812517 100644 --- a/bcftools/fet.c +++ b/bcftools/fet.c @@ -64,7 +64,8 @@ double kt_fisher_exact(int n11, int n12, int n21, int n22, double *_left, double n1_ = n11 + n12; n_1 = n11 + n21; n = n11 + n12 + n21 + n22; // calculate n1_, n_1 and n max = (n_1 < n1_) ? n_1 : n1_; // max n11, for right tail - min = (n1_ + n_1 - n < 0) ? 0 : (n1_ + n_1 - n < 0); // min n11, for left tail + min = n1_ + n_1 - n; + if (min < 0) min = 0; // min n11, for left tail *two = *_left = *_right = 1.; if (min == max) return 1.; // no need to do test q = hypergeo_acc(n11, n1_, n_1, n, &aux); // the probability of the current table @@ -79,6 +80,7 @@ double kt_fisher_exact(int n11, int n12, int n21, int n22, double *_left, double p = hypergeo_acc(max, 0, 0, 0, &aux); for (right = 0., j = max - 1; p < 0.99999999 * q; --j) // loop until underflow right += p, p = hypergeo_acc(j, 0, 0, 0, &aux); + ++j; if (p < 1.00000001 * q) right += p; else ++j; // two-tail diff --git a/bcftools/ld.c b/bcftools/ld.c index dc84d4b..11474ed 100644 --- a/bcftools/ld.c +++ b/bcftools/ld.c @@ -70,7 +70,7 @@ double bcf_ld_freq(const bcf1_t *b0, const bcf1_t *b1, double f[4]) for (i = 0; i < n_smpl; ++i) { const uint8_t *pi = PL[j] + i * PL_len[j]; double *p = pdg[j] + i * 3; - p[0] = g_q2p[pi[b[j]->n_alleles]]; p[1] = g_q2p[pi[1]]; p[2] = g_q2p[pi[0]]; + p[0] = g_q2p[pi[2]]; p[1] = g_q2p[pi[1]]; p[2] = g_q2p[pi[0]]; } } // iteration diff --git a/bcftools/prob1.c b/bcftools/prob1.c index 8bf968f..193c4a0 100644 --- a/bcftools/prob1.c +++ b/bcftools/prob1.c @@ -32,7 +32,7 @@ unsigned char seq_nt4_table[256] = { }; struct __bcf_p1aux_t { - int n, M, n1, is_indel, is_folded; + int n, M, n1, is_indel; double *q2p, *pdg; // pdg -> P(D|g) double *phi, *phi_indel; double *z, *zswap; // aux for afs @@ -43,13 +43,6 @@ struct __bcf_p1aux_t { int PL_len; }; -static void fold_array(int M, double *x) -{ - int k; - for (k = 0; k < M/2; ++k) - x[k] = x[M-k] = (x[k] + x[M-k]) / 2.; -} - void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x) { int i; @@ -162,15 +155,6 @@ int bcf_p1_set_n1(bcf_p1aux_t *b, int n1) return 0; } -void bcf_p1_set_folded(bcf_p1aux_t *p1a) -{ - if (p1a->n1 < 0) { - p1a->is_folded = 1; - fold_array(p1a->M, p1a->phi); - fold_array(p1a->M, p1a->phi_indel); - } -} - void bcf_p1_destroy(bcf_p1aux_t *ma) { if (ma) { @@ -184,18 +168,16 @@ void bcf_p1_destroy(bcf_p1aux_t *ma) static int cal_pdg(const bcf1_t *b, bcf_p1aux_t *ma) { - int i, j, k; + int i, j; long *p, tmp; p = alloca(b->n_alleles * sizeof(long)); memset(p, 0, sizeof(long) * b->n_alleles); for (j = 0; j < ma->n; ++j) { const uint8_t *pi = ma->PL + j * ma->PL_len; double *pdg = ma->pdg + j * 3; - pdg[0] = ma->q2p[pi[b->n_alleles]]; pdg[1] = ma->q2p[pi[1]]; pdg[2] = ma->q2p[pi[0]]; - for (i = k = 0; i < b->n_alleles; ++i) { - p[i] += (int)pi[k]; - k += b->n_alleles - i; - } + pdg[0] = ma->q2p[pi[2]]; pdg[1] = ma->q2p[pi[1]]; pdg[2] = ma->q2p[pi[0]]; + for (i = 0; i < b->n_alleles; ++i) + p[i] += (int)pi[(i+1)*(i+2)/2-1]; } for (i = 0; i < b->n_alleles; ++i) p[i] = p[i]<<4 | i; for (i = 1; i < b->n_alleles; ++i) // insertion sort @@ -326,19 +308,28 @@ static void contrast(bcf_p1aux_t *ma, double pc[4]) // mc_cal_y() must be called pc[1] = pc[1] == 1.? 99 : (int)(-4.343 * log(1. - pc[1]) + .499); } -static double mc_cal_afs(bcf_p1aux_t *ma) +static double mc_cal_afs(bcf_p1aux_t *ma, double *p_ref_folded, double *p_var_folded) { int k; - long double sum = 0.; + long double sum = 0., sum2; double *phi = ma->is_indel? ma->phi_indel : ma->phi; memset(ma->afs1, 0, sizeof(double) * (ma->M + 1)); mc_cal_y(ma); + // compute AFS for (k = 0, sum = 0.; k <= ma->M; ++k) sum += (long double)phi[k] * ma->z[k]; for (k = 0; k <= ma->M; ++k) { ma->afs1[k] = phi[k] * ma->z[k] / sum; if (isnan(ma->afs1[k]) || isinf(ma->afs1[k])) return -1.; } + // compute folded variant probability + for (k = 0, sum = 0.; k <= ma->M; ++k) + sum += (long double)(phi[k] + phi[ma->M - k]) / 2. * ma->z[k]; + for (k = 1, sum2 = 0.; k < ma->M; ++k) + sum2 += (long double)(phi[k] + phi[ma->M - k]) / 2. * ma->z[k]; + *p_var_folded = sum2 / sum; + *p_ref_folded = (phi[k] + phi[ma->M - k]) / 2. * (ma->z[ma->M] + ma->z[0]) / sum; + // the expected frequency for (k = 0, sum = 0.; k <= ma->M; ++k) { ma->afs[k] += ma->afs1[k]; sum += k * ma->afs1[k]; @@ -372,7 +363,7 @@ long double bcf_p1_cal_g3(bcf_p1aux_t *p1a, double g[3]) return pd; } -int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst) +int bcf_p1_cal(const bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst) { int i, k; long double sum = 0.; @@ -388,8 +379,11 @@ int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst) if (b->n_alleles < 2) return -1; // FIXME: find a better solution // rst->rank0 = cal_pdg(b, ma); - rst->f_exp = mc_cal_afs(ma); - rst->p_ref = ma->is_folded? ma->afs1[ma->M] + ma->afs1[0] : ma->afs1[ma->M]; + rst->f_exp = mc_cal_afs(ma, &rst->p_ref_folded, &rst->p_var_folded); + rst->p_ref = ma->afs1[ma->M]; + for (k = 0, sum = 0.; k < ma->M; ++k) + sum += ma->afs1[k]; + rst->p_var = (double)sum; // calculate f_flat and f_em for (k = 0, sum = 0.; k <= ma->M; ++k) sum += (long double)ma->z[k]; @@ -428,7 +422,6 @@ int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst) void bcf_p1_dump_afs(bcf_p1aux_t *ma) { int k; - if (ma->is_folded) fold_array(ma->M, ma->afs); fprintf(stderr, "[afs]"); for (k = 0; k <= ma->M; ++k) fprintf(stderr, " %d:%.3lf", k, ma->afs[ma->M - k]); diff --git a/bcftools/prob1.h b/bcftools/prob1.h index 3827534..b4f6a30 100644 --- a/bcftools/prob1.h +++ b/bcftools/prob1.h @@ -8,7 +8,7 @@ typedef struct __bcf_p1aux_t bcf_p1aux_t; typedef struct { int rank0; - double f_em, f_exp, f_flat, p_ref; + double f_em, f_exp, f_flat, p_ref_folded, p_ref, p_var_folded, p_var; double cil, cih; double pc[4]; double g[3]; @@ -26,7 +26,7 @@ extern "C" { void bcf_p1_init_prior(bcf_p1aux_t *ma, int type, double theta); void bcf_p1_init_subprior(bcf_p1aux_t *ma, int type, double theta); void bcf_p1_destroy(bcf_p1aux_t *ma); - int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst); + int bcf_p1_cal(const bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst); int bcf_p1_call_gt(const bcf_p1aux_t *ma, double f0, int k); void bcf_p1_dump_afs(bcf_p1aux_t *ma); int bcf_p1_read_prior(bcf_p1aux_t *ma, const char *fn); diff --git a/bcftools/vcf.c b/bcftools/vcf.c index 9b661ff..9daa845 100644 --- a/bcftools/vcf.c +++ b/bcftools/vcf.c @@ -72,6 +72,33 @@ bcf_t *vcf_open(const char *fn, const char *mode) return bp; } +int vcf_dictread(bcf_t *bp, bcf_hdr_t *h, const char *fn) +{ + vcf_t *v; + gzFile fp; + kstream_t *ks; + kstring_t s, rn; + int dret; + if (bp == 0) return -1; + if (!bp->is_vcf) return 0; + s.l = s.m = 0; s.s = 0; + rn.m = rn.l = h->l_nm; rn.s = h->name; + v = (vcf_t*)bp->v; + fp = gzopen(fn, "r"); + ks = ks_init(fp); + while (ks_getuntil(ks, 0, &s, &dret) >= 0) { + bcf_str2id_add(v->refhash, strdup(s.s)); + kputs(s.s, &rn); kputc('\0', &rn); + if (dret != '\n') ks_getuntil(ks, '\n', &s, &dret); + } + ks_destroy(ks); + gzclose(fp); + h->l_nm = rn.l; h->name = rn.s; + bcf_hdr_sync(h); + free(s.s); + return 0; +} + int vcf_close(bcf_t *bp) { vcf_t *v; @@ -84,7 +111,7 @@ int vcf_close(bcf_t *bp) } if (v->fpout) fclose(v->fpout); free(v->line.s); - bcf_str2id_destroy(v->refhash); + bcf_str2id_thorough_destroy(v->refhash); free(v); free(bp); return 0; @@ -93,15 +120,14 @@ int vcf_close(bcf_t *bp) int vcf_hdr_write(bcf_t *bp, const bcf_hdr_t *h) { vcf_t *v = (vcf_t*)bp->v; - int i, has_ref = 0, has_ver = 0; + int i, has_ver = 0; if (!bp->is_vcf) return bcf_hdr_write(bp, h); if (h->l_txt > 0) { if (strstr(h->txt, "##fileformat=")) has_ver = 1; - if (has_ver == 0) fprintf(v->fpout, "##fileformat=VCFv4.0\n"); + if (has_ver == 0) fprintf(v->fpout, "##fileformat=VCFv4.1\n"); fwrite(h->txt, 1, h->l_txt - 1, v->fpout); - if (strstr(h->txt, "##SQ=")) has_ref = 1; } - if (has_ver == 0) fprintf(v->fpout, "##fileformat=VCFv4.0\n"); + if (h->l_txt == 0) fprintf(v->fpout, "##fileformat=VCFv4.1\n"); fprintf(v->fpout, "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT"); for (i = 0; i < h->n_smpl; ++i) fprintf(v->fpout, "\t%s", h->sns[i]); @@ -138,7 +164,7 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b) if (k == 0) { // ref int tid = bcf_str2id(v->refhash, p); if (tid < 0) { - tid = bcf_str2id_add(v->refhash, p); + tid = bcf_str2id_add(v->refhash, strdup(p)); kputs(p, &rn); kputc('\0', &rn); sync = 1; } @@ -156,8 +182,10 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b) for (i = 0; i < b->n_gi; ++i) { if (b->gi[i].fmt == bcf_str2int("GT", 2)) { ((uint8_t*)b->gi[i].data)[k-9] = 1<<7; - } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) { + } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) { ((uint8_t*)b->gi[i].data)[k-9] = 0; + } else if (b->gi[i].fmt == bcf_str2int("SP", 2)) { + ((int32_t*)b->gi[i].data)[k-9] = 0; } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) { ((uint16_t*)b->gi[i].data)[k-9] = 0; } else if (b->gi[i].fmt == bcf_str2int("PL", 2)) { @@ -173,11 +201,15 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b) for (q = kstrtok(p, ":", &a2), i = 0; q && i < b->n_gi; q = kstrtok(0, 0, &a2), ++i) { if (b->gi[i].fmt == bcf_str2int("GT", 2)) { ((uint8_t*)b->gi[i].data)[k-9] = (q[0] - '0')<<3 | (q[2] - '0') | (q[1] == '/'? 0 : 1) << 6; - } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) { + } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) { double _x = strtod(q, &q); int x = (int)(_x + .499); if (x > 255) x = 255; ((uint8_t*)b->gi[i].data)[k-9] = x; + } else if (b->gi[i].fmt == bcf_str2int("SP", 2)) { + int x = strtol(q, &q, 10); + if (x > 0xffff) x = 0xffff; + ((uint32_t*)b->gi[i].data)[k-9] = x; } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) { int x = strtol(q, &q, 10); if (x > 0xffff) x = 0xffff; @@ -198,7 +230,7 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b) y = b->n_alleles * (b->n_alleles + 1) / 2; for (j = 0; j < y; ++j) { x = strtod(q, &q); - data[(k-9) * y + j] = x; + data[(k-9) * y + j] = x > 0? -x/10. : x; ++q; } } diff --git a/bcftools/vcfutils.pl b/bcftools/vcfutils.pl index cd86b0f..6ced66a 100755 --- a/bcftools/vcfutils.pl +++ b/bcftools/vcfutils.pl @@ -14,7 +14,7 @@ sub main { my $command = shift(@ARGV); my %func = (subsam=>\&subsam, listsam=>\&listsam, fillac=>\&fillac, qstats=>\&qstats, varFilter=>\&varFilter, hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&varFilter, ldstats=>\&ldstats, - gapstats=>\&gapstats, splitchr=>\&splitchr); + gapstats=>\&gapstats, splitchr=>\&splitchr, vcf2fq=>\&vcf2fq); die("Unknown command \"$command\".\n") if (!defined($func{$command})); &{$func{$command}}; } @@ -215,8 +215,8 @@ Note: This command discards indels. Output: QUAL #non-indel #SNPs #transitions # } sub varFilter { - my %opts = (d=>2, D=>10000, a=>2, W=>10, Q=>10, w=>10, p=>undef, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4); - getopts('pd:D:W:Q:w:a:1:2:3:4:', \%opts); + my %opts = (d=>2, D=>10000000, a=>2, W=>10, Q=>10, w=>10, p=>undef, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4, G=>0, S=>1000); + getopts('pd:D:W:Q:w:a:1:2:3:4:G:S:', \%opts); die(qq/ Usage: vcfutils.pl varFilter [options] @@ -246,6 +246,7 @@ Note: Some of the filters rely on annotations generated by SAMtools\/BCFtools. print; next; } next if ($t[4] eq '.'); # skip non-var sites + next if ($t[3] eq 'N'); # skip sites with unknown ref ('N') # check if the site is a SNP my $type = 1; # SNP if (length($t[3]) > 1) { @@ -289,6 +290,7 @@ Note: Some of the filters rely on annotations generated by SAMtools\/BCFtools. $flt = 1 if ($flt == 0 && $mq >= 0 && $mq < $opts{Q}); $flt = 7 if ($flt == 0 && /PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/ && ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4})); + $flt = 8 if ($flt == 0 && ((/MXGQ=(\d+)/ && $1 < $opts{G}) || (/MXSP=(\d+)/ && $1 >= $opts{S}))); my $score = $t[5] * 100 + $dp_alt; my $rlen = length($t[3]) - 1; # $indel_score<0 for SNPs @@ -311,7 +313,10 @@ Note: Some of the filters rely on annotations generated by SAMtools\/BCFtools. } else { # SNP or MNP for my $x (@staging) { next if (($x->[0]&3) != 3 || $x->[4] + $x->[2] + $ow < $t[1]); - $flt = 5; + if ($x->[4] + length($x->[7]) - 1 == $t[1] && substr($x->[7], -1, 1) eq substr($t[4], 0, 1) + && length($x->[7]) - length($x->[6]) == 1) { + $x->[1] = 5; + } else { $flt = 5; } last; } # check MNP @@ -338,7 +343,7 @@ sub varFilter_aux { if ($first->[1] == 0) { print join("\t", @$first[3 .. @$first-1]), "\n"; } elsif ($is_print) { - print STDERR join("\t", substr("UQdDaGgPM", $first->[1], 1), @$first[3 .. @$first-1]), "\n"; + print STDERR join("\t", substr("UQdDaGgPMS", $first->[1], 1), @$first[3 .. @$first-1]), "\n"; } } @@ -454,6 +459,87 @@ sub hapmap2vcf { } } +sub vcf2fq { + my %opts = (d=>3, D=>100000, Q=>10, l=>5); + getopts('d:D:Q:l:', \%opts); + die(qq/ +Usage: vcfutils.pl vcf2fq [options] + +Options: -d INT minimum depth [$opts{d}] + -D INT maximum depth [$opts{D}] + -Q INT min RMS mapQ [$opts{Q}] + -l INT INDEL filtering window [$opts{l}] +\n/) if (@ARGV == 0 && -t STDIN); + + my ($last_chr, $seq, $qual, $last_pos, @gaps); + my $_Q = $opts{Q}; + my $_d = $opts{d}; + my $_D = $opts{D}; + + my %het = (AC=>'M', AG=>'R', AT=>'W', CA=>'M', CG=>'S', CT=>'Y', + GA=>'R', GC=>'S', GT=>'K', TA=>'W', TC=>'Y', TG=>'K'); + + $last_chr = ''; + while (<>) { + next if (/^#/); + my @t = split; + if ($last_chr ne $t[0]) { + &v2q_post_process($last_chr, \$seq, \$qual, \@gaps, $opts{l}) if ($last_chr); + ($last_chr, $last_pos) = ($t[0], 0); + $seq = $qual = ''; + @gaps = (); + } + die("[vcf2fq] unsorted input\n") if ($t[1] - $last_pos < 0); + if ($t[1] - $last_pos > 1) { + $seq .= 'n' x ($t[1] - $last_pos - 1); + $qual .= '!' x ($t[1] - $last_pos - 1); + } + if (length($t[3]) == 1 && $t[7] !~ /INDEL/ && $t[4] =~ /^([A-Za-z.])(,[A-Za-z])*$/) { # a SNP or reference + my ($ref, $alt) = ($t[3], $1); + my ($b, $q); + $q = $1 if ($t[7] =~ /FQ=(-?[\d\.]+)/); + if ($q < 0) { + $_ = $1 if ($t[7] =~ /AF1=([\d\.]+)/); + $b = ($_ < .5 || $alt eq '.')? $ref : $alt; + $q = -$q; + } else { + $b = $het{"$ref$alt"}; + $b ||= 'N'; + } + $b = lc($b); + $b = uc($b) if (($t[7] =~ /MQ=(\d+)/ && $1 >= $_Q) && ($t[7] =~ /DP=(\d+)/ && $1 >= $_d && $1 <= $_D)); + $q = int($q + 33 + .499); + $q = chr($q <= 126? $q : 126); + $seq .= $b; + $qual .= $q; + } elsif ($t[4] ne '.') { # an INDEL + push(@gaps, [$t[1], length($t[3])]); + } + $last_pos = $t[1]; + } + &v2q_post_process($last_chr, \$seq, \$qual, \@gaps, $opts{l}); +} + +sub v2q_post_process { + my ($chr, $seq, $qual, $gaps, $l) = @_; + for my $g (@$gaps) { + my $beg = $g->[0] > $l? $g->[0] - $l : 0; + my $end = $g->[0] + $g->[1] + $l; + $end = length($$seq) if ($end > length($$seq)); + substr($$seq, $beg, $end - $beg) = lc(substr($$seq, $beg, $end - $beg)); + } + print "\@$chr\n"; &v2q_print_str($seq); + print "+\n"; &v2q_print_str($qual); +} + +sub v2q_print_str { + my ($s) = @_; + my $l = length($$s); + for (my $i = 0; $i < $l; $i += 60) { + print substr($$s, $i, 60), "\n"; + } +} + sub usage { die(qq/ Usage: vcfutils.pl []\n @@ -461,8 +547,14 @@ Command: subsam get a subset of samples listsam list the samples fillac fill the allele count field qstats SNP stats stratified by QUAL - varFilter filtering short variants + hapmap2vcf convert the hapmap format to VCF ucscsnp2vcf convert UCSC SNP SQL dump to VCF + + varFilter filtering short variants (*) + vcf2fq VCF->fastq (**) + +Notes: Commands with description endting with (*) may need bcftools + specific annotations. \n/); } diff --git a/cut_target.c b/cut_target.c new file mode 100644 index 0000000..26f434f --- /dev/null +++ b/cut_target.c @@ -0,0 +1,193 @@ +#include +#include +#include +#include "bam.h" +#include "errmod.h" +#include "faidx.h" + +#define ERR_DEP 0.83f + +typedef struct { + int e[2][3], p[2][2]; +} score_param_t; + +/* Note that although the two matrics have 10 parameters in total, only 4 + * (probably 3) are free. Changing the scoring matrices in a sort of symmetric + * way will not change the result. */ +static score_param_t g_param = { {{0,0,0},{-4,1,6}}, {{0,-14000}, {0,0}} }; + +typedef struct { + int min_baseQ, tid, max_bases; + uint16_t *bases; + bamFile fp; + bam_header_t *h; + char *ref; + faidx_t *fai; + errmod_t *em; +} ct_t; + +static uint16_t gencns(ct_t *g, int n, const bam_pileup1_t *plp) +{ + int i, j, ret, tmp, k, sum[4], qual; + float q[16]; + if (n > g->max_bases) { // enlarge g->bases + g->max_bases = n; + kroundup32(g->max_bases); + g->bases = realloc(g->bases, g->max_bases * 2); + } + for (i = k = 0; i < n; ++i) { + const bam_pileup1_t *p = plp + i; + uint8_t *seq; + int q, baseQ, b; + if (p->is_refskip || p->is_del) continue; + baseQ = bam1_qual(p->b)[p->qpos]; + if (baseQ < g->min_baseQ) continue; + seq = bam1_seq(p->b); + b = bam_nt16_nt4_table[bam1_seqi(seq, p->qpos)]; + if (b > 3) continue; + q = baseQ < p->b->core.qual? baseQ : p->b->core.qual; + if (q < 4) q = 4; + if (q > 63) q = 63; + g->bases[k++] = q<<5 | bam1_strand(p->b)<<4 | b; + } + if (k == 0) return 0; + errmod_cal(g->em, k, 4, g->bases, q); + for (i = 0; i < 4; ++i) sum[i] = (int)(q[i<<2|i] + .499) << 2 | i; + for (i = 1; i < 4; ++i) // insertion sort + for (j = i; j > 0 && sum[j] < sum[j-1]; --j) + tmp = sum[j], sum[j] = sum[j-1], sum[j-1] = tmp; + qual = (sum[1]>>2) - (sum[0]>>2); + k = k < 256? k : 255; + ret = (qual < 63? qual : 63) << 2 | (sum[0]&3); + return ret<<8|k; +} + +static void process_cns(bam_header_t *h, int tid, int l, uint16_t *cns) +{ + int i, f[2][2], *prev, *curr, *swap_tmp, s; + uint8_t *b; // backtrack array + b = calloc(l, 1); + f[0][0] = f[0][1] = 0; + prev = f[0]; curr = f[1]; + // fill the backtrack matrix + for (i = 0; i < l; ++i) { + int c = (cns[i] == 0)? 0 : (cns[i]>>8 == 0)? 1 : 2; + int tmp0, tmp1; + // compute f[0] + tmp0 = prev[0] + g_param.e[0][c] + g_param.p[0][0]; // (s[i+1],s[i])=(0,0) + tmp1 = prev[1] + g_param.e[0][c] + g_param.p[1][0]; // (0,1) + if (tmp0 > tmp1) curr[0] = tmp0, b[i] = 0; + else curr[0] = tmp1, b[i] = 1; + // compute f[1] + tmp0 = prev[0] + g_param.e[1][c] + g_param.p[0][1]; // (s[i+1],s[i])=(1,0) + tmp1 = prev[1] + g_param.e[1][c] + g_param.p[1][1]; // (1,1) + if (tmp0 > tmp1) curr[1] = tmp0, b[i] |= 0<<1; + else curr[1] = tmp1, b[i] |= 1<<1; + // swap + swap_tmp = prev; prev = curr; curr = swap_tmp; + } + // backtrack + s = prev[0] > prev[1]? 0 : 1; + for (i = l - 1; i > 0; --i) { + b[i] |= s<<2; + s = b[i]>>s&1; + } + // print + for (i = 0, s = -1; i <= l; ++i) { + if (i == l || ((b[i]>>2&3) == 0 && s >= 0)) { + if (s >= 0) { + int j; + printf("%s:%d-%d\t0\t%s\t%d\t60\t%dM\t*\t0\t0\t", h->target_name[tid], s+1, i, h->target_name[tid], s+1, i-s); + for (j = s; j < i; ++j) { + int c = cns[j]>>8; + if (c == 0) putchar('N'); + else putchar("ACGT"[c&3]); + } + putchar('\t'); + for (j = s; j < i; ++j) + putchar(33 + (cns[j]>>8>>2)); + putchar('\n'); + } + //if (s >= 0) printf("%s\t%d\t%d\t%d\n", h->target_name[tid], s, i, i - s); + s = -1; + } else if ((b[i]>>2&3) && s < 0) s = i; + } + free(b); +} + +static int read_aln(void *data, bam1_t *b) +{ + extern int bam_prob_realn_core(bam1_t *b, const char *ref, int flag); + ct_t *g = (ct_t*)data; + int ret, len; + ret = bam_read1(g->fp, b); + if (ret >= 0 && g->fai && b->core.tid >= 0 && (b->core.flag&4) == 0) { + if (b->core.tid != g->tid) { // then load the sequence + free(g->ref); + g->ref = fai_fetch(g->fai, g->h->target_name[b->core.tid], &len); + g->tid = b->core.tid; + } + bam_prob_realn_core(b, g->ref, 1<<1|1); + } + return ret; +} + +int main_cut_target(int argc, char *argv[]) +{ + int c, tid, pos, n, lasttid = -1, lastpos = -1, l, max_l; + const bam_pileup1_t *p; + bam_plp_t plp; + uint16_t *cns; + ct_t g; + + memset(&g, 0, sizeof(ct_t)); + g.min_baseQ = 13; g.tid = -1; + while ((c = getopt(argc, argv, "f:Q:i:o:0:1:2:")) >= 0) { + switch (c) { + case 'Q': g.min_baseQ = atoi(optarg); break; // quality cutoff + case 'i': g_param.p[0][1] = -atoi(optarg); break; // 0->1 transition (in) PENALTY + case '0': g_param.e[1][0] = atoi(optarg); break; // emission SCORE + case '1': g_param.e[1][1] = atoi(optarg); break; + case '2': g_param.e[1][2] = atoi(optarg); break; + case 'f': g.fai = fai_load(optarg); + if (g.fai == 0) fprintf(stderr, "[%s] fail to load the fasta index.\n", __func__); + break; + } + } + if (argc == optind) { + fprintf(stderr, "Usage: samtools targetcut [-Q minQ] [-i inPen] [-0 em0] [-1 em1] [-2 em2] [-f ref] \n"); + return 1; + } + l = max_l = 0; cns = 0; + g.fp = strcmp(argv[optind], "-")? bam_open(argv[optind], "r") : bam_dopen(fileno(stdin), "r"); + g.h = bam_header_read(g.fp); + g.em = errmod_init(1 - ERR_DEP); + plp = bam_plp_init(read_aln, &g); + while ((p = bam_plp_auto(plp, &tid, &pos, &n)) != 0) { + if (tid < 0) break; + if (tid != lasttid) { // change of chromosome + if (cns) process_cns(g.h, lasttid, l, cns); + if (max_l < g.h->target_len[tid]) { + max_l = g.h->target_len[tid]; + kroundup32(max_l); + cns = realloc(cns, max_l * 2); + } + l = g.h->target_len[tid]; + memset(cns, 0, max_l * 2); + lasttid = tid; + } + cns[pos] = gencns(&g, n, p); + lastpos = pos; + } + process_cns(g.h, lasttid, l, cns); + free(cns); + bam_header_destroy(g.h); + bam_plp_destroy(plp); + bam_close(g.fp); + if (g.fai) { + fai_destroy(g.fai); free(g.ref); + } + errmod_destroy(g.em); + free(g.bases); + return 0; +} diff --git a/errmod.h b/errmod.h index e3e9a90..32c07b6 100644 --- a/errmod.h +++ b/errmod.h @@ -12,6 +12,13 @@ typedef struct { errmod_t *errmod_init(float depcorr); void errmod_destroy(errmod_t *em); + +/* + n: number of bases + m: maximum base + bases[i]: qual:6, strand:1, base:4 + q[i*m+j]: phred-scaled likelihood of (i,j) + */ int errmod_cal(const errmod_t *em, int n, int m, uint16_t *bases, float *q); #endif diff --git a/examples/Makefile b/examples/Makefile index ec976ae..309399f 100644 --- a/examples/Makefile +++ b/examples/Makefile @@ -40,11 +40,11 @@ ex1.bcf:ex1.bam ex1.fa.fai (cd ..; make libbam.a) calDepth:../libbam.a calDepth.c - gcc -g -Wall -O2 -I.. calDepth.c -o $@ -lm -lz -L.. -lbam + gcc -g -Wall -O2 -I.. calDepth.c -o $@ -L.. -lbam -lm -lz clean: rm -fr *.bam *.bai *.glf* *.fai *.pileup* *~ calDepth *.dSYM ex1*.rg ex1.bcf # ../samtools pileup ex1.bam|perl -ape '$_=$F[4];s/(\d+)(??{".{$1}"})|\^.//g;@_=(tr/A-Z//,tr/a-z//);$_=join("\t",@F[0,1],@_)."\n"' -# ../samtools pileup -cf ex1.fa ex1.bam|perl -ape '$_=$F[8];s/\^.//g;s/(\d+)(??{".{$1}"})|\^.//g;@_=(tr/A-Za-z//,tr/,.//);$_=join("\t",@F[0,1],@_)."\n"' \ No newline at end of file +# ../samtools pileup -cf ex1.fa ex1.bam|perl -ape '$_=$F[8];s/\^.//g;s/(\d+)(??{".{$1}"})|\^.//g;@_=(tr/A-Za-z//,tr/,.//);$_=join("\t",@F[0,1],@_)."\n"' diff --git a/faidx.c b/faidx.c index dbd8b3e..38292a1 100644 --- a/faidx.c +++ b/faidx.c @@ -2,6 +2,7 @@ #include #include #include +#include #include "faidx.h" #include "khash.h" diff --git a/khash.h b/khash.h index 1d583ef..a7e8056 100644 --- a/khash.h +++ b/khash.h @@ -1,6 +1,6 @@ /* The MIT License - Copyright (c) 2008 Genome Research Ltd (GRL). + Copyright (c) 2008, 2009, 2011 by Attractive Chaos Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the @@ -23,8 +23,6 @@ SOFTWARE. */ -/* Contact: Heng Li */ - /* An example: @@ -49,6 +47,14 @@ int main() { */ /* + 2011-02-14 (0.2.5): + + * Allow to declare global functions. + + 2009-09-26 (0.2.4): + + * Improve portability + 2008-09-19 (0.2.3): * Corrected the example @@ -88,17 +94,35 @@ int main() { @copyright Heng Li */ -#define AC_VERSION_KHASH_H "0.2.2" +#define AC_VERSION_KHASH_H "0.2.5" -#include #include #include +#include + +/* compipler specific configuration */ + +#if UINT_MAX == 0xffffffffu +typedef unsigned int khint32_t; +#elif ULONG_MAX == 0xffffffffu +typedef unsigned long khint32_t; +#endif + +#if ULONG_MAX == ULLONG_MAX +typedef unsigned long khint64_t; +#else +typedef unsigned long long khint64_t; +#endif + +#ifdef _MSC_VER +#define inline __inline +#endif -typedef uint32_t khint_t; +typedef khint32_t khint_t; typedef khint_t khiter_t; #define __ac_HASH_PRIME_SIZE 32 -static const uint32_t __ac_prime_list[__ac_HASH_PRIME_SIZE] = +static const khint32_t __ac_prime_list[__ac_HASH_PRIME_SIZE] = { 0ul, 3ul, 11ul, 23ul, 53ul, 97ul, 193ul, 389ul, 769ul, 1543ul, @@ -119,17 +143,32 @@ static const uint32_t __ac_prime_list[__ac_HASH_PRIME_SIZE] = static const double __ac_HASH_UPPER = 0.77; -#define KHASH_INIT(name, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal) \ +#define KHASH_DECLARE(name, khkey_t, khval_t) \ + typedef struct { \ + khint_t n_buckets, size, n_occupied, upper_bound; \ + khint32_t *flags; \ + khkey_t *keys; \ + khval_t *vals; \ + } kh_##name##_t; \ + extern kh_##name##_t *kh_init_##name(); \ + extern void kh_destroy_##name(kh_##name##_t *h); \ + extern void kh_clear_##name(kh_##name##_t *h); \ + extern khint_t kh_get_##name(const kh_##name##_t *h, khkey_t key); \ + extern void kh_resize_##name(kh_##name##_t *h, khint_t new_n_buckets); \ + extern khint_t kh_put_##name(kh_##name##_t *h, khkey_t key, int *ret); \ + extern void kh_del_##name(kh_##name##_t *h, khint_t x); + +#define KHASH_INIT2(name, SCOPE, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal) \ typedef struct { \ khint_t n_buckets, size, n_occupied, upper_bound; \ - uint32_t *flags; \ + khint32_t *flags; \ khkey_t *keys; \ khval_t *vals; \ } kh_##name##_t; \ - static inline kh_##name##_t *kh_init_##name() { \ + SCOPE kh_##name##_t *kh_init_##name() { \ return (kh_##name##_t*)calloc(1, sizeof(kh_##name##_t)); \ } \ - static inline void kh_destroy_##name(kh_##name##_t *h) \ + SCOPE void kh_destroy_##name(kh_##name##_t *h) \ { \ if (h) { \ free(h->keys); free(h->flags); \ @@ -137,14 +176,14 @@ static const double __ac_HASH_UPPER = 0.77; free(h); \ } \ } \ - static inline void kh_clear_##name(kh_##name##_t *h) \ + SCOPE void kh_clear_##name(kh_##name##_t *h) \ { \ if (h && h->flags) { \ - memset(h->flags, 0xaa, ((h->n_buckets>>4) + 1) * sizeof(uint32_t)); \ + memset(h->flags, 0xaa, ((h->n_buckets>>4) + 1) * sizeof(khint32_t)); \ h->size = h->n_occupied = 0; \ } \ } \ - static inline khint_t kh_get_##name(const kh_##name##_t *h, khkey_t key) \ + SCOPE khint_t kh_get_##name(const kh_##name##_t *h, khkey_t key) \ { \ if (h->n_buckets) { \ khint_t inc, k, i, last; \ @@ -158,9 +197,9 @@ static const double __ac_HASH_UPPER = 0.77; return __ac_iseither(h->flags, i)? h->n_buckets : i; \ } else return 0; \ } \ - static inline void kh_resize_##name(kh_##name##_t *h, khint_t new_n_buckets) \ + SCOPE void kh_resize_##name(kh_##name##_t *h, khint_t new_n_buckets) \ { \ - uint32_t *new_flags = 0; \ + khint32_t *new_flags = 0; \ khint_t j = 1; \ { \ khint_t t = __ac_HASH_PRIME_SIZE - 1; \ @@ -168,8 +207,8 @@ static const double __ac_HASH_UPPER = 0.77; new_n_buckets = __ac_prime_list[t+1]; \ if (h->size >= (khint_t)(new_n_buckets * __ac_HASH_UPPER + 0.5)) j = 0; \ else { \ - new_flags = (uint32_t*)malloc(((new_n_buckets>>4) + 1) * sizeof(uint32_t)); \ - memset(new_flags, 0xaa, ((new_n_buckets>>4) + 1) * sizeof(uint32_t)); \ + new_flags = (khint32_t*)malloc(((new_n_buckets>>4) + 1) * sizeof(khint32_t)); \ + memset(new_flags, 0xaa, ((new_n_buckets>>4) + 1) * sizeof(khint32_t)); \ if (h->n_buckets < new_n_buckets) { \ h->keys = (khkey_t*)realloc(h->keys, new_n_buckets * sizeof(khkey_t)); \ if (kh_is_map) \ @@ -218,7 +257,7 @@ static const double __ac_HASH_UPPER = 0.77; h->upper_bound = (khint_t)(h->n_buckets * __ac_HASH_UPPER + 0.5); \ } \ } \ - static inline khint_t kh_put_##name(kh_##name##_t *h, khkey_t key, int *ret) \ + SCOPE khint_t kh_put_##name(kh_##name##_t *h, khkey_t key, int *ret) \ { \ khint_t x; \ if (h->n_occupied >= h->upper_bound) { \ @@ -256,7 +295,7 @@ static const double __ac_HASH_UPPER = 0.77; } else *ret = 0; \ return x; \ } \ - static inline void kh_del_##name(kh_##name##_t *h, khint_t x) \ + SCOPE void kh_del_##name(kh_##name##_t *h, khint_t x) \ { \ if (x != h->n_buckets && !__ac_iseither(h->flags, x)) { \ __ac_set_isdel_true(h->flags, x); \ @@ -264,24 +303,27 @@ static const double __ac_HASH_UPPER = 0.77; } \ } +#define KHASH_INIT(name, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal) \ + KHASH_INIT2(name, static inline, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal) + /* --- BEGIN OF HASH FUNCTIONS --- */ /*! @function @abstract Integer hash function - @param key The integer [uint32_t] + @param key The integer [khint32_t] @return The hash value [khint_t] */ -#define kh_int_hash_func(key) (uint32_t)(key) +#define kh_int_hash_func(key) (khint32_t)(key) /*! @function @abstract Integer comparison function */ #define kh_int_hash_equal(a, b) ((a) == (b)) /*! @function @abstract 64-bit integer hash function - @param key The integer [uint64_t] + @param key The integer [khint64_t] @return The hash value [khint_t] */ -#define kh_int64_hash_func(key) (uint32_t)((key)>>33^(key)^(key)<<11) +#define kh_int64_hash_func(key) (khint32_t)((key)>>33^(key)^(key)<<11) /*! @function @abstract 64-bit integer comparison function */ @@ -442,7 +484,7 @@ static inline khint_t __ac_X31_hash_string(const char *s) @param name Name of the hash table [symbol] */ #define KHASH_SET_INIT_INT(name) \ - KHASH_INIT(name, uint32_t, char, 0, kh_int_hash_func, kh_int_hash_equal) + KHASH_INIT(name, khint32_t, char, 0, kh_int_hash_func, kh_int_hash_equal) /*! @function @abstract Instantiate a hash map containing integer keys @@ -450,14 +492,14 @@ static inline khint_t __ac_X31_hash_string(const char *s) @param khval_t Type of values [type] */ #define KHASH_MAP_INIT_INT(name, khval_t) \ - KHASH_INIT(name, uint32_t, khval_t, 1, kh_int_hash_func, kh_int_hash_equal) + KHASH_INIT(name, khint32_t, khval_t, 1, kh_int_hash_func, kh_int_hash_equal) /*! @function @abstract Instantiate a hash map containing 64-bit integer keys @param name Name of the hash table [symbol] */ #define KHASH_SET_INIT_INT64(name) \ - KHASH_INIT(name, uint64_t, char, 0, kh_int64_hash_func, kh_int64_hash_equal) + KHASH_INIT(name, khint64_t, char, 0, kh_int64_hash_func, kh_int64_hash_equal) /*! @function @abstract Instantiate a hash map containing 64-bit integer keys @@ -465,7 +507,7 @@ static inline khint_t __ac_X31_hash_string(const char *s) @param khval_t Type of values [type] */ #define KHASH_MAP_INIT_INT64(name, khval_t) \ - KHASH_INIT(name, uint64_t, khval_t, 1, kh_int64_hash_func, kh_int64_hash_equal) + KHASH_INIT(name, khint64_t, khval_t, 1, kh_int64_hash_func, kh_int64_hash_equal) typedef const char *kh_cstr_t; /*! @function diff --git a/knetfile.c b/knetfile.c index 1e2c042..af09146 100644 --- a/knetfile.c +++ b/knetfile.c @@ -1,6 +1,7 @@ /* The MIT License - Copyright (c) 2008 Genome Research Ltd (GRL). + Copyright (c) 2008 by Genome Research Ltd (GRL). + 2010 by Attractive Chaos Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the @@ -23,11 +24,9 @@ SOFTWARE. */ -/* Contact: Heng Li */ - /* Probably I will not do socket programming in the next few years and therefore I decide to heavily annotate this file, for Linux and - Windows as well. -lh3 */ + Windows as well. -ac */ #include #include @@ -90,7 +89,7 @@ static int socket_connect(const char *host, const char *port) int on = 1, fd; struct linger lng = { 0, 0 }; - struct addrinfo hints, *res; + struct addrinfo hints, *res = 0; memset(&hints, 0, sizeof(struct addrinfo)); hints.ai_family = AF_UNSPEC; hints.ai_socktype = SOCK_STREAM; diff --git a/kstring.c b/kstring.c index 43d524c..b2a0dab 100644 --- a/kstring.c +++ b/kstring.c @@ -29,16 +29,24 @@ char *kstrtok(const char *str, const char *sep, ks_tokaux_t *aux) const char *p, *start; if (sep) { // set up the table if (str == 0 && (aux->tab[0]&1)) return 0; // no need to set up if we have finished - aux->tab[0] = aux->tab[1] = aux->tab[2] = aux->tab[3] = 0; - for (p = sep; *p; ++p) - aux->tab[*p/64] |= 1ull<<(*p%64); + aux->finished = 0; + if (sep[1]) { + aux->sep = -1; + aux->tab[0] = aux->tab[1] = aux->tab[2] = aux->tab[3] = 0; + for (p = sep; *p; ++p) aux->tab[*p>>6] |= 1ull<<(*p&0x3f); + } else aux->sep = sep[0]; + } + if (aux->finished) return 0; + else if (str) aux->p = str - 1, aux->finished = 0; + if (aux->sep < 0) { + for (p = start = aux->p + 1; *p; ++p) + if (aux->tab[*p>>6]>>(*p&0x3f)&1) break; + } else { + for (p = start = aux->p + 1; *p; ++p) + if (*p == aux->sep) break; } - if (str) aux->p = str - 1, aux->tab[0] &= ~1ull; - else if (aux->tab[0]&1) return 0; - for (p = start = aux->p + 1; *p; ++p) - if (aux->tab[*p/64]>>(*p%64)&1) break; aux->p = p; // end of token - if (*p == 0) aux->tab[0] |= 1; // no more tokens + if (*p == 0) aux->finished = 1; // no more tokens return (char*)start; } diff --git a/kstring.h b/kstring.h index c46a62b..ec5775b 100644 --- a/kstring.h +++ b/kstring.h @@ -19,6 +19,7 @@ typedef struct __kstring_t { typedef struct { uint64_t tab[4]; + int sep, finished; const char *p; // end of the current token } ks_tokaux_t; diff --git a/phase.c b/phase.c new file mode 100644 index 0000000..ef4eff9 --- /dev/null +++ b/phase.c @@ -0,0 +1,687 @@ +#include +#include +#include +#include +#include +#include +#include "bam.h" +#include "errmod.h" + +#include "kseq.h" +KSTREAM_INIT(gzFile, gzread, 16384) + +#define MAX_VARS 256 +#define FLIP_PENALTY 2 +#define FLIP_THRES 4 +#define MASK_THRES 3 + +#define FLAG_FIX_CHIMERA 0x1 +#define FLAG_LIST_EXCL 0x4 +#define FLAG_DROP_AMBI 0x8 + +typedef struct { + // configurations, initialized in the main function + int flag, k, min_baseQ, min_varLOD, max_depth; + // other global variables + int vpos_shift; + bamFile fp; + char *pre; + bamFile out[3]; + // alignment queue + int n, m; + bam1_t **b; +} phaseg_t; + +typedef struct { + int8_t seq[MAX_VARS]; // TODO: change to dynamic memory allocation! + int vpos, beg, end; + uint32_t vlen:16, single:1, flip:1, phase:1, phased:1, ambig:1; + uint32_t in:16, out:16; // in-phase and out-phase +} frag_t, *frag_p; + +#define rseq_lt(a,b) ((a)->vpos < (b)->vpos) + +#include "khash.h" +KHASH_SET_INIT_INT64(set64) +KHASH_MAP_INIT_INT64(64, frag_t) + +typedef khash_t(64) nseq_t; + +#include "ksort.h" +KSORT_INIT(rseq, frag_p, rseq_lt) + +static char nt16_nt4_table[] = { 4, 0, 1, 4, 2, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4 }; + +static inline uint64_t X31_hash_string(const char *s) +{ + uint64_t h = *s; + if (h) for (++s ; *s; ++s) h = (h << 5) - h + *s; + return h; +} + +static void count1(int l, const uint8_t *seq, int *cnt) +{ + int i, j, n_ambi; + uint32_t z, x; + if (seq[l-1] == 0) return; // do nothing is the last base is ambiguous + for (i = n_ambi = 0; i < l; ++i) // collect ambiguous bases + if (seq[i] == 0) ++n_ambi; + if (l - n_ambi <= 1) return; // only one SNP + for (x = 0; x < 1u<>j&1; + ++j; + } + z = z<<1 | c; + } + ++cnt[z]; + } +} + +static int **count_all(int l, int vpos, nseq_t *hash) +{ + khint_t k; + int i, j, **cnt; + uint8_t *seq; + seq = calloc(l, 1); + cnt = calloc(vpos, sizeof(void*)); + for (i = 0; i < vpos; ++i) cnt[i] = calloc(1<vpos >= vpos || f->single) continue; // out of region; or singleton + if (f->vlen == 1) { // such reads should be flagged as deleted previously if everything is right + f->single = 1; + continue; + } + for (j = 1; j < f->vlen; ++j) { + for (i = 0; i < l; ++i) + seq[i] = j < l - 1 - i? 0 : f->seq[j - (l - 1 - i)]; + count1(l, seq, cnt[f->vpos + j]); + } + } + } + free(seq); + return cnt; +} + +// phasing +static int8_t *dynaprog(int l, int vpos, int **w) +{ + int *f[2], *curr, *prev, max, i; + int8_t **b, *h = 0; + uint32_t x, z = 1u<<(l-1), mask = (1u<>1; y1 = xc>>1; + c0 = prev[y0] + wi[x] + wi[xc]; + c1 = prev[y1] + wi[x] + wi[xc]; + if (c0 > c1) bi[x] = 0, curr[x] = c0; + else bi[x] = 1, curr[x] = c1; + } + tmp = prev; prev = curr; curr = tmp; // swap + } + { // backtrack + uint32_t max_x = 0; + int which = 0; + h = calloc(vpos, 1); + for (x = 0, max = 0, max_x = 0; x < z; ++x) + if (prev[x] > max) max = prev[x], max_x = x; + for (i = vpos - 1, x = max_x; i >= 0; --i) { + h[i] = which? (~x&1) : (x&1); + which = b[i][x]? !which : which; + x = b[i][x]? (~x&mask)>>1 : x>>1; + } + } + // free + for (i = 0; i < vpos; ++i) free(b[i]); + free(f[0]); free(f[1]); free(b); + return h; +} + +// phase each fragment +static uint64_t *fragphase(int vpos, const int8_t *path, nseq_t *hash, int flip) +{ + khint_t k; + uint64_t *pcnt; + uint32_t *left, *rght, max; + left = rght = 0; max = 0; + pcnt = calloc(vpos, 8); + for (k = 0; k < kh_end(hash); ++k) { + if (kh_exist(hash, k)) { + int i, c[2]; + frag_t *f = &kh_val(hash, k); + if (f->vpos >= vpos) continue; + // get the phase + c[0] = c[1] = 0; + for (i = 0; i < f->vlen; ++i) { + if (f->seq[i] == 0) continue; + ++c[f->seq[i] == path[f->vpos + i] + 1? 0 : 1]; + } + f->phase = c[0] > c[1]? 0 : 1; + f->in = c[f->phase]; f->out = c[1 - f->phase]; + f->phased = f->in == f->out? 0 : 1; + f->ambig = (f->in && f->out && f->out < 3 && f->in <= f->out + 1)? 1 : 0; + // fix chimera + f->flip = 0; + if (flip && c[0] >= 3 && c[1] >= 3) { + int sum[2], m, mi, md; + if (f->vlen > max) { // enlarge the array + max = f->vlen; + kroundup32(max); + left = realloc(left, max * 4); + rght = realloc(rght, max * 4); + } + for (i = 0, sum[0] = sum[1] = 0; i < f->vlen; ++i) { // get left counts + if (f->seq[i]) { + int c = f->phase? 2 - f->seq[i] : f->seq[i] - 1; + ++sum[c == path[f->vpos + i]? 0 : 1]; + } + left[i] = sum[1]<<16 | sum[0]; + } + for (i = f->vlen - 1, sum[0] = sum[1] = 0; i >= 0; --i) { // get right counts + if (f->seq[i]) { + int c = f->phase? 2 - f->seq[i] : f->seq[i] - 1; + ++sum[c == path[f->vpos + i]? 0 : 1]; + } + rght[i] = sum[1]<<16 | sum[0]; + } + // find the best flip point + for (i = m = 0, mi = -1, md = -1; i < f->vlen - 1; ++i) { + int a[2]; + a[0] = (left[i]&0xffff) + (rght[i+1]>>16&0xffff) - (rght[i+1]&0xffff) * FLIP_PENALTY; + a[1] = (left[i]>>16&0xffff) + (rght[i+1]&0xffff) - (rght[i+1]>>16&0xffff) * FLIP_PENALTY; + if (a[0] > a[1]) { + if (a[0] > m) m = a[0], md = 0, mi = i; + } else { + if (a[1] > m) m = a[1], md = 1, mi = i; + } + } + if (m - c[0] >= FLIP_THRES && m - c[1] >= FLIP_THRES) { // then flip + f->flip = 1; + if (md == 0) { // flip the tail + for (i = mi + 1; i < f->vlen; ++i) + if (f->seq[i] == 1) f->seq[i] = 2; + else if (f->seq[i] == 2) f->seq[i] = 1; + } else { // flip the head + for (i = 0; i <= mi; ++i) + if (f->seq[i] == 1) f->seq[i] = 2; + else if (f->seq[i] == 2) f->seq[i] = 1; + } + } + } + // update pcnt[] + if (!f->single) { + for (i = 0; i < f->vlen; ++i) { + int c; + if (f->seq[i] == 0) continue; + c = f->phase? 2 - f->seq[i] : f->seq[i] - 1; + if (c == path[f->vpos + i]) { + if (f->phase == 0) ++pcnt[f->vpos + i]; + else pcnt[f->vpos + i] += 1ull<<32; + } else { + if (f->phase == 0) pcnt[f->vpos + i] += 1<<16; + else pcnt[f->vpos + i] += 1ull<<48; + } + } + } + } + } + free(left); free(rght); + return pcnt; +} + +static uint64_t *genmask(int vpos, const uint64_t *pcnt, int *_n) +{ + int i, max = 0, max_i = -1, m = 0, n = 0, beg = 0, score = 0; + uint64_t *list = 0; + for (i = 0; i < vpos; ++i) { + uint64_t x = pcnt[i]; + int c[4], pre = score, s; + c[0] = x&0xffff; c[1] = x>>16&0xffff; c[2] = x>>32&0xffff; c[3] = x>>48&0xffff; + s = (c[1] + c[3] == 0)? -(c[0] + c[2]) : (c[1] + c[3] - 1); + if (c[3] > c[2]) s += c[3] - c[2]; + if (c[1] > c[0]) s += c[1] - c[0]; + score += s; + if (score < 0) score = 0; + if (pre == 0 && score > 0) beg = i; // change from zero to non-zero + if ((i == vpos - 1 || score == 0) && max >= MASK_THRES) { + if (n == m) { + m = m? m<<1 : 4; + list = realloc(list, m * 8); + } + list[n++] = (uint64_t)beg<<32 | max_i; + i = max_i; // reset i to max_i + score = 0; + } else if (score > max) max = score, max_i = i; + if (score == 0) max = 0; + } + *_n = n; + return list; +} + +// trim heading and tailing ambiguous bases; mark deleted and remove sequence +static int clean_seqs(int vpos, nseq_t *hash) +{ + khint_t k; + int ret = 0; + for (k = 0; k < kh_end(hash); ++k) { + if (kh_exist(hash, k)) { + frag_t *f = &kh_val(hash, k); + int beg, end, i; + if (f->vpos >= vpos) { + ret = 1; + continue; + } + for (i = 0; i < f->vlen; ++i) + if (f->seq[i] != 0) break; + beg = i; + for (i = f->vlen - 1; i >= 0; --i) + if (f->seq[i] != 0) break; + end = i + 1; + if (end - beg <= 0) kh_del(64, hash, k); + else { + if (beg != 0) memmove(f->seq, f->seq + beg, end - beg); + f->vpos += beg; f->vlen = end - beg; + f->single = f->vlen == 1? 1 : 0; + } + } + } + return ret; +} + +static void dump_aln(phaseg_t *g, int min_pos, const nseq_t *hash) +{ + int i, is_flip, drop_ambi; + drop_ambi = g->flag & FLAG_DROP_AMBI; + is_flip = (drand48() < 0.5); + for (i = 0; i < g->n; ++i) { + int end, which; + uint64_t key; + khint_t k; + bam1_t *b = g->b[i]; + key = X31_hash_string(bam1_qname(b)); + end = bam_calend(&b->core, bam1_cigar(b)); + if (end > min_pos) break; + k = kh_get(64, hash, key); + if (k == kh_end(hash)) which = 3; + else { + frag_t *f = &kh_val(hash, k); + if (f->ambig) which = drop_ambi? 2 : 3; + else if (f->phased && f->flip) which = 2; + else if (f->phased == 0) which = 3; + else { // phased and not flipped + char c = 'Y'; + which = f->phase; + bam_aux_append(b, "ZP", 'A', 1, (uint8_t*)&c); + } + if (which < 2 && is_flip) which = 1 - which; // increase the randomness + } + if (which == 3) which = (drand48() < 0.5); + bam_write1(g->out[which], b); + bam_destroy1(b); + g->b[i] = 0; + } + memmove(g->b, g->b + i, (g->n - i) * sizeof(void*)); + g->n -= i; +} + +static int phase(phaseg_t *g, const char *chr, int vpos, uint64_t *cns, nseq_t *hash) +{ + int i, j, n_seqs = kh_size(hash), n_masked = 0, min_pos; + khint_t k; + frag_t **seqs; + int8_t *path, *sitemask; + uint64_t *pcnt, *regmask; + + if (vpos == 0) return 0; + i = clean_seqs(vpos, hash); // i is true if hash has an element with its vpos >= vpos + min_pos = i? cns[vpos]>>32 : 0x7fffffff; + if (vpos == 1) { + printf("PS\t%s\t%d\t%d\n", chr, (int)(cns[0]>>32) + 1, (int)(cns[0]>>32) + 1); + printf("M0\t%s\t%d\t%d\t%c\t%c\t%d\t0\t0\t0\t0\n//\n", chr, (int)(cns[0]>>32) + 1, (int)(cns[0]>>32) + 1, + "ACGTX"[cns[0]&3], "ACGTX"[cns[0]>>16&3], g->vpos_shift + 1); + for (k = 0; k < kh_end(hash); ++k) { + if (kh_exist(hash, k)) { + frag_t *f = &kh_val(hash, k); + if (f->vpos) continue; + f->flip = 0; + if (f->seq[0] == 0) f->phased = 0; + else f->phased = 1, f->phase = f->seq[0] - 1; + } + } + dump_aln(g, min_pos, hash); + ++g->vpos_shift; + return 1; + } + { // phase + int **cnt; + uint64_t *mask; + printf("PS\t%s\t%d\t%d\n", chr, (int)(cns[0]>>32) + 1, (int)(cns[vpos-1]>>32) + 1); + sitemask = calloc(vpos, 1); + cnt = count_all(g->k, vpos, hash); + path = dynaprog(g->k, vpos, cnt); + for (i = 0; i < vpos; ++i) free(cnt[i]); + free(cnt); + pcnt = fragphase(vpos, path, hash, 0); // do not fix chimeras when masking + mask = genmask(vpos, pcnt, &n_masked); + regmask = calloc(n_masked, 8); + for (i = 0; i < n_masked; ++i) { + regmask[i] = cns[mask[i]>>32]>>32<<32 | cns[(uint32_t)mask[i]]>>32; + for (j = mask[i]>>32; j <= (int32_t)mask[i]; ++j) + sitemask[j] = 1; + } + free(mask); + if (g->flag & FLAG_FIX_CHIMERA) { + free(pcnt); + pcnt = fragphase(vpos, path, hash, 1); + } + } + for (i = 0; i < n_masked; ++i) + printf("FL\t%s\t%d\t%d\n", chr, (int)(regmask[i]>>32) + 1, (int)regmask[i] + 1); + for (i = 0; i < vpos; ++i) { + uint64_t x = pcnt[i]; + int8_t c[2]; + c[0] = (cns[i]&0xffff)>>2 == 0? 4 : (cns[i]&3); + c[1] = (cns[i]>>16&0xffff)>>2 == 0? 4 : (cns[i]>>16&3); + printf("M%d\t%s\t%d\t%d\t%c\t%c\t%d\t%d\t%d\t%d\t%d\n", sitemask[i]+1, chr, (int)(cns[0]>>32) + 1, (int)(cns[i]>>32) + 1, "ACGTX"[c[path[i]]], "ACGTX"[c[1-path[i]]], + i + g->vpos_shift + 1, (int)(x&0xffff), (int)(x>>16&0xffff), (int)(x>>32&0xffff), (int)(x>>48&0xffff)); + } + free(path); free(pcnt); free(regmask); free(sitemask); + seqs = calloc(n_seqs, sizeof(void*)); + for (k = 0, i = 0; k < kh_end(hash); ++k) + if (kh_exist(hash, k) && kh_val(hash, k).vpos < vpos && !kh_val(hash, k).single) + seqs[i++] = &kh_val(hash, k); + n_seqs = i; + ks_introsort_rseq(n_seqs, seqs); + for (i = 0; i < n_seqs; ++i) { + frag_t *f = seqs[i]; + printf("EV\t0\t%s\t%d\t40\t%dM\t*\t0\t0\t", chr, f->vpos + 1 + g->vpos_shift, f->vlen); + for (j = 0; j < f->vlen; ++j) { + uint32_t c = cns[f->vpos + j]; + if (f->seq[j] == 0) putchar('N'); + else putchar("ACGT"[f->seq[j] == 1? (c&3) : (c>>16&3)]); + } + printf("\t*\tYP:i:%d\tYF:i:%d\tYI:i:%d\tYO:i:%d\tYS:i:%d\n", f->phase, f->flip, f->in, f->out, f->beg+1); + } + free(seqs); + printf("//\n"); + fflush(stdout); + g->vpos_shift += vpos; + dump_aln(g, min_pos, hash); + return vpos; +} + +static void update_vpos(int vpos, nseq_t *hash) +{ + khint_t k; + for (k = 0; k < kh_end(hash); ++k) { + if (kh_exist(hash, k)) { + frag_t *f = &kh_val(hash, k); + if (f->vpos < vpos) kh_del(64, hash, k); // TODO: if frag_t::seq is allocated dynamically, free it + else f->vpos -= vpos; + } + } +} + +static nseq_t *shrink_hash(nseq_t *hash) // TODO: to implement +{ + return hash; +} + +static int readaln(void *data, bam1_t *b) +{ + phaseg_t *g = (phaseg_t*)data; + int ret; + ret = bam_read1(g->fp, b); + if (ret < 0) return ret; + if (!(b->core.flag & (BAM_FUNMAP|BAM_FSECONDARY|BAM_FQCFAIL|BAM_FDUP)) && g->pre) { + if (g->n == g->m) { + g->m = g->m? g->m<<1 : 16; + g->b = realloc(g->b, g->m * sizeof(void*)); + } + g->b[g->n++] = bam_dup1(b); + } + return ret; +} + +static khash_t(set64) *loadpos(const char *fn, bam_header_t *h) +{ + gzFile fp; + kstream_t *ks; + int ret, dret; + kstring_t *str; + khash_t(set64) *hash; + + hash = kh_init(set64); + str = calloc(1, sizeof(kstring_t)); + fp = strcmp(fn, "-")? gzopen(fn, "r") : gzdopen(fileno(stdin), "r"); + ks = ks_init(fp); + while (ks_getuntil(ks, 0, str, &dret) >= 0) { + int tid = bam_get_tid(h, str->s); + if (tid >= 0 && dret != '\n') { + if (ks_getuntil(ks, 0, str, &dret) >= 0) { + uint64_t x = (uint64_t)tid<<32 | (atoi(str->s) - 1); + kh_put(set64, hash, x, &ret); + } else break; + } + if (dret != '\n') while ((dret = ks_getc(ks)) > 0 && dret != '\n'); + if (dret < 0) break; + } + ks_destroy(ks); + gzclose(fp); + free(str->s); free(str); + return hash; +} + +static int gl2cns(float q[16]) +{ + int i, j, min_ij; + float min, min2; + min = min2 = 1e30; min_ij = -1; + for (i = 0; i < 4; ++i) { + for (j = i; j < 4; ++j) { + if (q[i<<2|j] < min) min_ij = i<<2|j, min2 = min, min = q[i<<2|j]; + else if (q[i<<2|j] < min2) min2 = q[i<<2|j]; + } + } + return (min_ij>>2&3) == (min_ij&3)? 0 : 1<<18 | (min_ij>>2&3)<<16 | (min_ij&3) | (int)(min2 - min + .499) << 2; +} + +int main_phase(int argc, char *argv[]) +{ + extern void bam_init_header_hash(bam_header_t *header); + int c, tid, pos, vpos = 0, n, lasttid = -1, max_vpos = 0; + const bam_pileup1_t *plp; + bam_plp_t iter; + bam_header_t *h; + nseq_t *seqs; + uint64_t *cns = 0; + phaseg_t g; + char *fn_list = 0; + khash_t(set64) *set = 0; + errmod_t *em; + uint16_t *bases; + + memset(&g, 0, sizeof(phaseg_t)); + g.flag = FLAG_FIX_CHIMERA; + g.min_varLOD = 37; g.k = 13; g.min_baseQ = 13; g.max_depth = 256; + while ((c = getopt(argc, argv, "Q:eFq:k:b:l:D:A:")) >= 0) { + switch (c) { + case 'D': g.max_depth = atoi(optarg); break; + case 'q': g.min_varLOD = atoi(optarg); break; + case 'Q': g.min_baseQ = atoi(optarg); break; + case 'k': g.k = atoi(optarg); break; + case 'F': g.flag &= ~FLAG_FIX_CHIMERA; break; + case 'e': g.flag |= FLAG_LIST_EXCL; break; + case 'A': g.flag |= FLAG_DROP_AMBI; break; + case 'b': g.pre = strdup(optarg); break; + case 'l': fn_list = strdup(optarg); break; + } + } + if (argc == optind) { + fprintf(stderr, "\n"); + fprintf(stderr, "Usage: samtools phase [options] \n\n"); + fprintf(stderr, "Options: -k INT block length [%d]\n", g.k); + fprintf(stderr, " -b STR prefix of BAMs to output [null]\n"); + fprintf(stderr, " -q INT min het phred-LOD [%d]\n", g.min_varLOD); + fprintf(stderr, " -Q INT min base quality in het calling [%d]\n", g.min_baseQ); + fprintf(stderr, " -D INT max read depth [%d]\n", g.max_depth); +// fprintf(stderr, " -l FILE list of sites to phase [null]\n"); + fprintf(stderr, " -F do not attempt to fix chimeras\n"); + fprintf(stderr, " -A drop reads with ambiguous phase\n"); +// fprintf(stderr, " -e do not discover SNPs (effective with -l)\n"); + fprintf(stderr, "\n"); + return 1; + } + g.fp = strcmp(argv[optind], "-")? bam_open(argv[optind], "r") : bam_dopen(fileno(stdin), "r"); + h = bam_header_read(g.fp); + if (fn_list) { // read the list of sites to phase + bam_init_header_hash(h); + set = loadpos(fn_list, h); + free(fn_list); + } else g.flag &= ~FLAG_LIST_EXCL; + if (g.pre) { // open BAMs to write + char *s = malloc(strlen(g.pre) + 20); + strcpy(s, g.pre); strcat(s, ".0.bam"); g.out[0] = bam_open(s, "w"); + strcpy(s, g.pre); strcat(s, ".1.bam"); g.out[1] = bam_open(s, "w"); + strcpy(s, g.pre); strcat(s, ".chimera.bam"); g.out[2] = bam_open(s, "w"); + for (c = 0; c <= 2; ++c) bam_header_write(g.out[c], h); + free(s); + } + + iter = bam_plp_init(readaln, &g); + g.vpos_shift = 0; + seqs = kh_init(64); + em = errmod_init(1. - 0.83); + bases = calloc(g.max_depth, 2); + printf("CC\n"); + printf("CC\tDescriptions:\nCC\n"); + printf("CC\t CC comments\n"); + printf("CC\t PS start of a phase set\n"); + printf("CC\t FL filtered region\n"); + printf("CC\t M[012] markers; 0 for singletons, 1 for phased and 2 for filtered\n"); + printf("CC\t EV supporting reads; SAM format\n"); + printf("CC\t // end of a phase set\nCC\n"); + printf("CC\tFormats of PS, FL and M[012] lines (1-based coordinates):\nCC\n"); + printf("CC\t PS chr phaseSetStart phaseSetEnd\n"); + printf("CC\t FL chr filterStart filterEnd\n"); + printf("CC\t M? chr PS pos allele0 allele1 hetIndex #supports0 #errors0 #supp1 #err1\n"); + printf("CC\nCC\n"); + fflush(stdout); + while ((plp = bam_plp_auto(iter, &tid, &pos, &n)) != 0) { + int i, k, c, tmp, dophase = 1, in_set = 0; + float q[16]; + if (tid < 0) break; + if (tid != lasttid) { // change of chromosome + g.vpos_shift = 0; + if (lasttid >= 0) { + seqs = shrink_hash(seqs); + phase(&g, h->target_name[lasttid], vpos, cns, seqs); + update_vpos(0x7fffffff, seqs); + } + lasttid = tid; + vpos = 0; + } + if (set && kh_get(set64, set, (uint64_t)tid<<32 | pos) != kh_end(set)) in_set = 1; + if (n > g.max_depth) continue; // do not proceed if the depth is too high + // fill the bases array and check if there is a variant + for (i = k = 0; i < n; ++i) { + const bam_pileup1_t *p = plp + i; + uint8_t *seq; + int q, baseQ, b; + if (p->is_del || p->is_refskip) continue; + baseQ = bam1_qual(p->b)[p->qpos]; + if (baseQ < g.min_baseQ) continue; + seq = bam1_seq(p->b); + b = bam_nt16_nt4_table[bam1_seqi(seq, p->qpos)]; + if (b > 3) continue; + q = baseQ < p->b->core.qual? baseQ : p->b->core.qual; + if (q < 4) q = 4; + if (q > 63) q = 63; + bases[k++] = q<<5 | (int)bam1_strand(p->b)<<4 | b; + } + if (k == 0) continue; + errmod_cal(em, k, 4, bases, q); // compute genotype likelihood + c = gl2cns(q); // get the consensus + // tell if to proceed + if (set && (g.flag&FLAG_LIST_EXCL) && !in_set) continue; // not in the list + if (!in_set && (c&0xffff)>>2 < g.min_varLOD) continue; // not a variant + // add the variant + if (vpos == max_vpos) { + max_vpos = max_vpos? max_vpos<<1 : 128; + cns = realloc(cns, max_vpos * 8); + } + cns[vpos] = (uint64_t)pos<<32 | c; + for (i = 0; i < n; ++i) { + const bam_pileup1_t *p = plp + i; + uint64_t key; + khint_t k; + uint8_t *seq = bam1_seq(p->b); + frag_t *f; + if (p->is_del || p->is_refskip) continue; + if (p->b->core.qual == 0) continue; + // get the base code + c = nt16_nt4_table[(int)bam1_seqi(seq, p->qpos)]; + if (c == (cns[vpos]&3)) c = 1; + else if (c == (cns[vpos]>>16&3)) c = 2; + else c = 0; + // write to seqs + key = X31_hash_string(bam1_qname(p->b)); + k = kh_put(64, seqs, key, &tmp); + f = &kh_val(seqs, k); + if (tmp == 0) { // present in the hash table + if (vpos - f->vpos + 1 < MAX_VARS) { + f->vlen = vpos - f->vpos + 1; + f->seq[f->vlen-1] = c; + f->end = bam_calend(&p->b->core, bam1_cigar(p->b)); + } + dophase = 0; + } else { // absent + memset(f->seq, 0, MAX_VARS); + f->beg = p->b->core.pos; + f->end = bam_calend(&p->b->core, bam1_cigar(p->b)); + f->vpos = vpos, f->vlen = 1, f->seq[0] = c, f->single = f->phased = f->flip = f->ambig = 0; + } + } + if (dophase) { + seqs = shrink_hash(seqs); + phase(&g, h->target_name[tid], vpos, cns, seqs); + update_vpos(vpos, seqs); + cns[0] = cns[vpos]; + vpos = 0; + } + ++vpos; + } + if (tid >= 0) phase(&g, h->target_name[tid], vpos, cns, seqs); + bam_header_destroy(h); + bam_plp_destroy(iter); + bam_close(g.fp); + kh_destroy(64, seqs); + kh_destroy(set64, set); + free(cns); + errmod_destroy(em); + free(bases); + if (g.pre) { + for (c = 0; c <= 2; ++c) bam_close(g.out[c]); + free(g.pre); free(g.b); + } + return 0; +} diff --git a/samtools.1 b/samtools.1 index 57f1aff..e0c7743 100644 --- a/samtools.1 +++ b/samtools.1 @@ -1,4 +1,4 @@ -.TH samtools 1 "2 December 2010" "samtools-0.1.12" "Bioinformatics tools" +.TH samtools 1 "1 March 2011" "samtools-0.1.13" "Bioinformatics tools" .SH NAME .PP samtools - Utilities for the Sequence Alignment/Map (SAM) format @@ -137,7 +137,7 @@ viewing the same reference sequence. .TP .B mpileup -samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]] +samtools mpileup [-EBug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]] Generate BCF or pileup for one or multiple BAM files. Alignment records are grouped by sample identifiers in @RG header lines. If sample @@ -164,6 +164,10 @@ Phred-scaled gap extension sequencing error probability. Reducing .I INT leads to longer indels. [20] .TP +.B -E +Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt +specificity a little bit. +.TP .BI -f \ FILE The reference file [null] .TP @@ -355,7 +359,7 @@ Treat paired-end reads and single-end reads. .TP .B calmd -samtools calmd [-eubSr] [-C capQcoef] +samtools calmd [-EeubSr] [-C capQcoef] Generate the MD tag. If the MD tag is already present, this command will give a warning if the MD tag generated is different from the existing @@ -388,7 +392,58 @@ Coefficient to cap mapping quality of poorly mapped reads. See the command for details. [0] .TP .B -r -Compute the BQ tag without changing the base quality. +Compute the BQ tag (without -A) or cap base quality by BAQ (with -A). +.TP +.B -E +Extended BAQ calculation. This option trades specificity for sensitivity, though the +effect is minor. +.RE + +.TP +.B targetcut +samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] + +This command identifies target regions by examining the continuity of read depth, computes +haploid consensus sequences of targets and outputs a SAM with each sequence corresponding +to a target. When option +.B -f +is in use, BAQ will be applied. This command is +.B only +designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)]. +.RE + +.TP +.B phase +samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] + +Call and phase heterozygous SNPs. +.B OPTIONS: +.RS +.TP 8 +.B -A +Drop reads with ambiguous phase. +.TP 8 +.BI -b \ STR +Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file +.BR STR .0.bam +and phase-1 reads in +.BR STR .1.bam. +Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads +with switch errors will be saved in +.BR STR .chimeric.bam. +[null] +.TP +.B -F +Do not attempt to fix chimeric reads. +.TP +.BI -k \ INT +Maximum length for local phasing. [13] +.TP +.BI -q \ INT +Minimum Phred-scaled LOD to call a heterozygote. [40] +.TP +.BI -Q \ INT +Minimum base quality to be used in het calling. [13] .RE .TP @@ -629,6 +684,20 @@ mismatches. Applying this option usually helps .B BWA-short but may not other mappers. +.IP o 2 +Generate the consensus sequence for one diploid individual: + + samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq + +.IP o 2 +Phase one individual: + + samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out + +The +.B calmd +command is used to reduce false heterozygotes around INDELs. + .IP o 2 Call SNPs and short indels for multiple diploid individuals: diff --git a/samtools.txt b/samtools.txt deleted file mode 100644 index 4cbcef7..0000000 --- a/samtools.txt +++ /dev/null @@ -1,559 +0,0 @@ -samtools(1) Bioinformatics tools samtools(1) - - - -NAME - samtools - Utilities for the Sequence Alignment/Map (SAM) format - -SYNOPSIS - samtools view -bt ref_list.txt -o aln.bam aln.sam.gz - - samtools sort aln.bam aln.sorted - - samtools index aln.sorted.bam - - samtools idxstats aln.sorted.bam - - samtools view aln.sorted.bam chr2:20,100,000-20,200,000 - - samtools merge out.bam in1.bam in2.bam in3.bam - - samtools faidx ref.fasta - - samtools pileup -vcf ref.fasta aln.sorted.bam - - samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam - - samtools tview aln.sorted.bam ref.fasta - - -DESCRIPTION - Samtools is a set of utilities that manipulate alignments in the BAM - format. It imports from and exports to the SAM (Sequence Alignment/Map) - format, does sorting, merging and indexing, and allows to retrieve - reads in any regions swiftly. - - Samtools is designed to work on a stream. It regards an input file `-' - as the standard input (stdin) and an output file `-' as the standard - output (stdout). Several commands can thus be combined with Unix pipes. - Samtools always output warning and error messages to the standard error - output (stderr). - - Samtools is also able to open a BAM (not SAM) file on a remote FTP or - HTTP server if the BAM file name starts with `ftp://' or `http://'. - Samtools checks the current working directory for the index file and - will download the index upon absence. Samtools does not retrieve the - entire alignment file unless it is asked to do so. - - -COMMANDS AND OPTIONS - view samtools view [-bchuHS] [-t in.refList] [-o output] [-f - reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read- - Group] [-R rgFile] | [region1 [...]] - - Extract/print all or sub alignments in SAM or BAM format. If - no region is specified, all the alignments will be printed; - otherwise only alignments overlapping the specified regions - will be output. An alignment may be given multiple times if - it is overlapping several regions. A region can be presented, - for example, in the following format: `chr2' (the whole - chr2), `chr2:1000000' (region starting from 1,000,000bp) or - `chr2:1,000,000-2,000,000' (region between 1,000,000 and - 2,000,000bp including the end points). The coordinate is - 1-based. - - OPTIONS: - - -b Output in the BAM format. - - -f INT Only output alignments with all bits in INT present - in the FLAG field. INT can be in hex in the format of - /^0x[0-9A-F]+/ [0] - - -F INT Skip alignments with bits present in INT [0] - - -h Include the header in the output. - - -H Output the header only. - - -l STR Only output reads in library STR [null] - - -o FILE Output file [stdout] - - -q INT Skip alignments with MAPQ smaller than INT [0] - - -r STR Only output reads in read group STR [null] - - -R FILE Output reads in read groups listed in FILE [null] - - -S Input is in SAM. If @SQ header lines are absent, the - `-t' option is required. - - -c Instead of printing the alignments, only count them - and print the total number. All filter options, such - as `-f', `-F' and `-q' , are taken into account. - - -t FILE This file is TAB-delimited. Each line must contain - the reference name and the length of the reference, - one line for each distinct reference; additional - fields are ignored. This file also defines the order - of the reference sequences in sorting. If you run - `samtools faidx ', the resultant index file - .fai can be used as this file. - - -u Output uncompressed BAM. This option saves time spent - on compression/decomprssion and is thus preferred - when the output is piped to another samtools command. - - - tview samtools tview [ref.fasta] - - Text alignment viewer (based on the ncurses library). In the - viewer, press `?' for help and press `g' to check the align- - ment start from a region in the format like - `chr10:10,000,000' or `=10,000,000' when viewing the same - reference sequence. - - - mpileup samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l - list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam - [...]] - - Generate BCF or pileup for one or multiple BAM files. Align- - ment records are grouped by sample identifiers in @RG header - lines. If sample identifiers are absent, each input file is - regarded as one sample. - - OPTIONS: - - -B Disable probabilistic realignment for the computation - of base alignment quality (BAQ). BAQ is the Phred- - scaled probability of a read base being misaligned. - Applying this option greatly helps to reduce false - SNPs caused by misalignments. - - -C INT Coefficient for downgrading mapping quality for reads - containing excessive mismatches. Given a read with a - phred-scaled probability q of being generated from - the mapped position, the new mapping quality is about - sqrt((INT-q)/INT)*INT. A zero value disables this - functionality; if enabled, the recommended value for - BWA is 50. [0] - - -e INT Phred-scaled gap extension sequencing error probabil- - ity. Reducing INT leads to longer indels. [20] - - -f FILE The reference file [null] - - -g Compute genotype likelihoods and output them in the - binary call format (BCF). - - -h INT Coefficient for modeling homopolymer errors. Given an - l-long homopolymer run, the sequencing error of an - indel of size s is modeled as INT*s/l. [100] - - -l FILE File containing a list of sites where pileup or BCF - is outputted [null] - - -o INT Phred-scaled gap open sequencing error probability. - Reducing INT leads to more indel calls. [40] - - -P STR Comma dilimited list of platforms (determined by @RG- - PL) from which indel candidates are obtained. It is - recommended to collect indel candidates from sequenc- - ing technologies that have low indel error rate such - as ILLUMINA. [all] - - -q INT Minimum mapping quality for an alignment to be used - [0] - - -Q INT Minimum base quality for a base to be considered [13] - - -r STR Only generate pileup in region STR [all sites] - - -u Similar to -g except that the output is uncompressed - BCF, which is preferred for piping. - - - reheader samtools reheader - - Replace the header in in.bam with the header in - in.header.sam. This command is much faster than replacing - the header with a BAM->SAM->BAM conversion. - - - sort samtools sort [-no] [-m maxMem] - - Sort alignments by leftmost coordinates. File .bam will be created. This command may also create tempo- - rary files .%d.bam when the whole alignment can- - not be fitted into memory (controlled by option -m). - - OPTIONS: - - -o Output the final alignment to the standard output. - - -n Sort by read names rather than by chromosomal coordi- - nates - - -m INT Approximately the maximum required memory. - [500000000] - - - merge samtools merge [-nur] [-h inh.sam] [-R reg] - [...] - - Merge multiple sorted alignments. The header reference lists - of all the input BAM files, and the @SQ headers of inh.sam, - if any, must all refer to the same set of reference - sequences. The header reference list and (unless overridden - by -h) `@' headers of in1.bam will be copied to out.bam, and - the headers of other files will be ignored. - - OPTIONS: - - -h FILE Use the lines of FILE as `@' headers to be copied to - out.bam, replacing any header lines that would other- - wise be copied from in1.bam. (FILE is actually in - SAM format, though any alignment records it may con- - tain are ignored.) - - -R STR Merge files in the specified region indicated by STR - - -r Attach an RG tag to each alignment. The tag value is - inferred from file names. - - -n The input alignments are sorted by read names rather - than by chromosomal coordinates - - -u Uncompressed BAM output - - - index samtools index - - Index sorted alignment for fast random access. Index file - .bai will be created. - - - idxstats samtools idxstats - - Retrieve and print stats in the index file. The output is TAB - delimited with each line consisting of reference sequence - name, sequence length, # mapped reads and # unmapped reads. - - - faidx samtools faidx [region1 [...]] - - Index reference sequence in the FASTA format or extract sub- - sequence from indexed reference sequence. If no region is - specified, faidx will index the file and create - .fai on the disk. If regions are speficified, the - subsequences will be retrieved and printed to stdout in the - FASTA format. The input file can be compressed in the RAZF - format. - - - fixmate samtools fixmate - - Fill in mate coordinates, ISIZE and mate related flags from a - name-sorted alignment. - - - rmdup samtools rmdup [-sS] - - Remove potential PCR duplicates: if multiple read pairs have - identical external coordinates, only retain the pair with - highest mapping quality. In the paired-end mode, this com- - mand ONLY works with FR orientation and requires ISIZE is - correctly set. It does not work for unpaired reads (e.g. two - ends mapped to different chromosomes or orphan reads). - - OPTIONS: - - -s Remove duplicate for single-end reads. By default, - the command works for paired-end reads only. - - -S Treat paired-end reads and single-end reads. - - - calmd samtools calmd [-eubSr] [-C capQcoef] - - Generate the MD tag. If the MD tag is already present, this - command will give a warning if the MD tag generated is dif- - ferent from the existing tag. Output SAM by default. - - OPTIONS: - - -A When used jointly with -r this option overwrites the - original base quality. - - -e Convert a the read base to = if it is identical to - the aligned reference base. Indel caller does not - support the = bases at the moment. - - -u Output uncompressed BAM - - -b Output compressed BAM - - -S The input is SAM with header lines - - -C INT Coefficient to cap mapping quality of poorly mapped - reads. See the pileup command for details. [0] - - -r Compute the BQ tag without changing the base quality. - - - pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] - [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N - nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G - indelPrior] | - - Print the alignment in the pileup format. In the pileup for- - mat, each line represents a genomic position, consisting of - chromosome name, coordinate, reference base, read bases, read - qualities and alignment mapping qualities. Information on - match, mismatch, indel, strand, mapping quality and start and - end of a read are all encoded at the read base column. At - this column, a dot stands for a match to the reference base - on the forward strand, a comma for a match on the reverse - strand, a '>' or '<' for a reference skip, `ACGTN' for a mis- - match on the forward strand and `acgtn' for a mismatch on the - reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates - there is an insertion between this reference position and the - next reference position. The length of the insertion is given - by the integer in the pattern, followed by the inserted - sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre- - sents a deletion from the reference. The deleted bases will - be presented as `*' in the following lines. Also at the read - base column, a symbol `^' marks the start of a read. The - ASCII of the character following `^' minus 33 gives the map- - ping quality. A symbol `$' marks the end of a read segment. - - If option -c is applied, the consensus base, Phred-scaled - consensus quality, SNP quality (i.e. the Phred-scaled proba- - bility of the consensus being identical to the reference) and - root mean square (RMS) mapping quality of the reads covering - the site will be inserted between the `reference base' and - the `read bases' columns. An indel occupies an additional - line. Each indel line consists of chromosome name, coordi- - nate, a star, the genotype, consensus quality, SNP quality, - RMS mapping quality, # covering reads, the first alllele, the - second allele, # reads supporting the first allele, # reads - supporting the second allele and # reads containing indels - different from the top two alleles. - - NOTE: Since 0.1.10, the `pileup' command is deprecated by - `mpileup'. - - OPTIONS: - - -B Disable the BAQ computation. See the mpileup com- - mand for details. - - -c Call the consensus sequence. Options -T, -N, -I and - -r are only effective when -c or -g is in use. - - -C INT Coefficient for downgrading the mapping quality of - poorly mapped reads. See the mpileup command for - details. [0] - - -d INT Use the first NUM reads in the pileup for indel - calling for speed up. Zero for unlimited. [1024] - - -f FILE The reference sequence in the FASTA format. Index - file FILE.fai will be created if absent. - - -g Generate genotype likelihood in the binary GLFv3 - format. This option suppresses -c, -i and -s. This - option is deprecated by the mpileup command. - - -i Only output pileup lines containing indels. - - -I INT Phred probability of an indel in sequencing/prep. - [40] - - -l FILE List of sites at which pileup is output. This file - is space delimited. The first two columns are - required to be chromosome and 1-based coordinate. - Additional columns are ignored. It is recommended - to use option - - -m INT Filter reads with flag containing bits in INT - [1796] - - -M INT Cap mapping quality at INT [60] - - -N INT Number of haplotypes in the sample (>=2) [2] - - -r FLOAT Expected fraction of differences between a pair of - haplotypes [0.001] - - -s Print the mapping quality as the last column. This - option makes the output easier to parse, although - this format is not space efficient. - - -S The input file is in SAM. - - -t FILE List of reference names ane sequence lengths, in - the format described for the import command. If - this option is present, samtools assumes the input - is in SAM format; otherwise it - assumes in BAM format. -s together with -l as in - the default format we may not know the mapping - quality. - - -T FLOAT The theta parameter (error dependency coefficient) - in the maq consensus calling model [0.85] - - -SAM FORMAT - SAM is TAB-delimited. Apart from the header lines, which are started - with the `@' symbol, each alignment line consists of: - - - +----+-------+----------------------------------------------------------+ - |Col | Field | Description | - +----+-------+----------------------------------------------------------+ - | 1 | QNAME | Query (pair) NAME | - | 2 | FLAG | bitwise FLAG | - | 3 | RNAME | Reference sequence NAME | - | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence | - | 5 | MAPQ | MAPping Quality (Phred-scaled) | - | 6 | CIAGR | extended CIGAR string | - | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) | - | 8 | MPOS | 1-based Mate POSistion | - | 9 | ISIZE | Inferred insert SIZE | - |10 | SEQ | query SEQuence on the same strand as the reference | - |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) | - |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE | - +----+-------+----------------------------------------------------------+ - - Each bit in the FLAG field is defined as: - - - +-------+-----+--------------------------------------------------+ - | Flag | Chr | Description | - +-------+-----+--------------------------------------------------+ - |0x0001 | p | the read is paired in sequencing | - |0x0002 | P | the read is mapped in a proper pair | - |0x0004 | u | the query sequence itself is unmapped | - |0x0008 | U | the mate is unmapped | - |0x0010 | r | strand of the query (1 for reverse) | - |0x0020 | R | strand of the mate | - |0x0040 | 1 | the read is the first read in a pair | - |0x0080 | 2 | the read is the second read in a pair | - |0x0100 | s | the alignment is not primary | - |0x0200 | f | the read fails platform/vendor quality checks | - |0x0400 | d | the read is either a PCR or an optical duplicate | - +-------+-----+--------------------------------------------------+ - -EXAMPLES - o Import SAM to BAM when @SQ lines are present in the header: - - samtools view -bS aln.sam > aln.bam - - If @SQ lines are absent: - - samtools faidx ref.fa - samtools view -bt ref.fa.fai aln.sam > aln.bam - - where ref.fa.fai is generated automatically by the faidx command. - - - o Attach the RG tag while merging sorted alignments: - - perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu- - mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt - samtools merge -rh rg.txt merged.bam ga.bam 454.bam - - The value in a RG tag is determined by the file name the read is com- - ing from. In this example, in the merged.bam, reads from ga.bam will - be attached RG:Z:ga, while reads from 454.bam will be attached - RG:Z:454. - - - o Call SNPs and short indels for one diploid individual: - - samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > - var.raw.bcf - bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > - var.flt.vcf - - The -D option of varFilter controls the maximum read depth, which - should be adjusted to about twice the average read depth. One may - consider to add -C50 to mpileup if mapping quality is overestimated - for reads containing excessive mismatches. Applying this option usu- - ally helps BWA-short but may not other mappers. - - - o Call SNPs and short indels for multiple diploid individuals: - - samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view - -bcvg - > var.raw.bcf - bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > - var.flt.vcf - - Individuals are identified from the SM tags in the @RG header lines. - Individuals can be pooled in one alignment file; one individual can - also be separated into multiple files. The -P option specifies that - indel candidates should be collected only from read groups with the - @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads - sequenced by an indel-prone technology may affect the performance of - indel calling. - - - o Derive the allele frequency spectrum (AFS) on a list of sites from - multiple individuals: - - samtools mpileup -Igf ref.fa *.bam > all.bcf - bcftools view -bl sites.list all.bcf > sites.bcf - bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs - bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs - bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs - ...... - - where sites.list contains the list of sites with each line consisting - of the reference sequence name and position. The following bcftools - commands estimate AFS by EM. - - - o Dump BAQ applied alignment for other SNP callers: - - samtools calmd -bAr aln.bam > aln.baq.bam - - It adds and corrects the NM and MD tags at the same time. The calmd - command also comes with the -C option, the same as the one in pileup - and mpileup. Apply if it helps. - - -LIMITATIONS - o Unaligned words used in bam_import.c, bam_endian.h, bam.c and - bam_aux.c. - - o In merging, the input files are required to have the same number of - reference sequences. The requirement can be relaxed. In addition, - merging does not reconstruct the header dictionaries automatically. - Endusers have to provide the correct header. Picard is better at - merging. - - o Samtools paired-end rmdup does not work for unpaired reads (e.g. - orphan reads or ends mapped to different chromosomes). If this is a - concern, please use Picard's MarkDuplicate which correctly handles - these cases, although a little slower. - - -AUTHOR - Heng Li from the Sanger Institute wrote the C version of samtools. Bob - Handsaker from the Broad Institute implemented the BGZF library and Jue - Ruan from Beijing Genomics Institute wrote the RAZF library. John Mar- - shall and Petr Danecek contribute to the source code and various people - from the 1000 Genomes Project have contributed to the SAM format speci- - fication. - - -SEE ALSO - Samtools website: - - - -samtools-0.1.12 2 December 2010 samtools(1)