Imported Upstream version 0.1.10

author Charles Plessy <plessy@debian.org>

Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)

committer Charles Plessy <plessy@debian.org>

Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
author Charles Plessy <plessy@debian.org>
Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
committer Charles Plessy <plessy@debian.org>
Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
diff --git a/ChangeLog b/ChangeLog

index 12908c968c54e264baff1562e7cea623fb145dcb..f8456a3e61f1e731343c53fbb0aa859751332282 100644 (file)
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,453 @@
+------------------------------------------------------------------------
+r828 | lh3lh3 | 2010-11-16 20:48:49 -0500 (Tue, 16 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bamtk.c
+
+update version information: samtools-0.1.9-20 (r828)
+
+------------------------------------------------------------------------
+r827 | lh3lh3 | 2010-11-16 15:32:50 -0500 (Tue, 16 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/call1.c
+
+bcftools: allow to skip indels
+
+------------------------------------------------------------------------
+r826 | lh3lh3 | 2010-11-16 14:11:58 -0500 (Tue, 16 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam_md.c
+
+remove ZQ if both BQ and ZQ are present
+
+------------------------------------------------------------------------
+r825 | lh3lh3 | 2010-11-16 13:51:33 -0500 (Tue, 16 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/samtools.1
+
+ * samtools-0.1.9-18 (r825)
+ * change the behaviour of calmd such that by default it does not change the base quality
+
+------------------------------------------------------------------------
+r824 | lh3lh3 | 2010-11-15 23:31:53 -0500 (Mon, 15 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/call1.c
+   M /trunk/samtools/samtools.1
+
+ * samtools-0.1.9-17 (r824)
+ * added command line options to change the default parameters in indel calling
+ * update the manual
+
+------------------------------------------------------------------------
+r823 | lh3lh3 | 2010-11-15 12:20:13 -0500 (Mon, 15 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-r823
+ * the BQ tag is now 64 shifted, not 33 shifted
+
+------------------------------------------------------------------------
+r822 | lh3lh3 | 2010-11-15 00:30:18 -0500 (Mon, 15 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/vcfutils.pl
+   M /trunk/samtools/misc/samtools.pl
+
+ * samtools-0.1.9-16 (r822)
+ * keep the raw depth because in indel calling, DP4 may be way off the true depth
+
+------------------------------------------------------------------------
+r821 | lh3lh3 | 2010-11-13 01:18:31 -0500 (Sat, 13 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-15 (r821)
+ * calmd: write BQ
+ * skip realignment if BQ is present
+
+------------------------------------------------------------------------
+r820 | lh3lh3 | 2010-11-13 01:08:26 -0500 (Sat, 13 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-14 (r820)
+ * penalize reads with excessive differences in indel calling
+
+------------------------------------------------------------------------
+r819 | lh3lh3 | 2010-11-12 21:36:27 -0500 (Fri, 12 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam_maqcns.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-13 (r819)
+ * fixed a bug in pileup given refskip
+
+------------------------------------------------------------------------
+r818 | lh3lh3 | 2010-11-12 13:04:53 -0500 (Fri, 12 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-r818
+ * for indel calling, do two rounds of probabilistic realignments
+
+------------------------------------------------------------------------
+r817 | lh3lh3 | 2010-11-11 20:04:07 -0500 (Thu, 11 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/vcfutils.pl
+
+ * samtools-0.1.19-11 (r817)
+ * only initiate indel calling when 0.2% of reads contain a gap
+
+------------------------------------------------------------------------
+r816 | lh3lh3 | 2010-11-11 01:22:59 -0500 (Thu, 11 Nov 2010) | 7 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-10 (r816)
+
+ * I know why the forward method fails. it is because of zero base
+   qualities. when that is fixed, the forward method seems to give
+   better results than Viterbi, as it should be. I am tired...
+
+
+------------------------------------------------------------------------
+r815 | lh3lh3 | 2010-11-11 00:57:15 -0500 (Thu, 11 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/ChangeLog
+   M /trunk/samtools/bam2bcf_indel.c
+
+effectively revert to the viterbi version. The forward realignment gives too many false positives.
+
+------------------------------------------------------------------------
+r814 | lh3lh3 | 2010-11-11 00:18:02 -0500 (Thu, 11 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-9 (r810)
+ * use forward, instead of viterbi, for realignment
+ * realignment is now quality aware
+
+------------------------------------------------------------------------
+r813 | lh3lh3 | 2010-11-10 22:45:24 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/kprobaln.c
+   M /trunk/samtools/kprobaln.h
+
+ * prepare to replace kaln with kprobaln in realignment
+
+------------------------------------------------------------------------
+r812 | lh3lh3 | 2010-11-10 17:28:50 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/bcf.c
+
+fixed a typo
+
+------------------------------------------------------------------------
+r811 | lh3lh3 | 2010-11-10 16:54:46 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/bcf.c
+   M /trunk/samtools/bcftools/bcf.h
+
+use zlib for direct reading when BCF_LITE is in use
+
+------------------------------------------------------------------------
+r810 | lh3lh3 | 2010-11-10 16:32:13 -0500 (Wed, 10 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+
+ * do not use reads containing too many mismatches for indel calling
+ * fixed a trivial bug in case of multi-allelic indels
+
+------------------------------------------------------------------------
+r809 | lh3lh3 | 2010-11-10 13:23:02 -0500 (Wed, 10 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-8 (r809)
+ * fixed a bug in the indel caller
+
+------------------------------------------------------------------------
+r808 | lh3lh3 | 2010-11-10 12:24:10 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/Makefile
+
+minor change to makefile
+
+------------------------------------------------------------------------
+r807 | lh3lh3 | 2010-11-10 12:10:21 -0500 (Wed, 10 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/Makefile
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/vcfutils.pl
+
+ * samtools-0.1.9-8 (r807)
+ * collect indel candidates only from specified platforms (@RG-PL)
+ * merge varFilter and filter4vcf in vcfutils.pl 
+
+------------------------------------------------------------------------
+r806 | lh3lh3 | 2010-11-09 22:05:46 -0500 (Tue, 09 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/call1.c
+   M /trunk/samtools/bcftools/prob1.c
+   M /trunk/samtools/bcftools/prob1.h
+
+bcftools: compute equal-tail (Bayesian) credible interval
+
+------------------------------------------------------------------------
+r805 | lh3lh3 | 2010-11-09 16:28:39 -0500 (Tue, 09 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/vcfutils.pl
+
+added a double-hit filter to avoid overestimated indel likelihood
+
+------------------------------------------------------------------------
+r804 | lh3lh3 | 2010-11-09 14:12:06 -0500 (Tue, 09 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-7 (r804)
+ * fixed a bug in the gap caller
+
+------------------------------------------------------------------------
+r803 | lh3lh3 | 2010-11-09 10:45:33 -0500 (Tue, 09 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/bcf.c
+   M /trunk/samtools/bcftools/bcf.h
+   M /trunk/samtools/bcftools/prob1.c
+
+ * samtools-0.1.9-6 (r803)
+ * mpileup: apply homopolymer correction when calculating GL, instead of before
+ * bcftools: apply a different prior to indels
+
+------------------------------------------------------------------------
+r802 | lh3lh3 | 2010-11-08 23:53:15 -0500 (Mon, 08 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-5 (r802)
+ * relax tandem penalty. this will be made a command-line option in future.
+
+------------------------------------------------------------------------
+r801 | lh3lh3 | 2010-11-08 23:35:52 -0500 (Mon, 08 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-4 (r801)
+ * fixed a minor issue in printing indel VCF
+
+------------------------------------------------------------------------
+r800 | lh3lh3 | 2010-11-08 15:28:14 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bcftools/vcfutils.pl
+
+fixed another silly bug in mpileup's indel caller
+
+------------------------------------------------------------------------
+r799 | lh3lh3 | 2010-11-08 14:28:27 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+
+fixed a silly bug in the indel caller
+
+------------------------------------------------------------------------
+r798 | lh3lh3 | 2010-11-08 14:07:33 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/ChangeLog
+   M /trunk/samtools/sam_view.c
+   M /trunk/samtools/samtools.1
+
+Incorporate patches by Marcel Martin for read counting.
+
+------------------------------------------------------------------------
+r797 | lh3lh3 | 2010-11-08 13:39:52 -0500 (Mon, 08 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-2 (r797)
+ * mpileup: indel calling seems to be working
+
+------------------------------------------------------------------------
+r796 | lh3lh3 | 2010-11-08 10:54:46 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/kaln.c
+
+indel calling is apparently working, but more information needs to be collected
+
+------------------------------------------------------------------------
+r795 | lh3lh3 | 2010-11-08 00:39:18 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf_indel.c
+
+fixed a few bugs in the indel caller. Probably there are more.
+
+------------------------------------------------------------------------
+r794 | lh3lh3 | 2010-11-07 22:23:16 -0500 (Sun, 07 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/Makefile
+   M /trunk/samtools/bam.h
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   A /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/kaln.c
+   M /trunk/samtools/kaln.h
+
+prepare for the indel caller. It is not ready yet.
+
+------------------------------------------------------------------------
+r793 | lh3lh3 | 2010-11-05 11:28:23 -0400 (Fri, 05 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+
+Revert to r790. The recent changes are not good...
+
+------------------------------------------------------------------------
+r792 | lh3lh3 | 2010-11-05 00:19:14 -0400 (Fri, 05 Nov 2010) | 6 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+
+ * this revision is UNSTABLE
+
+ * indel caller seems working, but it is very insensitive and has
+   several things I do not quite understand.
+
+
+------------------------------------------------------------------------
+r791 | lh3lh3 | 2010-11-04 22:58:43 -0400 (Thu, 04 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+
+for backup. no effective changes
+
+------------------------------------------------------------------------
+r790 | lh3lh3 | 2010-11-03 15:51:24 -0400 (Wed, 03 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/vcfutils.pl
+   M /trunk/samtools/kprobaln.c
+
+fixed a minor problem in the example coming with kprobaln.c
+
+------------------------------------------------------------------------
+r789 | lh3lh3 | 2010-11-02 15:41:27 -0400 (Tue, 02 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/Makefile
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/kaln.c
+   M /trunk/samtools/kaln.h
+   A /trunk/samtools/kprobaln.c
+   A /trunk/samtools/kprobaln.h
+
+Separate kaln and kprobaln as I am preparing further changes. At
+present, the results should be identical to the previous.
+
+
+------------------------------------------------------------------------
+r788 | petulda | 2010-11-02 12:19:04 -0400 (Tue, 02 Nov 2010) | 1 line
+Changed paths:
+   M /trunk/samtools/bam_plcmd.c
+
+Added -b option: read file names from a file
+------------------------------------------------------------------------
+r787 | lh3lh3 | 2010-10-29 23:17:22 -0400 (Fri, 29 Oct 2010) | 7 lines
+Changed paths:
+   M /trunk/samtools/bam.h
+   M /trunk/samtools/bam_pileup.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-2 (r787)
+
+ * Allow to set a maximum per-sample depth to reduce memory. However,
+   BAQ computation is still applied to every read. The speed is not
+   improved.
+
+
+------------------------------------------------------------------------
+r786 | lh3lh3 | 2010-10-29 12:10:40 -0400 (Fri, 29 Oct 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/bcf.c
+   M /trunk/samtools/bcftools/vcf.c
+
+ * samtools-0.1.9-1 (r786)
+ * samtools: optionally perform exact test for each sample
+
+------------------------------------------------------------------------
+r785 | lh3lh3 | 2010-10-29 09:42:25 -0400 (Fri, 29 Oct 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bcftools/bcf.c
+
+Optionally output "DP", the individual read depth
+
+------------------------------------------------------------------------
+r784 | lh3lh3 | 2010-10-27 23:10:27 -0400 (Wed, 27 Oct 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/samtools.1
+
+acknowledge Petr and John who have greatly contributed to the project.
+
+------------------------------------------------------------------------
+r783 | lh3lh3 | 2010-10-27 22:47:47 -0400 (Wed, 27 Oct 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/ChangeLog
+   M /trunk/samtools/NEWS
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/samtools.1
+
+Release samtools-0.1.9 (r783)
+
  ------------------------------------------------------------------------
  r782 | lh3lh3 | 2010-10-27 19:58:54 -0400 (Wed, 27 Oct 2010) | 2 lines
  Changed paths:
  ------------------------------------------------------------------------
  r782 | lh3lh3 | 2010-10-27 19:58:54 -0400 (Wed, 27 Oct 2010) | 2 lines
  Changed paths:
diff --git a/Makefile b/Makefile

index 77cda865c4b38df76fdafe338feae873dabde0e6..13d4a7698e1ded63520b3c8c7a117a45d91af1a1 100644 (file)
--- a/Makefile
+++ b/Makefile
@@ -4,10 +4,10 @@ DFLAGS=               -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE -D_CURSES_LIB=1
  KNETFILE_O=    knetfile.o
  LOBJS=         bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o \
                         bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o \
  KNETFILE_O=    knetfile.o
  LOBJS=         bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o \
                         bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o \
-                       $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o
+                       $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o kprobaln.o
  AOBJS=         bam_tview.o bam_maqcns.o bam_plcmd.o sam_view.o \
                         bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o     \
  AOBJS=         bam_tview.o bam_maqcns.o bam_plcmd.o sam_view.o \
                         bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o     \
-                       bamtk.o kaln.o bam2bcf.o errmod.o sample.o
+                       bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o
  PROG=          samtools
  INCLUDES=      -I.
  SUBDIRS=       . bcftools misc
  PROG=          samtools
  INCLUDES=      -I.
  SUBDIRS=       . bcftools misc
@@ -64,6 +64,7 @@ glf.o:glf.h
  sam_header.o:sam_header.h khash.h
  bcf.o:bcftools/bcf.h
  bam2bcf.o:bam2bcf.h errmod.h bcftools/bcf.h
  sam_header.o:sam_header.h khash.h
  bcf.o:bcftools/bcf.h
  bam2bcf.o:bam2bcf.h errmod.h bcftools/bcf.h
+bam2bcf_indel.o:bam2bcf.h
  errmod.o:errmod.h
  
  faidx.o:faidx.h razf.h khash.h
  errmod.o:errmod.h
  
  faidx.o:faidx.h razf.h khash.h
diff --git a/NEWS b/NEWS

index 82646ba812fa66f90c1ca4cf2473645f87809b31..6c2195e5929fab1dbe96b5fafc734a62443ec99c 100644 (file)
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,38 @@
+Beta Release 0.1.10 (16 November, 2010)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This release is featured as the first major improvement to the indel
+caller. The method is similar to the old one implemented in the pileup
+command, but the details are handled more carefully both in theory and
+in practice. As a result, the new indel caller usually gives more
+accurate indel calls, though at the cost of sensitivity. The caller is
+implemented in the mpileup command and is invoked by default. It works
+with multiple samples.
+
+Other notable changes:
+
+ * With the -r option, the calmd command writes the difference between
+   the original base quality and the BAQ capped base quality at the BQ
+   tag but does not modify the base quality. Please use -Ar to overwrite
+   the original base quality (the 0.1.9 behavior).
+
+ * Allow to set a maximum per-sample read depth to reduce memory. In
+   0.1.9, most of memory is wasted for the ultra high read depth in some
+   regions (e.g. the chr1 centromere).
+
+ * Optionally write per-sample read depth and per-sample strand bias
+   P-value.
+
+ * Compute equal-tail (Bayesian) credible interval of site allele
+   frequency at the CI95 VCF annotation.
+
+ * Merged the vcfutils.pl varFilter and filter4vcf for better SNP/indel
+   filtering.
+
+(0.1.10: 16 November 2010, r829)
+
+
+
  Beta Release 0.1.9 (27 October, 2010)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
  Beta Release 0.1.9 (27 October, 2010)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
diff --git a/bam.h b/bam.h

index 4c8536fd33366883028ff9b77bb95049682c5f9f..eef2ea9853f25966c5a40c45a86272b78a507fb2 100644 (file)
--- a/bam.h
+++ b/bam.h
@@ -495,7 +495,7 @@ extern "C" {
                 bam1_t *b;
                 int32_t qpos;
                 int indel, level;
                 bam1_t *b;
                 int32_t qpos;
                 int indel, level;
-               uint32_t is_del:1, is_head:1, is_tail:1, is_refskip:1;
+               uint32_t is_del:1, is_head:1, is_tail:1, is_refskip:1, aux:28;
         } bam_pileup1_t;
  
         typedef int (*bam_plp_auto_f)(void *data, bam1_t *b);
         } bam_pileup1_t;
  
         typedef int (*bam_plp_auto_f)(void *data, bam1_t *b);
@@ -508,6 +508,7 @@ extern "C" {
         const bam_pileup1_t *bam_plp_next(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
         const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
         void bam_plp_set_mask(bam_plp_t iter, int mask);
         const bam_pileup1_t *bam_plp_next(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
         const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
         void bam_plp_set_mask(bam_plp_t iter, int mask);
+       void bam_plp_set_maxcnt(bam_plp_t iter, int maxcnt);
         void bam_plp_reset(bam_plp_t iter);
         void bam_plp_destroy(bam_plp_t iter);
  
         void bam_plp_reset(bam_plp_t iter);
         void bam_plp_destroy(bam_plp_t iter);
  
@@ -516,6 +517,7 @@ extern "C" {
  
         bam_mplp_t bam_mplp_init(int n, bam_plp_auto_f func, void **data);
         void bam_mplp_destroy(bam_mplp_t iter);
  
         bam_mplp_t bam_mplp_init(int n, bam_plp_auto_f func, void **data);
         void bam_mplp_destroy(bam_mplp_t iter);
+       void bam_mplp_set_maxcnt(bam_mplp_t iter, int maxcnt);
         int bam_mplp_auto(bam_mplp_t iter, int *_tid, int *_pos, int *n_plp, const bam_pileup1_t **plp);
  
         /*! @typedef
         int bam_mplp_auto(bam_mplp_t iter, int *_tid, int *_pos, int *n_plp, const bam_pileup1_t **plp);
  
         /*! @typedef
@@ -708,8 +710,8 @@ static inline bam1_t *bam_copy1(bam1_t *bdst, const bam1_t *bsrc)
  {
         uint8_t *data = bdst->data;
         int m_data = bdst->m_data;   // backup data and m_data
  {
         uint8_t *data = bdst->data;
         int m_data = bdst->m_data;   // backup data and m_data
-       if (m_data < bsrc->m_data) { // double the capacity
-               m_data = bsrc->m_data; kroundup32(m_data);
+       if (m_data < bsrc->data_len) { // double the capacity
+               m_data = bsrc->data_len; kroundup32(m_data);
                 data = (uint8_t*)realloc(data, m_data);
         }
         memcpy(data, bsrc->data, bsrc->data_len); // copy var-len data
                 data = (uint8_t*)realloc(data, m_data);
         }
         memcpy(data, bsrc->data, bsrc->data_len); // copy var-len data
diff --git a/bam2bcf.c b/bam2bcf.c

index e55212ca98cbb28041fb6a6dfc46619ff595ee09..088635c0b04a549b762d1a3b28efad82f4bb5c80 100644 (file)
--- a/bam2bcf.c
+++ b/bam2bcf.c
@@ -14,18 +14,13 @@ extern      void ks_introsort_uint32_t(size_t n, uint32_t a[]);
  
  #define CAP_DIST 25
  
  
  #define CAP_DIST 25
  
-struct __bcf_callaux_t {
-       int max_bases, capQ, min_baseQ;
-       uint16_t *bases;
-       errmod_t *e;
-};
-
  bcf_callaux_t *bcf_call_init(double theta, int min_baseQ)
  {
         bcf_callaux_t *bca;
         if (theta <= 0.) theta = CALL_DEFTHETA;
         bca = calloc(1, sizeof(bcf_callaux_t));
         bca->capQ = 60;
  bcf_callaux_t *bcf_call_init(double theta, int min_baseQ)
  {
         bcf_callaux_t *bca;
         if (theta <= 0.) theta = CALL_DEFTHETA;
         bca = calloc(1, sizeof(bcf_callaux_t));
         bca->capQ = 60;
+       bca->openQ = 40; bca->extQ = 20; bca->tandemQ = 100;
         bca->min_baseQ = min_baseQ;
         bca->e = errmod_init(1. - theta);
         return bca;
         bca->min_baseQ = min_baseQ;
         bca->e = errmod_init(1. - theta);
         return bca;
@@ -35,16 +30,19 @@ void bcf_call_destroy(bcf_callaux_t *bca)
  {
         if (bca == 0) return;
         errmod_destroy(bca->e);
  {
         if (bca == 0) return;
         errmod_destroy(bca->e);
-       free(bca->bases); free(bca);
+       free(bca->bases); free(bca->inscns); free(bca);
  }
  }
-
-int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf_callaux_t *bca, bcf_callret1_t *r)
+/* ref_base is the 4-bit representation of the reference base. It is
+ * negative if we are looking at an indel. */
+int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t *bca, bcf_callret1_t *r)
  {
  {
-       int i, n, ref4;
+       int i, n, ref4, is_indel, ori_depth = 0;
         memset(r, 0, sizeof(bcf_callret1_t));
         memset(r, 0, sizeof(bcf_callret1_t));
-       ref4 = bam_nt16_nt4_table[ref_base];
+       if (ref_base >= 0) {
+               ref4 = bam_nt16_nt4_table[ref_base];
+               is_indel = 0;
+       } else ref4 = 4, is_indel = 1;
         if (_n == 0) return -1;
         if (_n == 0) return -1;
-
         // enlarge the bases array if necessary
         if (bca->max_bases < _n) {
                 bca->max_bases = _n;
         // enlarge the bases array if necessary
         if (bca->max_bases < _n) {
                 bca->max_bases = _n;
@@ -55,21 +53,29 @@ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf
         memset(r, 0, sizeof(bcf_callret1_t));
         for (i = n = 0; i < _n; ++i) {
                 const bam_pileup1_t *p = pl + i;
         memset(r, 0, sizeof(bcf_callret1_t));
         for (i = n = 0; i < _n; ++i) {
                 const bam_pileup1_t *p = pl + i;
-               int q, b, mapQ, baseQ, is_diff, min_dist;
+               int q, b, mapQ, baseQ, is_diff, min_dist, seqQ;
                 // set base
                 // set base
-               if (p->is_del || (p->b->core.flag&BAM_FUNMAP)) continue; // skip unmapped reads and deleted bases
-               baseQ = q = (int)bam1_qual(p->b)[p->qpos]; // base quality
+               if (p->is_del || p->is_refskip || (p->b->core.flag&BAM_FUNMAP)) continue;
+               ++ori_depth;
+               baseQ = q = is_indel? p->aux&0xff : (int)bam1_qual(p->b)[p->qpos]; // base/indel quality
+               seqQ = is_indel? (p->aux>>8&0xff) : 99;
                 if (q < bca->min_baseQ) continue;
                 if (q < bca->min_baseQ) continue;
+               if (q > seqQ) q = seqQ;
                 mapQ = p->b->core.qual < bca->capQ? p->b->core.qual : bca->capQ;
                 if (q > mapQ) q = mapQ;
                 if (q > 63) q = 63;
                 if (q < 4) q = 4;
                 mapQ = p->b->core.qual < bca->capQ? p->b->core.qual : bca->capQ;
                 if (q > mapQ) q = mapQ;
                 if (q > 63) q = 63;
                 if (q < 4) q = 4;
-               b = bam1_seqi(bam1_seq(p->b), p->qpos); // base
-               b = bam_nt16_nt4_table[b? b : ref_base]; // b is the 2-bit base
+               if (!is_indel) {
+                       b = bam1_seqi(bam1_seq(p->b), p->qpos); // base
+                       b = bam_nt16_nt4_table[b? b : ref_base]; // b is the 2-bit base
+                       is_diff = (ref4 < 4 && b == ref4)? 0 : 1;
+               } else {
+                       b = p->aux>>16&0x3f;
+                       is_diff = (b != 0);
+               }
                 bca->bases[n++] = q<<5 | (int)bam1_strand(p->b)<<4 | b;
                 // collect annotations
                 r->qsum[b] += q;
                 bca->bases[n++] = q<<5 | (int)bam1_strand(p->b)<<4 | b;
                 // collect annotations
                 r->qsum[b] += q;
-               is_diff = (ref4 < 4 && b == ref4)? 0 : 1;
                 ++r->anno[0<<2|is_diff<<1|bam1_strand(p->b)];
                 min_dist = p->b->core.l_qseq - 1 - p->qpos;
                 if (min_dist > p->qpos) min_dist = p->qpos;
                 ++r->anno[0<<2|is_diff<<1|bam1_strand(p->b)];
                 min_dist = p->b->core.l_qseq - 1 - p->qpos;
                 if (min_dist > p->qpos) min_dist = p->qpos;
@@ -81,7 +87,7 @@ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf
                 r->anno[3<<2|is_diff<<1|0] += min_dist;
                 r->anno[3<<2|is_diff<<1|1] += min_dist * min_dist;
         }
                 r->anno[3<<2|is_diff<<1|0] += min_dist;
                 r->anno[3<<2|is_diff<<1|1] += min_dist * min_dist;
         }
-       r->depth = n;
+       r->depth = n; r->ori_depth = ori_depth;
         // glfgen
         errmod_cal(bca->e, n, 5, bca->bases, r->p);
         return r->depth;
         // glfgen
         errmod_cal(bca->e, n, 5, bca->bases, r->p);
         return r->depth;
@@ -91,8 +97,10 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/,
  {
         int ref4, i, j, qsum[4];
         int64_t tmp;
  {
         int ref4, i, j, qsum[4];
         int64_t tmp;
-       call->ori_ref = ref4 = bam_nt16_nt4_table[ref_base];
-       if (ref4 > 4) ref4 = 4;
+       if (ref_base >= 0) {
+               call->ori_ref = ref4 = bam_nt16_nt4_table[ref_base];
+               if (ref4 > 4) ref4 = 4;
+       } else call->ori_ref = -1, ref4 = 0;
         // calculate qsum
         memset(qsum, 0, 4 * sizeof(int));
         for (i = 0; i < n; ++i)
         // calculate qsum
         memset(qsum, 0, 4 * sizeof(int));
         for (i = 0; i < n; ++i)
@@ -113,9 +121,14 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/,
                         else break;
                 }
         }
                         else break;
                 }
         }
-       if (((ref4 < 4 && j < 4) || (ref4 == 4 && j < 5)) && i >= 0)
-               call->unseen = j, call->a[j++] = qsum[i]&3;
-       call->n_alleles = j;
+       if (ref_base >= 0) { // for SNPs, find the "unseen" base
+               if (((ref4 < 4 && j < 4) || (ref4 == 4 && j < 5)) && i >= 0)
+                       call->unseen = j, call->a[j++] = qsum[i]&3;
+               call->n_alleles = j;
+       } else {
+               call->n_alleles = j;
+               if (call->n_alleles == 1) return -1; // no reliable supporting read. stop doing anything
+       }
         // set the PL array
         if (call->n < n) {
                 call->n = n;
         // set the PL array
         if (call->n < n) {
                 call->n = n;
@@ -143,44 +156,101 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/,
                                 PL[j] = y;
                         }
                 }
                                 PL[j] = y;
                         }
                 }
+//             if (ref_base < 0) fprintf(stderr, "%d,%d,%f,%d\n", call->n_alleles, x, sum_min, call->unseen);
                 call->shift = (int)(sum_min + .499);
         }
         // combine annotations
         memset(call->anno, 0, 16 * sizeof(int));
                 call->shift = (int)(sum_min + .499);
         }
         // combine annotations
         memset(call->anno, 0, 16 * sizeof(int));
-       for (i = call->depth = 0, tmp = 0; i < n; ++i) {
+       for (i = call->depth = call->ori_depth = 0, tmp = 0; i < n; ++i) {
                 call->depth += calls[i].depth;
                 call->depth += calls[i].depth;
+               call->ori_depth += calls[i].ori_depth;
                 for (j = 0; j < 16; ++j) call->anno[j] += calls[i].anno[j];
         }
         return 0;
  }
  
                 for (j = 0; j < 16; ++j) call->anno[j] += calls[i].anno[j];
         }
         return 0;
  }
  
-int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b)
+int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b, bcf_callret1_t *bcr, int is_SP,
+                                const bcf_callaux_t *bca, const char *ref)
  {
  {
+       extern double kt_fisher_exact(int n11, int n12, int n21, int n22, double *_left, double *_right, double *two);
         kstring_t s;
         kstring_t s;
-       int i;
+       int i, j;
         b->n_smpl = bc->n;
         b->tid = tid; b->pos = pos; b->qual = 0;
         s.s = b->str; s.m = b->m_str; s.l = 0;
         kputc('\0', &s);
         b->n_smpl = bc->n;
         b->tid = tid; b->pos = pos; b->qual = 0;
         s.s = b->str; s.m = b->m_str; s.l = 0;
         kputc('\0', &s);
-       kputc("ACGTN"[bc->ori_ref], &s); kputc('\0', &s);
-       for (i = 1; i < 5; ++i) {
-               if (bc->a[i] < 0) break;
-               if (i > 1) kputc(',', &s);
-               kputc(bc->unseen == i? 'X' : "ACGT"[bc->a[i]], &s);
+       if (bc->ori_ref < 0) { // an indel
+               // write REF
+               kputc(ref[pos], &s);
+               for (j = 0; j < bca->indelreg; ++j) kputc(ref[pos+1+j], &s);
+               kputc('\0', &s);
+               // write ALT
+               kputc(ref[pos], &s);
+               for (i = 1; i < 4; ++i) {
+                       if (bc->a[i] < 0) break;
+                       if (i > 1) {
+                               kputc(',', &s); kputc(ref[pos], &s);
+                       }
+                       if (bca->indel_types[bc->a[i]] < 0) { // deletion
+                               for (j = -bca->indel_types[bc->a[i]]; j < bca->indelreg; ++j)
+                                       kputc(ref[pos+1+j], &s);
+                       } else { // insertion; cannot be a reference unless a bug
+                               char *inscns = &bca->inscns[bc->a[i] * bca->maxins];
+                               for (j = 0; j < bca->indel_types[bc->a[i]]; ++j)
+                                       kputc("ACGTN"[(int)inscns[j]], &s);
+                               for (j = 0; j < bca->indelreg; ++j) kputc(ref[pos+1+j], &s);
+                       }
+               }
+               kputc('\0', &s);
+       } else { // a SNP
+               kputc("ACGTN"[bc->ori_ref], &s); kputc('\0', &s);
+               for (i = 1; i < 5; ++i) {
+                       if (bc->a[i] < 0) break;
+                       if (i > 1) kputc(',', &s);
+                       kputc(bc->unseen == i? 'X' : "ACGT"[bc->a[i]], &s);
+               }
+               kputc('\0', &s);
         }
         kputc('\0', &s);
         }
         kputc('\0', &s);
-       kputc('\0', &s);
         // INFO
         // INFO
-       kputs("I16=", &s);
+       if (bc->ori_ref < 0) kputs("INDEL;", &s);
+       kputs("DP=", &s); kputw(bc->ori_depth, &s); kputs(";I16=", &s);
         for (i = 0; i < 16; ++i) {
                 if (i) kputc(',', &s);
                 kputw(bc->anno[i], &s);
         }
         kputc('\0', &s);
         // FMT
         for (i = 0; i < 16; ++i) {
                 if (i) kputc(',', &s);
                 kputw(bc->anno[i], &s);
         }
         kputc('\0', &s);
         // FMT
-       kputs("PL", &s); kputc('\0', &s);
+       kputs("PL", &s);
+       if (bcr) {
+               kputs(":DP", &s);
+               if (is_SP) kputs(":SP", &s);
+       }
+       kputc('\0', &s);
         b->m_str = s.m; b->str = s.s; b->l_str = s.l;
         bcf_sync(b);
         memcpy(b->gi[0].data, bc->PL, b->gi[0].len * bc->n);
         b->m_str = s.m; b->str = s.s; b->l_str = s.l;
         bcf_sync(b);
         memcpy(b->gi[0].data, bc->PL, b->gi[0].len * bc->n);
+       if (bcr) {
+               uint16_t *dp = (uint16_t*)b->gi[1].data;
+               uint8_t *sp = is_SP? b->gi[2].data : 0;
+               for (i = 0; i < bc->n; ++i) {
+                       bcf_callret1_t *p = bcr + i;
+                       dp[i] = p->depth < 0xffff? p->depth : 0xffff;
+                       if (is_SP) {
+                               if (p->anno[0] + p->anno[1] < 2 || p->anno[2] + p->anno[3] < 2
+                                       || p->anno[0] + p->anno[2] < 2 || p->anno[1] + p->anno[3] < 2)
+                               {
+                                       sp[i] = 0;
+                               } else {
+                                       double left, right, two;
+                                       int x;
+                                       kt_fisher_exact(p->anno[0], p->anno[1], p->anno[2], p->anno[3], &left, &right, &two);
+                                       x = (int)(-4.343 * log(two) + .499);
+                                       if (x > 255) x = 255;
+                                       sp[i] = x;
+                               }
+                       }
+               }
+       }
         return 0;
  }
         return 0;
  }
diff --git a/bam2bcf.h b/bam2bcf.h

index 0dddb924dcf62fee4f1352a9a20eecb7001da450..26b022c08e89e3597bd6e400690d42bd7ecabe38 100644 (file)
--- a/bam2bcf.h
+++ b/bam2bcf.h
@@ -2,13 +2,26 @@
  #define BAM2BCF_H
  
  #include <stdint.h>
  #define BAM2BCF_H
  
  #include <stdint.h>
+#include "errmod.h"
  #include "bcftools/bcf.h"
  
  #include "bcftools/bcf.h"
  
-struct __bcf_callaux_t;
-typedef struct __bcf_callaux_t bcf_callaux_t;
+#define B2B_INDEL_NULL 10000
+
+typedef struct __bcf_callaux_t {
+       int capQ, min_baseQ;
+       int openQ, extQ, tandemQ;
+       // for internal uses
+       int max_bases;
+       int indel_types[4];
+       int maxins, indelreg;
+       char *inscns;
+       uint16_t *bases;
+       errmod_t *e;
+       void *rghash;
+} bcf_callaux_t;
  
  typedef struct {
  
  typedef struct {
-       int depth, qsum[4];
+       int depth, ori_depth, qsum[4];
         int anno[16];
         float p[25];
  } bcf_callret1_t;
         int anno[16];
         float p[25];
  } bcf_callret1_t;
@@ -16,7 +29,7 @@ typedef struct {
  typedef struct {
         int a[5]; // alleles: ref, alt, alt2, alt3
         int n, n_alleles, shift, ori_ref, unseen;
  typedef struct {
         int a[5]; // alleles: ref, alt, alt2, alt3
         int n, n_alleles, shift, ori_ref, unseen;
-       int anno[16], depth;
+       int anno[16], depth, ori_depth;
         uint8_t *PL;
  } bcf_call_t;
  
         uint8_t *PL;
  } bcf_call_t;
  
@@ -26,9 +39,12 @@ extern "C" {
  
         bcf_callaux_t *bcf_call_init(double theta, int min_baseQ);
         void bcf_call_destroy(bcf_callaux_t *bca);
  
         bcf_callaux_t *bcf_call_init(double theta, int min_baseQ);
         void bcf_call_destroy(bcf_callaux_t *bca);
-       int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf_callaux_t *bca, bcf_callret1_t *r);
+       int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t *bca, bcf_callret1_t *r);
         int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/, bcf_call_t *call);
         int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/, bcf_call_t *call);
-       int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b);
+       int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b, bcf_callret1_t *bcr, int is_SP,
+                                        const bcf_callaux_t *bca, const char *ref);
+       int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_callaux_t *bca, const char *ref,
+                                                 const void *rghash);
  
  #ifdef __cplusplus
  }
  
  #ifdef __cplusplus
  }
diff --git a/bam2bcf_indel.c b/bam2bcf_indel.c

new file mode 100644 (file)

index 0000000..819c1e7
--- /dev/null
+++ b/bam2bcf_indel.c
@@ -0,0 +1,410 @@
+#include <assert.h>
+#include <ctype.h>
+#include <string.h>
+#include "bam.h"
+#include "bam2bcf.h"
+#include "ksort.h"
+#include "kaln.h"
+#include "kprobaln.h"
+#include "khash.h"
+KHASH_SET_INIT_STR(rg)
+
+#define MINUS_CONST 0x10000000
+#define INDEL_WINDOW_SIZE 50
+#define MIN_SUPPORT_COEF 500
+
+void *bcf_call_add_rg(void *_hash, const char *hdtext, const char *list)
+{
+       const char *s, *p, *q, *r, *t;
+       khash_t(rg) *hash;
+       if (list == 0 || hdtext == 0) return _hash;
+       if (_hash == 0) _hash = kh_init(rg);
+       hash = (khash_t(rg)*)_hash;
+       if ((s = strstr(hdtext, "@RG\t")) == 0) return hash;
+       do {
+               t = strstr(s + 4, "@RG\t"); // the next @RG
+               if ((p = strstr(s, "\tID:")) != 0) p += 4;
+               if ((q = strstr(s, "\tPL:")) != 0) q += 4;
+               if (p && q && (t == 0 || (p < t && q < t))) { // ID and PL are both present
+                       int lp, lq;
+                       char *x;
+                       for (r = p; *r && *r != '\t' && *r != '\n'; ++r); lp = r - p;
+                       for (r = q; *r && *r != '\t' && *r != '\n'; ++r); lq = r - q;
+                       x = calloc((lp > lq? lp : lq) + 1, 1);
+                       for (r = q; *r && *r != '\t' && *r != '\n'; ++r) x[r-q] = *r;
+                       if (strstr(list, x)) { // insert ID to the hash table
+                               khint_t k;
+                               int ret;
+                               for (r = p; *r && *r != '\t' && *r != '\n'; ++r) x[r-p] = *r;
+                               x[r-p] = 0;
+                               k = kh_get(rg, hash, x);
+                               if (k == kh_end(hash)) k = kh_put(rg, hash, x, &ret);
+                               else free(x);
+                       } else free(x);
+               }
+               s = t;
+       } while (s);
+       return hash;
+}
+
+void bcf_call_del_rghash(void *_hash)
+{
+       khint_t k;
+       khash_t(rg) *hash = (khash_t(rg)*)_hash;
+       if (hash == 0) return;
+       for (k = kh_begin(hash); k < kh_end(hash); ++k)
+               if (kh_exist(hash, k))
+                       free((char*)kh_key(hash, k));
+       kh_destroy(rg, hash);
+}
+
+static int tpos2qpos(const bam1_core_t *c, const uint32_t *cigar, int32_t tpos, int is_left, int32_t *_tpos)
+{
+       int k, x = c->pos, y = 0, last_y = 0;
+       *_tpos = c->pos;
+       for (k = 0; k < c->n_cigar; ++k) {
+               int op = cigar[k] & BAM_CIGAR_MASK;
+               int l = cigar[k] >> BAM_CIGAR_SHIFT;
+               if (op == BAM_CMATCH) {
+                       if (c->pos > tpos) return y;
+                       if (x + l > tpos) {
+                               *_tpos = tpos;
+                               return y + (tpos - x);
+                       }
+                       x += l; y += l;
+                       last_y = y;
+               } else if (op == BAM_CINS || op == BAM_CSOFT_CLIP) y += l;
+               else if (op == BAM_CDEL || op == BAM_CREF_SKIP) {
+                       if (x + l > tpos) {
+                               *_tpos = is_left? x : x + l;
+                               return y;
+                       }
+                       x += l;
+               }
+       }
+       *_tpos = x;
+       return last_y;
+}
+// FIXME: check if the inserted sequence is consistent with the homopolymer run
+// l is the relative gap length and l_run is the length of the homopolymer on the reference
+static inline int est_seqQ(const bcf_callaux_t *bca, int l, int l_run)
+{
+       int q, qh;
+       q = bca->openQ + bca->extQ * (abs(l) - 1);
+       qh = l_run >= 3? (int)(bca->tandemQ * (double)abs(l) / l_run + .499) : 1000;
+       return q < qh? q : qh;
+}
+
+static inline int est_indelreg(int pos, const char *ref, int l, char *ins4)
+{
+       int i, j, max = 0, max_i = pos, score = 0;
+       l = abs(l);
+       for (i = pos + 1, j = 0; ref[i]; ++i, ++j) {
+               if (ins4) score += (toupper(ref[i]) != "ACGTN"[(int)ins4[j%l]])? -10 : 1;
+               else score += (toupper(ref[i]) != toupper(ref[pos+1+j%l]))? -10 : 1;
+               if (score < 0) break;
+               if (max < score) max = score, max_i = i;
+       }
+       return max_i - pos;
+}
+
+int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_callaux_t *bca, const char *ref,
+                                         const void *rghash)
+{
+       extern void ks_introsort_uint32_t(int, uint32_t*);
+       int i, s, j, k, t, n_types, *types, max_rd_len, left, right, max_ins, *score1, *score2;
+       int N, K, l_run, ref_type, n_alt;
+       char *inscns = 0, *ref2, *query;
+       khash_t(rg) *hash = (khash_t(rg)*)rghash;
+       if (ref == 0 || bca == 0) return -1;
+       // mark filtered reads
+       if (rghash) {
+               N = 0;
+               for (s = N = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i) {
+                               bam_pileup1_t *p = plp[s] + i;
+                               const uint8_t *rg = bam_aux_get(p->b, "RG");
+                               p->aux = 1; // filtered by default
+                               if (rg) {
+                                       khint_t k = kh_get(rg, hash, (const char*)(rg + 1));
+                                       if (k != kh_end(hash)) p->aux = 0, ++N; // not filtered
+                               }
+                       }
+               }
+               if (N == 0) return -1; // no reads left
+       }
+       // determine if there is a gap
+       for (s = N = 0; s < n; ++s) {
+               for (i = 0; i < n_plp[s]; ++i)
+                       if (plp[s][i].indel != 0) break;
+               if (i < n_plp[s]) break;
+       }
+       if (s == n) return -1; // there is no indel at this position.
+       for (s = N = 0; s < n; ++s) N += n_plp[s]; // N is the total number of reads
+       { // find out how many types of indels are present
+               int m, n_alt = 0, n_tot = 0;
+               uint32_t *aux;
+               aux = calloc(N + 1, 4);
+               m = max_rd_len = 0;
+               aux[m++] = MINUS_CONST; // zero indel is always a type
+               for (s = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i) {
+                               const bam_pileup1_t *p = plp[s] + i;
+                               if (rghash == 0 || p->aux == 0) {
+                                       ++n_tot;
+                                       if (p->indel != 0) {
+                                               ++n_alt;
+                                               aux[m++] = MINUS_CONST + p->indel;
+                                       }
+                               }
+                               j = bam_cigar2qlen(&p->b->core, bam1_cigar(p->b));
+                               if (j > max_rd_len) max_rd_len = j;
+                       }
+               }
+               ks_introsort(uint32_t, m, aux);
+               // squeeze out identical types
+               for (i = 1, n_types = 1; i < m; ++i)
+                       if (aux[i] != aux[i-1]) ++n_types;
+               if (n_types == 1 || n_alt * MIN_SUPPORT_COEF < n_tot) { // no indels or too few supporting reads
+                       free(aux); return -1;
+               }
+               types = (int*)calloc(n_types, sizeof(int));
+               t = 0;
+               types[t++] = aux[0] - MINUS_CONST; 
+               for (i = 1; i < m; ++i)
+                       if (aux[i] != aux[i-1])
+                               types[t++] = aux[i] - MINUS_CONST;
+               free(aux);
+               for (t = 0; t < n_types; ++t)
+                       if (types[t] == 0) break;
+               ref_type = t; // the index of the reference type (0)
+               assert(n_types < 64);
+       }
+       { // calculate left and right boundary
+               left = pos > INDEL_WINDOW_SIZE? pos - INDEL_WINDOW_SIZE : 0;
+               right = pos + INDEL_WINDOW_SIZE;
+               if (types[0] < 0) right -= types[0];
+               // in case the alignments stand out the reference
+               for (i = pos; i < right; ++i)
+                       if (ref[i] == 0) break;
+               right = i;
+       }
+       { // the length of the homopolymer run around the current position
+               int c = bam_nt16_table[(int)ref[pos + 1]];
+               if (c == 15) l_run = 1;
+               else {
+                       for (i = pos + 2; ref[i]; ++i)
+                               if (bam_nt16_table[(int)ref[i]] != c) break;
+                       l_run = i;
+                       for (i = pos; i >= 0; --i)
+                               if (bam_nt16_table[(int)ref[i]] != c) break;
+                       l_run -= i + 1;
+               }
+       }
+       // construct the consensus sequence
+       max_ins = types[n_types - 1]; // max_ins is at least 0
+       if (max_ins > 0) {
+               int *inscns_aux = calloc(4 * n_types * max_ins, sizeof(int));
+               // count the number of occurrences of each base at each position for each type of insertion
+               for (t = 0; t < n_types; ++t) {
+                       if (types[t] > 0) {
+                               for (s = 0; s < n; ++s) {
+                                       for (i = 0; i < n_plp[s]; ++i) {
+                                               bam_pileup1_t *p = plp[s] + i;
+                                               if (p->indel == types[t]) {
+                                                       uint8_t *seq = bam1_seq(p->b);
+                                                       for (k = 1; k <= p->indel; ++k) {
+                                                               int c = bam_nt16_nt4_table[bam1_seqi(seq, p->qpos + k)];
+                                                               if (c < 4) ++inscns_aux[(t*max_ins+(k-1))*4 + c];
+                                                       }
+                                               }
+                                       }
+                               }
+                       }
+               }
+               // use the majority rule to construct the consensus
+               inscns = calloc(n_types * max_ins, 1);
+               for (t = 0; t < n_types; ++t) {
+                       for (j = 0; j < types[t]; ++j) {
+                               int max = 0, max_k = -1, *ia = &inscns_aux[(t*max_ins+j)*4];
+                               for (k = 0; k < 4; ++k)
+                                       if (ia[k] > max)
+                                               max = ia[k], max_k = k;
+                               inscns[t*max_ins + j] = max? max_k : 4;
+                       }
+               }
+               free(inscns_aux);
+       }
+       // compute the likelihood given each type of indel for each read
+       ref2  = calloc(right - left + max_ins + 2, 1);
+       query = calloc(right - left + max_rd_len + max_ins + 2, 1);
+       score1 = calloc(N * n_types, sizeof(int));
+       score2 = calloc(N * n_types, sizeof(int));
+       bca->indelreg = 0;
+       for (t = 0; t < n_types; ++t) {
+               int l, ir;
+               kpa_par_t apf1 = { 1e-4, 1e-2, 10 }, apf2 = { 1e-6, 1e-3, 10 };
+               apf1.bw = apf2.bw = abs(types[t]) + 3;
+               // compute indelreg
+               if (types[t] == 0) ir = 0;
+               else if (types[t] > 0) ir = est_indelreg(pos, ref, types[t], &inscns[t*max_ins]);
+               else ir = est_indelreg(pos, ref, -types[t], 0);
+               if (ir > bca->indelreg) bca->indelreg = ir;
+//             fprintf(stderr, "%d, %d, %d\n", pos, types[t], ir);
+               // write ref2
+               for (k = 0, j = left; j <= pos; ++j)
+                       ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]];
+               if (types[t] <= 0) j += -types[t];
+               else for (l = 0; l < types[t]; ++l)
+                                ref2[k++] = inscns[t*max_ins + l];
+               if (types[0] < 0) { // mask deleted sequences to avoid a particular error in the model.
+                       int jj, tmp = types[t] >= 0? -types[0] : -types[0] + types[t];
+                       for (jj = 0; jj < tmp && j < right && ref[j]; ++jj, ++j)
+                               ref2[k++] = 4;
+               }
+               for (; j < right && ref[j]; ++j)
+                       ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]];
+               if (j < right) right = j;
+               // align each read to ref2
+               for (s = K = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i, ++K) {
+                               bam_pileup1_t *p = plp[s] + i;
+                               int qbeg, qend, tbeg, tend, sc;
+                               uint8_t *seq = bam1_seq(p->b);
+                               // FIXME: the following skips soft clips, but using them may be more sensitive.
+                               // determine the start and end of sequences for alignment
+                               qbeg = tpos2qpos(&p->b->core, bam1_cigar(p->b), left,  0, &tbeg);
+                               qend = tpos2qpos(&p->b->core, bam1_cigar(p->b), right, 1, &tend);
+                               if (types[t] < 0) {
+                                       int l = -types[t];
+                                       tbeg = tbeg - l > left?  tbeg - l : left;
+                               }
+                               // write the query sequence
+                               for (l = qbeg; l < qend; ++l)
+                                       query[l - qbeg] = bam_nt16_nt4_table[bam1_seqi(seq, l)];
+                               { // do realignment; this is the bottleneck
+                                       const uint8_t *qual = bam1_qual(p->b), *bq;
+                                       uint8_t *qq;
+                                       qq = calloc(qend - qbeg, 1);
+                                       bq = (uint8_t*)bam_aux_get(p->b, "ZQ");
+                                       if (bq) ++bq; // skip type
+                                       for (l = qbeg; l < qend; ++l) {
+                                               qq[l - qbeg] = bq? qual[l] + (bq[l] - 64) : qual[l];
+                                               if (qq[l - qbeg] > 30) qq[l - qbeg] = 30;
+                                               if (qq[l - qbeg] < 7) qq[l - qbeg] = 7;
+                                       }
+                                       sc = kpa_glocal((uint8_t*)ref2 + tbeg - left, tend - tbeg + abs(types[t]),
+                                                                       (uint8_t*)query, qend - qbeg, qq, &apf1, 0, 0);
+                                       l = (int)(100. * sc / (qend - qbeg) + .499); // used for adjusting indelQ below
+                                       if (l > 255) l = 255;
+                                       score1[K*n_types + t] = score2[K*n_types + t] = sc<<8 | l;
+                                       if (sc > 5) {
+                                               sc = kpa_glocal((uint8_t*)ref2 + tbeg - left, tend - tbeg + abs(types[t]),
+                                                                               (uint8_t*)query, qend - qbeg, qq, &apf2, 0, 0);
+                                               l = (int)(100. * sc / (qend - qbeg) + .499);
+                                               if (l > 255) l = 255;
+                                               score2[K*n_types + t] = sc<<8 | l;
+                                       }
+                                       free(qq);
+                               }
+/*
+                               for (l = 0; l < tend - tbeg + abs(types[t]); ++l)
+                                       fputc("ACGTN"[(int)ref2[tbeg-left+l]], stderr);
+                               fputc('\n', stderr);
+                               for (l = 0; l < qend - qbeg; ++l) fputc("ACGTN"[(int)query[l]], stderr);
+                               fputc('\n', stderr);
+                               fprintf(stderr, "pos=%d type=%d read=%d:%d name=%s qbeg=%d tbeg=%d score=%d\n", pos, types[t], s, i, bam1_qname(p->b), qbeg, tbeg, sc);
+*/
+                       }
+               }
+       }
+       free(ref2); free(query);
+       { // compute indelQ
+               int *sc, tmp, *sumq;
+               sc   = alloca(n_types * sizeof(int));
+               sumq = alloca(n_types * sizeof(int));
+               memset(sumq, 0, sizeof(int) * n_types);
+               for (s = K = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i, ++K) {
+                               bam_pileup1_t *p = plp[s] + i;
+                               int *sct = &score1[K*n_types], indelQ1, indelQ2, seqQ, indelQ;
+                               for (t = 0; t < n_types; ++t) sc[t] = sct[t]<<6 | t;
+                               for (t = 1; t < n_types; ++t) // insertion sort
+                                       for (j = t; j > 0 && sc[j] < sc[j-1]; --j)
+                                               tmp = sc[j], sc[j] = sc[j-1], sc[j-1] = tmp;
+                               /* errmod_cal() assumes that if the call is wrong, the
+                                * likelihoods of other events are equal. This is about
+                                * right for substitutions, but is not desired for
+                                * indels. To reuse errmod_cal(), I have to make
+                                * compromise for multi-allelic indels.
+                                */
+                               if ((sc[0]&0x3f) == ref_type) {
+                                       indelQ1 = (sc[1]>>14) - (sc[0]>>14);
+                                       seqQ = est_seqQ(bca, types[sc[1]&0x3f], l_run);
+                               } else {
+                                       for (t = 0; t < n_types; ++t) // look for the reference type
+                                               if ((sc[t]&0x3f) == ref_type) break;
+                                       indelQ1 = (sc[t]>>14) - (sc[0]>>14);
+                                       seqQ = est_seqQ(bca, types[sc[0]&0x3f], l_run);
+                               }
+                               tmp = sc[0]>>6 & 0xff;
+                               indelQ1 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ1 + .499); // reduce indelQ
+                               sct = &score2[K*n_types];
+                               for (t = 0; t < n_types; ++t) sc[t] = sct[t]<<6 | t;
+                               for (t = 1; t < n_types; ++t) // insertion sort
+                                       for (j = t; j > 0 && sc[j] < sc[j-1]; --j)
+                                               tmp = sc[j], sc[j] = sc[j-1], sc[j-1] = tmp;
+                               if ((sc[0]&0x3f) == ref_type) {
+                                       indelQ2 = (sc[1]>>14) - (sc[0]>>14);
+                               } else {
+                                       for (t = 0; t < n_types; ++t) // look for the reference type
+                                               if ((sc[t]&0x3f) == ref_type) break;
+                                       indelQ2 = (sc[t]>>14) - (sc[0]>>14);
+                               }
+                               tmp = sc[0]>>6 & 0xff;
+                               indelQ2 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ2 + .499);
+                               // pick the smaller between indelQ1 and indelQ2
+                               indelQ = indelQ1 < indelQ2? indelQ1 : indelQ2;
+                               p->aux = (sc[0]&0x3f)<<16 | seqQ<<8 | indelQ;
+                               sumq[sc[0]&0x3f] += indelQ < seqQ? indelQ : seqQ;
+//                             fprintf(stderr, "pos=%d read=%d:%d name=%s call=%d q=%d\n", pos, s, i, bam1_qname(p->b), types[sc[0]&0x3f], indelQ);
+                       }
+               }
+               // determine bca->indel_types[] and bca->inscns
+               bca->maxins = max_ins;
+               bca->inscns = realloc(bca->inscns, bca->maxins * 4);
+               for (t = 0; t < n_types; ++t)
+                       sumq[t] = sumq[t]<<6 | t;
+               for (t = 1; t < n_types; ++t) // insertion sort
+                       for (j = t; j > 0 && sumq[j] > sumq[j-1]; --j)
+                               tmp = sumq[j], sumq[j] = sumq[j-1], sumq[j-1] = tmp;
+               for (t = 0; t < n_types; ++t) // look for the reference type
+                       if ((sumq[t]&0x3f) == ref_type) break;
+               if (t) { // then move the reference type to the first
+                       tmp = sumq[t];
+                       for (; t > 0; --t) sumq[t] = sumq[t-1];
+                       sumq[0] = tmp;
+               }
+               for (t = 0; t < 4; ++t) bca->indel_types[t] = B2B_INDEL_NULL;
+               for (t = 0; t < 4 && t < n_types; ++t) {
+                       bca->indel_types[t] = types[sumq[t]&0x3f];
+                       memcpy(&bca->inscns[t * bca->maxins], &inscns[(sumq[t]&0x3f) * max_ins], bca->maxins);
+               }
+               // update p->aux
+               for (s = n_alt = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i) {
+                               bam_pileup1_t *p = plp[s] + i;
+                               int x = types[p->aux>>16&0x3f];
+                               for (j = 0; j < 4; ++j)
+                                       if (x == bca->indel_types[j]) break;
+                               p->aux = j<<16 | (j == 4? 0 : (p->aux&0xffff));
+                               if ((p->aux>>16&0x3f) > 0) ++n_alt;
+//                             fprintf(stderr, "X pos=%d read=%d:%d name=%s call=%d type=%d q=%d seqQ=%d\n", pos, s, i, bam1_qname(p->b), p->aux>>16&63, bca->indel_types[p->aux>>16&63], p->aux&0xff, p->aux>>8&0xff);
+                       }
+               }               
+       }
+       free(score1); free(score2);
+       // free
+       free(types); free(inscns);
+       return n_alt > 0? 0 : -1;
+}
diff --git a/bam_maqcns.c b/bam_maqcns.c

index 2f3fc082be79aa0fa3feae9c1d9a03ad44592877..4fbc6c6d0cd14f54ac7d75f07f789cbdb73e4362 100644 (file)
--- a/bam_maqcns.c
+++ b/bam_maqcns.c
@@ -154,7 +154,7 @@ glf1_t *bam_maqcns_glfgen(int _n, const bam_pileup1_t *pl, uint8_t ref_base, bam
                 const bam_pileup1_t *p = pl + i;
                 uint32_t q, x = 0, qq;
                 uint16_t y = 0;
                 const bam_pileup1_t *p = pl + i;
                 uint32_t q, x = 0, qq;
                 uint16_t y = 0;
-               if (p->is_del || (p->b->core.flag&BAM_FUNMAP)) continue;
+               if (p->is_del || p->is_refskip || (p->b->core.flag&BAM_FUNMAP)) continue;
                 q = (uint32_t)bam1_qual(p->b)[p->qpos];
                 if (q < bm->min_baseQ) continue;
                 x |= (uint32_t)bam1_strand(p->b) << 18 | q << 8 | p->b->core.qual;
                 q = (uint32_t)bam1_qual(p->b)[p->qpos];
                 if (q < bm->min_baseQ) continue;
                 x |= (uint32_t)bam1_strand(p->b) << 18 | q << 8 | p->b->core.qual;
@@ -166,7 +166,7 @@ glf1_t *bam_maqcns_glfgen(int _n, const bam_pileup1_t *pl, uint8_t ref_base, bam
                 y |= q << 5;
                 qq = bam1_seqi(bam1_seq(p->b), p->qpos);
                 q = bam_nt16_nt4_table[qq? qq : ref_base];
                 y |= q << 5;
                 qq = bam1_seqi(bam1_seq(p->b), p->qpos);
                 q = bam_nt16_nt4_table[qq? qq : ref_base];
-               if (!p->is_del && q < 4) x |= 1 << 21 | q << 16, y |= q;
+               if (!p->is_del && !p->is_refskip && q < 4) x |= 1 << 21 | q << 16, y |= q;
                 bm->aux->info16[n] = y;
                 bm->aux->info[n++] = x;
         }
                 bm->aux->info16[n] = y;
                 bm->aux->info[n++] = x;
         }
diff --git a/bam_md.c b/bam_md.c

index f65b845b5d594980d4b4dba36122672c71b19cc9..44d46a492438beecae74fc648c9564a566565be7 100644 (file)
--- a/bam_md.c
+++ b/bam_md.c
@@ -7,6 +7,7 @@
  #include "sam.h"
  #include "kstring.h"
  #include "kaln.h"
  #include "sam.h"
  #include "kstring.h"
  #include "kaln.h"
+#include "kprobaln.h"
  
  void bam_fillmd1_core(bam1_t *b, char *ref, int is_equal, int max_nm)
  {
  
  void bam_fillmd1_core(bam1_t *b, char *ref, int is_equal, int max_nm)
  {
@@ -161,14 +162,35 @@ int bam_cap_mapQ(bam1_t *b, char *ref, int thres)
         return (int)(t + .499);
  }
  
         return (int)(t + .499);
  }
  
-int bam_prob_realn(bam1_t *b, const char *ref)
+int bam_prob_realn_core(bam1_t *b, const char *ref, int apply_baq)
  {
         int k, i, bw, x, y, yb, ye, xb, xe;
         uint32_t *cigar = bam1_cigar(b);
         bam1_core_t *c = &b->core;
  {
         int k, i, bw, x, y, yb, ye, xb, xe;
         uint32_t *cigar = bam1_cigar(b);
         bam1_core_t *c = &b->core;
-       ka_probpar_t conf = ka_probpar_def;
-       // find the start and end of the alignment
-       if (c->flag & BAM_FUNMAP) return -1;
+       kpa_par_t conf = kpa_par_def;
+       uint8_t *bq = 0, *zq = 0, *qual = bam1_qual(b);
+       if (c->flag & BAM_FUNMAP) return -1; // do nothing
+       // test if BQ or ZQ is present
+       if ((bq = bam_aux_get(b, "BQ")) != 0) ++bq;
+       if ((zq = bam_aux_get(b, "ZQ")) != 0 && *zq == 'Z') ++zq;
+       if (bq && zq) { // remove the ZQ tag
+               bam_aux_del(b, zq-1);
+               zq = 0;
+       }
+       if (bq || zq) {
+               if ((apply_baq && zq) || (!apply_baq && bq)) return -3; // in both cases, do nothing
+               if (bq && apply_baq) { // then convert BQ to ZQ
+                       for (i = 0; i < c->l_qseq; ++i)
+                               qual[i] = qual[i] + 64 < bq[i]? 0 : qual[i] - ((int)bq[i] - 64);
+                       *(bq - 3) = 'Z';
+               } else if (zq && !apply_baq) { // then convert ZQ to BQ
+                       for (i = 0; i < c->l_qseq; ++i)
+                               qual[i] += (int)zq[i] - 64;
+                       *(zq - 3) = 'B';
+               }
+               return 0;
+       }
+       // find the start and end of the alignment      
         x = c->pos, y = 0, yb = ye = xb = xe = -1;
         for (k = 0; k < c->n_cigar; ++k) {
                 int op, l;
         x = c->pos, y = 0, yb = ye = xb = xe = -1;
         for (k = 0; k < c->n_cigar; ++k) {
                 int op, l;
@@ -180,7 +202,7 @@ int bam_prob_realn(bam1_t *b, const char *ref)
                         x += l; y += l;
                 } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
                 else if (op == BAM_CDEL) x += l;
                         x += l; y += l;
                 } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
                 else if (op == BAM_CDEL) x += l;
-               else if (op == BAM_CREF_SKIP) return -1;
+               else if (op == BAM_CREF_SKIP) return -1; // do nothing if there is a reference skip
         }
         // set bandwidth and the start and the end
         bw = 7;
         }
         // set bandwidth and the start and the end
         bw = 7;
@@ -192,8 +214,10 @@ int bam_prob_realn(bam1_t *b, const char *ref)
         if (xe - xb - c->l_qseq > bw)
                 xb += (xe - xb - c->l_qseq - bw) / 2, xe -= (xe - xb - c->l_qseq - bw) / 2;
         { // glocal
         if (xe - xb - c->l_qseq > bw)
                 xb += (xe - xb - c->l_qseq - bw) / 2, xe -= (xe - xb - c->l_qseq - bw) / 2;
         { // glocal
-               uint8_t *s, *r, *q, *seq = bam1_seq(b), *qual = bam1_qual(b);
+               uint8_t *s, *r, *q, *seq = bam1_seq(b), *bq;
                 int *state;
                 int *state;
+               bq = calloc(c->l_qseq + 1, 1);
+               memcpy(bq, qual, c->l_qseq);
                 s = calloc(c->l_qseq, 1);
                 for (i = 0; i < c->l_qseq; ++i) s[i] = bam_nt16_nt4_table[bam1_seqi(seq, i)];
                 r = calloc(xe - xb, 1);
                 s = calloc(c->l_qseq, 1);
                 for (i = 0; i < c->l_qseq; ++i) s[i] = bam_nt16_nt4_table[bam1_seqi(seq, i)];
                 r = calloc(xe - xb, 1);
@@ -201,35 +225,45 @@ int bam_prob_realn(bam1_t *b, const char *ref)
                         r[i-xb] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[i]]];
                 state = calloc(c->l_qseq, sizeof(int));
                 q = calloc(c->l_qseq, 1);
                         r[i-xb] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[i]]];
                 state = calloc(c->l_qseq, sizeof(int));
                 q = calloc(c->l_qseq, 1);
-               ka_prob_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q);
+               kpa_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q);
                 for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
                         int op = cigar[k]&0xf, l = cigar[k]>>4;
                         if (op == BAM_CMATCH) {
                                 for (i = y; i < y + l; ++i) {
                 for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
                         int op = cigar[k]&0xf, l = cigar[k]>>4;
                         if (op == BAM_CMATCH) {
                                 for (i = y; i < y + l; ++i) {
-                                       if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) qual[i] = 0;
-                                       else qual[i] = qual[i] < q[i]? qual[i] : q[i];
+                                       if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) bq[i] = 0;
+                                       else bq[i] = bq[i] < q[i]? bq[i] : q[i];
                                 }
                                 x += l; y += l;
                         } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
                         else if (op == BAM_CDEL) x += l;
                 }
                                 }
                                 x += l; y += l;
                         } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
                         else if (op == BAM_CDEL) x += l;
                 }
-               free(s); free(r); free(q); free(state);
+               for (i = 0; i < c->l_qseq; ++i) bq[i] = qual[i] - bq[i] + 64; // finalize BQ
+               if (apply_baq) {
+                       for (i = 0; i < c->l_qseq; ++i) qual[i] -= bq[i] - 64; // modify qual
+                       bam_aux_append(b, "ZQ", 'Z', c->l_qseq + 1, bq);
+               } else bam_aux_append(b, "BQ", 'Z', c->l_qseq + 1, bq);
+               free(bq); free(s); free(r); free(q); free(state);
         }
         return 0;
  }
  
         }
         return 0;
  }
  
+int bam_prob_realn(bam1_t *b, const char *ref)
+{
+       return bam_prob_realn_core(b, ref, 1);
+}
+
  int bam_fillmd(int argc, char *argv[])
  {
  int bam_fillmd(int argc, char *argv[])
  {
-       int c, is_equal = 0, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm = 0, is_realn, capQ = 0;
+       int c, is_equal, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm, is_realn, capQ, apply_baq;
         samfile_t *fp, *fpout = 0;
         faidx_t *fai;
         char *ref = 0, mode_w[8], mode_r[8];
         bam1_t *b;
  
         samfile_t *fp, *fpout = 0;
         faidx_t *fai;
         char *ref = 0, mode_w[8], mode_r[8];
         bam1_t *b;
  
-       is_bam_out = is_sam_in = is_uncompressed = is_realn = 0;
+       is_equal = is_bam_out = is_sam_in = is_uncompressed = is_realn = max_nm = apply_baq = capQ = 0;
         mode_w[0] = mode_r[0] = 0;
         strcpy(mode_r, "r"); strcpy(mode_w, "w");
         mode_w[0] = mode_r[0] = 0;
         strcpy(mode_r, "r"); strcpy(mode_w, "w");
-       while ((c = getopt(argc, argv, "reubSC:n:")) >= 0) {
+       while ((c = getopt(argc, argv, "reubSC:n:A")) >= 0) {
                 switch (c) {
                 case 'r': is_realn = 1; break;
                 case 'e': is_equal = 1; break;
                 switch (c) {
                 case 'r': is_realn = 1; break;
                 case 'e': is_equal = 1; break;
@@ -238,6 +272,7 @@ int bam_fillmd(int argc, char *argv[])
                 case 'S': is_sam_in = 1; break;
                 case 'n': max_nm = atoi(optarg); break;
                 case 'C': capQ = atoi(optarg); break;
                 case 'S': is_sam_in = 1; break;
                 case 'n': max_nm = atoi(optarg); break;
                 case 'C': capQ = atoi(optarg); break;
+               case 'A': apply_baq = 1; break;
                 default: fprintf(stderr, "[bam_fillmd] unrecognized option '-%c'\n", c); return 1;
                 }
         }
                 default: fprintf(stderr, "[bam_fillmd] unrecognized option '-%c'\n", c); return 1;
                 }
         }
@@ -252,6 +287,7 @@ int bam_fillmd(int argc, char *argv[])
                 fprintf(stderr, "         -u       uncompressed BAM output (for piping)\n");
                 fprintf(stderr, "         -b       compressed BAM output\n");
                 fprintf(stderr, "         -S       the input is SAM with header\n");
                 fprintf(stderr, "         -u       uncompressed BAM output (for piping)\n");
                 fprintf(stderr, "         -b       compressed BAM output\n");
                 fprintf(stderr, "         -S       the input is SAM with header\n");
+               fprintf(stderr, "         -A       modify the quality string\n");
                 fprintf(stderr, "         -r       read-independent local realignment\n\n");
                 return 1;
         }
                 fprintf(stderr, "         -r       read-independent local realignment\n\n");
                 return 1;
         }
@@ -275,8 +311,7 @@ int bam_fillmd(int argc, char *argv[])
                                         fprintf(stderr, "[bam_fillmd] fail to find sequence '%s' in the reference.\n",
                                                         fp->header->target_name[tid]);
                         }
                                         fprintf(stderr, "[bam_fillmd] fail to find sequence '%s' in the reference.\n",
                                                         fp->header->target_name[tid]);
                         }
-//                     if (is_realn) bam_realn(b, ref);
-                       if (is_realn) bam_prob_realn(b, ref);
+                       if (is_realn) bam_prob_realn_core(b, ref, apply_baq);
                         if (capQ > 10) {
                                 int q = bam_cap_mapQ(b, ref, capQ);
                                 if (b->core.qual > q) b->core.qual = q;
                         if (capQ > 10) {
                                 int q = bam_cap_mapQ(b, ref, capQ);
                                 if (b->core.qual > q) b->core.qual = q;
diff --git a/bam_pileup.c b/bam_pileup.c

index a49f795d73d4f0ed1ad3a9592108f3be2a917579..55e51e27804cfc58fc5576b40ae7ba10416a3e8f 100644 (file)
--- a/bam_pileup.c
+++ b/bam_pileup.c
@@ -152,7 +152,7 @@ struct __bam_plp_t {
         mempool_t *mp;
         lbnode_t *head, *tail, *dummy;
         int32_t tid, pos, max_tid, max_pos;
         mempool_t *mp;
         lbnode_t *head, *tail, *dummy;
         int32_t tid, pos, max_tid, max_pos;
-       int is_eof, flag_mask, max_plp, error;
+       int is_eof, flag_mask, max_plp, error, maxcnt;
         bam_pileup1_t *plp;
         // for the "auto" interface only
         bam1_t *b;
         bam_pileup1_t *plp;
         // for the "auto" interface only
         bam1_t *b;
@@ -169,6 +169,7 @@ bam_plp_t bam_plp_init(bam_plp_auto_f func, void *data)
         iter->dummy = mp_alloc(iter->mp);
         iter->max_tid = iter->max_pos = -1;
         iter->flag_mask = BAM_DEF_MASK;
         iter->dummy = mp_alloc(iter->mp);
         iter->max_tid = iter->max_pos = -1;
         iter->flag_mask = BAM_DEF_MASK;
+       iter->maxcnt = 8000;
         if (func) {
                 iter->func = func;
                 iter->data = data;
         if (func) {
                 iter->func = func;
                 iter->data = data;
@@ -240,6 +241,7 @@ int bam_plp_push(bam_plp_t iter, const bam1_t *b)
         if (b) {
                 if (b->core.tid < 0) return 0;
                 if (b->core.flag & iter->flag_mask) return 0;
         if (b) {
                 if (b->core.tid < 0) return 0;
                 if (b->core.flag & iter->flag_mask) return 0;
+               if (iter->tid == b->core.tid && iter->pos == b->core.pos && iter->mp->cnt > iter->maxcnt) return 0;
                 bam_copy1(&iter->tail->b, b);
                 iter->tail->beg = b->core.pos; iter->tail->end = bam_calend(&b->core, bam1_cigar(b));
                 iter->tail->s = g_cstate_null; iter->tail->s.end = iter->tail->end - 1; // initialize cstate_t
                 bam_copy1(&iter->tail->b, b);
                 iter->tail->beg = b->core.pos; iter->tail->end = bam_calend(&b->core, bam1_cigar(b));
                 iter->tail->s = g_cstate_null; iter->tail->s.end = iter->tail->end - 1; // initialize cstate_t
@@ -267,7 +269,7 @@ const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_
         const bam_pileup1_t *plp;
         if (iter->func == 0 || iter->error) { *_n_plp = -1; return 0; }
         if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
         const bam_pileup1_t *plp;
         if (iter->func == 0 || iter->error) { *_n_plp = -1; return 0; }
         if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
-       else {
+       else { // no pileup line can be obtained; read alignments
                 *_n_plp = 0;
                 if (iter->is_eof) return 0;
                 while (iter->func(iter->data, iter->b) >= 0) {
                 *_n_plp = 0;
                 if (iter->is_eof) return 0;
                 while (iter->func(iter->data, iter->b) >= 0) {
@@ -276,6 +278,7 @@ const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_
                                 return 0;
                         }
                         if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
                                 return 0;
                         }
                         if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
+                       // otherwise no pileup line can be returned; read the next alignment.
                 }
                 bam_plp_push(iter, 0);
                 if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
                 }
                 bam_plp_push(iter, 0);
                 if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
@@ -302,6 +305,11 @@ void bam_plp_set_mask(bam_plp_t iter, int mask)
         iter->flag_mask = mask < 0? BAM_DEF_MASK : (BAM_FUNMAP | mask);
  }
  
         iter->flag_mask = mask < 0? BAM_DEF_MASK : (BAM_FUNMAP | mask);
  }
  
+void bam_plp_set_maxcnt(bam_plp_t iter, int maxcnt)
+{
+       iter->maxcnt = maxcnt;
+}
+
  /*****************
   * callback APIs *
   *****************/
  /*****************
   * callback APIs *
   *****************/
@@ -389,6 +397,13 @@ bam_mplp_t bam_mplp_init(int n, bam_plp_auto_f func, void **data)
         return iter;
  }
  
         return iter;
  }
  
+void bam_mplp_set_maxcnt(bam_mplp_t iter, int maxcnt)
+{
+       int i;
+       for (i = 0; i < iter->n; ++i)
+               iter->iter[i]->maxcnt = maxcnt;
+}
+
  void bam_mplp_destroy(bam_mplp_t iter)
  {
         int i;
  void bam_mplp_destroy(bam_mplp_t iter)
  {
         int i;
diff --git a/bam_plcmd.c b/bam_plcmd.c

index dca85181f99bf1c7343650846039153a019e7ea9..e3f73aaddd03a14343a82084ee060d9c41ba42d4 100644 (file)
--- a/bam_plcmd.c
+++ b/bam_plcmd.c
@@ -2,6 +2,8 @@
  #include <stdio.h>
  #include <unistd.h>
  #include <ctype.h>
  #include <stdio.h>
  #include <unistd.h>
  #include <ctype.h>
+#include <string.h>
+#include <errno.h>
  #include "sam.h"
  #include "faidx.h"
  #include "bam_maqcns.h"
  #include "sam.h"
  #include "faidx.h"
  #include "bam_maqcns.h"
@@ -528,10 +530,14 @@ int bam_pileup(int argc, char *argv[])
  #define MPLP_NO_COMP 0x20
  #define MPLP_NO_ORPHAN 0x40
  #define MPLP_REALN   0x80
  #define MPLP_NO_COMP 0x20
  #define MPLP_NO_ORPHAN 0x40
  #define MPLP_REALN   0x80
+#define MPLP_FMT_DP 0x100
+#define MPLP_FMT_SP 0x200
+#define MPLP_NO_INDEL 0x400
  
  typedef struct {
  
  typedef struct {
-       int max_mq, min_mq, flag, min_baseQ, capQ_thres;
-       char *reg, *fn_pos;
+       int max_mq, min_mq, flag, min_baseQ, capQ_thres, max_depth;
+       int openQ, extQ, tandemQ;
+       char *reg, *fn_pos, *pl_list;
         faidx_t *fai;
         kh_64_t *hash;
  } mplp_conf_t;
         faidx_t *fai;
         kh_64_t *hash;
  } mplp_conf_t;
@@ -552,7 +558,7 @@ typedef struct {
  static int mplp_func(void *data, bam1_t *b)
  {
         extern int bam_realn(bam1_t *b, const char *ref);
  static int mplp_func(void *data, bam1_t *b)
  {
         extern int bam_realn(bam1_t *b, const char *ref);
-       extern int bam_prob_realn(bam1_t *b, const char *ref);
+       extern int bam_prob_realn_core(bam1_t *b, const char *ref, int);
         extern int bam_cap_mapQ(bam1_t *b, char *ref, int thres);
         mplp_aux_t *ma = (mplp_aux_t*)data;
         int ret, skip = 0;
         extern int bam_cap_mapQ(bam1_t *b, char *ref, int thres);
         mplp_aux_t *ma = (mplp_aux_t*)data;
         int ret, skip = 0;
@@ -561,7 +567,7 @@ static int mplp_func(void *data, bam1_t *b)
                 ret = ma->iter? bam_iter_read(ma->fp, ma->iter, b) : bam_read1(ma->fp, b);
                 if (ret < 0) break;
                 skip = 0;
                 ret = ma->iter? bam_iter_read(ma->fp, ma->iter, b) : bam_read1(ma->fp, b);
                 if (ret < 0) break;
                 skip = 0;
-               if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn(b, ma->ref);
+               if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn_core(b, ma->ref, 1);
                 if (has_ref && ma->capQ_thres > 10) {
                         int q = bam_cap_mapQ(b, ma->ref, ma->capQ_thres);
                         if (q < 0) skip = 1;
                 if (has_ref && ma->capQ_thres > 10) {
                         int q = bam_cap_mapQ(b, ma->ref, ma->capQ_thres);
                         if (q < 0) skip = 1;
@@ -598,13 +604,16 @@ static void group_smpl(mplp_pileup_t *m, bam_sample_t *sm, kstring_t *buf,
  
  static int mpileup(mplp_conf_t *conf, int n, char **fn)
  {
  
  static int mpileup(mplp_conf_t *conf, int n, char **fn)
  {
+       extern void *bcf_call_add_rg(void *rghash, const char *hdtext, const char *list);
+       extern void bcf_call_del_rghash(void *rghash);
         mplp_aux_t **data;
         mplp_aux_t **data;
-       int i, tid, pos, *n_plp, beg0 = 0, end0 = 1u<<29, ref_len, ref_tid;
+       int i, tid, pos, *n_plp, beg0 = 0, end0 = 1u<<29, ref_len, ref_tid, max_depth;
         const bam_pileup1_t **plp;
         bam_mplp_t iter;
         bam_header_t *h = 0;
         char *ref;
         khash_t(64) *hash = 0;
         const bam_pileup1_t **plp;
         bam_mplp_t iter;
         bam_header_t *h = 0;
         char *ref;
         khash_t(64) *hash = 0;
+       void *rghash = 0;
  
         bcf_callaux_t *bca = 0;
         bcf_callret1_t *bcr = 0;
  
         bcf_callaux_t *bca = 0;
         bcf_callret1_t *bcr = 0;
@@ -634,6 +643,7 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                 data[i]->fp = strcmp(fn[i], "-") == 0? bam_dopen(fileno(stdin), "r") : bam_open(fn[i], "r");
                 h_tmp = bam_header_read(data[i]->fp);
                 bam_smpl_add(sm, fn[i], h_tmp->text);
                 data[i]->fp = strcmp(fn[i], "-") == 0? bam_dopen(fileno(stdin), "r") : bam_open(fn[i], "r");
                 h_tmp = bam_header_read(data[i]->fp);
                 bam_smpl_add(sm, fn[i], h_tmp->text);
+               rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
                 if (conf->reg) {
                         int beg, end;
                         bam_index_t *idx;
                 if (conf->reg) {
                         int beg, end;
                         bam_index_t *idx;
@@ -689,9 +699,19 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                 bcf_hdr_write(bp, bh);
                 bca = bcf_call_init(-1., conf->min_baseQ);
                 bcr = calloc(sm->n, sizeof(bcf_callret1_t));
                 bcf_hdr_write(bp, bh);
                 bca = bcf_call_init(-1., conf->min_baseQ);
                 bcr = calloc(sm->n, sizeof(bcf_callret1_t));
+               bca->rghash = rghash;
+               bca->openQ = conf->openQ, bca->extQ = conf->extQ, bca->tandemQ = conf->tandemQ;
         }
         ref_tid = -1; ref = 0;
         iter = bam_mplp_init(n, mplp_func, (void**)data);
         }
         ref_tid = -1; ref = 0;
         iter = bam_mplp_init(n, mplp_func, (void**)data);
+       max_depth = conf->max_depth;
+       if (max_depth * sm->n > 1<<20)
+               fprintf(stderr, "(%s) Max depth is above 1M. Potential memory hog!\n", __func__);
+       if (max_depth * sm->n < 8000) {
+               max_depth = 8000 / sm->n;
+               fprintf(stderr, "<%s> Set max per-sample depth to %d\n", __func__, max_depth);
+       }
+       bam_mplp_set_maxcnt(iter, max_depth);
         while (bam_mplp_auto(iter, &tid, &pos, n_plp, plp) > 0) {
                 if (conf->reg && (pos < beg0 || pos >= end0)) continue; // out of the region requested
                 if (hash) {
         while (bam_mplp_auto(iter, &tid, &pos, n_plp, plp) > 0) {
                 if (conf->reg && (pos < beg0 || pos >= end0)) continue; // out of the region requested
                 if (hash) {
@@ -714,9 +734,22 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                         for (i = 0; i < gplp.n; ++i)
                                 bcf_call_glfgen(gplp.n_plp[i], gplp.plp[i], ref16, bca, bcr + i);
                         bcf_call_combine(gplp.n, bcr, ref16, &bc);
                         for (i = 0; i < gplp.n; ++i)
                                 bcf_call_glfgen(gplp.n_plp[i], gplp.plp[i], ref16, bca, bcr + i);
                         bcf_call_combine(gplp.n, bcr, ref16, &bc);
-                       bcf_call2bcf(tid, pos, &bc, b);
+                       bcf_call2bcf(tid, pos, &bc, b, (conf->flag&(MPLP_FMT_DP|MPLP_FMT_SP))? bcr : 0,
+                                                (conf->flag&MPLP_FMT_SP), 0, 0);
                         bcf_write(bp, bh, b);
                         bcf_destroy(b);
                         bcf_write(bp, bh, b);
                         bcf_destroy(b);
+                       // call indels
+                       if (!(conf->flag&MPLP_NO_INDEL) && bcf_call_gap_prep(gplp.n, gplp.n_plp, gplp.plp, pos, bca, ref, rghash) >= 0) {
+                               for (i = 0; i < gplp.n; ++i)
+                                       bcf_call_glfgen(gplp.n_plp[i], gplp.plp[i], -1, bca, bcr + i);
+                               if (bcf_call_combine(gplp.n, bcr, -1, &bc) >= 0) {
+                                       b = calloc(1, sizeof(bcf1_t));
+                                       bcf_call2bcf(tid, pos, &bc, b, (conf->flag&(MPLP_FMT_DP|MPLP_FMT_SP))? bcr : 0,
+                                                                (conf->flag&MPLP_FMT_SP), bca, ref);
+                                       bcf_write(bp, bh, b);
+                                       bcf_destroy(b);
+                               }
+                       }
                 } else {
                         printf("%s\t%d\t%c", h->target_name[tid], pos + 1, (ref && pos < ref_len)? ref[pos] : 'N');
                         for (i = 0; i < n; ++i) {
                 } else {
                         printf("%s\t%d\t%c", h->target_name[tid], pos + 1, (ref && pos < ref_len)? ref[pos] : 'N');
                         for (i = 0; i < n; ++i) {
@@ -743,6 +776,7 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
         bam_smpl_destroy(sm); free(buf.s);
         for (i = 0; i < gplp.n; ++i) free(gplp.plp[i]);
         free(gplp.plp); free(gplp.n_plp); free(gplp.m_plp);
         bam_smpl_destroy(sm); free(buf.s);
         for (i = 0; i < gplp.n; ++i) free(gplp.plp[i]);
         free(gplp.plp); free(gplp.n_plp); free(gplp.m_plp);
+       bcf_call_del_rghash(rghash);
         if (hash) { // free the hash table
                 khint_t k;
                 for (k = kh_begin(hash); k < kh_end(hash); ++k)
         if (hash) { // free the hash table
                 khint_t k;
                 for (k = kh_begin(hash); k < kh_end(hash); ++k)
@@ -761,33 +795,96 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
         return 0;
  }
  
         return 0;
  }
  
+#define MAX_PATH_LEN 1024
+int read_file_list(const char *file_list,int *n,char **argv[])
+{
+    char buf[MAX_PATH_LEN];
+    int len, nfiles;
+    char **files;
+
+    FILE *fh = fopen(file_list,"r");
+    if ( !fh )
+    {
+        fprintf(stderr,"%s: %s\n", file_list,strerror(errno));
+        return 1;
+    }
+
+    // Speed is not an issue here, determine the number of files by reading the file twice
+    nfiles = 0;
+    while ( fgets(buf,MAX_PATH_LEN,fh) ) nfiles++;
+
+    if ( fseek(fh, 0L, SEEK_SET) )
+    {
+        fprintf(stderr,"%s: %s\n", file_list,strerror(errno));
+        return 1;
+    }
+
+    files = calloc(nfiles,sizeof(char*));
+    nfiles = 0;
+    while ( fgets(buf,MAX_PATH_LEN,fh) ) 
+    {
+        len = strlen(buf);
+        while ( len>0 && isspace(buf[len-1]) ) len--;
+        if ( !len ) continue;
+
+        files[nfiles] = malloc(sizeof(char)*(len+1)); 
+        strncpy(files[nfiles],buf,len);
+        files[nfiles][len] = 0;
+        nfiles++;
+    }
+    fclose(fh);
+    if ( !nfiles )
+    {
+        fprintf(stderr,"No files read from %s\n", file_list);
+        return 1;
+    }
+    *argv = files;
+    *n    = nfiles;
+    return 0;
+}
+#undef MAX_PATH_LEN
+
  int bam_mpileup(int argc, char *argv[])
  {
         int c;
  int bam_mpileup(int argc, char *argv[])
  {
         int c;
+    const char *file_list = NULL;
+    char **fn = NULL;
+    int nfiles = 0;
         mplp_conf_t mplp;
         memset(&mplp, 0, sizeof(mplp_conf_t));
         mplp.max_mq = 60;
         mplp.min_baseQ = 13;
         mplp.capQ_thres = 0;
         mplp_conf_t mplp;
         memset(&mplp, 0, sizeof(mplp_conf_t));
         mplp.max_mq = 60;
         mplp.min_baseQ = 13;
         mplp.capQ_thres = 0;
+       mplp.max_depth = 250;
+       mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100;
         mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN;
         mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN;
-       while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:B")) >= 0) {
+       while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:BDSd:b:P:o:e:h:I")) >= 0) {
                 switch (c) {
                 case 'f':
                         mplp.fai = fai_load(optarg);
                         if (mplp.fai == 0) return 1;
                         break;
                 switch (c) {
                 case 'f':
                         mplp.fai = fai_load(optarg);
                         if (mplp.fai == 0) return 1;
                         break;
+               case 'd': mplp.max_depth = atoi(optarg); break;
                 case 'r': mplp.reg = strdup(optarg); break;
                 case 'l': mplp.fn_pos = strdup(optarg); break;
                 case 'r': mplp.reg = strdup(optarg); break;
                 case 'l': mplp.fn_pos = strdup(optarg); break;
+               case 'P': mplp.pl_list = strdup(optarg); break;
                 case 'g': mplp.flag |= MPLP_GLF; break;
                 case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_GLF; break;
                 case 'a': mplp.flag |= MPLP_NO_ORPHAN | MPLP_REALN; break;
                 case 'B': mplp.flag &= ~MPLP_REALN & ~MPLP_NO_ORPHAN; break;
                 case 'O': mplp.flag |= MPLP_NO_ORPHAN; break;
                 case 'R': mplp.flag |= MPLP_REALN; break;
                 case 'g': mplp.flag |= MPLP_GLF; break;
                 case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_GLF; break;
                 case 'a': mplp.flag |= MPLP_NO_ORPHAN | MPLP_REALN; break;
                 case 'B': mplp.flag &= ~MPLP_REALN & ~MPLP_NO_ORPHAN; break;
                 case 'O': mplp.flag |= MPLP_NO_ORPHAN; break;
                 case 'R': mplp.flag |= MPLP_REALN; break;
+               case 'D': mplp.flag |= MPLP_FMT_DP; break;
+               case 'S': mplp.flag |= MPLP_FMT_SP; break;
+               case 'I': mplp.flag |= MPLP_NO_INDEL; break;
                 case 'C': mplp.capQ_thres = atoi(optarg); break;
                 case 'M': mplp.max_mq = atoi(optarg); break;
                 case 'q': mplp.min_mq = atoi(optarg); break;
                 case 'Q': mplp.min_baseQ = atoi(optarg); break;
                 case 'C': mplp.capQ_thres = atoi(optarg); break;
                 case 'M': mplp.max_mq = atoi(optarg); break;
                 case 'q': mplp.min_mq = atoi(optarg); break;
                 case 'Q': mplp.min_baseQ = atoi(optarg); break;
+        case 'b': file_list = optarg; break;
+               case 'o': mplp.openQ = atoi(optarg); break;
+               case 'e': mplp.extQ = atoi(optarg); break;
+               case 'h': mplp.tandemQ = atoi(optarg); break;
                 }
         }
         if (argc == 1) {
                 }
         }
         if (argc == 1) {
@@ -796,18 +893,35 @@ int bam_mpileup(int argc, char *argv[])
                 fprintf(stderr, "Options: -f FILE     reference sequence file [null]\n");
                 fprintf(stderr, "         -r STR      region in which pileup is generated [null]\n");
                 fprintf(stderr, "         -l FILE     list of positions (format: chr pos) [null]\n");
                 fprintf(stderr, "Options: -f FILE     reference sequence file [null]\n");
                 fprintf(stderr, "         -r STR      region in which pileup is generated [null]\n");
                 fprintf(stderr, "         -l FILE     list of positions (format: chr pos) [null]\n");
+               fprintf(stderr, "         -b FILE     list of input BAM files [null]\n");
                 fprintf(stderr, "         -M INT      cap mapping quality at INT [%d]\n", mplp.max_mq);
                 fprintf(stderr, "         -Q INT      min base quality [%d]\n", mplp.min_baseQ);
                 fprintf(stderr, "         -q INT      filter out alignment with MQ smaller than INT [%d]\n", mplp.min_mq);
                 fprintf(stderr, "         -M INT      cap mapping quality at INT [%d]\n", mplp.max_mq);
                 fprintf(stderr, "         -Q INT      min base quality [%d]\n", mplp.min_baseQ);
                 fprintf(stderr, "         -q INT      filter out alignment with MQ smaller than INT [%d]\n", mplp.min_mq);
+               fprintf(stderr, "         -d INT      max per-sample depth [%d]\n", mplp.max_depth);
+               fprintf(stderr, "         -P STR      comma separated list of platforms for indels [all]\n");
+               fprintf(stderr, "         -o INT      Phred-scaled gap open sequencing error probability [%d]\n", mplp.openQ);
+               fprintf(stderr, "         -e INT      Phred-scaled gap extension seq error probability [%d]\n", mplp.extQ);
+               fprintf(stderr, "         -h INT      coefficient for homopolyer errors [%d]\n", mplp.tandemQ);
                 fprintf(stderr, "         -g          generate BCF output\n");
                 fprintf(stderr, "         -u          do not compress BCF output\n");
                 fprintf(stderr, "         -B          disable BAQ computation\n");
                 fprintf(stderr, "         -g          generate BCF output\n");
                 fprintf(stderr, "         -u          do not compress BCF output\n");
                 fprintf(stderr, "         -B          disable BAQ computation\n");
+               fprintf(stderr, "         -D          output per-sample DP\n");
+               fprintf(stderr, "         -S          output per-sample SP (strand bias P-value, slow)\n");
+               fprintf(stderr, "         -I          do not perform indel calling\n");
                 fprintf(stderr, "\n");
                 fprintf(stderr, "Notes: Assuming diploid individuals.\n\n");
                 return 1;
         }
                 fprintf(stderr, "\n");
                 fprintf(stderr, "Notes: Assuming diploid individuals.\n\n");
                 return 1;
         }
-       mpileup(&mplp, argc - optind, argv + optind);
-       free(mplp.reg);
+    if ( file_list )
+    {
+        if ( read_file_list(file_list,&nfiles,&fn) ) return 1;
+        mpileup(&mplp,nfiles,fn);
+        for (c=0; c<nfiles; c++) free(fn[c]);
+        free(fn);
+    }
+    else
+           mpileup(&mplp, argc - optind, argv + optind);
+       free(mplp.reg); free(mplp.pl_list);
         if (mplp.fai) fai_destroy(mplp.fai);
         return 0;
  }
         if (mplp.fai) fai_destroy(mplp.fai);
         return 0;
  }
diff --git a/bamtk.c b/bamtk.c

index 082767c8d6bf0a29be8f1b73101b8a7c43e29a08..88328d47c24fc6bfe6f602ffce721fb4ee64e56c 100644 (file)
--- a/bamtk.c
+++ b/bamtk.c
@@ -9,7 +9,7 @@
  #endif
  
  #ifndef PACKAGE_VERSION
  #endif
  
  #ifndef PACKAGE_VERSION
-#define PACKAGE_VERSION "0.1.9 (r783)"
+#define PACKAGE_VERSION "0.1.10 (r829)"
  #endif
  
  int bam_taf2baf(int argc, char *argv[]);
  #endif
  
  int bam_taf2baf(int argc, char *argv[]);
diff --git a/bcftools/bcf.c b/bcftools/bcf.c

index 9044b93b0b514fd9566f037a705f8769f60ecc54..fd5c0c9ae24971f15168f488452afe2077258dd0 100644 (file)
--- a/bcftools/bcf.c
+++ b/bcftools/bcf.c
@@ -13,7 +13,9 @@ bcf_t *bcf_open(const char *fn, const char *mode)
         } else {
                 b->fp = strcmp(fn, "-")? bgzf_open(fn, mode) : bgzf_fdopen(fileno(stdin), mode);
         }
         } else {
                 b->fp = strcmp(fn, "-")? bgzf_open(fn, mode) : bgzf_fdopen(fileno(stdin), mode);
         }
+#ifndef BCF_LITE
         b->fp->owned_file = 1;
         b->fp->owned_file = 1;
+#endif
         return b;
  }
  
         return b;
  }
  
@@ -134,7 +136,9 @@ int bcf_sync(bcf1_t *b)
                         b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2;
                 } else if (b->gi[i].fmt == bcf_str2int("DP", 2) || b->gi[i].fmt == bcf_str2int("HQ", 2)) {
                         b->gi[i].len = 2;
                         b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2;
                 } else if (b->gi[i].fmt == bcf_str2int("DP", 2) || b->gi[i].fmt == bcf_str2int("HQ", 2)) {
                         b->gi[i].len = 2;
-               } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)) {
+               } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)
+                                  || b->gi[i].fmt == bcf_str2int("SP", 2))
+               {
                         b->gi[i].len = 1;
                 } else if (b->gi[i].fmt == bcf_str2int("GL", 2)) {
                         b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2 * 4;
                         b->gi[i].len = 1;
                 } else if (b->gi[i].fmt == bcf_str2int("GL", 2)) {
                         b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2 * 4;
@@ -236,7 +240,7 @@ void bcf_fmt_core(const bcf_hdr_t *h, bcf1_t *b, kstring_t *s)
                                 }
                         } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
                                 kputw(((uint16_t*)b->gi[i].data)[j], s);
                                 }
                         } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
                                 kputw(((uint16_t*)b->gi[i].data)[j], s);
-                       } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
+                       } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
                                 kputw(((uint8_t*)b->gi[i].data)[j], s);
                         } else if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
                                 int y = ((uint8_t*)b->gi[i].data)[j];
                                 kputw(((uint8_t*)b->gi[i].data)[j], s);
                         } else if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
                                 int y = ((uint8_t*)b->gi[i].data)[j];
@@ -304,3 +308,13 @@ int bcf_cpy(bcf1_t *r, const bcf1_t *b)
                 memcpy(r->gi[i].data, b->gi[i].data, r->n_smpl * r->gi[i].len);
         return 0;
  }
                 memcpy(r->gi[i].data, b->gi[i].data, r->n_smpl * r->gi[i].len);
         return 0;
  }
+
+int bcf_is_indel(const bcf1_t *b)
+{
+       char *p;
+       if (strlen(b->ref) > 1) return 1;
+       for (p = b->alt; *p; ++p)
+               if (*p != ',' && p[1] != ',' && p[1] != '\0')
+                       return 1;
+       return 0;
+}
diff --git a/bcftools/bcf.h b/bcftools/bcf.h

index 36578956bbae06d4c216c121a2ddb3a94cad94d8..f87ac1e6fa3021964753f051a944fdc95a59549d 100644 (file)
--- a/bcftools/bcf.h
+++ b/bcftools/bcf.h
@@ -30,7 +30,19 @@
  
  #include <stdint.h>
  #include <zlib.h>
  
  #include <stdint.h>
  #include <zlib.h>
+
+#ifndef BCF_LITE
  #include "bgzf.h"
  #include "bgzf.h"
+typedef BGZF *bcfFile;
+#else
+typedef gzFile bcfFile;
+#define bgzf_open(fn, mode) gzopen(fn, mode)
+#define bgzf_fdopen(fd, mode) gzdopen(fd, mode)
+#define bgzf_close(fp) gzclose(fp)
+#define bgzf_read(fp, buf, len) gzread(fp, buf, len)
+#define bgzf_write(fp, buf, len)
+#define bgzf_flush(fp)
+#endif
  
  /*
    A member in the structs below is said to "primary" if its content
  
  /*
    A member in the structs below is said to "primary" if its content
@@ -74,7 +86,7 @@ typedef struct {
  typedef struct {
         int is_vcf; // if the file in operation is a VCF
         void *v; // auxillary data structure for VCF
  typedef struct {
         int is_vcf; // if the file in operation is a VCF
         void *v; // auxillary data structure for VCF
-       BGZF *fp; // file handler for BCF
+       bcfFile fp; // file handler for BCF
  } bcf_t;
  
  struct __bcf_idx_t;
  } bcf_t;
  
  struct __bcf_idx_t;
@@ -128,6 +140,8 @@ extern "C" {
         int bcf_shrink_alt(bcf1_t *b, int n);
         // convert GL to PL
         int bcf_gl2pl(bcf1_t *b);
         int bcf_shrink_alt(bcf1_t *b, int n);
         // convert GL to PL
         int bcf_gl2pl(bcf1_t *b);
+       // if the site is an indel
+       int bcf_is_indel(const bcf1_t *b);
  
         // string hash table
         void *bcf_build_refhash(bcf_hdr_t *h);
  
         // string hash table
         void *bcf_build_refhash(bcf_hdr_t *h);
diff --git a/bcftools/call1.c b/bcftools/call1.c

index 2b284527185c0e12747ca1dfb5387aa0bd615c36..b0b14fc5307817cad907ecc6f7126716e80599a5 100644 (file)
--- a/bcftools/call1.c
+++ b/bcftools/call1.c
@@ -25,11 +25,12 @@ KSTREAM_INIT(gzFile, gzread, 16384)
  #define VC_QCALL   1024
  #define VC_CALL_GT 2048
  #define VC_ADJLD   4096
  #define VC_QCALL   1024
  #define VC_CALL_GT 2048
  #define VC_ADJLD   4096
+#define VC_NO_INDEL 8192
  
  typedef struct {
         int flag, prior_type, n1;
         char *fn_list, *prior_file;
  
  typedef struct {
         int flag, prior_type, n1;
         char *fn_list, *prior_file;
-       double theta, pref;
+       double theta, pref, indel_frac;
  } viewconf_t;
  
  khash_t(set64) *bcf_load_pos(const char *fn, bcf_hdr_t *_h)
  } viewconf_t;
  
  khash_t(set64) *bcf_load_pos(const char *fn, bcf_hdr_t *_h)
@@ -161,7 +162,8 @@ static int update_bcf1(int n_smpl, bcf1_t *b, const bcf_p1aux_t *pa, const bcf_p
         kputs(b->alt, &s); kputc('\0', &s); kputc('\0', &s);
         kputs(b->info, &s);
         if (b->info[0]) kputc(';', &s);
         kputs(b->alt, &s); kputc('\0', &s); kputc('\0', &s);
         kputs(b->info, &s);
         if (b->info[0]) kputc(';', &s);
-       ksprintf(&s, "AF1=%.3lf;AFE=%.3lf", 1.-pr->f_em, 1.-pr->f_exp);
+//     ksprintf(&s, "AF1=%.4lg;AFE=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, 1.-pr->f_exp, pr->cil, pr->cih);
+       ksprintf(&s, "AF1=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, pr->cil, pr->cih);
         ksprintf(&s, ";DP4=%d,%d,%d,%d;MQ=%d", a.d[0], a.d[1], a.d[2], a.d[3], a.mq);
         if (a.is_tested) {
                 if (pr->pc[0] >= 0.) ksprintf(&s, ";PC4=%lg,%lg,%lg,%lg", pr->pc[0], pr->pc[1], pr->pc[2], pr->pc[3]);
         ksprintf(&s, ";DP4=%d,%d,%d,%d;MQ=%d", a.d[0], a.d[1], a.d[2], a.d[3], a.mq);
         if (a.is_tested) {
                 if (pr->pc[0] >= 0.) ksprintf(&s, ";PC4=%lg,%lg,%lg,%lg", pr->pc[0], pr->pc[1], pr->pc[2], pr->pc[3]);
@@ -199,6 +201,7 @@ double bcf_ld_freq(const bcf1_t *b0, const bcf1_t *b1, double f[4]);
  int bcfview(int argc, char *argv[])
  {
         extern int bcf_2qcall(bcf_hdr_t *h, bcf1_t *b);
  int bcfview(int argc, char *argv[])
  {
         extern int bcf_2qcall(bcf_hdr_t *h, bcf1_t *b);
+       extern void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x);
         bcf_t *bp, *bout = 0;
         bcf1_t *b, *blast;
         int c;
         bcf_t *bp, *bout = 0;
         bcf1_t *b, *blast;
         int c;
@@ -212,8 +215,8 @@ int bcfview(int argc, char *argv[])
  
         tid = begin = end = -1;
         memset(&vc, 0, sizeof(viewconf_t));
  
         tid = begin = end = -1;
         memset(&vc, 0, sizeof(viewconf_t));
-       vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5;
-       while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgL")) >= 0) {
+       vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5; vc.indel_frac = -1.;
+       while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgLi:I")) >= 0) {
                 switch (c) {
                 case '1': vc.n1 = atoi(optarg); break;
                 case 'l': vc.fn_list = strdup(optarg); break;
                 switch (c) {
                 case '1': vc.n1 = atoi(optarg); break;
                 case 'l': vc.fn_list = strdup(optarg); break;
@@ -227,8 +230,10 @@ int bcfview(int argc, char *argv[])
                 case 'u': vc.flag |= VC_UNCOMP | VC_BCFOUT; break;
                 case 'H': vc.flag |= VC_HWE; break;
                 case 'g': vc.flag |= VC_CALL_GT | VC_CALL; break;
                 case 'u': vc.flag |= VC_UNCOMP | VC_BCFOUT; break;
                 case 'H': vc.flag |= VC_HWE; break;
                 case 'g': vc.flag |= VC_CALL_GT | VC_CALL; break;
+               case 'I': vc.flag |= VC_NO_INDEL; break;
                 case 't': vc.theta = atof(optarg); break;
                 case 'p': vc.pref = atof(optarg); break;
                 case 't': vc.theta = atof(optarg); break;
                 case 'p': vc.pref = atof(optarg); break;
+               case 'i': vc.indel_frac = atof(optarg); break;
                 case 'Q': vc.flag |= VC_QCALL; break;
                 case 'L': vc.flag |= VC_ADJLD; break;
                 case 'P':
                 case 'Q': vc.flag |= VC_QCALL; break;
                 case 'L': vc.flag |= VC_ADJLD; break;
                 case 'P':
@@ -254,9 +259,11 @@ int bcfview(int argc, char *argv[])
                 fprintf(stderr, "         -N        skip sites where REF is not A/C/G/T\n");
                 fprintf(stderr, "         -Q        output the QCALL likelihood format\n");
                 fprintf(stderr, "         -L        calculate LD for adjacent sites\n");
                 fprintf(stderr, "         -N        skip sites where REF is not A/C/G/T\n");
                 fprintf(stderr, "         -Q        output the QCALL likelihood format\n");
                 fprintf(stderr, "         -L        calculate LD for adjacent sites\n");
+               fprintf(stderr, "         -I        skip indels\n");
                 fprintf(stderr, "         -1 INT    number of group-1 samples [0]\n");
                 fprintf(stderr, "         -l FILE   list of sites to output [all sites]\n");
                 fprintf(stderr, "         -1 INT    number of group-1 samples [0]\n");
                 fprintf(stderr, "         -l FILE   list of sites to output [all sites]\n");
-               fprintf(stderr, "         -t FLOAT  scaled mutation rate [%.4lg]\n", vc.theta);
+               fprintf(stderr, "         -t FLOAT  scaled substitution mutation rate [%.4lg]\n", vc.theta);
+               fprintf(stderr, "         -i FLOAT  indel-to-substitution ratio [%.4lg]\n", vc.indel_frac);
                 fprintf(stderr, "         -p FLOAT  variant if P(ref|D)<FLOAT [%.3lg]\n", vc.pref);
                 fprintf(stderr, "         -P STR    type of prior: full, cond2, flat [full]\n");
                 fprintf(stderr, "\n");
                 fprintf(stderr, "         -p FLOAT  variant if P(ref|D)<FLOAT [%.3lg]\n", vc.pref);
                 fprintf(stderr, "         -P STR    type of prior: full, cond2, flat [full]\n");
                 fprintf(stderr, "\n");
@@ -286,6 +293,7 @@ int bcfview(int argc, char *argv[])
                         bcf_p1_set_n1(p1, vc.n1);
                         bcf_p1_init_subprior(p1, vc.prior_type, vc.theta);
                 }
                         bcf_p1_set_n1(p1, vc.n1);
                         bcf_p1_init_subprior(p1, vc.prior_type, vc.theta);
                 }
+               if (vc.indel_frac > 0.) bcf_p1_indel_prior(p1, vc.indel_frac);
         }
         if (vc.fn_list) hash = bcf_load_pos(vc.fn_list, h);
         if (optind + 1 < argc && !(vc.flag&VC_VCFIN)) {
         }
         if (vc.fn_list) hash = bcf_load_pos(vc.fn_list, h);
         if (optind + 1 < argc && !(vc.flag&VC_VCFIN)) {
@@ -306,7 +314,9 @@ int bcfview(int argc, char *argv[])
                 }
         }
         while (vcf_read(bp, h, b) > 0) {
                 }
         }
         while (vcf_read(bp, h, b) > 0) {
-               if (vc.flag & VC_ACGT_ONLY) {
+               int is_indel = bcf_is_indel(b);
+               if ((vc.flag & VC_NO_INDEL) && is_indel) continue;
+               if ((vc.flag & VC_ACGT_ONLY) && !is_indel) {
                         int x;
                         if (b->ref[0] == 0 || b->ref[1] != 0) continue;
                         x = toupper(b->ref[0]);
                         int x;
                         if (b->ref[0] == 0 || b->ref[1] != 0) continue;
                         x = toupper(b->ref[0]);
diff --git a/bcftools/prob1.c b/bcftools/prob1.c

index e3b0f7344d096f22c37048b2131219378bcb7ba5..176a0fc96d4e0b12f14172bc7c0d68297e01a50a 100644 (file)
--- a/bcftools/prob1.c
+++ b/bcftools/prob1.c
@@ -8,9 +8,9 @@
  #include "kseq.h"
  KSTREAM_INIT(gzFile, gzread, 16384)
  
  #include "kseq.h"
  KSTREAM_INIT(gzFile, gzread, 16384)
  
-#define MC_AVG_ERR 0.007
  #define MC_MAX_EM_ITER 16
  #define MC_EM_EPS 1e-4
  #define MC_MAX_EM_ITER 16
  #define MC_EM_EPS 1e-4
+#define MC_DEF_INDEL 0.15
  
  unsigned char seq_nt4_table[256] = {
         4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
  
  unsigned char seq_nt4_table[256] = {
         4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
@@ -32,9 +32,9 @@ unsigned char seq_nt4_table[256] = {
  };
  
  struct __bcf_p1aux_t {
  };
  
  struct __bcf_p1aux_t {
-       int n, M, n1;
+       int n, M, n1, is_indel;
         double *q2p, *pdg; // pdg -> P(D|g)
         double *q2p, *pdg; // pdg -> P(D|g)
-       double *phi;
+       double *phi, *phi_indel;
         double *z, *zswap; // aux for afs
         double *z1, *z2, *phi1, *phi2; // only calculated when n1 is set
         double t, t1, t2;
         double *z, *zswap; // aux for afs
         double *z1, *z2, *phi1, *phi2; // only calculated when n1 is set
         double t, t1, t2;
@@ -43,6 +43,14 @@ struct __bcf_p1aux_t {
         int PL_len;
  };
  
         int PL_len;
  };
  
+void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x)
+{
+       int i;
+       for (i = 0; i < ma->M; ++i)
+               ma->phi_indel[i] = ma->phi[i] * x;
+       ma->phi_indel[ma->M] = 1. - ma->phi[ma->M] * x;
+}
+
  static void init_prior(int type, double theta, int M, double *phi)
  {
         int i;
  static void init_prior(int type, double theta, int M, double *phi)
  {
         int i;
@@ -63,6 +71,7 @@ static void init_prior(int type, double theta, int M, double *phi)
  void bcf_p1_init_prior(bcf_p1aux_t *ma, int type, double theta)
  {
         init_prior(type, theta, ma->M, ma->phi);
  void bcf_p1_init_prior(bcf_p1aux_t *ma, int type, double theta)
  {
         init_prior(type, theta, ma->M, ma->phi);
+       bcf_p1_indel_prior(ma, MC_DEF_INDEL);
  }
  
  void bcf_p1_init_subprior(bcf_p1aux_t *ma, int type, double theta)
  }
  
  void bcf_p1_init_subprior(bcf_p1aux_t *ma, int type, double theta)
@@ -110,6 +119,7 @@ int bcf_p1_read_prior(bcf_p1aux_t *ma, const char *fn)
         fprintf(stderr, "[%s] heterozygosity=%lf, ", __func__, (double)sum);
         for (sum = 0., k = 1; k <= ma->M; ++k) sum += k * ma->phi[ma->M - k] / ma->M;
         fprintf(stderr, "theta=%lf\n", (double)sum);
         fprintf(stderr, "[%s] heterozygosity=%lf, ", __func__, (double)sum);
         for (sum = 0., k = 1; k <= ma->M; ++k) sum += k * ma->phi[ma->M - k] / ma->M;
         fprintf(stderr, "theta=%lf\n", (double)sum);
+       bcf_p1_indel_prior(ma, MC_DEF_INDEL);
         return 0;
  }
  
         return 0;
  }
  
@@ -123,6 +133,7 @@ bcf_p1aux_t *bcf_p1_init(int n)
         ma->q2p = calloc(256, sizeof(double));
         ma->pdg = calloc(3 * ma->n, sizeof(double));
         ma->phi = calloc(ma->M + 1, sizeof(double));
         ma->q2p = calloc(256, sizeof(double));
         ma->pdg = calloc(3 * ma->n, sizeof(double));
         ma->phi = calloc(ma->M + 1, sizeof(double));
+       ma->phi_indel = calloc(ma->M + 1, sizeof(double));
         ma->phi1 = calloc(ma->M + 1, sizeof(double));
         ma->phi2 = calloc(ma->M + 1, sizeof(double));
         ma->z = calloc(2 * ma->n + 1, sizeof(double));
         ma->phi1 = calloc(ma->M + 1, sizeof(double));
         ma->phi2 = calloc(ma->M + 1, sizeof(double));
         ma->z = calloc(2 * ma->n + 1, sizeof(double));
@@ -148,7 +159,7 @@ void bcf_p1_destroy(bcf_p1aux_t *ma)
  {
         if (ma) {
                 free(ma->q2p); free(ma->pdg);
  {
         if (ma) {
                 free(ma->q2p); free(ma->pdg);
-               free(ma->phi); free(ma->phi1); free(ma->phi2);
+               free(ma->phi); free(ma->phi_indel); free(ma->phi1); free(ma->phi2);
                 free(ma->z); free(ma->zswap); free(ma->z1); free(ma->z2);
                 free(ma->afs); free(ma->afs1);
                 free(ma);
                 free(ma->z); free(ma->zswap); free(ma->z1); free(ma->z2);
                 free(ma->afs); free(ma->afs1);
                 free(ma);
@@ -303,12 +314,13 @@ static double mc_cal_afs(bcf_p1aux_t *ma)
  {
         int k;
         long double sum = 0.;
  {
         int k;
         long double sum = 0.;
+       double *phi = ma->is_indel? ma->phi_indel : ma->phi;
         memset(ma->afs1, 0, sizeof(double) * (ma->M + 1));
         mc_cal_y(ma);
         for (k = 0, sum = 0.; k <= ma->M; ++k)
         memset(ma->afs1, 0, sizeof(double) * (ma->M + 1));
         mc_cal_y(ma);
         for (k = 0, sum = 0.; k <= ma->M; ++k)
-               sum += (long double)ma->phi[k] * ma->z[k];
+               sum += (long double)phi[k] * ma->z[k];
         for (k = 0; k <= ma->M; ++k) {
         for (k = 0; k <= ma->M; ++k) {
-               ma->afs1[k] = ma->phi[k] * ma->z[k] / sum;
+               ma->afs1[k] = phi[k] * ma->z[k] / sum;
                 if (isnan(ma->afs1[k]) || isinf(ma->afs1[k])) return -1.;
         }
         for (k = 0, sum = 0.; k <= ma->M; ++k) {
                 if (isnan(ma->afs1[k]) || isinf(ma->afs1[k])) return -1.;
         }
         for (k = 0, sum = 0.; k <= ma->M; ++k) {
@@ -348,6 +360,7 @@ int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst)
  {
         int i, k;
         long double sum = 0.;
  {
         int i, k;
         long double sum = 0.;
+       ma->is_indel = bcf_is_indel(b);
         // set PL and PL_len
         for (i = 0; i < b->n_gi; ++i) {
                 if (b->gi[i].fmt == bcf_str2int("PL", 2)) {
         // set PL and PL_len
         for (i = 0; i < b->n_gi; ++i) {
                 if (b->gi[i].fmt == bcf_str2int("PL", 2)) {
@@ -378,6 +391,19 @@ int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst)
                         flast = rst->f_em;
                 }
         }
                         flast = rst->f_em;
                 }
         }
+       { // estimate equal-tail credible interval (95% level)
+               int l, h;
+               double p;
+               for (i = 0, p = 0.; i < ma->M; ++i)
+                       if (p + ma->afs1[i] > 0.025) break;
+                       else p += ma->afs1[i];
+               l = i;
+               for (i = ma->M-1, p = 0.; i >= 0; --i)
+                       if (p + ma->afs1[i] > 0.025) break;
+                       else p += ma->afs1[i];
+               h = i;
+               rst->cil = (double)(ma->M - h) / ma->M; rst->cih = (double)(ma->M - l) / ma->M;
+       }
         rst->g[0] = rst->g[1] = rst->g[2] = -1.;
         contrast(ma, rst->pc);
         return 0;
         rst->g[0] = rst->g[1] = rst->g[2] = -1.;
         contrast(ma, rst->pc);
         return 0;
diff --git a/bcftools/prob1.h b/bcftools/prob1.h

index 7158fe29f89bee73dda2f2bdc9426af038a363c6..6d9d28e3cee6c7d0a66efbd5314912ac511cd384 100644 (file)
--- a/bcftools/prob1.h
+++ b/bcftools/prob1.h
@@ -9,6 +9,7 @@ typedef struct __bcf_p1aux_t bcf_p1aux_t;
  typedef struct {
         int rank0;
         double f_em, f_exp, f_flat, p_ref;
  typedef struct {
         int rank0;
         double f_em, f_exp, f_flat, p_ref;
+       double cil, cih;
         double pc[4];
         double g[3];
  } bcf_p1rst_t;
         double pc[4];
         double g[3];
  } bcf_p1rst_t;
diff --git a/bcftools/vcf.c b/bcftools/vcf.c

index ebca869ed999faaed7e782f3a3c410badc062c1b..9b661ff5b4fbc29dde6a53b5e71c027f131a81f4 100644 (file)
--- a/bcftools/vcf.c
+++ b/bcftools/vcf.c
@@ -156,7 +156,7 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b)
                                 for (i = 0; i < b->n_gi; ++i) {
                                         if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
                                                 ((uint8_t*)b->gi[i].data)[k-9] = 1<<7;
                                 for (i = 0; i < b->n_gi; ++i) {
                                         if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
                                                 ((uint8_t*)b->gi[i].data)[k-9] = 1<<7;
-                                       } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
+                                       } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
                                                 ((uint8_t*)b->gi[i].data)[k-9] = 0;
                                         } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
                                                 ((uint16_t*)b->gi[i].data)[k-9] = 0;
                                                 ((uint8_t*)b->gi[i].data)[k-9] = 0;
                                         } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
                                                 ((uint16_t*)b->gi[i].data)[k-9] = 0;
@@ -173,7 +173,7 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b)
                         for (q = kstrtok(p, ":", &a2), i = 0; q && i < b->n_gi; q = kstrtok(0, 0, &a2), ++i) {
                                 if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
                                         ((uint8_t*)b->gi[i].data)[k-9] = (q[0] - '0')<<3 | (q[2] - '0') | (q[1] == '/'? 0 : 1) << 6;
                         for (q = kstrtok(p, ":", &a2), i = 0; q && i < b->n_gi; q = kstrtok(0, 0, &a2), ++i) {
                                 if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
                                         ((uint8_t*)b->gi[i].data)[k-9] = (q[0] - '0')<<3 | (q[2] - '0') | (q[1] == '/'? 0 : 1) << 6;
-                               } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
+                               } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
                                         double _x = strtod(q, &q);
                                         int x = (int)(_x + .499);
                                         if (x > 255) x = 255;
                                         double _x = strtod(q, &q);
                                         int x = (int)(_x + .499);
                                         if (x > 255) x = 255;
diff --git a/bcftools/vcfutils.pl b/bcftools/vcfutils.pl

index d0b797143cf56fe0a67383ec4f610096eaecf2c0..bbc479bfffa47aa2b1089b5b6e56414b696be413 100755 (executable)
--- a/bcftools/vcfutils.pl
+++ b/bcftools/vcfutils.pl
@@ -10,11 +10,11 @@ use Getopt::Std;
  exit;
  
  sub main {
  exit;
  
  sub main {
-  my $version = '0.1.0';
    &usage if (@ARGV < 1);
    my $command = shift(@ARGV);
    my %func = (subsam=>\&subsam, listsam=>\&listsam, fillac=>\&fillac, qstats=>\&qstats, varFilter=>\&varFilter,
    &usage if (@ARGV < 1);
    my $command = shift(@ARGV);
    my %func = (subsam=>\&subsam, listsam=>\&listsam, fillac=>\&fillac, qstats=>\&qstats, varFilter=>\&varFilter,
-                         hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&filter4vcf, ldstats=>\&ldstats);
+                         hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&varFilter, ldstats=>\&ldstats,
+                         gapstats=>\&gapstats);
    die("Unknown command \"$command\".\n") if (!defined($func{$command}));
    &{$func{$command}};
  }
    die("Unknown command \"$command\".\n") if (!defined($func{$command}));
    &{$func{$command}};
  }
@@ -155,7 +155,7 @@ Note: This command discards indels. Output: QUAL #non-indel #SNPs #transitions #
         next if (length($t[3]) != 1 || uc($t[3]) eq 'N');
         $t[3] = uc($t[3]); $t[4] = uc($t[4]);
         my @s = split(',', $t[4]);
         next if (length($t[3]) != 1 || uc($t[3]) eq 'N');
         $t[3] = uc($t[3]); $t[4] = uc($t[4]);
         my @s = split(',', $t[4]);
-       $t[5] = 3 if ($t[5] < 0);
+       $t[5] = 3 if ($t[5] eq '.' || $t[5] < 0);
         next if (length($s[0]) != 1);
         my $hit;
         if ($is_vcf) {
         next if (length($s[0]) != 1);
         my $hit;
         if ($is_vcf) {
@@ -199,37 +199,38 @@ Note: This command discards indels. Output: QUAL #non-indel #SNPs #transitions #
  }
  
  sub varFilter {
  }
  
  sub varFilter {
-  my %opts = (d=>1, D=>10000, l=>30, Q=>25, q=>10, G=>25, s=>100, w=>10, W=>10, N=>2, p=>undef, F=>.001);
-  getopts('pq:d:D:l:Q:w:W:N:G:F:', \%opts);
+  my %opts = (d=>2, D=>10000, a=>2, W=>10, Q=>10, w=>10, p=>undef, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4);
+  getopts('pd:D:W:Q:w:a:1:2:3:4:', \%opts);
    die(qq/
  Usage:   vcfutils.pl varFilter [options] <in.vcf>
  
  Options: -Q INT    minimum RMS mapping quality for SNPs [$opts{Q}]
    die(qq/
  Usage:   vcfutils.pl varFilter [options] <in.vcf>
  
  Options: -Q INT    minimum RMS mapping quality for SNPs [$opts{Q}]
-         -q INT    minimum RMS mapping quality for gaps [$opts{q}]
           -d INT    minimum read depth [$opts{d}]
           -D INT    maximum read depth [$opts{D}]
           -d INT    minimum read depth [$opts{d}]
           -D INT    maximum read depth [$opts{D}]
-
-         -G INT    min indel score for nearby SNP filtering [$opts{G}]
+         -a INT    minimum number of alternate bases [$opts{a}]
           -w INT    SNP within INT bp around a gap to be filtered [$opts{w}]
           -w INT    SNP within INT bp around a gap to be filtered [$opts{w}]
-
-         -W INT    window size for filtering dense SNPs [$opts{W}]
-         -N INT    max number of SNPs in a window [$opts{N}]
-
-         -l INT    window size for filtering adjacent gaps [$opts{l}]
-
+         -W INT    window size for filtering adjacent gaps [$opts{W}]
+         -1 FLOAT  min P-value for strand bias (given PV4) [$opts{1}]
+         -2 FLOAT  min P-value for baseQ bias [$opts{2}]
+         -3 FLOAT  min P-value for mapQ bias [$opts{3}]
+         -4 FLOAT  min P-value for end distance bias [$opts{4}]
           -p        print filtered variants
           -p        print filtered variants
+
+Note: Some of the filters rely on annotations generated by SAMtools\/BCFtools.
  \n/) if (@ARGV == 0 && -t STDIN);
  
    # calculate the window size
  \n/) if (@ARGV == 0 && -t STDIN);
  
    # calculate the window size
-  my ($ol, $ow, $oW) = ($opts{l}, $opts{w}, $opts{W});
+  my ($ol, $ow) = ($opts{W}, $opts{w});
    my $max_dist = $ol > $ow? $ol : $ow;
    my $max_dist = $ol > $ow? $ol : $ow;
-  $max_dist = $oW if ($max_dist < $oW);
    # the core loop
    # the core loop
-  my @staging; # (indel_filtering_score, flt_tag)
+  my @staging; # (indel_filtering_score, flt_tag, indel_span; chr, pos, ...)
    while (<>) {
         my @t = split;
    while (<>) {
         my @t = split;
-       next if (/^#/);
+    if (/^#/) {
+         print; next;
+       }
         next if ($t[4] eq '.'); # skip non-var sites
         next if ($t[4] eq '.'); # skip non-var sites
+       # check if the site is a SNP
         my $is_snp = 1;
         if (length($t[3]) > 1) {
           $is_snp = 0;
         my $is_snp = 1;
         if (length($t[3]) > 1) {
           $is_snp = 0;
@@ -245,24 +246,18 @@ Options: -Q INT    minimum RMS mapping quality for SNPs [$opts{Q}]
           last if ($staging[0][3] eq $t[0] && $staging[0][4] + $staging[0][2] + $max_dist >= $t[1]);
           varFilter_aux(shift(@staging), $opts{p}); # calling a function is a bit slower, not much
         }
           last if ($staging[0][3] eq $t[0] && $staging[0][4] + $staging[0][2] + $max_dist >= $t[1]);
           varFilter_aux(shift(@staging), $opts{p}); # calling a function is a bit slower, not much
         }
-       my ($flt, $score) = (0, -1);
-
-       # collect key annotations
-       my ($dp, $mq, $af) = (-1, -1, 1);
-       if ($t[7] =~ /DP=(\d+)/i) {
-         $dp = $1;
-       } elsif ($t[7] =~ /DP4=(\d+),(\d+),(\d+),(\d+)/i) {
+       my $flt = 0;
+       # parse annotations
+       my ($dp, $mq, $dp_alt) = (-1, -1, -1);
+       if ($t[7] =~ /DP4=(\d+),(\d+),(\d+),(\d+)/i) {
           $dp = $1 + $2 + $3 + $4;
           $dp = $1 + $2 + $3 + $4;
+         $dp_alt = $3 + $4;
         }
         }
-       if ($t[7] =~ /MQ=(\d+)/i) {
-         $mq = $1;
-       }
-       if ($t[7] =~ /AF=([^\s;=]+)/i) {
-         $af = $1;
-       } elsif ($t[7] =~ /AF1=([^\s;=]+)/i) {
-         $af = $1;
+       if ($t[7] =~ /DP=(\d+)/i) {
+         $dp = $1;
         }
         }
-       # the depth filter
+       $mq = $1 if ($t[7] =~ /MQ=(\d+)/i);
+       # the depth and mapQ filter
         if ($dp >= 0) {
           if ($dp < $opts{d}) {
                 $flt = 2;
         if ($dp >= 0) {
           if ($dp < $opts{d}) {
                 $flt = 2;
@@ -270,58 +265,40 @@ Options: -Q INT    minimum RMS mapping quality for SNPs [$opts{Q}]
                 $flt = 3;
           }
         }
                 $flt = 3;
           }
         }
+       $flt = 4 if ($dp_alt >= 0 && $dp_alt < $opts{a});
+       $flt = 1 if ($flt == 0 && $mq >= 0 && $mq < $opts{Q});
+       $flt = 7 if ($flt == 0 && /PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/
+                                && ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4}));
  
         # site dependent filters
  
         # site dependent filters
-       my $dlen = 0;
+       my ($rlen, $indel_score) = (0, -1); # $indel_score<0 for SNPs
         if ($flt == 0) {
           if (!$is_snp) { # an indel
         if ($flt == 0) {
           if (!$is_snp) { # an indel
-        # If deletion, remember the length of the deletion
-               $dlen = length($t[3]) - 1;
-               $flt = 1 if ($mq < $opts{q});
+               $rlen = length($t[3]) - 1;
+               $indel_score = $t[5] * 100 + $dp_alt;
                 # filtering SNPs
                 # filtering SNPs
-               if ($t[5] >= $opts{G}) {
-                 for my $x (@staging) {
-            # Is it a SNP and is it outside the SNP filter window?
-                       next if ($x->[0] >= 0 || $x->[4] + $x->[2] + $ow < $t[1]);
-                       $x->[1] = 5 if ($x->[1] == 0);
-                 }
+               for my $x (@staging) {
+                 next if ($x->[0] >= 0 || $x->[1] || $x->[4] + $x->[2] + $ow < $t[1]);
+                 $x->[1] = 5;
                 }
                 }
-               # the indel filtering score
-               $score = $t[5];
                 # check the staging list for indel filtering
                 for my $x (@staging) {
                 # check the staging list for indel filtering
                 for my $x (@staging) {
-          # Is it a SNP and is it outside the gap filter window
-                 next if ($x->[0] < 0 || $x->[4] + $x->[2] + $ol < $t[1]);
-                 if ($x->[0] < $score) {
+                 next if ($x->[0] < 0 || $x->[1] || $x->[4] + $x->[2] + $ol < $t[1]);
+                 if ($x->[0] < $indel_score) {
                         $x->[1] = 6;
                   } else {
                         $flt = 6; last;
                   }
                 }
           } else { # a SNP
                         $x->[1] = 6;
                   } else {
                         $flt = 6; last;
                   }
                 }
           } else { # a SNP
-               $flt = 1 if ($mq < $opts{Q});
-               # check adjacent SNPs
-               my $k = 1;
                 for my $x (@staging) {
                 for my $x (@staging) {
-                 ++$k if ($x->[0] < 0 && -($x->[0] + 1) > $opts{F} && $x->[4] + $x->[2] + $oW >= $t[1] && ($x->[1] == 0 || $x->[1] == 4 || $x->[1] == 5));
-               }
-               # filtering is necessary
-               if ($k > $opts{N}) {
-                 $flt = 4;
-                 for my $x (@staging) {
-                        $x->[1] = 4 if ($x->[0] < 0 && $x->[4] + $x->[2] + $oW >= $t[1] && $x->[1] == 0);
-                 }
-               } else { # then check gap filter
-                 for my $x (@staging) {
-                       next if ($x->[0] < 0 || $x->[4] + $x->[2] + $ow < $t[1]);
-                       if ($x->[0] >= $opts{G}) {
-                         $flt = 5; last;
-                       }
-                 }
+                 next if ($x->[0] < 0 || $x->[1] || $x->[4] + $x->[2] + $ow < $t[1]);
+                 $flt = 5;
+                 last;
                 }
           }
         }
                 }
           }
         }
-       push(@staging, [$score < 0? -$af-1 : $score, $flt, $dlen, @t]);
+       push(@staging, [$indel_score, $flt, $rlen, @t]);
    }
    # output the last few elements in the staging list
    while (@staging) {
    }
    # output the last few elements in the staging list
    while (@staging) {
@@ -334,47 +311,35 @@ sub varFilter_aux {
    if ($first->[1] == 0) {
         print join("\t", @$first[3 .. @$first-1]), "\n";
    } elsif ($is_print) {
    if ($first->[1] == 0) {
         print join("\t", @$first[3 .. @$first-1]), "\n";
    } elsif ($is_print) {
-       print STDERR join("\t", substr("UQdDWGgsiX", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
+       print STDERR join("\t", substr("UQdDaGgP", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
    }
  }
  
    }
  }
  
-sub filter4vcf {
-  my %opts = (d=>3, D=>2000, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4, Q=>10, q=>3);
-  getopts('d:D:1:2:3:4:Q:q:', \%opts);
-  die(qq/
-Usage:   vcfutils.pl filter4vcf [options] <in.vcf>
-
-Options: -d INT     min total depth (given DP or DP4) [$opts{d}]
-         -D INT     max total depth [$opts{D}]
-         -q INT     min SNP quality [$opts{q}]
-         -Q INT     min RMS mapQ (given MQ) [$opts{Q}]
-         -1 FLOAT   min P-value for strand bias (given PV4) [$opts{1}]
-         -2 FLOAT   min P-value for baseQ bias [$opts{2}]
-         -3 FLOAT   min P-value for mapQ bias [$opts{3}]
-         -4 FLOAT   min P-value for end distance bias [$opts{4}]\n
-/) if (@ARGV == 0 && -t STDIN);
-
-  my %ts = (AG=>1, GA=>1, CT=>1, TC=>1);
-
-  my @n = (0, 0);
+sub gapstats {
+  my (@c0, @c1);
+  $c0[$_] = $c1[$_] = 0 for (0 .. 10000);
    while (<>) {
    while (<>) {
-       if (/^#/) {
-         print;
-         next;
-       }
-       next if (/PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/ && ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4}));
-       my $depth = -1;
-       $depth = $1 if (/DP=(\d+)/);
-       $depth = $1+$2+$3+$4 if (/DP4=(\d+),(\d+),(\d+),(\d+)/);
-       next if ($depth > 0 && ($depth < $opts{d} || $depth > $opts{D}));
-       next if (/MQ=(\d+)/ && $1 < $opts{Q});
+       next if (/^#/);
         my @t = split;
         my @t = split;
-       next if ($t[5] >= 0 && $t[5] < $opts{q});
-       ++$n[0];
+       next if (length($t[3]) == 1 && $t[4] =~ /^[A-Za-z](,[A-Za-z])*$/); # not an indel
         my @s = split(',', $t[4]);
         my @s = split(',', $t[4]);
-       ++$n[1] if ($ts{$t[3].$s[0]});
-       print;
+       for my $x (@s) {
+         my $l = length($x) - length($t[3]) + 5000;
+         if ($x =~ /^-/) {
+               $l = -(length($x) - 1) + 5000;
+         } elsif ($x =~ /^\+/) {
+               $l = length($x) - 1 + 5000;
+         }
+         $c0[$l] += 1 / @s;
+       }
+  }
+  for (my $i = 0; $i < 10000; ++$i) {
+       next if ($c0[$i] == 0);
+       $c1[0] += $c0[$i];
+       $c1[1] += $c0[$i] if (($i-5000)%3 == 0);
+       printf("C\t%d\t%.2f\n", ($i-5000), $c0[$i]);
    }
    }
+  printf("3\t%d\t%d\t%.3f\n", $c1[0], $c1[1], $c1[1]/$c1[0]);
  }
  
  sub ucscsnp2vcf {
  }
  
  sub ucscsnp2vcf {
@@ -470,7 +435,6 @@ Command: subsam       get a subset of samples
           fillac       fill the allele count field
           qstats       SNP stats stratified by QUAL
           varFilter    filtering short variants
           fillac       fill the allele count field
           qstats       SNP stats stratified by QUAL
           varFilter    filtering short variants
-         filter4vcf   filtering VCFs produced by samtools+bcftools
           hapmap2vcf   convert the hapmap format to VCF
           ucscsnp2vcf  convert UCSC SNP SQL dump to VCF
  \n/);
           hapmap2vcf   convert the hapmap format to VCF
           ucscsnp2vcf  convert UCSC SNP SQL dump to VCF
  \n/);
diff --git a/kaln.c b/kaln.c

index 680779113ce7c2220434727cf578845d1e840d40..9c0bbaa437ff5654d7a181b496bb9d58c9c30558 100644 (file)
--- a/kaln.c
+++ b/kaln.c
@@ -73,9 +73,19 @@ int aln_sm_blast[] = {
         -2, -2, -2, -2, -2
  };
  
         -2, -2, -2, -2, -2
  };
  
+int aln_sm_qual[] = {
+         0, -23, -23, -23, 0,
+       -23,   0, -23, -23, 0,
+       -23, -23,   0, -23, 0,
+       -23, -23, -23,   0, 0,
+         0,   0,   0,   0, 0
+};
+
  ka_param_t ka_param_blast = {  5,  2,   5, 2, aln_sm_blast, 5, 50 };
  ka_param_t ka_param_aa2aa = { 10,  2,  10, 2, aln_sm_blosum62, 22, 50 };
  
  ka_param_t ka_param_blast = {  5,  2,   5, 2, aln_sm_blast, 5, 50 };
  ka_param_t ka_param_aa2aa = { 10,  2,  10, 2, aln_sm_blosum62, 22, 50 };
  
+ka_param2_t ka_param2_qual  = { 37, 11, 37, 11, 37, 11, 0, 0, aln_sm_qual, 5, 50 };
+
  static uint32_t *ka_path2cigar32(const path_t *path, int path_len, int *n_cigar)
  {
         int i, n;
  static uint32_t *ka_path2cigar32(const path_t *path, int path_len, int *n_cigar)
  {
         int i, n;
@@ -176,7 +186,7 @@ static uint32_t *ka_path2cigar32(const path_t *path, int path_len, int *n_cigar)
  }
  
  typedef struct {
  }
  
  typedef struct {
-       uint8_t Mt:3, It:2, Dt:2;
+       uint8_t Mt:3, It:2, Dt:3;
  } dpcell_t;
  
  typedef struct {
  } dpcell_t;
  
  typedef struct {
@@ -208,7 +218,7 @@ uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const
         score_matrix = ap->matrix;
         N_MATRIX_ROW = ap->row;
  
         score_matrix = ap->matrix;
         N_MATRIX_ROW = ap->row;
  
-       *n_cigar = 0;
+       if (n_cigar) *n_cigar = 0;
         if (len1 == 0 || len2 == 0) return 0;
  
         /* calculate b1 and b2 */
         if (len1 == 0 || len2 == 0) return 0;
  
         /* calculate b1 and b2 */
@@ -371,211 +381,106 @@ uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const
         return cigar;
  }
  
         return cigar;
  }
  
-/*****************************************
- * Probabilistic banded glocal alignment *
- *****************************************/
-
-static float g_qual2prob[256];
-
-#define EI .25
-#define EM .3333333333333
-#define set_u(u, b, i, k) { int x=(i)-(b); x=x>0?x:0; (u)=((k)-x+1)*3; }
-
-ka_probpar_t ka_probpar_def = { 0.001, 0.1, 10 };
-
-/*
-  The topology of the profile HMM:
-
-           /\             /\        /\             /\
-           I[1]           I[k-1]    I[k]           I[L]
-            ^   \      \    ^    \   ^   \      \   ^
-            |    \      \   |     \  |    \      \  |
-    M[0]   M[1] -> ... -> M[k-1] -> M[k] -> ... -> M[L]   M[L+1]
-                \      \/        \/      \/      /
-                 \     /\        /\      /\     /
-                       -> D[k-1] -> D[k] ->
-
-   M[0] points to every {M,I}[k] and every {M,I}[k] points M[L+1].
-
-   On input, _ref is the reference sequence and _query is the query
-   sequence. Both are sequences of 0/1/2/3/4 where 4 stands for an
-   ambiguous residue. iqual is the base quality. c sets the gap open
-   probability, gap extension probability and band width.
-
-   On output, state and q are arrays of length l_query. The higher 30
-   bits give the reference position the query base is matched to and the
-   lower two bits can be 0 (an alignment match) or 1 (an
-   insertion). q[i] gives the phred scaled posterior probability of
-   state[i] being wrong.
- */
-int ka_prob_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
-                                  const ka_probpar_t *c, int *state, uint8_t *q)
+typedef struct {
+       int M, I, D;
+} score_aux_t;
+
+#define MINUS_INF -0x40000000
+
+// matrix: len2 rows and len1 columns
+int ka_global_score(const uint8_t *_seq1, int len1, const uint8_t *_seq2, int len2, const ka_param2_t *ap)
  {
  {
-       double **f, **b, *s, m[9], sI, sM, bI, bM, pb;
-       float *qual, *_qual;
-       const uint8_t *ref, *query;
-       int bw, bw2, i, k, is_diff = 0;
-
-       /*** initialization ***/
-       ref = _ref - 1; query = _query - 1; // change to 1-based coordinate
-       bw = l_ref > l_query? l_ref : l_query;
-       if (bw > c->bw) bw = c->bw;
-       if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query);
-       bw2 = bw * 2 + 1;
-       // allocate the forward and backward matrices f[][] and b[][] and the scaling array s[]
-       f = calloc(l_query+1, sizeof(void*));
-       b = calloc(l_query+1, sizeof(void*));
-       for (i = 0; i <= l_query; ++i) {
-               f[i] = calloc(bw2 * 3 + 6, sizeof(double)); // FIXME: this is over-allocated for very short seqs
-               b[i] = calloc(bw2 * 3 + 6, sizeof(double));
-       }
-       s = calloc(l_query+2, sizeof(double)); // s[] is the scaling factor to avoid underflow
-       // initialize qual
-       _qual = calloc(l_query, sizeof(float));
-       if (g_qual2prob[0] == 0)
-               for (i = 0; i < 256; ++i)
-                       g_qual2prob[i] = pow(10, -i/10.);
-       for (i = 0; i < l_query; ++i) _qual[i] = g_qual2prob[iqual? iqual[i] : 30];
-       qual = _qual - 1;
-       // initialize transition probability
-       sM = sI = 1. / (2 * l_query + 2); // the value here seems not to affect results; FIXME: need proof
-       m[0*3+0] = (1 - c->d - c->d) * (1 - sM); m[0*3+1] = m[0*3+2] = c->d * (1 - sM);
-       m[1*3+0] = (1 - c->e) * (1 - sI); m[1*3+1] = c->e * (1 - sI); m[1*3+2] = 0.;
-       m[2*3+0] = 1 - c->e; m[2*3+1] = 0.; m[2*3+2] = c->e;
-       bM = (1 - c->d) / l_query; bI = c->d / l_query; // (bM+bI)*l_query==1
-       /*** forward ***/
-       // f[0]
-       set_u(k, bw, 0, 0);
-       f[0][k] = s[0] = 1.;
-       { // f[1]
-               double *fi = f[1], sum;
-               int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1, _beg, _end;
-               for (k = beg, sum = 0.; k <= end; ++k) {
-                       int u;
-                       double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
-                       set_u(u, bw, 1, k);
-                       fi[u+0] = e * bM; fi[u+1] = EI * bI;
-                       sum += fi[u] + fi[u+1];
-               }
-               // rescale
-               s[1] = sum;
-               set_u(_beg, bw, 1, beg); set_u(_end, bw, 1, end); _end += 2;
-               for (k = _beg; k <= _end; ++k) fi[k] /= sum;
-       }
-       // f[2..l_query]
-       for (i = 2; i <= l_query; ++i) {
-               double *fi = f[i], *fi1 = f[i-1], sum, qli = qual[i];
-               int beg = 1, end = l_ref, x, _beg, _end;
-               uint8_t qyi = query[i];
-               x = i - bw; beg = beg > x? beg : x; // band start
-               x = i + bw; end = end < x? end : x; // band end
-               for (k = beg, sum = 0.; k <= end; ++k) {
-                       int u, v11, v01, v10;
-                       double e;
-                       e = (ref[k] > 3 || qyi > 3)? 1. : ref[k] == qyi? 1. - qli : qli * EM;
-                       set_u(u, bw, i, k); set_u(v11, bw, i-1, k-1); set_u(v10, bw, i-1, k); set_u(v01, bw, i, k-1);
-                       fi[u+0] = e * (m[0] * fi1[v11+0] + m[3] * fi1[v11+1] + m[6] * fi1[v11+2]);
-                       fi[u+1] = EI * (m[1] * fi1[v10+0] + m[4] * fi1[v10+1]);
-                       fi[u+2] = m[2] * fi[v01+0] + m[8] * fi[v01+2];
-                       sum += fi[u] + fi[u+1] + fi[u+2];
-//                     fprintf(stderr, "F (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, fi[u], fi[u+1], fi[u+2]); // DEBUG
-               }
-               // rescale
-               s[i] = sum;
-               set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
-               for (k = _beg, sum = 1./sum; k <= _end; ++k) fi[k] *= sum;
-       }
-       { // f[l_query+1]
-               double sum;
-               for (k = 1, sum = 0.; k <= l_ref; ++k) {
-                       int u;
-                       set_u(u, bw, l_query, k);
-                       if (u < 3 || u >= bw2*3+3) continue;
-                   sum += f[l_query][u+0] * sM + f[l_query][u+1] * sI;
-               }
-               s[l_query+1] = sum; // the last scaling factor
+       
+#define __score_aux(_p, _q0, _sc, _io, _ie, _do, _de) {                                        \
+               int t1, t2;                                                                                                             \
+               score_aux_t *_q;                                                                                                \
+               _q = _q0;                                                                                                               \
+               _p->M = _q->M >= _q->I? _q->M : _q->I;                                                  \
+               _p->M = _p->M >= _q->D? _p->M : _q->D;                                                  \
+               _p->M += (_sc);                                                                                                 \
+               ++_q;      t1 = _q->M - _io - _ie; t2 = _q->I - _ie; _p->I = t1 >= t2? t1 : t2; \
+               _q = _p-1; t1 = _q->M - _do - _de; t2 = _q->D - _de; _p->D = t1 >= t2? t1 : t2; \
         }
         }
-       /*** backward ***/
-       // b[l_query] (b[l_query+1][0]=1 and thus \tilde{b}[][]=1/s[l_query+1]; this is where s[l_query+1] comes from)
-       for (k = 1; k <= l_ref; ++k) {
-               int u;
-               double *bi = b[l_query];
-               set_u(u, bw, l_query, k);
-               if (u < 3 || u >= bw2*3+3) continue;
-               bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1];
+
+       int i, j, bw, scmat_size = ap->row, *scmat = ap->matrix, ret;
+       const uint8_t *seq1, *seq2;
+       score_aux_t *curr, *last, *swap;
+       bw = abs(len1 - len2) + ap->band_width;
+       i = len1 > len2? len1 : len2;
+       if (bw > i + 1) bw = i + 1;
+       seq1 = _seq1 - 1; seq2 = _seq2 - 1;
+       curr = calloc(len1 + 2, sizeof(score_aux_t));
+       last = calloc(len1 + 2, sizeof(score_aux_t));
+       { // the zero-th row
+               int x, end = len1;
+               score_aux_t *p;
+               j = 0;
+               x = j + bw; end = len1 < x? len1 : x; // band end
+               p = curr;
+               p->M = 0; p->I = p->D = MINUS_INF;
+               for (i = 1, p = &curr[1]; i <= end; ++i, ++p)
+                       p->M = p->I = MINUS_INF, p->D = -(ap->edo + ap->ede * i);
+               p->M = p->I = p->D = MINUS_INF;
+               swap = curr; curr = last; last = swap;
         }
         }
-       // b[l_query-1..1]
-       for (i = l_query - 1; i >= 1; --i) {
-               int beg = 1, end = l_ref, x, _beg, _end;
-               double *bi = b[i], *bi1 = b[i+1], y = (i > 1), qli1 = qual[i+1];
-               uint8_t qyi1 = query[i+1];
-               x = i - bw; beg = beg > x? beg : x;
-               x = i + bw; end = end < x? end : x;
-               for (k = end; k >= beg; --k) {
-                       int u, v11, v01, v10;
-                       double e;
-                       set_u(u, bw, i, k); set_u(v11, bw, i+1, k+1); set_u(v10, bw, i+1, k); set_u(v01, bw, i, k+1);
-                       e = (k >= l_ref? 0 : (ref[k+1] > 3 || qyi1 > 3)? 1. : ref[k+1] == qyi1? 1. - qli1 : qli1 * EM) * bi1[v11];
-                       bi[u+0] = e * m[0] + EI * m[1] * bi1[v10+1] + m[2] * bi[v01+2]; // bi1[v11] has been foled into e.
-                       bi[u+1] = e * m[3] + EI * m[4] * bi1[v10+1];
-                       bi[u+2] = (e * m[6] + m[8] * bi[v01+2]) * y;
-//                     fprintf(stderr, "B (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, bi[u], bi[u+1], bi[u+2]); // DEBUG
+       for (j = 1; j < len2; ++j) {
+               int x, beg = 0, end = len1, *scrow, col_end;
+               score_aux_t *p;
+               x = j - bw; beg =    0 > x?    0 : x; // band start
+               x = j + bw; end = len1 < x? len1 : x; // band end
+               if (beg == 0) { // from zero-th column
+                       p = curr;
+                       p->M = p->D = MINUS_INF; p->I = -(ap->eio + ap->eie * j);
+                       ++beg; // then beg = 1
                 }
                 }
-               // rescale
-               set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
-               for (k = _beg, y = 1./s[i]; k <= _end; ++k) bi[k] *= y;
-       }
-       { // b[0]
-               int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
-               double sum = 0.;
-               for (k = end; k >= beg; --k) {
-                       int u;
-                       double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
-                       set_u(u, bw, 1, k);
-                       if (u < 3 || u >= bw2*3+3) continue;
-                   sum += e * b[1][u+0] * bM + EI * b[1][u+1] * bI;
+               scrow = scmat + seq2[j] * scmat_size;
+               if (end == len1) col_end = 1, --end;
+               else col_end = 0;
+               for (i = beg, p = &curr[beg]; i <= end; ++i, ++p)
+                       __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->iio, ap->iie, ap->ido, ap->ide);
+               if (col_end) {
+                       __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->eio, ap->eie, ap->ido, ap->ide);
+                       ++p;
                 }
                 }
-               set_u(k, bw, 0, 0);
-               pb = b[0][k] = sum / s[0]; // if everything works as is expected, pb == 1.0
+               p->M = p->I = p->D = MINUS_INF;
+//             for (i = 0; i <= len1; ++i) printf("(%d,%d,%d) ", curr[i].M, curr[i].I, curr[i].D); putchar('\n');
+               swap = curr; curr = last; last = swap;
         }
         }
-       is_diff = fabs(pb - 1.) > 1e-7? 1 : 0;
-       /*** MAP ***/
-       for (i = 1; i <= l_query; ++i) {
-               double sum = 0., *fi = f[i], *bi = b[i], max = 0.;
-               int beg = 1, end = l_ref, x, max_k = -1;
-               x = i - bw; beg = beg > x? beg : x;
-               x = i + bw; end = end < x? end : x;
-               for (k = beg; k <= end; ++k) {
-                       int u;
-                       double z;
-                       set_u(u, bw, i, k);
-                       z = fi[u+0] * bi[u+0]; if (z > max) max = z, max_k = (k-1)<<2 | 0; sum += z;
-                       z = fi[u+1] * bi[u+1]; if (z > max) max = z, max_k = (k-1)<<2 | 1; sum += z;
+       { // the last row
+               int x, beg = 0, *scrow;
+               score_aux_t *p;
+               j = len2;
+               x = j - bw; beg = 0 > x?    0 : x; // band start
+               if (beg == 0) { // from zero-th column
+                       p = curr;
+                       p->M = p->D = MINUS_INF; p->I = -(ap->eio + ap->eie * j);
+                       ++beg; // then beg = 1
                 }
                 }
-               max /= sum; sum *= s[i]; // if everything works as is expected, sum == 1.0
-               if (state) state[i-1] = max_k;
-               if (q) k = (int)(-4.343 * log(1. - max) + .499), q[i-1] = k > 100? 99 : k;
-#ifdef _MAIN
-               fprintf(stderr, "(%.10lg,%.10lg) (%d,%d:%d)~%lg\n", pb, sum, i-1, max_k>>2, max_k&3, max); // DEBUG
-#endif
-       }
-       /*** free ***/
-       for (i = 0; i <= l_query; ++i) {
-               free(f[i]); free(b[i]);
+               scrow = scmat + seq2[j] * scmat_size;
+               for (i = beg, p = &curr[beg]; i < len1; ++i, ++p)
+                       __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->iio, ap->iie, ap->edo, ap->ede);
+               __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->eio, ap->eie, ap->edo, ap->ede);
+//             for (i = 0; i <= len1; ++i) printf("(%d,%d,%d) ", curr[i].M, curr[i].I, curr[i].D); putchar('\n');
         }
         }
-       free(f); free(b); free(s); free(_qual);
-       return 0;
+       ret = curr[len1].M >= curr[len1].I? curr[len1].M : curr[len1].I;
+       ret = ret >= curr[len1].D? ret : curr[len1].D;
+       free(curr); free(last);
+       return ret;
  }
  
  #ifdef _MAIN
  }
  
  #ifdef _MAIN
-int main()
+int main(int argc, char *argv[])
  {
  {
-       int l_ref = 5, l_query = 4;
-       uint8_t *ref = (uint8_t*)"\0\1\3\3\1";
-       uint8_t *query = (uint8_t*)"\0\3\3\1";
-//     uint8_t *query = (uint8_t*)"\1\3\3\1";
-       static uint8_t qual[4] = {20, 20, 20, 20};
-       ka_prob_glocal(ref, l_ref, query, l_query, qual, &ka_probpar_def, 0, 0);
+//     int len1 = 35, len2 = 35;
+//     uint8_t *seq1 = (uint8_t*)"\0\0\3\3\2\0\0\0\1\0\2\1\2\1\3\2\3\3\3\0\2\3\2\1\1\3\3\3\2\3\3\1\0\0\1";
+//     uint8_t *seq2 = (uint8_t*)"\0\0\3\3\2\0\0\0\1\0\2\1\2\1\3\2\3\3\3\0\2\3\2\1\1\3\3\3\2\3\3\1\0\1\0";
+       int len1 = 4, len2 = 4;
+       uint8_t *seq1 = (uint8_t*)"\1\0\0\1";
+       uint8_t *seq2 = (uint8_t*)"\1\0\1\0";
+       int sc;
+//     ka_global_core(seq1, 2, seq2, 1, &ka_param_qual, &sc, 0);
+       sc = ka_global_score(seq1, len1, seq2, len2, &ka_param2_qual);
+       printf("%d\n", sc);
         return 0;
  }
  #endif
         return 0;
  }
  #endif
diff --git a/kaln.h b/kaln.h

index b24597cfc2ee159529c5746916597591ecababe2..1ece132755e51212ab1aee8e56b7b316484e6d37 100644 (file)
--- a/kaln.h
+++ b/kaln.h
@@ -42,9 +42,12 @@ typedef struct {
  } ka_param_t;
  
  typedef struct {
  } ka_param_t;
  
  typedef struct {
-       float d, e;
-       int bw;
-} ka_probpar_t;
+       int iio, iie, ido, ide;
+       int eio, eie, edo, ede;
+       int *matrix;
+       int row;
+       int band_width;
+} ka_param2_t;
  
  #ifdef __cplusplus
  extern "C" {
  
  #ifdef __cplusplus
  extern "C" {
@@ -52,14 +55,13 @@ extern "C" {
  
         uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const ka_param_t *ap,
                                                          int *_score, int *n_cigar);
  
         uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const ka_param_t *ap,
                                                          int *_score, int *n_cigar);
-       int ka_prob_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
-                                          const ka_probpar_t *c, int *state, uint8_t *q);
-
+       int ka_global_score(const uint8_t *_seq1, int len1, const uint8_t *_seq2, int len2, const ka_param2_t *ap);
  #ifdef __cplusplus
  }
  #endif
  
  extern ka_param_t ka_param_blast; /* = { 5, 2, 5, 2, aln_sm_blast, 5, 50 }; */
  #ifdef __cplusplus
  }
  #endif
  
  extern ka_param_t ka_param_blast; /* = { 5, 2, 5, 2, aln_sm_blast, 5, 50 }; */
-extern ka_probpar_t ka_probpar_def; /* { 0.0001, 0.1, 10 } */
+extern ka_param_t ka_param_qual; // only use this for global alignment!!!
+extern ka_param2_t ka_param2_qual; // only use this for global alignment!!!
  
  #endif
  
  #endif
diff --git a/kprobaln.c b/kprobaln.c

new file mode 100644 (file)

index 0000000..5201c1a
--- /dev/null
+++ b/kprobaln.c
@@ -0,0 +1,278 @@
+/* The MIT License
+
+   Copyright (c) 2003-2006, 2008-2010, by Heng Li <lh3lh3@live.co.uk>
+
+   Permission is hereby granted, free of charge, to any person obtaining
+   a copy of this software and associated documentation files (the
+   "Software"), to deal in the Software without restriction, including
+   without limitation the rights to use, copy, modify, merge, publish,
+   distribute, sublicense, and/or sell copies of the Software, and to
+   permit persons to whom the Software is furnished to do so, subject to
+   the following conditions:
+
+   The above copyright notice and this permission notice shall be
+   included in all copies or substantial portions of the Software.
+
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+   SOFTWARE.
+*/
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <math.h>
+#include "kprobaln.h"
+
+/*****************************************
+ * Probabilistic banded glocal alignment *
+ *****************************************/
+
+#define EI .25
+#define EM .33333333333
+
+static float g_qual2prob[256];
+
+#define set_u(u, b, i, k) { int x=(i)-(b); x=x>0?x:0; (u)=((k)-x+1)*3; }
+
+kpa_par_t kpa_par_def = { 0.001, 0.1, 10 };
+kpa_par_t kpa_par_alt = { 0.0001, 0.01, 10 };
+
+/*
+  The topology of the profile HMM:
+
+           /\             /\        /\             /\
+           I[1]           I[k-1]    I[k]           I[L]
+            ^   \      \    ^    \   ^   \      \   ^
+            |    \      \   |     \  |    \      \  |
+    M[0]   M[1] -> ... -> M[k-1] -> M[k] -> ... -> M[L]   M[L+1]
+                \      \/        \/      \/      /
+                 \     /\        /\      /\     /
+                       -> D[k-1] -> D[k] ->
+
+   M[0] points to every {M,I}[k] and every {M,I}[k] points M[L+1].
+
+   On input, _ref is the reference sequence and _query is the query
+   sequence. Both are sequences of 0/1/2/3/4 where 4 stands for an
+   ambiguous residue. iqual is the base quality. c sets the gap open
+   probability, gap extension probability and band width.
+
+   On output, state and q are arrays of length l_query. The higher 30
+   bits give the reference position the query base is matched to and the
+   lower two bits can be 0 (an alignment match) or 1 (an
+   insertion). q[i] gives the phred scaled posterior probability of
+   state[i] being wrong.
+ */
+int kpa_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
+                          const kpa_par_t *c, int *state, uint8_t *q)
+{
+       double **f, **b = 0, *s, m[9], sI, sM, bI, bM, pb;
+       float *qual, *_qual;
+       const uint8_t *ref, *query;
+       int bw, bw2, i, k, is_diff = 0, is_backward = 1, Pr;
+
+       /*** initialization ***/
+       is_backward = state && q? 1 : 0;
+       ref = _ref - 1; query = _query - 1; // change to 1-based coordinate
+       bw = l_ref > l_query? l_ref : l_query;
+       if (bw > c->bw) bw = c->bw;
+       if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query);
+       bw2 = bw * 2 + 1;
+       // allocate the forward and backward matrices f[][] and b[][] and the scaling array s[]
+       f = calloc(l_query+1, sizeof(void*));
+       if (is_backward) b = calloc(l_query+1, sizeof(void*));
+       for (i = 0; i <= l_query; ++i) {
+               f[i] = calloc(bw2 * 3 + 6, sizeof(double)); // FIXME: this is over-allocated for very short seqs
+               if (is_backward) b[i] = calloc(bw2 * 3 + 6, sizeof(double));
+       }
+       s = calloc(l_query+2, sizeof(double)); // s[] is the scaling factor to avoid underflow
+       // initialize qual
+       _qual = calloc(l_query, sizeof(float));
+       if (g_qual2prob[0] == 0)
+               for (i = 0; i < 256; ++i)
+                       g_qual2prob[i] = pow(10, -i/10.);
+       for (i = 0; i < l_query; ++i) _qual[i] = g_qual2prob[iqual? iqual[i] : 30];
+       qual = _qual - 1;
+       // initialize transition probability
+       sM = sI = 1. / (2 * l_query + 2); // the value here seems not to affect results; FIXME: need proof
+       m[0*3+0] = (1 - c->d - c->d) * (1 - sM); m[0*3+1] = m[0*3+2] = c->d * (1 - sM);
+       m[1*3+0] = (1 - c->e) * (1 - sI); m[1*3+1] = c->e * (1 - sI); m[1*3+2] = 0.;
+       m[2*3+0] = 1 - c->e; m[2*3+1] = 0.; m[2*3+2] = c->e;
+       bM = (1 - c->d) / l_ref; bI = c->d / l_ref; // (bM+bI)*l_ref==1
+       /*** forward ***/
+       // f[0]
+       set_u(k, bw, 0, 0);
+       f[0][k] = s[0] = 1.;
+       { // f[1]
+               double *fi = f[1], sum;
+               int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1, _beg, _end;
+               for (k = beg, sum = 0.; k <= end; ++k) {
+                       int u;
+                       double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
+                       set_u(u, bw, 1, k);
+                       fi[u+0] = e * bM; fi[u+1] = EI * bI;
+                       sum += fi[u] + fi[u+1];
+               }
+               // rescale
+               s[1] = sum;
+               set_u(_beg, bw, 1, beg); set_u(_end, bw, 1, end); _end += 2;
+               for (k = _beg; k <= _end; ++k) fi[k] /= sum;
+       }
+       // f[2..l_query]
+       for (i = 2; i <= l_query; ++i) {
+               double *fi = f[i], *fi1 = f[i-1], sum, qli = qual[i];
+               int beg = 1, end = l_ref, x, _beg, _end;
+               uint8_t qyi = query[i];
+               x = i - bw; beg = beg > x? beg : x; // band start
+               x = i + bw; end = end < x? end : x; // band end
+               for (k = beg, sum = 0.; k <= end; ++k) {
+                       int u, v11, v01, v10;
+                       double e;
+                       e = (ref[k] > 3 || qyi > 3)? 1. : ref[k] == qyi? 1. - qli : qli * EM;
+                       set_u(u, bw, i, k); set_u(v11, bw, i-1, k-1); set_u(v10, bw, i-1, k); set_u(v01, bw, i, k-1);
+                       fi[u+0] = e * (m[0] * fi1[v11+0] + m[3] * fi1[v11+1] + m[6] * fi1[v11+2]);
+                       fi[u+1] = EI * (m[1] * fi1[v10+0] + m[4] * fi1[v10+1]);
+                       fi[u+2] = m[2] * fi[v01+0] + m[8] * fi[v01+2];
+                       sum += fi[u] + fi[u+1] + fi[u+2];
+//                     fprintf(stderr, "F (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, fi[u], fi[u+1], fi[u+2]); // DEBUG
+               }
+               // rescale
+               s[i] = sum;
+               set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
+               for (k = _beg, sum = 1./sum; k <= _end; ++k) fi[k] *= sum;
+       }
+       { // f[l_query+1]
+               double sum;
+               for (k = 1, sum = 0.; k <= l_ref; ++k) {
+                       int u;
+                       set_u(u, bw, l_query, k);
+                       if (u < 3 || u >= bw2*3+3) continue;
+                   sum += f[l_query][u+0] * sM + f[l_query][u+1] * sI;
+               }
+               s[l_query+1] = sum; // the last scaling factor
+       }
+       { // compute likelihood
+               double p = 1., Pr1 = 0.;
+               for (i = 0; i <= l_query + 1; ++i) {
+                       p *= s[i];
+                       if (p < 1e-100) Pr += -4.343 * log(p), p = 1.;
+               }
+               Pr1 += -4.343 * log(p * l_ref * l_query);
+               Pr = (int)(Pr1 + .499);
+               if (!is_backward) { // skip backward and MAP
+                       for (i = 0; i <= l_query; ++i) free(f[i]);
+                       free(f); free(s); free(_qual);
+                       return Pr;
+               }
+       }
+       /*** backward ***/
+       // b[l_query] (b[l_query+1][0]=1 and thus \tilde{b}[][]=1/s[l_query+1]; this is where s[l_query+1] comes from)
+       for (k = 1; k <= l_ref; ++k) {
+               int u;
+               double *bi = b[l_query];
+               set_u(u, bw, l_query, k);
+               if (u < 3 || u >= bw2*3+3) continue;
+               bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1];
+       }
+       // b[l_query-1..1]
+       for (i = l_query - 1; i >= 1; --i) {
+               int beg = 1, end = l_ref, x, _beg, _end;
+               double *bi = b[i], *bi1 = b[i+1], y = (i > 1), qli1 = qual[i+1];
+               uint8_t qyi1 = query[i+1];
+               x = i - bw; beg = beg > x? beg : x;
+               x = i + bw; end = end < x? end : x;
+               for (k = end; k >= beg; --k) {
+                       int u, v11, v01, v10;
+                       double e;
+                       set_u(u, bw, i, k); set_u(v11, bw, i+1, k+1); set_u(v10, bw, i+1, k); set_u(v01, bw, i, k+1);
+                       e = (k >= l_ref? 0 : (ref[k+1] > 3 || qyi1 > 3)? 1. : ref[k+1] == qyi1? 1. - qli1 : qli1 * EM) * bi1[v11];
+                       bi[u+0] = e * m[0] + EI * m[1] * bi1[v10+1] + m[2] * bi[v01+2]; // bi1[v11] has been foled into e.
+                       bi[u+1] = e * m[3] + EI * m[4] * bi1[v10+1];
+                       bi[u+2] = (e * m[6] + m[8] * bi[v01+2]) * y;
+//                     fprintf(stderr, "B (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, bi[u], bi[u+1], bi[u+2]); // DEBUG
+               }
+               // rescale
+               set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
+               for (k = _beg, y = 1./s[i]; k <= _end; ++k) bi[k] *= y;
+       }
+       { // b[0]
+               int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
+               double sum = 0.;
+               for (k = end; k >= beg; --k) {
+                       int u;
+                       double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
+                       set_u(u, bw, 1, k);
+                       if (u < 3 || u >= bw2*3+3) continue;
+                   sum += e * b[1][u+0] * bM + EI * b[1][u+1] * bI;
+               }
+               set_u(k, bw, 0, 0);
+               pb = b[0][k] = sum / s[0]; // if everything works as is expected, pb == 1.0
+       }
+       is_diff = fabs(pb - 1.) > 1e-7? 1 : 0;
+       /*** MAP ***/
+       for (i = 1; i <= l_query; ++i) {
+               double sum = 0., *fi = f[i], *bi = b[i], max = 0.;
+               int beg = 1, end = l_ref, x, max_k = -1;
+               x = i - bw; beg = beg > x? beg : x;
+               x = i + bw; end = end < x? end : x;
+               for (k = beg; k <= end; ++k) {
+                       int u;
+                       double z;
+                       set_u(u, bw, i, k);
+                       z = fi[u+0] * bi[u+0]; if (z > max) max = z, max_k = (k-1)<<2 | 0; sum += z;
+                       z = fi[u+1] * bi[u+1]; if (z > max) max = z, max_k = (k-1)<<2 | 1; sum += z;
+               }
+               max /= sum; sum *= s[i]; // if everything works as is expected, sum == 1.0
+               if (state) state[i-1] = max_k;
+               if (q) k = (int)(-4.343 * log(1. - max) + .499), q[i-1] = k > 100? 99 : k;
+#ifdef _MAIN
+               fprintf(stderr, "(%.10lg,%.10lg) (%d,%d:%c,%c:%d) %lg\n", pb, sum, i-1, max_k>>2,
+                               "ACGT"[query[i]], "ACGT"[ref[(max_k>>2)+1]], max_k&3, max); // DEBUG
+#endif
+       }
+       /*** free ***/
+       for (i = 0; i <= l_query; ++i) {
+               free(f[i]); free(b[i]);
+       }
+       free(f); free(b); free(s); free(_qual);
+       return Pr;
+}
+
+#ifdef _MAIN
+#include <unistd.h>
+int main(int argc, char *argv[])
+{
+       uint8_t conv[256], *iqual, *ref, *query;
+       int c, l_ref, l_query, i, q = 30, b = 10, P;
+       while ((c = getopt(argc, argv, "b:q:")) >= 0) {
+               switch (c) {
+               case 'b': b = atoi(optarg); break;
+               case 'q': q = atoi(optarg); break;
+               }
+       }
+       if (optind + 2 > argc) {
+               fprintf(stderr, "Usage: %s [-q %d] [-b %d] <ref> <query>\n", argv[0], q, b); // example: acttc attc
+               return 1;
+       }
+       memset(conv, 4, 256);
+       conv['a'] = conv['A'] = 0; conv['c'] = conv['C'] = 1;
+       conv['g'] = conv['G'] = 2; conv['t'] = conv['T'] = 3;
+       ref = (uint8_t*)argv[optind]; query = (uint8_t*)argv[optind+1];
+       l_ref = strlen((char*)ref); l_query = strlen((char*)query);
+       for (i = 0; i < l_ref; ++i) ref[i] = conv[ref[i]];
+       for (i = 0; i < l_query; ++i) query[i] = conv[query[i]];
+       iqual = malloc(l_query);
+       memset(iqual, q, l_query);
+       kpa_par_def.bw = b;
+       P = kpa_glocal(ref, l_ref, query, l_query, iqual, &kpa_par_alt, 0, 0);
+       fprintf(stderr, "%d\n", P);
+       free(iqual);
+       return 0;
+}
+#endif
diff --git a/kprobaln.h b/kprobaln.h

new file mode 100644 (file)

index 0000000..0357dcc
--- /dev/null
+++ b/kprobaln.h
@@ -0,0 +1,49 @@
+/* The MIT License
+
+   Copyright (c) 2003-2006, 2008, 2009 by Heng Li <lh3@live.co.uk>
+
+   Permission is hereby granted, free of charge, to any person obtaining
+   a copy of this software and associated documentation files (the
+   "Software"), to deal in the Software without restriction, including
+   without limitation the rights to use, copy, modify, merge, publish,
+   distribute, sublicense, and/or sell copies of the Software, and to
+   permit persons to whom the Software is furnished to do so, subject to
+   the following conditions:
+
+   The above copyright notice and this permission notice shall be
+   included in all copies or substantial portions of the Software.
+
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+   SOFTWARE.
+*/
+
+#ifndef LH3_KPROBALN_H_
+#define LH3_KPROBALN_H_
+
+#include <stdint.h>
+
+typedef struct {
+       float d, e;
+       int bw;
+} kpa_par_t;
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+       int kpa_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
+                                  const kpa_par_t *c, int *state, uint8_t *q);
+
+#ifdef __cplusplus
+}
+#endif
+
+extern kpa_par_t kpa_par_def, kpa_par_alt;
+
+#endif
diff --git a/misc/samtools.pl b/misc/samtools.pl

index ce8449de622884847003ba538a2cf60c57118c99..d03c1c7f148dcca835946e7ccd9ee52f3b6dc00b 100755 (executable)
--- a/misc/samtools.pl
+++ b/misc/samtools.pl
@@ -10,7 +10,7 @@ my $version = '0.3.3';
  &usage if (@ARGV < 1);
  
  my $command = shift(@ARGV);
  &usage if (@ARGV < 1);
  
  my $command = shift(@ARGV);
-my %func = (showALEN=>\&showALEN, pileup2fq=>\&pileup2fq, varFilter=>\&varFilter,
+my %func = (showALEN=>\&showALEN, pileup2fq=>\&pileup2fq, varFilter=>\&varFilter, plp2vcf=>\&plp2vcf,
                         unique=>\&unique, uniqcmp=>\&uniqcmp, sra2hdr=>\&sra2hdr, sam2fq=>\&sam2fq);
  
  die("Unknown command \"$command\".\n") if (!defined($func{$command}));
                         unique=>\&unique, uniqcmp=>\&uniqcmp, sra2hdr=>\&sra2hdr, sam2fq=>\&sam2fq);
  
  die("Unknown command \"$command\".\n") if (!defined($func{$command}));
@@ -473,6 +473,44 @@ sub uniqcmp_aux {
    close($fh);
  }
  
    close($fh);
  }
  
+sub plp2vcf {
+  while (<>) {
+       my @t = split;
+       next if ($t[3] eq '*/*');
+       if ($t[2] eq '*') { # indel
+         my @s = split("/", $t[3]);
+         my (@a, @b);
+         my ($ref, $alt);
+         for (@s) {
+               next if ($_ eq '*');
+               if (/^-/) {
+                 push(@a, 'N'.substr($_, 1));
+                 push(@b, 'N');
+               } elsif (/^\+/) {
+                 push(@a, 'N');
+                 push(@b, 'N'.substr($_, 1));
+               }
+         }
+         if ($a[0] && $a[1]) {
+               if (length($a[0]) < length($a[1])) {
+                 $ref = $a[1];
+                 $alt = ($b[0] . ('N' x (length($a[1]) - length($a[0])))) . ",$b[1]";
+               } elsif (length($a[0]) > length($a[1])) {
+                 $ref = $a[0];
+                 $alt = ($b[1] . ('N' x (length($a[0]) - length($a[1])))) . ",$b[0]";
+               } else {
+                 $ref = $a[0];
+                 $alt = ($b[0] eq $b[1])? $b[0] : "$b[0],$b[1]";
+               }
+         } else {
+               $ref = $a[0]; $alt = $b[0];
+         }
+         print join("\t", @t[0,1], '.', $ref, $alt, $t[5], '.', '.'), "\n";
+       } else { # SNP
+       }
+  }
+}
+
  #
  # Usage
  #
  #
  # Usage
  #
diff --git a/sam_view.c b/sam_view.c

index d0fdad25445508e90052970a90de92e1f0f61ae5..eb69449f9e44d643ef43f31638fd8534c454c940 100644 (file)
--- a/sam_view.c
+++ b/sam_view.c
@@ -9,6 +9,13 @@
  #include "khash.h"
  KHASH_SET_INIT_STR(rg)
  
  #include "khash.h"
  KHASH_SET_INIT_STR(rg)
  
+// When counting records instead of printing them,
+// data passed to the bam_fetch callback is encapsulated in this struct.
+typedef struct {
+       bam_header_t *header;
+       int *count;
+} count_func_data_t;
+
  typedef khash_t(rg) *rghash_t;
  
  rghash_t g_rghash = 0;
  typedef khash_t(rg) *rghash_t;
  
  rghash_t g_rghash = 0;
@@ -54,7 +61,7 @@ static inline int __g_skip_aln(const bam_header_t *h, const bam1_t *b)
         return 0;
  }
  
         return 0;
  }
  
-// callback function for bam_fetch()
+// callback function for bam_fetch() that prints nonskipped records
  static int view_func(const bam1_t *b, void *data)
  {
         if (!__g_skip_aln(((samfile_t*)data)->header, b))
  static int view_func(const bam1_t *b, void *data)
  {
         if (!__g_skip_aln(((samfile_t*)data)->header, b))
@@ -62,19 +69,30 @@ static int view_func(const bam1_t *b, void *data)
         return 0;
  }
  
         return 0;
  }
  
+// callback function for bam_fetch() that counts nonskipped records
+static int count_func(const bam1_t *b, void *data)
+{
+       if (!__g_skip_aln(((count_func_data_t*)data)->header, b)) {
+               (*((count_func_data_t*)data)->count)++;
+       }
+       return 0;
+}
+
  static int usage(int is_long_help);
  
  int main_samview(int argc, char *argv[])
  {
  static int usage(int is_long_help);
  
  int main_samview(int argc, char *argv[])
  {
-       int c, is_header = 0, is_header_only = 0, is_bamin = 1, ret = 0, is_uncompressed = 0, is_bamout = 0, slx2sngr = 0;
+       int c, is_header = 0, is_header_only = 0, is_bamin = 1, ret = 0, is_uncompressed = 0, is_bamout = 0, slx2sngr = 0, is_count = 0;
         int of_type = BAM_OFDEC, is_long_help = 0;
         int of_type = BAM_OFDEC, is_long_help = 0;
+       int count = 0;
         samfile_t *in = 0, *out = 0;
         char in_mode[5], out_mode[5], *fn_out = 0, *fn_list = 0, *fn_ref = 0, *fn_rg = 0;
  
         /* parse command-line options */
         strcpy(in_mode, "r"); strcpy(out_mode, "w");
         samfile_t *in = 0, *out = 0;
         char in_mode[5], out_mode[5], *fn_out = 0, *fn_list = 0, *fn_ref = 0, *fn_rg = 0;
  
         /* parse command-line options */
         strcpy(in_mode, "r"); strcpy(out_mode, "w");
-       while ((c = getopt(argc, argv, "Sbt:hHo:q:f:F:ul:r:xX?T:CR:")) >= 0) {
+       while ((c = getopt(argc, argv, "Sbct:hHo:q:f:F:ul:r:xX?T:CR:")) >= 0) {
                 switch (c) {
                 switch (c) {
+               case 'c': is_count = 1; break;
                 case 'C': slx2sngr = 1; break;
                 case 'S': is_bamin = 0; break;
                 case 'b': is_bamout = 1; break;
                 case 'C': slx2sngr = 1; break;
                 case 'S': is_bamin = 0; break;
                 case 'b': is_bamout = 1; break;
@@ -133,7 +151,7 @@ int main_samview(int argc, char *argv[])
                 ret = 1;
                 goto view_end;
         }
                 ret = 1;
                 goto view_end;
         }
-       if ((out = samopen(fn_out? fn_out : "-", out_mode, in->header)) == 0) {
+       if (!is_count && (out = samopen(fn_out? fn_out : "-", out_mode, in->header)) == 0) {
                 fprintf(stderr, "[main_samview] fail to open \"%s\" for writing.\n", fn_out? fn_out : "standard output");
                 ret = 1;
                 goto view_end;
                 fprintf(stderr, "[main_samview] fail to open \"%s\" for writing.\n", fn_out? fn_out : "standard output");
                 ret = 1;
                 goto view_end;
@@ -146,7 +164,8 @@ int main_samview(int argc, char *argv[])
                 while ((r = samread(in, b)) >= 0) { // read one alignment from `in'
                         if (!__g_skip_aln(in->header, b)) {
                                 if (slx2sngr) sol2sanger(b);
                 while ((r = samread(in, b)) >= 0) { // read one alignment from `in'
                         if (!__g_skip_aln(in->header, b)) {
                                 if (slx2sngr) sol2sanger(b);
-                               samwrite(out, b); // write the alignment to `out'
+                               if (!is_count) samwrite(out, b); // write the alignment to `out'
+                               count++;
                         }
                 }
                 if (r < -1) {
                         }
                 }
                 if (r < -1) {
@@ -164,14 +183,19 @@ int main_samview(int argc, char *argv[])
                         goto view_end;
                 }
                 for (i = optind + 1; i < argc; ++i) {
                         goto view_end;
                 }
                 for (i = optind + 1; i < argc; ++i) {
-                       int tid, beg, end;
+                       int tid, beg, end, result;
                         bam_parse_region(in->header, argv[i], &tid, &beg, &end); // parse a region in the format like `chr2:100-200'
                         if (tid < 0) { // reference name is not found
                                 fprintf(stderr, "[main_samview] region \"%s\" specifies an unknown reference name. Continue anyway.\n", argv[i]);
                                 continue;
                         }
                         // fetch alignments
                         bam_parse_region(in->header, argv[i], &tid, &beg, &end); // parse a region in the format like `chr2:100-200'
                         if (tid < 0) { // reference name is not found
                                 fprintf(stderr, "[main_samview] region \"%s\" specifies an unknown reference name. Continue anyway.\n", argv[i]);
                                 continue;
                         }
                         // fetch alignments
-                       if (bam_fetch(in->x.bam, idx, tid, beg, end, out, view_func) < 0) {
+                       if (is_count) {
+                               count_func_data_t count_data = { in->header, &count };
+                               result = bam_fetch(in->x.bam, idx, tid, beg, end, &count_data, count_func);
+                       } else
+                               result = bam_fetch(in->x.bam, idx, tid, beg, end, out, view_func);
+                       if (result < 0) {
                                 fprintf(stderr, "[main_samview] retrieval of region \"%s\" failed due to truncated file or corrupt BAM index file\n", argv[i]);
                                 ret = 1;
                                 break;
                                 fprintf(stderr, "[main_samview] retrieval of region \"%s\" failed due to truncated file or corrupt BAM index file\n", argv[i]);
                                 ret = 1;
                                 break;
@@ -181,6 +205,9 @@ int main_samview(int argc, char *argv[])
         }
  
  view_end:
         }
  
  view_end:
+       if (is_count && ret == 0) {
+               printf("%d\n", count);
+       }
         // close files, free and return
         free(fn_list); free(fn_ref); free(fn_out); free(g_library); free(g_rg); free(fn_rg);
         if (g_rghash) {
         // close files, free and return
         free(fn_list); free(fn_ref); free(fn_out); free(g_library); free(g_rg); free(fn_rg);
         if (g_rghash) {
@@ -190,7 +217,8 @@ view_end:
                 kh_destroy(rg, g_rghash);
         }
         samclose(in);
                 kh_destroy(rg, g_rghash);
         }
         samclose(in);
-       samclose(out);
+       if (!is_count)
+               samclose(out);
         return ret;
  }
  
         return ret;
  }
  
@@ -205,6 +233,7 @@ static int usage(int is_long_help)
         fprintf(stderr, "         -u       uncompressed BAM output (force -b)\n");
         fprintf(stderr, "         -x       output FLAG in HEX (samtools-C specific)\n");
         fprintf(stderr, "         -X       output FLAG in string (samtools-C specific)\n");
         fprintf(stderr, "         -u       uncompressed BAM output (force -b)\n");
         fprintf(stderr, "         -x       output FLAG in HEX (samtools-C specific)\n");
         fprintf(stderr, "         -X       output FLAG in string (samtools-C specific)\n");
+       fprintf(stderr, "         -c       print only the count of matching records\n");
         fprintf(stderr, "         -t FILE  list of reference names and lengths (force -S) [null]\n");
         fprintf(stderr, "         -T FILE  reference sequence file (force -S) [null]\n");
         fprintf(stderr, "         -o FILE  output file name [stdout]\n");
         fprintf(stderr, "         -t FILE  list of reference names and lengths (force -S) [null]\n");
         fprintf(stderr, "         -T FILE  reference sequence file (force -S) [null]\n");
         fprintf(stderr, "         -o FILE  output file name [stdout]\n");
diff --git a/samtools.1 b/samtools.1

index cc421d4bc34fd53efa123b4630fcbccb4ca58708..e87bef7e788f09d1c0c38219f231062868a010d4 100644 (file)
--- a/samtools.1
+++ b/samtools.1
@@ -1,4 +1,4 @@
-.TH samtools 1 "27 October 2010" "samtools-0.1.9" "Bioinformatics tools"
+.TH samtools 1 "15 November 2010" "samtools-0.1.10" "Bioinformatics tools"
  .SH NAME
  .PP
  samtools - Utilities for the Sequence Alignment/Map (SAM) format
  .SH NAME
  .PP
  samtools - Utilities for the Sequence Alignment/Map (SAM) format
@@ -20,7 +20,7 @@ samtools faidx ref.fasta
  .PP
  samtools pileup -vcf ref.fasta aln.sorted.bam
  .PP
  .PP
  samtools pileup -vcf ref.fasta aln.sorted.bam
  .PP
-samtools mpileup -C50 -agf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
+samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
  .PP
  samtools tview aln.sorted.bam ref.fasta
  
  .PP
  samtools tview aln.sorted.bam ref.fasta
  
@@ -47,7 +47,7 @@ entire alignment file unless it is asked to do so.
  
  .TP 10
  .B view
  
  .TP 10
  .B view
-samtools view [-bhuHS] [-t in.refList] [-o output] [-f reqFlag] [-F
+samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F
  skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  
  Extract/print all or sub alignments in SAM or BAM format. If no region
  skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  
  Extract/print all or sub alignments in SAM or BAM format. If no region
@@ -65,11 +65,11 @@ format: `chr2' (the whole chr2), `chr2:1000000' (region starting from
  .B -b
  Output in the BAM format.
  .TP
  .B -b
  Output in the BAM format.
  .TP
-.BI -f " INT"
+.BI -f \ INT
  Only output alignments with all bits in INT present in the FLAG
  field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
  .TP
  Only output alignments with all bits in INT present in the FLAG
  field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
  .TP
-.BI -F " INT"
+.BI -F \ INT
  Skip alignments with bits present in INT [0]
  .TP
  .B -h
  Skip alignments with bits present in INT [0]
  .TP
  .B -h
@@ -78,19 +78,19 @@ Include the header in the output.
  .B -H
  Output the header only.
  .TP
  .B -H
  Output the header only.
  .TP
-.BI -l " STR"
+.BI -l \ STR
  Only output reads in library STR [null]
  .TP
  Only output reads in library STR [null]
  .TP
-.BI -o " FILE"
+.BI -o \ FILE
  Output file [stdout]
  .TP
  Output file [stdout]
  .TP
-.BI -q " INT"
+.BI -q \ INT
  Skip alignments with MAPQ smaller than INT [0]
  .TP
  Skip alignments with MAPQ smaller than INT [0]
  .TP
-.BI -r " STR"
+.BI -r \ STR
  Only output reads in read group STR [null]
  .TP
  Only output reads in read group STR [null]
  .TP
-.BI -R " FILE"
+.BI -R \ FILE
  Output reads in read groups listed in
  .I FILE
  [null]
  Output reads in read groups listed in
  .I FILE
  [null]
@@ -100,7 +100,16 @@ Input is in SAM. If @SQ header lines are absent, the
  .B `-t'
  option is required.
  .TP
  .B `-t'
  option is required.
  .TP
-.BI -t " FILE"
+.B -c
+Instead of printing the alignments, only count them and print the
+total number. All filter options, such as
+.B `-f',
+.B `-F'
+and
+.B `-q'
+, are taken into account.
+.TP
+.BI -t \ FILE
  This file is TAB-delimited. Each line must contain the reference name
  and the length of the reference, one line for each distinct reference;
  additional fields are ignored. This file also defines the order of the
  This file is TAB-delimited. Each line must contain the reference name
  and the length of the reference, one line for each distinct reference;
  additional fields are ignored. This file also defines the order of the
@@ -126,135 +135,6 @@ press `?' for help and press `g' to check the alignment start from a
  region in the format like `chr10:10,000,000' or `=10,000,000' when
  viewing the same reference sequence.
  
  region in the format like `chr10:10,000,000' or `=10,000,000' when
  viewing the same reference sequence.
  
-.TP
-.B pileup
-samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l
-in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r
-pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior]
-<in.bam>|<in.sam>
-
-Print the alignment in the pileup format. In the pileup format, each
-line represents a genomic position, consisting of chromosome name,
-coordinate, reference base, read bases, read qualities and alignment
-mapping qualities. Information on match, mismatch, indel, strand,
-mapping quality and start and end of a read are all encoded at the read
-base column. At this column, a dot stands for a match to the reference
-base on the forward strand, a comma for a match on the reverse strand,
-a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
-strand and `acgtn' for a mismatch on the reverse strand. A pattern
-`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
-reference position and the next reference position. The length of the
-insertion is given by the integer in the pattern, followed by the
-inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
-represents a deletion from the reference. The deleted bases will be
-presented as `*' in the following lines. Also at the read base column, a
-symbol `^' marks the start of a read. The ASCII of the character
-following `^' minus 33 gives the mapping quality. A symbol `$' marks the
-end of a read segment.
-
-If option
-.B -c
-is applied, the consensus base, Phred-scaled consensus quality, SNP
-quality (i.e. the Phred-scaled probability of the consensus being
-identical to the reference) and root mean square (RMS) mapping quality
-of the reads covering the site will be inserted between the `reference
-base' and the `read bases' columns. An indel occupies an additional
-line. Each indel line consists of chromosome name, coordinate, a star,
-the genotype, consensus quality, SNP quality, RMS mapping quality, #
-covering reads, the first alllele, the second allele, # reads supporting
-the first allele, # reads supporting the second allele and # reads
-containing indels different from the top two alleles.
-
-The position of indels is offset by -1.
-
-.B OPTIONS:
-.RS
-.TP 10
-.B -B
-Disable the BAQ computation. See the
-.B mpileup
-command for details.
-.TP
-.B -c
-Call the consensus sequence using SOAPsnp consensus model. Options
-.BR -T ", " -N ", " -I " and " -r
-are only effective when
-.BR -c " or " -g
-is in use.
-.TP
-.BI -C " INT"
-Coefficient for downgrading the mapping quality of poorly mapped
-reads. See the
-.B mpileup
-command for details. [0]
-.TP
-.BI -d " INT"
-Use the first
-.I NUM
-reads in the pileup for indel calling for speed up. Zero for unlimited. [1024]
-.TP
-.BI -f " FILE"
-The reference sequence in the FASTA format. Index file
-.I FILE.fai
-will be created if
-absent.
-.TP
-.B -g
-Generate genotype likelihood in the binary GLFv3 format. This option
-suppresses -c, -i and -s. This option is deprecated by the
-.B mpileup
-command.
-.TP
-.B -i
-Only output pileup lines containing indels.
-.TP
-.BI -I " INT"
-Phred probability of an indel in sequencing/prep. [40]
-.TP
-.BI -l " FILE"
-List of sites at which pileup is output. This file is space
-delimited. The first two columns are required to be chromosome and
-1-based coordinate. Additional columns are ignored. It is
-recommended to use option
-.TP
-.BI -m " INT"
-Filter reads with flag containing bits in
-.I INT
-[1796]
-.TP
-.BI -M " INT"
-Cap mapping quality at INT [60]
-.TP
-.BI -N " INT"
-Number of haplotypes in the sample (>=2) [2]
-.TP
-.BI -r " FLOAT"
-Expected fraction of differences between a pair of haplotypes [0.001]
-.TP
-.B -s
-Print the mapping quality as the last column. This option makes the
-output easier to parse, although this format is not space efficient.
-.TP
-.B -S
-The input file is in SAM.
-.TP
-.BI -t " FILE"
-List of reference names ane sequence lengths, in the format described
-for the
-.B import
-command. If this option is present, samtools assumes the input
-.I <in.alignment>
-is in SAM format; otherwise it assumes in BAM format.
-.B -s
-together with
-.B -l
-as in the default format we may not know the mapping quality.
-.TP
-.BI -T " FLOAT"
-The theta parameter (error dependency coefficient) in the maq consensus
-calling model [0.85]
-.RE
-
  .TP
  .B mpileup
  samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
  .TP
  .B mpileup
  samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
@@ -272,37 +152,64 @@ quality (BAQ). BAQ is the Phred-scaled probability of a read base being
  misaligned. Applying this option greatly helps to reduce false SNPs
  caused by misalignments.
  .TP
  misaligned. Applying this option greatly helps to reduce false SNPs
  caused by misalignments.
  .TP
-.BI -C " INT"
+.BI -C \ INT
  Coefficient for downgrading mapping quality for reads containing
  excessive mismatches. Given a read with a phred-scaled probability q of
  being generated from the mapped position, the new mapping quality is
  about sqrt((INT-q)/INT)*INT. A zero value disables this
  Coefficient for downgrading mapping quality for reads containing
  excessive mismatches. Given a read with a phred-scaled probability q of
  being generated from the mapped position, the new mapping quality is
  about sqrt((INT-q)/INT)*INT. A zero value disables this
-functionality; if enabled, the recommended value is 50. [0]
+functionality; if enabled, the recommended value for BWA is 50. [0]
+.TP
+.BI -e \ INT
+Phred-scaled gap extension sequencing error probability. Reducing
+.I INT
+leads to longer indels. [20]
  .TP
  .TP
-.BI -f " FILE"
+.BI -f \ FILE
  The reference file [null]
  .TP
  .B -g
  Compute genotype likelihoods and output them in the binary call format (BCF).
  .TP
  The reference file [null]
  .TP
  .B -g
  Compute genotype likelihoods and output them in the binary call format (BCF).
  .TP
-.B -u
-Similar to
-.B -g
-except that the output is uncompressed BCF, which is preferred for pipeing.
-.TP
-.BI -l " FILE"
+.BI -h \ INT
+Coefficient for modeling homopolymer errors. Given an
+.IR l -long
+homopolymer
+run, the sequencing error of an indel of size
+.I s
+is modeled as
+.IR INT * s / l .
+[100]
+.TP
+.BI -l \ FILE
  File containing a list of sites where pileup or BCF is outputted [null]
  .TP
  File containing a list of sites where pileup or BCF is outputted [null]
  .TP
-.BI -q " INT"
+.BI -o \ INT
+Phred-scaled gap open sequencing error probability. Reducing
+.I INT
+leads to more indel calls. [40]
+.TP
+.BI -P \ STR
+Comma dilimited list of platforms (determined by
+.BR @RG-PL )
+from which indel candidates are obtained. It is recommended to collect
+indel candidates from sequencing technologies that have low indel error
+rate such as ILLUMINA. [all]
+.TP
+.BI -q \ INT
  Minimum mapping quality for an alignment to be used [0]
  .TP
  Minimum mapping quality for an alignment to be used [0]
  .TP
-.BI -Q " INT"
+.BI -Q \ INT
  Minimum base quality for a base to be considered [13]
  .TP
  Minimum base quality for a base to be considered [13]
  .TP
-.BI -r " STR"
+.BI -r \ STR
  Only generate pileup in region
  .I STR
  [all sites]
  Only generate pileup in region
  .I STR
  [all sites]
+.TP
+.B -u
+Similar to
+.B -g
+except that the output is uncompressed BCF, which is preferred for piping.
  .RE
  
  .TP
  .RE
  
  .TP
@@ -336,7 +243,7 @@ Output the final alignment to the standard output.
  .B -n
  Sort by read names rather than by chromosomal coordinates
  .TP
  .B -n
  Sort by read names rather than by chromosomal coordinates
  .TP
-.B -m INT
+.BI -m \ INT
  Approximately the maximum required memory. [500000000]
  .RE
  
  Approximately the maximum required memory. [500000000]
  .RE
  
@@ -359,7 +266,7 @@ and the headers of other files will be ignored.
  .B OPTIONS:
  .RS
  .TP 8
  .B OPTIONS:
  .RS
  .TP 8
-.BI -h " FILE"
+.BI -h \ FILE
  Use the lines of
  .I FILE
  as `@' headers to be copied to
  Use the lines of
  .I FILE
  as `@' headers to be copied to
@@ -370,7 +277,7 @@ replacing any header lines that would otherwise be copied from
  is actually in SAM format, though any alignment records it may contain
  are ignored.)
  .TP
  is actually in SAM format, though any alignment records it may contain
  are ignored.)
  .TP
-.BI -R " STR"
+.BI -R \ STR
  Merge files in the specified region indicated by
  .I STR
  .TP
  Merge files in the specified region indicated by
  .I STR
  .TP
@@ -457,6 +364,11 @@ tag. Output SAM by default.
  .B OPTIONS:
  .RS
  .TP 8
  .B OPTIONS:
  .RS
  .TP 8
+.B -A
+When used jointly with
+.B -r
+this option overwrites the original base quality.
+.TP 8
  .B -e
  Convert a the read base to = if it is identical to the aligned reference
  base. Indel caller does not support the = bases at the moment.
  .B -e
  Convert a the read base to = if it is identical to the aligned reference
  base. Indel caller does not support the = bases at the moment.
@@ -470,14 +382,143 @@ Output compressed BAM
  .B -S
  The input is SAM with header lines
  .TP
  .B -S
  The input is SAM with header lines
  .TP
-.BI -C " INT"
+.BI -C \ INT
  Coefficient to cap mapping quality of poorly mapped reads. See the
  .B pileup
  command for details. [0]
  .TP
  .B -r
  Coefficient to cap mapping quality of poorly mapped reads. See the
  .B pileup
  command for details. [0]
  .TP
  .B -r
-Perform probabilistic realignment to compute BAQ, which will be used to
-cap base quality.
+Compute the BQ tag without changing the base quality.
+.RE
+
+.TP
+.B pileup
+samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l
+in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r
+pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior]
+<in.bam>|<in.sam>
+
+Print the alignment in the pileup format. In the pileup format, each
+line represents a genomic position, consisting of chromosome name,
+coordinate, reference base, read bases, read qualities and alignment
+mapping qualities. Information on match, mismatch, indel, strand,
+mapping quality and start and end of a read are all encoded at the read
+base column. At this column, a dot stands for a match to the reference
+base on the forward strand, a comma for a match on the reverse strand,
+a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
+strand and `acgtn' for a mismatch on the reverse strand. A pattern
+`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
+reference position and the next reference position. The length of the
+insertion is given by the integer in the pattern, followed by the
+inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
+represents a deletion from the reference. The deleted bases will be
+presented as `*' in the following lines. Also at the read base column, a
+symbol `^' marks the start of a read. The ASCII of the character
+following `^' minus 33 gives the mapping quality. A symbol `$' marks the
+end of a read segment.
+
+If option
+.B -c
+is applied, the consensus base, Phred-scaled consensus quality, SNP
+quality (i.e. the Phred-scaled probability of the consensus being
+identical to the reference) and root mean square (RMS) mapping quality
+of the reads covering the site will be inserted between the `reference
+base' and the `read bases' columns. An indel occupies an additional
+line. Each indel line consists of chromosome name, coordinate, a star,
+the genotype, consensus quality, SNP quality, RMS mapping quality, #
+covering reads, the first alllele, the second allele, # reads supporting
+the first allele, # reads supporting the second allele and # reads
+containing indels different from the top two alleles.
+
+.B NOTE:
+Since 0.1.10, the `pileup' command is deprecated by `mpileup'.
+
+.B OPTIONS:
+.RS
+.TP 10
+.B -B
+Disable the BAQ computation. See the
+.B mpileup
+command for details.
+.TP
+.B -c
+Call the consensus sequence. Options
+.BR -T ", " -N ", " -I " and " -r
+are only effective when
+.BR -c " or " -g
+is in use.
+.TP
+.BI -C \ INT
+Coefficient for downgrading the mapping quality of poorly mapped
+reads. See the
+.B mpileup
+command for details. [0]
+.TP
+.BI -d \ INT
+Use the first
+.I NUM
+reads in the pileup for indel calling for speed up. Zero for unlimited. [1024]
+.TP
+.BI -f \ FILE
+The reference sequence in the FASTA format. Index file
+.I FILE.fai
+will be created if
+absent.
+.TP
+.B -g
+Generate genotype likelihood in the binary GLFv3 format. This option
+suppresses -c, -i and -s. This option is deprecated by the
+.B mpileup
+command.
+.TP
+.B -i
+Only output pileup lines containing indels.
+.TP
+.BI -I \ INT
+Phred probability of an indel in sequencing/prep. [40]
+.TP
+.BI -l \ FILE
+List of sites at which pileup is output. This file is space
+delimited. The first two columns are required to be chromosome and
+1-based coordinate. Additional columns are ignored. It is
+recommended to use option
+.TP
+.BI -m \ INT
+Filter reads with flag containing bits in
+.I INT
+[1796]
+.TP
+.BI -M \ INT
+Cap mapping quality at INT [60]
+.TP
+.BI -N \ INT
+Number of haplotypes in the sample (>=2) [2]
+.TP
+.BI -r \ FLOAT
+Expected fraction of differences between a pair of haplotypes [0.001]
+.TP
+.B -s
+Print the mapping quality as the last column. This option makes the
+output easier to parse, although this format is not space efficient.
+.TP
+.B -S
+The input file is in SAM.
+.TP
+.BI -t \ FILE
+List of reference names ane sequence lengths, in the format described
+for the
+.B import
+command. If this option is present, samtools assumes the input
+.I <in.alignment>
+is in SAM format; otherwise it assumes in BAM format.
+.B -s
+together with
+.B -l
+as in the default format we may not know the mapping quality.
+.TP
+.BI -T \ FLOAT
+The theta parameter (error dependency coefficient) in the maq consensus
+calling model [0.85]
  .RE
  
  .SH SAM FORMAT
  .RE
  
  .SH SAM FORMAT
@@ -573,9 +614,8 @@ will be attached
  .IP o 2
  Call SNPs and short indels for one diploid individual:
  
  .IP o 2
  Call SNPs and short indels for one diploid individual:
  
-  samtools pileup -vcf ref.fa aln.bam > var.raw.plp
-  samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
-  awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var.flt.plp > var.final.plp
+  samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf
+  bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf
  
  The
  .B -D
  
  The
  .B -D
@@ -583,37 +623,37 @@ option of varFilter controls the maximum read depth, which should be
  adjusted to about twice the average read depth.  One may consider to add
  .B -C50
  to
  adjusted to about twice the average read depth.  One may consider to add
  .B -C50
  to
-.B pileup
+.B mpileup
  if mapping quality is overestimated for reads containing excessive
  mismatches. Applying this option usually helps
  .B BWA-short
  if mapping quality is overestimated for reads containing excessive
  mismatches. Applying this option usually helps
  .B BWA-short
-but may not other mappers. It also potentially increases reference
-biases.
+but may not other mappers.
  
  .IP o 2
  
  .IP o 2
-Call SNPs (not short indels) for multiple diploid individuals:
+Call SNPs and short indels for multiple diploid individuals:
  
  
-  samtools mpileup -augf ref.fa *.bam | bcftools view -bcv - > snp.raw.bcf
-  bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 | bgzip > snp.flt.vcf.gz
+  samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf
+  bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf
  
  Individuals are identified from the
  .B SM
  tags in the
  .B @RG
  header lines. Individuals can be pooled in one alignment file; one
  
  Individuals are identified from the
  .B SM
  tags in the
  .B @RG
  header lines. Individuals can be pooled in one alignment file; one
-individual can also be separated into multiple files. Similarly, one may
-consider to apply
-.B -C50
-to
-.BR mpileup .
-SNP calling in this way also works for single sample and has the
-advantage of enabling more powerful filtering. The drawback is the lack
-of short indel calling, which may be implemented in future.
+individual can also be separated into multiple files. The
+.B -P
+option specifies that indel candidates should be collected only from
+read groups with the
+.B @RG-PL
+tag set to
+.IR ILLUMINA .
+Collecting indel candidates from reads sequenced by an indel-prone
+technology may affect the performance of indel calling.
  
  .IP o 2
  Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals:
  
  
  .IP o 2
  Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals:
  
-  samtools mpileup -gf ref.fa *.bam > all.bcf
+  samtools mpileup -Igf ref.fa *.bam > all.bcf
    bcftools view -bl sites.list all.bcf > sites.bcf
    bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
    bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
    bcftools view -bl sites.list all.bcf > sites.bcf
    bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
    bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
@@ -630,7 +670,7 @@ commands estimate AFS by EM.
  .IP o 2
  Dump BAQ applied alignment for other SNP callers:
  
  .IP o 2
  Dump BAQ applied alignment for other SNP callers:
  
-  samtools calmd -br aln.bam > aln.baq.bam
+  samtools calmd -bAr aln.bam > aln.baq.bam
  
  It adds and corrects the
  .B NM
  
  It adds and corrects the
  .B NM
@@ -640,7 +680,7 @@ tags at the same time. The
  .B calmd
  command also comes with the
  .B -C
  .B calmd
  command also comes with the
  .B -C
-option, the same as the on in
+option, the same as the one in
  .B pileup
  and
  .BR mpileup .
  .B pileup
  and
  .BR mpileup .
@@ -666,8 +706,9 @@ although a little slower.
  .PP
  Heng Li from the Sanger Institute wrote the C version of samtools. Bob
  Handsaker from the Broad Institute implemented the BGZF library and Jue
  .PP
  Heng Li from the Sanger Institute wrote the C version of samtools. Bob
  Handsaker from the Broad Institute implemented the BGZF library and Jue
-Ruan from Beijing Genomics Institute wrote the RAZF library. Various
-people in the 1000 Genomes Project contributed to the SAM format
+Ruan from Beijing Genomics Institute wrote the RAZF library. John
+Marshall and Petr Danecek contribute to the source code and various
+people from the 1000 Genomes Project have contributed to the SAM format
  specification.
  
  .SH SEE ALSO
  specification.
  
  .SH SEE ALSO
diff --git a/samtools.txt b/samtools.txt

index 09c1ccdedf9be5dd1964022519948a3932d384c8..41cabe5053986a44f415ae35981c0f61738a7996 100644 (file)
--- a/samtools.txt
+++ b/samtools.txt
@@ -22,8 +22,7 @@ SYNOPSIS
  
         samtools pileup -vcf ref.fasta aln.sorted.bam
  
  
         samtools pileup -vcf ref.fasta aln.sorted.bam
  
-       samtools  mpileup  -C50  -agf  ref.fasta  -r  chr3:1,000-2,000  in1.bam
-       in2.bam
+       samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
  
         samtools tview aln.sorted.bam ref.fasta
  
  
         samtools tview aln.sorted.bam ref.fasta
  
@@ -48,7 +47,7 @@ DESCRIPTION
  
  
  COMMANDS AND OPTIONS
  
  
  COMMANDS AND OPTIONS
-       view      samtools  view  [-bhuHS]  [-t  in.refList]  [-o  output]  [-f
+       view      samtools  view  [-bchuHS]  [-t  in.refList]  [-o  output] [-f
                   reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
                   Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  
                   reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
                   Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  
@@ -90,6 +89,10 @@ COMMANDS AND OPTIONS
                   -S      Input is in SAM. If @SQ header lines are absent,  the
                           `-t' option is required.
  
                   -S      Input is in SAM. If @SQ header lines are absent,  the
                           `-t' option is required.
  
+                 -c      Instead  of  printing the alignments, only count them
+                         and print the total number. All filter options,  such
+                         as `-f', `-F' and `-q' , are taken into account.
+
                   -t FILE This  file  is  TAB-delimited. Each line must contain
                           the reference name and the length of  the  reference,
                           one  line  for  each  distinct  reference; additional
                   -t FILE This  file  is  TAB-delimited. Each line must contain
                           the reference name and the length of  the  reference,
                           one  line  for  each  distinct  reference; additional
@@ -112,109 +115,6 @@ COMMANDS AND OPTIONS
                   reference sequence.
  
  
                   reference sequence.
  
  
-       pileup    samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
-                 [-l in.site_list] [-C capMapQ] [-M maxMapQ]  [-T  theta]  [-N
-                 nHap]  [-r  pairDiffRate]  [-m  mask]  [-d maxIndelDepth] [-G
-                 indelPrior] <in.bam>|<in.sam>
-
-                 Print the alignment in the pileup format. In the pileup  for-
-                 mat,  each  line represents a genomic position, consisting of
-                 chromosome name, coordinate, reference base, read bases, read
-                 qualities  and  alignment  mapping  qualities. Information on
-                 match, mismatch, indel, strand, mapping quality and start and
-                 end  of  a  read  are all encoded at the read base column. At
-                 this column, a dot stands for a match to the  reference  base
-                 on  the  forward  strand,  a comma for a match on the reverse
-                 strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
-                 match on the forward strand and `acgtn' for a mismatch on the
-                 reverse strand. A pattern  `\+[0-9]+[ACGTNacgtn]+'  indicates
-                 there is an insertion between this reference position and the
-                 next reference position. The length of the insertion is given
-                 by  the  integer  in  the  pattern,  followed by the inserted
-                 sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'  repre-
-                 sents  a  deletion from the reference. The deleted bases will
-                 be presented as `*' in the following lines. Also at the  read
-                 base  column,  a  symbol  `^'  marks the start of a read. The
-                 ASCII of the character following `^' minus 33 gives the  map-
-                 ping quality. A symbol `$' marks the end of a read segment.
-
-                 If  option  -c  is  applied, the consensus base, Phred-scaled
-                 consensus quality, SNP quality (i.e. the Phred-scaled  proba-
-                 bility of the consensus being identical to the reference) and
-                 root mean square (RMS) mapping quality of the reads  covering
-                 the  site  will  be inserted between the `reference base' and
-                 the `read bases' columns. An  indel  occupies  an  additional
-                 line.  Each  indel  line consists of chromosome name, coordi-
-                 nate, a star, the genotype, consensus quality,  SNP  quality,
-                 RMS mapping quality, # covering reads, the first alllele, the
-                 second allele, # reads supporting the first allele,  #  reads
-                 supporting  the  second  allele and # reads containing indels
-                 different from the top two alleles.
-
-                 The position of indels is offset by -1.
-
-                 OPTIONS:
-
-                 -B        Disable the BAQ computation. See the  mpileup  com-
-                           mand for details.
-
-                 -c        Call the consensus sequence using SOAPsnp consensus
-                           model. Options -T, -N, -I and -r are only effective
-                           when -c or -g is in use.
-
-                 -C INT    Coefficient  for downgrading the mapping quality of
-                           poorly mapped reads. See the  mpileup  command  for
-                           details. [0]
-
-                 -d INT    Use  the  first  NUM  reads in the pileup for indel
-                           calling for speed up. Zero for unlimited. [1024]
-
-                 -f FILE   The reference sequence in the FASTA  format.  Index
-                           file FILE.fai will be created if absent.
-
-                 -g        Generate  genotype  likelihood  in the binary GLFv3
-                           format. This option suppresses -c, -i and -s.  This
-                           option is deprecated by the mpileup command.
-
-                 -i        Only output pileup lines containing indels.
-
-                 -I INT    Phred  probability  of an indel in sequencing/prep.
-                           [40]
-
-                 -l FILE   List of sites at which pileup is output. This  file
-                           is  space  delimited.  The  first  two  columns are
-                           required to be chromosome and  1-based  coordinate.
-                           Additional  columns  are ignored. It is recommended
-                           to use option
-
-                 -m INT    Filter reads  with  flag  containing  bits  in  INT
-                           [1796]
-
-                 -M INT    Cap mapping quality at INT [60]
-
-                 -N INT    Number of haplotypes in the sample (>=2) [2]
-
-                 -r FLOAT  Expected  fraction of differences between a pair of
-                           haplotypes [0.001]
-
-                 -s        Print the mapping quality as the last column.  This
-                           option  makes  the output easier to parse, although
-                           this format is not space efficient.
-
-                 -S        The input file is in SAM.
-
-                 -t FILE   List of reference names ane  sequence  lengths,  in
-                           the  format  described  for  the import command. If
-                           this option is present, samtools assumes the  input
-                           <in.alignment>  is  in  SAM  format;  otherwise  it
-                           assumes in BAM format.  -s together with -l  as  in
-                           the  default  format  we  may  not know the mapping
-                           quality.
-
-                 -T FLOAT  The theta parameter (error dependency  coefficient)
-                           in the maq consensus calling model [0.85]
-
-
         mpileup   samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
                   list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
                   [...]]
         mpileup   samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
                   list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
                   [...]]
@@ -237,27 +137,43 @@ COMMANDS AND OPTIONS
                           phred-scaled probability q of  being  generated  from
                           the mapped position, the new mapping quality is about
                           sqrt((INT-q)/INT)*INT. A  zero  value  disables  this
                           phred-scaled probability q of  being  generated  from
                           the mapped position, the new mapping quality is about
                           sqrt((INT-q)/INT)*INT. A  zero  value  disables  this
-                         functionality;  if  enabled, the recommended value is
-                         50. [0]
+                         functionality;  if enabled, the recommended value for
+                         BWA is 50. [0]
+
+                 -e INT  Phred-scaled gap extension sequencing error probabil-
+                         ity. Reducing INT leads to longer indels. [20]
  
                   -f FILE The reference file [null]
  
  
                   -f FILE The reference file [null]
  
-                 -g      Compute genotype likelihoods and output them  in  the
+                 -g      Compute  genotype  likelihoods and output them in the
                           binary call format (BCF).
  
                           binary call format (BCF).
  
-                 -u      Similar  to -g except that the output is uncompressed
-                         BCF, which is preferred for pipeing.
+                 -h INT  Coefficient for modeling homopolymer errors. Given an
+                         l-long  homopolymer  run,  the sequencing error of an
+                         indel of size s is modeled as INT*s/l.  [100]
  
                   -l FILE File containing a list of sites where pileup  or  BCF
                           is outputted [null]
  
  
                   -l FILE File containing a list of sites where pileup  or  BCF
                           is outputted [null]
  
-                 -q INT  Minimum  mapping  quality for an alignment to be used
+                 -o INT  Phred-scaled  gap  open sequencing error probability.
+                         Reducing INT leads to more indel calls. [40]
+
+                 -P STR  Comma dilimited list of platforms (determined by @RG-
+                         PL)  from  which indel candidates are obtained. It is
+                         recommended to collect indel candidates from sequenc-
+                         ing  technologies that have low indel error rate such
+                         as ILLUMINA. [all]
+
+                 -q INT  Minimum mapping quality for an alignment to  be  used
                           [0]
  
                   -Q INT  Minimum base quality for a base to be considered [13]
  
                   -r STR  Only generate pileup in region STR [all sites]
  
                           [0]
  
                   -Q INT  Minimum base quality for a base to be considered [13]
  
                   -r STR  Only generate pileup in region STR [all sites]
  
+                 -u      Similar to -g except that the output is  uncompressed
+                         BCF, which is preferred for piping.
+
  
         reheader  samtools reheader <in.header.sam> <in.bam>
  
  
         reheader  samtools reheader <in.header.sam> <in.bam>
  
@@ -368,8 +284,11 @@ COMMANDS AND OPTIONS
  
                   OPTIONS:
  
  
                   OPTIONS:
  
-                 -e      Convert a the read base to = if it  is  identical  to
-                         the  aligned  reference  base.  Indel caller does not
+                 -A      When used jointly with -r this option overwrites  the
+                         original base quality.
+
+                 -e      Convert  a  the  read base to = if it is identical to
+                         the aligned reference base.  Indel  caller  does  not
                           support the = bases at the moment.
  
                   -u      Output uncompressed BAM
                           support the = bases at the moment.
  
                   -u      Output uncompressed BAM
@@ -378,15 +297,117 @@ COMMANDS AND OPTIONS
  
                   -S      The input is SAM with header lines
  
  
                   -S      The input is SAM with header lines
  
-                 -C INT  Coefficient to cap mapping quality of  poorly  mapped
+                 -C INT  Coefficient  to  cap mapping quality of poorly mapped
                           reads. See the pileup command for details. [0]
  
                           reads. See the pileup command for details. [0]
  
-                 -r      Perform  probabilistic  realignment  to  compute BAQ,
-                         which will be used to cap base quality.
+                 -r      Compute the BQ tag without changing the base quality.
+
+
+       pileup    samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
+                 [-l in.site_list] [-C capMapQ] [-M maxMapQ]  [-T  theta]  [-N
+                 nHap]  [-r  pairDiffRate]  [-m  mask]  [-d maxIndelDepth] [-G
+                 indelPrior] <in.bam>|<in.sam>
+
+                 Print the alignment in the pileup format. In the pileup  for-
+                 mat,  each  line represents a genomic position, consisting of
+                 chromosome name, coordinate, reference base, read bases, read
+                 qualities  and  alignment  mapping  qualities. Information on
+                 match, mismatch, indel, strand, mapping quality and start and
+                 end  of  a  read  are all encoded at the read base column. At
+                 this column, a dot stands for a match to the  reference  base
+                 on  the  forward  strand,  a comma for a match on the reverse
+                 strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
+                 match on the forward strand and `acgtn' for a mismatch on the
+                 reverse strand. A pattern  `\+[0-9]+[ACGTNacgtn]+'  indicates
+                 there is an insertion between this reference position and the
+                 next reference position. The length of the insertion is given
+                 by  the  integer  in  the  pattern,  followed by the inserted
+                 sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'  repre-
+                 sents  a  deletion from the reference. The deleted bases will
+                 be presented as `*' in the following lines. Also at the  read
+                 base  column,  a  symbol  `^'  marks the start of a read. The
+                 ASCII of the character following `^' minus 33 gives the  map-
+                 ping quality. A symbol `$' marks the end of a read segment.
+
+                 If  option  -c  is  applied, the consensus base, Phred-scaled
+                 consensus quality, SNP quality (i.e. the Phred-scaled  proba-
+                 bility of the consensus being identical to the reference) and
+                 root mean square (RMS) mapping quality of the reads  covering
+                 the  site  will  be inserted between the `reference base' and
+                 the `read bases' columns. An  indel  occupies  an  additional
+                 line.  Each  indel  line consists of chromosome name, coordi-
+                 nate, a star, the genotype, consensus quality,  SNP  quality,
+                 RMS mapping quality, # covering reads, the first alllele, the
+                 second allele, # reads supporting the first allele,  #  reads
+                 supporting  the  second  allele and # reads containing indels
+                 different from the top two alleles.
+
+                 NOTE: Since 0.1.10, the `pileup'  command  is  deprecated  by
+                 `mpileup'.
+
+                 OPTIONS:
+
+                 -B        Disable  the  BAQ computation. See the mpileup com-
+                           mand for details.
+
+                 -c        Call the consensus sequence. Options -T, -N, -I and
+                           -r are only effective when -c or -g is in use.
+
+                 -C INT    Coefficient  for downgrading the mapping quality of
+                           poorly mapped reads. See the  mpileup  command  for
+                           details. [0]
+
+                 -d INT    Use  the  first  NUM  reads in the pileup for indel
+                           calling for speed up. Zero for unlimited. [1024]
+
+                 -f FILE   The reference sequence in the FASTA  format.  Index
+                           file FILE.fai will be created if absent.
+
+                 -g        Generate  genotype  likelihood  in the binary GLFv3
+                           format. This option suppresses -c, -i and -s.  This
+                           option is deprecated by the mpileup command.
+
+                 -i        Only output pileup lines containing indels.
+
+                 -I INT    Phred  probability  of an indel in sequencing/prep.
+                           [40]
+
+                 -l FILE   List of sites at which pileup is output. This  file
+                           is  space  delimited.  The  first  two  columns are
+                           required to be chromosome and  1-based  coordinate.
+                           Additional  columns  are ignored. It is recommended
+                           to use option
+
+                 -m INT    Filter reads  with  flag  containing  bits  in  INT
+                           [1796]
+
+                 -M INT    Cap mapping quality at INT [60]
+
+                 -N INT    Number of haplotypes in the sample (>=2) [2]
+
+                 -r FLOAT  Expected  fraction of differences between a pair of
+                           haplotypes [0.001]
+
+                 -s        Print the mapping quality as the last column.  This
+                           option  makes  the output easier to parse, although
+                           this format is not space efficient.
+
+                 -S        The input file is in SAM.
+
+                 -t FILE   List of reference names ane  sequence  lengths,  in
+                           the  format  described  for  the import command. If
+                           this option is present, samtools assumes the  input
+                           <in.alignment>  is  in  SAM  format;  otherwise  it
+                           assumes in BAM format.  -s together with -l  as  in
+                           the  default  format  we  may  not know the mapping
+                           quality.
+
+                 -T FLOAT  The theta parameter (error dependency  coefficient)
+                           in the maq consensus calling model [0.85]
  
  
  SAM FORMAT
  
  
  SAM FORMAT
-       SAM is TAB-delimited. Apart from the header lines,  which  are  started
+       SAM  is  TAB-delimited.  Apart from the header lines, which are started
         with the `@' symbol, each alignment line consists of:
  
  
         with the `@' symbol, each alignment line consists of:
  
  
@@ -441,51 +462,50 @@ EXAMPLES
  
         o Attach the RG tag while merging sorted alignments:
  
  
         o Attach the RG tag while merging sorted alignments:
  
-           perl       -e       'print      "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
+           perl      -e       'print       "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
           mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
             samtools merge -rh rg.txt merged.bam ga.bam 454.bam
  
           The value in a RG tag is determined by the file name the read is com-
           mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
             samtools merge -rh rg.txt merged.bam ga.bam 454.bam
  
           The value in a RG tag is determined by the file name the read is com-
-         ing  from. In this example, in the merged.bam, reads from ga.bam will
-         be attached RG:Z:ga,  while  reads  from  454.bam  will  be  attached
+         ing from. In this example, in the merged.bam, reads from ga.bam  will
+         be  attached  RG:Z:ga,  while  reads  from  454.bam  will be attached
           RG:Z:454.
  
  
         o Call SNPs and short indels for one diploid individual:
  
           RG:Z:454.
  
  
         o Call SNPs and short indels for one diploid individual:
  
-           samtools pileup -vcf ref.fa aln.bam > var.raw.plp
-           samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
-           awk     '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)'     var.flt.plp    >
-         var.final.plp
+           samtools mpileup -ugf ref.fa aln.bam |  bcftools  view  -bvcg  -  >
+         var.raw.bcf
+           bcftools   view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  100  >
+         var.flt.vcf
  
           The -D option of varFilter controls the  maximum  read  depth,  which
           should  be  adjusted  to about twice the average read depth.  One may
  
           The -D option of varFilter controls the  maximum  read  depth,  which
           should  be  adjusted  to about twice the average read depth.  One may
-         consider to add -C50 to pileup if mapping  quality  is  overestimated
+         consider to add -C50 to mpileup if mapping quality  is  overestimated
           for  reads containing excessive mismatches. Applying this option usu-
           for  reads containing excessive mismatches. Applying this option usu-
-         ally helps BWA-short but may not other mappers. It  also  potentially
-         increases reference biases.
+         ally helps BWA-short but may not other mappers.
  
  
  
  
-       o Call SNPs (not short indels) for multiple diploid individuals:
+       o Call SNPs and short indels for multiple diploid individuals:
  
  
-           samtools  mpileup  -augf  ref.fa  *.bam  |  bcftools  view -bcv - >
-         snp.raw.bcf
-           bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 |  bgzip
-         > snp.flt.vcf.gz
+           samtools mpileup -P ILLUMINA -ugf  ref.fa  *.bam  |  bcftools  view
+         -bcvg - > var.raw.bcf
+           bcftools  view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  2000  >
+         var.flt.vcf
  
  
-         Individuals  are identified from the SM tags in the @RG header lines.
-         Individuals can be pooled in one alignment file; one  individual  can
-         also be separated into multiple files. Similarly, one may consider to
-         apply -C50 to mpileup.  SNP calling in this way also works for single
-         sample and has the advantage of enabling more powerful filtering. The
-         drawback is the lack of short indel calling, which may be implemented
-         in future.
+         Individuals are identified from the SM tags in the @RG header  lines.
+         Individuals  can  be pooled in one alignment file; one individual can
+         also be separated into multiple files. The -P option  specifies  that
+         indel  candidates  should be collected only from read groups with the
+         @RG-PL tag set to ILLUMINA.  Collecting indel candidates  from  reads
+         sequenced  by an indel-prone technology may affect the performance of
+         indel calling.
  
  
  
  
-       o Derive  the  allele  frequency spectrum (AFS) on a list of sites from
+       o Derive the allele frequency spectrum (AFS) on a list  of  sites  from
           multiple individuals:
  
           multiple individuals:
  
-           samtools mpileup -gf ref.fa *.bam > all.bcf
+           samtools mpileup -Igf ref.fa *.bam > all.bcf
             bcftools view -bl sites.list all.bcf > sites.bcf
             bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
             bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
             bcftools view -bl sites.list all.bcf > sites.bcf
             bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
             bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
@@ -493,40 +513,41 @@ EXAMPLES
             ......
  
           where sites.list contains the list of sites with each line consisting
             ......
  
           where sites.list contains the list of sites with each line consisting
-         of  the  reference sequence name and position. The following bcftools
+         of the reference sequence name and position. The  following  bcftools
           commands estimate AFS by EM.
  
  
         o Dump BAQ applied alignment for other SNP callers:
  
           commands estimate AFS by EM.
  
  
         o Dump BAQ applied alignment for other SNP callers:
  
-           samtools calmd -br aln.bam > aln.baq.bam
+           samtools calmd -bAr aln.bam > aln.baq.bam
  
  
-         It adds and corrects the NM and MD tags at the same time.  The  calmd
-         command  also  comes with the -C option, the same as the on in pileup
+         It  adds  and corrects the NM and MD tags at the same time. The calmd
+         command also comes with the -C option, the same as the one in  pileup
           and mpileup.  Apply if it helps.
  
  
  LIMITATIONS
           and mpileup.  Apply if it helps.
  
  
  LIMITATIONS
-       o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
+       o Unaligned   words  used  in  bam_import.c,  bam_endian.h,  bam.c  and
           bam_aux.c.
  
           bam_aux.c.
  
-       o In  merging,  the input files are required to have the same number of
-         reference sequences. The requirement can  be  relaxed.  In  addition,
-         merging  does  not reconstruct the header dictionaries automatically.
-         Endusers have to provide the correct  header.  Picard  is  better  at
+       o In merging, the input files are required to have the same  number  of
+         reference  sequences.  The  requirement  can be relaxed. In addition,
+         merging does not reconstruct the header  dictionaries  automatically.
+         Endusers  have  to  provide  the  correct header. Picard is better at
           merging.
  
           merging.
  
-       o Samtools  paired-end  rmdup  does  not  work for unpaired reads (e.g.
-         orphan reads or ends mapped to different chromosomes). If this  is  a
-         concern,  please  use  Picard's MarkDuplicate which correctly handles
+       o Samtools paired-end rmdup does not  work  for  unpaired  reads  (e.g.
+         orphan  reads  or ends mapped to different chromosomes). If this is a
+         concern, please use Picard's MarkDuplicate  which  correctly  handles
           these cases, although a little slower.
  
  
  AUTHOR
           these cases, although a little slower.
  
  
  AUTHOR
-       Heng Li from the Sanger Institute wrote the C version of samtools.  Bob
+       Heng  Li from the Sanger Institute wrote the C version of samtools. Bob
         Handsaker from the Broad Institute implemented the BGZF library and Jue
         Handsaker from the Broad Institute implemented the BGZF library and Jue
-       Ruan from Beijing Genomics Institute wrote the  RAZF  library.  Various
-       people in the 1000 Genomes Project contributed to the SAM format speci-
+       Ruan  from Beijing Genomics Institute wrote the RAZF library. John Mar-
+       shall and Petr Danecek contribute to the source code and various people
+       from the 1000 Genomes Project have contributed to the SAM format speci-
         fication.
  
  
         fication.
  
  
@@ -535,4 +556,4 @@ SEE ALSO
  
  
  
  
  
  
-samtools-0.1.9                  27 October 2010                    samtools(1)
+samtools-0.1.10                15 November 2010                    samtools(1)
author	Charles Plessy <plessy@debian.org>
	Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
committer	Charles Plessy <plessy@debian.org>
	Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
ChangeLog		patch \| blob \| history
Makefile		patch \| blob \| history
NEWS		patch \| blob \| history
bam.h		patch \| blob \| history
bam2bcf.c		patch \| blob \| history
bam2bcf.h		patch \| blob \| history
bam2bcf_indel.c	[new file with mode: 0644]	patch \| blob
bam_maqcns.c		patch \| blob \| history
bam_md.c		patch \| blob \| history
bam_pileup.c		patch \| blob \| history
bam_plcmd.c		patch \| blob \| history
bamtk.c		patch \| blob \| history
bcftools/bcf.c		patch \| blob \| history
bcftools/bcf.h		patch \| blob \| history
bcftools/call1.c		patch \| blob \| history
bcftools/prob1.c		patch \| blob \| history
bcftools/prob1.h		patch \| blob \| history
bcftools/vcf.c		patch \| blob \| history
bcftools/vcfutils.pl		patch \| blob \| history
kaln.c		patch \| blob \| history
kaln.h		patch \| blob \| history
kprobaln.c	[new file with mode: 0644]	patch \| blob
kprobaln.h	[new file with mode: 0644]	patch \| blob
misc/samtools.pl		patch \| blob \| history
sam_view.c		patch \| blob \| history
samtools.1		patch \| blob \| history
samtools.txt		patch \| blob \| history