Imported Upstream version 0.1.10

author Charles Plessy <plessy@debian.org>

Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)

committer Charles Plessy <plessy@debian.org>

Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
author Charles Plessy <plessy@debian.org>
Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
committer Charles Plessy <plessy@debian.org>
Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
diff --git a/ChangeLog b/ChangeLog

index 12908c968c54e264baff1562e7cea623fb145dcb..f8456a3e61f1e731343c53fbb0aa859751332282 100644 (file)
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,453 @@
+------------------------------------------------------------------------
+r828 | lh3lh3 | 2010-11-16 20:48:49 -0500 (Tue, 16 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bamtk.c
+
+update version information: samtools-0.1.9-20 (r828)
+
+------------------------------------------------------------------------
+r827 | lh3lh3 | 2010-11-16 15:32:50 -0500 (Tue, 16 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/call1.c
+
+bcftools: allow to skip indels
+
+------------------------------------------------------------------------
+r826 | lh3lh3 | 2010-11-16 14:11:58 -0500 (Tue, 16 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam_md.c
+
+remove ZQ if both BQ and ZQ are present
+
+------------------------------------------------------------------------
+r825 | lh3lh3 | 2010-11-16 13:51:33 -0500 (Tue, 16 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/samtools.1
+
+ * samtools-0.1.9-18 (r825)
+ * change the behaviour of calmd such that by default it does not change the base quality
+
+------------------------------------------------------------------------
+r824 | lh3lh3 | 2010-11-15 23:31:53 -0500 (Mon, 15 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/call1.c
+   M /trunk/samtools/samtools.1
+
+ * samtools-0.1.9-17 (r824)
+ * added command line options to change the default parameters in indel calling
+ * update the manual
+
+------------------------------------------------------------------------
+r823 | lh3lh3 | 2010-11-15 12:20:13 -0500 (Mon, 15 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-r823
+ * the BQ tag is now 64 shifted, not 33 shifted
+
+------------------------------------------------------------------------
+r822 | lh3lh3 | 2010-11-15 00:30:18 -0500 (Mon, 15 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/vcfutils.pl
+   M /trunk/samtools/misc/samtools.pl
+
+ * samtools-0.1.9-16 (r822)
+ * keep the raw depth because in indel calling, DP4 may be way off the true depth
+
+------------------------------------------------------------------------
+r821 | lh3lh3 | 2010-11-13 01:18:31 -0500 (Sat, 13 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-15 (r821)
+ * calmd: write BQ
+ * skip realignment if BQ is present
+
+------------------------------------------------------------------------
+r820 | lh3lh3 | 2010-11-13 01:08:26 -0500 (Sat, 13 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-14 (r820)
+ * penalize reads with excessive differences in indel calling
+
+------------------------------------------------------------------------
+r819 | lh3lh3 | 2010-11-12 21:36:27 -0500 (Fri, 12 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam_maqcns.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-13 (r819)
+ * fixed a bug in pileup given refskip
+
+------------------------------------------------------------------------
+r818 | lh3lh3 | 2010-11-12 13:04:53 -0500 (Fri, 12 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-r818
+ * for indel calling, do two rounds of probabilistic realignments
+
+------------------------------------------------------------------------
+r817 | lh3lh3 | 2010-11-11 20:04:07 -0500 (Thu, 11 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/vcfutils.pl
+
+ * samtools-0.1.19-11 (r817)
+ * only initiate indel calling when 0.2% of reads contain a gap
+
+------------------------------------------------------------------------
+r816 | lh3lh3 | 2010-11-11 01:22:59 -0500 (Thu, 11 Nov 2010) | 7 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-10 (r816)
+
+ * I know why the forward method fails. it is because of zero base
+   qualities. when that is fixed, the forward method seems to give
+   better results than Viterbi, as it should be. I am tired...
+
+
+------------------------------------------------------------------------
+r815 | lh3lh3 | 2010-11-11 00:57:15 -0500 (Thu, 11 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/ChangeLog
+   M /trunk/samtools/bam2bcf_indel.c
+
+effectively revert to the viterbi version. The forward realignment gives too many false positives.
+
+------------------------------------------------------------------------
+r814 | lh3lh3 | 2010-11-11 00:18:02 -0500 (Thu, 11 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-9 (r810)
+ * use forward, instead of viterbi, for realignment
+ * realignment is now quality aware
+
+------------------------------------------------------------------------
+r813 | lh3lh3 | 2010-11-10 22:45:24 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/kprobaln.c
+   M /trunk/samtools/kprobaln.h
+
+ * prepare to replace kaln with kprobaln in realignment
+
+------------------------------------------------------------------------
+r812 | lh3lh3 | 2010-11-10 17:28:50 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/bcf.c
+
+fixed a typo
+
+------------------------------------------------------------------------
+r811 | lh3lh3 | 2010-11-10 16:54:46 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/bcf.c
+   M /trunk/samtools/bcftools/bcf.h
+
+use zlib for direct reading when BCF_LITE is in use
+
+------------------------------------------------------------------------
+r810 | lh3lh3 | 2010-11-10 16:32:13 -0500 (Wed, 10 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+
+ * do not use reads containing too many mismatches for indel calling
+ * fixed a trivial bug in case of multi-allelic indels
+
+------------------------------------------------------------------------
+r809 | lh3lh3 | 2010-11-10 13:23:02 -0500 (Wed, 10 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-8 (r809)
+ * fixed a bug in the indel caller
+
+------------------------------------------------------------------------
+r808 | lh3lh3 | 2010-11-10 12:24:10 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/Makefile
+
+minor change to makefile
+
+------------------------------------------------------------------------
+r807 | lh3lh3 | 2010-11-10 12:10:21 -0500 (Wed, 10 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/Makefile
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/vcfutils.pl
+
+ * samtools-0.1.9-8 (r807)
+ * collect indel candidates only from specified platforms (@RG-PL)
+ * merge varFilter and filter4vcf in vcfutils.pl 
+
+------------------------------------------------------------------------
+r806 | lh3lh3 | 2010-11-09 22:05:46 -0500 (Tue, 09 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/call1.c
+   M /trunk/samtools/bcftools/prob1.c
+   M /trunk/samtools/bcftools/prob1.h
+
+bcftools: compute equal-tail (Bayesian) credible interval
+
+------------------------------------------------------------------------
+r805 | lh3lh3 | 2010-11-09 16:28:39 -0500 (Tue, 09 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/vcfutils.pl
+
+added a double-hit filter to avoid overestimated indel likelihood
+
+------------------------------------------------------------------------
+r804 | lh3lh3 | 2010-11-09 14:12:06 -0500 (Tue, 09 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-7 (r804)
+ * fixed a bug in the gap caller
+
+------------------------------------------------------------------------
+r803 | lh3lh3 | 2010-11-09 10:45:33 -0500 (Tue, 09 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/bcf.c
+   M /trunk/samtools/bcftools/bcf.h
+   M /trunk/samtools/bcftools/prob1.c
+
+ * samtools-0.1.9-6 (r803)
+ * mpileup: apply homopolymer correction when calculating GL, instead of before
+ * bcftools: apply a different prior to indels
+
+------------------------------------------------------------------------
+r802 | lh3lh3 | 2010-11-08 23:53:15 -0500 (Mon, 08 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-5 (r802)
+ * relax tandem penalty. this will be made a command-line option in future.
+
+------------------------------------------------------------------------
+r801 | lh3lh3 | 2010-11-08 23:35:52 -0500 (Mon, 08 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-4 (r801)
+ * fixed a minor issue in printing indel VCF
+
+------------------------------------------------------------------------
+r800 | lh3lh3 | 2010-11-08 15:28:14 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bcftools/vcfutils.pl
+
+fixed another silly bug in mpileup's indel caller
+
+------------------------------------------------------------------------
+r799 | lh3lh3 | 2010-11-08 14:28:27 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+
+fixed a silly bug in the indel caller
+
+------------------------------------------------------------------------
+r798 | lh3lh3 | 2010-11-08 14:07:33 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/ChangeLog
+   M /trunk/samtools/sam_view.c
+   M /trunk/samtools/samtools.1
+
+Incorporate patches by Marcel Martin for read counting.
+
+------------------------------------------------------------------------
+r797 | lh3lh3 | 2010-11-08 13:39:52 -0500 (Mon, 08 Nov 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-2 (r797)
+ * mpileup: indel calling seems to be working
+
+------------------------------------------------------------------------
+r796 | lh3lh3 | 2010-11-08 10:54:46 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/kaln.c
+
+indel calling is apparently working, but more information needs to be collected
+
+------------------------------------------------------------------------
+r795 | lh3lh3 | 2010-11-08 00:39:18 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf_indel.c
+
+fixed a few bugs in the indel caller. Probably there are more.
+
+------------------------------------------------------------------------
+r794 | lh3lh3 | 2010-11-07 22:23:16 -0500 (Sun, 07 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/Makefile
+   M /trunk/samtools/bam.h
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   A /trunk/samtools/bam2bcf_indel.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/kaln.c
+   M /trunk/samtools/kaln.h
+
+prepare for the indel caller. It is not ready yet.
+
+------------------------------------------------------------------------
+r793 | lh3lh3 | 2010-11-05 11:28:23 -0400 (Fri, 05 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+
+Revert to r790. The recent changes are not good...
+
+------------------------------------------------------------------------
+r792 | lh3lh3 | 2010-11-05 00:19:14 -0400 (Fri, 05 Nov 2010) | 6 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+
+ * this revision is UNSTABLE
+
+ * indel caller seems working, but it is very insensitive and has
+   several things I do not quite understand.
+
+
+------------------------------------------------------------------------
+r791 | lh3lh3 | 2010-11-04 22:58:43 -0400 (Thu, 04 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+
+for backup. no effective changes
+
+------------------------------------------------------------------------
+r790 | lh3lh3 | 2010-11-03 15:51:24 -0400 (Wed, 03 Nov 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bcftools/vcfutils.pl
+   M /trunk/samtools/kprobaln.c
+
+fixed a minor problem in the example coming with kprobaln.c
+
+------------------------------------------------------------------------
+r789 | lh3lh3 | 2010-11-02 15:41:27 -0400 (Tue, 02 Nov 2010) | 4 lines
+Changed paths:
+   M /trunk/samtools/Makefile
+   M /trunk/samtools/bam_md.c
+   M /trunk/samtools/kaln.c
+   M /trunk/samtools/kaln.h
+   A /trunk/samtools/kprobaln.c
+   A /trunk/samtools/kprobaln.h
+
+Separate kaln and kprobaln as I am preparing further changes. At
+present, the results should be identical to the previous.
+
+
+------------------------------------------------------------------------
+r788 | petulda | 2010-11-02 12:19:04 -0400 (Tue, 02 Nov 2010) | 1 line
+Changed paths:
+   M /trunk/samtools/bam_plcmd.c
+
+Added -b option: read file names from a file
+------------------------------------------------------------------------
+r787 | lh3lh3 | 2010-10-29 23:17:22 -0400 (Fri, 29 Oct 2010) | 7 lines
+Changed paths:
+   M /trunk/samtools/bam.h
+   M /trunk/samtools/bam_pileup.c
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-2 (r787)
+
+ * Allow to set a maximum per-sample depth to reduce memory. However,
+   BAQ computation is still applied to every read. The speed is not
+   improved.
+
+
+------------------------------------------------------------------------
+r786 | lh3lh3 | 2010-10-29 12:10:40 -0400 (Fri, 29 Oct 2010) | 3 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/bcftools/bcf.c
+   M /trunk/samtools/bcftools/vcf.c
+
+ * samtools-0.1.9-1 (r786)
+ * samtools: optionally perform exact test for each sample
+
+------------------------------------------------------------------------
+r785 | lh3lh3 | 2010-10-29 09:42:25 -0400 (Fri, 29 Oct 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/bam2bcf.c
+   M /trunk/samtools/bam2bcf.h
+   M /trunk/samtools/bam_plcmd.c
+   M /trunk/samtools/bcftools/bcf.c
+
+Optionally output "DP", the individual read depth
+
+------------------------------------------------------------------------
+r784 | lh3lh3 | 2010-10-27 23:10:27 -0400 (Wed, 27 Oct 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/samtools.1
+
+acknowledge Petr and John who have greatly contributed to the project.
+
+------------------------------------------------------------------------
+r783 | lh3lh3 | 2010-10-27 22:47:47 -0400 (Wed, 27 Oct 2010) | 2 lines
+Changed paths:
+   M /trunk/samtools/ChangeLog
+   M /trunk/samtools/NEWS
+   M /trunk/samtools/bamtk.c
+   M /trunk/samtools/samtools.1
+
+Release samtools-0.1.9 (r783)
+
  ------------------------------------------------------------------------
  r782 | lh3lh3 | 2010-10-27 19:58:54 -0400 (Wed, 27 Oct 2010) | 2 lines
  Changed paths:
diff --git a/Makefile b/Makefile

index 77cda865c4b38df76fdafe338feae873dabde0e6..13d4a7698e1ded63520b3c8c7a117a45d91af1a1 100644 (file)
--- a/Makefile
+++ b/Makefile
@@ -4,10 +4,10 @@ DFLAGS=               -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE -D_CURSES_LIB=1
  KNETFILE_O=    knetfile.o
  LOBJS=         bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o \
                         bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o \
-                       $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o
+                       $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o kprobaln.o
  AOBJS=         bam_tview.o bam_maqcns.o bam_plcmd.o sam_view.o \
                         bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o     \
-                       bamtk.o kaln.o bam2bcf.o errmod.o sample.o
+                       bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o
  PROG=          samtools
  INCLUDES=      -I.
  SUBDIRS=       . bcftools misc
@@ -64,6 +64,7 @@ glf.o:glf.h
  sam_header.o:sam_header.h khash.h
  bcf.o:bcftools/bcf.h
  bam2bcf.o:bam2bcf.h errmod.h bcftools/bcf.h
+bam2bcf_indel.o:bam2bcf.h
  errmod.o:errmod.h
  
  faidx.o:faidx.h razf.h khash.h
diff --git a/NEWS b/NEWS

index 82646ba812fa66f90c1ca4cf2473645f87809b31..6c2195e5929fab1dbe96b5fafc734a62443ec99c 100644 (file)
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,38 @@
+Beta Release 0.1.10 (16 November, 2010)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This release is featured as the first major improvement to the indel
+caller. The method is similar to the old one implemented in the pileup
+command, but the details are handled more carefully both in theory and
+in practice. As a result, the new indel caller usually gives more
+accurate indel calls, though at the cost of sensitivity. The caller is
+implemented in the mpileup command and is invoked by default. It works
+with multiple samples.
+
+Other notable changes:
+
+ * With the -r option, the calmd command writes the difference between
+   the original base quality and the BAQ capped base quality at the BQ
+   tag but does not modify the base quality. Please use -Ar to overwrite
+   the original base quality (the 0.1.9 behavior).
+
+ * Allow to set a maximum per-sample read depth to reduce memory. In
+   0.1.9, most of memory is wasted for the ultra high read depth in some
+   regions (e.g. the chr1 centromere).
+
+ * Optionally write per-sample read depth and per-sample strand bias
+   P-value.
+
+ * Compute equal-tail (Bayesian) credible interval of site allele
+   frequency at the CI95 VCF annotation.
+
+ * Merged the vcfutils.pl varFilter and filter4vcf for better SNP/indel
+   filtering.
+
+(0.1.10: 16 November 2010, r829)
+
+
+
  Beta Release 0.1.9 (27 October, 2010)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
diff --git a/bam.h b/bam.h

index 4c8536fd33366883028ff9b77bb95049682c5f9f..eef2ea9853f25966c5a40c45a86272b78a507fb2 100644 (file)
--- a/bam.h
+++ b/bam.h
@@ -495,7 +495,7 @@ extern "C" {
                 bam1_t *b;
                 int32_t qpos;
                 int indel, level;
-               uint32_t is_del:1, is_head:1, is_tail:1, is_refskip:1;
+               uint32_t is_del:1, is_head:1, is_tail:1, is_refskip:1, aux:28;
         } bam_pileup1_t;
  
         typedef int (*bam_plp_auto_f)(void *data, bam1_t *b);
@@ -508,6 +508,7 @@ extern "C" {
         const bam_pileup1_t *bam_plp_next(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
         const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
         void bam_plp_set_mask(bam_plp_t iter, int mask);
+       void bam_plp_set_maxcnt(bam_plp_t iter, int maxcnt);
         void bam_plp_reset(bam_plp_t iter);
         void bam_plp_destroy(bam_plp_t iter);
  
@@ -516,6 +517,7 @@ extern "C" {
  
         bam_mplp_t bam_mplp_init(int n, bam_plp_auto_f func, void **data);
         void bam_mplp_destroy(bam_mplp_t iter);
+       void bam_mplp_set_maxcnt(bam_mplp_t iter, int maxcnt);
         int bam_mplp_auto(bam_mplp_t iter, int *_tid, int *_pos, int *n_plp, const bam_pileup1_t **plp);
  
         /*! @typedef
@@ -708,8 +710,8 @@ static inline bam1_t *bam_copy1(bam1_t *bdst, const bam1_t *bsrc)
  {
         uint8_t *data = bdst->data;
         int m_data = bdst->m_data;   // backup data and m_data
-       if (m_data < bsrc->m_data) { // double the capacity
-               m_data = bsrc->m_data; kroundup32(m_data);
+       if (m_data < bsrc->data_len) { // double the capacity
+               m_data = bsrc->data_len; kroundup32(m_data);
                 data = (uint8_t*)realloc(data, m_data);
         }
         memcpy(data, bsrc->data, bsrc->data_len); // copy var-len data
diff --git a/bam2bcf.c b/bam2bcf.c

index e55212ca98cbb28041fb6a6dfc46619ff595ee09..088635c0b04a549b762d1a3b28efad82f4bb5c80 100644 (file)
--- a/bam2bcf.c
+++ b/bam2bcf.c
@@ -14,18 +14,13 @@ extern      void ks_introsort_uint32_t(size_t n, uint32_t a[]);
  
  #define CAP_DIST 25
  
-struct __bcf_callaux_t {
-       int max_bases, capQ, min_baseQ;
-       uint16_t *bases;
-       errmod_t *e;
-};
-
  bcf_callaux_t *bcf_call_init(double theta, int min_baseQ)
  {
         bcf_callaux_t *bca;
         if (theta <= 0.) theta = CALL_DEFTHETA;
         bca = calloc(1, sizeof(bcf_callaux_t));
         bca->capQ = 60;
+       bca->openQ = 40; bca->extQ = 20; bca->tandemQ = 100;
         bca->min_baseQ = min_baseQ;
         bca->e = errmod_init(1. - theta);
         return bca;
@@ -35,16 +30,19 @@ void bcf_call_destroy(bcf_callaux_t *bca)
  {
         if (bca == 0) return;
         errmod_destroy(bca->e);
-       free(bca->bases); free(bca);
+       free(bca->bases); free(bca->inscns); free(bca);
  }
-
-int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf_callaux_t *bca, bcf_callret1_t *r)
+/* ref_base is the 4-bit representation of the reference base. It is
+ * negative if we are looking at an indel. */
+int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t *bca, bcf_callret1_t *r)
  {
-       int i, n, ref4;
+       int i, n, ref4, is_indel, ori_depth = 0;
         memset(r, 0, sizeof(bcf_callret1_t));
-       ref4 = bam_nt16_nt4_table[ref_base];
+       if (ref_base >= 0) {
+               ref4 = bam_nt16_nt4_table[ref_base];
+               is_indel = 0;
+       } else ref4 = 4, is_indel = 1;
         if (_n == 0) return -1;
-
         // enlarge the bases array if necessary
         if (bca->max_bases < _n) {
                 bca->max_bases = _n;
@@ -55,21 +53,29 @@ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf
         memset(r, 0, sizeof(bcf_callret1_t));
         for (i = n = 0; i < _n; ++i) {
                 const bam_pileup1_t *p = pl + i;
-               int q, b, mapQ, baseQ, is_diff, min_dist;
+               int q, b, mapQ, baseQ, is_diff, min_dist, seqQ;
                 // set base
-               if (p->is_del || (p->b->core.flag&BAM_FUNMAP)) continue; // skip unmapped reads and deleted bases
-               baseQ = q = (int)bam1_qual(p->b)[p->qpos]; // base quality
+               if (p->is_del || p->is_refskip || (p->b->core.flag&BAM_FUNMAP)) continue;
+               ++ori_depth;
+               baseQ = q = is_indel? p->aux&0xff : (int)bam1_qual(p->b)[p->qpos]; // base/indel quality
+               seqQ = is_indel? (p->aux>>8&0xff) : 99;
                 if (q < bca->min_baseQ) continue;
+               if (q > seqQ) q = seqQ;
                 mapQ = p->b->core.qual < bca->capQ? p->b->core.qual : bca->capQ;
                 if (q > mapQ) q = mapQ;
                 if (q > 63) q = 63;
                 if (q < 4) q = 4;
-               b = bam1_seqi(bam1_seq(p->b), p->qpos); // base
-               b = bam_nt16_nt4_table[b? b : ref_base]; // b is the 2-bit base
+               if (!is_indel) {
+                       b = bam1_seqi(bam1_seq(p->b), p->qpos); // base
+                       b = bam_nt16_nt4_table[b? b : ref_base]; // b is the 2-bit base
+                       is_diff = (ref4 < 4 && b == ref4)? 0 : 1;
+               } else {
+                       b = p->aux>>16&0x3f;
+                       is_diff = (b != 0);
+               }
                 bca->bases[n++] = q<<5 | (int)bam1_strand(p->b)<<4 | b;
                 // collect annotations
                 r->qsum[b] += q;
-               is_diff = (ref4 < 4 && b == ref4)? 0 : 1;
                 ++r->anno[0<<2|is_diff<<1|bam1_strand(p->b)];
                 min_dist = p->b->core.l_qseq - 1 - p->qpos;
                 if (min_dist > p->qpos) min_dist = p->qpos;
@@ -81,7 +87,7 @@ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf
                 r->anno[3<<2|is_diff<<1|0] += min_dist;
                 r->anno[3<<2|is_diff<<1|1] += min_dist * min_dist;
         }
-       r->depth = n;
+       r->depth = n; r->ori_depth = ori_depth;
         // glfgen
         errmod_cal(bca->e, n, 5, bca->bases, r->p);
         return r->depth;
@@ -91,8 +97,10 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/,
  {
         int ref4, i, j, qsum[4];
         int64_t tmp;
-       call->ori_ref = ref4 = bam_nt16_nt4_table[ref_base];
-       if (ref4 > 4) ref4 = 4;
+       if (ref_base >= 0) {
+               call->ori_ref = ref4 = bam_nt16_nt4_table[ref_base];
+               if (ref4 > 4) ref4 = 4;
+       } else call->ori_ref = -1, ref4 = 0;
         // calculate qsum
         memset(qsum, 0, 4 * sizeof(int));
         for (i = 0; i < n; ++i)
@@ -113,9 +121,14 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/,
                         else break;
                 }
         }
-       if (((ref4 < 4 && j < 4) || (ref4 == 4 && j < 5)) && i >= 0)
-               call->unseen = j, call->a[j++] = qsum[i]&3;
-       call->n_alleles = j;
+       if (ref_base >= 0) { // for SNPs, find the "unseen" base
+               if (((ref4 < 4 && j < 4) || (ref4 == 4 && j < 5)) && i >= 0)
+                       call->unseen = j, call->a[j++] = qsum[i]&3;
+               call->n_alleles = j;
+       } else {
+               call->n_alleles = j;
+               if (call->n_alleles == 1) return -1; // no reliable supporting read. stop doing anything
+       }
         // set the PL array
         if (call->n < n) {
                 call->n = n;
@@ -143,44 +156,101 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/,
                                 PL[j] = y;
                         }
                 }
+//             if (ref_base < 0) fprintf(stderr, "%d,%d,%f,%d\n", call->n_alleles, x, sum_min, call->unseen);
                 call->shift = (int)(sum_min + .499);
         }
         // combine annotations
         memset(call->anno, 0, 16 * sizeof(int));
-       for (i = call->depth = 0, tmp = 0; i < n; ++i) {
+       for (i = call->depth = call->ori_depth = 0, tmp = 0; i < n; ++i) {
                 call->depth += calls[i].depth;
+               call->ori_depth += calls[i].ori_depth;
                 for (j = 0; j < 16; ++j) call->anno[j] += calls[i].anno[j];
         }
         return 0;
  }
  
-int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b)
+int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b, bcf_callret1_t *bcr, int is_SP,
+                                const bcf_callaux_t *bca, const char *ref)
  {
+       extern double kt_fisher_exact(int n11, int n12, int n21, int n22, double *_left, double *_right, double *two);
         kstring_t s;
-       int i;
+       int i, j;
         b->n_smpl = bc->n;
         b->tid = tid; b->pos = pos; b->qual = 0;
         s.s = b->str; s.m = b->m_str; s.l = 0;
         kputc('\0', &s);
-       kputc("ACGTN"[bc->ori_ref], &s); kputc('\0', &s);
-       for (i = 1; i < 5; ++i) {
-               if (bc->a[i] < 0) break;
-               if (i > 1) kputc(',', &s);
-               kputc(bc->unseen == i? 'X' : "ACGT"[bc->a[i]], &s);
+       if (bc->ori_ref < 0) { // an indel
+               // write REF
+               kputc(ref[pos], &s);
+               for (j = 0; j < bca->indelreg; ++j) kputc(ref[pos+1+j], &s);
+               kputc('\0', &s);
+               // write ALT
+               kputc(ref[pos], &s);
+               for (i = 1; i < 4; ++i) {
+                       if (bc->a[i] < 0) break;
+                       if (i > 1) {
+                               kputc(',', &s); kputc(ref[pos], &s);
+                       }
+                       if (bca->indel_types[bc->a[i]] < 0) { // deletion
+                               for (j = -bca->indel_types[bc->a[i]]; j < bca->indelreg; ++j)
+                                       kputc(ref[pos+1+j], &s);
+                       } else { // insertion; cannot be a reference unless a bug
+                               char *inscns = &bca->inscns[bc->a[i] * bca->maxins];
+                               for (j = 0; j < bca->indel_types[bc->a[i]]; ++j)
+                                       kputc("ACGTN"[(int)inscns[j]], &s);
+                               for (j = 0; j < bca->indelreg; ++j) kputc(ref[pos+1+j], &s);
+                       }
+               }
+               kputc('\0', &s);
+       } else { // a SNP
+               kputc("ACGTN"[bc->ori_ref], &s); kputc('\0', &s);
+               for (i = 1; i < 5; ++i) {
+                       if (bc->a[i] < 0) break;
+                       if (i > 1) kputc(',', &s);
+                       kputc(bc->unseen == i? 'X' : "ACGT"[bc->a[i]], &s);
+               }
+               kputc('\0', &s);
         }
         kputc('\0', &s);
-       kputc('\0', &s);
         // INFO
-       kputs("I16=", &s);
+       if (bc->ori_ref < 0) kputs("INDEL;", &s);
+       kputs("DP=", &s); kputw(bc->ori_depth, &s); kputs(";I16=", &s);
         for (i = 0; i < 16; ++i) {
                 if (i) kputc(',', &s);
                 kputw(bc->anno[i], &s);
         }
         kputc('\0', &s);
         // FMT
-       kputs("PL", &s); kputc('\0', &s);
+       kputs("PL", &s);
+       if (bcr) {
+               kputs(":DP", &s);
+               if (is_SP) kputs(":SP", &s);
+       }
+       kputc('\0', &s);
         b->m_str = s.m; b->str = s.s; b->l_str = s.l;
         bcf_sync(b);
         memcpy(b->gi[0].data, bc->PL, b->gi[0].len * bc->n);
+       if (bcr) {
+               uint16_t *dp = (uint16_t*)b->gi[1].data;
+               uint8_t *sp = is_SP? b->gi[2].data : 0;
+               for (i = 0; i < bc->n; ++i) {
+                       bcf_callret1_t *p = bcr + i;
+                       dp[i] = p->depth < 0xffff? p->depth : 0xffff;
+                       if (is_SP) {
+                               if (p->anno[0] + p->anno[1] < 2 || p->anno[2] + p->anno[3] < 2
+                                       || p->anno[0] + p->anno[2] < 2 || p->anno[1] + p->anno[3] < 2)
+                               {
+                                       sp[i] = 0;
+                               } else {
+                                       double left, right, two;
+                                       int x;
+                                       kt_fisher_exact(p->anno[0], p->anno[1], p->anno[2], p->anno[3], &left, &right, &two);
+                                       x = (int)(-4.343 * log(two) + .499);
+                                       if (x > 255) x = 255;
+                                       sp[i] = x;
+                               }
+                       }
+               }
+       }
         return 0;
  }
diff --git a/bam2bcf.h b/bam2bcf.h

index 0dddb924dcf62fee4f1352a9a20eecb7001da450..26b022c08e89e3597bd6e400690d42bd7ecabe38 100644 (file)
--- a/bam2bcf.h
+++ b/bam2bcf.h
@@ -2,13 +2,26 @@
  #define BAM2BCF_H
  
  #include <stdint.h>
+#include "errmod.h"
  #include "bcftools/bcf.h"
  
-struct __bcf_callaux_t;
-typedef struct __bcf_callaux_t bcf_callaux_t;
+#define B2B_INDEL_NULL 10000
+
+typedef struct __bcf_callaux_t {
+       int capQ, min_baseQ;
+       int openQ, extQ, tandemQ;
+       // for internal uses
+       int max_bases;
+       int indel_types[4];
+       int maxins, indelreg;
+       char *inscns;
+       uint16_t *bases;
+       errmod_t *e;
+       void *rghash;
+} bcf_callaux_t;
  
  typedef struct {
-       int depth, qsum[4];
+       int depth, ori_depth, qsum[4];
         int anno[16];
         float p[25];
  } bcf_callret1_t;
@@ -16,7 +29,7 @@ typedef struct {
  typedef struct {
         int a[5]; // alleles: ref, alt, alt2, alt3
         int n, n_alleles, shift, ori_ref, unseen;
-       int anno[16], depth;
+       int anno[16], depth, ori_depth;
         uint8_t *PL;
  } bcf_call_t;
  
@@ -26,9 +39,12 @@ extern "C" {
  
         bcf_callaux_t *bcf_call_init(double theta, int min_baseQ);
         void bcf_call_destroy(bcf_callaux_t *bca);
-       int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf_callaux_t *bca, bcf_callret1_t *r);
+       int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t *bca, bcf_callret1_t *r);
         int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/, bcf_call_t *call);
-       int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b);
+       int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b, bcf_callret1_t *bcr, int is_SP,
+                                        const bcf_callaux_t *bca, const char *ref);
+       int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_callaux_t *bca, const char *ref,
+                                                 const void *rghash);
  
  #ifdef __cplusplus
  }
diff --git a/bam2bcf_indel.c b/bam2bcf_indel.c

new file mode 100644 (file)

index 0000000..819c1e7
--- /dev/null
+++ b/bam2bcf_indel.c
@@ -0,0 +1,410 @@
+#include <assert.h>
+#include <ctype.h>
+#include <string.h>
+#include "bam.h"
+#include "bam2bcf.h"
+#include "ksort.h"
+#include "kaln.h"
+#include "kprobaln.h"
+#include "khash.h"
+KHASH_SET_INIT_STR(rg)
+
+#define MINUS_CONST 0x10000000
+#define INDEL_WINDOW_SIZE 50
+#define MIN_SUPPORT_COEF 500
+
+void *bcf_call_add_rg(void *_hash, const char *hdtext, const char *list)
+{
+       const char *s, *p, *q, *r, *t;
+       khash_t(rg) *hash;
+       if (list == 0 || hdtext == 0) return _hash;
+       if (_hash == 0) _hash = kh_init(rg);
+       hash = (khash_t(rg)*)_hash;
+       if ((s = strstr(hdtext, "@RG\t")) == 0) return hash;
+       do {
+               t = strstr(s + 4, "@RG\t"); // the next @RG
+               if ((p = strstr(s, "\tID:")) != 0) p += 4;
+               if ((q = strstr(s, "\tPL:")) != 0) q += 4;
+               if (p && q && (t == 0 || (p < t && q < t))) { // ID and PL are both present
+                       int lp, lq;
+                       char *x;
+                       for (r = p; *r && *r != '\t' && *r != '\n'; ++r); lp = r - p;
+                       for (r = q; *r && *r != '\t' && *r != '\n'; ++r); lq = r - q;
+                       x = calloc((lp > lq? lp : lq) + 1, 1);
+                       for (r = q; *r && *r != '\t' && *r != '\n'; ++r) x[r-q] = *r;
+                       if (strstr(list, x)) { // insert ID to the hash table
+                               khint_t k;
+                               int ret;
+                               for (r = p; *r && *r != '\t' && *r != '\n'; ++r) x[r-p] = *r;
+                               x[r-p] = 0;
+                               k = kh_get(rg, hash, x);
+                               if (k == kh_end(hash)) k = kh_put(rg, hash, x, &ret);
+                               else free(x);
+                       } else free(x);
+               }
+               s = t;
+       } while (s);
+       return hash;
+}
+
+void bcf_call_del_rghash(void *_hash)
+{
+       khint_t k;
+       khash_t(rg) *hash = (khash_t(rg)*)_hash;
+       if (hash == 0) return;
+       for (k = kh_begin(hash); k < kh_end(hash); ++k)
+               if (kh_exist(hash, k))
+                       free((char*)kh_key(hash, k));
+       kh_destroy(rg, hash);
+}
+
+static int tpos2qpos(const bam1_core_t *c, const uint32_t *cigar, int32_t tpos, int is_left, int32_t *_tpos)
+{
+       int k, x = c->pos, y = 0, last_y = 0;
+       *_tpos = c->pos;
+       for (k = 0; k < c->n_cigar; ++k) {
+               int op = cigar[k] & BAM_CIGAR_MASK;
+               int l = cigar[k] >> BAM_CIGAR_SHIFT;
+               if (op == BAM_CMATCH) {
+                       if (c->pos > tpos) return y;
+                       if (x + l > tpos) {
+                               *_tpos = tpos;
+                               return y + (tpos - x);
+                       }
+                       x += l; y += l;
+                       last_y = y;
+               } else if (op == BAM_CINS || op == BAM_CSOFT_CLIP) y += l;
+               else if (op == BAM_CDEL || op == BAM_CREF_SKIP) {
+                       if (x + l > tpos) {
+                               *_tpos = is_left? x : x + l;
+                               return y;
+                       }
+                       x += l;
+               }
+       }
+       *_tpos = x;
+       return last_y;
+}
+// FIXME: check if the inserted sequence is consistent with the homopolymer run
+// l is the relative gap length and l_run is the length of the homopolymer on the reference
+static inline int est_seqQ(const bcf_callaux_t *bca, int l, int l_run)
+{
+       int q, qh;
+       q = bca->openQ + bca->extQ * (abs(l) - 1);
+       qh = l_run >= 3? (int)(bca->tandemQ * (double)abs(l) / l_run + .499) : 1000;
+       return q < qh? q : qh;
+}
+
+static inline int est_indelreg(int pos, const char *ref, int l, char *ins4)
+{
+       int i, j, max = 0, max_i = pos, score = 0;
+       l = abs(l);
+       for (i = pos + 1, j = 0; ref[i]; ++i, ++j) {
+               if (ins4) score += (toupper(ref[i]) != "ACGTN"[(int)ins4[j%l]])? -10 : 1;
+               else score += (toupper(ref[i]) != toupper(ref[pos+1+j%l]))? -10 : 1;
+               if (score < 0) break;
+               if (max < score) max = score, max_i = i;
+       }
+       return max_i - pos;
+}
+
+int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_callaux_t *bca, const char *ref,
+                                         const void *rghash)
+{
+       extern void ks_introsort_uint32_t(int, uint32_t*);
+       int i, s, j, k, t, n_types, *types, max_rd_len, left, right, max_ins, *score1, *score2;
+       int N, K, l_run, ref_type, n_alt;
+       char *inscns = 0, *ref2, *query;
+       khash_t(rg) *hash = (khash_t(rg)*)rghash;
+       if (ref == 0 || bca == 0) return -1;
+       // mark filtered reads
+       if (rghash) {
+               N = 0;
+               for (s = N = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i) {
+                               bam_pileup1_t *p = plp[s] + i;
+                               const uint8_t *rg = bam_aux_get(p->b, "RG");
+                               p->aux = 1; // filtered by default
+                               if (rg) {
+                                       khint_t k = kh_get(rg, hash, (const char*)(rg + 1));
+                                       if (k != kh_end(hash)) p->aux = 0, ++N; // not filtered
+                               }
+                       }
+               }
+               if (N == 0) return -1; // no reads left
+       }
+       // determine if there is a gap
+       for (s = N = 0; s < n; ++s) {
+               for (i = 0; i < n_plp[s]; ++i)
+                       if (plp[s][i].indel != 0) break;
+               if (i < n_plp[s]) break;
+       }
+       if (s == n) return -1; // there is no indel at this position.
+       for (s = N = 0; s < n; ++s) N += n_plp[s]; // N is the total number of reads
+       { // find out how many types of indels are present
+               int m, n_alt = 0, n_tot = 0;
+               uint32_t *aux;
+               aux = calloc(N + 1, 4);
+               m = max_rd_len = 0;
+               aux[m++] = MINUS_CONST; // zero indel is always a type
+               for (s = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i) {
+                               const bam_pileup1_t *p = plp[s] + i;
+                               if (rghash == 0 || p->aux == 0) {
+                                       ++n_tot;
+                                       if (p->indel != 0) {
+                                               ++n_alt;
+                                               aux[m++] = MINUS_CONST + p->indel;
+                                       }
+                               }
+                               j = bam_cigar2qlen(&p->b->core, bam1_cigar(p->b));
+                               if (j > max_rd_len) max_rd_len = j;
+                       }
+               }
+               ks_introsort(uint32_t, m, aux);
+               // squeeze out identical types
+               for (i = 1, n_types = 1; i < m; ++i)
+                       if (aux[i] != aux[i-1]) ++n_types;
+               if (n_types == 1 || n_alt * MIN_SUPPORT_COEF < n_tot) { // no indels or too few supporting reads
+                       free(aux); return -1;
+               }
+               types = (int*)calloc(n_types, sizeof(int));
+               t = 0;
+               types[t++] = aux[0] - MINUS_CONST; 
+               for (i = 1; i < m; ++i)
+                       if (aux[i] != aux[i-1])
+                               types[t++] = aux[i] - MINUS_CONST;
+               free(aux);
+               for (t = 0; t < n_types; ++t)
+                       if (types[t] == 0) break;
+               ref_type = t; // the index of the reference type (0)
+               assert(n_types < 64);
+       }
+       { // calculate left and right boundary
+               left = pos > INDEL_WINDOW_SIZE? pos - INDEL_WINDOW_SIZE : 0;
+               right = pos + INDEL_WINDOW_SIZE;
+               if (types[0] < 0) right -= types[0];
+               // in case the alignments stand out the reference
+               for (i = pos; i < right; ++i)
+                       if (ref[i] == 0) break;
+               right = i;
+       }
+       { // the length of the homopolymer run around the current position
+               int c = bam_nt16_table[(int)ref[pos + 1]];
+               if (c == 15) l_run = 1;
+               else {
+                       for (i = pos + 2; ref[i]; ++i)
+                               if (bam_nt16_table[(int)ref[i]] != c) break;
+                       l_run = i;
+                       for (i = pos; i >= 0; --i)
+                               if (bam_nt16_table[(int)ref[i]] != c) break;
+                       l_run -= i + 1;
+               }
+       }
+       // construct the consensus sequence
+       max_ins = types[n_types - 1]; // max_ins is at least 0
+       if (max_ins > 0) {
+               int *inscns_aux = calloc(4 * n_types * max_ins, sizeof(int));
+               // count the number of occurrences of each base at each position for each type of insertion
+               for (t = 0; t < n_types; ++t) {
+                       if (types[t] > 0) {
+                               for (s = 0; s < n; ++s) {
+                                       for (i = 0; i < n_plp[s]; ++i) {
+                                               bam_pileup1_t *p = plp[s] + i;
+                                               if (p->indel == types[t]) {
+                                                       uint8_t *seq = bam1_seq(p->b);
+                                                       for (k = 1; k <= p->indel; ++k) {
+                                                               int c = bam_nt16_nt4_table[bam1_seqi(seq, p->qpos + k)];
+                                                               if (c < 4) ++inscns_aux[(t*max_ins+(k-1))*4 + c];
+                                                       }
+                                               }
+                                       }
+                               }
+                       }
+               }
+               // use the majority rule to construct the consensus
+               inscns = calloc(n_types * max_ins, 1);
+               for (t = 0; t < n_types; ++t) {
+                       for (j = 0; j < types[t]; ++j) {
+                               int max = 0, max_k = -1, *ia = &inscns_aux[(t*max_ins+j)*4];
+                               for (k = 0; k < 4; ++k)
+                                       if (ia[k] > max)
+                                               max = ia[k], max_k = k;
+                               inscns[t*max_ins + j] = max? max_k : 4;
+                       }
+               }
+               free(inscns_aux);
+       }
+       // compute the likelihood given each type of indel for each read
+       ref2  = calloc(right - left + max_ins + 2, 1);
+       query = calloc(right - left + max_rd_len + max_ins + 2, 1);
+       score1 = calloc(N * n_types, sizeof(int));
+       score2 = calloc(N * n_types, sizeof(int));
+       bca->indelreg = 0;
+       for (t = 0; t < n_types; ++t) {
+               int l, ir;
+               kpa_par_t apf1 = { 1e-4, 1e-2, 10 }, apf2 = { 1e-6, 1e-3, 10 };
+               apf1.bw = apf2.bw = abs(types[t]) + 3;
+               // compute indelreg
+               if (types[t] == 0) ir = 0;
+               else if (types[t] > 0) ir = est_indelreg(pos, ref, types[t], &inscns[t*max_ins]);
+               else ir = est_indelreg(pos, ref, -types[t], 0);
+               if (ir > bca->indelreg) bca->indelreg = ir;
+//             fprintf(stderr, "%d, %d, %d\n", pos, types[t], ir);
+               // write ref2
+               for (k = 0, j = left; j <= pos; ++j)
+                       ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]];
+               if (types[t] <= 0) j += -types[t];
+               else for (l = 0; l < types[t]; ++l)
+                                ref2[k++] = inscns[t*max_ins + l];
+               if (types[0] < 0) { // mask deleted sequences to avoid a particular error in the model.
+                       int jj, tmp = types[t] >= 0? -types[0] : -types[0] + types[t];
+                       for (jj = 0; jj < tmp && j < right && ref[j]; ++jj, ++j)
+                               ref2[k++] = 4;
+               }
+               for (; j < right && ref[j]; ++j)
+                       ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]];
+               if (j < right) right = j;
+               // align each read to ref2
+               for (s = K = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i, ++K) {
+                               bam_pileup1_t *p = plp[s] + i;
+                               int qbeg, qend, tbeg, tend, sc;
+                               uint8_t *seq = bam1_seq(p->b);
+                               // FIXME: the following skips soft clips, but using them may be more sensitive.
+                               // determine the start and end of sequences for alignment
+                               qbeg = tpos2qpos(&p->b->core, bam1_cigar(p->b), left,  0, &tbeg);
+                               qend = tpos2qpos(&p->b->core, bam1_cigar(p->b), right, 1, &tend);
+                               if (types[t] < 0) {
+                                       int l = -types[t];
+                                       tbeg = tbeg - l > left?  tbeg - l : left;
+                               }
+                               // write the query sequence
+                               for (l = qbeg; l < qend; ++l)
+                                       query[l - qbeg] = bam_nt16_nt4_table[bam1_seqi(seq, l)];
+                               { // do realignment; this is the bottleneck
+                                       const uint8_t *qual = bam1_qual(p->b), *bq;
+                                       uint8_t *qq;
+                                       qq = calloc(qend - qbeg, 1);
+                                       bq = (uint8_t*)bam_aux_get(p->b, "ZQ");
+                                       if (bq) ++bq; // skip type
+                                       for (l = qbeg; l < qend; ++l) {
+                                               qq[l - qbeg] = bq? qual[l] + (bq[l] - 64) : qual[l];
+                                               if (qq[l - qbeg] > 30) qq[l - qbeg] = 30;
+                                               if (qq[l - qbeg] < 7) qq[l - qbeg] = 7;
+                                       }
+                                       sc = kpa_glocal((uint8_t*)ref2 + tbeg - left, tend - tbeg + abs(types[t]),
+                                                                       (uint8_t*)query, qend - qbeg, qq, &apf1, 0, 0);
+                                       l = (int)(100. * sc / (qend - qbeg) + .499); // used for adjusting indelQ below
+                                       if (l > 255) l = 255;
+                                       score1[K*n_types + t] = score2[K*n_types + t] = sc<<8 | l;
+                                       if (sc > 5) {
+                                               sc = kpa_glocal((uint8_t*)ref2 + tbeg - left, tend - tbeg + abs(types[t]),
+                                                                               (uint8_t*)query, qend - qbeg, qq, &apf2, 0, 0);
+                                               l = (int)(100. * sc / (qend - qbeg) + .499);
+                                               if (l > 255) l = 255;
+                                               score2[K*n_types + t] = sc<<8 | l;
+                                       }
+                                       free(qq);
+                               }
+/*
+                               for (l = 0; l < tend - tbeg + abs(types[t]); ++l)
+                                       fputc("ACGTN"[(int)ref2[tbeg-left+l]], stderr);
+                               fputc('\n', stderr);
+                               for (l = 0; l < qend - qbeg; ++l) fputc("ACGTN"[(int)query[l]], stderr);
+                               fputc('\n', stderr);
+                               fprintf(stderr, "pos=%d type=%d read=%d:%d name=%s qbeg=%d tbeg=%d score=%d\n", pos, types[t], s, i, bam1_qname(p->b), qbeg, tbeg, sc);
+*/
+                       }
+               }
+       }
+       free(ref2); free(query);
+       { // compute indelQ
+               int *sc, tmp, *sumq;
+               sc   = alloca(n_types * sizeof(int));
+               sumq = alloca(n_types * sizeof(int));
+               memset(sumq, 0, sizeof(int) * n_types);
+               for (s = K = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i, ++K) {
+                               bam_pileup1_t *p = plp[s] + i;
+                               int *sct = &score1[K*n_types], indelQ1, indelQ2, seqQ, indelQ;
+                               for (t = 0; t < n_types; ++t) sc[t] = sct[t]<<6 | t;
+                               for (t = 1; t < n_types; ++t) // insertion sort
+                                       for (j = t; j > 0 && sc[j] < sc[j-1]; --j)
+                                               tmp = sc[j], sc[j] = sc[j-1], sc[j-1] = tmp;
+                               /* errmod_cal() assumes that if the call is wrong, the
+                                * likelihoods of other events are equal. This is about
+                                * right for substitutions, but is not desired for
+                                * indels. To reuse errmod_cal(), I have to make
+                                * compromise for multi-allelic indels.
+                                */
+                               if ((sc[0]&0x3f) == ref_type) {
+                                       indelQ1 = (sc[1]>>14) - (sc[0]>>14);
+                                       seqQ = est_seqQ(bca, types[sc[1]&0x3f], l_run);
+                               } else {
+                                       for (t = 0; t < n_types; ++t) // look for the reference type
+                                               if ((sc[t]&0x3f) == ref_type) break;
+                                       indelQ1 = (sc[t]>>14) - (sc[0]>>14);
+                                       seqQ = est_seqQ(bca, types[sc[0]&0x3f], l_run);
+                               }
+                               tmp = sc[0]>>6 & 0xff;
+                               indelQ1 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ1 + .499); // reduce indelQ
+                               sct = &score2[K*n_types];
+                               for (t = 0; t < n_types; ++t) sc[t] = sct[t]<<6 | t;
+                               for (t = 1; t < n_types; ++t) // insertion sort
+                                       for (j = t; j > 0 && sc[j] < sc[j-1]; --j)
+                                               tmp = sc[j], sc[j] = sc[j-1], sc[j-1] = tmp;
+                               if ((sc[0]&0x3f) == ref_type) {
+                                       indelQ2 = (sc[1]>>14) - (sc[0]>>14);
+                               } else {
+                                       for (t = 0; t < n_types; ++t) // look for the reference type
+                                               if ((sc[t]&0x3f) == ref_type) break;
+                                       indelQ2 = (sc[t]>>14) - (sc[0]>>14);
+                               }
+                               tmp = sc[0]>>6 & 0xff;
+                               indelQ2 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ2 + .499);
+                               // pick the smaller between indelQ1 and indelQ2
+                               indelQ = indelQ1 < indelQ2? indelQ1 : indelQ2;
+                               p->aux = (sc[0]&0x3f)<<16 | seqQ<<8 | indelQ;
+                               sumq[sc[0]&0x3f] += indelQ < seqQ? indelQ : seqQ;
+//                             fprintf(stderr, "pos=%d read=%d:%d name=%s call=%d q=%d\n", pos, s, i, bam1_qname(p->b), types[sc[0]&0x3f], indelQ);
+                       }
+               }
+               // determine bca->indel_types[] and bca->inscns
+               bca->maxins = max_ins;
+               bca->inscns = realloc(bca->inscns, bca->maxins * 4);
+               for (t = 0; t < n_types; ++t)
+                       sumq[t] = sumq[t]<<6 | t;
+               for (t = 1; t < n_types; ++t) // insertion sort
+                       for (j = t; j > 0 && sumq[j] > sumq[j-1]; --j)
+                               tmp = sumq[j], sumq[j] = sumq[j-1], sumq[j-1] = tmp;
+               for (t = 0; t < n_types; ++t) // look for the reference type
+                       if ((sumq[t]&0x3f) == ref_type) break;
+               if (t) { // then move the reference type to the first
+                       tmp = sumq[t];
+                       for (; t > 0; --t) sumq[t] = sumq[t-1];
+                       sumq[0] = tmp;
+               }
+               for (t = 0; t < 4; ++t) bca->indel_types[t] = B2B_INDEL_NULL;
+               for (t = 0; t < 4 && t < n_types; ++t) {
+                       bca->indel_types[t] = types[sumq[t]&0x3f];
+                       memcpy(&bca->inscns[t * bca->maxins], &inscns[(sumq[t]&0x3f) * max_ins], bca->maxins);
+               }
+               // update p->aux
+               for (s = n_alt = 0; s < n; ++s) {
+                       for (i = 0; i < n_plp[s]; ++i) {
+                               bam_pileup1_t *p = plp[s] + i;
+                               int x = types[p->aux>>16&0x3f];
+                               for (j = 0; j < 4; ++j)
+                                       if (x == bca->indel_types[j]) break;
+                               p->aux = j<<16 | (j == 4? 0 : (p->aux&0xffff));
+                               if ((p->aux>>16&0x3f) > 0) ++n_alt;
+//                             fprintf(stderr, "X pos=%d read=%d:%d name=%s call=%d type=%d q=%d seqQ=%d\n", pos, s, i, bam1_qname(p->b), p->aux>>16&63, bca->indel_types[p->aux>>16&63], p->aux&0xff, p->aux>>8&0xff);
+                       }
+               }               
+       }
+       free(score1); free(score2);
+       // free
+       free(types); free(inscns);
+       return n_alt > 0? 0 : -1;
+}
diff --git a/bam_maqcns.c b/bam_maqcns.c

index 2f3fc082be79aa0fa3feae9c1d9a03ad44592877..4fbc6c6d0cd14f54ac7d75f07f789cbdb73e4362 100644 (file)
--- a/bam_maqcns.c
+++ b/bam_maqcns.c
@@ -154,7 +154,7 @@ glf1_t *bam_maqcns_glfgen(int _n, const bam_pileup1_t *pl, uint8_t ref_base, bam
                 const bam_pileup1_t *p = pl + i;
                 uint32_t q, x = 0, qq;
                 uint16_t y = 0;
-               if (p->is_del || (p->b->core.flag&BAM_FUNMAP)) continue;
+               if (p->is_del || p->is_refskip || (p->b->core.flag&BAM_FUNMAP)) continue;
                 q = (uint32_t)bam1_qual(p->b)[p->qpos];
                 if (q < bm->min_baseQ) continue;
                 x |= (uint32_t)bam1_strand(p->b) << 18 | q << 8 | p->b->core.qual;
@@ -166,7 +166,7 @@ glf1_t *bam_maqcns_glfgen(int _n, const bam_pileup1_t *pl, uint8_t ref_base, bam
                 y |= q << 5;
                 qq = bam1_seqi(bam1_seq(p->b), p->qpos);
                 q = bam_nt16_nt4_table[qq? qq : ref_base];
-               if (!p->is_del && q < 4) x |= 1 << 21 | q << 16, y |= q;
+               if (!p->is_del && !p->is_refskip && q < 4) x |= 1 << 21 | q << 16, y |= q;
                 bm->aux->info16[n] = y;
                 bm->aux->info[n++] = x;
         }
diff --git a/bam_md.c b/bam_md.c

index f65b845b5d594980d4b4dba36122672c71b19cc9..44d46a492438beecae74fc648c9564a566565be7 100644 (file)
--- a/bam_md.c
+++ b/bam_md.c
@@ -7,6 +7,7 @@
  #include "sam.h"
  #include "kstring.h"
  #include "kaln.h"
+#include "kprobaln.h"
  
  void bam_fillmd1_core(bam1_t *b, char *ref, int is_equal, int max_nm)
  {
@@ -161,14 +162,35 @@ int bam_cap_mapQ(bam1_t *b, char *ref, int thres)
         return (int)(t + .499);
  }
  
-int bam_prob_realn(bam1_t *b, const char *ref)
+int bam_prob_realn_core(bam1_t *b, const char *ref, int apply_baq)
  {
         int k, i, bw, x, y, yb, ye, xb, xe;
         uint32_t *cigar = bam1_cigar(b);
         bam1_core_t *c = &b->core;
-       ka_probpar_t conf = ka_probpar_def;
-       // find the start and end of the alignment
-       if (c->flag & BAM_FUNMAP) return -1;
+       kpa_par_t conf = kpa_par_def;
+       uint8_t *bq = 0, *zq = 0, *qual = bam1_qual(b);
+       if (c->flag & BAM_FUNMAP) return -1; // do nothing
+       // test if BQ or ZQ is present
+       if ((bq = bam_aux_get(b, "BQ")) != 0) ++bq;
+       if ((zq = bam_aux_get(b, "ZQ")) != 0 && *zq == 'Z') ++zq;
+       if (bq && zq) { // remove the ZQ tag
+               bam_aux_del(b, zq-1);
+               zq = 0;
+       }
+       if (bq || zq) {
+               if ((apply_baq && zq) || (!apply_baq && bq)) return -3; // in both cases, do nothing
+               if (bq && apply_baq) { // then convert BQ to ZQ
+                       for (i = 0; i < c->l_qseq; ++i)
+                               qual[i] = qual[i] + 64 < bq[i]? 0 : qual[i] - ((int)bq[i] - 64);
+                       *(bq - 3) = 'Z';
+               } else if (zq && !apply_baq) { // then convert ZQ to BQ
+                       for (i = 0; i < c->l_qseq; ++i)
+                               qual[i] += (int)zq[i] - 64;
+                       *(zq - 3) = 'B';
+               }
+               return 0;
+       }
+       // find the start and end of the alignment      
         x = c->pos, y = 0, yb = ye = xb = xe = -1;
         for (k = 0; k < c->n_cigar; ++k) {
                 int op, l;
@@ -180,7 +202,7 @@ int bam_prob_realn(bam1_t *b, const char *ref)
                         x += l; y += l;
                 } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
                 else if (op == BAM_CDEL) x += l;
-               else if (op == BAM_CREF_SKIP) return -1;
+               else if (op == BAM_CREF_SKIP) return -1; // do nothing if there is a reference skip
         }
         // set bandwidth and the start and the end
         bw = 7;
@@ -192,8 +214,10 @@ int bam_prob_realn(bam1_t *b, const char *ref)
         if (xe - xb - c->l_qseq > bw)
                 xb += (xe - xb - c->l_qseq - bw) / 2, xe -= (xe - xb - c->l_qseq - bw) / 2;
         { // glocal
-               uint8_t *s, *r, *q, *seq = bam1_seq(b), *qual = bam1_qual(b);
+               uint8_t *s, *r, *q, *seq = bam1_seq(b), *bq;
                 int *state;
+               bq = calloc(c->l_qseq + 1, 1);
+               memcpy(bq, qual, c->l_qseq);
                 s = calloc(c->l_qseq, 1);
                 for (i = 0; i < c->l_qseq; ++i) s[i] = bam_nt16_nt4_table[bam1_seqi(seq, i)];
                 r = calloc(xe - xb, 1);
@@ -201,35 +225,45 @@ int bam_prob_realn(bam1_t *b, const char *ref)
                         r[i-xb] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[i]]];
                 state = calloc(c->l_qseq, sizeof(int));
                 q = calloc(c->l_qseq, 1);
-               ka_prob_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q);
+               kpa_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q);
                 for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
                         int op = cigar[k]&0xf, l = cigar[k]>>4;
                         if (op == BAM_CMATCH) {
                                 for (i = y; i < y + l; ++i) {
-                                       if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) qual[i] = 0;
-                                       else qual[i] = qual[i] < q[i]? qual[i] : q[i];
+                                       if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) bq[i] = 0;
+                                       else bq[i] = bq[i] < q[i]? bq[i] : q[i];
                                 }
                                 x += l; y += l;
                         } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
                         else if (op == BAM_CDEL) x += l;
                 }
-               free(s); free(r); free(q); free(state);
+               for (i = 0; i < c->l_qseq; ++i) bq[i] = qual[i] - bq[i] + 64; // finalize BQ
+               if (apply_baq) {
+                       for (i = 0; i < c->l_qseq; ++i) qual[i] -= bq[i] - 64; // modify qual
+                       bam_aux_append(b, "ZQ", 'Z', c->l_qseq + 1, bq);
+               } else bam_aux_append(b, "BQ", 'Z', c->l_qseq + 1, bq);
+               free(bq); free(s); free(r); free(q); free(state);
         }
         return 0;
  }
  
+int bam_prob_realn(bam1_t *b, const char *ref)
+{
+       return bam_prob_realn_core(b, ref, 1);
+}
+
  int bam_fillmd(int argc, char *argv[])
  {
-       int c, is_equal = 0, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm = 0, is_realn, capQ = 0;
+       int c, is_equal, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm, is_realn, capQ, apply_baq;
         samfile_t *fp, *fpout = 0;
         faidx_t *fai;
         char *ref = 0, mode_w[8], mode_r[8];
         bam1_t *b;
  
-       is_bam_out = is_sam_in = is_uncompressed = is_realn = 0;
+       is_equal = is_bam_out = is_sam_in = is_uncompressed = is_realn = max_nm = apply_baq = capQ = 0;
         mode_w[0] = mode_r[0] = 0;
         strcpy(mode_r, "r"); strcpy(mode_w, "w");
-       while ((c = getopt(argc, argv, "reubSC:n:")) >= 0) {
+       while ((c = getopt(argc, argv, "reubSC:n:A")) >= 0) {
                 switch (c) {
                 case 'r': is_realn = 1; break;
                 case 'e': is_equal = 1; break;
@@ -238,6 +272,7 @@ int bam_fillmd(int argc, char *argv[])
                 case 'S': is_sam_in = 1; break;
                 case 'n': max_nm = atoi(optarg); break;
                 case 'C': capQ = atoi(optarg); break;
+               case 'A': apply_baq = 1; break;
                 default: fprintf(stderr, "[bam_fillmd] unrecognized option '-%c'\n", c); return 1;
                 }
         }
@@ -252,6 +287,7 @@ int bam_fillmd(int argc, char *argv[])
                 fprintf(stderr, "         -u       uncompressed BAM output (for piping)\n");
                 fprintf(stderr, "         -b       compressed BAM output\n");
                 fprintf(stderr, "         -S       the input is SAM with header\n");
+               fprintf(stderr, "         -A       modify the quality string\n");
                 fprintf(stderr, "         -r       read-independent local realignment\n\n");
                 return 1;
         }
@@ -275,8 +311,7 @@ int bam_fillmd(int argc, char *argv[])
                                         fprintf(stderr, "[bam_fillmd] fail to find sequence '%s' in the reference.\n",
                                                         fp->header->target_name[tid]);
                         }
-//                     if (is_realn) bam_realn(b, ref);
-                       if (is_realn) bam_prob_realn(b, ref);
+                       if (is_realn) bam_prob_realn_core(b, ref, apply_baq);
                         if (capQ > 10) {
                                 int q = bam_cap_mapQ(b, ref, capQ);
                                 if (b->core.qual > q) b->core.qual = q;
diff --git a/bam_pileup.c b/bam_pileup.c

index a49f795d73d4f0ed1ad3a9592108f3be2a917579..55e51e27804cfc58fc5576b40ae7ba10416a3e8f 100644 (file)
--- a/bam_pileup.c
+++ b/bam_pileup.c
@@ -152,7 +152,7 @@ struct __bam_plp_t {
         mempool_t *mp;
         lbnode_t *head, *tail, *dummy;
         int32_t tid, pos, max_tid, max_pos;
-       int is_eof, flag_mask, max_plp, error;
+       int is_eof, flag_mask, max_plp, error, maxcnt;
         bam_pileup1_t *plp;
         // for the "auto" interface only
         bam1_t *b;
@@ -169,6 +169,7 @@ bam_plp_t bam_plp_init(bam_plp_auto_f func, void *data)
         iter->dummy = mp_alloc(iter->mp);
         iter->max_tid = iter->max_pos = -1;
         iter->flag_mask = BAM_DEF_MASK;
+       iter->maxcnt = 8000;
         if (func) {
                 iter->func = func;
                 iter->data = data;
@@ -240,6 +241,7 @@ int bam_plp_push(bam_plp_t iter, const bam1_t *b)
         if (b) {
                 if (b->core.tid < 0) return 0;
                 if (b->core.flag & iter->flag_mask) return 0;
+               if (iter->tid == b->core.tid && iter->pos == b->core.pos && iter->mp->cnt > iter->maxcnt) return 0;
                 bam_copy1(&iter->tail->b, b);
                 iter->tail->beg = b->core.pos; iter->tail->end = bam_calend(&b->core, bam1_cigar(b));
                 iter->tail->s = g_cstate_null; iter->tail->s.end = iter->tail->end - 1; // initialize cstate_t
@@ -267,7 +269,7 @@ const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_
         const bam_pileup1_t *plp;
         if (iter->func == 0 || iter->error) { *_n_plp = -1; return 0; }
         if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
-       else {
+       else { // no pileup line can be obtained; read alignments
                 *_n_plp = 0;
                 if (iter->is_eof) return 0;
                 while (iter->func(iter->data, iter->b) >= 0) {
@@ -276,6 +278,7 @@ const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_
                                 return 0;
                         }
                         if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
+                       // otherwise no pileup line can be returned; read the next alignment.
                 }
                 bam_plp_push(iter, 0);
                 if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
@@ -302,6 +305,11 @@ void bam_plp_set_mask(bam_plp_t iter, int mask)
         iter->flag_mask = mask < 0? BAM_DEF_MASK : (BAM_FUNMAP | mask);
  }
  
+void bam_plp_set_maxcnt(bam_plp_t iter, int maxcnt)
+{
+       iter->maxcnt = maxcnt;
+}
+
  /*****************
   * callback APIs *
   *****************/
@@ -389,6 +397,13 @@ bam_mplp_t bam_mplp_init(int n, bam_plp_auto_f func, void **data)
         return iter;
  }
  
+void bam_mplp_set_maxcnt(bam_mplp_t iter, int maxcnt)
+{
+       int i;
+       for (i = 0; i < iter->n; ++i)
+               iter->iter[i]->maxcnt = maxcnt;
+}
+
  void bam_mplp_destroy(bam_mplp_t iter)
  {
         int i;
diff --git a/bam_plcmd.c b/bam_plcmd.c

index dca85181f99bf1c7343650846039153a019e7ea9..e3f73aaddd03a14343a82084ee060d9c41ba42d4 100644 (file)
--- a/bam_plcmd.c
+++ b/bam_plcmd.c
@@ -2,6 +2,8 @@
  #include <stdio.h>
  #include <unistd.h>
  #include <ctype.h>
+#include <string.h>
+#include <errno.h>
  #include "sam.h"
  #include "faidx.h"
  #include "bam_maqcns.h"
@@ -528,10 +530,14 @@ int bam_pileup(int argc, char *argv[])
  #define MPLP_NO_COMP 0x20
  #define MPLP_NO_ORPHAN 0x40
  #define MPLP_REALN   0x80
+#define MPLP_FMT_DP 0x100
+#define MPLP_FMT_SP 0x200
+#define MPLP_NO_INDEL 0x400
  
  typedef struct {
-       int max_mq, min_mq, flag, min_baseQ, capQ_thres;
-       char *reg, *fn_pos;
+       int max_mq, min_mq, flag, min_baseQ, capQ_thres, max_depth;
+       int openQ, extQ, tandemQ;
+       char *reg, *fn_pos, *pl_list;
         faidx_t *fai;
         kh_64_t *hash;
  } mplp_conf_t;
@@ -552,7 +558,7 @@ typedef struct {
  static int mplp_func(void *data, bam1_t *b)
  {
         extern int bam_realn(bam1_t *b, const char *ref);
-       extern int bam_prob_realn(bam1_t *b, const char *ref);
+       extern int bam_prob_realn_core(bam1_t *b, const char *ref, int);
         extern int bam_cap_mapQ(bam1_t *b, char *ref, int thres);
         mplp_aux_t *ma = (mplp_aux_t*)data;
         int ret, skip = 0;
@@ -561,7 +567,7 @@ static int mplp_func(void *data, bam1_t *b)
                 ret = ma->iter? bam_iter_read(ma->fp, ma->iter, b) : bam_read1(ma->fp, b);
                 if (ret < 0) break;
                 skip = 0;
-               if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn(b, ma->ref);
+               if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn_core(b, ma->ref, 1);
                 if (has_ref && ma->capQ_thres > 10) {
                         int q = bam_cap_mapQ(b, ma->ref, ma->capQ_thres);
                         if (q < 0) skip = 1;
@@ -598,13 +604,16 @@ static void group_smpl(mplp_pileup_t *m, bam_sample_t *sm, kstring_t *buf,
  
  static int mpileup(mplp_conf_t *conf, int n, char **fn)
  {
+       extern void *bcf_call_add_rg(void *rghash, const char *hdtext, const char *list);
+       extern void bcf_call_del_rghash(void *rghash);
         mplp_aux_t **data;
-       int i, tid, pos, *n_plp, beg0 = 0, end0 = 1u<<29, ref_len, ref_tid;
+       int i, tid, pos, *n_plp, beg0 = 0, end0 = 1u<<29, ref_len, ref_tid, max_depth;
         const bam_pileup1_t **plp;
         bam_mplp_t iter;
         bam_header_t *h = 0;
         char *ref;
         khash_t(64) *hash = 0;
+       void *rghash = 0;
  
         bcf_callaux_t *bca = 0;
         bcf_callret1_t *bcr = 0;
@@ -634,6 +643,7 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                 data[i]->fp = strcmp(fn[i], "-") == 0? bam_dopen(fileno(stdin), "r") : bam_open(fn[i], "r");
                 h_tmp = bam_header_read(data[i]->fp);
                 bam_smpl_add(sm, fn[i], h_tmp->text);
+               rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
                 if (conf->reg) {
                         int beg, end;
                         bam_index_t *idx;
@@ -689,9 +699,19 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                 bcf_hdr_write(bp, bh);
                 bca = bcf_call_init(-1., conf->min_baseQ);
                 bcr = calloc(sm->n, sizeof(bcf_callret1_t));
+               bca->rghash = rghash;
+               bca->openQ = conf->openQ, bca->extQ = conf->extQ, bca->tandemQ = conf->tandemQ;
         }
         ref_tid = -1; ref = 0;
         iter = bam_mplp_init(n, mplp_func, (void**)data);
+       max_depth = conf->max_depth;
+       if (max_depth * sm->n > 1<<20)
+               fprintf(stderr, "(%s) Max depth is above 1M. Potential memory hog!\n", __func__);
+       if (max_depth * sm->n < 8000) {
+               max_depth = 8000 / sm->n;
+               fprintf(stderr, "<%s> Set max per-sample depth to %d\n", __func__, max_depth);
+       }
+       bam_mplp_set_maxcnt(iter, max_depth);
         while (bam_mplp_auto(iter, &tid, &pos, n_plp, plp) > 0) {
                 if (conf->reg && (pos < beg0 || pos >= end0)) continue; // out of the region requested
                 if (hash) {
@@ -714,9 +734,22 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                         for (i = 0; i < gplp.n; ++i)
                                 bcf_call_glfgen(gplp.n_plp[i], gplp.plp[i], ref16, bca, bcr + i);
                         bcf_call_combine(gplp.n, bcr, ref16, &bc);
-                       bcf_call2bcf(tid, pos, &bc, b);
+                       bcf_call2bcf(tid, pos, &bc, b, (conf->flag&(MPLP_FMT_DP|MPLP_FMT_SP))? bcr : 0,
+                                                (conf->flag&MPLP_FMT_SP), 0, 0);
                         bcf_write(bp, bh, b);
                         bcf_destroy(b);
+                       // call indels
+                       if (!(conf->flag&MPLP_NO_INDEL) && bcf_call_gap_prep(gplp.n, gplp.n_plp, gplp.plp, pos, bca, ref, rghash) >= 0) {
+                               for (i = 0; i < gplp.n; ++i)
+                                       bcf_call_glfgen(gplp.n_plp[i], gplp.plp[i], -1, bca, bcr + i);
+                               if (bcf_call_combine(gplp.n, bcr, -1, &bc) >= 0) {
+                                       b = calloc(1, sizeof(bcf1_t));
+                                       bcf_call2bcf(tid, pos, &bc, b, (conf->flag&(MPLP_FMT_DP|MPLP_FMT_SP))? bcr : 0,
+                                                                (conf->flag&MPLP_FMT_SP), bca, ref);
+                                       bcf_write(bp, bh, b);
+                                       bcf_destroy(b);
+                               }
+                       }
                 } else {
                         printf("%s\t%d\t%c", h->target_name[tid], pos + 1, (ref && pos < ref_len)? ref[pos] : 'N');
                         for (i = 0; i < n; ++i) {
@@ -743,6 +776,7 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
         bam_smpl_destroy(sm); free(buf.s);
         for (i = 0; i < gplp.n; ++i) free(gplp.plp[i]);
         free(gplp.plp); free(gplp.n_plp); free(gplp.m_plp);
+       bcf_call_del_rghash(rghash);
         if (hash) { // free the hash table
                 khint_t k;
                 for (k = kh_begin(hash); k < kh_end(hash); ++k)
@@ -761,33 +795,96 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
         return 0;
  }
  
+#define MAX_PATH_LEN 1024
+int read_file_list(const char *file_list,int *n,char **argv[])
+{
+    char buf[MAX_PATH_LEN];
+    int len, nfiles;
+    char **files;
+
+    FILE *fh = fopen(file_list,"r");
+    if ( !fh )
+    {
+        fprintf(stderr,"%s: %s\n", file_list,strerror(errno));
+        return 1;
+    }
+
+    // Speed is not an issue here, determine the number of files by reading the file twice
+    nfiles = 0;
+    while ( fgets(buf,MAX_PATH_LEN,fh) ) nfiles++;
+
+    if ( fseek(fh, 0L, SEEK_SET) )
+    {
+        fprintf(stderr,"%s: %s\n", file_list,strerror(errno));
+        return 1;
+    }
+
+    files = calloc(nfiles,sizeof(char*));
+    nfiles = 0;
+    while ( fgets(buf,MAX_PATH_LEN,fh) ) 
+    {
+        len = strlen(buf);
+        while ( len>0 && isspace(buf[len-1]) ) len--;
+        if ( !len ) continue;
+
+        files[nfiles] = malloc(sizeof(char)*(len+1)); 
+        strncpy(files[nfiles],buf,len);
+        files[nfiles][len] = 0;
+        nfiles++;
+    }
+    fclose(fh);
+    if ( !nfiles )
+    {
+        fprintf(stderr,"No files read from %s\n", file_list);
+        return 1;
+    }
+    *argv = files;
+    *n    = nfiles;
+    return 0;
+}
+#undef MAX_PATH_LEN
+
  int bam_mpileup(int argc, char *argv[])
  {
         int c;
+    const char *file_list = NULL;
+    char **fn = NULL;
+    int nfiles = 0;
         mplp_conf_t mplp;
         memset(&mplp, 0, sizeof(mplp_conf_t));
         mplp.max_mq = 60;
         mplp.min_baseQ = 13;
         mplp.capQ_thres = 0;
+       mplp.max_depth = 250;
+       mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100;
         mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN;
-       while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:B")) >= 0) {
+       while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:BDSd:b:P:o:e:h:I")) >= 0) {
                 switch (c) {
                 case 'f':
                         mplp.fai = fai_load(optarg);
                         if (mplp.fai == 0) return 1;
                         break;
+               case 'd': mplp.max_depth = atoi(optarg); break;
                 case 'r': mplp.reg = strdup(optarg); break;
                 case 'l': mplp.fn_pos = strdup(optarg); break;
+               case 'P': mplp.pl_list = strdup(optarg); break;
                 case 'g': mplp.flag |= MPLP_GLF; break;
                 case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_GLF; break;
                 case 'a': mplp.flag |= MPLP_NO_ORPHAN | MPLP_REALN; break;
                 case 'B': mplp.flag &= ~MPLP_REALN & ~MPLP_NO_ORPHAN; break;
                 case 'O': mplp.flag |= MPLP_NO_ORPHAN; break;
                 case 'R': mplp.flag |= MPLP_REALN; break;
+               case 'D': mplp.flag |= MPLP_FMT_DP; break;
+               case 'S': mplp.flag |= MPLP_FMT_SP; break;
+               case 'I': mplp.flag |= MPLP_NO_INDEL; break;
                 case 'C': mplp.capQ_thres = atoi(optarg); break;
                 case 'M': mplp.max_mq = atoi(optarg); break;
                 case 'q': mplp.min_mq = atoi(optarg); break;
                 case 'Q': mplp.min_baseQ = atoi(optarg); break;
+        case 'b': file_list = optarg; break;
+               case 'o': mplp.openQ = atoi(optarg); break;
+               case 'e': mplp.extQ = atoi(optarg); break;
+               case 'h': mplp.tandemQ = atoi(optarg); break;
                 }
         }
         if (argc == 1) {
@@ -796,18 +893,35 @@ int bam_mpileup(int argc, char *argv[])
                 fprintf(stderr, "Options: -f FILE     reference sequence file [null]\n");
                 fprintf(stderr, "         -r STR      region in which pileup is generated [null]\n");
                 fprintf(stderr, "         -l FILE     list of positions (format: chr pos) [null]\n");
+               fprintf(stderr, "         -b FILE     list of input BAM files [null]\n");
                 fprintf(stderr, "         -M INT      cap mapping quality at INT [%d]\n", mplp.max_mq);
                 fprintf(stderr, "         -Q INT      min base quality [%d]\n", mplp.min_baseQ);
                 fprintf(stderr, "         -q INT      filter out alignment with MQ smaller than INT [%d]\n", mplp.min_mq);
+               fprintf(stderr, "         -d INT      max per-sample depth [%d]\n", mplp.max_depth);
+               fprintf(stderr, "         -P STR      comma separated list of platforms for indels [all]\n");
+               fprintf(stderr, "         -o INT      Phred-scaled gap open sequencing error probability [%d]\n", mplp.openQ);
+               fprintf(stderr, "         -e INT      Phred-scaled gap extension seq error probability [%d]\n", mplp.extQ);
+               fprintf(stderr, "         -h INT      coefficient for homopolyer errors [%d]\n", mplp.tandemQ);
                 fprintf(stderr, "         -g          generate BCF output\n");
                 fprintf(stderr, "         -u          do not compress BCF output\n");
                 fprintf(stderr, "         -B          disable BAQ computation\n");
+               fprintf(stderr, "         -D          output per-sample DP\n");
+               fprintf(stderr, "         -S          output per-sample SP (strand bias P-value, slow)\n");
+               fprintf(stderr, "         -I          do not perform indel calling\n");
                 fprintf(stderr, "\n");
                 fprintf(stderr, "Notes: Assuming diploid individuals.\n\n");
                 return 1;
         }
-       mpileup(&mplp, argc - optind, argv + optind);
-       free(mplp.reg);
+    if ( file_list )
+    {
+        if ( read_file_list(file_list,&nfiles,&fn) ) return 1;
+        mpileup(&mplp,nfiles,fn);
+        for (c=0; c<nfiles; c++) free(fn[c]);
+        free(fn);
+    }
+    else
+           mpileup(&mplp, argc - optind, argv + optind);
+       free(mplp.reg); free(mplp.pl_list);
         if (mplp.fai) fai_destroy(mplp.fai);
         return 0;
  }
diff --git a/bamtk.c b/bamtk.c

index 082767c8d6bf0a29be8f1b73101b8a7c43e29a08..88328d47c24fc6bfe6f602ffce721fb4ee64e56c 100644 (file)
--- a/bamtk.c
+++ b/bamtk.c
@@ -9,7 +9,7 @@
  #endif
  
  #ifndef PACKAGE_VERSION
-#define PACKAGE_VERSION "0.1.9 (r783)"
+#define PACKAGE_VERSION "0.1.10 (r829)"
  #endif
  
  int bam_taf2baf(int argc, char *argv[]);
diff --git a/bcftools/bcf.c b/bcftools/bcf.c

index 9044b93b0b514fd9566f037a705f8769f60ecc54..fd5c0c9ae24971f15168f488452afe2077258dd0 100644 (file)
--- a/bcftools/bcf.c
+++ b/bcftools/bcf.c
@@ -13,7 +13,9 @@ bcf_t *bcf_open(const char *fn, const char *mode)
         } else {
                 b->fp = strcmp(fn, "-")? bgzf_open(fn, mode) : bgzf_fdopen(fileno(stdin), mode);
         }
+#ifndef BCF_LITE
         b->fp->owned_file = 1;
+#endif
         return b;
  }
  
@@ -134,7 +136,9 @@ int bcf_sync(bcf1_t *b)
                         b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2;
                 } else if (b->gi[i].fmt == bcf_str2int("DP", 2) || b->gi[i].fmt == bcf_str2int("HQ", 2)) {
                         b->gi[i].len = 2;
-               } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)) {
+               } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)
+                                  || b->gi[i].fmt == bcf_str2int("SP", 2))
+               {
                         b->gi[i].len = 1;
                 } else if (b->gi[i].fmt == bcf_str2int("GL", 2)) {
                         b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2 * 4;
@@ -236,7 +240,7 @@ void bcf_fmt_core(const bcf_hdr_t *h, bcf1_t *b, kstring_t *s)
                                 }
                         } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
                                 kputw(((uint16_t*)b->gi[i].data)[j], s);
-                       } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
+                       } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
                                 kputw(((uint8_t*)b->gi[i].data)[j], s);
                         } else if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
                                 int y = ((uint8_t*)b->gi[i].data)[j];
@@ -304,3 +308,13 @@ int bcf_cpy(bcf1_t *r, const bcf1_t *b)
                 memcpy(r->gi[i].data, b->gi[i].data, r->n_smpl * r->gi[i].len);
         return 0;
  }
+
+int bcf_is_indel(const bcf1_t *b)
+{
+       char *p;
+       if (strlen(b->ref) > 1) return 1;
+       for (p = b->alt; *p; ++p)
+               if (*p != ',' && p[1] != ',' && p[1] != '\0')
+                       return 1;
+       return 0;
+}
diff --git a/bcftools/bcf.h b/bcftools/bcf.h

index 36578956bbae06d4c216c121a2ddb3a94cad94d8..f87ac1e6fa3021964753f051a944fdc95a59549d 100644 (file)
--- a/bcftools/bcf.h
+++ b/bcftools/bcf.h
@@ -30,7 +30,19 @@
  
  #include <stdint.h>
  #include <zlib.h>
+
+#ifndef BCF_LITE
  #include "bgzf.h"
+typedef BGZF *bcfFile;
+#else
+typedef gzFile bcfFile;
+#define bgzf_open(fn, mode) gzopen(fn, mode)
+#define bgzf_fdopen(fd, mode) gzdopen(fd, mode)
+#define bgzf_close(fp) gzclose(fp)
+#define bgzf_read(fp, buf, len) gzread(fp, buf, len)
+#define bgzf_write(fp, buf, len)
+#define bgzf_flush(fp)
+#endif
  
  /*
    A member in the structs below is said to "primary" if its content
@@ -74,7 +86,7 @@ typedef struct {
  typedef struct {
         int is_vcf; // if the file in operation is a VCF
         void *v; // auxillary data structure for VCF
-       BGZF *fp; // file handler for BCF
+       bcfFile fp; // file handler for BCF
  } bcf_t;
  
  struct __bcf_idx_t;
@@ -128,6 +140,8 @@ extern "C" {
         int bcf_shrink_alt(bcf1_t *b, int n);
         // convert GL to PL
         int bcf_gl2pl(bcf1_t *b);
+       // if the site is an indel
+       int bcf_is_indel(const bcf1_t *b);
  
         // string hash table
         void *bcf_build_refhash(bcf_hdr_t *h);
diff --git a/bcftools/call1.c b/bcftools/call1.c

index 2b284527185c0e12747ca1dfb5387aa0bd615c36..b0b14fc5307817cad907ecc6f7126716e80599a5 100644 (file)
--- a/bcftools/call1.c
+++ b/bcftools/call1.c
@@ -25,11 +25,12 @@ KSTREAM_INIT(gzFile, gzread, 16384)
  #define VC_QCALL   1024
  #define VC_CALL_GT 2048
  #define VC_ADJLD   4096
+#define VC_NO_INDEL 8192
  
  typedef struct {
         int flag, prior_type, n1;
         char *fn_list, *prior_file;
-       double theta, pref;
+       double theta, pref, indel_frac;
  } viewconf_t;
  
  khash_t(set64) *bcf_load_pos(const char *fn, bcf_hdr_t *_h)
@@ -161,7 +162,8 @@ static int update_bcf1(int n_smpl, bcf1_t *b, const bcf_p1aux_t *pa, const bcf_p
         kputs(b->alt, &s); kputc('\0', &s); kputc('\0', &s);
         kputs(b->info, &s);
         if (b->info[0]) kputc(';', &s);
-       ksprintf(&s, "AF1=%.3lf;AFE=%.3lf", 1.-pr->f_em, 1.-pr->f_exp);
+//     ksprintf(&s, "AF1=%.4lg;AFE=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, 1.-pr->f_exp, pr->cil, pr->cih);
+       ksprintf(&s, "AF1=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, pr->cil, pr->cih);
         ksprintf(&s, ";DP4=%d,%d,%d,%d;MQ=%d", a.d[0], a.d[1], a.d[2], a.d[3], a.mq);
         if (a.is_tested) {
                 if (pr->pc[0] >= 0.) ksprintf(&s, ";PC4=%lg,%lg,%lg,%lg", pr->pc[0], pr->pc[1], pr->pc[2], pr->pc[3]);
@@ -199,6 +201,7 @@ double bcf_ld_freq(const bcf1_t *b0, const bcf1_t *b1, double f[4]);
  int bcfview(int argc, char *argv[])
  {
         extern int bcf_2qcall(bcf_hdr_t *h, bcf1_t *b);
+       extern void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x);
         bcf_t *bp, *bout = 0;
         bcf1_t *b, *blast;
         int c;
@@ -212,8 +215,8 @@ int bcfview(int argc, char *argv[])
  
         tid = begin = end = -1;
         memset(&vc, 0, sizeof(viewconf_t));
-       vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5;
-       while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgL")) >= 0) {
+       vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5; vc.indel_frac = -1.;
+       while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgLi:I")) >= 0) {
                 switch (c) {
                 case '1': vc.n1 = atoi(optarg); break;
                 case 'l': vc.fn_list = strdup(optarg); break;
@@ -227,8 +230,10 @@ int bcfview(int argc, char *argv[])
                 case 'u': vc.flag |= VC_UNCOMP | VC_BCFOUT; break;
                 case 'H': vc.flag |= VC_HWE; break;
                 case 'g': vc.flag |= VC_CALL_GT | VC_CALL; break;
+               case 'I': vc.flag |= VC_NO_INDEL; break;
                 case 't': vc.theta = atof(optarg); break;
                 case 'p': vc.pref = atof(optarg); break;
+               case 'i': vc.indel_frac = atof(optarg); break;
                 case 'Q': vc.flag |= VC_QCALL; break;
                 case 'L': vc.flag |= VC_ADJLD; break;
                 case 'P':
@@ -254,9 +259,11 @@ int bcfview(int argc, char *argv[])
                 fprintf(stderr, "         -N        skip sites where REF is not A/C/G/T\n");
                 fprintf(stderr, "         -Q        output the QCALL likelihood format\n");
                 fprintf(stderr, "         -L        calculate LD for adjacent sites\n");
+               fprintf(stderr, "         -I        skip indels\n");
                 fprintf(stderr, "         -1 INT    number of group-1 samples [0]\n");
                 fprintf(stderr, "         -l FILE   list of sites to output [all sites]\n");
-               fprintf(stderr, "         -t FLOAT  scaled mutation rate [%.4lg]\n", vc.theta);
+               fprintf(stderr, "         -t FLOAT  scaled substitution mutation rate [%.4lg]\n", vc.theta);
+               fprintf(stderr, "         -i FLOAT  indel-to-substitution ratio [%.4lg]\n", vc.indel_frac);
                 fprintf(stderr, "         -p FLOAT  variant if P(ref|D)<FLOAT [%.3lg]\n", vc.pref);
                 fprintf(stderr, "         -P STR    type of prior: full, cond2, flat [full]\n");
                 fprintf(stderr, "\n");
@@ -286,6 +293,7 @@ int bcfview(int argc, char *argv[])
                         bcf_p1_set_n1(p1, vc.n1);
                         bcf_p1_init_subprior(p1, vc.prior_type, vc.theta);
                 }
+               if (vc.indel_frac > 0.) bcf_p1_indel_prior(p1, vc.indel_frac);
         }
         if (vc.fn_list) hash = bcf_load_pos(vc.fn_list, h);
         if (optind + 1 < argc && !(vc.flag&VC_VCFIN)) {
@@ -306,7 +314,9 @@ int bcfview(int argc, char *argv[])
                 }
         }
         while (vcf_read(bp, h, b) > 0) {
-               if (vc.flag & VC_ACGT_ONLY) {
+               int is_indel = bcf_is_indel(b);
+               if ((vc.flag & VC_NO_INDEL) && is_indel) continue;
+               if ((vc.flag & VC_ACGT_ONLY) && !is_indel) {
                         int x;
                         if (b->ref[0] == 0 || b->ref[1] != 0) continue;
                         x = toupper(b->ref[0]);
diff --git a/bcftools/prob1.c b/bcftools/prob1.c

index e3b0f7344d096f22c37048b2131219378bcb7ba5..176a0fc96d4e0b12f14172bc7c0d68297e01a50a 100644 (file)
--- a/bcftools/prob1.c
+++ b/bcftools/prob1.c
@@ -8,9 +8,9 @@
  #include "kseq.h"
  KSTREAM_INIT(gzFile, gzread, 16384)
  
-#define MC_AVG_ERR 0.007
  #define MC_MAX_EM_ITER 16
  #define MC_EM_EPS 1e-4
+#define MC_DEF_INDEL 0.15
  
  unsigned char seq_nt4_table[256] = {
         4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
@@ -32,9 +32,9 @@ unsigned char seq_nt4_table[256] = {
  };
  
  struct __bcf_p1aux_t {
-       int n, M, n1;
+       int n, M, n1, is_indel;
         double *q2p, *pdg; // pdg -> P(D|g)
-       double *phi;
+       double *phi, *phi_indel;
         double *z, *zswap; // aux for afs
         double *z1, *z2, *phi1, *phi2; // only calculated when n1 is set
         double t, t1, t2;
@@ -43,6 +43,14 @@ struct __bcf_p1aux_t {
         int PL_len;
  };
  
+void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x)
+{
+       int i;
+       for (i = 0; i < ma->M; ++i)
+               ma->phi_indel[i] = ma->phi[i] * x;
+       ma->phi_indel[ma->M] = 1. - ma->phi[ma->M] * x;
+}
+
  static void init_prior(int type, double theta, int M, double *phi)
  {
         int i;
@@ -63,6 +71,7 @@ static void init_prior(int type, double theta, int M, double *phi)
  void bcf_p1_init_prior(bcf_p1aux_t *ma, int type, double theta)
  {
         init_prior(type, theta, ma->M, ma->phi);
+       bcf_p1_indel_prior(ma, MC_DEF_INDEL);
  }
  
  void bcf_p1_init_subprior(bcf_p1aux_t *ma, int type, double theta)
@@ -110,6 +119,7 @@ int bcf_p1_read_prior(bcf_p1aux_t *ma, const char *fn)
         fprintf(stderr, "[%s] heterozygosity=%lf, ", __func__, (double)sum);
         for (sum = 0., k = 1; k <= ma->M; ++k) sum += k * ma->phi[ma->M - k] / ma->M;
         fprintf(stderr, "theta=%lf\n", (double)sum);
+       bcf_p1_indel_prior(ma, MC_DEF_INDEL);
         return 0;
  }
  
@@ -123,6 +133,7 @@ bcf_p1aux_t *bcf_p1_init(int n)
         ma->q2p = calloc(256, sizeof(double));
         ma->pdg = calloc(3 * ma->n, sizeof(double));
         ma->phi = calloc(ma->M + 1, sizeof(double));
+       ma->phi_indel = calloc(ma->M + 1, sizeof(double));
         ma->phi1 = calloc(ma->M + 1, sizeof(double));
         ma->phi2 = calloc(ma->M + 1, sizeof(double));
         ma->z = calloc(2 * ma->n + 1, sizeof(double));
@@ -148,7 +159,7 @@ void bcf_p1_destroy(bcf_p1aux_t *ma)
  {
         if (ma) {
                 free(ma->q2p); free(ma->pdg);
-               free(ma->phi); free(ma->phi1); free(ma->phi2);
+               free(ma->phi); free(ma->phi_indel); free(ma->phi1); free(ma->phi2);
                 free(ma->z); free(ma->zswap); free(ma->z1); free(ma->z2);
                 free(ma->afs); free(ma->afs1);
                 free(ma);
@@ -303,12 +314,13 @@ static double mc_cal_afs(bcf_p1aux_t *ma)
  {
         int k;
         long double sum = 0.;
+       double *phi = ma->is_indel? ma->phi_indel : ma->phi;
         memset(ma->afs1, 0, sizeof(double) * (ma->M + 1));
         mc_cal_y(ma);
         for (k = 0, sum = 0.; k <= ma->M; ++k)
-               sum += (long double)ma->phi[k] * ma->z[k];
+               sum += (long double)phi[k] * ma->z[k];
         for (k = 0; k <= ma->M; ++k) {
-               ma->afs1[k] = ma->phi[k] * ma->z[k] / sum;
+               ma->afs1[k] = phi[k] * ma->z[k] / sum;
                 if (isnan(ma->afs1[k]) || isinf(ma->afs1[k])) return -1.;
         }
         for (k = 0, sum = 0.; k <= ma->M; ++k) {
@@ -348,6 +360,7 @@ int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst)
  {
         int i, k;
         long double sum = 0.;
+       ma->is_indel = bcf_is_indel(b);
         // set PL and PL_len
         for (i = 0; i < b->n_gi; ++i) {
                 if (b->gi[i].fmt == bcf_str2int("PL", 2)) {
@@ -378,6 +391,19 @@ int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst)
                         flast = rst->f_em;
                 }
         }
+       { // estimate equal-tail credible interval (95% level)
+               int l, h;
+               double p;
+               for (i = 0, p = 0.; i < ma->M; ++i)
+                       if (p + ma->afs1[i] > 0.025) break;
+                       else p += ma->afs1[i];
+               l = i;
+               for (i = ma->M-1, p = 0.; i >= 0; --i)
+                       if (p + ma->afs1[i] > 0.025) break;
+                       else p += ma->afs1[i];
+               h = i;
+               rst->cil = (double)(ma->M - h) / ma->M; rst->cih = (double)(ma->M - l) / ma->M;
+       }
         rst->g[0] = rst->g[1] = rst->g[2] = -1.;
         contrast(ma, rst->pc);
         return 0;
diff --git a/bcftools/prob1.h b/bcftools/prob1.h

index 7158fe29f89bee73dda2f2bdc9426af038a363c6..6d9d28e3cee6c7d0a66efbd5314912ac511cd384 100644 (file)
--- a/bcftools/prob1.h
+++ b/bcftools/prob1.h
@@ -9,6 +9,7 @@ typedef struct __bcf_p1aux_t bcf_p1aux_t;
  typedef struct {
         int rank0;
         double f_em, f_exp, f_flat, p_ref;
+       double cil, cih;
         double pc[4];
         double g[3];
  } bcf_p1rst_t;
diff --git a/bcftools/vcf.c b/bcftools/vcf.c

index ebca869ed999faaed7e782f3a3c410badc062c1b..9b661ff5b4fbc29dde6a53b5e71c027f131a81f4 100644 (file)
--- a/bcftools/vcf.c
+++ b/bcftools/vcf.c
@@ -156,7 +156,7 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b)
                                 for (i = 0; i < b->n_gi; ++i) {
                                         if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
                                                 ((uint8_t*)b->gi[i].data)[k-9] = 1<<7;
-                                       } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
+                                       } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
                                                 ((uint8_t*)b->gi[i].data)[k-9] = 0;
                                         } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
                                                 ((uint16_t*)b->gi[i].data)[k-9] = 0;
@@ -173,7 +173,7 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b)
                         for (q = kstrtok(p, ":", &a2), i = 0; q && i < b->n_gi; q = kstrtok(0, 0, &a2), ++i) {
                                 if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
                                         ((uint8_t*)b->gi[i].data)[k-9] = (q[0] - '0')<<3 | (q[2] - '0') | (q[1] == '/'? 0 : 1) << 6;
-                               } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
+                               } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
                                         double _x = strtod(q, &q);
                                         int x = (int)(_x + .499);
                                         if (x > 255) x = 255;
diff --git a/bcftools/vcfutils.pl b/bcftools/vcfutils.pl

index d0b797143cf56fe0a67383ec4f610096eaecf2c0..bbc479bfffa47aa2b1089b5b6e56414b696be413 100755 (executable)
--- a/bcftools/vcfutils.pl
+++ b/bcftools/vcfutils.pl
@@ -10,11 +10,11 @@ use Getopt::Std;
  exit;
  
  sub main {
-  my $version = '0.1.0';
    &usage if (@ARGV < 1);
    my $command = shift(@ARGV);
    my %func = (subsam=>\&subsam, listsam=>\&listsam, fillac=>\&fillac, qstats=>\&qstats, varFilter=>\&varFilter,
-                         hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&filter4vcf, ldstats=>\&ldstats);
+                         hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&varFilter, ldstats=>\&ldstats,
+                         gapstats=>\&gapstats);
    die("Unknown command \"$command\".\n") if (!defined($func{$command}));
    &{$func{$command}};
  }
@@ -155,7 +155,7 @@ Note: This command discards indels. Output: QUAL #non-indel #SNPs #transitions #
         next if (length($t[3]) != 1 || uc($t[3]) eq 'N');
         $t[3] = uc($t[3]); $t[4] = uc($t[4]);
         my @s = split(',', $t[4]);
-       $t[5] = 3 if ($t[5] < 0);
+       $t[5] = 3 if ($t[5] eq '.' || $t[5] < 0);
         next if (length($s[0]) != 1);
         my $hit;
         if ($is_vcf) {
@@ -199,37 +199,38 @@ Note: This command discards indels. Output: QUAL #non-indel #SNPs #transitions #
  }
  
  sub varFilter {
-  my %opts = (d=>1, D=>10000, l=>30, Q=>25, q=>10, G=>25, s=>100, w=>10, W=>10, N=>2, p=>undef, F=>.001);
-  getopts('pq:d:D:l:Q:w:W:N:G:F:', \%opts);
+  my %opts = (d=>2, D=>10000, a=>2, W=>10, Q=>10, w=>10, p=>undef, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4);
+  getopts('pd:D:W:Q:w:a:1:2:3:4:', \%opts);
    die(qq/
  Usage:   vcfutils.pl varFilter [options] <in.vcf>
  
  Options: -Q INT    minimum RMS mapping quality for SNPs [$opts{Q}]
-         -q INT    minimum RMS mapping quality for gaps [$opts{q}]
           -d INT    minimum read depth [$opts{d}]
           -D INT    maximum read depth [$opts{D}]
-
-         -G INT    min indel score for nearby SNP filtering [$opts{G}]
+         -a INT    minimum number of alternate bases [$opts{a}]
           -w INT    SNP within INT bp around a gap to be filtered [$opts{w}]
-
-         -W INT    window size for filtering dense SNPs [$opts{W}]
-         -N INT    max number of SNPs in a window [$opts{N}]
-
-         -l INT    window size for filtering adjacent gaps [$opts{l}]
-
+         -W INT    window size for filtering adjacent gaps [$opts{W}]
+         -1 FLOAT  min P-value for strand bias (given PV4) [$opts{1}]
+         -2 FLOAT  min P-value for baseQ bias [$opts{2}]
+         -3 FLOAT  min P-value for mapQ bias [$opts{3}]
+         -4 FLOAT  min P-value for end distance bias [$opts{4}]
           -p        print filtered variants
+
+Note: Some of the filters rely on annotations generated by SAMtools\/BCFtools.
  \n/) if (@ARGV == 0 && -t STDIN);
  
    # calculate the window size
-  my ($ol, $ow, $oW) = ($opts{l}, $opts{w}, $opts{W});
+  my ($ol, $ow) = ($opts{W}, $opts{w});
    my $max_dist = $ol > $ow? $ol : $ow;
-  $max_dist = $oW if ($max_dist < $oW);
    # the core loop
-  my @staging; # (indel_filtering_score, flt_tag)
+  my @staging; # (indel_filtering_score, flt_tag, indel_span; chr, pos, ...)
    while (<>) {
         my @t = split;
-       next if (/^#/);
+    if (/^#/) {
+         print; next;
+       }
         next if ($t[4] eq '.'); # skip non-var sites
+       # check if the site is a SNP
         my $is_snp = 1;
         if (length($t[3]) > 1) {
           $is_snp = 0;
@@ -245,24 +246,18 @@ Options: -Q INT    minimum RMS mapping quality for SNPs [$opts{Q}]
           last if ($staging[0][3] eq $t[0] && $staging[0][4] + $staging[0][2] + $max_dist >= $t[1]);
           varFilter_aux(shift(@staging), $opts{p}); # calling a function is a bit slower, not much
         }
-       my ($flt, $score) = (0, -1);
-
-       # collect key annotations
-       my ($dp, $mq, $af) = (-1, -1, 1);
-       if ($t[7] =~ /DP=(\d+)/i) {
-         $dp = $1;
-       } elsif ($t[7] =~ /DP4=(\d+),(\d+),(\d+),(\d+)/i) {
+       my $flt = 0;
+       # parse annotations
+       my ($dp, $mq, $dp_alt) = (-1, -1, -1);
+       if ($t[7] =~ /DP4=(\d+),(\d+),(\d+),(\d+)/i) {
           $dp = $1 + $2 + $3 + $4;
+         $dp_alt = $3 + $4;
         }
-       if ($t[7] =~ /MQ=(\d+)/i) {
-         $mq = $1;
-       }
-       if ($t[7] =~ /AF=([^\s;=]+)/i) {
-         $af = $1;
-       } elsif ($t[7] =~ /AF1=([^\s;=]+)/i) {
-         $af = $1;
+       if ($t[7] =~ /DP=(\d+)/i) {
+         $dp = $1;
         }
-       # the depth filter
+       $mq = $1 if ($t[7] =~ /MQ=(\d+)/i);
+       # the depth and mapQ filter
         if ($dp >= 0) {
           if ($dp < $opts{d}) {
                 $flt = 2;
@@ -270,58 +265,40 @@ Options: -Q INT    minimum RMS mapping quality for SNPs [$opts{Q}]
                 $flt = 3;
           }
         }
+       $flt = 4 if ($dp_alt >= 0 && $dp_alt < $opts{a});
+       $flt = 1 if ($flt == 0 && $mq >= 0 && $mq < $opts{Q});
+       $flt = 7 if ($flt == 0 && /PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/
+                                && ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4}));
  
         # site dependent filters
-       my $dlen = 0;
+       my ($rlen, $indel_score) = (0, -1); # $indel_score<0 for SNPs
         if ($flt == 0) {
           if (!$is_snp) { # an indel
-        # If deletion, remember the length of the deletion
-               $dlen = length($t[3]) - 1;
-               $flt = 1 if ($mq < $opts{q});
+               $rlen = length($t[3]) - 1;
+               $indel_score = $t[5] * 100 + $dp_alt;
                 # filtering SNPs
-               if ($t[5] >= $opts{G}) {
-                 for my $x (@staging) {
-            # Is it a SNP and is it outside the SNP filter window?
-                       next if ($x->[0] >= 0 || $x->[4] + $x->[2] + $ow < $t[1]);
-                       $x->[1] = 5 if ($x->[1] == 0);
-                 }
+               for my $x (@staging) {
+                 next if ($x->[0] >= 0 || $x->[1] || $x->[4] + $x->[2] + $ow < $t[1]);
+                 $x->[1] = 5;
                 }
-               # the indel filtering score
-               $score = $t[5];
                 # check the staging list for indel filtering
                 for my $x (@staging) {
-          # Is it a SNP and is it outside the gap filter window
-                 next if ($x->[0] < 0 || $x->[4] + $x->[2] + $ol < $t[1]);
-                 if ($x->[0] < $score) {
+                 next if ($x->[0] < 0 || $x->[1] || $x->[4] + $x->[2] + $ol < $t[1]);
+                 if ($x->[0] < $indel_score) {
                         $x->[1] = 6;
                   } else {
                         $flt = 6; last;
                   }
                 }
           } else { # a SNP
-               $flt = 1 if ($mq < $opts{Q});
-               # check adjacent SNPs
-               my $k = 1;
                 for my $x (@staging) {
-                 ++$k if ($x->[0] < 0 && -($x->[0] + 1) > $opts{F} && $x->[4] + $x->[2] + $oW >= $t[1] && ($x->[1] == 0 || $x->[1] == 4 || $x->[1] == 5));
-               }
-               # filtering is necessary
-               if ($k > $opts{N}) {
-                 $flt = 4;
-                 for my $x (@staging) {
-                        $x->[1] = 4 if ($x->[0] < 0 && $x->[4] + $x->[2] + $oW >= $t[1] && $x->[1] == 0);
-                 }
-               } else { # then check gap filter
-                 for my $x (@staging) {
-                       next if ($x->[0] < 0 || $x->[4] + $x->[2] + $ow < $t[1]);
-                       if ($x->[0] >= $opts{G}) {
-                         $flt = 5; last;
-                       }
-                 }
+                 next if ($x->[0] < 0 || $x->[1] || $x->[4] + $x->[2] + $ow < $t[1]);
+                 $flt = 5;
+                 last;
                 }
           }
         }
-       push(@staging, [$score < 0? -$af-1 : $score, $flt, $dlen, @t]);
+       push(@staging, [$indel_score, $flt, $rlen, @t]);
    }
    # output the last few elements in the staging list
    while (@staging) {
@@ -334,47 +311,35 @@ sub varFilter_aux {
    if ($first->[1] == 0) {
         print join("\t", @$first[3 .. @$first-1]), "\n";
    } elsif ($is_print) {
-       print STDERR join("\t", substr("UQdDWGgsiX", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
+       print STDERR join("\t", substr("UQdDaGgP", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
    }
  }
  
-sub filter4vcf {
-  my %opts = (d=>3, D=>2000, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4, Q=>10, q=>3);
-  getopts('d:D:1:2:3:4:Q:q:', \%opts);
-  die(qq/
-Usage:   vcfutils.pl filter4vcf [options] <in.vcf>
-
-Options: -d INT     min total depth (given DP or DP4) [$opts{d}]
-         -D INT     max total depth [$opts{D}]
-         -q INT     min SNP quality [$opts{q}]
-         -Q INT     min RMS mapQ (given MQ) [$opts{Q}]
-         -1 FLOAT   min P-value for strand bias (given PV4) [$opts{1}]
-         -2 FLOAT   min P-value for baseQ bias [$opts{2}]
-         -3 FLOAT   min P-value for mapQ bias [$opts{3}]
-         -4 FLOAT   min P-value for end distance bias [$opts{4}]\n
-/) if (@ARGV == 0 && -t STDIN);
-
-  my %ts = (AG=>1, GA=>1, CT=>1, TC=>1);
-
-  my @n = (0, 0);
+sub gapstats {
+  my (@c0, @c1);
+  $c0[$_] = $c1[$_] = 0 for (0 .. 10000);
    while (<>) {
-       if (/^#/) {
-         print;
-         next;
-       }
-       next if (/PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/ && ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4}));
-       my $depth = -1;
-       $depth = $1 if (/DP=(\d+)/);
-       $depth = $1+$2+$3+$4 if (/DP4=(\d+),(\d+),(\d+),(\d+)/);
-       next if ($depth > 0 && ($depth < $opts{d} || $depth > $opts{D}));
-       next if (/MQ=(\d+)/ && $1 < $opts{Q});
+       next if (/^#/);
         my @t = split;
-       next if ($t[5] >= 0 && $t[5] < $opts{q});
-       ++$n[0];
+       next if (length($t[3]) == 1 && $t[4] =~ /^[A-Za-z](,[A-Za-z])*$/); # not an indel
         my @s = split(',', $t[4]);
-       ++$n[1] if ($ts{$t[3].$s[0]});
-       print;
+       for my $x (@s) {
+         my $l = length($x) - length($t[3]) + 5000;
+         if ($x =~ /^-/) {
+               $l = -(length($x) - 1) + 5000;
+         } elsif ($x =~ /^\+/) {
+               $l = length($x) - 1 + 5000;
+         }
+         $c0[$l] += 1 / @s;
+       }
+  }
+  for (my $i = 0; $i < 10000; ++$i) {
+       next if ($c0[$i] == 0);
+       $c1[0] += $c0[$i];
+       $c1[1] += $c0[$i] if (($i-5000)%3 == 0);
+       printf("C\t%d\t%.2f\n", ($i-5000), $c0[$i]);
    }
+  printf("3\t%d\t%d\t%.3f\n", $c1[0], $c1[1], $c1[1]/$c1[0]);
  }
  
  sub ucscsnp2vcf {
@@ -470,7 +435,6 @@ Command: subsam       get a subset of samples
           fillac       fill the allele count field
           qstats       SNP stats stratified by QUAL
           varFilter    filtering short variants
-         filter4vcf   filtering VCFs produced by samtools+bcftools
           hapmap2vcf   convert the hapmap format to VCF
           ucscsnp2vcf  convert UCSC SNP SQL dump to VCF
  \n/);
diff --git a/kaln.c b/kaln.c

index 680779113ce7c2220434727cf578845d1e840d40..9c0bbaa437ff5654d7a181b496bb9d58c9c30558 100644 (file)
--- a/kaln.c
+++ b/kaln.c
@@ -73,9 +73,19 @@ int aln_sm_blast[] = {
         -2, -2, -2, -2, -2
  };
  
+int aln_sm_qual[] = {
+         0, -23, -23, -23, 0,
+       -23,   0, -23, -23, 0,
+       -23, -23,   0, -23, 0,
+       -23, -23, -23,   0, 0,
+         0,   0,   0,   0, 0
+};
+
  ka_param_t ka_param_blast = {  5,  2,   5, 2, aln_sm_blast, 5, 50 };
  ka_param_t ka_param_aa2aa = { 10,  2,  10, 2, aln_sm_blosum62, 22, 50 };
  
+ka_param2_t ka_param2_qual  = { 37, 11, 37, 11, 37, 11, 0, 0, aln_sm_qual, 5, 50 };
+
  static uint32_t *ka_path2cigar32(const path_t *path, int path_len, int *n_cigar)
  {
         int i, n;
@@ -176,7 +186,7 @@ static uint32_t *ka_path2cigar32(const path_t *path, int path_len, int *n_cigar)
  }
  
  typedef struct {
-       uint8_t Mt:3, It:2, Dt:2;
+       uint8_t Mt:3, It:2, Dt:3;
  } dpcell_t;
  
  typedef struct {
@@ -208,7 +218,7 @@ uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const
         score_matrix = ap->matrix;
         N_MATRIX_ROW = ap->row;
  
-       *n_cigar = 0;
+       if (n_cigar) *n_cigar = 0;
         if (len1 == 0 || len2 == 0) return 0;
  
         /* calculate b1 and b2 */
@@ -371,211 +381,106 @@ uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const
         return cigar;
  }
  
-/*****************************************
- * Probabilistic banded glocal alignment *
- *****************************************/
-
-static float g_qual2prob[256];
-
-#define EI .25
-#define EM .3333333333333
-#define set_u(u, b, i, k) { int x=(i)-(b); x=x>0?x:0; (u)=((k)-x+1)*3; }
-
-ka_probpar_t ka_probpar_def = { 0.001, 0.1, 10 };
-
-/*
-  The topology of the profile HMM:
-
-           /\             /\        /\             /\
-           I[1]           I[k-1]    I[k]           I[L]
-            ^   \      \    ^    \   ^   \      \   ^
-            |    \      \   |     \  |    \      \  |
-    M[0]   M[1] -> ... -> M[k-1] -> M[k] -> ... -> M[L]   M[L+1]
-                \      \/        \/      \/      /
-                 \     /\        /\      /\     /
-                       -> D[k-1] -> D[k] ->
-
-   M[0] points to every {M,I}[k] and every {M,I}[k] points M[L+1].
-
-   On input, _ref is the reference sequence and _query is the query
-   sequence. Both are sequences of 0/1/2/3/4 where 4 stands for an
-   ambiguous residue. iqual is the base quality. c sets the gap open
-   probability, gap extension probability and band width.
-
-   On output, state and q are arrays of length l_query. The higher 30
-   bits give the reference position the query base is matched to and the
-   lower two bits can be 0 (an alignment match) or 1 (an
-   insertion). q[i] gives the phred scaled posterior probability of
-   state[i] being wrong.
- */
-int ka_prob_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
-                                  const ka_probpar_t *c, int *state, uint8_t *q)
+typedef struct {
+       int M, I, D;
+} score_aux_t;
+
+#define MINUS_INF -0x40000000
+
+// matrix: len2 rows and len1 columns
+int ka_global_score(const uint8_t *_seq1, int len1, const uint8_t *_seq2, int len2, const ka_param2_t *ap)
  {
-       double **f, **b, *s, m[9], sI, sM, bI, bM, pb;
-       float *qual, *_qual;
-       const uint8_t *ref, *query;
-       int bw, bw2, i, k, is_diff = 0;
-
-       /*** initialization ***/
-       ref = _ref - 1; query = _query - 1; // change to 1-based coordinate
-       bw = l_ref > l_query? l_ref : l_query;
-       if (bw > c->bw) bw = c->bw;
-       if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query);
-       bw2 = bw * 2 + 1;
-       // allocate the forward and backward matrices f[][] and b[][] and the scaling array s[]
-       f = calloc(l_query+1, sizeof(void*));
-       b = calloc(l_query+1, sizeof(void*));
-       for (i = 0; i <= l_query; ++i) {
-               f[i] = calloc(bw2 * 3 + 6, sizeof(double)); // FIXME: this is over-allocated for very short seqs
-               b[i] = calloc(bw2 * 3 + 6, sizeof(double));
-       }
-       s = calloc(l_query+2, sizeof(double)); // s[] is the scaling factor to avoid underflow
-       // initialize qual
-       _qual = calloc(l_query, sizeof(float));
-       if (g_qual2prob[0] == 0)
-               for (i = 0; i < 256; ++i)
-                       g_qual2prob[i] = pow(10, -i/10.);
-       for (i = 0; i < l_query; ++i) _qual[i] = g_qual2prob[iqual? iqual[i] : 30];
-       qual = _qual - 1;
-       // initialize transition probability
-       sM = sI = 1. / (2 * l_query + 2); // the value here seems not to affect results; FIXME: need proof
-       m[0*3+0] = (1 - c->d - c->d) * (1 - sM); m[0*3+1] = m[0*3+2] = c->d * (1 - sM);
-       m[1*3+0] = (1 - c->e) * (1 - sI); m[1*3+1] = c->e * (1 - sI); m[1*3+2] = 0.;
-       m[2*3+0] = 1 - c->e; m[2*3+1] = 0.; m[2*3+2] = c->e;
-       bM = (1 - c->d) / l_query; bI = c->d / l_query; // (bM+bI)*l_query==1
-       /*** forward ***/
-       // f[0]
-       set_u(k, bw, 0, 0);
-       f[0][k] = s[0] = 1.;
-       { // f[1]
-               double *fi = f[1], sum;
-               int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1, _beg, _end;
-               for (k = beg, sum = 0.; k <= end; ++k) {
-                       int u;
-                       double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
-                       set_u(u, bw, 1, k);
-                       fi[u+0] = e * bM; fi[u+1] = EI * bI;
-                       sum += fi[u] + fi[u+1];
-               }
-               // rescale
-               s[1] = sum;
-               set_u(_beg, bw, 1, beg); set_u(_end, bw, 1, end); _end += 2;
-               for (k = _beg; k <= _end; ++k) fi[k] /= sum;
-       }
-       // f[2..l_query]
-       for (i = 2; i <= l_query; ++i) {
-               double *fi = f[i], *fi1 = f[i-1], sum, qli = qual[i];
-               int beg = 1, end = l_ref, x, _beg, _end;
-               uint8_t qyi = query[i];
-               x = i - bw; beg = beg > x? beg : x; // band start
-               x = i + bw; end = end < x? end : x; // band end
-               for (k = beg, sum = 0.; k <= end; ++k) {
-                       int u, v11, v01, v10;
-                       double e;
-                       e = (ref[k] > 3 || qyi > 3)? 1. : ref[k] == qyi? 1. - qli : qli * EM;
-                       set_u(u, bw, i, k); set_u(v11, bw, i-1, k-1); set_u(v10, bw, i-1, k); set_u(v01, bw, i, k-1);
-                       fi[u+0] = e * (m[0] * fi1[v11+0] + m[3] * fi1[v11+1] + m[6] * fi1[v11+2]);
-                       fi[u+1] = EI * (m[1] * fi1[v10+0] + m[4] * fi1[v10+1]);
-                       fi[u+2] = m[2] * fi[v01+0] + m[8] * fi[v01+2];
-                       sum += fi[u] + fi[u+1] + fi[u+2];
-//                     fprintf(stderr, "F (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, fi[u], fi[u+1], fi[u+2]); // DEBUG
-               }
-               // rescale
-               s[i] = sum;
-               set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
-               for (k = _beg, sum = 1./sum; k <= _end; ++k) fi[k] *= sum;
-       }
-       { // f[l_query+1]
-               double sum;
-               for (k = 1, sum = 0.; k <= l_ref; ++k) {
-                       int u;
-                       set_u(u, bw, l_query, k);
-                       if (u < 3 || u >= bw2*3+3) continue;
-                   sum += f[l_query][u+0] * sM + f[l_query][u+1] * sI;
-               }
-               s[l_query+1] = sum; // the last scaling factor
+       
+#define __score_aux(_p, _q0, _sc, _io, _ie, _do, _de) {                                        \
+               int t1, t2;                                                                                                             \
+               score_aux_t *_q;                                                                                                \
+               _q = _q0;                                                                                                               \
+               _p->M = _q->M >= _q->I? _q->M : _q->I;                                                  \
+               _p->M = _p->M >= _q->D? _p->M : _q->D;                                                  \
+               _p->M += (_sc);                                                                                                 \
+               ++_q;      t1 = _q->M - _io - _ie; t2 = _q->I - _ie; _p->I = t1 >= t2? t1 : t2; \
+               _q = _p-1; t1 = _q->M - _do - _de; t2 = _q->D - _de; _p->D = t1 >= t2? t1 : t2; \
         }
-       /*** backward ***/
-       // b[l_query] (b[l_query+1][0]=1 and thus \tilde{b}[][]=1/s[l_query+1]; this is where s[l_query+1] comes from)
-       for (k = 1; k <= l_ref; ++k) {
-               int u;
-               double *bi = b[l_query];
-               set_u(u, bw, l_query, k);
-               if (u < 3 || u >= bw2*3+3) continue;
-               bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1];
+
+       int i, j, bw, scmat_size = ap->row, *scmat = ap->matrix, ret;
+       const uint8_t *seq1, *seq2;
+       score_aux_t *curr, *last, *swap;
+       bw = abs(len1 - len2) + ap->band_width;
+       i = len1 > len2? len1 : len2;
+       if (bw > i + 1) bw = i + 1;
+       seq1 = _seq1 - 1; seq2 = _seq2 - 1;
+       curr = calloc(len1 + 2, sizeof(score_aux_t));
+       last = calloc(len1 + 2, sizeof(score_aux_t));
+       { // the zero-th row
+               int x, end = len1;
+               score_aux_t *p;
+               j = 0;
+               x = j + bw; end = len1 < x? len1 : x; // band end
+               p = curr;
+               p->M = 0; p->I = p->D = MINUS_INF;
+               for (i = 1, p = &curr[1]; i <= end; ++i, ++p)
+                       p->M = p->I = MINUS_INF, p->D = -(ap->edo + ap->ede * i);
+               p->M = p->I = p->D = MINUS_INF;
+               swap = curr; curr = last; last = swap;
         }
-       // b[l_query-1..1]
-       for (i = l_query - 1; i >= 1; --i) {
-               int beg = 1, end = l_ref, x, _beg, _end;
-               double *bi = b[i], *bi1 = b[i+1], y = (i > 1), qli1 = qual[i+1];
-               uint8_t qyi1 = query[i+1];
-               x = i - bw; beg = beg > x? beg : x;
-               x = i + bw; end = end < x? end : x;
-               for (k = end; k >= beg; --k) {
-                       int u, v11, v01, v10;
-                       double e;
-                       set_u(u, bw, i, k); set_u(v11, bw, i+1, k+1); set_u(v10, bw, i+1, k); set_u(v01, bw, i, k+1);
-                       e = (k >= l_ref? 0 : (ref[k+1] > 3 || qyi1 > 3)? 1. : ref[k+1] == qyi1? 1. - qli1 : qli1 * EM) * bi1[v11];
-                       bi[u+0] = e * m[0] + EI * m[1] * bi1[v10+1] + m[2] * bi[v01+2]; // bi1[v11] has been foled into e.
-                       bi[u+1] = e * m[3] + EI * m[4] * bi1[v10+1];
-                       bi[u+2] = (e * m[6] + m[8] * bi[v01+2]) * y;
-//                     fprintf(stderr, "B (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, bi[u], bi[u+1], bi[u+2]); // DEBUG
+       for (j = 1; j < len2; ++j) {
+               int x, beg = 0, end = len1, *scrow, col_end;
+               score_aux_t *p;
+               x = j - bw; beg =    0 > x?    0 : x; // band start
+               x = j + bw; end = len1 < x? len1 : x; // band end
+               if (beg == 0) { // from zero-th column
+                       p = curr;
+                       p->M = p->D = MINUS_INF; p->I = -(ap->eio + ap->eie * j);
+                       ++beg; // then beg = 1
                 }
-               // rescale
-               set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
-               for (k = _beg, y = 1./s[i]; k <= _end; ++k) bi[k] *= y;
-       }
-       { // b[0]
-               int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
-               double sum = 0.;
-               for (k = end; k >= beg; --k) {
-                       int u;
-                       double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
-                       set_u(u, bw, 1, k);
-                       if (u < 3 || u >= bw2*3+3) continue;
-                   sum += e * b[1][u+0] * bM + EI * b[1][u+1] * bI;
+               scrow = scmat + seq2[j] * scmat_size;
+               if (end == len1) col_end = 1, --end;
+               else col_end = 0;
+               for (i = beg, p = &curr[beg]; i <= end; ++i, ++p)
+                       __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->iio, ap->iie, ap->ido, ap->ide);
+               if (col_end) {
+                       __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->eio, ap->eie, ap->ido, ap->ide);
+                       ++p;
                 }
-               set_u(k, bw, 0, 0);
-               pb = b[0][k] = sum / s[0]; // if everything works as is expected, pb == 1.0
+               p->M = p->I = p->D = MINUS_INF;
+//             for (i = 0; i <= len1; ++i) printf("(%d,%d,%d) ", curr[i].M, curr[i].I, curr[i].D); putchar('\n');
+               swap = curr; curr = last; last = swap;
         }
-       is_diff = fabs(pb - 1.) > 1e-7? 1 : 0;
-       /*** MAP ***/
-       for (i = 1; i <= l_query; ++i) {
-               double sum = 0., *fi = f[i], *bi = b[i], max = 0.;
-               int beg = 1, end = l_ref, x, max_k = -1;
-               x = i - bw; beg = beg > x? beg : x;
-               x = i + bw; end = end < x? end : x;
-               for (k = beg; k <= end; ++k) {
-                       int u;
-                       double z;
-                       set_u(u, bw, i, k);
-                       z = fi[u+0] * bi[u+0]; if (z > max) max = z, max_k = (k-1)<<2 | 0; sum += z;
-                       z = fi[u+1] * bi[u+1]; if (z > max) max = z, max_k = (k-1)<<2 | 1; sum += z;
+       { // the last row
+               int x, beg = 0, *scrow;
+               score_aux_t *p;
+               j = len2;
+               x = j - bw; beg = 0 > x?    0 : x; // band start
+               if (beg == 0) { // from zero-th column
+                       p = curr;
+                       p->M = p->D = MINUS_INF; p->I = -(ap->eio + ap->eie * j);
+                       ++beg; // then beg = 1
                 }
-               max /= sum; sum *= s[i]; // if everything works as is expected, sum == 1.0
-               if (state) state[i-1] = max_k;
-               if (q) k = (int)(-4.343 * log(1. - max) + .499), q[i-1] = k > 100? 99 : k;
-#ifdef _MAIN
-               fprintf(stderr, "(%.10lg,%.10lg) (%d,%d:%d)~%lg\n", pb, sum, i-1, max_k>>2, max_k&3, max); // DEBUG
-#endif
-       }
-       /*** free ***/
-       for (i = 0; i <= l_query; ++i) {
-               free(f[i]); free(b[i]);
+               scrow = scmat + seq2[j] * scmat_size;
+               for (i = beg, p = &curr[beg]; i < len1; ++i, ++p)
+                       __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->iio, ap->iie, ap->edo, ap->ede);
+               __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->eio, ap->eie, ap->edo, ap->ede);
+//             for (i = 0; i <= len1; ++i) printf("(%d,%d,%d) ", curr[i].M, curr[i].I, curr[i].D); putchar('\n');
         }
-       free(f); free(b); free(s); free(_qual);
-       return 0;
+       ret = curr[len1].M >= curr[len1].I? curr[len1].M : curr[len1].I;
+       ret = ret >= curr[len1].D? ret : curr[len1].D;
+       free(curr); free(last);
+       return ret;
  }
  
  #ifdef _MAIN
-int main()
+int main(int argc, char *argv[])
  {
-       int l_ref = 5, l_query = 4;
-       uint8_t *ref = (uint8_t*)"\0\1\3\3\1";
-       uint8_t *query = (uint8_t*)"\0\3\3\1";
-//     uint8_t *query = (uint8_t*)"\1\3\3\1";
-       static uint8_t qual[4] = {20, 20, 20, 20};
-       ka_prob_glocal(ref, l_ref, query, l_query, qual, &ka_probpar_def, 0, 0);
+//     int len1 = 35, len2 = 35;
+//     uint8_t *seq1 = (uint8_t*)"\0\0\3\3\2\0\0\0\1\0\2\1\2\1\3\2\3\3\3\0\2\3\2\1\1\3\3\3\2\3\3\1\0\0\1";
+//     uint8_t *seq2 = (uint8_t*)"\0\0\3\3\2\0\0\0\1\0\2\1\2\1\3\2\3\3\3\0\2\3\2\1\1\3\3\3\2\3\3\1\0\1\0";
+       int len1 = 4, len2 = 4;
+       uint8_t *seq1 = (uint8_t*)"\1\0\0\1";
+       uint8_t *seq2 = (uint8_t*)"\1\0\1\0";
+       int sc;
+//     ka_global_core(seq1, 2, seq2, 1, &ka_param_qual, &sc, 0);
+       sc = ka_global_score(seq1, len1, seq2, len2, &ka_param2_qual);
+       printf("%d\n", sc);
         return 0;
  }
  #endif
diff --git a/kaln.h b/kaln.h

index b24597cfc2ee159529c5746916597591ecababe2..1ece132755e51212ab1aee8e56b7b316484e6d37 100644 (file)
--- a/kaln.h
+++ b/kaln.h
@@ -42,9 +42,12 @@ typedef struct {
  } ka_param_t;
  
  typedef struct {
-       float d, e;
-       int bw;
-} ka_probpar_t;
+       int iio, iie, ido, ide;
+       int eio, eie, edo, ede;
+       int *matrix;
+       int row;
+       int band_width;
+} ka_param2_t;
  
  #ifdef __cplusplus
  extern "C" {
@@ -52,14 +55,13 @@ extern "C" {
  
         uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const ka_param_t *ap,
                                                          int *_score, int *n_cigar);
-       int ka_prob_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
-                                          const ka_probpar_t *c, int *state, uint8_t *q);
-
+       int ka_global_score(const uint8_t *_seq1, int len1, const uint8_t *_seq2, int len2, const ka_param2_t *ap);
  #ifdef __cplusplus
  }
  #endif
  
  extern ka_param_t ka_param_blast; /* = { 5, 2, 5, 2, aln_sm_blast, 5, 50 }; */
-extern ka_probpar_t ka_probpar_def; /* { 0.0001, 0.1, 10 } */
+extern ka_param_t ka_param_qual; // only use this for global alignment!!!
+extern ka_param2_t ka_param2_qual; // only use this for global alignment!!!
  
  #endif
diff --git a/kprobaln.c b/kprobaln.c

new file mode 100644 (file)

index 0000000..5201c1a
--- /dev/null
+++ b/kprobaln.c
@@ -0,0 +1,278 @@
+/* The MIT License
+
+   Copyright (c) 2003-2006, 2008-2010, by Heng Li <lh3lh3@live.co.uk>
+
+   Permission is hereby granted, free of charge, to any person obtaining
+   a copy of this software and associated documentation files (the
+   "Software"), to deal in the Software without restriction, including
+   without limitation the rights to use, copy, modify, merge, publish,
+   distribute, sublicense, and/or sell copies of the Software, and to
+   permit persons to whom the Software is furnished to do so, subject to
+   the following conditions:
+
+   The above copyright notice and this permission notice shall be
+   included in all copies or substantial portions of the Software.
+
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+   SOFTWARE.
+*/
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <math.h>
+#include "kprobaln.h"
+
+/*****************************************
+ * Probabilistic banded glocal alignment *
+ *****************************************/
+
+#define EI .25
+#define EM .33333333333
+
+static float g_qual2prob[256];
+
+#define set_u(u, b, i, k) { int x=(i)-(b); x=x>0?x:0; (u)=((k)-x+1)*3; }
+
+kpa_par_t kpa_par_def = { 0.001, 0.1, 10 };
+kpa_par_t kpa_par_alt = { 0.0001, 0.01, 10 };
+
+/*
+  The topology of the profile HMM:
+
+           /\             /\        /\             /\
+           I[1]           I[k-1]    I[k]           I[L]
+            ^   \      \    ^    \   ^   \      \   ^
+            |    \      \   |     \  |    \      \  |
+    M[0]   M[1] -> ... -> M[k-1] -> M[k] -> ... -> M[L]   M[L+1]
+                \      \/        \/      \/      /
+                 \     /\        /\      /\     /
+                       -> D[k-1] -> D[k] ->
+
+   M[0] points to every {M,I}[k] and every {M,I}[k] points M[L+1].
+
+   On input, _ref is the reference sequence and _query is the query
+   sequence. Both are sequences of 0/1/2/3/4 where 4 stands for an
+   ambiguous residue. iqual is the base quality. c sets the gap open
+   probability, gap extension probability and band width.
+
+   On output, state and q are arrays of length l_query. The higher 30
+   bits give the reference position the query base is matched to and the
+   lower two bits can be 0 (an alignment match) or 1 (an
+   insertion). q[i] gives the phred scaled posterior probability of
+   state[i] being wrong.
+ */
+int kpa_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
+                          const kpa_par_t *c, int *state, uint8_t *q)
+{
+       double **f, **b = 0, *s, m[9], sI, sM, bI, bM, pb;
+       float *qual, *_qual;
+       const uint8_t *ref, *query;
+       int bw, bw2, i, k, is_diff = 0, is_backward = 1, Pr;
+
+       /*** initialization ***/
+       is_backward = state && q? 1 : 0;
+       ref = _ref - 1; query = _query - 1; // change to 1-based coordinate
+       bw = l_ref > l_query? l_ref : l_query;
+       if (bw > c->bw) bw = c->bw;
+       if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query);
+       bw2 = bw * 2 + 1;
+       // allocate the forward and backward matrices f[][] and b[][] and the scaling array s[]
+       f = calloc(l_query+1, sizeof(void*));
+       if (is_backward) b = calloc(l_query+1, sizeof(void*));
+       for (i = 0; i <= l_query; ++i) {
+               f[i] = calloc(bw2 * 3 + 6, sizeof(double)); // FIXME: this is over-allocated for very short seqs
+               if (is_backward) b[i] = calloc(bw2 * 3 + 6, sizeof(double));
+       }
+       s = calloc(l_query+2, sizeof(double)); // s[] is the scaling factor to avoid underflow
+       // initialize qual
+       _qual = calloc(l_query, sizeof(float));
+       if (g_qual2prob[0] == 0)
+               for (i = 0; i < 256; ++i)
+                       g_qual2prob[i] = pow(10, -i/10.);
+       for (i = 0; i < l_query; ++i) _qual[i] = g_qual2prob[iqual? iqual[i] : 30];
+       qual = _qual - 1;
+       // initialize transition probability
+       sM = sI = 1. / (2 * l_query + 2); // the value here seems not to affect results; FIXME: need proof
+       m[0*3+0] = (1 - c->d - c->d) * (1 - sM); m[0*3+1] = m[0*3+2] = c->d * (1 - sM);
+       m[1*3+0] = (1 - c->e) * (1 - sI); m[1*3+1] = c->e * (1 - sI); m[1*3+2] = 0.;
+       m[2*3+0] = 1 - c->e; m[2*3+1] = 0.; m[2*3+2] = c->e;
+       bM = (1 - c->d) / l_ref; bI = c->d / l_ref; // (bM+bI)*l_ref==1
+       /*** forward ***/
+       // f[0]
+       set_u(k, bw, 0, 0);
+       f[0][k] = s[0] = 1.;
+       { // f[1]
+               double *fi = f[1], sum;
+               int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1, _beg, _end;
+               for (k = beg, sum = 0.; k <= end; ++k) {
+                       int u;
+                       double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
+                       set_u(u, bw, 1, k);
+                       fi[u+0] = e * bM; fi[u+1] = EI * bI;
+                       sum += fi[u] + fi[u+1];
+               }
+               // rescale
+               s[1] = sum;
+               set_u(_beg, bw, 1, beg); set_u(_end, bw, 1, end); _end += 2;
+               for (k = _beg; k <= _end; ++k) fi[k] /= sum;
+       }
+       // f[2..l_query]
+       for (i = 2; i <= l_query; ++i) {
+               double *fi = f[i], *fi1 = f[i-1], sum, qli = qual[i];
+               int beg = 1, end = l_ref, x, _beg, _end;
+               uint8_t qyi = query[i];
+               x = i - bw; beg = beg > x? beg : x; // band start
+               x = i + bw; end = end < x? end : x; // band end
+               for (k = beg, sum = 0.; k <= end; ++k) {
+                       int u, v11, v01, v10;
+                       double e;
+                       e = (ref[k] > 3 || qyi > 3)? 1. : ref[k] == qyi? 1. - qli : qli * EM;
+                       set_u(u, bw, i, k); set_u(v11, bw, i-1, k-1); set_u(v10, bw, i-1, k); set_u(v01, bw, i, k-1);
+                       fi[u+0] = e * (m[0] * fi1[v11+0] + m[3] * fi1[v11+1] + m[6] * fi1[v11+2]);
+                       fi[u+1] = EI * (m[1] * fi1[v10+0] + m[4] * fi1[v10+1]);
+                       fi[u+2] = m[2] * fi[v01+0] + m[8] * fi[v01+2];
+                       sum += fi[u] + fi[u+1] + fi[u+2];
+//                     fprintf(stderr, "F (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, fi[u], fi[u+1], fi[u+2]); // DEBUG
+               }
+               // rescale
+               s[i] = sum;
+               set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
+               for (k = _beg, sum = 1./sum; k <= _end; ++k) fi[k] *= sum;
+       }
+       { // f[l_query+1]
+               double sum;
+               for (k = 1, sum = 0.; k <= l_ref; ++k) {
+                       int u;
+                       set_u(u, bw, l_query, k);
+                       if (u < 3 || u >= bw2*3+3) continue;
+                   sum += f[l_query][u+0] * sM + f[l_query][u+1] * sI;
+               }
+               s[l_query+1] = sum; // the last scaling factor
+       }
+       { // compute likelihood
+               double p = 1., Pr1 = 0.;
+               for (i = 0; i <= l_query + 1; ++i) {
+                       p *= s[i];
+                       if (p < 1e-100) Pr += -4.343 * log(p), p = 1.;
+               }
+               Pr1 += -4.343 * log(p * l_ref * l_query);
+               Pr = (int)(Pr1 + .499);
+               if (!is_backward) { // skip backward and MAP
+                       for (i = 0; i <= l_query; ++i) free(f[i]);
+                       free(f); free(s); free(_qual);
+                       return Pr;
+               }
+       }
+       /*** backward ***/
+       // b[l_query] (b[l_query+1][0]=1 and thus \tilde{b}[][]=1/s[l_query+1]; this is where s[l_query+1] comes from)
+       for (k = 1; k <= l_ref; ++k) {
+               int u;
+               double *bi = b[l_query];
+               set_u(u, bw, l_query, k);
+               if (u < 3 || u >= bw2*3+3) continue;
+               bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1];
+       }
+       // b[l_query-1..1]
+       for (i = l_query - 1; i >= 1; --i) {
+               int beg = 1, end = l_ref, x, _beg, _end;
+               double *bi = b[i], *bi1 = b[i+1], y = (i > 1), qli1 = qual[i+1];
+               uint8_t qyi1 = query[i+1];
+               x = i - bw; beg = beg > x? beg : x;
+               x = i + bw; end = end < x? end : x;
+               for (k = end; k >= beg; --k) {
+                       int u, v11, v01, v10;
+                       double e;
+                       set_u(u, bw, i, k); set_u(v11, bw, i+1, k+1); set_u(v10, bw, i+1, k); set_u(v01, bw, i, k+1);
+                       e = (k >= l_ref? 0 : (ref[k+1] > 3 || qyi1 > 3)? 1. : ref[k+1] == qyi1? 1. - qli1 : qli1 * EM) * bi1[v11];
+                       bi[u+0] = e * m[0] + EI * m[1] * bi1[v10+1] + m[2] * bi[v01+2]; // bi1[v11] has been foled into e.
+                       bi[u+1] = e * m[3] + EI * m[4] * bi1[v10+1];
+                       bi[u+2] = (e * m[6] + m[8] * bi[v01+2]) * y;
+//                     fprintf(stderr, "B (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, bi[u], bi[u+1], bi[u+2]); // DEBUG
+               }
+               // rescale
+               set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
+               for (k = _beg, y = 1./s[i]; k <= _end; ++k) bi[k] *= y;
+       }
+       { // b[0]
+               int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
+               double sum = 0.;
+               for (k = end; k >= beg; --k) {
+                       int u;
+                       double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
+                       set_u(u, bw, 1, k);
+                       if (u < 3 || u >= bw2*3+3) continue;
+                   sum += e * b[1][u+0] * bM + EI * b[1][u+1] * bI;
+               }
+               set_u(k, bw, 0, 0);
+               pb = b[0][k] = sum / s[0]; // if everything works as is expected, pb == 1.0
+       }
+       is_diff = fabs(pb - 1.) > 1e-7? 1 : 0;
+       /*** MAP ***/
+       for (i = 1; i <= l_query; ++i) {
+               double sum = 0., *fi = f[i], *bi = b[i], max = 0.;
+               int beg = 1, end = l_ref, x, max_k = -1;
+               x = i - bw; beg = beg > x? beg : x;
+               x = i + bw; end = end < x? end : x;
+               for (k = beg; k <= end; ++k) {
+                       int u;
+                       double z;
+                       set_u(u, bw, i, k);
+                       z = fi[u+0] * bi[u+0]; if (z > max) max = z, max_k = (k-1)<<2 | 0; sum += z;
+                       z = fi[u+1] * bi[u+1]; if (z > max) max = z, max_k = (k-1)<<2 | 1; sum += z;
+               }
+               max /= sum; sum *= s[i]; // if everything works as is expected, sum == 1.0
+               if (state) state[i-1] = max_k;
+               if (q) k = (int)(-4.343 * log(1. - max) + .499), q[i-1] = k > 100? 99 : k;
+#ifdef _MAIN
+               fprintf(stderr, "(%.10lg,%.10lg) (%d,%d:%c,%c:%d) %lg\n", pb, sum, i-1, max_k>>2,
+                               "ACGT"[query[i]], "ACGT"[ref[(max_k>>2)+1]], max_k&3, max); // DEBUG
+#endif
+       }
+       /*** free ***/
+       for (i = 0; i <= l_query; ++i) {
+               free(f[i]); free(b[i]);
+       }
+       free(f); free(b); free(s); free(_qual);
+       return Pr;
+}
+
+#ifdef _MAIN
+#include <unistd.h>
+int main(int argc, char *argv[])
+{
+       uint8_t conv[256], *iqual, *ref, *query;
+       int c, l_ref, l_query, i, q = 30, b = 10, P;
+       while ((c = getopt(argc, argv, "b:q:")) >= 0) {
+               switch (c) {
+               case 'b': b = atoi(optarg); break;
+               case 'q': q = atoi(optarg); break;
+               }
+       }
+       if (optind + 2 > argc) {
+               fprintf(stderr, "Usage: %s [-q %d] [-b %d] <ref> <query>\n", argv[0], q, b); // example: acttc attc
+               return 1;
+       }
+       memset(conv, 4, 256);
+       conv['a'] = conv['A'] = 0; conv['c'] = conv['C'] = 1;
+       conv['g'] = conv['G'] = 2; conv['t'] = conv['T'] = 3;
+       ref = (uint8_t*)argv[optind]; query = (uint8_t*)argv[optind+1];
+       l_ref = strlen((char*)ref); l_query = strlen((char*)query);
+       for (i = 0; i < l_ref; ++i) ref[i] = conv[ref[i]];
+       for (i = 0; i < l_query; ++i) query[i] = conv[query[i]];
+       iqual = malloc(l_query);
+       memset(iqual, q, l_query);
+       kpa_par_def.bw = b;
+       P = kpa_glocal(ref, l_ref, query, l_query, iqual, &kpa_par_alt, 0, 0);
+       fprintf(stderr, "%d\n", P);
+       free(iqual);
+       return 0;
+}
+#endif
diff --git a/kprobaln.h b/kprobaln.h

new file mode 100644 (file)

index 0000000..0357dcc
--- /dev/null
+++ b/kprobaln.h
@@ -0,0 +1,49 @@
+/* The MIT License
+
+   Copyright (c) 2003-2006, 2008, 2009 by Heng Li <lh3@live.co.uk>
+
+   Permission is hereby granted, free of charge, to any person obtaining
+   a copy of this software and associated documentation files (the
+   "Software"), to deal in the Software without restriction, including
+   without limitation the rights to use, copy, modify, merge, publish,
+   distribute, sublicense, and/or sell copies of the Software, and to
+   permit persons to whom the Software is furnished to do so, subject to
+   the following conditions:
+
+   The above copyright notice and this permission notice shall be
+   included in all copies or substantial portions of the Software.
+
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+   SOFTWARE.
+*/
+
+#ifndef LH3_KPROBALN_H_
+#define LH3_KPROBALN_H_
+
+#include <stdint.h>
+
+typedef struct {
+       float d, e;
+       int bw;
+} kpa_par_t;
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+       int kpa_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
+                                  const kpa_par_t *c, int *state, uint8_t *q);
+
+#ifdef __cplusplus
+}
+#endif
+
+extern kpa_par_t kpa_par_def, kpa_par_alt;
+
+#endif
diff --git a/misc/samtools.pl b/misc/samtools.pl

index ce8449de622884847003ba538a2cf60c57118c99..d03c1c7f148dcca835946e7ccd9ee52f3b6dc00b 100755 (executable)
--- a/misc/samtools.pl
+++ b/misc/samtools.pl
@@ -10,7 +10,7 @@ my $version = '0.3.3';
  &usage if (@ARGV < 1);
  
  my $command = shift(@ARGV);
-my %func = (showALEN=>\&showALEN, pileup2fq=>\&pileup2fq, varFilter=>\&varFilter,
+my %func = (showALEN=>\&showALEN, pileup2fq=>\&pileup2fq, varFilter=>\&varFilter, plp2vcf=>\&plp2vcf,
                         unique=>\&unique, uniqcmp=>\&uniqcmp, sra2hdr=>\&sra2hdr, sam2fq=>\&sam2fq);
  
  die("Unknown command \"$command\".\n") if (!defined($func{$command}));
@@ -473,6 +473,44 @@ sub uniqcmp_aux {
    close($fh);
  }
  
+sub plp2vcf {
+  while (<>) {
+       my @t = split;
+       next if ($t[3] eq '*/*');
+       if ($t[2] eq '*') { # indel
+         my @s = split("/", $t[3]);
+         my (@a, @b);
+         my ($ref, $alt);
+         for (@s) {
+               next if ($_ eq '*');
+               if (/^-/) {
+                 push(@a, 'N'.substr($_, 1));
+                 push(@b, 'N');
+               } elsif (/^\+/) {
+                 push(@a, 'N');
+                 push(@b, 'N'.substr($_, 1));
+               }
+         }
+         if ($a[0] && $a[1]) {
+               if (length($a[0]) < length($a[1])) {
+                 $ref = $a[1];
+                 $alt = ($b[0] . ('N' x (length($a[1]) - length($a[0])))) . ",$b[1]";
+               } elsif (length($a[0]) > length($a[1])) {
+                 $ref = $a[0];
+                 $alt = ($b[1] . ('N' x (length($a[0]) - length($a[1])))) . ",$b[0]";
+               } else {
+                 $ref = $a[0];
+                 $alt = ($b[0] eq $b[1])? $b[0] : "$b[0],$b[1]";
+               }
+         } else {
+               $ref = $a[0]; $alt = $b[0];
+         }
+         print join("\t", @t[0,1], '.', $ref, $alt, $t[5], '.', '.'), "\n";
+       } else { # SNP
+       }
+  }
+}
+
  #
  # Usage
  #
diff --git a/sam_view.c b/sam_view.c

index d0fdad25445508e90052970a90de92e1f0f61ae5..eb69449f9e44d643ef43f31638fd8534c454c940 100644 (file)
--- a/sam_view.c
+++ b/sam_view.c
@@ -9,6 +9,13 @@
  #include "khash.h"
  KHASH_SET_INIT_STR(rg)
  
+// When counting records instead of printing them,
+// data passed to the bam_fetch callback is encapsulated in this struct.
+typedef struct {
+       bam_header_t *header;
+       int *count;
+} count_func_data_t;
+
  typedef khash_t(rg) *rghash_t;
  
  rghash_t g_rghash = 0;
@@ -54,7 +61,7 @@ static inline int __g_skip_aln(const bam_header_t *h, const bam1_t *b)
         return 0;
  }
  
-// callback function for bam_fetch()
+// callback function for bam_fetch() that prints nonskipped records
  static int view_func(const bam1_t *b, void *data)
  {
         if (!__g_skip_aln(((samfile_t*)data)->header, b))
@@ -62,19 +69,30 @@ static int view_func(const bam1_t *b, void *data)
         return 0;
  }
  
+// callback function for bam_fetch() that counts nonskipped records
+static int count_func(const bam1_t *b, void *data)
+{
+       if (!__g_skip_aln(((count_func_data_t*)data)->header, b)) {
+               (*((count_func_data_t*)data)->count)++;
+       }
+       return 0;
+}
+
  static int usage(int is_long_help);
  
  int main_samview(int argc, char *argv[])
  {
-       int c, is_header = 0, is_header_only = 0, is_bamin = 1, ret = 0, is_uncompressed = 0, is_bamout = 0, slx2sngr = 0;
+       int c, is_header = 0, is_header_only = 0, is_bamin = 1, ret = 0, is_uncompressed = 0, is_bamout = 0, slx2sngr = 0, is_count = 0;
         int of_type = BAM_OFDEC, is_long_help = 0;
+       int count = 0;
         samfile_t *in = 0, *out = 0;
         char in_mode[5], out_mode[5], *fn_out = 0, *fn_list = 0, *fn_ref = 0, *fn_rg = 0;
  
         /* parse command-line options */
         strcpy(in_mode, "r"); strcpy(out_mode, "w");
-       while ((c = getopt(argc, argv, "Sbt:hHo:q:f:F:ul:r:xX?T:CR:")) >= 0) {
+       while ((c = getopt(argc, argv, "Sbct:hHo:q:f:F:ul:r:xX?T:CR:")) >= 0) {
                 switch (c) {
+               case 'c': is_count = 1; break;
                 case 'C': slx2sngr = 1; break;
                 case 'S': is_bamin = 0; break;
                 case 'b': is_bamout = 1; break;
@@ -133,7 +151,7 @@ int main_samview(int argc, char *argv[])
                 ret = 1;
                 goto view_end;
         }
-       if ((out = samopen(fn_out? fn_out : "-", out_mode, in->header)) == 0) {
+       if (!is_count && (out = samopen(fn_out? fn_out : "-", out_mode, in->header)) == 0) {
                 fprintf(stderr, "[main_samview] fail to open \"%s\" for writing.\n", fn_out? fn_out : "standard output");
                 ret = 1;
                 goto view_end;
@@ -146,7 +164,8 @@ int main_samview(int argc, char *argv[])
                 while ((r = samread(in, b)) >= 0) { // read one alignment from `in'
                         if (!__g_skip_aln(in->header, b)) {
                                 if (slx2sngr) sol2sanger(b);
-                               samwrite(out, b); // write the alignment to `out'
+                               if (!is_count) samwrite(out, b); // write the alignment to `out'
+                               count++;
                         }
                 }
                 if (r < -1) {
@@ -164,14 +183,19 @@ int main_samview(int argc, char *argv[])
                         goto view_end;
                 }
                 for (i = optind + 1; i < argc; ++i) {
-                       int tid, beg, end;
+                       int tid, beg, end, result;
                         bam_parse_region(in->header, argv[i], &tid, &beg, &end); // parse a region in the format like `chr2:100-200'
                         if (tid < 0) { // reference name is not found
                                 fprintf(stderr, "[main_samview] region \"%s\" specifies an unknown reference name. Continue anyway.\n", argv[i]);
                                 continue;
                         }
                         // fetch alignments
-                       if (bam_fetch(in->x.bam, idx, tid, beg, end, out, view_func) < 0) {
+                       if (is_count) {
+                               count_func_data_t count_data = { in->header, &count };
+                               result = bam_fetch(in->x.bam, idx, tid, beg, end, &count_data, count_func);
+                       } else
+                               result = bam_fetch(in->x.bam, idx, tid, beg, end, out, view_func);
+                       if (result < 0) {
                                 fprintf(stderr, "[main_samview] retrieval of region \"%s\" failed due to truncated file or corrupt BAM index file\n", argv[i]);
                                 ret = 1;
                                 break;
@@ -181,6 +205,9 @@ int main_samview(int argc, char *argv[])
         }
  
  view_end:
+       if (is_count && ret == 0) {
+               printf("%d\n", count);
+       }
         // close files, free and return
         free(fn_list); free(fn_ref); free(fn_out); free(g_library); free(g_rg); free(fn_rg);
         if (g_rghash) {
@@ -190,7 +217,8 @@ view_end:
                 kh_destroy(rg, g_rghash);
         }
         samclose(in);
-       samclose(out);
+       if (!is_count)
+               samclose(out);
         return ret;
  }
  
@@ -205,6 +233,7 @@ static int usage(int is_long_help)
         fprintf(stderr, "         -u       uncompressed BAM output (force -b)\n");
         fprintf(stderr, "         -x       output FLAG in HEX (samtools-C specific)\n");
         fprintf(stderr, "         -X       output FLAG in string (samtools-C specific)\n");
+       fprintf(stderr, "         -c       print only the count of matching records\n");
         fprintf(stderr, "         -t FILE  list of reference names and lengths (force -S) [null]\n");
         fprintf(stderr, "         -T FILE  reference sequence file (force -S) [null]\n");
         fprintf(stderr, "         -o FILE  output file name [stdout]\n");
diff --git a/samtools.1 b/samtools.1

index cc421d4bc34fd53efa123b4630fcbccb4ca58708..e87bef7e788f09d1c0c38219f231062868a010d4 100644 (file)
--- a/samtools.1
+++ b/samtools.1
@@ -1,4 +1,4 @@
-.TH samtools 1 "27 October 2010" "samtools-0.1.9" "Bioinformatics tools"
+.TH samtools 1 "15 November 2010" "samtools-0.1.10" "Bioinformatics tools"
  .SH NAME
  .PP
  samtools - Utilities for the Sequence Alignment/Map (SAM) format
@@ -20,7 +20,7 @@ samtools faidx ref.fasta
  .PP
  samtools pileup -vcf ref.fasta aln.sorted.bam
  .PP
-samtools mpileup -C50 -agf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
+samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
  .PP
  samtools tview aln.sorted.bam ref.fasta
  
@@ -47,7 +47,7 @@ entire alignment file unless it is asked to do so.
  
  .TP 10
  .B view
-samtools view [-bhuHS] [-t in.refList] [-o output] [-f reqFlag] [-F
+samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F
  skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  
  Extract/print all or sub alignments in SAM or BAM format. If no region
@@ -65,11 +65,11 @@ format: `chr2' (the whole chr2), `chr2:1000000' (region starting from
  .B -b
  Output in the BAM format.
  .TP
-.BI -f " INT"
+.BI -f \ INT
  Only output alignments with all bits in INT present in the FLAG
  field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
  .TP
-.BI -F " INT"
+.BI -F \ INT
  Skip alignments with bits present in INT [0]
  .TP
  .B -h
@@ -78,19 +78,19 @@ Include the header in the output.
  .B -H
  Output the header only.
  .TP
-.BI -l " STR"
+.BI -l \ STR
  Only output reads in library STR [null]
  .TP
-.BI -o " FILE"
+.BI -o \ FILE
  Output file [stdout]
  .TP
-.BI -q " INT"
+.BI -q \ INT
  Skip alignments with MAPQ smaller than INT [0]
  .TP
-.BI -r " STR"
+.BI -r \ STR
  Only output reads in read group STR [null]
  .TP
-.BI -R " FILE"
+.BI -R \ FILE
  Output reads in read groups listed in
  .I FILE
  [null]
@@ -100,7 +100,16 @@ Input is in SAM. If @SQ header lines are absent, the
  .B `-t'
  option is required.
  .TP
-.BI -t " FILE"
+.B -c
+Instead of printing the alignments, only count them and print the
+total number. All filter options, such as
+.B `-f',
+.B `-F'
+and
+.B `-q'
+, are taken into account.
+.TP
+.BI -t \ FILE
  This file is TAB-delimited. Each line must contain the reference name
  and the length of the reference, one line for each distinct reference;
  additional fields are ignored. This file also defines the order of the
@@ -126,135 +135,6 @@ press `?' for help and press `g' to check the alignment start from a
  region in the format like `chr10:10,000,000' or `=10,000,000' when
  viewing the same reference sequence.
  
-.TP
-.B pileup
-samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l
-in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r
-pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior]
-<in.bam>|<in.sam>
-
-Print the alignment in the pileup format. In the pileup format, each
-line represents a genomic position, consisting of chromosome name,
-coordinate, reference base, read bases, read qualities and alignment
-mapping qualities. Information on match, mismatch, indel, strand,
-mapping quality and start and end of a read are all encoded at the read
-base column. At this column, a dot stands for a match to the reference
-base on the forward strand, a comma for a match on the reverse strand,
-a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
-strand and `acgtn' for a mismatch on the reverse strand. A pattern
-`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
-reference position and the next reference position. The length of the
-insertion is given by the integer in the pattern, followed by the
-inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
-represents a deletion from the reference. The deleted bases will be
-presented as `*' in the following lines. Also at the read base column, a
-symbol `^' marks the start of a read. The ASCII of the character
-following `^' minus 33 gives the mapping quality. A symbol `$' marks the
-end of a read segment.
-
-If option
-.B -c
-is applied, the consensus base, Phred-scaled consensus quality, SNP
-quality (i.e. the Phred-scaled probability of the consensus being
-identical to the reference) and root mean square (RMS) mapping quality
-of the reads covering the site will be inserted between the `reference
-base' and the `read bases' columns. An indel occupies an additional
-line. Each indel line consists of chromosome name, coordinate, a star,
-the genotype, consensus quality, SNP quality, RMS mapping quality, #
-covering reads, the first alllele, the second allele, # reads supporting
-the first allele, # reads supporting the second allele and # reads
-containing indels different from the top two alleles.
-
-The position of indels is offset by -1.
-
-.B OPTIONS:
-.RS
-.TP 10
-.B -B
-Disable the BAQ computation. See the
-.B mpileup
-command for details.
-.TP
-.B -c
-Call the consensus sequence using SOAPsnp consensus model. Options
-.BR -T ", " -N ", " -I " and " -r
-are only effective when
-.BR -c " or " -g
-is in use.
-.TP
-.BI -C " INT"
-Coefficient for downgrading the mapping quality of poorly mapped
-reads. See the
-.B mpileup
-command for details. [0]
-.TP
-.BI -d " INT"
-Use the first
-.I NUM
-reads in the pileup for indel calling for speed up. Zero for unlimited. [1024]
-.TP
-.BI -f " FILE"
-The reference sequence in the FASTA format. Index file
-.I FILE.fai
-will be created if
-absent.
-.TP
-.B -g
-Generate genotype likelihood in the binary GLFv3 format. This option
-suppresses -c, -i and -s. This option is deprecated by the
-.B mpileup
-command.
-.TP
-.B -i
-Only output pileup lines containing indels.
-.TP
-.BI -I " INT"
-Phred probability of an indel in sequencing/prep. [40]
-.TP
-.BI -l " FILE"
-List of sites at which pileup is output. This file is space
-delimited. The first two columns are required to be chromosome and
-1-based coordinate. Additional columns are ignored. It is
-recommended to use option
-.TP
-.BI -m " INT"
-Filter reads with flag containing bits in
-.I INT
-[1796]
-.TP
-.BI -M " INT"
-Cap mapping quality at INT [60]
-.TP
-.BI -N " INT"
-Number of haplotypes in the sample (>=2) [2]
-.TP
-.BI -r " FLOAT"
-Expected fraction of differences between a pair of haplotypes [0.001]
-.TP
-.B -s
-Print the mapping quality as the last column. This option makes the
-output easier to parse, although this format is not space efficient.
-.TP
-.B -S
-The input file is in SAM.
-.TP
-.BI -t " FILE"
-List of reference names ane sequence lengths, in the format described
-for the
-.B import
-command. If this option is present, samtools assumes the input
-.I <in.alignment>
-is in SAM format; otherwise it assumes in BAM format.
-.B -s
-together with
-.B -l
-as in the default format we may not know the mapping quality.
-.TP
-.BI -T " FLOAT"
-The theta parameter (error dependency coefficient) in the maq consensus
-calling model [0.85]
-.RE
-
  .TP
  .B mpileup
  samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
@@ -272,37 +152,64 @@ quality (BAQ). BAQ is the Phred-scaled probability of a read base being
  misaligned. Applying this option greatly helps to reduce false SNPs
  caused by misalignments.
  .TP
-.BI -C " INT"
+.BI -C \ INT
  Coefficient for downgrading mapping quality for reads containing
  excessive mismatches. Given a read with a phred-scaled probability q of
  being generated from the mapped position, the new mapping quality is
  about sqrt((INT-q)/INT)*INT. A zero value disables this
-functionality; if enabled, the recommended value is 50. [0]
+functionality; if enabled, the recommended value for BWA is 50. [0]
+.TP
+.BI -e \ INT
+Phred-scaled gap extension sequencing error probability. Reducing
+.I INT
+leads to longer indels. [20]
  .TP
-.BI -f " FILE"
+.BI -f \ FILE
  The reference file [null]
  .TP
  .B -g
  Compute genotype likelihoods and output them in the binary call format (BCF).
  .TP
-.B -u
-Similar to
-.B -g
-except that the output is uncompressed BCF, which is preferred for pipeing.
-.TP
-.BI -l " FILE"
+.BI -h \ INT
+Coefficient for modeling homopolymer errors. Given an
+.IR l -long
+homopolymer
+run, the sequencing error of an indel of size
+.I s
+is modeled as
+.IR INT * s / l .
+[100]
+.TP
+.BI -l \ FILE
  File containing a list of sites where pileup or BCF is outputted [null]
  .TP
-.BI -q " INT"
+.BI -o \ INT
+Phred-scaled gap open sequencing error probability. Reducing
+.I INT
+leads to more indel calls. [40]
+.TP
+.BI -P \ STR
+Comma dilimited list of platforms (determined by
+.BR @RG-PL )
+from which indel candidates are obtained. It is recommended to collect
+indel candidates from sequencing technologies that have low indel error
+rate such as ILLUMINA. [all]
+.TP
+.BI -q \ INT
  Minimum mapping quality for an alignment to be used [0]
  .TP
-.BI -Q " INT"
+.BI -Q \ INT
  Minimum base quality for a base to be considered [13]
  .TP
-.BI -r " STR"
+.BI -r \ STR
  Only generate pileup in region
  .I STR
  [all sites]
+.TP
+.B -u
+Similar to
+.B -g
+except that the output is uncompressed BCF, which is preferred for piping.
  .RE
  
  .TP
@@ -336,7 +243,7 @@ Output the final alignment to the standard output.
  .B -n
  Sort by read names rather than by chromosomal coordinates
  .TP
-.B -m INT
+.BI -m \ INT
  Approximately the maximum required memory. [500000000]
  .RE
  
@@ -359,7 +266,7 @@ and the headers of other files will be ignored.
  .B OPTIONS:
  .RS
  .TP 8
-.BI -h " FILE"
+.BI -h \ FILE
  Use the lines of
  .I FILE
  as `@' headers to be copied to
@@ -370,7 +277,7 @@ replacing any header lines that would otherwise be copied from
  is actually in SAM format, though any alignment records it may contain
  are ignored.)
  .TP
-.BI -R " STR"
+.BI -R \ STR
  Merge files in the specified region indicated by
  .I STR
  .TP
@@ -457,6 +364,11 @@ tag. Output SAM by default.
  .B OPTIONS:
  .RS
  .TP 8
+.B -A
+When used jointly with
+.B -r
+this option overwrites the original base quality.
+.TP 8
  .B -e
  Convert a the read base to = if it is identical to the aligned reference
  base. Indel caller does not support the = bases at the moment.
@@ -470,14 +382,143 @@ Output compressed BAM
  .B -S
  The input is SAM with header lines
  .TP
-.BI -C " INT"
+.BI -C \ INT
  Coefficient to cap mapping quality of poorly mapped reads. See the
  .B pileup
  command for details. [0]
  .TP
  .B -r
-Perform probabilistic realignment to compute BAQ, which will be used to
-cap base quality.
+Compute the BQ tag without changing the base quality.
+.RE
+
+.TP
+.B pileup
+samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l
+in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r
+pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior]
+<in.bam>|<in.sam>
+
+Print the alignment in the pileup format. In the pileup format, each
+line represents a genomic position, consisting of chromosome name,
+coordinate, reference base, read bases, read qualities and alignment
+mapping qualities. Information on match, mismatch, indel, strand,
+mapping quality and start and end of a read are all encoded at the read
+base column. At this column, a dot stands for a match to the reference
+base on the forward strand, a comma for a match on the reverse strand,
+a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
+strand and `acgtn' for a mismatch on the reverse strand. A pattern
+`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
+reference position and the next reference position. The length of the
+insertion is given by the integer in the pattern, followed by the
+inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
+represents a deletion from the reference. The deleted bases will be
+presented as `*' in the following lines. Also at the read base column, a
+symbol `^' marks the start of a read. The ASCII of the character
+following `^' minus 33 gives the mapping quality. A symbol `$' marks the
+end of a read segment.
+
+If option
+.B -c
+is applied, the consensus base, Phred-scaled consensus quality, SNP
+quality (i.e. the Phred-scaled probability of the consensus being
+identical to the reference) and root mean square (RMS) mapping quality
+of the reads covering the site will be inserted between the `reference
+base' and the `read bases' columns. An indel occupies an additional
+line. Each indel line consists of chromosome name, coordinate, a star,
+the genotype, consensus quality, SNP quality, RMS mapping quality, #
+covering reads, the first alllele, the second allele, # reads supporting
+the first allele, # reads supporting the second allele and # reads
+containing indels different from the top two alleles.
+
+.B NOTE:
+Since 0.1.10, the `pileup' command is deprecated by `mpileup'.
+
+.B OPTIONS:
+.RS
+.TP 10
+.B -B
+Disable the BAQ computation. See the
+.B mpileup
+command for details.
+.TP
+.B -c
+Call the consensus sequence. Options
+.BR -T ", " -N ", " -I " and " -r
+are only effective when
+.BR -c " or " -g
+is in use.
+.TP
+.BI -C \ INT
+Coefficient for downgrading the mapping quality of poorly mapped
+reads. See the
+.B mpileup
+command for details. [0]
+.TP
+.BI -d \ INT
+Use the first
+.I NUM
+reads in the pileup for indel calling for speed up. Zero for unlimited. [1024]
+.TP
+.BI -f \ FILE
+The reference sequence in the FASTA format. Index file
+.I FILE.fai
+will be created if
+absent.
+.TP
+.B -g
+Generate genotype likelihood in the binary GLFv3 format. This option
+suppresses -c, -i and -s. This option is deprecated by the
+.B mpileup
+command.
+.TP
+.B -i
+Only output pileup lines containing indels.
+.TP
+.BI -I \ INT
+Phred probability of an indel in sequencing/prep. [40]
+.TP
+.BI -l \ FILE
+List of sites at which pileup is output. This file is space
+delimited. The first two columns are required to be chromosome and
+1-based coordinate. Additional columns are ignored. It is
+recommended to use option
+.TP
+.BI -m \ INT
+Filter reads with flag containing bits in
+.I INT
+[1796]
+.TP
+.BI -M \ INT
+Cap mapping quality at INT [60]
+.TP
+.BI -N \ INT
+Number of haplotypes in the sample (>=2) [2]
+.TP
+.BI -r \ FLOAT
+Expected fraction of differences between a pair of haplotypes [0.001]
+.TP
+.B -s
+Print the mapping quality as the last column. This option makes the
+output easier to parse, although this format is not space efficient.
+.TP
+.B -S
+The input file is in SAM.
+.TP
+.BI -t \ FILE
+List of reference names ane sequence lengths, in the format described
+for the
+.B import
+command. If this option is present, samtools assumes the input
+.I <in.alignment>
+is in SAM format; otherwise it assumes in BAM format.
+.B -s
+together with
+.B -l
+as in the default format we may not know the mapping quality.
+.TP
+.BI -T \ FLOAT
+The theta parameter (error dependency coefficient) in the maq consensus
+calling model [0.85]
  .RE
  
  .SH SAM FORMAT
@@ -573,9 +614,8 @@ will be attached
  .IP o 2
  Call SNPs and short indels for one diploid individual:
  
-  samtools pileup -vcf ref.fa aln.bam > var.raw.plp
-  samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
-  awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var.flt.plp > var.final.plp
+  samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf
+  bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf
  
  The
  .B -D
@@ -583,37 +623,37 @@ option of varFilter controls the maximum read depth, which should be
  adjusted to about twice the average read depth.  One may consider to add
  .B -C50
  to
-.B pileup
+.B mpileup
  if mapping quality is overestimated for reads containing excessive
  mismatches. Applying this option usually helps
  .B BWA-short
-but may not other mappers. It also potentially increases reference
-biases.
+but may not other mappers.
  
  .IP o 2
-Call SNPs (not short indels) for multiple diploid individuals:
+Call SNPs and short indels for multiple diploid individuals:
  
-  samtools mpileup -augf ref.fa *.bam | bcftools view -bcv - > snp.raw.bcf
-  bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 | bgzip > snp.flt.vcf.gz
+  samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf
+  bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf
  
  Individuals are identified from the
  .B SM
  tags in the
  .B @RG
  header lines. Individuals can be pooled in one alignment file; one
-individual can also be separated into multiple files. Similarly, one may
-consider to apply
-.B -C50
-to
-.BR mpileup .
-SNP calling in this way also works for single sample and has the
-advantage of enabling more powerful filtering. The drawback is the lack
-of short indel calling, which may be implemented in future.
+individual can also be separated into multiple files. The
+.B -P
+option specifies that indel candidates should be collected only from
+read groups with the
+.B @RG-PL
+tag set to
+.IR ILLUMINA .
+Collecting indel candidates from reads sequenced by an indel-prone
+technology may affect the performance of indel calling.
  
  .IP o 2
  Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals:
  
-  samtools mpileup -gf ref.fa *.bam > all.bcf
+  samtools mpileup -Igf ref.fa *.bam > all.bcf
    bcftools view -bl sites.list all.bcf > sites.bcf
    bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
    bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
@@ -630,7 +670,7 @@ commands estimate AFS by EM.
  .IP o 2
  Dump BAQ applied alignment for other SNP callers:
  
-  samtools calmd -br aln.bam > aln.baq.bam
+  samtools calmd -bAr aln.bam > aln.baq.bam
  
  It adds and corrects the
  .B NM
@@ -640,7 +680,7 @@ tags at the same time. The
  .B calmd
  command also comes with the
  .B -C
-option, the same as the on in
+option, the same as the one in
  .B pileup
  and
  .BR mpileup .
@@ -666,8 +706,9 @@ although a little slower.
  .PP
  Heng Li from the Sanger Institute wrote the C version of samtools. Bob
  Handsaker from the Broad Institute implemented the BGZF library and Jue
-Ruan from Beijing Genomics Institute wrote the RAZF library. Various
-people in the 1000 Genomes Project contributed to the SAM format
+Ruan from Beijing Genomics Institute wrote the RAZF library. John
+Marshall and Petr Danecek contribute to the source code and various
+people from the 1000 Genomes Project have contributed to the SAM format
  specification.
  
  .SH SEE ALSO
diff --git a/samtools.txt b/samtools.txt

index 09c1ccdedf9be5dd1964022519948a3932d384c8..41cabe5053986a44f415ae35981c0f61738a7996 100644 (file)
--- a/samtools.txt
+++ b/samtools.txt
@@ -22,8 +22,7 @@ SYNOPSIS
  
         samtools pileup -vcf ref.fasta aln.sorted.bam
  
-       samtools  mpileup  -C50  -agf  ref.fasta  -r  chr3:1,000-2,000  in1.bam
-       in2.bam
+       samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
  
         samtools tview aln.sorted.bam ref.fasta
  
@@ -48,7 +47,7 @@ DESCRIPTION
  
  
  COMMANDS AND OPTIONS
-       view      samtools  view  [-bhuHS]  [-t  in.refList]  [-o  output]  [-f
+       view      samtools  view  [-bchuHS]  [-t  in.refList]  [-o  output] [-f
                   reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
                   Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  
@@ -90,6 +89,10 @@ COMMANDS AND OPTIONS
                   -S      Input is in SAM. If @SQ header lines are absent,  the
                           `-t' option is required.
  
+                 -c      Instead  of  printing the alignments, only count them
+                         and print the total number. All filter options,  such
+                         as `-f', `-F' and `-q' , are taken into account.
+
                   -t FILE This  file  is  TAB-delimited. Each line must contain
                           the reference name and the length of  the  reference,
                           one  line  for  each  distinct  reference; additional
@@ -112,109 +115,6 @@ COMMANDS AND OPTIONS
                   reference sequence.
  
  
-       pileup    samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
-                 [-l in.site_list] [-C capMapQ] [-M maxMapQ]  [-T  theta]  [-N
-                 nHap]  [-r  pairDiffRate]  [-m  mask]  [-d maxIndelDepth] [-G
-                 indelPrior] <in.bam>|<in.sam>
-
-                 Print the alignment in the pileup format. In the pileup  for-
-                 mat,  each  line represents a genomic position, consisting of
-                 chromosome name, coordinate, reference base, read bases, read
-                 qualities  and  alignment  mapping  qualities. Information on
-                 match, mismatch, indel, strand, mapping quality and start and
-                 end  of  a  read  are all encoded at the read base column. At
-                 this column, a dot stands for a match to the  reference  base
-                 on  the  forward  strand,  a comma for a match on the reverse
-                 strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
-                 match on the forward strand and `acgtn' for a mismatch on the
-                 reverse strand. A pattern  `\+[0-9]+[ACGTNacgtn]+'  indicates
-                 there is an insertion between this reference position and the
-                 next reference position. The length of the insertion is given
-                 by  the  integer  in  the  pattern,  followed by the inserted
-                 sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'  repre-
-                 sents  a  deletion from the reference. The deleted bases will
-                 be presented as `*' in the following lines. Also at the  read
-                 base  column,  a  symbol  `^'  marks the start of a read. The
-                 ASCII of the character following `^' minus 33 gives the  map-
-                 ping quality. A symbol `$' marks the end of a read segment.
-
-                 If  option  -c  is  applied, the consensus base, Phred-scaled
-                 consensus quality, SNP quality (i.e. the Phred-scaled  proba-
-                 bility of the consensus being identical to the reference) and
-                 root mean square (RMS) mapping quality of the reads  covering
-                 the  site  will  be inserted between the `reference base' and
-                 the `read bases' columns. An  indel  occupies  an  additional
-                 line.  Each  indel  line consists of chromosome name, coordi-
-                 nate, a star, the genotype, consensus quality,  SNP  quality,
-                 RMS mapping quality, # covering reads, the first alllele, the
-                 second allele, # reads supporting the first allele,  #  reads
-                 supporting  the  second  allele and # reads containing indels
-                 different from the top two alleles.
-
-                 The position of indels is offset by -1.
-
-                 OPTIONS:
-
-                 -B        Disable the BAQ computation. See the  mpileup  com-
-                           mand for details.
-
-                 -c        Call the consensus sequence using SOAPsnp consensus
-                           model. Options -T, -N, -I and -r are only effective
-                           when -c or -g is in use.
-
-                 -C INT    Coefficient  for downgrading the mapping quality of
-                           poorly mapped reads. See the  mpileup  command  for
-                           details. [0]
-
-                 -d INT    Use  the  first  NUM  reads in the pileup for indel
-                           calling for speed up. Zero for unlimited. [1024]
-
-                 -f FILE   The reference sequence in the FASTA  format.  Index
-                           file FILE.fai will be created if absent.
-
-                 -g        Generate  genotype  likelihood  in the binary GLFv3
-                           format. This option suppresses -c, -i and -s.  This
-                           option is deprecated by the mpileup command.
-
-                 -i        Only output pileup lines containing indels.
-
-                 -I INT    Phred  probability  of an indel in sequencing/prep.
-                           [40]
-
-                 -l FILE   List of sites at which pileup is output. This  file
-                           is  space  delimited.  The  first  two  columns are
-                           required to be chromosome and  1-based  coordinate.
-                           Additional  columns  are ignored. It is recommended
-                           to use option
-
-                 -m INT    Filter reads  with  flag  containing  bits  in  INT
-                           [1796]
-
-                 -M INT    Cap mapping quality at INT [60]
-
-                 -N INT    Number of haplotypes in the sample (>=2) [2]
-
-                 -r FLOAT  Expected  fraction of differences between a pair of
-                           haplotypes [0.001]
-
-                 -s        Print the mapping quality as the last column.  This
-                           option  makes  the output easier to parse, although
-                           this format is not space efficient.
-
-                 -S        The input file is in SAM.
-
-                 -t FILE   List of reference names ane  sequence  lengths,  in
-                           the  format  described  for  the import command. If
-                           this option is present, samtools assumes the  input
-                           <in.alignment>  is  in  SAM  format;  otherwise  it
-                           assumes in BAM format.  -s together with -l  as  in
-                           the  default  format  we  may  not know the mapping
-                           quality.
-
-                 -T FLOAT  The theta parameter (error dependency  coefficient)
-                           in the maq consensus calling model [0.85]
-
-
         mpileup   samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
                   list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
                   [...]]
@@ -237,27 +137,43 @@ COMMANDS AND OPTIONS
                           phred-scaled probability q of  being  generated  from
                           the mapped position, the new mapping quality is about
                           sqrt((INT-q)/INT)*INT. A  zero  value  disables  this
-                         functionality;  if  enabled, the recommended value is
-                         50. [0]
+                         functionality;  if enabled, the recommended value for
+                         BWA is 50. [0]
+
+                 -e INT  Phred-scaled gap extension sequencing error probabil-
+                         ity. Reducing INT leads to longer indels. [20]
  
                   -f FILE The reference file [null]
  
-                 -g      Compute genotype likelihoods and output them  in  the
+                 -g      Compute  genotype  likelihoods and output them in the
                           binary call format (BCF).
  
-                 -u      Similar  to -g except that the output is uncompressed
-                         BCF, which is preferred for pipeing.
+                 -h INT  Coefficient for modeling homopolymer errors. Given an
+                         l-long  homopolymer  run,  the sequencing error of an
+                         indel of size s is modeled as INT*s/l.  [100]
  
                   -l FILE File containing a list of sites where pileup  or  BCF
                           is outputted [null]
  
-                 -q INT  Minimum  mapping  quality for an alignment to be used
+                 -o INT  Phred-scaled  gap  open sequencing error probability.
+                         Reducing INT leads to more indel calls. [40]
+
+                 -P STR  Comma dilimited list of platforms (determined by @RG-
+                         PL)  from  which indel candidates are obtained. It is
+                         recommended to collect indel candidates from sequenc-
+                         ing  technologies that have low indel error rate such
+                         as ILLUMINA. [all]
+
+                 -q INT  Minimum mapping quality for an alignment to  be  used
                           [0]
  
                   -Q INT  Minimum base quality for a base to be considered [13]
  
                   -r STR  Only generate pileup in region STR [all sites]
  
+                 -u      Similar to -g except that the output is  uncompressed
+                         BCF, which is preferred for piping.
+
  
         reheader  samtools reheader <in.header.sam> <in.bam>
  
@@ -368,8 +284,11 @@ COMMANDS AND OPTIONS
  
                   OPTIONS:
  
-                 -e      Convert a the read base to = if it  is  identical  to
-                         the  aligned  reference  base.  Indel caller does not
+                 -A      When used jointly with -r this option overwrites  the
+                         original base quality.
+
+                 -e      Convert  a  the  read base to = if it is identical to
+                         the aligned reference base.  Indel  caller  does  not
                           support the = bases at the moment.
  
                   -u      Output uncompressed BAM
@@ -378,15 +297,117 @@ COMMANDS AND OPTIONS
  
                   -S      The input is SAM with header lines
  
-                 -C INT  Coefficient to cap mapping quality of  poorly  mapped
+                 -C INT  Coefficient  to  cap mapping quality of poorly mapped
                           reads. See the pileup command for details. [0]
  
-                 -r      Perform  probabilistic  realignment  to  compute BAQ,
-                         which will be used to cap base quality.
+                 -r      Compute the BQ tag without changing the base quality.
+
+
+       pileup    samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
+                 [-l in.site_list] [-C capMapQ] [-M maxMapQ]  [-T  theta]  [-N
+                 nHap]  [-r  pairDiffRate]  [-m  mask]  [-d maxIndelDepth] [-G
+                 indelPrior] <in.bam>|<in.sam>
+
+                 Print the alignment in the pileup format. In the pileup  for-
+                 mat,  each  line represents a genomic position, consisting of
+                 chromosome name, coordinate, reference base, read bases, read
+                 qualities  and  alignment  mapping  qualities. Information on
+                 match, mismatch, indel, strand, mapping quality and start and
+                 end  of  a  read  are all encoded at the read base column. At
+                 this column, a dot stands for a match to the  reference  base
+                 on  the  forward  strand,  a comma for a match on the reverse
+                 strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
+                 match on the forward strand and `acgtn' for a mismatch on the
+                 reverse strand. A pattern  `\+[0-9]+[ACGTNacgtn]+'  indicates
+                 there is an insertion between this reference position and the
+                 next reference position. The length of the insertion is given
+                 by  the  integer  in  the  pattern,  followed by the inserted
+                 sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'  repre-
+                 sents  a  deletion from the reference. The deleted bases will
+                 be presented as `*' in the following lines. Also at the  read
+                 base  column,  a  symbol  `^'  marks the start of a read. The
+                 ASCII of the character following `^' minus 33 gives the  map-
+                 ping quality. A symbol `$' marks the end of a read segment.
+
+                 If  option  -c  is  applied, the consensus base, Phred-scaled
+                 consensus quality, SNP quality (i.e. the Phred-scaled  proba-
+                 bility of the consensus being identical to the reference) and
+                 root mean square (RMS) mapping quality of the reads  covering
+                 the  site  will  be inserted between the `reference base' and
+                 the `read bases' columns. An  indel  occupies  an  additional
+                 line.  Each  indel  line consists of chromosome name, coordi-
+                 nate, a star, the genotype, consensus quality,  SNP  quality,
+                 RMS mapping quality, # covering reads, the first alllele, the
+                 second allele, # reads supporting the first allele,  #  reads
+                 supporting  the  second  allele and # reads containing indels
+                 different from the top two alleles.
+
+                 NOTE: Since 0.1.10, the `pileup'  command  is  deprecated  by
+                 `mpileup'.
+
+                 OPTIONS:
+
+                 -B        Disable  the  BAQ computation. See the mpileup com-
+                           mand for details.
+
+                 -c        Call the consensus sequence. Options -T, -N, -I and
+                           -r are only effective when -c or -g is in use.
+
+                 -C INT    Coefficient  for downgrading the mapping quality of
+                           poorly mapped reads. See the  mpileup  command  for
+                           details. [0]
+
+                 -d INT    Use  the  first  NUM  reads in the pileup for indel
+                           calling for speed up. Zero for unlimited. [1024]
+
+                 -f FILE   The reference sequence in the FASTA  format.  Index
+                           file FILE.fai will be created if absent.
+
+                 -g        Generate  genotype  likelihood  in the binary GLFv3
+                           format. This option suppresses -c, -i and -s.  This
+                           option is deprecated by the mpileup command.
+
+                 -i        Only output pileup lines containing indels.
+
+                 -I INT    Phred  probability  of an indel in sequencing/prep.
+                           [40]
+
+                 -l FILE   List of sites at which pileup is output. This  file
+                           is  space  delimited.  The  first  two  columns are
+                           required to be chromosome and  1-based  coordinate.
+                           Additional  columns  are ignored. It is recommended
+                           to use option
+
+                 -m INT    Filter reads  with  flag  containing  bits  in  INT
+                           [1796]
+
+                 -M INT    Cap mapping quality at INT [60]
+
+                 -N INT    Number of haplotypes in the sample (>=2) [2]
+
+                 -r FLOAT  Expected  fraction of differences between a pair of
+                           haplotypes [0.001]
+
+                 -s        Print the mapping quality as the last column.  This
+                           option  makes  the output easier to parse, although
+                           this format is not space efficient.
+
+                 -S        The input file is in SAM.
+
+                 -t FILE   List of reference names ane  sequence  lengths,  in
+                           the  format  described  for  the import command. If
+                           this option is present, samtools assumes the  input
+                           <in.alignment>  is  in  SAM  format;  otherwise  it
+                           assumes in BAM format.  -s together with -l  as  in
+                           the  default  format  we  may  not know the mapping
+                           quality.
+
+                 -T FLOAT  The theta parameter (error dependency  coefficient)
+                           in the maq consensus calling model [0.85]
  
  
  SAM FORMAT
-       SAM is TAB-delimited. Apart from the header lines,  which  are  started
+       SAM  is  TAB-delimited.  Apart from the header lines, which are started
         with the `@' symbol, each alignment line consists of:
  
  
@@ -441,51 +462,50 @@ EXAMPLES
  
         o Attach the RG tag while merging sorted alignments:
  
-           perl       -e       'print      "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
+           perl      -e       'print       "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
           mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
             samtools merge -rh rg.txt merged.bam ga.bam 454.bam
  
           The value in a RG tag is determined by the file name the read is com-
-         ing  from. In this example, in the merged.bam, reads from ga.bam will
-         be attached RG:Z:ga,  while  reads  from  454.bam  will  be  attached
+         ing from. In this example, in the merged.bam, reads from ga.bam  will
+         be  attached  RG:Z:ga,  while  reads  from  454.bam  will be attached
           RG:Z:454.
  
  
         o Call SNPs and short indels for one diploid individual:
  
-           samtools pileup -vcf ref.fa aln.bam > var.raw.plp
-           samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
-           awk     '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)'     var.flt.plp    >
-         var.final.plp
+           samtools mpileup -ugf ref.fa aln.bam |  bcftools  view  -bvcg  -  >
+         var.raw.bcf
+           bcftools   view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  100  >
+         var.flt.vcf
  
           The -D option of varFilter controls the  maximum  read  depth,  which
           should  be  adjusted  to about twice the average read depth.  One may
-         consider to add -C50 to pileup if mapping  quality  is  overestimated
+         consider to add -C50 to mpileup if mapping quality  is  overestimated
           for  reads containing excessive mismatches. Applying this option usu-
-         ally helps BWA-short but may not other mappers. It  also  potentially
-         increases reference biases.
+         ally helps BWA-short but may not other mappers.
  
  
-       o Call SNPs (not short indels) for multiple diploid individuals:
+       o Call SNPs and short indels for multiple diploid individuals:
  
-           samtools  mpileup  -augf  ref.fa  *.bam  |  bcftools  view -bcv - >
-         snp.raw.bcf
-           bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 |  bgzip
-         > snp.flt.vcf.gz
+           samtools mpileup -P ILLUMINA -ugf  ref.fa  *.bam  |  bcftools  view
+         -bcvg - > var.raw.bcf
+           bcftools  view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  2000  >
+         var.flt.vcf
  
-         Individuals  are identified from the SM tags in the @RG header lines.
-         Individuals can be pooled in one alignment file; one  individual  can
-         also be separated into multiple files. Similarly, one may consider to
-         apply -C50 to mpileup.  SNP calling in this way also works for single
-         sample and has the advantage of enabling more powerful filtering. The
-         drawback is the lack of short indel calling, which may be implemented
-         in future.
+         Individuals are identified from the SM tags in the @RG header  lines.
+         Individuals  can  be pooled in one alignment file; one individual can
+         also be separated into multiple files. The -P option  specifies  that
+         indel  candidates  should be collected only from read groups with the
+         @RG-PL tag set to ILLUMINA.  Collecting indel candidates  from  reads
+         sequenced  by an indel-prone technology may affect the performance of
+         indel calling.
  
  
-       o Derive  the  allele  frequency spectrum (AFS) on a list of sites from
+       o Derive the allele frequency spectrum (AFS) on a list  of  sites  from
           multiple individuals:
  
-           samtools mpileup -gf ref.fa *.bam > all.bcf
+           samtools mpileup -Igf ref.fa *.bam > all.bcf
             bcftools view -bl sites.list all.bcf > sites.bcf
             bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
             bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
@@ -493,40 +513,41 @@ EXAMPLES
             ......
  
           where sites.list contains the list of sites with each line consisting
-         of  the  reference sequence name and position. The following bcftools
+         of the reference sequence name and position. The  following  bcftools
           commands estimate AFS by EM.
  
  
         o Dump BAQ applied alignment for other SNP callers:
  
-           samtools calmd -br aln.bam > aln.baq.bam
+           samtools calmd -bAr aln.bam > aln.baq.bam
  
-         It adds and corrects the NM and MD tags at the same time.  The  calmd
-         command  also  comes with the -C option, the same as the on in pileup
+         It  adds  and corrects the NM and MD tags at the same time. The calmd
+         command also comes with the -C option, the same as the one in  pileup
           and mpileup.  Apply if it helps.
  
  
  LIMITATIONS
-       o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
+       o Unaligned   words  used  in  bam_import.c,  bam_endian.h,  bam.c  and
           bam_aux.c.
  
-       o In  merging,  the input files are required to have the same number of
-         reference sequences. The requirement can  be  relaxed.  In  addition,
-         merging  does  not reconstruct the header dictionaries automatically.
-         Endusers have to provide the correct  header.  Picard  is  better  at
+       o In merging, the input files are required to have the same  number  of
+         reference  sequences.  The  requirement  can be relaxed. In addition,
+         merging does not reconstruct the header  dictionaries  automatically.
+         Endusers  have  to  provide  the  correct header. Picard is better at
           merging.
  
-       o Samtools  paired-end  rmdup  does  not  work for unpaired reads (e.g.
-         orphan reads or ends mapped to different chromosomes). If this  is  a
-         concern,  please  use  Picard's MarkDuplicate which correctly handles
+       o Samtools paired-end rmdup does not  work  for  unpaired  reads  (e.g.
+         orphan  reads  or ends mapped to different chromosomes). If this is a
+         concern, please use Picard's MarkDuplicate  which  correctly  handles
           these cases, although a little slower.
  
  
  AUTHOR
-       Heng Li from the Sanger Institute wrote the C version of samtools.  Bob
+       Heng  Li from the Sanger Institute wrote the C version of samtools. Bob
         Handsaker from the Broad Institute implemented the BGZF library and Jue
-       Ruan from Beijing Genomics Institute wrote the  RAZF  library.  Various
-       people in the 1000 Genomes Project contributed to the SAM format speci-
+       Ruan  from Beijing Genomics Institute wrote the RAZF library. John Mar-
+       shall and Petr Danecek contribute to the source code and various people
+       from the 1000 Genomes Project have contributed to the SAM format speci-
         fication.
  
  
@@ -535,4 +556,4 @@ SEE ALSO
  
  
  
-samtools-0.1.9                  27 October 2010                    samtools(1)
+samtools-0.1.10                15 November 2010                    samtools(1)
author	Charles Plessy <plessy@debian.org>
	Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
committer	Charles Plessy <plessy@debian.org>
	Wed, 17 Nov 2010 11:23:16 +0000 (20:23 +0900)
ChangeLog		patch \| blob \| history
Makefile		patch \| blob \| history
NEWS		patch \| blob \| history
bam.h		patch \| blob \| history
bam2bcf.c		patch \| blob \| history
bam2bcf.h		patch \| blob \| history
bam2bcf_indel.c	[new file with mode: 0644]	patch \| blob
bam_maqcns.c		patch \| blob \| history
bam_md.c		patch \| blob \| history
bam_pileup.c		patch \| blob \| history
bam_plcmd.c		patch \| blob \| history
bamtk.c		patch \| blob \| history
bcftools/bcf.c		patch \| blob \| history
bcftools/bcf.h		patch \| blob \| history
bcftools/call1.c		patch \| blob \| history
bcftools/prob1.c		patch \| blob \| history
bcftools/prob1.h		patch \| blob \| history
bcftools/vcf.c		patch \| blob \| history
bcftools/vcfutils.pl		patch \| blob \| history
kaln.c		patch \| blob \| history
kaln.h		patch \| blob \| history
kprobaln.c	[new file with mode: 0644]	patch \| blob
kprobaln.h	[new file with mode: 0644]	patch \| blob
misc/samtools.pl		patch \| blob \| history
sam_view.c		patch \| blob \| history
samtools.1		patch \| blob \| history
samtools.txt		patch \| blob \| history