From: Charles Plessy Date: Wed, 17 Nov 2010 11:23:16 +0000 (+0900) Subject: Imported Upstream version 0.1.10 X-Git-Url: http://woldlab.caltech.edu/gitweb/?p=samtools.git;a=commitdiff_plain;h=e3b3a0177339fb8c099346986e965e3bd5b85999 Imported Upstream version 0.1.10 --- diff --git a/ChangeLog b/ChangeLog index 12908c9..f8456a3 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,453 @@ +------------------------------------------------------------------------ +r828 | lh3lh3 | 2010-11-16 20:48:49 -0500 (Tue, 16 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bamtk.c + +update version information: samtools-0.1.9-20 (r828) + +------------------------------------------------------------------------ +r827 | lh3lh3 | 2010-11-16 15:32:50 -0500 (Tue, 16 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/call1.c + +bcftools: allow to skip indels + +------------------------------------------------------------------------ +r826 | lh3lh3 | 2010-11-16 14:11:58 -0500 (Tue, 16 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam_md.c + +remove ZQ if both BQ and ZQ are present + +------------------------------------------------------------------------ +r825 | lh3lh3 | 2010-11-16 13:51:33 -0500 (Tue, 16 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_md.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/samtools.1 + + * samtools-0.1.9-18 (r825) + * change the behaviour of calmd such that by default it does not change the base quality + +------------------------------------------------------------------------ +r824 | lh3lh3 | 2010-11-15 23:31:53 -0500 (Mon, 15 Nov 2010) | 4 lines +Changed paths: + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/call1.c + M /trunk/samtools/samtools.1 + + * samtools-0.1.9-17 (r824) + * added command line options to change the default parameters in indel calling + * update the manual + +------------------------------------------------------------------------ +r823 | lh3lh3 | 2010-11-15 12:20:13 -0500 (Mon, 15 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_md.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-r823 + * the BQ tag is now 64 shifted, not 33 shifted + +------------------------------------------------------------------------ +r822 | lh3lh3 | 2010-11-15 00:30:18 -0500 (Mon, 15 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/vcfutils.pl + M /trunk/samtools/misc/samtools.pl + + * samtools-0.1.9-16 (r822) + * keep the raw depth because in indel calling, DP4 may be way off the true depth + +------------------------------------------------------------------------ +r821 | lh3lh3 | 2010-11-13 01:18:31 -0500 (Sat, 13 Nov 2010) | 4 lines +Changed paths: + M /trunk/samtools/bam_md.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-15 (r821) + * calmd: write BQ + * skip realignment if BQ is present + +------------------------------------------------------------------------ +r820 | lh3lh3 | 2010-11-13 01:08:26 -0500 (Sat, 13 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-14 (r820) + * penalize reads with excessive differences in indel calling + +------------------------------------------------------------------------ +r819 | lh3lh3 | 2010-11-12 21:36:27 -0500 (Fri, 12 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam_maqcns.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-13 (r819) + * fixed a bug in pileup given refskip + +------------------------------------------------------------------------ +r818 | lh3lh3 | 2010-11-12 13:04:53 -0500 (Fri, 12 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bamtk.c + + * samtools-r818 + * for indel calling, do two rounds of probabilistic realignments + +------------------------------------------------------------------------ +r817 | lh3lh3 | 2010-11-11 20:04:07 -0500 (Thu, 11 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/vcfutils.pl + + * samtools-0.1.19-11 (r817) + * only initiate indel calling when 0.2% of reads contain a gap + +------------------------------------------------------------------------ +r816 | lh3lh3 | 2010-11-11 01:22:59 -0500 (Thu, 11 Nov 2010) | 7 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-10 (r816) + + * I know why the forward method fails. it is because of zero base + qualities. when that is fixed, the forward method seems to give + better results than Viterbi, as it should be. I am tired... + + +------------------------------------------------------------------------ +r815 | lh3lh3 | 2010-11-11 00:57:15 -0500 (Thu, 11 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/ChangeLog + M /trunk/samtools/bam2bcf_indel.c + +effectively revert to the viterbi version. The forward realignment gives too many false positives. + +------------------------------------------------------------------------ +r814 | lh3lh3 | 2010-11-11 00:18:02 -0500 (Thu, 11 Nov 2010) | 4 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_md.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-9 (r810) + * use forward, instead of viterbi, for realignment + * realignment is now quality aware + +------------------------------------------------------------------------ +r813 | lh3lh3 | 2010-11-10 22:45:24 -0500 (Wed, 10 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/kprobaln.c + M /trunk/samtools/kprobaln.h + + * prepare to replace kaln with kprobaln in realignment + +------------------------------------------------------------------------ +r812 | lh3lh3 | 2010-11-10 17:28:50 -0500 (Wed, 10 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/bcf.c + +fixed a typo + +------------------------------------------------------------------------ +r811 | lh3lh3 | 2010-11-10 16:54:46 -0500 (Wed, 10 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/bcf.c + M /trunk/samtools/bcftools/bcf.h + +use zlib for direct reading when BCF_LITE is in use + +------------------------------------------------------------------------ +r810 | lh3lh3 | 2010-11-10 16:32:13 -0500 (Wed, 10 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + + * do not use reads containing too many mismatches for indel calling + * fixed a trivial bug in case of multi-allelic indels + +------------------------------------------------------------------------ +r809 | lh3lh3 | 2010-11-10 13:23:02 -0500 (Wed, 10 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-8 (r809) + * fixed a bug in the indel caller + +------------------------------------------------------------------------ +r808 | lh3lh3 | 2010-11-10 12:24:10 -0500 (Wed, 10 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/Makefile + +minor change to makefile + +------------------------------------------------------------------------ +r807 | lh3lh3 | 2010-11-10 12:10:21 -0500 (Wed, 10 Nov 2010) | 4 lines +Changed paths: + M /trunk/samtools/Makefile + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/vcfutils.pl + + * samtools-0.1.9-8 (r807) + * collect indel candidates only from specified platforms (@RG-PL) + * merge varFilter and filter4vcf in vcfutils.pl + +------------------------------------------------------------------------ +r806 | lh3lh3 | 2010-11-09 22:05:46 -0500 (Tue, 09 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/call1.c + M /trunk/samtools/bcftools/prob1.c + M /trunk/samtools/bcftools/prob1.h + +bcftools: compute equal-tail (Bayesian) credible interval + +------------------------------------------------------------------------ +r805 | lh3lh3 | 2010-11-09 16:28:39 -0500 (Tue, 09 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/vcfutils.pl + +added a double-hit filter to avoid overestimated indel likelihood + +------------------------------------------------------------------------ +r804 | lh3lh3 | 2010-11-09 14:12:06 -0500 (Tue, 09 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-7 (r804) + * fixed a bug in the gap caller + +------------------------------------------------------------------------ +r803 | lh3lh3 | 2010-11-09 10:45:33 -0500 (Tue, 09 Nov 2010) | 4 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/bcf.c + M /trunk/samtools/bcftools/bcf.h + M /trunk/samtools/bcftools/prob1.c + + * samtools-0.1.9-6 (r803) + * mpileup: apply homopolymer correction when calculating GL, instead of before + * bcftools: apply a different prior to indels + +------------------------------------------------------------------------ +r802 | lh3lh3 | 2010-11-08 23:53:15 -0500 (Mon, 08 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-5 (r802) + * relax tandem penalty. this will be made a command-line option in future. + +------------------------------------------------------------------------ +r801 | lh3lh3 | 2010-11-08 23:35:52 -0500 (Mon, 08 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-4 (r801) + * fixed a minor issue in printing indel VCF + +------------------------------------------------------------------------ +r800 | lh3lh3 | 2010-11-08 15:28:14 -0500 (Mon, 08 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bcftools/vcfutils.pl + +fixed another silly bug in mpileup's indel caller + +------------------------------------------------------------------------ +r799 | lh3lh3 | 2010-11-08 14:28:27 -0500 (Mon, 08 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + +fixed a silly bug in the indel caller + +------------------------------------------------------------------------ +r798 | lh3lh3 | 2010-11-08 14:07:33 -0500 (Mon, 08 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/ChangeLog + M /trunk/samtools/sam_view.c + M /trunk/samtools/samtools.1 + +Incorporate patches by Marcel Martin for read counting. + +------------------------------------------------------------------------ +r797 | lh3lh3 | 2010-11-08 13:39:52 -0500 (Mon, 08 Nov 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-2 (r797) + * mpileup: indel calling seems to be working + +------------------------------------------------------------------------ +r796 | lh3lh3 | 2010-11-08 10:54:46 -0500 (Mon, 08 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/kaln.c + +indel calling is apparently working, but more information needs to be collected + +------------------------------------------------------------------------ +r795 | lh3lh3 | 2010-11-08 00:39:18 -0500 (Mon, 08 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf_indel.c + +fixed a few bugs in the indel caller. Probably there are more. + +------------------------------------------------------------------------ +r794 | lh3lh3 | 2010-11-07 22:23:16 -0500 (Sun, 07 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/Makefile + M /trunk/samtools/bam.h + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + A /trunk/samtools/bam2bcf_indel.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/kaln.c + M /trunk/samtools/kaln.h + +prepare for the indel caller. It is not ready yet. + +------------------------------------------------------------------------ +r793 | lh3lh3 | 2010-11-05 11:28:23 -0400 (Fri, 05 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bam_plcmd.c + +Revert to r790. The recent changes are not good... + +------------------------------------------------------------------------ +r792 | lh3lh3 | 2010-11-05 00:19:14 -0400 (Fri, 05 Nov 2010) | 6 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bam_plcmd.c + + * this revision is UNSTABLE + + * indel caller seems working, but it is very insensitive and has + several things I do not quite understand. + + +------------------------------------------------------------------------ +r791 | lh3lh3 | 2010-11-04 22:58:43 -0400 (Thu, 04 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bam_plcmd.c + +for backup. no effective changes + +------------------------------------------------------------------------ +r790 | lh3lh3 | 2010-11-03 15:51:24 -0400 (Wed, 03 Nov 2010) | 2 lines +Changed paths: + M /trunk/samtools/bcftools/vcfutils.pl + M /trunk/samtools/kprobaln.c + +fixed a minor problem in the example coming with kprobaln.c + +------------------------------------------------------------------------ +r789 | lh3lh3 | 2010-11-02 15:41:27 -0400 (Tue, 02 Nov 2010) | 4 lines +Changed paths: + M /trunk/samtools/Makefile + M /trunk/samtools/bam_md.c + M /trunk/samtools/kaln.c + M /trunk/samtools/kaln.h + A /trunk/samtools/kprobaln.c + A /trunk/samtools/kprobaln.h + +Separate kaln and kprobaln as I am preparing further changes. At +present, the results should be identical to the previous. + + +------------------------------------------------------------------------ +r788 | petulda | 2010-11-02 12:19:04 -0400 (Tue, 02 Nov 2010) | 1 line +Changed paths: + M /trunk/samtools/bam_plcmd.c + +Added -b option: read file names from a file +------------------------------------------------------------------------ +r787 | lh3lh3 | 2010-10-29 23:17:22 -0400 (Fri, 29 Oct 2010) | 7 lines +Changed paths: + M /trunk/samtools/bam.h + M /trunk/samtools/bam_pileup.c + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + + * samtools-0.1.9-2 (r787) + + * Allow to set a maximum per-sample depth to reduce memory. However, + BAQ computation is still applied to every read. The speed is not + improved. + + +------------------------------------------------------------------------ +r786 | lh3lh3 | 2010-10-29 12:10:40 -0400 (Fri, 29 Oct 2010) | 3 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bamtk.c + M /trunk/samtools/bcftools/bcf.c + M /trunk/samtools/bcftools/vcf.c + + * samtools-0.1.9-1 (r786) + * samtools: optionally perform exact test for each sample + +------------------------------------------------------------------------ +r785 | lh3lh3 | 2010-10-29 09:42:25 -0400 (Fri, 29 Oct 2010) | 2 lines +Changed paths: + M /trunk/samtools/bam2bcf.c + M /trunk/samtools/bam2bcf.h + M /trunk/samtools/bam_plcmd.c + M /trunk/samtools/bcftools/bcf.c + +Optionally output "DP", the individual read depth + +------------------------------------------------------------------------ +r784 | lh3lh3 | 2010-10-27 23:10:27 -0400 (Wed, 27 Oct 2010) | 2 lines +Changed paths: + M /trunk/samtools/samtools.1 + +acknowledge Petr and John who have greatly contributed to the project. + +------------------------------------------------------------------------ +r783 | lh3lh3 | 2010-10-27 22:47:47 -0400 (Wed, 27 Oct 2010) | 2 lines +Changed paths: + M /trunk/samtools/ChangeLog + M /trunk/samtools/NEWS + M /trunk/samtools/bamtk.c + M /trunk/samtools/samtools.1 + +Release samtools-0.1.9 (r783) + ------------------------------------------------------------------------ r782 | lh3lh3 | 2010-10-27 19:58:54 -0400 (Wed, 27 Oct 2010) | 2 lines Changed paths: diff --git a/Makefile b/Makefile index 77cda86..13d4a76 100644 --- a/Makefile +++ b/Makefile @@ -4,10 +4,10 @@ DFLAGS= -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE -D_CURSES_LIB=1 KNETFILE_O= knetfile.o LOBJS= bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o \ bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o \ - $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o + $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o kprobaln.o AOBJS= bam_tview.o bam_maqcns.o bam_plcmd.o sam_view.o \ bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o \ - bamtk.o kaln.o bam2bcf.o errmod.o sample.o + bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o PROG= samtools INCLUDES= -I. SUBDIRS= . bcftools misc @@ -64,6 +64,7 @@ glf.o:glf.h sam_header.o:sam_header.h khash.h bcf.o:bcftools/bcf.h bam2bcf.o:bam2bcf.h errmod.h bcftools/bcf.h +bam2bcf_indel.o:bam2bcf.h errmod.o:errmod.h faidx.o:faidx.h razf.h khash.h diff --git a/NEWS b/NEWS index 82646ba..6c2195e 100644 --- a/NEWS +++ b/NEWS @@ -1,3 +1,38 @@ +Beta Release 0.1.10 (16 November, 2010) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This release is featured as the first major improvement to the indel +caller. The method is similar to the old one implemented in the pileup +command, but the details are handled more carefully both in theory and +in practice. As a result, the new indel caller usually gives more +accurate indel calls, though at the cost of sensitivity. The caller is +implemented in the mpileup command and is invoked by default. It works +with multiple samples. + +Other notable changes: + + * With the -r option, the calmd command writes the difference between + the original base quality and the BAQ capped base quality at the BQ + tag but does not modify the base quality. Please use -Ar to overwrite + the original base quality (the 0.1.9 behavior). + + * Allow to set a maximum per-sample read depth to reduce memory. In + 0.1.9, most of memory is wasted for the ultra high read depth in some + regions (e.g. the chr1 centromere). + + * Optionally write per-sample read depth and per-sample strand bias + P-value. + + * Compute equal-tail (Bayesian) credible interval of site allele + frequency at the CI95 VCF annotation. + + * Merged the vcfutils.pl varFilter and filter4vcf for better SNP/indel + filtering. + +(0.1.10: 16 November 2010, r829) + + + Beta Release 0.1.9 (27 October, 2010) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/bam.h b/bam.h index 4c8536f..eef2ea9 100644 --- a/bam.h +++ b/bam.h @@ -495,7 +495,7 @@ extern "C" { bam1_t *b; int32_t qpos; int indel, level; - uint32_t is_del:1, is_head:1, is_tail:1, is_refskip:1; + uint32_t is_del:1, is_head:1, is_tail:1, is_refskip:1, aux:28; } bam_pileup1_t; typedef int (*bam_plp_auto_f)(void *data, bam1_t *b); @@ -508,6 +508,7 @@ extern "C" { const bam_pileup1_t *bam_plp_next(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp); const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp); void bam_plp_set_mask(bam_plp_t iter, int mask); + void bam_plp_set_maxcnt(bam_plp_t iter, int maxcnt); void bam_plp_reset(bam_plp_t iter); void bam_plp_destroy(bam_plp_t iter); @@ -516,6 +517,7 @@ extern "C" { bam_mplp_t bam_mplp_init(int n, bam_plp_auto_f func, void **data); void bam_mplp_destroy(bam_mplp_t iter); + void bam_mplp_set_maxcnt(bam_mplp_t iter, int maxcnt); int bam_mplp_auto(bam_mplp_t iter, int *_tid, int *_pos, int *n_plp, const bam_pileup1_t **plp); /*! @typedef @@ -708,8 +710,8 @@ static inline bam1_t *bam_copy1(bam1_t *bdst, const bam1_t *bsrc) { uint8_t *data = bdst->data; int m_data = bdst->m_data; // backup data and m_data - if (m_data < bsrc->m_data) { // double the capacity - m_data = bsrc->m_data; kroundup32(m_data); + if (m_data < bsrc->data_len) { // double the capacity + m_data = bsrc->data_len; kroundup32(m_data); data = (uint8_t*)realloc(data, m_data); } memcpy(data, bsrc->data, bsrc->data_len); // copy var-len data diff --git a/bam2bcf.c b/bam2bcf.c index e55212c..088635c 100644 --- a/bam2bcf.c +++ b/bam2bcf.c @@ -14,18 +14,13 @@ extern void ks_introsort_uint32_t(size_t n, uint32_t a[]); #define CAP_DIST 25 -struct __bcf_callaux_t { - int max_bases, capQ, min_baseQ; - uint16_t *bases; - errmod_t *e; -}; - bcf_callaux_t *bcf_call_init(double theta, int min_baseQ) { bcf_callaux_t *bca; if (theta <= 0.) theta = CALL_DEFTHETA; bca = calloc(1, sizeof(bcf_callaux_t)); bca->capQ = 60; + bca->openQ = 40; bca->extQ = 20; bca->tandemQ = 100; bca->min_baseQ = min_baseQ; bca->e = errmod_init(1. - theta); return bca; @@ -35,16 +30,19 @@ void bcf_call_destroy(bcf_callaux_t *bca) { if (bca == 0) return; errmod_destroy(bca->e); - free(bca->bases); free(bca); + free(bca->bases); free(bca->inscns); free(bca); } - -int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf_callaux_t *bca, bcf_callret1_t *r) +/* ref_base is the 4-bit representation of the reference base. It is + * negative if we are looking at an indel. */ +int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t *bca, bcf_callret1_t *r) { - int i, n, ref4; + int i, n, ref4, is_indel, ori_depth = 0; memset(r, 0, sizeof(bcf_callret1_t)); - ref4 = bam_nt16_nt4_table[ref_base]; + if (ref_base >= 0) { + ref4 = bam_nt16_nt4_table[ref_base]; + is_indel = 0; + } else ref4 = 4, is_indel = 1; if (_n == 0) return -1; - // enlarge the bases array if necessary if (bca->max_bases < _n) { bca->max_bases = _n; @@ -55,21 +53,29 @@ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf memset(r, 0, sizeof(bcf_callret1_t)); for (i = n = 0; i < _n; ++i) { const bam_pileup1_t *p = pl + i; - int q, b, mapQ, baseQ, is_diff, min_dist; + int q, b, mapQ, baseQ, is_diff, min_dist, seqQ; // set base - if (p->is_del || (p->b->core.flag&BAM_FUNMAP)) continue; // skip unmapped reads and deleted bases - baseQ = q = (int)bam1_qual(p->b)[p->qpos]; // base quality + if (p->is_del || p->is_refskip || (p->b->core.flag&BAM_FUNMAP)) continue; + ++ori_depth; + baseQ = q = is_indel? p->aux&0xff : (int)bam1_qual(p->b)[p->qpos]; // base/indel quality + seqQ = is_indel? (p->aux>>8&0xff) : 99; if (q < bca->min_baseQ) continue; + if (q > seqQ) q = seqQ; mapQ = p->b->core.qual < bca->capQ? p->b->core.qual : bca->capQ; if (q > mapQ) q = mapQ; if (q > 63) q = 63; if (q < 4) q = 4; - b = bam1_seqi(bam1_seq(p->b), p->qpos); // base - b = bam_nt16_nt4_table[b? b : ref_base]; // b is the 2-bit base + if (!is_indel) { + b = bam1_seqi(bam1_seq(p->b), p->qpos); // base + b = bam_nt16_nt4_table[b? b : ref_base]; // b is the 2-bit base + is_diff = (ref4 < 4 && b == ref4)? 0 : 1; + } else { + b = p->aux>>16&0x3f; + is_diff = (b != 0); + } bca->bases[n++] = q<<5 | (int)bam1_strand(p->b)<<4 | b; // collect annotations r->qsum[b] += q; - is_diff = (ref4 < 4 && b == ref4)? 0 : 1; ++r->anno[0<<2|is_diff<<1|bam1_strand(p->b)]; min_dist = p->b->core.l_qseq - 1 - p->qpos; if (min_dist > p->qpos) min_dist = p->qpos; @@ -81,7 +87,7 @@ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf r->anno[3<<2|is_diff<<1|0] += min_dist; r->anno[3<<2|is_diff<<1|1] += min_dist * min_dist; } - r->depth = n; + r->depth = n; r->ori_depth = ori_depth; // glfgen errmod_cal(bca->e, n, 5, bca->bases, r->p); return r->depth; @@ -91,8 +97,10 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/, { int ref4, i, j, qsum[4]; int64_t tmp; - call->ori_ref = ref4 = bam_nt16_nt4_table[ref_base]; - if (ref4 > 4) ref4 = 4; + if (ref_base >= 0) { + call->ori_ref = ref4 = bam_nt16_nt4_table[ref_base]; + if (ref4 > 4) ref4 = 4; + } else call->ori_ref = -1, ref4 = 0; // calculate qsum memset(qsum, 0, 4 * sizeof(int)); for (i = 0; i < n; ++i) @@ -113,9 +121,14 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/, else break; } } - if (((ref4 < 4 && j < 4) || (ref4 == 4 && j < 5)) && i >= 0) - call->unseen = j, call->a[j++] = qsum[i]&3; - call->n_alleles = j; + if (ref_base >= 0) { // for SNPs, find the "unseen" base + if (((ref4 < 4 && j < 4) || (ref4 == 4 && j < 5)) && i >= 0) + call->unseen = j, call->a[j++] = qsum[i]&3; + call->n_alleles = j; + } else { + call->n_alleles = j; + if (call->n_alleles == 1) return -1; // no reliable supporting read. stop doing anything + } // set the PL array if (call->n < n) { call->n = n; @@ -143,44 +156,101 @@ int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/, PL[j] = y; } } +// if (ref_base < 0) fprintf(stderr, "%d,%d,%f,%d\n", call->n_alleles, x, sum_min, call->unseen); call->shift = (int)(sum_min + .499); } // combine annotations memset(call->anno, 0, 16 * sizeof(int)); - for (i = call->depth = 0, tmp = 0; i < n; ++i) { + for (i = call->depth = call->ori_depth = 0, tmp = 0; i < n; ++i) { call->depth += calls[i].depth; + call->ori_depth += calls[i].ori_depth; for (j = 0; j < 16; ++j) call->anno[j] += calls[i].anno[j]; } return 0; } -int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b) +int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b, bcf_callret1_t *bcr, int is_SP, + const bcf_callaux_t *bca, const char *ref) { + extern double kt_fisher_exact(int n11, int n12, int n21, int n22, double *_left, double *_right, double *two); kstring_t s; - int i; + int i, j; b->n_smpl = bc->n; b->tid = tid; b->pos = pos; b->qual = 0; s.s = b->str; s.m = b->m_str; s.l = 0; kputc('\0', &s); - kputc("ACGTN"[bc->ori_ref], &s); kputc('\0', &s); - for (i = 1; i < 5; ++i) { - if (bc->a[i] < 0) break; - if (i > 1) kputc(',', &s); - kputc(bc->unseen == i? 'X' : "ACGT"[bc->a[i]], &s); + if (bc->ori_ref < 0) { // an indel + // write REF + kputc(ref[pos], &s); + for (j = 0; j < bca->indelreg; ++j) kputc(ref[pos+1+j], &s); + kputc('\0', &s); + // write ALT + kputc(ref[pos], &s); + for (i = 1; i < 4; ++i) { + if (bc->a[i] < 0) break; + if (i > 1) { + kputc(',', &s); kputc(ref[pos], &s); + } + if (bca->indel_types[bc->a[i]] < 0) { // deletion + for (j = -bca->indel_types[bc->a[i]]; j < bca->indelreg; ++j) + kputc(ref[pos+1+j], &s); + } else { // insertion; cannot be a reference unless a bug + char *inscns = &bca->inscns[bc->a[i] * bca->maxins]; + for (j = 0; j < bca->indel_types[bc->a[i]]; ++j) + kputc("ACGTN"[(int)inscns[j]], &s); + for (j = 0; j < bca->indelreg; ++j) kputc(ref[pos+1+j], &s); + } + } + kputc('\0', &s); + } else { // a SNP + kputc("ACGTN"[bc->ori_ref], &s); kputc('\0', &s); + for (i = 1; i < 5; ++i) { + if (bc->a[i] < 0) break; + if (i > 1) kputc(',', &s); + kputc(bc->unseen == i? 'X' : "ACGT"[bc->a[i]], &s); + } + kputc('\0', &s); } kputc('\0', &s); - kputc('\0', &s); // INFO - kputs("I16=", &s); + if (bc->ori_ref < 0) kputs("INDEL;", &s); + kputs("DP=", &s); kputw(bc->ori_depth, &s); kputs(";I16=", &s); for (i = 0; i < 16; ++i) { if (i) kputc(',', &s); kputw(bc->anno[i], &s); } kputc('\0', &s); // FMT - kputs("PL", &s); kputc('\0', &s); + kputs("PL", &s); + if (bcr) { + kputs(":DP", &s); + if (is_SP) kputs(":SP", &s); + } + kputc('\0', &s); b->m_str = s.m; b->str = s.s; b->l_str = s.l; bcf_sync(b); memcpy(b->gi[0].data, bc->PL, b->gi[0].len * bc->n); + if (bcr) { + uint16_t *dp = (uint16_t*)b->gi[1].data; + uint8_t *sp = is_SP? b->gi[2].data : 0; + for (i = 0; i < bc->n; ++i) { + bcf_callret1_t *p = bcr + i; + dp[i] = p->depth < 0xffff? p->depth : 0xffff; + if (is_SP) { + if (p->anno[0] + p->anno[1] < 2 || p->anno[2] + p->anno[3] < 2 + || p->anno[0] + p->anno[2] < 2 || p->anno[1] + p->anno[3] < 2) + { + sp[i] = 0; + } else { + double left, right, two; + int x; + kt_fisher_exact(p->anno[0], p->anno[1], p->anno[2], p->anno[3], &left, &right, &two); + x = (int)(-4.343 * log(two) + .499); + if (x > 255) x = 255; + sp[i] = x; + } + } + } + } return 0; } diff --git a/bam2bcf.h b/bam2bcf.h index 0dddb92..26b022c 100644 --- a/bam2bcf.h +++ b/bam2bcf.h @@ -2,13 +2,26 @@ #define BAM2BCF_H #include +#include "errmod.h" #include "bcftools/bcf.h" -struct __bcf_callaux_t; -typedef struct __bcf_callaux_t bcf_callaux_t; +#define B2B_INDEL_NULL 10000 + +typedef struct __bcf_callaux_t { + int capQ, min_baseQ; + int openQ, extQ, tandemQ; + // for internal uses + int max_bases; + int indel_types[4]; + int maxins, indelreg; + char *inscns; + uint16_t *bases; + errmod_t *e; + void *rghash; +} bcf_callaux_t; typedef struct { - int depth, qsum[4]; + int depth, ori_depth, qsum[4]; int anno[16]; float p[25]; } bcf_callret1_t; @@ -16,7 +29,7 @@ typedef struct { typedef struct { int a[5]; // alleles: ref, alt, alt2, alt3 int n, n_alleles, shift, ori_ref, unseen; - int anno[16], depth; + int anno[16], depth, ori_depth; uint8_t *PL; } bcf_call_t; @@ -26,9 +39,12 @@ extern "C" { bcf_callaux_t *bcf_call_init(double theta, int min_baseQ); void bcf_call_destroy(bcf_callaux_t *bca); - int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf_callaux_t *bca, bcf_callret1_t *r); + int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t *bca, bcf_callret1_t *r); int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/, bcf_call_t *call); - int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b); + int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b, bcf_callret1_t *bcr, int is_SP, + const bcf_callaux_t *bca, const char *ref); + int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_callaux_t *bca, const char *ref, + const void *rghash); #ifdef __cplusplus } diff --git a/bam2bcf_indel.c b/bam2bcf_indel.c new file mode 100644 index 0000000..819c1e7 --- /dev/null +++ b/bam2bcf_indel.c @@ -0,0 +1,410 @@ +#include +#include +#include +#include "bam.h" +#include "bam2bcf.h" +#include "ksort.h" +#include "kaln.h" +#include "kprobaln.h" +#include "khash.h" +KHASH_SET_INIT_STR(rg) + +#define MINUS_CONST 0x10000000 +#define INDEL_WINDOW_SIZE 50 +#define MIN_SUPPORT_COEF 500 + +void *bcf_call_add_rg(void *_hash, const char *hdtext, const char *list) +{ + const char *s, *p, *q, *r, *t; + khash_t(rg) *hash; + if (list == 0 || hdtext == 0) return _hash; + if (_hash == 0) _hash = kh_init(rg); + hash = (khash_t(rg)*)_hash; + if ((s = strstr(hdtext, "@RG\t")) == 0) return hash; + do { + t = strstr(s + 4, "@RG\t"); // the next @RG + if ((p = strstr(s, "\tID:")) != 0) p += 4; + if ((q = strstr(s, "\tPL:")) != 0) q += 4; + if (p && q && (t == 0 || (p < t && q < t))) { // ID and PL are both present + int lp, lq; + char *x; + for (r = p; *r && *r != '\t' && *r != '\n'; ++r); lp = r - p; + for (r = q; *r && *r != '\t' && *r != '\n'; ++r); lq = r - q; + x = calloc((lp > lq? lp : lq) + 1, 1); + for (r = q; *r && *r != '\t' && *r != '\n'; ++r) x[r-q] = *r; + if (strstr(list, x)) { // insert ID to the hash table + khint_t k; + int ret; + for (r = p; *r && *r != '\t' && *r != '\n'; ++r) x[r-p] = *r; + x[r-p] = 0; + k = kh_get(rg, hash, x); + if (k == kh_end(hash)) k = kh_put(rg, hash, x, &ret); + else free(x); + } else free(x); + } + s = t; + } while (s); + return hash; +} + +void bcf_call_del_rghash(void *_hash) +{ + khint_t k; + khash_t(rg) *hash = (khash_t(rg)*)_hash; + if (hash == 0) return; + for (k = kh_begin(hash); k < kh_end(hash); ++k) + if (kh_exist(hash, k)) + free((char*)kh_key(hash, k)); + kh_destroy(rg, hash); +} + +static int tpos2qpos(const bam1_core_t *c, const uint32_t *cigar, int32_t tpos, int is_left, int32_t *_tpos) +{ + int k, x = c->pos, y = 0, last_y = 0; + *_tpos = c->pos; + for (k = 0; k < c->n_cigar; ++k) { + int op = cigar[k] & BAM_CIGAR_MASK; + int l = cigar[k] >> BAM_CIGAR_SHIFT; + if (op == BAM_CMATCH) { + if (c->pos > tpos) return y; + if (x + l > tpos) { + *_tpos = tpos; + return y + (tpos - x); + } + x += l; y += l; + last_y = y; + } else if (op == BAM_CINS || op == BAM_CSOFT_CLIP) y += l; + else if (op == BAM_CDEL || op == BAM_CREF_SKIP) { + if (x + l > tpos) { + *_tpos = is_left? x : x + l; + return y; + } + x += l; + } + } + *_tpos = x; + return last_y; +} +// FIXME: check if the inserted sequence is consistent with the homopolymer run +// l is the relative gap length and l_run is the length of the homopolymer on the reference +static inline int est_seqQ(const bcf_callaux_t *bca, int l, int l_run) +{ + int q, qh; + q = bca->openQ + bca->extQ * (abs(l) - 1); + qh = l_run >= 3? (int)(bca->tandemQ * (double)abs(l) / l_run + .499) : 1000; + return q < qh? q : qh; +} + +static inline int est_indelreg(int pos, const char *ref, int l, char *ins4) +{ + int i, j, max = 0, max_i = pos, score = 0; + l = abs(l); + for (i = pos + 1, j = 0; ref[i]; ++i, ++j) { + if (ins4) score += (toupper(ref[i]) != "ACGTN"[(int)ins4[j%l]])? -10 : 1; + else score += (toupper(ref[i]) != toupper(ref[pos+1+j%l]))? -10 : 1; + if (score < 0) break; + if (max < score) max = score, max_i = i; + } + return max_i - pos; +} + +int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_callaux_t *bca, const char *ref, + const void *rghash) +{ + extern void ks_introsort_uint32_t(int, uint32_t*); + int i, s, j, k, t, n_types, *types, max_rd_len, left, right, max_ins, *score1, *score2; + int N, K, l_run, ref_type, n_alt; + char *inscns = 0, *ref2, *query; + khash_t(rg) *hash = (khash_t(rg)*)rghash; + if (ref == 0 || bca == 0) return -1; + // mark filtered reads + if (rghash) { + N = 0; + for (s = N = 0; s < n; ++s) { + for (i = 0; i < n_plp[s]; ++i) { + bam_pileup1_t *p = plp[s] + i; + const uint8_t *rg = bam_aux_get(p->b, "RG"); + p->aux = 1; // filtered by default + if (rg) { + khint_t k = kh_get(rg, hash, (const char*)(rg + 1)); + if (k != kh_end(hash)) p->aux = 0, ++N; // not filtered + } + } + } + if (N == 0) return -1; // no reads left + } + // determine if there is a gap + for (s = N = 0; s < n; ++s) { + for (i = 0; i < n_plp[s]; ++i) + if (plp[s][i].indel != 0) break; + if (i < n_plp[s]) break; + } + if (s == n) return -1; // there is no indel at this position. + for (s = N = 0; s < n; ++s) N += n_plp[s]; // N is the total number of reads + { // find out how many types of indels are present + int m, n_alt = 0, n_tot = 0; + uint32_t *aux; + aux = calloc(N + 1, 4); + m = max_rd_len = 0; + aux[m++] = MINUS_CONST; // zero indel is always a type + for (s = 0; s < n; ++s) { + for (i = 0; i < n_plp[s]; ++i) { + const bam_pileup1_t *p = plp[s] + i; + if (rghash == 0 || p->aux == 0) { + ++n_tot; + if (p->indel != 0) { + ++n_alt; + aux[m++] = MINUS_CONST + p->indel; + } + } + j = bam_cigar2qlen(&p->b->core, bam1_cigar(p->b)); + if (j > max_rd_len) max_rd_len = j; + } + } + ks_introsort(uint32_t, m, aux); + // squeeze out identical types + for (i = 1, n_types = 1; i < m; ++i) + if (aux[i] != aux[i-1]) ++n_types; + if (n_types == 1 || n_alt * MIN_SUPPORT_COEF < n_tot) { // no indels or too few supporting reads + free(aux); return -1; + } + types = (int*)calloc(n_types, sizeof(int)); + t = 0; + types[t++] = aux[0] - MINUS_CONST; + for (i = 1; i < m; ++i) + if (aux[i] != aux[i-1]) + types[t++] = aux[i] - MINUS_CONST; + free(aux); + for (t = 0; t < n_types; ++t) + if (types[t] == 0) break; + ref_type = t; // the index of the reference type (0) + assert(n_types < 64); + } + { // calculate left and right boundary + left = pos > INDEL_WINDOW_SIZE? pos - INDEL_WINDOW_SIZE : 0; + right = pos + INDEL_WINDOW_SIZE; + if (types[0] < 0) right -= types[0]; + // in case the alignments stand out the reference + for (i = pos; i < right; ++i) + if (ref[i] == 0) break; + right = i; + } + { // the length of the homopolymer run around the current position + int c = bam_nt16_table[(int)ref[pos + 1]]; + if (c == 15) l_run = 1; + else { + for (i = pos + 2; ref[i]; ++i) + if (bam_nt16_table[(int)ref[i]] != c) break; + l_run = i; + for (i = pos; i >= 0; --i) + if (bam_nt16_table[(int)ref[i]] != c) break; + l_run -= i + 1; + } + } + // construct the consensus sequence + max_ins = types[n_types - 1]; // max_ins is at least 0 + if (max_ins > 0) { + int *inscns_aux = calloc(4 * n_types * max_ins, sizeof(int)); + // count the number of occurrences of each base at each position for each type of insertion + for (t = 0; t < n_types; ++t) { + if (types[t] > 0) { + for (s = 0; s < n; ++s) { + for (i = 0; i < n_plp[s]; ++i) { + bam_pileup1_t *p = plp[s] + i; + if (p->indel == types[t]) { + uint8_t *seq = bam1_seq(p->b); + for (k = 1; k <= p->indel; ++k) { + int c = bam_nt16_nt4_table[bam1_seqi(seq, p->qpos + k)]; + if (c < 4) ++inscns_aux[(t*max_ins+(k-1))*4 + c]; + } + } + } + } + } + } + // use the majority rule to construct the consensus + inscns = calloc(n_types * max_ins, 1); + for (t = 0; t < n_types; ++t) { + for (j = 0; j < types[t]; ++j) { + int max = 0, max_k = -1, *ia = &inscns_aux[(t*max_ins+j)*4]; + for (k = 0; k < 4; ++k) + if (ia[k] > max) + max = ia[k], max_k = k; + inscns[t*max_ins + j] = max? max_k : 4; + } + } + free(inscns_aux); + } + // compute the likelihood given each type of indel for each read + ref2 = calloc(right - left + max_ins + 2, 1); + query = calloc(right - left + max_rd_len + max_ins + 2, 1); + score1 = calloc(N * n_types, sizeof(int)); + score2 = calloc(N * n_types, sizeof(int)); + bca->indelreg = 0; + for (t = 0; t < n_types; ++t) { + int l, ir; + kpa_par_t apf1 = { 1e-4, 1e-2, 10 }, apf2 = { 1e-6, 1e-3, 10 }; + apf1.bw = apf2.bw = abs(types[t]) + 3; + // compute indelreg + if (types[t] == 0) ir = 0; + else if (types[t] > 0) ir = est_indelreg(pos, ref, types[t], &inscns[t*max_ins]); + else ir = est_indelreg(pos, ref, -types[t], 0); + if (ir > bca->indelreg) bca->indelreg = ir; +// fprintf(stderr, "%d, %d, %d\n", pos, types[t], ir); + // write ref2 + for (k = 0, j = left; j <= pos; ++j) + ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]]; + if (types[t] <= 0) j += -types[t]; + else for (l = 0; l < types[t]; ++l) + ref2[k++] = inscns[t*max_ins + l]; + if (types[0] < 0) { // mask deleted sequences to avoid a particular error in the model. + int jj, tmp = types[t] >= 0? -types[0] : -types[0] + types[t]; + for (jj = 0; jj < tmp && j < right && ref[j]; ++jj, ++j) + ref2[k++] = 4; + } + for (; j < right && ref[j]; ++j) + ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]]; + if (j < right) right = j; + // align each read to ref2 + for (s = K = 0; s < n; ++s) { + for (i = 0; i < n_plp[s]; ++i, ++K) { + bam_pileup1_t *p = plp[s] + i; + int qbeg, qend, tbeg, tend, sc; + uint8_t *seq = bam1_seq(p->b); + // FIXME: the following skips soft clips, but using them may be more sensitive. + // determine the start and end of sequences for alignment + qbeg = tpos2qpos(&p->b->core, bam1_cigar(p->b), left, 0, &tbeg); + qend = tpos2qpos(&p->b->core, bam1_cigar(p->b), right, 1, &tend); + if (types[t] < 0) { + int l = -types[t]; + tbeg = tbeg - l > left? tbeg - l : left; + } + // write the query sequence + for (l = qbeg; l < qend; ++l) + query[l - qbeg] = bam_nt16_nt4_table[bam1_seqi(seq, l)]; + { // do realignment; this is the bottleneck + const uint8_t *qual = bam1_qual(p->b), *bq; + uint8_t *qq; + qq = calloc(qend - qbeg, 1); + bq = (uint8_t*)bam_aux_get(p->b, "ZQ"); + if (bq) ++bq; // skip type + for (l = qbeg; l < qend; ++l) { + qq[l - qbeg] = bq? qual[l] + (bq[l] - 64) : qual[l]; + if (qq[l - qbeg] > 30) qq[l - qbeg] = 30; + if (qq[l - qbeg] < 7) qq[l - qbeg] = 7; + } + sc = kpa_glocal((uint8_t*)ref2 + tbeg - left, tend - tbeg + abs(types[t]), + (uint8_t*)query, qend - qbeg, qq, &apf1, 0, 0); + l = (int)(100. * sc / (qend - qbeg) + .499); // used for adjusting indelQ below + if (l > 255) l = 255; + score1[K*n_types + t] = score2[K*n_types + t] = sc<<8 | l; + if (sc > 5) { + sc = kpa_glocal((uint8_t*)ref2 + tbeg - left, tend - tbeg + abs(types[t]), + (uint8_t*)query, qend - qbeg, qq, &apf2, 0, 0); + l = (int)(100. * sc / (qend - qbeg) + .499); + if (l > 255) l = 255; + score2[K*n_types + t] = sc<<8 | l; + } + free(qq); + } +/* + for (l = 0; l < tend - tbeg + abs(types[t]); ++l) + fputc("ACGTN"[(int)ref2[tbeg-left+l]], stderr); + fputc('\n', stderr); + for (l = 0; l < qend - qbeg; ++l) fputc("ACGTN"[(int)query[l]], stderr); + fputc('\n', stderr); + fprintf(stderr, "pos=%d type=%d read=%d:%d name=%s qbeg=%d tbeg=%d score=%d\n", pos, types[t], s, i, bam1_qname(p->b), qbeg, tbeg, sc); +*/ + } + } + } + free(ref2); free(query); + { // compute indelQ + int *sc, tmp, *sumq; + sc = alloca(n_types * sizeof(int)); + sumq = alloca(n_types * sizeof(int)); + memset(sumq, 0, sizeof(int) * n_types); + for (s = K = 0; s < n; ++s) { + for (i = 0; i < n_plp[s]; ++i, ++K) { + bam_pileup1_t *p = plp[s] + i; + int *sct = &score1[K*n_types], indelQ1, indelQ2, seqQ, indelQ; + for (t = 0; t < n_types; ++t) sc[t] = sct[t]<<6 | t; + for (t = 1; t < n_types; ++t) // insertion sort + for (j = t; j > 0 && sc[j] < sc[j-1]; --j) + tmp = sc[j], sc[j] = sc[j-1], sc[j-1] = tmp; + /* errmod_cal() assumes that if the call is wrong, the + * likelihoods of other events are equal. This is about + * right for substitutions, but is not desired for + * indels. To reuse errmod_cal(), I have to make + * compromise for multi-allelic indels. + */ + if ((sc[0]&0x3f) == ref_type) { + indelQ1 = (sc[1]>>14) - (sc[0]>>14); + seqQ = est_seqQ(bca, types[sc[1]&0x3f], l_run); + } else { + for (t = 0; t < n_types; ++t) // look for the reference type + if ((sc[t]&0x3f) == ref_type) break; + indelQ1 = (sc[t]>>14) - (sc[0]>>14); + seqQ = est_seqQ(bca, types[sc[0]&0x3f], l_run); + } + tmp = sc[0]>>6 & 0xff; + indelQ1 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ1 + .499); // reduce indelQ + sct = &score2[K*n_types]; + for (t = 0; t < n_types; ++t) sc[t] = sct[t]<<6 | t; + for (t = 1; t < n_types; ++t) // insertion sort + for (j = t; j > 0 && sc[j] < sc[j-1]; --j) + tmp = sc[j], sc[j] = sc[j-1], sc[j-1] = tmp; + if ((sc[0]&0x3f) == ref_type) { + indelQ2 = (sc[1]>>14) - (sc[0]>>14); + } else { + for (t = 0; t < n_types; ++t) // look for the reference type + if ((sc[t]&0x3f) == ref_type) break; + indelQ2 = (sc[t]>>14) - (sc[0]>>14); + } + tmp = sc[0]>>6 & 0xff; + indelQ2 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ2 + .499); + // pick the smaller between indelQ1 and indelQ2 + indelQ = indelQ1 < indelQ2? indelQ1 : indelQ2; + p->aux = (sc[0]&0x3f)<<16 | seqQ<<8 | indelQ; + sumq[sc[0]&0x3f] += indelQ < seqQ? indelQ : seqQ; +// fprintf(stderr, "pos=%d read=%d:%d name=%s call=%d q=%d\n", pos, s, i, bam1_qname(p->b), types[sc[0]&0x3f], indelQ); + } + } + // determine bca->indel_types[] and bca->inscns + bca->maxins = max_ins; + bca->inscns = realloc(bca->inscns, bca->maxins * 4); + for (t = 0; t < n_types; ++t) + sumq[t] = sumq[t]<<6 | t; + for (t = 1; t < n_types; ++t) // insertion sort + for (j = t; j > 0 && sumq[j] > sumq[j-1]; --j) + tmp = sumq[j], sumq[j] = sumq[j-1], sumq[j-1] = tmp; + for (t = 0; t < n_types; ++t) // look for the reference type + if ((sumq[t]&0x3f) == ref_type) break; + if (t) { // then move the reference type to the first + tmp = sumq[t]; + for (; t > 0; --t) sumq[t] = sumq[t-1]; + sumq[0] = tmp; + } + for (t = 0; t < 4; ++t) bca->indel_types[t] = B2B_INDEL_NULL; + for (t = 0; t < 4 && t < n_types; ++t) { + bca->indel_types[t] = types[sumq[t]&0x3f]; + memcpy(&bca->inscns[t * bca->maxins], &inscns[(sumq[t]&0x3f) * max_ins], bca->maxins); + } + // update p->aux + for (s = n_alt = 0; s < n; ++s) { + for (i = 0; i < n_plp[s]; ++i) { + bam_pileup1_t *p = plp[s] + i; + int x = types[p->aux>>16&0x3f]; + for (j = 0; j < 4; ++j) + if (x == bca->indel_types[j]) break; + p->aux = j<<16 | (j == 4? 0 : (p->aux&0xffff)); + if ((p->aux>>16&0x3f) > 0) ++n_alt; +// fprintf(stderr, "X pos=%d read=%d:%d name=%s call=%d type=%d q=%d seqQ=%d\n", pos, s, i, bam1_qname(p->b), p->aux>>16&63, bca->indel_types[p->aux>>16&63], p->aux&0xff, p->aux>>8&0xff); + } + } + } + free(score1); free(score2); + // free + free(types); free(inscns); + return n_alt > 0? 0 : -1; +} diff --git a/bam_maqcns.c b/bam_maqcns.c index 2f3fc08..4fbc6c6 100644 --- a/bam_maqcns.c +++ b/bam_maqcns.c @@ -154,7 +154,7 @@ glf1_t *bam_maqcns_glfgen(int _n, const bam_pileup1_t *pl, uint8_t ref_base, bam const bam_pileup1_t *p = pl + i; uint32_t q, x = 0, qq; uint16_t y = 0; - if (p->is_del || (p->b->core.flag&BAM_FUNMAP)) continue; + if (p->is_del || p->is_refskip || (p->b->core.flag&BAM_FUNMAP)) continue; q = (uint32_t)bam1_qual(p->b)[p->qpos]; if (q < bm->min_baseQ) continue; x |= (uint32_t)bam1_strand(p->b) << 18 | q << 8 | p->b->core.qual; @@ -166,7 +166,7 @@ glf1_t *bam_maqcns_glfgen(int _n, const bam_pileup1_t *pl, uint8_t ref_base, bam y |= q << 5; qq = bam1_seqi(bam1_seq(p->b), p->qpos); q = bam_nt16_nt4_table[qq? qq : ref_base]; - if (!p->is_del && q < 4) x |= 1 << 21 | q << 16, y |= q; + if (!p->is_del && !p->is_refskip && q < 4) x |= 1 << 21 | q << 16, y |= q; bm->aux->info16[n] = y; bm->aux->info[n++] = x; } diff --git a/bam_md.c b/bam_md.c index f65b845..44d46a4 100644 --- a/bam_md.c +++ b/bam_md.c @@ -7,6 +7,7 @@ #include "sam.h" #include "kstring.h" #include "kaln.h" +#include "kprobaln.h" void bam_fillmd1_core(bam1_t *b, char *ref, int is_equal, int max_nm) { @@ -161,14 +162,35 @@ int bam_cap_mapQ(bam1_t *b, char *ref, int thres) return (int)(t + .499); } -int bam_prob_realn(bam1_t *b, const char *ref) +int bam_prob_realn_core(bam1_t *b, const char *ref, int apply_baq) { int k, i, bw, x, y, yb, ye, xb, xe; uint32_t *cigar = bam1_cigar(b); bam1_core_t *c = &b->core; - ka_probpar_t conf = ka_probpar_def; - // find the start and end of the alignment - if (c->flag & BAM_FUNMAP) return -1; + kpa_par_t conf = kpa_par_def; + uint8_t *bq = 0, *zq = 0, *qual = bam1_qual(b); + if (c->flag & BAM_FUNMAP) return -1; // do nothing + // test if BQ or ZQ is present + if ((bq = bam_aux_get(b, "BQ")) != 0) ++bq; + if ((zq = bam_aux_get(b, "ZQ")) != 0 && *zq == 'Z') ++zq; + if (bq && zq) { // remove the ZQ tag + bam_aux_del(b, zq-1); + zq = 0; + } + if (bq || zq) { + if ((apply_baq && zq) || (!apply_baq && bq)) return -3; // in both cases, do nothing + if (bq && apply_baq) { // then convert BQ to ZQ + for (i = 0; i < c->l_qseq; ++i) + qual[i] = qual[i] + 64 < bq[i]? 0 : qual[i] - ((int)bq[i] - 64); + *(bq - 3) = 'Z'; + } else if (zq && !apply_baq) { // then convert ZQ to BQ + for (i = 0; i < c->l_qseq; ++i) + qual[i] += (int)zq[i] - 64; + *(zq - 3) = 'B'; + } + return 0; + } + // find the start and end of the alignment x = c->pos, y = 0, yb = ye = xb = xe = -1; for (k = 0; k < c->n_cigar; ++k) { int op, l; @@ -180,7 +202,7 @@ int bam_prob_realn(bam1_t *b, const char *ref) x += l; y += l; } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l; else if (op == BAM_CDEL) x += l; - else if (op == BAM_CREF_SKIP) return -1; + else if (op == BAM_CREF_SKIP) return -1; // do nothing if there is a reference skip } // set bandwidth and the start and the end bw = 7; @@ -192,8 +214,10 @@ int bam_prob_realn(bam1_t *b, const char *ref) if (xe - xb - c->l_qseq > bw) xb += (xe - xb - c->l_qseq - bw) / 2, xe -= (xe - xb - c->l_qseq - bw) / 2; { // glocal - uint8_t *s, *r, *q, *seq = bam1_seq(b), *qual = bam1_qual(b); + uint8_t *s, *r, *q, *seq = bam1_seq(b), *bq; int *state; + bq = calloc(c->l_qseq + 1, 1); + memcpy(bq, qual, c->l_qseq); s = calloc(c->l_qseq, 1); for (i = 0; i < c->l_qseq; ++i) s[i] = bam_nt16_nt4_table[bam1_seqi(seq, i)]; r = calloc(xe - xb, 1); @@ -201,35 +225,45 @@ int bam_prob_realn(bam1_t *b, const char *ref) r[i-xb] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[i]]]; state = calloc(c->l_qseq, sizeof(int)); q = calloc(c->l_qseq, 1); - ka_prob_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q); + kpa_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q); for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) { int op = cigar[k]&0xf, l = cigar[k]>>4; if (op == BAM_CMATCH) { for (i = y; i < y + l; ++i) { - if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) qual[i] = 0; - else qual[i] = qual[i] < q[i]? qual[i] : q[i]; + if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) bq[i] = 0; + else bq[i] = bq[i] < q[i]? bq[i] : q[i]; } x += l; y += l; } else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l; else if (op == BAM_CDEL) x += l; } - free(s); free(r); free(q); free(state); + for (i = 0; i < c->l_qseq; ++i) bq[i] = qual[i] - bq[i] + 64; // finalize BQ + if (apply_baq) { + for (i = 0; i < c->l_qseq; ++i) qual[i] -= bq[i] - 64; // modify qual + bam_aux_append(b, "ZQ", 'Z', c->l_qseq + 1, bq); + } else bam_aux_append(b, "BQ", 'Z', c->l_qseq + 1, bq); + free(bq); free(s); free(r); free(q); free(state); } return 0; } +int bam_prob_realn(bam1_t *b, const char *ref) +{ + return bam_prob_realn_core(b, ref, 1); +} + int bam_fillmd(int argc, char *argv[]) { - int c, is_equal = 0, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm = 0, is_realn, capQ = 0; + int c, is_equal, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm, is_realn, capQ, apply_baq; samfile_t *fp, *fpout = 0; faidx_t *fai; char *ref = 0, mode_w[8], mode_r[8]; bam1_t *b; - is_bam_out = is_sam_in = is_uncompressed = is_realn = 0; + is_equal = is_bam_out = is_sam_in = is_uncompressed = is_realn = max_nm = apply_baq = capQ = 0; mode_w[0] = mode_r[0] = 0; strcpy(mode_r, "r"); strcpy(mode_w, "w"); - while ((c = getopt(argc, argv, "reubSC:n:")) >= 0) { + while ((c = getopt(argc, argv, "reubSC:n:A")) >= 0) { switch (c) { case 'r': is_realn = 1; break; case 'e': is_equal = 1; break; @@ -238,6 +272,7 @@ int bam_fillmd(int argc, char *argv[]) case 'S': is_sam_in = 1; break; case 'n': max_nm = atoi(optarg); break; case 'C': capQ = atoi(optarg); break; + case 'A': apply_baq = 1; break; default: fprintf(stderr, "[bam_fillmd] unrecognized option '-%c'\n", c); return 1; } } @@ -252,6 +287,7 @@ int bam_fillmd(int argc, char *argv[]) fprintf(stderr, " -u uncompressed BAM output (for piping)\n"); fprintf(stderr, " -b compressed BAM output\n"); fprintf(stderr, " -S the input is SAM with header\n"); + fprintf(stderr, " -A modify the quality string\n"); fprintf(stderr, " -r read-independent local realignment\n\n"); return 1; } @@ -275,8 +311,7 @@ int bam_fillmd(int argc, char *argv[]) fprintf(stderr, "[bam_fillmd] fail to find sequence '%s' in the reference.\n", fp->header->target_name[tid]); } -// if (is_realn) bam_realn(b, ref); - if (is_realn) bam_prob_realn(b, ref); + if (is_realn) bam_prob_realn_core(b, ref, apply_baq); if (capQ > 10) { int q = bam_cap_mapQ(b, ref, capQ); if (b->core.qual > q) b->core.qual = q; diff --git a/bam_pileup.c b/bam_pileup.c index a49f795..55e51e2 100644 --- a/bam_pileup.c +++ b/bam_pileup.c @@ -152,7 +152,7 @@ struct __bam_plp_t { mempool_t *mp; lbnode_t *head, *tail, *dummy; int32_t tid, pos, max_tid, max_pos; - int is_eof, flag_mask, max_plp, error; + int is_eof, flag_mask, max_plp, error, maxcnt; bam_pileup1_t *plp; // for the "auto" interface only bam1_t *b; @@ -169,6 +169,7 @@ bam_plp_t bam_plp_init(bam_plp_auto_f func, void *data) iter->dummy = mp_alloc(iter->mp); iter->max_tid = iter->max_pos = -1; iter->flag_mask = BAM_DEF_MASK; + iter->maxcnt = 8000; if (func) { iter->func = func; iter->data = data; @@ -240,6 +241,7 @@ int bam_plp_push(bam_plp_t iter, const bam1_t *b) if (b) { if (b->core.tid < 0) return 0; if (b->core.flag & iter->flag_mask) return 0; + if (iter->tid == b->core.tid && iter->pos == b->core.pos && iter->mp->cnt > iter->maxcnt) return 0; bam_copy1(&iter->tail->b, b); iter->tail->beg = b->core.pos; iter->tail->end = bam_calend(&b->core, bam1_cigar(b)); iter->tail->s = g_cstate_null; iter->tail->s.end = iter->tail->end - 1; // initialize cstate_t @@ -267,7 +269,7 @@ const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_ const bam_pileup1_t *plp; if (iter->func == 0 || iter->error) { *_n_plp = -1; return 0; } if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp; - else { + else { // no pileup line can be obtained; read alignments *_n_plp = 0; if (iter->is_eof) return 0; while (iter->func(iter->data, iter->b) >= 0) { @@ -276,6 +278,7 @@ const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_ return 0; } if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp; + // otherwise no pileup line can be returned; read the next alignment. } bam_plp_push(iter, 0); if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp; @@ -302,6 +305,11 @@ void bam_plp_set_mask(bam_plp_t iter, int mask) iter->flag_mask = mask < 0? BAM_DEF_MASK : (BAM_FUNMAP | mask); } +void bam_plp_set_maxcnt(bam_plp_t iter, int maxcnt) +{ + iter->maxcnt = maxcnt; +} + /***************** * callback APIs * *****************/ @@ -389,6 +397,13 @@ bam_mplp_t bam_mplp_init(int n, bam_plp_auto_f func, void **data) return iter; } +void bam_mplp_set_maxcnt(bam_mplp_t iter, int maxcnt) +{ + int i; + for (i = 0; i < iter->n; ++i) + iter->iter[i]->maxcnt = maxcnt; +} + void bam_mplp_destroy(bam_mplp_t iter) { int i; diff --git a/bam_plcmd.c b/bam_plcmd.c index dca8518..e3f73aa 100644 --- a/bam_plcmd.c +++ b/bam_plcmd.c @@ -2,6 +2,8 @@ #include #include #include +#include +#include #include "sam.h" #include "faidx.h" #include "bam_maqcns.h" @@ -528,10 +530,14 @@ int bam_pileup(int argc, char *argv[]) #define MPLP_NO_COMP 0x20 #define MPLP_NO_ORPHAN 0x40 #define MPLP_REALN 0x80 +#define MPLP_FMT_DP 0x100 +#define MPLP_FMT_SP 0x200 +#define MPLP_NO_INDEL 0x400 typedef struct { - int max_mq, min_mq, flag, min_baseQ, capQ_thres; - char *reg, *fn_pos; + int max_mq, min_mq, flag, min_baseQ, capQ_thres, max_depth; + int openQ, extQ, tandemQ; + char *reg, *fn_pos, *pl_list; faidx_t *fai; kh_64_t *hash; } mplp_conf_t; @@ -552,7 +558,7 @@ typedef struct { static int mplp_func(void *data, bam1_t *b) { extern int bam_realn(bam1_t *b, const char *ref); - extern int bam_prob_realn(bam1_t *b, const char *ref); + extern int bam_prob_realn_core(bam1_t *b, const char *ref, int); extern int bam_cap_mapQ(bam1_t *b, char *ref, int thres); mplp_aux_t *ma = (mplp_aux_t*)data; int ret, skip = 0; @@ -561,7 +567,7 @@ static int mplp_func(void *data, bam1_t *b) ret = ma->iter? bam_iter_read(ma->fp, ma->iter, b) : bam_read1(ma->fp, b); if (ret < 0) break; skip = 0; - if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn(b, ma->ref); + if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn_core(b, ma->ref, 1); if (has_ref && ma->capQ_thres > 10) { int q = bam_cap_mapQ(b, ma->ref, ma->capQ_thres); if (q < 0) skip = 1; @@ -598,13 +604,16 @@ static void group_smpl(mplp_pileup_t *m, bam_sample_t *sm, kstring_t *buf, static int mpileup(mplp_conf_t *conf, int n, char **fn) { + extern void *bcf_call_add_rg(void *rghash, const char *hdtext, const char *list); + extern void bcf_call_del_rghash(void *rghash); mplp_aux_t **data; - int i, tid, pos, *n_plp, beg0 = 0, end0 = 1u<<29, ref_len, ref_tid; + int i, tid, pos, *n_plp, beg0 = 0, end0 = 1u<<29, ref_len, ref_tid, max_depth; const bam_pileup1_t **plp; bam_mplp_t iter; bam_header_t *h = 0; char *ref; khash_t(64) *hash = 0; + void *rghash = 0; bcf_callaux_t *bca = 0; bcf_callret1_t *bcr = 0; @@ -634,6 +643,7 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn) data[i]->fp = strcmp(fn[i], "-") == 0? bam_dopen(fileno(stdin), "r") : bam_open(fn[i], "r"); h_tmp = bam_header_read(data[i]->fp); bam_smpl_add(sm, fn[i], h_tmp->text); + rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list); if (conf->reg) { int beg, end; bam_index_t *idx; @@ -689,9 +699,19 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn) bcf_hdr_write(bp, bh); bca = bcf_call_init(-1., conf->min_baseQ); bcr = calloc(sm->n, sizeof(bcf_callret1_t)); + bca->rghash = rghash; + bca->openQ = conf->openQ, bca->extQ = conf->extQ, bca->tandemQ = conf->tandemQ; } ref_tid = -1; ref = 0; iter = bam_mplp_init(n, mplp_func, (void**)data); + max_depth = conf->max_depth; + if (max_depth * sm->n > 1<<20) + fprintf(stderr, "(%s) Max depth is above 1M. Potential memory hog!\n", __func__); + if (max_depth * sm->n < 8000) { + max_depth = 8000 / sm->n; + fprintf(stderr, "<%s> Set max per-sample depth to %d\n", __func__, max_depth); + } + bam_mplp_set_maxcnt(iter, max_depth); while (bam_mplp_auto(iter, &tid, &pos, n_plp, plp) > 0) { if (conf->reg && (pos < beg0 || pos >= end0)) continue; // out of the region requested if (hash) { @@ -714,9 +734,22 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn) for (i = 0; i < gplp.n; ++i) bcf_call_glfgen(gplp.n_plp[i], gplp.plp[i], ref16, bca, bcr + i); bcf_call_combine(gplp.n, bcr, ref16, &bc); - bcf_call2bcf(tid, pos, &bc, b); + bcf_call2bcf(tid, pos, &bc, b, (conf->flag&(MPLP_FMT_DP|MPLP_FMT_SP))? bcr : 0, + (conf->flag&MPLP_FMT_SP), 0, 0); bcf_write(bp, bh, b); bcf_destroy(b); + // call indels + if (!(conf->flag&MPLP_NO_INDEL) && bcf_call_gap_prep(gplp.n, gplp.n_plp, gplp.plp, pos, bca, ref, rghash) >= 0) { + for (i = 0; i < gplp.n; ++i) + bcf_call_glfgen(gplp.n_plp[i], gplp.plp[i], -1, bca, bcr + i); + if (bcf_call_combine(gplp.n, bcr, -1, &bc) >= 0) { + b = calloc(1, sizeof(bcf1_t)); + bcf_call2bcf(tid, pos, &bc, b, (conf->flag&(MPLP_FMT_DP|MPLP_FMT_SP))? bcr : 0, + (conf->flag&MPLP_FMT_SP), bca, ref); + bcf_write(bp, bh, b); + bcf_destroy(b); + } + } } else { printf("%s\t%d\t%c", h->target_name[tid], pos + 1, (ref && pos < ref_len)? ref[pos] : 'N'); for (i = 0; i < n; ++i) { @@ -743,6 +776,7 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn) bam_smpl_destroy(sm); free(buf.s); for (i = 0; i < gplp.n; ++i) free(gplp.plp[i]); free(gplp.plp); free(gplp.n_plp); free(gplp.m_plp); + bcf_call_del_rghash(rghash); if (hash) { // free the hash table khint_t k; for (k = kh_begin(hash); k < kh_end(hash); ++k) @@ -761,33 +795,96 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn) return 0; } +#define MAX_PATH_LEN 1024 +int read_file_list(const char *file_list,int *n,char **argv[]) +{ + char buf[MAX_PATH_LEN]; + int len, nfiles; + char **files; + + FILE *fh = fopen(file_list,"r"); + if ( !fh ) + { + fprintf(stderr,"%s: %s\n", file_list,strerror(errno)); + return 1; + } + + // Speed is not an issue here, determine the number of files by reading the file twice + nfiles = 0; + while ( fgets(buf,MAX_PATH_LEN,fh) ) nfiles++; + + if ( fseek(fh, 0L, SEEK_SET) ) + { + fprintf(stderr,"%s: %s\n", file_list,strerror(errno)); + return 1; + } + + files = calloc(nfiles,sizeof(char*)); + nfiles = 0; + while ( fgets(buf,MAX_PATH_LEN,fh) ) + { + len = strlen(buf); + while ( len>0 && isspace(buf[len-1]) ) len--; + if ( !len ) continue; + + files[nfiles] = malloc(sizeof(char)*(len+1)); + strncpy(files[nfiles],buf,len); + files[nfiles][len] = 0; + nfiles++; + } + fclose(fh); + if ( !nfiles ) + { + fprintf(stderr,"No files read from %s\n", file_list); + return 1; + } + *argv = files; + *n = nfiles; + return 0; +} +#undef MAX_PATH_LEN + int bam_mpileup(int argc, char *argv[]) { int c; + const char *file_list = NULL; + char **fn = NULL; + int nfiles = 0; mplp_conf_t mplp; memset(&mplp, 0, sizeof(mplp_conf_t)); mplp.max_mq = 60; mplp.min_baseQ = 13; mplp.capQ_thres = 0; + mplp.max_depth = 250; + mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100; mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN; - while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:B")) >= 0) { + while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:BDSd:b:P:o:e:h:I")) >= 0) { switch (c) { case 'f': mplp.fai = fai_load(optarg); if (mplp.fai == 0) return 1; break; + case 'd': mplp.max_depth = atoi(optarg); break; case 'r': mplp.reg = strdup(optarg); break; case 'l': mplp.fn_pos = strdup(optarg); break; + case 'P': mplp.pl_list = strdup(optarg); break; case 'g': mplp.flag |= MPLP_GLF; break; case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_GLF; break; case 'a': mplp.flag |= MPLP_NO_ORPHAN | MPLP_REALN; break; case 'B': mplp.flag &= ~MPLP_REALN & ~MPLP_NO_ORPHAN; break; case 'O': mplp.flag |= MPLP_NO_ORPHAN; break; case 'R': mplp.flag |= MPLP_REALN; break; + case 'D': mplp.flag |= MPLP_FMT_DP; break; + case 'S': mplp.flag |= MPLP_FMT_SP; break; + case 'I': mplp.flag |= MPLP_NO_INDEL; break; case 'C': mplp.capQ_thres = atoi(optarg); break; case 'M': mplp.max_mq = atoi(optarg); break; case 'q': mplp.min_mq = atoi(optarg); break; case 'Q': mplp.min_baseQ = atoi(optarg); break; + case 'b': file_list = optarg; break; + case 'o': mplp.openQ = atoi(optarg); break; + case 'e': mplp.extQ = atoi(optarg); break; + case 'h': mplp.tandemQ = atoi(optarg); break; } } if (argc == 1) { @@ -796,18 +893,35 @@ int bam_mpileup(int argc, char *argv[]) fprintf(stderr, "Options: -f FILE reference sequence file [null]\n"); fprintf(stderr, " -r STR region in which pileup is generated [null]\n"); fprintf(stderr, " -l FILE list of positions (format: chr pos) [null]\n"); + fprintf(stderr, " -b FILE list of input BAM files [null]\n"); fprintf(stderr, " -M INT cap mapping quality at INT [%d]\n", mplp.max_mq); fprintf(stderr, " -Q INT min base quality [%d]\n", mplp.min_baseQ); fprintf(stderr, " -q INT filter out alignment with MQ smaller than INT [%d]\n", mplp.min_mq); + fprintf(stderr, " -d INT max per-sample depth [%d]\n", mplp.max_depth); + fprintf(stderr, " -P STR comma separated list of platforms for indels [all]\n"); + fprintf(stderr, " -o INT Phred-scaled gap open sequencing error probability [%d]\n", mplp.openQ); + fprintf(stderr, " -e INT Phred-scaled gap extension seq error probability [%d]\n", mplp.extQ); + fprintf(stderr, " -h INT coefficient for homopolyer errors [%d]\n", mplp.tandemQ); fprintf(stderr, " -g generate BCF output\n"); fprintf(stderr, " -u do not compress BCF output\n"); fprintf(stderr, " -B disable BAQ computation\n"); + fprintf(stderr, " -D output per-sample DP\n"); + fprintf(stderr, " -S output per-sample SP (strand bias P-value, slow)\n"); + fprintf(stderr, " -I do not perform indel calling\n"); fprintf(stderr, "\n"); fprintf(stderr, "Notes: Assuming diploid individuals.\n\n"); return 1; } - mpileup(&mplp, argc - optind, argv + optind); - free(mplp.reg); + if ( file_list ) + { + if ( read_file_list(file_list,&nfiles,&fn) ) return 1; + mpileup(&mplp,nfiles,fn); + for (c=0; cfp = strcmp(fn, "-")? bgzf_open(fn, mode) : bgzf_fdopen(fileno(stdin), mode); } +#ifndef BCF_LITE b->fp->owned_file = 1; +#endif return b; } @@ -134,7 +136,9 @@ int bcf_sync(bcf1_t *b) b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2; } else if (b->gi[i].fmt == bcf_str2int("DP", 2) || b->gi[i].fmt == bcf_str2int("HQ", 2)) { b->gi[i].len = 2; - } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)) { + } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2) + || b->gi[i].fmt == bcf_str2int("SP", 2)) + { b->gi[i].len = 1; } else if (b->gi[i].fmt == bcf_str2int("GL", 2)) { b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2 * 4; @@ -236,7 +240,7 @@ void bcf_fmt_core(const bcf_hdr_t *h, bcf1_t *b, kstring_t *s) } } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) { kputw(((uint16_t*)b->gi[i].data)[j], s); - } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) { + } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) { kputw(((uint8_t*)b->gi[i].data)[j], s); } else if (b->gi[i].fmt == bcf_str2int("GT", 2)) { int y = ((uint8_t*)b->gi[i].data)[j]; @@ -304,3 +308,13 @@ int bcf_cpy(bcf1_t *r, const bcf1_t *b) memcpy(r->gi[i].data, b->gi[i].data, r->n_smpl * r->gi[i].len); return 0; } + +int bcf_is_indel(const bcf1_t *b) +{ + char *p; + if (strlen(b->ref) > 1) return 1; + for (p = b->alt; *p; ++p) + if (*p != ',' && p[1] != ',' && p[1] != '\0') + return 1; + return 0; +} diff --git a/bcftools/bcf.h b/bcftools/bcf.h index 3657895..f87ac1e 100644 --- a/bcftools/bcf.h +++ b/bcftools/bcf.h @@ -30,7 +30,19 @@ #include #include + +#ifndef BCF_LITE #include "bgzf.h" +typedef BGZF *bcfFile; +#else +typedef gzFile bcfFile; +#define bgzf_open(fn, mode) gzopen(fn, mode) +#define bgzf_fdopen(fd, mode) gzdopen(fd, mode) +#define bgzf_close(fp) gzclose(fp) +#define bgzf_read(fp, buf, len) gzread(fp, buf, len) +#define bgzf_write(fp, buf, len) +#define bgzf_flush(fp) +#endif /* A member in the structs below is said to "primary" if its content @@ -74,7 +86,7 @@ typedef struct { typedef struct { int is_vcf; // if the file in operation is a VCF void *v; // auxillary data structure for VCF - BGZF *fp; // file handler for BCF + bcfFile fp; // file handler for BCF } bcf_t; struct __bcf_idx_t; @@ -128,6 +140,8 @@ extern "C" { int bcf_shrink_alt(bcf1_t *b, int n); // convert GL to PL int bcf_gl2pl(bcf1_t *b); + // if the site is an indel + int bcf_is_indel(const bcf1_t *b); // string hash table void *bcf_build_refhash(bcf_hdr_t *h); diff --git a/bcftools/call1.c b/bcftools/call1.c index 2b28452..b0b14fc 100644 --- a/bcftools/call1.c +++ b/bcftools/call1.c @@ -25,11 +25,12 @@ KSTREAM_INIT(gzFile, gzread, 16384) #define VC_QCALL 1024 #define VC_CALL_GT 2048 #define VC_ADJLD 4096 +#define VC_NO_INDEL 8192 typedef struct { int flag, prior_type, n1; char *fn_list, *prior_file; - double theta, pref; + double theta, pref, indel_frac; } viewconf_t; khash_t(set64) *bcf_load_pos(const char *fn, bcf_hdr_t *_h) @@ -161,7 +162,8 @@ static int update_bcf1(int n_smpl, bcf1_t *b, const bcf_p1aux_t *pa, const bcf_p kputs(b->alt, &s); kputc('\0', &s); kputc('\0', &s); kputs(b->info, &s); if (b->info[0]) kputc(';', &s); - ksprintf(&s, "AF1=%.3lf;AFE=%.3lf", 1.-pr->f_em, 1.-pr->f_exp); +// ksprintf(&s, "AF1=%.4lg;AFE=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, 1.-pr->f_exp, pr->cil, pr->cih); + ksprintf(&s, "AF1=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, pr->cil, pr->cih); ksprintf(&s, ";DP4=%d,%d,%d,%d;MQ=%d", a.d[0], a.d[1], a.d[2], a.d[3], a.mq); if (a.is_tested) { if (pr->pc[0] >= 0.) ksprintf(&s, ";PC4=%lg,%lg,%lg,%lg", pr->pc[0], pr->pc[1], pr->pc[2], pr->pc[3]); @@ -199,6 +201,7 @@ double bcf_ld_freq(const bcf1_t *b0, const bcf1_t *b1, double f[4]); int bcfview(int argc, char *argv[]) { extern int bcf_2qcall(bcf_hdr_t *h, bcf1_t *b); + extern void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x); bcf_t *bp, *bout = 0; bcf1_t *b, *blast; int c; @@ -212,8 +215,8 @@ int bcfview(int argc, char *argv[]) tid = begin = end = -1; memset(&vc, 0, sizeof(viewconf_t)); - vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5; - while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgL")) >= 0) { + vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5; vc.indel_frac = -1.; + while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgLi:I")) >= 0) { switch (c) { case '1': vc.n1 = atoi(optarg); break; case 'l': vc.fn_list = strdup(optarg); break; @@ -227,8 +230,10 @@ int bcfview(int argc, char *argv[]) case 'u': vc.flag |= VC_UNCOMP | VC_BCFOUT; break; case 'H': vc.flag |= VC_HWE; break; case 'g': vc.flag |= VC_CALL_GT | VC_CALL; break; + case 'I': vc.flag |= VC_NO_INDEL; break; case 't': vc.theta = atof(optarg); break; case 'p': vc.pref = atof(optarg); break; + case 'i': vc.indel_frac = atof(optarg); break; case 'Q': vc.flag |= VC_QCALL; break; case 'L': vc.flag |= VC_ADJLD; break; case 'P': @@ -254,9 +259,11 @@ int bcfview(int argc, char *argv[]) fprintf(stderr, " -N skip sites where REF is not A/C/G/T\n"); fprintf(stderr, " -Q output the QCALL likelihood format\n"); fprintf(stderr, " -L calculate LD for adjacent sites\n"); + fprintf(stderr, " -I skip indels\n"); fprintf(stderr, " -1 INT number of group-1 samples [0]\n"); fprintf(stderr, " -l FILE list of sites to output [all sites]\n"); - fprintf(stderr, " -t FLOAT scaled mutation rate [%.4lg]\n", vc.theta); + fprintf(stderr, " -t FLOAT scaled substitution mutation rate [%.4lg]\n", vc.theta); + fprintf(stderr, " -i FLOAT indel-to-substitution ratio [%.4lg]\n", vc.indel_frac); fprintf(stderr, " -p FLOAT variant if P(ref|D) 0.) bcf_p1_indel_prior(p1, vc.indel_frac); } if (vc.fn_list) hash = bcf_load_pos(vc.fn_list, h); if (optind + 1 < argc && !(vc.flag&VC_VCFIN)) { @@ -306,7 +314,9 @@ int bcfview(int argc, char *argv[]) } } while (vcf_read(bp, h, b) > 0) { - if (vc.flag & VC_ACGT_ONLY) { + int is_indel = bcf_is_indel(b); + if ((vc.flag & VC_NO_INDEL) && is_indel) continue; + if ((vc.flag & VC_ACGT_ONLY) && !is_indel) { int x; if (b->ref[0] == 0 || b->ref[1] != 0) continue; x = toupper(b->ref[0]); diff --git a/bcftools/prob1.c b/bcftools/prob1.c index e3b0f73..176a0fc 100644 --- a/bcftools/prob1.c +++ b/bcftools/prob1.c @@ -8,9 +8,9 @@ #include "kseq.h" KSTREAM_INIT(gzFile, gzread, 16384) -#define MC_AVG_ERR 0.007 #define MC_MAX_EM_ITER 16 #define MC_EM_EPS 1e-4 +#define MC_DEF_INDEL 0.15 unsigned char seq_nt4_table[256] = { 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, @@ -32,9 +32,9 @@ unsigned char seq_nt4_table[256] = { }; struct __bcf_p1aux_t { - int n, M, n1; + int n, M, n1, is_indel; double *q2p, *pdg; // pdg -> P(D|g) - double *phi; + double *phi, *phi_indel; double *z, *zswap; // aux for afs double *z1, *z2, *phi1, *phi2; // only calculated when n1 is set double t, t1, t2; @@ -43,6 +43,14 @@ struct __bcf_p1aux_t { int PL_len; }; +void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x) +{ + int i; + for (i = 0; i < ma->M; ++i) + ma->phi_indel[i] = ma->phi[i] * x; + ma->phi_indel[ma->M] = 1. - ma->phi[ma->M] * x; +} + static void init_prior(int type, double theta, int M, double *phi) { int i; @@ -63,6 +71,7 @@ static void init_prior(int type, double theta, int M, double *phi) void bcf_p1_init_prior(bcf_p1aux_t *ma, int type, double theta) { init_prior(type, theta, ma->M, ma->phi); + bcf_p1_indel_prior(ma, MC_DEF_INDEL); } void bcf_p1_init_subprior(bcf_p1aux_t *ma, int type, double theta) @@ -110,6 +119,7 @@ int bcf_p1_read_prior(bcf_p1aux_t *ma, const char *fn) fprintf(stderr, "[%s] heterozygosity=%lf, ", __func__, (double)sum); for (sum = 0., k = 1; k <= ma->M; ++k) sum += k * ma->phi[ma->M - k] / ma->M; fprintf(stderr, "theta=%lf\n", (double)sum); + bcf_p1_indel_prior(ma, MC_DEF_INDEL); return 0; } @@ -123,6 +133,7 @@ bcf_p1aux_t *bcf_p1_init(int n) ma->q2p = calloc(256, sizeof(double)); ma->pdg = calloc(3 * ma->n, sizeof(double)); ma->phi = calloc(ma->M + 1, sizeof(double)); + ma->phi_indel = calloc(ma->M + 1, sizeof(double)); ma->phi1 = calloc(ma->M + 1, sizeof(double)); ma->phi2 = calloc(ma->M + 1, sizeof(double)); ma->z = calloc(2 * ma->n + 1, sizeof(double)); @@ -148,7 +159,7 @@ void bcf_p1_destroy(bcf_p1aux_t *ma) { if (ma) { free(ma->q2p); free(ma->pdg); - free(ma->phi); free(ma->phi1); free(ma->phi2); + free(ma->phi); free(ma->phi_indel); free(ma->phi1); free(ma->phi2); free(ma->z); free(ma->zswap); free(ma->z1); free(ma->z2); free(ma->afs); free(ma->afs1); free(ma); @@ -303,12 +314,13 @@ static double mc_cal_afs(bcf_p1aux_t *ma) { int k; long double sum = 0.; + double *phi = ma->is_indel? ma->phi_indel : ma->phi; memset(ma->afs1, 0, sizeof(double) * (ma->M + 1)); mc_cal_y(ma); for (k = 0, sum = 0.; k <= ma->M; ++k) - sum += (long double)ma->phi[k] * ma->z[k]; + sum += (long double)phi[k] * ma->z[k]; for (k = 0; k <= ma->M; ++k) { - ma->afs1[k] = ma->phi[k] * ma->z[k] / sum; + ma->afs1[k] = phi[k] * ma->z[k] / sum; if (isnan(ma->afs1[k]) || isinf(ma->afs1[k])) return -1.; } for (k = 0, sum = 0.; k <= ma->M; ++k) { @@ -348,6 +360,7 @@ int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst) { int i, k; long double sum = 0.; + ma->is_indel = bcf_is_indel(b); // set PL and PL_len for (i = 0; i < b->n_gi; ++i) { if (b->gi[i].fmt == bcf_str2int("PL", 2)) { @@ -378,6 +391,19 @@ int bcf_p1_cal(bcf1_t *b, bcf_p1aux_t *ma, bcf_p1rst_t *rst) flast = rst->f_em; } } + { // estimate equal-tail credible interval (95% level) + int l, h; + double p; + for (i = 0, p = 0.; i < ma->M; ++i) + if (p + ma->afs1[i] > 0.025) break; + else p += ma->afs1[i]; + l = i; + for (i = ma->M-1, p = 0.; i >= 0; --i) + if (p + ma->afs1[i] > 0.025) break; + else p += ma->afs1[i]; + h = i; + rst->cil = (double)(ma->M - h) / ma->M; rst->cih = (double)(ma->M - l) / ma->M; + } rst->g[0] = rst->g[1] = rst->g[2] = -1.; contrast(ma, rst->pc); return 0; diff --git a/bcftools/prob1.h b/bcftools/prob1.h index 7158fe2..6d9d28e 100644 --- a/bcftools/prob1.h +++ b/bcftools/prob1.h @@ -9,6 +9,7 @@ typedef struct __bcf_p1aux_t bcf_p1aux_t; typedef struct { int rank0; double f_em, f_exp, f_flat, p_ref; + double cil, cih; double pc[4]; double g[3]; } bcf_p1rst_t; diff --git a/bcftools/vcf.c b/bcftools/vcf.c index ebca869..9b661ff 100644 --- a/bcftools/vcf.c +++ b/bcftools/vcf.c @@ -156,7 +156,7 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b) for (i = 0; i < b->n_gi; ++i) { if (b->gi[i].fmt == bcf_str2int("GT", 2)) { ((uint8_t*)b->gi[i].data)[k-9] = 1<<7; - } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) { + } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) { ((uint8_t*)b->gi[i].data)[k-9] = 0; } else if (b->gi[i].fmt == bcf_str2int("DP", 2)) { ((uint16_t*)b->gi[i].data)[k-9] = 0; @@ -173,7 +173,7 @@ int vcf_read(bcf_t *bp, bcf_hdr_t *h, bcf1_t *b) for (q = kstrtok(p, ":", &a2), i = 0; q && i < b->n_gi; q = kstrtok(0, 0, &a2), ++i) { if (b->gi[i].fmt == bcf_str2int("GT", 2)) { ((uint8_t*)b->gi[i].data)[k-9] = (q[0] - '0')<<3 | (q[2] - '0') | (q[1] == '/'? 0 : 1) << 6; - } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) { + } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) { double _x = strtod(q, &q); int x = (int)(_x + .499); if (x > 255) x = 255; diff --git a/bcftools/vcfutils.pl b/bcftools/vcfutils.pl index d0b7971..bbc479b 100755 --- a/bcftools/vcfutils.pl +++ b/bcftools/vcfutils.pl @@ -10,11 +10,11 @@ use Getopt::Std; exit; sub main { - my $version = '0.1.0'; &usage if (@ARGV < 1); my $command = shift(@ARGV); my %func = (subsam=>\&subsam, listsam=>\&listsam, fillac=>\&fillac, qstats=>\&qstats, varFilter=>\&varFilter, - hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&filter4vcf, ldstats=>\&ldstats); + hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&varFilter, ldstats=>\&ldstats, + gapstats=>\&gapstats); die("Unknown command \"$command\".\n") if (!defined($func{$command})); &{$func{$command}}; } @@ -155,7 +155,7 @@ Note: This command discards indels. Output: QUAL #non-indel #SNPs #transitions # next if (length($t[3]) != 1 || uc($t[3]) eq 'N'); $t[3] = uc($t[3]); $t[4] = uc($t[4]); my @s = split(',', $t[4]); - $t[5] = 3 if ($t[5] < 0); + $t[5] = 3 if ($t[5] eq '.' || $t[5] < 0); next if (length($s[0]) != 1); my $hit; if ($is_vcf) { @@ -199,37 +199,38 @@ Note: This command discards indels. Output: QUAL #non-indel #SNPs #transitions # } sub varFilter { - my %opts = (d=>1, D=>10000, l=>30, Q=>25, q=>10, G=>25, s=>100, w=>10, W=>10, N=>2, p=>undef, F=>.001); - getopts('pq:d:D:l:Q:w:W:N:G:F:', \%opts); + my %opts = (d=>2, D=>10000, a=>2, W=>10, Q=>10, w=>10, p=>undef, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4); + getopts('pd:D:W:Q:w:a:1:2:3:4:', \%opts); die(qq/ Usage: vcfutils.pl varFilter [options] Options: -Q INT minimum RMS mapping quality for SNPs [$opts{Q}] - -q INT minimum RMS mapping quality for gaps [$opts{q}] -d INT minimum read depth [$opts{d}] -D INT maximum read depth [$opts{D}] - - -G INT min indel score for nearby SNP filtering [$opts{G}] + -a INT minimum number of alternate bases [$opts{a}] -w INT SNP within INT bp around a gap to be filtered [$opts{w}] - - -W INT window size for filtering dense SNPs [$opts{W}] - -N INT max number of SNPs in a window [$opts{N}] - - -l INT window size for filtering adjacent gaps [$opts{l}] - + -W INT window size for filtering adjacent gaps [$opts{W}] + -1 FLOAT min P-value for strand bias (given PV4) [$opts{1}] + -2 FLOAT min P-value for baseQ bias [$opts{2}] + -3 FLOAT min P-value for mapQ bias [$opts{3}] + -4 FLOAT min P-value for end distance bias [$opts{4}] -p print filtered variants + +Note: Some of the filters rely on annotations generated by SAMtools\/BCFtools. \n/) if (@ARGV == 0 && -t STDIN); # calculate the window size - my ($ol, $ow, $oW) = ($opts{l}, $opts{w}, $opts{W}); + my ($ol, $ow) = ($opts{W}, $opts{w}); my $max_dist = $ol > $ow? $ol : $ow; - $max_dist = $oW if ($max_dist < $oW); # the core loop - my @staging; # (indel_filtering_score, flt_tag) + my @staging; # (indel_filtering_score, flt_tag, indel_span; chr, pos, ...) while (<>) { my @t = split; - next if (/^#/); + if (/^#/) { + print; next; + } next if ($t[4] eq '.'); # skip non-var sites + # check if the site is a SNP my $is_snp = 1; if (length($t[3]) > 1) { $is_snp = 0; @@ -245,24 +246,18 @@ Options: -Q INT minimum RMS mapping quality for SNPs [$opts{Q}] last if ($staging[0][3] eq $t[0] && $staging[0][4] + $staging[0][2] + $max_dist >= $t[1]); varFilter_aux(shift(@staging), $opts{p}); # calling a function is a bit slower, not much } - my ($flt, $score) = (0, -1); - - # collect key annotations - my ($dp, $mq, $af) = (-1, -1, 1); - if ($t[7] =~ /DP=(\d+)/i) { - $dp = $1; - } elsif ($t[7] =~ /DP4=(\d+),(\d+),(\d+),(\d+)/i) { + my $flt = 0; + # parse annotations + my ($dp, $mq, $dp_alt) = (-1, -1, -1); + if ($t[7] =~ /DP4=(\d+),(\d+),(\d+),(\d+)/i) { $dp = $1 + $2 + $3 + $4; + $dp_alt = $3 + $4; } - if ($t[7] =~ /MQ=(\d+)/i) { - $mq = $1; - } - if ($t[7] =~ /AF=([^\s;=]+)/i) { - $af = $1; - } elsif ($t[7] =~ /AF1=([^\s;=]+)/i) { - $af = $1; + if ($t[7] =~ /DP=(\d+)/i) { + $dp = $1; } - # the depth filter + $mq = $1 if ($t[7] =~ /MQ=(\d+)/i); + # the depth and mapQ filter if ($dp >= 0) { if ($dp < $opts{d}) { $flt = 2; @@ -270,58 +265,40 @@ Options: -Q INT minimum RMS mapping quality for SNPs [$opts{Q}] $flt = 3; } } + $flt = 4 if ($dp_alt >= 0 && $dp_alt < $opts{a}); + $flt = 1 if ($flt == 0 && $mq >= 0 && $mq < $opts{Q}); + $flt = 7 if ($flt == 0 && /PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/ + && ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4})); # site dependent filters - my $dlen = 0; + my ($rlen, $indel_score) = (0, -1); # $indel_score<0 for SNPs if ($flt == 0) { if (!$is_snp) { # an indel - # If deletion, remember the length of the deletion - $dlen = length($t[3]) - 1; - $flt = 1 if ($mq < $opts{q}); + $rlen = length($t[3]) - 1; + $indel_score = $t[5] * 100 + $dp_alt; # filtering SNPs - if ($t[5] >= $opts{G}) { - for my $x (@staging) { - # Is it a SNP and is it outside the SNP filter window? - next if ($x->[0] >= 0 || $x->[4] + $x->[2] + $ow < $t[1]); - $x->[1] = 5 if ($x->[1] == 0); - } + for my $x (@staging) { + next if ($x->[0] >= 0 || $x->[1] || $x->[4] + $x->[2] + $ow < $t[1]); + $x->[1] = 5; } - # the indel filtering score - $score = $t[5]; # check the staging list for indel filtering for my $x (@staging) { - # Is it a SNP and is it outside the gap filter window - next if ($x->[0] < 0 || $x->[4] + $x->[2] + $ol < $t[1]); - if ($x->[0] < $score) { + next if ($x->[0] < 0 || $x->[1] || $x->[4] + $x->[2] + $ol < $t[1]); + if ($x->[0] < $indel_score) { $x->[1] = 6; } else { $flt = 6; last; } } } else { # a SNP - $flt = 1 if ($mq < $opts{Q}); - # check adjacent SNPs - my $k = 1; for my $x (@staging) { - ++$k if ($x->[0] < 0 && -($x->[0] + 1) > $opts{F} && $x->[4] + $x->[2] + $oW >= $t[1] && ($x->[1] == 0 || $x->[1] == 4 || $x->[1] == 5)); - } - # filtering is necessary - if ($k > $opts{N}) { - $flt = 4; - for my $x (@staging) { - $x->[1] = 4 if ($x->[0] < 0 && $x->[4] + $x->[2] + $oW >= $t[1] && $x->[1] == 0); - } - } else { # then check gap filter - for my $x (@staging) { - next if ($x->[0] < 0 || $x->[4] + $x->[2] + $ow < $t[1]); - if ($x->[0] >= $opts{G}) { - $flt = 5; last; - } - } + next if ($x->[0] < 0 || $x->[1] || $x->[4] + $x->[2] + $ow < $t[1]); + $flt = 5; + last; } } } - push(@staging, [$score < 0? -$af-1 : $score, $flt, $dlen, @t]); + push(@staging, [$indel_score, $flt, $rlen, @t]); } # output the last few elements in the staging list while (@staging) { @@ -334,47 +311,35 @@ sub varFilter_aux { if ($first->[1] == 0) { print join("\t", @$first[3 .. @$first-1]), "\n"; } elsif ($is_print) { - print STDERR join("\t", substr("UQdDWGgsiX", $first->[1], 1), @$first[3 .. @$first-1]), "\n"; + print STDERR join("\t", substr("UQdDaGgP", $first->[1], 1), @$first[3 .. @$first-1]), "\n"; } } -sub filter4vcf { - my %opts = (d=>3, D=>2000, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4, Q=>10, q=>3); - getopts('d:D:1:2:3:4:Q:q:', \%opts); - die(qq/ -Usage: vcfutils.pl filter4vcf [options] - -Options: -d INT min total depth (given DP or DP4) [$opts{d}] - -D INT max total depth [$opts{D}] - -q INT min SNP quality [$opts{q}] - -Q INT min RMS mapQ (given MQ) [$opts{Q}] - -1 FLOAT min P-value for strand bias (given PV4) [$opts{1}] - -2 FLOAT min P-value for baseQ bias [$opts{2}] - -3 FLOAT min P-value for mapQ bias [$opts{3}] - -4 FLOAT min P-value for end distance bias [$opts{4}]\n -/) if (@ARGV == 0 && -t STDIN); - - my %ts = (AG=>1, GA=>1, CT=>1, TC=>1); - - my @n = (0, 0); +sub gapstats { + my (@c0, @c1); + $c0[$_] = $c1[$_] = 0 for (0 .. 10000); while (<>) { - if (/^#/) { - print; - next; - } - next if (/PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/ && ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4})); - my $depth = -1; - $depth = $1 if (/DP=(\d+)/); - $depth = $1+$2+$3+$4 if (/DP4=(\d+),(\d+),(\d+),(\d+)/); - next if ($depth > 0 && ($depth < $opts{d} || $depth > $opts{D})); - next if (/MQ=(\d+)/ && $1 < $opts{Q}); + next if (/^#/); my @t = split; - next if ($t[5] >= 0 && $t[5] < $opts{q}); - ++$n[0]; + next if (length($t[3]) == 1 && $t[4] =~ /^[A-Za-z](,[A-Za-z])*$/); # not an indel my @s = split(',', $t[4]); - ++$n[1] if ($ts{$t[3].$s[0]}); - print; + for my $x (@s) { + my $l = length($x) - length($t[3]) + 5000; + if ($x =~ /^-/) { + $l = -(length($x) - 1) + 5000; + } elsif ($x =~ /^\+/) { + $l = length($x) - 1 + 5000; + } + $c0[$l] += 1 / @s; + } + } + for (my $i = 0; $i < 10000; ++$i) { + next if ($c0[$i] == 0); + $c1[0] += $c0[$i]; + $c1[1] += $c0[$i] if (($i-5000)%3 == 0); + printf("C\t%d\t%.2f\n", ($i-5000), $c0[$i]); } + printf("3\t%d\t%d\t%.3f\n", $c1[0], $c1[1], $c1[1]/$c1[0]); } sub ucscsnp2vcf { @@ -470,7 +435,6 @@ Command: subsam get a subset of samples fillac fill the allele count field qstats SNP stats stratified by QUAL varFilter filtering short variants - filter4vcf filtering VCFs produced by samtools+bcftools hapmap2vcf convert the hapmap format to VCF ucscsnp2vcf convert UCSC SNP SQL dump to VCF \n/); diff --git a/kaln.c b/kaln.c index 6807791..9c0bbaa 100644 --- a/kaln.c +++ b/kaln.c @@ -73,9 +73,19 @@ int aln_sm_blast[] = { -2, -2, -2, -2, -2 }; +int aln_sm_qual[] = { + 0, -23, -23, -23, 0, + -23, 0, -23, -23, 0, + -23, -23, 0, -23, 0, + -23, -23, -23, 0, 0, + 0, 0, 0, 0, 0 +}; + ka_param_t ka_param_blast = { 5, 2, 5, 2, aln_sm_blast, 5, 50 }; ka_param_t ka_param_aa2aa = { 10, 2, 10, 2, aln_sm_blosum62, 22, 50 }; +ka_param2_t ka_param2_qual = { 37, 11, 37, 11, 37, 11, 0, 0, aln_sm_qual, 5, 50 }; + static uint32_t *ka_path2cigar32(const path_t *path, int path_len, int *n_cigar) { int i, n; @@ -176,7 +186,7 @@ static uint32_t *ka_path2cigar32(const path_t *path, int path_len, int *n_cigar) } typedef struct { - uint8_t Mt:3, It:2, Dt:2; + uint8_t Mt:3, It:2, Dt:3; } dpcell_t; typedef struct { @@ -208,7 +218,7 @@ uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const score_matrix = ap->matrix; N_MATRIX_ROW = ap->row; - *n_cigar = 0; + if (n_cigar) *n_cigar = 0; if (len1 == 0 || len2 == 0) return 0; /* calculate b1 and b2 */ @@ -371,211 +381,106 @@ uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const return cigar; } -/***************************************** - * Probabilistic banded glocal alignment * - *****************************************/ - -static float g_qual2prob[256]; - -#define EI .25 -#define EM .3333333333333 -#define set_u(u, b, i, k) { int x=(i)-(b); x=x>0?x:0; (u)=((k)-x+1)*3; } - -ka_probpar_t ka_probpar_def = { 0.001, 0.1, 10 }; - -/* - The topology of the profile HMM: - - /\ /\ /\ /\ - I[1] I[k-1] I[k] I[L] - ^ \ \ ^ \ ^ \ \ ^ - | \ \ | \ | \ \ | - M[0] M[1] -> ... -> M[k-1] -> M[k] -> ... -> M[L] M[L+1] - \ \/ \/ \/ / - \ /\ /\ /\ / - -> D[k-1] -> D[k] -> - - M[0] points to every {M,I}[k] and every {M,I}[k] points M[L+1]. - - On input, _ref is the reference sequence and _query is the query - sequence. Both are sequences of 0/1/2/3/4 where 4 stands for an - ambiguous residue. iqual is the base quality. c sets the gap open - probability, gap extension probability and band width. - - On output, state and q are arrays of length l_query. The higher 30 - bits give the reference position the query base is matched to and the - lower two bits can be 0 (an alignment match) or 1 (an - insertion). q[i] gives the phred scaled posterior probability of - state[i] being wrong. - */ -int ka_prob_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual, - const ka_probpar_t *c, int *state, uint8_t *q) +typedef struct { + int M, I, D; +} score_aux_t; + +#define MINUS_INF -0x40000000 + +// matrix: len2 rows and len1 columns +int ka_global_score(const uint8_t *_seq1, int len1, const uint8_t *_seq2, int len2, const ka_param2_t *ap) { - double **f, **b, *s, m[9], sI, sM, bI, bM, pb; - float *qual, *_qual; - const uint8_t *ref, *query; - int bw, bw2, i, k, is_diff = 0; - - /*** initialization ***/ - ref = _ref - 1; query = _query - 1; // change to 1-based coordinate - bw = l_ref > l_query? l_ref : l_query; - if (bw > c->bw) bw = c->bw; - if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query); - bw2 = bw * 2 + 1; - // allocate the forward and backward matrices f[][] and b[][] and the scaling array s[] - f = calloc(l_query+1, sizeof(void*)); - b = calloc(l_query+1, sizeof(void*)); - for (i = 0; i <= l_query; ++i) { - f[i] = calloc(bw2 * 3 + 6, sizeof(double)); // FIXME: this is over-allocated for very short seqs - b[i] = calloc(bw2 * 3 + 6, sizeof(double)); - } - s = calloc(l_query+2, sizeof(double)); // s[] is the scaling factor to avoid underflow - // initialize qual - _qual = calloc(l_query, sizeof(float)); - if (g_qual2prob[0] == 0) - for (i = 0; i < 256; ++i) - g_qual2prob[i] = pow(10, -i/10.); - for (i = 0; i < l_query; ++i) _qual[i] = g_qual2prob[iqual? iqual[i] : 30]; - qual = _qual - 1; - // initialize transition probability - sM = sI = 1. / (2 * l_query + 2); // the value here seems not to affect results; FIXME: need proof - m[0*3+0] = (1 - c->d - c->d) * (1 - sM); m[0*3+1] = m[0*3+2] = c->d * (1 - sM); - m[1*3+0] = (1 - c->e) * (1 - sI); m[1*3+1] = c->e * (1 - sI); m[1*3+2] = 0.; - m[2*3+0] = 1 - c->e; m[2*3+1] = 0.; m[2*3+2] = c->e; - bM = (1 - c->d) / l_query; bI = c->d / l_query; // (bM+bI)*l_query==1 - /*** forward ***/ - // f[0] - set_u(k, bw, 0, 0); - f[0][k] = s[0] = 1.; - { // f[1] - double *fi = f[1], sum; - int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1, _beg, _end; - for (k = beg, sum = 0.; k <= end; ++k) { - int u; - double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM; - set_u(u, bw, 1, k); - fi[u+0] = e * bM; fi[u+1] = EI * bI; - sum += fi[u] + fi[u+1]; - } - // rescale - s[1] = sum; - set_u(_beg, bw, 1, beg); set_u(_end, bw, 1, end); _end += 2; - for (k = _beg; k <= _end; ++k) fi[k] /= sum; - } - // f[2..l_query] - for (i = 2; i <= l_query; ++i) { - double *fi = f[i], *fi1 = f[i-1], sum, qli = qual[i]; - int beg = 1, end = l_ref, x, _beg, _end; - uint8_t qyi = query[i]; - x = i - bw; beg = beg > x? beg : x; // band start - x = i + bw; end = end < x? end : x; // band end - for (k = beg, sum = 0.; k <= end; ++k) { - int u, v11, v01, v10; - double e; - e = (ref[k] > 3 || qyi > 3)? 1. : ref[k] == qyi? 1. - qli : qli * EM; - set_u(u, bw, i, k); set_u(v11, bw, i-1, k-1); set_u(v10, bw, i-1, k); set_u(v01, bw, i, k-1); - fi[u+0] = e * (m[0] * fi1[v11+0] + m[3] * fi1[v11+1] + m[6] * fi1[v11+2]); - fi[u+1] = EI * (m[1] * fi1[v10+0] + m[4] * fi1[v10+1]); - fi[u+2] = m[2] * fi[v01+0] + m[8] * fi[v01+2]; - sum += fi[u] + fi[u+1] + fi[u+2]; -// fprintf(stderr, "F (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, fi[u], fi[u+1], fi[u+2]); // DEBUG - } - // rescale - s[i] = sum; - set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2; - for (k = _beg, sum = 1./sum; k <= _end; ++k) fi[k] *= sum; - } - { // f[l_query+1] - double sum; - for (k = 1, sum = 0.; k <= l_ref; ++k) { - int u; - set_u(u, bw, l_query, k); - if (u < 3 || u >= bw2*3+3) continue; - sum += f[l_query][u+0] * sM + f[l_query][u+1] * sI; - } - s[l_query+1] = sum; // the last scaling factor + +#define __score_aux(_p, _q0, _sc, _io, _ie, _do, _de) { \ + int t1, t2; \ + score_aux_t *_q; \ + _q = _q0; \ + _p->M = _q->M >= _q->I? _q->M : _q->I; \ + _p->M = _p->M >= _q->D? _p->M : _q->D; \ + _p->M += (_sc); \ + ++_q; t1 = _q->M - _io - _ie; t2 = _q->I - _ie; _p->I = t1 >= t2? t1 : t2; \ + _q = _p-1; t1 = _q->M - _do - _de; t2 = _q->D - _de; _p->D = t1 >= t2? t1 : t2; \ } - /*** backward ***/ - // b[l_query] (b[l_query+1][0]=1 and thus \tilde{b}[][]=1/s[l_query+1]; this is where s[l_query+1] comes from) - for (k = 1; k <= l_ref; ++k) { - int u; - double *bi = b[l_query]; - set_u(u, bw, l_query, k); - if (u < 3 || u >= bw2*3+3) continue; - bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1]; + + int i, j, bw, scmat_size = ap->row, *scmat = ap->matrix, ret; + const uint8_t *seq1, *seq2; + score_aux_t *curr, *last, *swap; + bw = abs(len1 - len2) + ap->band_width; + i = len1 > len2? len1 : len2; + if (bw > i + 1) bw = i + 1; + seq1 = _seq1 - 1; seq2 = _seq2 - 1; + curr = calloc(len1 + 2, sizeof(score_aux_t)); + last = calloc(len1 + 2, sizeof(score_aux_t)); + { // the zero-th row + int x, end = len1; + score_aux_t *p; + j = 0; + x = j + bw; end = len1 < x? len1 : x; // band end + p = curr; + p->M = 0; p->I = p->D = MINUS_INF; + for (i = 1, p = &curr[1]; i <= end; ++i, ++p) + p->M = p->I = MINUS_INF, p->D = -(ap->edo + ap->ede * i); + p->M = p->I = p->D = MINUS_INF; + swap = curr; curr = last; last = swap; } - // b[l_query-1..1] - for (i = l_query - 1; i >= 1; --i) { - int beg = 1, end = l_ref, x, _beg, _end; - double *bi = b[i], *bi1 = b[i+1], y = (i > 1), qli1 = qual[i+1]; - uint8_t qyi1 = query[i+1]; - x = i - bw; beg = beg > x? beg : x; - x = i + bw; end = end < x? end : x; - for (k = end; k >= beg; --k) { - int u, v11, v01, v10; - double e; - set_u(u, bw, i, k); set_u(v11, bw, i+1, k+1); set_u(v10, bw, i+1, k); set_u(v01, bw, i, k+1); - e = (k >= l_ref? 0 : (ref[k+1] > 3 || qyi1 > 3)? 1. : ref[k+1] == qyi1? 1. - qli1 : qli1 * EM) * bi1[v11]; - bi[u+0] = e * m[0] + EI * m[1] * bi1[v10+1] + m[2] * bi[v01+2]; // bi1[v11] has been foled into e. - bi[u+1] = e * m[3] + EI * m[4] * bi1[v10+1]; - bi[u+2] = (e * m[6] + m[8] * bi[v01+2]) * y; -// fprintf(stderr, "B (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, bi[u], bi[u+1], bi[u+2]); // DEBUG + for (j = 1; j < len2; ++j) { + int x, beg = 0, end = len1, *scrow, col_end; + score_aux_t *p; + x = j - bw; beg = 0 > x? 0 : x; // band start + x = j + bw; end = len1 < x? len1 : x; // band end + if (beg == 0) { // from zero-th column + p = curr; + p->M = p->D = MINUS_INF; p->I = -(ap->eio + ap->eie * j); + ++beg; // then beg = 1 } - // rescale - set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2; - for (k = _beg, y = 1./s[i]; k <= _end; ++k) bi[k] *= y; - } - { // b[0] - int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1; - double sum = 0.; - for (k = end; k >= beg; --k) { - int u; - double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM; - set_u(u, bw, 1, k); - if (u < 3 || u >= bw2*3+3) continue; - sum += e * b[1][u+0] * bM + EI * b[1][u+1] * bI; + scrow = scmat + seq2[j] * scmat_size; + if (end == len1) col_end = 1, --end; + else col_end = 0; + for (i = beg, p = &curr[beg]; i <= end; ++i, ++p) + __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->iio, ap->iie, ap->ido, ap->ide); + if (col_end) { + __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->eio, ap->eie, ap->ido, ap->ide); + ++p; } - set_u(k, bw, 0, 0); - pb = b[0][k] = sum / s[0]; // if everything works as is expected, pb == 1.0 + p->M = p->I = p->D = MINUS_INF; +// for (i = 0; i <= len1; ++i) printf("(%d,%d,%d) ", curr[i].M, curr[i].I, curr[i].D); putchar('\n'); + swap = curr; curr = last; last = swap; } - is_diff = fabs(pb - 1.) > 1e-7? 1 : 0; - /*** MAP ***/ - for (i = 1; i <= l_query; ++i) { - double sum = 0., *fi = f[i], *bi = b[i], max = 0.; - int beg = 1, end = l_ref, x, max_k = -1; - x = i - bw; beg = beg > x? beg : x; - x = i + bw; end = end < x? end : x; - for (k = beg; k <= end; ++k) { - int u; - double z; - set_u(u, bw, i, k); - z = fi[u+0] * bi[u+0]; if (z > max) max = z, max_k = (k-1)<<2 | 0; sum += z; - z = fi[u+1] * bi[u+1]; if (z > max) max = z, max_k = (k-1)<<2 | 1; sum += z; + { // the last row + int x, beg = 0, *scrow; + score_aux_t *p; + j = len2; + x = j - bw; beg = 0 > x? 0 : x; // band start + if (beg == 0) { // from zero-th column + p = curr; + p->M = p->D = MINUS_INF; p->I = -(ap->eio + ap->eie * j); + ++beg; // then beg = 1 } - max /= sum; sum *= s[i]; // if everything works as is expected, sum == 1.0 - if (state) state[i-1] = max_k; - if (q) k = (int)(-4.343 * log(1. - max) + .499), q[i-1] = k > 100? 99 : k; -#ifdef _MAIN - fprintf(stderr, "(%.10lg,%.10lg) (%d,%d:%d)~%lg\n", pb, sum, i-1, max_k>>2, max_k&3, max); // DEBUG -#endif - } - /*** free ***/ - for (i = 0; i <= l_query; ++i) { - free(f[i]); free(b[i]); + scrow = scmat + seq2[j] * scmat_size; + for (i = beg, p = &curr[beg]; i < len1; ++i, ++p) + __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->iio, ap->iie, ap->edo, ap->ede); + __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->eio, ap->eie, ap->edo, ap->ede); +// for (i = 0; i <= len1; ++i) printf("(%d,%d,%d) ", curr[i].M, curr[i].I, curr[i].D); putchar('\n'); } - free(f); free(b); free(s); free(_qual); - return 0; + ret = curr[len1].M >= curr[len1].I? curr[len1].M : curr[len1].I; + ret = ret >= curr[len1].D? ret : curr[len1].D; + free(curr); free(last); + return ret; } #ifdef _MAIN -int main() +int main(int argc, char *argv[]) { - int l_ref = 5, l_query = 4; - uint8_t *ref = (uint8_t*)"\0\1\3\3\1"; - uint8_t *query = (uint8_t*)"\0\3\3\1"; -// uint8_t *query = (uint8_t*)"\1\3\3\1"; - static uint8_t qual[4] = {20, 20, 20, 20}; - ka_prob_glocal(ref, l_ref, query, l_query, qual, &ka_probpar_def, 0, 0); +// int len1 = 35, len2 = 35; +// uint8_t *seq1 = (uint8_t*)"\0\0\3\3\2\0\0\0\1\0\2\1\2\1\3\2\3\3\3\0\2\3\2\1\1\3\3\3\2\3\3\1\0\0\1"; +// uint8_t *seq2 = (uint8_t*)"\0\0\3\3\2\0\0\0\1\0\2\1\2\1\3\2\3\3\3\0\2\3\2\1\1\3\3\3\2\3\3\1\0\1\0"; + int len1 = 4, len2 = 4; + uint8_t *seq1 = (uint8_t*)"\1\0\0\1"; + uint8_t *seq2 = (uint8_t*)"\1\0\1\0"; + int sc; +// ka_global_core(seq1, 2, seq2, 1, &ka_param_qual, &sc, 0); + sc = ka_global_score(seq1, len1, seq2, len2, &ka_param2_qual); + printf("%d\n", sc); return 0; } #endif diff --git a/kaln.h b/kaln.h index b24597c..1ece132 100644 --- a/kaln.h +++ b/kaln.h @@ -42,9 +42,12 @@ typedef struct { } ka_param_t; typedef struct { - float d, e; - int bw; -} ka_probpar_t; + int iio, iie, ido, ide; + int eio, eie, edo, ede; + int *matrix; + int row; + int band_width; +} ka_param2_t; #ifdef __cplusplus extern "C" { @@ -52,14 +55,13 @@ extern "C" { uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const ka_param_t *ap, int *_score, int *n_cigar); - int ka_prob_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual, - const ka_probpar_t *c, int *state, uint8_t *q); - + int ka_global_score(const uint8_t *_seq1, int len1, const uint8_t *_seq2, int len2, const ka_param2_t *ap); #ifdef __cplusplus } #endif extern ka_param_t ka_param_blast; /* = { 5, 2, 5, 2, aln_sm_blast, 5, 50 }; */ -extern ka_probpar_t ka_probpar_def; /* { 0.0001, 0.1, 10 } */ +extern ka_param_t ka_param_qual; // only use this for global alignment!!! +extern ka_param2_t ka_param2_qual; // only use this for global alignment!!! #endif diff --git a/kprobaln.c b/kprobaln.c new file mode 100644 index 0000000..5201c1a --- /dev/null +++ b/kprobaln.c @@ -0,0 +1,278 @@ +/* The MIT License + + Copyright (c) 2003-2006, 2008-2010, by Heng Li + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ + +#include +#include +#include +#include +#include +#include "kprobaln.h" + +/***************************************** + * Probabilistic banded glocal alignment * + *****************************************/ + +#define EI .25 +#define EM .33333333333 + +static float g_qual2prob[256]; + +#define set_u(u, b, i, k) { int x=(i)-(b); x=x>0?x:0; (u)=((k)-x+1)*3; } + +kpa_par_t kpa_par_def = { 0.001, 0.1, 10 }; +kpa_par_t kpa_par_alt = { 0.0001, 0.01, 10 }; + +/* + The topology of the profile HMM: + + /\ /\ /\ /\ + I[1] I[k-1] I[k] I[L] + ^ \ \ ^ \ ^ \ \ ^ + | \ \ | \ | \ \ | + M[0] M[1] -> ... -> M[k-1] -> M[k] -> ... -> M[L] M[L+1] + \ \/ \/ \/ / + \ /\ /\ /\ / + -> D[k-1] -> D[k] -> + + M[0] points to every {M,I}[k] and every {M,I}[k] points M[L+1]. + + On input, _ref is the reference sequence and _query is the query + sequence. Both are sequences of 0/1/2/3/4 where 4 stands for an + ambiguous residue. iqual is the base quality. c sets the gap open + probability, gap extension probability and band width. + + On output, state and q are arrays of length l_query. The higher 30 + bits give the reference position the query base is matched to and the + lower two bits can be 0 (an alignment match) or 1 (an + insertion). q[i] gives the phred scaled posterior probability of + state[i] being wrong. + */ +int kpa_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual, + const kpa_par_t *c, int *state, uint8_t *q) +{ + double **f, **b = 0, *s, m[9], sI, sM, bI, bM, pb; + float *qual, *_qual; + const uint8_t *ref, *query; + int bw, bw2, i, k, is_diff = 0, is_backward = 1, Pr; + + /*** initialization ***/ + is_backward = state && q? 1 : 0; + ref = _ref - 1; query = _query - 1; // change to 1-based coordinate + bw = l_ref > l_query? l_ref : l_query; + if (bw > c->bw) bw = c->bw; + if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query); + bw2 = bw * 2 + 1; + // allocate the forward and backward matrices f[][] and b[][] and the scaling array s[] + f = calloc(l_query+1, sizeof(void*)); + if (is_backward) b = calloc(l_query+1, sizeof(void*)); + for (i = 0; i <= l_query; ++i) { + f[i] = calloc(bw2 * 3 + 6, sizeof(double)); // FIXME: this is over-allocated for very short seqs + if (is_backward) b[i] = calloc(bw2 * 3 + 6, sizeof(double)); + } + s = calloc(l_query+2, sizeof(double)); // s[] is the scaling factor to avoid underflow + // initialize qual + _qual = calloc(l_query, sizeof(float)); + if (g_qual2prob[0] == 0) + for (i = 0; i < 256; ++i) + g_qual2prob[i] = pow(10, -i/10.); + for (i = 0; i < l_query; ++i) _qual[i] = g_qual2prob[iqual? iqual[i] : 30]; + qual = _qual - 1; + // initialize transition probability + sM = sI = 1. / (2 * l_query + 2); // the value here seems not to affect results; FIXME: need proof + m[0*3+0] = (1 - c->d - c->d) * (1 - sM); m[0*3+1] = m[0*3+2] = c->d * (1 - sM); + m[1*3+0] = (1 - c->e) * (1 - sI); m[1*3+1] = c->e * (1 - sI); m[1*3+2] = 0.; + m[2*3+0] = 1 - c->e; m[2*3+1] = 0.; m[2*3+2] = c->e; + bM = (1 - c->d) / l_ref; bI = c->d / l_ref; // (bM+bI)*l_ref==1 + /*** forward ***/ + // f[0] + set_u(k, bw, 0, 0); + f[0][k] = s[0] = 1.; + { // f[1] + double *fi = f[1], sum; + int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1, _beg, _end; + for (k = beg, sum = 0.; k <= end; ++k) { + int u; + double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM; + set_u(u, bw, 1, k); + fi[u+0] = e * bM; fi[u+1] = EI * bI; + sum += fi[u] + fi[u+1]; + } + // rescale + s[1] = sum; + set_u(_beg, bw, 1, beg); set_u(_end, bw, 1, end); _end += 2; + for (k = _beg; k <= _end; ++k) fi[k] /= sum; + } + // f[2..l_query] + for (i = 2; i <= l_query; ++i) { + double *fi = f[i], *fi1 = f[i-1], sum, qli = qual[i]; + int beg = 1, end = l_ref, x, _beg, _end; + uint8_t qyi = query[i]; + x = i - bw; beg = beg > x? beg : x; // band start + x = i + bw; end = end < x? end : x; // band end + for (k = beg, sum = 0.; k <= end; ++k) { + int u, v11, v01, v10; + double e; + e = (ref[k] > 3 || qyi > 3)? 1. : ref[k] == qyi? 1. - qli : qli * EM; + set_u(u, bw, i, k); set_u(v11, bw, i-1, k-1); set_u(v10, bw, i-1, k); set_u(v01, bw, i, k-1); + fi[u+0] = e * (m[0] * fi1[v11+0] + m[3] * fi1[v11+1] + m[6] * fi1[v11+2]); + fi[u+1] = EI * (m[1] * fi1[v10+0] + m[4] * fi1[v10+1]); + fi[u+2] = m[2] * fi[v01+0] + m[8] * fi[v01+2]; + sum += fi[u] + fi[u+1] + fi[u+2]; +// fprintf(stderr, "F (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, fi[u], fi[u+1], fi[u+2]); // DEBUG + } + // rescale + s[i] = sum; + set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2; + for (k = _beg, sum = 1./sum; k <= _end; ++k) fi[k] *= sum; + } + { // f[l_query+1] + double sum; + for (k = 1, sum = 0.; k <= l_ref; ++k) { + int u; + set_u(u, bw, l_query, k); + if (u < 3 || u >= bw2*3+3) continue; + sum += f[l_query][u+0] * sM + f[l_query][u+1] * sI; + } + s[l_query+1] = sum; // the last scaling factor + } + { // compute likelihood + double p = 1., Pr1 = 0.; + for (i = 0; i <= l_query + 1; ++i) { + p *= s[i]; + if (p < 1e-100) Pr += -4.343 * log(p), p = 1.; + } + Pr1 += -4.343 * log(p * l_ref * l_query); + Pr = (int)(Pr1 + .499); + if (!is_backward) { // skip backward and MAP + for (i = 0; i <= l_query; ++i) free(f[i]); + free(f); free(s); free(_qual); + return Pr; + } + } + /*** backward ***/ + // b[l_query] (b[l_query+1][0]=1 and thus \tilde{b}[][]=1/s[l_query+1]; this is where s[l_query+1] comes from) + for (k = 1; k <= l_ref; ++k) { + int u; + double *bi = b[l_query]; + set_u(u, bw, l_query, k); + if (u < 3 || u >= bw2*3+3) continue; + bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1]; + } + // b[l_query-1..1] + for (i = l_query - 1; i >= 1; --i) { + int beg = 1, end = l_ref, x, _beg, _end; + double *bi = b[i], *bi1 = b[i+1], y = (i > 1), qli1 = qual[i+1]; + uint8_t qyi1 = query[i+1]; + x = i - bw; beg = beg > x? beg : x; + x = i + bw; end = end < x? end : x; + for (k = end; k >= beg; --k) { + int u, v11, v01, v10; + double e; + set_u(u, bw, i, k); set_u(v11, bw, i+1, k+1); set_u(v10, bw, i+1, k); set_u(v01, bw, i, k+1); + e = (k >= l_ref? 0 : (ref[k+1] > 3 || qyi1 > 3)? 1. : ref[k+1] == qyi1? 1. - qli1 : qli1 * EM) * bi1[v11]; + bi[u+0] = e * m[0] + EI * m[1] * bi1[v10+1] + m[2] * bi[v01+2]; // bi1[v11] has been foled into e. + bi[u+1] = e * m[3] + EI * m[4] * bi1[v10+1]; + bi[u+2] = (e * m[6] + m[8] * bi[v01+2]) * y; +// fprintf(stderr, "B (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, bi[u], bi[u+1], bi[u+2]); // DEBUG + } + // rescale + set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2; + for (k = _beg, y = 1./s[i]; k <= _end; ++k) bi[k] *= y; + } + { // b[0] + int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1; + double sum = 0.; + for (k = end; k >= beg; --k) { + int u; + double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM; + set_u(u, bw, 1, k); + if (u < 3 || u >= bw2*3+3) continue; + sum += e * b[1][u+0] * bM + EI * b[1][u+1] * bI; + } + set_u(k, bw, 0, 0); + pb = b[0][k] = sum / s[0]; // if everything works as is expected, pb == 1.0 + } + is_diff = fabs(pb - 1.) > 1e-7? 1 : 0; + /*** MAP ***/ + for (i = 1; i <= l_query; ++i) { + double sum = 0., *fi = f[i], *bi = b[i], max = 0.; + int beg = 1, end = l_ref, x, max_k = -1; + x = i - bw; beg = beg > x? beg : x; + x = i + bw; end = end < x? end : x; + for (k = beg; k <= end; ++k) { + int u; + double z; + set_u(u, bw, i, k); + z = fi[u+0] * bi[u+0]; if (z > max) max = z, max_k = (k-1)<<2 | 0; sum += z; + z = fi[u+1] * bi[u+1]; if (z > max) max = z, max_k = (k-1)<<2 | 1; sum += z; + } + max /= sum; sum *= s[i]; // if everything works as is expected, sum == 1.0 + if (state) state[i-1] = max_k; + if (q) k = (int)(-4.343 * log(1. - max) + .499), q[i-1] = k > 100? 99 : k; +#ifdef _MAIN + fprintf(stderr, "(%.10lg,%.10lg) (%d,%d:%c,%c:%d) %lg\n", pb, sum, i-1, max_k>>2, + "ACGT"[query[i]], "ACGT"[ref[(max_k>>2)+1]], max_k&3, max); // DEBUG +#endif + } + /*** free ***/ + for (i = 0; i <= l_query; ++i) { + free(f[i]); free(b[i]); + } + free(f); free(b); free(s); free(_qual); + return Pr; +} + +#ifdef _MAIN +#include +int main(int argc, char *argv[]) +{ + uint8_t conv[256], *iqual, *ref, *query; + int c, l_ref, l_query, i, q = 30, b = 10, P; + while ((c = getopt(argc, argv, "b:q:")) >= 0) { + switch (c) { + case 'b': b = atoi(optarg); break; + case 'q': q = atoi(optarg); break; + } + } + if (optind + 2 > argc) { + fprintf(stderr, "Usage: %s [-q %d] [-b %d] \n", argv[0], q, b); // example: acttc attc + return 1; + } + memset(conv, 4, 256); + conv['a'] = conv['A'] = 0; conv['c'] = conv['C'] = 1; + conv['g'] = conv['G'] = 2; conv['t'] = conv['T'] = 3; + ref = (uint8_t*)argv[optind]; query = (uint8_t*)argv[optind+1]; + l_ref = strlen((char*)ref); l_query = strlen((char*)query); + for (i = 0; i < l_ref; ++i) ref[i] = conv[ref[i]]; + for (i = 0; i < l_query; ++i) query[i] = conv[query[i]]; + iqual = malloc(l_query); + memset(iqual, q, l_query); + kpa_par_def.bw = b; + P = kpa_glocal(ref, l_ref, query, l_query, iqual, &kpa_par_alt, 0, 0); + fprintf(stderr, "%d\n", P); + free(iqual); + return 0; +} +#endif diff --git a/kprobaln.h b/kprobaln.h new file mode 100644 index 0000000..0357dcc --- /dev/null +++ b/kprobaln.h @@ -0,0 +1,49 @@ +/* The MIT License + + Copyright (c) 2003-2006, 2008, 2009 by Heng Li + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ + +#ifndef LH3_KPROBALN_H_ +#define LH3_KPROBALN_H_ + +#include + +typedef struct { + float d, e; + int bw; +} kpa_par_t; + +#ifdef __cplusplus +extern "C" { +#endif + + int kpa_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual, + const kpa_par_t *c, int *state, uint8_t *q); + +#ifdef __cplusplus +} +#endif + +extern kpa_par_t kpa_par_def, kpa_par_alt; + +#endif diff --git a/misc/samtools.pl b/misc/samtools.pl index ce8449d..d03c1c7 100755 --- a/misc/samtools.pl +++ b/misc/samtools.pl @@ -10,7 +10,7 @@ my $version = '0.3.3'; &usage if (@ARGV < 1); my $command = shift(@ARGV); -my %func = (showALEN=>\&showALEN, pileup2fq=>\&pileup2fq, varFilter=>\&varFilter, +my %func = (showALEN=>\&showALEN, pileup2fq=>\&pileup2fq, varFilter=>\&varFilter, plp2vcf=>\&plp2vcf, unique=>\&unique, uniqcmp=>\&uniqcmp, sra2hdr=>\&sra2hdr, sam2fq=>\&sam2fq); die("Unknown command \"$command\".\n") if (!defined($func{$command})); @@ -473,6 +473,44 @@ sub uniqcmp_aux { close($fh); } +sub plp2vcf { + while (<>) { + my @t = split; + next if ($t[3] eq '*/*'); + if ($t[2] eq '*') { # indel + my @s = split("/", $t[3]); + my (@a, @b); + my ($ref, $alt); + for (@s) { + next if ($_ eq '*'); + if (/^-/) { + push(@a, 'N'.substr($_, 1)); + push(@b, 'N'); + } elsif (/^\+/) { + push(@a, 'N'); + push(@b, 'N'.substr($_, 1)); + } + } + if ($a[0] && $a[1]) { + if (length($a[0]) < length($a[1])) { + $ref = $a[1]; + $alt = ($b[0] . ('N' x (length($a[1]) - length($a[0])))) . ",$b[1]"; + } elsif (length($a[0]) > length($a[1])) { + $ref = $a[0]; + $alt = ($b[1] . ('N' x (length($a[0]) - length($a[1])))) . ",$b[0]"; + } else { + $ref = $a[0]; + $alt = ($b[0] eq $b[1])? $b[0] : "$b[0],$b[1]"; + } + } else { + $ref = $a[0]; $alt = $b[0]; + } + print join("\t", @t[0,1], '.', $ref, $alt, $t[5], '.', '.'), "\n"; + } else { # SNP + } + } +} + # # Usage # diff --git a/sam_view.c b/sam_view.c index d0fdad2..eb69449 100644 --- a/sam_view.c +++ b/sam_view.c @@ -9,6 +9,13 @@ #include "khash.h" KHASH_SET_INIT_STR(rg) +// When counting records instead of printing them, +// data passed to the bam_fetch callback is encapsulated in this struct. +typedef struct { + bam_header_t *header; + int *count; +} count_func_data_t; + typedef khash_t(rg) *rghash_t; rghash_t g_rghash = 0; @@ -54,7 +61,7 @@ static inline int __g_skip_aln(const bam_header_t *h, const bam1_t *b) return 0; } -// callback function for bam_fetch() +// callback function for bam_fetch() that prints nonskipped records static int view_func(const bam1_t *b, void *data) { if (!__g_skip_aln(((samfile_t*)data)->header, b)) @@ -62,19 +69,30 @@ static int view_func(const bam1_t *b, void *data) return 0; } +// callback function for bam_fetch() that counts nonskipped records +static int count_func(const bam1_t *b, void *data) +{ + if (!__g_skip_aln(((count_func_data_t*)data)->header, b)) { + (*((count_func_data_t*)data)->count)++; + } + return 0; +} + static int usage(int is_long_help); int main_samview(int argc, char *argv[]) { - int c, is_header = 0, is_header_only = 0, is_bamin = 1, ret = 0, is_uncompressed = 0, is_bamout = 0, slx2sngr = 0; + int c, is_header = 0, is_header_only = 0, is_bamin = 1, ret = 0, is_uncompressed = 0, is_bamout = 0, slx2sngr = 0, is_count = 0; int of_type = BAM_OFDEC, is_long_help = 0; + int count = 0; samfile_t *in = 0, *out = 0; char in_mode[5], out_mode[5], *fn_out = 0, *fn_list = 0, *fn_ref = 0, *fn_rg = 0; /* parse command-line options */ strcpy(in_mode, "r"); strcpy(out_mode, "w"); - while ((c = getopt(argc, argv, "Sbt:hHo:q:f:F:ul:r:xX?T:CR:")) >= 0) { + while ((c = getopt(argc, argv, "Sbct:hHo:q:f:F:ul:r:xX?T:CR:")) >= 0) { switch (c) { + case 'c': is_count = 1; break; case 'C': slx2sngr = 1; break; case 'S': is_bamin = 0; break; case 'b': is_bamout = 1; break; @@ -133,7 +151,7 @@ int main_samview(int argc, char *argv[]) ret = 1; goto view_end; } - if ((out = samopen(fn_out? fn_out : "-", out_mode, in->header)) == 0) { + if (!is_count && (out = samopen(fn_out? fn_out : "-", out_mode, in->header)) == 0) { fprintf(stderr, "[main_samview] fail to open \"%s\" for writing.\n", fn_out? fn_out : "standard output"); ret = 1; goto view_end; @@ -146,7 +164,8 @@ int main_samview(int argc, char *argv[]) while ((r = samread(in, b)) >= 0) { // read one alignment from `in' if (!__g_skip_aln(in->header, b)) { if (slx2sngr) sol2sanger(b); - samwrite(out, b); // write the alignment to `out' + if (!is_count) samwrite(out, b); // write the alignment to `out' + count++; } } if (r < -1) { @@ -164,14 +183,19 @@ int main_samview(int argc, char *argv[]) goto view_end; } for (i = optind + 1; i < argc; ++i) { - int tid, beg, end; + int tid, beg, end, result; bam_parse_region(in->header, argv[i], &tid, &beg, &end); // parse a region in the format like `chr2:100-200' if (tid < 0) { // reference name is not found fprintf(stderr, "[main_samview] region \"%s\" specifies an unknown reference name. Continue anyway.\n", argv[i]); continue; } // fetch alignments - if (bam_fetch(in->x.bam, idx, tid, beg, end, out, view_func) < 0) { + if (is_count) { + count_func_data_t count_data = { in->header, &count }; + result = bam_fetch(in->x.bam, idx, tid, beg, end, &count_data, count_func); + } else + result = bam_fetch(in->x.bam, idx, tid, beg, end, out, view_func); + if (result < 0) { fprintf(stderr, "[main_samview] retrieval of region \"%s\" failed due to truncated file or corrupt BAM index file\n", argv[i]); ret = 1; break; @@ -181,6 +205,9 @@ int main_samview(int argc, char *argv[]) } view_end: + if (is_count && ret == 0) { + printf("%d\n", count); + } // close files, free and return free(fn_list); free(fn_ref); free(fn_out); free(g_library); free(g_rg); free(fn_rg); if (g_rghash) { @@ -190,7 +217,8 @@ view_end: kh_destroy(rg, g_rghash); } samclose(in); - samclose(out); + if (!is_count) + samclose(out); return ret; } @@ -205,6 +233,7 @@ static int usage(int is_long_help) fprintf(stderr, " -u uncompressed BAM output (force -b)\n"); fprintf(stderr, " -x output FLAG in HEX (samtools-C specific)\n"); fprintf(stderr, " -X output FLAG in string (samtools-C specific)\n"); + fprintf(stderr, " -c print only the count of matching records\n"); fprintf(stderr, " -t FILE list of reference names and lengths (force -S) [null]\n"); fprintf(stderr, " -T FILE reference sequence file (force -S) [null]\n"); fprintf(stderr, " -o FILE output file name [stdout]\n"); diff --git a/samtools.1 b/samtools.1 index cc421d4..e87bef7 100644 --- a/samtools.1 +++ b/samtools.1 @@ -1,4 +1,4 @@ -.TH samtools 1 "27 October 2010" "samtools-0.1.9" "Bioinformatics tools" +.TH samtools 1 "15 November 2010" "samtools-0.1.10" "Bioinformatics tools" .SH NAME .PP samtools - Utilities for the Sequence Alignment/Map (SAM) format @@ -20,7 +20,7 @@ samtools faidx ref.fasta .PP samtools pileup -vcf ref.fasta aln.sorted.bam .PP -samtools mpileup -C50 -agf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam +samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam .PP samtools tview aln.sorted.bam ref.fasta @@ -47,7 +47,7 @@ entire alignment file unless it is asked to do so. .TP 10 .B view -samtools view [-bhuHS] [-t in.refList] [-o output] [-f reqFlag] [-F +samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] | [region1 [...]] Extract/print all or sub alignments in SAM or BAM format. If no region @@ -65,11 +65,11 @@ format: `chr2' (the whole chr2), `chr2:1000000' (region starting from .B -b Output in the BAM format. .TP -.BI -f " INT" +.BI -f \ INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0] .TP -.BI -F " INT" +.BI -F \ INT Skip alignments with bits present in INT [0] .TP .B -h @@ -78,19 +78,19 @@ Include the header in the output. .B -H Output the header only. .TP -.BI -l " STR" +.BI -l \ STR Only output reads in library STR [null] .TP -.BI -o " FILE" +.BI -o \ FILE Output file [stdout] .TP -.BI -q " INT" +.BI -q \ INT Skip alignments with MAPQ smaller than INT [0] .TP -.BI -r " STR" +.BI -r \ STR Only output reads in read group STR [null] .TP -.BI -R " FILE" +.BI -R \ FILE Output reads in read groups listed in .I FILE [null] @@ -100,7 +100,16 @@ Input is in SAM. If @SQ header lines are absent, the .B `-t' option is required. .TP -.BI -t " FILE" +.B -c +Instead of printing the alignments, only count them and print the +total number. All filter options, such as +.B `-f', +.B `-F' +and +.B `-q' +, are taken into account. +.TP +.BI -t \ FILE This file is TAB-delimited. Each line must contain the reference name and the length of the reference, one line for each distinct reference; additional fields are ignored. This file also defines the order of the @@ -126,135 +135,6 @@ press `?' for help and press `g' to check the alignment start from a region in the format like `chr10:10,000,000' or `=10,000,000' when viewing the same reference sequence. -.TP -.B pileup -samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l -in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r -pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior] -| - -Print the alignment in the pileup format. In the pileup format, each -line represents a genomic position, consisting of chromosome name, -coordinate, reference base, read bases, read qualities and alignment -mapping qualities. Information on match, mismatch, indel, strand, -mapping quality and start and end of a read are all encoded at the read -base column. At this column, a dot stands for a match to the reference -base on the forward strand, a comma for a match on the reverse strand, -a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward -strand and `acgtn' for a mismatch on the reverse strand. A pattern -`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this -reference position and the next reference position. The length of the -insertion is given by the integer in the pattern, followed by the -inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' -represents a deletion from the reference. The deleted bases will be -presented as `*' in the following lines. Also at the read base column, a -symbol `^' marks the start of a read. The ASCII of the character -following `^' minus 33 gives the mapping quality. A symbol `$' marks the -end of a read segment. - -If option -.B -c -is applied, the consensus base, Phred-scaled consensus quality, SNP -quality (i.e. the Phred-scaled probability of the consensus being -identical to the reference) and root mean square (RMS) mapping quality -of the reads covering the site will be inserted between the `reference -base' and the `read bases' columns. An indel occupies an additional -line. Each indel line consists of chromosome name, coordinate, a star, -the genotype, consensus quality, SNP quality, RMS mapping quality, # -covering reads, the first alllele, the second allele, # reads supporting -the first allele, # reads supporting the second allele and # reads -containing indels different from the top two alleles. - -The position of indels is offset by -1. - -.B OPTIONS: -.RS -.TP 10 -.B -B -Disable the BAQ computation. See the -.B mpileup -command for details. -.TP -.B -c -Call the consensus sequence using SOAPsnp consensus model. Options -.BR -T ", " -N ", " -I " and " -r -are only effective when -.BR -c " or " -g -is in use. -.TP -.BI -C " INT" -Coefficient for downgrading the mapping quality of poorly mapped -reads. See the -.B mpileup -command for details. [0] -.TP -.BI -d " INT" -Use the first -.I NUM -reads in the pileup for indel calling for speed up. Zero for unlimited. [1024] -.TP -.BI -f " FILE" -The reference sequence in the FASTA format. Index file -.I FILE.fai -will be created if -absent. -.TP -.B -g -Generate genotype likelihood in the binary GLFv3 format. This option -suppresses -c, -i and -s. This option is deprecated by the -.B mpileup -command. -.TP -.B -i -Only output pileup lines containing indels. -.TP -.BI -I " INT" -Phred probability of an indel in sequencing/prep. [40] -.TP -.BI -l " FILE" -List of sites at which pileup is output. This file is space -delimited. The first two columns are required to be chromosome and -1-based coordinate. Additional columns are ignored. It is -recommended to use option -.TP -.BI -m " INT" -Filter reads with flag containing bits in -.I INT -[1796] -.TP -.BI -M " INT" -Cap mapping quality at INT [60] -.TP -.BI -N " INT" -Number of haplotypes in the sample (>=2) [2] -.TP -.BI -r " FLOAT" -Expected fraction of differences between a pair of haplotypes [0.001] -.TP -.B -s -Print the mapping quality as the last column. This option makes the -output easier to parse, although this format is not space efficient. -.TP -.B -S -The input file is in SAM. -.TP -.BI -t " FILE" -List of reference names ane sequence lengths, in the format described -for the -.B import -command. If this option is present, samtools assumes the input -.I -is in SAM format; otherwise it assumes in BAM format. -.B -s -together with -.B -l -as in the default format we may not know the mapping quality. -.TP -.BI -T " FLOAT" -The theta parameter (error dependency coefficient) in the maq consensus -calling model [0.85] -.RE - .TP .B mpileup samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]] @@ -272,37 +152,64 @@ quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments. .TP -.BI -C " INT" +.BI -C \ INT Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this -functionality; if enabled, the recommended value is 50. [0] +functionality; if enabled, the recommended value for BWA is 50. [0] +.TP +.BI -e \ INT +Phred-scaled gap extension sequencing error probability. Reducing +.I INT +leads to longer indels. [20] .TP -.BI -f " FILE" +.BI -f \ FILE The reference file [null] .TP .B -g Compute genotype likelihoods and output them in the binary call format (BCF). .TP -.B -u -Similar to -.B -g -except that the output is uncompressed BCF, which is preferred for pipeing. -.TP -.BI -l " FILE" +.BI -h \ INT +Coefficient for modeling homopolymer errors. Given an +.IR l -long +homopolymer +run, the sequencing error of an indel of size +.I s +is modeled as +.IR INT * s / l . +[100] +.TP +.BI -l \ FILE File containing a list of sites where pileup or BCF is outputted [null] .TP -.BI -q " INT" +.BI -o \ INT +Phred-scaled gap open sequencing error probability. Reducing +.I INT +leads to more indel calls. [40] +.TP +.BI -P \ STR +Comma dilimited list of platforms (determined by +.BR @RG-PL ) +from which indel candidates are obtained. It is recommended to collect +indel candidates from sequencing technologies that have low indel error +rate such as ILLUMINA. [all] +.TP +.BI -q \ INT Minimum mapping quality for an alignment to be used [0] .TP -.BI -Q " INT" +.BI -Q \ INT Minimum base quality for a base to be considered [13] .TP -.BI -r " STR" +.BI -r \ STR Only generate pileup in region .I STR [all sites] +.TP +.B -u +Similar to +.B -g +except that the output is uncompressed BCF, which is preferred for piping. .RE .TP @@ -336,7 +243,7 @@ Output the final alignment to the standard output. .B -n Sort by read names rather than by chromosomal coordinates .TP -.B -m INT +.BI -m \ INT Approximately the maximum required memory. [500000000] .RE @@ -359,7 +266,7 @@ and the headers of other files will be ignored. .B OPTIONS: .RS .TP 8 -.BI -h " FILE" +.BI -h \ FILE Use the lines of .I FILE as `@' headers to be copied to @@ -370,7 +277,7 @@ replacing any header lines that would otherwise be copied from is actually in SAM format, though any alignment records it may contain are ignored.) .TP -.BI -R " STR" +.BI -R \ STR Merge files in the specified region indicated by .I STR .TP @@ -457,6 +364,11 @@ tag. Output SAM by default. .B OPTIONS: .RS .TP 8 +.B -A +When used jointly with +.B -r +this option overwrites the original base quality. +.TP 8 .B -e Convert a the read base to = if it is identical to the aligned reference base. Indel caller does not support the = bases at the moment. @@ -470,14 +382,143 @@ Output compressed BAM .B -S The input is SAM with header lines .TP -.BI -C " INT" +.BI -C \ INT Coefficient to cap mapping quality of poorly mapped reads. See the .B pileup command for details. [0] .TP .B -r -Perform probabilistic realignment to compute BAQ, which will be used to -cap base quality. +Compute the BQ tag without changing the base quality. +.RE + +.TP +.B pileup +samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l +in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r +pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior] +| + +Print the alignment in the pileup format. In the pileup format, each +line represents a genomic position, consisting of chromosome name, +coordinate, reference base, read bases, read qualities and alignment +mapping qualities. Information on match, mismatch, indel, strand, +mapping quality and start and end of a read are all encoded at the read +base column. At this column, a dot stands for a match to the reference +base on the forward strand, a comma for a match on the reverse strand, +a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward +strand and `acgtn' for a mismatch on the reverse strand. A pattern +`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this +reference position and the next reference position. The length of the +insertion is given by the integer in the pattern, followed by the +inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' +represents a deletion from the reference. The deleted bases will be +presented as `*' in the following lines. Also at the read base column, a +symbol `^' marks the start of a read. The ASCII of the character +following `^' minus 33 gives the mapping quality. A symbol `$' marks the +end of a read segment. + +If option +.B -c +is applied, the consensus base, Phred-scaled consensus quality, SNP +quality (i.e. the Phred-scaled probability of the consensus being +identical to the reference) and root mean square (RMS) mapping quality +of the reads covering the site will be inserted between the `reference +base' and the `read bases' columns. An indel occupies an additional +line. Each indel line consists of chromosome name, coordinate, a star, +the genotype, consensus quality, SNP quality, RMS mapping quality, # +covering reads, the first alllele, the second allele, # reads supporting +the first allele, # reads supporting the second allele and # reads +containing indels different from the top two alleles. + +.B NOTE: +Since 0.1.10, the `pileup' command is deprecated by `mpileup'. + +.B OPTIONS: +.RS +.TP 10 +.B -B +Disable the BAQ computation. See the +.B mpileup +command for details. +.TP +.B -c +Call the consensus sequence. Options +.BR -T ", " -N ", " -I " and " -r +are only effective when +.BR -c " or " -g +is in use. +.TP +.BI -C \ INT +Coefficient for downgrading the mapping quality of poorly mapped +reads. See the +.B mpileup +command for details. [0] +.TP +.BI -d \ INT +Use the first +.I NUM +reads in the pileup for indel calling for speed up. Zero for unlimited. [1024] +.TP +.BI -f \ FILE +The reference sequence in the FASTA format. Index file +.I FILE.fai +will be created if +absent. +.TP +.B -g +Generate genotype likelihood in the binary GLFv3 format. This option +suppresses -c, -i and -s. This option is deprecated by the +.B mpileup +command. +.TP +.B -i +Only output pileup lines containing indels. +.TP +.BI -I \ INT +Phred probability of an indel in sequencing/prep. [40] +.TP +.BI -l \ FILE +List of sites at which pileup is output. This file is space +delimited. The first two columns are required to be chromosome and +1-based coordinate. Additional columns are ignored. It is +recommended to use option +.TP +.BI -m \ INT +Filter reads with flag containing bits in +.I INT +[1796] +.TP +.BI -M \ INT +Cap mapping quality at INT [60] +.TP +.BI -N \ INT +Number of haplotypes in the sample (>=2) [2] +.TP +.BI -r \ FLOAT +Expected fraction of differences between a pair of haplotypes [0.001] +.TP +.B -s +Print the mapping quality as the last column. This option makes the +output easier to parse, although this format is not space efficient. +.TP +.B -S +The input file is in SAM. +.TP +.BI -t \ FILE +List of reference names ane sequence lengths, in the format described +for the +.B import +command. If this option is present, samtools assumes the input +.I +is in SAM format; otherwise it assumes in BAM format. +.B -s +together with +.B -l +as in the default format we may not know the mapping quality. +.TP +.BI -T \ FLOAT +The theta parameter (error dependency coefficient) in the maq consensus +calling model [0.85] .RE .SH SAM FORMAT @@ -573,9 +614,8 @@ will be attached .IP o 2 Call SNPs and short indels for one diploid individual: - samtools pileup -vcf ref.fa aln.bam > var.raw.plp - samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp - awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var.flt.plp > var.final.plp + samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf + bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf The .B -D @@ -583,37 +623,37 @@ option of varFilter controls the maximum read depth, which should be adjusted to about twice the average read depth. One may consider to add .B -C50 to -.B pileup +.B mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps .B BWA-short -but may not other mappers. It also potentially increases reference -biases. +but may not other mappers. .IP o 2 -Call SNPs (not short indels) for multiple diploid individuals: +Call SNPs and short indels for multiple diploid individuals: - samtools mpileup -augf ref.fa *.bam | bcftools view -bcv - > snp.raw.bcf - bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 | bgzip > snp.flt.vcf.gz + samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf + bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf Individuals are identified from the .B SM tags in the .B @RG header lines. Individuals can be pooled in one alignment file; one -individual can also be separated into multiple files. Similarly, one may -consider to apply -.B -C50 -to -.BR mpileup . -SNP calling in this way also works for single sample and has the -advantage of enabling more powerful filtering. The drawback is the lack -of short indel calling, which may be implemented in future. +individual can also be separated into multiple files. The +.B -P +option specifies that indel candidates should be collected only from +read groups with the +.B @RG-PL +tag set to +.IR ILLUMINA . +Collecting indel candidates from reads sequenced by an indel-prone +technology may affect the performance of indel calling. .IP o 2 Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals: - samtools mpileup -gf ref.fa *.bam > all.bcf + samtools mpileup -Igf ref.fa *.bam > all.bcf bcftools view -bl sites.list all.bcf > sites.bcf bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs @@ -630,7 +670,7 @@ commands estimate AFS by EM. .IP o 2 Dump BAQ applied alignment for other SNP callers: - samtools calmd -br aln.bam > aln.baq.bam + samtools calmd -bAr aln.bam > aln.baq.bam It adds and corrects the .B NM @@ -640,7 +680,7 @@ tags at the same time. The .B calmd command also comes with the .B -C -option, the same as the on in +option, the same as the one in .B pileup and .BR mpileup . @@ -666,8 +706,9 @@ although a little slower. .PP Heng Li from the Sanger Institute wrote the C version of samtools. Bob Handsaker from the Broad Institute implemented the BGZF library and Jue -Ruan from Beijing Genomics Institute wrote the RAZF library. Various -people in the 1000 Genomes Project contributed to the SAM format +Ruan from Beijing Genomics Institute wrote the RAZF library. John +Marshall and Petr Danecek contribute to the source code and various +people from the 1000 Genomes Project have contributed to the SAM format specification. .SH SEE ALSO diff --git a/samtools.txt b/samtools.txt index 09c1ccd..41cabe5 100644 --- a/samtools.txt +++ b/samtools.txt @@ -22,8 +22,7 @@ SYNOPSIS samtools pileup -vcf ref.fasta aln.sorted.bam - samtools mpileup -C50 -agf ref.fasta -r chr3:1,000-2,000 in1.bam - in2.bam + samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam samtools tview aln.sorted.bam ref.fasta @@ -48,7 +47,7 @@ DESCRIPTION COMMANDS AND OPTIONS - view samtools view [-bhuHS] [-t in.refList] [-o output] [-f + view samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read- Group] [-R rgFile] | [region1 [...]] @@ -90,6 +89,10 @@ COMMANDS AND OPTIONS -S Input is in SAM. If @SQ header lines are absent, the `-t' option is required. + -c Instead of printing the alignments, only count them + and print the total number. All filter options, such + as `-f', `-F' and `-q' , are taken into account. + -t FILE This file is TAB-delimited. Each line must contain the reference name and the length of the reference, one line for each distinct reference; additional @@ -112,109 +115,6 @@ COMMANDS AND OPTIONS reference sequence. - pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] - [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N - nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G - indelPrior] | - - Print the alignment in the pileup format. In the pileup for- - mat, each line represents a genomic position, consisting of - chromosome name, coordinate, reference base, read bases, read - qualities and alignment mapping qualities. Information on - match, mismatch, indel, strand, mapping quality and start and - end of a read are all encoded at the read base column. At - this column, a dot stands for a match to the reference base - on the forward strand, a comma for a match on the reverse - strand, a '>' or '<' for a reference skip, `ACGTN' for a mis- - match on the forward strand and `acgtn' for a mismatch on the - reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates - there is an insertion between this reference position and the - next reference position. The length of the insertion is given - by the integer in the pattern, followed by the inserted - sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre- - sents a deletion from the reference. The deleted bases will - be presented as `*' in the following lines. Also at the read - base column, a symbol `^' marks the start of a read. The - ASCII of the character following `^' minus 33 gives the map- - ping quality. A symbol `$' marks the end of a read segment. - - If option -c is applied, the consensus base, Phred-scaled - consensus quality, SNP quality (i.e. the Phred-scaled proba- - bility of the consensus being identical to the reference) and - root mean square (RMS) mapping quality of the reads covering - the site will be inserted between the `reference base' and - the `read bases' columns. An indel occupies an additional - line. Each indel line consists of chromosome name, coordi- - nate, a star, the genotype, consensus quality, SNP quality, - RMS mapping quality, # covering reads, the first alllele, the - second allele, # reads supporting the first allele, # reads - supporting the second allele and # reads containing indels - different from the top two alleles. - - The position of indels is offset by -1. - - OPTIONS: - - -B Disable the BAQ computation. See the mpileup com- - mand for details. - - -c Call the consensus sequence using SOAPsnp consensus - model. Options -T, -N, -I and -r are only effective - when -c or -g is in use. - - -C INT Coefficient for downgrading the mapping quality of - poorly mapped reads. See the mpileup command for - details. [0] - - -d INT Use the first NUM reads in the pileup for indel - calling for speed up. Zero for unlimited. [1024] - - -f FILE The reference sequence in the FASTA format. Index - file FILE.fai will be created if absent. - - -g Generate genotype likelihood in the binary GLFv3 - format. This option suppresses -c, -i and -s. This - option is deprecated by the mpileup command. - - -i Only output pileup lines containing indels. - - -I INT Phred probability of an indel in sequencing/prep. - [40] - - -l FILE List of sites at which pileup is output. This file - is space delimited. The first two columns are - required to be chromosome and 1-based coordinate. - Additional columns are ignored. It is recommended - to use option - - -m INT Filter reads with flag containing bits in INT - [1796] - - -M INT Cap mapping quality at INT [60] - - -N INT Number of haplotypes in the sample (>=2) [2] - - -r FLOAT Expected fraction of differences between a pair of - haplotypes [0.001] - - -s Print the mapping quality as the last column. This - option makes the output easier to parse, although - this format is not space efficient. - - -S The input file is in SAM. - - -t FILE List of reference names ane sequence lengths, in - the format described for the import command. If - this option is present, samtools assumes the input - is in SAM format; otherwise it - assumes in BAM format. -s together with -l as in - the default format we may not know the mapping - quality. - - -T FLOAT The theta parameter (error dependency coefficient) - in the maq consensus calling model [0.85] - - mpileup samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]] @@ -237,27 +137,43 @@ COMMANDS AND OPTIONS phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this - functionality; if enabled, the recommended value is - 50. [0] + functionality; if enabled, the recommended value for + BWA is 50. [0] + + -e INT Phred-scaled gap extension sequencing error probabil- + ity. Reducing INT leads to longer indels. [20] -f FILE The reference file [null] - -g Compute genotype likelihoods and output them in the + -g Compute genotype likelihoods and output them in the binary call format (BCF). - -u Similar to -g except that the output is uncompressed - BCF, which is preferred for pipeing. + -h INT Coefficient for modeling homopolymer errors. Given an + l-long homopolymer run, the sequencing error of an + indel of size s is modeled as INT*s/l. [100] -l FILE File containing a list of sites where pileup or BCF is outputted [null] - -q INT Minimum mapping quality for an alignment to be used + -o INT Phred-scaled gap open sequencing error probability. + Reducing INT leads to more indel calls. [40] + + -P STR Comma dilimited list of platforms (determined by @RG- + PL) from which indel candidates are obtained. It is + recommended to collect indel candidates from sequenc- + ing technologies that have low indel error rate such + as ILLUMINA. [all] + + -q INT Minimum mapping quality for an alignment to be used [0] -Q INT Minimum base quality for a base to be considered [13] -r STR Only generate pileup in region STR [all sites] + -u Similar to -g except that the output is uncompressed + BCF, which is preferred for piping. + reheader samtools reheader @@ -368,8 +284,11 @@ COMMANDS AND OPTIONS OPTIONS: - -e Convert a the read base to = if it is identical to - the aligned reference base. Indel caller does not + -A When used jointly with -r this option overwrites the + original base quality. + + -e Convert a the read base to = if it is identical to + the aligned reference base. Indel caller does not support the = bases at the moment. -u Output uncompressed BAM @@ -378,15 +297,117 @@ COMMANDS AND OPTIONS -S The input is SAM with header lines - -C INT Coefficient to cap mapping quality of poorly mapped + -C INT Coefficient to cap mapping quality of poorly mapped reads. See the pileup command for details. [0] - -r Perform probabilistic realignment to compute BAQ, - which will be used to cap base quality. + -r Compute the BQ tag without changing the base quality. + + + pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] + [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N + nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G + indelPrior] | + + Print the alignment in the pileup format. In the pileup for- + mat, each line represents a genomic position, consisting of + chromosome name, coordinate, reference base, read bases, read + qualities and alignment mapping qualities. Information on + match, mismatch, indel, strand, mapping quality and start and + end of a read are all encoded at the read base column. At + this column, a dot stands for a match to the reference base + on the forward strand, a comma for a match on the reverse + strand, a '>' or '<' for a reference skip, `ACGTN' for a mis- + match on the forward strand and `acgtn' for a mismatch on the + reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates + there is an insertion between this reference position and the + next reference position. The length of the insertion is given + by the integer in the pattern, followed by the inserted + sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre- + sents a deletion from the reference. The deleted bases will + be presented as `*' in the following lines. Also at the read + base column, a symbol `^' marks the start of a read. The + ASCII of the character following `^' minus 33 gives the map- + ping quality. A symbol `$' marks the end of a read segment. + + If option -c is applied, the consensus base, Phred-scaled + consensus quality, SNP quality (i.e. the Phred-scaled proba- + bility of the consensus being identical to the reference) and + root mean square (RMS) mapping quality of the reads covering + the site will be inserted between the `reference base' and + the `read bases' columns. An indel occupies an additional + line. Each indel line consists of chromosome name, coordi- + nate, a star, the genotype, consensus quality, SNP quality, + RMS mapping quality, # covering reads, the first alllele, the + second allele, # reads supporting the first allele, # reads + supporting the second allele and # reads containing indels + different from the top two alleles. + + NOTE: Since 0.1.10, the `pileup' command is deprecated by + `mpileup'. + + OPTIONS: + + -B Disable the BAQ computation. See the mpileup com- + mand for details. + + -c Call the consensus sequence. Options -T, -N, -I and + -r are only effective when -c or -g is in use. + + -C INT Coefficient for downgrading the mapping quality of + poorly mapped reads. See the mpileup command for + details. [0] + + -d INT Use the first NUM reads in the pileup for indel + calling for speed up. Zero for unlimited. [1024] + + -f FILE The reference sequence in the FASTA format. Index + file FILE.fai will be created if absent. + + -g Generate genotype likelihood in the binary GLFv3 + format. This option suppresses -c, -i and -s. This + option is deprecated by the mpileup command. + + -i Only output pileup lines containing indels. + + -I INT Phred probability of an indel in sequencing/prep. + [40] + + -l FILE List of sites at which pileup is output. This file + is space delimited. The first two columns are + required to be chromosome and 1-based coordinate. + Additional columns are ignored. It is recommended + to use option + + -m INT Filter reads with flag containing bits in INT + [1796] + + -M INT Cap mapping quality at INT [60] + + -N INT Number of haplotypes in the sample (>=2) [2] + + -r FLOAT Expected fraction of differences between a pair of + haplotypes [0.001] + + -s Print the mapping quality as the last column. This + option makes the output easier to parse, although + this format is not space efficient. + + -S The input file is in SAM. + + -t FILE List of reference names ane sequence lengths, in + the format described for the import command. If + this option is present, samtools assumes the input + is in SAM format; otherwise it + assumes in BAM format. -s together with -l as in + the default format we may not know the mapping + quality. + + -T FLOAT The theta parameter (error dependency coefficient) + in the maq consensus calling model [0.85] SAM FORMAT - SAM is TAB-delimited. Apart from the header lines, which are started + SAM is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of: @@ -441,51 +462,50 @@ EXAMPLES o Attach the RG tag while merging sorted alignments: - perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu- + perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu- mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt samtools merge -rh rg.txt merged.bam ga.bam 454.bam The value in a RG tag is determined by the file name the read is com- - ing from. In this example, in the merged.bam, reads from ga.bam will - be attached RG:Z:ga, while reads from 454.bam will be attached + ing from. In this example, in the merged.bam, reads from ga.bam will + be attached RG:Z:ga, while reads from 454.bam will be attached RG:Z:454. o Call SNPs and short indels for one diploid individual: - samtools pileup -vcf ref.fa aln.bam > var.raw.plp - samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp - awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var.flt.plp > - var.final.plp + samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > + var.raw.bcf + bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > + var.flt.vcf The -D option of varFilter controls the maximum read depth, which should be adjusted to about twice the average read depth. One may - consider to add -C50 to pileup if mapping quality is overestimated + consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usu- - ally helps BWA-short but may not other mappers. It also potentially - increases reference biases. + ally helps BWA-short but may not other mappers. - o Call SNPs (not short indels) for multiple diploid individuals: + o Call SNPs and short indels for multiple diploid individuals: - samtools mpileup -augf ref.fa *.bam | bcftools view -bcv - > - snp.raw.bcf - bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 | bgzip - > snp.flt.vcf.gz + samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view + -bcvg - > var.raw.bcf + bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > + var.flt.vcf - Individuals are identified from the SM tags in the @RG header lines. - Individuals can be pooled in one alignment file; one individual can - also be separated into multiple files. Similarly, one may consider to - apply -C50 to mpileup. SNP calling in this way also works for single - sample and has the advantage of enabling more powerful filtering. The - drawback is the lack of short indel calling, which may be implemented - in future. + Individuals are identified from the SM tags in the @RG header lines. + Individuals can be pooled in one alignment file; one individual can + also be separated into multiple files. The -P option specifies that + indel candidates should be collected only from read groups with the + @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads + sequenced by an indel-prone technology may affect the performance of + indel calling. - o Derive the allele frequency spectrum (AFS) on a list of sites from + o Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals: - samtools mpileup -gf ref.fa *.bam > all.bcf + samtools mpileup -Igf ref.fa *.bam > all.bcf bcftools view -bl sites.list all.bcf > sites.bcf bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs @@ -493,40 +513,41 @@ EXAMPLES ...... where sites.list contains the list of sites with each line consisting - of the reference sequence name and position. The following bcftools + of the reference sequence name and position. The following bcftools commands estimate AFS by EM. o Dump BAQ applied alignment for other SNP callers: - samtools calmd -br aln.bam > aln.baq.bam + samtools calmd -bAr aln.bam > aln.baq.bam - It adds and corrects the NM and MD tags at the same time. The calmd - command also comes with the -C option, the same as the on in pileup + It adds and corrects the NM and MD tags at the same time. The calmd + command also comes with the -C option, the same as the one in pileup and mpileup. Apply if it helps. LIMITATIONS - o Unaligned words used in bam_import.c, bam_endian.h, bam.c and + o Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. - o In merging, the input files are required to have the same number of - reference sequences. The requirement can be relaxed. In addition, - merging does not reconstruct the header dictionaries automatically. - Endusers have to provide the correct header. Picard is better at + o In merging, the input files are required to have the same number of + reference sequences. The requirement can be relaxed. In addition, + merging does not reconstruct the header dictionaries automatically. + Endusers have to provide the correct header. Picard is better at merging. - o Samtools paired-end rmdup does not work for unpaired reads (e.g. - orphan reads or ends mapped to different chromosomes). If this is a - concern, please use Picard's MarkDuplicate which correctly handles + o Samtools paired-end rmdup does not work for unpaired reads (e.g. + orphan reads or ends mapped to different chromosomes). If this is a + concern, please use Picard's MarkDuplicate which correctly handles these cases, although a little slower. AUTHOR - Heng Li from the Sanger Institute wrote the C version of samtools. Bob + Heng Li from the Sanger Institute wrote the C version of samtools. Bob Handsaker from the Broad Institute implemented the BGZF library and Jue - Ruan from Beijing Genomics Institute wrote the RAZF library. Various - people in the 1000 Genomes Project contributed to the SAM format speci- + Ruan from Beijing Genomics Institute wrote the RAZF library. John Mar- + shall and Petr Danecek contribute to the source code and various people + from the 1000 Genomes Project have contributed to the SAM format speci- fication. @@ -535,4 +556,4 @@ SEE ALSO -samtools-0.1.9 27 October 2010 samtools(1) +samtools-0.1.10 15 November 2010 samtools(1)