+------------------------------------------------------------------------
+r828 | lh3lh3 | 2010-11-16 20:48:49 -0500 (Tue, 16 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bamtk.c
+
+update version information: samtools-0.1.9-20 (r828)
+
+------------------------------------------------------------------------
+r827 | lh3lh3 | 2010-11-16 15:32:50 -0500 (Tue, 16 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/call1.c
+
+bcftools: allow to skip indels
+
+------------------------------------------------------------------------
+r826 | lh3lh3 | 2010-11-16 14:11:58 -0500 (Tue, 16 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam_md.c
+
+remove ZQ if both BQ and ZQ are present
+
+------------------------------------------------------------------------
+r825 | lh3lh3 | 2010-11-16 13:51:33 -0500 (Tue, 16 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_md.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/samtools.1
+
+ * samtools-0.1.9-18 (r825)
+ * change the behaviour of calmd such that by default it does not change the base quality
+
+------------------------------------------------------------------------
+r824 | lh3lh3 | 2010-11-15 23:31:53 -0500 (Mon, 15 Nov 2010) | 4 lines
+Changed paths:
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/call1.c
+ M /trunk/samtools/samtools.1
+
+ * samtools-0.1.9-17 (r824)
+ * added command line options to change the default parameters in indel calling
+ * update the manual
+
+------------------------------------------------------------------------
+r823 | lh3lh3 | 2010-11-15 12:20:13 -0500 (Mon, 15 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_md.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-r823
+ * the BQ tag is now 64 shifted, not 33 shifted
+
+------------------------------------------------------------------------
+r822 | lh3lh3 | 2010-11-15 00:30:18 -0500 (Mon, 15 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/vcfutils.pl
+ M /trunk/samtools/misc/samtools.pl
+
+ * samtools-0.1.9-16 (r822)
+ * keep the raw depth because in indel calling, DP4 may be way off the true depth
+
+------------------------------------------------------------------------
+r821 | lh3lh3 | 2010-11-13 01:18:31 -0500 (Sat, 13 Nov 2010) | 4 lines
+Changed paths:
+ M /trunk/samtools/bam_md.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-15 (r821)
+ * calmd: write BQ
+ * skip realignment if BQ is present
+
+------------------------------------------------------------------------
+r820 | lh3lh3 | 2010-11-13 01:08:26 -0500 (Sat, 13 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-14 (r820)
+ * penalize reads with excessive differences in indel calling
+
+------------------------------------------------------------------------
+r819 | lh3lh3 | 2010-11-12 21:36:27 -0500 (Fri, 12 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam_maqcns.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-13 (r819)
+ * fixed a bug in pileup given refskip
+
+------------------------------------------------------------------------
+r818 | lh3lh3 | 2010-11-12 13:04:53 -0500 (Fri, 12 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-r818
+ * for indel calling, do two rounds of probabilistic realignments
+
+------------------------------------------------------------------------
+r817 | lh3lh3 | 2010-11-11 20:04:07 -0500 (Thu, 11 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+ * samtools-0.1.19-11 (r817)
+ * only initiate indel calling when 0.2% of reads contain a gap
+
+------------------------------------------------------------------------
+r816 | lh3lh3 | 2010-11-11 01:22:59 -0500 (Thu, 11 Nov 2010) | 7 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-10 (r816)
+
+ * I know why the forward method fails. it is because of zero base
+ qualities. when that is fixed, the forward method seems to give
+ better results than Viterbi, as it should be. I am tired...
+
+
+------------------------------------------------------------------------
+r815 | lh3lh3 | 2010-11-11 00:57:15 -0500 (Thu, 11 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/ChangeLog
+ M /trunk/samtools/bam2bcf_indel.c
+
+effectively revert to the viterbi version. The forward realignment gives too many false positives.
+
+------------------------------------------------------------------------
+r814 | lh3lh3 | 2010-11-11 00:18:02 -0500 (Thu, 11 Nov 2010) | 4 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_md.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-9 (r810)
+ * use forward, instead of viterbi, for realignment
+ * realignment is now quality aware
+
+------------------------------------------------------------------------
+r813 | lh3lh3 | 2010-11-10 22:45:24 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/kprobaln.c
+ M /trunk/samtools/kprobaln.h
+
+ * prepare to replace kaln with kprobaln in realignment
+
+------------------------------------------------------------------------
+r812 | lh3lh3 | 2010-11-10 17:28:50 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/bcf.c
+
+fixed a typo
+
+------------------------------------------------------------------------
+r811 | lh3lh3 | 2010-11-10 16:54:46 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/bcf.c
+ M /trunk/samtools/bcftools/bcf.h
+
+use zlib for direct reading when BCF_LITE is in use
+
+------------------------------------------------------------------------
+r810 | lh3lh3 | 2010-11-10 16:32:13 -0500 (Wed, 10 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+
+ * do not use reads containing too many mismatches for indel calling
+ * fixed a trivial bug in case of multi-allelic indels
+
+------------------------------------------------------------------------
+r809 | lh3lh3 | 2010-11-10 13:23:02 -0500 (Wed, 10 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-8 (r809)
+ * fixed a bug in the indel caller
+
+------------------------------------------------------------------------
+r808 | lh3lh3 | 2010-11-10 12:24:10 -0500 (Wed, 10 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/Makefile
+
+minor change to makefile
+
+------------------------------------------------------------------------
+r807 | lh3lh3 | 2010-11-10 12:10:21 -0500 (Wed, 10 Nov 2010) | 4 lines
+Changed paths:
+ M /trunk/samtools/Makefile
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+ * samtools-0.1.9-8 (r807)
+ * collect indel candidates only from specified platforms (@RG-PL)
+ * merge varFilter and filter4vcf in vcfutils.pl
+
+------------------------------------------------------------------------
+r806 | lh3lh3 | 2010-11-09 22:05:46 -0500 (Tue, 09 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/call1.c
+ M /trunk/samtools/bcftools/prob1.c
+ M /trunk/samtools/bcftools/prob1.h
+
+bcftools: compute equal-tail (Bayesian) credible interval
+
+------------------------------------------------------------------------
+r805 | lh3lh3 | 2010-11-09 16:28:39 -0500 (Tue, 09 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+added a double-hit filter to avoid overestimated indel likelihood
+
+------------------------------------------------------------------------
+r804 | lh3lh3 | 2010-11-09 14:12:06 -0500 (Tue, 09 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-7 (r804)
+ * fixed a bug in the gap caller
+
+------------------------------------------------------------------------
+r803 | lh3lh3 | 2010-11-09 10:45:33 -0500 (Tue, 09 Nov 2010) | 4 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/bcf.c
+ M /trunk/samtools/bcftools/bcf.h
+ M /trunk/samtools/bcftools/prob1.c
+
+ * samtools-0.1.9-6 (r803)
+ * mpileup: apply homopolymer correction when calculating GL, instead of before
+ * bcftools: apply a different prior to indels
+
+------------------------------------------------------------------------
+r802 | lh3lh3 | 2010-11-08 23:53:15 -0500 (Mon, 08 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-5 (r802)
+ * relax tandem penalty. this will be made a command-line option in future.
+
+------------------------------------------------------------------------
+r801 | lh3lh3 | 2010-11-08 23:35:52 -0500 (Mon, 08 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-4 (r801)
+ * fixed a minor issue in printing indel VCF
+
+------------------------------------------------------------------------
+r800 | lh3lh3 | 2010-11-08 15:28:14 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bcftools/vcfutils.pl
+
+fixed another silly bug in mpileup's indel caller
+
+------------------------------------------------------------------------
+r799 | lh3lh3 | 2010-11-08 14:28:27 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+
+fixed a silly bug in the indel caller
+
+------------------------------------------------------------------------
+r798 | lh3lh3 | 2010-11-08 14:07:33 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/ChangeLog
+ M /trunk/samtools/sam_view.c
+ M /trunk/samtools/samtools.1
+
+Incorporate patches by Marcel Martin for read counting.
+
+------------------------------------------------------------------------
+r797 | lh3lh3 | 2010-11-08 13:39:52 -0500 (Mon, 08 Nov 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-2 (r797)
+ * mpileup: indel calling seems to be working
+
+------------------------------------------------------------------------
+r796 | lh3lh3 | 2010-11-08 10:54:46 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/kaln.c
+
+indel calling is apparently working, but more information needs to be collected
+
+------------------------------------------------------------------------
+r795 | lh3lh3 | 2010-11-08 00:39:18 -0500 (Mon, 08 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf_indel.c
+
+fixed a few bugs in the indel caller. Probably there are more.
+
+------------------------------------------------------------------------
+r794 | lh3lh3 | 2010-11-07 22:23:16 -0500 (Sun, 07 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/Makefile
+ M /trunk/samtools/bam.h
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ A /trunk/samtools/bam2bcf_indel.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/kaln.c
+ M /trunk/samtools/kaln.h
+
+prepare for the indel caller. It is not ready yet.
+
+------------------------------------------------------------------------
+r793 | lh3lh3 | 2010-11-05 11:28:23 -0400 (Fri, 05 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bam_plcmd.c
+
+Revert to r790. The recent changes are not good...
+
+------------------------------------------------------------------------
+r792 | lh3lh3 | 2010-11-05 00:19:14 -0400 (Fri, 05 Nov 2010) | 6 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bam_plcmd.c
+
+ * this revision is UNSTABLE
+
+ * indel caller seems working, but it is very insensitive and has
+ several things I do not quite understand.
+
+
+------------------------------------------------------------------------
+r791 | lh3lh3 | 2010-11-04 22:58:43 -0400 (Thu, 04 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bam_plcmd.c
+
+for backup. no effective changes
+
+------------------------------------------------------------------------
+r790 | lh3lh3 | 2010-11-03 15:51:24 -0400 (Wed, 03 Nov 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bcftools/vcfutils.pl
+ M /trunk/samtools/kprobaln.c
+
+fixed a minor problem in the example coming with kprobaln.c
+
+------------------------------------------------------------------------
+r789 | lh3lh3 | 2010-11-02 15:41:27 -0400 (Tue, 02 Nov 2010) | 4 lines
+Changed paths:
+ M /trunk/samtools/Makefile
+ M /trunk/samtools/bam_md.c
+ M /trunk/samtools/kaln.c
+ M /trunk/samtools/kaln.h
+ A /trunk/samtools/kprobaln.c
+ A /trunk/samtools/kprobaln.h
+
+Separate kaln and kprobaln as I am preparing further changes. At
+present, the results should be identical to the previous.
+
+
+------------------------------------------------------------------------
+r788 | petulda | 2010-11-02 12:19:04 -0400 (Tue, 02 Nov 2010) | 1 line
+Changed paths:
+ M /trunk/samtools/bam_plcmd.c
+
+Added -b option: read file names from a file
+------------------------------------------------------------------------
+r787 | lh3lh3 | 2010-10-29 23:17:22 -0400 (Fri, 29 Oct 2010) | 7 lines
+Changed paths:
+ M /trunk/samtools/bam.h
+ M /trunk/samtools/bam_pileup.c
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+
+ * samtools-0.1.9-2 (r787)
+
+ * Allow to set a maximum per-sample depth to reduce memory. However,
+ BAQ computation is still applied to every read. The speed is not
+ improved.
+
+
+------------------------------------------------------------------------
+r786 | lh3lh3 | 2010-10-29 12:10:40 -0400 (Fri, 29 Oct 2010) | 3 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/bcftools/bcf.c
+ M /trunk/samtools/bcftools/vcf.c
+
+ * samtools-0.1.9-1 (r786)
+ * samtools: optionally perform exact test for each sample
+
+------------------------------------------------------------------------
+r785 | lh3lh3 | 2010-10-29 09:42:25 -0400 (Fri, 29 Oct 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/bam2bcf.c
+ M /trunk/samtools/bam2bcf.h
+ M /trunk/samtools/bam_plcmd.c
+ M /trunk/samtools/bcftools/bcf.c
+
+Optionally output "DP", the individual read depth
+
+------------------------------------------------------------------------
+r784 | lh3lh3 | 2010-10-27 23:10:27 -0400 (Wed, 27 Oct 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/samtools.1
+
+acknowledge Petr and John who have greatly contributed to the project.
+
+------------------------------------------------------------------------
+r783 | lh3lh3 | 2010-10-27 22:47:47 -0400 (Wed, 27 Oct 2010) | 2 lines
+Changed paths:
+ M /trunk/samtools/ChangeLog
+ M /trunk/samtools/NEWS
+ M /trunk/samtools/bamtk.c
+ M /trunk/samtools/samtools.1
+
+Release samtools-0.1.9 (r783)
+
------------------------------------------------------------------------
r782 | lh3lh3 | 2010-10-27 19:58:54 -0400 (Wed, 27 Oct 2010) | 2 lines
Changed paths:
KNETFILE_O= knetfile.o
LOBJS= bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o \
bam_pileup.o bam_lpileup.o bam_md.o glf.o razf.o faidx.o \
- $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o
+ $(KNETFILE_O) bam_sort.o sam_header.o bam_reheader.o kprobaln.o
AOBJS= bam_tview.o bam_maqcns.o bam_plcmd.o sam_view.o \
bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o \
- bamtk.o kaln.o bam2bcf.o errmod.o sample.o
+ bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o
PROG= samtools
INCLUDES= -I.
SUBDIRS= . bcftools misc
sam_header.o:sam_header.h khash.h
bcf.o:bcftools/bcf.h
bam2bcf.o:bam2bcf.h errmod.h bcftools/bcf.h
+bam2bcf_indel.o:bam2bcf.h
errmod.o:errmod.h
faidx.o:faidx.h razf.h khash.h
+Beta Release 0.1.10 (16 November, 2010)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This release is featured as the first major improvement to the indel
+caller. The method is similar to the old one implemented in the pileup
+command, but the details are handled more carefully both in theory and
+in practice. As a result, the new indel caller usually gives more
+accurate indel calls, though at the cost of sensitivity. The caller is
+implemented in the mpileup command and is invoked by default. It works
+with multiple samples.
+
+Other notable changes:
+
+ * With the -r option, the calmd command writes the difference between
+ the original base quality and the BAQ capped base quality at the BQ
+ tag but does not modify the base quality. Please use -Ar to overwrite
+ the original base quality (the 0.1.9 behavior).
+
+ * Allow to set a maximum per-sample read depth to reduce memory. In
+ 0.1.9, most of memory is wasted for the ultra high read depth in some
+ regions (e.g. the chr1 centromere).
+
+ * Optionally write per-sample read depth and per-sample strand bias
+ P-value.
+
+ * Compute equal-tail (Bayesian) credible interval of site allele
+ frequency at the CI95 VCF annotation.
+
+ * Merged the vcfutils.pl varFilter and filter4vcf for better SNP/indel
+ filtering.
+
+(0.1.10: 16 November 2010, r829)
+
+
+
Beta Release 0.1.9 (27 October, 2010)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bam1_t *b;
int32_t qpos;
int indel, level;
- uint32_t is_del:1, is_head:1, is_tail:1, is_refskip:1;
+ uint32_t is_del:1, is_head:1, is_tail:1, is_refskip:1, aux:28;
} bam_pileup1_t;
typedef int (*bam_plp_auto_f)(void *data, bam1_t *b);
const bam_pileup1_t *bam_plp_next(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
void bam_plp_set_mask(bam_plp_t iter, int mask);
+ void bam_plp_set_maxcnt(bam_plp_t iter, int maxcnt);
void bam_plp_reset(bam_plp_t iter);
void bam_plp_destroy(bam_plp_t iter);
bam_mplp_t bam_mplp_init(int n, bam_plp_auto_f func, void **data);
void bam_mplp_destroy(bam_mplp_t iter);
+ void bam_mplp_set_maxcnt(bam_mplp_t iter, int maxcnt);
int bam_mplp_auto(bam_mplp_t iter, int *_tid, int *_pos, int *n_plp, const bam_pileup1_t **plp);
/*! @typedef
{
uint8_t *data = bdst->data;
int m_data = bdst->m_data; // backup data and m_data
- if (m_data < bsrc->m_data) { // double the capacity
- m_data = bsrc->m_data; kroundup32(m_data);
+ if (m_data < bsrc->data_len) { // double the capacity
+ m_data = bsrc->data_len; kroundup32(m_data);
data = (uint8_t*)realloc(data, m_data);
}
memcpy(data, bsrc->data, bsrc->data_len); // copy var-len data
#define CAP_DIST 25
-struct __bcf_callaux_t {
- int max_bases, capQ, min_baseQ;
- uint16_t *bases;
- errmod_t *e;
-};
-
bcf_callaux_t *bcf_call_init(double theta, int min_baseQ)
{
bcf_callaux_t *bca;
if (theta <= 0.) theta = CALL_DEFTHETA;
bca = calloc(1, sizeof(bcf_callaux_t));
bca->capQ = 60;
+ bca->openQ = 40; bca->extQ = 20; bca->tandemQ = 100;
bca->min_baseQ = min_baseQ;
bca->e = errmod_init(1. - theta);
return bca;
{
if (bca == 0) return;
errmod_destroy(bca->e);
- free(bca->bases); free(bca);
+ free(bca->bases); free(bca->inscns); free(bca);
}
-
-int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf_callaux_t *bca, bcf_callret1_t *r)
+/* ref_base is the 4-bit representation of the reference base. It is
+ * negative if we are looking at an indel. */
+int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t *bca, bcf_callret1_t *r)
{
- int i, n, ref4;
+ int i, n, ref4, is_indel, ori_depth = 0;
memset(r, 0, sizeof(bcf_callret1_t));
- ref4 = bam_nt16_nt4_table[ref_base];
+ if (ref_base >= 0) {
+ ref4 = bam_nt16_nt4_table[ref_base];
+ is_indel = 0;
+ } else ref4 = 4, is_indel = 1;
if (_n == 0) return -1;
-
// enlarge the bases array if necessary
if (bca->max_bases < _n) {
bca->max_bases = _n;
memset(r, 0, sizeof(bcf_callret1_t));
for (i = n = 0; i < _n; ++i) {
const bam_pileup1_t *p = pl + i;
- int q, b, mapQ, baseQ, is_diff, min_dist;
+ int q, b, mapQ, baseQ, is_diff, min_dist, seqQ;
// set base
- if (p->is_del || (p->b->core.flag&BAM_FUNMAP)) continue; // skip unmapped reads and deleted bases
- baseQ = q = (int)bam1_qual(p->b)[p->qpos]; // base quality
+ if (p->is_del || p->is_refskip || (p->b->core.flag&BAM_FUNMAP)) continue;
+ ++ori_depth;
+ baseQ = q = is_indel? p->aux&0xff : (int)bam1_qual(p->b)[p->qpos]; // base/indel quality
+ seqQ = is_indel? (p->aux>>8&0xff) : 99;
if (q < bca->min_baseQ) continue;
+ if (q > seqQ) q = seqQ;
mapQ = p->b->core.qual < bca->capQ? p->b->core.qual : bca->capQ;
if (q > mapQ) q = mapQ;
if (q > 63) q = 63;
if (q < 4) q = 4;
- b = bam1_seqi(bam1_seq(p->b), p->qpos); // base
- b = bam_nt16_nt4_table[b? b : ref_base]; // b is the 2-bit base
+ if (!is_indel) {
+ b = bam1_seqi(bam1_seq(p->b), p->qpos); // base
+ b = bam_nt16_nt4_table[b? b : ref_base]; // b is the 2-bit base
+ is_diff = (ref4 < 4 && b == ref4)? 0 : 1;
+ } else {
+ b = p->aux>>16&0x3f;
+ is_diff = (b != 0);
+ }
bca->bases[n++] = q<<5 | (int)bam1_strand(p->b)<<4 | b;
// collect annotations
r->qsum[b] += q;
- is_diff = (ref4 < 4 && b == ref4)? 0 : 1;
++r->anno[0<<2|is_diff<<1|bam1_strand(p->b)];
min_dist = p->b->core.l_qseq - 1 - p->qpos;
if (min_dist > p->qpos) min_dist = p->qpos;
r->anno[3<<2|is_diff<<1|0] += min_dist;
r->anno[3<<2|is_diff<<1|1] += min_dist * min_dist;
}
- r->depth = n;
+ r->depth = n; r->ori_depth = ori_depth;
// glfgen
errmod_cal(bca->e, n, 5, bca->bases, r->p);
return r->depth;
{
int ref4, i, j, qsum[4];
int64_t tmp;
- call->ori_ref = ref4 = bam_nt16_nt4_table[ref_base];
- if (ref4 > 4) ref4 = 4;
+ if (ref_base >= 0) {
+ call->ori_ref = ref4 = bam_nt16_nt4_table[ref_base];
+ if (ref4 > 4) ref4 = 4;
+ } else call->ori_ref = -1, ref4 = 0;
// calculate qsum
memset(qsum, 0, 4 * sizeof(int));
for (i = 0; i < n; ++i)
else break;
}
}
- if (((ref4 < 4 && j < 4) || (ref4 == 4 && j < 5)) && i >= 0)
- call->unseen = j, call->a[j++] = qsum[i]&3;
- call->n_alleles = j;
+ if (ref_base >= 0) { // for SNPs, find the "unseen" base
+ if (((ref4 < 4 && j < 4) || (ref4 == 4 && j < 5)) && i >= 0)
+ call->unseen = j, call->a[j++] = qsum[i]&3;
+ call->n_alleles = j;
+ } else {
+ call->n_alleles = j;
+ if (call->n_alleles == 1) return -1; // no reliable supporting read. stop doing anything
+ }
// set the PL array
if (call->n < n) {
call->n = n;
PL[j] = y;
}
}
+// if (ref_base < 0) fprintf(stderr, "%d,%d,%f,%d\n", call->n_alleles, x, sum_min, call->unseen);
call->shift = (int)(sum_min + .499);
}
// combine annotations
memset(call->anno, 0, 16 * sizeof(int));
- for (i = call->depth = 0, tmp = 0; i < n; ++i) {
+ for (i = call->depth = call->ori_depth = 0, tmp = 0; i < n; ++i) {
call->depth += calls[i].depth;
+ call->ori_depth += calls[i].ori_depth;
for (j = 0; j < 16; ++j) call->anno[j] += calls[i].anno[j];
}
return 0;
}
-int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b)
+int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b, bcf_callret1_t *bcr, int is_SP,
+ const bcf_callaux_t *bca, const char *ref)
{
+ extern double kt_fisher_exact(int n11, int n12, int n21, int n22, double *_left, double *_right, double *two);
kstring_t s;
- int i;
+ int i, j;
b->n_smpl = bc->n;
b->tid = tid; b->pos = pos; b->qual = 0;
s.s = b->str; s.m = b->m_str; s.l = 0;
kputc('\0', &s);
- kputc("ACGTN"[bc->ori_ref], &s); kputc('\0', &s);
- for (i = 1; i < 5; ++i) {
- if (bc->a[i] < 0) break;
- if (i > 1) kputc(',', &s);
- kputc(bc->unseen == i? 'X' : "ACGT"[bc->a[i]], &s);
+ if (bc->ori_ref < 0) { // an indel
+ // write REF
+ kputc(ref[pos], &s);
+ for (j = 0; j < bca->indelreg; ++j) kputc(ref[pos+1+j], &s);
+ kputc('\0', &s);
+ // write ALT
+ kputc(ref[pos], &s);
+ for (i = 1; i < 4; ++i) {
+ if (bc->a[i] < 0) break;
+ if (i > 1) {
+ kputc(',', &s); kputc(ref[pos], &s);
+ }
+ if (bca->indel_types[bc->a[i]] < 0) { // deletion
+ for (j = -bca->indel_types[bc->a[i]]; j < bca->indelreg; ++j)
+ kputc(ref[pos+1+j], &s);
+ } else { // insertion; cannot be a reference unless a bug
+ char *inscns = &bca->inscns[bc->a[i] * bca->maxins];
+ for (j = 0; j < bca->indel_types[bc->a[i]]; ++j)
+ kputc("ACGTN"[(int)inscns[j]], &s);
+ for (j = 0; j < bca->indelreg; ++j) kputc(ref[pos+1+j], &s);
+ }
+ }
+ kputc('\0', &s);
+ } else { // a SNP
+ kputc("ACGTN"[bc->ori_ref], &s); kputc('\0', &s);
+ for (i = 1; i < 5; ++i) {
+ if (bc->a[i] < 0) break;
+ if (i > 1) kputc(',', &s);
+ kputc(bc->unseen == i? 'X' : "ACGT"[bc->a[i]], &s);
+ }
+ kputc('\0', &s);
}
kputc('\0', &s);
- kputc('\0', &s);
// INFO
- kputs("I16=", &s);
+ if (bc->ori_ref < 0) kputs("INDEL;", &s);
+ kputs("DP=", &s); kputw(bc->ori_depth, &s); kputs(";I16=", &s);
for (i = 0; i < 16; ++i) {
if (i) kputc(',', &s);
kputw(bc->anno[i], &s);
}
kputc('\0', &s);
// FMT
- kputs("PL", &s); kputc('\0', &s);
+ kputs("PL", &s);
+ if (bcr) {
+ kputs(":DP", &s);
+ if (is_SP) kputs(":SP", &s);
+ }
+ kputc('\0', &s);
b->m_str = s.m; b->str = s.s; b->l_str = s.l;
bcf_sync(b);
memcpy(b->gi[0].data, bc->PL, b->gi[0].len * bc->n);
+ if (bcr) {
+ uint16_t *dp = (uint16_t*)b->gi[1].data;
+ uint8_t *sp = is_SP? b->gi[2].data : 0;
+ for (i = 0; i < bc->n; ++i) {
+ bcf_callret1_t *p = bcr + i;
+ dp[i] = p->depth < 0xffff? p->depth : 0xffff;
+ if (is_SP) {
+ if (p->anno[0] + p->anno[1] < 2 || p->anno[2] + p->anno[3] < 2
+ || p->anno[0] + p->anno[2] < 2 || p->anno[1] + p->anno[3] < 2)
+ {
+ sp[i] = 0;
+ } else {
+ double left, right, two;
+ int x;
+ kt_fisher_exact(p->anno[0], p->anno[1], p->anno[2], p->anno[3], &left, &right, &two);
+ x = (int)(-4.343 * log(two) + .499);
+ if (x > 255) x = 255;
+ sp[i] = x;
+ }
+ }
+ }
+ }
return 0;
}
#define BAM2BCF_H
#include <stdint.h>
+#include "errmod.h"
#include "bcftools/bcf.h"
-struct __bcf_callaux_t;
-typedef struct __bcf_callaux_t bcf_callaux_t;
+#define B2B_INDEL_NULL 10000
+
+typedef struct __bcf_callaux_t {
+ int capQ, min_baseQ;
+ int openQ, extQ, tandemQ;
+ // for internal uses
+ int max_bases;
+ int indel_types[4];
+ int maxins, indelreg;
+ char *inscns;
+ uint16_t *bases;
+ errmod_t *e;
+ void *rghash;
+} bcf_callaux_t;
typedef struct {
- int depth, qsum[4];
+ int depth, ori_depth, qsum[4];
int anno[16];
float p[25];
} bcf_callret1_t;
typedef struct {
int a[5]; // alleles: ref, alt, alt2, alt3
int n, n_alleles, shift, ori_ref, unseen;
- int anno[16], depth;
+ int anno[16], depth, ori_depth;
uint8_t *PL;
} bcf_call_t;
bcf_callaux_t *bcf_call_init(double theta, int min_baseQ);
void bcf_call_destroy(bcf_callaux_t *bca);
- int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base /*4-bit*/, bcf_callaux_t *bca, bcf_callret1_t *r);
+ int bcf_call_glfgen(int _n, const bam_pileup1_t *pl, int ref_base, bcf_callaux_t *bca, bcf_callret1_t *r);
int bcf_call_combine(int n, const bcf_callret1_t *calls, int ref_base /*4-bit*/, bcf_call_t *call);
- int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b);
+ int bcf_call2bcf(int tid, int pos, bcf_call_t *bc, bcf1_t *b, bcf_callret1_t *bcr, int is_SP,
+ const bcf_callaux_t *bca, const char *ref);
+ int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_callaux_t *bca, const char *ref,
+ const void *rghash);
#ifdef __cplusplus
}
--- /dev/null
+#include <assert.h>
+#include <ctype.h>
+#include <string.h>
+#include "bam.h"
+#include "bam2bcf.h"
+#include "ksort.h"
+#include "kaln.h"
+#include "kprobaln.h"
+#include "khash.h"
+KHASH_SET_INIT_STR(rg)
+
+#define MINUS_CONST 0x10000000
+#define INDEL_WINDOW_SIZE 50
+#define MIN_SUPPORT_COEF 500
+
+void *bcf_call_add_rg(void *_hash, const char *hdtext, const char *list)
+{
+ const char *s, *p, *q, *r, *t;
+ khash_t(rg) *hash;
+ if (list == 0 || hdtext == 0) return _hash;
+ if (_hash == 0) _hash = kh_init(rg);
+ hash = (khash_t(rg)*)_hash;
+ if ((s = strstr(hdtext, "@RG\t")) == 0) return hash;
+ do {
+ t = strstr(s + 4, "@RG\t"); // the next @RG
+ if ((p = strstr(s, "\tID:")) != 0) p += 4;
+ if ((q = strstr(s, "\tPL:")) != 0) q += 4;
+ if (p && q && (t == 0 || (p < t && q < t))) { // ID and PL are both present
+ int lp, lq;
+ char *x;
+ for (r = p; *r && *r != '\t' && *r != '\n'; ++r); lp = r - p;
+ for (r = q; *r && *r != '\t' && *r != '\n'; ++r); lq = r - q;
+ x = calloc((lp > lq? lp : lq) + 1, 1);
+ for (r = q; *r && *r != '\t' && *r != '\n'; ++r) x[r-q] = *r;
+ if (strstr(list, x)) { // insert ID to the hash table
+ khint_t k;
+ int ret;
+ for (r = p; *r && *r != '\t' && *r != '\n'; ++r) x[r-p] = *r;
+ x[r-p] = 0;
+ k = kh_get(rg, hash, x);
+ if (k == kh_end(hash)) k = kh_put(rg, hash, x, &ret);
+ else free(x);
+ } else free(x);
+ }
+ s = t;
+ } while (s);
+ return hash;
+}
+
+void bcf_call_del_rghash(void *_hash)
+{
+ khint_t k;
+ khash_t(rg) *hash = (khash_t(rg)*)_hash;
+ if (hash == 0) return;
+ for (k = kh_begin(hash); k < kh_end(hash); ++k)
+ if (kh_exist(hash, k))
+ free((char*)kh_key(hash, k));
+ kh_destroy(rg, hash);
+}
+
+static int tpos2qpos(const bam1_core_t *c, const uint32_t *cigar, int32_t tpos, int is_left, int32_t *_tpos)
+{
+ int k, x = c->pos, y = 0, last_y = 0;
+ *_tpos = c->pos;
+ for (k = 0; k < c->n_cigar; ++k) {
+ int op = cigar[k] & BAM_CIGAR_MASK;
+ int l = cigar[k] >> BAM_CIGAR_SHIFT;
+ if (op == BAM_CMATCH) {
+ if (c->pos > tpos) return y;
+ if (x + l > tpos) {
+ *_tpos = tpos;
+ return y + (tpos - x);
+ }
+ x += l; y += l;
+ last_y = y;
+ } else if (op == BAM_CINS || op == BAM_CSOFT_CLIP) y += l;
+ else if (op == BAM_CDEL || op == BAM_CREF_SKIP) {
+ if (x + l > tpos) {
+ *_tpos = is_left? x : x + l;
+ return y;
+ }
+ x += l;
+ }
+ }
+ *_tpos = x;
+ return last_y;
+}
+// FIXME: check if the inserted sequence is consistent with the homopolymer run
+// l is the relative gap length and l_run is the length of the homopolymer on the reference
+static inline int est_seqQ(const bcf_callaux_t *bca, int l, int l_run)
+{
+ int q, qh;
+ q = bca->openQ + bca->extQ * (abs(l) - 1);
+ qh = l_run >= 3? (int)(bca->tandemQ * (double)abs(l) / l_run + .499) : 1000;
+ return q < qh? q : qh;
+}
+
+static inline int est_indelreg(int pos, const char *ref, int l, char *ins4)
+{
+ int i, j, max = 0, max_i = pos, score = 0;
+ l = abs(l);
+ for (i = pos + 1, j = 0; ref[i]; ++i, ++j) {
+ if (ins4) score += (toupper(ref[i]) != "ACGTN"[(int)ins4[j%l]])? -10 : 1;
+ else score += (toupper(ref[i]) != toupper(ref[pos+1+j%l]))? -10 : 1;
+ if (score < 0) break;
+ if (max < score) max = score, max_i = i;
+ }
+ return max_i - pos;
+}
+
+int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos, bcf_callaux_t *bca, const char *ref,
+ const void *rghash)
+{
+ extern void ks_introsort_uint32_t(int, uint32_t*);
+ int i, s, j, k, t, n_types, *types, max_rd_len, left, right, max_ins, *score1, *score2;
+ int N, K, l_run, ref_type, n_alt;
+ char *inscns = 0, *ref2, *query;
+ khash_t(rg) *hash = (khash_t(rg)*)rghash;
+ if (ref == 0 || bca == 0) return -1;
+ // mark filtered reads
+ if (rghash) {
+ N = 0;
+ for (s = N = 0; s < n; ++s) {
+ for (i = 0; i < n_plp[s]; ++i) {
+ bam_pileup1_t *p = plp[s] + i;
+ const uint8_t *rg = bam_aux_get(p->b, "RG");
+ p->aux = 1; // filtered by default
+ if (rg) {
+ khint_t k = kh_get(rg, hash, (const char*)(rg + 1));
+ if (k != kh_end(hash)) p->aux = 0, ++N; // not filtered
+ }
+ }
+ }
+ if (N == 0) return -1; // no reads left
+ }
+ // determine if there is a gap
+ for (s = N = 0; s < n; ++s) {
+ for (i = 0; i < n_plp[s]; ++i)
+ if (plp[s][i].indel != 0) break;
+ if (i < n_plp[s]) break;
+ }
+ if (s == n) return -1; // there is no indel at this position.
+ for (s = N = 0; s < n; ++s) N += n_plp[s]; // N is the total number of reads
+ { // find out how many types of indels are present
+ int m, n_alt = 0, n_tot = 0;
+ uint32_t *aux;
+ aux = calloc(N + 1, 4);
+ m = max_rd_len = 0;
+ aux[m++] = MINUS_CONST; // zero indel is always a type
+ for (s = 0; s < n; ++s) {
+ for (i = 0; i < n_plp[s]; ++i) {
+ const bam_pileup1_t *p = plp[s] + i;
+ if (rghash == 0 || p->aux == 0) {
+ ++n_tot;
+ if (p->indel != 0) {
+ ++n_alt;
+ aux[m++] = MINUS_CONST + p->indel;
+ }
+ }
+ j = bam_cigar2qlen(&p->b->core, bam1_cigar(p->b));
+ if (j > max_rd_len) max_rd_len = j;
+ }
+ }
+ ks_introsort(uint32_t, m, aux);
+ // squeeze out identical types
+ for (i = 1, n_types = 1; i < m; ++i)
+ if (aux[i] != aux[i-1]) ++n_types;
+ if (n_types == 1 || n_alt * MIN_SUPPORT_COEF < n_tot) { // no indels or too few supporting reads
+ free(aux); return -1;
+ }
+ types = (int*)calloc(n_types, sizeof(int));
+ t = 0;
+ types[t++] = aux[0] - MINUS_CONST;
+ for (i = 1; i < m; ++i)
+ if (aux[i] != aux[i-1])
+ types[t++] = aux[i] - MINUS_CONST;
+ free(aux);
+ for (t = 0; t < n_types; ++t)
+ if (types[t] == 0) break;
+ ref_type = t; // the index of the reference type (0)
+ assert(n_types < 64);
+ }
+ { // calculate left and right boundary
+ left = pos > INDEL_WINDOW_SIZE? pos - INDEL_WINDOW_SIZE : 0;
+ right = pos + INDEL_WINDOW_SIZE;
+ if (types[0] < 0) right -= types[0];
+ // in case the alignments stand out the reference
+ for (i = pos; i < right; ++i)
+ if (ref[i] == 0) break;
+ right = i;
+ }
+ { // the length of the homopolymer run around the current position
+ int c = bam_nt16_table[(int)ref[pos + 1]];
+ if (c == 15) l_run = 1;
+ else {
+ for (i = pos + 2; ref[i]; ++i)
+ if (bam_nt16_table[(int)ref[i]] != c) break;
+ l_run = i;
+ for (i = pos; i >= 0; --i)
+ if (bam_nt16_table[(int)ref[i]] != c) break;
+ l_run -= i + 1;
+ }
+ }
+ // construct the consensus sequence
+ max_ins = types[n_types - 1]; // max_ins is at least 0
+ if (max_ins > 0) {
+ int *inscns_aux = calloc(4 * n_types * max_ins, sizeof(int));
+ // count the number of occurrences of each base at each position for each type of insertion
+ for (t = 0; t < n_types; ++t) {
+ if (types[t] > 0) {
+ for (s = 0; s < n; ++s) {
+ for (i = 0; i < n_plp[s]; ++i) {
+ bam_pileup1_t *p = plp[s] + i;
+ if (p->indel == types[t]) {
+ uint8_t *seq = bam1_seq(p->b);
+ for (k = 1; k <= p->indel; ++k) {
+ int c = bam_nt16_nt4_table[bam1_seqi(seq, p->qpos + k)];
+ if (c < 4) ++inscns_aux[(t*max_ins+(k-1))*4 + c];
+ }
+ }
+ }
+ }
+ }
+ }
+ // use the majority rule to construct the consensus
+ inscns = calloc(n_types * max_ins, 1);
+ for (t = 0; t < n_types; ++t) {
+ for (j = 0; j < types[t]; ++j) {
+ int max = 0, max_k = -1, *ia = &inscns_aux[(t*max_ins+j)*4];
+ for (k = 0; k < 4; ++k)
+ if (ia[k] > max)
+ max = ia[k], max_k = k;
+ inscns[t*max_ins + j] = max? max_k : 4;
+ }
+ }
+ free(inscns_aux);
+ }
+ // compute the likelihood given each type of indel for each read
+ ref2 = calloc(right - left + max_ins + 2, 1);
+ query = calloc(right - left + max_rd_len + max_ins + 2, 1);
+ score1 = calloc(N * n_types, sizeof(int));
+ score2 = calloc(N * n_types, sizeof(int));
+ bca->indelreg = 0;
+ for (t = 0; t < n_types; ++t) {
+ int l, ir;
+ kpa_par_t apf1 = { 1e-4, 1e-2, 10 }, apf2 = { 1e-6, 1e-3, 10 };
+ apf1.bw = apf2.bw = abs(types[t]) + 3;
+ // compute indelreg
+ if (types[t] == 0) ir = 0;
+ else if (types[t] > 0) ir = est_indelreg(pos, ref, types[t], &inscns[t*max_ins]);
+ else ir = est_indelreg(pos, ref, -types[t], 0);
+ if (ir > bca->indelreg) bca->indelreg = ir;
+// fprintf(stderr, "%d, %d, %d\n", pos, types[t], ir);
+ // write ref2
+ for (k = 0, j = left; j <= pos; ++j)
+ ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]];
+ if (types[t] <= 0) j += -types[t];
+ else for (l = 0; l < types[t]; ++l)
+ ref2[k++] = inscns[t*max_ins + l];
+ if (types[0] < 0) { // mask deleted sequences to avoid a particular error in the model.
+ int jj, tmp = types[t] >= 0? -types[0] : -types[0] + types[t];
+ for (jj = 0; jj < tmp && j < right && ref[j]; ++jj, ++j)
+ ref2[k++] = 4;
+ }
+ for (; j < right && ref[j]; ++j)
+ ref2[k++] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[j]]];
+ if (j < right) right = j;
+ // align each read to ref2
+ for (s = K = 0; s < n; ++s) {
+ for (i = 0; i < n_plp[s]; ++i, ++K) {
+ bam_pileup1_t *p = plp[s] + i;
+ int qbeg, qend, tbeg, tend, sc;
+ uint8_t *seq = bam1_seq(p->b);
+ // FIXME: the following skips soft clips, but using them may be more sensitive.
+ // determine the start and end of sequences for alignment
+ qbeg = tpos2qpos(&p->b->core, bam1_cigar(p->b), left, 0, &tbeg);
+ qend = tpos2qpos(&p->b->core, bam1_cigar(p->b), right, 1, &tend);
+ if (types[t] < 0) {
+ int l = -types[t];
+ tbeg = tbeg - l > left? tbeg - l : left;
+ }
+ // write the query sequence
+ for (l = qbeg; l < qend; ++l)
+ query[l - qbeg] = bam_nt16_nt4_table[bam1_seqi(seq, l)];
+ { // do realignment; this is the bottleneck
+ const uint8_t *qual = bam1_qual(p->b), *bq;
+ uint8_t *qq;
+ qq = calloc(qend - qbeg, 1);
+ bq = (uint8_t*)bam_aux_get(p->b, "ZQ");
+ if (bq) ++bq; // skip type
+ for (l = qbeg; l < qend; ++l) {
+ qq[l - qbeg] = bq? qual[l] + (bq[l] - 64) : qual[l];
+ if (qq[l - qbeg] > 30) qq[l - qbeg] = 30;
+ if (qq[l - qbeg] < 7) qq[l - qbeg] = 7;
+ }
+ sc = kpa_glocal((uint8_t*)ref2 + tbeg - left, tend - tbeg + abs(types[t]),
+ (uint8_t*)query, qend - qbeg, qq, &apf1, 0, 0);
+ l = (int)(100. * sc / (qend - qbeg) + .499); // used for adjusting indelQ below
+ if (l > 255) l = 255;
+ score1[K*n_types + t] = score2[K*n_types + t] = sc<<8 | l;
+ if (sc > 5) {
+ sc = kpa_glocal((uint8_t*)ref2 + tbeg - left, tend - tbeg + abs(types[t]),
+ (uint8_t*)query, qend - qbeg, qq, &apf2, 0, 0);
+ l = (int)(100. * sc / (qend - qbeg) + .499);
+ if (l > 255) l = 255;
+ score2[K*n_types + t] = sc<<8 | l;
+ }
+ free(qq);
+ }
+/*
+ for (l = 0; l < tend - tbeg + abs(types[t]); ++l)
+ fputc("ACGTN"[(int)ref2[tbeg-left+l]], stderr);
+ fputc('\n', stderr);
+ for (l = 0; l < qend - qbeg; ++l) fputc("ACGTN"[(int)query[l]], stderr);
+ fputc('\n', stderr);
+ fprintf(stderr, "pos=%d type=%d read=%d:%d name=%s qbeg=%d tbeg=%d score=%d\n", pos, types[t], s, i, bam1_qname(p->b), qbeg, tbeg, sc);
+*/
+ }
+ }
+ }
+ free(ref2); free(query);
+ { // compute indelQ
+ int *sc, tmp, *sumq;
+ sc = alloca(n_types * sizeof(int));
+ sumq = alloca(n_types * sizeof(int));
+ memset(sumq, 0, sizeof(int) * n_types);
+ for (s = K = 0; s < n; ++s) {
+ for (i = 0; i < n_plp[s]; ++i, ++K) {
+ bam_pileup1_t *p = plp[s] + i;
+ int *sct = &score1[K*n_types], indelQ1, indelQ2, seqQ, indelQ;
+ for (t = 0; t < n_types; ++t) sc[t] = sct[t]<<6 | t;
+ for (t = 1; t < n_types; ++t) // insertion sort
+ for (j = t; j > 0 && sc[j] < sc[j-1]; --j)
+ tmp = sc[j], sc[j] = sc[j-1], sc[j-1] = tmp;
+ /* errmod_cal() assumes that if the call is wrong, the
+ * likelihoods of other events are equal. This is about
+ * right for substitutions, but is not desired for
+ * indels. To reuse errmod_cal(), I have to make
+ * compromise for multi-allelic indels.
+ */
+ if ((sc[0]&0x3f) == ref_type) {
+ indelQ1 = (sc[1]>>14) - (sc[0]>>14);
+ seqQ = est_seqQ(bca, types[sc[1]&0x3f], l_run);
+ } else {
+ for (t = 0; t < n_types; ++t) // look for the reference type
+ if ((sc[t]&0x3f) == ref_type) break;
+ indelQ1 = (sc[t]>>14) - (sc[0]>>14);
+ seqQ = est_seqQ(bca, types[sc[0]&0x3f], l_run);
+ }
+ tmp = sc[0]>>6 & 0xff;
+ indelQ1 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ1 + .499); // reduce indelQ
+ sct = &score2[K*n_types];
+ for (t = 0; t < n_types; ++t) sc[t] = sct[t]<<6 | t;
+ for (t = 1; t < n_types; ++t) // insertion sort
+ for (j = t; j > 0 && sc[j] < sc[j-1]; --j)
+ tmp = sc[j], sc[j] = sc[j-1], sc[j-1] = tmp;
+ if ((sc[0]&0x3f) == ref_type) {
+ indelQ2 = (sc[1]>>14) - (sc[0]>>14);
+ } else {
+ for (t = 0; t < n_types; ++t) // look for the reference type
+ if ((sc[t]&0x3f) == ref_type) break;
+ indelQ2 = (sc[t]>>14) - (sc[0]>>14);
+ }
+ tmp = sc[0]>>6 & 0xff;
+ indelQ2 = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ2 + .499);
+ // pick the smaller between indelQ1 and indelQ2
+ indelQ = indelQ1 < indelQ2? indelQ1 : indelQ2;
+ p->aux = (sc[0]&0x3f)<<16 | seqQ<<8 | indelQ;
+ sumq[sc[0]&0x3f] += indelQ < seqQ? indelQ : seqQ;
+// fprintf(stderr, "pos=%d read=%d:%d name=%s call=%d q=%d\n", pos, s, i, bam1_qname(p->b), types[sc[0]&0x3f], indelQ);
+ }
+ }
+ // determine bca->indel_types[] and bca->inscns
+ bca->maxins = max_ins;
+ bca->inscns = realloc(bca->inscns, bca->maxins * 4);
+ for (t = 0; t < n_types; ++t)
+ sumq[t] = sumq[t]<<6 | t;
+ for (t = 1; t < n_types; ++t) // insertion sort
+ for (j = t; j > 0 && sumq[j] > sumq[j-1]; --j)
+ tmp = sumq[j], sumq[j] = sumq[j-1], sumq[j-1] = tmp;
+ for (t = 0; t < n_types; ++t) // look for the reference type
+ if ((sumq[t]&0x3f) == ref_type) break;
+ if (t) { // then move the reference type to the first
+ tmp = sumq[t];
+ for (; t > 0; --t) sumq[t] = sumq[t-1];
+ sumq[0] = tmp;
+ }
+ for (t = 0; t < 4; ++t) bca->indel_types[t] = B2B_INDEL_NULL;
+ for (t = 0; t < 4 && t < n_types; ++t) {
+ bca->indel_types[t] = types[sumq[t]&0x3f];
+ memcpy(&bca->inscns[t * bca->maxins], &inscns[(sumq[t]&0x3f) * max_ins], bca->maxins);
+ }
+ // update p->aux
+ for (s = n_alt = 0; s < n; ++s) {
+ for (i = 0; i < n_plp[s]; ++i) {
+ bam_pileup1_t *p = plp[s] + i;
+ int x = types[p->aux>>16&0x3f];
+ for (j = 0; j < 4; ++j)
+ if (x == bca->indel_types[j]) break;
+ p->aux = j<<16 | (j == 4? 0 : (p->aux&0xffff));
+ if ((p->aux>>16&0x3f) > 0) ++n_alt;
+// fprintf(stderr, "X pos=%d read=%d:%d name=%s call=%d type=%d q=%d seqQ=%d\n", pos, s, i, bam1_qname(p->b), p->aux>>16&63, bca->indel_types[p->aux>>16&63], p->aux&0xff, p->aux>>8&0xff);
+ }
+ }
+ }
+ free(score1); free(score2);
+ // free
+ free(types); free(inscns);
+ return n_alt > 0? 0 : -1;
+}
const bam_pileup1_t *p = pl + i;
uint32_t q, x = 0, qq;
uint16_t y = 0;
- if (p->is_del || (p->b->core.flag&BAM_FUNMAP)) continue;
+ if (p->is_del || p->is_refskip || (p->b->core.flag&BAM_FUNMAP)) continue;
q = (uint32_t)bam1_qual(p->b)[p->qpos];
if (q < bm->min_baseQ) continue;
x |= (uint32_t)bam1_strand(p->b) << 18 | q << 8 | p->b->core.qual;
y |= q << 5;
qq = bam1_seqi(bam1_seq(p->b), p->qpos);
q = bam_nt16_nt4_table[qq? qq : ref_base];
- if (!p->is_del && q < 4) x |= 1 << 21 | q << 16, y |= q;
+ if (!p->is_del && !p->is_refskip && q < 4) x |= 1 << 21 | q << 16, y |= q;
bm->aux->info16[n] = y;
bm->aux->info[n++] = x;
}
#include "sam.h"
#include "kstring.h"
#include "kaln.h"
+#include "kprobaln.h"
void bam_fillmd1_core(bam1_t *b, char *ref, int is_equal, int max_nm)
{
return (int)(t + .499);
}
-int bam_prob_realn(bam1_t *b, const char *ref)
+int bam_prob_realn_core(bam1_t *b, const char *ref, int apply_baq)
{
int k, i, bw, x, y, yb, ye, xb, xe;
uint32_t *cigar = bam1_cigar(b);
bam1_core_t *c = &b->core;
- ka_probpar_t conf = ka_probpar_def;
- // find the start and end of the alignment
- if (c->flag & BAM_FUNMAP) return -1;
+ kpa_par_t conf = kpa_par_def;
+ uint8_t *bq = 0, *zq = 0, *qual = bam1_qual(b);
+ if (c->flag & BAM_FUNMAP) return -1; // do nothing
+ // test if BQ or ZQ is present
+ if ((bq = bam_aux_get(b, "BQ")) != 0) ++bq;
+ if ((zq = bam_aux_get(b, "ZQ")) != 0 && *zq == 'Z') ++zq;
+ if (bq && zq) { // remove the ZQ tag
+ bam_aux_del(b, zq-1);
+ zq = 0;
+ }
+ if (bq || zq) {
+ if ((apply_baq && zq) || (!apply_baq && bq)) return -3; // in both cases, do nothing
+ if (bq && apply_baq) { // then convert BQ to ZQ
+ for (i = 0; i < c->l_qseq; ++i)
+ qual[i] = qual[i] + 64 < bq[i]? 0 : qual[i] - ((int)bq[i] - 64);
+ *(bq - 3) = 'Z';
+ } else if (zq && !apply_baq) { // then convert ZQ to BQ
+ for (i = 0; i < c->l_qseq; ++i)
+ qual[i] += (int)zq[i] - 64;
+ *(zq - 3) = 'B';
+ }
+ return 0;
+ }
+ // find the start and end of the alignment
x = c->pos, y = 0, yb = ye = xb = xe = -1;
for (k = 0; k < c->n_cigar; ++k) {
int op, l;
x += l; y += l;
} else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
else if (op == BAM_CDEL) x += l;
- else if (op == BAM_CREF_SKIP) return -1;
+ else if (op == BAM_CREF_SKIP) return -1; // do nothing if there is a reference skip
}
// set bandwidth and the start and the end
bw = 7;
if (xe - xb - c->l_qseq > bw)
xb += (xe - xb - c->l_qseq - bw) / 2, xe -= (xe - xb - c->l_qseq - bw) / 2;
{ // glocal
- uint8_t *s, *r, *q, *seq = bam1_seq(b), *qual = bam1_qual(b);
+ uint8_t *s, *r, *q, *seq = bam1_seq(b), *bq;
int *state;
+ bq = calloc(c->l_qseq + 1, 1);
+ memcpy(bq, qual, c->l_qseq);
s = calloc(c->l_qseq, 1);
for (i = 0; i < c->l_qseq; ++i) s[i] = bam_nt16_nt4_table[bam1_seqi(seq, i)];
r = calloc(xe - xb, 1);
r[i-xb] = bam_nt16_nt4_table[bam_nt16_table[(int)ref[i]]];
state = calloc(c->l_qseq, sizeof(int));
q = calloc(c->l_qseq, 1);
- ka_prob_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q);
+ kpa_glocal(r, xe-xb, s, c->l_qseq, qual, &conf, state, q);
for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
int op = cigar[k]&0xf, l = cigar[k]>>4;
if (op == BAM_CMATCH) {
for (i = y; i < y + l; ++i) {
- if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) qual[i] = 0;
- else qual[i] = qual[i] < q[i]? qual[i] : q[i];
+ if ((state[i]&3) != 0 || state[i]>>2 != x - xb + (i - y)) bq[i] = 0;
+ else bq[i] = bq[i] < q[i]? bq[i] : q[i];
}
x += l; y += l;
} else if (op == BAM_CSOFT_CLIP || op == BAM_CINS) y += l;
else if (op == BAM_CDEL) x += l;
}
- free(s); free(r); free(q); free(state);
+ for (i = 0; i < c->l_qseq; ++i) bq[i] = qual[i] - bq[i] + 64; // finalize BQ
+ if (apply_baq) {
+ for (i = 0; i < c->l_qseq; ++i) qual[i] -= bq[i] - 64; // modify qual
+ bam_aux_append(b, "ZQ", 'Z', c->l_qseq + 1, bq);
+ } else bam_aux_append(b, "BQ", 'Z', c->l_qseq + 1, bq);
+ free(bq); free(s); free(r); free(q); free(state);
}
return 0;
}
+int bam_prob_realn(bam1_t *b, const char *ref)
+{
+ return bam_prob_realn_core(b, ref, 1);
+}
+
int bam_fillmd(int argc, char *argv[])
{
- int c, is_equal = 0, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm = 0, is_realn, capQ = 0;
+ int c, is_equal, tid = -2, ret, len, is_bam_out, is_sam_in, is_uncompressed, max_nm, is_realn, capQ, apply_baq;
samfile_t *fp, *fpout = 0;
faidx_t *fai;
char *ref = 0, mode_w[8], mode_r[8];
bam1_t *b;
- is_bam_out = is_sam_in = is_uncompressed = is_realn = 0;
+ is_equal = is_bam_out = is_sam_in = is_uncompressed = is_realn = max_nm = apply_baq = capQ = 0;
mode_w[0] = mode_r[0] = 0;
strcpy(mode_r, "r"); strcpy(mode_w, "w");
- while ((c = getopt(argc, argv, "reubSC:n:")) >= 0) {
+ while ((c = getopt(argc, argv, "reubSC:n:A")) >= 0) {
switch (c) {
case 'r': is_realn = 1; break;
case 'e': is_equal = 1; break;
case 'S': is_sam_in = 1; break;
case 'n': max_nm = atoi(optarg); break;
case 'C': capQ = atoi(optarg); break;
+ case 'A': apply_baq = 1; break;
default: fprintf(stderr, "[bam_fillmd] unrecognized option '-%c'\n", c); return 1;
}
}
fprintf(stderr, " -u uncompressed BAM output (for piping)\n");
fprintf(stderr, " -b compressed BAM output\n");
fprintf(stderr, " -S the input is SAM with header\n");
+ fprintf(stderr, " -A modify the quality string\n");
fprintf(stderr, " -r read-independent local realignment\n\n");
return 1;
}
fprintf(stderr, "[bam_fillmd] fail to find sequence '%s' in the reference.\n",
fp->header->target_name[tid]);
}
-// if (is_realn) bam_realn(b, ref);
- if (is_realn) bam_prob_realn(b, ref);
+ if (is_realn) bam_prob_realn_core(b, ref, apply_baq);
if (capQ > 10) {
int q = bam_cap_mapQ(b, ref, capQ);
if (b->core.qual > q) b->core.qual = q;
mempool_t *mp;
lbnode_t *head, *tail, *dummy;
int32_t tid, pos, max_tid, max_pos;
- int is_eof, flag_mask, max_plp, error;
+ int is_eof, flag_mask, max_plp, error, maxcnt;
bam_pileup1_t *plp;
// for the "auto" interface only
bam1_t *b;
iter->dummy = mp_alloc(iter->mp);
iter->max_tid = iter->max_pos = -1;
iter->flag_mask = BAM_DEF_MASK;
+ iter->maxcnt = 8000;
if (func) {
iter->func = func;
iter->data = data;
if (b) {
if (b->core.tid < 0) return 0;
if (b->core.flag & iter->flag_mask) return 0;
+ if (iter->tid == b->core.tid && iter->pos == b->core.pos && iter->mp->cnt > iter->maxcnt) return 0;
bam_copy1(&iter->tail->b, b);
iter->tail->beg = b->core.pos; iter->tail->end = bam_calend(&b->core, bam1_cigar(b));
iter->tail->s = g_cstate_null; iter->tail->s.end = iter->tail->end - 1; // initialize cstate_t
const bam_pileup1_t *plp;
if (iter->func == 0 || iter->error) { *_n_plp = -1; return 0; }
if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
- else {
+ else { // no pileup line can be obtained; read alignments
*_n_plp = 0;
if (iter->is_eof) return 0;
while (iter->func(iter->data, iter->b) >= 0) {
return 0;
}
if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
+ // otherwise no pileup line can be returned; read the next alignment.
}
bam_plp_push(iter, 0);
if ((plp = bam_plp_next(iter, _tid, _pos, _n_plp)) != 0) return plp;
iter->flag_mask = mask < 0? BAM_DEF_MASK : (BAM_FUNMAP | mask);
}
+void bam_plp_set_maxcnt(bam_plp_t iter, int maxcnt)
+{
+ iter->maxcnt = maxcnt;
+}
+
/*****************
* callback APIs *
*****************/
return iter;
}
+void bam_mplp_set_maxcnt(bam_mplp_t iter, int maxcnt)
+{
+ int i;
+ for (i = 0; i < iter->n; ++i)
+ iter->iter[i]->maxcnt = maxcnt;
+}
+
void bam_mplp_destroy(bam_mplp_t iter)
{
int i;
#include <stdio.h>
#include <unistd.h>
#include <ctype.h>
+#include <string.h>
+#include <errno.h>
#include "sam.h"
#include "faidx.h"
#include "bam_maqcns.h"
#define MPLP_NO_COMP 0x20
#define MPLP_NO_ORPHAN 0x40
#define MPLP_REALN 0x80
+#define MPLP_FMT_DP 0x100
+#define MPLP_FMT_SP 0x200
+#define MPLP_NO_INDEL 0x400
typedef struct {
- int max_mq, min_mq, flag, min_baseQ, capQ_thres;
- char *reg, *fn_pos;
+ int max_mq, min_mq, flag, min_baseQ, capQ_thres, max_depth;
+ int openQ, extQ, tandemQ;
+ char *reg, *fn_pos, *pl_list;
faidx_t *fai;
kh_64_t *hash;
} mplp_conf_t;
static int mplp_func(void *data, bam1_t *b)
{
extern int bam_realn(bam1_t *b, const char *ref);
- extern int bam_prob_realn(bam1_t *b, const char *ref);
+ extern int bam_prob_realn_core(bam1_t *b, const char *ref, int);
extern int bam_cap_mapQ(bam1_t *b, char *ref, int thres);
mplp_aux_t *ma = (mplp_aux_t*)data;
int ret, skip = 0;
ret = ma->iter? bam_iter_read(ma->fp, ma->iter, b) : bam_read1(ma->fp, b);
if (ret < 0) break;
skip = 0;
- if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn(b, ma->ref);
+ if (has_ref && (ma->flag&MPLP_REALN)) bam_prob_realn_core(b, ma->ref, 1);
if (has_ref && ma->capQ_thres > 10) {
int q = bam_cap_mapQ(b, ma->ref, ma->capQ_thres);
if (q < 0) skip = 1;
static int mpileup(mplp_conf_t *conf, int n, char **fn)
{
+ extern void *bcf_call_add_rg(void *rghash, const char *hdtext, const char *list);
+ extern void bcf_call_del_rghash(void *rghash);
mplp_aux_t **data;
- int i, tid, pos, *n_plp, beg0 = 0, end0 = 1u<<29, ref_len, ref_tid;
+ int i, tid, pos, *n_plp, beg0 = 0, end0 = 1u<<29, ref_len, ref_tid, max_depth;
const bam_pileup1_t **plp;
bam_mplp_t iter;
bam_header_t *h = 0;
char *ref;
khash_t(64) *hash = 0;
+ void *rghash = 0;
bcf_callaux_t *bca = 0;
bcf_callret1_t *bcr = 0;
data[i]->fp = strcmp(fn[i], "-") == 0? bam_dopen(fileno(stdin), "r") : bam_open(fn[i], "r");
h_tmp = bam_header_read(data[i]->fp);
bam_smpl_add(sm, fn[i], h_tmp->text);
+ rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
if (conf->reg) {
int beg, end;
bam_index_t *idx;
bcf_hdr_write(bp, bh);
bca = bcf_call_init(-1., conf->min_baseQ);
bcr = calloc(sm->n, sizeof(bcf_callret1_t));
+ bca->rghash = rghash;
+ bca->openQ = conf->openQ, bca->extQ = conf->extQ, bca->tandemQ = conf->tandemQ;
}
ref_tid = -1; ref = 0;
iter = bam_mplp_init(n, mplp_func, (void**)data);
+ max_depth = conf->max_depth;
+ if (max_depth * sm->n > 1<<20)
+ fprintf(stderr, "(%s) Max depth is above 1M. Potential memory hog!\n", __func__);
+ if (max_depth * sm->n < 8000) {
+ max_depth = 8000 / sm->n;
+ fprintf(stderr, "<%s> Set max per-sample depth to %d\n", __func__, max_depth);
+ }
+ bam_mplp_set_maxcnt(iter, max_depth);
while (bam_mplp_auto(iter, &tid, &pos, n_plp, plp) > 0) {
if (conf->reg && (pos < beg0 || pos >= end0)) continue; // out of the region requested
if (hash) {
for (i = 0; i < gplp.n; ++i)
bcf_call_glfgen(gplp.n_plp[i], gplp.plp[i], ref16, bca, bcr + i);
bcf_call_combine(gplp.n, bcr, ref16, &bc);
- bcf_call2bcf(tid, pos, &bc, b);
+ bcf_call2bcf(tid, pos, &bc, b, (conf->flag&(MPLP_FMT_DP|MPLP_FMT_SP))? bcr : 0,
+ (conf->flag&MPLP_FMT_SP), 0, 0);
bcf_write(bp, bh, b);
bcf_destroy(b);
+ // call indels
+ if (!(conf->flag&MPLP_NO_INDEL) && bcf_call_gap_prep(gplp.n, gplp.n_plp, gplp.plp, pos, bca, ref, rghash) >= 0) {
+ for (i = 0; i < gplp.n; ++i)
+ bcf_call_glfgen(gplp.n_plp[i], gplp.plp[i], -1, bca, bcr + i);
+ if (bcf_call_combine(gplp.n, bcr, -1, &bc) >= 0) {
+ b = calloc(1, sizeof(bcf1_t));
+ bcf_call2bcf(tid, pos, &bc, b, (conf->flag&(MPLP_FMT_DP|MPLP_FMT_SP))? bcr : 0,
+ (conf->flag&MPLP_FMT_SP), bca, ref);
+ bcf_write(bp, bh, b);
+ bcf_destroy(b);
+ }
+ }
} else {
printf("%s\t%d\t%c", h->target_name[tid], pos + 1, (ref && pos < ref_len)? ref[pos] : 'N');
for (i = 0; i < n; ++i) {
bam_smpl_destroy(sm); free(buf.s);
for (i = 0; i < gplp.n; ++i) free(gplp.plp[i]);
free(gplp.plp); free(gplp.n_plp); free(gplp.m_plp);
+ bcf_call_del_rghash(rghash);
if (hash) { // free the hash table
khint_t k;
for (k = kh_begin(hash); k < kh_end(hash); ++k)
return 0;
}
+#define MAX_PATH_LEN 1024
+int read_file_list(const char *file_list,int *n,char **argv[])
+{
+ char buf[MAX_PATH_LEN];
+ int len, nfiles;
+ char **files;
+
+ FILE *fh = fopen(file_list,"r");
+ if ( !fh )
+ {
+ fprintf(stderr,"%s: %s\n", file_list,strerror(errno));
+ return 1;
+ }
+
+ // Speed is not an issue here, determine the number of files by reading the file twice
+ nfiles = 0;
+ while ( fgets(buf,MAX_PATH_LEN,fh) ) nfiles++;
+
+ if ( fseek(fh, 0L, SEEK_SET) )
+ {
+ fprintf(stderr,"%s: %s\n", file_list,strerror(errno));
+ return 1;
+ }
+
+ files = calloc(nfiles,sizeof(char*));
+ nfiles = 0;
+ while ( fgets(buf,MAX_PATH_LEN,fh) )
+ {
+ len = strlen(buf);
+ while ( len>0 && isspace(buf[len-1]) ) len--;
+ if ( !len ) continue;
+
+ files[nfiles] = malloc(sizeof(char)*(len+1));
+ strncpy(files[nfiles],buf,len);
+ files[nfiles][len] = 0;
+ nfiles++;
+ }
+ fclose(fh);
+ if ( !nfiles )
+ {
+ fprintf(stderr,"No files read from %s\n", file_list);
+ return 1;
+ }
+ *argv = files;
+ *n = nfiles;
+ return 0;
+}
+#undef MAX_PATH_LEN
+
int bam_mpileup(int argc, char *argv[])
{
int c;
+ const char *file_list = NULL;
+ char **fn = NULL;
+ int nfiles = 0;
mplp_conf_t mplp;
memset(&mplp, 0, sizeof(mplp_conf_t));
mplp.max_mq = 60;
mplp.min_baseQ = 13;
mplp.capQ_thres = 0;
+ mplp.max_depth = 250;
+ mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100;
mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN;
- while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:B")) >= 0) {
+ while ((c = getopt(argc, argv, "gf:r:l:M:q:Q:uaORC:BDSd:b:P:o:e:h:I")) >= 0) {
switch (c) {
case 'f':
mplp.fai = fai_load(optarg);
if (mplp.fai == 0) return 1;
break;
+ case 'd': mplp.max_depth = atoi(optarg); break;
case 'r': mplp.reg = strdup(optarg); break;
case 'l': mplp.fn_pos = strdup(optarg); break;
+ case 'P': mplp.pl_list = strdup(optarg); break;
case 'g': mplp.flag |= MPLP_GLF; break;
case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_GLF; break;
case 'a': mplp.flag |= MPLP_NO_ORPHAN | MPLP_REALN; break;
case 'B': mplp.flag &= ~MPLP_REALN & ~MPLP_NO_ORPHAN; break;
case 'O': mplp.flag |= MPLP_NO_ORPHAN; break;
case 'R': mplp.flag |= MPLP_REALN; break;
+ case 'D': mplp.flag |= MPLP_FMT_DP; break;
+ case 'S': mplp.flag |= MPLP_FMT_SP; break;
+ case 'I': mplp.flag |= MPLP_NO_INDEL; break;
case 'C': mplp.capQ_thres = atoi(optarg); break;
case 'M': mplp.max_mq = atoi(optarg); break;
case 'q': mplp.min_mq = atoi(optarg); break;
case 'Q': mplp.min_baseQ = atoi(optarg); break;
+ case 'b': file_list = optarg; break;
+ case 'o': mplp.openQ = atoi(optarg); break;
+ case 'e': mplp.extQ = atoi(optarg); break;
+ case 'h': mplp.tandemQ = atoi(optarg); break;
}
}
if (argc == 1) {
fprintf(stderr, "Options: -f FILE reference sequence file [null]\n");
fprintf(stderr, " -r STR region in which pileup is generated [null]\n");
fprintf(stderr, " -l FILE list of positions (format: chr pos) [null]\n");
+ fprintf(stderr, " -b FILE list of input BAM files [null]\n");
fprintf(stderr, " -M INT cap mapping quality at INT [%d]\n", mplp.max_mq);
fprintf(stderr, " -Q INT min base quality [%d]\n", mplp.min_baseQ);
fprintf(stderr, " -q INT filter out alignment with MQ smaller than INT [%d]\n", mplp.min_mq);
+ fprintf(stderr, " -d INT max per-sample depth [%d]\n", mplp.max_depth);
+ fprintf(stderr, " -P STR comma separated list of platforms for indels [all]\n");
+ fprintf(stderr, " -o INT Phred-scaled gap open sequencing error probability [%d]\n", mplp.openQ);
+ fprintf(stderr, " -e INT Phred-scaled gap extension seq error probability [%d]\n", mplp.extQ);
+ fprintf(stderr, " -h INT coefficient for homopolyer errors [%d]\n", mplp.tandemQ);
fprintf(stderr, " -g generate BCF output\n");
fprintf(stderr, " -u do not compress BCF output\n");
fprintf(stderr, " -B disable BAQ computation\n");
+ fprintf(stderr, " -D output per-sample DP\n");
+ fprintf(stderr, " -S output per-sample SP (strand bias P-value, slow)\n");
+ fprintf(stderr, " -I do not perform indel calling\n");
fprintf(stderr, "\n");
fprintf(stderr, "Notes: Assuming diploid individuals.\n\n");
return 1;
}
- mpileup(&mplp, argc - optind, argv + optind);
- free(mplp.reg);
+ if ( file_list )
+ {
+ if ( read_file_list(file_list,&nfiles,&fn) ) return 1;
+ mpileup(&mplp,nfiles,fn);
+ for (c=0; c<nfiles; c++) free(fn[c]);
+ free(fn);
+ }
+ else
+ mpileup(&mplp, argc - optind, argv + optind);
+ free(mplp.reg); free(mplp.pl_list);
if (mplp.fai) fai_destroy(mplp.fai);
return 0;
}
#endif
#ifndef PACKAGE_VERSION
-#define PACKAGE_VERSION "0.1.9 (r783)"
+#define PACKAGE_VERSION "0.1.10 (r829)"
#endif
int bam_taf2baf(int argc, char *argv[]);
} else {
b->fp = strcmp(fn, "-")? bgzf_open(fn, mode) : bgzf_fdopen(fileno(stdin), mode);
}
+#ifndef BCF_LITE
b->fp->owned_file = 1;
+#endif
return b;
}
b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2;
} else if (b->gi[i].fmt == bcf_str2int("DP", 2) || b->gi[i].fmt == bcf_str2int("HQ", 2)) {
b->gi[i].len = 2;
- } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)) {
+ } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("GT", 2)
+ || b->gi[i].fmt == bcf_str2int("SP", 2))
+ {
b->gi[i].len = 1;
} else if (b->gi[i].fmt == bcf_str2int("GL", 2)) {
b->gi[i].len = b->n_alleles * (b->n_alleles + 1) / 2 * 4;
}
} else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
kputw(((uint16_t*)b->gi[i].data)[j], s);
- } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
+ } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
kputw(((uint8_t*)b->gi[i].data)[j], s);
} else if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
int y = ((uint8_t*)b->gi[i].data)[j];
memcpy(r->gi[i].data, b->gi[i].data, r->n_smpl * r->gi[i].len);
return 0;
}
+
+int bcf_is_indel(const bcf1_t *b)
+{
+ char *p;
+ if (strlen(b->ref) > 1) return 1;
+ for (p = b->alt; *p; ++p)
+ if (*p != ',' && p[1] != ',' && p[1] != '\0')
+ return 1;
+ return 0;
+}
#include <stdint.h>
#include <zlib.h>
+
+#ifndef BCF_LITE
#include "bgzf.h"
+typedef BGZF *bcfFile;
+#else
+typedef gzFile bcfFile;
+#define bgzf_open(fn, mode) gzopen(fn, mode)
+#define bgzf_fdopen(fd, mode) gzdopen(fd, mode)
+#define bgzf_close(fp) gzclose(fp)
+#define bgzf_read(fp, buf, len) gzread(fp, buf, len)
+#define bgzf_write(fp, buf, len)
+#define bgzf_flush(fp)
+#endif
/*
A member in the structs below is said to "primary" if its content
typedef struct {
int is_vcf; // if the file in operation is a VCF
void *v; // auxillary data structure for VCF
- BGZF *fp; // file handler for BCF
+ bcfFile fp; // file handler for BCF
} bcf_t;
struct __bcf_idx_t;
int bcf_shrink_alt(bcf1_t *b, int n);
// convert GL to PL
int bcf_gl2pl(bcf1_t *b);
+ // if the site is an indel
+ int bcf_is_indel(const bcf1_t *b);
// string hash table
void *bcf_build_refhash(bcf_hdr_t *h);
#define VC_QCALL 1024
#define VC_CALL_GT 2048
#define VC_ADJLD 4096
+#define VC_NO_INDEL 8192
typedef struct {
int flag, prior_type, n1;
char *fn_list, *prior_file;
- double theta, pref;
+ double theta, pref, indel_frac;
} viewconf_t;
khash_t(set64) *bcf_load_pos(const char *fn, bcf_hdr_t *_h)
kputs(b->alt, &s); kputc('\0', &s); kputc('\0', &s);
kputs(b->info, &s);
if (b->info[0]) kputc(';', &s);
- ksprintf(&s, "AF1=%.3lf;AFE=%.3lf", 1.-pr->f_em, 1.-pr->f_exp);
+// ksprintf(&s, "AF1=%.4lg;AFE=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, 1.-pr->f_exp, pr->cil, pr->cih);
+ ksprintf(&s, "AF1=%.4lg;CI95=%.4lg,%.4lg", 1.-pr->f_em, pr->cil, pr->cih);
ksprintf(&s, ";DP4=%d,%d,%d,%d;MQ=%d", a.d[0], a.d[1], a.d[2], a.d[3], a.mq);
if (a.is_tested) {
if (pr->pc[0] >= 0.) ksprintf(&s, ";PC4=%lg,%lg,%lg,%lg", pr->pc[0], pr->pc[1], pr->pc[2], pr->pc[3]);
int bcfview(int argc, char *argv[])
{
extern int bcf_2qcall(bcf_hdr_t *h, bcf1_t *b);
+ extern void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x);
bcf_t *bp, *bout = 0;
bcf1_t *b, *blast;
int c;
tid = begin = end = -1;
memset(&vc, 0, sizeof(viewconf_t));
- vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5;
- while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgL")) >= 0) {
+ vc.prior_type = vc.n1 = -1; vc.theta = 1e-3; vc.pref = 0.5; vc.indel_frac = -1.;
+ while ((c = getopt(argc, argv, "N1:l:cHAGvbSuP:t:p:QgLi:I")) >= 0) {
switch (c) {
case '1': vc.n1 = atoi(optarg); break;
case 'l': vc.fn_list = strdup(optarg); break;
case 'u': vc.flag |= VC_UNCOMP | VC_BCFOUT; break;
case 'H': vc.flag |= VC_HWE; break;
case 'g': vc.flag |= VC_CALL_GT | VC_CALL; break;
+ case 'I': vc.flag |= VC_NO_INDEL; break;
case 't': vc.theta = atof(optarg); break;
case 'p': vc.pref = atof(optarg); break;
+ case 'i': vc.indel_frac = atof(optarg); break;
case 'Q': vc.flag |= VC_QCALL; break;
case 'L': vc.flag |= VC_ADJLD; break;
case 'P':
fprintf(stderr, " -N skip sites where REF is not A/C/G/T\n");
fprintf(stderr, " -Q output the QCALL likelihood format\n");
fprintf(stderr, " -L calculate LD for adjacent sites\n");
+ fprintf(stderr, " -I skip indels\n");
fprintf(stderr, " -1 INT number of group-1 samples [0]\n");
fprintf(stderr, " -l FILE list of sites to output [all sites]\n");
- fprintf(stderr, " -t FLOAT scaled mutation rate [%.4lg]\n", vc.theta);
+ fprintf(stderr, " -t FLOAT scaled substitution mutation rate [%.4lg]\n", vc.theta);
+ fprintf(stderr, " -i FLOAT indel-to-substitution ratio [%.4lg]\n", vc.indel_frac);
fprintf(stderr, " -p FLOAT variant if P(ref|D)<FLOAT [%.3lg]\n", vc.pref);
fprintf(stderr, " -P STR type of prior: full, cond2, flat [full]\n");
fprintf(stderr, "\n");
bcf_p1_set_n1(p1, vc.n1);
bcf_p1_init_subprior(p1, vc.prior_type, vc.theta);
}
+ if (vc.indel_frac > 0.) bcf_p1_indel_prior(p1, vc.indel_frac);
}
if (vc.fn_list) hash = bcf_load_pos(vc.fn_list, h);
if (optind + 1 < argc && !(vc.flag&VC_VCFIN)) {
}
}
while (vcf_read(bp, h, b) > 0) {
- if (vc.flag & VC_ACGT_ONLY) {
+ int is_indel = bcf_is_indel(b);
+ if ((vc.flag & VC_NO_INDEL) && is_indel) continue;
+ if ((vc.flag & VC_ACGT_ONLY) && !is_indel) {
int x;
if (b->ref[0] == 0 || b->ref[1] != 0) continue;
x = toupper(b->ref[0]);
#include "kseq.h"
KSTREAM_INIT(gzFile, gzread, 16384)
-#define MC_AVG_ERR 0.007
#define MC_MAX_EM_ITER 16
#define MC_EM_EPS 1e-4
+#define MC_DEF_INDEL 0.15
unsigned char seq_nt4_table[256] = {
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
};
struct __bcf_p1aux_t {
- int n, M, n1;
+ int n, M, n1, is_indel;
double *q2p, *pdg; // pdg -> P(D|g)
- double *phi;
+ double *phi, *phi_indel;
double *z, *zswap; // aux for afs
double *z1, *z2, *phi1, *phi2; // only calculated when n1 is set
double t, t1, t2;
int PL_len;
};
+void bcf_p1_indel_prior(bcf_p1aux_t *ma, double x)
+{
+ int i;
+ for (i = 0; i < ma->M; ++i)
+ ma->phi_indel[i] = ma->phi[i] * x;
+ ma->phi_indel[ma->M] = 1. - ma->phi[ma->M] * x;
+}
+
static void init_prior(int type, double theta, int M, double *phi)
{
int i;
void bcf_p1_init_prior(bcf_p1aux_t *ma, int type, double theta)
{
init_prior(type, theta, ma->M, ma->phi);
+ bcf_p1_indel_prior(ma, MC_DEF_INDEL);
}
void bcf_p1_init_subprior(bcf_p1aux_t *ma, int type, double theta)
fprintf(stderr, "[%s] heterozygosity=%lf, ", __func__, (double)sum);
for (sum = 0., k = 1; k <= ma->M; ++k) sum += k * ma->phi[ma->M - k] / ma->M;
fprintf(stderr, "theta=%lf\n", (double)sum);
+ bcf_p1_indel_prior(ma, MC_DEF_INDEL);
return 0;
}
ma->q2p = calloc(256, sizeof(double));
ma->pdg = calloc(3 * ma->n, sizeof(double));
ma->phi = calloc(ma->M + 1, sizeof(double));
+ ma->phi_indel = calloc(ma->M + 1, sizeof(double));
ma->phi1 = calloc(ma->M + 1, sizeof(double));
ma->phi2 = calloc(ma->M + 1, sizeof(double));
ma->z = calloc(2 * ma->n + 1, sizeof(double));
{
if (ma) {
free(ma->q2p); free(ma->pdg);
- free(ma->phi); free(ma->phi1); free(ma->phi2);
+ free(ma->phi); free(ma->phi_indel); free(ma->phi1); free(ma->phi2);
free(ma->z); free(ma->zswap); free(ma->z1); free(ma->z2);
free(ma->afs); free(ma->afs1);
free(ma);
{
int k;
long double sum = 0.;
+ double *phi = ma->is_indel? ma->phi_indel : ma->phi;
memset(ma->afs1, 0, sizeof(double) * (ma->M + 1));
mc_cal_y(ma);
for (k = 0, sum = 0.; k <= ma->M; ++k)
- sum += (long double)ma->phi[k] * ma->z[k];
+ sum += (long double)phi[k] * ma->z[k];
for (k = 0; k <= ma->M; ++k) {
- ma->afs1[k] = ma->phi[k] * ma->z[k] / sum;
+ ma->afs1[k] = phi[k] * ma->z[k] / sum;
if (isnan(ma->afs1[k]) || isinf(ma->afs1[k])) return -1.;
}
for (k = 0, sum = 0.; k <= ma->M; ++k) {
{
int i, k;
long double sum = 0.;
+ ma->is_indel = bcf_is_indel(b);
// set PL and PL_len
for (i = 0; i < b->n_gi; ++i) {
if (b->gi[i].fmt == bcf_str2int("PL", 2)) {
flast = rst->f_em;
}
}
+ { // estimate equal-tail credible interval (95% level)
+ int l, h;
+ double p;
+ for (i = 0, p = 0.; i < ma->M; ++i)
+ if (p + ma->afs1[i] > 0.025) break;
+ else p += ma->afs1[i];
+ l = i;
+ for (i = ma->M-1, p = 0.; i >= 0; --i)
+ if (p + ma->afs1[i] > 0.025) break;
+ else p += ma->afs1[i];
+ h = i;
+ rst->cil = (double)(ma->M - h) / ma->M; rst->cih = (double)(ma->M - l) / ma->M;
+ }
rst->g[0] = rst->g[1] = rst->g[2] = -1.;
contrast(ma, rst->pc);
return 0;
typedef struct {
int rank0;
double f_em, f_exp, f_flat, p_ref;
+ double cil, cih;
double pc[4];
double g[3];
} bcf_p1rst_t;
for (i = 0; i < b->n_gi; ++i) {
if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
((uint8_t*)b->gi[i].data)[k-9] = 1<<7;
- } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
+ } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
((uint8_t*)b->gi[i].data)[k-9] = 0;
} else if (b->gi[i].fmt == bcf_str2int("DP", 2)) {
((uint16_t*)b->gi[i].data)[k-9] = 0;
for (q = kstrtok(p, ":", &a2), i = 0; q && i < b->n_gi; q = kstrtok(0, 0, &a2), ++i) {
if (b->gi[i].fmt == bcf_str2int("GT", 2)) {
((uint8_t*)b->gi[i].data)[k-9] = (q[0] - '0')<<3 | (q[2] - '0') | (q[1] == '/'? 0 : 1) << 6;
- } else if (b->gi[i].fmt == bcf_str2int("GQ", 2)) {
+ } else if (b->gi[i].fmt == bcf_str2int("GQ", 2) || b->gi[i].fmt == bcf_str2int("SP", 2)) {
double _x = strtod(q, &q);
int x = (int)(_x + .499);
if (x > 255) x = 255;
exit;
sub main {
- my $version = '0.1.0';
&usage if (@ARGV < 1);
my $command = shift(@ARGV);
my %func = (subsam=>\&subsam, listsam=>\&listsam, fillac=>\&fillac, qstats=>\&qstats, varFilter=>\&varFilter,
- hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&filter4vcf, ldstats=>\&ldstats);
+ hapmap2vcf=>\&hapmap2vcf, ucscsnp2vcf=>\&ucscsnp2vcf, filter4vcf=>\&varFilter, ldstats=>\&ldstats,
+ gapstats=>\&gapstats);
die("Unknown command \"$command\".\n") if (!defined($func{$command}));
&{$func{$command}};
}
next if (length($t[3]) != 1 || uc($t[3]) eq 'N');
$t[3] = uc($t[3]); $t[4] = uc($t[4]);
my @s = split(',', $t[4]);
- $t[5] = 3 if ($t[5] < 0);
+ $t[5] = 3 if ($t[5] eq '.' || $t[5] < 0);
next if (length($s[0]) != 1);
my $hit;
if ($is_vcf) {
}
sub varFilter {
- my %opts = (d=>1, D=>10000, l=>30, Q=>25, q=>10, G=>25, s=>100, w=>10, W=>10, N=>2, p=>undef, F=>.001);
- getopts('pq:d:D:l:Q:w:W:N:G:F:', \%opts);
+ my %opts = (d=>2, D=>10000, a=>2, W=>10, Q=>10, w=>10, p=>undef, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4);
+ getopts('pd:D:W:Q:w:a:1:2:3:4:', \%opts);
die(qq/
Usage: vcfutils.pl varFilter [options] <in.vcf>
Options: -Q INT minimum RMS mapping quality for SNPs [$opts{Q}]
- -q INT minimum RMS mapping quality for gaps [$opts{q}]
-d INT minimum read depth [$opts{d}]
-D INT maximum read depth [$opts{D}]
-
- -G INT min indel score for nearby SNP filtering [$opts{G}]
+ -a INT minimum number of alternate bases [$opts{a}]
-w INT SNP within INT bp around a gap to be filtered [$opts{w}]
-
- -W INT window size for filtering dense SNPs [$opts{W}]
- -N INT max number of SNPs in a window [$opts{N}]
-
- -l INT window size for filtering adjacent gaps [$opts{l}]
-
+ -W INT window size for filtering adjacent gaps [$opts{W}]
+ -1 FLOAT min P-value for strand bias (given PV4) [$opts{1}]
+ -2 FLOAT min P-value for baseQ bias [$opts{2}]
+ -3 FLOAT min P-value for mapQ bias [$opts{3}]
+ -4 FLOAT min P-value for end distance bias [$opts{4}]
-p print filtered variants
+
+Note: Some of the filters rely on annotations generated by SAMtools\/BCFtools.
\n/) if (@ARGV == 0 && -t STDIN);
# calculate the window size
- my ($ol, $ow, $oW) = ($opts{l}, $opts{w}, $opts{W});
+ my ($ol, $ow) = ($opts{W}, $opts{w});
my $max_dist = $ol > $ow? $ol : $ow;
- $max_dist = $oW if ($max_dist < $oW);
# the core loop
- my @staging; # (indel_filtering_score, flt_tag)
+ my @staging; # (indel_filtering_score, flt_tag, indel_span; chr, pos, ...)
while (<>) {
my @t = split;
- next if (/^#/);
+ if (/^#/) {
+ print; next;
+ }
next if ($t[4] eq '.'); # skip non-var sites
+ # check if the site is a SNP
my $is_snp = 1;
if (length($t[3]) > 1) {
$is_snp = 0;
last if ($staging[0][3] eq $t[0] && $staging[0][4] + $staging[0][2] + $max_dist >= $t[1]);
varFilter_aux(shift(@staging), $opts{p}); # calling a function is a bit slower, not much
}
- my ($flt, $score) = (0, -1);
-
- # collect key annotations
- my ($dp, $mq, $af) = (-1, -1, 1);
- if ($t[7] =~ /DP=(\d+)/i) {
- $dp = $1;
- } elsif ($t[7] =~ /DP4=(\d+),(\d+),(\d+),(\d+)/i) {
+ my $flt = 0;
+ # parse annotations
+ my ($dp, $mq, $dp_alt) = (-1, -1, -1);
+ if ($t[7] =~ /DP4=(\d+),(\d+),(\d+),(\d+)/i) {
$dp = $1 + $2 + $3 + $4;
+ $dp_alt = $3 + $4;
}
- if ($t[7] =~ /MQ=(\d+)/i) {
- $mq = $1;
- }
- if ($t[7] =~ /AF=([^\s;=]+)/i) {
- $af = $1;
- } elsif ($t[7] =~ /AF1=([^\s;=]+)/i) {
- $af = $1;
+ if ($t[7] =~ /DP=(\d+)/i) {
+ $dp = $1;
}
- # the depth filter
+ $mq = $1 if ($t[7] =~ /MQ=(\d+)/i);
+ # the depth and mapQ filter
if ($dp >= 0) {
if ($dp < $opts{d}) {
$flt = 2;
$flt = 3;
}
}
+ $flt = 4 if ($dp_alt >= 0 && $dp_alt < $opts{a});
+ $flt = 1 if ($flt == 0 && $mq >= 0 && $mq < $opts{Q});
+ $flt = 7 if ($flt == 0 && /PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/
+ && ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4}));
# site dependent filters
- my $dlen = 0;
+ my ($rlen, $indel_score) = (0, -1); # $indel_score<0 for SNPs
if ($flt == 0) {
if (!$is_snp) { # an indel
- # If deletion, remember the length of the deletion
- $dlen = length($t[3]) - 1;
- $flt = 1 if ($mq < $opts{q});
+ $rlen = length($t[3]) - 1;
+ $indel_score = $t[5] * 100 + $dp_alt;
# filtering SNPs
- if ($t[5] >= $opts{G}) {
- for my $x (@staging) {
- # Is it a SNP and is it outside the SNP filter window?
- next if ($x->[0] >= 0 || $x->[4] + $x->[2] + $ow < $t[1]);
- $x->[1] = 5 if ($x->[1] == 0);
- }
+ for my $x (@staging) {
+ next if ($x->[0] >= 0 || $x->[1] || $x->[4] + $x->[2] + $ow < $t[1]);
+ $x->[1] = 5;
}
- # the indel filtering score
- $score = $t[5];
# check the staging list for indel filtering
for my $x (@staging) {
- # Is it a SNP and is it outside the gap filter window
- next if ($x->[0] < 0 || $x->[4] + $x->[2] + $ol < $t[1]);
- if ($x->[0] < $score) {
+ next if ($x->[0] < 0 || $x->[1] || $x->[4] + $x->[2] + $ol < $t[1]);
+ if ($x->[0] < $indel_score) {
$x->[1] = 6;
} else {
$flt = 6; last;
}
}
} else { # a SNP
- $flt = 1 if ($mq < $opts{Q});
- # check adjacent SNPs
- my $k = 1;
for my $x (@staging) {
- ++$k if ($x->[0] < 0 && -($x->[0] + 1) > $opts{F} && $x->[4] + $x->[2] + $oW >= $t[1] && ($x->[1] == 0 || $x->[1] == 4 || $x->[1] == 5));
- }
- # filtering is necessary
- if ($k > $opts{N}) {
- $flt = 4;
- for my $x (@staging) {
- $x->[1] = 4 if ($x->[0] < 0 && $x->[4] + $x->[2] + $oW >= $t[1] && $x->[1] == 0);
- }
- } else { # then check gap filter
- for my $x (@staging) {
- next if ($x->[0] < 0 || $x->[4] + $x->[2] + $ow < $t[1]);
- if ($x->[0] >= $opts{G}) {
- $flt = 5; last;
- }
- }
+ next if ($x->[0] < 0 || $x->[1] || $x->[4] + $x->[2] + $ow < $t[1]);
+ $flt = 5;
+ last;
}
}
}
- push(@staging, [$score < 0? -$af-1 : $score, $flt, $dlen, @t]);
+ push(@staging, [$indel_score, $flt, $rlen, @t]);
}
# output the last few elements in the staging list
while (@staging) {
if ($first->[1] == 0) {
print join("\t", @$first[3 .. @$first-1]), "\n";
} elsif ($is_print) {
- print STDERR join("\t", substr("UQdDWGgsiX", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
+ print STDERR join("\t", substr("UQdDaGgP", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
}
}
-sub filter4vcf {
- my %opts = (d=>3, D=>2000, 1=>1e-4, 2=>1e-100, 3=>0, 4=>1e-4, Q=>10, q=>3);
- getopts('d:D:1:2:3:4:Q:q:', \%opts);
- die(qq/
-Usage: vcfutils.pl filter4vcf [options] <in.vcf>
-
-Options: -d INT min total depth (given DP or DP4) [$opts{d}]
- -D INT max total depth [$opts{D}]
- -q INT min SNP quality [$opts{q}]
- -Q INT min RMS mapQ (given MQ) [$opts{Q}]
- -1 FLOAT min P-value for strand bias (given PV4) [$opts{1}]
- -2 FLOAT min P-value for baseQ bias [$opts{2}]
- -3 FLOAT min P-value for mapQ bias [$opts{3}]
- -4 FLOAT min P-value for end distance bias [$opts{4}]\n
-/) if (@ARGV == 0 && -t STDIN);
-
- my %ts = (AG=>1, GA=>1, CT=>1, TC=>1);
-
- my @n = (0, 0);
+sub gapstats {
+ my (@c0, @c1);
+ $c0[$_] = $c1[$_] = 0 for (0 .. 10000);
while (<>) {
- if (/^#/) {
- print;
- next;
- }
- next if (/PV4=([^,]+),([^,]+),([^,]+),([^,;\t]+)/ && ($1<$opts{1} || $2<$opts{2} || $3<$opts{3} || $4<$opts{4}));
- my $depth = -1;
- $depth = $1 if (/DP=(\d+)/);
- $depth = $1+$2+$3+$4 if (/DP4=(\d+),(\d+),(\d+),(\d+)/);
- next if ($depth > 0 && ($depth < $opts{d} || $depth > $opts{D}));
- next if (/MQ=(\d+)/ && $1 < $opts{Q});
+ next if (/^#/);
my @t = split;
- next if ($t[5] >= 0 && $t[5] < $opts{q});
- ++$n[0];
+ next if (length($t[3]) == 1 && $t[4] =~ /^[A-Za-z](,[A-Za-z])*$/); # not an indel
my @s = split(',', $t[4]);
- ++$n[1] if ($ts{$t[3].$s[0]});
- print;
+ for my $x (@s) {
+ my $l = length($x) - length($t[3]) + 5000;
+ if ($x =~ /^-/) {
+ $l = -(length($x) - 1) + 5000;
+ } elsif ($x =~ /^\+/) {
+ $l = length($x) - 1 + 5000;
+ }
+ $c0[$l] += 1 / @s;
+ }
+ }
+ for (my $i = 0; $i < 10000; ++$i) {
+ next if ($c0[$i] == 0);
+ $c1[0] += $c0[$i];
+ $c1[1] += $c0[$i] if (($i-5000)%3 == 0);
+ printf("C\t%d\t%.2f\n", ($i-5000), $c0[$i]);
}
+ printf("3\t%d\t%d\t%.3f\n", $c1[0], $c1[1], $c1[1]/$c1[0]);
}
sub ucscsnp2vcf {
fillac fill the allele count field
qstats SNP stats stratified by QUAL
varFilter filtering short variants
- filter4vcf filtering VCFs produced by samtools+bcftools
hapmap2vcf convert the hapmap format to VCF
ucscsnp2vcf convert UCSC SNP SQL dump to VCF
\n/);
-2, -2, -2, -2, -2
};
+int aln_sm_qual[] = {
+ 0, -23, -23, -23, 0,
+ -23, 0, -23, -23, 0,
+ -23, -23, 0, -23, 0,
+ -23, -23, -23, 0, 0,
+ 0, 0, 0, 0, 0
+};
+
ka_param_t ka_param_blast = { 5, 2, 5, 2, aln_sm_blast, 5, 50 };
ka_param_t ka_param_aa2aa = { 10, 2, 10, 2, aln_sm_blosum62, 22, 50 };
+ka_param2_t ka_param2_qual = { 37, 11, 37, 11, 37, 11, 0, 0, aln_sm_qual, 5, 50 };
+
static uint32_t *ka_path2cigar32(const path_t *path, int path_len, int *n_cigar)
{
int i, n;
}
typedef struct {
- uint8_t Mt:3, It:2, Dt:2;
+ uint8_t Mt:3, It:2, Dt:3;
} dpcell_t;
typedef struct {
score_matrix = ap->matrix;
N_MATRIX_ROW = ap->row;
- *n_cigar = 0;
+ if (n_cigar) *n_cigar = 0;
if (len1 == 0 || len2 == 0) return 0;
/* calculate b1 and b2 */
return cigar;
}
-/*****************************************
- * Probabilistic banded glocal alignment *
- *****************************************/
-
-static float g_qual2prob[256];
-
-#define EI .25
-#define EM .3333333333333
-#define set_u(u, b, i, k) { int x=(i)-(b); x=x>0?x:0; (u)=((k)-x+1)*3; }
-
-ka_probpar_t ka_probpar_def = { 0.001, 0.1, 10 };
-
-/*
- The topology of the profile HMM:
-
- /\ /\ /\ /\
- I[1] I[k-1] I[k] I[L]
- ^ \ \ ^ \ ^ \ \ ^
- | \ \ | \ | \ \ |
- M[0] M[1] -> ... -> M[k-1] -> M[k] -> ... -> M[L] M[L+1]
- \ \/ \/ \/ /
- \ /\ /\ /\ /
- -> D[k-1] -> D[k] ->
-
- M[0] points to every {M,I}[k] and every {M,I}[k] points M[L+1].
-
- On input, _ref is the reference sequence and _query is the query
- sequence. Both are sequences of 0/1/2/3/4 where 4 stands for an
- ambiguous residue. iqual is the base quality. c sets the gap open
- probability, gap extension probability and band width.
-
- On output, state and q are arrays of length l_query. The higher 30
- bits give the reference position the query base is matched to and the
- lower two bits can be 0 (an alignment match) or 1 (an
- insertion). q[i] gives the phred scaled posterior probability of
- state[i] being wrong.
- */
-int ka_prob_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
- const ka_probpar_t *c, int *state, uint8_t *q)
+typedef struct {
+ int M, I, D;
+} score_aux_t;
+
+#define MINUS_INF -0x40000000
+
+// matrix: len2 rows and len1 columns
+int ka_global_score(const uint8_t *_seq1, int len1, const uint8_t *_seq2, int len2, const ka_param2_t *ap)
{
- double **f, **b, *s, m[9], sI, sM, bI, bM, pb;
- float *qual, *_qual;
- const uint8_t *ref, *query;
- int bw, bw2, i, k, is_diff = 0;
-
- /*** initialization ***/
- ref = _ref - 1; query = _query - 1; // change to 1-based coordinate
- bw = l_ref > l_query? l_ref : l_query;
- if (bw > c->bw) bw = c->bw;
- if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query);
- bw2 = bw * 2 + 1;
- // allocate the forward and backward matrices f[][] and b[][] and the scaling array s[]
- f = calloc(l_query+1, sizeof(void*));
- b = calloc(l_query+1, sizeof(void*));
- for (i = 0; i <= l_query; ++i) {
- f[i] = calloc(bw2 * 3 + 6, sizeof(double)); // FIXME: this is over-allocated for very short seqs
- b[i] = calloc(bw2 * 3 + 6, sizeof(double));
- }
- s = calloc(l_query+2, sizeof(double)); // s[] is the scaling factor to avoid underflow
- // initialize qual
- _qual = calloc(l_query, sizeof(float));
- if (g_qual2prob[0] == 0)
- for (i = 0; i < 256; ++i)
- g_qual2prob[i] = pow(10, -i/10.);
- for (i = 0; i < l_query; ++i) _qual[i] = g_qual2prob[iqual? iqual[i] : 30];
- qual = _qual - 1;
- // initialize transition probability
- sM = sI = 1. / (2 * l_query + 2); // the value here seems not to affect results; FIXME: need proof
- m[0*3+0] = (1 - c->d - c->d) * (1 - sM); m[0*3+1] = m[0*3+2] = c->d * (1 - sM);
- m[1*3+0] = (1 - c->e) * (1 - sI); m[1*3+1] = c->e * (1 - sI); m[1*3+2] = 0.;
- m[2*3+0] = 1 - c->e; m[2*3+1] = 0.; m[2*3+2] = c->e;
- bM = (1 - c->d) / l_query; bI = c->d / l_query; // (bM+bI)*l_query==1
- /*** forward ***/
- // f[0]
- set_u(k, bw, 0, 0);
- f[0][k] = s[0] = 1.;
- { // f[1]
- double *fi = f[1], sum;
- int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1, _beg, _end;
- for (k = beg, sum = 0.; k <= end; ++k) {
- int u;
- double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
- set_u(u, bw, 1, k);
- fi[u+0] = e * bM; fi[u+1] = EI * bI;
- sum += fi[u] + fi[u+1];
- }
- // rescale
- s[1] = sum;
- set_u(_beg, bw, 1, beg); set_u(_end, bw, 1, end); _end += 2;
- for (k = _beg; k <= _end; ++k) fi[k] /= sum;
- }
- // f[2..l_query]
- for (i = 2; i <= l_query; ++i) {
- double *fi = f[i], *fi1 = f[i-1], sum, qli = qual[i];
- int beg = 1, end = l_ref, x, _beg, _end;
- uint8_t qyi = query[i];
- x = i - bw; beg = beg > x? beg : x; // band start
- x = i + bw; end = end < x? end : x; // band end
- for (k = beg, sum = 0.; k <= end; ++k) {
- int u, v11, v01, v10;
- double e;
- e = (ref[k] > 3 || qyi > 3)? 1. : ref[k] == qyi? 1. - qli : qli * EM;
- set_u(u, bw, i, k); set_u(v11, bw, i-1, k-1); set_u(v10, bw, i-1, k); set_u(v01, bw, i, k-1);
- fi[u+0] = e * (m[0] * fi1[v11+0] + m[3] * fi1[v11+1] + m[6] * fi1[v11+2]);
- fi[u+1] = EI * (m[1] * fi1[v10+0] + m[4] * fi1[v10+1]);
- fi[u+2] = m[2] * fi[v01+0] + m[8] * fi[v01+2];
- sum += fi[u] + fi[u+1] + fi[u+2];
-// fprintf(stderr, "F (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, fi[u], fi[u+1], fi[u+2]); // DEBUG
- }
- // rescale
- s[i] = sum;
- set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
- for (k = _beg, sum = 1./sum; k <= _end; ++k) fi[k] *= sum;
- }
- { // f[l_query+1]
- double sum;
- for (k = 1, sum = 0.; k <= l_ref; ++k) {
- int u;
- set_u(u, bw, l_query, k);
- if (u < 3 || u >= bw2*3+3) continue;
- sum += f[l_query][u+0] * sM + f[l_query][u+1] * sI;
- }
- s[l_query+1] = sum; // the last scaling factor
+
+#define __score_aux(_p, _q0, _sc, _io, _ie, _do, _de) { \
+ int t1, t2; \
+ score_aux_t *_q; \
+ _q = _q0; \
+ _p->M = _q->M >= _q->I? _q->M : _q->I; \
+ _p->M = _p->M >= _q->D? _p->M : _q->D; \
+ _p->M += (_sc); \
+ ++_q; t1 = _q->M - _io - _ie; t2 = _q->I - _ie; _p->I = t1 >= t2? t1 : t2; \
+ _q = _p-1; t1 = _q->M - _do - _de; t2 = _q->D - _de; _p->D = t1 >= t2? t1 : t2; \
}
- /*** backward ***/
- // b[l_query] (b[l_query+1][0]=1 and thus \tilde{b}[][]=1/s[l_query+1]; this is where s[l_query+1] comes from)
- for (k = 1; k <= l_ref; ++k) {
- int u;
- double *bi = b[l_query];
- set_u(u, bw, l_query, k);
- if (u < 3 || u >= bw2*3+3) continue;
- bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1];
+
+ int i, j, bw, scmat_size = ap->row, *scmat = ap->matrix, ret;
+ const uint8_t *seq1, *seq2;
+ score_aux_t *curr, *last, *swap;
+ bw = abs(len1 - len2) + ap->band_width;
+ i = len1 > len2? len1 : len2;
+ if (bw > i + 1) bw = i + 1;
+ seq1 = _seq1 - 1; seq2 = _seq2 - 1;
+ curr = calloc(len1 + 2, sizeof(score_aux_t));
+ last = calloc(len1 + 2, sizeof(score_aux_t));
+ { // the zero-th row
+ int x, end = len1;
+ score_aux_t *p;
+ j = 0;
+ x = j + bw; end = len1 < x? len1 : x; // band end
+ p = curr;
+ p->M = 0; p->I = p->D = MINUS_INF;
+ for (i = 1, p = &curr[1]; i <= end; ++i, ++p)
+ p->M = p->I = MINUS_INF, p->D = -(ap->edo + ap->ede * i);
+ p->M = p->I = p->D = MINUS_INF;
+ swap = curr; curr = last; last = swap;
}
- // b[l_query-1..1]
- for (i = l_query - 1; i >= 1; --i) {
- int beg = 1, end = l_ref, x, _beg, _end;
- double *bi = b[i], *bi1 = b[i+1], y = (i > 1), qli1 = qual[i+1];
- uint8_t qyi1 = query[i+1];
- x = i - bw; beg = beg > x? beg : x;
- x = i + bw; end = end < x? end : x;
- for (k = end; k >= beg; --k) {
- int u, v11, v01, v10;
- double e;
- set_u(u, bw, i, k); set_u(v11, bw, i+1, k+1); set_u(v10, bw, i+1, k); set_u(v01, bw, i, k+1);
- e = (k >= l_ref? 0 : (ref[k+1] > 3 || qyi1 > 3)? 1. : ref[k+1] == qyi1? 1. - qli1 : qli1 * EM) * bi1[v11];
- bi[u+0] = e * m[0] + EI * m[1] * bi1[v10+1] + m[2] * bi[v01+2]; // bi1[v11] has been foled into e.
- bi[u+1] = e * m[3] + EI * m[4] * bi1[v10+1];
- bi[u+2] = (e * m[6] + m[8] * bi[v01+2]) * y;
-// fprintf(stderr, "B (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, bi[u], bi[u+1], bi[u+2]); // DEBUG
+ for (j = 1; j < len2; ++j) {
+ int x, beg = 0, end = len1, *scrow, col_end;
+ score_aux_t *p;
+ x = j - bw; beg = 0 > x? 0 : x; // band start
+ x = j + bw; end = len1 < x? len1 : x; // band end
+ if (beg == 0) { // from zero-th column
+ p = curr;
+ p->M = p->D = MINUS_INF; p->I = -(ap->eio + ap->eie * j);
+ ++beg; // then beg = 1
}
- // rescale
- set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
- for (k = _beg, y = 1./s[i]; k <= _end; ++k) bi[k] *= y;
- }
- { // b[0]
- int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
- double sum = 0.;
- for (k = end; k >= beg; --k) {
- int u;
- double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
- set_u(u, bw, 1, k);
- if (u < 3 || u >= bw2*3+3) continue;
- sum += e * b[1][u+0] * bM + EI * b[1][u+1] * bI;
+ scrow = scmat + seq2[j] * scmat_size;
+ if (end == len1) col_end = 1, --end;
+ else col_end = 0;
+ for (i = beg, p = &curr[beg]; i <= end; ++i, ++p)
+ __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->iio, ap->iie, ap->ido, ap->ide);
+ if (col_end) {
+ __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->eio, ap->eie, ap->ido, ap->ide);
+ ++p;
}
- set_u(k, bw, 0, 0);
- pb = b[0][k] = sum / s[0]; // if everything works as is expected, pb == 1.0
+ p->M = p->I = p->D = MINUS_INF;
+// for (i = 0; i <= len1; ++i) printf("(%d,%d,%d) ", curr[i].M, curr[i].I, curr[i].D); putchar('\n');
+ swap = curr; curr = last; last = swap;
}
- is_diff = fabs(pb - 1.) > 1e-7? 1 : 0;
- /*** MAP ***/
- for (i = 1; i <= l_query; ++i) {
- double sum = 0., *fi = f[i], *bi = b[i], max = 0.;
- int beg = 1, end = l_ref, x, max_k = -1;
- x = i - bw; beg = beg > x? beg : x;
- x = i + bw; end = end < x? end : x;
- for (k = beg; k <= end; ++k) {
- int u;
- double z;
- set_u(u, bw, i, k);
- z = fi[u+0] * bi[u+0]; if (z > max) max = z, max_k = (k-1)<<2 | 0; sum += z;
- z = fi[u+1] * bi[u+1]; if (z > max) max = z, max_k = (k-1)<<2 | 1; sum += z;
+ { // the last row
+ int x, beg = 0, *scrow;
+ score_aux_t *p;
+ j = len2;
+ x = j - bw; beg = 0 > x? 0 : x; // band start
+ if (beg == 0) { // from zero-th column
+ p = curr;
+ p->M = p->D = MINUS_INF; p->I = -(ap->eio + ap->eie * j);
+ ++beg; // then beg = 1
}
- max /= sum; sum *= s[i]; // if everything works as is expected, sum == 1.0
- if (state) state[i-1] = max_k;
- if (q) k = (int)(-4.343 * log(1. - max) + .499), q[i-1] = k > 100? 99 : k;
-#ifdef _MAIN
- fprintf(stderr, "(%.10lg,%.10lg) (%d,%d:%d)~%lg\n", pb, sum, i-1, max_k>>2, max_k&3, max); // DEBUG
-#endif
- }
- /*** free ***/
- for (i = 0; i <= l_query; ++i) {
- free(f[i]); free(b[i]);
+ scrow = scmat + seq2[j] * scmat_size;
+ for (i = beg, p = &curr[beg]; i < len1; ++i, ++p)
+ __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->iio, ap->iie, ap->edo, ap->ede);
+ __score_aux(p, &last[i-1], scrow[(int)seq1[i]], ap->eio, ap->eie, ap->edo, ap->ede);
+// for (i = 0; i <= len1; ++i) printf("(%d,%d,%d) ", curr[i].M, curr[i].I, curr[i].D); putchar('\n');
}
- free(f); free(b); free(s); free(_qual);
- return 0;
+ ret = curr[len1].M >= curr[len1].I? curr[len1].M : curr[len1].I;
+ ret = ret >= curr[len1].D? ret : curr[len1].D;
+ free(curr); free(last);
+ return ret;
}
#ifdef _MAIN
-int main()
+int main(int argc, char *argv[])
{
- int l_ref = 5, l_query = 4;
- uint8_t *ref = (uint8_t*)"\0\1\3\3\1";
- uint8_t *query = (uint8_t*)"\0\3\3\1";
-// uint8_t *query = (uint8_t*)"\1\3\3\1";
- static uint8_t qual[4] = {20, 20, 20, 20};
- ka_prob_glocal(ref, l_ref, query, l_query, qual, &ka_probpar_def, 0, 0);
+// int len1 = 35, len2 = 35;
+// uint8_t *seq1 = (uint8_t*)"\0\0\3\3\2\0\0\0\1\0\2\1\2\1\3\2\3\3\3\0\2\3\2\1\1\3\3\3\2\3\3\1\0\0\1";
+// uint8_t *seq2 = (uint8_t*)"\0\0\3\3\2\0\0\0\1\0\2\1\2\1\3\2\3\3\3\0\2\3\2\1\1\3\3\3\2\3\3\1\0\1\0";
+ int len1 = 4, len2 = 4;
+ uint8_t *seq1 = (uint8_t*)"\1\0\0\1";
+ uint8_t *seq2 = (uint8_t*)"\1\0\1\0";
+ int sc;
+// ka_global_core(seq1, 2, seq2, 1, &ka_param_qual, &sc, 0);
+ sc = ka_global_score(seq1, len1, seq2, len2, &ka_param2_qual);
+ printf("%d\n", sc);
return 0;
}
#endif
} ka_param_t;
typedef struct {
- float d, e;
- int bw;
-} ka_probpar_t;
+ int iio, iie, ido, ide;
+ int eio, eie, edo, ede;
+ int *matrix;
+ int row;
+ int band_width;
+} ka_param2_t;
#ifdef __cplusplus
extern "C" {
uint32_t *ka_global_core(uint8_t *seq1, int len1, uint8_t *seq2, int len2, const ka_param_t *ap,
int *_score, int *n_cigar);
- int ka_prob_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
- const ka_probpar_t *c, int *state, uint8_t *q);
-
+ int ka_global_score(const uint8_t *_seq1, int len1, const uint8_t *_seq2, int len2, const ka_param2_t *ap);
#ifdef __cplusplus
}
#endif
extern ka_param_t ka_param_blast; /* = { 5, 2, 5, 2, aln_sm_blast, 5, 50 }; */
-extern ka_probpar_t ka_probpar_def; /* { 0.0001, 0.1, 10 } */
+extern ka_param_t ka_param_qual; // only use this for global alignment!!!
+extern ka_param2_t ka_param2_qual; // only use this for global alignment!!!
#endif
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2003-2006, 2008-2010, by Heng Li <lh3lh3@live.co.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining
+ a copy of this software and associated documentation files (the
+ "Software"), to deal in the Software without restriction, including
+ without limitation the rights to use, copy, modify, merge, publish,
+ distribute, sublicense, and/or sell copies of the Software, and to
+ permit persons to whom the Software is furnished to do so, subject to
+ the following conditions:
+
+ The above copyright notice and this permission notice shall be
+ included in all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ SOFTWARE.
+*/
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <math.h>
+#include "kprobaln.h"
+
+/*****************************************
+ * Probabilistic banded glocal alignment *
+ *****************************************/
+
+#define EI .25
+#define EM .33333333333
+
+static float g_qual2prob[256];
+
+#define set_u(u, b, i, k) { int x=(i)-(b); x=x>0?x:0; (u)=((k)-x+1)*3; }
+
+kpa_par_t kpa_par_def = { 0.001, 0.1, 10 };
+kpa_par_t kpa_par_alt = { 0.0001, 0.01, 10 };
+
+/*
+ The topology of the profile HMM:
+
+ /\ /\ /\ /\
+ I[1] I[k-1] I[k] I[L]
+ ^ \ \ ^ \ ^ \ \ ^
+ | \ \ | \ | \ \ |
+ M[0] M[1] -> ... -> M[k-1] -> M[k] -> ... -> M[L] M[L+1]
+ \ \/ \/ \/ /
+ \ /\ /\ /\ /
+ -> D[k-1] -> D[k] ->
+
+ M[0] points to every {M,I}[k] and every {M,I}[k] points M[L+1].
+
+ On input, _ref is the reference sequence and _query is the query
+ sequence. Both are sequences of 0/1/2/3/4 where 4 stands for an
+ ambiguous residue. iqual is the base quality. c sets the gap open
+ probability, gap extension probability and band width.
+
+ On output, state and q are arrays of length l_query. The higher 30
+ bits give the reference position the query base is matched to and the
+ lower two bits can be 0 (an alignment match) or 1 (an
+ insertion). q[i] gives the phred scaled posterior probability of
+ state[i] being wrong.
+ */
+int kpa_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
+ const kpa_par_t *c, int *state, uint8_t *q)
+{
+ double **f, **b = 0, *s, m[9], sI, sM, bI, bM, pb;
+ float *qual, *_qual;
+ const uint8_t *ref, *query;
+ int bw, bw2, i, k, is_diff = 0, is_backward = 1, Pr;
+
+ /*** initialization ***/
+ is_backward = state && q? 1 : 0;
+ ref = _ref - 1; query = _query - 1; // change to 1-based coordinate
+ bw = l_ref > l_query? l_ref : l_query;
+ if (bw > c->bw) bw = c->bw;
+ if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query);
+ bw2 = bw * 2 + 1;
+ // allocate the forward and backward matrices f[][] and b[][] and the scaling array s[]
+ f = calloc(l_query+1, sizeof(void*));
+ if (is_backward) b = calloc(l_query+1, sizeof(void*));
+ for (i = 0; i <= l_query; ++i) {
+ f[i] = calloc(bw2 * 3 + 6, sizeof(double)); // FIXME: this is over-allocated for very short seqs
+ if (is_backward) b[i] = calloc(bw2 * 3 + 6, sizeof(double));
+ }
+ s = calloc(l_query+2, sizeof(double)); // s[] is the scaling factor to avoid underflow
+ // initialize qual
+ _qual = calloc(l_query, sizeof(float));
+ if (g_qual2prob[0] == 0)
+ for (i = 0; i < 256; ++i)
+ g_qual2prob[i] = pow(10, -i/10.);
+ for (i = 0; i < l_query; ++i) _qual[i] = g_qual2prob[iqual? iqual[i] : 30];
+ qual = _qual - 1;
+ // initialize transition probability
+ sM = sI = 1. / (2 * l_query + 2); // the value here seems not to affect results; FIXME: need proof
+ m[0*3+0] = (1 - c->d - c->d) * (1 - sM); m[0*3+1] = m[0*3+2] = c->d * (1 - sM);
+ m[1*3+0] = (1 - c->e) * (1 - sI); m[1*3+1] = c->e * (1 - sI); m[1*3+2] = 0.;
+ m[2*3+0] = 1 - c->e; m[2*3+1] = 0.; m[2*3+2] = c->e;
+ bM = (1 - c->d) / l_ref; bI = c->d / l_ref; // (bM+bI)*l_ref==1
+ /*** forward ***/
+ // f[0]
+ set_u(k, bw, 0, 0);
+ f[0][k] = s[0] = 1.;
+ { // f[1]
+ double *fi = f[1], sum;
+ int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1, _beg, _end;
+ for (k = beg, sum = 0.; k <= end; ++k) {
+ int u;
+ double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
+ set_u(u, bw, 1, k);
+ fi[u+0] = e * bM; fi[u+1] = EI * bI;
+ sum += fi[u] + fi[u+1];
+ }
+ // rescale
+ s[1] = sum;
+ set_u(_beg, bw, 1, beg); set_u(_end, bw, 1, end); _end += 2;
+ for (k = _beg; k <= _end; ++k) fi[k] /= sum;
+ }
+ // f[2..l_query]
+ for (i = 2; i <= l_query; ++i) {
+ double *fi = f[i], *fi1 = f[i-1], sum, qli = qual[i];
+ int beg = 1, end = l_ref, x, _beg, _end;
+ uint8_t qyi = query[i];
+ x = i - bw; beg = beg > x? beg : x; // band start
+ x = i + bw; end = end < x? end : x; // band end
+ for (k = beg, sum = 0.; k <= end; ++k) {
+ int u, v11, v01, v10;
+ double e;
+ e = (ref[k] > 3 || qyi > 3)? 1. : ref[k] == qyi? 1. - qli : qli * EM;
+ set_u(u, bw, i, k); set_u(v11, bw, i-1, k-1); set_u(v10, bw, i-1, k); set_u(v01, bw, i, k-1);
+ fi[u+0] = e * (m[0] * fi1[v11+0] + m[3] * fi1[v11+1] + m[6] * fi1[v11+2]);
+ fi[u+1] = EI * (m[1] * fi1[v10+0] + m[4] * fi1[v10+1]);
+ fi[u+2] = m[2] * fi[v01+0] + m[8] * fi[v01+2];
+ sum += fi[u] + fi[u+1] + fi[u+2];
+// fprintf(stderr, "F (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, fi[u], fi[u+1], fi[u+2]); // DEBUG
+ }
+ // rescale
+ s[i] = sum;
+ set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
+ for (k = _beg, sum = 1./sum; k <= _end; ++k) fi[k] *= sum;
+ }
+ { // f[l_query+1]
+ double sum;
+ for (k = 1, sum = 0.; k <= l_ref; ++k) {
+ int u;
+ set_u(u, bw, l_query, k);
+ if (u < 3 || u >= bw2*3+3) continue;
+ sum += f[l_query][u+0] * sM + f[l_query][u+1] * sI;
+ }
+ s[l_query+1] = sum; // the last scaling factor
+ }
+ { // compute likelihood
+ double p = 1., Pr1 = 0.;
+ for (i = 0; i <= l_query + 1; ++i) {
+ p *= s[i];
+ if (p < 1e-100) Pr += -4.343 * log(p), p = 1.;
+ }
+ Pr1 += -4.343 * log(p * l_ref * l_query);
+ Pr = (int)(Pr1 + .499);
+ if (!is_backward) { // skip backward and MAP
+ for (i = 0; i <= l_query; ++i) free(f[i]);
+ free(f); free(s); free(_qual);
+ return Pr;
+ }
+ }
+ /*** backward ***/
+ // b[l_query] (b[l_query+1][0]=1 and thus \tilde{b}[][]=1/s[l_query+1]; this is where s[l_query+1] comes from)
+ for (k = 1; k <= l_ref; ++k) {
+ int u;
+ double *bi = b[l_query];
+ set_u(u, bw, l_query, k);
+ if (u < 3 || u >= bw2*3+3) continue;
+ bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1];
+ }
+ // b[l_query-1..1]
+ for (i = l_query - 1; i >= 1; --i) {
+ int beg = 1, end = l_ref, x, _beg, _end;
+ double *bi = b[i], *bi1 = b[i+1], y = (i > 1), qli1 = qual[i+1];
+ uint8_t qyi1 = query[i+1];
+ x = i - bw; beg = beg > x? beg : x;
+ x = i + bw; end = end < x? end : x;
+ for (k = end; k >= beg; --k) {
+ int u, v11, v01, v10;
+ double e;
+ set_u(u, bw, i, k); set_u(v11, bw, i+1, k+1); set_u(v10, bw, i+1, k); set_u(v01, bw, i, k+1);
+ e = (k >= l_ref? 0 : (ref[k+1] > 3 || qyi1 > 3)? 1. : ref[k+1] == qyi1? 1. - qli1 : qli1 * EM) * bi1[v11];
+ bi[u+0] = e * m[0] + EI * m[1] * bi1[v10+1] + m[2] * bi[v01+2]; // bi1[v11] has been foled into e.
+ bi[u+1] = e * m[3] + EI * m[4] * bi1[v10+1];
+ bi[u+2] = (e * m[6] + m[8] * bi[v01+2]) * y;
+// fprintf(stderr, "B (%d,%d;%d): %lg,%lg,%lg\n", i, k, u, bi[u], bi[u+1], bi[u+2]); // DEBUG
+ }
+ // rescale
+ set_u(_beg, bw, i, beg); set_u(_end, bw, i, end); _end += 2;
+ for (k = _beg, y = 1./s[i]; k <= _end; ++k) bi[k] *= y;
+ }
+ { // b[0]
+ int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
+ double sum = 0.;
+ for (k = end; k >= beg; --k) {
+ int u;
+ double e = (ref[k] > 3 || query[1] > 3)? 1. : ref[k] == query[1]? 1. - qual[1] : qual[1] * EM;
+ set_u(u, bw, 1, k);
+ if (u < 3 || u >= bw2*3+3) continue;
+ sum += e * b[1][u+0] * bM + EI * b[1][u+1] * bI;
+ }
+ set_u(k, bw, 0, 0);
+ pb = b[0][k] = sum / s[0]; // if everything works as is expected, pb == 1.0
+ }
+ is_diff = fabs(pb - 1.) > 1e-7? 1 : 0;
+ /*** MAP ***/
+ for (i = 1; i <= l_query; ++i) {
+ double sum = 0., *fi = f[i], *bi = b[i], max = 0.;
+ int beg = 1, end = l_ref, x, max_k = -1;
+ x = i - bw; beg = beg > x? beg : x;
+ x = i + bw; end = end < x? end : x;
+ for (k = beg; k <= end; ++k) {
+ int u;
+ double z;
+ set_u(u, bw, i, k);
+ z = fi[u+0] * bi[u+0]; if (z > max) max = z, max_k = (k-1)<<2 | 0; sum += z;
+ z = fi[u+1] * bi[u+1]; if (z > max) max = z, max_k = (k-1)<<2 | 1; sum += z;
+ }
+ max /= sum; sum *= s[i]; // if everything works as is expected, sum == 1.0
+ if (state) state[i-1] = max_k;
+ if (q) k = (int)(-4.343 * log(1. - max) + .499), q[i-1] = k > 100? 99 : k;
+#ifdef _MAIN
+ fprintf(stderr, "(%.10lg,%.10lg) (%d,%d:%c,%c:%d) %lg\n", pb, sum, i-1, max_k>>2,
+ "ACGT"[query[i]], "ACGT"[ref[(max_k>>2)+1]], max_k&3, max); // DEBUG
+#endif
+ }
+ /*** free ***/
+ for (i = 0; i <= l_query; ++i) {
+ free(f[i]); free(b[i]);
+ }
+ free(f); free(b); free(s); free(_qual);
+ return Pr;
+}
+
+#ifdef _MAIN
+#include <unistd.h>
+int main(int argc, char *argv[])
+{
+ uint8_t conv[256], *iqual, *ref, *query;
+ int c, l_ref, l_query, i, q = 30, b = 10, P;
+ while ((c = getopt(argc, argv, "b:q:")) >= 0) {
+ switch (c) {
+ case 'b': b = atoi(optarg); break;
+ case 'q': q = atoi(optarg); break;
+ }
+ }
+ if (optind + 2 > argc) {
+ fprintf(stderr, "Usage: %s [-q %d] [-b %d] <ref> <query>\n", argv[0], q, b); // example: acttc attc
+ return 1;
+ }
+ memset(conv, 4, 256);
+ conv['a'] = conv['A'] = 0; conv['c'] = conv['C'] = 1;
+ conv['g'] = conv['G'] = 2; conv['t'] = conv['T'] = 3;
+ ref = (uint8_t*)argv[optind]; query = (uint8_t*)argv[optind+1];
+ l_ref = strlen((char*)ref); l_query = strlen((char*)query);
+ for (i = 0; i < l_ref; ++i) ref[i] = conv[ref[i]];
+ for (i = 0; i < l_query; ++i) query[i] = conv[query[i]];
+ iqual = malloc(l_query);
+ memset(iqual, q, l_query);
+ kpa_par_def.bw = b;
+ P = kpa_glocal(ref, l_ref, query, l_query, iqual, &kpa_par_alt, 0, 0);
+ fprintf(stderr, "%d\n", P);
+ free(iqual);
+ return 0;
+}
+#endif
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2003-2006, 2008, 2009 by Heng Li <lh3@live.co.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining
+ a copy of this software and associated documentation files (the
+ "Software"), to deal in the Software without restriction, including
+ without limitation the rights to use, copy, modify, merge, publish,
+ distribute, sublicense, and/or sell copies of the Software, and to
+ permit persons to whom the Software is furnished to do so, subject to
+ the following conditions:
+
+ The above copyright notice and this permission notice shall be
+ included in all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ SOFTWARE.
+*/
+
+#ifndef LH3_KPROBALN_H_
+#define LH3_KPROBALN_H_
+
+#include <stdint.h>
+
+typedef struct {
+ float d, e;
+ int bw;
+} kpa_par_t;
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+ int kpa_glocal(const uint8_t *_ref, int l_ref, const uint8_t *_query, int l_query, const uint8_t *iqual,
+ const kpa_par_t *c, int *state, uint8_t *q);
+
+#ifdef __cplusplus
+}
+#endif
+
+extern kpa_par_t kpa_par_def, kpa_par_alt;
+
+#endif
&usage if (@ARGV < 1);
my $command = shift(@ARGV);
-my %func = (showALEN=>\&showALEN, pileup2fq=>\&pileup2fq, varFilter=>\&varFilter,
+my %func = (showALEN=>\&showALEN, pileup2fq=>\&pileup2fq, varFilter=>\&varFilter, plp2vcf=>\&plp2vcf,
unique=>\&unique, uniqcmp=>\&uniqcmp, sra2hdr=>\&sra2hdr, sam2fq=>\&sam2fq);
die("Unknown command \"$command\".\n") if (!defined($func{$command}));
close($fh);
}
+sub plp2vcf {
+ while (<>) {
+ my @t = split;
+ next if ($t[3] eq '*/*');
+ if ($t[2] eq '*') { # indel
+ my @s = split("/", $t[3]);
+ my (@a, @b);
+ my ($ref, $alt);
+ for (@s) {
+ next if ($_ eq '*');
+ if (/^-/) {
+ push(@a, 'N'.substr($_, 1));
+ push(@b, 'N');
+ } elsif (/^\+/) {
+ push(@a, 'N');
+ push(@b, 'N'.substr($_, 1));
+ }
+ }
+ if ($a[0] && $a[1]) {
+ if (length($a[0]) < length($a[1])) {
+ $ref = $a[1];
+ $alt = ($b[0] . ('N' x (length($a[1]) - length($a[0])))) . ",$b[1]";
+ } elsif (length($a[0]) > length($a[1])) {
+ $ref = $a[0];
+ $alt = ($b[1] . ('N' x (length($a[0]) - length($a[1])))) . ",$b[0]";
+ } else {
+ $ref = $a[0];
+ $alt = ($b[0] eq $b[1])? $b[0] : "$b[0],$b[1]";
+ }
+ } else {
+ $ref = $a[0]; $alt = $b[0];
+ }
+ print join("\t", @t[0,1], '.', $ref, $alt, $t[5], '.', '.'), "\n";
+ } else { # SNP
+ }
+ }
+}
+
#
# Usage
#
#include "khash.h"
KHASH_SET_INIT_STR(rg)
+// When counting records instead of printing them,
+// data passed to the bam_fetch callback is encapsulated in this struct.
+typedef struct {
+ bam_header_t *header;
+ int *count;
+} count_func_data_t;
+
typedef khash_t(rg) *rghash_t;
rghash_t g_rghash = 0;
return 0;
}
-// callback function for bam_fetch()
+// callback function for bam_fetch() that prints nonskipped records
static int view_func(const bam1_t *b, void *data)
{
if (!__g_skip_aln(((samfile_t*)data)->header, b))
return 0;
}
+// callback function for bam_fetch() that counts nonskipped records
+static int count_func(const bam1_t *b, void *data)
+{
+ if (!__g_skip_aln(((count_func_data_t*)data)->header, b)) {
+ (*((count_func_data_t*)data)->count)++;
+ }
+ return 0;
+}
+
static int usage(int is_long_help);
int main_samview(int argc, char *argv[])
{
- int c, is_header = 0, is_header_only = 0, is_bamin = 1, ret = 0, is_uncompressed = 0, is_bamout = 0, slx2sngr = 0;
+ int c, is_header = 0, is_header_only = 0, is_bamin = 1, ret = 0, is_uncompressed = 0, is_bamout = 0, slx2sngr = 0, is_count = 0;
int of_type = BAM_OFDEC, is_long_help = 0;
+ int count = 0;
samfile_t *in = 0, *out = 0;
char in_mode[5], out_mode[5], *fn_out = 0, *fn_list = 0, *fn_ref = 0, *fn_rg = 0;
/* parse command-line options */
strcpy(in_mode, "r"); strcpy(out_mode, "w");
- while ((c = getopt(argc, argv, "Sbt:hHo:q:f:F:ul:r:xX?T:CR:")) >= 0) {
+ while ((c = getopt(argc, argv, "Sbct:hHo:q:f:F:ul:r:xX?T:CR:")) >= 0) {
switch (c) {
+ case 'c': is_count = 1; break;
case 'C': slx2sngr = 1; break;
case 'S': is_bamin = 0; break;
case 'b': is_bamout = 1; break;
ret = 1;
goto view_end;
}
- if ((out = samopen(fn_out? fn_out : "-", out_mode, in->header)) == 0) {
+ if (!is_count && (out = samopen(fn_out? fn_out : "-", out_mode, in->header)) == 0) {
fprintf(stderr, "[main_samview] fail to open \"%s\" for writing.\n", fn_out? fn_out : "standard output");
ret = 1;
goto view_end;
while ((r = samread(in, b)) >= 0) { // read one alignment from `in'
if (!__g_skip_aln(in->header, b)) {
if (slx2sngr) sol2sanger(b);
- samwrite(out, b); // write the alignment to `out'
+ if (!is_count) samwrite(out, b); // write the alignment to `out'
+ count++;
}
}
if (r < -1) {
goto view_end;
}
for (i = optind + 1; i < argc; ++i) {
- int tid, beg, end;
+ int tid, beg, end, result;
bam_parse_region(in->header, argv[i], &tid, &beg, &end); // parse a region in the format like `chr2:100-200'
if (tid < 0) { // reference name is not found
fprintf(stderr, "[main_samview] region \"%s\" specifies an unknown reference name. Continue anyway.\n", argv[i]);
continue;
}
// fetch alignments
- if (bam_fetch(in->x.bam, idx, tid, beg, end, out, view_func) < 0) {
+ if (is_count) {
+ count_func_data_t count_data = { in->header, &count };
+ result = bam_fetch(in->x.bam, idx, tid, beg, end, &count_data, count_func);
+ } else
+ result = bam_fetch(in->x.bam, idx, tid, beg, end, out, view_func);
+ if (result < 0) {
fprintf(stderr, "[main_samview] retrieval of region \"%s\" failed due to truncated file or corrupt BAM index file\n", argv[i]);
ret = 1;
break;
}
view_end:
+ if (is_count && ret == 0) {
+ printf("%d\n", count);
+ }
// close files, free and return
free(fn_list); free(fn_ref); free(fn_out); free(g_library); free(g_rg); free(fn_rg);
if (g_rghash) {
kh_destroy(rg, g_rghash);
}
samclose(in);
- samclose(out);
+ if (!is_count)
+ samclose(out);
return ret;
}
fprintf(stderr, " -u uncompressed BAM output (force -b)\n");
fprintf(stderr, " -x output FLAG in HEX (samtools-C specific)\n");
fprintf(stderr, " -X output FLAG in string (samtools-C specific)\n");
+ fprintf(stderr, " -c print only the count of matching records\n");
fprintf(stderr, " -t FILE list of reference names and lengths (force -S) [null]\n");
fprintf(stderr, " -T FILE reference sequence file (force -S) [null]\n");
fprintf(stderr, " -o FILE output file name [stdout]\n");
-.TH samtools 1 "27 October 2010" "samtools-0.1.9" "Bioinformatics tools"
+.TH samtools 1 "15 November 2010" "samtools-0.1.10" "Bioinformatics tools"
.SH NAME
.PP
samtools - Utilities for the Sequence Alignment/Map (SAM) format
.PP
samtools pileup -vcf ref.fasta aln.sorted.bam
.PP
-samtools mpileup -C50 -agf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
+samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
.PP
samtools tview aln.sorted.bam ref.fasta
.TP 10
.B view
-samtools view [-bhuHS] [-t in.refList] [-o output] [-f reqFlag] [-F
+samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F
skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
Extract/print all or sub alignments in SAM or BAM format. If no region
.B -b
Output in the BAM format.
.TP
-.BI -f " INT"
+.BI -f \ INT
Only output alignments with all bits in INT present in the FLAG
field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
.TP
-.BI -F " INT"
+.BI -F \ INT
Skip alignments with bits present in INT [0]
.TP
.B -h
.B -H
Output the header only.
.TP
-.BI -l " STR"
+.BI -l \ STR
Only output reads in library STR [null]
.TP
-.BI -o " FILE"
+.BI -o \ FILE
Output file [stdout]
.TP
-.BI -q " INT"
+.BI -q \ INT
Skip alignments with MAPQ smaller than INT [0]
.TP
-.BI -r " STR"
+.BI -r \ STR
Only output reads in read group STR [null]
.TP
-.BI -R " FILE"
+.BI -R \ FILE
Output reads in read groups listed in
.I FILE
[null]
.B `-t'
option is required.
.TP
-.BI -t " FILE"
+.B -c
+Instead of printing the alignments, only count them and print the
+total number. All filter options, such as
+.B `-f',
+.B `-F'
+and
+.B `-q'
+, are taken into account.
+.TP
+.BI -t \ FILE
This file is TAB-delimited. Each line must contain the reference name
and the length of the reference, one line for each distinct reference;
additional fields are ignored. This file also defines the order of the
region in the format like `chr10:10,000,000' or `=10,000,000' when
viewing the same reference sequence.
-.TP
-.B pileup
-samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l
-in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r
-pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior]
-<in.bam>|<in.sam>
-
-Print the alignment in the pileup format. In the pileup format, each
-line represents a genomic position, consisting of chromosome name,
-coordinate, reference base, read bases, read qualities and alignment
-mapping qualities. Information on match, mismatch, indel, strand,
-mapping quality and start and end of a read are all encoded at the read
-base column. At this column, a dot stands for a match to the reference
-base on the forward strand, a comma for a match on the reverse strand,
-a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
-strand and `acgtn' for a mismatch on the reverse strand. A pattern
-`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
-reference position and the next reference position. The length of the
-insertion is given by the integer in the pattern, followed by the
-inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
-represents a deletion from the reference. The deleted bases will be
-presented as `*' in the following lines. Also at the read base column, a
-symbol `^' marks the start of a read. The ASCII of the character
-following `^' minus 33 gives the mapping quality. A symbol `$' marks the
-end of a read segment.
-
-If option
-.B -c
-is applied, the consensus base, Phred-scaled consensus quality, SNP
-quality (i.e. the Phred-scaled probability of the consensus being
-identical to the reference) and root mean square (RMS) mapping quality
-of the reads covering the site will be inserted between the `reference
-base' and the `read bases' columns. An indel occupies an additional
-line. Each indel line consists of chromosome name, coordinate, a star,
-the genotype, consensus quality, SNP quality, RMS mapping quality, #
-covering reads, the first alllele, the second allele, # reads supporting
-the first allele, # reads supporting the second allele and # reads
-containing indels different from the top two alleles.
-
-The position of indels is offset by -1.
-
-.B OPTIONS:
-.RS
-.TP 10
-.B -B
-Disable the BAQ computation. See the
-.B mpileup
-command for details.
-.TP
-.B -c
-Call the consensus sequence using SOAPsnp consensus model. Options
-.BR -T ", " -N ", " -I " and " -r
-are only effective when
-.BR -c " or " -g
-is in use.
-.TP
-.BI -C " INT"
-Coefficient for downgrading the mapping quality of poorly mapped
-reads. See the
-.B mpileup
-command for details. [0]
-.TP
-.BI -d " INT"
-Use the first
-.I NUM
-reads in the pileup for indel calling for speed up. Zero for unlimited. [1024]
-.TP
-.BI -f " FILE"
-The reference sequence in the FASTA format. Index file
-.I FILE.fai
-will be created if
-absent.
-.TP
-.B -g
-Generate genotype likelihood in the binary GLFv3 format. This option
-suppresses -c, -i and -s. This option is deprecated by the
-.B mpileup
-command.
-.TP
-.B -i
-Only output pileup lines containing indels.
-.TP
-.BI -I " INT"
-Phred probability of an indel in sequencing/prep. [40]
-.TP
-.BI -l " FILE"
-List of sites at which pileup is output. This file is space
-delimited. The first two columns are required to be chromosome and
-1-based coordinate. Additional columns are ignored. It is
-recommended to use option
-.TP
-.BI -m " INT"
-Filter reads with flag containing bits in
-.I INT
-[1796]
-.TP
-.BI -M " INT"
-Cap mapping quality at INT [60]
-.TP
-.BI -N " INT"
-Number of haplotypes in the sample (>=2) [2]
-.TP
-.BI -r " FLOAT"
-Expected fraction of differences between a pair of haplotypes [0.001]
-.TP
-.B -s
-Print the mapping quality as the last column. This option makes the
-output easier to parse, although this format is not space efficient.
-.TP
-.B -S
-The input file is in SAM.
-.TP
-.BI -t " FILE"
-List of reference names ane sequence lengths, in the format described
-for the
-.B import
-command. If this option is present, samtools assumes the input
-.I <in.alignment>
-is in SAM format; otherwise it assumes in BAM format.
-.B -s
-together with
-.B -l
-as in the default format we may not know the mapping quality.
-.TP
-.BI -T " FLOAT"
-The theta parameter (error dependency coefficient) in the maq consensus
-calling model [0.85]
-.RE
-
.TP
.B mpileup
samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
misaligned. Applying this option greatly helps to reduce false SNPs
caused by misalignments.
.TP
-.BI -C " INT"
+.BI -C \ INT
Coefficient for downgrading mapping quality for reads containing
excessive mismatches. Given a read with a phred-scaled probability q of
being generated from the mapped position, the new mapping quality is
about sqrt((INT-q)/INT)*INT. A zero value disables this
-functionality; if enabled, the recommended value is 50. [0]
+functionality; if enabled, the recommended value for BWA is 50. [0]
+.TP
+.BI -e \ INT
+Phred-scaled gap extension sequencing error probability. Reducing
+.I INT
+leads to longer indels. [20]
.TP
-.BI -f " FILE"
+.BI -f \ FILE
The reference file [null]
.TP
.B -g
Compute genotype likelihoods and output them in the binary call format (BCF).
.TP
-.B -u
-Similar to
-.B -g
-except that the output is uncompressed BCF, which is preferred for pipeing.
-.TP
-.BI -l " FILE"
+.BI -h \ INT
+Coefficient for modeling homopolymer errors. Given an
+.IR l -long
+homopolymer
+run, the sequencing error of an indel of size
+.I s
+is modeled as
+.IR INT * s / l .
+[100]
+.TP
+.BI -l \ FILE
File containing a list of sites where pileup or BCF is outputted [null]
.TP
-.BI -q " INT"
+.BI -o \ INT
+Phred-scaled gap open sequencing error probability. Reducing
+.I INT
+leads to more indel calls. [40]
+.TP
+.BI -P \ STR
+Comma dilimited list of platforms (determined by
+.BR @RG-PL )
+from which indel candidates are obtained. It is recommended to collect
+indel candidates from sequencing technologies that have low indel error
+rate such as ILLUMINA. [all]
+.TP
+.BI -q \ INT
Minimum mapping quality for an alignment to be used [0]
.TP
-.BI -Q " INT"
+.BI -Q \ INT
Minimum base quality for a base to be considered [13]
.TP
-.BI -r " STR"
+.BI -r \ STR
Only generate pileup in region
.I STR
[all sites]
+.TP
+.B -u
+Similar to
+.B -g
+except that the output is uncompressed BCF, which is preferred for piping.
.RE
.TP
.B -n
Sort by read names rather than by chromosomal coordinates
.TP
-.B -m INT
+.BI -m \ INT
Approximately the maximum required memory. [500000000]
.RE
.B OPTIONS:
.RS
.TP 8
-.BI -h " FILE"
+.BI -h \ FILE
Use the lines of
.I FILE
as `@' headers to be copied to
is actually in SAM format, though any alignment records it may contain
are ignored.)
.TP
-.BI -R " STR"
+.BI -R \ STR
Merge files in the specified region indicated by
.I STR
.TP
.B OPTIONS:
.RS
.TP 8
+.B -A
+When used jointly with
+.B -r
+this option overwrites the original base quality.
+.TP 8
.B -e
Convert a the read base to = if it is identical to the aligned reference
base. Indel caller does not support the = bases at the moment.
.B -S
The input is SAM with header lines
.TP
-.BI -C " INT"
+.BI -C \ INT
Coefficient to cap mapping quality of poorly mapped reads. See the
.B pileup
command for details. [0]
.TP
.B -r
-Perform probabilistic realignment to compute BAQ, which will be used to
-cap base quality.
+Compute the BQ tag without changing the base quality.
+.RE
+
+.TP
+.B pileup
+samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l
+in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r
+pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior]
+<in.bam>|<in.sam>
+
+Print the alignment in the pileup format. In the pileup format, each
+line represents a genomic position, consisting of chromosome name,
+coordinate, reference base, read bases, read qualities and alignment
+mapping qualities. Information on match, mismatch, indel, strand,
+mapping quality and start and end of a read are all encoded at the read
+base column. At this column, a dot stands for a match to the reference
+base on the forward strand, a comma for a match on the reverse strand,
+a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
+strand and `acgtn' for a mismatch on the reverse strand. A pattern
+`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
+reference position and the next reference position. The length of the
+insertion is given by the integer in the pattern, followed by the
+inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
+represents a deletion from the reference. The deleted bases will be
+presented as `*' in the following lines. Also at the read base column, a
+symbol `^' marks the start of a read. The ASCII of the character
+following `^' minus 33 gives the mapping quality. A symbol `$' marks the
+end of a read segment.
+
+If option
+.B -c
+is applied, the consensus base, Phred-scaled consensus quality, SNP
+quality (i.e. the Phred-scaled probability of the consensus being
+identical to the reference) and root mean square (RMS) mapping quality
+of the reads covering the site will be inserted between the `reference
+base' and the `read bases' columns. An indel occupies an additional
+line. Each indel line consists of chromosome name, coordinate, a star,
+the genotype, consensus quality, SNP quality, RMS mapping quality, #
+covering reads, the first alllele, the second allele, # reads supporting
+the first allele, # reads supporting the second allele and # reads
+containing indels different from the top two alleles.
+
+.B NOTE:
+Since 0.1.10, the `pileup' command is deprecated by `mpileup'.
+
+.B OPTIONS:
+.RS
+.TP 10
+.B -B
+Disable the BAQ computation. See the
+.B mpileup
+command for details.
+.TP
+.B -c
+Call the consensus sequence. Options
+.BR -T ", " -N ", " -I " and " -r
+are only effective when
+.BR -c " or " -g
+is in use.
+.TP
+.BI -C \ INT
+Coefficient for downgrading the mapping quality of poorly mapped
+reads. See the
+.B mpileup
+command for details. [0]
+.TP
+.BI -d \ INT
+Use the first
+.I NUM
+reads in the pileup for indel calling for speed up. Zero for unlimited. [1024]
+.TP
+.BI -f \ FILE
+The reference sequence in the FASTA format. Index file
+.I FILE.fai
+will be created if
+absent.
+.TP
+.B -g
+Generate genotype likelihood in the binary GLFv3 format. This option
+suppresses -c, -i and -s. This option is deprecated by the
+.B mpileup
+command.
+.TP
+.B -i
+Only output pileup lines containing indels.
+.TP
+.BI -I \ INT
+Phred probability of an indel in sequencing/prep. [40]
+.TP
+.BI -l \ FILE
+List of sites at which pileup is output. This file is space
+delimited. The first two columns are required to be chromosome and
+1-based coordinate. Additional columns are ignored. It is
+recommended to use option
+.TP
+.BI -m \ INT
+Filter reads with flag containing bits in
+.I INT
+[1796]
+.TP
+.BI -M \ INT
+Cap mapping quality at INT [60]
+.TP
+.BI -N \ INT
+Number of haplotypes in the sample (>=2) [2]
+.TP
+.BI -r \ FLOAT
+Expected fraction of differences between a pair of haplotypes [0.001]
+.TP
+.B -s
+Print the mapping quality as the last column. This option makes the
+output easier to parse, although this format is not space efficient.
+.TP
+.B -S
+The input file is in SAM.
+.TP
+.BI -t \ FILE
+List of reference names ane sequence lengths, in the format described
+for the
+.B import
+command. If this option is present, samtools assumes the input
+.I <in.alignment>
+is in SAM format; otherwise it assumes in BAM format.
+.B -s
+together with
+.B -l
+as in the default format we may not know the mapping quality.
+.TP
+.BI -T \ FLOAT
+The theta parameter (error dependency coefficient) in the maq consensus
+calling model [0.85]
.RE
.SH SAM FORMAT
.IP o 2
Call SNPs and short indels for one diploid individual:
- samtools pileup -vcf ref.fa aln.bam > var.raw.plp
- samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
- awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var.flt.plp > var.final.plp
+ samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf
+ bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf
The
.B -D
adjusted to about twice the average read depth. One may consider to add
.B -C50
to
-.B pileup
+.B mpileup
if mapping quality is overestimated for reads containing excessive
mismatches. Applying this option usually helps
.B BWA-short
-but may not other mappers. It also potentially increases reference
-biases.
+but may not other mappers.
.IP o 2
-Call SNPs (not short indels) for multiple diploid individuals:
+Call SNPs and short indels for multiple diploid individuals:
- samtools mpileup -augf ref.fa *.bam | bcftools view -bcv - > snp.raw.bcf
- bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 | bgzip > snp.flt.vcf.gz
+ samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf
+ bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf
Individuals are identified from the
.B SM
tags in the
.B @RG
header lines. Individuals can be pooled in one alignment file; one
-individual can also be separated into multiple files. Similarly, one may
-consider to apply
-.B -C50
-to
-.BR mpileup .
-SNP calling in this way also works for single sample and has the
-advantage of enabling more powerful filtering. The drawback is the lack
-of short indel calling, which may be implemented in future.
+individual can also be separated into multiple files. The
+.B -P
+option specifies that indel candidates should be collected only from
+read groups with the
+.B @RG-PL
+tag set to
+.IR ILLUMINA .
+Collecting indel candidates from reads sequenced by an indel-prone
+technology may affect the performance of indel calling.
.IP o 2
Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals:
- samtools mpileup -gf ref.fa *.bam > all.bcf
+ samtools mpileup -Igf ref.fa *.bam > all.bcf
bcftools view -bl sites.list all.bcf > sites.bcf
bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
.IP o 2
Dump BAQ applied alignment for other SNP callers:
- samtools calmd -br aln.bam > aln.baq.bam
+ samtools calmd -bAr aln.bam > aln.baq.bam
It adds and corrects the
.B NM
.B calmd
command also comes with the
.B -C
-option, the same as the on in
+option, the same as the one in
.B pileup
and
.BR mpileup .
.PP
Heng Li from the Sanger Institute wrote the C version of samtools. Bob
Handsaker from the Broad Institute implemented the BGZF library and Jue
-Ruan from Beijing Genomics Institute wrote the RAZF library. Various
-people in the 1000 Genomes Project contributed to the SAM format
+Ruan from Beijing Genomics Institute wrote the RAZF library. John
+Marshall and Petr Danecek contribute to the source code and various
+people from the 1000 Genomes Project have contributed to the SAM format
specification.
.SH SEE ALSO
samtools pileup -vcf ref.fasta aln.sorted.bam
- samtools mpileup -C50 -agf ref.fasta -r chr3:1,000-2,000 in1.bam
- in2.bam
+ samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
samtools tview aln.sorted.bam ref.fasta
COMMANDS AND OPTIONS
- view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
+ view samtools view [-bchuHS] [-t in.refList] [-o output] [-f
reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
-S Input is in SAM. If @SQ header lines are absent, the
`-t' option is required.
+ -c Instead of printing the alignments, only count them
+ and print the total number. All filter options, such
+ as `-f', `-F' and `-q' , are taken into account.
+
-t FILE This file is TAB-delimited. Each line must contain
the reference name and the length of the reference,
one line for each distinct reference; additional
reference sequence.
- pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
- [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N
- nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G
- indelPrior] <in.bam>|<in.sam>
-
- Print the alignment in the pileup format. In the pileup for-
- mat, each line represents a genomic position, consisting of
- chromosome name, coordinate, reference base, read bases, read
- qualities and alignment mapping qualities. Information on
- match, mismatch, indel, strand, mapping quality and start and
- end of a read are all encoded at the read base column. At
- this column, a dot stands for a match to the reference base
- on the forward strand, a comma for a match on the reverse
- strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
- match on the forward strand and `acgtn' for a mismatch on the
- reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates
- there is an insertion between this reference position and the
- next reference position. The length of the insertion is given
- by the integer in the pattern, followed by the inserted
- sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre-
- sents a deletion from the reference. The deleted bases will
- be presented as `*' in the following lines. Also at the read
- base column, a symbol `^' marks the start of a read. The
- ASCII of the character following `^' minus 33 gives the map-
- ping quality. A symbol `$' marks the end of a read segment.
-
- If option -c is applied, the consensus base, Phred-scaled
- consensus quality, SNP quality (i.e. the Phred-scaled proba-
- bility of the consensus being identical to the reference) and
- root mean square (RMS) mapping quality of the reads covering
- the site will be inserted between the `reference base' and
- the `read bases' columns. An indel occupies an additional
- line. Each indel line consists of chromosome name, coordi-
- nate, a star, the genotype, consensus quality, SNP quality,
- RMS mapping quality, # covering reads, the first alllele, the
- second allele, # reads supporting the first allele, # reads
- supporting the second allele and # reads containing indels
- different from the top two alleles.
-
- The position of indels is offset by -1.
-
- OPTIONS:
-
- -B Disable the BAQ computation. See the mpileup com-
- mand for details.
-
- -c Call the consensus sequence using SOAPsnp consensus
- model. Options -T, -N, -I and -r are only effective
- when -c or -g is in use.
-
- -C INT Coefficient for downgrading the mapping quality of
- poorly mapped reads. See the mpileup command for
- details. [0]
-
- -d INT Use the first NUM reads in the pileup for indel
- calling for speed up. Zero for unlimited. [1024]
-
- -f FILE The reference sequence in the FASTA format. Index
- file FILE.fai will be created if absent.
-
- -g Generate genotype likelihood in the binary GLFv3
- format. This option suppresses -c, -i and -s. This
- option is deprecated by the mpileup command.
-
- -i Only output pileup lines containing indels.
-
- -I INT Phred probability of an indel in sequencing/prep.
- [40]
-
- -l FILE List of sites at which pileup is output. This file
- is space delimited. The first two columns are
- required to be chromosome and 1-based coordinate.
- Additional columns are ignored. It is recommended
- to use option
-
- -m INT Filter reads with flag containing bits in INT
- [1796]
-
- -M INT Cap mapping quality at INT [60]
-
- -N INT Number of haplotypes in the sample (>=2) [2]
-
- -r FLOAT Expected fraction of differences between a pair of
- haplotypes [0.001]
-
- -s Print the mapping quality as the last column. This
- option makes the output easier to parse, although
- this format is not space efficient.
-
- -S The input file is in SAM.
-
- -t FILE List of reference names ane sequence lengths, in
- the format described for the import command. If
- this option is present, samtools assumes the input
- <in.alignment> is in SAM format; otherwise it
- assumes in BAM format. -s together with -l as in
- the default format we may not know the mapping
- quality.
-
- -T FLOAT The theta parameter (error dependency coefficient)
- in the maq consensus calling model [0.85]
-
-
mpileup samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
[...]]
phred-scaled probability q of being generated from
the mapped position, the new mapping quality is about
sqrt((INT-q)/INT)*INT. A zero value disables this
- functionality; if enabled, the recommended value is
- 50. [0]
+ functionality; if enabled, the recommended value for
+ BWA is 50. [0]
+
+ -e INT Phred-scaled gap extension sequencing error probabil-
+ ity. Reducing INT leads to longer indels. [20]
-f FILE The reference file [null]
- -g Compute genotype likelihoods and output them in the
+ -g Compute genotype likelihoods and output them in the
binary call format (BCF).
- -u Similar to -g except that the output is uncompressed
- BCF, which is preferred for pipeing.
+ -h INT Coefficient for modeling homopolymer errors. Given an
+ l-long homopolymer run, the sequencing error of an
+ indel of size s is modeled as INT*s/l. [100]
-l FILE File containing a list of sites where pileup or BCF
is outputted [null]
- -q INT Minimum mapping quality for an alignment to be used
+ -o INT Phred-scaled gap open sequencing error probability.
+ Reducing INT leads to more indel calls. [40]
+
+ -P STR Comma dilimited list of platforms (determined by @RG-
+ PL) from which indel candidates are obtained. It is
+ recommended to collect indel candidates from sequenc-
+ ing technologies that have low indel error rate such
+ as ILLUMINA. [all]
+
+ -q INT Minimum mapping quality for an alignment to be used
[0]
-Q INT Minimum base quality for a base to be considered [13]
-r STR Only generate pileup in region STR [all sites]
+ -u Similar to -g except that the output is uncompressed
+ BCF, which is preferred for piping.
+
reheader samtools reheader <in.header.sam> <in.bam>
OPTIONS:
- -e Convert a the read base to = if it is identical to
- the aligned reference base. Indel caller does not
+ -A When used jointly with -r this option overwrites the
+ original base quality.
+
+ -e Convert a the read base to = if it is identical to
+ the aligned reference base. Indel caller does not
support the = bases at the moment.
-u Output uncompressed BAM
-S The input is SAM with header lines
- -C INT Coefficient to cap mapping quality of poorly mapped
+ -C INT Coefficient to cap mapping quality of poorly mapped
reads. See the pileup command for details. [0]
- -r Perform probabilistic realignment to compute BAQ,
- which will be used to cap base quality.
+ -r Compute the BQ tag without changing the base quality.
+
+
+ pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
+ [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N
+ nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G
+ indelPrior] <in.bam>|<in.sam>
+
+ Print the alignment in the pileup format. In the pileup for-
+ mat, each line represents a genomic position, consisting of
+ chromosome name, coordinate, reference base, read bases, read
+ qualities and alignment mapping qualities. Information on
+ match, mismatch, indel, strand, mapping quality and start and
+ end of a read are all encoded at the read base column. At
+ this column, a dot stands for a match to the reference base
+ on the forward strand, a comma for a match on the reverse
+ strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
+ match on the forward strand and `acgtn' for a mismatch on the
+ reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates
+ there is an insertion between this reference position and the
+ next reference position. The length of the insertion is given
+ by the integer in the pattern, followed by the inserted
+ sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre-
+ sents a deletion from the reference. The deleted bases will
+ be presented as `*' in the following lines. Also at the read
+ base column, a symbol `^' marks the start of a read. The
+ ASCII of the character following `^' minus 33 gives the map-
+ ping quality. A symbol `$' marks the end of a read segment.
+
+ If option -c is applied, the consensus base, Phred-scaled
+ consensus quality, SNP quality (i.e. the Phred-scaled proba-
+ bility of the consensus being identical to the reference) and
+ root mean square (RMS) mapping quality of the reads covering
+ the site will be inserted between the `reference base' and
+ the `read bases' columns. An indel occupies an additional
+ line. Each indel line consists of chromosome name, coordi-
+ nate, a star, the genotype, consensus quality, SNP quality,
+ RMS mapping quality, # covering reads, the first alllele, the
+ second allele, # reads supporting the first allele, # reads
+ supporting the second allele and # reads containing indels
+ different from the top two alleles.
+
+ NOTE: Since 0.1.10, the `pileup' command is deprecated by
+ `mpileup'.
+
+ OPTIONS:
+
+ -B Disable the BAQ computation. See the mpileup com-
+ mand for details.
+
+ -c Call the consensus sequence. Options -T, -N, -I and
+ -r are only effective when -c or -g is in use.
+
+ -C INT Coefficient for downgrading the mapping quality of
+ poorly mapped reads. See the mpileup command for
+ details. [0]
+
+ -d INT Use the first NUM reads in the pileup for indel
+ calling for speed up. Zero for unlimited. [1024]
+
+ -f FILE The reference sequence in the FASTA format. Index
+ file FILE.fai will be created if absent.
+
+ -g Generate genotype likelihood in the binary GLFv3
+ format. This option suppresses -c, -i and -s. This
+ option is deprecated by the mpileup command.
+
+ -i Only output pileup lines containing indels.
+
+ -I INT Phred probability of an indel in sequencing/prep.
+ [40]
+
+ -l FILE List of sites at which pileup is output. This file
+ is space delimited. The first two columns are
+ required to be chromosome and 1-based coordinate.
+ Additional columns are ignored. It is recommended
+ to use option
+
+ -m INT Filter reads with flag containing bits in INT
+ [1796]
+
+ -M INT Cap mapping quality at INT [60]
+
+ -N INT Number of haplotypes in the sample (>=2) [2]
+
+ -r FLOAT Expected fraction of differences between a pair of
+ haplotypes [0.001]
+
+ -s Print the mapping quality as the last column. This
+ option makes the output easier to parse, although
+ this format is not space efficient.
+
+ -S The input file is in SAM.
+
+ -t FILE List of reference names ane sequence lengths, in
+ the format described for the import command. If
+ this option is present, samtools assumes the input
+ <in.alignment> is in SAM format; otherwise it
+ assumes in BAM format. -s together with -l as in
+ the default format we may not know the mapping
+ quality.
+
+ -T FLOAT The theta parameter (error dependency coefficient)
+ in the maq consensus calling model [0.85]
SAM FORMAT
- SAM is TAB-delimited. Apart from the header lines, which are started
+ SAM is TAB-delimited. Apart from the header lines, which are started
with the `@' symbol, each alignment line consists of:
o Attach the RG tag while merging sorted alignments:
- perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
+ perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
samtools merge -rh rg.txt merged.bam ga.bam 454.bam
The value in a RG tag is determined by the file name the read is com-
- ing from. In this example, in the merged.bam, reads from ga.bam will
- be attached RG:Z:ga, while reads from 454.bam will be attached
+ ing from. In this example, in the merged.bam, reads from ga.bam will
+ be attached RG:Z:ga, while reads from 454.bam will be attached
RG:Z:454.
o Call SNPs and short indels for one diploid individual:
- samtools pileup -vcf ref.fa aln.bam > var.raw.plp
- samtools.pl varFilter -D 100 var.raw.plp > var.flt.plp
- awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' var.flt.plp >
- var.final.plp
+ samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - >
+ var.raw.bcf
+ bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 >
+ var.flt.vcf
The -D option of varFilter controls the maximum read depth, which
should be adjusted to about twice the average read depth. One may
- consider to add -C50 to pileup if mapping quality is overestimated
+ consider to add -C50 to mpileup if mapping quality is overestimated
for reads containing excessive mismatches. Applying this option usu-
- ally helps BWA-short but may not other mappers. It also potentially
- increases reference biases.
+ ally helps BWA-short but may not other mappers.
- o Call SNPs (not short indels) for multiple diploid individuals:
+ o Call SNPs and short indels for multiple diploid individuals:
- samtools mpileup -augf ref.fa *.bam | bcftools view -bcv - >
- snp.raw.bcf
- bcftools view snp.raw.bcf | vcfutils.pl filter4vcf -D 2000 | bgzip
- > snp.flt.vcf.gz
+ samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view
+ -bcvg - > var.raw.bcf
+ bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 >
+ var.flt.vcf
- Individuals are identified from the SM tags in the @RG header lines.
- Individuals can be pooled in one alignment file; one individual can
- also be separated into multiple files. Similarly, one may consider to
- apply -C50 to mpileup. SNP calling in this way also works for single
- sample and has the advantage of enabling more powerful filtering. The
- drawback is the lack of short indel calling, which may be implemented
- in future.
+ Individuals are identified from the SM tags in the @RG header lines.
+ Individuals can be pooled in one alignment file; one individual can
+ also be separated into multiple files. The -P option specifies that
+ indel candidates should be collected only from read groups with the
+ @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads
+ sequenced by an indel-prone technology may affect the performance of
+ indel calling.
- o Derive the allele frequency spectrum (AFS) on a list of sites from
+ o Derive the allele frequency spectrum (AFS) on a list of sites from
multiple individuals:
- samtools mpileup -gf ref.fa *.bam > all.bcf
+ samtools mpileup -Igf ref.fa *.bam > all.bcf
bcftools view -bl sites.list all.bcf > sites.bcf
bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
......
where sites.list contains the list of sites with each line consisting
- of the reference sequence name and position. The following bcftools
+ of the reference sequence name and position. The following bcftools
commands estimate AFS by EM.
o Dump BAQ applied alignment for other SNP callers:
- samtools calmd -br aln.bam > aln.baq.bam
+ samtools calmd -bAr aln.bam > aln.baq.bam
- It adds and corrects the NM and MD tags at the same time. The calmd
- command also comes with the -C option, the same as the on in pileup
+ It adds and corrects the NM and MD tags at the same time. The calmd
+ command also comes with the -C option, the same as the one in pileup
and mpileup. Apply if it helps.
LIMITATIONS
- o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
+ o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
bam_aux.c.
- o In merging, the input files are required to have the same number of
- reference sequences. The requirement can be relaxed. In addition,
- merging does not reconstruct the header dictionaries automatically.
- Endusers have to provide the correct header. Picard is better at
+ o In merging, the input files are required to have the same number of
+ reference sequences. The requirement can be relaxed. In addition,
+ merging does not reconstruct the header dictionaries automatically.
+ Endusers have to provide the correct header. Picard is better at
merging.
- o Samtools paired-end rmdup does not work for unpaired reads (e.g.
- orphan reads or ends mapped to different chromosomes). If this is a
- concern, please use Picard's MarkDuplicate which correctly handles
+ o Samtools paired-end rmdup does not work for unpaired reads (e.g.
+ orphan reads or ends mapped to different chromosomes). If this is a
+ concern, please use Picard's MarkDuplicate which correctly handles
these cases, although a little slower.
AUTHOR
- Heng Li from the Sanger Institute wrote the C version of samtools. Bob
+ Heng Li from the Sanger Institute wrote the C version of samtools. Bob
Handsaker from the Broad Institute implemented the BGZF library and Jue
- Ruan from Beijing Genomics Institute wrote the RAZF library. Various
- people in the 1000 Genomes Project contributed to the SAM format speci-
+ Ruan from Beijing Genomics Institute wrote the RAZF library. John Mar-
+ shall and Petr Danecek contribute to the source code and various people
+ from the 1000 Genomes Project have contributed to the SAM format speci-
fication.
-samtools-0.1.9 27 October 2010 samtools(1)
+samtools-0.1.10 15 November 2010 samtools(1)