Normal and Compound Poisson Approximations for Pattern Occurrences in   NGS Reads

CORC > 北京大学 > 数学科学学院

	Normal and Compound Poisson Approximations for Pattern Occurrences in NGS Reads
	Zhai, Zhiyuan ; Reinert, Gesine ; Song, Kai ; Waterman, Michael S. ; Luan, Yihui ; Sun, Fengzhu
	2012
关键词	algorithms genome analysis HMM next generation sequencing statistical models FEATURE FREQUENCY PROFILES WHOLE-PROTEOME PHYLOGENY ALIGNMENT-FREE METHOD FACTOR-BINDING SITES DNA MOTIF DISCOVERY MARKOV-CHAINS EUKARYOTIC GENOMES SEQ DATA SEQUENCES PROKARYOTES
英文摘要	Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/similar to fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).; Biochemical Research Methods; Biotechnology & Applied Microbiology; Computer Science, Interdisciplinary Applications; Mathematical & Computational Biology; Statistics & Probability; SCI(E); 0; ARTICLE; 6; 839-854; 19
语种	英语
出处	SCI
出版者	journal of computational biology
内容类型	其他
源URL	[http://hdl.handle.net/20.500.11897/393237]
专题	数学科学学院
推荐引用方式 GB/T 7714	Zhai, Zhiyuan,Reinert, Gesine,Song, Kai,et al. Normal and Compound Poisson Approximations for Pattern Occurrences in NGS Reads. 2012-01-01.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们