» » »

Detecting Genomic Insertions and Deletions in the Cloud with MapReduce and Cloudbreak

The detection of genomic structural variations remains one of the the most difficult challenges in analyzing high-throughput sequencing data. Considering multiple mappings of all reads, rather than only uniquely mapped discordant fragments, can improve the performance of read-pair based detection methods. However, the computational requirements for creating, storing, and processing large scale data sets with multiple mappings can be formidable. Meanwhile, the growing size and number of sequencing data sets have led to intense interest in distributing computation for genomic analyses to cloud or commodity servers. MapReduce, via its Hadoop implementation, is becoming a standard architecture for distributing processing across such compute clusters.
We have developed a conceptual framework for structural variation detection in Hadoop based on computing local features along the genome. In this framework, we have implemented and evaluated an algorithm for finding deletions and short insertions based on fitting a Gaussian mixture model (GMM) to the distribution of mapped insert sizes spanning each location in the genome. A similar method was used in MoDIL; however, our algorithm and the Hadoop framework drastically reduce the runtime requirements and overall difficulty of using this approach.
On simulated and real data sets of paired-end reads, our algorithm achieves performance similar to or better than a variety of popular structural variation detection algorithms. Cloudbreak performs well on both small (40-100bp) and medium size (100bp – 25kb) deletions, and in our simulations has greater sensitivity at most fixed levels of specificity than other methods. We also show increased performance in finding deletions in repetitive areas of the genome, identifying more variants that overlap repeats than other approaches in both simulated and real data. Cloudbreak also outperforms other read-pair based approaches for small insertion detection.
In addition, our algorithm can accurately genotype heterozygous and homozygous deletions and short insertions from diploid samples. Using the parameters computed in fitting the GMM and a simple thresholding procedure, we were able to achieve 88.0% and 94.9% accuracy in predicting the genotype of the true positive deletions we detected in simulated and real data sets, respectively, and 91.2% accuracy on simulated insertions.

Speakers: Chris Whelan, PhD student at the Institute on Development and Disability and Center for Spoken Language Understanding at Oregon Health & Science University, Oregon Health & Science University; Kemal Sonmez, Associate Professor of Bioinformatics and Computational Biology at Oregon Health and Science University, Oregon Health & Science University

Room 465

Tuesday, 03/05/13

Contact:

Website: Click to Visit

Cost:

Free

Save this Event:

iCalendar
Google Calendar
Yahoo! Calendar
Windows Live Calendar

UC Berkeley

Soda Hall
Berkeley, CA 94720

Categories: