Genomic and Reconfigurable Computing

Jacopo Werther [Public domain], via Wikimedia Commons https://commons.wikimedia.org/wiki/File:1gof_front_b.png

Next-generation genome sequencing technology has reached a point at which it is almost cost-effective to sequence all patients. Biologists, doctors, and researchers face an oncoming deluge of genomic data, whose processing requires new and scalable bioinformatics architectures and systems. Processing raw genetic sequence data is computationally expensive, and datasets are large. Current software systems can require many hours to process a single genome and generally run only on a single computer.

To address these challenges, we initially built Persona, a cluster-scale, high-throughput bioinformatics framework. Persona currently supports paired-read alignment, sorting, and duplicate marking using well-known algorithms and techniques. Persona can significantly reduce end-to-end processing times for bioinformatics computations. In a case study on sequence alignment, Persona sustains 1.353 gigabases aligned per second with 101 base pair reads on a 32-node cluster and can align a full genome in ~16.7 seconds using the SNAP algorithm. Our results demonstrate that: (1) alignment computation with Persona scales linearly across servers with no measurable completion-time imbalance and negligible framework overheads; (2) on a single server, sorting with Persona and AGD is up to 2.3× faster than commonly used tools, while duplicate marking is 3× faster; (3) with AGD, a 7-node COTS network storage system can serve up to 60 alignment nodes; (4) server cost dominates for a balanced system running Persona, while long-term data storage dwarfs the cost of computation.

Protein clustering is an essential technique in the field of proteomes; the analysis of the protein sequences contained in an organism’s genome.  Clustering reduces the number of comparisons needed to find similar pairs in a set of elements, such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least one cluster so that similarities can be identified by all-to-all comparison in each cluster rather than on the full set. ClusterMerge is a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach.

We apply ClusterMerge to the problem of finding similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full O(n2) comparison, with only half as many operations. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604x on 768 cores (1400x faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%.

Papers

Presentations

Persona conference presentation (USENIX ATC 2017).