ISBN-13: 9783659684883 / Angielski / Miękka / 2015 / 132 str.
Recent progress in bioinformatics and especially high-throughput sequencing has enabled us to sequence and analyze genomes of many individuals, which can lead to improved diagnostics and treatment for patients suffering from genetic diseases. To achieve this, tools used in both clinical and research environments need to be enhanced to handle large amounts of data. This book analyzes application-level parallelization of database query processing by means of sharding as a technique for improving performance and scalability of an open-source search engine for genomic variants. We describe the challenges of designing and implementing a data access layer, the core of which is a general sharding framework. The approach allows for utilization of multiple processors as well as machines when querying the underlying data. This enables the system to scale in a near-linear fashion as more servers are added, with many queries achieving even superlinear speedup. This book should be useful to software engineers and scientists interested in an intriguing problem in the area of parallelization as well as anyone curious about what happens under the hood of modern genome analysis systems.
Recent progress in bioinformatics and especially high-throughput sequencing has enabled us to sequence and analyze genomes of many individuals, which can lead to improved diagnostics and treatment for patients suffering from genetic diseases. To achieve this, tools used in both clinical and research environments need to be enhanced to handle large amounts of data. This book analyzes application-level parallelization of database query processing by means of sharding as a technique for improving performance and scalability of an open-source search engine for genomic variants. We describe the challenges of designing and implementing a data access layer, the core of which is a general sharding framework. The approach allows for utilization of multiple processors as well as machines when querying the underlying data. This enables the system to scale in a near-linear fashion as more servers are added, with many queries achieving even superlinear speedup. This book should be useful to software engineers and scientists interested in an intriguing problem in the area of parallelization as well as anyone curious about what happens under the hood of modern genome analysis systems.