Applications and Extensions of pClust to Big Microbial Proteomic Data
MetadataShow full item record
The goal of biological sciences is to understand the biomolecular mechanics of living organisms. Proteins serve as the foundation for organisms functional analysis and sequence analysis has shown to be invaluable in answering questions about individual organisms. The first step in any sequence analysis is alignment and it is common that even modestly sized studies involve hundreds of thousands of protein sequences. In multigenome studies, the time consideration for sequence alignment becomes paramount and heuristic algorithms are frequently used sacrificing accuracy for speedup. At the same time, new algorithms have appeared that provide not only highly efficient performance, but also guarantee to deliver optimal solutions. However, the adoption of these algorithms is hindered by the absence of generalized analysis pipeline as well as availability of user-friendly computational tools. In this dissertation we present applications of existing, computationally efficient algorithms to multigenome studies where we apply our developed pClust pipelineto various sets of microbial organisms. The computational time is significantly improved and the results are more accurate than those obtained by traditional methods. The first study is a baseline comparison study on a small set of 11 microorganisms. It compares pClust results to the existing scientific knowledge and finds it to be consistent while at the same time providing new insights. The second study addresses the question of identification of common tick-transmissiblity mechanisms across different species. It involves a larger set of 108 microbial genomes with approximately 127K protein sequences. Traditionally, a study of such scope would have required days or at least hours of CPU time of high-performance computers to produce all-versus-all sequence alignment. Using pClust it took less than 10 minutes on a desktop computer to perform sequence alignment and clustering. For this study we also developed a graphical user interface for pClust in order to make the new algorithms more accessible for use by microbiologists. The third study analyzes the set of all proteobacterial genomes. The study comprised of 2326 complete genomes containing 8.7M protein sequences. The alignment was performed using pGraph-Tascel algorithm on high-performance computers. This is the first study of its kind.