An efficient MapReduce algorighm for parallelizing large-scale graph clustering
MetadataShow full item record
Identifying close-knit communities (or “clusters”) in graphs is an advanced operation with a broad range of scientific applications. While theoretical formulations of this operation are either intractable or computationally prohibitive, practical algorithmic heuristics exist to efficiently tackle the problem. However, implementing these heuristics to work for large real world graphs still remains a significant challenge, owing to a combination of factors that include magnitude of the data, irregular data access patterns and computer-intensive operations to better the approximation. In this paper, we propose i) a novel MapReduce-based  algorithm for a well known serial graph clustering heuristic called Shingling ; and ii) a novel application of the method to cluster biological graphs built out of proteins and domains. Operating on an input graph that is simply represented as a list of edges, our algorithm uses a combination of shuffling and sorting operations, and pipelined MapReduce stages to implement the various phases of the algorithm. Preliminary results show linear scaling of the time-dominant phase up to 64 cores on a relatively small real world graph containing 8.41M vertices (8,407,839 proteins and 11,823 domains) and 11M edges (protein to domain connections). More importantly, MapReduce parallelization has allowed us to enhance the problem size reach by about two to three orders of magnitude (from 20K to 8M vertices) relative to our previous serial implementation, in roughly the same amount of time.