e present an algorithm for general sparse matrix-matrix multiplication (SpGEMM) on many-core architectures, such as GPUs. SpGEMM is implemented by iterative row merging, similar to merge sort, except that elements with duplicate column indices are aggregated on the fly. The main kernel merges small numbers of sparse rows at once using sub-warps of threads to realize an early compression effect which reduces the overhead of global memory accesses. The performance is compared with a parallel CPU implementation as well as with three GPU-based implementations. Measurements performed for computing the matrix square for 21 sparse matrices show that the proposed method consistently outperforms the other methods. Analysis showed that the performance is achieved by utilizing the compression effect and the GPU caching architecture. An improved performance was also found for computing Galerkin products which are required by algebraic multigrid solvers. The performance was particularly good for 7-point stencil matrices arising in the context of diffuse optical imaging and the improved performance allows to perform image reconstruction at higher resolution using the same computational resources.