List of works
Book chapter
Random number generators for parallel applications
Published 2009
Advances in Chemical Physics, Volume 105: Monte Carlo Methods in Chemical Physics, 1 - 30
Book chapter
A Synchronous Mode MPI Implementation on the Cell BETM Architecture
Published 2007
Parallel and Distributed Processing and Applications: 5th International Symposium, ISPA 2007, Niagara Falls, Canada, August 29-31, 2007, Proceedings, 982 - 991
The Cell Broadband Engine shows much promise in high performance computing applications. The Cell is a heterogeneous multi-core processor, with the bulk of the computational work load meant to be borne by eight co-processors called SPEs. Each SPE operates on a distinct 256 KB local store, and all the SPEs also have access to a shared 512 MB to 2 GB main memory through DMA. The unconventional architecture of the SPEs, and in particular their small local store, creates some programming challenges. We have provided an implementation of core features of MPI for the Cell to help deal with this. This implementation views each SPE as a node for an MPI process, with the local store used as if it were a cache. In this paper, we describe synchronous mode communication in our implementation, using the rendezvous protocol, which makes MPI communication for long messages efficient. We further present experimental results on the Cell hardware, where it demonstrates good performance, such as throughput up to 6.01 GB/s and latency as low as 0.65 μs on the pingpong test. This demonstrates that it is possible to efficiently implement MPI calls even on the simple SPE cores.
Book chapter
Optimization of Collective Communication in Intra-cell MPI
Published 2007
High Performance Computing – HiPC 2007: 14th International Conference, Goa, India, December 18-21, 2007, Proceedings, 488 - 499
The Cell is a heterogeneous multi-core processor, which has eight co-processors, called SPEs. The SPEs can access a common shared main memory through DMA, and each SPE can directly operate on a small distinct local store. An MPI implementation can use each SPE as if it were a node for an MPI process. In this paper, we discuss the efficient implementation of collective communication operations for intra-Cell MPI, both for cores on a single chip, and for a Cell blade. While we have implemented all the collective operations, we describe in detail the following: barrier, broadcast, and reduce. The main contributions of this work are (i) describing our implementation, which achieves low latencies and high bandwidths using the unique features of the Cell, and (ii) comparing different algorithms, and evaluating the influence of the architectural features of the Cell processor on their effectiveness.
Book chapter
A buffered-mode MPI implementation for the cell BE (TM) processor
Published 01/01/2007
Computational Science – ICCS 2007: 7th International Conference, Beijing China, May 27-30, 2007, Proceedings, Part I, 4487, 603 - 610
The Cell Broadband Engine (TM) is a heterogeneous multi-core architecture developed by IBM, Sony and Toshiba. It has eight computation intensive cores (SPEs) with a small local memory, and a single PowerPC core. The SPEs have a total peak single precision performance of 204.8 Gflops/s, and 14.64 Gflops/s in double precision. Therefore, the Cell has a good potential for high performance computing. But the unconventional architecture makes it difficult to program. We propose an implementation of the core features of MPI as a solution to this problem. This can enable a large class of existing applications to be ported to the Cell. Our MPI implementation attains bandwidth up to 6.01 GB/s, and latency as small as 0.41 mu s. The significance of our work is in demonstrating the effectiveness of intra-Cell MPI, consequently enabling the porting of MPI applications to the Cell with minimal effort.
Book chapter
Domain decomposition for particle methods on the sphere
Published 06/16/2005
Parallel Algorithms for Irregularly Structured Problems: Third International Workshop, IRREGULAR '96, Santa Barbara, CA, USA, August 19 - 21, 1996. Proceedings, 119 - 130
We present an algorithm for efficient parallelization of particle methods when the domain is the surface of a sphere. Such applications typically arise when dealing with directional data. We propose a domain decomposition scheme based on geometric partitioning that provides domains suitable for practical implementation. This algorithm has the advantage of being fast enough to be applied dynamically, and at the same time provides good partitions, comparable in quality to those produced by spectral graph partitioning schemes.
Book chapter
Parallel Simulation of Carbon Nanotube Based Composites
Published 2004
High Performance Computing - HiPC 2004: 11th International Conference, Bangalore, India, December 19-22, 2004, Proceedings, 211
Computational simulation plays a vital role in nanotechnology. Molecular dynamics (MD) is an important computational method to understand the fundamental behavior of nanoscale systems, and to transform that understanding into useful products. MD computations, however, are severely restricted by the spatial and temporal scales of simulations. This paper describes the methods used to achieve effective spatial parallelization of a MD code that is based on a multi-body bond order potential. The material system studied here is a carbon nanotube (CNT). We discuss the scientific and computational issues in the development and implementation of parallel algorithms, when the domain needs to be discretized with fine granularity. Specific issues in terms of neighbor-list computation, communication reduction, and cache awareness are delineated, with corresponding benefit in terms of speed up.Important practical problems relevant to CNT based composites are studied, and the effectiveness of various strategies reported. Our implementation achieves efficient parallelization at a finer granularity compared with published works on CNTs with complex configurations.
Book chapter
Improved Monte Carlo Linear Solvers Through Non-diagonal Splitting
Published 2003
Computational Science and Its Applications — ICCSA 2003: International Conference, Montreal, Canada, May 18-21, 2003, Proceedings, Part III, 168 - 177
Fast, but approximate, solutions to linear algebra problems have many potential applications, such as in graph partitioning, preconditioning, information retrieval, etc. Monte Carlo techniques appear attractive for such needs. While Monte Carlo linear solvers have a long history, their application has been limited due to slow convergence. Despite the development of techniques to improve their accuracy, current methods suffer from the drawback that they are stochastic realizations of inherently poor iterative methods. The reason for such choices is the need for efficient Monte Carlo implementation, which has restricted the splittings that are considered. However, in this paper we demonstrate that such restrictions are not necessarily required, and that efficient Monte Carlo implementations are possible even with splittings that do not appear amenable to it.
Book chapter
Monte Carlo techniques for estimating the Fiedler vector in graph applications
Published 2002
Computational Science — ICCS 2002: International Conference Amsterdam, The Netherlands, April 21–24, 2002 Proceedings, Part II, 2330, 2, 635 - 645
Determining the Fiedler vector of the Laplacian or adjacency matrices of graphs is the most computationally intensive component of several applications, such as graph partitioning, graph coloring, envelope reduction, and seriation. Often an approximation of the Fiedler vector is sufficient. We discuss issues involved in the use of Monte Carlo techniques for this purpose.
Book chapter
A compact storage scheme for fast wavelet-based subregion retrieval
Published 1997
Computing and Combinatorics: Third Annual International Conference, COCOON '97, Shanghai, China, August 20-22, 1997. Proceedings, 1276, 353 - 362
Home Computing and Combinatorics Conference paper
A compact storage scheme for fast wavelet-based subregion retrieval
Session 11: Algorithms and Applications
A. S. Poulakidas, A. Srinivasan, O. Egecioglu, O. Ibarra & T. Yang
Conference paper
First Online: 01 January 2006
94 Accesses
1 Citations
Part of the Lecture Notes in Computer Science book series (LNCS,volume 1276)
Abstract
In an image browsing environment there is need for progressively viewing image subregions at various resolutions. We describe a storage scheme that accomplishes good image compression, while supporting fast image subregion retrieval. We evaluate analytically and experimentally the compression performance of our algorithm. We also provide results on the speed of the algorithm to demonstrate its effectiveness, and present an extension to a client/server environment.