All Stories

  1. CommBench: Micro-Benchmarking Hierarchical Networks with Multi-GPU, Multi-NIC Nodes
  2. Fine-grained Policy-driven I/O Sharing for Burst Buffers
  3. Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods
  4. EMPRESS: Accelerating Scientific Discovery through Descriptive Metadata Management
  5. Realizing the Vision of CFD in 2030
  6. Succeeding Together
  7. Performance Portability for Advanced Architectures
  8. Translational research in the MPICH project
  9. Convergence of artificial intelligence and high performance computing on NSF-supported cyberinfrastructure
  10. HAL: Computer System for Scalable Deep Learning
  11. Convergence of Artificial Intelligence and High Performance Computing on NSF-supported Cyberinfrastructure
  12. Node-Aware Improvements to Allreduce
  13. Enabling real-time multi-messenger astrophysics discoveries with deep learning
  14. Node aware sparse matrix–vector multiplication
  15. Guest editor's introduction: Special issue on best papers from EuroMPI/USA 2017
  16. Using Node Information to Implement MPI Cartesian Topologies
  17. Big data and extreme-scale computing
  18. The Blue Waters Super-System for Super-Science
  19. Final report for “Extreme-scale Algorithms and Solver Resilience”
  20. Key Value Stores in HPC
  21. Scalable Non-blocking Preconditioned Conjugate Gradient Methods
  22. Towards millions of communicating threads
  23. Modeling MPI Communication Performance on SMP Nodes
  24. Rethinking High Performance Computing System Architecture for Scientific Big Data Applications
  25. An implementation and evaluation of the MPI 3.0 one‐sided communication interface
  26. Scalability Challenges in Current MPI One-Sided Implementations
  27. Final report: Compiled MPI. Cost-Effective Exascale Application Development
  28. Efficient disk-to-disk sorting
  29. Message Passing Interface
  30. Remote Memory Access Programming in MPI-3
  31. A study of file system read and write behavior on supercomputers
  32. Runtime Support for Irregular Computation in MPI-Based Applications
  33. Algebraic Multigrid on a Dragonfly Network: First Experiences on a Cray XC30
  34. Rethinking Key-Value Store for Parallel I/O Optimization
  35. Nonblocking Epochs in MPI One-Sided Communication
  36. Decoupled I/O for Data-Intensive High Performance Computing
  37. Enabling the environmentally clean air transportation of the future: a vision of computational fluid dynamics in 2030
  38. Special Issue: SC13 – The International Conference for High Performance Computing, Networking, Storage and Analysis
  39. MPI-Interoperable Generalized Active Messages
  40. Optimization Strategies for MPI-Interoperable Active Messages
  41. Programming for Exascale Computers
  42. Analysis of topology-dependent MPI performance on Gemini networks
  43. Runtime system design of decoupled execution paradigm for data-intensive high-end computing
  44. MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory
  45. Toward Asynchronous and MPI-Interoperable Active Messages
  46. Performance Analysis of the Lattice Boltzmann Model Beyond Navier-Stokes
  47. Systematic Reduction of Data Movement in Algebraic Multigrid Solvers
  48. Multiphysics simulations
  49. Applications of the streamed storage format for sparse matrix operations
  50. Parallel Adaptive Deflated GMRES
  51. A Case for Optimistic Coordination in HPC Storage Systems
  52. Performance Modeling of Algebraic Multigrid on Blue Gene/Q: Lessons Learned
  53. A Decoupled Execution Paradigm for Data-Intensive High-End Computing
  54. Modeling the Performance of an Algebraic Multigrid Cycle Using Hybrid MPI/OpenMP
  55. Best algorithms + best computers = powerful match
  56. Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization
  57. Faster topology-aware collective algorithms through non-minimal communication
  58. Faster topology-aware collective algorithms through non-minimal communication
  59. Adaptive Strategy for One-Sided Communication in MPICH2
  60. Efficient Multithreaded Context ID Allocation in MPI
  61. Leveraging MPI’s One-Sided Communication Interface for Shared-Memory Programming
  62. MPI 3 and Beyond: Why MPI Is Successful and What Challenges It Faces
  63. Formal analysis of MPI-based parallel programs
  64. Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters
  65. Avoiding hot-spots on two-level direct networks
  66. Performance modeling for systematic performance tuning
  67. Modeling the performance of an algebraic multigrid cycle on HPC platforms
  68. LACIO: A New Collective I/O Strategy for Parallel I/O Systems
  69. Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes
  70. MPI ON MILLIONS OF CORES
  71. EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing
  72. The International Exascale Software Project roadmap
  73. Multi-core and Network Aware MPI Topology Functions
  74. Performance Expectations and Guidelines for MPI Derived Datatypes
  75. Scalable Memory Use in MPI: A Case Study with MPICH2
  76. Minimizing MPI Resource Contention in Multithreaded Multicore Environments
  77. P2S2 2010: Third International Workshop on Parallel Programming Models and Systems Software for High-End Computing
  78. Optimizing Sparse Data Structures for Matrix-vector Multiply
  79. Self-Consistent MPI Performance Guidelines
  80. Erratum
  81. Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming
  82. The Importance of Non-Data-Communication Overheads in MPI
  83. A Pipelined Algorithm for Large, Irregular All-Gather Problems
  84. The Importance of Non-Data-Communication Overheads in MPI
  85. An adaptive performance modeling tool for GPU architectures
  86. An adaptive performance modeling tool for GPU architectures
  87. A Scalable MPI_Comm_split Algorithm for Exascale Computing
  88. An introductory exascale feasibility study for FFTs and multigrid
  89. Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems
  90. Load Balancing for Regular Meshes on SMPs with MPI
  91. PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems
  92. Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues
  93. Enabling the Next Generation of Scalable Clusters
  94. Formal methods applied to high‐performance computing software design: a case study of MPI one‐sided communication‐based locking
  95. Test suite for evaluating performance of multithreaded MPI communication
  96. On the Need for a Consortium of Capability Centers
  97. Toward Exascale Resilience
  98. Investigating High Performance RMA Interfaces for the MPI-3 Standard
  99. Software for Petascale Computing Systems
  100. Toward message passing for a million processes: characterizing MPI on a massive scale blue gene/P
  101. MPI on a Million Processors
  102. Processing MPI Datatypes Outside MPI
  103. Natively Supporting True One-Sided Communication in  MPI on Multi-core Systems with InfiniBand
  104. Hiding I/O latency with pre-execution prefetching for parallel applications
  105. Parallel I/O prefetching using MPI file caching and I/O signatures
  106. Exploring Parallel I/O Concurrency with Speculative Prefetching
  107. Applied Mathematics at the U.S. Department of Energy: Past, Present and a View to the Future
  108. An Efficient Format for Nearly Constant-Time Access to Arbitrary Time Intervals in Large Trace Files
  109. Non-data-communication Overheads in MPI: Analysis on Blue Gene/P
  110. A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems
  111. A Formal Approach to Detect Functionally Irrelevant Barriers in MPI Programs
  112. Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems
  113. Implementing Efficient Dynamic Formal Verification Methods for MPI Programs
  114. Improving the Performance of Tensor Matrix Vector Multiplication in Cumulative Reaction Probability Based Quantum Chemistry Codes
  115. Self-consistent MPI-IO Performance Requirements and Expectations
  116. Toward Efficient Support for Multithreaded MPI Communication
  117. Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP
  118. Advanced Flow-control Mechanisms for the Sockets Direct Protocol over InfiniBand
  119. Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem
  120. Thread-safety in an MPI implementation: Requirements and analysis
  121. A Portable Method for Finding User Errors in the Usage of MPI Collective Operations
  122. Electron injection by a nanowire in the bubble regime
  123. MPI - Eine Einführung
  124. Nonuniformly Communicating Noncontiguous Data: A Case Study with PETSc and MPI
  125. Self-consistent MPI Performance Requirements
  126. Collective communication on architectures that support simultaneous communication over multiple links
  127. An Interface to Support the Identification of Dynamic MPI 2 Processes for Scalable Parallel Debugging
  128. Automatic Memory Optimizations for Improving MPI Derived Datatype Performance
  129. S01---Advanced MPI
  130. M01---Application supercomputing and multiscale simulation techniques
  131. Awards & video---Awards session
  132. Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem
  133. Formal Verification of Programs That Use MPI One-Sided Communication
  134. Issues in Developing a Thread-Safe MPI Implementation
  135. Message from the Program Chair
  136. Multi-core issues---Multi-Core for HPC
  137. Optimizing the Synchronization Operations in Message Passing Interface One-Sided Communication
  138. Optimization of Collective Communication Operations in MPICH
  139. Using MPI-2: A Problem-Based Approach
  140. Collective Error Detection for MPI Collective Operations
  141. An Evaluation of Implementation Options for MPI One-Sided Communication
  142. Designing a Common Communication Subsystem
  143. Implementing MPI-IO atomic mode without file system support
  144. Towards a Productive MPI Environment
  145. Fault Tolerance in Message Passing Interface Programs
  146. Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters
  147. Minimizing Synchronization Overhead in the Implementation of MPI One-Sided Communication
  148. High performance MPI-2 one-sided communication over InfiniBand
  149. Parallel netCDF
  150. Integrated Network Management VIII
  151. Improving the performance of MPI derived datatypes by optimizing memory-access cost
  152. Efficient structured data access in parallel file systems
  153. Improving the Performance of Collective Operations in MPICH
  154. Noncontiguous I/O accesses through MPI-IO
  155. Optimizing noncontiguous accesses in MPI–IO
  156. MPI on the Grid
  157. Parallel Programming with MPI
  158. Parallel Programming with MPI
  159. Advanced Topics in MPI Programming
  160. Advanced Topics in MPI Programming
  161. Components and interfaces of a process management system for parallel programs
  162. High-performance parallel implicit CFD
  163. Scalable Unix Commands for Parallel Processors: A High-Performance Implementation
  164. Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD
  165. A Scalable Process-Management Environment for Parallel Programs
  166. Analyzing the Parallel Scalability of an Implicit Unstructured Mesh CFD Code
  167. From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems
  168. Performance Modeling and Tuning of an Unstructured Mesh CFD Application
  169. Towards Realistic Performance Bounds for Implicit CFD Codes
  170. Toward Scalable Performance Visualization with Jumpshot
  171. Parallel computation of three-dimensional nonlinear magnetostatic problems
  172. On implementing MPI-IO portably and with high performance
  173. Using MPI-2
  174. Using MPI
  175. Achieving high sustained performance in an unstructured mesh CFD application
  176. Data sieving and collective I/O in ROMIO
  177. I/O in Parallel Applications: the Weakest Link
  178. Parallel Newton--Krylov--Schwarz Algorithms for the Transonic Full Potential Equation
  179. A Case for Using MPI's Derived Datatypes to Improve I/O Performance
  180. MPI - The Complete Reference
  181. Parallel Implicit PDE Computations
  182. Sowing Mpich: a Case Study in the Dissemination of a Portable Environment for Parallel Scientific Computing
  183. A high-performance MPI implementation on a shared-memory vector supercomputer
  184. Why are PVM and MPI so different?
  185. A high-performance, portable implementation of the MPI message passing interface standard
  186. I/O characterization of a portable astrophysics application on the IBM SP and Intel Paragon
  187. Numerical Simulation of Vortex Dynamics in Type-II Superconductors
  188. An experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon using a production application
  189. Early Applications in the Message-Passing Interface (Mpi)
  190. Experiences with the IBM SP1
  191. Solution of dense systems of linear equations arising from integral-equation formulations
  192. Users guide for the ANL IBM SPx
  193. A comparison of some domain decomposition and ILU preconditioned iterative methods for nonsymmetric elliptic problems
  194. Newton-Krylov-Schwarz Methods in CFD
  195. Parallel implicit methods for aerodynamics
  196. Early experiences with the IBM SP-1
  197. Users manual for the Chameleon parallel programming tools
  198. A test implementation of the MPI draft message-passing standard
  199. Convergence rate estimate for a domain decomposition method
  200. Parallel Performance of Domain-Decomposed Preconditioned Krylov Methods for PDEswith Locally Uniform Refinement
  201. Domain decomposition techniques for the parallel solution of nonsymmetric systems of elliptic boundary value problems
  202. Krylov methods preconditioned with incompletely factored matrices on the CM-2
  203. A parallel version of the fast multipole method
  204. Computational fluid dynamics on parallel processors
  205. Domain decomposition on parallel computers
  206. Recursive mesh refinement on hypercubes
  207. A Parallel Version of the Fast Multipole Method
  208. Recursive Mesh Refinement on Hypercubes
  209. Complexity of Parallel Implementation of Domain Decomposition Techniques for Elliptic Partial Differential Equations
  210. Local uniform mesh refinement on loosely-coupled parallel processors
  211. Solving PDEs on loosely-coupled parallel processors
  212. A Comparison of Domain Decomposition Techniques for Elliptic Partial Differential Equations and their Parallel Implementation
  213. Local Uniform Mesh Refinement on Vector and Parallel Processors
  214. Predicting memory-access cost based on data-access patterns
  215. Grid-based Image Registration
  216. Observations on WoCo9
  217. Data Transfers between Processes in an SMP System: Performance Study and Application to MPI
  218. High Performance File I/O for The Blue Gene/L Supercomputer
  219. A taxonomy of programming models for symmetric multiprocessors and SMP clusters
  220. An abstract-device interface for implementing portable parallel-I/O interfaces
  221. Developing Applications for a Heterogeneous Computing Environment
  222. Dynamic process management in an MPI setting
  223. Goals guiding design: PVM and MPI
  224. Open Issues in MPI Implementation
  225. Practical Model-Checking Method for Verifying Correctness of MPI Programs
  226. Revealing the Performance of MPI RMA Implementations
  227. Scalable Unix tools on parallel processors
  228. The MPI communication library: its design and a portable implementation