Abstract. HPC has been extremely successful by focusing on how best to support software developers that have the technical skills required to design highly parallel algorithms, to optimize data partitioning and load balancing, to minimize communications and synchronization, and to do all of the other things necessary to achieve the highest possible performance on large-scale dedicated clusters and supercomputers. In HPC, where choices have to be made between high performance and fault tolerance, the tendency has almost always been to focus on the former.
Big Data, AI and Cloud Computing have also been extremely successful in recent years by focusing on exactly the opposite community, those software developers who do not have the technical skills, or motivation, required for HPC. In Big Data, AI and Cloud Computing, most of the key design decisions regarding parallelization, data partitioning, load balancing, communications, synchronization, redundancy and fault tolerance are automated, enabling developers to produce applications with a minimum of effort, applications that will typically be run on low-cost pools of commodity cloud resources, that will often be virtualized or use containers.
In this talk, I will describe some of the challenges in bringing high performance to big data, AI and cloud computing, and describe some new approaches to this problem. I will describe a new bridging model for general purpose parallel computing with fault tolerance and tail tolerance.
SIAM Conference Participation System
Corrections or problems using this system? Email
meetings@siam.org.