Khaled HAMIDOUCHE

AMD Research

Adress :	AMD Research 7171 Southwest Pkwy Austin, TX 78735, USA
Office :	B100
Phone :	512-602-3569
Fax :
email :	Khaled.Hamidouche(at)amd.com khaledhamidouche (at) gmail.com

I'm a Senior Member of Technical Staff (SMTS) at Advanced Micro-Devices (AMD) Research. I work and lead network-based programming models for parallel and heterogeneous architectures

From 2012 to 2017, I was a Research Scientist in the Department of Computer Science and Engineering at the Ohio State University. I was a member of Network Based Computing Lab. working with Prof. Dhabaleswar K. (DK) Panda. I was leading the design and development of MVAPICH project for GPUs and Accelerators as well as support for hybrid MPI+PGAS programming models.

Before that I was a post-doctoral researcher with the HP2 team (High Performance and Parallel) at Telecom Sud-Paris, Evry. In the area of compilation/ code generation for parallel architectures, I work on both the optimization of the source to source transformation tool (STEP) and its port on manycores architectures.

I defended my PhD thesis at the LRI- Laboratoire de Recherche en Informatique ( Parall Team ) in November 2011 on parallel computing and parallel architectures. My thesis work (dissertation available here), led by Prof Daniel Etiemble and Dr Joel Falcou , focused on a programming model and deployment / hybrid code generation for hierarchical and heterogeneous parallel architectures (with development of tools).

A full version of my resume is availabale in PDF form.

Research Interests

High Performance Computing
Parallel Programming Models
Scale-out Machine Learning Frameworks and Applications
Compilation and code generation
Application co-design

Publications

60) M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt and L. K. John, ComP-Net: Command Processor Networking for Efficient Intra-kernel Communications on GPUs, The 27th International Conference on Parallel Architectures and Compilation Techniques (PACT'18), November, 2018

59) M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt and L. K. John, GPU Triggered Networking for Intra-Kernel Communications, ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), November, 2017

58) J. Hashmi , K. Hamidouche , H. Subramoni , and D. K. Panda, Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), December, 2017

57) A. Venkatesh , C. Chu , K. Hamidouche , S. Potluri , Davide Rossetti , and D. K. Panda, MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling, International Conference on Parallel Processing (ICPP 2017) August, 2017

56) A. Awan , K. Hamidouche , J. Hashmi , and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters, 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2017) February, 2017

55) J. M. Hashmi, K. Hamidouche, and D. K. Panda, Enabling Performance Efficient Runtime Support for hybrid MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and Communications (HPCC 2016) December, 2016

54) D. S. Banerjee, K. Hamidouche, and D. K. Panda, Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE International Conference on Cloud Computing Technology and Science (CloudComp'16) December, 2016

53) K. Hamidouche, A. Awan, A. Venkatesh, and D. K. Panda, CUDA M3: Designing Efficient CUDA Managed Memory-aware MPI by Exploiting GDR and IPC, IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'16) December, 2016

52) M. Li, X. Lu, K. Hamidouche, J. Zhang, and D. K. Panda, Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA, IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'16) December, 2016

51) M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and D. K. Panda, Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits, ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16) November, 2016

50) K. Hamidouche, J. Zhang, K. Tomko, and D. K. Panda, OpenSHMEM NonBlocking Data Movement Operations with MVAPICH2-X: Early Experiences, PGAS Applications Workshop (PAW'16), affiliated with SuperComputing (SC'16) November, 2016

49) C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,, Communication Optimizations in HPC Workshop (COMHPC'16), affiliated with SuperComputing (SC'16) November, 2016

48) C. Chu, K. Hamidouche, A. Venkatesh, H. Subramoni, B. Elton and D. K. Panda, Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'16) October, 2016

47) A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda, Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning, EUROMPI'16 September, 2016, Best Paper Runner-up Award

46) K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, and D. K. Panda, CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters, Accepted to appear in PARCO: Elsevier Parallel Computing Journal

45) C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled System The 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS '16) May 2016

44) C. Chu, K. Hamidouche, A. Venkatesh, A. Awan, and D. K. Panda, CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16) May 2016

43) H. Subramoni, A. M. Augustine, M. Arnold, J. Perkins, X. Lu, K. Hamidouche and D. K. Panda INAM2: InfiniBand Network Analysis & Monitoring with MPI, International Supercomputing Conference (ISC'16) June, 2016

42) D. Banerjee, K. Hamidouche, and D. K. Panda, Designing High Performance Communication Runtime for GPU Managed Memory: Early Experiences General Purpose GPU (GPGPU-9), Affiliated with PPoPP'16 Mar 2016

41) A. Venkatesh, K. Hamidouche, H. Subramoni and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, 22nd IEEE International Conference on High Performance Computing (HiPC'15) Dec 2015

40) M. Li, K. Hamidouche, X. Lu, J. Zhang, J. Lin and D. K. Panda, High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR, 22nd IEEE International Conference on High Performance Computing (HiPC'15) Dec 2015

39) A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D. K. Panda, D. Kerbyson, A. Hoisie, A Case for Application-Oblivious Energy-Efficient MPI Runtime, ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'15), November 2015, Best Student Paper Finalist

38) K. Hamidouche, A. Venkatesh, A. A. Awan, H. Subramoni, C. Chu and D. K. Panda, Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters IEEE Cluster 2015 Spetember 2015, Chicago, USA

37) M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with User-mode Memory Registration: Challenges, Designs and Benefits IEEE Cluster 2015 Spetember 2015, Chicago, USA

36) A. Awan, K. Hamidouche, A. Venkatesh, J. Perkins, H. Subramoni, and D. K.Panda, GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks, EUROMPI Conference 2015, September 2015, France

35) M. Li, K. Hamidouche, X. Lu, J. Lin and D. K. Panda, High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters International EURO-PAR Conference (Euro-par 2015) August 2015, Austria

34) H. Subramoni, A. Venkatesh, K. Hamidouche, K. Tomko and D. Panda, Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-all Collective Algorithms IEEE Hot Interconnects (HotI'15), August 2015

33) A. Awan, K. Hamidouche, C. Chu and D. K. Panda, A Case for Non-Blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X, OpenSHMEM Workshop 2015, July 2015

32) J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu and D.K. Panda, Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM, OpenSHMEM Workshop 2015, July 2015

31) A. Gómez-Iglesias, J. Vienne, K. Hamidouche, W. Barth and D.K. Panda, Scalable Out-of-core OpenSHMEM Library for HPC, OpenSHMEM Workshop 2015, July 2015

30) H. Subramoni, A. A. Awan, K. Hamidouche, D. Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko and D. K. Panda, Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters International Supercomputing Conference (ISC'15) July 2015, Germany

29) A. Gomez-Iglesias, D. Pekurovsky, K. Hamidouche, J. Zhang, J. Vienne Porting Scientific Libraries to PGAS in XSEDE Resources: Practice and Experience XSEDE'2015 Conference July 2015, ST-Louis, USA

28) J. Lian, K. Hamidouche, X. Lu, M. Li and D. K. Panda, Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '15)-- Affiliated with IPDPS 2015 May 2015, India

27) R. Rajachandrasekar, A. Venkatesh, K. Hamidouche and D. K. Panda, Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, (CCGrid'2015) May 2015, Shenzhen, China

26) J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. K. Panda, Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models International Conference on Partitioned Global Address Space Programming Models, (PGAS'2014). October, Oregon, USA

25) R. Shi, S. Potluri, K. Hamidouche M. Li, J. Perkins D. Rossetti and D. K. Panda, Designing Efficient Small Message Transfer Mechanism for Inter-node MPI Communication on InfiniBand GPU Clusters IEEE International Conference on High Performance Computing (HiPC'2014). December 2014, Goa, India

24) A. Venkatesh, H. Subramoni, K. Hamidouche and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on Infiniband Clusters IEEE International Conference on High Performance Computing (HiPC'2014). December 2014, Goa, India

23) J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko and and D. K. Panda, High Performance OpenSHMEM for MIC Clusters: Extensions, Runtime Designs and Application Co-design IEEE CLUSTER'14 (Best Paper Nominee) . Spetember 2014, Madrid, Spain

22) M. Li, X. Lu, S. Potluri, K. Hamidouche, J. Jose, K. Tomko and and D. K. Panda, Scalable Graph500 Design with MPI-3 RMA IEEE CLUSTER'14. Spetember 2014, Madrid, Spain

21) R. Rajachandrasekar, J. Perkins, K. Hamidouche, M. Arnold and D. K. Panda, Understanding the Memory-Utilization of MPI Libraries: Challenges and Designs in Implementing the MPI_T Interface. EUROMPI'14. Spetember 2014, Jepan

20) R. Shi, X. Lu, S. Potluri, K. Hamidouche, J. Zhang and D. K. Panda, HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement using MPI Datatypes on GPU Clusters International Conference on Parallel Processing (ICPP'14). Spetember 2014, Minneapolis, USA

19) R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, Md. Wasi-ur-Rahman and D. K. Panda, MIC-Check: A Distributed Checkpointing Framework for the Intel Many Integrated Cores Architecture. The International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC '14). June 2014, Vancouver, Canada

18) H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC '14). June 2014, Leipzig, Germany

17) J.Jose, K. Hamidouche, J. Zhane, A. Venkadesh and D. K. Panda, Optimizing Collective in UPC. International Workshop on High-Level Parallel Programming Models and Suppor- tive Environments (HIPS '14). May 2014, Phoenix, USA

16) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda, High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters. IEEE International Parallel & Distributed Processing Symposium (IPDPS '14). May 2014, Phoenix, USA

15) M. Luo, X. Lu, K. Hamidouche, K. Kandalla and D. K. Panda, Initial Study of Multi-Endpoint Runtime for MPI+OpenMP Hybrid Applications on Multi-Core Systems. International Symposium on Principles and Practice of Parallel Programming (PPoPP '14). February 2014, Orlondo, USA

14) R. Shi, S. Potluri, K. Hamidouche, X. Lu, K. Tomko and D. K. Panda, A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters. IEEE Cluster (Cluster13). Best Student Paper Award . September 2013, Indianapolis, USA

13) S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni and D. K. Panda, MVAPICH-PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC Clusters. IEEE/ACM International Conference on Supercomputing (SC13) . November 2013, Denver, CO, USA

12) S. Potluri, K. Hamidouche, D. Bureddy and D. K. Panda, MVAPICH2-MIC: A High-Performance MPI Library for Xeon Phi Clusters with InfiniBand. Extreme Scaling Workshop . August 2013, Boulder, CO, USA

11) K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda, Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters. IEEE International Symposium on High-Performance Interconnects (HotI 2013) . August 2013, San Jose, CA, USA

10) S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. IEEE International Conference on Parallel Processing (ICPP 2013) . October 2013, Lyon, France

9) M. LI, S. Potluri, K. Hamidouche, J. Jose, D. K. Panda, Efficient and Truly Passive MPI-3 RMA Using InfiniBand Atomics. EuroMPI 13 . Septembre 2013, Madrid, Spain

8) K. Hamidouche, S. Potluri, H. Subramoni, K. Kandalla and D. K. Panda, MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core (MIC) Clusters with InfiniBand. ACM International Conference on Supercomputing (ICS 2013) . June 2013, Oregon, USA

7) K. Hamidouche, F. M. Mendonca, J. Falcou, A.C.M.A Melo, D. Etiemble, Parallel Smith-Waterman Comparison on Multicore and Manycore Computing Platforms with BSP++. International Journal of Parallel Programming (IJPP) . August 2012

6) K. Hamidouche, F. M. Mendonca, J. Falcou, D. Etiemble, Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++. 23rd IEEE International Symposium on Computer Architecture and High Performance Computing - SBAC-PAD'2011 , Vitoria, Espirito Santo, Brazil, October 26-29, 2011

5) K. Hamidouche, J. Falcou, D. Etiemble, A Framework for an Automatic Hybrid MPI+OpenMP code generation, ACM High Performance Computing Symposium , HPC-11, Boston - USA, April, 3-7, 2011

4) K. Hamidouche, J. Falcou, D. Etiemble, Hybrid Bulk Synchronous Parallelism Library for Clustered SMP Architectures, ACM International workshop on High Level Parallel Programming and Applications, HLPP2010 , Affiliated to ICFP 2010 , Baltimore - USA, September, 25,2010

3) K. Hamidouche, A. Borghi, P. Esterie, J. Falcou, S. Peyronnet, Three High Performance Architectures in the Parallel APMC Boat, IEEE International Workshop on Parallel and Distributed Methods in Verification, PDMC 2010 , Enschede, Netherlands, September 30, 2010

2) C. Tadonki, L. Lacassagne, T. Saidani, J. Falcou, K. Hamidouche The Harris algorithm revisited on the CELL processor , International workshop on highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2010 , 1 June 2010, Tsukuba , Japan

1) K. Hamidouche, F. Cappello, D. Etiemble, Comparaison de MPI, OpenMP et MPI+OpenMP sur un noeud multiprocesseur multicoeurs AMD a memoire partagee , Recontre Francophone de Parallelisme, RenPar 2009 , Toulouse, Septempber,9-11, 2009.

Synergistic Activities

Organizer and Co-Chair

HotI'17, ESPM2 2016, ExaComm'16, ESPM2 2015, ExaComm'15

Publicity Chair

HPCC17, SmartCity17, DSS1, PGAS'15, PGAS'14

Program Committee Member
Tutorials

Software

MVAPICH2
The MVAPICH2 software, supporting MPI 3.0 standard, delivers best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, 10GigE/iWARP and RoCE networking technologies. The MVAPICH2-X software package provides support for hybrid MPI+PGAS (UPC and OpenSHMEM) programming models with unified communication runtime for emerging exascale systems. The MVAPICH2-GDR package provides support for clusters with NVIDIA GPUs supporting the GPUDirect RDMA feature. The MVAPICH2-MIC package provides support for clusters with Intel MIC coprocessors. Recently, we have proposed the first production-ready Energy-Aware MPI runtime with the MVAPICH2-EA library. The novelty of MVAPICH2-EA is its combination of analytical model of MPI protocols and on-line time estimation. MVAPICH2 Libraries are powering several supercomputers in the TOP 500 list.

BSP++ Library
Is a generic library using C++ templates. Based on a hierarchical model, the BSP++ library takes the hybrid architectures (Multicore clusters and Cell BE accelerator based clusters) as native targets. Using a small set of primitives and intuitive concepts, BSP++ provides a simple way to program hybrid and heterogeneous architectures. It generates MPI, OpenMP, MPI+OpenMP, Cell BE and MPI+Cell BE codes with the same version of the user base code (the choice of the target machine is just a preprocessing symbole).

BSPGen FrameWork
Is a tool for an automatic hybrid multi-level hierarchy (MPI + OpenMP or MPI +Cell BE) code generation. Using the BSP++ cost model, BSPGen predicts and generates the appropriate hierarchical hybrid code (BSP++ code) for a given application on a target architecture. The prediction is based on a new pass on the LLVM compiler. BSPGen generates hybrid code from a list of sequential functions and a description of the parallel algorithm (XML file).

Teaching

I was Instructor/Teacher at Paris Sud 11 university. During my thesis, I taught several courses of different levels:

(5 th degree - Formation Continue) Parallel and Distributed Programming
(5 th degree - Apprentis 3) Parallel and Distributed Programming

(3 rd Degree) Operating Systems

(1 st Degree) Introduction to the Algorithmic and C langage