Khaled HAMIDOUCHE
AMD Research

|
Adress : |
AMD Research
7171 Southwest Pkwy
Austin, TX 78735, USA
|
Office : |
B100 |
Phone : |
512-602-3569 |
Fax : |
|
email : |
Khaled.Hamidouche(at)amd.com
khaledhamidouche (at) gmail.com
|
|
I'm a Senior Member of Technical Staff (SMTS) at Advanced Micro-Devices (AMD) Research. I
work and lead network-based programming models for parallel and heterogeneous
architectures
From 2012 to 2017, I was a Research Scientist
in the Department of
Computer Science
and Engineering at the Ohio State
University.
I was a member of Network Based
Computing Lab.
working with Prof. Dhabaleswar
K. (DK) Panda. I was leading the design and development of MVAPICH
project for GPUs and Accelerators as well as support for hybrid MPI+PGAS
programming models.
Before that I was a post-doctoral researcher with the HP2 team (High Performance and Parallel) at Telecom Sud-Paris, Evry. In the area of compilation/ code generation for parallel architectures, I work on both the optimization of the source to source transformation tool (STEP) and its port on manycores architectures.
I defended my PhD thesis at the LRI- Laboratoire de Recherche en Informatique (
Parall Team )
in November 2011 on parallel computing and parallel architectures. My thesis
work (dissertation available here), led by Prof Daniel Etiemble and Dr Joel Falcou , focused on a programming model and deployment / hybrid code generation for hierarchical and heterogeneous parallel architectures (with development of tools).
A full version of my resume is availabale in PDF form.
- High Performance Computing
- Parallel Programming Models
- Scale-out Machine Learning Frameworks and Applications
- Compilation and code generation
- Application co-design
- 60) M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K.
Reinhardt and L. K. John,
ComP-Net: Command Processor Networking for Efficient
Intra-kernel Communications on GPUs, The 27th International Conference on Parallel Architectures and
Compilation Techniques (PACT'18),
November,
2018
- 59) M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K.
Reinhardt and L. K. John,
GPU Triggered Networking for Intra-Kernel
Communications, ACM/IEEE International
Conference for High Performance Computing, Networking, Storage and Analysis
(SC'17),
November,
2017
- 58) J. Hashmi , K. Hamidouche , H. Subramoni , and D. K. Panda,
Kernel-assisted Communication Engine for MPI on
Emerging Manycore Processors, 24th IEEE International Conference on High Performance Computing,
Data, and Analytics (HiPC'17),
December,
2017
- 57) A. Venkatesh , C. Chu , K. Hamidouche , S. Potluri , Davide Rossetti ,
and D. K. Panda,
MPI-GDS: High Performance MPI Designs with
GPUDirect-aSync for CPU-GPU Control Flow Decoupling, International Conference on Parallel Processing (ICPP 2017)
August,
2017
- 56) A. Awan , K. Hamidouche , J. Hashmi , and D. K. Panda,
S-Caffe: Co-designing MPI Runtimes and Caffe for
Scalable Deep Learning on Modern GPU Clusters, 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (PPoPP 2017)
February,
2017
- 55) J. M. Hashmi, K. Hamidouche, and D. K. Panda,
Enabling Performance Efficient Runtime Support for
hybrid MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and
Communications (HPCC 2016)
December,
2016
- 54) D. S. Banerjee, K. Hamidouche, and D. K. Panda,
Re-designing CNTK Deep Learning Framework on Modern
GPU Enabled Clusters, IEEE International Conference on Cloud Computing Technology and
Science (CloudComp'16)
December,
2016
- 53) K. Hamidouche, A. Awan, A. Venkatesh, and D. K. Panda,
CUDA M3: Designing Efficient CUDA Managed
Memory-aware MPI by Exploiting GDR and IPC, IEEE International
Conference on High Performance Computing, Data, and Analytics (HiPC'16)
December,
2016
- 52) M. Li, X. Lu, K. Hamidouche, J. Zhang, and D. K. Panda,
Mizan-RMA: Accelerating Mizan Graph Processing
Framework with MPI RMA, IEEE International
Conference on High Performance Computing, Data, and Analytics (HiPC'16)
December,
2016
- 51) M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and D. K. Panda, Designing MPI Library with On-Demand Paging (ODP) of
InfiniBand: Challenges and
Benefits, ACM/IEEE International
Conference for High Performance Computing, Networking, Storage and Analysis
(SC'16) November,
2016
- 50) K. Hamidouche, J. Zhang, K. Tomko, and D. K. Panda,
OpenSHMEM NonBlocking Data Movement Operations with
MVAPICH2-X: Early Experiences, PGAS Applications Workshop (PAW'16), affiliated with SuperComputing (SC'16) November,
2016
- 49) C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda,
Efficient Reliability Support for Hardware
Multicast-based Broadcast in GPU-enabled Streaming Applications,, Communication Optimizations in HPC
Workshop (COMHPC'16), affiliated with SuperComputing
(SC'16) November,
2016
- 48) C. Chu, K. Hamidouche, A. Venkatesh, H. Subramoni, B. Elton and D. K.
Panda,
Designing High Performance Heterogeneous Broadcast
for Streaming Applications on GPU Clusters, International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD'16) October,
2016
- 47) A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda,
Efficient Large Message Broadcast using NCCL and
CUDA-Aware MPI for Deep
Learning, EUROMPI'16 September,
2016, Best Paper Runner-up Award
- 46) K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, and D. K. Panda, CUDA-Aware OpenSHMEM: Extensions and Designs for
High Performance OpenSHMEM on
GPU Clusters, Accepted to appear in PARCO: Elsevier Parallel Computing Journal
- 45) C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data
Movement Processing on Modern GPU-enabled System The 30th IEEE International Parallel & Distributed Processing
Symposium (IPDPS '16) May
2016
- 44) C. Chu, K. Hamidouche, A. Venkatesh, A. Awan, and D. K. Panda, CUDA Kernel based Collective Reduction Operations on
Large-scale GPU Clusters 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (CCGrid'16) May
2016
- 43) H. Subramoni, A. M. Augustine, M. Arnold, J. Perkins,
X. Lu, K. Hamidouche and D. K. Panda INAM2: InfiniBand Network Analysis &
Monitoring with MPI, International Supercomputing Conference (ISC'16) June,
2016
- 42) D. Banerjee, K. Hamidouche, and D. K. Panda, Designing High Performance Communication Runtime for
GPU Managed Memory: Early Experiences General Purpose GPU (GPGPU-9), Affiliated with PPoPP'16 Mar
2016
- 41) A. Venkatesh, K. Hamidouche, H. Subramoni and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA
Capabilities on IB Clusters, 22nd IEEE International Conference on High Performance Computing
(HiPC'15) Dec
2015
- 40) M. Li, K. Hamidouche, X. Lu, J. Zhang, J. Lin and D. K. Panda, High Performance OpenSHMEM Strided Communication
Support with InfiniBand UMR, 22nd IEEE International Conference on High Performance Computing
(HiPC'15) Dec
2015
- 39)
A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D. K. Panda, D.
Kerbyson, A. Hoisie,
A Case for Application-Oblivious Energy-Efficient MPI
Runtime,
ACM/IEEE International
Conference for High Performance Computing, Networking, Storage and Analysis
(SC'15),
November 2015, Best Student Paper Finalist
- 38) K. Hamidouche, A. Venkatesh, A. A. Awan, H.
Subramoni, C. Chu and D. K. Panda, Exploiting GPUDirect RDMA in Designing High
Performance OpenSHMEM for NVIDIA GPU
Clusters IEEE Cluster 2015 Spetember
2015, Chicago, USA
- 37) M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with User-mode
Memory Registration:
Challenges, Designs and Benefits IEEE Cluster 2015 Spetember
2015, Chicago, USA
- 36)
A. Awan, K. Hamidouche, A. Venkatesh, J. Perkins, H. Subramoni, and
D. K.Panda,
GPU-Aware Design, Implementation, and Evaluation of Non-blocking
Collective Benchmarks,
EUROMPI Conference 2015,
September 2015, France
- 35) M. Li, K. Hamidouche, X. Lu, J. Lin and D. K. Panda, High-Performance
and Scalable Design of MPI-3 RMA on Xeon Phi Clusters International EURO-PAR Conference (Euro-par 2015) August
2015, Austria
- 34)
H. Subramoni, A. Venkatesh, K. Hamidouche, K. Tomko and D. Panda,
Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-all
Collective Algorithms
IEEE Hot Interconnects (HotI'15),
August 2015
- 33)
A. Awan, K. Hamidouche, C. Chu and D. K.
Panda,
A Case for Non-Blocking Collectives in OpenSHMEM: Design, Implementation,
and Performance Evaluation using MVAPICH2-X,
OpenSHMEM Workshop 2015,
July 2015
- 32)
J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu and
D.K. Panda,
Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM,
OpenSHMEM Workshop 2015,
July 2015
- 31)
A. Gómez-Iglesias, J. Vienne, K. Hamidouche, W.
Barth and D.K. Panda,
Scalable Out-of-core OpenSHMEM Library for HPC,
OpenSHMEM Workshop 2015,
July 2015
- 30) H. Subramoni, A. A. Awan, K. Hamidouche, D.
Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko and D. K. Panda, Designing
Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled
Clusters
International Supercomputing Conference (ISC'15) July 2015, Germany
- 29) A. Gomez-Iglesias, D. Pekurovsky, K. Hamidouche, J. Zhang, J. Vienne Porting Scientific Libraries to PGAS in XSEDE
Resources: Practice and
Experience XSEDE'2015 Conference July 2015, ST-Louis, USA
- 28) J. Lian, K. Hamidouche, X. Lu, M. Li and D. K. Panda, Coarray Fortran Support with MVAPICH2-X: Initial
Experience and Evaluation
International Workshop on High-Level Parallel Programming Models and Supportive
Environments (HIPS '15)-- Affiliated with IPDPS 2015 May 2015, India
- 27) R. Rajachandrasekar, A. Venkatesh, K. Hamidouche and D. K. Panda, Power-Check: An Energy-Efficient Checkpointing
Framework for HPC Clusters
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing,
(CCGrid'2015) May 2015, Shenzhen, China
- 26) J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H.
Sundar and D. K. Panda, Designing Scalable
Out-of-core Sorting with Hybrid MPI+PGAS Programming Models International Conference on Partitioned Global Address Space Programming Models, (PGAS'2014). October, Oregon, USA
- 25) R. Shi, S. Potluri, K. Hamidouche M. Li, J. Perkins D. Rossetti and D.
K. Panda, Designing Efficient Small Message
Transfer Mechanism for Inter-node MPI Communication on InfiniBand GPU Clusters
IEEE International Conference on High Performance Computing (HiPC'2014). December 2014, Goa, India
- 24) A. Venkatesh, H. Subramoni, K. Hamidouche and D. K. Panda, A High Performance Broadcast Design with Hardware
Multicast and GPUDirect RDMA for Streaming Applications on Infiniband Clusters
IEEE International Conference on High Performance Computing (HiPC'2014). December 2014, Goa, India
- 23) J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko and and
D. K. Panda, High Performance OpenSHMEM for
MIC Clusters: Extensions, Runtime Designs and Application Co-design IEEE CLUSTER'14 (Best Paper Nominee) . Spetember 2014, Madrid, Spain
- 22) M. Li, X. Lu, S. Potluri, K. Hamidouche, J. Jose, K. Tomko and and D.
K. Panda, Scalable Graph500 Design with MPI-3
RMA IEEE CLUSTER'14. Spetember 2014, Madrid, Spain
- 21) R. Rajachandrasekar, J. Perkins, K. Hamidouche, M. Arnold and D. K. Panda, Understanding the Memory-Utilization of MPI Libraries: Challenges and Designs in Implementing the MPI_T Interface. EUROMPI'14. Spetember 2014, Jepan
- 20) R. Shi, X. Lu, S. Potluri, K. Hamidouche, J. Zhang and D. K. Panda,
HAND: A Hybrid Approach to Accelerate
Non-contiguous Data Movement using MPI Datatypes on GPU Clusters International Conference on Parallel Processing (ICPP'14). Spetember 2014, Minneapolis, USA
- 19) R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, Md. Wasi-ur-Rahman and D. K. Panda, MIC-Check: A Distributed Checkpointing Framework for the Intel Many Integrated Cores Architecture. The International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC '14). June 2014, Vancouver, Canada
- 18) H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC '14). June 2014, Leipzig, Germany
- 17) J.Jose, K. Hamidouche, J. Zhane, A. Venkadesh and D. K. Panda, Optimizing Collective in UPC. International Workshop on High-Level Parallel Programming Models and Suppor- tive Environments (HIPS '14). May 2014, Phoenix, USA
- 16) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda, High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters. IEEE International Parallel & Distributed Processing Symposium (IPDPS '14). May 2014, Phoenix, USA
- 15) M. Luo, X. Lu, K. Hamidouche, K. Kandalla and D. K. Panda, Initial Study of Multi-Endpoint Runtime for MPI+OpenMP Hybrid Applications on Multi-Core Systems. International Symposium on Principles and Practice of Parallel Programming (PPoPP '14). February 2014, Orlondo, USA
- 14) R. Shi, S. Potluri, K. Hamidouche, X. Lu, K. Tomko and D. K. Panda, A Scalable and Portable Approach to Accelerate Hybrid HPL on
Heterogeneous CPU-GPU Clusters. IEEE Cluster (Cluster13). Best Student Paper Award . September 2013, Indianapolis, USA
- 13) S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni and D. K. Panda, MVAPICH-PRISM: A Proxy-based Communication
Framework using InfiniBand and SCIF for Intel MIC Clusters. IEEE/ACM International Conference on Supercomputing (SC13) . November 2013, Denver, CO, USA
- 12) S. Potluri, K. Hamidouche, D. Bureddy and D. K. Panda, MVAPICH2-MIC: A High-Performance MPI Library for Xeon Phi Clusters with InfiniBand. Extreme Scaling Workshop . August 2013, Boulder, CO, USA
- 11) K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda, Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters. IEEE International Symposium on High-Performance Interconnects (HotI 2013) . August 2013, San Jose, CA, USA
- 10) S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. IEEE International Conference on Parallel Processing (ICPP 2013) . October 2013, Lyon, France
- 9) M. LI, S. Potluri, K. Hamidouche, J. Jose, D. K. Panda, Efficient and Truly Passive MPI-3 RMA Using InfiniBand Atomics. EuroMPI 13 . Septembre 2013, Madrid, Spain
- 8) K. Hamidouche, S. Potluri, H. Subramoni, K. Kandalla and D. K. Panda, MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core (MIC) Clusters with InfiniBand. ACM International Conference on Supercomputing (ICS 2013) . June 2013, Oregon, USA
- 7) K. Hamidouche, F. M. Mendonca, J. Falcou, A.C.M.A Melo, D. Etiemble, Parallel Smith-Waterman Comparison on Multicore and Manycore Computing Platforms with BSP++. International Journal of Parallel Programming (IJPP) . August 2012
- 6) K. Hamidouche, F. M. Mendonca, J. Falcou, D. Etiemble, Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++. 23rd IEEE International Symposium on Computer Architecture and High Performance Computing - SBAC-PAD'2011 , Vitoria, Espirito Santo, Brazil, October 26-29, 2011
- 5) K. Hamidouche, J. Falcou, D. Etiemble, A Framework for an Automatic Hybrid MPI+OpenMP code generation, ACM High Performance Computing Symposium , HPC-11, Boston - USA,
April, 3-7, 2011
- 4) K. Hamidouche, J. Falcou, D. Etiemble, Hybrid Bulk Synchronous Parallelism Library for Clustered SMP Architectures, ACM International workshop on High Level Parallel Programming and Applications, HLPP2010 , Affiliated to ICFP 2010 , Baltimore - USA, September, 25,2010
- 3) K. Hamidouche, A. Borghi, P. Esterie, J. Falcou, S. Peyronnet, Three High Performance Architectures in the Parallel APMC Boat, IEEE International Workshop on Parallel and Distributed Methods in Verification, PDMC 2010 , Enschede, Netherlands, September 30, 2010
- 2) C. Tadonki, L. Lacassagne, T. Saidani, J. Falcou, K. Hamidouche The Harris algorithm revisited on the CELL processor , International workshop on highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2010 , 1 June 2010, Tsukuba , Japan
- 1) K. Hamidouche, F. Cappello, D. Etiemble, Comparaison de MPI, OpenMP et MPI+OpenMP sur un noeud multiprocesseur multicoeurs AMD a memoire partagee , Recontre Francophone de Parallelisme, RenPar 2009 , Toulouse, Septempber,9-11, 2009.
- Organizer and Co-Chair
- HotI'17, ESPM2 2016, ExaComm'16, ESPM2 2015, ExaComm'15
- Publicity Chair
- HPCC17, SmartCity17, DSS1, PGAS'15, PGAS'14
- Program Committee Member
SC'18, ScalCom'18, OpenSHMEM'18, SAC'18, PMAM'18, PMAM'17, SAC'17, CCGrid17,
HotI'16, OpenSHMEM'16, WAMCA'16, ADVCOMP'16, AsHES'16,
PMAM'16, PGAS'15, ADVCOMP 2015, HIPS'15, AsHES'15, CCGrid'15
- Tutorials
MUG'14, PGAS'14, SC'14, PPoPP'15, XSEDE'15, PGAS'15, SC'15, PPoPP'16, ISC'16,
MUG'16
-
MVAPICH2
The MVAPICH2 software, supporting MPI 3.0 standard, delivers best performance,
scalability and fault tolerance for high-end computing systems and servers using
InfiniBand, 10GigE/iWARP and RoCE networking technologies. The MVAPICH2-X software package provides support for hybrid MPI+PGAS (UPC and OpenSHMEM)
programming models with unified communication runtime for emerging exascale
systems. The MVAPICH2-GDR package provides support for clusters with NVIDIA GPUs
supporting the GPUDirect RDMA feature. The MVAPICH2-MIC package provides support
for clusters with Intel MIC coprocessors. Recently, we have proposed the first
production-ready Energy-Aware MPI runtime with the MVAPICH2-EA library. The novelty of MVAPICH2-EA is its
combination of analytical model of MPI protocols and on-line time estimation.
MVAPICH2 Libraries are powering several
supercomputers in the TOP 500 list.
- BSP++ Library
Is a generic library using C++ templates. Based on a hierarchical model, the BSP++ library takes the hybrid architectures (Multicore clusters and Cell BE accelerator based clusters) as native targets. Using a small set of primitives and intuitive concepts, BSP++ provides a simple way to program hybrid and heterogeneous architectures. It generates MPI, OpenMP, MPI+OpenMP, Cell BE and MPI+Cell BE codes with the same version of the user base code (the choice of the target machine is just a preprocessing symbole).
- BSPGen FrameWork
Is a tool for an automatic hybrid multi-level hierarchy (MPI + OpenMP or MPI +Cell BE) code generation. Using the BSP++ cost model, BSPGen predicts and generates the
appropriate hierarchical hybrid code (BSP++ code) for a given application on a target architecture. The prediction is based on a new pass on the LLVM compiler. BSPGen generates hybrid code from a list of sequential functions and a description of the parallel algorithm (XML file).
- I was Instructor/Teacher at Paris Sud 11 university. During my thesis, I taught several courses of different levels:
Polytechnique -IFIPS - University of Paris-sud 11
- (5 th degree - Formation Continue) Parallel and Distributed Programming
- (5 th degree - Apprentis 3) Parallel and Distributed Programming
IUT d'Orsay - University of Paris-sud 11
- (3 rd Degree) Operating Systems
UFR d'Orsay - University of Paris-sud 11
- (1 st Degree) Introduction to the Algorithmic and C langage