Research
Compiler Technologies for Emerging Architectures
[The overview of PIMFlow]
With the end of Dennard scaling, we are witnessing a major shift in the computer system and microarchitecture design towards exploiting more specialized and lightweight “accelerators” of different types instead of relying on mostly general-purpose processors. Such heterogeneous systems pose an unprecedented challenge for the entire software stack to provide programmability and portability while delivering performance.
We work on rethinking compiler and runtime technologies for heterogeneous systems with emerging architectures such as compute-augmented memory (NDP/PIM) (joint work with SK Hynix, KAIST, SNU, and other universities and start-ups). PIMFlow (CAL ’22, CGO ’23) proposes software layers specifically designed to accelerate compute-intensive convolutional layers on PIM-DRAM. XLA-NDP (CAL '23) introduces a compiler and runtime solution for NDPX to maximize parallelism based on GPU and NDPX costs.
Related publications:
J. Park, H. Sung, XLA-NDP: Efficient Scheduling and Code Generation for DL Model Training on Near-Data Processing Memory, IEEE Computer Architecture Letters (CAL), 2023.
Y. Shin, J. Park, S. Cho, and H. Sung, PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM. International Symposium on Code Generation and Optimization (CGO), 2023.
Y. Shin, J. Park, J. Hong, and H. Sung, Runtime Support for Accelerating CNN Models on Digital DRAM Processing-in-Memory Hardware, Computer Architecture Letters (CAL), 2022.
Machine Learning for Compiler Optimizations
[The overview of One-shot Tuner]
Generating high-performing codes for increasingly heterogeneous hardware calls for more flexible and adaptive ways to model program behaviors with different optimization decisions than traditional heuristic-based cost models.
We take data-driven approaches where the code-performance relationship is accurately learned from code representation and profiling results. CogR (PACT ’19) guides the OpenMP runtime scheduler by predicting whether an OpenMP target region will execute faster on CPU on GPU using a deep-learning based predictor model, while MetaTune extends an auto-tuning framework in a deep-learning compiler, TVM, to reduce autotuning overheads and generate better-optimized codes for tensor operations. Most recently, One-shot Tuner (CC ’22) showed how online auto-autotuning overheads can be practically eliminated with a NAS-inspired performance predictor model trained with a small set of samples (open-sourced).
Related publications:
J. Ryu, E. Park, and H. Sung, One-shot tuner for deep learning compilers. ACM SIGPLAN International Conference on Compiler Construction (CC), 2022.
J. Ryu and H. Sung, MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks, arXiv, 2021
H. Sung, T. Chen, A. Eichenberger and K. K. O'Brien, POSTER: CogR: Exploiting Program Structures for Machine-Learning Based Runtime Solutions, International Conference on Parallel Architectures and Compilation Techniques (PACT), 2019.
OpenCL Compiler and Runtime Support for Next-gen Supercomputers
Modern supercomputers harness massive parallelism provided by host CPU’s and specialized computing elements such as GPU’s and NPU’s.
In collaboration with ETRI and KISTI, we work on building Korea’s own next-generation supercomputers and providing OpenCL programming support with optimizing compilers and runtime.
Funded Projects
TBU