Genghan Zhang

Authors

Genghan Zhang,Shaowei Zhu,Anjiang Wei,Zhenyu Song,Allen Nie,Zhen Jia,Nandita Vijaykumar,Yida Wang,andKunle Olukotun

Abstract

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26x cheaper.

Blog

Blog URL

Article

BibTeX


       @article{zhang2026accelopt, 
 title={AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization}, 
 author={Zhang, Genghan and Zhu, Shaowei and Wei, Anjiang and Song, Zhenyu and Nie, Allen and Jia, Zhen and Vijaykumar, Nandita and Wang, Yida and Olukotun, Kunle}, 
 journal={Proceedings of Machine Learning and Systems}, 
 volume={9}, 
 year={2026} 
 }