One Agent, Any Chip: Expert-free, Self-Improving Kernel Optimization

Genghan Zhang

December 22, 2025

The code for AccelOpt is open-source and available on GitHub: https://github.com/zhang677/AccelOpt. More details can be found in our paper. Accelopt is a follow-up to this work.

AI accelerator kernels are hard to optimize

The unprecedented demand for compute power in the age of large models has prompted the rise of AI accelerators. However, their performance critically depends on the efficiency of kernels—the low-level implementations that determine how machine learning operators are mapped onto hardware resources. Suboptimal kernels can severely limit system performance and, when scaled to large deployments, result in substantial waste of compute and financial resources. Kernel optimization, however, is notoriously difficult and demanding, even for well-understood architectures like GPUs.

FA1-4
Figure 1 FlashAttention lags behind the release of NVIDIA GPUs by ~1 year.


Can LLM-based systems autonomously navigate the kernel optimization space?

There are two primary challenges. First, LLM queries incur substantial computational costs, this exploration must be conducted strategically to balance comprehensive search space coverage with cost efficiency. Second, we aim to enable the LLM-based system to autonomously accumulate optimization insights during such explorations, allowing the system to progressively improve its capabilities over time without requiring manual intervention.

AccelOpt-Method
Figure 2 At each iteration of AccelOpt, the agentic workflow shown on the right optimizes the candidate kernels with the latest optimization memory, and generates new candidate kernels, updating optimization memory with newly collected experiences.


Solution: AccelOpt

To tackle this challenging task, we propose AccelOpt, a self-improving LLM agentic system for kernel optimization on emerging AI accelerators that combines search with memory accumulation. Among the open-source systems we are aware of, AccelOpt is the first system that does not require expert-provided, hardware-specific optimization knowledge or predefined optimization recipes on emerging AI accelerators.

AccelOpt-Example
Figure 3 A snapshot of AccelOpt’s execution trace, which conducts a non-trivial loop transformation.


Real-world Kernel Challenge: NKIBench

We construct NKIBench, the first benchmark suite for NKI kernel optimization on Amazon Trainium, with all kernels derived from real-world LLM workloads. NKIBench measures kernel performance against theoretical peak hardware performance on Trainium, rather than relying solely on relative speedup metrics, which can be ambiguous due to different baseline choices. AccelOpt boosts the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels.

AccelOpt-NKIBench
Figure 4 Per-task kernel improvement of NKIBench achieved using Claude Sonnet 4 and AccelOpt (gpt-oss-120b + Qwen3-Coder-480B) on Trainium 1. The y-axis represents percentage of peak throughput, where a higher score signifies better kernels.


Enhancing Test-Time Scaling via Beam Search and Optimization Memory

AccelOpt leverages test-time scaling to unlock the potential of LLMs that may be under-trained for specialized tasks. It employs beam search to iteratively discover better kernels by building upon the best kernels of previous iterations. This aligns with AutoComp’s finding1. A key new finding of AccelOpt is the utility of optimization memory, which directs agents toward successful strategies. Similar to how RLVR improves pass@1 but not pass@n2, optimization memory reduces the cost of test-time scaling while achieving the same average performance. AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26× cheaper.

AccelOpt-Cost
Figure 5 Left: AccelOpt improves the cost efficiency of test-time scaling by 26x. Right: Beam search discovers better kernels and optimization memory saves iterations.


“Aha moment” and the Educational Impact

As a CA for Stanford’s CS149 (Parallel Computing), I leveraged AccelOpt to design problems for Assignment 4: Fused Conv+MaxPool on the Trainium2 Accelerator. During the 14th iteration of the optimization process, AccelOpt achieved a breakthrough by transforming a temporally sequential execution pattern into simultaneous spatial execution. This transformation resulted in a 5x speedup of last year’s reference Conv2D kernel on small-scale inputs.

Aha
Figure 6 "Aha moment" of AccelOpt.


In collaboration with other CAs, we developed four extra-credit problems centered on this optimization technique. While the challenge was substantial, with approximately half of the class unable to complete it, 1/3 of the students successfully acquired this critical ‘spatial thinking’ skill, earning full extra credit.

CS149
Figure 7 Human students' performance on the optimization discovered by LLM-powered AccelOpt.


This example underscores the generality of AccelOpt beyond NKIBench and highlights the educational impact of LLM-assisted kernel optimization.

What about GPU?

AccelOpt is a hardware-agnostic framework. To demonstrate its efficacy, we evaluated its performance on the H100 SXM5 platform using nine categories of the FlashInfer-Bench best Triton baselines (as of December 23, 2025). Utilizing the gpt-oss-120b model, AccelOpt discovered significant kernel enhancements, achieving up to a 3.49x speedup. Here are all the generated kernels. We are currently collaborating with the FlashInfer-Bench team to integrate these optimized kernels into the public benchmark.

Fib
Figure 8 Performance evaluation of AccelOpt on selected FlashInfer-Bench problems. The y-axis represents latency, where a lower score signifies better kernels.


Let the Agent Flow: Beyond Human Guidance

A notable finding of AccelOpt is that the base prompt does not need to include full syntax information for NKI—a relatively new architecture-specific programming language (ASPL) released in 2024. It is often assumed that LLMs must master the syntax of ASPLs before optimizing kernels3. However, as pointed out in this blog, many ASPLs’ syntax have a structural similarity to one another. Therefore, syntax is not the primary consideration; the priority is building a workflow that enables continuous improvement.

Within this continuous improvement workflow, AccelOpt demonstrates that human-crafted optimization guidance is unnecessary. It was previously assumed that designers needed to provide specific instructions, such as optimization plans for the LLM to select from1. Consequently, porting prompts from NKI to Triton simply involves replacing NKI-specific details with a few sentences about Triton. They can even share the same summarizer prompts. Interested readers can read the prompts for NKI and Triton.

Memory
Figure 9 Magic of Optimization Memory generated by Nano Banana Pro.


Benchmarks as Environments for Self-Improvement

Kernel benchmarks have three distinct properties that distinguish them from traditional benchmarks with fixed answers (such as labels or solutions):

  1. Open-ended Optimization: Each problem remains technically “unsolved” because further optimization is always possible until the hardware’s theoretical roofline bound is reached.
  2. Hardware-Dependent Evaluation: Evaluating kernels requires directly running them on real hardware. With LLM inference and fine-tuning becoming standardized and optimized, the entire self-improvement process is often bottlenecked by kernel profiling.
  3. Production Readiness: The surrounding ML frameworks are mature, meaning optimized kernels can be deployed directly into production.

These properties indicate that kernel benchmarks environments must be scalable for new problems, parallelized to evaluate heavy sampling, and modular for “day-zero” integration. Developed concurrently with FlashInfer-Bench, NKIBench utilizes a similar interface that features structured storage for problems and kernels along with a distributed profiling service. This represents an advancement over traditional kernel benchmarks, such as KernelBench4, which consist only of a list of problems. By providing more than just a fixed repository of tasks, this new approach creates a comprehensive environment for the development of self-improving LLM agents.

What about RL?

TL;DR: An effective training recipe has not yet been identified. We believe the potential of AccelOpt for data synthesis remains underexplored. Leveraging AccelOpt, we synthesized ~18k ‘slow-fast’ pairs through 16 iterations across 14 NKIBench problems. Additionally, we leveraged an in-house framework to produce 1k NKI kernels for training and 150 for evaluation. Supervised Fine-Tuning (SFT) was conducted using two distinct tasks: (1) slow-to-fast NKI kernel transformation and (2) Numpy-to-NKI program translation. Reinforcement Learning (RL) was then applied using GRPO via verl, trained on the 1k NKI kernel dataset. The reward function was designed to penalize empty outputs, compilation errors, and performance regressions, while incentivizing functional correctness and execution speedup. We evaluated four distinct recipes using a Best-of-5 sampling strategy. Metrics include functional correctness, the count of cases achieving a speedup ≥ 1 and the average speedup of them, and maximum speedup across the 150 evaluated problems.

Qwen2.5-Coder-7B-Instruct

Recipe Correct Cases Cases with Speedup ≥ 1 Avg. Speedup Max Speedup
Base model 126 / 150 39 / 150 1.022 1.194
SFT-Slow-Fast 127 / 150 31 / 150 1.034 1.179
SFT-Slow-Fast-RL 131 / 150 53 / 150 1.024 1.161
SFT-Numpy-NKI 125 / 150 75 / 150 1.029 1.135
SFT-Numpy-NKI-RL 132 / 150 52 / 150 1.018 1.149

DeepSeek-Coder-33B-Base-Instruct

Recipe Correct Cases Cases with Speedup ≥ 1 Avg. Speedup Max Speedup
Base model 105 / 150 26 / 150 1.025 1.095
SFT-Slow-Fast 70 / 150 66 / 150 1.040 1.172
SFT-Slow-Fast-RL 104 / 150 57 / 150 1.029 1.103
SFT-Numpy-NKI 105 / 150 49 / 150 1.032 1.210
SFT-Numpy-NKI-RL 108 / 150 67 / 150 1.025 1.152

RL infrastructure was initiated by Anjiang Wei, who also collected the first 1k+150 kernels, and implemented the Numpy-NKI recipes. Genghan Zhang later optimized the RL infrastructure, collected the 18k ‘slow-fast’ kernel pairs, and implemented the Slow-Fast recipes.

Fast-weight Engieering

Context engineering can be viewed as ‘fast-weight engineering’ because the context persists in the KV-cache, where it is referenced by every generated token much like static model weights. Workflows such as AccelOpt demonstrate that this is not merely ad-hoc manual prompting, but a systematic approach to fast-weight optimization. In a broader sense, just as back-propagation was designed to optimize weights, we are now seeking the most effective design for context. We are currently in an era reminiscent of the early days of neural networks, experimenting with diverse ‘weight engineering’ recipes to find what sticks. While ‘fast-weight’ may be an overloaded term in this context, I believe self-improving agents share a deep structural lineage with Test-Time Training (TTT). Both paradigms essentially treat inference not as a static forward pass, but as a dynamic optimization process where the model’s ‘state’—whether in the KV-cache or through local RNN state updates—adapts to the task (or sequence) at hand.

Acknowledgement

This work was a collaborative effort across several teams. I would like to thank our collaborators from Stanford University, AWS Neuron Science Team, and the University of Toronto: Shaowei Zhu, Anjiang Wei, Zhenyu Song, Allen Nie, Zhen Jia, Nandita Vijaykumar, Yida Wang, and Kunle Olukotun. Special thanks to Stanford CS149 instructors Kayvon Fatahalian and Kunle Olukotun, as well as the amazing CA team—particularly our “Assignment 4 gurus” Weixin Yu and Anthony Zhan—and the AWS support team for their essential role in the course. Finally, I’m grateful to Yixin Dong from the FlashInfer-Bench team for his insights and collaboration.

AccelOpt-Moon
Appendix AccelOpt Cooking Kernels on the Moon generated by Nano Banana Pro.


References

  1. Hong, C., Bhatia, S., Cheung, A. and Shao, S., 2025, May. Autocomp: Llm-driven code optimization for tensor accelerators. In Machine Learning for Computer Architecture and Systems 2025.
  2. Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S. and Huang, G., 2025. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837.
  3. Agrawal, L.A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M.J., Jiang, M. and Potts, C., 2025. Gepa: Reflective prompt evolution can outperform reinforcement learning. arXiv preprint arXiv:2507.19457.
  4. Ouyang, A., Guo, S., Arora, S., Zhang, A.L., Hu, W., Ré, C. and Mirhoseini, A., 2025. Kernelbench: Can llms write efficient gpu kernels?. arXiv preprint arXiv:2502.10517.