Genghan Zhang

Abstract

The dramatic improvements in Large Language Models (LLMs) come at the cost of increased computational resources for inference. Recent studies ameliorate the computational costs of LLMs by increasing their activation sparsity but suffer from significant performance degradation on downstream tasks. In this work, we introduce a new framework for sparsifying the activations of LLMs and reducing inference costs, dubbed Contextually Aware Thresholding for Sparsity (CATS). CATS is a relatively simple algorithm that is easy to implement and highly effective. At the heart of our framework is a new non-linear activation function. We demonstrate that CATS can be applied to various models, including Mistral-7B and Llama2-7B & 13B, and outperforms existing sparsification techniques across multiple tasks. More precisely, CATS-based models achieve downstream task performance within ~99% of their base models at 50% activation sparsity, even without fine-tuning. Moreover, with fine-tuning that targets only 1% of the parameters, CATS-based models not only converge faster but also achieve better task performance than competing techniques. Finally, we develop a custom GPU kernel for efficient implementation of CATS that translates the activation sparsity of CATS to real wall-clock time speedups. Our custom kernel implementation of CATS results in a ~15% improvement in wall-clock inference latency of token generation. We release our code, experiments, and datasets at https://github.com/ScalingIntelligence/CATS.

Blog

Blog URL

Article

BibTeX


       @inproceedings{leecats, 
 title={CATS: Context-Aware Thresholding for Sparsity in Large Language Models}, 
 author={Lee, Donghyun and Lee, Jaeyong and Zhang, Genghan and Tiwari, Mo and Mirhoseini, Azalia}, 
 booktitle={First Conference on Language Modeling}, 
 year={2024} 
 }