
Figure 1: Demonstration of the distribution of different channels in OPT decoder layers. Each point
is (maximum value, minimum value) of a channel in the activation.
accelerates compute-intensive operations like Matrix Multiplication and linear layers. By quantizing
weights and activations, storage and communication overhead is reduced, leading to improved
efficiency and faster inference times. Quantization methods are typically divided into two categories:
post-training quantization (PTQ) and quantization-aware training (QAT). While QAT methods can
lead to higher accuracy in most cases, they require significant computational resources to train
the models, making them less practical for LLMs that already have significant training costs. In
contrast, PTQ methods can quantize pre-trained models without additional training, making them
more practical for larger models that require significant computational and memory resources. This
paper focuses on PTQ for LLMs.
In this paper, we highlights the challenge of quantizing the activations of LLMs, which is attributed
to the significant variations in the values across different channels
2
, as shown in Figure 1. Two
observations can be made from this figure: 1) Some channels exhibit significant outliers, with
maximum or minimum values that are hundreds of times larger than those of other channels. Previous
studies [
34
;
11
] have also identified this issue and proposed special treatment for outliers. 2) Different
channels exhibit significant difference in the range of values. Quantizing different channels using the
same quantization parameter can lead to substantial quantization errors. Even if two channels have
the same absolute value of outliers, they can exhibit strong difference in the range of numerical values.
For instance, one channel may have a range of -100 to -50, while another channel may have a range
of 80 to 100. Using the same quantization parameters for them can lead to significant quantization
errors, which is a challenge that has not been effectively addressed in previous works.
To address the issue of quantizing activations with channels that have significantly different ranges, we
propose a method called RPTQ. This method involves clustering channels in activations that exhibit
similar value ranges, followed by the quantization with the same quantization parameter to the values
within each cluster. Consequently, channels displaying considerable discrepancies in numerical ranges
can utilize distinct quantization parameters, leading to a significant reduction in quantization error.
Furthermore, we propose strategies to avoid explicit reordering, thereby decreasing computational
overhead and enhancing inference efficiency. We propose a modified layer norm operation to yield
reordered activations directly, obviating the necessity for explicit channel adjustments during the
inference process. In addition, we reorganize the weights of linear layers to enable them to directly
accept and produce activations in a sorted order.
Our experiments demonstrate that RTPQ is an effective solution for addressing the issue of quantizing
the activations of LLMs. Clustering the channels in only a small number of clusters can significantly
reduce quantization errors and improve the accuracy of quantized LLMs. The results show that RPTQ
2
For simplicity, we use the term "channel" to refer to the dimension of the hidden size. See Appendix A.1 for
more results.
2
评论