对于程序员来说,调用内部函数就像调用其他函数一样。 在底层,编译器会用适当的程序集指令替换它。 因此,不必使用 c / c + + 代码中的汇编指令来处理寄存器,而是调用相应的内部函数。 每个 CPU 体系结构都有自己的一组内部函数API 和相应的头文件。 作为一个例子,让我们使用 ARM 架构的 SIMD内部函数对 PostgreSQL 代码片段进行向量化,看看通过向量化代码会产生多大的不同。 在此之前,您可能希望快速浏览NEON架构预览来了解寄存器(registers)、通道(lanes)和向量(vectors)的命名规范。 NEON是 ARM SIMD 架构的品牌名称(The implementation of the Advanced SIMD extension used in ARM processors is called NEON,)。 NEON 单元是 ARMv8芯片的必备部分。
#include <arm_neon.h> ...... ...... int i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); int remainder; int count = i2 + 1; int32 *digptr = &dig[i1 + 2];
/* Load the same var1digit value into all lanes of 16x4 vector. */ int16x4_t var1digit_16x4 = vdup_n_s16(var1digit); // VDUP.16 d0,r0
/* Parallelize each group of 4 digits */ remainder = count%4; count -= remainder; for (i = 0; i < count; i += 4) { /* \* 1. Load required data into vectors \* 2. Do multiply-accumulate-long operation using 16x4 vectors, \* whose output is a 32x4 vector which we need, because digptr[] \* is 32bit. \* 3. Store back the result vector into digptr[] */
/* Load 4 var2digits into 16x4 vector and digptr into 32x4 */ int16x4_t var2digits_16x4 = vld1_s16(&var2digits[i]); int32x4_t dig_32x4 = vld1q_s32(&digptr[i]);
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index f3a725271e..4243242ad9 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -7226,6 +7226,7 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, int res_weight; int maxdigits; int *dig; \+ int *digptr; int carry; int maxdig; int newdig; @@ -7362,10 +7363,14 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, * \* As above, digits of var2 can be ignored if they don't contribute, \* so we only include digits for which i1+i2+2 <= res_ndigits - 1. \+ * \+ * For large precisions, this can become a bottleneck; so keep this for \+ * loop simple so that it can be auto-vectorized. */ \- for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2; \- i2 >= 0; i2--) \- dig[i--] += var1digit * var2digits[i2]; \+ i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); \+ digptr = &dig[i1 + 2]; \+ for (i = 0; i <= i2; i++) \+ digptr[i] += var1digit * var2digits[i]; }
numeric.c:7217:3: optimized: loop vectorized using 16 byte vectors Or in case it can't vectorize, you would see something like this : numeric.c:7380:3: missed: couldn't vectorize loop numeric.c:7381:15: missed: not vectorized: relevant stmt not supported: _39 = *_38;