How to Efficiently Implement log2(__m256d) in AVX2?-C++-php.cn

How to Efficiently Implement log2(__m256d) in AVX2?

DDD

Release： 2024-12-04 14:06:11

Original

214 people have browsed it

How to Efficiently Implement log2(__m256d) in AVX2?

Efficient Implementation of log2(__m256d) in AVX2

Introduction

The intrinsic __m256d _mm256_log2_pd (__m256d a) is not supported beyond Intel compilers and its performance is compromised on AMD processors. This article aims to provide a performant and cross-compiler solution for calculating log2() for vectors of doubles using the AVX2 instruction set.

Approach

The typical method involves dividing log(a*b) into log(a) log(b) and adjusting for exponent bias. For the case of log2, the result is equivalent to exponent log2(mantissa). As the mantissa range (1.0 to 2.0) is limited, a polynomial approximation for log2(mantissa) can be used.

Accuracy Considerations

The accuracy of the approximation influences the relative error. In order to minimize the maximum absolute or relative error, coefficients should be tuned through minimax fitting rather than simply using Taylor series expansion.

Vectorization

To leverage the AVX2 instruction set for vector processing, the following steps are implemented:

Extract exponent bits and convert them to floats after bias adjustment.
Extract the mantissa and modify it for a range of [0.5, 1.0) with exponent adjustments.
Utilize a polynomial approximation for log(x) accurate around x=1.0 using AVX2 instructions with FMA.
Calculate the final log2 result by adding the exponent and the polynomial approximation.
Incorporate special handling for underflow, overflow, and denormal cases.

Performance Enhancements

To improve performance:

Use higher-order polynomials or a ratio of polynomials for greater precision.
Utilize AVX512 instructions for extended capabilities, such as extracting exponents and mantissas more efficiently.
Remove or adjust checking for special cases if values are known to be finite and positive.

Implementation

The implementation below uses intrinsics for vectorization and FMA instructions for efficient multiplication and addition:

__m256d Log2(__m256d x) {
  // Extract exponent and adjust bias
  const __m256i exps64 = _mm256_srli_epi64(_mm256_and_si256(gDoubleExpMask, _mm256_castpd_si256(x)), 52);
  const __m256i exps32_avx = _mm256_permutevar8x32_epi32(exps64, gTo32bitExp);
  const __m128i exps32_sse = _mm256_castsi256_si128(exps32_avx);
  const __m128i normExps = _mm_sub_epi32(exps32_sse, gExpNormalizer);
  const __m256d expsPD = _mm256_cvtepi32_pd(normExps);

  // Prepare mantissa
  const __m256d y = _mm256_or_pd(_mm256_castsi256_pd(gDoubleExp0),
    _mm256_andnot_pd(_mm256_castsi256_pd(gDoubleExpMask), x));

  // Calculate t=(y-1)/(y+1) and t**2
  const __m256d tNum = _mm256_sub_pd(y, gVect1);
  const __m256d tDen = _mm256_add_pd(y, gVect1);
  const __m256d t = _mm256_div_pd(tNum, tDen);
  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  // Calculate terms and final log2
  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);
  const __m256d log2_y = _mm256_mul_pd(terms01234, gCommMul);
  const __m256d log2_x = _mm256_add_pd(log2_y, expsPD);

  return log2_x;
}

Copy after login

Conclusion

This implementation provides an efficient and portable solution for log2() calculations using AVX2. By optimizing for both speed and accuracy, it offers a cross-compiler alternative to intrinsic functions and can significantly improve performance.

The above is the detailed content of How to Efficiently Implement log2(__m256d) in AVX2?. For more information, please follow other related articles on the PHP Chinese website!