パフォーマンスの探求パートⅢ：C Force-C++-php.cn

The Quest for Performance Part III : C Force

このシリーズの前 2 回の記事では、Perl での浮動演算のパフォーマンスについて検討しました。
関数 cos(sin(sqrt(x))) を計算するおもちゃの例では、Python と R が使用されています。ここで、x は 50M の倍精度浮動小数点数の 非常に大きな 配列です。
算術集中部分を C に委任したハイブリッド実装は、最もパフォーマンスの高い実装の 1 つでした。今回は少し脱線して、おもちゃのサンプルの純粋な C コード実装のパフォーマンスを見ていきます。
C コードは、パフォーマンスに対するメモリの局所性の重要性について、コンテナと比較してさらに洞察を提供します (デフォルトでは、C 配列の要素はメモリ内の連続したアドレスに格納され、PDL やそのようなコンテナとの numpy インターフェイスなどの数値 API) 、
例えばメモリ内の連続したアドレスに値を格納しない Perl 配列。最後に、重要なことですが、C コードの実装により、低レベルコンパイラ (この場合は gcc) の浮動小数点演算に関連するフラグがパフォーマンスに影響を与えるかどうかを評価できるようになります。
この点は強調する価値があります。一般的な人間は、「インストール」を「パイプ」するとき、またはインラインファイルをビルドするときのコンパイラフラグの選択に完全に依存しています。これらのフラグに触れなければ、幸いなことに、何が欠けているのか、あるいは避けている落とし穴に気付かないことになります。
謙虚な C ファイル makefile を使用すると、このようなパフォーマンス評価を明示的に行うことができます。

おもちゃの例の C コード全体を以下に示します。このコードは一目瞭然なので、

に対する 4 つの関数が含まれていることを指摘する以外は説明に時間を費やすことはありません。

高価な関数の非順次計算: 3 つの浮動小数点演算はすべて、1 つのスレッドを使用して単一のループ内で実行されます
高価な関数の逐次計算: 3 つの浮動小数点関数の評価はそれぞれ、1 つのスレッドを使用して別個のループ内で行われます
ノンシーケンシャル OpenMP コード: ノンシーケンシャルコードのスレッドバージョン
シーケンシャル OpenMP コード: シーケンシャルコードのスレッド化

この場合、コンパイラーが、平方根がアセンブリ内のパックされた (ベクトル化された) 浮動小数点演算にマップされることを認識するのに十分賢いので、適切な SIMD 命令を使用して 1 つの関数をベクトル化できることを期待するかもしれません (実際に実行したことに注意してください) OpenMP コードには simd プログラムを使用しないでください)。
おそらく、ベクトル化による高速化により、同じメモリ位置に繰り返しアクセスする (またはしない) ことによるパフォーマンスの低下が相殺される可能性があります。

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <stdio.h>
#include <omp.h>

// simulates a large array of random numbers
double*  simulate_array(int num_of_elements,int seed);
// OMP environment functions
void _set_openmp_schedule_from_env();
void _set_num_threads_from_env();



// functions to modify C arrays 
void map_c_array(double* array, int len);
void map_c_array_sequential(double* array, int len);
void map_C_array_using_OMP(double* array, int len);
void map_C_array_sequential_using_OMP(double* array, int len);

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Usage: %s <array_size>\n", argv[0]);
        return 1;
    }

    int array_size = atoi(argv[1]);
    // printf the array size
    printf("Array size: %d\n", array_size);
    double *array = simulate_array(array_size, 1234);

    // Set OMP environment
    _set_openmp_schedule_from_env();
    _set_num_threads_from_env();

    // Perform calculations and collect timing data
    double start_time, end_time, elapsed_time;
    // Non-Sequential calculation
    start_time = omp_get_wtime();
    map_c_array(array, array_size);
    end_time = omp_get_wtime();
    elapsed_time = end_time - start_time;
    printf("Non-sequential calculation time: %f seconds\n", elapsed_time);
    free(array);

    // Sequential calculation
    array = simulate_array(array_size, 1234);
    start_time = omp_get_wtime();
    map_c_array_sequential(array, array_size);
    end_time = omp_get_wtime();
    elapsed_time = end_time - start_time;
    printf("Sequential calculation time: %f seconds\n", elapsed_time);
    free(array);

    array = simulate_array(array_size, 1234);
    // Parallel calculation using OMP
    start_time = omp_get_wtime();
    map_C_array_using_OMP(array, array_size);
    end_time = omp_get_wtime();
    elapsed_time = end_time - start_time;
    printf("Parallel calculation using OMP time: %f seconds\n", elapsed_time);
    free(array);

    // Sequential calculation using OMP
    array = simulate_array(array_size, 1234);
    start_time = omp_get_wtime();
    map_C_array_sequential_using_OMP(array, array_size);
    end_time = omp_get_wtime();
    elapsed_time = end_time - start_time;
    printf("Sequential calculation using OMP time: %f seconds\n", elapsed_time);

    free(array);
    return 0;
}



/*
*******************************************************************************
* OMP environment functions
*******************************************************************************
*/
void _set_openmp_schedule_from_env() {
  char *schedule_env = getenv("OMP_SCHEDULE");
  printf("Schedule from env %s\n", getenv("OMP_SCHEDULE"));
  if (schedule_env != NULL) {
    char *kind_str = strtok(schedule_env, ",");
    char *chunk_size_str = strtok(NULL, ",");

    omp_sched_t kind;
    if (strcmp(kind_str, "static") == 0) {
      kind = omp_sched_static;
    } else if (strcmp(kind_str, "dynamic") == 0) {
      kind = omp_sched_dynamic;
    } else if (strcmp(kind_str, "guided") == 0) {
      kind = omp_sched_guided;
    } else {
      kind = omp_sched_auto;
    }
    int chunk_size = atoi(chunk_size_str);
    omp_set_schedule(kind, chunk_size);
  }
}

void _set_num_threads_from_env() {
  char *num = getenv("OMP_NUM_THREADS");
  printf("Number of threads = %s from within C\n", num);
  omp_set_num_threads(atoi(num));
}
/*
*******************************************************************************
* Functions that modify C arrays whose address is passed from Perl in C
*******************************************************************************
*/

double*  simulate_array(int num_of_elements, int seed) {
  srand(seed); // Seed the random number generator
  double *array = (double *)malloc(num_of_elements * sizeof(double));
  for (int i = 0; i < num_of_elements; i++) {
    array[i] =
        (double)rand() / RAND_MAX; // Generate a random double between 0 and 1
  }
  return array;
}

void map_c_array(double *array, int len) {
  for (int i = 0; i < len; i++) {
    array[i] = cos(sin(sqrt(array[i])));
  }
}

void map_c_array_sequential(double* array, int len) {
  for (int i = 0; i < len; i++) {
    array[i] = sqrt(array[i]);
  }
  for (int i = 0; i < len; i++) {
    array[i] = sin(array[i]);
  }
  for (int i = 0; i < len; i++) {
    array[i] = cos(array[i]);
  }
}

void map_C_array_using_OMP(double* array, int len) {
#pragma omp parallel
  {
#pragma omp for schedule(runtime) nowait
    for (int i = 0; i < len; i++) {
      array[i] = cos(sin(sqrt(array[i])));
    }
  }
}

void map_C_array_sequential_using_OMP(double* array, int len) {
#pragma omp parallel
  {
#pragma omp for schedule(runtime) nowait
    for (int i = 0; i < len; i++) {
      array[i] = sqrt(array[i]);
    }
#pragma omp for schedule(runtime) nowait
    for (int i = 0; i < len; i++) {
      array[i] = sin(array[i]);
    }
#pragma omp for schedule(runtime) nowait
    for (int i = 0; i < len; i++) {
      array[i] = cos(array[i]);
    }
  }
}

ログイン後にコピー

重要な問題は、高速フローティングコンパイラフラグの使用 (コードの精度と速度を引き換えにするトリック) がパフォーマンスに影響を与える可能性があるかどうかです。
このコンパイラフラグのないメイクファイルは次のとおりです

CC = gcc
CFLAGS = -O3 -ftree-vectorize  -march=native  -Wall -std=gnu11 -fopenmp -fstrict-aliasing 
LDFLAGS = -fPIE -fopenmp
LIBS =  -lm

SOURCES = inplace_array_mod_with_OpenMP.c
OBJECTS = $(SOURCES:.c=_noffmath_gcc.o)
EXECUTABLE = inplace_array_mod_with_OpenMP_noffmath_gcc

all: $(SOURCES) $(EXECUTABLE)

clean:
    rm -f $(OBJECTS) $(EXECUTABLE)

$(EXECUTABLE): $(OBJECTS)
    $(CC) $(LDFLAGS) $(OBJECTS) $(LIBS) -o $@

%_noffmath_gcc.o : %.c 
    $(CC) $(CFLAGS) -c $< -o $@

ログイン後にコピー

そして、これがこのフラグを持つものです:

CC = gcc
CFLAGS = -O3 -ftree-vectorize  -march=native -Wall -std=gnu11 -fopenmp -fstrict-aliasing -ffast-math
LDFLAGS = -fPIE -fopenmp
LIBS =  -lm

SOURCES = inplace_array_mod_with_OpenMP.c
OBJECTS = $(SOURCES:.c=_gcc.o)
EXECUTABLE = inplace_array_mod_with_OpenMP_gcc

all: $(SOURCES) $(EXECUTABLE)

clean:
    rm -f $(OBJECTS) $(EXECUTABLE)

$(EXECUTABLE): $(OBJECTS)
    $(CC) $(LDFLAGS) $(OBJECTS) $(LIBS) -o $@

%_gcc.o : %.c 
    $(CC) $(CFLAGS) -c $< -o $@

ログイン後にコピー

これら 2 つのプログラムを実行した結果が次のとおりです

-ffast-math なし

OMP_SCHEDULE=guided,1 OMP_NUM_THREADS=8 ./inplace_array_mod_with_OpenMP_noffmath_gcc 50000000
Array size: 50000000
Schedule from env guided,1
Number of threads = 8 from within C
Non-sequential calculation time: 1.12 seconds
Sequential calculation time: 0.95 seconds
Parallel calculation using OMP time: 0.17 seconds
Sequential calculation using OMP time: 0.15 seconds

ログイン後にコピー

-ffast-math を使用する

OMP_SCHEDULE=guided,1 OMP_NUM_THREADS=8 ./inplace_array_mod_with_OpenMP_gcc 50000000
Array size: 50000000
Schedule from env guided,1
Number of threads = 8 from within C
Non-sequential calculation time: 0.27 seconds
Sequential calculation time: 0.28 seconds
Parallel calculation using OMP time: 0.05 seconds
Sequential calculation using OMP time: 0.06 seconds

ログイン後にコピー

次のように Numba コードで fastmath を使用できることに注意してください (デフォルトは fastmath=False)。

@njit(nogil=True,fastmath=True)
def compute_inplace_with_numba(array):
    np.sqrt(array,array)
    np.sin(array,array)
    np.cos(array,array)

ログイン後にコピー

注目に値するいくつかの点:

-ffast-math はパフォーマンスを大幅に向上させます (シングルスレッドコードとマルチスレッドコードの両方で約 300%) が、誤った結果が生成される可能性があります
Fastmath は Numba でも動作しますが、正確さを追求するアプリケーションで避けるべきと同じ理由で避けるべきです
シーケンシャル C シングルスレッドコードは、シングルスレッド PDL や Numpy と同様のパフォーマンスを提供します
少し驚くべきことに、正しい (高速ではない) 計算を使用すると、シーケンシャルコードは非シーケンシャルコードよりも約 20% 高速になります。
当然のことながら、マルチスレッドコードはシングルスレッドコードより高速です :)
numbas が、このかなり単純な関数の C コードよりもどのようにパフォーマンスを 50% 向上させるのか、まだ説明できません。

タイトル: 「パフォーマンスの探求パート III : C フォース」

date: 2024-07-07

In the two prior installments of this series, we considered the performance of floating operations in Perl,
Python and R in a toy example that computed the function cos(sin(sqrt(x))), where x was a very large array of 50M double precision floating numbers.
Hybrid implementations that delegated the arithmetic intensive part to C were among the most performant implementations. In this installment, we will digress slightly and look at the performance of a pure C code implementation of the toy example.
The C code will provide further insights about the importance of memory locality for performance (by default elements in a C array are stored in sequential addresses in memory, and numerical APIs such as PDL or numpy interface with such containers) vis-a-vis containers,
e.g. Perl arrays which do not store their values in sequential addresses in memory. Last, but certainly not least, the C code implementations will allow us to assess whether flags related to floating point operations for the low level compiler (in this case gcc) can affect performance.
This point is worth emphasizing: common mortals are entirely dependent on the choice of compiler flags when "piping" their "install" or building their Inline file. If one does not touch these flags, then one will be blissfully unaware of what they may missing, or pitfalls they may be avoiding.
The humble C file makefile allows one to make such performance evaluations explicitly.

The C code for our toy example is listed in its entirety below. The code is rather self-explanatory, so will not spend time explaining other than pointing out that it contains four functions for

Non-sequential calculation of the expensive function : all three floating pointing operations take place inside a single loop using one thread
Sequential calculations of the expensive function : each of the 3 floating point function evaluations takes inside a separate loop using one thread
Non-sequential OpenMP code : threaded version of the non-sequential code
Sequential OpenMP code: threaded of the sequential code

In this case, one may hope that the compiler is smart enough to recognize that the square root maps to packed (vectorized) floating pointing operations in assembly, so that one function can be vectorized using the appropriate SIMD instructions (note we did not use the simd program for the OpenMP codes).
Perhaps the speedup from the vectorization may offset the loss of performance from repeatedly accessing the same memory locations (or not).

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <stdio.h>
#include <omp.h>

// simulates a large array of random numbers
double*  simulate_array(int num_of_elements,int seed);
// OMP environment functions
void _set_openmp_schedule_from_env();
void _set_num_threads_from_env();



// functions to modify C arrays 
void map_c_array(double* array, int len);
void map_c_array_sequential(double* array, int len);
void map_C_array_using_OMP(double* array, int len);
void map_C_array_sequential_using_OMP(double* array, int len);

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Usage: %s <array_size>\n", argv[0]);
        return 1;
    }

    int array_size = atoi(argv[1]);
    // printf the array size
    printf("Array size: %d\n", array_size);
    double *array = simulate_array(array_size, 1234);

    // Set OMP environment
    _set_openmp_schedule_from_env();
    _set_num_threads_from_env();

    // Perform calculations and collect timing data
    double start_time, end_time, elapsed_time;
    // Non-Sequential calculation
    start_time = omp_get_wtime();
    map_c_array(array, array_size);
    end_time = omp_get_wtime();
    elapsed_time = end_time - start_time;
    printf("Non-sequential calculation time: %f seconds\n", elapsed_time);
    free(array);

    // Sequential calculation
    array = simulate_array(array_size, 1234);
    start_time = omp_get_wtime();
    map_c_array_sequential(array, array_size);
    end_time = omp_get_wtime();
    elapsed_time = end_time - start_time;
    printf("Sequential calculation time: %f seconds\n", elapsed_time);
    free(array);

    array = simulate_array(array_size, 1234);
    // Parallel calculation using OMP
    start_time = omp_get_wtime();
    map_C_array_using_OMP(array, array_size);
    end_time = omp_get_wtime();
    elapsed_time = end_time - start_time;
    printf("Parallel calculation using OMP time: %f seconds\n", elapsed_time);
    free(array);

    // Sequential calculation using OMP
    array = simulate_array(array_size, 1234);
    start_time = omp_get_wtime();
    map_C_array_sequential_using_OMP(array, array_size);
    end_time = omp_get_wtime();
    elapsed_time = end_time - start_time;
    printf("Sequential calculation using OMP time: %f seconds\n", elapsed_time);

    free(array);
    return 0;
}



/*
*******************************************************************************
* OMP environment functions
*******************************************************************************
*/
void _set_openmp_schedule_from_env() {
  char *schedule_env = getenv("OMP_SCHEDULE");
  printf("Schedule from env %s\n", getenv("OMP_SCHEDULE"));
  if (schedule_env != NULL) {
    char *kind_str = strtok(schedule_env, ",");
    char *chunk_size_str = strtok(NULL, ",");

    omp_sched_t kind;
    if (strcmp(kind_str, "static") == 0) {
      kind = omp_sched_static;
    } else if (strcmp(kind_str, "dynamic") == 0) {
      kind = omp_sched_dynamic;
    } else if (strcmp(kind_str, "guided") == 0) {
      kind = omp_sched_guided;
    } else {
      kind = omp_sched_auto;
    }
    int chunk_size = atoi(chunk_size_str);
    omp_set_schedule(kind, chunk_size);
  }
}

void _set_num_threads_from_env() {
  char *num = getenv("OMP_NUM_THREADS");
  printf("Number of threads = %s from within C\n", num);
  omp_set_num_threads(atoi(num));
}
/*
*******************************************************************************
* Functions that modify C arrays whose address is passed from Perl in C
*******************************************************************************
*/

double*  simulate_array(int num_of_elements, int seed) {
  srand(seed); // Seed the random number generator
  double *array = (double *)malloc(num_of_elements * sizeof(double));
  for (int i = 0; i < num_of_elements; i++) {
    array[i] =
        (double)rand() / RAND_MAX; // Generate a random double between 0 and 1
  }
  return array;
}

void map_c_array(double *array, int len) {
  for (int i = 0; i < len; i++) {
    array[i] = cos(sin(sqrt(array[i])));
  }
}

void map_c_array_sequential(double* array, int len) {
  for (int i = 0; i < len; i++) {
    array[i] = sqrt(array[i]);
  }
  for (int i = 0; i < len; i++) {
    array[i] = sin(array[i]);
  }
  for (int i = 0; i < len; i++) {
    array[i] = cos(array[i]);
  }
}

void map_C_array_using_OMP(double* array, int len) {
#pragma omp parallel
  {
#pragma omp for schedule(runtime) nowait
    for (int i = 0; i < len; i++) {
      array[i] = cos(sin(sqrt(array[i])));
    }
  }
}

void map_C_array_sequential_using_OMP(double* array, int len) {
#pragma omp parallel
  {
#pragma omp for schedule(runtime) nowait
    for (int i = 0; i < len; i++) {
      array[i] = sqrt(array[i]);
    }
#pragma omp for schedule(runtime) nowait
    for (int i = 0; i < len; i++) {
      array[i] = sin(array[i]);
    }
#pragma omp for schedule(runtime) nowait
    for (int i = 0; i < len; i++) {
      array[i] = cos(array[i]);
    }
  }
}

ログイン後にコピー

A critical question is whether the use of fast floating compiler flags, a trick that trades speed for accuracy of the code, can affect performance.
Here is the makefile withut this compiler flag

CC = gcc
CFLAGS = -O3 -ftree-vectorize  -march=native  -Wall -std=gnu11 -fopenmp -fstrict-aliasing 
LDFLAGS = -fPIE -fopenmp
LIBS =  -lm

SOURCES = inplace_array_mod_with_OpenMP.c
OBJECTS = $(SOURCES:.c=_noffmath_gcc.o)
EXECUTABLE = inplace_array_mod_with_OpenMP_noffmath_gcc

all: $(SOURCES) $(EXECUTABLE)

clean:
    rm -f $(OBJECTS) $(EXECUTABLE)

$(EXECUTABLE): $(OBJECTS)
    $(CC) $(LDFLAGS) $(OBJECTS) $(LIBS) -o $@

%_noffmath_gcc.o : %.c 
    $(CC) $(CFLAGS) -c $< -o $@

ログイン後にコピー

and here is the one with this flag:

CC = gcc
CFLAGS = -O3 -ftree-vectorize  -march=native -Wall -std=gnu11 -fopenmp -fstrict-aliasing -ffast-math
LDFLAGS = -fPIE -fopenmp
LIBS =  -lm

SOURCES = inplace_array_mod_with_OpenMP.c
OBJECTS = $(SOURCES:.c=_gcc.o)
EXECUTABLE = inplace_array_mod_with_OpenMP_gcc

all: $(SOURCES) $(EXECUTABLE)

clean:
    rm -f $(OBJECTS) $(EXECUTABLE)

$(EXECUTABLE): $(OBJECTS)
    $(CC) $(LDFLAGS) $(OBJECTS) $(LIBS) -o $@

%_gcc.o : %.c 
    $(CC) $(CFLAGS) -c $< -o $@

ログイン後にコピー

And here are the results of running these two programs

Without -ffast-math

OMP_SCHEDULE=guided,1 OMP_NUM_THREADS=8 ./inplace_array_mod_with_OpenMP_noffmath_gcc 50000000
Array size: 50000000
Schedule from env guided,1
Number of threads = 8 from within C
Non-sequential calculation time: 1.12 seconds
Sequential calculation time: 0.95 seconds
Parallel calculation using OMP time: 0.17 seconds
Sequential calculation using OMP time: 0.15 seconds

ログイン後にコピー

With -ffast-math

OMP_SCHEDULE=guided,1 OMP_NUM_THREADS=8 ./inplace_array_mod_with_OpenMP_gcc 50000000
Array size: 50000000
Schedule from env guided,1
Number of threads = 8 from within C
Non-sequential calculation time: 0.27 seconds
Sequential calculation time: 0.28 seconds
Parallel calculation using OMP time: 0.05 seconds
Sequential calculation using OMP time: 0.06 seconds

ログイン後にコピー

Note that one can use the fastmath in Numba code as follows (the default is fastmath=False):

@njit(nogil=True,fastmath=True)
def compute_inplace_with_numba(array):
    np.sqrt(array,array)
    np.sin(array,array)
    np.cos(array,array)

ログイン後にコピー

A few points that are worth noting:

The -ffast-math gives major boost in performance (about 300% for both the single threaded and the multi-threaded code), but it can generate erroneous results
Fastmath also works in Numba, but should be avoided for the same reasons it should be avoided in any application that strives for accuracy
The sequential C single threaded code gives performance similar to the single threaded PDL and Numpy
Somewhat surprisingly, the sequential code is about 20% faster than the non-sequential code when the correct (non-fast) math is used.
Unsurprisingly, multi-threaded code is faster than single threaded code :)
I still cannot explain how numbas delivers a 50% performance premium over the C code of this rather simple function.

以上がパフォーマンスの探求パートⅢ：C Forceの詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。