Loop Unrolling in JavaScript?-JS Tutorial-php.cn

Loop Unrolling in JavaScript?

王林

Release： 2024-07-24 13:18:52

Original

756 people have browsed it

Loop Unrolling in JavaScript?

JavaScript can feel very removed from the hardware it runs on, but thinking low-level can still be useful in limited cases.

A recent post of Kafeel Ahmad on loop optimization detailed a number of loop performance improvement techniques. That article got me thinking about the topic.

Premature Optimization

Just to get this out of the way, this is a technique very few will ever need to consider in web development. Also, focusing on optimization too early can make code harder to write and much harder maintain. Taking a peek at low-level techniques can give us insight into our tools and the work in general, even if we can't apply that knowledge directly.

What is Loop Unrolling?

Loop unrolling basically duplicates the logic inside a loop so you perform multiple operations during each, well, loop. In specific cases, making the code in the loop longer can make it faster.

By intentionally performing some operations in groups rather than one-by-one, the computer may be able to operate more efficiently.

Unrolling Example

Let's take a very simple example: summing values in an array.

// 1-to-1 looping
const simpleSum = (data) => {
  let sum = 0;
  for(let i=0; i < data.length; i += 1) {
    sum += data[i];
  }
  return sum;
};

const parallelSum = (data) => {
  let sum1 = 0;
  let sum2 = 0;
  for(let i=0; i < data.length; i += 2) {
    sum1 += data[i];
    sum2 += data[i + 1];
  }
  return sum1 + sum2;
};

Copy after login

This may look very strange at first. We're managing more variables and performing additional operations that don't happen in the simple example. How can this be faster?!

Measuring the Difference

I ran some comparisons over a variety of data sizes and multiple runs, as well as sequential or interleaved testing. The parallelSum performance varied, but was almost always better, excepting some odd results for very small data sizes. I tested this using RunJS, which is built on Chrome's V8 engine.

Different data sizes gave very roughly these results:

Small (< 10k): Rarely any difference
Medium (10k-100k): Typically ~20-80% faster
Large (> 1M): Consistently twice as fast
Then I created a JSPerf with 1 million records to try across different browsers. Try it yourself!

Chrome ran parallelSum twice as fast as simpleSum, as expected from the RunJS testing.

Safari was almost identical to Chrome, both in percents and operations per second.

Firefox on the same system performed almost the same for simpleSum but parallelSum was only about 15% faster, not twice as fast.

This variation sent me looking for more information. While it's nothing definitive, I found a StackOverflow comment from 2016 discussing some of the JS engine issues with loop unrolling. It's an interesting look at how engines and optimizations can affect code in ways we don't expect.

Variations

I tried a third version as well, which added two values in a single operation to see if there was a noticeable difference between one variable and two.
```
const parallelSum = (data) => {
  let sum = 0
  for(let i=0; i < data.length; i += 2) {
    sum += data[i] + data[i + 1];
  }
  return sum;
};
```
Copy after login
Short answer: No. The two "parallel" versions were within the reported margin of error of each other.

So, How Does it Work?

While JavaScript is single-threaded, the interpreters, compilers, and hardware underneath can perform optimizations for us when certain conditions are met.

In the simple example, the operation needs the value i to know what data to fetch, and it needs the latest value of sum to update. Because both of these change in each loop, the computer has to wait for the loop to complete to get more data. While it may seem obvious to us what i += 1 will do, the computer mostly understands "the value will change, check back later", so it has difficulty optimizing.

Our parallel versions load multiple data entries for each value of i. We still depend on sum for each loop, but we can load and process twice as much data per cycle. But that doesn't mean it runs twice as fast.

Deeper Dive

To understand why loop unrolling works we look to the low-level operation of a computer. Processors with super-scalar architectures can have multiple pipelines to perform simultaneous operations. They can support out-of-order execution so operations that don't depend on each other can happen as soon as possible. For some operations, SIMD can perform one action on multiple pieces of data at once. Beyond that we start getting into caching, data fetching, and branch prediction...

But this is a JavaScript article! We're not going that deep. If you want to know more about processor architectures, Anandtech has some excellent Deep Dives.
한계와 단점

루프 풀기는 마술이 아닙니다. 프로그램이나 데이터 크기, 작업 복잡성, 컴퓨터 아키텍처 등으로 인해 나타나는 한계와 수익 감소가 있습니다. 하지만 우리는 한두 가지 작업만 테스트했으며 최신 컴퓨터는 종종 네 개 이상의 스레드를 지원합니다.

더 큰 증분을 시도하기 위해 1, 2, 4, 10개의 레코드로 또 다른 JSPerf를 만들어 macOS 14.5 Sonoma를 실행하는 Apple M1 Max MacBook Pro와 Windows 11을 실행하는 AMD Ryzen 9 3950X PC에서 실행했습니다.

한 번에 10개의 레코드를 처리하는 것은 기본 루프보다 2.5~3.5배 빠르지만 Mac에서 4개의 레코드를 처리하는 것보다 12~15% 더 빠릅니다. PC에서는 여전히 1~2개의 레코드 사이에서 2배의 개선이 있었지만 10개의 레코드는 4개의 레코드보다 단지 2% 더 빨랐습니다. 이는 16코어 프로세서에서는 예측할 수 없었던 일입니다.

플랫폼 및 업데이트

이러한 다양한 결과는 최적화에 주의해야 함을 상기시켜 줍니다. 컴퓨터에 맞게 최적화하면 성능이 떨어지거나 다른 하드웨어에서 더 나쁜 경험이 발생할 수 있습니다. 구형 또는 보급형 하드웨어의 성능 또는 기능 문제는 개발자가 빠르고 강력한 시스템을 사용하여 작업할 때 흔히 발생하는 문제이며, 제가 경력을 쌓는 동안 여러 번 이러한 문제를 겪었습니다.

일부 성능 측면에서 현재 사용 가능한 HP의 보급형 Chromebook에는 Intel Celeron N4120 프로세서가 탑재되어 있습니다. 이것은 내 2013 Core i5-4250U MacBook Air와 대략 동일합니다. 종합 벤치마크에서 M1 Max의 성능은 9분의 1에 불과합니다. 최신 버전의 Chrome을 실행하는 2013 MacBook Air에서 4 레코드 기능은 10 레코드보다 빨랐지만 여전히 단일 레코드 기능보다 60% 더 빨랐습니다!

브라우저와 표준도 끊임없이 변화하고 있습니다. 일상적인 브라우저 업데이트나 다른 프로세서 아키텍처로 인해 최적화된 코드가 일반 루프보다 느리게될 수 있습니다. 자신이 심층적으로 최적화하고 있다는 것을 알게 되면 최적화가 소비자와 관련이 있고 관련성이 유지되는지.
확인해야 할 수도 있습니다.
2012년에 읽었던 Nicholas Zakas의 High Performance JavaScript 책이 생각납니다. 훌륭한 책이었고 많은 통찰력을 담고 있었습니다. 그러나 2014년에는 책에서 확인된 여러 가지 중요한 성능 문제가 브라우저 엔진 업데이트를 통해 해결되거나 크게 줄어들었으며 유지 관리 가능한 코드를 작성하는 데 더 많은 노력을 집중할 수 있었습니다.

성능 최적화의 선두에 있으려면 변화와 정기적인 검증에 대비하세요.

과거로부터의 교훈

이 주제를 조사하는 동안 저는 2000년에 작성된 일부 루프 언롤링 최적화를 제거하여 궁극적으로 애플리케이션 성능을 향상시키는 Linux 커널 메일링 목록 스레드를 발견했습니다. 여기에는 여전히 관련성이 높은 다음 사항이 포함되어 있습니다(강조):

결론은 빠른 것과 그렇지 않은 것에 대한 우리의 직관적인 가정은 종종 틀릴 수 있다는 것입니다. 특히 지난 몇 년 동안 CPU가 얼마나 많이 변경되었는지를 고려하면 더욱 그렇습니다.
– 테오도르 쵸

결론

루프에서 성능을 짜내야 하는 경우가 있을 수 있으며, 충분한 항목을 처리하는 경우 이는 그렇게 하는 방법 중 하나일 수 있습니다. 이러한 종류의 최적화에 대해 알아두면 좋지만 대부분의 작업에서는 You Are n't Gonna Need It™이 필요합니다.

그래도 내 장황한 설명이 즐거웠기를 바라며, 아마도 미래에는 성능 최적화 고려 사항에 대해 기억이 가물가물해질 것입니다.

읽어주셔서 감사합니다!

The above is the detailed content of Loop Unrolling in JavaScript?. For more information, please follow other related articles on the PHP Chinese website!