NestJS + Opentelemetry (Sampling)-JS Tutorial-php.cn

NestJS + Opentelemetry (Sampling)

PHPz

Release： 2024-08-19 17:16:03

Original

667 people have browsed it

Grafana Cloud

In the previous post, I shot, saved and viewed Opentelemetry data in Grafana Cloud.

If you use the free version of Grafana Cloud, you get about 50GB of logs and traces per month. If it is a service that does not accumulate much trace (or does not record logs) because there are not many users, you can just use it, but if you introduce it on a small scale, I am afraid that too many logs will accumulate and explode.

Sampling

Sampling means extracting a part from the whole. As a result, the task is to reduce the number of telemetry data stored.

Why need sampling

Why is sampling necessary?

NestJS + Opentelemetry (Sampling)

There is no need to save all the circles (trace) in the picture above. It is enough to store only important traces (errors, or too long execution time) and some samples representative of the whole (some of the OK traces).

Types of Sampling

Sampling can be broadly divided into Head Sampling and Tail Sampling.

Head Sampling

This refers to sampling from the very beginning. A typical example is simply probabilistic sampling. Only 10% of the total trace is left and the rest is not traced.

Javascript

TraceIdRatioBasedSampler is provided by default.

import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-node';

const samplePercentage = 0.1;

const sdk = new NodeSDK({
  // Other SDK configuration parameters go here
  sampler: new TraceIdRatioBasedSampler(samplePercentage),
});

Copy after login

disadvantage

There are cases where important traces are dropped because they drop it without asking.

Tail Sampling

Sampling is done from the back. At this time, since there is a lot of information available, you can filter it according to the desired logic.
For example, error traces are always sampled.
Usually, sampling is performed once all traces have been received from the collector.

disadvantage

Implementation may be difficult. It is something that always has to change when the system changes and conditions change.
It is difficult to perform because it must maintain a stateful state for sampling.
There are cases where the Tail Sampler is vendor-specific.

avatar

Let’s implement Tail Sampling by implementing a Custom Span Processor.

SamplingSpanProcessor implementation

Create sampling-span-processor.ts file

import { Context } from "@opentelemetry/api";
import {
  SpanProcessor,
  ReadableSpan,
  Span,
} from "@opentelemetry/sdk-trace-node";

/**
 * Sampling span processor (including all error span and ratio of other spans)
 */
export class SamplingSpanProcessor implements SpanProcessor {
  constructor(
    private _spanProcessor: SpanProcessor,
    private _ratio: number
  ) {}

  /**
   * Forces to export all finished spans
   */
  forceFlush(): Promise<void> {
    return this._spanProcessor.forceFlush();
  }

  onStart(span: Span, parentContext: Context): void {
    this._spanProcessor.onStart(span, parentContext);
  }

  shouldSample(traceId: string): boolean {
    let accumulation = 0;
    for (let idx = 0; idx < traceId.length; idx++) {
      accumulation += traceId.charCodeAt(idx);
    }
    const cmp = (accumulation % 100) / 100;
    return cmp < this._ratio;
  }

  /**
   * Called when a {@link ReadableSpan} is ended, if the `span.isRecording()`
   * returns true.
   * @param span the Span that just ended.
   */
  onEnd(span: ReadableSpan): void {
    // Only process spans that have an error status
    if (span.status.code === 2) {
      // Status code 0 means "UNSET", 1 means "OK", and 2 means "ERROR"
      this._spanProcessor.onEnd(span);
    } else {
      if (this.shouldSample(span.spanContext().traceId)) {
        this._spanProcessor.onEnd(span);
      }
    }
  }

  /**
   * Shuts down the processor. Called when SDK is shut down. This is an
   * opportunity for processor to do any cleanup required.
   */
  async shutdown(): Promise<void> {
    return this._spanProcessor.shutdown();
  }
}

Copy after login

This._spanProcessor.onEnd(span); only when status.code is 2 (Error) or the ratio probability is winning. Export by calling

OtelSDK Update

Update spanProcessors in main.ts.

  spanProcessors: [
    new SamplingSpanProcessor(
      new BatchSpanProcessor(traceExporter),
      samplePercentage
    ),
  ],

Copy after login

The above is the detailed content of NestJS + Opentelemetry (Sampling). For more information, please follow other related articles on the PHP Chinese website!