How Redis prevents frequent main database switching in sentry mode_Troubleshoot monitoring jitter and improve down-after-milliseconds-Redis-php.cn

Table of Contents

Why did Sentinel misjudge that the main database was offline?

How to set down-after-milliseconds reasonably?

What monitoring indicators can detect jitter in advance?

In addition to parameter adjustment, what other hard circumvention methods are there?

Home

Database

Redis

How Redis prevents frequent main database switching in sentry mode_Troubleshoot monitoring jitter and improve down-after-milliseconds

Mary-Kate Olsen

Apr 28, 2026 am 07:09 AM

redis red

The root cause of Sentinel's misjudgment that the main database is offline is that network jitter or the response delay of the main database is mistakenly captured by down-after-milliseconds; subjective offline only relies on a single sentinel timeout, and objective offline requires majority consensus. However, simultaneous timeout of multiple sentinels will trigger false switching.

How Redis prevents frequent main database switching in sentry mode_Troubleshoot monitoring jitter and improve down-after-milliseconds

Why did Sentinel misjudge that the main database was offline?

Sentinel frequently switches the main library. The fundamental reason is not that the configuration is too aggressive, but that the network jitter or the main library response delay is mistakenly captured by down-after-milliseconds . The sentinel's judgment of subjective offline (sdown) only relies on the ping response timeout of a single sentinel without negotiation; while the objective offline (odown) requires the majority of sentinels to reach agreement - but if multiple sentinels time out due to network delay at the same time, a failover will be quickly triggered.

Common causes include: occasional packet loss on the intranet, short-term filling of the main database CPU leading to delayed INFO or PING responses, cross-machine room deployment of Sentinel and the main database, and excessive load on Sentinel itself (such as monitoring a large number of instances without tuning).

How to set `down-after-milliseconds` reasonably?

This value is not as small as possible, nor can it be set to 5000ms or 30000ms. It must be greater than 3~5 times the "average command processing time of the normal network RTT main library", and leave a buffer margin. The actual test recommendations are as follows:

LAN (same computer room) deployment: start from 3000 , observe the frequency of sdown in the log, and gradually increase it to 5000 ~ 8000
Cross-computer room or high-latency link: set to at least 15000 , and check sentinel ping-reply-timeout simultaneously (default 1000ms), increase to 2000 if necessary
If the main database often has slow queries (such as unoptimized KEYS * ), it is necessary to combine slowlog-log-slower-than and monitoring to confirm the glitch cycle. down-after-milliseconds should be significantly longer than this cycle.

After modification, you must restart the sentinel process one by one (or use SENTINEL SET to take effect dynamically). Simply changing the configuration file will not take effect.

What monitoring indicators can detect jitter in advance?

Don't just stare at switch-master log. The really useful signals are hidden in Sentinel’s own indicators:

sentinel_masters : A sudden drop to 0 means that all sentinels have lost contact with the main database, which is likely to be a network partition.
sentinel_running_scripts : persistent > 0 indicates that the failover script is stuck, possibly due to notification service timeout.
sentinel_tilt : A value of 1 indicates that the sentinel enters "tilt mode" (internal clock abnormality). At this time, all subjective offline determinations are suspended. System time synchronization ( ntpq -p ) and CPU load need to be checked.
connected_clients and used_cpu_sys of the main library suddenly increase, and combined with FAIL message timestamp in the sentinel log, it can be determined whether the problem is caused by the main library itself.

It is recommended to use Prometheus to capture the redis_sentinel_* indicators, make histogram statistics on sentinel_odown_time , and identify whether batch triggering occurs within a certain few minutes.

In addition to parameter adjustment, what other hard circumvention methods are there?

Parameters are just the bottom line, the architectural level is more critical:

Ensure that the number of sentinels is an odd number (3 or 5) and **dispersed in different physical machines/availability zones** to avoid single-point host failure and collapse of most sentinels.
Disable the default value of sentinel failover-timeout (60000ms) and set it to more than 180000 to prevent the new master election from being forced before the network is restored.
Turn off unnecessary sentinel notifications (such as sentinel notification-script ), or ensure that the script has idempotence and timeout control ( timeout 5s ./notify.sh )
The main library turns on tcp-keepalive 60 to reduce false timeouts caused by idle disconnection of intermediate devices (such as cloud vendor SLB).

The most easily overlooked thing is the resource limit of Sentinel itself: each Sentinel can manage up to 50 master nodes by default. After exceeding this limit, the heartbeat detection delay increases, and the actual effect of down-after-milliseconds is compromised. Monitor sentinel_masters and note whether sentinel_monitors are close to the upper limit.

The above is the detailed content of How Redis prevents frequent main database switching in sentry mode_Troubleshoot monitoring jitter and improve down-after-milliseconds. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact [email protected]

Hot AI Tools

Undress AI Tool

Undress images for free

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undresser.AI Undress

AI-powered app for creating realistic nude photos

ArtGPT

AI image generator for creative art from text prompts.

Stock Market GPT

AI powered investment research for smarter decisions

Hot Article

Is the Martingale strategy effective? Detailed explanation of Ouyi DCA robot parameter setting

4 weeks ago By DDD

Why relative import in Python must first import the package as a module

4 weeks ago By DDD

5 Solutions to Steam Voice Chat Not Working

4 weeks ago By Jack chen

How to call Excel macro (VBA) in Python and achieve background silent execution

4 weeks ago By DDD

PHP server side image upload guide

3 weeks ago By DDD

Popular tool

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Douyin level price list 1-75

20568

wifi shows no ip assigned

13667

Virtual mobile phone number to receive verification code

11994

Where is the login entrance for gmail email?

9140

How to turn off windows security center

8538

Related knowledge

Where should I look at the Fear and Greed Index each day? Mar 05, 2026 pm 03:15 PM

The Fear and Greed Index is updated once a day and can be obtained in real time through five channels: CoinGeckoAPP, Fei Xiaohao APP, ChainCatcherAPP, TradingView web terminal and Alternative.me official website. Among them, CoinGeckoAPP is the most convenient and the official website data is the most original.

What are 'Galxe' and 'Layer3'? A platform to participate in the Web3 mission Mar 03, 2026 pm 04:27 PM

Galxe and Layer3 are Web3 on-chain certificate platforms. Galxe records users' cross-chain behavior and issues certificates through NFT. Layer3 anchors multi-chain certificates with ZK certificates and supports sovereign data control. The two can collaborate to complete social tasks, certificate import and dual-track snapshot verification.

CSS pseudo-class:required application_Visual scheme to highlight required fields Feb 27, 2026 pm 01:36 PM

:required only matches the required attribute and does not automatically add asterisks or prompts. It needs to be implemented manually with ::after or HTML; the priority is low and easy to be overwritten. IE10 supports but IE does not support ::after on input; dynamically added attributes take effect immediately, but cannot replace server-side verification.

How to get all keys in Redis database using Redigo in Go Mar 05, 2026 pm 03:57 PM

This article introduces how to use the Redigo client to safely and efficiently obtain all keys in the Redis database in the Go program, and convert them into string slices for subsequent processing.

What is DeFi's 'liquidation line”? How to avoid assets being liquidated Mar 04, 2026 pm 08:18 PM

DeFi's "liquidation line" is the price critical point that triggers the forced disposal of collateral in the lending agreement. It is determined by the mortgage rate and the real-time price of the asset. The calculation formula is (total debt principal and interest × liquidation threshold) ÷ the number of mortgage assets. Risks need to be prevented by supplementing mortgages, repaying debts, monitoring multi-source prices, selecting hierarchical liquidation protocols, and setting up off-chain early warnings.

CSS required pseudo-class application_Style that highlights required form items Feb 27, 2026 pm 01:30 PM

The :required pseudo-class only takes effect on form elements (input/select/textarea) that natively support the required attribute; it is necessary to ensure that attributes are transparently transmitted, the selector is correct, avoid the Safari old version combination pseudo-class bug, and pay attention to CSS priority and scope restrictions.

Will Redis String type modification block? Analysis of performance loss under different Value lengths Apr 13, 2026 pm 11:12 PM

The RedisSET command does not block globally, but the single execution time increases linearly with the value length, occupying the main thread and causing delays in subsequent commands. The main bottleneck for large values is memory allocation, copying and eviction logic, which can be verified through SLOWLOG and latency-monitor. Optimization methods include splitting the key, using Stream/Hash instead, or moving out the large blob.

How to operate in batches in a Redis cluster environment_Use Hash Tag to map related keys to the same slot Apr 17, 2026 pm 09:18 PM

MSET or PIPELINE in the Redis cluster reports a CROSSSLOT error because the cross-slot key is rejected; HashTag (such as user:{1001}:name) determines the slot based on the content in {}, so that keys with the same tag are routed to the same node, thereby supporting batch operations.

How Redis prevents frequent main database switching in sentry mode_Troubleshoot monitoring jitter and improve down-after-milliseconds

Why did Sentinel misjudge that the main database was offline?

How to set down-after-milliseconds reasonably?

What monitoring indicators can detect jitter in advance?

In addition to parameter adjustment, what other hard circumvention methods are there?

Hot AI Tools

Undress AI Tool

AI Clothes Remover

Undresser.AI Undress

ArtGPT

Stock Market GPT

Hot Article

Popular tool

Notepad++7.3.1

SublimeText3 Chinese version

Zend Studio 13.0.1

Dreamweaver CS6

SublimeText3 Mac version

Hot Topics

How to set `down-after-milliseconds` reasonably?