Detailed explanation of Redis' high availability and high concurrency mechanism-Redis-php.cn

1. High concurrency mechanism

We know that redis is based on single thread and can be hosted in stand-alone mode It is only about tens of thousands, so how to improve its high concurrent requests of hundreds of thousands under big data through the master-slave architecture of redis and the separation of reading and writing.

Video Course Recommendation →: "Concurrency Solution for Tens of Millions of Data (Theory and Practice)"

1. Master-slave replication

The configuration of redis master-slave replication is not emphasized. It mainly depends on the principle and process of master-slave replication: In the process of master-slave replication of redis, a master host is required as an administrator. Build multiple slave machines. When the slave slave tries to start, it will send a command PSYNC to the master host. If the slave slave is reconnected at this time, the data that the slave slave does not have will be copied from the master host. If it is the first time to connect, then Full resynchronization will be triggered. After triggering, the master host will start a process in the background to generate an RDB snapshot file, and at the same time store the write operations in this time period into the cache. When the RDB file is generated, it will send the RDB file to the slave machine, and the slave machine will get the file. After that, it is first written to the disk and then loaded into the memory. Finally, the master host will also send the data cached in the memory to the slave machine at the same time. If a master-slave network failure occurs and multiple slaves reconnect, the master will only restart one RDB to serve all slaves. [Related recommendations: Redis Video Tutorial]

Breakpoint resume: There is a replica offset in the master and slave, and there is a master id in it, where the offset is kept in the backlog, when the master When the slave reconnects after a network failure, it will find the corresponding last replica offset and copy it. If the corresponding offset is not found, full resynchronization is triggered.

①The complete process of replication

(1) The slave node starts and only saves the information of the master node, including the host and IP of the master node, but the replication process does not start

Where do the master host and IP come from?

of the slaveof configuration in redis.conf (2) There is a scheduled task inside the slave node to check whether there is a new master node to connect and copy every second. If Found that, establish a socket network connection with the master node
(3) The slave node sends the ping command to the master node
(4) Password authentication. If the master sets requirepass, then the slave node must send the masterauth password for authentication.
(5) The master node performs full replication for the first time and sends all data to the slave node
(6) The master node will continue to write commands and asynchronously copy them to the slave node

②Data synchronization The related core mechanism

refers to the full copy performed when the slave connects to msater for the first time. Some of your detailed mechanisms in that process

(1) Both master and slave will maintain An offset

The master will continuously accumulate offsets on itself, and the slave will also continuously accumulate offsets on itself
The slave will report its own offset to the master every second, and the master will also save the offset of each slave

This does not mean that it is specifically used for full replication. The main reason is that both the master and the slave need to know the offset of their respective data in order to know the inconsistency of the data between each other.

(2) backlog

The master node has a backlog, the default size is 1MB
When the master node copies data to the slave node, it will also write a copy of the data synchronously in the backlog
The backlog is mainly used for full replication Incremental replication after interruption

(3) master run id

info server, you can see the master run id
It is unreliable to locate the master node based on the host ip , if the master node restarts or the data changes, then the slave node should be distinguished according to different run ids. If the run id is different, full copy will be made.
If you need to restart redis without changing the run id, you can use the redis-cli debug reload command

（4）psync

The slave node uses psync to copy from the master node, and psync runid offset
The master node will return response information according to its own situation. It may be FULLRESYNC runid offset that triggers full replication. , it may be that CONTINUE triggers incremental copy

③Full copy

(1) The master executes bgsave and generates an rdb snapshot file locally
(2) The master node sends the rdb snapshot file to the slave node. If the rdb copy time exceeds 60 seconds (repl-timeout), then the slave The node will think that the copy failed, and you can adjust this parameter appropriately
(3) For machines with Gigabit network cards, 100MB, 6G files are generally transferred per second, which is likely to exceed 60s
(4) The master node is generating RDB When, all new write commands will be cached in memory. After the salve node saves the rdb, the new write commands will be copied to the salve node
(5) client-output-buffer-limit slave 256MB 64MB 60, If during copying, the memory buffer continues to consume more than 64MB, or exceeds 256MB at one time, then stop copying and copy fails
(6) After the slave node receives the rdb, it clears its own old data, and then reloads the rdb to itself. in the memory, while providing external services based on the old data version
(7) If the slave node turns on AOF, then BGREWRITEAOF will be executed immediately and the AOF will be rewritten

rdb generation, rdb copy through the network, slave Cleaning old data and slave aof rewrite are very time-consuming

If the amount of copied data is between 4G~6G, then the full copy time is likely to take 1 and a half to 2 minutes

④Incremental replication

(1) If the master-slave network connection is disconnected during the full replication process, then when the salve reconnects to the master, incremental replication will be triggered
(2) The master directly copies from its own Get part of the lost data from the backlog and send it to the slave node. The default backlog is 1MB
(3) msater gets the data from the backlog based on the offset in psync sent by the slave

⑤heartbeat

The master and slave nodes will send heartbeat information to each other

The master sends a heartbeat every 10 seconds by default, and the salve node sends a heartbeat every 1 second

⑥Asynchronous replication

Every time the master receives a write command, it now writes data internally and then sends it asynchronously to the slave node

2. Read and write separation: the master is responsible for the write operation, and the slave is responsible for helping the master reduce access queries. Quantity

2. High availability mechanism

In the case of high concurrency, multiple clusters are equipped with one master and multiple backups. Although the high concurrency problem can be solved, there is only one host. , if the master is down, the entire system cannot perform write operations, and the slave cannot synchronize data, the entire system will be paralyzed, and the entire system will be unavailable. The high-availability mechanism of redis is the sentinel mechanism. The sentinel is an important component in the redis cluster. It is responsible for cluster monitoring, information notification, failover, and configuration center.

(1) Cluster monitoring, responsible for monitoring whether the redis master and slave processes are working normally
(2) Message notification, if a redis instance fails, the sentinel is responsible for sending messages as alarm notifications to the administrator
(3) Failover, if the master node hangs up, it will be automatically transferred to the slave node
(4) Configuration center, if failover occurs, notify the client of the new master address
Sentinel It is distributed in itself and works as a cluster and needs to work together.

When the master node is found to be down, it will require the consent of a majority of sentinels. This involves distributed elections.

The sentinel mechanism needs to ensure at least 3 nodes to ensure its robustness. If we only give two nodes during the test, one is the master node and the other is the slave node, then there is a sentinel responsible for both nodes. Monitoring, when the master host goes down, then sentinels are needed for election. Then the s1 sentinel in the master node can no longer work, and the election can only be carried out by the s2 sentinel in the slave node. After the election, a fault must be carried out. The transfer requires one sentinel to work, and its majority parameter specifies the number of sentinels required for failover. At this time, there is only one S2 sentinel without majority for failover. So at least 3 nodes are needed to ensure its robustness.

3. Data loss issues arising from high availability and high concurrency

(1) Data loss caused by asynchronous replication

Because master -> The slave's replication is asynchronous, so some data may not be copied to the slave before the master crashes, and these parts of the data are lost.

(2) Data loss caused by split brain

Split brain, that is to say, the machine where a master is located suddenly leaves the normal network and cannot connect to other slave machines, but in fact The master is still running.

At this time, the sentinel may think that the master is down, and then start the election and switch other slaves to the master.

At this time, there will be two slaves in the cluster. There is a master, which is the so-called split brain.

Although a slave is switched to the master at this time, the client may not have time to switch to the new master, and the data that continues to write to the old master may not be Lost,

So when the old master is restored again, it will be hung to the new master as a slave, its own data will be cleared, and the data will be copied from the new master again.

Solution to data loss caused by asynchronous replication and split-brain

min-slaves-to-write 1
 min-slaves-max-lag 10

Copy after login

Requires at least 1 slave, the delay of data replication and synchronization cannot exceed 10 seconds

If once all The slave, data replication and synchronization delays exceed 10 seconds, then at this time, the master will no longer receive any requests

The above two configurations can reduce data loss caused by asynchronous replication and split-brain

(1) Reduce data loss caused by asynchronous replication

With the min-slaves-max-lag configuration , it can be ensured that once the slave copy data and ACK delay is too long, it is considered that too much data may be lost after the master goes down, and then the write request is rejected. This can prevent some data from being synchronized when the master goes down. The data loss caused by the slave is reduced within the controllable range

(2) Reduce the data loss caused by split brain

If a master has a split brain and loses connection with other slaves, then the above two This configuration can ensure that if it cannot continue to send data to the specified number of slaves, and the slave does not send itself an ack message for more than 10 seconds, then the client's write request will be directly rejected

In this way, the old master after the split brain will It will not accept new data from the client, thus avoiding data loss.

The above configuration ensures that if the connection is lost with any slave and no slave gives itself an ack after 10 seconds, then it will be rejected. New write request

Therefore, in a split-brain scenario, up to 10 seconds of data will be lost

For more programming-related knowledge, please visit:Introduction to Programming ! !

The above is the detailed content of Detailed explanation of Redis' high availability and high concurrency mechanism. For more information, please follow other related articles on the PHP Chinese website!