• 技术文章 >后端开发 >php教程

    关于Redis集群故障的分析

    小云云小云云2017-12-14 15:04:06原创1398

    Redis集群是一个实现分布式并且允许单点故障的Redis高级版本。Redis集群没有最重要或者说中心节点,这个版本最主要的一个目标是设计一个线性可伸缩(可随意增删节点)的功能。

    本文主要介绍了详细分析Redis集群故障的相关内容,希望能帮助到大家。

    故障表象:

    业务层面显示提示查询redis失败

    集群组成:

    3主3从,每个节点的数据有8GB

    机器分布:

    在同一个机架中,

    xx.x.xxx.199
    xx.x.xxx.200
    xx.x.xxx.201

    redis-server进程状态:

    通过命令ps -eo pid,lstart | grep $pid,

    发现进程已经持续运行了3个月

    发生故障前集群的节点状态:

    xx.x.xxx.200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65) master
    xx.x.xxx.199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave
    xx.x.xxx.201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b) master
    xx.x.xxx.201:8372(826607654f5ec81c3756a4a21f357e644efe605a) slave
    xx.x.xxx.199:8370(462cadcb41e635d460425430d318f2fe464665c5) master
    xx.x.xxx.200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf) slave

    ---------------------------------下面是日志分析--------------------------------------

    步1:
    主节点8371失去和从节点8373的连接:
    46590:M 09 Sep 18:57:51.379 # Connection with slave xx.x.xxx.199:8373 lost.

    步2:
    主节点8370/8375判定8371失联:
    42645:M 09 Sep 18:57:50.117 * Marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached).

    步3:
    从节点8372/8373/8374收到主节点8375说8371失联:
    46986:S 09 Sep 18:57:50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65

    步4:
    主节点8370/8375授权8373升级为主节点转移:
    42645:M 09 Sep 18:57:51.055 # Failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 for epoch 16

    步5:
    原主节点8371修改自己的配置,成为8373的从节点:
    46590:M 09 Sep 18:57:51.488 # Configuration change detected. Reconfiguring myself as a replica of 792020fe66c00ae56e27cd7a048ba6bb2b67adb6

    步6:
    主节点8370/8375/8373明确8371失败状态:
    42645:M 09 Sep 18:57:51.522 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: master without slots is reachable again.

    步7:
    新从节点8371开始从新主节点8373,第一次全量同步数据:
    8373日志::
    4255:M 09 Sep 18:57:51.906 * Full resync requested by slave xx.x.xxx.200:8371
    4255:M 09 Sep 18:57:51.906 * Starting BGSAVE for SYNC with target: disk
    4255:M 09 Sep 18:57:51.941 * Background saving started by pid 5230
    8371日志::
    46590:S 09 Sep 18:57:51.948 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993

    步8:
    主节点8370/8375判定8373(新主)失联:
    42645:M 09 Sep 18:58:00.320 * Marking node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 as failing (quorum reached).

    步9:
    主节点8370/8375判定8373(新主)恢复:
    60295:M 09 Sep 18:58:18.181 * Clear FAIL state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is serving its slots after some time.

    步10:
    主节点8373完成全量同步所需要的BGSAVE操作:
    5230:C 09 Sep 18:59:01.474 * DB saved on disk
    5230:C 09 Sep 18:59:01.491 * RDB: 7112 MB of memory used by copy-on-write
    4255:M 09 Sep 18:59:01.877 * Background saving terminated with success

    步11:
    从节点8371开始从主节点8373接收到数据:
    46590:S 09 Sep 18:59:02.263 * MASTER <-> SLAVE sync: receiving 2657606930 bytes from master

    步12:
    主节点8373发现从节点8371对output buffer作了限制:
    4255:M 09 Sep 19:00:19.014 # Client id=14259015 addr=xx.x.xxx.200:21772 fd=844 name= age=148 idle=148 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
    4255:M 09 Sep 19:00:19.015 # Connection with slave xx.x.xxx.200:8371 lost.

    步13:
    从节点8371从主节点8373同步数据失败,连接断了,第一次全量同步失败:
    46590:S 09 Sep 19:00:19.018 # I/O error trying to sync with MASTER: connection lost
    46590:S 09 Sep 19:00:20.102 * Connecting to MASTER xx.x.xxx.199:8373
    46590:S 09 Sep 19:00:20.102 * MASTER <-> SLAVE sync started

    步14:
    从节点8371重新开始同步,连接失败,主节点8373的连接数满了:
    46590:S 09 Sep 19:00:21.103 * Connecting to MASTER xx.x.xxx.199:8373
    46590:S 09 Sep 19:00:21.103 * MASTER <-> SLAVE sync started
    46590:S 09 Sep 19:00:21.104 * Non blocking connect for SYNC fired the event.
    46590:S 09 Sep 19:00:21.104 # Error reply to PING from master: '-ERR max number of clients reached'

    步15:
    从节点8371重新连上主节点8373,第二次开始全量同步:
    8371日志:
    46590:S 09 Sep 19:00:49.175 * Connecting to MASTER xx.x.xxx.199:8373
    46590:S 09 Sep 19:00:49.175 * MASTER <-> SLAVE sync started
    46590:S 09 Sep 19:00:49.175 * Non blocking connect for SYNC fired the event.
    46590:S 09 Sep 19:00:49.176 * Master replied to PING, replication can continue...
    46590:S 09 Sep 19:00:49.179 * Partial resynchronization not possible (no cached master)
    46590:S 09 Sep 19:00:49.501 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454
    8373日志:
    4255:M 09 Sep 19:00:49.176 * Slave xx.x.xxx.200:8371 asks for synchronization
    4255:M 09 Sep 19:00:49.176 * Full resync requested by slave xx.x.xxx.200:8371
    4255:M 09 Sep 19:00:49.176 * Starting BGSAVE for SYNC with target: disk
    4255:M 09 Sep 19:00:49.498 * Background saving started by pid 18413
    18413:C 09 Sep 19:01:52.466 * DB saved on disk
    18413:C 09 Sep 19:01:52.620 * RDB: 2124 MB of memory used by copy-on-write
    4255:M 09 Sep 19:01:53.186 * Background saving terminated with success

    步16:
    从节点8371同步数据成功,开始加载经内存:
    46590:S 09 Sep 19:01:53.190 * MASTER <-> SLAVE sync: receiving 2637183250 bytes from master
    46590:S 09 Sep 19:04:51.485 * MASTER <-> SLAVE sync: Flushing old data
    46590:S 09 Sep 19:05:58.695 * MASTER <-> SLAVE sync: Loading DB in memory

    步17:
    集群恢复正常:
    42645:M 09 Sep 19:05:58.786 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: slave is reachable again.

    步18:
    从节点8371同步数据成功,耗时7分钟:
    46590:S 09 Sep 19:08:19.303 * MASTER <-> SLAVE sync: Finished with success

    8371失联原因分析:

    由于几台机器在同一个机架,不太可能发生网络中断的情况,于是通过SLOWLOG GET命令查看了慢查询日志,发现有一个KEYS命令被执行了,耗时8.3秒,再查看集群节点超时设置,发现是5s(cluster-node-timeout 5000)

    出现节点失联的原因:

    客户端执行了耗时1条8.3s的命令,

    2016/9/9 18:57:43 开始执行KEYS命令
    2016/9/9 18:57:50 8371被判断失联(redis日志)
    2016/9/9 18:57:51 执行完KEYS命令

    总结来说,有以下几个问题:

    1.由于cluster-node-timeout设置比较短,慢查询KEYS导致了集群判断节点8371失联

    2.由于8371失联,导致8373升级为主,开始主从同步

    3.由于配置client-output-buffer-limit的限制,导致第一次全量同步失败了

    4.又由于PHP客户端的连接池有问题,疯狂连接服务器,产生了类似SYN攻击的效果

    5.第一次全量同步失败后,从节点重连主节点花了30秒(超过了最大连接数1w)

    关于client-output-buffer-limit参数:

    # The syntax of every client-output-buffer-limit directive is the following: 
    # 
    # client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds> 
    # 
    # A client is immediately disconnected once the hard limit is reached, or if 
    # the soft limit is reached and remains reached for the specified number of 
    # seconds (continuously). 
    # So for instance if the hard limit is 32 megabytes and the soft limit is 
    # 16 megabytes / 10 seconds, the client will get disconnected immediately 
    # if the size of the output buffers reach 32 megabytes, but will also get 
    # disconnected if the client reaches 16 megabytes and continuously overcomes 
    # the limit for 10 seconds. 
    # 
    # By default normal clients are not limited because they don't receive data 
    # without asking (in a push way), but just after a request, so only 
    # asynchronous clients may create a scenario where data is requested faster 
    # than it can read. 
    # 
    # Instead there is a default limit for pubsub and slave clients, since 
    # subscribers and slaves receive data in a push fashion. 
    # 
    # Both the hard or the soft limit can be disabled by setting them to zero. 
    client-output-buffer-limit normal 0 0 0 
    client-output-buffer-limit slave 256mb 64mb 60 
    client-output-buffer-limit pubsub 32mb 8mb 60


    采取措施:

    1.单实例的切割到4G以下,否则发生主从切换会耗时很长

    2.调整client-output-buffer-limit参数,防止同步进行到一半失败

    3.调整cluster-node-timeout,不能少于15s

    4.禁止任何耗时超过cluster-node-timeout的慢查询,因为会导致主从切换

    5.修复客户端类似SYN攻击的疯狂连接方式

    相关推荐:

    Redis集群搭建全记录

    详解redis集群规范知识

    redis集群实战

    以上就是关于Redis集群故障的分析的详细内容,更多请关注php中文网其它相关文章!

    声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn核实处理。
    专题推荐:Redis 分析 故障
    上一篇:详解redis集群规范知识 下一篇:自己动手写 PHP MVC 框架(40节精讲/巨细/新人进阶必看)

    相关文章推荐

    • PHP8.3要有新函数了!(json_validate函数说明)• 设计API接口时,要注意这些地方!• PHP网站常见一些安全漏洞及防御方法• ThinkPHP控制器里javascript代码不能执行的解决方法_PHP• php实现refresh刷新页面批量导入数据的方法_PHP
    1/1

    PHP中文网