As an open source storage engine, RocksDB supports the ACID characteristics of transactions. To support I (Isolation) in ACID, concurrency control is indispensable. This article mainly discusses the lock mechanism implementation of RocksDB, and the details will be covered. To source code analysis, I hope that through this article readers can gain a deeper understanding of RocksDB concurrency control principles. The article mainly starts from the following four aspects. First, I will introduce the basic structure of RocksDB lock. Then I will introduce the lock space overhead under the design of RocksDB row lock data structure. Then I will introduce the locking process of several typical scenarios. Finally, I will introduce the lock. An essential deadlock detection mechanism in the mechanism.
1. Row lock data structure
The minimum lock granularity of RocksDB is row. For KV storage, the lock object is the key, and each key corresponds to one LockInfo structure. All keys are managed through the hash table. When searching for a lock, you can determine whether the key has been locked by directly locating it through the hash table. But if there is only one hash table globally, it will cause many conflicts in accessing this hash table, affecting concurrency performance. RocksDB is first split by Columnfamily. The locks in each Columnfamily are managed by a LockMap, and each LockMap is split into several shards. Each shard is managed by LockMapStripe, and the hash table (std::unordered_map
The relevant data structure is as follows:
struct LockInfo { bool exclusive; //排它锁或是共享锁 autovector<TransactionID> txn_ids; //事务列表,对于共享锁而言,同一个key可以对应多个事务 // Transaction locks are not valid after this time in us uint64_t expiration_time; } struct LockMapStripe { // Mutex must be held before modifying keys map std::shared_ptr<TransactionDBMutex> stripe_mutex; // Condition Variable per stripe for waiting on a lock std::shared_ptr<TransactionDBCondVar> stripe_cv; // Locked keys mapped to the info about the transactions that locked them. std::unordered_map<std::string, LockInfo> keys; } struct LockMap { const size_t num_stripes_; //分片个数 std::atomic<int64_t> lock_cnt{0}; //锁数目 std::vector<LockMapStripe*> lock_map_stripes_; //锁分片 } class TransactionLockMgr { using LockMaps = std::unordered_map<uint32_t, std::shared_ptr<LockMap>>; LockMaps lock_maps_; // Thread-local cache of entries in lock_maps_. This is an optimization // to avoid acquiring a mutex in order to look up a LockMap std::unique_ptr<ThreadLocalPtr> lock_maps_cache_; }
2.行锁空间代价
由于锁信息是常驻内存,我们简单分析下RocksDB锁占用的内存。每个锁实际上是unordered_map中的一个元素,则锁占用的内存为key_length+8+8+1,假设key为bigint,占8个字节,则100w行记录,需要消耗大约22M内存。但是由于内存与key_length正相关,导致RocksDB的内存消耗不可控。我们可以简单算算RocksDB作为MySQL存储引擎时,key_length的范围。对于单列索引,最大值为2048个字节,具体可以参考max_supported_key_part_length实现;对于复合索引,索引最大长度为3072个字节,具体可以参考max_supported_key_length实现。假设最坏的情况,key_length=3072,则100w行记录,需要消耗3G内存,如果是锁1亿行记录,则需要消耗300G内存,这种情况下内存会有撑爆的风险。因此RocksDB提供参数配置max_row_locks,确保内存可控,默认RDB_MAX_ROW_LOCKS设置为1G,对于大部分key为bigint场景,极端情况下,也需要消耗22G内存。而在这方面,InnoDB则比较友好,hash表的key是(space_id, page_no),所以无论key有多大,key部分的内存消耗都是恒定的。前面我也提到了InnoDB在一个事务需要锁大量记录场景下是有优化的,多个记录可以公用一把锁,这样也间接可以减少内存。
3.上锁流程分析
前面简单了解了RocksDB锁数据结构的设计以及锁对内存资源的消耗。这节主要介绍几种典型场景下,RocksDB是如何加锁的。与InnoDB一样,RocksDB也支持MVCC,读不上锁,为了方便,下面的讨论基于RocksDB作为MySQL的一个引擎来展开,主要包括三类,基于主键的更新,基于二级索引的更新,基于主键的范围更新等。在展开讨论之前,有一点需要说明的是,RocksDB与InnoDB不同,RocksDB的更新也是基于快照的,而InnoDB的更新基于当前读,这种差异也使得在实际应用中,相同隔离级别下,表现有所不一样。对于RocksDB而言,在RC隔离级别下,每个语句开始都会重新获取一次快照;在RR隔离级别下,整个事务中只在第一个语句开始时获取一次快照,所有语句共用这个快照,直到事务结束。
3.1.基于主键的更新
这里主要接口是TransactionBaseImpl::GetForUpdate
1).尝试对key加锁,如果锁被其它事务持有,则需要等待
2).创建snapshot
3).调用ValidateSnapshot,Get key,通过比较Sequence判断key是否被更新过
4).由于是加锁后,再获取snapshot,所以检查一定成功。
5).执行更新操作
这里有一个延迟获取快照的机制,实际上在语句开始时,需要调用acquire_snapshot获取快照,但为了避免冲突导致的重试,在对key加锁后,再获取snapshot,这就保证了在基于主键更新的场景下,不会存在ValidateSnapshot失败的场景。
堆栈如下:
1-myrocks::ha_rocksdb::get_row_by_rowid 2-myrocks::ha_rocksdb::get_for_update 3-myrocks::Rdb_transaction_impl::get_for_update 4-rocksdb::TransactionBaseImpl::GetForUpdate { //加锁 5-rocksdb::TransactionImpl::TryLock 6-rocksdb::TransactionDBImpl::TryLock 7-rocksdb::TransactionLockMgr::TryLock //延迟获取快照,与acquire_snapshot配合使用 6-SetSnapshotIfNeeded() //检查key对应快照是否过期 6-ValidateSnapshot 7-rocksdb::TransactionUtil::CheckKeyForConflict 8-rocksdb::TransactionUtil::CheckKey 9-rocksdb::DBImpl::GetLatestSequenceForKey //第一次读取 //读取key 5-rocksdb::TransactionBaseImpl::Get 6-rocksdb::WriteBatchWithIndex::GetFromBatchAndDB 7-rocksdb::DB::Get 8-rocksdb::DBImpl::Get 9-rocksdb::DBImpl::GetImpl //第二次读取 }
3.2. Range update based on primary key
1).Create Snapshot and scan the primary key based on iterator
2).Through get_row_by_rowid, try to Key lock
3). Call ValidateSnapshot, Get key, and compare the Sequence to determine whether the key has been updated
4). If the key has been updated by other transactions ( The SequenceNumber corresponding to the key is newer than the Snapshot), triggering a retry
5). In the case of retry, the old snapshot will be released and the lock will be released, through tx->acquire_snapshot(false), delay Get the snapshot (after locking, take the snapshot again)
5). Call get_for_update again. Since the key has been locked at this time, retrying will definitely succeed.
6). Perform update operation
7). Jump to 1 and continue execution until the primary key does not meet the conditions, then it ends.
3.3. Update based on secondary index
This scenario is similar to 3.2, except that there is one more step in the process of locating the primary key from the secondary index.
1). Create a Snapshot and scan the secondary index based on the iterator
2). Find the primary key in reverse according to the secondary index. In fact, get_row_by_rowid is also called. This The process will try to lock the key
3). Continue to traverse the next primary key according to the secondary index and try to lock
4). When the returned secondary When the index does not meet the conditions, it ends
3.4 The difference between locking and InnoDB
We mentioned earlier that the difference between RocksDB and InnoDB is that for updates In this scenario, RocksDB is still a snapshot read, while InnoDB is a current read, resulting in behavioral differences. For example, in the range update scenario under the RC isolation level, for example, a transaction needs to update 1,000 records. Since it is locked while scanning, when the 999th record is scanned, it may be found that the Sequence of this key is greater than the scanned snapshot (this key Updated by other transactions), this time will trigger re-acquisition of the snapshot, and then get the latest key value based on this snapshot. InnoDB does not have this problem. Through the current read and scan process, if the 999th record is updated, InnoDB can directly see the latest record. In this case, the results seen by RocksDB and InnoDB are the same. In another case, assuming that a new key is inserted into the scanning range, the Sequence of this key will undoubtedly be larger than the scanned Snapshot, so this key will be filtered out during the Scan process and will not exist. The so-called conflict detection, this key will not be found. During the update process, two records with IDs 1 and 900 were inserted. Finally, the 900th record could not be updated because it was invisible. For InnoDB, since it is currently reading, the newly inserted record with id 900 can be seen and updated, so this is the difference from InnoDB.
In addition to the difference that updates are based on snapshots, RocksDB is also more concise in locking. All locks only involve unique indexes. Specifically, during the update process, only the primary key is locked; update When a column involves a unique constraint, it needs to be locked. However, ordinary secondary indexes do not need to be locked. This purpose is to avoid unique constraint conflicts. Here, if the unique constraint (primary key, or unique index) is updated, a lock is required. InnoDB needs to lock each index. For example, if the update is based on the secondary index positioning, the secondary index also needs to be locked. The reason for this difference is that InnoDB implements the RR isolation level. Let’s talk a little bit about the isolation level here. In fact, the RR isolation level defined in MySQL is somewhat different from the isolation level defined by the SQL standard. The SQL standard defines the RR isolation level to solve the problem of non-repeatable reads, and the Serializable isolation level to solve the problem of phantom reads. Non-repeatable reading emphasizes that the value of the same record will not be modified; while phantom reading emphasizes that the number of records returned by two reads is fixed and will not increase or decrease the number of records. MySQL defines the RR isolation level to solve the problems of non-repeatable reads and phantom reads, while the implementation of the RR isolation level in InnoDB relies on GAP locks. RocksDB does not support GAP locks (only supports unique constraint checks and locks non-existent keys) because the snapshot-based mechanism can effectively filter out newly inserted records, while InnoDB needs to prohibit other insertions through gap locks due to current reads. , so the secondary index also needs to be locked, mainly for the lock gap, otherwise the results of the two current reads may be different. Of course, for the RC split level, InnoDB ordinary secondary indexes do not need to be locked.
4.死锁检测算法
死锁检测采用DFS((Depth First Search,深度优先算法),基本思路根据加入等待关系,继续查找被等待者的等待关系,如果发现成环,则认为发生了死锁,当然在大并发系统下,锁等待关系非常复杂,为了将死锁检测带来的资源消耗控制在一定范围,可以通过设置deadlock_detect_depth来控制死锁检测搜索的深度,或者在特定业务场景下,认为一定不会发生死锁,则关闭死锁检测,这样在一定程度上有利于系统并发的提升。需要说明的是,如果关闭死锁,最好配套将锁等待超时时间设置较小,避免系统真发生死锁时,事务长时间hang住。死锁检测基本流程如下:
1.定位到具体某个分片,获取mutex
2.调用AcquireLocked尝试加锁
3.若上锁失败,则触发进行死锁检测
4.调用IncrementWaiters增加一个等待者
5.如果等待者不在被等待者map里面,则肯定不会存在死锁,返回
6.对于被等待者,沿着wait_txn_map_向下检查等待关系,看看是否成环
7.若发现成环,则将调用DecrementWaitersImpl将新加入的等待关系解除,并报死锁错误。
相关的数据结构:
class TransactionLockMgr { // Must be held when modifying wait_txn_map_ and rev_wait_txn_map_. std::mutex wait_txn_map_mutex_; // Maps from waitee -> number of waiters. HashMap<TransactionID, int> rev_wait_txn_map_; // Maps from waiter -> waitee. HashMap<TransactionID, autovector<TransactionID>> wait_txn_map_; DecrementWaiters // IncrementWaiters // } struct TransactionOptions { bool deadlock_detect = false; //是否检测死锁 int64_t deadlock_detect_depth = 50; //死锁检测的深度 int64_t lock_timeout = -1; //等待锁时间,线上一般设置为5s int64_t expiration = -1; //持有锁时间, }
The above is the detailed content of Detailed explanation of examples of RocksDB locking mechanism. For more information, please follow other related articles on the PHP Chinese website!