Is tidb the go language?-Golang-php.cn

Yes, TiDB is written in go language. TiDB is a distributed NewSQL database; it supports horizontal elastic expansion, ACID transactions, standard SQL, MySQL syntax and MySQL protocol, and has strong data consistency and high availability features. The PD in the TiDB architecture stores the meta-information of the cluster, such as which TiKV node the key is on; the PD is also responsible for the load balancing and data sharding of the cluster. PD supports data distribution and fault tolerance by embedding etcd; PD is written in go language.

Is tidb the go language?

The operating environment of this tutorial: Windows 7 system, GO version 1.18, Dell G3 computer.

There are many heavyweight Go language projects, and the most powerful Go open source project in China is probably TiDB. TiDB is a distributed database, which many people may not know anything about. Let me talk to you about this topic today.

TiDB has a simple design, and its official website and code are very easy to read. It is the first choice open source project for learning distributed databases.

Database, operating system and compiler are collectively called the three major systems, which can be said to be the cornerstone of the entire computer software.

Many people have used databases, but few have implemented a database, especially a distributed database. Understanding the implementation principles and details of the database can, on the one hand, improve personal skills and help build other systems, and on the other hand, it can also help make good use of the database.

1. Introduction to TiDB

TiDB is a distributed NewSQL database. It supports horizontal elastic expansion, ACID transactions, standard SQL, MySQL syntax and MySQL protocol, and has strong data consistency high availability features. It is not only suitable for OLTP scenarios but also OLAP Hybrid database for scenarios.

OLTP: On-Line Transaction Processing, online Transaction Processing.
OLAP: On-Line Analytical Processing, online analytical processing.

Highly compatible with MySQL 5.7

TiDB is highly compatible with the MySQL 5.7 protocol, MySQL 5.7 common functions and grammar. Although TiDB supports MySQL syntax and protocol, TiDB is a product completely independently developed by the PingCAP team and is not developed based on MySQL.

System tools (PHPMyAdmin, Navicat, MySQL Workbench, mysqldump, Mydumper, Myloader), clients, etc. in the MySQL 5.7 ecosystem are all applicable to TiDB.

TiDB currently does not support triggers, stored procedures, custom functions, and foreign keys.

Ease of use

TiDB is very simple to use. You can use the TiDB cluster as MySQL. TiDB can be used in any application that uses MySQL as a backend storage service, and there is basically no need to modify the application code. At the same time, most popular MySQL management tools can be used to manage TiDB.

As long as the programming language supports MySQL Client/Driver, you can use TiDB directly.

Support distributed transactions

Whether it is several nodes in one place or across multiple data centers On multiple nodes, TiDB supports ACID distributed transactions.

TiDB transaction model is inspired by Google Percolator model. The main body is a two-phase commit protocol, and some practical optimizations have been made. The model relies on a timestamp assigner that assigns monotonically increasing timestamps to each transaction so that transaction conflicts are detected. In the TiDB cluster, PD assumes the role of timestamp distributor.

TiDB does not need to support XA to meet cross-database transactions like MySQL. TiDO's own distributed transaction model is much higher than XA in terms of performance and stability, so it does not There is no need to support XA.

Compared with traditional stand-alone databases, TiDB has the following advantages:

Purely distributed architecture with good scalability , supports flexible expansion and contraction
Supports SQL, exposes the MySQL network protocol to the outside world, and is compatible with most MySQL syntax, and can directly replace MySQL in most scenarios
Supports high availability by default , when a few copies fail, the database itself can automatically perform data repair and failover, which is transparent to the business
Supports ACID transactions and is friendly to some scenarios with strong consistency requirements, such as bank transfer
It has a rich tool chain ecosystem, covering various scenarios such as data migration, synchronization, and backup.

In short, TiDB is suitable for scenarios with the following characteristics：

The amount of data is large and cannot be saved on a single machine
I don’t want to do Sharding or I am too lazy to do Sharding
There are no obvious hot spots in the access mode
Need transactions, need Strong consistency, need for disaster recovery
Hope Real-Time HTAP, reduce storage links

Five core features

One-click horizontal expansion or reduction

Thanks to the design of TiDB’s storage and computing separation architecture, computing can be performed on demand , and storage are expanded or reduced online respectively, and the expansion or reduction process is transparent to application operation and maintenance personnel.
Financial-grade high availability

Data is stored in multiple copies, and the data copies synchronize transaction logs through the Multi-Raft protocol , only successful transactions written by the majority can be submitted, ensuring strong data consistency and the failure of a few replicas does not affect the availability of the data. Strategies such as replica geographical location and number of replicas can be configured as needed to meet the requirements of different disaster tolerance levels.
Real-time HTAP

Provides two storage engines: row storage engine TiKV and column storage engine TiFlash. TiFlash passes Multi -The Raft Learner protocol copies data from TiKV in real time, ensuring strong data consistency between the row storage engine TiKV and the column storage engine TiFlash. TiKV and TiFlash can be deployed on different machines on demand to solve the problem of HTAP resource isolation.
Cloud-native distributed database

A distributed database specially designed for the cloud, available through TiDB Operator Implement deployment tools and automation in public clouds, private clouds, and hybrid clouds.
Compatible with MySQL 5.7 protocol and MySQL ecosystem

Compatible with MySQL 5.7 protocol, MySQL common functions, MySQL ecosystem, Applications can be migrated from MySQL to TiDB without requiring or modifying a small amount of code. Provides a wealth of data migration tools to help applications complete data migration easily.

Four core application scenarios

Consistent data Scenarios in the financial industry with high requirements for data consistency, high reliability, high system availability, scalability, and disaster recovery

As we all know, the financial industry has high requirements for data consistency, high reliability, and high system Availability, scalability, and disaster recovery requirements are high. The traditional solution is that two computer rooms in the same city provide services, and one computer room in another place provides data disaster recovery capabilities but no services. This solution has the following shortcomings: low resource utilization, high maintenance costs, RTO (Recovery Time Objective) and RPO (Recovery Point Objective) cannot truly achieve the value expected by the enterprise. TiDB uses the multi-copy Multi-Raft protocol to schedule data to different computer rooms, racks, and machines. When some machines fail, the system can automatically switch to ensure that the system's RTO
Massive data and high-concurrency OLTP scenarios with high storage capacity, scalability, and concurrency requirements

With the rapid development of business, data has shown explosive growth. Traditional stand-alone databases cannot meet the capacity requirements of the database due to the explosive growth of data. A feasible solution is to use middleware products with sub-databases and sub-tables or NewSQL databases to replace or use high-end Storage devices, etc. The most cost-effective among them is the NewSQL database, such as TiDB. TiDB adopts a computing and storage separation architecture, which can expand and shrink computing and storage respectively. Computing supports a maximum of 512 nodes, each node supports a maximum of 1000 concurrency, and the cluster capacity supports a maximum of PB level.
Real-time HTAP scenario

With the rapid development of 5G, Internet of Things, and artificial intelligence, enterprises More and more data will be produced, and its scale may reach hundreds of TB or even PB levels. The traditional solution is to process online online transactions through OLTP databases, and use ETL tools to synchronize the data to OLAP databases for data analysis. This This processing solution has many problems such as high storage cost and poor real-time performance. TiDB introduced the column storage engine TiFlash in version 4.0 and combined it with the row storage engine TiKV to build a true HTAP database. With a small increase in storage costs, online transaction processing and real-time data analysis can be done in the same system, greatly saving enterprises cost.
Data aggregation and secondary processing scenarios

Currently, the business data of most enterprises are scattered in different systems without a unified summary. As the business develops, the decision-making level of the enterprise needs to understand the business status of the entire company in order to make timely decisions, so it is necessary to Gather the data scattered in various systems into the same system and perform secondary processing to generate T 0 or T 1 reports. The traditional common solution is to use ETL Hadoop to complete it, but the Hadoop system is too complex, and the operation and maintenance and storage costs are too high to meet the needs of users. Compared with Hadoop, TiDB is much simpler. The business synchronizes data to TiDB through ETL tools or TiDB synchronization tools. Reports can be directly generated in TiDB through SQL.

2. Get started quickly

TiDB is a distributed system. The most basic TiDB test cluster usually consists of 2 TiDB instances, 3 TiKV instances, 3 PD instances and optional TiFlash instances. Through TiUP Playground, you can quickly build the above basic test cluster. The steps are as follows:

step1. Download and install TiUP.

curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh

Copy after login

After the installation is completed, it will be displayed:

Successfully set mirror to https://tiup-mirrors.pingcap.com
Detected shell: bash
Shell profile:  /home/user/.bashrc
/home/user/.bashrc has been modified to add tiup to PATH
open a new terminal or source /home/user/.bashrc to use it
Installed path: /home/user/.tiup/bin/tiup
===============================================
Have a try:     tiup playground
===============================================

Copy after login

step2. Declare global environment variables. source ${your_shell_profile}
```
source /home/user/.bashrc
```
Copy after login
step3. Execute the following command in the current session to start the cluster.
```
tiup playground
```
Copy after login

step4, verification. [Now you can use TiDB just like MySQL】

#新开启一个 session 以访问 TiDB 数据库。
#使用 TiUP client 连接 TiDB：
tiup client
#也可使用 MySQL 客户端连接 TiDB
mysql --host 127.0.0.1 --port 4000 -u root
#通过 http://127.0.0.1:9090 访问 TiDB 的 Prometheus 管理界面。
#通过 http://127.0.0.1:2379/dashboard 访问 TiDB Dashboard 页面，默认用户名为 root，密码为空。
#通过 http://127.0.0.1:3000 访问 TiDB 的 Grafana 界面，默认用户名和密码都为 admin。

Copy after login

##3. TiDB architecture principle

In terms of core design, the TiDB distributed database splits the overall architecture into multiple modules, and each module communicates with each other to form a complete TiDB system. The corresponding architecture diagram is as follows:

Is tidb the go language?

TiDB Server is responsible for processing SQL-related logic, converting SQL statements into keys, and searching for data through PD. Specifically, Which TiKV. TiDB itself is stateless, does not store data, and is only responsible for calculations. TiDB is written in go language. [Related recommendations:
Go Video Tutorial]
PD
PD stores the meta-information of the cluster, such as which TiKV node the key is on; PD also Responsible for cluster load balancing and data sharding. PD supports data distribution and fault tolerance by embedding etcd. PD is written in go language.
TiKV Server
TiKV is a distributed Key-Value storage engine that provides transactions. It is designed based on Google Spanner and HBase, but is separated from the underlying more complex HDFS. . Store key-value values in the local disk through RocksDB, and use the Raft protocol for replication to maintain data consistency and disaster recovery. TiKV is written in Rust language.

Is tidb the go language?

1. Storage of TiDB database - TiKV Server

TiDB Storage model, a distributed KV engine with transactions

[A globally ordered distributed Key-Value engine] 's hierarchical structure and how to achieve multi-copy fault tolerance.

Storage Node

TiKV Server: Responsible for storing data. From the outside, TiKV is a distributed Key-Value storage engine that provides transactions. The basic unit for storing data is Region. Each Region is responsible for storing data of a Key Range (the left-closed and right-open range from StartKey to EndKey). Each TiKV node is responsible for multiple Regions. TiKV's API provides native support for distributed transactions at the KV key-value pair level, and provides the isolation level of SI (Snapshot Isolation) by default, which is also the core of TiDB's support for distributed transactions at the SQL level. After TiDB's SQL layer completes SQL parsing, it will convert the SQL execution plan into an actual call to the TiKV API. Therefore, the data is stored in TiKV. In addition, the data in TiKV automatically maintains multiple copies (three copies by default), naturally supporting high availability and automatic failover.

TiFlash: TiFlash is a special type of storage node. Unlike ordinary TiKV nodes, within TiFlash, data is stored in columnar form, and its main function is to accelerate analytical scenarios.

TiKV

Saving data requires ensuring that: data is not lost and the data is good→Raft protocol. In addition, the following issues need to be considered:

1. Can it support cross-data center disaster recovery?

2. Is the writing speed fast enough?
3. After the data is saved, is it easy to read?
4. How to modify the saved data? How to support concurrent modifications?
5. How to modify multiple records atomically?

The TiKV project solves the above problems very well. So

How to implement a huge (distributed) Map with high performance and high reliability like TiKV?

TiKV is a huge Map, which stores Key-Value pairs.
The Key-Value pairs in this Map are ordered according to the binary order of the Key, that is, we can Seek to the location of a certain Key, and then continuously call the Next method to obtain larger keys than this Key in increasing order. Key-Value.

Raft and RocksDB

TiKV uses Raft for data replication. Each data change will be recorded as a Raft log. Raft's log replication function synchronizes data to most nodes of the Group safely and reliably.

TiKV does not choose to write data directly to the disk, but saves the data in RocksDB. RocksDB is responsible for the specific data landing. [RocksDB is a very excellent open source stand-alone storage engine. 】

Is tidb the go language?

By using the Raft consistency algorithm, data is copied into multiple copies between TiKV nodes to ensure the security of the data when a node fails.

Actually, under the hood, TiKV uses the replication log state machine (State Machine) model to replicate data. For write requests, data is written to the Leader, which then copies the command to its Followers in the form of a log. When a majority of the nodes in the cluster receive this log, the log is committed and the state machine changes accordingly.

Region concept

For a KV system, there are two typical solutions for distributing data on multiple machines: one is Hash according to the Key and select the corresponding storage node according to the Hash value; the other is to divide the Range, and a certain continuous Key is stored on a storage node. In order to support range query, TiKV chose the second method, divided the entire Key-Value space into many segments, each segment is a series of consecutive Keys, we call each segment a Region, and we will try to keep the data saved in each Region not exceeding a certain size (this size can be configured, currently the default is 96Mb). Each Region can be described by a left-closed and right-open interval from StartKey to EndKey.

Is tidb the go language?

After dividing the data into Regions, Two important things will be done:

Use Region as the unit, disperse the data on all nodes in the cluster, and try to ensure that the number of Regions served on each node is about the same.
The data is divided into many Regions according to Key, and the data of each Region will only be saved on one node. Our system will have a component [PD] responsible for distributing Regions as evenly as possible on all nodes in the cluster, so that on the one hand achieves horizontal expansion of storage capacity (after adding new nodes, Regions on other nodes will be automatically scheduled), On the other hand, load balancing is also achieved (there will not be a situation where a node has a lot of data and other nodes have little data). At the same time, in order to ensure that the upper-layer client can access the required data, the component [PD] in the system will also record the distribution of the Region on the node. That is, through any Key, you can query which Region the Key is in, and the Which node the Region is currently on.
Do Raft replication and member management in Regions.
TiKV replicates data in Region units, that is, the data of a Region will save multiple copies, and each copy is called a Replica. Replica uses Raft to maintain data consistency (finally mentioned Raft). Multiple Replica in a Region will be stored on different nodes to form a Raft Group. One Replica will serve as the Leader of this Group, and the other Replica will serve as Followers. All reading and writing are performed through the Leader, which is then copied to the Follower.
After understanding Region, you should be able to understand the following picture:

Is tidb the go language?

The data is made in Region With dispersion and replication, you have a distributed KeyValue system with certain disaster tolerance capabilities. You no longer have to worry about data being lost or disk failure causing data loss.

MVCC

If two Clients modify the Value of a Key at the same time, if there is no MVCC, the data needs to be locked. , in a distributed scenario, it may cause performance and deadlock problems. TiKV's MVCC implementation is implemented by adding Version after Key.

对于同一个 Key 的多个版本，把版本号较大的放在前面，版本号小的放在后面。这样当用户通过一个 Key + Version 来获取 Value 的时候，可以将 Key 和 Version 构造出 MVCC 的 Key，也就是 Key-Version。然后可以直接 Seek(Key-Version)，定位到第一个大于等于这个 Key-Version 的位置。

#简单来说，没有 MVCC 之前，可以把 TiKV 看做这样的
Key1 -> Value
Key2 -> Value
……
KeyN -> Value
#有了 MVCC 之后，TiKV 的 Key 排列是这样的：
Key1-Version3 -> Value
Key1-Version2 -> Value
Key1-Version1 -> Value
……
Key2-Version4 -> Value
Key2-Version3 -> Value
Key2-Version2 -> Value
Key2-Version1 -> Value
……
KeyN-Version2 -> Value
KeyN-Version1 -> Value
……

Copy after login

GC

TiDB 的事务的实现采用了 MVCC（多版本并发控制）机制，当新写入的数据覆盖旧的数据时，旧的数据不会被替换掉，而是与新写入的数据同时保留，并以时间戳来区分版本。Garbage Collection (GC) 的任务便是清理不再需要的旧数据。

GC整体流程

一个 TiDB 集群中会有一个 TiDB 实例被选举为 GC leader，GC 的运行由 GC leader 来控制。

GC 会被定期触发。每次 GC 时，首先，TiDB 会计算一个称为 safe point 的时间戳，接下来 TiDB 会在保证 safe point 之后的快照全部拥有正确数据的前提下，删除更早的过期数据。每一轮 GC 分为以下三个步骤：

step1：“Resolve Locks” 【清理锁】阶段会对所有 Region 扫描 safe point 之前的锁，并清理这些锁。

step2：“Delete Ranges” 【删除区间】阶段快速地删除由于 DROP TABLE/DROP INDEX 等操作产生的整区间的废弃数据。

step3：“Do GC”【进行GC清理】阶段每个 TiKV 节点将会各自扫描该节点上的数据，并对每一个 key 删除其不再需要的旧版本。

默认配置下，GC 每 10 分钟触发一次，每次 GC 会保留最近 10 分钟内的数据（即默认 GC life time 为 10 分钟，safe point 的计算方式为当前时间减去 GC life time）。如果一轮 GC 运行时间太久，那么在一轮 GC 完成之前，即使到了下一次触发 GC 的时间也不会开始下一轮 GC。另外，为了使持续时间较长的事务能在超过 GC life time 之后仍然可以正常运行，safe point 不会超过正在执行中的事务的开始时间 (start_ts)。

2、TiDB数据库的计算——TiDB Server

从 SQL 的角度了解了数据是如何存储，以及如何用于计算。

TiDB 在 TiKV 提供的分布式存储能力基础上，构建了兼具优异的交易处理能力与良好的数据分析能力的计算引擎。

TiDB Server：SQL 解析层，对外暴露 MySQL 协议的连接 endpoint，负责接受客户端的连接，执行 SQL 解析和优化，最终生成分布式执行计划。TiDB 层本身是无状态的，实践中可以启动多个 TiDB 实例，通过负载均衡组件（如 LVS、HAProxy 或 F5）对外提供统一的接入地址，客户端的连接可以均匀地分摊在多个 TiDB 实例上以达到负载均衡的效果。TiDB Server 本身并不存储数据，只是解析 SQL，将实际的数据读取请求转发给底层的存储节点 TiKV（或 TiFlash）。

SQL映射KV

可以将关系模型简单理解为 Table 和 SQL 语句，那么问题变为如何在 KV 结构上保存 Table 以及如何在 KV 结构上运行 SQL 语句。 SQL 和 KV 结构之间存在巨大的区别，那么如何能够方便高效地进行映射，就成为一个很重要的问题。一个好的映射方案必须有利于对数据操作的需求。

Is tidb the go language?

分布式SQL运算

首先我们需要将计算尽量靠近存储节点，以避免大量的 RPC 调用。其次，我们需要将 Filter 也下推到存储节点进行计算，这样只需要返回有效的行，避免无意义的网络传输。最后，我们可以将聚合函数、GroupBy 也下推【计算下推】到存储节点，进行预聚合，每个节点只需要返回一个 Count 值即可，再由 tidb-server 将 Count 值 Sum 起来【并行算子】。这里有一个数据逐层返回的示意图：

Is tidb the go language?

实际上 TiDB 的 SQL 层要复杂的多，模块以及层次非常多，下面这个图【SQL引擎架构】列出了重要的模块以及调用关系：

Is tidb the go language?

Brief process of SQL query return: The user's SQL request will be sent to tidb-server directly or through Load Balancer. tidb-server will parse the MySQL Protocol Packet, obtain the request content, and then perform syntax analysis. Query plan formulation and optimization, executing query plans to obtain and process data. All data is stored in the TiKV cluster, so in this process tidb-server needs to interact with tikv-server to obtain data. Finally, tidb-server needs to return the query results to the user.

SQL execution process

In TiDB, the process from the input query text to the final execution plan execution result can be seen in the figure below:

Is tidb the go language?

After the parser parses the original query text and some simple legality verification, TiDB will first make some logically equivalent changes to the query——Query logic optimization.
Through these equivalent changes, this query can become easier to handle in the logical execution plan. After the equivalent change is completed, TiDB will obtain a query plan structure that is equivalent to the original query, and then obtain a final execution plan based on the data distribution and the specific execution cost of an operator - Query physical optimization .
At the same time, when TiDB executes the PREPARE statement, you can choose to enable caching to reduce the cost of TiDB generating execution plans - Execution plan cache.

3. TiDB database scheduling - PD Server

PD (Placement Driver) is the management module of the TiDB cluster and is also responsible for the cluster Real-time scheduling of data.

Scheduling scenario

PD (Placement Driver) Server: The meta-information management module of the entire TiDB cluster, responsible for storage It provides the real-time data distribution of each TiKV node and the overall topology of the cluster, provides the TiDB Dashboard management and control interface, and assigns transaction IDs to distributed transactions. PD not only stores meta-information, but also issues data scheduling commands to specific TiKV nodes based on the data distribution status reported in real time by TiKV nodes. It can be said to be the "brain" of the entire cluster. In addition, the PD itself is composed of at least 3 nodes to provide high availability. It is recommended to deploy an odd number of PD nodes.

Scheduling requirements

The first category: As a distributed high-availability storage system, the requirements that must be met include several : [Disaster recovery function]

The number of copies cannot be more or less.
Replicas need to be distributed on machines with different attributes according to the topology.
Node downtime or abnormality can automatically and reasonably quickly carry out disaster recovery.

The second category: As a good distributed system, things that need to be considered include: [Higher and reasonable resource utilization, good scalability]

Maintain even distribution of Leaders throughout the cluster.
Maintain uniform storage capacity of each node.
Maintain even distribution of access hotspots.
Control the speed of load balancing to avoid affecting online services.
Manage node status, including manually bringing nodes online/offline.

In order to meet these needs, sufficient information needs to be collected, such as the status of each node, information of each Raft Group, statistics of business access operations, etc.; secondly, some policies need to be set, and PD needs to set some policies according to these information and scheduling strategies to develop a scheduling plan that meets the previously mentioned needs as much as possible.

Scheduling operations

The basic operations of scheduling refer to the purpose of satisfying the scheduling policy. The above scheduling requirements can be organized into the following three operations:

Add a copy
Delete a copy
Place the Leader role between different copies of a Raft Group transfer (migration)

It happens that the Raft protocol can support the above three through the three commands AddReplica, RemoveReplica, TransferLeader Basic operations.

The status of TiKV Store is specifically divided into Up, Disconnect, Offline, Down, and Tombstone. The relationship between each status is as follows:

Is tidb the go language?

Scheduling strategy

The number of copies of a Region is correct.
Multiple copies in a Raft Group are not in the same location.
Copies are evenly distributed among Stores.
The number of Leaders is evenly distributed among Stores.
The number of access hotspots is evenly distributed among Stores.
The storage space occupied by each Store is roughly equal.
Control the scheduling speed to avoid affecting online services.

Scheduling implementation

PD continuously collects information through the Store [i.e. TiKV node] or the Leader’s heartbeat packet to obtain the entire cluster detailed data, and generates a scheduling operation sequence based on this information and the scheduling policy. Each time it receives a heartbeat packet from the Region Leader, the PD will check whether there are any operations to be performed on this Region. Through the reply message of the heartbeat packet, the PD will The required operations are returned to the Region Leader, and the execution results are monitored in subsequent heartbeat packets. Note that the operations here are only suggestions for the Region Leader, and there is no guarantee that they will be executed. Whether and when they will be executed are determined by the Region Leader itself based on its current status.

5. TiDB Best Practices

TiDB’s best practices are closely related to its implementation principles. It is recommended to first understand some basic implementation mechanisms. Including Raft, distributed transactions, data sharding, load balancing, SQL to KV mapping scheme, secondary index implementation method, and distributed execution engine.

Raft

Raft is a consistency protocol that can provide strong consistency data replication guarantee. The bottom layer of TiDB uses Raft to synchronize data. . Each write must write to a majority of copies before success can be returned to the outside world. This way, even if a few copies are lost, the latest data can be ensured in the system. For example, if there are a maximum of 3 copies, each write to 2 copies will be considered successful. At any time, if only one copy is lost, at least one of the two surviving copies will have the latest data.

Compared with the Master-Slave method of synchronization, which also saves three copies, the Raft method is more efficient because the writing delay depends on the two fastest copies, not the slowest one. Therefore, when using Raft synchronization, it is possible to live in multiple locations. In a typical scenario of three centers in two places, each write only requires success in the local data center and a nearby data center to ensure data consistency, and does not require success in all three data centers.

Distributed transactions

TiDB provides complete distributed transactions, and the transaction model is optimized based on Google Percolator.

Optimistic lock

TiDB’s optimistic transaction model will only perform conflict detection when it is actually submitted. If there is a conflict, you need to try again. This model will be relatively inefficient in scenarios with serious conflicts, because the operations before retrying are invalid and need to be repeated. To take an extreme example, use the database as a counter. If the concurrency of access is relatively high, there will be serious conflicts, resulting in a large number of retries or even timeouts. But if the access conflict is not very serious, then the optimistic locking model is more efficient. In scenarios with serious conflicts, it is recommended to use pessimistic locking or solve the problem at the system architecture level, such as placing counters in Redis.
Pessimistic lock

TiDB's pessimistic transaction mode, the behavior of pessimistic transactions is basically the same as MySQL, it will be locked during the execution phase, first come first served, to avoid conflicts Retry under certain circumstances can ensure the success rate of transactions with more conflicts. Pessimistic locking also solves the scenario where you want to lock data in advance through select for update. But if the business scenario itself has fewer conflicts, the performance of optimistic locking will be more advantageous.
Transaction size limit

Since distributed transactions require two-stage commit, and the bottom layer also needs Raft replication, if a transaction is very large, the submission process will be very long. It is slow and will block the Raft copy process below. In order to prevent the system from getting stuck, we have limited the size of transactions [a single transaction contains no more than 5,000 SQL statements (default)].

Data fragmentation

TiKV automatically fragments the underlying data according to the Key's Range. Each Region is a range of Key, a left-closed and right-open interval from StartKey to EndKey. If the total number of Key-Values in a Region exceeds a certain value, it will be automatically split. This part is transparent to the user.

Load Balancing

PD will schedule the load of the cluster based on the status of the entire TiKV cluster. Scheduling is based on Region as the unit, and the policy configured by PD is used as the scheduling logic, and is completed automatically.

SQL on KV

TiDB automatically maps SQL structures to KV structures. Simply put, TiDB performs the following operations:

A row of data is mapped to a KV, the Key is constructed with TableID as the prefix, and the row ID is the suffix
An index is mapped to a KV, the Key is constructed with TableID IndexID Construct the prefix and construct the suffix with the index value

It can be seen that the data or index in a table will have the same prefix, so that in the TiKV Key space, these Key- Value will be in the adjacent position. Then when the amount of writing is large and concentrated on a table, it will cause writing hotspots, especially when some index values in the continuously written data are also continuous (such as update time, a field that increases by time). ), will form write hotspots on a few Regions and become a bottleneck for the entire system. Similarly, if all data reading operations are concentrated in a small range (such as tens of thousands or hundreds of thousands of consecutive rows of data), it may cause data access hotspots.

Secondary index

TiDB supports a complete secondary index and is a global index. Many queries can be optimized through the index. If you make good use of secondary indexes, it is very important for your business. Many experiences on MySQL are still applicable to TiDB. However, TiDB also has some characteristics of its own that you need to pay attention to. This section mainly discusses some precautions when using secondary indexes on TiDB. matter.

The more secondary indexes, the better.
It is more appropriate to create an index on a column with a relatively large degree of discrimination [cardinality]. When there are multiple query conditions, you can choose a combined index and pay attention to the leftmost prefix principle.
The difference between querying through index and directly scanning Table.
Query concurrency.

Data is scattered across many Regions, so TiDB will perform queries concurrently. The default concurrency is relatively conservative, because too high concurrency will consume a lot of system resources.

For OLTP type queries, which often do not involve a large amount of data, lower concurrency can already meet the needs.
For OLAP type Query, a higher degree of concurrency is often required.

So TiDB supports adjusting query concurrency through System Variable. [tidb_distsql_scan_concurrency, tidb_index_lookup_size, tidb_index_lookup_concurrency, tidb_index_serial_scan_concurrency, etc.]
Guarantee the order of results through the index. [In addition to being used to filter data, indexes can also be used to sort data. First, the row IDs are obtained in the order of the index, and then the contents of the rows are returned in the order of return of the row IDs. This ensures that the returned results are arranged according to the index columns. sequence. 】
also supports reverse indexing. [The current speed is slower than sequential Scan, usually 20% slower. When the data is frequently modified and there are many versions, it will be even slower. If possible, it is recommended to avoid reverse Scan of the index]