MySQL · Engine Features · Detailed introduction of InnoDB IO subsystem-Mysql Tutorial-php.cn

Preface

As a mature cross-platform database engine, InnoDB implements a set of efficient and easy-to-use IO interfaces, including synchronous asynchronous IO, IO merging, etc. This article briefly introduces its internal implementation. The main code is concentrated in the file os0file.cc. The analysis in this article is based on MySQL 5.6, CentOS 6, and gcc 4.8 by default. Information on other versions will be pointed out separately.

Basic knowledge

WAL technology: Log-first technology, basically all databases use this technology. To put it simply, when a data block needs to be written, the database front-end thread first writes the corresponding log (batch sequential writing) to the disk, and then tells the client that the operation is successful. As for the actual writing of the data block (discrete random writing) Then put it in the background IO thread. Using this technology, although there is one more disk write operation, because the log is written in batches and sequentially, the efficiency is very high, so the client can get the response quickly. In addition, if the database crashes before the actual data blocks are written to disk, the database can use the log for crash recovery without causing data loss when restarting.
Data pre-reading: Data blocks B and C "adjacent" to data block A will also have large differences when A is read. The probability is read, so when reading B, they can be read into the memory in advance. This is the data pre-reading technology. The adjacency mentioned here has two meanings, one is physical adjacency, and the other is logical adjacency. Adjacencies in the underlying data file are called physically adjacent. If the data files are not adjacent, but are logically adjacent (the data with id=1 and the data with id=2 are logically adjacent, but not necessarily physically adjacent, and may exist in different locations in the same file), It is called logical adjacency.
File opening mode: There are three main common modes for Open system calls: O_DIRECT, O_SYNC and default mode. The O_DIRECT mode means that subsequent operations on the file do not use the file system cache. The user mode directly operates the device file, bypassing the kernel's cache and optimization. From another perspective, use the O_DIRECT mode to write the file. If the return is successful, the data It is really written to the disk (regardless of the cache that comes with the disk). Use the O_DIRECT mode to read the file. Each read operation is really read from the disk and will not be read from the cache of the file system. O_SYNC means using the operating system cache, and reading and writing files go through the kernel, but this mode also ensures that every time data is written, the data will be written to the disk. The default mode is similar to the O_SYNC mode, except that after writing the data, there is no guarantee that the data will be placed on the disk. The data may still be in the file system. When the host goes down, the data may be lost.
In addition, the write operation not only requires the modified or added data to be written to the disk, but also the file metainformation. Only when both parts are written to the disk can the data not be lost. O_DIRECT mode does not guarantee that file metainformation will be written to disk (but most file systems do, Bug #45892), so if no other operations are performed, there is a risk of loss after writing files with O_DIRECT. O_SYNC ensures that both data and meta-information are placed on disk. Neither data is guaranteed in default mode.
After calling the function fsync, it can ensure that the data and logs are written to the disk. Therefore, if you use the file opened in O_DIRECT and default mode, you need to call the fsync function after writing the data.
Synchronous IO: Our commonly used read/write function (on Linux) is this type of IO. The characteristic is that when the function is executed, the caller will wait for the function to complete. , and there is no message notification mechanism, because when the function returns, it means that the operation is completed, and you can know whether the operation was successful by directly checking the return value later. This type of IO operation is relatively simple to program and can complete all operations in the same thread, but requires the caller to wait. In a database system, it is more suitable to call when certain data is urgently needed. For example, the log in WAL must be returned to the client. Before downloading, a synchronous IO operation will be performed.
Asynchronous IO: In the database, the IO thread that flushes data blocks in the background basically uses asynchronous IO. The database front-end thread only needs to submit the brush block request to the asynchronous IO queue before returning to do other things, while the background thread IO thread regularly checks whether these submitted requests have been completed, and if so, does some follow-up processing. At the same time, asynchronous IO is often submitted in batches of requests. If different requests access the same file and have consecutive offsets, they can be merged into one IO request. For example, the first request reads file 1, 200 bytes of data starting at offset 100, and the second request reads file 1, 100 bytes of data starting at offset 300, then the two requests can be merged into Read file 1, 300 bytes of data starting at offset 100. Logical pre-reading in data pre-reading also often uses asynchronous IO technology.
The current asynchronous IO library on Linux requires the file to be opened in O_DIRECT mode, and the memory address where the data block is stored, the offset of file reading and writing, and the amount of data read and written must be an integer multiple of the file system logical block size. The file system logical block size can be queried using a statement similar to sudo blockdev --getss /dev/sda5. If the above three are not integer multiples of the file system logical block size, an EINVAL error will be reported when calling the read and write functions. However, if the file is not opened using O_DIRECT, the program can still run, but it will degrade to synchronous IO and block the io_submit function call. superior.

InnoDB regular IO operations and synchronous IO

In InnoDB, if the system has pread/pwrite functions (os_file_read_func and os_file_write_func), use them for reading Write, otherwise use lseek+read/write scheme. This is InnoDB synchronous IO. Looking at the pred/pwrite documentation, we can see that these two functions will not change the offset of the file handle and are thread-safe, so they are recommended in multi-threaded environments. The lseek+read/write solution requires its own mutex protection. Under concurrent conditions, frequent kernel state failures will have a certain impact on performance.

In InnoDB, use the open system call to open the file (os_file_create_func). In addition to O_RDONLY (read-only), O_RDWR (read-write), O_CREAT (create file), the mode also uses O_EXCL (guaranteed that this thread created this file) and O_TRUNC (clear the file). By default (the database is not set to read-only mode), all files are opened in O_RDWR mode. The parameter of innodb_flush_method is more important. Let’s focus on it:

If innodb_flush_method sets O_DSYNC, the log file (ib_logfileXXX) is opened using O_SYNC, so there is no need to call the function fsync to flush the data after writing the data. The file (ibd) is opened in default mode, so fsync needs to be called to flush the disk after writing the data.
If innodb_flush_method is set to O_DIRECT, the log file (ib_logfileXXX) is opened in default mode. After writing the data, you need to call the fsync function to flush the disk. The data file (ibd) is opened in O_DIRECT mode. After writing, The data needs to be flushed by calling the fsync function.
If innodb_flush_method is set to fsync or not set, the data file and log file are opened in default mode, and fsync is required to flush the disk after writing the data.
If innodb_flush_method is set to O_DIRECT_NO_FSYNC, the file opening method is similar to the O_DIRECT mode. The difference is that after the data file is written, the fsync function is not called to flush the disk. This is mainly because O_DIRECT can ensure that the file is Metadata is also placed on disk in the file system.
InnoDB currently does not support using O_DIRECT mode to open log files, nor does it support using O_SYNC mode to open data files.
Note that if you use linux native aio (see the next section for details), innodb_flush_method must be configured as O_DIRECT, otherwise it will degrade to synchronous IO (there will be no task prompts in the error log).

InnoDB uses the file system's file lock to ensure that only one process reads and writes a file (os_file_lock), and uses advisory locking (Advisory locking) ), rather than mandatory locking (Mandatory locking), because mandatory locking has bugs on many systems, including Linux. In non-read-only mode, all files are locked with file locks after they are opened.

The directory in InnoDB is created recursively (os_file_create_subdirs_if_needed and os_file_create_directory). For example, if you need to create the directory /a/b/c/, first create c, then b, then a, create the directory and call the mkdir function. In addition, to create the upper layer of the directory, you need to call the os_file_create_simple_func function instead of os_file_create_func. Please note that.

InnoDB also needs temporary files. The creation logic of temporary files is relatively simple (os_file_create_tmpfile). After successfully creating a file in the tmp directory, directly use the unlink function to release the handle, so that when the process After it ends (whether it ends normally or abnormally), this file will be automatically released. When InnoDB creates a temporary file, it first reuses the logic of the server layer function mysql_tmpfile. Later, because it needs to call the server layer function to release resources, it calls the dup function to copy a handle.

If you need to get the size of a file, InnoDB does not check the metadata of the file (stat function), but uses lseek(file, 0, SEEK_END) method to obtain the file size. The reason for this is to prevent the delay in meta-information update from causing the incorrect file size to be obtained.

InnoDB will pre-allocate a size to all newly created files (including data and log files), and all pre-allocated file contents will be set to zero (os_file_set_size). When the current file is full , and then expand. Additionally, when the log file is created, i.e. during the install_db phase, allocation progress is output in the error log at 100MB intervals.

Generally speaking, conventional IO operations and synchronous IO are relatively simple, but in InnoDB, asynchronous IO is basically used for writing data files.

InnoDB Asynchronous IO

Since MySQL was born before Linux native aio, there are two solutions to implement asynchronous IO in the MySQL asynchronous IO code.
The first is the original Simulated aio. InnoDB simulated an aio mechanism before Linux native air was imported and on some systems that did not support air. When an asynchronous read and write request is submitted, it is simply put into a queue and then returned, and the program can do other things. There are several asynchronous io processing threads in the background (controlled by the two parameters innobase_read_io_threads and innobase_write_io_threads) that continuously take out requests from this queue, and then use synchronous IO to complete the read and write requests and the work after the reading and writing are completed.
The other is Native aio. Currently, it is completed using io_submit, io_getevents and other functions on Linux (glibc aio is not used, this is also simulated). Submit requests using io_submit, and wait for requests using io_getevents. In addition, the window platform also has its own corresponding aio, which will not be introduced here. If you use the window technology stack, the database should use sqlserver. Currently, other platforms (except Linux and window) can only use Simulate aio.

First introduce some common functions and structures, and then introduce Simulate alo and Native aio on Linux in detail.
A global array is defined in os0file.cc, of type os_aio_array_t. These arrays are the queues used by Simulate aio to cache read and write requests. Each element of the array is os_aio_slot_t Type, which records the type of each IO request, the file's fd, offset, the amount of data to be read, the time when the IO request was initiated, whether the IO request has been completed, etc. In addition, the struct iocb in Linux native io is also in os_aio_slot_t. The array structure os_aio_slot_t records some statistical information, such as how many data elements (os_aio_slot_t) have been used, whether it is empty, whether it is full, etc. There are a total of 5 such global arrays, which are used to store asynchronous data file read requests (os_aio_read_array), data file write asynchronous requests (os_aio_write_array), and log file write asynchronous requests (os_aio_log_array), insert buffer write asynchronous request (os_aio_ibuf_array), data file synchronous read and write request (os_aio_sync_array). The data block writing of the log file is synchronous IO, but why do we need to allocate an asynchronous request queue (os_aio_log_array) to the log writing here? The reason is that the checkpoint information needs to be recorded in the log header of the InnoDB log file. Currently, the reading and writing of checkpoint information is still implemented using asynchronous IO, because it is not very urgent. In the window platform, if asynchronous IO is used for a specific file, the file cannot use synchronous IO, so the data file synchronous read and write request queue (os_aio_sync_array) is introduced. The log file does not need to be read from the asynchronous request queue, because the log only needs to be read during crash recovery, and when doing crash recovery, the database is not yet available, so there is no need to enter asynchronous reading mode. One thing to note here is that no matter what the two parameters of the variables innobase_read_io_threads and innobase_write_io_threads are, there is only one os_aio_read_array and os_aio_write_array, but the os_aio_slot_t element in the data will Correspondingly, in Linux, the variable is increased by 1 and the number of elements is increased by 256. For example, innobase_read_io_threads=4, then the os_aio_read_array array is divided into four parts, each part has 256 elements, and each part has its own independent lock, semaphore and statistical variables, used to simulate 4 threads, innobase_write_io_threads is similar. From here we can also see that there is an upper limit to the read and write requests that each asynchronous read/write thread can cache, which is 256. If this number is exceeded, subsequent asynchronous requests need to wait. 256 can be understood as the InnoDB layer's control of the number of asynchronous IO concurrency, and there are also length restrictions at the file system layer and disk level. Use cat /sys/block/sda/queue/nr_requests and cat respectively. /sys/block/sdb/queue/nr_requestsQuery.
os_aio_init Called when InnoDB starts, it is used to initialize various structures, including the above-mentioned global array, as well as locks and mutexes used in Simulate aio. os_aio_free releases the corresponding structure. The os_aio_print_XXX series of functions are used to output the status of the aio subsystem, mainly used in the show engine innodb status statement.

Simulate aio

Compared with Native aio, Simulate aio is relatively complicated because InnoDB implements a set of simulation mechanisms.

The entry function is os_aio_func. In debug mode, the parameters will be checked, such as the memory address where the data block is stored, the offset of file reading and writing, and Whether the amount of data read and written is an integer multiple of OS_FILE_LOG_BLOCK_SIZE, but it is not checked whether O_DIRECT is used in the file opening mode, because Simulate aio ultimately uses synchronous IO, and there is no need to use O_DIRECT to open the file.
After the verification is passed, os_aio_array_reserve_slot is called. The function is to allocate this IO request to a certain background io processing thread (allocated by innobase_xxxx_io_threads, but it is actually in the same global array), and record the relevant information of the io request to facilitate the background io thread processing. If the IO request type is the same, the same file is requested and the offset is relatively close (by default, the offset difference is within 1M), InnoDB will allocate the two requests to the same io thread to facilitate subsequent steps. Medium IO merge.
After submitting an IO request, you need to wake up the background io processing thread, because if the background thread detects that there is no IO request, it will enter the waiting state (os_event_wait).
At this point, the function returns, the program can do other things, and subsequent IO processing is handed over to the background thread.
Introduce how the background IO thread is processed.
When InnoDB starts, the background IO thread will be started (io_handler_thread). It will call os_aio_simulated_handle to take out the IO request from the global array, and then use synchronous IO to process it. After the end, finishing work needs to be done. For example, if it is a write request, the corresponding data needs to be put in the buffer pool. The page is removed from the dirty page list.
os_aio_simulated_handleFirst, you need to select an IO request from the array to execute. The selection algorithm is not a simple first-in-first-out method. It selects the smallest offset among all requests. The request is processed first. This is done to facilitate calculation of subsequent IO merging. However, this can also easily lead to some isolated requests with particularly large offsets not being executed for a long time, that is, starving to death. In order to solve this problem, before selecting IO requests, InnoDB will first do a traversal. If a request is found 2s ago If it is pushed (that is, it has waited for 2 seconds) but has not been executed yet, the oldest request will be executed first to prevent these requests from being starved. If there are two requests with the same waiting time, the request with the smaller offset will be selected.
os_aio_simulated_handleThe next step is to perform IO merging. For example, read request 1 requests file1, 200 bytes starting from offset100, read request 2 The request is file1, 100 bytes starting from offset300, then these two requests can be merged into one request: file1, 300 bytes starting from offset100. After IO returns, just copy the data to the buffer of the original request. The write request is similar. Before the write operation, the data to be written is copied to a temporary space, and then written all at once. Note that IO will only be merged if the offsets are continuous. If there are interruptions or overlaps, they will not be merged. Identical IO requests will not be merged, so this can be regarded as an optimizable point.
os_aio_simulated_handleIf it is found that there is no IO request now, it will enter the waiting state and wait to be awakened

To sum up the above From the above, it can be seen that IO requests are the opposite of push one by one. Each push enters a background thread and is processed. If the background thread priority is relatively high, the IO merging effect may be poor. In order to solve this problem, Simulate aio provides Similar to the function of group submission, that is, after a group of IO requests are submitted, the background thread is awakened and processed uniformly, so that the effect of IO merging will be better. But there is still a slight problem with this. If the background thread is relatively busy, it will not enter the waiting state, which means that as long as the request enters the queue, it will be processed. This problem can be solved in the Native aio below.
Generally speaking, the simulation mechanism implemented by InnoDB is relatively safe and reliable. If the platform does not support Native aio, this mechanism will be used to read and write data files.

Linux native aio

If the system has the libaio library installed and innodb_use_native_aio=on is set in the configuration file, Native aio will be used at startup.

The entry function is still os_aio_func. In debug mode, the incoming parameters will still be checked. It will also not check whether the file is opened in O_DIRECT mode. This is considered A somewhat risky point is that if the user does not know that linux native aio needs to use O_DIRECT mode to open files to take advantage of aio, then the performance will not meet expectations. It is recommended to check here and output any problems to the error log.
After passing the check, just like Simulated aio, call os_aio_array_reserve_slot to allocate the IO request to the background thread. The allocation algorithm also takes into account subsequent IO merging, and Simulated Same as aio. The main difference is that the iocb structure needs to be initialized with the parameters of the IO request. In addition to initializing the iocb, the relevant information of the IO request also needs to be recorded in the slot of the global array, mainly for the convenience of statistics in the os_aio_print_XXX series of functions.
Call io_submit to submit the request.
At this point, the function returns, the program can do other things, and subsequent IO processing is handed over to the background thread.
Next is the background IO thread.
Similar to Simulate aio, the background IO thread is also started when InnoDB starts. If it is Linux native aio, the os_aio_linux_handle function will be called later. The function of this function is similar to os_aio_simulated_handle, but the underlying implementation is relatively simple. It only calls the io_getevents function to wait for the IO request to be completed. The timeout is 0.5s, which means that if no IO request is completed within 0.5 seconds, the function will return and continue to call io_getevents to wait. Of course, before waiting, it will determine whether the server is closed, and if so, exit.

When distributing IO threads, try to put adjacent IO in one thread. This is similar to Simulate aio, but for subsequent IO merging operations, Simulate aio implements it by itself, Native aio is completed by the kernel, so the code is relatively simple.
Another difference is that when there is no IO request, Simulate aio will enter the waiting state, while Native aio will wake up every 0.5 seconds, do some checks, and then continue to wait. Therefore, when a new request comes, Simulated aio requires the user thread to wake up, but Native aio does not. In addition, Simulate aio also needs to wake up when the server is shut down, but Native aio does not.

It can be found that Native aio is similar to Simulate aio. Requests are submitted one by one and then processed one by one. This will lead to poor IO merging effect. The Facebook team submitted a Native aio group submission optimization: cache the IO requests first, and then call the io_submit function to submit all previous requests in one go (io_submit can submit multiple requests at one time), so that the kernel It is more convenient to do IO optimization. When Simulate aio is under heavy pressure on the IO thread, the group submission optimization will fail, but Native aio will not. Note that group submission is optimized and you cannot submit too many at once. If the aio waiting queue length is exceeded, an io_submit will be forced to be initiated.

Summary

This article introduces in detail the implementation of the IO subsystem in InnoDB and the points that need attention when using it. InnoDB logs use synchronous IO, data uses asynchronous IO, and the writing order of asynchronous IO is not first-in-first-out mode. These points need to be paid attention to. Although Simulate aio has relatively large learning value, in modern operating systems, it is recommended to use Native aio.

The above is the detailed introduction of MySQL · engine features · InnoDB IO subsystem. For more related content, please pay attention to the PHP Chinese website (m.sbmmt.com)!