Detailed analysis of the mysql transaction log in the vernacular mysql

Detailed analysis of the mysql transaction log in the vernacular mysql

In the back-end interview, mysql is an indispensable part, and the inspection of transactions and logs is even more of the "hardest hit". Most students may know that mysql guarantees the transactional nature of SQL through redolog, binlog and undolog. It can be used for database data recovery, but a little deeper, how to ensure transactional? How is the data written to disk during update? What should I do if the contents of these two logs are inconsistent? To write the log, it is also necessary to write the log to the disk. Why is it more efficient than writing data directly to the disk? ..., if you ask three questions about these, the interviewer (especially the big factory interview) will almost tell you to go back and wait for the news.

redo log and binlog

Although most of the articles may have been introduced, for the sake of completeness of the article, we still talk about the difference between redo log and binlog.

Different location

The first is that the locations of the two logs are different. The overall architecture of mysql can be divided into server layer and storage engine layer. MySQL uses pluggable storage engines. Common storage engines include myisam, innodb, memory, etc., when creating tables When specifying the storage engine to be used (create table .... engine=innodb).

Binlog is a log that exists at the server layer, that is, no matter which storage engine is used, binlog can be used to record execution statements. The redolog is unique to the innodb storage engine.

Different size

Binlog is recorded in multiple log files, and the size of a single file is passed

Set, write in append mode, when binlog size exceeds
The set size will create a new log file, and then switch to the new file to continue writing. In addition, you can pass
Set the number of days to keep binlog logs.

The size of redolog is fixed, you can modify the configuration parameters in mysql

Configure the number of log files and the size of each log file, and use circular writing to record. When writing to the end, it will return to the beginning to circularly write the log.

Record content is different

The method of binlog recording operation is a logical statement. There are two record formats, statement and row. The statement format records sql statements; the row format records the content of the row before and after the update.

The redolog records the modification of each page in the database. For example, "what was modified on a certain data page"

Two-stage update process

After understanding the difference between the two logs, let's take a look at how the two logs are written through the execution flow of an update statement. The content of the sentence is

update t set a = a + 1 where id = 1

  1. The executor obtains the data with id = 1 through the innodb engine. If the data itself is in the memory, it will be directly returned to the executor; otherwise, it will be read from the disk into the memory and then returned.

  2. The executor gets the row data given by the engine, adds 1 to this value to get a new row of data, and then calls the engine interface to write the new row of data.

  3. The engine updates this new row of data into memory. Then the updated content of the memory data page is recorded in the redolog buffer. At this time, the state of the statement in the buffer is prepare. Then inform the executor that the execution is complete and the transaction can be submitted at any time.

  4. When the server layer commits a transaction, it will first write the log of this operation into the binlog buffer, and then call the transaction commit interface of the engine, and the engine will modify the state of the redolog record that has just been written to commit. update completed.

It can be found that after an update, not only the data is stored in the memory, but the redolog and binlog are also written to the memory first, and then the log is placed in the disk according to the set disk placement mechanism.

Log placement

Binlog order placement strategy

mysql pass

Parameters to control the log placement strategy of the binlog buffer.

sync_binlog = 0
, Indicates that mysql does not control the refresh of binlog and uses the cache refresh strategy of the file system. At this time, the performance is the best, and the risk is also the greatest. Once the system crashes, the log data in the binlog buffer will be lost.

sync_binlog = 1
It means that every time a transaction is committed, the log data in the buffer will be flushed to the disk, which is the safest, but due to the higher frequency of flushing, the performance is also the worst.

sync_binlog> 1
Indicates that every time the binlog buffer is written
After the second transaction, flush the log data to the disk.

redolog order placement strategy

Before talking about redolog persistence, let's first understand the two system calls of write and fsync. In the operating system, the memory is divided into user space and kernel space. The user space stores the cache data of the application, and the redolog buffer exists in the user space. In order to persist the user space data to the disk, you need to call the write system call first to write the data to the kernel space first, and then call the fsync system call to write the kernel space data to the disk.

mysql pass

The parameter controls when the redo log buffer is written to the disk.

innodb_flush_log_at_trx_commit = 0
Indicates that when the transaction is committed, the log continues to be stored in the redolog buffer, according to
Call write and fsync at the set interval to persist the log to disk,
The default is 1, which means that the log is written to the disk every second. Batch write, io performance is better, but the risk of data loss is greater.

innodb_flush_log_at_trx_commit = 1
Indicates that when the transaction is committed, both write and fsync will be called to write the log to disk. No data will be lost in this way, but io performance is poor.

innodb_flush_log_at_trx_commit = 2
Indicates that when the transaction is committed, it will call write to write the log to the kernel cache, and then call fsync every second to write the log to the disk. This is also safer. Even if the mysql program crashes, the logs in the os buffer will not be lost. Of course, if the operating system crashes, this part of the log will disappear.


Q: Will the redolog in the prepare state be flushed to the disk?

A: Yes, for example, at the same time, there are two transactions, a and b, a is in prepare, and b commits to trigger log flushing. At this time, the redo log of a will also be flushed to the disk.

Q: Is binlog redundant? Can redolog be used instead of binlog?

A: First of all, in terms of supporting transactions, binlog is really not very useful. When recovering from a crash, you need to use binlog to determine whether the transaction should be committed. This is just to prevent binlog from being applied to the standby database. If the main database is directly rolled back, it will Lead to inconsistent primary and secondary data.

But the "archive" function of binlog is not available in redolog. The redolog has a fixed size and uses circular writing. The older logs will be overwritten and cannot be saved for a long time. The binlog is not limited in size, and the logs are additionally written. As long as the binlog log is kept, the state of the database can be restored at any time.

Q: The several disk placement strategies of binlog and redolog are also frequently written to disk. Is there any difference between direct data writing to disk?

A: The log file is stored in several consecutive data pages, so when writing the log to the disk, only one addressing is required, which is a sequential read and write; while writing data, the data that may need to be changed in a transaction may involve several A discrete data page requires multiple "seek->rotation" addressing processes when writing to a disk, which is random read and write, and the speed is several orders of magnitude worse than sequential read and write.

Data placing

In order to avoid the performance bottleneck caused by frequent writing to the disk, the data page is modified in the memory first, and the page that has been modified in the memory is called a dirty page (because the data in the page is inconsistent with the disk at this time, it is "dirty") , The changed data pages need to find time to synchronize to the disk. This process is called "flushing dirty pages".


In innodb, every modification of a data page will generate an 8-byte serial number lsn to mark the version. The value of lsn increases monotonically globally and gradually increases as the log is written. lsn exists in the data page and redo log. During the entire update process, there are several lsns worthy of attention:

  1. When the data in the memory data page is modified, the LSN in the memory data page will be updated, temporarily called data_in_buffer_lsn.

  2. When writing logs to the redolog buffer, the corresponding LSN will be recorded, which is temporarily called redo_log_in_buffer_lsn.

  3. When several redolog flushing strategies are triggered, the logs in the redolog buffer will be flushed to the disk, and the corresponding LSN will be recorded in the file, which is temporarily called redo_log_on_disk_lsn.

  4. When data is flushed from the memory to the disk, the current LSN will be recorded in the corresponding data page on the disk, which is temporarily called data_on_disk_lsn.

  5. Innodb will synchronize the changes of the corresponding data page recorded on the redolog to the disk when appropriate, and the synchronization progress is also marked by lsn, called checkpoint_lsn. (Details will be introduced later)

able to pass

show engine innodb status
Check the value of each lsn.

lsn can be understood as the amount of redo logs generated by the database since its creation. The larger the value, the more the database is updated, and it can also be understood as the time of update. In addition, there is also an lsn on each data page, which represents the lsn when it was last modified. The larger the value, the later it is modified. For example, the lsn of data page A is 100, the lsn of data page B is 200, the checkpoint lsn is 150, and the system lsn is 300, indicating that the current system has been updated to 300, and data pages smaller than 150 have been flushed to the disk. The latest data of page A must be on disk, while data page B is not necessarily, and may still be in memory.

Let's discuss the timing of dirty page flushing in InnoDB.

Timing of data placement

Timed refresh

Innodb's main thread will periodically flush a certain percentage of dirty pages to the disk. This process is asynchronous and will not affect other operations such as query/update.

System memory is not enough

Innodb maintains a list of LRUs of memory data pages, and uses a separate page clear thread to ensure certain free data pages. When the free pages are insufficient, the memory pages at the end of the lru will be eliminated. If the eliminated pages are dirty Page, the dirty page data will be flushed to disk first.

The proportion of dirty pages is too high

Innodb, there is a

Parameter, used to control the proportion of dirty pages in the memory. When the proportion of dirty pages exceeds the set proportion, some dirty pages will be flushed to the disk.

MySQL> Show Variables like 'innodb_max_dirty_pages_pct' ; + ----------------------------+-----------+ | Variable_name | Value | + ----------------------------+-----------+ | innodb_max_dirty_pages_pct | 90.000000 | + ----------------------------+-----------+ Copy code

The database is shut down normally


Controls the disk placement strategy when the database is closed. When set to 1, all dirty log pages and data dirty pages will be flushed to the disk; when set to 2, only log placement is guaranteed.

redo log checkpoint

Review the update process again, the update operation is recorded in the redolog, the data is updated to the memory, and the entire update operation is over. If the database is shut down abnormally, we need to restore the data changes of the corresponding data page according to the redolog the next time it is started.

However, the size of redolog is fixed, and the circular writing mode is adopted. When writing to the end, it will return to the beginning to circularly write the log. Therefore, as the number of update operations accumulates, the records on the redolog will be overwritten, and some changes will be lost.

Is it okay not to limit the size of redolog? Imagine that the redolog reaches 1TG, and the database data volume is 10TG. In order to restore the changes of the data page when restarting abnormally. We need to read the 1T log for recovery. If all data pages have been modified, we also need to load all the 10TG data into the memory. Therefore, after the redolog size is not limited, there will be two other problems:

  1. The recovery speed is slow;
  2. The memory cannot cache all the data in the database.

Redolog adopts a checkpoint strategy, which periodically refreshes the data modification on redolog to disk, and the synchronization progress is marked with lsn, which is called checkpoint_lsn. Redolog can be divided into two parts according to checkpoint_lsn. The data page changes corresponding to the log smaller than checkpoint_lsn have been flushed to the disk, and this part of the log can be overwritten and rewritten; the corresponding changes of the log larger than checkpoint_lsn have not been synchronized to the disk.

Redolog checkpoint brushing is divided into asynchronous brushing and synchronous brushing.

checkpoint_age = redo_lsn-checkpoint_lsn async_water_mark = 75% * total_redo_log_file_size sync_water_mark = 90% * total_redo_log_file_size Copy code

checkpoint_age <async_water_mark, which means that the current dirty page data is less and will not trigger redolog checkpoint flushing.

async_water_mark <checkpoint_age <sync_water_mark, flushes a certain amount of dirty pages to disk asynchronously, so that checkpoint_age <async_water_mark is satisfied. Asynchronous refresh will not affect other update operations.

checkpoint_age> sync_water_mark, when the redolog capacity is set to be small and a large number of update operations are performed at the same time, resulting in less remaining available logs, synchronous refresh will be triggered, and dirty pages will be flushed to the disk until checkpoint_age <async_water_mark is met, synchronous refresh will be Block the user's update operation.


Q: Except for redolog checkpoint, will flushing dirty pages promote checkpoint_lsn in other situations?

A: No. The buffer pool maintains a flush_list for managing dirty pages. After a data page becomes a dirty page due to modified data, it will be added to the flush_list, and the dirty page will be removed from the flush_list after being flushed to the disk.

flush_list is sorted by the oldest modification lsn (oldest_modifcation) of the data page from small to large. For example, after a clean page becomes a dirty page, data_in_buffer_lsn=100, and the position of flush_list is 1. When the data page is changed again, data_in_buffer_lsn becomes 120, but the position of flush_list remains unchanged.

When performing redo checkpoint, the selected log only needs to be compared with the oldest page on flush_list (with the smallest lsn on flsuh_list):

  1. page_no flush_list ! = page_no redo , indicating that the data has been synchronized dirty pages to disk, promoting checkpoint_lsn.
  2. page_no flush_list == page_no redo , the dirty pages are flushed to disk, promoting checkpoint_lsn.

Q: Where does the checkpoint information exist? How to store it?

A: The checkpoint information is stored in the header of the first redo log file. Storage adopts double storage, reading and writing in turn.

There are two places in the file header of the first redo log file for storing checkpoint information, and these two checkpoint fields are read back and forth during recording. Assuming that there is only one checkpoint domain, when half of the checkpoint is updated, the server also hangs, which will cause the entire checkpoint domain to be unavailable. In this way, the database will not be able to do crash recovery, and thus cannot be started. If there are two checkpoint domains, even if one of them is badly written, the other can be used to try to recover. Although it is possible that the log has been overwritten at this time, it at least improves the probability of successful recovery. Two checkpoint domains write in turn, which can also reduce the impact of disk sector failures.

Crash recovery

The user modifies the data and successfully submits the transaction. At this time, the data changes have not been placed in the memory. If the database is hung up at this time, after restarting, the successfully submitted transaction data needs to be restored from the log and rewritten to the disk to ensure The data is not lost, and at the same time, uncommitted transactions must be rolled back. In crash recovery, in addition to redolog and binlog logs, it is also inseparable from the support of undo logs.

undo log

When an update operation is performed, an undo log will be generated: when a record is deleted, a corresponding insert log will be recorded. When a record is updated, a corresponding update log is recorded. When a record is inserted, a delete log is recorded.

When you need to roll back a transaction, you only need to perform the corresponding undo operation to restore the data. In addition, through the undo log, the isolation of transactions can be guaranteed. Assuming that the isolation level is set to read commit, when the uncommitted transaction A modifies the row data corresponding to id=1, at this time transaction B wants to read the data with id=1, you can first hold the latest version of the data and follow undo The log finds records that meet its visibility.

The undo log is the same as the normal data page. For the modification of the undo page, you need to write the redo log first. It may also be eliminated from the memory due to the rules of lru, and then read from the disk.

Crash recovery process

The entire crash recovery process can be divided into

redo roll forward
undo rollback
Two parts.

redo roll forward

For the log before checkpoint_lsn, the corresponding changes have already been placed, so you don t need to care. First initialize a hash_table, scan the logs after checkpoint_lsn, distribute the logs of the same data page to the same location of the hash_table, and sort the logs from small to large. After scanning, the entire hash table is traversed, and the log of each data page is applied in turn. After the application, the state of the data page in the memory is restored to before the crash.

undo rollback

Next, initialize the undo log, divided into undo_insert_list and undo_update_list according to the operation type, traverse the two linked lists, and rebuild the transaction state according to the state of the transaction recorded in the log. TRX_ACTIVE indicates that rollback is required, and TRX_STATE_PREPARED indicates that rollback may be required. Then the transaction is added to the trx_list linked list, and then trx_list is traversed, and the transaction is rolled back or submitted according to the different status of the transaction. For transactions in the TRX_ACTIVE state, use the undo log to roll back directly; for transactions in the TRX_STATE_PREPARED state, determine whether to roll back according to the binlog of the server layer. If the binlog has been written and the log is complete, the transaction is committed, otherwise it is rolled back .


Q: When will the undo log be deleted?

A: Undo can be divided into update/delete/insert according to the operation type. The insert operation is only visible to the current transaction before the transaction is submitted, and the generated Undo log can be directly deleted after the transaction is submitted. For update/delete operations, other transactions may require old version data, and need to be retained until the transaction id corresponding to the undo operation is smaller than all current transaction snapshots in the database (at this time, all transactions in the database are visible to this change) before they can be deleted.

Write at the end

Friends who like this article, welcome to follow the official account "playing code" and focus on sharing practical technologies in vernacular.

Welfare of Official Account

Reply mysql Get free test database! !

Reply [pdf] Get continuously updated massive learning materials! !