Innodb系统表-结构解析

MySQL中在information_schema下，有几张’INNODB_SYS%’命名的系统表，其中记录了当前实例下Inoodb存储的表和索引等信息，也称之为数据字典，这些内容存储在ibdata1系统表空间文件中。在某些情况下，没有了.frm文件，也可以读取ibdata1文件获取对应的表结构。本文即介绍一下系统表空间结构及如何读取ibdata文件。

[TOC]

基本文件结构介绍

数据页组成及类型

数据文件和系统文件都是由多个数据页组成，每个数据页16K(默认)，每个数据页都有不同的作用，有以下几种类型(storage/innobase/include/fil0fil.h):

/** File page types (values of FIL_PAGE_TYPE) @{ */

#define FIL_PAGE_INDEX 17855 /*!< B-tree node */

#define FIL_PAGE_RTREE 17854 /*!< B-tree node */

#define FIL_PAGE_UNDO_LOG 2 /*!< Undo log page */

#define FIL_PAGE_INODE 3 /*!< Index node */

#define FIL_PAGE_IBUF_FREE_LIST 4 /*!< Insert buffer free list */

/* File page types introduced in MySQL/InnoDB 5.1.7 */

#define FIL_PAGE_TYPE_ALLOCATED 0 /*!< Freshly allocated page */

#define FIL_PAGE_IBUF_BITMAP 5 /*!< Insert buffer bitmap */

#define FIL_PAGE_TYPE_SYS 6 /*!< System page */

#define FIL_PAGE_TYPE_TRX_SYS 7 /*!< Transaction system data */

#define FIL_PAGE_TYPE_FSP_HDR 8 /*!< File space header */

#define FIL_PAGE_TYPE_XDES 9 /*!< Extent descriptor page */

...省略一些不太关注的类型

每个数据页头部有38自己的 FIL Header ，结构内容:

大小(字节)	字段
4	Checksum
4	Previous Page
4	Next Page
8	LSN for last page modification
2	Page Type
8	Flush LSN
4	Space ID

解析FIL Header可以得到一些有用内容:

offset //相对文件的偏移量
previous/next page //当前页面前后节点的偏移量，构成链表
page type //page的类型，对应上面所述
space id //表空间ID

ibdata1文件结构

ibdata1文件是系统表空间，space id为0 ，结构如下:

从上图中看到，ibdata1文件中的第7个页面，为FIL_PAGE_TYPE_SYS类型数据页，存放 Data Dictionary Header ，这个页面中存储内容对应的偏移量如下(storage/innobase/include/dict0boot.h):

/* The offset of the dictionary header on the page */

#define DICT_HDR FSEG_PAGE_DATA //38字节文件头部信息

/-------------------------------------------------------------/

/* Dictionary header offsets */

#define DICT_HDR_ROW_ID 0 /* The latest assigned row id */

#define DICT_HDR_TABLE_ID 8 /* The latest assigned table id */

#define DICT_HDR_INDEX_ID 16 /* The latest assigned index id */

#define DICT_HDR_MAX_SPACE_ID 24 /* The latest assigned space id,or 0*/

#define DICT_HDR_MIX_ID_LOW 28 /* Obsolete,always DICT_HDR_FIRST_ID*/

#define DICT_HDR_TABLES 32 /* Root of SYS_TABLES clust index */

#define DICT_HDR_TABLE_IDS 36 /* Root of SYS_TABLE_IDS sec index */

#define DICT_HDR_COLUMNS 40 /* Root of SYS_COLUMNS clust index */

#define DICT_HDR_INDEXES 44 /* Root of SYS_INDEXES clust index */

#define DICT_HDR_FIELDS 48 /* Root of SYS_FIELDS clust index */

#define DICT_HDR_FSEG_HEADER 56 /* Segment header for the tablespace

segment into which the dictionary

header is created */

其中DICT_HDR_TABLES、DICT_HDR_TABLE_IDS、DICT_HDR_COLUMNS、DICT_HDR_INDEXES、DICT_HDR_FIELDS就分别对应INNODB_SYS_%这几个系统表聚集索引(DICT_HDR_TABLE_IDS是SYS_TABLES的二级索引page)。解析这几个page就可以得到对应系统表中的数据。

hexdump -C ibdata1 解析

从0开始的第7个page偏移量大小是1c000，从这里开始经过(FSEG_PAGE_DATA+ DICT_HDR_TABLES） 70字节，之后开始读取的内容即为这几个系统表对应的page号:

DICT_HDR_TABLES // 8

DICT_HDR_TABLE_IDS // 9

DICT_HDR_COLUMNS // 10

DICT_HDR_INDEXES // 11

DICT_HDR_FIELDS // 12

数据页解析

DICT_HDR_%对应的数据页类型为INDEX类型，INDEX结构如下:

其中INDEX Header 结构如下:

上面这两张图中Nmuber of Directory Slots和Page Directory比较重要，Slots的作用是加快在页面内数据的查找速度，实现二分查找，通过解析Nmuber of Directory Slots可以得到page中总共有多少Slot，每个Slot为2个字节，存放相对于page的偏移量。

page从后向前读取Directory Slot 中的偏移量，实现二分查找，加快在页面中查找数据的速度，组成结构如下:

例如上图中总共有7个Slot，存放的偏移量数据为[99, 221, 349, 477, 605, 733, 112]，如果要查询K=10这条记录，只需要扫描[477，349]这两个偏移量对应的Slot即可找到对应的数据。每个Slot包含的记录数(4-8条记录)。

记录解析

现在知道了如何通过Page Directory定位数据，就需要知道每一条记录的存储结构了， MySQL 记录格式有新旧两种(Redundant Or Compact)，Index Header 中Number of Heap Records 的最高位如果是1就是Compact格式，否则是Redundant。

(storage/innobase/include/page0page.h)

#define PAGE_N_HEAP 4 /* number of records in the heap,

bit 15=flag: new-style compact page format */

不同的类型，存储结构也不相同，记录由header和data两部分组成

header部分存放了记录长度信息和一些额外的信息，Redundant格式为6字节，Compact格式为5字节

(storage/innobase/include/rem0rec.ic)

这里我们解析的是ibdata1文件中的系统表，其格式都是Redundant，6个字节存储的内容如下:

/* Offsets of the bit-fields in an old-style record. NOTE! In the table the

most significant bytes and bits are written below less significant.

(1) byte offset (2) bit usage within byte

downward from

origin -> 1 8 bits pointer to next record

2 8 bits pointer to next record

3 1 bit short flag

7 bits number of fields

4 3 bits number of fields

5 bits heap number

5 8 bits heap number

6 4 bits n_owned

4 bits info bi

*/

通过这些额外信息可以得到:

当前记录是否为delete记录
当前Slot中有几条记录
当前记录的类型，如果heap number 为0是infimum，1是supremum，从2开始是用户记录
记录中有多个字段
变长字段存储格式(1 or 2 字节)
下一条记录的偏移量(相对于page)

例如我们解析第8个page，也就是innodb_sys_tables中的内容:

第8个page对应的偏移量大小是0x20000

第9个page对应的偏移量大小是0x24000

第8个page减去8个字节之后的2个字节就是第一个slot对应的值 00 65(16进制)= 101(10进制)，如下图:

从101向前读取6字节，就是record header信息，按规则解析，解析的代码示例:

h.deleteFlag = (data[0] & 0x20) != 0

h.minRecFlag = (data[0] & 0x10) != 0

h.Owned = data[0] & 0x0f

h.heapNo |= uint16(data[1]) << 5

h.heapNo |= uint16(data[2]&0xf8) >> 3

h.nField |= uint16(data[2]&0x07) << 7

h.nField |= uint16(data[3]&0xfe) >> 1

h.sFlag = (data[3] & 0x1) != 0

h.nextRecorder = binary.BigEndian.Uint16(data[4:])

得到值为：

字段	值
deleteFlag	false
minRecFlag	false
Owned	1
heapNo	0
nField	1
sFlag	true
nextRecorder	366

这可以知道记录类型是infimum，下一条记录的Offset是366。

偏移量366对应的前6个字节内容如下:

解析后的值:

字段	值
deleteFlag	false
minRecFlag	false
Owned	0
heapNo	5
nField	10
sFlag	true
nextRecorder	141

得到这条记录有10个字段，向前读取10个字节: 3a b6 36 32 2a 26 22 1a 13 0d，就是字段的偏移量，通过偏移量就可以从fieldOffset位置处开始解析每一个字段的值。下面这个图更为直观一些

解析这10个字节就可以得到每个字段的长度，[13,6,7,8,4,4,8,4,0,4]，解析大致方式是判断最高位是否为1，如果不为1就取后7位，并用当前字段的解析值减去前一个字段的解析值即得到这个字段的长度，例如:

3a b6 36 32 2a 26 22 1a 13 0d

0d = 0000 1101

13 = 0001 0011

这两个字段都不为空，所以第一个字段的长度是13 ，第二个字段长度是19 - 13 = 6

通过每个字段的长度，再从fieldOffset位置处开始解析出每个字段的值:

sys_tables聚集索引的定义如下:

enum dict_fld_sys_tables_enum {

DICT_FLDSYS_TABLESNAME = 0,

DICT_FLDSYS_TABLESDB_TRX_ID = 1,

DICT_FLDSYS_TABLESDB_ROLL_PTR = 2,

DICT_FLDSYS_TABLESID = 3,

DICT_FLDSYS_TABLESN_COLS = 4,

DICT_FLDSYS_TABLESTYPE = 5,

DICT_FLDSYS_TABLESMIX_ID = 6,

DICT_FLDSYS_TABLESMIX_LEN = 7,

DICT_FLDSYS_TABLESCLUSTER_ID = 8,

DICT_FLDSYS_TABLESSPACE = 9,

DICT_NUM_FIELDS__SYS_TABLES = 10

};

解析方法可参考函数(storage/innobase/dict/dict0load.cc)dict_sys_tables_rec_read，解析后的值:

字段名	值
DICT_FLD__SYS_TABLES_NAME	SYS_DATAFILES
DICT_FLD__SYS_TABLES_DB_TRX_ID	769
DICT_FLD__SYS_TABLES_DB_ROLL_PTR	45317471250485761
DICT_FLD__SYS_TABLES_ID	14
DICT_FLD__SYS_TABLES_N_COLS	2
DICT_FLD__SYS_TABLES_TYPE	1
DICT_FLD__SYS_TABLES_MIX_ID	0
DICT_FLD__SYS_TABLES_MIX_LEN	64
DICT_FLD__SYS_TABLES_CLUSTER_ID	null
DICT_FLD__SYS_TABLES_SPACE	0

与查询INNODB_SYS_TABLES表中的记录做个对比:

表中查询出的n_cols是5，但我们解析出来的是2，原因是表中查询会把三个隐藏字段也计算在内(DB_TRX_ID,DB_ROLL_PTR,DB_ROW_ID)。

file_format与row_format如何的出来的呢？

n_cols的第32位代表row_format格式，如果为1就是COMPACT。

file_format会比较特殊一些，需要根据解析出来的type和n_cols共同计算完成，代码中是这样写的(storage/innobase/include/dict0dict.ic):

Convert a 32 bit integer from SYS_TABLES.TYPE to dict_table_t::flags

The following chart shows the translation of the low order bit.

Other bits are the same.

========================= Low order bit ==========================

| REDUNDANT | COMPACT | COMPRESSED and DYNAMIC

SYS_TABLES.TYPE | 1 | 1 | 1

dict_table_t::flags | 0 | 1 | 1

==================================================================

@return ulint containing SYS_TABLES.TYPE */

UNIV_INLINE

ulint

dict_sys_tables_type_to_tf(

/=======================/

ulint type, /*!< in: SYS_TABLES.TYPE field */

ulint n_cols) /*!< in: SYS_TABLES.N_COLS field */

{

ulint flags;

ulint redundant = !(n_cols & DICT_N_COLS_COMPACT); // 判断是redundant还是COMPACT

/* Adjust bit zero. */

flags = redundant ? 0 : 1;

/* ZIP_SSIZE, ATOMIC_BLOBS & DATA_DIR are the same. */

flags |= type & (DICT_TF_MASK_ZIP_SSIZE

| DICT_TF_MASK_ATOMIC_BLOBS

| DICT_TF_MASK_DATA_DIR

| DICT_TF_MASK_SHARED_SPACE); //除了最低位，用相应的标志位与type做 '&'运算，判断是否存在这些属性，最后存放在flags中

ut_ad(!DICT_TF_GET_ZIP_SSIZE(flags) || DICT_TF_HAS_ATOMIC_BLOBS(flags));

return(flags);

}

字段名	值
DICT_FLD__SYS_TABLES_N_COLS	2
DICT_FLD__SYS_TABLES_TYPE	1

n_cols为2第32位是0 ，所以redundant=1，flags的低位是0

type为1，经过与相应的标志位做’&’运算后，所有位都为0，最后的flags即是0，用一张图解释:

查询的时候做相应的转换，判断file_format类型:

i_s_dict_fill_sys_tables(

/=====================/

THD* thd, /*!< in: thread */

dict_table_t* table, /*!< in: table */

TABLE* table_to_fill) /*!< in/out: fill this table */

{

Field** fields;

ulint compact = DICT_TF_GET_COMPACT(table->flags); //获取compact标志位

ulint atomic_blobs = DICT_TF_HAS_ATOMIC_BLOBS(

table->flags); //获取atomic_blobs标志位，这个标志就代表了对于 blob or text field 做页外存储，也就代表file_format是Barracuda

const page_size_t& page_size = dict_tf_get_page_size(table->flags);

const char* file_format;

const char* row_format;

const char* space_type;

file_format = trx_sys_file_format_id_to_name(atomic_blobs); //可以参考file_format_name_map，是一个指针数组，下标0代表Antelope，下标1代表Barracuda

if (!compact) {

row_format = "Redundant";

} else if (!atomic_blobs) {

row_format = "Compact";

} else if (DICT_TF_GET_ZIP_SSIZE(table->flags)) {

row_format = "Compressed";

} else {

row_format = "Dynamic";

}

if (is_system_tablespace(table->space)) {

space_type = "System";

} else if (DICT_TF_HAS_SHARED_SPACE(table->flags)) {

space_type = "General";

} else {

space_type = "Single";

}

//省略无用代码
}

compact为0，!compact即为1，所以row_format为Redundant

同时atomic_blobs为0 ， file_format即为Antelope

再用以上规则解析一条非系统表记录

字段名	值
DICT_FLD__SYS_TABLES_NAME	dhy/dhytest2
DICT_FLD__SYS_TABLES_DB_TRX_ID	27446
DICT_FLD__SYS_TABLES_DB_ROLL_PTR	46161896180619265
DICT_FLD__SYS_TABLES_ID	45
DICT_FLD__SYS_TABLE_N_COLS	2147483650
DICT_FLD__SYS_TABLES_TYPE	33
DICT_FLD__SYS_TABLES_MIX_ID	0
DICT_FLD__SYS_TABLES_MIX_LEN	80
DICT_FLD__SYS_TABLES_CLUSTER_ID	null
DICT_FLD__SYS_TABLES_SPACE	82

n_cols为2第32位是1 ，所以redundant=0，flags的低位是1

type为33，对应的二进制是:0010 0001，经过与相应的标志位做’&’运算后，对应的二进制为:0010 0000，在与flags做’|’操作后，二进制为:0010001，则atomic_blobs和compact为1，对应的file_format则是Barracuda， row_format是Dynamic

同时n_cols 第32位的标志位需要取消，n_cols即为2了。

与查询表中得到的信息是一致的:

<!–14–>

结语

本文介绍了Innodb系统表空间基本的结构，及如何解析ibdata表空间中的记录。通过翻阅资料和MySQL代码的查看，学习到很多技巧，例如:位移操作、逻辑运算、如何节省空间等。后面会再写一篇关于系统表加载的文章。