Download pdf - 5 古雷my sql源碼與資料庫規範

MYSQL源碼與資料庫規範

CNTV（央視國際網路）資深MYSQL DBA 古雷

CMUG（中國MYSQL⽤用⼾戶組）成員

古雷 CNTV資深MYSQL⼯工程師

• 在MySQL DBA領域混跡多年

• 在搜狐和暢遊獲得成⾧長

• ⺫⽬目前在CNTV繼續深⼊入學習MySQL

• 喜歡研究MySQL源碼

• 積極參加MySQL⽤用⼾戶組(CMUG)的活動

• 由於愛好佛教，⼈人送綽號“古⼤大師”——純屬朋友送的昵稱，希望⼤大家⾒見怪不怪

可能涉及到的術語

• 資料庫——數據庫——database

• 資料——數據——data

• 欄位——字段——column, field

• 1位元——1位——one bit

• 1位元組——1個字節——one byte

• 檔案——⽂文件——file

• 暫存檔案——臨時⽂文件——temporary file

• 記憶體——內存——memory

• 字元——字符——character

• 網路──網絡──network

為什麼要看源碼

• 運維MySQL多年，⽤用得越多越想知道為什麼

• 看⽂文檔覺得不夠解渴

• 推⾏行資料庫開發規範，開發（研發）部⾨門的同事也想知道更多的“為什麼”

• 進展：

• 了解了⼀一些執⾏行計劃的執⾏行過程

• 了解了InnoDB page的⼀一部分結構

• 了解了⼀一點有關SQL成本（cost）估算的內容

資料庫開發規範

• 學習、借鑒多家公司的規範

• 制訂並推⾏行適合⾃自⼰己公司的規範

• 每⼀一條規範都是不可打破的嗎？有適⽤用範圍嗎？

• 開發⼈人員可以被我說服嗎

• 嘗試向源碼尋求幫助

FILESORT從⼀一個排序的SQL說起

• MySQL中，當ORDER BY無法使⽤用索引時，則產⽣生filesort

• 本例是⼀一個簡化的例⼦子，為了演⽰示filesort，有意不加索引

• 實際情況中，會有⼀一些稍微複雜的SQL不容易使⽤用索引，或者是確實缺少索引

• 本例試圖以⼀一個簡化的例⼦子專⾨門展⽰示filesort的效果

FILESORTSELECT多⼀一個欄位，時間相差6倍

• select * 僅⽐比select id,name多了⼀一個欄位 address varchar(255)

• ⽽而且address全部是空，即NULL

• SELECT多⼀一個欄位，時間相差6倍——1.19秒VS 0.20秒

mysql> � flush � status; � � mysql> � select � * � from � testsort � order � by � name; � � mysql> � select � * � from � information_schema.session_status � where � variable_name � in � ('created_tmp_files','sort_merge_passes'); � � +-------------------+----------------+ � � | � VARIABLE_NAME � � � � � | � VARIABLE_VALUE � | � � +-------------------+----------------+ � � | � CREATED_TMP_FILES � | � 3 � � � � � � � � � � � � � � | � � | � SORT_MERGE_PASSES � | � 3 � � � � � � � � � � � � � � | � � +-------------------+----------------+ � � � mysql> � flush � status; � � mysql> � select � id,name � from � testsort � order � by � name; � � mysql> � select � * � from � information_schema.session_status � where � variable_name � in � ('created_tmp_files','sort_merge_passes'); � � +-------------------+----------------+ � � | � VARIABLE_NAME � � � � � | � VARIABLE_VALUE � | � � +-------------------+----------------+ � � | � CREATED_TMP_FILES � | � 0 � � � � � � � � � � � � � � | � � | � SORT_MERGE_PASSES � | � 0 � � � � � � � � � � � � � � | � � +-------------------+----------------+ �

FILESORT排序記憶體計算• select id,name,address from testsort order by name; • order by name欄位name varchar(50) utf8：

• 50個utf8字元占50*3位元組，轉換為占2位元組編碼(150*2+2)/3=100 • name可為空，需要加1個位元組，共101位元組

• select id,name,address欄位 • id int占4位元組；name占50*3+1+1=152（兩個1的來源150<255，

151<255） • address varchar(255): 255*3+2+2=769（兩個2的來源765>255，

767>255） • name和address可以為空，共⽤用1個位元組表⽰示空值（各占⼀一bit） • ⼀一個char*（指標），占8位元組（只存在於sort buffer中，在暫存檔案中不存）

• 共101+4+152+769+1+8=1035位元組

⼀一⾏行之差mysql> � set � sort_buffer_size=1035*122881; � � � � � � � � flush � status; � � mysql> � select � * � into � outfile � '/data/dump/sort.txt' � from � testsort � order � by � name; � � Query � OK, � 122881 � rows � affected � (0.56 � sec) � mysql> � select � * � from � information_schema.session_status � where � variable_name � in � ('created_tmp_files','sort_merge_passes'); � � +-------------------+----------------+ � � | � VARIABLE_NAME � � � � � | � VARIABLE_VALUE � | � � +-------------------+----------------+ � � | � CREATED_TMP_FILES � | � 0 � � � � � � � � � � � � � � | � � | � SORT_MERGE_PASSES � | � 0 � � � � � � � � � � � � � � | � � +-------------------+----------------+ � � mysql> � set � sort_buffer_size=1035*122880; � � � � � � � flush � status; � � mysql> � select � * � into � outfile � '/data/dump/sort.txt' � from � testsort � order � by � name; � � Query � OK, � 122881 � rows � affected � (0.91 � sec) � mysql> � select � * � from � information_schema.session_status � where � variable_name � in � ('created_tmp_files','sort_merge_passes'); � � +-------------------+----------------+ � � | � VARIABLE_NAME � � � � � | � VARIABLE_VALUE � | � � +-------------------+----------------+ � � | � CREATED_TMP_FILES � | � 2 � � � � � � � � � � � � � � | � � | � SORT_MERGE_PASSES � | � 1 � � � � � � � � � � � � � � | � � +-------------------+----------------+ �

FILESORT兩種演算法的選擇mysql> � set � max_length_for_sort_data=1024; � //讓filesort每行資料最大長度小於1035位元組 � mysql> � set � sort_buffer_size=117*122881; � � � //filesort會選擇另一演算法，每行只保存排序的 � mysql> � flush � status; � � � � � � � � � � � � � � � � � � � � � � � � � � � //（接上面） � KEY、主鍵（8位元組）、指標（8位元組） � mysql> � show � global � status � like � 'innodb_rows_read'; � � +------------------+---------+ � � | � Variable_name � � � � | � Value � � � | � � +------------------+---------+ � � | � Innodb_rows_read � | � 1720348 � | � � +------------------+---------+ � � mysql> � select � * � into � outfile � '/data/dump/sort.txt' � from � testsort � order � by � name; � � Query � OK, � 122881 � rows � affected � (0.66 � sec) � � mysql> � select � * � from � information_schema.session_status � where � variable_name � in � ('created_tmp_files','sort_merge_passes','innodb_rows_read'); � � +-------------------+----------------+ � � | � VARIABLE_NAME � � � � � | � VARIABLE_VALUE � | � � +-------------------+----------------+ � � | � CREATED_TMP_FILES � | � 0 � � � � � � � � � � � � � � | � � � � � � � � � � � � � � � � � � � //這次sort_buffer_size較小，也足夠用。 � | � INNODB_ROWS_READ � � | � 1966110 � � � � � � � � | � � � � � � � � � //另外，1966110 � - � 1720348 � = � 245762 � | � SORT_MERGE_PASSES � | � 0 � � � � � � � � � � � � � � | � � � � � � � � � � � � � � � � � //（接上面）恰好是122881的兩倍 � +-------------------+----------------+ �

FILESORT落實到資料庫規範

• 請勿使⽤用SELECT *

• SELECT後⾯面指明實際需要的欄位名，在⾜足夠⽤用的前提下，儘量少

• VARCHAR(n)，n的⼤大⼩小也是需要計較的，在⾜足夠⽤用的前提下，儘量⼩小

• 把不會在SQL的WHERE以及ORDER BY、GROUP BY中出現的TEXT欄位、⽐比較⼤大的VARCHAR欄位等單獨放在⼀一張表裡

• 上述習慣將避免很多後續的優化⼯工作

• ⽽而且有些優化⼯工作是成本很⾼高的，⽐比如把⼤大的欄位拆到新表裡

SORT_MERGE_PASSES的計算mysql> desc tab; +---------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | name | varchar(15) | YES | | NULL | | | address | varchar(200) | YES | | NULL | | +---------+--------------+------+-----+---------+----------------+ mysql> select * into outfile '/media/psf/Home/workspace/dump/sort.txt' from tab order by name; Query OK, 63584 rows affected (3.70 sec) mysql> select * from information_schema.session_status where variable_name in ('created_tmp_files','sort_merge_passes'); +-------------------+----------------+ | VARIABLE_NAME | VARIABLE_VALUE | +-------------------+----------------+ | CREATED_TMP_FILES | 3 | | SORT_MERGE_PASSES | 28 | +-------------------+----------------+

統計相關函數的執⾏行次數

引⼊入新玩具——SYSTEMTAP

root@ubuntu:~# stap calls_count.stp /usr/local/mysql56debug/bin/mysqld

make_sortkey 63584write_keys 169merge_buffers 28create_temp_file 3filesort_free_buffers 3filesort 1trace_filesort_information 1init_for_filesort 1merge_many_buff 1merge_index 1

SYSTEMTAP腳本root@ubuntu:~# cat calls_count.stp global calls probe process(@1).function("*write_keys*").call , process(@1).function("*filesort*").call , process(@1).function("*merge_buffers*").call , process(@1).function("*create_temp_file*").call , process(@1).function("*make_sortkey*").call , process(@1).function("*merge_many_buff*").call , process(@1).function(“*merge_index*").call { calls[ppfunc()] ++ } probe timer.s(10) { foreach (name in calls- limit 30) printf ("%s\t%d\n", name, calls[name]) print("\n") }

SORT_MERGE_PASSES的計算

• 第1批合併：169=7*24+1；即合併24次排序結果，前23次每次合併7個，最後⼀一次合併8個

• 第2批合併：24=7*3+3；即合併3次排序結果，前2次每次合併7個，最後⼀一次10個

• 第3批合併：3個排序結果，最後再合併1次，得到最終結果 • 24+3+1=28；即調⽤用merge_buffers的次數，也即sort_merge_passes的值

函數執⾏行次數解釋

make_sortkey 65384 製作排序的key，本例是name欄位

write_keys 169每在sort buffer中排⼀一次序，在排序後要把排序結果寫⼊入暫存檔案，⼀一共寫了169次。也即排序排了169次

merge_buffers 28 合併排序結果，即進⾏行merge sort，每7個結果合併⼀一次

create_temp_file 3創建了3個暫存檔案，其中⼀一個應該是

16*sort_buffer_size那麼⼤大

SYSTEMTAP監控暫存檔案的讀寫

• root@ubuntu:~# stap iotime.stp 19120 ##刪除了部分輸出內容 • (mysqld) access /tmp/MY2h0vCA read: 43682208 write: 43682208 • (mysqld) access /tmp/MYfjNKjs read: 87364416 write: 87364416 • (mysqld) access /tmp/MYXM65AV read: 41711104 write: 41711104 • (mysqld) access /media/psf/Home/workspace/dump/sort.txt read: 0

write: 36424243 • id int 4位元組，name varchar(15) utf8 45+1+1位元組，address

varchar(200) utf8 600+2+2，再加1個空值位元組，共656 • name作為排序的KEY，(45*2+2)/3=30，再加1個空值位元組，共31 • (656+31)*63584=43682208; 43682208*2=87364416 • 41711104/63584=656 • 169次排序，結果寫⼊入/tmp/MYfjNKjs；第⼀一批合併排序，結果寫⼊入/

tmp/MY2h0vCA；第⼆二批合併排序，結果寫⼊入/tmp/MYfjNKjs；第三批合併排序，/tmp/MYXM65AV；最終輸出的結果從/tmp/MYXM65AV讀

SYSTEMTAP官網的腳本IOTIME.STP（⼀一）root@ubuntu:~# cat iotime.stp #! /usr/bin/env stap global start global time_io function timestamp:long() { return gettimeofday_us() - start } function proc:string() { return sprintf("%d (%s)", pid(), execname()) } probe begin { start = gettimeofday_us() } global filehandles, fileread, filewrite probe syscall.open.return { if( pid() == strtol(@1,10) ) { filename = user_string($filename) if ($return != -1) { filehandles[pid(), $return] = filename } else { printf("%d %s access %s fail\n", timestamp(), proc(), filename) } } }

IOTIME.STP（⼆二）

probe syscall.read.return ,syscall.pread.return { #MySQL使⽤用了pread if( pid() == strtol(@1,10) ) { p = pid() fd = $fd bytes = $return time = gettimeofday_us() - @entry(gettimeofday_us()) if (bytes > 0) fileread[p, fd] += bytes time_io[p, fd] <<< time } }

IOTIME.STP（三）

probe syscall.write.return { if( pid() == strtol(@1,10) ) { p = pid() fd = $fd bytes = $return time = gettimeofday_us() - @entry(gettimeofday_us()) if (bytes > 0) filewrite[p, fd] += bytes time_io[p, fd] <<< time } }

IOTIME.STP（四）

probe syscall.close { if( pid() == strtol(@1,10) ) { if ([pid(), $fd] in filehandles) { printf("%d %s access %s read: %d write: %d\n", timestamp(), proc(), filehandles[pid(), $fd], fileread[pid(), $fd], filewrite[pid(), $fd]) if (@count(time_io[pid(), $fd])) printf("%d %s iotime %s time: %d\n", timestamp(), proc(), filehandles[pid(), $fd], @sum(time_io[pid(), $fd])) } delete fileread[pid(), $fd] delete filewrite[pid(), $fd] delete filehandles[pid(), $fd] delete time_io[pid(),$fd] } }

總結

• 看MySQL源碼很有樂趣

• 可以很⾃自然地落實到資料庫開發規範上

• systemap好強⼤大，可以好好學習和利⽤用

• 當前正在學習：opitmizer_trace中cost和rows的計算公式

•有什麼問題？交流⼀一下謝謝⼤大家