Upload
morgan-tocker
View
165
Download
11
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
<Insert Picture Here>
The InnoDB Storage Engine for MySQL Morgan Tocker, MySQL Community Managerhttp://www.tocker.ca/
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
<Insert Picture Here> MySQL 5.5MySQL Cluster 7.3
MySQL Enterprise Monitor 2.3 & 3.0MySQL Enterprise Backup Security Scalability HA Audit
MySQL 5.6MySQL Workbench 6.0
M y S Q L U t i l i t i e s
M y S Q L A p p l i e r f o r
H a d o o p
MySQL Workbench 5.2 & 6.0M y S Q L E n t e r p r i s e O r a c l e C e r t i f i c a t i o n s
4 Years of MySQL Innovation
M y S Q L C l u s t e r M a n a g e r
Windows installer & Tools
MySQL Cluster 7.2MySQL Cluster 7.1
MySQL Migration Wizard
MySQL 5.7
Hello and Welcome!
• I will be talking about InnoDB’s internal behaviour. • Not talking (much) about MySQL. • Aim of this talk is to give you X-ray vision.
• i.e. not so many direct takeaways, but one day it will help you debug a problem.
Copyright © 2012 Oracle and/or its affiliates. All rights reserved.
Prerequisites
MySQL Architecture
L1 cache reference 0.5 ns!Branch mispredict 5 ns!L2 cache reference 7 ns!Mutex lock/unlock 25 ns!Main memory reference 100 ns!Compress 1K bytes with Zippy 3,000 ns!Send 2K bytes over 1 Gbps network 20,000 ns!Read 1 MB sequentially from memory 250,000 ns!Round trip within same datacenter 500,000 ns!Disk seek 10,000,000 ns!Read 1 MB sequentially from disk 20,000,000 ns!Send packet CA->Netherlands->CA 150,000,000 ns
IO Performance
See: http://www.linux-mag.com/cache/7589/1.html and Google http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
• 5-10ms per disk IO. • Maybe 50us for a high end SSD.
• Still not “memory speed”.
IO Performance (cont.)
• Operating Systems compensate well already. • Reads are cached with free memory. • Writes don’t happen instantly.
• A step is introduced to rewrite and merge.
Buffered IO
Block 9, 10, 1, 4, 200, 5.
Block 1, 4, 5, 9, 10, 200
fsync
Synopsis #include <unistd.h> int fsync(int fd); int fdatasync(int fd); !Description fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) where that file resides. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).
Copyright © 2012 Oracle and/or its affiliates. All rights reserved.
Basic Operation
InnoDB High Level OverviewTransa
ctio
nSystem
Storage
Caching
SYS_TABLES
ibda
ta1
spac
e 0
Page Cache
A.ibd
B.ibd
C.ibd
IBUF_HEADERIBUF_TREETRX_SYS
FIRST_RSEGDICT_HDR
Data Dict.
SYS_COLUMNSSYS_INDEXESSYS_FIELDS
Block 1 (64 pages)Block 2 (64 pages)
iblogfile0 iblogfile1 iblogfile2
Tables withfile_per_tableDoublewrite Buffer
Buffe
r Poo
l Data Dictionary Cache
Adaptive Hash Indexes
Buffer Pool LRU
Additional Mem Pool
Log Buffer
Log
Gro
up
Buffer Pool Flush List
In Memory On Disk
Query (pages not in buffer pool)
InnoDB
Transa
ctio
nSystem
Storage
Caching
SYS_TABLES
ibda
ta1
spac
e 0
Page Cache
A.ibd
B.ibd
C.ibd
IBUF_HEADERIBUF_TREETRX_SYS
FIRST_RSEGDICT_HDR
Data Dict.
SYS_COLUMNSSYS_INDEXESSYS_FIELDS
Block 1 (64 pages)Block 2 (64 pages)
iblogfile0 iblogfile1 iblogfile2
Tables withfile_per_tableDoublewrite Buffer
Buffe
r Poo
l Data Dictionary Cache
Adaptive Hash Indexes
Buffer Pool LRU
Additional Mem Pool
Log Buffer
Log
Gro
upBuffer Pool Flush List
SELECT * FROM a WHERE id = 10;
mysqld
Not Found
Query (pages in buffer pool)
InnoDB
Transa
ctio
nSystem
Storage
Caching
SYS_TABLES
ibda
ta1
spac
e 0
Page Cache
A.ibd
B.ibd
C.ibd
IBUF_HEADERIBUF_TREETRX_SYS
FIRST_RSEGDICT_HDR
Data Dict.
SYS_COLUMNSSYS_INDEXESSYS_FIELDS
Block 1 (64 pages)Block 2 (64 pages)
iblogfile0 iblogfile1 iblogfile2
Tables withfile_per_tableDoublewrite Buffer
Buffe
r Poo
l Data Dictionary Cache
Adaptive Hash Indexes
Buffer Pool LRU
Additional Mem Pool
Log Buffer
Log
Gro
upBuffer Pool Flush List
SELECT * FROM a WHERE id = 10;
mysqld
Update Query in a Transaction (simplified)
InnoDB
Transa
ctio
nSystem
Storage
Caching
SYS_TABLES
ibda
ta1
spac
e 0
Page Cache
A.ibd
B.ibd
C.ibd
IBUF_HEADERIBUF_TREETRX_SYS
FIRST_RSEGDICT_HDR
Data Dict.
SYS_COLUMNSSYS_INDEXESSYS_FIELDS
Block 1 (64 pages)Block 2 (64 pages)
iblogfile0 iblogfile1 iblogfile2
Tables withfile_per_tableDoublewrite Buffer
Buffe
r Poo
l Data Dictionary Cache
Adaptive Hash Indexes
Buffer Pool LRU
Additional Mem Pool
Log Buffer
Log
Gro
upBuffer Pool Flush List
UPDATE a SET col1 = ‘new’ WHERE id = 10;
mysqld
commit;
Log files
• Provide recovery. • Only written to in regular operation. • Read only required if there is a crash.
• Are rewritten over-and-over again. • Think of it like a tank tread.
Log files (cont.)
• Are an optimization! • 512B aligned sequential writes. • Tablespace writes are 16KiB random writes.
• Tablespace writes to same pages in close time window can be merged. • Just need a large enough log file.
Checkpoint (Background Activity)
InnoDB
Transa
ctio
nSystem
Storage
Caching
SYS_TABLES
ibda
ta1
spac
e 0
Page Cache
A.ibd
B.ibd
C.ibd
IBUF_HEADERIBUF_TREETRX_SYS
FIRST_RSEGDICT_HDR
Data Dict.
SYS_COLUMNSSYS_INDEXESSYS_FIELDS
Block 1 (64 pages)Block 2 (64 pages)
iblogfile0 iblogfile1 iblogfile2
Tables withfile_per_tableDoublewrite Buffer
Buffe
r Poo
l Data Dictionary Cache
Adaptive Hash Indexes
Buffer Pool LRU
Additional Mem Pool
Log Buffer
Log
Gro
upBuffer Pool Flush List
(nothing) mysqld
• Q: What do we write to the log - is it committed data only, or can we write uncommitted data as well?
• A: Both.
FAQ
• Q: How do you unapply transactions? • A: UNDO space.
Think of it like a hidden table internally stored in ibdata1.
FAQ
Update Query (More Accurate*)
InnoDB
Transa
ctio
nSystem
Storage
Caching
SYS_TABLES
ibda
ta1
spac
e 0
Page Cache
A.ibd
B.ibd
C.ibd
IBUF_HEADERIBUF_TREETRX_SYS
FIRST_RSEGDICT_HDR
Data Dict.
SYS_COLUMNSSYS_INDEXESSYS_FIELDS
Block 1 (64 pages)Block 2 (64 pages)
iblogfile0 iblogfile1 iblogfile2
Tables withfile_per_tableDoublewrite Buffer
Buffe
r Poo
l Data Dictionary Cache
Adaptive Hash Indexes
Buffer Pool LRU
Additional Mem Pool
Log Buffer
Log
Gro
upBuffer Pool Flush List
UPDATE a SET col1 = ‘new’ WHERE id = 10;
mysqld
commit;Page Cache
Update any indexes to hold both versions.Modify row in place
Redirect older version of row to undo space
• Background purge process is able to clean old rows from UNDO as soon as oldest transaction advances forward.
Update Query (cont.)
Summarized Performance Characteristics
• Log Files: • Are short sequential writes. • They permit InnoDB to delay tablespace writes -
enabling more merging/optimization. • Buffer Pool:
• “In memory version of the tablespace”. • Loading/unloading via modified LRU algorithm.
• Indexes and “data” in InnoDB are B+Trees. • Clustered Index design means that data itself is
stored in an index.
Index Structure
Index Structure (cont.)
Infimum Supremum
Level 0Root
NextRecord
Page 3
Empty root
Index Structure (cont.)
Infimum Supremum
1B*
Level 0Root
Page 3
Insert: 1
Page size is 16KBB* is 2KB
Index Structure (cont.)
Level 0Root
I S
1B*
Page 3
2B*
3B*
4B*
5B*
6B*
7B*
Insert: 1 to 7
Index Structure (cont.)
Infimum Supremum
14
Level 1Root
Page 3
Level 0Leaf
I S
1B*
Page 4
2B*
3B*
4B*
5B*
6B*
7B*
Insert: 8
Allocate new page and link in rootMove records to new pageSplit new page
Index Structure (cont.)
Infimum Supremum
14
Level 1Root
Page 3
Level 0Leaf
45
I S
4B*
Page 5
5B*
6B*
7B*
I S
1B*
Page 4
2B*
3B*
Insert: 8 (Cont.)
Split at the middle of original page
Index Structure (cont.)
Infimum Supremum
14
Level 1Root
Page 3
Level 0Leaf
45
I S
4B*
Page 5
5B*
6B*
10B*
I S
1B*
Page 4
2B*
3B*
7B*
8B*
9B*
Insert: 9 and 10
Index Structure (cont.)
Level 1Root
Level 0Leaf
I S
1B*
Page 4
2B*
3B*
I S
4B*
Page 5
5B*
6B*
10B*
7B*
8B*
9B*
Infimum Supremum
14
Page 3
116
45
I S
11B*
Page 6
Insert: 11
Insert leads to a split at the insertion point
Index Structure
Infimum Supremum
≥0→4
≥4→5
I S≥2→7
≥0→6
I S≥6→9
≥4→8
I S
1B
0A
I S
3D
2C
I S
5F
4E
I S
7H
6GL
eve
l 0Leaf
Leve
l 1In
tern
al
Leve
l 2R
oo
t
Next Page
Prev Page
NextRecord
Page 3
Page 4 Page 5
Page 6 Page 7 Page 8 Page 9
Page Format
FIL Header (38)
FIL Trailer (8)
0
38
16384
16376
Other headers and page data,depending on page type.
Total usable space: 16,338 bytes.
Row Format
N-2
Info Flags (4 bits)Number of Records Owned (4 bits)
Order (13 bits)Record Type (3 bits)
Next Record Offset (2)Cluster Key Fields (k)
N
N+k
Transaction ID (6)Roll Pointer (7)
Non-Key Fields (j)
N-4
N-5Variable field lengths (1-2 bytes per var. field)
N+k+6
N+k+13
N+k+13+j
• Page is basic unit of storage. • Default is 16KiB • Rows of variable length.
Conclusion
• Adaptive hash - Partial hash index to accelerate secondary key lookups.
• Change buffering - when non-unique indexes are not in memory, changes can be temporarily buffered until they are.
Two more useful features
Query (by secondary key)
InnoDB
Transa
ctio
nSystem
Storage
Caching
SYS_TABLES
ibda
ta1
spac
e 0
Page Cache
A.ibd
B.ibd
C.ibd
IBUF_HEADERIBUF_TREETRX_SYS
FIRST_RSEGDICT_HDR
Data Dict.
SYS_COLUMNSSYS_INDEXESSYS_FIELDS
Block 1 (64 pages)Block 2 (64 pages)
iblogfile0 iblogfile1 iblogfile2
Tables withfile_per_tableDoublewrite Buffer
Buffe
r Poo
l Data Dictionary Cache
Adaptive Hash Indexes
Buffer Pool LRU
Additional Mem Pool
Log Buffer
Log
Gro
upBuffer Pool Flush List
SELECT * FROM a WHERE b_key = 10;
mysqld
Update Query (large table)
InnoDB
Transa
ctio
nSystem
Storage
Caching
SYS_TABLES
ibda
ta1
spac
e 0
Page Cache
A.ibd
B.ibd
C.ibd
IBUF_HEADERIBUF_TREETRX_SYS
FIRST_RSEGDICT_HDR
Data Dict.
SYS_COLUMNSSYS_INDEXESSYS_FIELDS
Block 1 (64 pages)Block 2 (64 pages)
iblogfile0 iblogfile1 iblogfile2
Tables withfile_per_tableDoublewrite Buffer
Buffe
r Poo
l Data Dictionary Cache
Adaptive Hash Indexes
Buffer Pool LRU
Additional Mem Pool
Log Buffer
Log
Gro
upBuffer Pool Flush List
UPDATE a SET col1 = ‘new’ WHERE id = 10;
mysqld
commit;
Not Required
Copyright © 2012 Oracle and/or its affiliates. All rights reserved.
New Features
• IO Scalability • Async IO • Multiple Buffer Pools • Adaptive Flushing • Scan Resistant LRU • Compressed Pages • CPU Scalability • Improved Atomics • Spin Loops with PAUSE
MySQL 5.5+
• LRU Dump and Restore
• Improved Group Commit
• Fulltext Search • Fast Read-only
Transactions • Memcached Interface • Information Schema
metadata tables
• Persistent Statistics • Variable Page Size • Online DDL • Transportable
Tablespace • Transactional
ReplicationUsing InnoDB
MySQL 5.6+
• Faster Temporary Tables • Index Lock Contention Reduction • More Online DDL
• Extend VARCHAR • Rename Index
• Improved Read-Only Transactions • Improved CPU Scalability
MySQL 5.7+
Copyright © 2012 Oracle and/or its affiliates. All rights reserved.
Configuration
1. innodb-buffer-pool-size 2. innodb-log-file-size 3. innodb_flush_log_at_trx_commit
The Top 3
• Really only one major buffer/cache settings to set. • Responsible for all pages types (data, indexes, undo,
insert buffer..)
innodb-buffer-pool-size
• Recommendation is 50-80% of RAM. • Default is 128M of RAM. • Please allow 5-10% on top for other meta data to
grow.
innodb-buffer-pool-size (cont.)
• Log files are on disk, but this contributes to how many unflushed (dirty) pages you can hold in memory.
• In theory larger log files = longer crash recovery. • In MySQL 5.5 -2G max. • In MySQL 5.6 - 4G is usually safe. • Early versions should be much smaller.
• Default is 48M Log Files.
innodb-log-file-size
• Default is full ACID Compliance (=1) • Can be set to 0/2 if you do not mind some data loss.
innodb_flush_log_at_trx_commit
• 0 = Log buffer written + synced once per second. Nothing done at commit.
• 1 = Log buffer written + synced once per second + written and synced on commit.
• 2 = Log buffer written + synced once per second + written (not synced) on commit.
!2 is a slightly safer version of 0.
innodb_flush_log_at_trx_commit
Basic Configuration
InnoDBTransa
ctio
nSystem
Storage
Caching
SYS_TABLES
ibda
ta1
spac
e 0
Page Cache
A.ibd
B.ibd
C.ibd
IBUF_HEADERIBUF_TREETRX_SYS
FIRST_RSEGDICT_HDR
Data Dict.
SYS_COLUMNSSYS_INDEXESSYS_FIELDS
Block 1 (64 pages)Block 2 (64 pages)
iblogfile0 iblogfile1 iblogfile2
Tables withfile_per_tableDoublewrite Buffer
Buffe
r Poo
l Data Dictionary Cache
Adaptive Hash Indexes
Buffer Pool LRU
Additional Mem Pool
Log BufferLo
g G
roup
Buffer Pool Flush List
Innodb_buffer_pool_size. Recommendation is 50-80% RAM.
Requires about 5-10% of buffer pool size as overhead (not directly configurable).
innodb_log_buffer_size. Typical values 1-8M. Flushing behaviour influenced by innodb_flush_log_at_trx_commit.
innodb_log_file_size. Typical values 256M+. Default of 2 files (innodb_log_files_in_group).
innodb_file_per_table (Default: ON in 5.6+)
1. innodb-buffer-pool-size 2. innodb-log-file-size 3. innodb-log-buffer-size 4. innodb_flush_log_at_trx_commit 5. innodb_flush_method 6. innodb_flush_neighbors 7. innodb_io_capacity, innodb_io_capacity_max,
innodb_lru_scan_depth 8. innodb-buffer-pool-instances 9. innodb_read_io_threads and innodb_write_io_threads
The ~Top 10
• innodb_thread_concurrency • innodb_concurrency_tickets • innodb_max_pct_dirty_pages • innodb_use_native_aio (always on) • innodb_old_blocks_time (5.6 default: 1000)
Less-likely to need configuration
• Typically “remove on sight” from config files: • innodb_additional_mempool_size • innodb_use_sys_malloc
Deprecated Settings
Credits
• InnoDB Architecture Diagrams via https://github.com/jeremycole/innodb_diagrams
• Available under (3-clause) BSD licenseCopyright (c) 2013, Twitter, Inc.Copyright (c) 2013, Jeremy Cole <[email protected]>Copyright (c) 2013, Davi Arnaut <[email protected]>