Upload
flashdomain
View
262
Download
0
Tags:
Embed Size (px)
Citation preview
ACM SIGMOD 2007, Beijing, China -1-COMPUTER SCIENCE DEPARTMENT
Design of Flash-Based DBMS: Design of Flash-Based DBMS: An In-Page Logging ApproachAn In-Page Logging ApproachDesign of Flash-Based DBMS: Design of Flash-Based DBMS: An In-Page Logging ApproachAn In-Page Logging Approach
Bongki MoonBongki MoonDepartment of Computer ScienceDepartment of Computer Science
University of ArizonaUniversity of ArizonaTucson, AZ 85721, U.S.A.Tucson, AZ 85721, [email protected]@cs.arizona.edu
Sang-Won LeeSang-Won LeeSchool of Info & Comm Eng School of Info & Comm Eng Sungkyunkwan UniversitySungkyunkwan University
Suwon, Korea 440-746Suwon, Korea 440-746
[email protected]@ece.skku.ac.kr
SIGMOD’07SIGMOD’07SIGMOD’07SIGMOD’07
ACM SIGMOD 2007, Beijing, China -2-COMPUTER SCIENCE DEPARTMENT
OutlineOutlineOutlineOutline• Flash memoryFlash memory
• Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory
• Flash-Based DBMS: In-Paging Logging approachFlash-Based DBMS: In-Paging Logging approach
• ReviewsReviews
ACM SIGMOD 2007, Beijing, China -3-COMPUTER SCIENCE DEPARTMENT
Flash MemoryFlash MemoryFlash MemoryFlash Memory• Flash memory is a type of electrically-erasable programmable read-only Flash memory is a type of electrically-erasable programmable read-only
memory (EEPROM)memory (EEPROM)
• Page is the unit of read and write operationsPage is the unit of read and write operations Typical value: 2KBTypical value: 2KB
• Write operation can only clear bits (change their value from 1 to 0).Write operation can only clear bits (change their value from 1 to 0).
• The only way to change value from 0 to 1 is erase an entire region The only way to change value from 0 to 1 is erase an entire region memory.memory. This region has fixed-size, called erase units, erase block or just block.This region has fixed-size, called erase units, erase block or just block. Typical value: 128KB for large flash memoryTypical value: 128KB for large flash memory
ACM SIGMOD 2007, Beijing, China -4-COMPUTER SCIENCE DEPARTMENT
Characteristics of FlashCharacteristics of FlashCharacteristics of FlashCharacteristics of Flash
• No in-place updateNo in-place update the data item need to be erased first before writing it again.the data item need to be erased first before writing it again. An erase unit (16KB or 128 KB) is much larger than a sector (512 An erase unit (16KB or 128 KB) is much larger than a sector (512
bytes).bytes).
• No mechanical latencyNo mechanical latency Flash memory is an electronic device without moving partsFlash memory is an electronic device without moving parts Provides uniform random access speed without seek/rotational Provides uniform random access speed without seek/rotational
latencylatency
• Asymmetric read & write speedAsymmetric read & write speed Read speed is typically at least twice faster than write speedRead speed is typically at least twice faster than write speed Write (and erase) optimization is criticalWrite (and erase) optimization is critical
ACM SIGMOD 2007, Beijing, China -5-COMPUTER SCIENCE DEPARTMENT
Magnetic Disk vs Flash MemoryMagnetic Disk vs Flash MemoryMagnetic Disk vs Flash MemoryMagnetic Disk vs Flash Memory
Magnetic Disk : Seagate Barracuda 7200.7 ST380011AMagnetic Disk : Seagate Barracuda 7200.7 ST380011A NAND Flash : Samsung K9WAG08U1A 16 Gbits SLC NANDNAND Flash : Samsung K9WAG08U1A 16 Gbits SLC NAND
Unit of read/write: 2KB, Unit of erase: 128KBUnit of read/write: 2KB, Unit of erase: 128KB
Read timeRead time Write timeWrite time Erase timeErase time
Magnetic DiskMagnetic Disk 12.7 msec12.7 msec 13.7 msec13.7 msec N/AN/A
NAND FlashNAND Flash 80 80 secsec 200 200 secsec 1.5 msec1.5 msec
ACM SIGMOD 2007, Beijing, China -6-COMPUTER SCIENCE DEPARTMENT
Category of Flash MemoryCategory of Flash MemoryCategory of Flash MemoryCategory of Flash Memory
• NAND Vs. NOR Flash NAND Vs. NOR Flash NOR: high erase cost (several seconds), directly addressable (access is by bit NOR: high erase cost (several seconds), directly addressable (access is by bit
or byte)or byte) NAND: relative low erase cost (several ms), access is by pagesNAND: relative low erase cost (several ms), access is by pages
• MLC Vs. SLC NAND FlashMLC Vs. SLC NAND Flash MLC (Multiple Level Cell): it stores multiple bits per cell, butMLC (Multiple Level Cell): it stores multiple bits per cell, but
significantly slower read and write speeds; 10x lower read/write lifetime significantly slower read and write speeds; 10x lower read/write lifetime SLC (Single Level Cell): it stores only one single bit per cellSLC (Single Level Cell): it stores only one single bit per cell
SLC flash has much better performance, lifetime, and reliability properties SLC flash has much better performance, lifetime, and reliability properties than MLCthan MLC
ACM SIGMOD 2007, Beijing, China -7-COMPUTER SCIENCE DEPARTMENT
Small flash Vs Large flashSmall flash Vs Large flashSmall flash Vs Large flashSmall flash Vs Large flash• Small flash memory was widely used for PDA, MP3, mobile phone, Small flash memory was widely used for PDA, MP3, mobile phone,
sensor network…sensor network… Advantages: size, weight, shock resistance, power consumption, noise …Advantages: size, weight, shock resistance, power consumption, noise … Typical size: a few gigabytesTypical size: a few gigabytes
• Recently, some vendors developed large flash memory called Flash SSD Recently, some vendors developed large flash memory called Flash SSD (Solid State Disk)(Solid State Disk) Mainly used for notebook PC. Apple AirBook / Thinkpad X300 Mainly used for notebook PC. Apple AirBook / Thinkpad X300 Typical size: > 16GTypical size: > 16G
ACM SIGMOD 2007, Beijing, China -9-COMPUTER SCIENCE DEPARTMENT
Different FTLs for Large and Different FTLs for Large and Small Flash MemorySmall Flash MemoryDifferent FTLs for Large and Different FTLs for Large and Small Flash MemorySmall Flash Memory
• Page-mapping FTL (used for small flash memory)Page-mapping FTL (used for small flash memory) maintains the mapping information between the logical page and the maintains the mapping information between the logical page and the
physical page separately physical page separately Log-structured achitectureLog-structured achitecture Large memory for its mapping informationLarge memory for its mapping information must be reconstructed by scanning the whole flash memory at start-up, and must be reconstructed by scanning the whole flash memory at start-up, and
this may result in long mount timethis may result in long mount time
• Block-mapping FTLBlock-mapping FTL Small memory for its mapping informationSmall memory for its mapping information Any update causes a whole block rewrite (that is why random writes are so Any update causes a whole block rewrite (that is why random writes are so
slow!)slow!) In real production, there are some optimizations for improving In real production, there are some optimizations for improving
concentrated updatesconcentrated updates
ACM SIGMOD 2007, Beijing, China -10-COMPUTER SCIENCE DEPARTMENT
Flash memory for server applicationFlash memory for server applicationFlash memory for server applicationFlash memory for server application
• More recently, because of the advantages of flash memory and the More recently, because of the advantages of flash memory and the increasing capacity, there is a new trend that use large flash memory increasing capacity, there is a new trend that use large flash memory for database server applicationfor database server application
• Jim Gray said:Jim Gray said:
Tape is Dead,Tape is Dead,Disk is Tape,Disk is Tape,Flash is Disk!Flash is Disk!
ACM SIGMOD 2007, Beijing, China -11-COMPUTER SCIENCE DEPARTMENT
OutlineOutlineOutlineOutline• Flash memoryFlash memory
• Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory
• Flash-Based DBMS: In-Paging Logging approachFlash-Based DBMS: In-Paging Logging approach
• My reviewsMy reviews
ACM SIGMOD 2007, Beijing, China -12-COMPUTER SCIENCE DEPARTMENT
Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory
• What happens if disk-based DBMS runs on Flash memory?What happens if disk-based DBMS runs on Flash memory? Due to No In-place Update, it writes the whole block into another clean Due to No In-place Update, it writes the whole block into another clean
blockblock Consume free blocks quickly causing frequent garbage collection and eraseConsume free blocks quickly causing frequent garbage collection and erase
Flash Memory
Page : 4KB
SQL: Update / Insert / Delete
BufferMgr.
Data Block Area
Dirty Block Write
Erase Unit: 128KB
Update
ACM SIGMOD 2007, Beijing, China -13-COMPUTER SCIENCE DEPARTMENT
Disk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS Performance
• Run SQL queries on a Run SQL queries on a commercial DBMScommercial DBMS Sequential scan or update of a Sequential scan or update of a
tabletable Non-sequential read or update of Non-sequential read or update of
a table (via B-tree index)a table (via B-tree index)
• Experimental settingsExperimental settings Storage: Magnetic disk vs M-Storage: Magnetic disk vs M-
Tron SSD (Samsung flash chip)Tron SSD (Samsung flash chip) Data page of 8KBData page of 8KB 10 tuples per page, 640,000 10 tuples per page, 640,000
tuples in a table (64,000 pages, tuples in a table (64,000 pages, 512MB)512MB)
ACM SIGMOD 2007, Beijing, China -14-COMPUTER SCIENCE DEPARTMENT
Disk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS Performance• Read performance : Read performance : The result is not surprising at allThe result is not surprising at all
Hard disk
− Read performance is poor for non-sequential accesses, mainly because of seek and rotational latency
Flash memory
− Read performance is insensitive to access patterns
DiskDisk FlashFlash
SequentialSequential 14.0 sec14.0 sec 11.0 sec11.0 sec
Non-sequentialNon-sequential 61.1 ~ 172.0 sec61.1 ~ 172.0 sec 12.1 ~ 13.1 sec12.1 ~ 13.1 sec
ACM SIGMOD 2007, Beijing, China -15-COMPUTER SCIENCE DEPARTMENT
Disk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS Performance• Write performanceWrite performance
Hard disk
− Write performance is poor for non-sequential accesses, mainly because of seek and rotational latency
Flash memory
− Write performance is poor (worse than disk) for non-sequential accesses due to out-of-place update and erase operations
− Demonstrate the need of write optimization for DBMS running on Flash
DiskDisk FlashFlash
SequentialSequential 34.0 sec34.0 sec 26.0 sec26.0 sec
Non-sequentialNon-sequential 151.9 ~ 340.7 sec151.9 ~ 340.7 sec 61.8 ~ 369.9 sec61.8 ~ 369.9 sec
ACM SIGMOD 2007, Beijing, China -16-COMPUTER SCIENCE DEPARTMENT
OutlineOutlineOutlineOutline• Flash memoryFlash memory
• Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory
• Flash-Based DBMS: In-Paging Logging approachFlash-Based DBMS: In-Paging Logging approach
• My reviewsMy reviews
ACM SIGMOD 2007, Beijing, China -17-COMPUTER SCIENCE DEPARTMENT
In-Page Logging (IPL) ApproachIn-Page Logging (IPL) ApproachIn-Page Logging (IPL) ApproachIn-Page Logging (IPL) Approach
• Design PrinciplesDesign Principles Take advantage of the characteristics of flash memoryTake advantage of the characteristics of flash memory
• Fast read speedFast read speed
Overcome the “erase-before-write” limitation of flash memoryOvercome the “erase-before-write” limitation of flash memory Minimize the changes to the DBMS architectureMinimize the changes to the DBMS architecture
• Limited to buffer manager and storage managerLimited to buffer manager and storage manager
ACM SIGMOD 2007, Beijing, China -18-COMPUTER SCIENCE DEPARTMENT
Design of the IPLDesign of the IPLDesign of the IPLDesign of the IPL
• Logging on Per-Page basis in both Memory and FlashLogging on Per-Page basis in both Memory and Flash
An In-memory log sector can be associated with a buffer frame in memory
Allocated on demand when a page becomes dirty
An In-flash log segment is allocated in each erase unit
The log area is shared by all the data pages in an erase unit
Flash Memory
DatabaseBuffer
in-memorydata page(8KB)
update-in-place
in-memorylog sector (512B)
log area (8KB): 16 sectors
Erase unit: 128KB
15 data pages (8KB each)
….….
ACM SIGMOD 2007, Beijing, China -19-COMPUTER SCIENCE DEPARTMENT
IPL WriteIPL WriteIPL WriteIPL Write
BufferMgr.
Flash Memory
Update / Insert / Delete
Data Block Area
update-in-place
physiological log Page : 8KB
Sector : 512B
Block :
128KB
• Whenever an update is performed on a data page, Whenever an update is performed on a data page, the in-memory copy of data the in-memory copy of data page is updated immediatelypage is updated immediately. . In addition, In addition, IPL buffer manager adds a log record to the in-memory log sector IPL buffer manager adds a log record to the in-memory log sector
• When When a dirty page is evicted by replacement policy a dirty page is evicted by replacement policy or or the in-memory log sector the in-memory log sector is fullis full, , the content of data page is the content of data page is not writtennot written to flash memory. to flash memory. Instead, Instead, In-memory log sector is written to the in-flash log segmentIn-memory log sector is written to the in-flash log segment
ACM SIGMOD 2007, Beijing, China -20-COMPUTER SCIENCE DEPARTMENT
IPL ReadIPL ReadIPL ReadIPL Read
• When a page is read from flash, the current version is computed on the When a page is read from flash, the current version is computed on the flyfly
BufferMgr.
Apply the “physiological action”to the copy read from Flash(CPU overhead)
Flash Memory
Read from Flash Original copy of Pi
All log records belonging to Pi (IO overhead)
Re-constructthe currentin-memory copy
Pi
log area (8KB): 16 sectors
data area (120KB): 15 pages
ACM SIGMOD 2007, Beijing, China -21-COMPUTER SCIENCE DEPARTMENT
IPL MergeIPL MergeIPL MergeIPL Merge
• When all free log sectors in an erase unit are consumed When all free log sectors in an erase unit are consumed Log records are applied to the corresponding data pagesLog records are applied to the corresponding data pages The current data pages are copied into a new erase unitThe current data pages are copied into a new erase unit
A PhysicalFlash Block
log area (8KB): 16 sectors
Bold Bnew
clean log area
15 up-to-datedata pages
Merge
ACM SIGMOD 2007, Beijing, China -22-COMPUTER SCIENCE DEPARTMENT
Why IPL can improve write Why IPL can improve write performance of DBMS?performance of DBMS?Why IPL can improve write Why IPL can improve write performance of DBMS?performance of DBMS?
• The number of disk writes doesn’t decreaseThe number of disk writes doesn’t decrease Actually, #writes may increase because:Actually, #writes may increase because:(1)(1)It introduces excess disk writes if the log sector is fullIt introduces excess disk writes if the log sector is full(2)(2)The merge operation introduces overheadThe merge operation introduces overhead
• Then why can IPL improve write performance?Then why can IPL improve write performance? IPL overcomes the erase-before-write property of flashIPL overcomes the erase-before-write property of flash Reduces the number of erasuresReduces the number of erasures
ACM SIGMOD 2007, Beijing, China -23-COMPUTER SCIENCE DEPARTMENT
IPL Simulation with TPC-CIPL Simulation with TPC-CIPL Simulation with TPC-CIPL Simulation with TPC-C• TPC-C Log Data GenerationTPC-C Log Data Generation
Run a commercial DBMS to generate reference streams of TPC-C Run a commercial DBMS to generate reference streams of TPC-C benchmarkbenchmark
• HammerOra utility used for TPC-C workload generationHammerOra utility used for TPC-C workload generation Each trace contains log records of physiological updates as well as Each trace contains log records of physiological updates as well as
physical page writesphysical page writes Average length of a log record: 20 ~ 50BAverage length of a log record: 20 ~ 50B
• TPC-C TracesTPC-C Traces 100M.20M.10u: 100MB DB, 20 MB buffer, 10 simulated users100M.20M.10u: 100MB DB, 20 MB buffer, 10 simulated users 1G.20M.100u: 1GB DB, 20 MB buffer, 100 simulated users1G.20M.100u: 1GB DB, 20 MB buffer, 100 simulated users 1G.40M.100u: 1GB DB, 40 MB buffer, 100 simulated users1G.40M.100u: 1GB DB, 40 MB buffer, 100 simulated users
• Parameter settingParameter setting Write (2KB): 200 usWrite (2KB): 200 us Merge (128KB): 20 msMerge (128KB): 20 ms
ACM SIGMOD 2007, Beijing, China -24-COMPUTER SCIENCE DEPARTMENT
Log Segment Size vs MergesLog Segment Size vs MergesLog Segment Size vs MergesLog Segment Size vs Merges• TPC-C TPC-C WriteWrite frequencies are highly skewed (and low temporal locality) frequencies are highly skewed (and low temporal locality)
• Erase units containing hot pages consume log sectors quicklyErase units containing hot pages consume log sectors quickly Could cause a large number of erase operationsCould cause a large number of erase operations More storage but less frequent merges with more log sectorsMore storage but less frequent merges with more log sectors
ACM SIGMOD 2007, Beijing, China -25-COMPUTER SCIENCE DEPARTMENT
Estimated Write PerformanceEstimated Write PerformanceEstimated Write PerformanceEstimated Write Performance• Performance trend with varying buffer sizesPerformance trend with varying buffer sizes
The size of log segment was fixed at 8KBThe size of log segment was fixed at 8KB
• Estimated write timeEstimated write time With IPL = (# of sector writes) With IPL = (# of sector writes) × 200us + (# of merges) 200us + (# of merges) × 20ms 20ms Without IPL = Without IPL = × (# of page writes) (# of page writes) × 20ms 20ms
is the probability that a page write causes erase operationis the probability that a page write causes erase operation
ACM SIGMOD 2007, Beijing, China -26-COMPUTER SCIENCE DEPARTMENT
Support for RecoverySupport for RecoverySupport for RecoverySupport for Recovery• IPL helps realize a lean recovery mechanismIPL helps realize a lean recovery mechanism
Additional logging: transaction log and list of dirty pagesAdditional logging: transaction log and list of dirty pages
• Transaction CommitTransaction Commit Similarly to flushing log tailSimilarly to flushing log tail An in-memory log sector is forced out to flash if it contains at least one log record of An in-memory log sector is forced out to flash if it contains at least one log record of
a committing transactiona committing transaction No explicit REDO action required at system restartNo explicit REDO action required at system restart
• Transaction AbortTransaction Abort De-apply the log records of an aborting transactionDe-apply the log records of an aborting transaction Use Use selective merge selective merge instead of regular merge, because it’s irreversibleinstead of regular merge, because it’s irreversible
• If committed, merge the log recordIf committed, merge the log record
• If aborted, discard the log recordIf aborted, discard the log record
• If active, carry over the log record to a new erase unitIf active, carry over the log record to a new erase unit
To avoid a thrashing behavior, allow an erase unit to have overflow log sectorsTo avoid a thrashing behavior, allow an erase unit to have overflow log sectors No explicit UNDO action requiredNo explicit UNDO action required
ACM SIGMOD 2007, Beijing, China -27-COMPUTER SCIENCE DEPARTMENT
ConclusionConclusionConclusionConclusion
• Clear and present evidence that Flash can replace DiskClear and present evidence that Flash can replace Disk
• IPL approach demonstrates its potential for TPC-C type IPL approach demonstrates its potential for TPC-C type database applications bydatabase applications by Overcoming the “erase-before-write” limitationOvercoming the “erase-before-write” limitation
Exploiting the fast and uniform random accessExploiting the fast and uniform random access
• IPL also helps realize a lean recovery mechanismIPL also helps realize a lean recovery mechanism
ACM SIGMOD 2007, Beijing, China -28-COMPUTER SCIENCE DEPARTMENT
OutlineOutlineOutlineOutline• Flash memoryFlash memory
• Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory
• Flash-Based DBMS: In-Paging Logging approachFlash-Based DBMS: In-Paging Logging approach
• ReviewsReviews
ACM SIGMOD 2007, Beijing, China -29-COMPUTER SCIENCE DEPARTMENT
ReviewsReviewsReviewsReviews
• IPL hurts read performanceIPL hurts read performance For each read operation, it has to read data page and log sector For each read operation, it has to read data page and log sector
pagepage Read performance will be about 2X slowerRead performance will be about 2X slower
• No Experiment ResultNo Experiment Result The authors only give the result through the I/O access simulationThe authors only give the result through the I/O access simulation
• SimulationSimulation The data size of simulation is too small (1G).The data size of simulation is too small (1G). Didn’t show the overall performance of TPC-C. (most operations in Didn’t show the overall performance of TPC-C. (most operations in
TPC-C are read operations)TPC-C are read operations)
ACM SIGMOD 2007, Beijing, China -30-COMPUTER SCIENCE DEPARTMENT
Any Questions?Any Questions?Any Questions?Any Questions?
• Q & AQ & A