29
ACM SIGMOD 2007, Beijing, China -1- COMPUTER SCIENCE DEPARTMENT Design of Flash-Based Design of Flash-Based DBMS: DBMS: An In-Page Logging An In-Page Logging Approach Approach Bongki Moon Bongki Moon Department of Computer Science Department of Computer Science University of Arizona University of Arizona Tucson, AZ 85721, U.S.A. Tucson, AZ 85721, U.S.A. [email protected] [email protected] Sang-Won Lee Sang-Won Lee School of Info & Comm Eng School of Info & Comm Eng Sungkyunkwan University Sungkyunkwan University Suwon, Korea 440-746 Suwon, Korea 440-746 [email protected] [email protected] SIGMOD’07 SIGMOD’07

ACM SIGMOD 2007, Beijing, China -1 -

Embed Size (px)

Citation preview

Page 1: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -1-COMPUTER SCIENCE DEPARTMENT

Design of Flash-Based DBMS: Design of Flash-Based DBMS: An In-Page Logging ApproachAn In-Page Logging ApproachDesign of Flash-Based DBMS: Design of Flash-Based DBMS: An In-Page Logging ApproachAn In-Page Logging Approach

Bongki MoonBongki MoonDepartment of Computer ScienceDepartment of Computer Science

University of ArizonaUniversity of ArizonaTucson, AZ 85721, U.S.A.Tucson, AZ 85721, [email protected]@cs.arizona.edu

Sang-Won LeeSang-Won LeeSchool of Info & Comm Eng School of Info & Comm Eng Sungkyunkwan UniversitySungkyunkwan University

Suwon, Korea 440-746Suwon, Korea 440-746

[email protected]@ece.skku.ac.kr

SIGMOD’07SIGMOD’07SIGMOD’07SIGMOD’07

Page 2: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -2-COMPUTER SCIENCE DEPARTMENT

OutlineOutlineOutlineOutline• Flash memoryFlash memory

• Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory

• Flash-Based DBMS: In-Paging Logging approachFlash-Based DBMS: In-Paging Logging approach

• ReviewsReviews

Page 3: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -3-COMPUTER SCIENCE DEPARTMENT

Flash MemoryFlash MemoryFlash MemoryFlash Memory• Flash memory is a type of electrically-erasable programmable read-only Flash memory is a type of electrically-erasable programmable read-only

memory (EEPROM)memory (EEPROM)

• Page is the unit of read and write operationsPage is the unit of read and write operations Typical value: 2KBTypical value: 2KB

• Write operation can only clear bits (change their value from 1 to 0).Write operation can only clear bits (change their value from 1 to 0).

• The only way to change value from 0 to 1 is erase an entire region The only way to change value from 0 to 1 is erase an entire region memory.memory. This region has fixed-size, called erase units, erase block or just block.This region has fixed-size, called erase units, erase block or just block. Typical value: 128KB for large flash memoryTypical value: 128KB for large flash memory

Page 4: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -4-COMPUTER SCIENCE DEPARTMENT

Characteristics of FlashCharacteristics of FlashCharacteristics of FlashCharacteristics of Flash

• No in-place updateNo in-place update the data item need to be erased first before writing it again.the data item need to be erased first before writing it again. An erase unit (16KB or 128 KB) is much larger than a sector (512 An erase unit (16KB or 128 KB) is much larger than a sector (512

bytes).bytes).

• No mechanical latencyNo mechanical latency Flash memory is an electronic device without moving partsFlash memory is an electronic device without moving parts Provides uniform random access speed without seek/rotational Provides uniform random access speed without seek/rotational

latencylatency

• Asymmetric read & write speedAsymmetric read & write speed Read speed is typically at least twice faster than write speedRead speed is typically at least twice faster than write speed Write (and erase) optimization is criticalWrite (and erase) optimization is critical

Page 5: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -5-COMPUTER SCIENCE DEPARTMENT

Magnetic Disk vs Flash MemoryMagnetic Disk vs Flash MemoryMagnetic Disk vs Flash MemoryMagnetic Disk vs Flash Memory

Magnetic Disk : Seagate Barracuda 7200.7 ST380011AMagnetic Disk : Seagate Barracuda 7200.7 ST380011A NAND Flash : Samsung K9WAG08U1A 16 Gbits SLC NANDNAND Flash : Samsung K9WAG08U1A 16 Gbits SLC NAND

Unit of read/write: 2KB, Unit of erase: 128KBUnit of read/write: 2KB, Unit of erase: 128KB

Read timeRead time Write timeWrite time Erase timeErase time

Magnetic DiskMagnetic Disk 12.7 msec12.7 msec 13.7 msec13.7 msec N/AN/A

NAND FlashNAND Flash 80 80 secsec 200 200 secsec 1.5 msec1.5 msec

Page 6: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -6-COMPUTER SCIENCE DEPARTMENT

Category of Flash MemoryCategory of Flash MemoryCategory of Flash MemoryCategory of Flash Memory

• NAND Vs. NOR Flash NAND Vs. NOR Flash NOR: high erase cost (several seconds), directly addressable (access is by bit NOR: high erase cost (several seconds), directly addressable (access is by bit

or byte)or byte) NAND: relative low erase cost (several ms), access is by pagesNAND: relative low erase cost (several ms), access is by pages

• MLC Vs. SLC NAND FlashMLC Vs. SLC NAND Flash MLC (Multiple Level Cell): it stores multiple bits per cell, butMLC (Multiple Level Cell): it stores multiple bits per cell, but

significantly slower read and write speeds; 10x lower read/write lifetime significantly slower read and write speeds; 10x lower read/write lifetime SLC (Single Level Cell): it stores only one single bit per cellSLC (Single Level Cell): it stores only one single bit per cell

SLC flash has much better performance, lifetime, and reliability properties SLC flash has much better performance, lifetime, and reliability properties than MLCthan MLC

Page 7: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -7-COMPUTER SCIENCE DEPARTMENT

Small flash Vs Large flashSmall flash Vs Large flashSmall flash Vs Large flashSmall flash Vs Large flash• Small flash memory was widely used for PDA, MP3, mobile phone, Small flash memory was widely used for PDA, MP3, mobile phone,

sensor network…sensor network… Advantages: size, weight, shock resistance, power consumption, noise …Advantages: size, weight, shock resistance, power consumption, noise … Typical size: a few gigabytesTypical size: a few gigabytes

• Recently, some vendors developed large flash memory called Flash SSD Recently, some vendors developed large flash memory called Flash SSD (Solid State Disk)(Solid State Disk) Mainly used for notebook PC. Apple AirBook / Thinkpad X300 Mainly used for notebook PC. Apple AirBook / Thinkpad X300 Typical size: > 16GTypical size: > 16G

Page 8: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -9-COMPUTER SCIENCE DEPARTMENT

Different FTLs for Large and Different FTLs for Large and Small Flash MemorySmall Flash MemoryDifferent FTLs for Large and Different FTLs for Large and Small Flash MemorySmall Flash Memory

• Page-mapping FTL (used for small flash memory)Page-mapping FTL (used for small flash memory) maintains the mapping information between the logical page and the maintains the mapping information between the logical page and the

physical page separately physical page separately Log-structured achitectureLog-structured achitecture Large memory for its mapping informationLarge memory for its mapping information must be reconstructed by scanning the whole flash memory at start-up, and must be reconstructed by scanning the whole flash memory at start-up, and

this may result in long mount timethis may result in long mount time

• Block-mapping FTLBlock-mapping FTL Small memory for its mapping informationSmall memory for its mapping information Any update causes a whole block rewrite (that is why random writes are so Any update causes a whole block rewrite (that is why random writes are so

slow!)slow!) In real production, there are some optimizations for improving In real production, there are some optimizations for improving

concentrated updatesconcentrated updates

Page 9: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -10-COMPUTER SCIENCE DEPARTMENT

Flash memory for server applicationFlash memory for server applicationFlash memory for server applicationFlash memory for server application

• More recently, because of the advantages of flash memory and the More recently, because of the advantages of flash memory and the increasing capacity, there is a new trend that use large flash memory increasing capacity, there is a new trend that use large flash memory for database server applicationfor database server application

• Jim Gray said:Jim Gray said:

Tape is Dead,Tape is Dead,Disk is Tape,Disk is Tape,Flash is Disk!Flash is Disk!

Page 10: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -11-COMPUTER SCIENCE DEPARTMENT

OutlineOutlineOutlineOutline• Flash memoryFlash memory

• Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory

• Flash-Based DBMS: In-Paging Logging approachFlash-Based DBMS: In-Paging Logging approach

• My reviewsMy reviews

Page 11: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -12-COMPUTER SCIENCE DEPARTMENT

Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory

• What happens if disk-based DBMS runs on Flash memory?What happens if disk-based DBMS runs on Flash memory? Due to No In-place Update, it writes the whole block into another clean Due to No In-place Update, it writes the whole block into another clean

blockblock Consume free blocks quickly causing frequent garbage collection and eraseConsume free blocks quickly causing frequent garbage collection and erase

Flash Memory

Page : 4KB

SQL: Update / Insert / Delete

BufferMgr.

Data Block Area

Dirty Block Write

Erase Unit: 128KB

Update

Page 12: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -13-COMPUTER SCIENCE DEPARTMENT

Disk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS Performance

• Run SQL queries on a Run SQL queries on a commercial DBMScommercial DBMS Sequential scan or update of a Sequential scan or update of a

tabletable Non-sequential read or update of Non-sequential read or update of

a table (via B-tree index)a table (via B-tree index)

• Experimental settingsExperimental settings Storage: Magnetic disk vs M-Storage: Magnetic disk vs M-

Tron SSD (Samsung flash chip)Tron SSD (Samsung flash chip) Data page of 8KBData page of 8KB 10 tuples per page, 640,000 10 tuples per page, 640,000

tuples in a table (64,000 pages, tuples in a table (64,000 pages, 512MB)512MB)

Page 13: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -14-COMPUTER SCIENCE DEPARTMENT

Disk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS Performance• Read performance : Read performance : The result is not surprising at allThe result is not surprising at all

Hard disk

− Read performance is poor for non-sequential accesses, mainly because of seek and rotational latency

Flash memory

− Read performance is insensitive to access patterns

DiskDisk FlashFlash

SequentialSequential 14.0 sec14.0 sec 11.0 sec11.0 sec

Non-sequentialNon-sequential 61.1 ~ 172.0 sec61.1 ~ 172.0 sec 12.1 ~ 13.1 sec12.1 ~ 13.1 sec

Page 14: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -15-COMPUTER SCIENCE DEPARTMENT

Disk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS PerformanceDisk-Based DBMS Performance• Write performanceWrite performance

Hard disk

− Write performance is poor for non-sequential accesses, mainly because of seek and rotational latency

Flash memory

− Write performance is poor (worse than disk) for non-sequential accesses due to out-of-place update and erase operations

− Demonstrate the need of write optimization for DBMS running on Flash

DiskDisk FlashFlash

SequentialSequential 34.0 sec34.0 sec 26.0 sec26.0 sec

Non-sequentialNon-sequential 151.9 ~ 340.7 sec151.9 ~ 340.7 sec 61.8 ~ 369.9 sec61.8 ~ 369.9 sec

Page 15: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -16-COMPUTER SCIENCE DEPARTMENT

OutlineOutlineOutlineOutline• Flash memoryFlash memory

• Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory

• Flash-Based DBMS: In-Paging Logging approachFlash-Based DBMS: In-Paging Logging approach

• My reviewsMy reviews

Page 16: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -17-COMPUTER SCIENCE DEPARTMENT

In-Page Logging (IPL) ApproachIn-Page Logging (IPL) ApproachIn-Page Logging (IPL) ApproachIn-Page Logging (IPL) Approach

• Design PrinciplesDesign Principles Take advantage of the characteristics of flash memoryTake advantage of the characteristics of flash memory

• Fast read speedFast read speed

Overcome the “erase-before-write” limitation of flash memoryOvercome the “erase-before-write” limitation of flash memory Minimize the changes to the DBMS architectureMinimize the changes to the DBMS architecture

• Limited to buffer manager and storage managerLimited to buffer manager and storage manager

Page 17: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -18-COMPUTER SCIENCE DEPARTMENT

Design of the IPLDesign of the IPLDesign of the IPLDesign of the IPL

• Logging on Per-Page basis in both Memory and FlashLogging on Per-Page basis in both Memory and Flash

An In-memory log sector can be associated with a buffer frame in memory

Allocated on demand when a page becomes dirty

An In-flash log segment is allocated in each erase unit

The log area is shared by all the data pages in an erase unit

Flash Memory

DatabaseBuffer

in-memorydata page(8KB)

update-in-place

in-memorylog sector (512B)

log area (8KB): 16 sectors

Erase unit: 128KB

15 data pages (8KB each)

….….

Page 18: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -19-COMPUTER SCIENCE DEPARTMENT

IPL WriteIPL WriteIPL WriteIPL Write

BufferMgr.

Flash Memory

Update / Insert / Delete

Data Block Area

update-in-place

physiological log Page : 8KB

Sector : 512B

Block :

128KB

• Whenever an update is performed on a data page, Whenever an update is performed on a data page, the in-memory copy of data the in-memory copy of data page is updated immediatelypage is updated immediately. . In addition, In addition, IPL buffer manager adds a log record to the in-memory log sector IPL buffer manager adds a log record to the in-memory log sector

• When When a dirty page is evicted by replacement policy a dirty page is evicted by replacement policy or or the in-memory log sector the in-memory log sector is fullis full, , the content of data page is the content of data page is not writtennot written to flash memory. to flash memory. Instead, Instead, In-memory log sector is written to the in-flash log segmentIn-memory log sector is written to the in-flash log segment

Page 19: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -20-COMPUTER SCIENCE DEPARTMENT

IPL ReadIPL ReadIPL ReadIPL Read

• When a page is read from flash, the current version is computed on the When a page is read from flash, the current version is computed on the flyfly

BufferMgr.

Apply the “physiological action”to the copy read from Flash(CPU overhead)

Flash Memory

Read from Flash Original copy of Pi

All log records belonging to Pi (IO overhead)

Re-constructthe currentin-memory copy

Pi

log area (8KB): 16 sectors

data area (120KB): 15 pages

Page 20: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -21-COMPUTER SCIENCE DEPARTMENT

IPL MergeIPL MergeIPL MergeIPL Merge

• When all free log sectors in an erase unit are consumed When all free log sectors in an erase unit are consumed Log records are applied to the corresponding data pagesLog records are applied to the corresponding data pages The current data pages are copied into a new erase unitThe current data pages are copied into a new erase unit

A PhysicalFlash Block

log area (8KB): 16 sectors

Bold Bnew

clean log area

15 up-to-datedata pages

Merge

Page 21: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -22-COMPUTER SCIENCE DEPARTMENT

Why IPL can improve write Why IPL can improve write performance of DBMS?performance of DBMS?Why IPL can improve write Why IPL can improve write performance of DBMS?performance of DBMS?

• The number of disk writes doesn’t decreaseThe number of disk writes doesn’t decrease Actually, #writes may increase because:Actually, #writes may increase because:(1)(1)It introduces excess disk writes if the log sector is fullIt introduces excess disk writes if the log sector is full(2)(2)The merge operation introduces overheadThe merge operation introduces overhead

• Then why can IPL improve write performance?Then why can IPL improve write performance? IPL overcomes the erase-before-write property of flashIPL overcomes the erase-before-write property of flash Reduces the number of erasuresReduces the number of erasures

Page 22: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -23-COMPUTER SCIENCE DEPARTMENT

IPL Simulation with TPC-CIPL Simulation with TPC-CIPL Simulation with TPC-CIPL Simulation with TPC-C• TPC-C Log Data GenerationTPC-C Log Data Generation

Run a commercial DBMS to generate reference streams of TPC-C Run a commercial DBMS to generate reference streams of TPC-C benchmarkbenchmark

• HammerOra utility used for TPC-C workload generationHammerOra utility used for TPC-C workload generation Each trace contains log records of physiological updates as well as Each trace contains log records of physiological updates as well as

physical page writesphysical page writes Average length of a log record: 20 ~ 50BAverage length of a log record: 20 ~ 50B

• TPC-C TracesTPC-C Traces 100M.20M.10u: 100MB DB, 20 MB buffer, 10 simulated users100M.20M.10u: 100MB DB, 20 MB buffer, 10 simulated users 1G.20M.100u: 1GB DB, 20 MB buffer, 100 simulated users1G.20M.100u: 1GB DB, 20 MB buffer, 100 simulated users 1G.40M.100u: 1GB DB, 40 MB buffer, 100 simulated users1G.40M.100u: 1GB DB, 40 MB buffer, 100 simulated users

• Parameter settingParameter setting Write (2KB): 200 usWrite (2KB): 200 us Merge (128KB): 20 msMerge (128KB): 20 ms

Page 23: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -24-COMPUTER SCIENCE DEPARTMENT

Log Segment Size vs MergesLog Segment Size vs MergesLog Segment Size vs MergesLog Segment Size vs Merges• TPC-C TPC-C WriteWrite frequencies are highly skewed (and low temporal locality) frequencies are highly skewed (and low temporal locality)

• Erase units containing hot pages consume log sectors quicklyErase units containing hot pages consume log sectors quickly Could cause a large number of erase operationsCould cause a large number of erase operations More storage but less frequent merges with more log sectorsMore storage but less frequent merges with more log sectors

Page 24: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -25-COMPUTER SCIENCE DEPARTMENT

Estimated Write PerformanceEstimated Write PerformanceEstimated Write PerformanceEstimated Write Performance• Performance trend with varying buffer sizesPerformance trend with varying buffer sizes

The size of log segment was fixed at 8KBThe size of log segment was fixed at 8KB

• Estimated write timeEstimated write time With IPL = (# of sector writes) With IPL = (# of sector writes) × 200us + (# of merges) 200us + (# of merges) × 20ms 20ms Without IPL = Without IPL = × (# of page writes) (# of page writes) × 20ms 20ms

is the probability that a page write causes erase operationis the probability that a page write causes erase operation

Page 25: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -26-COMPUTER SCIENCE DEPARTMENT

Support for RecoverySupport for RecoverySupport for RecoverySupport for Recovery• IPL helps realize a lean recovery mechanismIPL helps realize a lean recovery mechanism

Additional logging: transaction log and list of dirty pagesAdditional logging: transaction log and list of dirty pages

• Transaction CommitTransaction Commit Similarly to flushing log tailSimilarly to flushing log tail An in-memory log sector is forced out to flash if it contains at least one log record of An in-memory log sector is forced out to flash if it contains at least one log record of

a committing transactiona committing transaction No explicit REDO action required at system restartNo explicit REDO action required at system restart

• Transaction AbortTransaction Abort De-apply the log records of an aborting transactionDe-apply the log records of an aborting transaction Use Use selective merge selective merge instead of regular merge, because it’s irreversibleinstead of regular merge, because it’s irreversible

• If committed, merge the log recordIf committed, merge the log record

• If aborted, discard the log recordIf aborted, discard the log record

• If active, carry over the log record to a new erase unitIf active, carry over the log record to a new erase unit

To avoid a thrashing behavior, allow an erase unit to have overflow log sectorsTo avoid a thrashing behavior, allow an erase unit to have overflow log sectors No explicit UNDO action requiredNo explicit UNDO action required

Page 26: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -27-COMPUTER SCIENCE DEPARTMENT

ConclusionConclusionConclusionConclusion

• Clear and present evidence that Flash can replace DiskClear and present evidence that Flash can replace Disk

• IPL approach demonstrates its potential for TPC-C type IPL approach demonstrates its potential for TPC-C type database applications bydatabase applications by Overcoming the “erase-before-write” limitationOvercoming the “erase-before-write” limitation

Exploiting the fast and uniform random accessExploiting the fast and uniform random access

• IPL also helps realize a lean recovery mechanismIPL also helps realize a lean recovery mechanism

Page 27: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -28-COMPUTER SCIENCE DEPARTMENT

OutlineOutlineOutlineOutline• Flash memoryFlash memory

• Disk-Based DBMS on Flash MemoryDisk-Based DBMS on Flash Memory

• Flash-Based DBMS: In-Paging Logging approachFlash-Based DBMS: In-Paging Logging approach

• ReviewsReviews

Page 28: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -29-COMPUTER SCIENCE DEPARTMENT

ReviewsReviewsReviewsReviews

• IPL hurts read performanceIPL hurts read performance For each read operation, it has to read data page and log sector For each read operation, it has to read data page and log sector

pagepage Read performance will be about 2X slowerRead performance will be about 2X slower

• No Experiment ResultNo Experiment Result The authors only give the result through the I/O access simulationThe authors only give the result through the I/O access simulation

• SimulationSimulation The data size of simulation is too small (1G).The data size of simulation is too small (1G). Didn’t show the overall performance of TPC-C. (most operations in Didn’t show the overall performance of TPC-C. (most operations in

TPC-C are read operations)TPC-C are read operations)

Page 29: ACM SIGMOD 2007, Beijing, China -1 -

ACM SIGMOD 2007, Beijing, China -30-COMPUTER SCIENCE DEPARTMENT

Any Questions?Any Questions?Any Questions?Any Questions?

• Q & AQ & A