agrawal (4)

Embed Size (px)

Citation preview

  • 8/9/2019 agrawal (4)

    1/35

    Generating Realistic Impressions

    for File-System BenchmarkingNitin Agrawal

    Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

  • 8/9/2019 agrawal (4)

    2/35

    For better or for worse,

    benchmarks shape a fieldDavid Patterson

    2

  • 8/9/2019 agrawal (4)

    3/35

  • 8/9/2019 agrawal (4)

    4/35

    FS images in past: use what is convenient

    Randomly generated files ofseveral MB (FAST 08)1000 files in 10 dirs w/ random data (SOSP 03)

    188GB and 129GB volumes in Engg dept (OSDI 99)

    5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04)

    10702 files from /usr/local, size 354MB (SOSP 01)

    Typical desktop file system w/ no description (SOSP 05)

    1641 files, 109 dirs, 13.4 MB total size (OSDI 02)

  • 8/9/2019 agrawal (4)

    5/35

    Performance offindoperation

    5

    Relative

    Time

    Taken

    Disk layout & cache state

    File-system logical

    organization

  • 8/9/2019 agrawal (4)

    6/35

    Problem scope

    Characteristics of file-system images have strong impact on performance

    We need to incorporate representative file-system images in benchmarking & design

    6

    How to create representativefile-system images?

  • 8/9/2019 agrawal (4)

    7/35

    Requirements for creating FS images

    Access to data on file systems and disk layout Properties of file-system metadata [Satyanarayan81,

    Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07]Disk fragmentation [Smith97]More such studies in future? A technique to create file-system images that

    isRepresentative: given a set of input distributionsControllable: supply additional user constraints Reproducible: control & report internal parametersEasy to use: for widespread adoption and consensus

    7

  • 8/9/2019 agrawal (4)

    8/35

    Introducing Impressions Powerful statistical framework to generate

    file-system images

    Takes properties of file-system attributes as input

    Works out underlying statistical details of the image Mounted on a disk partition for real benchmarkingSatisfies the four design goals

    Applying Impressions gives useful insightsWhat is the impact on performance and storage size?How does an application behave on a real FS image?

    8

  • 8/9/2019 agrawal (4)

    9/35

    Outline

    Introduction Generating realistic file-system images

    Applying Impressions: Desktop search

    Conclusion

    9

  • 8/9/2019 agrawal (4)

    10/35

    Overview ofImpressions

    10

    Impressions

  • 8/9/2019 agrawal (4)

    11/35

    Properties of file-system metadataFive-year study of file-system metadata [FAST07]

    (Agrawal, Bolosky, Douceur, Lorch)

    Used as exemplar for metadata properties in Impressions

    11

  • 8/9/2019 agrawal (4)

    12/35

    Features of Impressions Modes of operation for different usages

    Basic mode: choose default settings for parametersAdvanced mode: several individually tunable knobs

    Thorough statistical machinery ensures accuracyUses parameterized curve fitsAllows arbitrary user constraints

    Built-in statistical tests for goodness-of-fit

    Generates namespace, metadata, file content, anddisk fragmentation using above techniques

    12

  • 8/9/2019 agrawal (4)

    13/35

    Creating valid metadata Creating file-system namespace

    Uses Generative Model proposed earlier [FAST 07]Explains the process of directory tree creationAccurately regenerates distribution of directory

    size and of directory depth

    13

  • 8/9/2019 agrawal (4)

    14/35

    Creating namespace

    14

    Directory tree Monte Carlo runIncorporates dirs by depth

    and dirs by subdir counti

    Probability of parent selection Count(subdirs)+2

    50

    60

    70

    80

    90

    100

    0 2 4 6 8 10 12 14 16Cumulative%o

    fdirectories

    Count of subdirectories

    Directories by Subdirectory Count

    DG

    Dirs by subdir count

    00.020.040.060.08

    0.10.120.140.160.18

    0 2 4 6 8 10 12 14 16Fraction

    ofdirectories

    Namespace depth (bin size 1)

    Directories by Namespace Depth

    DG

    Dirs by namespace depth

    DatasetGenerated

  • 8/9/2019 agrawal (4)

    15/35

    Creating valid metadata Creating file-system namespace Creating files: stepwise process

    File size, file extension, file depth, parent directoryUses statistical models & analytical approximations

    15

  • 8/9/2019 agrawal (4)

    16/35

    Example: creating realistic file sizes

    Pure lognormal distribution no longer good fit Hybrid model: lognormal body, Pareto tail

    Fits observed data more accurately, used to recreatefile sizes in Impressions

    Con

    tribution

    tousedspace

    Containing file size (bytes, log scale)

    LognormalHybrid

    16

    File Sizes

  • 8/9/2019 agrawal (4)

    17/35

    Creating files

    17

    File Size ModelLognormal body, Pareto tail

    Captures bimodal curveiS2S4S6S8S9 S1S3S5S7

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0 8 2K 512K 512M128G

    Fractio

    n

    ofbytes

    File Size (bytes, log scale)

    Files by Containing Bytes

    DG

    Files by containing bytes

  • 8/9/2019 agrawal (4)

    18/35

    Creating files

    18

    File Extensions Percentile valuesTop 20 extensions account

    for 50% of files and bytesiS2S4S6S8S9 S1S3S5S7 E2E4E6E8E9 E3E5E7 E1

    0

    0.2

    0.4

    0.6

    0.8

    1

    Desired Generated

    Fractio

    n

    offiles

    Top Extensions by Count

    cppdllexegifhhtm

    jpgnulltxt

    others

    cppdllexegifhhtm

    jpgnulltxt

    others

    Top extensions by count

  • 8/9/2019 agrawal (4)

    19/35

    Creating files

    19

    File DepthPoissonMultiplicative model along

    with bytes by depth

    00.020.040.06

    0.080.1

    0.120.140.16

    0 2 4 6 8 10 12 14 16

    Fraction

    offiles

    Namespace depth (bin size 1)

    Files by Namespace Depth

    DG

    Files by namespace depth

    16KB

    64KB

    256KB

    768KB2MB

    0 2 4 6 8 10 12 14 16Meanby

    tesperfile

    (log

    scale)

    Namespace depth (bin size 1)

    Bytes by Namespace Depth

    DG

    Bytes by namespace depth

    iS2S4S6S8S9 S1S3S5S7 E2E4E6E8E9 E3E5E7 E1D2D4D6D8D9 D3D5D7 D1

  • 8/9/2019 agrawal (4)

    20/35

    Creating files

    20

    Parent Dir Inverse Polynomial Satisfies distribution of dirs

    with file count

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0 2 4 6 8 10 12 14 16

    Fraction

    offiles

    Namespace depth (bin size 1)

    Files by Namespace Depth(with Special Directories)

    DG

    Files by namespace depthw/ special dirs

    i

  • 8/9/2019 agrawal (4)

    21/35

    Resolving arbitrary constraints

    21

  • 8/9/2019 agrawal (4)

    22/35

    0

    0.05

    0.1

    0.15

    8 2K 512K 8M

    Fraction

    offiles

    File Size (bytes, log scale)

    C

    0

    0.05

    0.1

    0.15

    8 2K 512K 8M

    Fraction

    offiles

    File Size (bytes, log scale)

    O

    Resolving arbitrary constraints

    Accurate both for the sum and the distribution22

    Original Constrained

    Contrived

    for sum

    Constraint: Given count of files & size distribution, ensure

    sum of file sizes matches a desired total file system size

  • 8/9/2019 agrawal (4)

    23/35

    Resolving arbitrary constraints Arbitrarily specified on file system parameters Variant of NP-complete Subset Sum Problem

    Approximation algorithm based solution (in paper)Oversampling to get additional sample valuesLocal improvement to iteratively converge to the

    desired sum by identifying best-fit in current sample While constraints are satisfied, constrained

    distribution also retains original characteristics23

  • 8/9/2019 agrawal (4)

    24/35

    Interpolation and extrapolation Why dont we just use available data values?

    Limited to empirical data in input dataset

    What-if analysis beyond available dataset

    Efficient to maintain compact curve fits and useinterpolation/extrapolation instead of all data

    Technique: Piecewise interpolation

    24

  • 8/9/2019 agrawal (4)

    25/35

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0 2 832

    128

    512

    2K

    8K

    32K

    128K

    512K

    2M

    8M

    32M

    128M

    512M

    2G

    8G

    32G

    128G

    Fractiono

    fbytes

    File Size

    Piecewise Interpolation

    100 GB

    50 GB

    10 GB

    Segment 19Segment 19Segment 19Segment 19

    0

    0.02

    0.04

    0.06

    0 50 100SegmentValue

    File System Size (GB)

    Interpola9on: Seg 1

    Interpolation technique & accuracyPiecewise Interpolation

    25

    Each distribution broken down into segments Data points within a segment used for curve fit Combine segment interpolations for new curve

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    8 2K 512K 128M 32G

    Fraction

    ofbytes

    File Size

    Extrapolation (125 GB)

    RE

    File Size extrapolation 125GB

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    8 2K 512K 128M 32G

    Fraction

    ofbytes

    File Size

    Interpolation (75 GB)

    RI

    File Size interpolation 75GBReal

    InterpolatedExtrapolated

    Real

  • 8/9/2019 agrawal (4)

    26/35

    File content Files having natural language content

    Word-popularity model (heavy tailed)

    Word-length frequency model (for the long tail)

    Content for other files (mp3, gif, mpeg etc)Impressions generates valid header/footerUses third-party libraries and software

    26

  • 8/9/2019 agrawal (4)

    27/35

    Disk layout and fragmentation

    27

  • 8/9/2019 agrawal (4)

    28/35

    Disk layout and fragmentation Simplistic technique

    Layout Scorefor degree of fragmentation [Smith97]

    Pairs of file create/delete operations till desiredlayout score is achieved In future more nuanced ways are possible

    Out-of-order file writes, writes with long delaysAccess to file-system specific interfaces

    FIPMAP in Linux, XFS_IOC_GETBMAP for XFSPerhaps a tool complementary to Impressions

    28

    File 1 File 2

    1 non-contiguous block (out of 8)File Layout Score = 7/8 All blocks contiguousFile Layout Score = 1 (6/6)

  • 8/9/2019 agrawal (4)

    29/35

    Outline

    Introduction Generating realistic file-system images

    Applying Impressions: Desktop search

    Conclusion

    29

  • 8/9/2019 agrawal (4)

    30/35

    Applying Impressions Case study: desktop search

    Google desktop for linux (GDL) and Beagle

    Metrics of interest:

    Size of index, time to build initial search indexIdentifying application bugs and policies

    GDL doesnt index content beyond 10-deep Computing realistic rules of thumb

    Overhead of metadata replication?30

  • 8/9/2019 agrawal (4)

    31/35

  • 8/9/2019 agrawal (4)

    32/35

    Impact of metadata and content

    Developer aid: understanding impact of different

    file system content & different index schemes

    32

    00.5

    11.5

    22.5

    33.5

    Original

    TextCache

    DisDir

    DisFilte

    rRelativeIndexSize

    Beagle: Index Size

    DefaultText

    Image

    Binary

  • 8/9/2019 agrawal (4)

    33/35

    Impact of metadata and content

    Reproducing identical file-system image to

    compare other apps or ones developed later

    33

    00.5

    11.5

    22.5

    33.5

    Origina

    l

    TextC

    ache

    DisDir

    DisFilte

    rRelativeIndexSize

    Beagle: Index Size

    DefaultText

    Image

    Binary

    Future AppBeagle

  • 8/9/2019 agrawal (4)

    34/35

    Conclusion Impressions framework for realistic FS images

    Representative, controllable, reproducible, easy to use

    Includes almost all file system params of interest

    Extensible architecturePlug in new statistical constructs, new models for

    metadata and content generation Powerful utility for file-system benchmarking

    To be contributed publicly (coming soon) http://www.cs.wisc.edu/adsl/Software/Impressions

    34

  • 8/9/2019 agrawal (4)

    35/35

    Nitin Agrawalhttp://www.cs.wisc.edu/~nitina

    Impressions download (coming soon)http://www.cs.wisc.edu/adsl/Software/Impressions

    35

    Questions?

    i