HDFS Analysis for Small Files

Preview:

Citation preview

Analyzing Small Files in HDFS Cluster

Presenters: Rohit JangidPresenters: Raman Goyal

HDFS Analysis for Small Files

Outline▪ What are small files and their problems?▪ Small Files Analysis▪ Architecture▪ FsImage Processing and Aggregation▪ Implementation and tool

▪ Dashboards and Results▪ Dashboards▪ Results▪ Future Work

▪ Conclusions

2

Expedia’s HDFS Cluster

3

Hdfs Doesn’t Like Lots Of Small Files…

4

Problem?

INEFFICIENT DATA ACCESS PATTERN

5

MAKES JOBS SLOW....

6

Trivial Solution?

7

Compaction

Solution?

8

BUT WHERE...?

9

SMALL FILES ANALYSIS

10

ARCHITECTURE

HDFS Cluster RAW FsImage Interpreted FsImage

Attributed Files and Directory

information

Aggregated Files and Directory information

Dashboard

Storage

11

LSR

LSR

FsIMAGE PROCESSING

MeProcessed 20gb FsImage In ~20 Minutes

Custom OIV Interpreter For Reduced Memory Usage

Fetched from Name node OIV to LSR Interpreter

HDFS Cluster RAW FsImage Interpreted FsImage

12

LSR

ARCHITECTURE

HDFS Cluster RAW FsImage Interpreted FsImage

Attributed Files and Directory

information

Aggregated Files and Directory information

Dashboard

Storage

13

Attributes Found Directly

Owner Name

Group Name

Size of File

Replication Factor

Number of Direct File objects

Last Modified Date

Level of File

Is File or Is Directory?

Attribution and AggregationAggregated Attributes

Number of Small File objects

Number of Namespace objects

Smallest, Largest, Avg File size

Difference in Size since Last run

If Directory

14

Attribution and Aggregation

Generate Small Files / Total Files Metrics

Roll-up Attributes to Parent Directories

Custom UDF’s and PIG Scripts Using Sqoop

Stored in HDFS

Attributed Files and Directory

information

Aggregated Files and Directory information

Storage

15

ARCHITECTURE

HDFS Cluster RAW FsImage Interpreted FsImage

Attributed Files and Directory

information

Aggregated Files and Directory information

Dashboard

Storage

16

LSR

STORAGE AND REPORTING

DashboardStorage

Relational Database and Rest API Dashboards

Different Dashboards Showing User Level and Overall Level

REST API

Powered by Cyclotron: http://cyclotron.io

17

Implementation and Tool

Files and Directories Attributed

Small file & Directory information

Download and Interpret HDFS NameNode

At Directory level

Statistics like Smallest File calculated

Using OIV Interpreter

By splitting FsImage rows

Storage, REST API and DashboardsCan easily add new Clusters in Tool

18

DASHBOARDS AND RESULTS

19

Dashboards InformationFor file size less than 10 MB

For file size between 10 MB to 70 MB

For file size between 70 MB to ~100 MB

3 possible bucketing models

Goes upto all levels in HDFS

Distribution of owners of small Top 10

Directories to be investigated fordeletion, re-partition, compaction

3

2

1

20

Overall Dashboard containing all Information

21

Distribution of Owners of Small Files

22

Sample Directories Containing Small Files

23

Top 10: Files vs Small Files

24

Daily Small Files per Directory

25

Doesn’t have real time analysis! with alerting

Cluster has 200+ million namespace objects that we get as memory dump from Hadoop server.

Future Work

Translating and attributing each directory and file is a time consuming process.

Developing Customisable Compaction Utility

1

2

26

EDWPMonitoring@expedia.com

Conclusions

Recommended