TagIt : An Integrated Indexing and Search Service for File ... · TagIt : An Integrated Indexing and Search Service for File Systems Hyogi Sim†,∗, Youngjae Kim‡, SudharshanS.Vazhkudai

TagIt: An Integrated Indexing and Search Servicefor File Systems

Hyogi Sim†,∗, Youngjae Kim‡, Sudharshan S. Vazhkudai∗, Geoffroy R. Vallee∗,Seung-Hwan Lim∗, and Ali R. Butt†

Virginia Tech†, Oak Ridge National Laboratory∗, Sogang University‡

Need for scientific data management service• Big data in scientific computing– Ever-increasing data generation rate from scientific applications and experiments– Growing number of data files in the shared storage system– Extremely cumbersome to locate datasets of interest

• Where are my result files from the Supernova simulation last month?

Number of file in the Spider II storage system (32PB) in Oak Ridge Leadership Computing Facility

Existing solutions• File systems do not directly support scalable search and discovery semantics– GPFS, Lustre, HDFS, GlusterFS, Ceph, …– They are designed to provide scalable storage and failure resilience

• Tagging and search solutions in commodity/desktop file systems cannot be simply extended to the large scale file systems– Spotlight /HFS+ in OS X, Google Desktop

• Ad-hoc methods to manage the scientific datasets– Directory hierarchy and descriptive file names– Manual annotations and domain-specific datasets with external database catalog

(e.g., ESGF, …)

Current state of the art: using an external database catalog

• Extra resources and efforts to design, deploy and maintain at scale• Updating the external database catalog is very costly• Disconnect between file system and external database catalog– Inevitable inconsistency between the two systems

File System External Database Catalog

FS Scan &DB Update

TagIt: file system-integrated data management service

File System

File system-integrated distributed metadata indexing• Supporting user-defined tags• Consistent and scalable metadata indexing

Additional data management services using active operations• Server-side data reduction/filtering• Automatic metadata extraction framework

TagIt integrates user-defined metadata with the datasets, making the file system inherently searchable!

GlusterFS: shared-nothing distributed FS• No dedicated metadata server– Directory hierarchy is mirrored on all volume servers– A file is created in a single volume server (DHT)

Volume server #1 Volume server #2 Volume server #3 …

Network

Client #1 Client #2 Client #3 …$ mkdir /mnt/gluster/testdir

/testdir /testdir /testdir/testdir/file1 /testdir/file2

Brick #1 Brick #2 Brick #3

$ touch /mnt/gluster/testdir/file1$ touch /mnt/gluster/testdir/file2

• File/directory semantics are kept in volume servers• All operations to a file are isolated to a single volume server

TagIt architecture overview

…

Network

Client #2 Client #3

…

Brick #1 Brick #2 Brick #3

Client #2

Volume Server #1 Volume Server #2 Volume Server #3

Distributed Metadata Index Database

DB Manager

Active Manager

DB Manager

Active Manager

DB Manager

Active Manager

TagItUtility

TagItUtility

TagItUtility

find-likecommandlineutility

Proc-likedynamicvirtualview

Volumeserver#1

Indexshard#1

Distributed metadata index database

• File system distributes files evenly across multiple servers based on DHT• In the shared-nothing architecture, all operations to a file take place in a single server

GID GFID FID GID PATH NAME

NID NAME XID GID NID VALUE

FILEID FILEPATHNAME

ATTR.VALUEATTR.NAME

Brick#1 Metadata Dataa.txt

Volumeserver#2

Indexshard#2

GID GFID FID GID PATH NAME

NID NAME XID GID NID VALUE

FILEID FILEPATHNAME

ATTR.VALUEATTR.NAME

Brick#2 Metadata Datab.txt

…GID GFID1 1000

FID GID PATH NAME10 1 /my/test a.txt

NID NAME101 job

XID GID NID VALUE100 1 101 “sim-a”

GID GFID1 1001

FID GID PATH NAME10 1 /my/test b.txt

NID NAME101 job

XID GID NID VALUE100 1 101 “sim-b”

• Index database is evenly distributed across multiple servers• The consistency and durability problems are localized to a single server

Index update:TagIt-Sync

• Synchronous index update– Consistent, durable index database– Significant performance penalty due to the extra I/O operations from DB

VolumeServer

Brick(e.g.,XFS)

ClientI/Orequest

FileI/Omanager IndexDBManager

I/Ooperation IndexDBupdate

Returntotheclient

MetadataDBFile

Index update: TagIt-Async

• Dedicated DB update thread– Negligible runtime overhead– Consistency and durability of the index database

• Queueing delay is under 1ms in a congested environment (1:8 server to client ratio)• For an unexpected shutdown, lost records are recovered from the GlusterFS journal and the

backend local FS (30 sec. to recover metadata from 10,000 files)

VolumeServer

Brick(e.g.,XFS)

ClientI/Orequest

FileI/Omanager

IndexDBManager

I/Ooperation

Returntotheclient

MetadataDBFile

mmaped DB

memorymapped

Requestupdate

PeriodicsyncbyOS

UpdateDB

DBthread

TagIt: data management service• User-defined tags and file search– TagIt indexes the standard POSIX extended attributes

• setfattr and getfattr commands

– tagit utility supports to search files based on stat and extended attributes• e.g., Where are the files that I generated last month with the Supernova simulation?

• Advanced active operations associated to the search– Similar to find … -exec ...– Operations are offloaded and performed in the volume servers

user $ ls /tagit/datachkpnt1.out chkpnt2.out chkpnt3.out run.log tmp.txtuser $ setfattr –n job –v Supernova /tagit/data/chkpnt*.outuser $ tagit /tagit/data –attr “job=Supernova”/tagit/data/chkpnt1.out/tagit/data/chkpnt2.out/tagit/data/chkpnt3.out

Volumeserver#1

IndexDB

Client

Brick#1

ActiveManager

Volumeserver#2

IndexDB

Brick#2

ActiveManager …

Networksearchquery

searchresult

Tagging with standard xttr

commands

File search based on tags

user $ tagit /tagit/data –attr “job=Supernova”/tagit/data/chkpnt1.out # we want to calculate the average/tagit/data/chkpnt2.out # of temperature values in each file/tagit/data/chkpnt3.outuser $ tagit /tagit/data –attr “job=Supernova” –exec ./avgavgtemp=1000 # result of ./avg /tagit/data/chkpnt1.outavgtemp=2000 # result of ./avg /tagit/data/chkpnt2.outavgtemp=1500 # result of ./avg /tagit/data/chkpnt3.out

Volumeserver#1

IndexDB

Client

Brick#1

ActiveManager

Volumeserver#2

IndexDB

Brick#2

ActiveManager …

Networksearchquery+command

searchresult searchresult

activeprocessingactiveprocessing

executionresult Active Operation

Dataset of interest

user $ tagit /tagit/data –attr “job=Supernova” –exec ./avgavgtemp=1000 # result of ./avg /tagit/data/chkpnt1.outavgtemp=2000 # result of ./avg /tagit/data/chkpnt2.outavgtemp=1500 # result of ./avg /tagit/data/chkpnt3.outuser $ tagit /tagit/data –attr “job=Supernova” –exec ./avg -indexuser $ tagit /tagit/data –attr “job=Supernova and avgtemp>1000”/tagit/data/chkpnt2.out/tagit/data/chkpnt3.out

Dataset of interest

Indexing the result (metadata

extraction)

Volumeserver#1

IndexDB

Client

Brick#1

ActiveManager

Volumeserver#2

IndexDB

Brick#2

ActiveManager …

Networksearchquery+command

searchresult searchresult

activeprocessingactiveprocessing

Returncode

processingresult processingresult

Search with new attribute

Performance evaluation

• Implementation using GlusterFS-3.7–Modular architecture based on translators– Server-side: A dedicated translator for the metadata indexing and active

operations using SQLite embedded database– Client-side: Command line utilities using GlusterFS API (glapi)

1. What is the overhead of the extra metadata indexing?2. What is the file search performance?3. How effective are the active operators?

1. What is the overhead of the extra metadata indexing?

0.98 0.95 0.980.91

0.97 1.00 0.96

0

0.2

0.4

0.6

0.8

1

1.2

F-create F-stat F-read F-remove D-create D-stat D-remove

Nor

mal

ized

IOPS

GlusterFS TagIt-Async

• mdtest with 104 node cluster• Two four-core Xeon E5410 processor with 16GB RAM• Mellanox MT25208 10Gbit/sec Infiniband network

• 80 volume servers using 80 physical nodes (tmpfs) and 24 clients

Less than 10%

2. What is the file search performance?• Comparing to the external database approach– TagIt vs. MySQL, both with 16 identical servers using SSDs–Workload: 1.3 million entries from a Spider II PFS daily snapshot

• Populating the index database with the MySQL took 96 minutes1. Scan of the file system2. Populate the database with the scanning result

• TagIt does not need extra data population

Test queries and method• Test file search queries– Q1: Locate files/directories with pathname containing ‘never-existing’ (name)– Q2: Count the number of regular files in ‘/proj’, owned by me (path, mode, uid)– Q3: Find regular files with a ‘.mpi’ extension owned by our group under /proj

(path, mode, uid)– Q4: List all files owned by our group (path, mode, gid)– Q5: List all files that were created within the last 24 hours (path, mode, ctime)

• Test method– Each client is repeatedly execute a query for 50 times.– Increase the number of clients up to 16.

0500010000150002000025000300003500040000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Num

ber o

f Rec

ords

Server ID

Q5 - Record Distribution

MySQL-16

TagIt

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Num

ber o

f Rec

ords

Server ID

Q4 - Record Distribution

MySQL-16

TagIt

562/64735,150/50,502

0100200300400500600700800900

1 2 4 8 16

Runt

ime

(sec

onds

)

Number of clients

Q4

MySQL-16

TagIt

02004006008001000120014001600

1 2 4 8 16

Runt

ime

(sec

onds

)

Number of clients

Q5

MySQL-16

TagIt

3. How effective are the active operators?• Workload– AMIP* atmospheric measurement dataset with 132 netCDF files

(each 1.2 GB, total 150 GB)– Calculating the average temperature from each file

• Offline vs. TagIt operator– 16 volume servers– Offline: traditional way with file I/O system calls

• up to 16 processes from 16 client nodes– TagIt: using the active operator

• tagit /AMIP –name *.nc –exec ./getavg

*AMIP: Atmospheric Model Intercomparison Project

Active operator runtime comparison65.52

34.91

19.51

11.657.28

4.26 4.26 4.26 4.26 4.26

0

10

20

30

40

50

60

70

1 2 4 8 16

Run

time

(sec

onds

)

Number of Clients

Offline-Multi TagIt

TagIt summary• File system-integrated indexing and search service– Consistent, scalable metadata indexing framework– Advanced data management services including active operations and

metadata extraction– No need for additional resources

Questions?

0

1

2

3

4

5

6

7

2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94

Run

time

(sec

onds

)

Number of volume servers

Query Performance at Scale

Q1Q2Q3

Query broadcasting overhead

• 96 volume servers using 48 physical nodes (OLCF Rhea)• Populating 105 million files, metadata index database: 140GB• Executing queries from a single server

Q1:0.013x

Q2:0.018xQ3:0.016x

• Q1: 0• Q2: 1 (count)• Q3: 4,766

4.5sec.

6.1sec.

Active operator runtime comparison65.52

43.56

35.79 34.63 34.25

65.52

34.91

19.51

11.657.28

4.26 4.26 4.26 4.26 4.26

0

10

20

30

40

50

60

70

1 2 4 8 16

Run

time

(sec

onds

)

Number of Clients

Offline-Single Offline-Multi TagIt

Documents

TagIt : An Integrated Indexing and Search Service for File ... · TagIt : An Integrated Indexing and Search Service for File Systems Hyogi Sim†,∗, Youngjae Kim‡, SudharshanS.Vazhkudai