Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
TagIt: An Integrated Indexing and Search Servicefor File Systems
Hyogi Sim†,∗, Youngjae Kim‡, Sudharshan S. Vazhkudai∗, Geoffroy R. Vallee∗,Seung-Hwan Lim∗, and Ali R. Butt†
Virginia Tech†, Oak Ridge National Laboratory∗, Sogang University‡
Need for scientific data management service• Big data in scientific computing– Ever-increasing data generation rate from scientific applications and experiments– Growing number of data files in the shared storage system– Extremely cumbersome to locate datasets of interest
• Where are my result files from the Supernova simulation last month?
Number of file in the Spider II storage system (32PB) in Oak Ridge Leadership Computing Facility
Existing solutions• File systems do not directly support scalable search and discovery semantics– GPFS, Lustre, HDFS, GlusterFS, Ceph, …– They are designed to provide scalable storage and failure resilience
• Tagging and search solutions in commodity/desktop file systems cannot be simply extended to the large scale file systems– Spotlight /HFS+ in OS X, Google Desktop
• Ad-hoc methods to manage the scientific datasets– Directory hierarchy and descriptive file names– Manual annotations and domain-specific datasets with external database catalog
(e.g., ESGF, …)
Current state of the art: using an external database catalog
• Extra resources and efforts to design, deploy and maintain at scale• Updating the external database catalog is very costly• Disconnect between file system and external database catalog– Inevitable inconsistency between the two systems
File System External Database Catalog
FS Scan &DB Update
TagIt: file system-integrated data management service
File System
File system-integrated distributed metadata indexing• Supporting user-defined tags• Consistent and scalable metadata indexing
Additional data management services using active operations• Server-side data reduction/filtering• Automatic metadata extraction framework
TagIt integrates user-defined metadata with the datasets, making the file system inherently searchable!
GlusterFS: shared-nothing distributed FS• No dedicated metadata server– Directory hierarchy is mirrored on all volume servers– A file is created in a single volume server (DHT)
Volume server #1 Volume server #2 Volume server #3 …
Network
Client #1 Client #2 Client #3 …$ mkdir /mnt/gluster/testdir
/testdir /testdir /testdir/testdir/file1 /testdir/file2
Brick #1 Brick #2 Brick #3
$ touch /mnt/gluster/testdir/file1$ touch /mnt/gluster/testdir/file2
• File/directory semantics are kept in volume servers• All operations to a file are isolated to a single volume server
TagIt architecture overview
…
Network
Client #2 Client #3
…
Brick #1 Brick #2 Brick #3
Client #2
Volume Server #1 Volume Server #2 Volume Server #3
Distributed Metadata Index Database
DB Manager
Active Manager
DB Manager
Active Manager
DB Manager
Active Manager
TagItUtility
TagItUtility
TagItUtility
find-likecommandlineutility
Proc-likedynamicvirtualview
Volumeserver#1
Indexshard#1
Distributed metadata index database
• File system distributes files evenly across multiple servers based on DHT• In the shared-nothing architecture, all operations to a file take place in a single server
GID GFID FID GID PATH NAME
NID NAME XID GID NID VALUE
FILEID FILEPATHNAME
ATTR.VALUEATTR.NAME
Brick#1 Metadata Dataa.txt
Volumeserver#2
Indexshard#2
GID GFID FID GID PATH NAME
NID NAME XID GID NID VALUE
FILEID FILEPATHNAME
ATTR.VALUEATTR.NAME
Brick#2 Metadata Datab.txt
…GID GFID1 1000
FID GID PATH NAME10 1 /my/test a.txt
NID NAME101 job
XID GID NID VALUE100 1 101 “sim-a”
GID GFID1 1001
FID GID PATH NAME10 1 /my/test b.txt
NID NAME101 job
XID GID NID VALUE100 1 101 “sim-b”
• Index database is evenly distributed across multiple servers• The consistency and durability problems are localized to a single server
Index update:TagIt-Sync
• Synchronous index update– Consistent, durable index database– Significant performance penalty due to the extra I/O operations from DB
VolumeServer
Brick(e.g.,XFS)
ClientI/Orequest
FileI/Omanager IndexDBManager
I/Ooperation IndexDBupdate
Returntotheclient
MetadataDBFile
Index update: TagIt-Async
• Dedicated DB update thread– Negligible runtime overhead– Consistency and durability of the index database
• Queueing delay is under 1ms in a congested environment (1:8 server to client ratio)• For an unexpected shutdown, lost records are recovered from the GlusterFS journal and the
backend local FS (30 sec. to recover metadata from 10,000 files)
VolumeServer
Brick(e.g.,XFS)
ClientI/Orequest
FileI/Omanager
IndexDBManager
I/Ooperation
Returntotheclient
MetadataDBFile
mmaped DB
memorymapped
Requestupdate
PeriodicsyncbyOS
UpdateDB
DBthread
TagIt: data management service• User-defined tags and file search– TagIt indexes the standard POSIX extended attributes
• setfattr and getfattr commands
– tagit utility supports to search files based on stat and extended attributes• e.g., Where are the files that I generated last month with the Supernova simulation?
• Advanced active operations associated to the search– Similar to find … -exec ...– Operations are offloaded and performed in the volume servers
user $ ls /tagit/datachkpnt1.out chkpnt2.out chkpnt3.out run.log tmp.txtuser $ setfattr –n job –v Supernova /tagit/data/chkpnt*.outuser $ tagit /tagit/data –attr “job=Supernova”/tagit/data/chkpnt1.out/tagit/data/chkpnt2.out/tagit/data/chkpnt3.out
Volumeserver#1
IndexDB
Client
Brick#1
ActiveManager
Volumeserver#2
IndexDB
Brick#2
ActiveManager …
Networksearchquery
searchresult
Tagging with standard xttr
commands
File search based on tags
user $ tagit /tagit/data –attr “job=Supernova”/tagit/data/chkpnt1.out # we want to calculate the average/tagit/data/chkpnt2.out # of temperature values in each file/tagit/data/chkpnt3.outuser $ tagit /tagit/data –attr “job=Supernova” –exec ./avgavgtemp=1000 # result of ./avg /tagit/data/chkpnt1.outavgtemp=2000 # result of ./avg /tagit/data/chkpnt2.outavgtemp=1500 # result of ./avg /tagit/data/chkpnt3.out
Volumeserver#1
IndexDB
Client
Brick#1
ActiveManager
Volumeserver#2
IndexDB
Brick#2
ActiveManager …
Networksearchquery+command
searchresult searchresult
activeprocessingactiveprocessing
executionresult Active Operation
Dataset of interest
user $ tagit /tagit/data –attr “job=Supernova” –exec ./avgavgtemp=1000 # result of ./avg /tagit/data/chkpnt1.outavgtemp=2000 # result of ./avg /tagit/data/chkpnt2.outavgtemp=1500 # result of ./avg /tagit/data/chkpnt3.outuser $ tagit /tagit/data –attr “job=Supernova” –exec ./avg -indexuser $ tagit /tagit/data –attr “job=Supernova and avgtemp>1000”/tagit/data/chkpnt2.out/tagit/data/chkpnt3.out
Dataset of interest
Indexing the result (metadata
extraction)
Volumeserver#1
IndexDB
Client
Brick#1
ActiveManager
Volumeserver#2
IndexDB
Brick#2
ActiveManager …
Networksearchquery+command
searchresult searchresult
activeprocessingactiveprocessing
Returncode
processingresult processingresult
Search with new attribute
Performance evaluation
• Implementation using GlusterFS-3.7–Modular architecture based on translators– Server-side: A dedicated translator for the metadata indexing and active
operations using SQLite embedded database– Client-side: Command line utilities using GlusterFS API (glapi)
1. What is the overhead of the extra metadata indexing?2. What is the file search performance?3. How effective are the active operators?
1. What is the overhead of the extra metadata indexing?
0.98 0.95 0.980.91
0.97 1.00 0.96
0
0.2
0.4
0.6
0.8
1
1.2
F-create F-stat F-read F-remove D-create D-stat D-remove
Nor
mal
ized
IOPS
GlusterFS TagIt-Async
• mdtest with 104 node cluster• Two four-core Xeon E5410 processor with 16GB RAM• Mellanox MT25208 10Gbit/sec Infiniband network
• 80 volume servers using 80 physical nodes (tmpfs) and 24 clients
Less than 10%
2. What is the file search performance?• Comparing to the external database approach– TagIt vs. MySQL, both with 16 identical servers using SSDs–Workload: 1.3 million entries from a Spider II PFS daily snapshot
• Populating the index database with the MySQL took 96 minutes1. Scan of the file system2. Populate the database with the scanning result
• TagIt does not need extra data population
Test queries and method• Test file search queries– Q1: Locate files/directories with pathname containing ‘never-existing’ (name)– Q2: Count the number of regular files in ‘/proj’, owned by me (path, mode, uid)– Q3: Find regular files with a ‘.mpi’ extension owned by our group under /proj
(path, mode, uid)– Q4: List all files owned by our group (path, mode, gid)– Q5: List all files that were created within the last 24 hours (path, mode, ctime)
• Test method– Each client is repeatedly execute a query for 50 times.– Increase the number of clients up to 16.
0500010000150002000025000300003500040000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Num
ber o
f Rec
ords
Server ID
Q5 - Record Distribution
MySQL-16
TagIt
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Num
ber o
f Rec
ords
Server ID
Q4 - Record Distribution
MySQL-16
TagIt
562/64735,150/50,502
0100200300400500600700800900
1 2 4 8 16
Runt
ime
(sec
onds
)
Number of clients
Q4
MySQL-16
TagIt
02004006008001000120014001600
1 2 4 8 16
Runt
ime
(sec
onds
)
Number of clients
Q5
MySQL-16
TagIt
3. How effective are the active operators?• Workload– AMIP* atmospheric measurement dataset with 132 netCDF files
(each 1.2 GB, total 150 GB)– Calculating the average temperature from each file
• Offline vs. TagIt operator– 16 volume servers– Offline: traditional way with file I/O system calls
• up to 16 processes from 16 client nodes– TagIt: using the active operator
• tagit /AMIP –name *.nc –exec ./getavg
*AMIP: Atmospheric Model Intercomparison Project
Active operator runtime comparison65.52
34.91
19.51
11.657.28
4.26 4.26 4.26 4.26 4.26
0
10
20
30
40
50
60
70
1 2 4 8 16
Run
time
(sec
onds
)
Number of Clients
Offline-Multi TagIt
TagIt summary• File system-integrated indexing and search service– Consistent, scalable metadata indexing framework– Advanced data management services including active operations and
metadata extraction– No need for additional resources
Questions?
0
1
2
3
4
5
6
7
2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94
Run
time
(sec
onds
)
Number of volume servers
Query Performance at Scale
Q1Q2Q3
Query broadcasting overhead
• 96 volume servers using 48 physical nodes (OLCF Rhea)• Populating 105 million files, metadata index database: 140GB• Executing queries from a single server
Q1:0.013x
Q2:0.018xQ3:0.016x
• Q1: 0• Q2: 1 (count)• Q3: 4,766
4.5sec.
6.1sec.
Active operator runtime comparison65.52
43.56
35.79 34.63 34.25
65.52
34.91
19.51
11.657.28
4.26 4.26 4.26 4.26 4.26
0
10
20
30
40
50
60
70
1 2 4 8 16
Run
time
(sec
onds
)
Number of Clients
Offline-Single Offline-Multi TagIt