If you can't read please download the document
Upload
glusterorg
View
420
Download
0
Embed Size (px)
Citation preview
Gluster Technical Overview
Kaleb KEITHLEYRed Hat11 April, 2015
Translator a shared library that implements storage semanticse.g. distribution, replication, performance, etc.
Translator Stack a directed graph of translators
Brick a server and a directory
Volume a collection of bricks
Concepts:
Server: glusterfsd the heart of it all
Clients:FUSE, glusterfs
NFS server, glusterfs (gluster client, NFS server)
Management: glusterd
Healing: glustershd
And more
Gluster Processes
Srv1% gluster peer probe srv1...srv1% gluster volume create my_vol srv1:/bricks/my_volsrv1% gluster volume start my_vol
client1% mount -t glusterfs srv1:my_vol /mnt
Let's get real creating a volume
Simple Translator Stack Linear
protocol/serverdebug/io-statsfeatures/markerperformance/io-threadsfeatures/locksfeatures/access-controlstorage/posixprotocols/clientperformance/write-behindperformance/read-aheadperformance/io-cacheperformance/quick-readPerformance/stat-prefetchDebug/io-stats
Here's a simple single brick volume.
You can see which translators run on the clientAnd which ones run on the server
In the lower left is the brick, the real disk drive where the data is stored.In the upper left is the virtual disk on the client.
A write to the virtual disk on the client traverses all the translators down to the protocols/client translator, where both the data and metadata are marshaled and sent to the server, and down through the server's stack to the disk. And then the results come back the opposite path.
server
client
glusterdglusterfsdkernelglusterfsbrickkernel/mntapplication
vfsvfs
fuse
srv1% gluster volume create my_dist srv1:/bricks/my_vol srv2:/bricks/my_volsrv1% gluster volume start my_dist
client1% mount -t glusterfs srv1:my_dist /mnt
Let's get real creating a distribute volume
Complex Translator Stack Distribute (DHT)
protocol/serverdebug/io-stats
features/access-controlstorage/posix...
performance/stat-prefetchdebug/io-stats
cluster/distributeperformance/write-behindprotocol/clientprotocol/client...
protocol/serverdebug/io-statsfeatures/access-controlstorage/posix
...
Here's a similar setup, but now we have added another brick and and are using DHT.
See how the I/O is split by the cluster/distribute xlator and sent to both servers.
server 1
glusterdglusterfsdkernelbrick
vfs
client
glusterfskernel/mntapplication
vfsfuse
server 2
glusterdglusterfsdkernelbrick
vfs
nfs/serverComplex Translator Stack DHT + NFS
protocol/serverdebug/io-stats
features/access-controlstorage/posix...
performance/stat-prefetchdebug/io-stats
cluster/distributeperformance/write-behindprotocol/clientprotocol/client...
protocol/serverdebug/io-statsfeatures/access-controlstorage/posix
...
nfs clients
A similar setup, but now we've added NFS by adding a NFS xlator.
Color here is important. On the previous slides I used different colors to indicate which pieces ran on different machines.
That's still true here, but notice that the NFS server client stack runs on one of the servers.
Thus you can see how GlusterFS uses NFS with DHT (or AFR).
server 1
glusterdglusterfsdkernelbrick
vfs
server 2
glusterdglusterfsdkernelbrick
vfs
glusterfsglusterfs
nfs clientsnfs clients
srv1% gluster volume create replica 2 my_repl srv1:/bricks/my_vol srv2:/bricks/my_volsrv1% gluster volume start my_repl
client1% mount -t glusterfs srv1:my_repl /mnt
Let's get real creating a replica volume
Complex Translator Stack Replica (AFR)
protocol/serverdebug/io-stats
features/access-controlstorage/posix...
performance/stat-prefetchdebug/io-stats
cluster/replicaperformance/write-behindprotocol/clientprotocol/client...
protocol/serverdebug/io-statsfeatures/access-controlstorage/posix
...
Here's a similar setup, but now we have added another brick and and are using DHT.
See how the I/O is split by the cluster/distribute xlator and sent to both servers.
server 1
glusterdglusterfsdkernelbrick
vfs
client
glusterfskernel/mntapplication
vfsfuse
server 2
glusterdglusterfsdkernelbrick
vfs
Hot and cold data
Tiering: a variation on distribution
Hot tier
glusterdglusterfsdkernelbrick
vfs
client
glusterfskernel/mntapplication
vfsfuse
Cold tier
glusterdglusterfsdkernelbrick
vfs
Adding a brick distributed and rebalancing
server 1
glusterdglusterfsdkernelbrick
vfs
client
glusterfskernel/mntapplication
vfsfuse
server 2
glusterdglusterfsdkernelbrick
vfs
server 1
glusterdglusterfsdkernelbrick
vfs
client
glusterfskernel/mntapplication
vfsfuse
server 2
glusterdglusterfsdkernelbrick
vfs
server 3
glusterdglusterfsdkernelbrick
vfs
Adding a brick replica and healing
server 1
glusterdglusterfsdkernelbrick
vfs
client
glusterfskernel/mntapplication
vfsfuse
server 2
glusterdglusterfsdkernelbrick
vfs
server 1
glusterdglusterfsdkernelbrick
vfs
client
glusterfskernel/mntapplication
vfsfuse
server 2
glusterdglusterfsdkernelbrick
vfs
server 3
glusterdglusterfsdkernelbrick
vfs
client
glusterfsdglusterfskernel/mntapplication
vfsfuseIsn't FUSE slow?
Count the copies
And context switches
client
glusterfsdkernel/mntapplication
vfsfusegfapi to the rescue
POSIX-like APIglfs_open()
glfs_close()
glfs_read(), glfs_write()
glfs_seek()
glfs_stat()
etc.
libgfapi hellogluster.c
#include int main (int argc, char** argv){fs = glfs_new (fsync);ret = glfs_set_volfile_server (fs, tcp, localhost, 24007);/* or ret = glfs_set_volfile (fs, /tmp/foo.vol); */ret = glfs_init (fs);fd = glfs_creat (fs, filename, O_RDWR, 0644);ret = glfs_write (fd, hello gluster, 14, 0);glfs_close (fd);return 0;
}
And here's how to handle some unrecoverable error, logic or otherwise, in the fop.
Note the glusterfs idiom (retch)
Writing a translator in Python with GluPy
io-cachedhtGluPy
python
gluster.py
libglusterfs...
...
fop
cbk
cbk
fop
STACK_UNWIND
STACK_WIND
ctypes (ffi)
SwiftOnFile RESTful API, examples with curl
Create a container: curl -v -X PUT -H 'X-Auth-Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername -k
List containers: curl -v -X GET -H 'X-Auth-Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname -k
Copy a file into a container (upload): curl -v -X PUT -T $filename -H 'X-Auth-Token: $authtoken' -H 'Content-Length: $filelen' https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername/$filename -k
Copy a file from a container (download): curl -v -X GET -H 'X-Auth-Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername/$filename -k > $filename
Here's an example of community at work. Gluster never provided a -devel rpm until we did it for Fedora.
HekaFS source tree is a good starting point for adding your own translators for building out-of-tree
Or build in-tree. On modern hardware the whole gluster build takes only a couple of minutes.
Need example of files to change to add new xlator
Translator basics
Translators are shared objects (shlibs)Methodsint32_t init(xlator_t *this);
void fini(xlator_t *this);
Datastruct xlator_fops fops { ... };
struct xlator_cbks cbks { };
struct volume_options options [] = { ... };
Client, Server, Client/Server
Threads: write MT-SAFE
Portability: GlusterFS != Linux only
License: GPLv2 or LGPLv3+
fops: function pointer table or vtble with its own sub-API
cbks: never used in the translator, legacy. Run-time will dlsym it, so you have to have it.
Think carefully about where your translator should run. Client? Server? Either one?
GlusterFS is threaded, your xlator must be MT-SAFE.
This is LinuxCon, yes, and GlusterFS is developed on Linux, but
Say some words about the pieces and their license(s)
Every method has a different signature
Open fop method and callbacktypedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *, loc_t *, int32_t, fd_t *, dict_t *);typedef int32_t (*fop_open_cbk_t) (call_frame_t *, void *, xlator_t *, int32_t, int32_t, fd_t *, dict_t *);
Rename fop method and callbacktypedef int32_t (*fop_rename_t) (call_frame_t *, xlator_t *, loc_t *, loc_t *, dict_t *);typedef int32_t (*fop_rename_cbk_t) (call_frame_t , void *, xlator_t *, int32_t, int32_t, struct iatt *, struct iatt *, struct iatt *, struct iatt *, struct iatt *);
here we compare the fop and cbk signatures for open(), and also for rename(), just emphasize the point that fops and cbks have different signatures.
Perhaps that's obvious
Data Types in Translators
call_frame_t
xlator_t translator context
inode_t represents a file on disk; ref-counted
fd_t represents an open file; ref-counted
iatt_t ~= struct stat
dict_t ~= Python dict (or C++ std::map)
client_t represents the connect client
call_frame_t, context for the I/O as it traverses down and back up the stack of translators. Valid for the duration of this particular I/O
xlator_t, context about the particular translator the I/O is in at this moment in time. Valid for as long as the translator is loaded, i.e. effectively indefinite.
inode_t and fd_t. It is possible for them to go away. If you need them to stay around, bump their ref count. Don't forget to decrement the ref count when you're done with it.
fop methods and fop callbacks
uidmap_writev (...){...STACK_WIND (frame, uidmap_writev_cbk,FIRST_CHILD (this),FIRST_CHILD (this)->fops->writev,fd, vector, count, offset, iobref);
/* DANGER ZONE */return 0;
}Effectively lose control after STACK_WINDCallback might have already happened
Or might be running right now
Or maybe it's not going to run 'til later
Here's an fop method for writev().
Do all your work before passing the I/O on to the next translator in the chain, i.e. Before calling STACK_WIND.
Note cbk, in this case uidmap_writev_cbk(). No I/O occurs here. It all happens sometime between the STACK_WIND and the invocation of the cbk.
N.B. In this case we're only passing the I/O to a single child.
In the danger zone? Only do clean-up. E.g. Release local stuff. Make sure it's not in use further down the call tree.
fop methods and fop callback methods, cont.
uidmap_writev_cbk (call_frame_t *frame, void *cookie, ...){...STACK_UNWIND_STRICT (writev, frameop_ret, op_errno, prebuf, postbuf);
return 0;
}
The I/O is complete when the callback is called
Here's the cbk.
The I/O is complete at this point, return back up the stack with STACK_REWIND
Notice the cookie parameter. More about this in the next slide
STACK_WIND versus STACK_WIND_COOKIE
Pass extra data to the cbk with STACK_WIND_COOKIEquota_statfs (call_frame_t *frame,xlator_t *this, loc_t *loc)
{inode_t *root_inode = loc->inode->table->root;STACK_WIND_COOKIE (frame, quota_statfs_cbk,root_inode, FIRST_CHILD (this),FIRST_CHILD (this)->fops->statfs, loc, xdata);
return 0;
}
There is also frame->localshared by all STACK_WIND callbacks
compare STACK_WIND to STACK_WIND_COOKIE. Notice 'inode' and that it's being passed as an extra param
The 'cookie' is only shared between fop and cbk
There's also frame->local which is shared by all fops and cbks, but be careful you don't overwrite something that's already there.
STACK_WIND, STACK_WIND_COOKIE, cont.
Pass extra data to the cbk with STACK_WIND_COOKIEquota_statfs_cbk (call_frame_t *frame, void *cookie, ...){inode_t *root_inode = cookie;...
}
Here we see that cookie != NULL when the fop used STACK_WIND_COOKIE
STACK_UNWIND versus STACK_UNWIND_STRICT
STACK_UNWIND_STRICT uses the correct type/* return from function in a type-safe way */#define STACK_UNWIND (frame, params ...)do {ret_fn_t fn = frame->ret;
...
versus#define STACK_UNWIND_STRICT (op, frame, params ...)do {fop_##op##_cbk_t fn = (fop_##op##_cbk_t)frame->ret;
...
And why wouldn't you want strong typing?
This slide speaks for itself.
Is there a reason to use STACK_UNWIND? Ever?
Calling multiple children (fan out)
afr_writev_wind (...){...for (i = 0; i < priv->child_count; i++) {if (local->transaction.pre_op[i]) {STACK_WIND_COOKIE (frame, afr_writev_wind_cbk,(void *) (long) i,priv->children[i],priv->children[i]->fops->writev,local->fd, ...);
}
}return 0;
}
Here's an example from AFR where the I/O is sent to all the replicas.
Remember the danger zone
Calling multiple children, cont. (fan in)
afr_writev_wind_cbk (...){LOCK (&frame->lock);callcnt = --local->call_count;UNLOCK (&frame->lock);if (callcnt == 0) /* we're done */...
}failure by any one child means the whole transaction failed?And needs to be handled accordingly
Not much else to say
Dealing With Errors: I/O errors
uidmap_writev_cbk (call_frame_t *frame, void *cookie,xlator_t *this, int32_t op_ret, int32_t op_errno, ...)
{...STACK_UNWIND_STRICT (writev, frame, -1, EIO, ...);return 0;
}op_ret: 0 or -1, success or failure
op_errno: from Use an op_errno that's valid and/or relevant for the fop
Note that this is in the cbk
Xlators further down the call tree may return an error in op_ret and op_errno. E.g. The storage/posix xlator that actually writes to disk may have a real error, and this propagates back up.
You could decide it's not a real errorConversely the lower xlators may return success and you could decide it is an error anyway
Either pass the parameters on, or change them accordingly.
Recommend: use appropriate errno, or you risk confusing people
Dealing With Errors: FOP method errors
uidmap_writev (call_frame_t *frame, xlator_t *this, ...){...if (horrible_logic_error_must_abort) {goto error; /* glusterfs idiom */
}STACK_WIND(frame, uid_writev_cbk, ...);return 0;
error:STACK_UNWIND_STRICT (writev, frame, -1, EIO, NULL, NULL);return 0;
}
And here's how to handle some unrecoverable error, logic or otherwise, in the fop.
Note the glusterfs idiom (retch)
Call to action
Go forth and write applications for GlusterFS!
A basic translator isn't hard. Trust me. ;-)
Click to edit the title text format
Click to edit the outline text format
Kaleb KEITHLEY
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level