Gluster technical overview

Embed Size (px)

Citation preview

Gluster Technical Overview

Kaleb KEITHLEYRed Hat11 April, 2015

Translator a shared library that implements storage semanticse.g. distribution, replication, performance, etc.

Translator Stack a directed graph of translators

Brick a server and a directory

Volume a collection of bricks

Concepts:

Server: glusterfsd the heart of it all

Clients:FUSE, glusterfs

NFS server, glusterfs (gluster client, NFS server)

Management: glusterd

Healing: glustershd

And more

Gluster Processes

Srv1% gluster peer probe srv1...srv1% gluster volume create my_vol srv1:/bricks/my_volsrv1% gluster volume start my_vol

client1% mount -t glusterfs srv1:my_vol /mnt

Let's get real creating a volume

Simple Translator Stack Linear

protocol/serverdebug/io-statsfeatures/markerperformance/io-threadsfeatures/locksfeatures/access-controlstorage/posixprotocols/clientperformance/write-behindperformance/read-aheadperformance/io-cacheperformance/quick-readPerformance/stat-prefetchDebug/io-stats

Here's a simple single brick volume.

You can see which translators run on the clientAnd which ones run on the server

In the lower left is the brick, the real disk drive where the data is stored.In the upper left is the virtual disk on the client.

A write to the virtual disk on the client traverses all the translators down to the protocols/client translator, where both the data and metadata are marshaled and sent to the server, and down through the server's stack to the disk. And then the results come back the opposite path.

server

client

glusterdglusterfsdkernelglusterfsbrickkernel/mntapplication

vfsvfs

fuse

srv1% gluster volume create my_dist srv1:/bricks/my_vol srv2:/bricks/my_volsrv1% gluster volume start my_dist

client1% mount -t glusterfs srv1:my_dist /mnt

Let's get real creating a distribute volume

Complex Translator Stack Distribute (DHT)

protocol/serverdebug/io-stats

features/access-controlstorage/posix...

performance/stat-prefetchdebug/io-stats

cluster/distributeperformance/write-behindprotocol/clientprotocol/client...

protocol/serverdebug/io-statsfeatures/access-controlstorage/posix

...

Here's a similar setup, but now we have added another brick and and are using DHT.

See how the I/O is split by the cluster/distribute xlator and sent to both servers.

server 1

glusterdglusterfsdkernelbrick

vfs

client

glusterfskernel/mntapplication

vfsfuse

server 2

glusterdglusterfsdkernelbrick

vfs

nfs/serverComplex Translator Stack DHT + NFS

protocol/serverdebug/io-stats

features/access-controlstorage/posix...

performance/stat-prefetchdebug/io-stats

cluster/distributeperformance/write-behindprotocol/clientprotocol/client...

protocol/serverdebug/io-statsfeatures/access-controlstorage/posix

...

nfs clients

A similar setup, but now we've added NFS by adding a NFS xlator.

Color here is important. On the previous slides I used different colors to indicate which pieces ran on different machines.

That's still true here, but notice that the NFS server client stack runs on one of the servers.

Thus you can see how GlusterFS uses NFS with DHT (or AFR).

server 1

glusterdglusterfsdkernelbrick

vfs

server 2

glusterdglusterfsdkernelbrick

vfs

glusterfsglusterfs

nfs clientsnfs clients

srv1% gluster volume create replica 2 my_repl srv1:/bricks/my_vol srv2:/bricks/my_volsrv1% gluster volume start my_repl

client1% mount -t glusterfs srv1:my_repl /mnt

Let's get real creating a replica volume

Complex Translator Stack Replica (AFR)

protocol/serverdebug/io-stats

features/access-controlstorage/posix...

performance/stat-prefetchdebug/io-stats

cluster/replicaperformance/write-behindprotocol/clientprotocol/client...

protocol/serverdebug/io-statsfeatures/access-controlstorage/posix

...

Here's a similar setup, but now we have added another brick and and are using DHT.

See how the I/O is split by the cluster/distribute xlator and sent to both servers.

server 1

glusterdglusterfsdkernelbrick

vfs

client

glusterfskernel/mntapplication

vfsfuse

server 2

glusterdglusterfsdkernelbrick

vfs

Hot and cold data

Tiering: a variation on distribution

Hot tier

glusterdglusterfsdkernelbrick

vfs

client

glusterfskernel/mntapplication

vfsfuse

Cold tier

glusterdglusterfsdkernelbrick

vfs

Adding a brick distributed and rebalancing

server 1

glusterdglusterfsdkernelbrick

vfs

client

glusterfskernel/mntapplication

vfsfuse

server 2

glusterdglusterfsdkernelbrick

vfs

server 1

glusterdglusterfsdkernelbrick

vfs

client

glusterfskernel/mntapplication

vfsfuse

server 2

glusterdglusterfsdkernelbrick

vfs

server 3

glusterdglusterfsdkernelbrick

vfs

Adding a brick replica and healing

server 1

glusterdglusterfsdkernelbrick

vfs

client

glusterfskernel/mntapplication

vfsfuse

server 2

glusterdglusterfsdkernelbrick

vfs

server 1

glusterdglusterfsdkernelbrick

vfs

client

glusterfskernel/mntapplication

vfsfuse

server 2

glusterdglusterfsdkernelbrick

vfs

server 3

glusterdglusterfsdkernelbrick

vfs

client

glusterfsdglusterfskernel/mntapplication

vfsfuseIsn't FUSE slow?

Count the copies

And context switches

client

glusterfsdkernel/mntapplication

vfsfusegfapi to the rescue

POSIX-like APIglfs_open()

glfs_close()

glfs_read(), glfs_write()

glfs_seek()

glfs_stat()

etc.

libgfapi hellogluster.c

#include int main (int argc, char** argv){fs = glfs_new (fsync);ret = glfs_set_volfile_server (fs, tcp, localhost, 24007);/* or ret = glfs_set_volfile (fs, /tmp/foo.vol); */ret = glfs_init (fs);fd = glfs_creat (fs, filename, O_RDWR, 0644);ret = glfs_write (fd, hello gluster, 14, 0);glfs_close (fd);return 0;

}

And here's how to handle some unrecoverable error, logic or otherwise, in the fop.

Note the glusterfs idiom (retch)

Writing a translator in Python with GluPy

io-cachedhtGluPy

python

gluster.py

libglusterfs...

...

fop

cbk

cbk

fop

STACK_UNWIND

STACK_WIND

ctypes (ffi)

SwiftOnFile RESTful API, examples with curl

Create a container: curl -v -X PUT -H 'X-Auth-Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername -k

List containers: curl -v -X GET -H 'X-Auth-Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname -k

Copy a file into a container (upload): curl -v -X PUT -T $filename -H 'X-Auth-Token: $authtoken' -H 'Content-Length: $filelen' https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername/$filename -k

Copy a file from a container (download): curl -v -X GET -H 'X-Auth-Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername/$filename -k > $filename

Here's an example of community at work. Gluster never provided a -devel rpm until we did it for Fedora.

HekaFS source tree is a good starting point for adding your own translators for building out-of-tree

Or build in-tree. On modern hardware the whole gluster build takes only a couple of minutes.

Need example of files to change to add new xlator

Translator basics

Translators are shared objects (shlibs)Methodsint32_t init(xlator_t *this);

void fini(xlator_t *this);

Datastruct xlator_fops fops { ... };

struct xlator_cbks cbks { };

struct volume_options options [] = { ... };

Client, Server, Client/Server

Threads: write MT-SAFE

Portability: GlusterFS != Linux only

License: GPLv2 or LGPLv3+

fops: function pointer table or vtble with its own sub-API

cbks: never used in the translator, legacy. Run-time will dlsym it, so you have to have it.

Think carefully about where your translator should run. Client? Server? Either one?

GlusterFS is threaded, your xlator must be MT-SAFE.

This is LinuxCon, yes, and GlusterFS is developed on Linux, but

Say some words about the pieces and their license(s)

Every method has a different signature

Open fop method and callbacktypedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *, loc_t *, int32_t, fd_t *, dict_t *);typedef int32_t (*fop_open_cbk_t) (call_frame_t *, void *, xlator_t *, int32_t, int32_t, fd_t *, dict_t *);

Rename fop method and callbacktypedef int32_t (*fop_rename_t) (call_frame_t *, xlator_t *, loc_t *, loc_t *, dict_t *);typedef int32_t (*fop_rename_cbk_t) (call_frame_t , void *, xlator_t *, int32_t, int32_t, struct iatt *, struct iatt *, struct iatt *, struct iatt *, struct iatt *);

here we compare the fop and cbk signatures for open(), and also for rename(), just emphasize the point that fops and cbks have different signatures.

Perhaps that's obvious

Data Types in Translators

call_frame_t

xlator_t translator context

inode_t represents a file on disk; ref-counted

fd_t represents an open file; ref-counted

iatt_t ~= struct stat

dict_t ~= Python dict (or C++ std::map)

client_t represents the connect client

call_frame_t, context for the I/O as it traverses down and back up the stack of translators. Valid for the duration of this particular I/O

xlator_t, context about the particular translator the I/O is in at this moment in time. Valid for as long as the translator is loaded, i.e. effectively indefinite.

inode_t and fd_t. It is possible for them to go away. If you need them to stay around, bump their ref count. Don't forget to decrement the ref count when you're done with it.

fop methods and fop callbacks

uidmap_writev (...){...STACK_WIND (frame, uidmap_writev_cbk,FIRST_CHILD (this),FIRST_CHILD (this)->fops->writev,fd, vector, count, offset, iobref);

/* DANGER ZONE */return 0;

}Effectively lose control after STACK_WINDCallback might have already happened

Or might be running right now

Or maybe it's not going to run 'til later

Here's an fop method for writev().

Do all your work before passing the I/O on to the next translator in the chain, i.e. Before calling STACK_WIND.

Note cbk, in this case uidmap_writev_cbk(). No I/O occurs here. It all happens sometime between the STACK_WIND and the invocation of the cbk.

N.B. In this case we're only passing the I/O to a single child.

In the danger zone? Only do clean-up. E.g. Release local stuff. Make sure it's not in use further down the call tree.

fop methods and fop callback methods, cont.

uidmap_writev_cbk (call_frame_t *frame, void *cookie, ...){...STACK_UNWIND_STRICT (writev, frameop_ret, op_errno, prebuf, postbuf);

return 0;

}

The I/O is complete when the callback is called

Here's the cbk.

The I/O is complete at this point, return back up the stack with STACK_REWIND

Notice the cookie parameter. More about this in the next slide

STACK_WIND versus STACK_WIND_COOKIE

Pass extra data to the cbk with STACK_WIND_COOKIEquota_statfs (call_frame_t *frame,xlator_t *this, loc_t *loc)

{inode_t *root_inode = loc->inode->table->root;STACK_WIND_COOKIE (frame, quota_statfs_cbk,root_inode, FIRST_CHILD (this),FIRST_CHILD (this)->fops->statfs, loc, xdata);

return 0;

}

There is also frame->localshared by all STACK_WIND callbacks

compare STACK_WIND to STACK_WIND_COOKIE. Notice 'inode' and that it's being passed as an extra param

The 'cookie' is only shared between fop and cbk

There's also frame->local which is shared by all fops and cbks, but be careful you don't overwrite something that's already there.

STACK_WIND, STACK_WIND_COOKIE, cont.

Pass extra data to the cbk with STACK_WIND_COOKIEquota_statfs_cbk (call_frame_t *frame, void *cookie, ...){inode_t *root_inode = cookie;...

}

Here we see that cookie != NULL when the fop used STACK_WIND_COOKIE

STACK_UNWIND versus STACK_UNWIND_STRICT

STACK_UNWIND_STRICT uses the correct type/* return from function in a type-safe way */#define STACK_UNWIND (frame, params ...)do {ret_fn_t fn = frame->ret;

...

versus#define STACK_UNWIND_STRICT (op, frame, params ...)do {fop_##op##_cbk_t fn = (fop_##op##_cbk_t)frame->ret;

...

And why wouldn't you want strong typing?

This slide speaks for itself.

Is there a reason to use STACK_UNWIND? Ever?

Calling multiple children (fan out)

afr_writev_wind (...){...for (i = 0; i < priv->child_count; i++) {if (local->transaction.pre_op[i]) {STACK_WIND_COOKIE (frame, afr_writev_wind_cbk,(void *) (long) i,priv->children[i],priv->children[i]->fops->writev,local->fd, ...);

}

}return 0;

}

Here's an example from AFR where the I/O is sent to all the replicas.

Remember the danger zone

Calling multiple children, cont. (fan in)

afr_writev_wind_cbk (...){LOCK (&frame->lock);callcnt = --local->call_count;UNLOCK (&frame->lock);if (callcnt == 0) /* we're done */...

}failure by any one child means the whole transaction failed?And needs to be handled accordingly

Not much else to say

Dealing With Errors: I/O errors

uidmap_writev_cbk (call_frame_t *frame, void *cookie,xlator_t *this, int32_t op_ret, int32_t op_errno, ...)

{...STACK_UNWIND_STRICT (writev, frame, -1, EIO, ...);return 0;

}op_ret: 0 or -1, success or failure

op_errno: from Use an op_errno that's valid and/or relevant for the fop

Note that this is in the cbk

Xlators further down the call tree may return an error in op_ret and op_errno. E.g. The storage/posix xlator that actually writes to disk may have a real error, and this propagates back up.

You could decide it's not a real errorConversely the lower xlators may return success and you could decide it is an error anyway

Either pass the parameters on, or change them accordingly.

Recommend: use appropriate errno, or you risk confusing people

Dealing With Errors: FOP method errors

uidmap_writev (call_frame_t *frame, xlator_t *this, ...){...if (horrible_logic_error_must_abort) {goto error; /* glusterfs idiom */

}STACK_WIND(frame, uid_writev_cbk, ...);return 0;

error:STACK_UNWIND_STRICT (writev, frame, -1, EIO, NULL, NULL);return 0;

}

And here's how to handle some unrecoverable error, logic or otherwise, in the fop.

Note the glusterfs idiom (retch)

Call to action

Go forth and write applications for GlusterFS!

A basic translator isn't hard. Trust me. ;-)

Click to edit the title text format

Click to edit the outline text format

Kaleb KEITHLEY

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level