Click here to load reader

Gluster FS a filesistem for Big Data | Roberto Franchini - Codemotion Rome 2015

  • View
    285

  • Download
    2

Embed Size (px)

Text of Gluster FS a filesistem for Big Data | Roberto Franchini - Codemotion Rome 2015

  • 1Gluster FSA distributed file system for today and tomorrow's BigDataRoberto FRANK Franchini

    @robfrankie @CELI_NLP

    CELI 2015

    Roma, 2728 Marzo 2015Roma, 2728 Marzo 2015

  • 15 years of experience, proud to be a programmer

    Writes software for information extraction, nlp, opinion mining

    (@scale ), and a lot of other buzzwords

    Implements scalable architectures

    Plays with servers (don't say that to my sysadmin)

    Member of the JUG-Torino coordination team

    whoami(1)whoami(1)

    CELI 2014 2

  • Company

    CELI 2015 3

  • Identify a distributed and scalable file

    system

    for today and tomorrow's

    Big Data

    CELI 2015 4

    The problem

  • Once upon a time

    CELI 2015 5

  • Once upon a timeOnce upon a time

    CELI 2015 6

  • Once upon a time

    2008: One nfs share

    1,5TB ought to be enough for anybody

    Once upon a time

    CELI 2015 7

  • Once upon a timeOnce upon a time

    CELI 2015 8

  • Once upon a time

    2010: Herd of shares

    (1,5TB x N) ought to be enough for anybody

    Once upon a time

    CELI 2015 9

  • Once upon a time

    DIY Hand Made Distributed File System:

    HMDFS (tm)

    Nobody could stop the data flood

    It was the time for

    something new

    Once upon a time

    CELI 2015 10

  • What we do with GFS

    Can be enlarged on demand

    No dedicated HW

    OpenSource is preferred and trusted

    No specialized API

    No specialized Kernel

    POSIX compliance

    Zillions of big and small files

    No NAS or SAN ()

    Requirements

    CELI 2015 11

  • What we do with GFS

    More than 10GB of Lucene inverted indexes to be stored

    on a daily basis (more than 200GB/month)

    Search stored indexes to extract different sets of

    documents for different customers

    Analytics on large sets of documents (years of data)

    To do what?

    CELI 2015 12

  • FeaturesGlusterFs

    CELI 2015 13

  • Features

    Clustered Scale-out General Purpose Storage Platform

    POSIX-y Distributed File System

    Built on commodity systems

    x86_64 Linux ++

    POSIX filesystems underneath (XFS, EXT4)

    No central metadata Server (NO SPOF)

    Modular architecture for scale and functionality

    GlusterFs

    CELI 2015 14

  • Features

    Large Scale File Server

    Media/Content Distribution Network (CDN)

    Backup / Archive / Disaster Recovery (DR)

    HDFS replacement in Hadoop ecosystem

    High Performance Computing (HPC)

    IaaS storage layer

    Database offload (blobs)

    Unified Object Store + File Access

    Common use cases

    CELI 2015 15

  • Features

    ACL and Quota support

    Fault-tolerance

    Peer to peer

    Self-healing

    Fast setup up

    Enlarge/Shrink on demand

    Snapshot

    On cloud, on premise (phisical or virtual)

    Features

    CELI 2015 16

  • Architecture

    CELI 2015 17

  • Architecture

    Peer / Node

    Cluster servers (glusterfs server)

    Runs the gluster daemons and participates in volumes

    Node

    CELI 2015 18

  • Architecture

    Brick

    A filesystem mountpoint on GusterFS nodes

    A unit of storage used as a capacity building block

    Brick

    CELI 2015 19

  • ArchitectureFrom disks to bricks

    CELI 2015 20

    3TB

    3TB

    3TB

    3TB

    3TB

    Gluster Server

    3TB

    3TB

    3TB

    3TB

    3TB

    Gluster Server

    LVM volume

    Gluster Server

    LVM volume

    LVM volume

    LVM volume

  • ArchitectureFrom disks to bricks

    CELI 2015 21

    3TB

    3TB

    3TB

    3TB

    3TB

    Gluster Server

    3TB

    3TB

    3TB

    3TB

    3TB

    RAID 6Physicalvolume

    Gluster Server

    SingleBrick

    Gluster Server

  • ArchitectureFrom disks to bricks

    CELI 2015 22

    3TB

    3TB

    3TB

    3TB

    3TB

    Gluster Server

    3TB

    3TB

    3TB

    3TB

    3TB

    /b1

    /b3

    /b5

    /bN

    /b7

    Gluster Server

    /b2

    /b4

    /b6

    /bM

    /b8

  • Bricks on a nodeBricks on a node

    CELI 2015 23

  • Architecture

    Logic between bricks or subvolume that generate a

    subvolume with certain characteristic

    Distribute, replica, stripe are special translators to generate

    simil-RAID configuration

    Perfomance translators: read ahead, write behind

    Translators

    CELI 2015 24

  • Architecture

    Bricks combined and passed through translators

    Ultimately, what's presented to the end user

    Volume

    CELI 2015 25

  • Architecture

    brick

    translator subvolume

    translator subvolume

    transator subvolume

    translator volume

    Volume

    CELI 2015 26

  • VolumeVolume

    CELI 2015 27

    Gluster Server 1

    /brick 14

    /brick 13

    /brick 12

    /brick 11

    Gluster Server 2

    /brick 24

    /brick 23

    /brick 22

    /brick 21

    Gluster Server 3

    /brick 34

    /brick 33

    /brick 32

    /brick 31

    Client

    /bigdata

    volume

    replicate

    distribute

  • VolumeVolume

    CELI 2015 28

  • Volume types

    CELI 2015 29

  • Distributed

    The default configuration

    Files evenly spread across bricks

    Similar to file-level RAID 0

    Server/Disk failure could be catastrophic

    Distributed

    CELI 2014 30

  • DistributedDistributed

    CELI 2015 31

  • Replicated

    Files written synchronously to replica peers

    Files read synchronously,

    but ultimately serviced by the first responder

    Similar to file-level RAID 1

    Replicated

    CELI 2015 32

  • ReplicatedReplicated

    CELI 2015 33

  • Distributed + replicated

    Distribued + replicated

    Similar to file-level RAID 10

    Most used layout

    Distributed + replicated

    CELI 2015 34

  • Distributed replicatedDistributed + replicated

    CELI 2015 35

  • Striped

    Individual files split among bricks (sparse files)Similar to block-level RAID 0

    Limited Use CasesHPC Pre/Post Processing

    File size exceeds brick size

    Striped

    CELI 2015 36

  • StripedStriped

    CELI 2015 37

  • Moving parts

    CELI 2015 38

  • Components

    glusterdManagement daemonOne instance on each GlusterFS serverInterfaced through gluster CLI

    glusterfsdGlusterFS brick daemonOne process for each brick on each serverManaged by glusterd

    Components

    CELI 2015 39

  • Components

    glusterfsVolume service daemonOne process for each volume serviceNFS server, FUSE client, Self-Heal, Quota, ...

    mount.glusterfsFUSE native client mount extension

    glusterGluster Console Manager (CLI)

    Components

    CELI 2015 40

  • Components

    sudo fdisk /dev/sdbsudo mkfs.xfs -i size=512 /dev/sdb1sudo mkdir -p /srv/sdb1sudo mount /dev/sdb1 /srv/sdb1sudo mkdir -p /srv/sdb1/brickecho "/dev/sdb1 /srv/sdb1 xfs defaults 0 0" | sudo tee -a /etc/fstab

    sudo apt-get install glusterfs-server

    Quick start

    CELI 2015 41

  • Components

    sudo gluster peer probe node02

    sudo gluster volume create testvol replica 2 node01:/srv/sdb1/brick node02:/srv/sdb1/brick

    sudo gluster volume start testvol

    sudo mkdir /mnt/glustersudo mount -t glusterfs node01:/testvol /mnt/gluster

    Quick start

    CELI 2015 42

  • Clients

    CELI 2015 43

  • Clients: native

    FUSE kernel module allows the filesystem to be built and

    operated entirely in userspace

    Specify mount to any GlusterFS server

    Native Client fetches volfile from mount server, then

    communicates directly with all nodes to access data

    Recommended for high concurrency and high write

    performance

    Load is inherently balanced across distributed volumes

    Clients: native

    CELI 2015 44

  • Clients:NFS

    Standard NFS v3 clients

    Standard automounter is supported

    Mount to any server, or use a load balancer

    GlusterFS NFS server includes Network Lock Manager

    (NLM) to synchronize locks across clients

    Better performance for reading many small files from a

    single client

    Load balancing must be managed externally

    Clients: NFS

    CELI 2015 45

  • Clients: libgfapi

    Introduced with GlusterFS 3.4

    User-space library for accessing data in GlusterFS

    Filesystem-like API

    Runs in application process

    no FUSE, no copies, no context switches

    ...but same volfiles, translators, etc.

    Clients: libgfapi

    CELI 2015 46

  • Clients: SMB/CIFS

    In GlusterFS 3.4 Samba + libgfapi

    No need for local native client mount & re-export

    Significant performance improvements with FUSE removed

    from the equation

    Must be setup on each server you wish to connect to via

    CIFS

    CTDB is required for Samba clustering

    Clients: SMB/CIFS

    CELI 2014 47

  • Clients: HDFS

    Access data within and outside of Hadoop

    No HDFS name node single point of failure / bottleneck

    Seamless replacement for HDFS

    Scales with the massive growth of big data

    Clients: HDFS

    CELI 2015 48

  • Scalability

    CELI 2015 49

  • Under the hood

    El

Search related