63
GlusterFS – What, Why, How ... Presented by: Sahina Bose Red Hat 18 th Mar, 2015 Slide credits: Vijay Bellur GlusterFS architect Red Hat

Gluster fs architecture_future_directions_tlv

Embed Size (px)

Citation preview

GlusterFS – What, Why, How ...

Presented by:Sahina BoseRed Hat18th Mar, 2015

Slide credits:Vijay BellurGlusterFS architectRed Hat

03/18/15

Agenda

● 4Ws and a H● Why GlusterFS?● What is GlusterFS?● How does GlusterFS work?● Where is GlusterFS used?● When does feature foo appear in GlusterFS?

● Q & A

03/18/15

Why GlusterFS?

03/18/15

Why GlusterFS?

● 2.5+ exabytes of data produced every day!● 90% of data in last two years● Data needs to be stored somewhere!● Commoditization and Democratization – way to

go

source: http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

03/18/15

What is GlusterFS?

03/18/15

What is GlusterFS?

● Scale-out distributed storage system.● Aggregates storage exports over network interconnects to

provide an unified namespace.● Layered on disk file systems that support extended

attributes.● Provides file, object and block interfaces for data access.

03/18/15

How does GlusterFS work?

03/18/15

Typical GlusterFS Deployment

03/18/15

GlusterFS Architecture – Foundations

● Software only, runs on commodity hardware

● No external metadata servers

● Scale-out with Elasticity

● Extensible and modular

● Deployment agnostic

03/18/15

Concepts & Algorithms

03/18/15

GlusterFS concepts – Trusted Storage Pool

Node2

Probe

Probe accepted

Node 1 and Node 2 are peers in a trusted storage pool

Node2Node1

Node1

03/18/15

GlusterFS concepts – Trusted Storage Pool

Node1 Node2 Node3Node2Node1 Trusted Storage Pool

Node3Node2Node1

Detach

03/18/15

A brick is the combination of a node and an export directory – for e.g. hostname:/dir

Each brick inherits limits of the underlying filesystem

No limit on the number of bricks per node

Data and metadata get stored on bricks

/export3 /export3 /export3

Storage Node

/export1

Storage Node

/export2

/export1

/export2

/export4

/export5

Storage Node

/export1

/export2

3 bricks 5 bricks 3 bricks

GlusterFS concepts - Bricks

03/18/15

GlusterFS concepts - Volumes

● Logical collection of bricks.

● Identified by an administrator provided name.

● Volume or a part of the volume can be mounted in a client

● mount -t glusterfs server1:/<volname>

/my/mnt/point

03/18/15

GlusterFS concepts - Volumes

Node2Node1 Node3

/export/brick1

/export/brick2

/export/brick1

/export/brick2

/export/brick1

/export/brick2

music

Videos

03/18/15

Volume Types

➢Type of a volume is specified at the time of volume creation

➢ Volume type determines how and where data is placed

➢ Following volume types are supported in glusterfs:1) Distribute2) Stripe3) Replication4) Distributed Replicate5) Striped Replicate6) Distributed Striped Replicate7) Disperse

03/18/15

Distributed Volume

➢Distributes files across various bricks of the volume.

➢Directories are present on all bricks of the volume.

➢Removes the need for an external meta data server, provides

O(1) lookup.

03/18/15

How does a distributed volume work?

03/18/15

How does a distributed volume work?

03/18/15

How does a distributed volume work?

03/18/15

Replicated Volume

● Synchronous replication of all updates.

● Provides HA for data.

● Transaction driven for ensuring consistency.

● Changelogs maintained for re-conciliation.

● Any number of replicas can be configured.

03/18/15

How does a replicated volume work?

03/18/15

How does a replicated volume work?

03/18/15

Distributed Replicated Volume

03/18/15

Disperse Volume

● Introduced in GlusterFS 3.6

● Erasure Coding / RAID 5 over the network

● “Disperses” data on to various bricks

● IDA : Reed solomon codes

● Non–systematic erasure coding

● Encoding / decoding done on client side

03/18/15

Access Mechanisms

03/18/15

Access Mechanisms

Gluster volumes can be accessed via the following mechanisms:

– FUSE based Native protocol

– NFSv3

– SMB

– libgfapi

– ReST/HTTP

– HDFS

– iSCSI

03/18/15

FUSE based native access

03/18/15

libgfapi access

03/18/15

NFSv3 access with Gluster NFS

03/18/15

Nfs-Ganesha with GlusterFS

03/18/15

SMB with GlusterFS

03/18/15

Object/ReST - SwiftonFile

Client Proxy Account

Container

Object

HTTP Request ( Swift REST

API)

Directory

Volume

FileClientNFS or

GlusterFS Mount

● Unified File and object view.

● Entity mapping between file and object building blocks

03/18/15

HDFS access

03/18/15

Features● Scale-out NAS

● Elasticity● Directory & Volume quotas

● Data Protection and Recovery

● Volume and File Snapshots● User Serviceable Snapshots● Geographic/Asynchronous replication

● Archival

● Read-only● WORM

03/18/15

Features● Performance

● Client side in memory caching for performance

● Data, metadata and readdir caching

● Monitoring

● Built in io statistics

● /proc like interface for introspection

● Provisioning

● puppet-gluster

● gluster-deploy

● More..

03/18/15

GlusterFS & oVirt

Row 1 Row 2 Row 3 Row 40

2

4

6

8

10

12

Column 1

Column 2

Column 3

03/18/15

GlusterFS Monitoring with Nagios

http://www.ovirt.org/Features/Nagios_Integration

03/18/15

How is it implemented?

03/18/15

Translators in GlusterFS

● Building blocks for a GlusterFS process.● Based on Translators in GNU HURD.● Each translator is a functional unit.● Translators can be stacked together for achieving

desired functionality.● Translators are deployment agnostic – can be loaded

in either the client or server stacks.

03/18/15

Customizable Translator Stack

03/18/15

Develop with GlusterFS

● Write your own translator● Possible in C and Python as of today

● Write your custom application using libgfapi● Interfaces similar to system calls● C, python, java and go bindings available

● Talk to Gluster through ReST APIs

03/18/15

Where is GlusterFS used?

03/18/15

GlusterFS Use Cases

Source: 2014 GlusterFS user survey

03/18/15

Ecosystem Integration

● Currently used with various ecosystem projects● Virtualization

– OpenStack

– oVirt

– Qemu

– CloudStack

● Big Data Analytics– Hadoop

– Tachyon

● File Sync and Share– ownCloud

03/18/15

Nova Nodes(GlusterFS as

ephemeral storage)

SwiftObjects

Cinder Data

Glance Data

Swift Data

Swift API

Storage Server

(GlusterFS)

Storage Server

(GlusterFS)

Storage Server

(GlusterFS)

KVM

KVM

KVM

…Fuse/libgfapi

Manila Data

Storage Server

(GlusterFS)

Block Images FilesObjects

Openstack Juno + GlusterFS – Current Integration

03/18/15

When and What in GlusterFS.next?

03/18/15

New Features in GlusterFS 3.7

● Data Tiering

● Bitrot detection

● Sharding

● Performance improvements

● Better protection against split brain

● NFSv4, 4.1 and pNFS access using NFS Ganesha

● Netgroups style configuration for NFS

03/18/15

Data Tiering in 3.7

● Policy based data movement across hot and cold tiers

● New translator for identifying candidates for promotion/demotion

● Enables better utilization of different classes of storage device/SSDs

Tier Xlator

HOT DHT COLD DHT

Replication Xlator

Other Client Xlator

HOT Tier

POSIX Xlator

CTR Xlator

Other Server Xlator

Brick Storage

Heat Data Store

POSIX Xlator

CTR Xlator

Other Server Xlator

Brick Storage

Heat DataStore

COLD Tier

Demotion

Promotion

03/18/15

Bitrot detection in 3.7

● Detection of silent data corruption● Checksum associated with each file● Periodic data scrubbing● Detection also upon access

03/18/15

Sharding in 3.7

● Solves fragmentation in Gluster volumes● Chunks and places data in any node that has

space● Suitable for large file workloads

03/18/15

Netgroups and Exports for NFS in 3.7

● More advanced configuration for authentication based on /etc/exports like syntax

● Support for netgroups

● Patches written at Facebook

● Forward ported from 3.4 to 3.7

● Cleanups and posted for review

03/18/15

NFS Ganesha support in 3.7

● Supports NFSv4, NFSv4.1 with Kerberos

● pNFS support for Gluster Volumes follows later● Modifications to Gluster internals

● Upcall infrastructure● Gluster CLI to manage NFS Genesha● libgfapi improvements

● High-Availability based on Pacemaker and Corosync

03/18/15

Small-file performance enhancements in 3.7

● Multithreaded epoll (transport layer)

● In memory caching of stat and extended attributes on the bricks

● Improvements for directory reads

● Tiering of data within a volume

03/18/15

Features beyond GlusterFS 3.7

● HyperConvergence with oVirt

● At rest compression

● De-duplication

● Multi-protocol support with NFS, FUSE and SMB

● Native ReST APIs for gluster management

● More integration with OpenStack, Containers

03/18/15

Hyperconverged oVirt – GlusterFS

● Server nodes are used both for virtualization and serving replicated images from the GlusterFS Bricks

● The boxes can be standardized (hardware and deployment) for easy addition and replacement

● Support for both scaling up, adding more disks, and scaling out, adding more hosts V

Ms

and

Sto

rage

Eng

ine

GlusterFS Volume

Bricks Bricks Bricks

03/18/15

GlusterFS 4.0

03/18/15

GlusterFS 4.0

● Not Evolutionary anymore

● Intended for massive scalability and manageability improvements, remove known bottlenecks

● Make it easier for devops to provison, manage and monitor

● Enable larger deployments and new use cases

03/18/15

GlusterFS 4.0

● New Style Replication

● Improved Distributed hashing Translator

● Composite operations in the GlusterFS RPC protocol

● Support for multiple networks

● Coherent client-side caching

● Advanced data tiering

● ... and much more

03/18/15

GlusterFS next releases

● 3.7 – April 2015● 3.7+ - April 2015 till 4.0● 4.0 - 2016

03/18/15

Resources

Mailing lists:[email protected]@nongnu.org

IRC:#gluster and #gluster-dev on freenode

Web:http://www.gluster.org

Thank you!