17-21 November 2010ECPRD - WGICT - Bucharest1 DROSS Distributed & Resilient Open Source Software Andrew Hardie ECPRD WGICT 17-21 November

17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 11

DROSSDROSSDistributed & Resilient Open Source SoftwareDistributed & Resilient Open Source Software

Andrew HardieAndrew Hardiehttp://ashardie.comhttp://ashardie.com

ECPRD WGICTECPRD WGICT

17-21 November 201017-21 November 2010

Chamber of Deputies, BucharestChamber of Deputies, Bucharest


TopicsTopics

Distributed, not virtualized or ‘cloud’Distributed, not virtualized or ‘cloud’ DRBDDRBD GlusterGluster HeartbeatHeartbeat NginxNginx Trends:Trends:

• NoSQLNoSQL• Map / ReduceMap / Reduce• Cassandra, Hadoop & familyCassandra, Hadoop & family

Other stuff ‘out there’Other stuff ‘out there’ Predictions…Predictions…


DRBDDRBD

Block-level disk replicator (effectively, net RAID-1)Block-level disk replicator (effectively, net RAID-1)


DRBD – Good/bad pointsDRBD – Good/bad points

Good for HA clusters (e,g, LAMP servers)Good for HA clusters (e,g, LAMP servers) Ideal for block-level apps, e.g. MySQLIdeal for block-level apps, e.g. MySQL Sync/Async operation Sync/Async operation Auto recovery from disk, net or node failureAuto recovery from disk, net or node failure In Linux kernels from 2.6.33 In Linux kernels from 2.6.33 (Ubuntu 10.10 is 2.6.35)(Ubuntu 10.10 is 2.6.35)

Supports Infiniband, LVM, XEN, Dual primary configSupports Infiniband, LVM, XEN, Dual primary config Hard to extend beyond two systems, three is maximumHard to extend beyond two systems, three is maximum Remote offsite really needs DRBD Proxy (commercial)Remote offsite really needs DRBD Proxy (commercial) Requires dedicated disk/partitionRequires dedicated disk/partition Moderately difficult to configureModerately difficult to configure Documentation could be betterDocumentation could be better


GlusterGluster

Filesystem-level replicatorFilesystem-level replicator More like NAS than RAIDMore like NAS than RAID Claims to scale to petabytesClaims to scale to petabytes Nodes can be servers, clients or bothNodes can be servers, clients or both On the fly reconfig of disks & nodesOn the fly reconfig of disks & nodes Scripting interfaceScripting interface ‘‘Cloud compliant’ (isn’t everything?)Cloud compliant’ (isn’t everything?)


Gluster – Use case - DublinGluster – Use case - DublinReal-time mirroring of Digital AudioReal-time mirroring of Digital Audio


Gluster – Good/bad pointsGluster – Good/bad points

Moving to “turnkey system” (black box)Moving to “turnkey system” (black box) N-way replication easyN-way replication easy Easier than DRBD to configureEasier than DRBD to configure Dedicated partitions or disks not requiredDedicated partitions or disks not required Supports InfinibandSupports Infiniband Background self-healing (pull rather than push)Background self-healing (pull rather than push) Aggregate and/or replicate volumesAggregate and/or replicate volumes POSIX support POSIX support Native support for NFS, CIFS, HTTP & FTPNative support for NFS, CIFS, HTTP & FTP No specific features for slow link replicationNo specific features for slow link replication Similar documentation Similar documentation vsvs revenue earning tension revenue earning tension


HeartbeatHeartbeat

HA Cluster infrastructure (“cluster glue”)HA Cluster infrastructure (“cluster glue”) Needs Cluster Resource manager (CRM), e.g. Needs Cluster Resource manager (CRM), e.g.

Pacemaker, to be usefulPacemaker, to be useful Part of the Linux-HA projectPart of the Linux-HA project Provides:Provides:

hot-swap of synthetic IP address between nodes hot-swap of synthetic IP address between nodes (Synthetic IP is in addition to node’s own IPs)(Synthetic IP is in addition to node’s own IPs)

Node failure/restore detectionNode failure/restore detection Start/stop of services to be managed, via init scriptsStart/stop of services to be managed, via init scripts


Heartbeat/DRBD – use caseHeartbeat/DRBD – use caseHA LAMP Server pairHA LAMP Server pair


Heartbeat – good/bad pointsHeartbeat – good/bad points

Lots of resource agents availableLots of resource agents available e.g. Apache, Squid, Sphinx search, VMWare, DB2, e.g. Apache, Squid, Sphinx search, VMWare, DB2,

WebSphere, Oracle, JBOSS, Tomcat, Postfix, WebSphere, Oracle, JBOSS, Tomcat, Postfix, Informix, SAP, iSCSI, DRBD, …Informix, SAP, iSCSI, DRBD, …

Beyond simple 2-way hot-swap, config can get Beyond simple 2-way hot-swap, config can get very complicatedvery complicated

Good for stateless (e.g. HTTP); not so good for Good for stateless (e.g. HTTP); not so good for file shares (e.g. Samba)file shares (e.g. Samba)

Documentation out of date in some areas, e.g. Documentation out of date in some areas, e.g. Ububtu ‘upstart’ scripts Ububtu ‘upstart’ scripts (boot-time startup of services to be (boot-time startup of services to be managed by Heartbeat has to be disabled)managed by Heartbeat has to be disabled)


NGINXNGINX Fast, simple Russian HTTP serverFast, simple Russian HTTP server Reverse proxy serverReverse proxy server Mail proxy serverMail proxy server Fast static content servingFast static content serving Very low memory footprintVery low memory footprint Load balancing and fault toleranceLoad balancing and fault tolerance Name and IP based virtual serversName and IP based virtual servers Embedded PerlEmbedded Perl FLV streamingFLV streaming Non-threaded, event-driven architectureNon-threaded, event-driven architecture Modular architectureModular architecture Can front-end Apache (instead of mod_proxy)Can front-end Apache (instead of mod_proxy)


Trends – NoSQL, etc…Trends – NoSQL, etc…

NoSQLNoSQL Or, is it really NoACID (Or, is it really NoACID (atomicity, consistency, isolation, atomicity, consistency, isolation,

durability)?durability)? It’s really the ACID that’s hard to scale, esp. in the very large, It’s really the ACID that’s hard to scale, esp. in the very large,

very active data stores (e.g. SN)very active data stores (e.g. SN)• Some NoSQLs now have SQL for query onlySome NoSQLs now have SQL for query only• Ways of solving ACID scalability being discussedWays of solving ACID scalability being discussed

The problems:The problems:• Huge numbers of simultaneous updatesHuge numbers of simultaneous updates• Large JOINs across very large tables (= big SQL query)Large JOINs across very large tables (= big SQL query)• Lots of updates & searches on small data elements in vast data setsLots of updates & searches on small data elements in vast data sets

The alternative: The alternative: • Key/value storesKey/value stores• De-normalized dataDe-normalized data


Consequences of De-normalizingConsequences of De-normalizing

Order(s) of magnitude increase in storage Order(s) of magnitude increase in storage requirementsrequirements

Difficulty of updating numerous “Key Difficulty of updating numerous “Key equivalents” in many places – can’t be done equivalents” in many places – can’t be done synchronouslysynchronously

Breaking relationship links allows parallel Breaking relationship links allows parallel processing:processing: helps the bottleneck of storage read speed (storage helps the bottleneck of storage read speed (storage

capacity is growing much faster than transfer rates)capacity is growing much faster than transfer rates) No JOINs or transactionsNo JOINs or transactions


Name/Value ModelsName/Value Models

Just name/value pairs, e.g. memcachedb, Just name/value pairs, e.g. memcachedb, DynamoDynamo

Name/value pairs plus associated data, Name/value pairs plus associated data, e.g. CouchDB, MongoDB – think e.g. CouchDB, MongoDB – think document stores with metadatadocument stores with metadata

Name/value pairs with nesting, e.g. Name/value pairs with nesting, e.g. CassandraCassandra


CassandraCassandra

Distributed, fault-tolerant database, based on Distributed, fault-tolerant database, based on ideas in Dynamo (Amazon) & BigTable (Google)ideas in Dynamo (Amazon) & BigTable (Google) Developed by FaceBook, open-sourced in 2008Developed by FaceBook, open-sourced in 2008 Now Apache projectNow Apache project Key/value pairs, in column-oriented formatKey/value pairs, in column-oriented format

• Standard column: name, value, timestampStandard column: name, value, timestamp• Super-column: name, map of columns, each with name, Super-column: name, map of columns, each with name,

value, timestamp (think array of hashes)value, timestamp (think array of hashes)• Grouped by Column family, also either standard or superGrouped by Column family, also either standard or super• Column family contains ‘rows’, roughly like a DB tableColumn family contains ‘rows’, roughly like a DB table• Column families then go in key-spacesColumn families then go in key-spaces


Cassandra - NoACIDCassandra - NoACID

Cassandra, et al, e.g. Voldemort (LinkedIn), trade speed, Cassandra, et al, e.g. Voldemort (LinkedIn), trade speed, distribution and availability for consistency and atomicitydistribution and availability for consistency and atomicity

No single point of failureNo single point of failure ““Eventually consistent” modelEventually consistent” model Tunable levels of consistencyTunable levels of consistency Atomicity only guaranteed within a column familyAtomicity only guaranteed within a column family Accessed using Thrift (also developed by Facebook)Accessed using Thrift (also developed by Facebook) Used by: Used by:

FacebookFacebook DiggDigg TwitterTwitter RedditReddit


NoSQL for Parliaments?NoSQL for Parliaments?

Much parliamentary material is naturally unstructured Much parliamentary material is naturally unstructured and suited to the name/value model (think XML)and suited to the name/value model (think XML)

Remember the old discussions about how to map such Remember the old discussions about how to map such parliamentary material into relational databases?parliamentary material into relational databases?

Think of every MPs contribution (speech) in chamber or Think of every MPs contribution (speech) in chamber or committee as a key/value pair, i.e. a columncommittee as a key/value pair, i.e. a column

Think of every PQ & answer as a super-column of Think of every PQ & answer as a super-column of name/value pairs for question, answer, holding, name/value pairs for question, answer, holding, supplementary, pursuant, referral …supplementary, pursuant, referral …

Hansard becomes a super-column family!Hansard becomes a super-column family!


Map / ReduceMap / Reduce Column (or record) oriented design & de-normalized data power the Column (or record) oriented design & de-normalized data power the

parallel “map reduce” model (think “sharding on speed”)parallel “map reduce” model (think “sharding on speed”)


HadoopHadoop Nothing to do with NoSQLNothing to do with NoSQL Hadoop is an infrastructure and now family of tools for Hadoop is an infrastructure and now family of tools for

managing distributed systems and immense datasetsmanaging distributed systems and immense datasets How immense? Hundreds of GB and 10 node cluster is How immense? Hundreds of GB and 10 node cluster is

‘entry-level’ in Hadoop terms‘entry-level’ in Hadoop terms Developed by Yahoo for their cloud, now Apache projectDeveloped by Yahoo for their cloud, now Apache project Supports Map/Reduce by pre-dividing & distributing dataSupports Map/Reduce by pre-dividing & distributing data ““Moves computation to the data instead of data to the Moves computation to the data instead of data to the

computation”computation” HDFS file system particularly interesting – distributed, HDFS file system particularly interesting – distributed,

resilient (far more advanced than DRBD or Gluster), but resilient (far more advanced than DRBD or Gluster), but not real time (more eventually consistent…)not real time (more eventually consistent…)

Hive data warehouse front end – has SQL-like queriesHive data warehouse front end – has SQL-like queries


Who uses Hadoop?Who uses Hadoop? TwitterTwitter AOLAOL IBMIBM Last.fmLast.fm LinkedInLinkedIn E-BayE-Bay YahooYahoo

36,000 machines with > 100,000 cores running Hadoop36,000 machines with > 100,000 cores running Hadoop Largest cluster is only 4000 nodes Largest cluster is only 4000 nodes

Largest known cluster is Facebook!Largest known cluster is Facebook! 2000 machines with 22,400 cores2000 machines with 22,400 cores 21Petabytes in a single HDFS store21Petabytes in a single HDFS store


Hadoop for Parliaments?Hadoop for Parliaments?

Hadoop may seem overkill for parliaments now…Hadoop may seem overkill for parliaments now… But, when you start your legacy collection digitization But, when you start your legacy collection digitization

and digital preservation projects its model, for managing and digital preservation projects its model, for managing large datasets which essentially do not change & don’t large datasets which essentially do not change & don’t need real-time commit, is very good fit!need real-time commit, is very good fit!

Other interesting Hadoop projects:Other interesting Hadoop projects: Zookeeper (distributed apps co-ordination)Zookeeper (distributed apps co-ordination) Hive (data warehouse infrastructure)Hive (data warehouse infrastructure) Pig (high-level data flow language)Pig (high-level data flow language) Mahout (scalable machine learning library)Mahout (scalable machine learning library) Scribe (for aggregating streaming log data) [not strictly Hadoop Scribe (for aggregating streaming log data) [not strictly Hadoop

project, but can be integrated with it, using interesting work-project, but can be integrated with it, using interesting work-around for the non-real time & NameNode single point of failure]around for the non-real time & NameNode single point of failure]


Other things ‘out there’Other things ‘out there’ Drizzle Drizzle

A database “optimized for Cloud infrastructure and Web applications”A database “optimized for Cloud infrastructure and Web applications” ““Design for massive concurrency on modern multi-cpu architecture”Design for massive concurrency on modern multi-cpu architecture” But, doesn’t actually explain how to use it for these…But, doesn’t actually explain how to use it for these… It’s SQL and ACIDIt’s SQL and ACID Mostly seems to be a reaction against what’s happening at MySQL…Mostly seems to be a reaction against what’s happening at MySQL… Has to be compiled from source – no distros available for it yetHas to be compiled from source – no distros available for it yet

CouchDBCouchDB Distributed, fault-tolerant, schema-free document-oriented databaseDistributed, fault-tolerant, schema-free document-oriented database RESTful JSON API (i.e. Web front end)RESTful JSON API (i.e. Web front end) Incremental replication with bi-directional conflict detection Incremental replication with bi-directional conflict detection Written in Erlang (highly reliable language developed by Ericsson)Written in Erlang (highly reliable language developed by Ericsson) Supports ‘map/reduce’ like querying and indexingSupports ‘map/reduce’ like querying and indexing Interesting model, different from most other offeringsInteresting model, different from most other offerings Also now an Apache projectAlso now an Apache project Still too immature for anything beyond experimentationStill too immature for anything beyond experimentation


Also ‘out there’Also ‘out there’ VoldemortVoldemort

Another distributed key/value storage systemAnother distributed key/value storage system Used at LinkedInUsed at LinkedIn Doesn’t seem to have much futureDoesn’t seem to have much future Cassandra is similar, better & more widely usedCassandra is similar, better & more widely used

MonetDBMonetDB ““database system for high-performance applications in data mining, database system for high-performance applications in data mining,

OLAP, GIS, XML Query, text and multimedia retrieval “OLAP, GIS, XML Query, text and multimedia retrieval “ SQL and XQUERY front endsSQL and XQUERY front ends Also hard to see where it’s going…Also hard to see where it’s going…

MongoDBMongoDB Tries to bridge the gap between RDBMS and map/reduceTries to bridge the gap between RDBMS and map/reduce JSON document storage (like CouchDB)JSON document storage (like CouchDB) No JOINs, no transactionsNo JOINs, no transactions Supports atomic transactions only on single documentsSupports atomic transactions only on single documents Interesting, but may ‘fall between two stools’Interesting, but may ‘fall between two stools’


PredictionsPredictions Hadoop and Cassandra are the ones to watchHadoop and Cassandra are the ones to watch There will likely be some sort of re-convergence between There will likely be some sort of re-convergence between

NoSQL and query languages of some kind – can’t do NoSQL and query languages of some kind – can’t do everything with map/reduce (esp. not ad hoc queries)everything with map/reduce (esp. not ad hoc queries)

SQL may be destined to become like COBOL – still SQL may be destined to become like COBOL – still around and running things but not something to use for around and running things but not something to use for new projectsnew projects

Distributed storage models (with or without map/reduce) Distributed storage models (with or without map/reduce) have good futurehave good future

Datasets will only get bigger – compliance, audit, digital Datasets will only get bigger – compliance, audit, digital preservation, the shift to visuals, etcpreservation, the shift to visuals, etc

Information management models (“strategy”) and access Information management models (“strategy”) and access speed will remain key problemsspeed will remain key problems


QuestionsQuestions

““What’s it all about?” What’s it all about?”

http://ashardie.comhttp://ashardie.com

Documents

17-21 November 2010ECPRD - WGICT - Bucharest1 DROSS Distributed & Resilient Open Source Software Andrew Hardie ECPRD WGICT 17-21 November