Upload
nathan-wright
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 11
DROSSDROSSDistributed & Resilient Open Source SoftwareDistributed & Resilient Open Source Software
Andrew HardieAndrew Hardiehttp://ashardie.comhttp://ashardie.com
ECPRD WGICTECPRD WGICT
17-21 November 201017-21 November 2010
Chamber of Deputies, BucharestChamber of Deputies, Bucharest
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 22
TopicsTopics
Distributed, not virtualized or ‘cloud’Distributed, not virtualized or ‘cloud’ DRBDDRBD GlusterGluster HeartbeatHeartbeat NginxNginx Trends:Trends:
• NoSQLNoSQL• Map / ReduceMap / Reduce• Cassandra, Hadoop & familyCassandra, Hadoop & family
Other stuff ‘out there’Other stuff ‘out there’ Predictions…Predictions…
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 33
DRBDDRBD
Block-level disk replicator (effectively, net RAID-1)Block-level disk replicator (effectively, net RAID-1)
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 44
DRBD – Good/bad pointsDRBD – Good/bad points
Good for HA clusters (e,g, LAMP servers)Good for HA clusters (e,g, LAMP servers) Ideal for block-level apps, e.g. MySQLIdeal for block-level apps, e.g. MySQL Sync/Async operation Sync/Async operation Auto recovery from disk, net or node failureAuto recovery from disk, net or node failure In Linux kernels from 2.6.33 In Linux kernels from 2.6.33 (Ubuntu 10.10 is 2.6.35)(Ubuntu 10.10 is 2.6.35)
Supports Infiniband, LVM, XEN, Dual primary configSupports Infiniband, LVM, XEN, Dual primary config Hard to extend beyond two systems, three is maximumHard to extend beyond two systems, three is maximum Remote offsite really needs DRBD Proxy (commercial)Remote offsite really needs DRBD Proxy (commercial) Requires dedicated disk/partitionRequires dedicated disk/partition Moderately difficult to configureModerately difficult to configure Documentation could be betterDocumentation could be better
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 55
GlusterGluster
Filesystem-level replicatorFilesystem-level replicator More like NAS than RAIDMore like NAS than RAID Claims to scale to petabytesClaims to scale to petabytes Nodes can be servers, clients or bothNodes can be servers, clients or both On the fly reconfig of disks & nodesOn the fly reconfig of disks & nodes Scripting interfaceScripting interface ‘‘Cloud compliant’ (isn’t everything?)Cloud compliant’ (isn’t everything?)
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 66
Gluster – Use case - DublinGluster – Use case - DublinReal-time mirroring of Digital AudioReal-time mirroring of Digital Audio
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 77
Gluster – Good/bad pointsGluster – Good/bad points
Moving to “turnkey system” (black box)Moving to “turnkey system” (black box) N-way replication easyN-way replication easy Easier than DRBD to configureEasier than DRBD to configure Dedicated partitions or disks not requiredDedicated partitions or disks not required Supports InfinibandSupports Infiniband Background self-healing (pull rather than push)Background self-healing (pull rather than push) Aggregate and/or replicate volumesAggregate and/or replicate volumes POSIX support POSIX support Native support for NFS, CIFS, HTTP & FTPNative support for NFS, CIFS, HTTP & FTP No specific features for slow link replicationNo specific features for slow link replication Similar documentation Similar documentation vsvs revenue earning tension revenue earning tension
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 88
HeartbeatHeartbeat
HA Cluster infrastructure (“cluster glue”)HA Cluster infrastructure (“cluster glue”) Needs Cluster Resource manager (CRM), e.g. Needs Cluster Resource manager (CRM), e.g.
Pacemaker, to be usefulPacemaker, to be useful Part of the Linux-HA projectPart of the Linux-HA project Provides:Provides:
hot-swap of synthetic IP address between nodes hot-swap of synthetic IP address between nodes (Synthetic IP is in addition to node’s own IPs)(Synthetic IP is in addition to node’s own IPs)
Node failure/restore detectionNode failure/restore detection Start/stop of services to be managed, via init scriptsStart/stop of services to be managed, via init scripts
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 99
Heartbeat/DRBD – use caseHeartbeat/DRBD – use caseHA LAMP Server pairHA LAMP Server pair
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1010
Heartbeat – good/bad pointsHeartbeat – good/bad points
Lots of resource agents availableLots of resource agents available e.g. Apache, Squid, Sphinx search, VMWare, DB2, e.g. Apache, Squid, Sphinx search, VMWare, DB2,
WebSphere, Oracle, JBOSS, Tomcat, Postfix, WebSphere, Oracle, JBOSS, Tomcat, Postfix, Informix, SAP, iSCSI, DRBD, …Informix, SAP, iSCSI, DRBD, …
Beyond simple 2-way hot-swap, config can get Beyond simple 2-way hot-swap, config can get very complicatedvery complicated
Good for stateless (e.g. HTTP); not so good for Good for stateless (e.g. HTTP); not so good for file shares (e.g. Samba)file shares (e.g. Samba)
Documentation out of date in some areas, e.g. Documentation out of date in some areas, e.g. Ububtu ‘upstart’ scripts Ububtu ‘upstart’ scripts (boot-time startup of services to be (boot-time startup of services to be managed by Heartbeat has to be disabled)managed by Heartbeat has to be disabled)
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1111
NGINXNGINX Fast, simple Russian HTTP serverFast, simple Russian HTTP server Reverse proxy serverReverse proxy server Mail proxy serverMail proxy server Fast static content servingFast static content serving Very low memory footprintVery low memory footprint Load balancing and fault toleranceLoad balancing and fault tolerance Name and IP based virtual serversName and IP based virtual servers Embedded PerlEmbedded Perl FLV streamingFLV streaming Non-threaded, event-driven architectureNon-threaded, event-driven architecture Modular architectureModular architecture Can front-end Apache (instead of mod_proxy)Can front-end Apache (instead of mod_proxy)
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1212
Trends – NoSQL, etc…Trends – NoSQL, etc…
NoSQLNoSQL Or, is it really NoACID (Or, is it really NoACID (atomicity, consistency, isolation, atomicity, consistency, isolation,
durability)?durability)? It’s really the ACID that’s hard to scale, esp. in the very large, It’s really the ACID that’s hard to scale, esp. in the very large,
very active data stores (e.g. SN)very active data stores (e.g. SN)• Some NoSQLs now have SQL for query onlySome NoSQLs now have SQL for query only• Ways of solving ACID scalability being discussedWays of solving ACID scalability being discussed
The problems:The problems:• Huge numbers of simultaneous updatesHuge numbers of simultaneous updates• Large JOINs across very large tables (= big SQL query)Large JOINs across very large tables (= big SQL query)• Lots of updates & searches on small data elements in vast data setsLots of updates & searches on small data elements in vast data sets
The alternative: The alternative: • Key/value storesKey/value stores• De-normalized dataDe-normalized data
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1313
Consequences of De-normalizingConsequences of De-normalizing
Order(s) of magnitude increase in storage Order(s) of magnitude increase in storage requirementsrequirements
Difficulty of updating numerous “Key Difficulty of updating numerous “Key equivalents” in many places – can’t be done equivalents” in many places – can’t be done synchronouslysynchronously
Breaking relationship links allows parallel Breaking relationship links allows parallel processing:processing: helps the bottleneck of storage read speed (storage helps the bottleneck of storage read speed (storage
capacity is growing much faster than transfer rates)capacity is growing much faster than transfer rates) No JOINs or transactionsNo JOINs or transactions
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1414
Name/Value ModelsName/Value Models
Just name/value pairs, e.g. memcachedb, Just name/value pairs, e.g. memcachedb, DynamoDynamo
Name/value pairs plus associated data, Name/value pairs plus associated data, e.g. CouchDB, MongoDB – think e.g. CouchDB, MongoDB – think document stores with metadatadocument stores with metadata
Name/value pairs with nesting, e.g. Name/value pairs with nesting, e.g. CassandraCassandra
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1515
CassandraCassandra
Distributed, fault-tolerant database, based on Distributed, fault-tolerant database, based on ideas in Dynamo (Amazon) & BigTable (Google)ideas in Dynamo (Amazon) & BigTable (Google) Developed by FaceBook, open-sourced in 2008Developed by FaceBook, open-sourced in 2008 Now Apache projectNow Apache project Key/value pairs, in column-oriented formatKey/value pairs, in column-oriented format
• Standard column: name, value, timestampStandard column: name, value, timestamp• Super-column: name, map of columns, each with name, Super-column: name, map of columns, each with name,
value, timestamp (think array of hashes)value, timestamp (think array of hashes)• Grouped by Column family, also either standard or superGrouped by Column family, also either standard or super• Column family contains ‘rows’, roughly like a DB tableColumn family contains ‘rows’, roughly like a DB table• Column families then go in key-spacesColumn families then go in key-spaces
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1616
Cassandra - NoACIDCassandra - NoACID
Cassandra, et al, e.g. Voldemort (LinkedIn), trade speed, Cassandra, et al, e.g. Voldemort (LinkedIn), trade speed, distribution and availability for consistency and atomicitydistribution and availability for consistency and atomicity
No single point of failureNo single point of failure ““Eventually consistent” modelEventually consistent” model Tunable levels of consistencyTunable levels of consistency Atomicity only guaranteed within a column familyAtomicity only guaranteed within a column family Accessed using Thrift (also developed by Facebook)Accessed using Thrift (also developed by Facebook) Used by: Used by:
FacebookFacebook DiggDigg TwitterTwitter RedditReddit
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1717
NoSQL for Parliaments?NoSQL for Parliaments?
Much parliamentary material is naturally unstructured Much parliamentary material is naturally unstructured and suited to the name/value model (think XML)and suited to the name/value model (think XML)
Remember the old discussions about how to map such Remember the old discussions about how to map such parliamentary material into relational databases?parliamentary material into relational databases?
Think of every MPs contribution (speech) in chamber or Think of every MPs contribution (speech) in chamber or committee as a key/value pair, i.e. a columncommittee as a key/value pair, i.e. a column
Think of every PQ & answer as a super-column of Think of every PQ & answer as a super-column of name/value pairs for question, answer, holding, name/value pairs for question, answer, holding, supplementary, pursuant, referral …supplementary, pursuant, referral …
Hansard becomes a super-column family!Hansard becomes a super-column family!
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1818
Map / ReduceMap / Reduce Column (or record) oriented design & de-normalized data power the Column (or record) oriented design & de-normalized data power the
parallel “map reduce” model (think “sharding on speed”)parallel “map reduce” model (think “sharding on speed”)
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 1919
HadoopHadoop Nothing to do with NoSQLNothing to do with NoSQL Hadoop is an infrastructure and now family of tools for Hadoop is an infrastructure and now family of tools for
managing distributed systems and immense datasetsmanaging distributed systems and immense datasets How immense? Hundreds of GB and 10 node cluster is How immense? Hundreds of GB and 10 node cluster is
‘entry-level’ in Hadoop terms‘entry-level’ in Hadoop terms Developed by Yahoo for their cloud, now Apache projectDeveloped by Yahoo for their cloud, now Apache project Supports Map/Reduce by pre-dividing & distributing dataSupports Map/Reduce by pre-dividing & distributing data ““Moves computation to the data instead of data to the Moves computation to the data instead of data to the
computation”computation” HDFS file system particularly interesting – distributed, HDFS file system particularly interesting – distributed,
resilient (far more advanced than DRBD or Gluster), but resilient (far more advanced than DRBD or Gluster), but not real time (more eventually consistent…)not real time (more eventually consistent…)
Hive data warehouse front end – has SQL-like queriesHive data warehouse front end – has SQL-like queries
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 2020
Who uses Hadoop?Who uses Hadoop? TwitterTwitter AOLAOL IBMIBM Last.fmLast.fm LinkedInLinkedIn E-BayE-Bay YahooYahoo
36,000 machines with > 100,000 cores running Hadoop36,000 machines with > 100,000 cores running Hadoop Largest cluster is only 4000 nodes Largest cluster is only 4000 nodes
Largest known cluster is Facebook!Largest known cluster is Facebook! 2000 machines with 22,400 cores2000 machines with 22,400 cores 21Petabytes in a single HDFS store21Petabytes in a single HDFS store
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 2121
Hadoop for Parliaments?Hadoop for Parliaments?
Hadoop may seem overkill for parliaments now…Hadoop may seem overkill for parliaments now… But, when you start your legacy collection digitization But, when you start your legacy collection digitization
and digital preservation projects its model, for managing and digital preservation projects its model, for managing large datasets which essentially do not change & don’t large datasets which essentially do not change & don’t need real-time commit, is very good fit!need real-time commit, is very good fit!
Other interesting Hadoop projects:Other interesting Hadoop projects: Zookeeper (distributed apps co-ordination)Zookeeper (distributed apps co-ordination) Hive (data warehouse infrastructure)Hive (data warehouse infrastructure) Pig (high-level data flow language)Pig (high-level data flow language) Mahout (scalable machine learning library)Mahout (scalable machine learning library) Scribe (for aggregating streaming log data) [not strictly Hadoop Scribe (for aggregating streaming log data) [not strictly Hadoop
project, but can be integrated with it, using interesting work-project, but can be integrated with it, using interesting work-around for the non-real time & NameNode single point of failure]around for the non-real time & NameNode single point of failure]
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 2222
Other things ‘out there’Other things ‘out there’ Drizzle Drizzle
A database “optimized for Cloud infrastructure and Web applications”A database “optimized for Cloud infrastructure and Web applications” ““Design for massive concurrency on modern multi-cpu architecture”Design for massive concurrency on modern multi-cpu architecture” But, doesn’t actually explain how to use it for these…But, doesn’t actually explain how to use it for these… It’s SQL and ACIDIt’s SQL and ACID Mostly seems to be a reaction against what’s happening at MySQL…Mostly seems to be a reaction against what’s happening at MySQL… Has to be compiled from source – no distros available for it yetHas to be compiled from source – no distros available for it yet
CouchDBCouchDB Distributed, fault-tolerant, schema-free document-oriented databaseDistributed, fault-tolerant, schema-free document-oriented database RESTful JSON API (i.e. Web front end)RESTful JSON API (i.e. Web front end) Incremental replication with bi-directional conflict detection Incremental replication with bi-directional conflict detection Written in Erlang (highly reliable language developed by Ericsson)Written in Erlang (highly reliable language developed by Ericsson) Supports ‘map/reduce’ like querying and indexingSupports ‘map/reduce’ like querying and indexing Interesting model, different from most other offeringsInteresting model, different from most other offerings Also now an Apache projectAlso now an Apache project Still too immature for anything beyond experimentationStill too immature for anything beyond experimentation
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 2323
Also ‘out there’Also ‘out there’ VoldemortVoldemort
Another distributed key/value storage systemAnother distributed key/value storage system Used at LinkedInUsed at LinkedIn Doesn’t seem to have much futureDoesn’t seem to have much future Cassandra is similar, better & more widely usedCassandra is similar, better & more widely used
MonetDBMonetDB ““database system for high-performance applications in data mining, database system for high-performance applications in data mining,
OLAP, GIS, XML Query, text and multimedia retrieval “OLAP, GIS, XML Query, text and multimedia retrieval “ SQL and XQUERY front endsSQL and XQUERY front ends Also hard to see where it’s going…Also hard to see where it’s going…
MongoDBMongoDB Tries to bridge the gap between RDBMS and map/reduceTries to bridge the gap between RDBMS and map/reduce JSON document storage (like CouchDB)JSON document storage (like CouchDB) No JOINs, no transactionsNo JOINs, no transactions Supports atomic transactions only on single documentsSupports atomic transactions only on single documents Interesting, but may ‘fall between two stools’Interesting, but may ‘fall between two stools’
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 2424
PredictionsPredictions Hadoop and Cassandra are the ones to watchHadoop and Cassandra are the ones to watch There will likely be some sort of re-convergence between There will likely be some sort of re-convergence between
NoSQL and query languages of some kind – can’t do NoSQL and query languages of some kind – can’t do everything with map/reduce (esp. not ad hoc queries)everything with map/reduce (esp. not ad hoc queries)
SQL may be destined to become like COBOL – still SQL may be destined to become like COBOL – still around and running things but not something to use for around and running things but not something to use for new projectsnew projects
Distributed storage models (with or without map/reduce) Distributed storage models (with or without map/reduce) have good futurehave good future
Datasets will only get bigger – compliance, audit, digital Datasets will only get bigger – compliance, audit, digital preservation, the shift to visuals, etcpreservation, the shift to visuals, etc
Information management models (“strategy”) and access Information management models (“strategy”) and access speed will remain key problemsspeed will remain key problems
17-21 November 201017-21 November 2010 ECPRD - WGICT - BucharestECPRD - WGICT - Bucharest 2525
QuestionsQuestions
““What’s it all about?” What’s it all about?”
http://ashardie.comhttp://ashardie.com