A multi-tool in computing clouds: Tuple Space

Joerg Fritsch,

School of Computer Science & Informatics

Cardiff University, 16 January 2014

• Key themes: parallelization, shared nothing, Challenging Data (aka “Big Data”)

• Tuple Space: the multi-tool

• Use Case (1): Overcoming limitations of tier based architectures

• Use Case (2): In Stream processing of Big Data

• Some miscellaneous remarks

Agenda

• Eventually everything is about scalability.

• Scalable software: Make use of 1000s of cores– Distribution– Decomposition & modularity– Coordination

• Data does not fit in main memory– Distribution– Stream processing

• Need for speed: reduce time complexity

The why’s and how’s

Key Themes: Parallelization

• Clouds will need to support scalable programs.

• Many programs have to parallelize relative small computations with high inter-dependency.

• “Any” application scaled through distribution over parallel (multicore) hardware.

• Everything “inside a cloud” is physically distributed (data, processing).

• Large scale distributed processing. “Many Core”.

Key Themes: Shared Nothing

• Synchronization = shared “something” for example memory, disk, data(base)

• Asynchronous = shared “shared nothing”

• Avoid synchronization issues

• Abstract multithreading and parallelization issues away from the developer, i.e. actor model

• Highly scalable! –for example Erlang

Challenging Data (aka “Big Data”)

• Data in computing clouds is challenging

• 3V Data (Gartner, 2001): Volume, Variety, Velocity

• Volume: perceived as “Big”

– Hadoop & traditional RDBMs often similar in data volume

– Differ in number of nodes (proportional to no. cores)

– Analytics

• Variety: unstructured data, data mashups

– Hadoop does not cast into schemes, rows, cols

• Velocity: streams

Challenging Data (aka “Big Data”)

• Batch tasks are the prevailing computational model:– Map Reduce

– Computation over “offline” data set (on disks)

– Parallelized Polynomial time: Nm/k

• Stream Processing catching up: – Operating on real-time data

– N * log (N) time

– You only got ‘one shot’

– In memory data structures (e.g. Redis, Memcached)

– Examples: Storm project, AWS Kinesis, Apache S4,

• Tuples are key-value pairs

• Tuple Space acting as Distributed Shared Memory (DSM)

• Four primitives to manipulate and store tuples: rd(), in(), out(), eval()

• No schema, ideal for unstructured data

• Tuples matched using associative lookup

• Associative lookup generally very powerful: CAM table/Routing, Data Flow programming & processors

• Commercial use as in-memory Data Grids

Tuple Space, Gelertner (1985)

• Loose coupling

• Decoupled in

– Time

– Space

– Synchornization

• Distributed shared memory (DSM) vs distributed memory (“like MPI”)

Tuple Space, cont’d

• In memory key-value store, can be persistent across system reboot

• No schema

• Keys matched with glob-style patterns in O(1) time– Good enough implementation of associative lookup

• Binary safe

• Other key-value stores may be equally suitable and have different advantages/disadvantages– Distributed Hash Tables (DHT)

– Memcached Distribution

– Dynamo Presence as application service in AWS

Redis Key-Value Store as Tuple Space

• Coordination vs Threading

• Composition happens outside of the worker or agentcode– FPLs: composition and currying outside of functions

– Stream processing and composition of kernels

– Unix Pipes: application_1 | applications_3 | application_2

– Pipes/(Message Queues) represent the dataflow graph

• Error handling?– What happens to the mutable state if app_3 (or any of the

kernels) fail?

Coordination Language LINDA,Gelertner (1992)

app_1 app_3 app_2 std_out

• Not enough expressive power! (for complex coordination)

• Ways to make it more sonic:– Algebra of communication processes (ACP)

– ACP generally quite suitable for streams, clicks, GUIs, Dataflow programming

– Constraint Handling Rules (CHRs)

– Agent Communicating through Logic Theories (ACLT), Omicini et al (1995), Denti et al (1998)

• For example: Barrier (i.e. MPI_barrier)/Eureka conditions, Turing powerful implementation

Coordination Language LINDA,cont’d

• Database , Data Grid– No schema

• Key / Value store

• Extension to programming languages– Without adaptation not Turing-powerful

• Message bus, Message Queue

• Means of coordination– Workers, Agents, Skeletons

• Memory virtualization– Extension of main memory across physical boundaries

Recap: Tuple Space is like a(n) …

Use Case (1 of 2)Overcoming limitations of tier based

architectures

• Concept has been around since 1998

• Costly serialization (of data) required at every system boundary latency!

• Often depicted w three simple tiers: web server, application server and data(base)

• Many more devices & protocols involved: redundant load balancers, spanning tree, etc.

Tier-based architectures

• To date: not many alternatives

• Space based architectures

– Gigaspaces

– Tibco activespace

• Notion of a one stop shop

– Networks L2 Ethernet fabrics

– Networks Integrated packet processing

• Nobody wants to hit a spindle!

– In-memory computing

Tier-based architectures(alternatives)

The end of Tier-based architectures

Source: http://wiki.gigaspaces.com

The end of Tier-based architectures (cont’d 1)

Source: http://wiki.gigaspaces.com

The end of Tier-based architectures (cont’d 2)

Space based cloud platformNo tiersImplicit load balancingHarmonization of messaging, data and coordination

Traditional tier-based cloud platform

Use Case (2 of 2)In stream processing of Big Data

“More programmer-friendly parallel dataflow languages await discovery, I think. Map Reduce is one (small) step in that direction.”

Engineer-to-Engineer LecturesJeff HammerbacherJune 2010

• Stream

– An unbounded sequence of tuples

• Map Reduce excels in ad-hoc queries, no fit for recursion ≠ machine learning (ML)

• Error resilient: Stateful stream processing

– Redis knows transactions

– Tuple space can contain global mutual state

• Tuple vs Batch / Fine grain vs coarse grain

Stream processing of 3V Data

• Redis has a built-in Lua interpreter to manipulate data

• Commercial tuple spaces are mostly “reactive”

• Context-based recursion on portion of data that is in memory (aka “granularity”)

(Reactive) in Memory Tuple Space

Tuple space architecture for in stream processing of Big Data

Commonalities

FPLs & Flow based Programming

(Johnston, 2004)

Immutable Data. Shared nothing.

Freedom of side effects

Locality of effects

Lazy evaluation

Data dependency equivalent to scheduling

FPLs & Tuple Space

(Fritsch & Walker, 2013)

Coordination

Distribution

Decoupling

Inter process communication (IPC)

Commonalities cont’d

Flow based Programming & Tuple

Both need “a space”

IP Space in Flow based programs

Tuple Spave in LINDA

Altogether

(Data) Queues

Coordonation does not need to reckon w side effects

Coordination & composition

Representation of dataflow graph in place of a (thread) call graph

• News/RSS streams• Clicks

– Online advertisement analytics (e.g. spider.io)– URLs (e.g. bit.ly)– GUI programming

• Logistics & Transportation• Media

– GPUs (streams + kernels)

• Mashups: create new wisdom from multiple data sources (incompatible in velocity, volume, variety/structure)– Separate errors

• Debit card transactions– Data Masking– Fraud detection/Feedback Context Mashups

Real World Applications

• The ultimate mashup: batch data (aka “map reduce”) and speed data (aka “streams”)

– Lambda architecture

– Complementary to each other (e.e. Apache Spark, Lambda Architecture)

• Currently three paradigms: RDBMs, Map Reduce, Streams.

– Distributed query processing is a key element

Points to ponder

• Tuple Space is a piece of software as well

• Scalability of tuple space

– Distribution vs fast in memory computation

• Complex coordination is a must!

– So is error handling (stream replay?)

• Number of supporting elements needed

– (auto) scaler

– cloud-like deployment: DevOps recipes

– Zookeper?

Issues

Thank you

Denti, Enrico, Antonio Natali, and Andrea Omicini. "On the expressive power of a language for programming coordination media." Proceedings of the 1998 ACM symposium on Applied Computing. ACM, 1998.

Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13.

Fritsch J. Walker C. (2013), “Cwmwl, a LINDA-based PaaS fabric for the cloud”, Journal of Communications, SI on Cloud and Big Data (to be published)

Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13.

Gelernter, David. "Generative communication in Linda." ACM Transactions on Programming Languages and Systems (TOPLAS) 7.1 (1985): 80-112.

Gelernter, David, and Nicholas Carriero. "Coordination languages and their significance." Communications of the ACM 35.2 (1992): 96.

Johnston, Wesley M., J. R. Hanna, and Richard J. Millar. "Advances in dataflow programming languages." ACM Computing Surveys (CSUR) 36.1 (2004): 1-34.

Omicini, A., Denti, E., & Natali, A. (1995). Agent coordination and control through logic theories. In Topics in Artificial Intelligence (pp. 439-450). Springer Berlin Heidelberg.

References

A multi-tool in computing clouds: Tuple Space

Technology

Virtual Machine Placement in Computing Clouds

Presentation Jonker - Clouds and Computing

Education Clouds: Cloud Computing West 2012 Conference

Mostafa Ammar, School of Computer Science Georgia Institute of Technology Atlanta, GA Mobile Computing in Cirrus Clouds: Mobile Computing in Cirrus Clouds:

Grid Computing & Tuple Space

Clouds Technical Computing FoxGannongrids.ucs.indiana.edu/ptliupages/publications/... · Using Clouds for Technical Computing Geoffrey Foxa and Dennis Gannonb a School of Informatics

Cloud Computing Security From Single to Multi Clouds

High Availability Clouds-Cloud Computing Expo

Distributed Computing: Utilities, Grids & Clouds

Grid Computing & Tuple Space Presented by Nelson Chu

The next generation of computing – A walk through the clouds Cloud .... Joe.Lopez_IT.pdf · The next generation of computing – A walk through the clouds Cloud Computing Dr. Joe.Lopez

Storm Clouds Ahead a Risk Analysis of Cloud Computing

Clearing Away the Clouds: Cloud Computing at Stanford

P2P-Tuple: Towards a Robust Volunteer Computing Platform

Data-Intensive Computing: From Clouds to GPUs

Seeding the Clouds: Key Infrastructure Elements for Cloud Computing

Inter-Cloud Computing & Voice WWC (World Wide Computing) Through Inter-Clouds by Power-All Networks

Above the Clouds: A Berkeley View of Cloud Computing

Scalable Parallel Computing on Clouds (Dissertation Proposal)

Parallel Computing on Clouds Map-Reduce using Hadoop