A multi-tool in computing clouds: Tuple Space

Preview:

DESCRIPTION

 

Citation preview

Joerg Fritsch,

School of Computer Science & Informatics

Cardiff University, 16 January 2014

A multi-tool in computing clouds: Tuple Space

• Key themes: parallelization, shared nothing, Challenging Data (aka “Big Data”)

• Tuple Space: the multi-tool

• Use Case (1): Overcoming limitations of tier based architectures

• Use Case (2): In Stream processing of Big Data

• Some miscellaneous remarks

2

Agenda

• Eventually everything is about scalability.

• Scalable software: Make use of 1000s of cores– Distribution– Decomposition & modularity– Coordination

• Data does not fit in main memory– Distribution– Stream processing

• Need for speed: reduce time complexity

3

The why’s and how’s

Key Themes: Parallelization

• Clouds will need to support scalable programs.

• Many programs have to parallelize relative small computations with high inter-dependency.

• “Any” application scaled through distribution over parallel (multicore) hardware.

• Everything “inside a cloud” is physically distributed (data, processing).

• Large scale distributed processing. “Many Core”.

4

Key Themes: Shared Nothing

5

• Synchronization = shared “something” for example memory, disk, data(base)

• Asynchronous = shared “shared nothing”

• Avoid synchronization issues

• Abstract multithreading and parallelization issues away from the developer, i.e. actor model

• Highly scalable! –for example Erlang

Challenging Data (aka “Big Data”)

6

• Data in computing clouds is challenging

• 3V Data (Gartner, 2001): Volume, Variety, Velocity

• Volume: perceived as “Big”

– Hadoop & traditional RDBMs often similar in data volume

– Differ in number of nodes (proportional to no. cores)

– Analytics

• Variety: unstructured data, data mashups

– Hadoop does not cast into schemes, rows, cols

• Velocity: streams

Challenging Data (aka “Big Data”)

7

• Batch tasks are the prevailing computational model:– Map Reduce

– Computation over “offline” data set (on disks)

– Parallelized Polynomial time: Nm/k

• Stream Processing catching up: – Operating on real-time data

– N * log (N) time

– You only got ‘one shot’

– In memory data structures (e.g. Redis, Memcached)

– Examples: Storm project, AWS Kinesis, Apache S4,

• Tuples are key-value pairs

• Tuple Space acting as Distributed Shared Memory (DSM)

• Four primitives to manipulate and store tuples: rd(), in(), out(), eval()

• No schema, ideal for unstructured data

• Tuples matched using associative lookup

• Associative lookup generally very powerful: CAM table/Routing, Data Flow programming & processors

• Commercial use as in-memory Data Grids

8

Tuple Space, Gelertner (1985)

• Loose coupling

• Decoupled in

– Time

– Space

– Synchornization

• Distributed shared memory (DSM) vs distributed memory (“like MPI”)

9

Tuple Space, cont’d

• In memory key-value store, can be persistent across system reboot

• No schema

• Keys matched with glob-style patterns in O(1) time– Good enough implementation of associative lookup

• Binary safe

• Other key-value stores may be equally suitable and have different advantages/disadvantages– Distributed Hash Tables (DHT)

– Memcached Distribution

– Dynamo Presence as application service in AWS

10

Redis Key-Value Store as Tuple Space

• Coordination vs Threading

• Composition happens outside of the worker or agentcode– FPLs: composition and currying outside of functions

– Stream processing and composition of kernels

– Unix Pipes: application_1 | applications_3 | application_2

– Pipes/(Message Queues) represent the dataflow graph

• Error handling?– What happens to the mutable state if app_3 (or any of the

kernels) fail?

11

Coordination Language LINDA,Gelertner (1992)

app_1 app_3 app_2 std_out

• Not enough expressive power! (for complex coordination)

• Ways to make it more sonic:– Algebra of communication processes (ACP)

– ACP generally quite suitable for streams, clicks, GUIs, Dataflow programming

– Constraint Handling Rules (CHRs)

– Agent Communicating through Logic Theories (ACLT), Omicini et al (1995), Denti et al (1998)

• For example: Barrier (i.e. MPI_barrier)/Eureka conditions, Turing powerful implementation

12

Coordination Language LINDA,cont’d

• Database , Data Grid– No schema

• Key / Value store

• Extension to programming languages– Without adaptation not Turing-powerful

• Message bus, Message Queue

• Means of coordination– Workers, Agents, Skeletons

• Memory virtualization– Extension of main memory across physical boundaries

13

Recap: Tuple Space is like a(n) …

14

Use Case (1 of 2)Overcoming limitations of tier based

architectures

• Concept has been around since 1998

• Costly serialization (of data) required at every system boundary latency!

• Often depicted w three simple tiers: web server, application server and data(base)

• Many more devices & protocols involved: redundant load balancers, spanning tree, etc.

15

Tier-based architectures

• To date: not many alternatives

• Space based architectures

– Gigaspaces

– Tibco activespace

• Notion of a one stop shop

– Networks L2 Ethernet fabrics

– Networks Integrated packet processing

• Nobody wants to hit a spindle!

– In-memory computing

16

Tier-based architectures(alternatives)

17

The end of Tier-based architectures

Source: http://wiki.gigaspaces.com

18

The end of Tier-based architectures (cont’d 1)

Source: http://wiki.gigaspaces.com

19

The end of Tier-based architectures (cont’d 2)

Space based cloud platformNo tiersImplicit load balancingHarmonization of messaging, data and coordination

Traditional tier-based cloud platform

20

Use Case (2 of 2)In stream processing of Big Data

21

“More programmer-friendly parallel dataflow languages await discovery, I think. Map Reduce is one (small) step in that direction.”

Engineer-to-Engineer LecturesJeff HammerbacherJune 2010

• Stream

– An unbounded sequence of tuples

• Map Reduce excels in ad-hoc queries, no fit for recursion ≠ machine learning (ML)

• Error resilient: Stateful stream processing

– Redis knows transactions

– Tuple space can contain global mutual state

• Tuple vs Batch / Fine grain vs coarse grain

22

Stream processing of 3V Data

• Redis has a built-in Lua interpreter to manipulate data

• Commercial tuple spaces are mostly “reactive”

• Context-based recursion on portion of data that is in memory (aka “granularity”)

23

(Reactive) in Memory Tuple Space

24

Tuple space architecture for in stream processing of Big Data

25

Commonalities

FPLs & Flow based Programming

(Johnston, 2004)

Immutable Data. Shared nothing.

Freedom of side effects

Locality of effects

Lazy evaluation

Data dependency equivalent to scheduling

FPLs & Tuple Space

(Fritsch & Walker, 2013)

Coordination

Distribution

Decoupling

Inter process communication (IPC)

26

Commonalities cont’d

Flow based Programming & Tuple

Space

Both need “a space”

IP Space in Flow based programs

Tuple Spave in LINDA

Altogether

(Data) Queues

Coordonation does not need to reckon w side effects

Coordination & composition

Representation of dataflow graph in place of a (thread) call graph

• News/RSS streams• Clicks

– Online advertisement analytics (e.g. spider.io)– URLs (e.g. bit.ly)– GUI programming

• Logistics & Transportation• Media

– GPUs (streams + kernels)

• Mashups: create new wisdom from multiple data sources (incompatible in velocity, volume, variety/structure)– Separate errors

• Debit card transactions– Data Masking– Fraud detection/Feedback Context Mashups

27

Real World Applications

• The ultimate mashup: batch data (aka “map reduce”) and speed data (aka “streams”)

– Lambda architecture

– Complementary to each other (e.e. Apache Spark, Lambda Architecture)

• Currently three paradigms: RDBMs, Map Reduce, Streams.

– Distributed query processing is a key element

28

Points to ponder

• Tuple Space is a piece of software as well

• Scalability of tuple space

– Distribution vs fast in memory computation

• Complex coordination is a must!

– So is error handling (stream replay?)

• Number of supporting elements needed

– (auto) scaler

– cloud-like deployment: DevOps recipes

– Zookeper?

29

Issues

30

Thank you

Denti, Enrico, Antonio Natali, and Andrea Omicini. "On the expressive power of a language for programming coordination media." Proceedings of the 1998 ACM symposium on Applied Computing. ACM, 1998.

Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13.

Fritsch J. Walker C. (2013), “Cwmwl, a LINDA-based PaaS fabric for the cloud”, Journal of Communications, SI on Cloud and Big Data (to be published)

Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13.

Gelernter, David. "Generative communication in Linda." ACM Transactions on Programming Languages and Systems (TOPLAS) 7.1 (1985): 80-112.

Gelernter, David, and Nicholas Carriero. "Coordination languages and their significance." Communications of the ACM 35.2 (1992): 96.

Johnston, Wesley M., J. R. Hanna, and Richard J. Millar. "Advances in dataflow programming languages." ACM Computing Surveys (CSUR) 36.1 (2004): 1-34.

Omicini, A., Denti, E., & Natali, A. (1995). Agent coordination and control through logic theories. In Topics in Artificial Intelligence (pp. 439-450). Springer Berlin Heidelberg.

31

References

Recommended