31
Joerg Fritsch, School of Computer Science & Informatics Cardiff University, 16 January 2014 A multi - tool in computing clouds: Tuple Space

A multi-tool in computing clouds: Tuple Space

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: A multi-tool in computing clouds: Tuple Space

Joerg Fritsch,

School of Computer Science & Informatics

Cardiff University, 16 January 2014

A multi-tool in computing clouds: Tuple Space

Page 2: A multi-tool in computing clouds: Tuple Space

• Key themes: parallelization, shared nothing, Challenging Data (aka “Big Data”)

• Tuple Space: the multi-tool

• Use Case (1): Overcoming limitations of tier based architectures

• Use Case (2): In Stream processing of Big Data

• Some miscellaneous remarks

2

Agenda

Page 3: A multi-tool in computing clouds: Tuple Space

• Eventually everything is about scalability.

• Scalable software: Make use of 1000s of cores– Distribution– Decomposition & modularity– Coordination

• Data does not fit in main memory– Distribution– Stream processing

• Need for speed: reduce time complexity

3

The why’s and how’s

Page 4: A multi-tool in computing clouds: Tuple Space

Key Themes: Parallelization

• Clouds will need to support scalable programs.

• Many programs have to parallelize relative small computations with high inter-dependency.

• “Any” application scaled through distribution over parallel (multicore) hardware.

• Everything “inside a cloud” is physically distributed (data, processing).

• Large scale distributed processing. “Many Core”.

4

Page 5: A multi-tool in computing clouds: Tuple Space

Key Themes: Shared Nothing

5

• Synchronization = shared “something” for example memory, disk, data(base)

• Asynchronous = shared “shared nothing”

• Avoid synchronization issues

• Abstract multithreading and parallelization issues away from the developer, i.e. actor model

• Highly scalable! –for example Erlang

Page 6: A multi-tool in computing clouds: Tuple Space

Challenging Data (aka “Big Data”)

6

• Data in computing clouds is challenging

• 3V Data (Gartner, 2001): Volume, Variety, Velocity

• Volume: perceived as “Big”

– Hadoop & traditional RDBMs often similar in data volume

– Differ in number of nodes (proportional to no. cores)

– Analytics

• Variety: unstructured data, data mashups

– Hadoop does not cast into schemes, rows, cols

• Velocity: streams

Page 7: A multi-tool in computing clouds: Tuple Space

Challenging Data (aka “Big Data”)

7

• Batch tasks are the prevailing computational model:– Map Reduce

– Computation over “offline” data set (on disks)

– Parallelized Polynomial time: Nm/k

• Stream Processing catching up: – Operating on real-time data

– N * log (N) time

– You only got ‘one shot’

– In memory data structures (e.g. Redis, Memcached)

– Examples: Storm project, AWS Kinesis, Apache S4,

Page 8: A multi-tool in computing clouds: Tuple Space

• Tuples are key-value pairs

• Tuple Space acting as Distributed Shared Memory (DSM)

• Four primitives to manipulate and store tuples: rd(), in(), out(), eval()

• No schema, ideal for unstructured data

• Tuples matched using associative lookup

• Associative lookup generally very powerful: CAM table/Routing, Data Flow programming & processors

• Commercial use as in-memory Data Grids

8

Tuple Space, Gelertner (1985)

Page 9: A multi-tool in computing clouds: Tuple Space

• Loose coupling

• Decoupled in

– Time

– Space

– Synchornization

• Distributed shared memory (DSM) vs distributed memory (“like MPI”)

9

Tuple Space, cont’d

Page 10: A multi-tool in computing clouds: Tuple Space

• In memory key-value store, can be persistent across system reboot

• No schema

• Keys matched with glob-style patterns in O(1) time– Good enough implementation of associative lookup

• Binary safe

• Other key-value stores may be equally suitable and have different advantages/disadvantages– Distributed Hash Tables (DHT)

– Memcached Distribution

– Dynamo Presence as application service in AWS

10

Redis Key-Value Store as Tuple Space

Page 11: A multi-tool in computing clouds: Tuple Space

• Coordination vs Threading

• Composition happens outside of the worker or agentcode– FPLs: composition and currying outside of functions

– Stream processing and composition of kernels

– Unix Pipes: application_1 | applications_3 | application_2

– Pipes/(Message Queues) represent the dataflow graph

• Error handling?– What happens to the mutable state if app_3 (or any of the

kernels) fail?

11

Coordination Language LINDA,Gelertner (1992)

app_1 app_3 app_2 std_out

Page 12: A multi-tool in computing clouds: Tuple Space

• Not enough expressive power! (for complex coordination)

• Ways to make it more sonic:– Algebra of communication processes (ACP)

– ACP generally quite suitable for streams, clicks, GUIs, Dataflow programming

– Constraint Handling Rules (CHRs)

– Agent Communicating through Logic Theories (ACLT), Omicini et al (1995), Denti et al (1998)

• For example: Barrier (i.e. MPI_barrier)/Eureka conditions, Turing powerful implementation

12

Coordination Language LINDA,cont’d

Page 13: A multi-tool in computing clouds: Tuple Space

• Database , Data Grid– No schema

• Key / Value store

• Extension to programming languages– Without adaptation not Turing-powerful

• Message bus, Message Queue

• Means of coordination– Workers, Agents, Skeletons

• Memory virtualization– Extension of main memory across physical boundaries

13

Recap: Tuple Space is like a(n) …

Page 14: A multi-tool in computing clouds: Tuple Space

14

Use Case (1 of 2)Overcoming limitations of tier based

architectures

Page 15: A multi-tool in computing clouds: Tuple Space

• Concept has been around since 1998

• Costly serialization (of data) required at every system boundary latency!

• Often depicted w three simple tiers: web server, application server and data(base)

• Many more devices & protocols involved: redundant load balancers, spanning tree, etc.

15

Tier-based architectures

Page 16: A multi-tool in computing clouds: Tuple Space

• To date: not many alternatives

• Space based architectures

– Gigaspaces

– Tibco activespace

• Notion of a one stop shop

– Networks L2 Ethernet fabrics

– Networks Integrated packet processing

• Nobody wants to hit a spindle!

– In-memory computing

16

Tier-based architectures(alternatives)

Page 17: A multi-tool in computing clouds: Tuple Space

17

The end of Tier-based architectures

Source: http://wiki.gigaspaces.com

Page 18: A multi-tool in computing clouds: Tuple Space

18

The end of Tier-based architectures (cont’d 1)

Source: http://wiki.gigaspaces.com

Page 19: A multi-tool in computing clouds: Tuple Space

19

The end of Tier-based architectures (cont’d 2)

Space based cloud platformNo tiersImplicit load balancingHarmonization of messaging, data and coordination

Traditional tier-based cloud platform

Page 20: A multi-tool in computing clouds: Tuple Space

20

Use Case (2 of 2)In stream processing of Big Data

Page 21: A multi-tool in computing clouds: Tuple Space

21

“More programmer-friendly parallel dataflow languages await discovery, I think. Map Reduce is one (small) step in that direction.”

Engineer-to-Engineer LecturesJeff HammerbacherJune 2010

Page 22: A multi-tool in computing clouds: Tuple Space

• Stream

– An unbounded sequence of tuples

• Map Reduce excels in ad-hoc queries, no fit for recursion ≠ machine learning (ML)

• Error resilient: Stateful stream processing

– Redis knows transactions

– Tuple space can contain global mutual state

• Tuple vs Batch / Fine grain vs coarse grain

22

Stream processing of 3V Data

Page 23: A multi-tool in computing clouds: Tuple Space

• Redis has a built-in Lua interpreter to manipulate data

• Commercial tuple spaces are mostly “reactive”

• Context-based recursion on portion of data that is in memory (aka “granularity”)

23

(Reactive) in Memory Tuple Space

Page 24: A multi-tool in computing clouds: Tuple Space

24

Tuple space architecture for in stream processing of Big Data

Page 25: A multi-tool in computing clouds: Tuple Space

25

Commonalities

FPLs & Flow based Programming

(Johnston, 2004)

Immutable Data. Shared nothing.

Freedom of side effects

Locality of effects

Lazy evaluation

Data dependency equivalent to scheduling

FPLs & Tuple Space

(Fritsch & Walker, 2013)

Coordination

Distribution

Decoupling

Inter process communication (IPC)

Page 26: A multi-tool in computing clouds: Tuple Space

26

Commonalities cont’d

Flow based Programming & Tuple

Space

Both need “a space”

IP Space in Flow based programs

Tuple Spave in LINDA

Altogether

(Data) Queues

Coordonation does not need to reckon w side effects

Coordination & composition

Representation of dataflow graph in place of a (thread) call graph

Page 27: A multi-tool in computing clouds: Tuple Space

• News/RSS streams• Clicks

– Online advertisement analytics (e.g. spider.io)– URLs (e.g. bit.ly)– GUI programming

• Logistics & Transportation• Media

– GPUs (streams + kernels)

• Mashups: create new wisdom from multiple data sources (incompatible in velocity, volume, variety/structure)– Separate errors

• Debit card transactions– Data Masking– Fraud detection/Feedback Context Mashups

27

Real World Applications

Page 28: A multi-tool in computing clouds: Tuple Space

• The ultimate mashup: batch data (aka “map reduce”) and speed data (aka “streams”)

– Lambda architecture

– Complementary to each other (e.e. Apache Spark, Lambda Architecture)

• Currently three paradigms: RDBMs, Map Reduce, Streams.

– Distributed query processing is a key element

28

Points to ponder

Page 29: A multi-tool in computing clouds: Tuple Space

• Tuple Space is a piece of software as well

• Scalability of tuple space

– Distribution vs fast in memory computation

• Complex coordination is a must!

– So is error handling (stream replay?)

• Number of supporting elements needed

– (auto) scaler

– cloud-like deployment: DevOps recipes

– Zookeper?

29

Issues

Page 30: A multi-tool in computing clouds: Tuple Space

30

Thank you

Page 31: A multi-tool in computing clouds: Tuple Space

Denti, Enrico, Antonio Natali, and Andrea Omicini. "On the expressive power of a language for programming coordination media." Proceedings of the 1998 ACM symposium on Applied Computing. ACM, 1998.

Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13.

Fritsch J. Walker C. (2013), “Cwmwl, a LINDA-based PaaS fabric for the cloud”, Journal of Communications, SI on Cloud and Big Data (to be published)

Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13.

Gelernter, David. "Generative communication in Linda." ACM Transactions on Programming Languages and Systems (TOPLAS) 7.1 (1985): 80-112.

Gelernter, David, and Nicholas Carriero. "Coordination languages and their significance." Communications of the ACM 35.2 (1992): 96.

Johnston, Wesley M., J. R. Hanna, and Richard J. Millar. "Advances in dataflow programming languages." ACM Computing Surveys (CSUR) 36.1 (2004): 1-34.

Omicini, A., Denti, E., & Natali, A. (1995). Agent coordination and control through logic theories. In Topics in Artificial Intelligence (pp. 439-450). Springer Berlin Heidelberg.

31

References