72
Data Streaming Raja Chiky [email protected]

Introduction to Data streaming - 05/12/2014

  • Upload
    rchiky

  • View
    395

  • Download
    7

Embed Size (px)

Citation preview

Page 1: Introduction to Data streaming - 05/12/2014

Data Streaming

Raja Chiky – [email protected]

Page 2: Introduction to Data streaming - 05/12/2014

About me ¡  Associate professor in Computer Science – LISITE-RDI

¡  Research interest: Data stream mining, scalability and resource optimization in distributed architectures (e.g cloud architectures), recommender systems

¡  Research field: Large scale data management

1. Real-time and distributed

processing of various data

sources

2. Use semantic technologies to add a semantic

layer

3. Recommender systems and

collaborative data mining

4. Optimizing resources in large scale systems

Heterogeneous  and  dynamic  data  streams  

Heterogeneous  and  sta1c  data  

sensors

2

Page 3: Introduction to Data streaming - 05/12/2014

OUTLINE

¡ Context: Big Data

¡ What is a data stream ?

¡ Data stream management systems

¡  Basic approximate algorithms

¡  Big Data: Distributed Systems

¡  Semantic Data Streaming

¡ Conclusion

3

Page 4: Introduction to Data streaming - 05/12/2014

New era 4

Page 5: Introduction to Data streaming - 05/12/2014

5 Big Data: Buzzword!

Page 6: Introduction to Data streaming - 05/12/2014

08/12/2014

6 Where is all this data coming from?

Page 7: Introduction to Data streaming - 05/12/2014

7 More and More connected Things

Page 8: Introduction to Data streaming - 05/12/2014

8 So, what is Big Data?

Dawn  of  (me  

2003   2012  

5  EB  

…  

2.7  ZB  

2015  

10  ZB  (E)  

Volume  of  data  created  Worldwide  

§  1  YB  =  10^24  Bytes  §  1  ZB  =  10^21  Bytes  §  1  EB  =  10^18  Bytes  §  1  PB  =  10^15  Bytes  §  1TB  =  10^12  Bytes  §  1  GB  =  10^9  Bytes  

Variety  of  data  

Velocity  of  data  

§  Walmart  handles  1M  transac(ons  per  hour  §  Google  processes  24PB  of  data  per  day  §  AT&T  transfers  30  PB  of  data  per  day  §  90  trillion  emails  are  sent  per  year  §  World  of  WarcraQ  uses  1.3  PB  of  storage  

§  Facebook  when  had  a  user  base  of  900  M  users,  had  25  PB  of  compressed  data  

§  400M  tweets  per  day  in  June  ’12  §  72  hours  of  video  is  uploaded  to  Youtube  

every  minute  

§  Radio  §  TV  §  News  §  E-­‐Mails  §  Facebook  

Posts  

§  Tweets  §  Blogs  §  Photos  §  Videos  (user  

and  paid)  §  RSS  feeds  

§  Wikipedia  §  GPS  data  §  RFID  §  POS  

Scanners  §  …  

Volume  

Variety  

Velocity  

Big  Data  Elements  

Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla

+ Veracity (IBM) - information uncertainty

Page 9: Introduction to Data streaming - 05/12/2014

Output

User Interaction

Gathering Information

Store

Data sources

C

Static data Stream (big) data

sensors

databases Static data Data stream

ETL Batch

processing

Semantic ETL stream

processing

Continuous queries/ Business rules

Ad-hoc queries Analytics

Knowledge enrichment

Databases/Triplestores (synopsis)

Real time visual-analytics

Retro-action

Load shedding

Data Warehouse

08/12/2014

9

Page 10: Introduction to Data streaming - 05/12/2014

10 Big Data : Velocity Website logs

Network monitoring Financial services

eCommerce Traffic control Power consumption

Weather forecasting

Page 11: Introduction to Data streaming - 05/12/2014

What is a data stream? ¡  Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered

(implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”

¡  Massive volumes of data, items arrive at a high rate.

11

Page 12: Introduction to Data streaming - 05/12/2014

12 Applications of data stream processing ¡ Data stream processing ¡  Process queries (compute statistics, activate alarms)

¡  Apply data mining algorithms

¡  Requirements ¡  Real-time processing

¡  One-pass processing

¡  Bounded storage (no complete storage of streams)

¡  Possibly consider several streams

Page 13: Introduction to Data streaming - 05/12/2014

13 Applications of data stream processing ¡  Let’s go deeper into some examples ¡  Network management

¡  Stock monitoring

Page 14: Introduction to Data streaming - 05/12/2014

14 Network management ¡  Supervision of a computer network

¡  Improvement of network configuration (hardware, software, architecture)

¡ Detection of attacks

¡ Measurements made on routers Network supervision center Huge volume of data

High rate of arrivals

Page 15: Introduction to Data streaming - 05/12/2014

15 Network management

Network supervision center

Timestamp Source Destination Duration Bytes Protocol … … … … … …

12342 10.1.0.2 16.2.3.7 12 20K http 12343 18.6.7.1 12.4.0.3 16 24K http 12344 12.4.3.8 14.8.7.4 26 58K http 12345 19.7.1.2 16.5.5.8 18 80K ftp

… … … … … …

Page 16: Introduction to Data streaming - 05/12/2014

16 Network management

Typical queries:

- 100 most frequent (@S, @D) on router R1 …

- How many different (@S, @D) seen on R1 but not R2 …

- … during last month, last week, last day, last hour ?

Network supervision center

Page 17: Introduction to Data streaming - 05/12/2014

17 Stock monitoring ¡  Stream of price and sales volume of stocks over time

¡  Technical analysis/charting for stock investors

¡  Support trading decisions

l  Notify me when the price of IBM is above $83, and the first MSFT price afterwards is below $27.

l  Notify me when some stock goes up by at least 5% from one transaction to the next.

l  Notify me when the price of any stock increases monotonically for ≥30 min.

l  Notify me when the difference between the current price of a stock and its 10 day moving average is greater than some threshold value

Source: Gehrke 07 and Cayuga application scenarios (Cornell University)

Page 18: Introduction to Data streaming - 05/12/2014

05/12/2014

18 Where is the problem? (1/2) ¡ Example:

¡  Join between several streams ¡  Join between stream data and customer database

¡  Generic tools for processing streams ¡  Avoid the ‘Store’, ‘Compute’, ‘Delete’ approach

¡  Solution: incremental computation and definition of temporal windows for joins

Bank withdrawal

50 € 12/05/2014

Bank fraud

bank withdrawal

1000$

12/05/2014

Page 19: Introduction to Data streaming - 05/12/2014

19 Where is the problem? (2/2)

¡ Example: ¡ 100 most frequent @S IP adresses on a router ¡ Maintain a table of IP addresses with

frequencies ? ¡ Sampling the stream ?

¡ Face high (and varying) rate of arrivals ¡ Exact versus approximate answers

Page 20: Introduction to Data streaming - 05/12/2014

20 Examples of queries

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Find elements with frequency> 0.1%

top-k

Frequency of the element 3 Total frequency of Elements between 8 and 14

number of elements having a non-zero frequency

Page 21: Introduction to Data streaming - 05/12/2014

21 Data Stream Management Systems

DBMS DSMS

Data model Permanent updatable relations Streams and permanent updatable relations

Storage Data is stored on disk Permanent relations are stored on disk Streams are processed on the fly

Query SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query)

SQL-like query language Standard SQL on permanent relations Extended SQL on streams with windowing Continuous queries

Performance Large volumes of data Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash

Page 22: Introduction to Data streaming - 05/12/2014

22 Generic DSMS architecture

Input Monitor

Output Buffer

Que

ry P

roce

ssor

Query Reposi-

tory

Working Storage

Summary Storage

Static

Storage Streaming

Inputs Streaming Outputs

Updates to Static Data

User Queries

Golab & Oszu (2003)

Page 23: Introduction to Data streaming - 05/12/2014

23 Existing DSMS

Page 24: Introduction to Data streaming - 05/12/2014

24 How to deal with Big Data Streams

Distribution

DSMS Sampling/Load Shedding/Sketch/…

Page 25: Introduction to Data streaming - 05/12/2014

25 Approximate answers to queries ¡ When ? ¡  Queries needing unbounded memory

¡  Too much queries/too rapid streams/too high response time requirements

¡  CPU limit

¡  Memory limit

¡  Solution : approximate answers to queries ¡  Sliding windows

¡  Sampling and load shedding

¡  Definition of synopsis

Page 26: Introduction to Data streaming - 05/12/2014

26 Approaches ¡  Two approaches for handling such streams ¡  Use a time window, and query the window as a static table

¡  When you can’t store collected data, or to keep track of historical data

¡  Sampling

¡  Filtering

¡  Counting

Page 27: Introduction to Data streaming - 05/12/2014

Windowing ¡ Applying queries/mining tasks to the whole stream (from

beginning to current time)

¡ Applying queries/mining to a portion of the stream

Beg

inni

ng o

f the

str

eam

Cur

rent

dat

e

Window on the stream

t

Page 28: Introduction to Data streaming - 05/12/2014

08/12/2014

29 Windowing ¡ Definition of windows of interest on streams ¡  Fixed windows: September 2014

¡  Sliding windows: last 3 hours

¡  Landmark windows: from September 1st, 2014

¡ Window specification ¡  Physical : last 3 hours

¡  Logical : last 1000 items

¡  Refreshing rate ¡  Rate of producing results (every item, every 10 items, every minute,

…)

Page 29: Introduction to Data streaming - 05/12/2014

Sliding window B

egin

ning

of t

he s

trea

m

t tc

t t’c

Refreshment time

Results

Results

Give me the last room where Axel has been in the last 10 minutes, updating results every minute

30

Page 30: Introduction to Data streaming - 05/12/2014

Sliding window vs. Tumbling window

Beg

inni

ng o

f the

str

eam

t tc

t t’c

Refreshment time

Results

Results

Give me the last room where Axel has been in the last 10 minutes, updating results every 10 minutes

31

Page 31: Introduction to Data streaming - 05/12/2014

32 Sampling from data stream ¡  Inputs: ¡  Sample size k ¡  Window size n >> k (alternatively, time duration m) (optionnaly) ¡  Stream of data elements that arrive online

¡ Output: ¡  k elements chosen uniformly at random from the last n elements

(alternatively, from all elements that have arrived in the last m time units)

¡ Goal: ¡  maintain a data structure that can produce the desired output at

any time upon request

¡ Challenge: ¡  don’t know how long stream is ¡  So when/how often to sample?

Page 32: Introduction to Data streaming - 05/12/2014

34 A simple, Unsatisfying Approach ¡ Choose a random subset X={x1, …,xk}, X⊂{0,1,…,n-1}

¡  The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n)

¡ Only uses O(k) memory

¡  Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic

¡  Unsuitable for many real applications, particularly those with periodicity in the data

Page 33: Introduction to Data streaming - 05/12/2014

35 Reservoir sampling ¡ Classic online algorithm due to Vitter (1985)

¡ Maintains a fixed-size uniform random sample ¡  Size of the data stream need not be known in advance

¡ Data structure: “reservoir” of k data elements

¡ As the ith data element arrives: ¡  Add it to the reservoir with probability p = k/i, discarding a randomly

chosen data element from the reservoir to make room

Page 34: Introduction to Data streaming - 05/12/2014

36 Chain Sampling v  Chain Sampling method is for sequence based windows.

v  In this type of sampling when the ith element arrives it is chosen to become the sample with probability Min(i,n)/n.

v  If the ith element is chosen as the sample, the algorithm also selects the index of the element that will replace it when expires (assuming that it is still present in the sample when it expires). This index is picked uniformly at random from the range i+1…i+n, representing the range of indexes of the elements that will be active when the ith element expires.

v  When the element with the selected index arrives, the algorithm stores it in the memory and choses the index of the element that will replace it when it expires etc., building a chain of elements to use in case of the expiration of the current element in the sample.

Page 35: Introduction to Data streaming - 05/12/2014

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

37 Chain Sampling

Page 36: Introduction to Data streaming - 05/12/2014

38 Another Simple Approach: oversample ¡ As each element arrives remember it with probability

p = ck/n log n; otherwise discard it

¡ Discard elements when they expire

¡ When asked to produce a sample, choose k elements at random from the set in memory

¡  Expected memory usage of O(k log n)

¡  The algorithm can fail if less than k elements from a window are remembered; however with high probability this will not happen

Page 37: Introduction to Data streaming - 05/12/2014

39 Min-wise Sampling ¡  For each item, pick a random fraction between 0 and 1

¡  Store item(s) with the smallest random tag (Nath et al. 2004)

0.391 0.908 0.291 0.555 0.619 0.273

Each item has same chance of least tag, so uniform Can run on multiple streams separately, then merge

Page 38: Introduction to Data streaming - 05/12/2014

08/12/2014

40 Hash function - Revision ¡  Hash function: Any algorithm that maps data of arbitrary length

to data of a fixed length. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.

Page 39: Introduction to Data streaming - 05/12/2014

41 Stream filtering ¡ Compact data structure, by B.H. Bloom, 1970: ¡  A bit array of size m,

¡  A H family of k hash functions (MD5, SHA256, Murmur),

¡  A set S of n items

¡  False positive probability f=(1-e-kn/m)k

¡  Example M=10, hash=MD5, k=3

Page 40: Introduction to Data streaming - 05/12/2014

42 Sketches ¡  Sketch ¡  Synopsis structure taking advantage of high volumes of data

¡  Provides an approximate result with probabilistic bounds

¡  Random projections on smaller spaces (hash functions)

¡ Many sketch structures: usually dedicated to a specialized task

¡  Examples of sketch structures ¡  COUNT (Flajolet 85)

¡  COUNT SKETCH (Charikar et al. 04)

Page 41: Introduction to Data streaming - 05/12/2014

43 Difference between Sampling and Sketching ¡  Sample sees only those items which were selected to be in the

sample; whereas the sketch sees the entire input, but is restricted to retain only a small summary of it

¡ Not every problem can be solved with sampling ¡  Example: counting how many distinct items in the stream

¡  If a large fraction of items aren’t sampled, don’t know if they are all same or all different

Page 42: Introduction to Data streaming - 05/12/2014

44 Sketches: COUNT (Flajolet Martin algorithm) ¡ Goal ¡  Number N of distinct values in a stream (for large N)

¡  Example: number of distinct IP addresses going through a router

¡  Sketch structure ¡  SK: L bits initialized to 0

¡  H: hashing function transforming an element of the stream into L bits

¡  H distributes uniformly elements of the stream on the 2^L possibilities

18.6.7.1 0 0 1 0 1 0 1 0

0 0 0 0 0 0 0 0

Page 43: Introduction to Data streaming - 05/12/2014

45 Sketches ¡ Method ¡  Maintenance and update of SK

¡  For each new element e

¡  Compute H(e)

¡  Select the position of the leftmost 1 in H(e)

¡  Force to 1 this position in SK

0 1 0 0 1 0 0 1

0 0 1 0 1 0 1 0 H(18.6.7.1)

SK

New SK 0 1 1 0 1 0 0 1

Page 44: Introduction to Data streaming - 05/12/2014

46 Sketches ¡  Result ¡  Select the position R (0…L-1) of the leftmost 0 in SK

¡  E(R) = log2 (φ*N) with φ = 0.77351…

è Estimate the count by 2^R/φ ¡  σ(R) = 1.12

1 1 1 0 1 0 0 0 SK

R

To improve accuracy of this approximation algorithm, use multiple hash functions and use the average R instead.

Page 45: Introduction to Data streaming - 05/12/2014

47 Sketches ¡ COUNT SKETCH ALGORITHM (Charikar et al. 2004)

¡ Goal ¡  k most frequent elements in a stream (for large number N of distinct

values)

¡  Ex. 100 most frequent IP addresses going through a router

Input stream 2, 0, 1, 3, 1, 2, 4, . . . f(0) f(1) f(2) f(3) f(4)

1 1 1

2 2

Page 46: Introduction to Data streaming - 05/12/2014

48 Sketches

+12 +7 +23 +15

-5 -12 -23 +1

. . . .

. . . .

. . . .

. . . .

. . . .

+78 +56 +66 +65

1

2

.

.

.

.

.

B

1 2 … t

e

-1

+1

-1

+1

Page 47: Introduction to Data streaming - 05/12/2014

49 Sketches ¡  Sketch structure

¡  h : hash function from [0, … , N-1] to [0, 1, … , B] ¡  s : hash function from [0, … , N-1] to {+1, -1} ¡  Array of B counters: C1, …, CB (with B << N)

¡  Sketch maintenance when e arrives: Ch(e) += s(e)

¡  Use of sketch ¡  Estimation of frequency of object e: : ne ≈ Ch(e) . s(e) ¡  Actually t hash function h and t hash function s:

ne ≈ median j∈[1…t] ( Chj(e) . sj(e) ) ¡  Theoretical results on error depending on N, t and B.

Page 48: Introduction to Data streaming - 05/12/2014

50 Sketches ¡ Algorithm

¡  Maintenance of a list (e1, e2, …, ek) of the current k most frequent elements

¡  For a new arriving element e

¡  Add e to the sketch structure

¡  Estimate frequency of e from the sketch structure

¡  If f(e) > f(ek), remove ek and insert e into the list

Page 49: Introduction to Data streaming - 05/12/2014

Distributed Systems

Answering today Big Data Needs S4, Storm, SAMOA, …

51

Page 50: Introduction to Data streaming - 05/12/2014

Yahoo S4 ¡  Simple Scalable Streaming System

¡  Use a decentralized and symmetric architecture

¡  All nodes are similar, no master node

¡  Based on Processing Elements

¡  Each PE has 4 components: 1.  Functionality defined by a PE class and associated configuration

¡  input event handler processEvent() ¡  output mechanism output()

2.  the types of events that it consumes 3.  the keyed attribute in those events

4.  the value of the keyed attribute in events which it consumes

¡  Several PEs are available for standard tasks such as count, aggregate, join …

¡  Custom PEs can easily be programmed

¡  The PEs are atomatically deployed on the Processing Nodes (machines): ¡  Ensures load balancing of events, ¡  Notifies appropriate PEs when an event comes in.

Page 51: Introduction to Data streaming - 05/12/2014

S4 Example: Word Count

Page 52: Introduction to Data streaming - 05/12/2014

Twitter Storm ¡  Same princple as S4, with a simplified programming model

¡  Storm provides realtime computation ¡  Scalable

¡  Guarantees no data loss

¡  Extremely robust and fault-tolerant

¡  Programming language agnostic

¡ Concepts ¡  Streams

¡  Spouts

¡  Bolts

¡  Topologies

Page 53: Introduction to Data streaming - 05/12/2014

56 Lambda architecture (By Nathan Marz)

Generic, scalable and fault-tolerant data processing architecture

SERVING LAYER

SPEED LAYER

BATCH LAYER

DATA FLOW QUERIES

REAL TIME STREAM PROCESSING

BATCH PROCESSING

PRECOMPUTED VIEWS

Source: Mathieu DESPRIEE (USI)

Page 54: Introduction to Data streaming - 05/12/2014

Lambda architecture 1.  All data entering the system is dispatched to both the batch

layer and the speed layer for processing.

2.  The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.

3.  The serving layer indexes the batch views so that they can be queried in ad-hoc way.

4.  The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.

5.  Any incoming query can be answered by merging results from batch views and real-time views.

http://lambda-architecture.net/

Page 55: Introduction to Data streaming - 05/12/2014

58 Big Data Stream Mining

Machine Learning

Distributed

Batch

Hadoop

Mahout

Stream

S4, Storm

SAMOA

Non Distributed

Batch

R, WEKA,

Stream

MOA

Page 56: Introduction to Data streaming - 05/12/2014

59

What is SAMOA?

•  NEW Software framework for mining distributed data streams

•  Big Data mining for evolving streams in REAL-TIME

Page 57: Introduction to Data streaming - 05/12/2014

SAMOA

S4 Storm …

SAMOA

Classifier Methods

Clustering Methods

Frequent Pattern Mining

}  Use S4, Storm, or other distributed stream processing platform

}  Use MOA, or other streaming machine learning library

}  Easy to extend through PACKAGES

SAMOA architecture

Page 58: Introduction to Data streaming - 05/12/2014

Semantic data streaming

61

Page 59: Introduction to Data streaming - 05/12/2014

62 Too much data streams

Page 60: Introduction to Data streaming - 05/12/2014

63 Type of data used in Big Data initiatives

Internal data Traditional sources

« New data »

Source: Big Data opportunities survey, Unisphere / SAP, May 2013.

Page 61: Introduction to Data streaming - 05/12/2014

08/12/2014

64 BI: New Generation

Heterogeneous and dynamic data streams

Heterogeneous and static data

sensors

Semantic data streams

Semantic filtering and Continuous queries Interconnection

Decision Monitoring, Alerts, Statistics, fault

detection, etc.

Ontologies

Retro-action

Page 62: Introduction to Data streaming - 05/12/2014

65 Semantic Web technologies for data stream ¡ Annotate stream data with semantic metadata

¡ Apply Linked Data principles to publish streaming data

¡  Interlink streaming data with existing datasets

¡  Integrate data stream processing + reasoning

¡ Objectives : interoperability, automation, enrichment

Page 63: Introduction to Data streaming - 05/12/2014

66 Existing prototypes

CQELS

SPARQL-Stream

EP-SPARQL

C-SPARQL

Page 64: Introduction to Data streaming - 05/12/2014

67 Approaches

In RDF Stream models

(timestamps, events, time

intervals, triple-based, graph-

based …)

Page 65: Introduction to Data streaming - 05/12/2014

68 Example: SRBench Q: Detect if a hurricane has been observed

“A hurricane has a sustained wind (for more than 3 hours) of at least 33 metres per second or 74 miles per hour (119 km/h).”

ASK WHERE { STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 10800s SLIDE 600s] {?observation om-owl:procedure ?sensor ; om-owl:observedProperty weather:WindSpeed ; om-owl:result [ om-owl:floatValue ?value ] .} } GROUP BY ?sensor HAVING ( AVG(?value) >= "74"^^xsd:float )

ASK FROM STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 3h STEP 10m] WHERE {   ?observation om-owl:procedure ?sensor ; om-owl:observedProperty weather:WindSpeed ; om-owl:result [ om-owl:floatValue ?value ] . } GROUP BY ?sensor HAVING ( AVG(?value) >= "74"^^xsd:float )

Page 66: Introduction to Data streaming - 05/12/2014

69 How to deal with Big Data Streams

Cloud Computing

DSMS Sampling/Load Shedding

[1] J. Hoeksema & S. Kotoulas : High-performance Distributed Stream Reasoning using S4 (ISWC 2011) [2] D. L. Phuoc & al : Elastic and Scalable Processing of Linked Stream Data in the Cloud (ISWC 2013) [3] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)

[1] et [2]

[3]

Page 67: Introduction to Data streaming - 05/12/2014

70 Sampling Extensions for continuous SPARQL

PREFIX vocab: http://data-gov.tw.rpi.edu/vocab/p/8/ SELECT ?val WHERE { STREAM <C:/CQELS/streams/data-8.stream>[NOW][UNISAMPLING %80] {?rawData vocab:ozone_8hr_daily_max ?val } GRAPH <C:/CQELS/test.rdf> {?user sioc:account_of ?person} }

Operators: [UNISAMPLING %{Sampling Percentage}] [RESSAMPLING %{Reservoir Size}] [CHNSAMPLING %{Window Size}]

N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)

Page 68: Introduction to Data streaming - 05/12/2014

71 Semantic Data Stream Load Shedding (1/2) 

Observation_AirTemperature_ID

TemperatureObservation type

AirTemperature observedProperty

System_ID

procedure

MeasureData_AirTemperature_ID result

Instant_ID MeasureData

NN

double floatValue

fahrenheit uom

Instant type

2004-08-08T06:25:00 inXSDDateTime

Observation_RelativeHumidity_ID procedure

samplingTime

RelativeHumidityObservation type

RelativeHumidity observedProperty

MeasureData_RelativeHumidity_ID

result

type

floatValue

percentage uom

NN

double

(a)

(b)

(c)

(d)

(1)

(2)

RDF Triple approach : Effect of deleting the two triples below (1) and (2) (in dotted line) in the graph : - The deletion of the first RDF triple destroys the link connecting node (a) to node (b) - The deletion of the second RDF triple destroys the link connecting node (c) to node (d) => This has the effect of making nodes (b), (d) and all those connected to them unreachables, in spite of the presence of their data in memory. Which represents 6 RDF striples among 18 i.e. 33% of unusable data

Page 69: Introduction to Data streaming - 05/12/2014

72 Semantic Data Stream Load Shedding (2/2)

observedProperty

RDF Graph approach : Effect of deleting sub-graphs such as those formed by nodes (b) or (d) and all nodes to which they are directly connected. => Preserving the semantic level of the information => Protecting the data consistency of the whole graph => Enhancing the Semantic Data Stream systems

Page 70: Introduction to Data streaming - 05/12/2014

73 Conclusion: Big Data Stream challenges ¡  Semantic Information aggregation ¡  Information aggregation: “too much data to assimilate but not

enough knowledge to act”

¡ Distributed and real-time processing ¡  Design of real-time and distributed algorithms for stream processing

and information aggregation

¡  Distribution and parallelization of data mining algorithms

¡ Visual analytics and user modeling ¡  Dynamic user model

¡  Novel visualizations for very large datasets

Page 71: Introduction to Data streaming - 05/12/2014

74

Thanks to Zakia Kazi Aoul, ISEP Marie-Aude Aufaure, ECP Fethi Belghaouti, ISEP-INT Georges Hébrail, EDF R&D Sylvain Lefebvre, ISEP Yousra Chabchoub, ISEP

Page 72: Introduction to Data streaming - 05/12/2014

Big  Data   Linked  Data  Volume,  Variety,  Velocity,  Veracity,    …              Value  

Web  of  data,  Seman(c  Web    -­‐  A  set  of  principles  and  good  

prac1ces  allowing  to  link,  publish  and  search  for  web  data    

-­‐  Structure  and  seman1cally  enrich  RDF  data,  with  a  very  high  scalability  

 -­‐>  Big  Linked  Data  

Integrate,  aggregate,  analyze,  visualize    large  data  sets,  whatever  is  their  type,  provenance,  speed  of  their  flow  …      

Big  Linked  Data      

Linked  Big  Data  

Our  Value  proposi8on  –  Seman1c  aggrega1on  from  textual  and  non  textual  streams  –  Manage  seman1c  heterogeneity,  real-­‐1me  and  distributed  processing  –  Ensure  data  quality  and  veracity  –  Visual  analy1cs  

Seman8c  Technologies  

Living  Lab    

Linked  &  Big  Data  Academic  Chair