Upload
payam-barnaghi
View
714
Download
3
Embed Size (px)
Citation preview
1
Intelligent Data Processing for the Internet of Things
Payam Barnaghi
Institute for Communication Systems (ICS)
University of Surrey
Guildford, United Kingdom
International “IoT 360″ Summer SchoolOctober 29th – November 1st, 2014 – Rome, Italy
2
Key characteristics of IoT devices
−Often inexpensive sensors (actuators) equipped with a radio transceiver for various applications, typically low data rate ~ 10-250 kbps.
−Deployed in large numbers−The sensors should coordinate to perform the desired task.−The acquired information (periodic or event-based) is
reported back to the information processing centre (or some cases in-network processing is required)
−Solutions are often application-dependent.
2
3
Beyond conventional sensors
− Human as a sensor (citizen sensors)− e.g. tweeting real world data and/or events
− Software sensors− e.g. Software agents/services generating/representing
data
Road block, A3Road block, A3
Road block, A3Road block, A3
Suggest a different routeSuggest a different route
Internet of Things: The story so far
P. Barnaghi, A. Sheth, “The Internet of Things: The story so far”, IEEE IoT Newsletter, September 2014.
5
The benefits of data processing in IoT
− Turn 12 terabytes of Tweets created each day into sentiment analysis related to different events/occurrences or relate them to products and services.
− Convert (billions of) smart meter readings to better predict and balance power consumption.
− Analyze thousands of traffic, pollution, weather, congestion, public transport and event sensory data to provide better traffic and smart city management.
− Monitor patients, elderly care and much more…
− Requires: real-time, reliable, efficient (for low power and resource limited nodes), and scalable solutions.
Partially adapted from: What is Bog Data?, IBM
… but also Data Dynamicity:
Not just Volume…
How can we efficiently deal with:- Large amounts of (heterogeneous/distributed) data?- Both static and dynamic data?- In a re-usable, modular, flexible way?- Integrate different types of data- Provide hypothesis and create more context-aware solutions
Adapted from: M. Hauswirth. A. Mileo, Insight, National University of Ireland, Galway.
AnyPlace AnyTime
AnyThing
Data Volume
Security, Reliability, Trust and Privacy
Societal Impacts, Economic Values and Viability
Services and Applications
Networking andCommunication
Data is not what we want or is it?
What we need are insights and actionable-information
Diffusion of innovation
image source: Wikipedia
IoT
Problem #1
Data: We seem to have lots of it…
Real World Data: it is always difficult to get (silos, format, privacy, business interests or lack of interest!...)
Problem #2
Data: interoperability and metadata frameworks…
Real World Data: there are solutions for service based (RESTful) access, meta-data/semantic representation frameworks (e.g. W3C SSN, HyperCat,…) but none of them are still widely adapted.
Problem #3
Data: quality, reliability…
Real World Data: data can be noisy, crowed source data can be inaccurate, contradictory, delay in accessing/processing the data…
Problem #4
Data: having too much data and using analytics tools alone won’t solve the problem…
Real World Data: in addition to the HPC issues, we need new methods/solutions that can provide real-time analysis of dynamic, variable quality and multi-modal streams…
Problem #5
Data: abstraction, discovering the associations…
Real World Data: co-occurrence vs. causation; we need hypothesis, background knowledge,…After all data is not what we are really after…
We need more linked open data
(near) real-time linked open data Streams
Sometimes it’s even better if we have:
(near) real-time linked open data Streams+meta-data (semantic annotations)+Adaptable and scalable analytics tools+Sufficient background knowledge
or even better than that if we have:
IoT Data
19
?
Current focus on Big Data
− Emphasis on power of data and data mining solutions
− Technology solutions to handle large volumes of data; e.g. Hadoop, NoSQL, Graph Databases, …
− Trying to find patterns and trends from large volumes of data…
Myths About Big Data
− Big Data is only about massive data volume− Big Data means Hadoop− Big Data means unstructured data− If we have enough data we can draw conclusions
(enough here often means massive amounts)− NoSQL means No SQL− It is about increasing computational power and
taking more data and running data mining algorithms.
21Some of the items are adapted from: Brain Gentile, http://mashable.com/2012/06/19/big-data-myths/
What happens if we only focus on data
− Number of burgers consumed per day.− Number of cats outside.− Number of people checking their facebook
account.
− What insight would you draw?
22
It is also important to note what type of problems we expect to solve.
Back to the future
24
25Source LAT Times, http://documents.latimes.com/la-2013/
Future cities: a view from 1998
26Source: http://robertluisrabello.com/denial/traffic-in-la/#gallery[default]/0/
Source: wikipedia
27
101 Smart City Use-case Scenarios
http://www.ict-citypulse.eu/page/content/smart-city-use-cases-and-requirements
29
Data alone is not enough
− Domain knowledge− Machine interpretable meta-data− Delivery, sharing and representation services− Query, discovery, aggregation services− Publish, subscribe, notification, and access
interfaces/services− More open solutions for innovation and citizen participation− Efficient feedback and control mechanisms − Social network and social system analysis− In cities, interactions with people and social systems is the
key.
30
We need an Integrated Approach
31
Processing steps
AnalyticsToolbox
Context-awareDecision Support,
Visualisation
Knowledge-based
Stream Processing
Real-TimeMonitoring &
Testing
Accuracy & Trust
Modelling
SemanticIntegration
On Demand Data
Federation
OpenReferenceData Sets
Real-TimeIoT InformationExtraction
IoT StreamProcessing
Federation ofHeterogenousData Streams
Design-Time Run-Time Testing
Exposure APIs
32
Storing, handling and processing the data
Image courtesy: IEEE Spectrum
33
Technical Challenges
− Discovery: finding appropriate device and data sources− Access: Availability and (open) access to data resources
and data− Search: querying for data− Integration: dealing with heterogeneous devices, networks
and data− Large-scale data mining, adaptable learning and efficient
computing and processing − Interpretation: translating data to knowledge that can be
used by people and applications− Scalability: dealing with large numbers of devices and a
myriad of data and the computational complexity of interpreting the data.
34
IoT Data Access
− Publish/Subscribe (long-term/short-term)− Ad-hoc query− The typical types of data request for sensory data:
− Query based on− ID (resource/service) – for known resources− Location− Type− Time – requests for freshness data or historical data; − One of the above + a range [+ Unit of Measurement]− Type/Location/Time + A combination of Quality of Information
attributes − An entity of interest (a feature of an entity on interest)− Complex Data Types (e.g. pollution data could be a combination of
different types)
35
IoT Data in the Cloud
Image courtesy: http://images.mathrubhumi.comhttp://www.anacostiaws.org/userfiles/image/Blog-Photos/river2.jpg
Comparing IoT data streams with conventional multimedia streams
Source: P. Barnaghi, W. Wang, L. Dong, C. Wang, "A Linked-data Model for Semantic Sensor Streams", in the Proc. of
IEEE International Conference on Internet of Things (iThings 2013), August 2013. Source: P. Barnaghi, W. Wang, L. Dong, C. Wang, "A Linked-data Model for Semantic Sensor Streams", in the Proc. of
IEEE International Conference on Internet of Things (iThings 2013), August 2013.
37
Describing IoT Data: An example
Time
Location
Type
Value
Link to QoI metadata
UTCUTC
#GeoHash#GeoHash
#Hash#Hash
[DataType, Value][DataType, Value]
URIURI
Ontology for commontypes
38
Observation and Measurement Value
GeoHash
UTC time
UTC time (in Java) : The time indicated is returned represented as the distance, measured in milliseconds, of that time from the epoch (00:00:00 GMT on January 1, 1970).
Standard XSD data type
39
GeoHashing
− For example Guildford: lat: 51.235401 and long: -0.574600 can be hashed as: gcpe6zjeffgp
− It can be used as:− A unique identifier − represent point data as hash string
− It uses Base 32 encoding and bit interleaving − It’s used for geo-tagging (and is a symmetric technique)− Place close to each other will have similar prefix (string similarity) − Limitations:
− We could have Geohash codes with no common prefix − Edge case (locations close to each other but on opposite sides of the
Equator)
− A meridian point (line of longitude)
40
GeoHash Example
Sample locations on a Google Map and theirequivalent geohash strings; - close locations have similar prefixes
IoT Data Processing
WSNWSN
WSNWSN
WSNWSN
WSNWSN
WSNWSN
Network-enabled DevicesNetwork-enabled Devices
Network-enabled DevicesNetwork-enabled Devices
Network services/storage and processing
units Data/service access at
application level
Data collections and processing
within the networks
Data Discovery
Service/Resource Discovery
42
In-network processing
− Mobile Ad-hoc Networks can be seen as a set of nodes that deliver bits from one end to the other;
− WSNs, on the other end, are expected to provide information, not necessarily original bits− Gives additional options− e.g., manipulate or process the data in the network
− Main example: aggregation − Applying aggregation functions to a obtain an average value of
measurement data− Typical functions: minimum, maximum, average, sum, … − Not amenable functions: median
Source: Protocols and Architectures for Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor NetworksHolger Karl, Andreas Willig, chapter 3, Wiley, 2005 .
43
In-network processing
− Depending on application, more sophisticated processing of data can take place within the network− Example edge detection: locally exchange raw data with
neighboring nodes, compute edges, only communicate edge description to far away data sinks
− Example tracking/angle detection of signal source: Conceive of sensor nodes as a distributed microphone array, use it to compute the angle of a single source, only communicate this angle, not all the raw data
− Exploit temporal and spatial correlation− Observed signals might vary only slowly in time; so no need to
transmit all data at full rate all the time− Signals of neighboring nodes are often quite similar; only try to
transmit differences (details a bit complicated, see later)
Source: Protocols and Architectures for Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor NetworksHolger Karl, Andreas Willig, chapter 3, Wiley, 2005 .
Data Aggregation
− Computing a smaller representation of a number of data items (or messages) that is extracted from all the individual data items.
− For example computing min/max or mean of sensor data.− More advance aggregation solutions could use approximation
techniques to transform high-dimensionality data to lower-dimensionality abstractions/representations.
− The aggregated data can be smaller in size, represent patterns/abstractions; so in multi-hop networks, nodes can receive data form other node and aggregate them before forwarding them to a sink or gateway.
− Or the aggregation can happen on a sink/gateway node.
Aggregation example
− Reduce number of transmitted bits/packets by applying an aggregation function in the network
1
1
31
1
6
1
1
11
1
1
Source: Holger Karl, Andreas Willig, Protocols and Architectures for Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor Networks, chapter 3, Wiley, 2005 .
Efficacy of an aggregation mechanism
− Accuracy: difference between the resulting value or representation and the original data− Some solutions can be lossless or lossly depending on the
applied techniques. − Completeness: the percentage of all the data items that are
included in the computation of the aggregated data.− Latency: delay time to compute and report the aggregated data
− Computation foot-print; complexity;
− Overhead: the main advantage of the aggregation is reducing the size of the data representation;− Aggregation functions can trade-off between accuracy, latency and
overhead;
− Aggregation should happen close to the source.
Publish/Subscribe
− Achieved by publish/subscribe paradigm− Idea: Entities can publish data under certain names− Entities can subscribe to updates of such named data
− Conceptually: Implemented by a software bus− Software bus stores subscriptions, published data; names used
as filters; subscribers notified when values of named data changes
Software bus
Publisher 1 Publisher 2
Subscriber 1 Subscriber 2 Subscriber 3
Source: Holger Karl, Andreas Willig, Protocols and Architectures for Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor Networks, chapter 12, Wiley, 2005 .
MQTT Pub/Sub Protocol
− MQ Telemetry Transport (MQTT) is a lightweight broker-based publish/subscribe messaging protocol.
− MQTT is designed to be open, simple, lightweight and easy to implement. − These characteristics make MQTT ideal for use in constrained
environments, for example in IoT. − Where the network is expensive, has low bandwidth or is
unreliable − When run on an embedded device with limited processor or
memory resources;− A small transport overhead (the fixed-length header is just 2
bytes), and protocol exchanges minimised to reduce network traffic
− MQTT was developed by Andy Stanford-Clark of IBM, and Arlen Nipper of Cirrus Link Solutions.
Source: MQTT V3.1 Protocol Specification, IBM, http://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.htmlSource: MQTT V3.1 Protocol Specification, IBM, http://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.html
MQTT
− It supports publish/subscribe message pattern to provide one-to-many message distribution and decoupling of applications
− A messaging transport that is agnostic to the content of the payload − The use of TCP/IP to provide basic network connectivity − Three qualities of service for message delivery:
− "At most once", where messages are delivered according to the best efforts of the underlying TCP/IP network. Message loss or duplication can occur.
− This level could be used, for example, with ambient sensor data where it does not matter if an individual reading is lost as the next one will be published soon after.
− "At least once", where messages are assured to arrive but duplicates may occur.
− "Exactly once", where message are assured to arrive exactly once. This level could be used, for example, with billing systems where duplicate or lost messages could lead to incorrect charges being applied.
Source: MQTT V3.1 Protocol Specification, IBM, http://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.htmlSource: MQTT V3.1 Protocol Specification, IBM, http://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.html
MQTT Message Format
− The message header for each MQTT command message contains a fixed header.
− Some messages also require a variable header and a payload. − The format for each part of the message header:
Source: MQTT V3.1 Protocol Specification, IBM, http://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.htmlSource: MQTT V3.1 Protocol Specification, IBM, http://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.html
— DUP: Duplicate delivery— QoS: Quality of Service— RETAIN: RETAIN flag
—This flag is only used on PUBLISH messages. When a client sends a PUBLISH to a server, if the Retain flag is set (1), the server should hold on to the message after it has been delivered to the current subscribers.—This allows new subscribers to instantly receive data with the retained, or Last Known Good, value.
— DUP: Duplicate delivery— QoS: Quality of Service— RETAIN: RETAIN flag
—This flag is only used on PUBLISH messages. When a client sends a PUBLISH to a server, if the Retain flag is set (1), the server should hold on to the message after it has been delivered to the current subscribers.—This allows new subscribers to instantly receive data with the retained, or Last Known Good, value.
Sensor Data as time-series data
− The sensor data (or IoT data in general) can be seen as time-series data.
− A sensor stream refers to a source that provide sensor data over time.
− The data can be sampled/collected at a rate (can be also variable) and is sent as a series of values.
− Over time, there will be a large number of data items collected. − Using time-series processing techniques can help to reduce the
size of the data that is communicated; − Let’s remember, communication can consume more energy
than communication;
Sensor Data as time-series data
− Different representation method that introduced for time-series data can be applied.
− The goal is to reduce the dimensionality (and size) of the data, to find patterns, detect anomalies, to query similar data;
− Dimensionality reduction techniques transform a data series with n items to a representation with w items where w < n.− This functions are often lossy in comparison with solutions like
normal compression that preserve all the data. − One of these techniques is called Symbolic Aggregation
Approximation (SAX).− SAX was originally proposed for symbolic representation of time-
series data; it can be also used for symbolic representation of time-series sensor measurements.
− The computational foot-print of SAX is low; so it can be also used as a an in-network processing technique.
53
In-network processing
Using Symbolic Aggregate Approximation (SAX)
SAX Pattern (blue) with word length of 20 and a vocabulary of 10 symbolsover the original sensor time-series data (green)
Source: P. Barnaghi, F. Ganz, C. Henson, A. Sheth, "Computing Perception from Sensor Data", in Proc. of the IEEE Sensors 2012, Oct. 2012.
fggfffhfffffgjhghfff
jfhiggfffhfffffgjhgi
fggfffhfffffgjhghfff
Symbolic Aggregate Approximation (SAX)
− SAX transforms time-series data into symbolic string representations.
− Symbolic Aggregate approXimation was proposed by Jessica Lin et al at the University of California –Riverside;
− http://www.cs.ucr.edu/~eamonn/SAX.htm .
− It extends Piecewise Aggregate Approximation (PAA) symbolic representation approach.
− SAX algorithm is interesting for in-network processing in WSN because of its simplicity and low computational complexity.
− SAX provides reasonable sensitivity and selectivity in representing the data.
− The use of a symbolic representation makes it possible to use several other algorithms and techniques to process/utilise SAX representations such as hashing, pattern matching, suffix trees etc.
Processing Steps in SAX
− SAX transforms a time-series X of length n into the string of arbitrary length, where typically, using an alphabet A of size a > 2.
− The SAX algorithm has two main steps: − Transforming the original time-series into a PAA representation− Converting the PAA intermediate representation into a string
during. − The string representations can be used for pattern matching,
distance measurements, outlier detection, etc.
Piecewise Aggregate Approximation
− In PAA, to reduce the time series from n dimensions to w dimensions, the data is divided into w equal sized “frames.”
− The mean value of the data falling within a frame is calculated and a vector of these values becomes the data-reduced representation.
− Before applying PAA, each time series to have a needs to be normalised to achieve a mean of zero and a standard deviation of one.− The reason is to avoid comparing time series with different
offsets and amplitudes;
Source: Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (DMKD '03). ACM, New York, NY, USA, 2-11.Source: Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (DMKD '03). ACM, New York, NY, USA, 2-11.
SAX- normalisation before PAA
Timeseries (c): 2, 3, 4.5, 7.6, 4, 2, 2, 2, 3, 1
Mean (μ): μ= (2+3+4.5+7.6+4+2+2+2+3+1)/10= 3.11
Standard deviation (σ): (2-3.11)2 = 1.2321(3-3.11)2 = 0.0121(4.5-3.11)2 = 1.9321(7.6-3.11)2 = 20.1601(4-3.11)2 = 0.7921(2-3.11)2 = 1.2321(2-3.11)2 = 1.2321(2-3.11)2 = 1.2321(3-3.11)2 = 0.0121(1-3.11)2 = 4.4521
1.2321+0.0121+ 1.9321+ 20.1601+ 0.7921+ 1.2321+ 1.2321+ 1.2321+ 1.2321+ 0.0121+4.4521 = 33.5211
σ = √ (33.5211/10) = 1.83087683911
1.2321+0.0121+ 1.9321+ 20.1601+ 0.7921+ 1.2321+ 1.2321+ 1.2321+ 1.2321+ 0.0121+4.4521 = 33.5211
σ = √ (33.5211/10) = 1.83087683911
Normalisation
Timeseries (c): 2, 3, 4.5, 7.6, 4, 2, 2, 2, 3, 1
Normalised: zi = (ci – μ)/ σ
σ = 1.83087683911μ = 3.11z1 = (2- 3.11)/1.83087683911 = -0.606z2 = (3-3.11)/ 1.83087683911= -0.600z3 = (4.5-3.11)/ 1.83087683911= 2.452z4 = (7.6-3.11)/ 1.83087683911= -0.600z5 = (4-3.11)/ 1.83087683911= 0.486z6 = (2-3.11)/ 1.83087683911= -0.606z7 = (2-3.11)/ 1.83087683911= -0.606z8 = (2-3.11)/ 1.83087683911= -0.606z9 = (3-3.11)/ 1.83087683911= -0.600z10 = (1-3.11)/ 1.83087683911= -1.152
Normalised Timeseries (z): -0.606, -0.600, 2.452, -0.600, 0.486, -0.606, -0.606, -0.606 , -0.600, -1.152
PAA calculation
Timeseries (c): 2, 3, 4.5, 7.6, 4, 2, 2, 2, 3, 1Normalised Timeseries (z): -0.606, -0.600, 2.452, -0.600,
0.486, -0.606, -0.606, -0.606 , -0.600, -1.152
PAA (w=5): -0.603, 0.926, -0.06, -0.606, 0.273
PAA to SAX Conversion
− Conversion of the PAA representation of a time-series into SAX is based on producing symbols that correspond to the time-series features with equal probability.
− The SAX developers have shown that time-series which are normalised (zero mean and standard deviation of 1) follow a Normal distribution (Gaussian distribution).
− The SAX method introduces breakpoints that divides the PAA representation to equal sections and assigns an alphabet for each section.− For defining breakpoints, Normal inverse cumulative
distribution function
Breakpoints in SAX
− “Breakpoints: breakpoints are a sorted list of numbers B = β 1,…, β a-1 such that the area under a N(0,1) Gaussian curve from βi to βi+1 = 1/a”.
Source: Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (DMKD '03). ACM, New York, NY, USA, 2-11.Source: Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (DMKD '03). ACM, New York, NY, USA, 2-11.
Alphabet representation in SAX
− Let’s assume that we will have 4 symbols alphabet: a,b,c,d− As shown in the table in the previous slide, the cut lines for
this alphabet (also shown as the thin red lines on the plot below) will be { -0.67, 0, 0.67 }
Source: JMOTIF Time series mining, http://code.google.com/p/jmotif/wiki/SAXSource: JMOTIF Time series mining, http://code.google.com/p/jmotif/wiki/SAX
SAX Represetantion
Timeseries (c): 2, 3, 4.5, 7.6, 4, 2, 2, 2, 3, 1Normalised Timeseries (z): -0.606, -0.600, 2.452, -
0.600, 0.486, -0.606, -0.606, -0.606 , -0.600, -1.152
PAA (w=5): -0.603, 0.926, -0.06, -0.606, 0.273
Cut off ranges: {-0.67, 0, 0.67}Alphabet: a ,b ,c, d
SAX representation: bdbbc
Features of the SAX technique
− SAX divides a time series data into equal segments and then creates a string representation for each segment.
− The SAX patterns create the lower-level abstractions that are used to create the higher-level interpretation of the underlying data.
− The string representation of the SAX mechanism enables to compare the patterns using a specific type of string similarity function.
65
A sample data processing framework
fggfffhfffffgjhghfff dddfffffffffffddd cccddddccccdddccc aaaacccaaaaaaaaccccdddcdcdcdcddasddd
PIR Sensor Light Sensor Temperature
Sensor
Raw sensor data stream Raw sensor data streamRaw sensor data stream
Attendance PhoneHot
Temperature Cold
TemperatureBright
Day-time
Night-time
Office room BA0121
On going meeting
Window has been left open
….
Temporal data(extracted from descriptions)
Spatial data(extracted from descriptions)
Thematic data(low level
abstractions)
Intelligent Processing
Observations
High-level abstractions
Domain knowledge
SAX Patterns
Raw sensor data(or Annotated data)
…
….
Intelligent Processing/Reasoning
High-level information/knowledge
Correlation analysis
Slides: Maria Bermudez-Edo
Correlations: linear and nonlinear
67
Co-occurrence analysis
68
Correlation analysis
69
Challenges
− Complexity − Co-occurrences vs. Causation
70
Data Discovery
Slides: Syed Amir Hoseinitabatabaei
A discovery method in the IoT
timetime
locationlocation
typetype
Query formulatingQuery formulating
[#location | #type | time][#location | #type | time]
Discovery IDDiscovery ID
Discovery/DHT ServerDiscovery/DHT Server
Data repository(archived data)Data repository(archived data)
#location#type
#location#type
#location#type
Data hypercubeData hypercube
GatewayGateway
Core networkCore network
Network Connection
Logical Connection
Data
An example: a discovery method in the IoT
73S. A. Hoseinitabatabaei, P. Barnaghi, C. Wang, R. Tafazolli, L. Dong, "A Distributed Data Discovery Mechanism for the Internet of Things", 2014.
An example: a discovery method in the IoT
74S. A. Hoseinitabatabaei, P. Barnaghi, C. Wang, R. Tafazolli, L. Dong, "A Distributed Data Discovery Mechanism for the Internet of Things", 2014.
Adaptive learning
Slides: Daniel Puschmann, Masound Hassanpour
Adaptable and dynamic learning methods
http://kat.ee.surrey.ac.uk/
A D
CB
Image source for equilibrium diagram: John D. Hey, The University of York.
Equilibrium in transient and non-uniform world
Social data analysis: A case study
Slides: Pramod Anantharam, Kno.e.sis Centre, Wright State University
Social data analysis: A case study
− Are people talking about city infrastructure on twitter?
− Can we extract city infrastructure related events from twitter?
− How can we leverage event and location knowledge bases for event extraction?
− How well can we extract city events?
Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
Do people talk about city infrastructure on Twitter?
80Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
Some challenges in extracting events from Tweets
− No well accepted definition of “events related to a city”.
− Tweets are short (140 characters) and its informal nature make it hard to analyze− Entity, location, time, and type of the event
− Multiple reports of the same event and sparse report of some events (biased sample)− Numbers don’t necessarily indicate intensity
− Validation of the solution is hard due to the open domain nature of the problem
81Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
Event extraction techniques
− N-grams + Regression − Text analysis to extract uni- and bi-grams (event markers)− Feature selection to select best possible event markers− Apply regression to predict P(Y|X) where Y is the target (rainfall)
and X is the input (event marker).− Clustering
− Create event clusters incrementally over time− Identify clusters of interest based on its relevance (manual
inspection)− Granularity remains at the tweet/cluster level (tweets are assigned
to clusters of interest)− Sequence Labeling (CRFs)
− Text analysis to extract features such as named entities, POS1 tagging
− Each event indicator is modeled as a mixture of event types that are latent variables
− Each type corresponds to a distribution over named entities n (labels assigned to event types by manual inspection) and other features
Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
City event extraction
− Event extraction should be open domain (no a priori event types) with event metadata.
− Incorporate background knowledge related to city related events e.g., 511.org hierarchy, SCRIBE ontology, city location names.
− Assess the intensity of an event using content and network cues.
− Robust to noise, informal nature, and variability of data.
Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
N-grams + Regression
− Open domain: works best when there is a reference corpus to extract n-grams
− Event metadata: cannot distinguish between entities and hence hard to extract event metadata
− Background knowledge: incorporating domain vocabulary (e.g., subsumption) is not natural
− Event intensity: regression maps the event indicators to some quantified values
− Robustness: quite robust if there is a reference corpus
Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
Clustering
− Open domain: works well for domains with no a priori knowledge of events (may need human inspection)
− Background knowledge: incorporating domain vocabulary is not natural
− Event intensity: not captured − Robustness: quite robust for twitter data with enough
data for each cluster
Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
Sequence Labeling (CRFs)
− Open domain: works well for domains with no a priori knowledge of events (may need human inspection)
− Event metadata: event metadata extraction is captured naturally with the named entities
− Background knowledge: incorporating domain vocabulary is quite natural
− Event intensity: part-of-speech tag may indirectly capture intensity
− Robustness: with a deeper model for capturing context, quite robust for twitter data
Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
City Infrastructure
Tweets from a city POS Tagging
POS Tagging
Hybrid NER+ Event term extraction
Hybrid NER+ Event term extraction
GeohashingGeohashing
Temporal EstimationTemporal Estimation
Impact Assessment
Impact Assessment
Event Aggregation
Event AggregationOSM LocationsOSM Locations SCRIBE ontologySCRIBE ontology
511.org hierarchy511.org hierarchy
City Event ExtractionCity Event Annotation
Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
City event extraction solution architecture
City event extraction - Key insights
a) Space: events reported within a grid (gi ∈G where G is a set of all grids in a city)at a certain time are most likely reporting the same event
b) Time: events reported within a time ∆t in a grid gi are most likely to be reporting the same event
c) Theme: events with similar entities within a grid gi and time ∆t are most likely reporting the same event
Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
0.6 miles
Max-lat
Min-lat
Min-long
Max-long
0.38 miles
37.7545166015625, -122.40966796875
37.7490234375, -122.40966796875
37.7545166015625, -122.420654296875
37.7490234375, -122.420654296875
4
37.74933, -122.4106711
Hierarchical spatial structure of geohash for representing locations with variable precision.Here the location string is 5H34
0 1 2 3 4 5 6
7 8 9 B C D E
F G H I J K L
0 1
7
2 3 4
5 6 8 9
0 1 2 3 4
5 6 7
0 1 2
3 4 5
6 7 8
City event extraction – Geohashing
Implmentation
− City event annotation − Automated creation of training data − Annotation task (our CRF model vs. baseline CRF model)
− City event extraction− Use aggregation algorithm for event extraction− Extracted events AND ground truth
− Dataset (Aug – Nov 2013) ~ 8 GB of data on disk− Over 8 million tweets− Over 162 million sensor data points− 311 active events and 170 scheduled events
Source: Pramod Anantharam, Kno.e.sis Centre, Wright State University.
Evaluation – extracted events & ground truth
92
Conclusions
− A primary goal of interconnecting devices and collecting/processing data from them is to create situation awareness and enable applications, machines, and human users to better understand their surrounding environments.
− The understanding of a situation, or context, potentially enables services and applications to make intelligent decisions and to respond to the dynamics of their environments.
− Dynamicity, energy efficiency, multi-modality, heterogeneity and volume are among the key challenges.
− We need to design adaptable and scalable solutions that combine knowledge and data engineering (semantics), background knowledge (ontologies) and machine learning techniques.
Acknowledgements
− Some parts of the content are adapted from:− Holger Karl, Andreas Willig, Protocols and Architectures for
Wireless Sensor Networks, Protocols and Architectures for Wireless Sensor Networks, chapters 3 and 12, Wiley, 2005 .
− Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (DMKD '03). ACM, New York, NY, USA, 2-11.
− JMOTIF Time series mining, http://code.google.com/p/jmotif/wiki/SAX
Q&A
− Payam Barnaghi, University of Surrey/EU FP7 CityPulse Project
http://www.ict-citypulse.eu/
@pbarnaghi