53
Tetherless World Constellation Broad Data Jim Hendler Tetherless World Professor of Computer and Cognitive Science Director, The Rensselaer Institute of Data Exploration and Applications (IDEA) Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)

Broad Data (India 2015)

Embed Size (px)

Citation preview

Tetherless World Constellation

Broad Data Jim Hendler

Tetherless World Professor of Computer and Cognitive Science

Director, The Rensselaer Institute of Data Exploration and Applications (IDEA)

Rensselaer Polytechnic Institutehttp://www.cs.rpi.edu/~hendler

@jahendler (twitter)

Tetherless World Constellation

This talk

• What I’m not going to talk about much– The Semantic Web (per se)

• http://www.slideshare.net/jahendler/semantic-web-the-inside-story

– Social Machines• http://www.slideshare.net/jahendler/social-machines-oxford-hendler

– My work with Watson and Cognitive Computing• http://www.slideshare.net/jahendler/watson-an-academics-perspective

• http://www.slideshare.net/jahendler/watson-summer-review82013final

• What I am going to present–The rest of the big data story…

Tetherless World Constellation

Data is important!

• Roughly every 50 years a new power source for the human race is found. Once upon a time it was chemical, then it was electrical, then nuclear, etc.

• Information – so not just data, but data being used – is the new power source for our generation.

http://www.slideshare.net/jahendler/the-science-of-data-science

4

The Rensselaer Institute for Data Exploration and Applications

BusinessSystems:

Built and NaturalEnvironments:

Cyber-Resiliency:

Policy, Ethics andStewardship:

Materials Informatics:Data-driven Physical/Life Sciences:

Healthcare Analytics and Mobile Health:

Social Network Analytics:

Agents and Augmented Reality:

Office of Research 5

Developing a “Data Science” Research Agenda

Multiscale Sparcity

Abductive Agent-oriented

Tetherless World Constellation

BIG Data

• The term “Big Data” is widely used nowadays to refer to a whole bunch of machine-readable data in one accessible (to the researcher) place–3 main contexts

• The large data collections of “big science” projects – in traditional data warehouse or database formats

• The enterprise data of large, non-Web-based companies (IBM, TATA, etc.)

– Generally in multiple data formats, stores, warehouses, etc.

• The data holdings of a Google, Facebook or other large Web company

– Include large “unstructured” holdings – Include “graph” data

Tetherless World Constellation

But wait, there’s more!

• 4th context: Broad Data – The huge amount of freely available, but widely varied,

Open Data on the World Wide Web (Structured and Semi-structured)• Example: The extended Facebook OGP graph (the

part outside Facebook’s datasets)• Example: dbpedia, yago, wikidata, and other sources

of indexed information sources• Example: The growing linked open data cloud of

freely available linked data from many domains• Example: millions of datasets that are available on

the Web freely available from governments around the world

Tetherless World Constellation

The V’s

Volume Velocity

Tetherless World Constellation

BROAD data challenges

• For broad data the new challenges that emerge include – (Web-scale) data search – “Crowd-sourced” modeling and user testing– rapid (and potentially ad hoc) integration of

datasets– visualization and analysis of only-partially

modeled datasets– policies for data use, reuse and combination.

• Which are an overlooked but critical part of the KDD world

Tetherless World Constellation

10

KDD Pipeline – as usually presented

Data Storage

(Big Data Warehouse)

Data Storage

(Big Data Warehouse)

Tetherless World Constellation

KDD Pipeline – in the real world

• Data is increasingly being brought in from external sources, with mixed provenance, and increasingly outside the analyzers’ control.

• At increasing rates and scalesData

Storage

Data Storag

e

Sensors … apps

Social Media

Customer Behaviors

Web Partners

Formatting, standards use, data cleansing, data bias analysis, …

Open data

Data Storag

e

Data Storag

e

Data Storag

e

Data Storag

e

Data Storag

e

Data Storag

e

Data Storag

e

Data Storag

e

Data Storag

e

Data Storag

e

Data Storag

e

Data Storag

eData

SourcesData

Sources

Tetherless World Constellation

Tough data integration challenges

Enterprise analytics

Open Data Integration

Hard problems!

Tetherless World Constellation

DIVE into Data

Discover Integrate Validate Explore

Thinking outside the Database box

IDEA

Discovery needs semantics

How do you find the Data you need?How do you find the Data you need?

The answer isn’t:Middle Eastern Terrorists for $800 …

IDEA

Discovery – there’s a lot out there

IDEA

Discovery challenge: keyword search won’t work

World Bank: Africa

US Data.gov: Crop

Africover: Agriculture

Kenya: Agricultural

IDEA

Integration challenge: need to understand the data

Person

RIN 660125137

Address # 1118

Address St Pinehurst

Address zip

12203

Course topic

CSCI

Course # 4961

Campus Personnel

RPI ID 660125137

Name Hendler

Campus Classes

CRN 1118

Name Intro to Physics

YES

NO!!!!

IDEA

Semantic Web and Linked Data (UK)

County Council

Ordnance Survey

Royal Mail

IOGDC Open Data Tutorial 18

IDEADistribution Statement

http://logd.tw.rpi.edu

Semantic Web and Linked Data (US examples)

IDEA

Validation challenge: easy for humans

Easy for us

IDEA

But very hard for machines without people (or knowledge)

Head to head comparison shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California

* one of the most dangerous places in the US vs. one of the safest in the UK * fails the “smell test”

IDEA

Data + everything else you know

Same or different?

Do the terms mean the same? Are they collected in the same way? Are they processed differently? …

Office of Research

Exploration challenge: develop/test earlier in pipeline

23

Data StorageData

Storage

Sensors and apps Social Media

Customer Behaviors

Web Partners

Formatting, standards use, data cleansing, data bias analysis, …

Open data

Data StorageData

StorageData

StorageData

StorageData

StorageData

StorageData

StorageData

StorageData

StorageData

StorageData

StorageData

StorageData

StorageData

Storage

ExploreExplore

Can we develop mechanisms to rapidly develop/test hypotheses prior to entering the full analytics pipeline?

Can human perceptual apparatus help?

Tetherless World Constellation

Exploration challenge is to improve human/data interaction

Were there really no fires in 1985?

Tetherless World Constellation

How do we attack these challenges?

DOH? DO!

OR

Tetherless World Constellation

Traditional Metadata

• Traditionally metadata tries to be comprehensive–Example:ISO 19115

(GIS standard)• >400 elements• 14 “packages”• Dozens of UML models

(not all consistent w/ each other)

• After 50 years this still doesn’t work!

Tetherless World Constellation

The alternative: Not your “father’s metadata”

• Big Data on the web– is moving away from

traditional relational models (cf. NoSQL)

– Moving towards third party application and extension (cf. Json)

– Focus on interoperability and exchange with “lightweight” semantics• Using ideas from the Semantic Web

– Search: Schema.org – Social Networking: OGP

Tetherless World Constellation

Semantic Web to Knowledge graph

Tetherless World Constellation

Knowledge graph and schema.org

Tetherless World Constellation

Google 2014

Google finds embedded metadata on >20% of its crawl – Guha, 2014

Tetherless World Constellation

• The schema.org hierarchy and details are all available on line

–https://schema.org/docs/full.html

Tetherless World Constellation

Schema.org/Dataset

Human-readable database description (HTML)

Tetherless World Constellation

Schema.org/Dataset

Embedded meta-data (RDFa)

Tetherless World Constellation

Dataset extension to schema.org - April, 2013

Schema.org/Dataset – add this to your pages!

Tetherless World Constellation

Schema.org/Dataset(Schema-labs, data search engne)

Tetherless World Constellation

Distribution Statement

Big Deal!

Tetherless World Constellation

USA “Project Data” – metadataJSON

Aimed at developers

Based on DCAT

Tetherless World Constellation

USA “Project Data” – metadataRDFa

Embedded metadata for Search, Web Apps

Based on Schema.org/Dataset

Tetherless World Constellation

EU moving in similar direction

ADMS

Tetherless World Constellation

Not just Govt sector

• IPTC rNews– Embedded format for online news publications

Tetherless World Constellation

Not just Govt sector

• Goodrelations– Embedded format for online products/catalogs

Tetherless World Constellation

Not just Govt sector

• Open Graph Protocol–Embedded format for Facebook

relationships

Tetherless World Constellation

OGP Use

Tetherless World Constellation

Next steps

Smith James June 4

Jones Fred May 17

O’Connell Frank April 3

Chang Wu February 21

Hoffman Bernd December 9

Person

Date

It’s not enough just to describe the data elements…

Tetherless World Constellation

Describing a dataset … requires a context

Smith James June 4

Jones Fred May 17

O’Connell Frank April 3

Chang Wu February 21

Hoffman Bernd December 9

Person

Date

1976 Dates of Birth

Tetherless World Constellation

Describing a dataset … requires a contextHow do we capture more of this information?

Smith James June 4

Jones Fred May 17

O’Connell Frank April 3

Chang Wu February 21

Hoffman Bernd December 9

Person

Date

1976 Cancer Mortality dates

IDEA

Scalable Data Integration (via metadata)

IDEA

Semantic Linking

IDEA

ARL Network-Science CTA

1

10

100

1000

1 10 100 1000

Count

Time interval (# of days)

Mentorship first

Housing first

1

10

100

1000

1 10 100 1000

Count

Time interval (# of days)

Mentorship first

Housing trust first

1

10

100

1000

1 10 100 1000

Count

Time interval (# of days)

Housing trust first

Mentorship first

1

10

100

1000

1 10 100 1000

Count

Time interval (# of days)

Housing trust first

Mentorship first

A

C

B

D

0

50

100

150

200

250

300

350

400

-300 -200 -100 0 100 200 300

Count

Time interval (# of days)

0

100

200

300

400

500

600

700

-300 -200 -100 0 100 200 300

Count

Time interval (# of days)

0

50

100

150

200

250

300

350

400

450

-300 -200 -100 0 100 200 300

Count

Time interval (# of days)

0

50

100

150

200

250

300

350

-300 -200 -100 0 100 200 300

Count

Time interval (# of days)

A

C

B

D

Algorithms designed Y3 were tested against 220GB of data from Everquest II game looking for proxy measures of trust - Performance results on real data showed good correspondence with theoretical results. (but 220GB = 1 month of our 2 yrs of data)

IDEA

Scaling inference for discovery, integration & validation

AI “rules on graphs” bring (limited) KR languages to supercomputing models

Weaver (PhD 2013) showed power of BlueGene/Q for AI computations

51

From visualization to exploration

… Unfortunately, visualization too often becomes an end product of scientific analysis, rather than an exploration tool that scientists can use throughout the research life cycle. However, new database technologies, coupled with emerging Web-based technologies, may hold the key to lowering the cost of visualization generation and allow it to become a more integral part of the scientific process.

Tetherless World Constellation

Conclusions

• Our data challenge is becoming “Broad Data”– World Wide Web trend towards more and more varied

data• In many domains

– E-commerce, Open Govt, many more (cf. Health/Medical care)

• Broad data requires– Modern, Web-oriented metadata– LINKING the metadata, not the data

• Broad data requires thinking outside the “Database” box– DIVE: discover, integrate, validate and– especially: EXPLORE (early, often, rapidly)

Tetherless World Constellation

Questions?