42
Tetherless World Constellation Data: Big and Broad Jim Hendler Tetherless World Constellation Tetherless World Professor of Computer and Cognitive Science Head, Computer Science Department Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)

Data Big and Broad (Oxford, 2012)

Embed Size (px)

DESCRIPTION

Definitions, examples, and challenges in a world where data is available and plentiful

Citation preview

Page 1: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Data: Big and Broad

Jim HendlerTetherless World Constellation

Tetherless World Professor of Computer and Cognitive ScienceHead, Computer Science Department

Rensselaer Polytechnic Institutehttp://www.cs.rpi.edu/~hendler

@jahendler (twitter)

Page 2: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Outline (if I stick to it)

• What is big data?• How big is big?• What is big data on the Web?• What is Broad data?• Got an example?• What’s the problem?• What’s going on

Page 3: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Useful Terms

• Machine-readable Data– Information available in a form that is accessible and manipulable

by computer– Accessible ≠ Manipulable

• eg PDF documents can be read in and displayed, but the information in the document is not readily available without special tooling

• Metadata– Information associated with (machine-readable) data that provides

information about the data set

• Workflow, Provenance, and lots of other terms– Useful sorts of metadata with respect to who created the data,

when, how was it processed, etc.

• Metadata and the other stuff most useful when it is machine-readable and openly available in commonly agreed upon formats

Page 4: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

BIG Data is NOT the Web of Data

• The term “Big Data” is widely used nowadays to refer to a whole bunch of machine-readable data in one accessible (to the researcher) place– 3 main contexts

• The large data collections of “big science” projects – in traditional data warehouse or database formats

• The enterprise data of large, non-Web-based companies (IBM, TATA, etc.)

– Generally in multiple

• The data holdings of a Google, Facebook or other large Web company

– Include large “unstructured” holdings – Include “graph” data

Page 5: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Tera, Peta, Zeta yotta, yotta, yotta…

• World Wide Web data is extremely large• Extremely well “funded”

– eg. Facebook • 25 Terabytes of logged data per day; valuation $33B (US NIH budget ~

$31B)

– eg. Google• In 2008 it was estimated at 20 petabytes per day (not including youTube);

current valuation $190B (about 1/3 the entire US DoD budget)

• And really, really fascinating stuff– Data about people and their relationships

• To each other • To products• To activities and actions• …

Page 6: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

How BIG is Big?

Page 7: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

BIG Data

Google uses their data in many waysSearch => ads => user

Page 8: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Big Data is becoming different on the Web

• New Work– is moving away from traditional relational

models • cf. NoSQL

– Moving towards third party application and extension

• cf. Mobile apps for local governments

– Includes a focus on interoperability and exchange with “lightweight” semantics

• Using ideas from the Semantic Web– Search: Schema.org – Social Networking: OGP

Page 9: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Which in part gives rise to BROAD data

• 4th context: Broad Data – The huge amount of freely available, but widely varied,

Open Data on the World Wide Web (Structured and Semi-structured)

• Example: The extended Facebook OGP graph (the part outside Facebook’s datasets)

• Example: The growing linked open data cloud of freely available RDF linked data

• Example: Hundreds of thousands of datasets that are available on the Web free from governments around the world

Page 10: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Example: adding “Breadth”

April 2010

Page 11: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Facebook’s Open Graph Protocol

• Facebook now allows other sites to extend the graph • Open Graph Protocol uses RDFa to let web sites contain

information about the things people “like”og:title - The title of your object as it should appear within the graph, e.g., "The Rock".og:type - The type of your object, e.g., "movie". Depending on the type you specify, other properties may also be required.og:image - An image URL which should represent your object within the graph.og:url - The canonical URL of your object that will be used as its permanent ID in the graphog:description - A one to two sentence description of your object.og:site_name - If your object is part of a larger web site, the name which should be displayed for the overall site. e.g., "IMDb".

– Not a traditional “ontology”

Page 12: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Big Data

Facebook generates terabytes of data per dayWhat could be learned from this?

Page 13: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Creates a platform for SW-powered apps

Page 14: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

BROAD data challenges

• For broad data the new challenges that emerge include – (Web-scale) data search – “Crowd-sourced” modeling– rapid (and potentially ad hoc) integration of

datasets– visualization and analysis of only-partially

modeled datasets– policies for data use, reuse and

combination.

Page 15: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Huh?

“The more I work with data, the more I realize I need Semantics”

Huh?

The traditional database community has, umm, not always been the first to embrace semantics

What is different here?

Page 16: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Government Data Sharing

Page 17: Data Big and Broad (Oxford, 2012)

2012 International Open Government Data Conference—Open Gov Data Tutorial

The Web of Open Government Data is Growing

• Analytics based on over 1,000,000 datasets from around the world can be seen at – http://logd.tw.rpi.edu/iogds_data_analytics

• The examples that follow are from that page

9 July 2012 17

Datasets 1,028,054Countries 43Catalogs 192Categories 2460Languages 24

Page 18: Data Big and Broad (Oxford, 2012)

2012 International Open Government Data Conference—Open Gov Data Tutorial

International

9 July 2012 18

Page 19: Data Big and Broad (Oxford, 2012)

2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 19

Page 20: Data Big and Broad (Oxford, 2012)

2012 International Open Government Data Conference—Open Gov Data Tutorial

Important note:quantity is not really the most important issue

9 July 2012 20

Many others…Many others…

Page 21: Data Big and Broad (Oxford, 2012)

2012 International Open Government Data Conference—Open Gov Data Tutorial

Topics (Across All Catalogs)

9 July 2012 21

Page 22: Data Big and Broad (Oxford, 2012)

2012 International Open Government Data Conference—Open Gov Data Tutorial

Topics (Across All Catalogs)

9 July 2012 22

Page 23: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Combining data from different data sharing sites

Page 24: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Data Integration Problems

Head to head comparions shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California (one of the highest crime areas in the US)

Page 25: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

The problem is (likely) semantics

Same or different?

Do the terms mean the same? Are they collected in the same way? Are they processed differently? …

Page 26: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Example: Water

Page 27: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Example: Water/Kenya

Page 28: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Finding Data

World Bank: Africa

US Data.gov: Crop

Africover: Agriculture

Kenya: Agricultural

Page 29: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

5 Star Data

9 July 2012 IOGDC Open Data Tutorial 29

Page 30: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Broad Data “Integration”requires simple semantics

Page 31: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Example any wikipedia topic!

Page 32: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Arizona

Page 33: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Arizona info (From the previous)

Page 34: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

USDA data turns out to be crucial

Page 35: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Metadata is crucial for Broad Data

• Metadata design is crucial to govt data sharing– Needed for search and federation in large data

sharing efforts

• International data sharing – W3C Govt Linked Data Working Group

– Need for vocabularies within govt sectors• Esp for cross-langauge use

– How can we compare health (or legal, or social, or ….) data between countries like US, UK, India, Kenya (English) with Norway, China, France, etc.

– How can we link local govts (in traditional languages, local dialects, etc) w/national data

Page 36: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Database metadata

Page 37: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Dataset extension to schema.org (pending)

Page 38: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Government Data in the linked open data cloud

http://linkeddata.org/

Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)

Page 39: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Research in Govt Data => Broad Data challenges

• Trust– Government data is controversial, and potentially biased

• How do we confirm or dispute?

• Combination– When we combine data we need to keep the provenance of information

(see trust)• How do we make policies explicit and sharable

• Scaling– Our project has already converted 9.9B triples from only >2,000 of the

710,000 government databases we can identify (116 catalogs, 32 countries, 16 languages)

• Cross-catalog• Cross Langauge

• Versioning and updating • Archiving• Visualization• …

Page 40: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Big Data needs bigger ideas for visualization

(Fox &Hendler, Science, 2/11/10)

Page 41: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

A new idea we’re playing with at RPI

• Data as “exhibition”– Museums/Performing Arts have explored

accessibility for real world artifacts, can we extend these to the data web?

• Data via physical interaction– Using theatre techniques

we can literally move a person through a data landscape, what new metaphors does this open up?

Page 42: Data Big and Broad (Oxford, 2012)

Tetherless World Constellation

Conclusions

• Big data is going Broad– World Wide Web trend towards more and more

varied data• In many domains

– E-commerce, Open Govt, many more (cf. Health/Medical care)

• Broad data requires thinking outside the “Database” box– Including considering access

• Broad data opens exciting possibilities for research and innovation– And I hope will help provide tools for making data

more accessible