19
WHD Colloquium, March 27, 2012 1 Historical Data Integration based on Collective Intelligence Vladimir Zadorozhny Graduate Information Science and Technology Program School of Information Sciences University of Pittsburgh NADM Group V. Zadorozhny

Historical Data Integration based on Collective Intelligence

  • Upload
    yannis

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Historical Data Integration based on Collective Intelligence . Vladimir Zadorozhny Graduate Information Science and Technology Program School of Information Sciences University of Pittsburgh. NADM Group. Challenge. Consolidated Structured Information. WHD Data Integration - PowerPoint PPT Presentation

Citation preview

Page 1: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 1

Historical Data Integration based on Collective Intelligence

Vladimir Zadorozhny

Graduate Information Science and Technology ProgramSchool of Information Sciences

University of Pittsburgh

NADM Group

V. Zadorozhny

Page 2: Historical Data Integration based on Collective Intelligence

2

Challenge

Diverse ,Heterogeneous,Semi-structuredData Sources

WHD Data Integration Infrastructure

ConsolidatedStructuredInformation

V. Zadorozhny

Page 3: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 3

Web of Data?• Linked Data: using the Web to create typed links between data

from different sources• Linked Data uses RDF (Resource Description Framework) to make

typed statements (triples)• Expected result: Web of Data extending the Web with a global

data space connecting diverse domains (people, companies, publications , etc.)

• In general, Web of Data has a potential (still questionable) to support loose data coupling that may facilitate more efficient data utilization

While WHD can utilize LD and related Web mashup technologies to some extent, it would be premature to rely upon the Linked Data infrastructure

V. Zadorozhny

Page 4: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 4

Dataverse Network?• An open source application to publish, share, reference, extract and

analyze research data that facilitates making data available to others• "Dataverse owners can upload any file type and format (excel, txt,pdf,

doc, etc.), and the files will be stored and made available in the original format“ (http://thedata.org/files/dataversehandout.pdf)

• Information consumers should further integrate data sources to perform analysis using multiple "dataverses".

While WHD aims to be a part of the Dataverse Network, it would not encourage users to contribute data in ANY format. Instead, users integrate their data into the WHD repository while submitting the data. To summarize, WHD infrastructure crowdsourses the data integration task, not just data contribution task. V. Zadorozhny

Page 5: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 5

Data Submission System

Structured homogeneoushistorical data

Information Providers

Annotated historical data

Internal Data

ReliabilityAssessment

Fused historical data

Information Consumers

Wrapper

Wrapper

Heterogeneous historical data sources

WrapperGeneration

WrapperRegistration

ExternalData Reliability

Assessment

DataFusion

General WHD Architecture

V. Zadorozhny

Page 6: Historical Data Integration based on Collective Intelligence

According to the 2006 revision of the World Population Prospects the total population in the region of Liberia in 1950 was 824,000. The average population growth percent per year for the following ten years was 2.5. For Ivory Coast those numbers are 2,505,000 and 3.6 correspondingly

Extendable Target Schema (relational is not mandatory):Source | Location | From | To | Population |

Data Source: s1 (xl) Data Source: s2 (doc)

Source|Location | From |To | Population| s2 | Liberia | 01/01/1950 | 12/31/1950| 824000 | s2 |Liberia | 01/01/1960 | 12/31/1960| 1,052,000 | s2 |Ivory Coals | 01/01/1950 | 12/31/1950| 2,505, 000 | s2 |Ivory Coast | 01/01/1950 | 12/31/1950| 3,692,000 |

Materialize Data

Keep Data Remotely

select * from Population

s1 |Mauritania | 01/01/1950 | 12/31/1950| 692,000 | s1 |Mauritania | 01/01/1960 | 12/31/1960| 892,000 | s1 | Senegal | 01/01/1950 | 12/31/1950| 2,543,000 | s1 | Senegal | 01/01/1960 |12/31/1960 | 3,277,000 |

Simple Scenario

Mapping: Territories -> Location Population -> PopulationData Aggregation -> TotalYear -> From,To

Wrapper

Mapping: region -> Location Population -> PopulationData Aggregation -> TotalYear -> From,To

Wrapper

WHD Infrastructure

Page 7: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 7

WHD Infrastructure

Data Curation Data Collection

Data Utilization

Big Picture: continuously growing infrastructure (a la Wikipedia)

V. Zadorozhny

Page 8: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 8

• Group of graduate IS students: special project in Advanced Data Management class (INFSCI2711)

• Content Management → Pligg ( Open Source Content Management System, Apache, PHP, and MySQL based)

• Data Integration Engine → Pentaho Kettle (Open Source Data Integration Engine, Java-based GUI and Command Line Tools, XML based data transformation file)

• Data providers download Wrapper Generating Software configure wrappers on their workstation ( using

preconfigured templates) register wrappers on WHD Server

WHD Prototype

V. Zadorozhny

Page 9: Historical Data Integration based on Collective Intelligence
Page 10: Historical Data Integration based on Collective Intelligence

10

Data Source

Data Transformation

Transformed Data

XML Wrapper

Page 11: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 11V. Zadorozhny

Page 12: Historical Data Integration based on Collective Intelligence

12

Data Reliability Assessment and Data Fusion

• The systems based on crowdsourcing require mechanisms to ensure data quality. • WHD Infrastructure will support efficient data curation strategies based on advanced data reliability assessment and data fusion methods. • As system continuously receives new historical reports, WHD estimates reliability of this data, which evolves with respect to new evidence. • WHS uses a measure of inconsistency caused by a report to assess its internal reliability.• WHD also allows users to submit their subjective feedback on reliability of data to assess external reliability. •WHD utilizes subjective logic to combine internal and external reliability assessment

Page 13: Historical Data Integration based on Collective Intelligence

13

Historical Data: Redundancy

t1 | source_ref1 | Measles | NYC |10/10/1900 | 10/10/1920 | 700t2 | source_ref2 | Measles | NYC |10/20/1910 | 10/30/1930 | 300

Total number of Measles cases in New York City from 1900 to 1930: 700+300 = 1000 ??? Temporal overlap between t1 and t2

1900 193019201910

Measles reports: 700 300

Temporal Overlaps

t5 | source_ref1 | Yellow fever | NY |10/10/1900 | 10/10/1920 | 700t6 | source_ref2 | Hepatitis | NY|10/10/1900 | 10/10/1920 | 700t7 | source_ref4 | Hepatitis B | NY| 10/20/1910 | 10/30/1930 | 300

Total number of Hepatitis cases in New York State from 1920 to 1930: 700+700+300 =1700 ??? Naming overlap between t5, t6 and t7

Naming Overlaps

Total number of Smallpox cases in New York State from 1900 to 1930: 500+600 = 1100 ??? Spatial overlap between t3 and t4

Smallpox reports: 500 (NY) 600 (NYC)

t3 | source_ref1 | Smallpox | NY |10/20/1900 | 10/20/1920 | 500t4 | source_ref1 | Smallpox | NYC |10/30/1920 | 10/30/1930 | 600

Spatial Overlaps

1900 193019201910

Page 14: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 14

Historical Data: Inconsistency

time

Measles reports in NYC: 200 500

300 400

700

……….

R1:

R2:

Redundant and Inconsistent :

V. Zadorozhny

Page 15: Historical Data Integration based on Collective Intelligence

Information Consumer Toolset:Data Visualization Dashboard

Page 16: Historical Data Integration based on Collective Intelligence

ICTS: Map Exhibits and Timeline Widgets

Page 17: Historical Data Integration based on Collective Intelligence

CV

CVCV

ICTS: Motion Chart Animation

Page 18: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 18

Conclusion

• We explore a novel approach to reliable, large-scale historical data integration based on collective intelligence

• We implement this approach in WHD infrastructure for consolidation heterogeneous historical data

• Major challenge: how to engage a large community of researchers to share their data and collectively resolve the data heterogeneities in a continuously growing large-scale distributed historical repository?– contributions from CHAI members (only a small fraction of Wikipedia users

contributes information to ensure its growth)– as the infrastructure evolves users may become interested in “embedding” their

data in a larger context to perform global analysis and to utilize WHD tools– open development platform (extendable data transformation library and

toolsets)

V. Zadorozhny

Page 19: Historical Data Integration based on Collective Intelligence

WHD Colloquium, March 27, 2012 19

AcknowledgementsGraduate IS Students (WHD system development team):

Andrew Barnett (team leader)Andrew Entin Thomas JunkerJidapa KraisangkaHan LiaoEric Miller Ye PengEvan PulginoHenry Quattrone Mark Swartz Miao Tan Liu Yuchen Lihong Zhang

Doctoral Students:

Ying-Feng Hsu Julian Lee

V. Zadorozhny