Federated Ontology Based Query System

Integrated Ontology for Sports(Domains: Cricket, Football and Tennis.)

Database Interoperability Project

Abhishek Agrawal, George Sam, Hari Haran Venugopal, Noopur Joshi

• Problem Statement and Motivation• Scope of the Project• Our Approach• Data sources – Scraper• Data Cleaning – Google refining, Karma • Ontology Creation – Using existing ontology to create Federated• Data Modeling – Karma Tool• Data Publishing – RDF and Triple Store Creation. • Data Extraction – Using OpenRDF for SPARQL Query • Future Work and Challenges• Conclusion

Outline:

2

Problem Statement and Motivation

3

Why do we need Ontologies?- Need for constant, intelligent access to up-to-date, integrated and detailed information from

the Web- Helps to aggregate data from various sources

Why Federated Sports Ontology?- Helps to represent different sports and presents a common view

- Is easily extendible

- Intelligent information gathering- Scores: Who's winning, and how did the score change? - Schedules: Who's playing who, when, and where? - Standings: Who's in first place? Who's closest to qualifying ?

- Data Analysis - Statistics: How do the players and/or teams measure up against one another in various

categories?

- News: How do we combine editorial coverage of sports with all data feeds??

Tennis- Players- Tournaments

Cricket - Players- Matches- Rankings

Football- Players- Leagues

Scope of the Project

4

Data Extraction

Data Cleaning

Ontology Creation

Date Modeling

Querying using SPARQL

Our Approach

5

Web Scraping: (web harvesting or web data extraction) is a computer software technique of extracting information from websites.

Data Source: Scraper

Scraping tools:

• Beautiful Soap – Simple methods, Unicode support and consists of parsers like lxml and html5lib.

• Jsoup – Java HTML Parser, WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

• Chrome Web scrapper – Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data.

6

Data Cleaning

Data cleansing, data cleaning or data scrubbing: is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.

Data Cleaning tools:

• Karma Tool – Karma offers a programming-by-example interface to enable users to define data transformation scripts that transform data expressed in multiple data formats into a common format.

• Google Refine – a power tool for working with messy data, cleaning it up, transforming it from one format into another.

7

Ontology : Class Hierarchy

8

Federated Ontology

9

Data Modeling

Tool Used: KARMA (USC ISI)

• Browser based Data Integration/ Data Modeling tool • Advantage – Data Integration and Publishing is easy

• Steps:1. Load Ontologies and data sets2. Primitive Data Filtering3. Setting semantic types for attributes4. Building semantics for sports individually

• Karma intelligently creates semantic mappings for higher concepts.• Create URL for entities.

10

Screenshot

11

Data Publishing

• Available frameworks : OpenRDF, Protégé, ApacheJena.

• OpenRDF :

Browser based framework Integrated with KARMA Publish each Data set

1. JSON2. R2RML Model3. RDF

Create Triple Store for RDF Load RDF into OpenRDF Triplestore

12

13

Data Extraction

SPARQL

• Language used to extract information from RDF

• Query Based

SELECT *WHERE {?Subject ?Predicate ?Object}

14

Future Work

1. Inclusion of other sports 2. Creating a web/ mobile based interface to query data3. Creating an application for university level players and teams4. Providing more specific information like :

• Details about a particular team from the year 1990 – 2014• Images of the players/teams• Details of all the matches played between two players/ teams

15

References

• http://www.isi.edu/integration/karma/• http://phd.jabenitez.com/wp-content/uploads/2014/03/A-Practical-Guide-To-Building-OWL-Ontologies-Using-Protege-4.pdf• http://ict.siit.tu.ac.th/~sun/SW/Protege%20Tutorial.pdf• http://www.crummy.com/software/BeautifulSoup/• https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en • https://code.google.com/p/google-refine/• http://www.datacleansing.net.au/Data_Cleansing_Services 16

http://www.isi.edu/integration/karma/

http://www.isi.edu/integration/karma/

http://phd.jabenitez.com/wp-content/uploads/2014/03/A-Practical-Guide-To-Building-OWL-Ontologies-Using-Protege-4.pdf

http://phd.jabenitez.com/wp-content/uploads/2014/03/A-Practical-Guide-To-Building-OWL-Ontologies-Using-Protege-4.pdf

https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en

http://ict.siit.tu.ac.th/~sun/SW/Protege%20Tutorial.pdf

http://www.crummy.com/software/BeautifulSoup/







https://code.google.com/p/google-refine/



http://www.datacleansing.net.au/Data_Cleansing_Services

http://www.datacleansing.net.au/Data_Cleansing_Services

Engineering

Federated Ontology Based Query System