Upload
george-sam
View
131
Download
1
Tags:
Embed Size (px)
Citation preview
Integrated Ontology for Sports(Domains: Cricket, Football and Tennis.)
Database Interoperability Project
Abhishek Agrawal, George Sam, Hari Haran Venugopal, Noopur Joshi
• Problem Statement and Motivation• Scope of the Project• Our Approach• Data sources – Scraper• Data Cleaning – Google refining, Karma • Ontology Creation – Using existing ontology to create Federated• Data Modeling – Karma Tool• Data Publishing – RDF and Triple Store Creation. • Data Extraction – Using OpenRDF for SPARQL Query • Future Work and Challenges• Conclusion
Outline:
2
Problem Statement and Motivation
3
Why do we need Ontologies?- Need for constant, intelligent access to up-to-date, integrated and detailed information from
the Web- Helps to aggregate data from various sources
Why Federated Sports Ontology?- Helps to represent different sports and presents a common view
- Is easily extendible
- Intelligent information gathering- Scores: Who's winning, and how did the score change? - Schedules: Who's playing who, when, and where? - Standings: Who's in first place? Who's closest to qualifying ?
- Data Analysis - Statistics: How do the players and/or teams measure up against one another in various
categories?
- News: How do we combine editorial coverage of sports with all data feeds??
Tennis- Players- Tournaments
Cricket - Players- Matches- Rankings
Football- Players- Leagues
Scope of the Project
4
Data Extraction
Data Cleaning
Ontology Creation
Date Modeling
Querying using SPARQL
Our Approach
5
Web Scraping: (web harvesting or web data extraction) is a computer software technique of extracting information from websites.
Data Source: Scraper
Scraping tools:
• Beautiful Soap – Simple methods, Unicode support and consists of parsers like lxml and html5lib.
• Jsoup – Java HTML Parser, WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
• Chrome Web scrapper – Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data.
6
Data Cleaning
Data cleansing, data cleaning or data scrubbing: is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.
Data Cleaning tools:
• Karma Tool – Karma offers a programming-by-example interface to enable users to define data transformation scripts that transform data expressed in multiple data formats into a common format.
• Google Refine – a power tool for working with messy data, cleaning it up, transforming it from one format into another.
7
Ontology : Class Hierarchy
8
Federated Ontology
9
Data Modeling
Tool Used: KARMA (USC ISI)
• Browser based Data Integration/ Data Modeling tool • Advantage – Data Integration and Publishing is easy
• Steps:1. Load Ontologies and data sets2. Primitive Data Filtering3. Setting semantic types for attributes4. Building semantics for sports individually
• Karma intelligently creates semantic mappings for higher concepts.• Create URL for entities.
10
Screenshot
11
Data Publishing
• Available frameworks : OpenRDF, Protégé, ApacheJena.
• OpenRDF :
Browser based framework Integrated with KARMA Publish each Data set
1. JSON2. R2RML Model3. RDF
Create Triple Store for RDF Load RDF into OpenRDF Triplestore
12
13
Data Extraction
SPARQL
• Language used to extract information from RDF
• Query Based
SELECT *WHERE {?Subject ?Predicate ?Object}
14
Future Work
1. Inclusion of other sports 2. Creating a web/ mobile based interface to query data3. Creating an application for university level players and teams4. Providing more specific information like :
• Details about a particular team from the year 1990 – 2014• Images of the players/teams• Details of all the matches played between two players/ teams
15
References
• http://www.isi.edu/integration/karma/• http://phd.jabenitez.com/wp-content/uploads/2014/03/A-Practical-Guide-To-Building-OWL-Ontologies-Using-Protege-4.pdf• http://ict.siit.tu.ac.th/~sun/SW/Protege%20Tutorial.pdf• http://www.crummy.com/software/BeautifulSoup/• https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en • https://code.google.com/p/google-refine/• http://www.datacleansing.net.au/Data_Cleansing_Services 16