1
Eurostat
Big DataEffective Processing and Analysis
of Very Large and Unstructured data for Official Statistics.
Dealing with Schemaless Data Examples and
ApplicationsMonica Scannapieco
Istat ([email protected])
Eurostat
Example 1: Census LOD Project
• Datalift Platform:
• To design and implement the LOD production process
• Steps:
• Dataset Upload
• Ontologies Upload
• Mapping to RDF
• LOD Publishing
• Example of LOD-Based services: Querying and Visualization
2
Eurostat
Census LOD Project: Recap
Data Model: RDF GraphQuery Language: SPARQL
Eurostat
Screenshot Live Demo - 1Addresses Triples
3
Eurostat
Screenshot Live Demo - 2
Linked Census Section
Eurostat
Screenshot Live Demo - 3
Linked Census Section
4
Eurostat
Screenshot Live Demo - 4
Linked Census Section
Eurostat
Example 2: Scraping and Processing Web Documents
• Apache Platform:
• Nutch: Scraper
• Lucene: Document access
• SOLR: Document management
• Steps:
• Configure and Launch Nutch scraper
• Configure SOLR
• Access LUCENE API for processing
5
Eurostat
Example 2: Configure and Launch Nutch
• Set parameters like:
• Seed: URLS where to start crawling
• Width and depth of navigation
• Regular experession the URLs should be conform to
• HTML tags to keep
• Data object types to include (e.g. images, etc.)
Eurostat
Example 2: SOLR Features and Config
• Defines the field types and fields of documents
• HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby, PHP, Velocity, CSV, binary)
• Natural Language Processing settings:
• E.g. Space removal, stemming, etc.
• Index settings:
• Tokens to index