Upload
sebastian-hellmann
View
887
Download
1
Tags:
Embed Size (px)
Citation preview
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 1 http://lod2.eu
http://lod2.eu
NLP2RDFIntegration of Data, Tools
and Applications with RDF/OWL in the Areas of
Textmining and LinguisticsPhD Thesis, Sebastian Hellmann
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 2 http://lod2.eu
Extensive Topic – What is the core?
Features for Machine Learning
Which features do I need for a certain Textmining task?
An introductory example :Resources: • Face Recognition Tool that detects color of the eyes (brown, green, blue)
and type of haircut (Vo-ku-hi-la, Mullet, GI Joe)• Database with Age and Occupation
Goal: predict income of persons• Young students earn less than old CEO‘s .
=> Color of eyes and haircut probably irrelevant!
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 3 http://lod2.eu
Basic idea: a benchmarking framework
Input: • Task specification• Text• Training/test data
Output:• Tools and data required to solve the task
Do I need POS tags to classify Tourism documents?
Prerequisites:• Tools and applications need a standardized interface• Data needs a standardized format
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 4 http://lod2.eu
Basic idea: a benchmarking framework
NLP2RDF stack
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 5 http://lod2.eu
Basic idea: a benchmarking framework
Google Code project was created• Stanford parser was integrated• Ontologies were found and integrated• Pipeline implemented• Plugin system implemented• Some results were achieved
But…• Architecture not flexible enough (Pipeline)• Integration bound to Java• Data sources were not sufficient• Wikipedia/DBpedia too course-grained• Speed of integration too slow
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 6 http://lod2.eu
Prerequisites
One step back:
1. Creation of data sets in RDF2. Data integration and linking of data sets3. Licences4. Standardized format for tool integration5. Acquisition of additional knowledge
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 7 http://lod2.eu
Why RDF and OWL ?
1. RDF makes data integration easy: URIref, LinkedData2. OWL is based on Description Logics (Guarded Fragment)3. Availability of open data sets (access and licence)4. Diverse serializations for annotations: XML, Turtle, RDFa+XHTML5. Scalable tool support (Databases, Reasoning)6. If the only tool you have is a hammer, everything looks like a nail.
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 8 http://lod2.eu
LOD Cloud - over 26 Billion Facts
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
DBpedia is central:• Cross-domain• Crystalization point (early bird)
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 9 http://lod2.eu
Simplified:• Circles are Database Tables• Links are HTTP-Foreign Keys
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 10 http://lod2.eu
LinkedData
http://www4.wiwiss.fu-berlin.de/rdf_browser/?browse_uri=http%3A%2F%2Fdata.nytimes.com%2FN12930380387917339601
Resembles database tableKey-Value pairsValues can be:• Datatypes (Strings, Integers)• URIs pointing to subjects in the
same table• URIs pointing to subjects in any
other table
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 11 http://lod2.eu
SPARQL – optimizations for table joins
All soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants
http://tinyurl.com/2uhuow9
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 12 http://lod2.eu
SPARQL – optimizations for table joins
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 13 http://lod2.eu
Creation of data sets: Wiktionary2RDF
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 14 http://lod2.eu
Creation of data sets: Wiktionary2RDF
http://en.wiktionary.org/wiki/house• Covers 170 languages• Total of 10 million pages• 900.000 users• RDF Dump will increase number of editors• Same properties as Wikipedia (stable identifiers)
• Hundreds of Wiktionary parsers (especially for English)• Information is trapped in the Wiki• Structure changes make software obsolete
Why try it again?• DBpedia Extraction Framework is very mature (5 years, 15 developers)• Configuration over Code, Templates will allow Wiktionarians to update
Parsers• Early contact with the community
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 15 http://lod2.eu
Creation of data sets: Wortschatz
Converted in 2009:
Matthias Quasthoff, Sebastian Hellmann und Konrad Höffner:Standardized Multilingual Language Resources for the Web of Data:http://corpora.uni-leipzig.de/rdf 3rd prize at the LOD Triplification Challenge, Graz, 2009
What was missing?• Research questions• Use cases• Other data sets to link to!• Wikipedia as a linking partner not suited • No servers
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 16 http://lod2.eu
Wiktionary, Wortschatz, OLiA can become the Crystallization point for a Linguistic Linked Data Web
Four major types:• Lexical Semantic Resources• Dictionaries• Corporas• Schemas/Ontologies
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 17 http://lod2.eu
Interlinking Wortschatz: Research and Use Case
Iterated Co-occurences can be done with SPARQLWiktionary and Wortschatz can be loaded in the same database
Interesting questions:• What is the overlap and coverage?• Which Wiktionary relation can be linked to which statistical relation?• Can we build tools that helps Wiktionary editors (Suggestions)?• Wiktionary links Words across languages. Are there any similar
patterns? • Can we validate the Wiktionary RDF dump with Wortschatz?
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 18 http://lod2.eu
Open Licences – Focus of LOD2 and OKFN
http://ckan.net/
CKAN is an open registry of data and content packages. Harnessing the CKAN software, this site makes it easy to find, share and reuse content and data, especially in ways that are machine automatable.
Working Group on Open Data in Linguisticshttp://wiki.okfn.org/wg/linguistics
• Founded on Nov 2010• 6-7 Members• Membership open, please join
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 19 http://lod2.eu
Standardized Formats: Part 1 – Corpora
http://www.sfb632.uni-potsdam.de/~d1/paula/doc/
PAULA XML is the Potsdamer Austauschformat für linguistische Annotation ("Potsdam Interchange Format for Linguistic Annotation"). It is an XML-based standoff representation format, which has been designed to represent data with heterogeneous annotation layers produced by different tools. For visualization and querying of PAULA XML data, the database ANNIS can be used.
Christian Chiarcos at work: PAULA will become POWLA and will be used for representation of corpora annotations.
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 20 http://lod2.eu
Standardized Formats: Part 2 – the Web
Bottom layer of the NLP2RDF stack can be reused:An ontology to represent Strings (formerly the SSO).
In his latest book, Wikinomics, Don Tapscott explains deep changes in technology, demographics and business.
• URIs to represent Strings e.g. http://nlp2rdf.org/example/Don_Tapscott
• Relation between Strings: previous, next, sub, super• http://nlp2rdf.org/example/Don is a subString of the above
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 21 http://lod2.eu
Standardized Formats: Part 2 – the Web
• RDFa allows for flexible in-line annotations• Multiple services can be ad-hoc integrated• Multiple layers of annotation can be used
• Full compatability with POWLA• Trade-off between flexibility and speed
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 22 http://lod2.eu
Knowledge Acquisition
Tiger Corpus Navigator
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 23 http://lod2.eu
Ontology Learning
Johanna Völker – Learning Expressive Ontologies (LExO)
# Example:# A fish is any aquatic vertebrate animal that is covered with scales,# and equipped with two sets of paired fins and several unpaired fins.## [fish] subClassOf [any aquatic vertebrate animal that is covered …]
#Construct {?sub rdfs:subClassOf ?super} {Construct {?sub owl:equivalentClass ?super} {?is a penn:BePresentTense .?is nlp:superToken ?is_any_aquatic_.?is_any_aquatic_ a olia:VerbPhrase .?is_any_aquatic_ nlp:syntacticSubToken [ nlp:normUri ?super] .?animal nlp:cop ?is .?animal nlp:nsubj ?fish .?fish nlp:superToken [ nlp:normUri ?sub] .}
NLP2RDF – http://aksw.org/Projects/NLP2RDF . Page 24 http://lod2.eu
Standing on the shoulders of giants
Markus Strohmaier,TU Graz
Johanna VölkerUni Mannheim
Christian ChiarcosSFB632 - Uni Potsdam
Sören AuerUni Leipzig
Jens LehmannUni Leipzig
Thank you for your attention