Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

TR DISCOVERDeZhao Song, Frank Schilder, Charese Smiley…TR Corporate Research and Development, Eagan, MinnesotaChris BrewTR Corporate Research & Development, LondonML Prague, April 23th 2016

Outline

• TR Discover: NLP as part of the solution to a business problem.

– Problem

– Technologies used

– Demonstration

– Reflections

• What is it like to be a scientist working in a business setting?

About me• B.Sc Chemistry, Bristol• Search Examiner, European Patent Office, Berlin, Germany• M.Sc and D.Phil, Sussex, with Steve Isard in EP• Postdoc at Edinburgh, Scotland• Sharp Laboratories of Europe, Oxford• Research (and faculty-ish) positions at Edinburgh• Core faculty in Linguistics and CSE, OSU, Columbus OH, USA• Educational Testing Service, Princeton, NJ, USA• Nuance Communications, Sunnyvale, CA, USA• Thomson Reuters Corporate Research, London, England

Disclaimer

All opinions are my own, and do not reflect official positions of The Thomson Reuters Corporation

Thomson Reuters’ Business • Offer people information that they value enough to pay for.• Professional users• Many products, each catering for its own market segment.

Thomson Reuters’ Business • Not an internet company with tens of millions of users• Company is set up to build long-term trust relationships with

clients.– Many product managers and marketers, who are highly

expert in maintaining these relationships– Privacy and data security are crucial.

• Cannot be “one size fits all”.• Role of technology is to support and improve products. Primary

responsibility for daily support is the Platform Group.

Who are Corporate Research & Development?• Mission: To support Thomson Reuters by carrying out

applied research relevant to our businesses.• Team of approx. 40 researchers, developers, managers,

administrators, architects• Distributed group:

• Rochester, NY, USA• Eagan, MN, USA• New York City, NY, USA (3 Times Square, 12th floor)• London, UK (1 Mark Square), opened in August 2013

The business context

Financial and Risk

IP&Science

The technology context• Databases (mainly SQL, mainly Oracle)• Search (mainly Elastic Search, built on top of Lucene)• Virtualized servers in data centers• Front ends mostly in Javascript with AngularJS + components

to aid branding via common look and feel.• Back ends mostly Java.• Products often consist of a bundle of related capabilities,

packaged together to help potential users understand.

BOLD• Big• Open• Linked• Data

The Knowledge Graph

Experts and non-experts• There are also expert and non-expert professional users

– Cortellis (product for drug companies)• First time user, asks broad questions. No idea what is

available. Needs whatever guidance we can give. • Expert user. Knows roughly what is available, but may

need help locating what they want.– Common thread: users are trying to do something specific,

such as a market overview, a comparison, or verification of a hunch about a trend. Give them a data visualization, not just raw data.

Expert user

Natural Language Query: TR-Discover• Keyword based search is not enough to express user intent.• What if the user could type queries, and be guided towards

things that our system can answer?– Experts and first timers alike can access through NL– Enables discovery of data– Capture of user intent allows well-targeted analytics

• This is not new, there have been NL database query systems since the 60s, but these tend to be hard wired to specific databases and their schemas. We want a reusable tool.

Placeholder for demo.Available to you at http://cortellislabs.com/. You do have to register, but anyone can. NB. Beta version. Works, but has rough edges.

First-time user

Market technology trends

NER Sentiment Analytics

Comparing top 10 indications for companies for Drugs having a primary indication of pain

How we did it• Feature based context-free grammar with features, using the

formalism of NLTK.• Real logic-based formal semantics.• Autosuggest based on grammar, logical form and heuristics

derived from our databases.• Query via translation from logic to SPARQL or SQL

– SQL is just for now, for efficiency. – But we plan to keep logic as a separate level, not translate

directly to query language.

The Grammar• Feature based context-

free grammar with features, using the formalism of NLTK.– Grammar captures

selectional restrictions relevant to the drug domain.

– Adding a new domain should (mainly) be a matter of adding new lexical entries.

Grammar

• The word “drugs” is plural, and has λx.drug(x) as its semantics• For now, prepositional phrases have features that enforce very

tight attachment preferences. This is going to break, but OK for now.

• The type a of verb specifes both the potential subject-type and object-type, which can be used to filter out nonsensical questions like “drugs headquartered in the U.S”.

G1: Nom→NG2: NP→NomG3: NPbar → NPG4: NPbar → NPbar VPbarG5: VPbar → TV NPbarLex1: N[type=drug, num=pl, sem=<λx.drug(x)>] → ’drugs’Lex2: TV[TYPE=[drug,org,dev], sem=<λX x.X(λy.dev org drug(y,x))>, tns=past, NUM=?n] → ’developed by’Lex3: TV[TYPE=[org,country,hq], NUM=?n] → ’headquartered in’

Query translation input: Drugs developed by Merck

Query translation output: Drugs developed by Merck

• This was SPARQL. It works, but is much the slowest part of the system

• Similar translation for SQL. – In our demo, we can use the fact that we know all the

words of the grammar to make the database small.– This lets us replace the big costly knowledge graph with a

single-file Sqlite database. • Yes, we know it won’t scale

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX example: <http://www.example.com#>select ?xwhere {?id042 rdfs:label ’Merck’.?id042 rdf:type example:Company . ?x rdf:type example:Drug .?id042 example:develops ?x .}

Autosuggest• Use the grammar to calculate possible continuations from what

we have so far.– Currently this process does not use a full-fledged parser, and

relies on the fact that the grammar is carefully engineered to minimize local and eradicate global ambiguity.

– I want to achieve a tighter integration with the parser, and generate predictions based on elements present in the parser’s chart, allowing more ambiguity

• Rank suggestions by preferring concepts that correspond to nodes in the RDF graph that are involved in many relationships.– When we have large enough query logs we hope to add in

an additional preference component based on a domain specific n-gram language model.

What is it like being a scientist in the business world?• It varies with the DNA of the organization…

– ETS– Nuance– Thomson Reuters

Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Technology

Enabling Eﬃcient Question Answering over …users.ics.forth.gr/~tzitzik/publications/Tzitzikas_2020...Enabling Eﬃcient Question Answering over Hundreds of Linked Datasets 3 Linked

PhD Day: Entity Linking using Generic Linked Data Datasets

Estudo de Ações para disponibilização de datasets governamentais em Linked Open Data

The Web of Linked Data - polito.it€¢The Web of linked data 2/3/2017 01RRDIU ... •Linked Data is not a specification, ... –Describes the interlinking relationship between datasets

Report on high-value datasets from EU institutions · linked open data”. There, the Publications Office will select a number of datasets from the list for which a clear business

IOS Press Weaving a Web of Linked Resources...2 F. Gandon et al. / Weaving a Web of Linked Resources Fig. 1. Number of linked open datasets on the Web plotted from 2007 to 2017 with

Linked Data for Digital History · Linked Data for Digital History • Represent heterogeneous datasets with their own data models in common format: Resource Description Format (RDF)

Report on high-value datasets from EU institutions · linked open data”. ... strategy 2011-2015 in Denmark. ... Report on high-value datasets from EU institutions . services . EU

Versioning Linked Datasets - Hasso-Plattner-Institut · Versioning Linked Datasets Towards Preserving History on the Semantic Web Author Paul Meinhardt 744393 July 13, 2015 Supervisors

The Europeana Linked Open Data Pilot Server - W3C · The Europeana Linked Open Data Server ... •Distributing the Europeana datasets as Linked Open Data ... • The servlets implement

Semantic Similarity Assessment to Browse Resources exposed as Linked Data: an Application to Habitat and Species Datasets

Exploiting visual similarities for ontology alignment · Linked Open Data (LOD) paradigm shows how the different exposed datasets can be linked in order to provide a deeper understanding

Scaffold-based Analytics: Enabling Hit-to-Lead Decisions by Visualizing Chemical Series Linked Across Large Datasets (ACS Boston 2015)

Brew University...Brew University Summary/Brief Brew University is a hands on do it yourself brewery where the customer is the Brew Master. At Brew University customers may brew batches

Applying Linked Data in Multimedia Annotations - Eprints · Applying Linked Data in Multimedia ... datasets in the linked data cloud. In the future, ... data in varieties of repositories

Quality of Linked Bibliographic Data: The Models, Vocabularies, …25877/Quality of Linked... · Quality of Linked Bibliographic Data: The Models, Vocabularies, and Links of Datasets

Querying Heterogeneous Datasets on the Linked Data Web

Linked Data: Principles and State of the Art · publish existing open license datasets as Linked Data on the Web ... Tutorial on How to Publish Linked Data on the Web

IOS Press Weaving a Web of Linked Resources · 2 F. Gandon et al. / Weaving a Web of Linked Resources Fig. 1. Number of linked open datasets on the Web plotted from 2007 to 2017 with

Museum Linked Open Data: Ontologies, Datasets, Projects Linke… · An active Linked Open Data for Libraries, Archives and Museums (LODLAM) com-munity exists, CH data is published