25
Thomson Reuters © 2014. Confidential. All Rights Reserved. No part of this document may be disclosed, reproduced or used in any form without the prior permission of Thomson Reuters TR DISCOVER DeZhao Song, Frank Schilder, Charese Smiley… TR Corporate Research and Development, Eagan, Minnesota Chris Brew TR Corporate Research & Development, London ML Prague, April 23 th 2016

Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Embed Size (px)

Citation preview

Page 1: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Thomson Reuters © 2014. Confidential. All Rights Reserved. No part of this document may be disclosed, reproduced or used in any form without the prior permission of Thomson Reuters

TR DISCOVERDeZhao Song, Frank Schilder, Charese Smiley…TR Corporate Research and Development, Eagan, MinnesotaChris BrewTR Corporate Research & Development, LondonML Prague, April 23th 2016

Page 2: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Outline

• TR Discover: NLP as part of the solution to a business problem.

– Problem

– Technologies used

– Demonstration

– Reflections

• What is it like to be a scientist working in a business setting?

Page 3: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

About me• B.Sc Chemistry, Bristol• Search Examiner, European Patent Office, Berlin, Germany• M.Sc and D.Phil, Sussex, with Steve Isard in EP• Postdoc at Edinburgh, Scotland• Sharp Laboratories of Europe, Oxford• Research (and faculty-ish) positions at Edinburgh• Core faculty in Linguistics and CSE, OSU, Columbus OH, USA• Educational Testing Service, Princeton, NJ, USA• Nuance Communications, Sunnyvale, CA, USA• Thomson Reuters Corporate Research, London, England

Page 4: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Disclaimer

All opinions are my own, and do not reflect official positions of The Thomson Reuters Corporation

Page 5: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Thomson Reuters’ Business • Offer people information that they value enough to pay for.• Professional users• Many products, each catering for its own market segment.

Page 6: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Thomson Reuters’ Business • Not an internet company with tens of millions of users• Company is set up to build long-term trust relationships with

clients.– Many product managers and marketers, who are highly

expert in maintaining these relationships– Privacy and data security are crucial.

• Cannot be “one size fits all”.• Role of technology is to support and improve products. Primary

responsibility for daily support is the Platform Group.

Page 7: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Who are Corporate Research & Development?• Mission: To support Thomson Reuters by carrying out

applied research relevant to our businesses.• Team of approx. 40 researchers, developers, managers,

administrators, architects• Distributed group:

• Rochester, NY, USA• Eagan, MN, USA• New York City, NY, USA (3 Times Square, 12th floor)• London, UK (1 Mark Square), opened in August 2013

Page 8: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

The business context

Financial and Risk

Legal

News

IP&Science

Page 9: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

The technology context• Databases (mainly SQL, mainly Oracle)• Search (mainly Elastic Search, built on top of Lucene)• Virtualized servers in data centers• Front ends mostly in Javascript with AngularJS + components

to aid branding via common look and feel.• Back ends mostly Java.• Products often consist of a bundle of related capabilities,

packaged together to help potential users understand.

Page 10: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

BOLD• Big• Open• Linked• Data

The Knowledge Graph

Page 11: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Experts and non-experts• There are also expert and non-expert professional users

– Cortellis (product for drug companies)• First time user, asks broad questions. No idea what is

available. Needs whatever guidance we can give. • Expert user. Knows roughly what is available, but may

need help locating what they want.– Common thread: users are trying to do something specific,

such as a market overview, a comparison, or verification of a hunch about a trend. Give them a data visualization, not just raw data.

Page 12: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Expert user

Page 13: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Natural Language Query: TR-Discover• Keyword based search is not enough to express user intent.• What if the user could type queries, and be guided towards

things that our system can answer?– Experts and first timers alike can access through NL– Enables discovery of data– Capture of user intent allows well-targeted analytics

• This is not new, there have been NL database query systems since the 60s, but these tend to be hard wired to specific databases and their schemas. We want a reusable tool.

Page 14: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Placeholder for demo.Available to you at http://cortellislabs.com/. You do have to register, but anyone can. NB. Beta version. Works, but has rough edges.

Page 15: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

First-time user

Page 16: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Market technology trends

Page 17: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

NER Sentiment Analytics

Page 18: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Comparing top 10 indications for companies for Drugs having a primary indication of pain

Page 19: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

How we did it• Feature based context-free grammar with features, using the

formalism of NLTK.• Real logic-based formal semantics.• Autosuggest based on grammar, logical form and heuristics

derived from our databases.• Query via translation from logic to SPARQL or SQL

– SQL is just for now, for efficiency. – But we plan to keep logic as a separate level, not translate

directly to query language.

Page 20: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

The Grammar• Feature based context-

free grammar with features, using the formalism of NLTK.– Grammar captures

selectional restrictions relevant to the drug domain.

– Adding a new domain should (mainly) be a matter of adding new lexical entries.

Page 21: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Grammar

• The word “drugs” is plural, and has λx.drug(x) as its semantics• For now, prepositional phrases have features that enforce very

tight attachment preferences. This is going to break, but OK for now.

• The type a of verb specifes both the potential subject-type and object-type, which can be used to filter out nonsensical questions like “drugs headquartered in the U.S”.

G1: Nom→NG2: NP→NomG3: NPbar → NPG4: NPbar → NPbar VPbarG5: VPbar → TV NPbarLex1: N[type=drug, num=pl, sem=<λx.drug(x)>] → ’drugs’Lex2: TV[TYPE=[drug,org,dev], sem=<λX x.X(λy.dev org drug(y,x))>, tns=past, NUM=?n] → ’developed by’Lex3: TV[TYPE=[org,country,hq], NUM=?n] → ’headquartered in’

Page 22: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Query translation input: Drugs developed by Merck

Page 23: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Query translation output: Drugs developed by Merck

• This was SPARQL. It works, but is much the slowest part of the system

• Similar translation for SQL. – In our demo, we can use the fact that we know all the

words of the grammar to make the database small.– This lets us replace the big costly knowledge graph with a

single-file Sqlite database. • Yes, we know it won’t scale

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX example: <http://www.example.com#>select ?xwhere {?id042 rdfs:label ’Merck’.?id042 rdf:type example:Company . ?x rdf:type example:Drug .?id042 example:develops ?x .}

Page 24: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Autosuggest• Use the grammar to calculate possible continuations from what

we have so far.– Currently this process does not use a full-fledged parser, and

relies on the fact that the grammar is carefully engineered to minimize local and eradicate global ambiguity.

– I want to achieve a tighter integration with the parser, and generate predictions based on elements present in the parser’s chart, allowing more ambiguity

• Rank suggestions by preferring concepts that correspond to nodes in the RDF graph that are involved in many relationships.– When we have large enough query logs we hope to add in

an additional preference component based on a domain specific n-gram language model.

Page 25: Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

What is it like being a scientist in the business world?• It varies with the DNA of the organization…

– ETS– Nuance– Thomson Reuters