Interactive Instance-based Evaluation of Knowledge Base … · what are taylor swift’s albums? Input question: 1. Entity linking: Taylor Swift Q462 album Q24951125 2. Semantic parsing:

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (System Demonstrations), pages 114–119Brussels, Belgium, October 31–November 4, 2018. c©2018 Association for Computational Linguistics

114

Interactive Instance-based Evaluation ofKnowledge Base Question Answering

Daniil Sorokin and Iryna GurevychUbiquitous Knowledge Processing Lab (UKP) and Research Training Group AIPHES

Department of Computer Science, Technische Universitat Darmstadtwww.ukp.tu-darmstadt.de

Abstract

Most approaches to Knowledge Base Ques-tion Answering are based on semantic pars-ing. In this paper, we present a tool that aidsin debugging of question answering systemsthat construct a structured semantic represen-tation for the input question. Previous workhas largely focused on building question an-swering interfaces or evaluation frameworksthat unify multiple data sets. The primary ob-jective of our system is to enable interactivedebugging of model predictions on individualinstances (questions) and to simplify manualerror analysis. Our interactive interface helpsresearchers to understand the shortcomings ofa particular model, qualitatively analyze thecomplete pipeline and compare different mod-els. A set of sit-by sessions was used to vali-date our interface design.

1 Introduction

Knowledge base question answering (QA) is animportant natural language processing problem.Given a natural language question, the task is tofind a set of entities in a knowledge base (KB) thatconstitutes the answer. For example, for a question“Who played Princess Leia?” the answer, “CarrieFisher”, could be retrieved from a general-purposeKB, such as Wikidata1. A successful KB QA sys-tem would ultimately provide a universally acces-sible natural language interface to factual knowl-edge (Liang, 2016).

KB QA requires a precise modeling of the ques-tion semantics through the entities and relationsavailable in the KB in order to retrieve the correctanswer. It is common to break down the task intothree main steps: entity linking, semantic parsingor relation disambiguation and answer retrieval.We show in Figure 1 how the outlined steps leadto an answer on an example question and how the

1https://www.wikidata.org/

what are taylor swift’s albums?

Input question:

1. Entity linking:

Taylor Swift Q462 album Q24951125

2. Semantic parsing:

what are taylor swift’s albums?

Q462 Q24951125qPERFORMER

INSTANCE OF

3. Answer retrieval:

q Red, 1989, . . .WIKIDATA

Figure 1: Typical steps undertaken by a QA system.Gray dashed arrows show how the output of the previ-ous step is passed into the next one. Qxxx stands for aKB identifier

output of each step is re-used in the next one. Thisapproach has been exhibited by the most of therecent works on the KB QA (Berant and Liang,2014; Reddy et al., 2016; Yih et al., 2015; Penget al., 2017; Sorokin and Gurevych, 2018b). Themulti-step pipeline poses particular challenges forerror analysis, since many unique errors can ariseat different processing stages. With our tool, weaim at supporting the evaluation of QA systems andhelping to identify problems that do not necessarilyform generalisable error patterns, but hinder theoverall system performance nonetheless.

Some frameworks have been introduced recentlyto streamline the evaluation of KB QA systems.The ParlAI framework focuses on building a uni-

www.ukp.tu-darmstadt.de

https://www.wikidata.org/

115

fied interface for multiple QA data sets (Milleret al., 2017), while GerbilQA2 introduces evalua-tion of individual steps of a QA pipeline. However,none of them addresses an interactive debuggingscenario, that can be used by the researchers to doinstance-base error analysis. This is especially rel-evant in the context of such benchmarks as QALD,where each individual question is meant to test aparticular aspect of the system and debugging indi-vidual instances is crucial for understanding of thesystem performance (Unger et al., 2016).

Another set of tools have focused on buildingan infrastructure to support the development forKB QA. Ask Wikidata3 offers an easy way topost queries to Wikidata via a web-based interface,though the tool relies on manual disambiguation tounderstand a question. The WDAqua project4 hasproduced a speech-based plug-in interface (Kumaret al., 2017) and the Qanary specification for QAsystems (Singh et al., 2016). These tools followthe described steps of the QA pipeline, but do notfacilitate the interactive instance-based evaluationthat is the main aspect of this work.

Main contribution In this paper, we present amodular debugging application for KB QA thatcan be used to manually evaluate the main steps ofa QA pipeline. Our system focuses on the analy-sis of individual examples and a detailed view ofpossible causes of errors, so that individual errorpropagation cases can be identified.

Demo system and the code A demo instancewith the default QA model is running at thefollowing url: http://semanticparsing.

ukp.informatik.tu-darmstadt.de:5000/

question-answering/. Our system is freelyavailable: https://github.com/UKPLab/

emnlp2018-question-answering-interface.

2 A prototypical QA pipeline and therequirements

The first stage of every QA approach is entity link-ing (EL), that is the identification of entity mentionsin the question and linking them to entities in KB.In Figure 1, two entity mentions are detected andlinked to the KB referents. According to multipleerror analyses, entity linking is a common sourceof errors in a QA system (Berant and Liang, 2014;Reddy et al., 2016).

2http://aksw.org/Projects/GERBIL.html3https://tools.wmflabs.org/bene/ask/4http://wdaqua.eu

Debugging UI

QuestionAnswers, entities,graphs, weights

Preprocessing Entity linking QA model

WIKIDATACORE NLP

HTTPREST

HTTPREST

Figure 2: Overview of the system architecture

The entity linking stage is followed by semanticparsing that consists of combining the extractedentities into a single structured meaning represen-tation. The entities are connected with semanticrelations to a special question variable that denotesthe answer to the question (Yih et al., 2015).

Given the ambiguity of the natural language, asemantic parsing model constructs multiple repre-sentations that can match the question and assignsprobabilities to them (Liang, 2016). It is commonto learn a vector encoding for the question and thestructured representations and then use a similarityfunction to compute the probabilities (Yih et al.,2015; Reddy et al., 2016). The most probable struc-tured representation is then translated into a queryand used to extract the answer from the KB.

Some approaches circumvent building a struc-tured representation and instead directly composevector encodings of the potential answers (Donget al., 2015). Since this is a less common archi-tecture type for KB QA, we focused on semanticparsing approaches while developing our interface.

The described pipeline lets us outline the mainrequirements for an interactive debugging tool:

1. It needs to represent all stages of the QApipeline in a sequential manner to let the useridentify where the error occurs and how itpropagates.

2. It needs to account for the specific propertiesof semantic parsing approaches to KB QA,such as structured semantic representations.

3. It needs to include an analysis block thatshows if the model has learned meaningfulvector representations.

http://semanticparsing.ukp.informatik.tu-darmstadt.de:5000/question-answering/



https://github.com/UKPLab/emnlp2018-question-answering-interface

https://github.com/UKPLab/emnlp2018-question-answering-interface

http://aksw.org/Projects/GERBIL.html

https://tools.wmflabs.org/bene/ask/

http://wdaqua.eu

116

1 e n t i t y . l i n k i n g :2 model : models / e lmode l . p k l3 max . e n t i t y . o p t i o n s : 54 max . ngram . l e n : 456 model :7 model . f i l e : models / qamodel . p k l89 e v a l u a t i o n :

10 max . num . e n t i t i e s : 311 beam . s i z e : 1012 . . .1314 w i k i d a t a :15 backend : l o c a l h o s t : 8 8 9 0 / s p a r q l16 t i m e o u t : 1017 . . .

Listing 1: A snippet of the YAML configurationfile with the main modules and settings

3 System overview

Our system consists of a web-based front-end and aset of back-end services that communicate throughHTTP REST API (see Figure 2). The front-endcontains the interactive debugging user interface(Section 4). A separate request is sent to the back-end service for each processing step of the QApipeline. Thus, we are able to show the results tothe user as they are being delivered by the back-endservices. The front-end is responsible for aggregat-ing and visualizing the information after each step.In case any service fails, a partial result from theprevious steps would be available to the user.

The back-end services include the pre-processing, the entity linking and the semanticparsing modules. The pre-processing moduleperforms tokenization and part-of speech taggingusing the Stanford CoreNLP toolkit (Manninget al., 2014). The entity linking module recognizesmentions of the KB entities in the input question.We provide a default model for entity linking onWikidata that is freely available feature-basedimplementation of Sorokin and Gurevych (2018a).

The semantic parsing module includes the con-struction of structured semantic representations anda learned model that selects the correct represen-tation. The integrated model uses a convolutionalneural network architecture to learn vector encod-ings for questions and semantic representations(Sorokin and Gurevych, 2017). We provide an in-tegrated model for demonstartion purposes, whilethe main purpose of the tool is to enable manualevaluation and comparison of new models. We de-fine an interface that a user needs to implement in

Figure 3: The main input field and the answer block

order to integrate their own model into the tool.Finally, using the KB query provided by the se-

mantic parsing module, the front-end retrieves theanswers from the KB. The back-end modules canbe configured using a YAML properties file (seeListing 1 for an example configuration).

3.1 Implementation details

We implement the user interface and the back-endservices with modern web technologies, such asJQuery, D3.js and Bootstrap. The back-end ser-vices are implemented in Python with Flask. Itis configurable and can be further easily extendedwith other models for entity linking and QA.

A new QA model can be integrated either as aPython module or as a separate REST service. Tocommunicate the results to the front-end, the ser-vice has to send the response in the defined JSONformat. We refer to the published code repositoryfor additional details on the implementation.

4 User interface

The interactive web-based user interface (UI) is thecentral element of our system. We have designedthe interface for an expert user and have consideredthe following user traits (Raskin, 2000): a back-ground in KB QA, a knowledge of programminglanguages, an interest in manual error analysis.

The UI is modeled after the prototypical QApipeline as described in Section 2. Each step of thepipeline is represented as a separate block in theinterface (see the complete UI depicted in Figure 5).That is, the represented model of the UI directlycorresponds to the implementation model of theprototypical QA system. Such design choice isappropriate for tools aimed at domain experts whoalready have a clear mental model of the underlyingprocesses and it makes the UI understandable forthe first time (expert) users (Cooper et al., 2007).

Consequently, the interface is divided into blocks

117

that correspond to the steps of the QA pipeline.There are several base blocks that are fixed and cannot be removed, such as the input question block,while the other ones can be hidden if needed.

Input question and answer block The firstblock consists of the input question field and theanswer area (Figure 3). When the user first loadsthe interface, only the input question field is pre-sented. This avoids the confusion as to what is thestarting point of the interactions with the system.Further elements appear only when the correspond-ing results are returned. For example, the answersarea is only shown when the processing is finished.

Although the answer retrieval from the KB isthe last step of the pipeline, we put the answer arearight below the input question field. This designchoice makes it easy to see right away if the systemhas processed the question correctly.

Entity linking block In this block, we list allidentified entity mentions in the input question andthe top 5 entity disambiguation candidates. Theentity candidate with the highest score is automati-cally selected and forwared to the QA model.

The list of entity disambiguation candidates isinteractive and the user can select all or none can-didates for each entity mention. In case multiplecandidates are selected, all of them are sent to theQA system as separate entities. This lets the usercorrect potential entity linker mistakes by selectingsome other than the top disambiguation candidateand continue to debug the rest of the pipeline.

Semantic graphs We visualize the structured se-mantic representations that the QA model generatesas graphs. That is the most common way to visuallydepict structured representations (Yih et al., 2015;Reddy et al., 2016; Sorokin and Gurevych, 2018b).A semantic graph consists of a question variablenode that denotes the answer to the question, KBentities and KB relation types.

We use circles to depict entities and solid linesbetween them to show relations. In each graphthe question variable node is represented with ahigh-contrast blue circle. Since most relations areattached to the question node, this makes it easierto parse the structure of the graph. Additionally,since each graph has only one high contrast node,the user can identify at the first sight how manygraphs have been composed for the input question.For example, Figure 4 shows four semantic graphsfor a question “Who played Princess Leia in Star

Figure 4: The semantic graphs block for the question“Who played Princess Leia in Star Wars?”

Wars?”, that are clearly visible. In this instance,the model selects an incorrect graph (highlightedin green) and retrieves all cast members of the StarWars movie. The correct graph would be the secondone, that also uses the entity Princess Leia.

Representation analysis The visual inspectionof the learned vector representation in the two finalblocks makes it possible to identify implementationor training errors in the QA model. Once the erroris attributed to the learned model, a researcher cancontinue to inspect the model in the tool that wouldbe the most appropriate to inspect the weights of aparticular model architecture (e.g. TensorBoard5).

The token-level representations block visualizesthe weights computed by the model for each inputtoken of the question. This kind of visual analy-sis is helpful to identify if the model is learningmeaningful word representations. We rely on theshade and saturation visual variables to encode thecomputed vectors. Each vertical line correspondsto a vector dimension and the darker saturated col-ors denote a higher numerical value. In Figure 5,one can see that the model is assigning the highestweights to the main entity in the question.

The second representation analysis block placesthe vector representation of the question andof the semantic relations on a 2D-plane using

5https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard

https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard

https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard

118

t-SNE (van der Maaten and Hinton, 2008). Thevectors of the relations that appear in the generatedsemantic graphs are rendered as red circles. Theuser can zoom in and out to compare the positionof the question vector to the surrounding relations.

5 User study

The interface was designed with the requirementsthat were outlined in Section 2. As domain ex-perts, we were personally interested in applyingthe developed tool to QA systems. To verify thatthe interface and the designed interactions are inline with expectations of the first-time users, wehave conducted a set of brief sit-by testing sessions.Sit-by sessions are usually used for exploratory sit-uations and gathering first impressions about thedesign of a product (Rubin and Chisnell, 2008).

In the user study, we aimed to evaluate: can aperson familiar with KB QA use the tool indepen-dently? Does the developed tool make it possibleto manually identify errors in a QA pipeline? Twoparticipants with background in natural languageprocessing and linguistics were asked to perform asimple analysis task while a moderator was sittingnear them and monitoring the progress. The partic-ipants were asked to input a list of three questionsinto the tool and tell if the model succeeded in an-swering them. In case of an incorrect answer, wehave expected participants to be able to identify thestage of the pipeline that caused the error. Duringthe sit-by sessions, we were able to confirm thatthe interface is intuitive and easy to use. All theparticipants were able to complete the task in under10 minutes and could point out at what stage anerror has occurred for all input questions. The rep-resentation analysis instruments, on the contrary,have proven to be the least intuitive element of theinterface. Although the participants could attributethe error to the model, they were unable to say ifthe learned vector representations were meaningfulbased on the provided visualization.

6 Conclusions

In this work, we have presented an interactive de-bugging tool for semantic parsing approaches toKB QA. We have started by defining the main re-quirements for an instance-based evaluation tooland then demonstarted how the different aspects ofthe designed interface fulfill them. Our tool enablesresearchers to explore and qualitatively analyse adeveloped QA pipeline. We used sit-by sessions to

verify the design choices and to assess the usabil-ity of the tool. Our architecture includes defaultmodels for entity linking and question answering,which makes it easy to replace only one of thecomponents with a new module.

Acknowledgments

This work has been supported by the German Re-search Foundation as part of the Research TrainingGroup AIPHES (grant No. GRK 1994/1), and viathe QA-EduInf project (grant GU 798/18-1 and RI803/12-1). We gratefully acknowledge the supportof NVIDIA Corporation with the donation of theTitan X GPU.

ReferencesJonathan Berant and Percy Liang. 2014. Semantic

Parsing via Paraphrasing. In Proceedings of 52ndAnnual Meeting of the Association for Computa-tional Linguistics (ACL), pages 1415–1425, Balti-more, MD, USA.

Alan Cooper, Robert Reimann, and David Cronin.2007. About Face 3: The Essentials of InteractionDesign. John Wiley & Sons.

Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015.Question Answering over Freebase with Multi-Column Convolutional Neural Networks. In Pro-ceedings of the 53rd Annual Meeting of the Associ-ation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing (ACL-IJCNLP), pages 260–269.

Ashwini Jaya Kumar, Christoph Schmidt, and JoachimKhler. 2017. A knowledge graph based speech inter-face for question answering systems. Speech Com-munication, 92:1–12.

Percy Liang. 2016. Learning Executable SemanticParsers for Natural Language Understanding. Com-munications of the ACM, 59(9):68–76.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing High-Dimensional Data using t-SNE.Journal of Machine Learning Research, 9:2579–2605.

Christopher D. Manning, John Bauer, Jenny Finkel,Steven J. Bethard, Mihai Surdeanu, and David Mc-Closky. 2014. The Stanford CoreNLP Natural Lan-guage Processing Toolkit. In Proceedings of 52ndAnnual Meeting of the Association for Computa-tional Linguistics (ACL): System Demonstrations,pages 55–60, Baltimore, MD, USA.

A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bor-des, D. Parikh, and J. Weston. 2017. ParlAI: A Di-alog Research Software Platform. arXiv preprintarXiv:1705.06476.

119

Haoruo Peng, Ming-Wei Chang, and Wen-Tau Yih.2017. Maximum Margin Reward Networks forLearning from Explicit and Implicit Supervision. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 2358–2368, Copenhagen, Denmark.

J Raskin. 2000. The Humane Interface: New Direc-tions for Designing Interactive Systems. Addison-Wesley.

Siva Reddy, Oscar Tackstrom, Michael Collins, TomKwiatkowski, Dipanjan Das, Mark Steedman, andMirella Lapata. 2016. Transforming DependencyStructures to Logical Forms for Semantic Parsing.Transactions of the Association for ComputationalLinguistics, 4:127–140.

J Rubin and D Chisnell. 2008. Handbook of UsabilityTesting: Howto Plan, Design, and Conduct EffectiveTests, 2nd edition. Wiley Publishing.

Kuldeep Singh, Andreas Both, Dennis Diefenbach,Saedeeh Shekarpour, Didier Cherix, and ChristophLange. 2016. Qanary – The Fast Track to Creat-ing a Question Answering System with Linked DataTechnology. In The Semantic Web, pages 183–188,Cham. Springer International Publishing.

Daniil Sorokin and Iryna Gurevych. 2017. End-to-endrepresentation learning for question answering withweak supervision. In Semantic Web Challenges: 4thSemWebEval Challenge at ESWC 2017, volume 769of Communications in Computer and InformationScience, pages 70–83, Cham. Springer InternationalPublishing.

Daniil Sorokin and Iryna Gurevych. 2018a. MixingContext Granularities for Improved Entity Linkingon Question Answering Data across Entity Cate-gories. In Proceedings of the Seventh Joint Con-ference on Lexical and Computational Semantics(*SEM), pages 65–75, New Orleans, LA, USA. As-sociation for Computational Linguistics.

Daniil Sorokin and Iryna Gurevych. 2018b. Model-ing semantics with gated graph neural networks forknowledge base question answering. In Proceed-ings of COLING 2018, the 27th International Con-ference on Computational Linguistics, pages 3306–3317. Association for Computational Linguistics.

Christina Unger, Axel-Cyrille Ngonga Ngomo, andElena Cabrio. 2016. 6th Open Challenge on Ques-tion Answering over Linked Data (QALD-6). InSemantic Web Challenges, pages 171–177, Cham.Springer International Publishing.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic Parsing via StagedQuery Graph Generation: Question Answering withKnowledge Base. In Proceedings of the 53rdAnnual Meeting of the Association for Computa-tional Linguistics and the 7th International JointConference on Natural Language Processing (ACL-IJCNLP), pages 1321–1331, Beijing, China.

Figure 5: The complete unrolled UI for the question“Who played Luke Skywalker in Star Wars?”

Documents

Interactive Instance-based Evaluation of Knowledge Base … · what are taylor swift’s albums? Input question: 1. Entity linking: Taylor Swift Q462 album Q24951125 2. Semantic parsing: