11
QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering System Evaluations Nan-Chen Chen * University of Washington Been Kim Allen Institute for Artificial Intelligence Figure 1: The QSAnglyzer interface consists of three basic panels: ‘Evaluations’ (top left), ‘Question Space Angles’ (top right), and ‘Question Table’ (bottom) panels, following the well-known visualization mantra: “Overview first, zoom and filter, details on demand. ABSTRACT Developing sophisticated artificial intelligence (AI) systems requires AI researchers to experiment with different designs and analyze results from evaluations (we refer this task as evaluation analysis). In this paper, we tackle the challenges of evaluation analysis in the domain of question-answering (QA) systems. Through in-depth studies with QA researchers, we identify tasks and goals of evalua- tion analysis and derive a set of design rationales, based on which we propose a novel approach termed prismatic analysis. Prismatic anal- ysis examines data through multiple ways of categorization (referred as angles). Categories in each angle are measured by aggregate metrics to enable diverse comparison scenarios. To facilitate prismatic analysis of QA evaluations, we design and implement the Question Space Anglyzer (QSAnglyzer), a visual ana- lytics (VA) tool. In QSAnglyzer, the high-dimensional space formed by questions is divided into categories based on several angles (e.g., topic and question type). Each category is aggregated by accuracy, the number of questions, and accuracy variance across evaluations. QSAnglyzer visualizes these angles so that QA researchers can examine and compare evaluations from various aspects both individ- ually and collectively. Furthermore, QA researchers filter questions based on any angle by clicking to construct complex queries. We validate QSAnglyzer through controlled experiments and by ex- pert reviews. The results indicate that when using QSAnglyzer, users perform analysis tasks faster (p< 0.01) and more accurately (p< 0.05), and are quick to gain new insight. We discuss how * e-mail: [email protected] e-mail: [email protected] prismatic analysis and QSAnglyzer scaffold evaluation analysis, and provide directions for future research. Keywords: visual analytics; visualization; interactive visualization; question answering; multi-experiment analysis; visual comparison; visual exploration; prismatic analysis Index Terms: H.5.2 [Information Interfaces and Presentation]: User Interfaces— 1 I NTRODUCTION In the artificial intelligence (AI) era, complex AI systems such as conversational agents, self-driving cars, and recommendation systems are part of daily life. Various reports indicate that AI plays an increasingly influential role in human society [18, 23]. Major technology companies [1, 12] as well as government agencies [6] are putting increasing emphasis on AI applications. The development of a sophisticated AI system is a challenging process. AI researchers experiment with different AI system ar- chitectures and evaluate their performance on the target tasks. As components in the system interact with each other, it can be diffi- cult to make sense of evaluations and find insight to improve the system. For example, a question answering (QA) system, a type of AI system that answers questions, may involve information retrieval (IR), natural language processing (NLP), knowledge representation, reasoning, and machine learning (ML) components. All components are complex in their own ways, and can interact with each other in complicated ways such that the impact of changes in one compo- nent can propagate to overall changes in behavior. AI researchers may modify one component based on insight gained from one set of evaluations, but discover that the overall behavior of the system changes in unexpected ways. For instance, an NLP component can take the output of an IR component as input, and pass this onto an ML component. Small changes in the IR component may thus

QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

QSAnglyzer: Visual Analytics for Prismatic Analysisof Question Answering System Evaluations

Nan-Chen Chen*

University of WashingtonBeen Kim†

Allen Institute for Artificial Intelligence

Figure 1: The QSAnglyzer interface consists of three basic panels: ‘Evaluations’ (top left), ‘Question Space Angles’ (top right), and‘Question Table’ (bottom) panels, following the well-known visualization mantra: “Overview first, zoom and filter, details on demand.

ABSTRACT

Developing sophisticated artificial intelligence (AI) systems requiresAI researchers to experiment with different designs and analyzeresults from evaluations (we refer this task as evaluation analysis).In this paper, we tackle the challenges of evaluation analysis in thedomain of question-answering (QA) systems. Through in-depthstudies with QA researchers, we identify tasks and goals of evalua-tion analysis and derive a set of design rationales, based on which wepropose a novel approach termed prismatic analysis. Prismatic anal-ysis examines data through multiple ways of categorization (referredas angles). Categories in each angle are measured by aggregatemetrics to enable diverse comparison scenarios.

To facilitate prismatic analysis of QA evaluations, we design andimplement the Question Space Anglyzer (QSAnglyzer), a visual ana-lytics (VA) tool. In QSAnglyzer, the high-dimensional space formedby questions is divided into categories based on several angles (e.g.,topic and question type). Each category is aggregated by accuracy,the number of questions, and accuracy variance across evaluations.QSAnglyzer visualizes these angles so that QA researchers canexamine and compare evaluations from various aspects both individ-ually and collectively. Furthermore, QA researchers filter questionsbased on any angle by clicking to construct complex queries. Wevalidate QSAnglyzer through controlled experiments and by ex-pert reviews. The results indicate that when using QSAnglyzer,users perform analysis tasks faster (p < 0.01) and more accurately(p < 0.05), and are quick to gain new insight. We discuss how

*e-mail: [email protected]†e-mail: [email protected]

prismatic analysis and QSAnglyzer scaffold evaluation analysis, andprovide directions for future research.

Keywords: visual analytics; visualization; interactive visualization;question answering; multi-experiment analysis; visual comparison;visual exploration; prismatic analysis

Index Terms: H.5.2 [Information Interfaces and Presentation]:User Interfaces—

1 INTRODUCTION

In the artificial intelligence (AI) era, complex AI systems suchas conversational agents, self-driving cars, and recommendationsystems are part of daily life. Various reports indicate that AI playsan increasingly influential role in human society [18, 23]. Majortechnology companies [1, 12] as well as government agencies [6]are putting increasing emphasis on AI applications.

The development of a sophisticated AI system is a challengingprocess. AI researchers experiment with different AI system ar-chitectures and evaluate their performance on the target tasks. Ascomponents in the system interact with each other, it can be diffi-cult to make sense of evaluations and find insight to improve thesystem. For example, a question answering (QA) system, a type ofAI system that answers questions, may involve information retrieval(IR), natural language processing (NLP), knowledge representation,reasoning, and machine learning (ML) components. All componentsare complex in their own ways, and can interact with each other incomplicated ways such that the impact of changes in one compo-nent can propagate to overall changes in behavior. AI researchersmay modify one component based on insight gained from one setof evaluations, but discover that the overall behavior of the systemchanges in unexpected ways. For instance, an NLP component cantake the output of an IR component as input, and pass this ontoan ML component. Small changes in the IR component may thus

Page 2: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

propagate to the ML component, and lead to unanticipated changesin overall behavior.

Traditional metrics for system performance (e.g., task accuracy)often fail to provide useful and actionable insights for AI researchers.For instance, two versions of a QA system can both achieve 50%accuracy by answering completely different sets of questions cor-rectly. One version may answer all the questions starting with ‘what’correctly, whereas the other version may get all the ‘why’ questionsright. More complex patterns of the strengths and weaknesses ofeach version are challenging to discover and track. The burden forAI researchers is to analyze evaluations to understand how changesimpact system behavior, and then find ways to enhance the system.

One way to aid this investigation of evaluations is to introducecategorization. Showing finer-grain patterns may provide deeper in-sight than showing overall metrics. The challenge is that researchersdo not know a priori the best way to categorize evaluations. Inaddition, the best insight to improve the system may require iterativefiltering on various types of categories. The complexity of suchexploratory queries can further impose burdens on researchers.

In this paper, we tackle the challenge of evaluation analysis inthe domain of QA systems. We work closely with QA researchersto identify their goals and tasks in analyzing QA evaluations, andderive a set of design rationales that support their workflows. Thesein-depth studies with QA researchers lead to our proposed novelapproach termed prismatic analysis. This approach examines datafrom multiple angles, where each angle shows a different way todivide the input data into categories. Categories are then measuredby aggregate metrics for further investigation. With this approach,users can perform various comparison scenarios, such as between-category, between-angle, and between-evaluation comparison. Inaddition, the approach works with data subsets as well, so QAresearchers can conduct analysis even after filtering.

Although multiple angles of categories can enable more diversepattern discovery, it can also be overwhelming to researchers. Tosupport the prismatic analysis of QA evaluations, we design andimplement a visual analytics (VA) tool called the Question SpaceAnglyzer (QSAnglyzer). In QSAnglyzer, questions in evaluationsconstitute a high-dimensional space in which each question is a point.The question space can be divided into categories based on severalangles, such as topics and question types. The aggregate metrics foreach category that are chosen based on the design rationales includeaccuracy, the number of questions, and accuracy variance acrossevaluations. QSAnglyzer visualizes these angles and aggregatemetrics of categories using color, height, and order. The QSAnglyzerdesign enables QA researchers to examine and compare evaluationsfrom various angles individually and collectively. Furthermore, QAresearchers filter questions based on any angle by clicking, thusconstructing complex queries visually. With controlled experimentsand expert reviews, we validate that the tool helps users performanalysis task more quickly (p < 0.01) and with greater accuracy(p < 0.05), and also generates insight in a short amount of time.

The major contributions of our work are as follows:

• We identify the goals and tasks of QA researchers in analyz-ing evaluations to derive design rationales. We highlight theneed to further study the workflows of complex AI systemdevelopment.

• We propose prismatic analysis, a novel approach for multi-angle categorization and shared aggregate metrics for categorycomparison and connection, which enables finer-grain patterndiscovery.

• We design and implement QSAnglyzer, a visual analytics sys-tem for performing prismatic analysis on QA system evalua-tions, and validate its effectiveness. We discuss how prismaticanalysis and QSAnglyzer scaffold evaluation analysis, and weprovide directions for future research to extend the tool andapply prismatic analysis to other AI domains.

2 BACKGROUND AND RELATED WORK

2.1 Question Answering SystemsThe QA system is a type of complex AI system that gained recog-nition when IBM’s QA system, Watson, appeared on the Jeopardy!game show [7] in 2011. The subject of study now for four decades,QA is one of the oldest AI problems. Numerous QA systems havebeen proposed to answer short factoid questions [3, 7, 10], stan-dardized tests in geometry [19], and mathematics [11]. Some QAsystems are designed to answer multiple-choice questions (e.g.,[5, 27]), whereas some are designed for direct-answer questions(e.g., [26, 34]).

Typical QA systems can be roughly divided into two types: IR-based and knowledge-based [8, Chap 28]. IR-based QA systems relyon large sets of unstructured corpora and use IR techniques to searchfor sentences that contain words found in the questions. In contrast,knowledge-based QA systems are built upon structured knowledge(e.g., lists of tuples for relations between semantically typed entities)for reasoning. The knowledge base is usually smaller and moreorganized than IR corpora, and the knowledge can be chained toinfer unseen facts.

In addition, in complex AI systems, ensemble methods are oftenused to combine results from a set of weaker AI systems to generatethe final outputs. Likewise, a QA system may comprise a set ofsolvers that answer questions, and use ensemble methods to producethe system’s final answers.

2.2 VA for Multi-Experiment Result AnalysisConducting multi-experiment result analysis, including comparisonand exploration, is one of the most common usages of visualiza-tion in scientific domains. For instance, Nocke et al. propose aseries of visualization views for climate science simulation outputcomparison [17]. Wilson and Potter focus on visualizing ensemblesimulation data for weather forecasting and support side-by-sidecomparisons [30]. Malik et al. utilize visualization to inspect sets of3D X-ray computed tomography images where device parametersvaried when taking the images [13]. Although such applicationsare common in scientific domains, they do not directly apply to QAsystem design for three reasons. First, most above-mentioned workfocuses on simulation data from the same models with different pa-rameter settings, which is not the case in QA system design. Tuningparameters is only one type of change that QA researchers make totheir solvers. Often times the changes are more complex, such asadding more knowledge, changing language processing algorithms,and so on. Furthermore, there is typically more than one type ofsolver, and they all work very differently from each other; thus thereis no intuitive way to visualize the parameter spaces across theseheterogeneous solvers. Such unique characteristics of QA systemscompose a different design space for visualization from the typicalcomparative visualization used in scientific domains.

2.3 VA for Inspecting Computational ModelsInspection of computational models, such as ML models, is a grow-ing topic in the field of human-computer interaction (HCI) andvisualization (VIS) research. Two key goals of this direction areto better understand complex black-box models, and to reduce theeffort required to construct and tune models. The former goal aimsto make it easier for model designers to pin down potential issueswith the complex computational process, whereas the latter providesclues for improvement of models. For instance, Smilkov and Carterdemonstrate an interactive visualization that shows the process ofneural networks (NN) [21] so that model designers can see how aNN model progresses over time. Various work in classifier weighttuning [9, 24] or feature ideation [4] also demonstrates how visual-ization can simplify efforts to construct a better model. AlthoughQA and other complex AI systems also contain computational mod-els, enhancing the performance of such complex systems requires

Page 3: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

more than focusing on specific models. Therefore, we focus on theworkflow level of the development process, and specifically focuson building analytics that support evaluation analysis.

2.4 Multi-view Learning and Multiple clustering Solu-tions

Prismatic analysis can be related to multi-view learning and multipleclustering solutions. Multi-view learning is considering a machinelearning problem from multiple views (e.g., multiple sources or dif-ferent feature subsets) since the target learning problem cannot bedescribed using a single view [32, 33]. Multiple clustering solutionsfocus on using multiple ways to cluster data points [14], and havebeen used in subspace analysis of high-dimensional data space [25].Like multi-view learning, prismatic analysis also emphasizes theneed to consider multiple views (here referred to as ‘angles’), but itdoes not focus on learning problems per se. In addition, multi-viewlearning does not require defining categories in each view. In con-trast, for multiple clustering solutions, the goal is to define categories(i.e., clusters) in many different ways: this is similar to the idea ofprismatic analysis. However, prismatic analysis can be positioned asthe next step after finding multiple clustering solutions. The resultsfrom multiple clustering solutions can be fed into prismatic analysis.Moreover, multiple clustering solutions focus on finding categories,whereas prismatic analysis emphasizes comparisons and makingconnections between categories. Thus, defining aggregate metricsis also a critical step of prismatic analysis, which is not a part ofmultiple clustering solutions. In Section 4, we describe in detailprismatic analysis.

3 USER REQUIREMENT ANALYSIS

In this section, we describe user requirements and design rationalesderived through in-depth user studies with QA researchers.

3.1 Identifying User RequirementsTo identify user requirements in analyzing QA system evaluations,we worked with a research team with about 20 people who built aQA system at a research institute in the USA for six months. Theformative study mostly happened in the first month of the project.During the formative study, we informally interacted with the teamin a daily basis, and we conducted a series of interviews with fiveQA researchers who are primary solver developers. All of the inter-viewees hold PhD degrees and have been working on the project asresearch scientists full time for more than a year, and their experiencein the related fields ranges from 10 years to 33 years.

The QA system consists of multiple solvers for answering stan-dardized science exam questions. The system uses an ensemblemethod to combine answers from the solvers, and returns a finalanswer for each question. For simplicity, we hereafter use ‘solvers’to refer to both solvers and the QA system, as the QA system canbe considered an ensemble solver that outputs answers based on allsolvers’ answers.

3.1.1 QA Evaluation OverviewTo develop and test the QA system, the QA researchers ran solverson a set of evaluation questions. When we worked with the QAresearchers, their set contained about 1,200 multiple-choice ques-tions, and they sometimes ran evaluations on a subset of the wholequestion set. They had another question set contained direct-answerquestions, but in our study we focused on multiple-choice questionsas they were the center of solver development at the time of thestudy.

To evaluate a solver, the QA researchers run the solver on anarbitrary set of questions. The solver takes each question’s descrip-tion and answer choices, and produces a confidence score for eachchoice. The choice with the highest confidence score is taken as thesolver’s final answer to the question. Sometimes, confidence scores

of choices are tied, and the solver returns more than one choice asits answer to a question. Currently, QA researchers measure thesolver’s overall accuracy in answering these questions. In additionto single solver evaluation, QA researchers also run multiple solverstogether as an ensemble solver. The output of an ensemble of solverscombines individual solvers answers and returns an answer.

3.1.2 QA Researcher Goals and TasksFrom interviews and informal interactions with the QA researchers,we noted three primary goals for them when analyzing evaluations:

G1. Look for ways to improve a solver. When they analyze eval-uations, one key goal is to seek opportunities to improve thesolver. Example ways include examine whether a solver hasenough data (e.g., knowledge) to answer a question, or whetherother dependencies, such as entailment service, impact solverperformance.

G2. Examine if new changes to a solver improve performance.After QA researchers modify a solver, they evaluate whetherthe changes are helpful. As they often have a hypothesis inmind about how the changes may work, they also examinewhether the newer version of a solver behaves as expected, andwhether any unexpected behavior appears.

G3. Understand the strengths and weaknesses of solvers. Sincethe final QA system combines the answers of all solvers, itis important to understand the strengths and weaknesses ofsolvers. This understanding provides opportunities for a re-searcher to better contribute to overall system performance.

We then collected and compiled the following common tasks byobserving how QA researchers investigate evaluations to achieve theabove goals. For each task, we note in parentheses its correspondinggoals.

T1. Compare two or more versions of a solver for G1 and G2.After modifying a solver, QA researchers usually compare thenew version with the original version. They look for questionsthat the newer version gets right but the older version getswrong, and vice versa. Sometimes, they investigate questionsfor which the two versions of the solver have different answers,regardless of the correctness. Since solver development is not alinear process (i.e., versions can branch out), these comparisontasks are sometimes performed on more than two versions atthe same time.

T2. Contrast one solver’s behavior with that of other solvers(G1, G3). In addition to comparing different versions of asolver, QA researchers also contrast a solver’s behavior withthat of other solvers to understand solver strengths and weak-nesses (G3). QA researchers compare their solvers behavior tothat of others as a reference to show increases or decreases intheir own solvers performance.

T3. Categorize and/or filter questions (G1, G2, G3). QA re-searchers often focus on question subsets to improve certaincategories (G1), investigate component-wise changes (G2), oridentify patterns of solver performance between subsets (G3).To this end, they either try to group the whole set into cate-gories and examine individual categories, or they find a way toretrieve relevant questions (e.g., questions containing certainkeywords).

T4. Investigate question descriptions to make sense of asolver’s behavior (G1). QA researchers usually read ques-tion text and answer choices first to see if they can come upwith a hypothesis for the solver’s correct or incorrect answer.This task relies on their past experience in developing the solverand debugging similar questions. QA researchers do not al-ways have an idea about why solvers behave in certain ways

Page 4: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

for a question. Therefore, they have to closely investigate thequestion to discover situations where, for instance, a solver isbeing misled by a particular word and thus exhibits extremelydifferent behavior for two very similar questions.

T5. Inspect how a solver answers a question (G1, G2). To knowmore about how a solver answers a question, QA researchersprogram solvers to produce intermediate output (e.g., describ-ing the reasoning process) so that they can quickly make senseof what may have happened. They also have solver-specifictools to drill down into detailed steps.

T6. Find insight that can improve a large subset of questionsat once (G1). One QA researcher noted, “We want to makechanges that can fix (solvers’ answers to) many questions, notjust one at a time.” In order to efficiently gain insight, theytypically investigate subsets of questions that have shared prop-erties and that are not too small (i.e., fewer than 10 questions).For instance, they examine questions that belong to the topic‘adaptation’ to figure out what knowledge should be added tothe knowledge base to answer questions on this topic.

T7. Look for questions for which solvers return no answers, ormultiple answers (G1). To improve a solver, one commonapproach taken by researchers is to investigate questions forwhich the solver returns no answers, or multiple answers. Forthese questions they seek ways to modify the solver to breakties between choices, or cause the solver to return a singleanswer.

T8. Share their findings and insights with other researchers(G1, G3). Researchers sometimes share findings from eval-uation analyses with each other. In particular, when discussingsolver strengths and weaknesses, spreadsheets and screenshotsare popular ways to illustrate their findings.

3.2 Preliminary Investigation: Sunburst VisualizationDuring our user requirement analysis, the QA researchers expressedinterest in utilizing the three types of categories they manually la-beled: topic, subtopic, and question type (To be consistent with latersections, we term each categorization an angle. Note that angles donot refer to degrees encoded within the Sunburst). Thus, as a wayto crystallize user requirements, we created an interactive Sunburstvisualization [22] using the three angles (Fig. 2) . In this visualiza-tion, a Sunburst corresponds to an evaluation. The three angles werearranged hierarchically. Each layer of the Sunburst consisted of thecategories within one angle. The color indicated the solver accuracyin the corresponding categories in an evaluation. The accuracy wasbucketed into 10 levels of the red-green scale shown in the legendson the top left. All Sunbursts were linked together such that as usersinteracted with one of the Sunbursts by clicking a category in anylevel, the other Sunburst views were updated to reflect the change.

3.2.1 ChallengesWe presented the visualization to the QA researchers and conductedcontextual inquiries. Although researchers reacted positively andcommented that some solver patterns confirmed their understandingof the solver strengths and weaknesses, they indicated a few chal-lenges and issues when using this Sunburst visualization for theiranalysis:

C1. Difficulty when comparing more than two evaluations. Astasks T1 and T2 indicate, QA researchers need to comparemore than two evaluations. We attempted to present more thantwo Sunbursts to researchers in a single view: Although thevisualization did provide an overview of the results of multipleevaluations, the more Sunbursts that were shown (i.e, the moreevaluations loaded), the larger area they occupied on the screen.This makes it more difficult to conduct complex analysis andcomparisons.

Figure 2: Sunburst visualization utilizing three angles: topic, subtopic,and question type, the three question categories. Each Sunburstlayer consists of the categories within an angle. Color indicates solveraccuracy for the corresponding categories in an evaluation.

C2. Unidentifiable categories in outer layers. Although we usedonly a three-level hierarchy in this Sunburst visualization, theresearchers still found it difficult to see and identify categoriesin the outer layers. Although they could filter categories inouter layers of the hierarchy, finding the category they soughtto investigate was challenging. This problem is further com-plicated by additional angles, which add more layers to theSunbursts.

C3. Complex query requirements. Though the Sunburst visual-ization provided an overview of the evaluations, QA researchersfound it difficult to perform more complex queries. For exam-ple, they may want to filter on multiple angles, such as ques-tions that belong to the topic ‘Matter’ or ‘The Earth’, but thatalso belong to the question type ‘Definition’ or ‘Story’. Suchcomplex queries are difficult to perform using the Sunburstvisualization.

3.3 Design RationalesGiven the identified goals and common tasks of users, as well as thechallenges found in our preliminary investigation of the Sunburstvisualization, we present the following design rationales that led toour focus on prismatic analysis and the QSAnglyzer design.

R1. Take questions as the center of analysis. When QA re-searchers analyze evaluations, questions are the center of anal-ysis. In most of their tasks, they directly investigate questions(T3, T4, and T6) or solver behavior with regard to questions(T5 and T7). In addition, questions are the only aspect of theQA system shared between all solver evaluations. Furthermore,even though solvers can be very different from each other,and not every researcher knows all the solvers, they can stillcommunicate and share insights about the questions (T8).

R2. Prioritize accuracy, the number of questions, and accuracyvariance as key variables. When researchers modify a solver,their goal is to increase accuracy (G1). When they investigatea subset of questions, they want to find a large subset to ensurethat changes made based on the set have a broad impact (G1and T3). In addition, when studying solver strengths and weak-nesses (G3 and T2), they look for a set of questions for whichsolvers have highly different accuracy. Therefore, we consideraccuracy, the number of questions, and accuracy variance asthe high-priority variables for design.

R3. Support categorization. Researchers group questions intocategories when focusing on a subset of questions (T3), com-

Page 5: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

paring between versions of a solver (T1), or contrasting solvers(T2). Thus, we observe that categorization can be valuable inrevealing patterns in evaluations.

R4. Consider diverse comparison scenarios and complexqueries. QA researchers need support for diverse compari-son scenarios and complex queries. They compare betweenversions of solvers (T1) and different solvers (T2). They alsowant to determine whether a solver’s performance is worsein one category than in another (T3), to decide which cate-gory should be improved (G1). Moreover, they also comparewhether two categories have different performance patternsacross solvers (T3) to identify strengths and weaknesses (G3).These comparison scenarios can take place after search andfiltering (T3 and R5). Therefore, it is necessary to have a de-sign that supports any combination of the above comparisonscenarios.

R5. Enable filtering and search. As researchers often drill downinto a subset of questions (T3), our VA design should includeboth filtering and search functionalities.

R6. Show multiple angles at the same time. QA researchers oftenwant to examine evaluations from multiple angles. For example,they may want to find questions for which two versions of asolver produce different answers (T1). They may also wantto see for which topics solver accuracy differs the most (T2).Hence, it is critical to show multiple angles of evaluations. Inaddition, from challenges C2 and C3 we learn that presentingdifferent angles as layers in a hierarchy may not be ideal, andthus we suggest presenting multiple angles in parallel.

R7. Ensure scalability. As there may be more than two sets ofevaluations involved in the analysis (C1), it is essential thatthe design scales up. In our case, we aim to support at leasteight sets of evaluations since the QA system contains eightsolvers. Additionally, we need to ensure that the design scalesto multiple angles (R6).

R8. Make different levels of evaluation details available. Thedesign should incorporate various levels of evaluation details,including high-level metrics like accuracy, the description andchoices of questions (T4), solver confidence scores on eachchoice of a question (T7), and intermediate outputs (T5).

R9. Accommodate existing workflows. Our VA tool designshould fit into the QA researchers’ existing workflows so thatthey can use the tool together with their solver-specific tools(T5), and discuss and share findings with each other (T8).

4 PRISMATIC ANALYSIS

The study of QA researchers’ current evaluation analysis workflowsunderlies our proposal of prismatic analysis. Although this is in thecontext of QA systems, we argue that the approach can be appliedto other types of complex AI system evaluation analysis. Therefore,we start from a generalized formal definition of prismatic analysisusing mathematical notation:

Given a set of data points that constitute a high-dimensionaldata space D, we define a set of categorization functions F . Eachfunction fi ∈ F divides D into a set of categories Ci from the i-thangle. In addition, we define a set of aggregate metrics M. Eachmetric m ∈ M represents aggregated information of data points inscalar. As a result, we can compare categories within an angle orbetween angles based on the set of aggregate metrics. Furthermore,some of the metrics may have multiple measurements, and thus wecan also compare these measurements within a category. Last, wecan filter to a subset of D and conduct the same analysis.

We call this approach ‘prismatic analysis’ because when we con-sider the data space D, the categorization functions F act like aset of prisms. These prisms are oriented at different angles toward

the space so that we see the space divided in various ways. Thisresults in multiple ways to group the data points into categories.Then the set of aggregate metrics M summarizes all the categoriesin the same way regardless of angles; this aids us in making compar-isons within and between categories in an angle, as well as makingconnections between angles.

We then describe how prismatic analysis can be applied to analyzeQA evaluations: As questions are the center of QA evaluation anal-ysis (R1), the question space, the high-dimensional space formedby questions, corresponds to the data space D. The question spacecan be divided into categories from multiple angles (each anglecorresponding to a categorization function in F), such as by topicsor by question types. This implements our design rationales R3 andR6. The aggregate metrics M here include accuracy (denoted asMacc), the number of questions (Mnum), and accuracy variance(Mvar) between evaluations, as suggested by design rationale R2.The Macc metric can have multiple measurements when we loadmultiple evaluations. Moreover, prismatic analysis allows us to filtersubsets of questions to conduct the same analysis, which fulfillsdesign rationale R5.

Prismatic analysis is not limited to QA system evaluations. Thisapproach can be applied to other types of complex AI system evalu-ation analysis by identifying the data space D, categorization func-tions F , and aggregate metrics M. We further discuss this point inSection 7.4.

Although prismatic analysis captures many analytical tasks per-formed by QA researchers, it can be overwhelming to have multipleangles and metrics in pure text (e.g., labels and numbers). In addi-tion, design rationales R7, R8, and R9 are not directly supportedby prismatic analysis. This leads to the design of QSAnglyzer, ourvisual analytics tool introduced in the next section.

5 QSANGLYZER

In this section, we describe the design of QSAnglyzer, which is im-plemented as a JavaScript web application using React.js and D3.js.All data are stored in a PostgreSQL database and can be accessedthrough web service calls (implemented in Node.js). QSAnglyzeraims to support prismatic analysis of QA evaluations; its designfollows the rationales we derived in Section 3.3. To be more specific,QSAnglyzer enables QA researchers to load evaluations into thetool, and supports visual comparison and interactions based on theframework of prismatic analysis. The system treats each evaluationas a dataset in which each question is a data point. Results fromdifferent evaluations are different dimensions of the data points. Thesystem provides seven default angles (F) to divide questions intogroups; we explain the details of each angle in Section 5.1. Afterthe user loads an evaluation on the system or performs filtering orsearch, the system automatically calculates the aggregate metrics(Macc, Mnum, and Mvar) for each category. The metric Macc hasN measures, one for each evaluation, where N is the number ofevaluations loaded on the system.

5.1 AnglesBy default, the system provides seven angles that can be divided intothree types: question-related angles, evaluation-related angles, andfree labels. These angles are chosen and defined based on what wefound QA researchers need, though we envision that more angleswill be defined in the future. Question-related angles utilize the threesets of manually labeled categories from the QA researchers (topics,subtopics, and question types) used in Section 3.2.

Evaluation-related angles are a set of angles automatically de-rived from evaluations. These include correctness (is correct), agree-ment of answers (difference), and the number of selected choices(num selected). The ‘is correct’ angle categorizes questions basedon whether a selected choice was correct (T) or not (F), or having noanswer (X). For multiple evaluations, the categories are correctness

Page 6: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

permutations across evaluations. For example, when the systemis loaded with three evaluations, the ‘F / T / T’ category refers toquestions for which the first evaluation is incorrect but the secondand third evaluations are both correct.

The ‘difference’ angle divides questions into two categories:‘same’ and ’different’. Questions for which all evaluations havethe same answer belong to the former; the rest belong to the latter. Aquestion may have the same answers in two evaluations, but as longas there is another evaluation that has a different answer, it belongsto the ‘different’ category.

The ‘num selected’ angle aims to support task T7 by categorizingquestions based on the number of selected choices. For simplicity,we treat questions with more than two choices as being in the samegroup; thus each question can only have ‘0’, ‘1’, or ‘>=2’ selectedchoices. Similar to the ‘correctness’ angle, in the case of multipleevaluations, the categories are permutations across evaluations. Forinstance, when there are three evaluations, the category ‘>=2 / 1/ 1’ includes questions for which the first evaluation selected morethan one choice, and the other two selected only one choice.

The ‘free labels’ angle is designed for users to assign customizedcategories to questions. Users label the questions in the bottom‘Question Table’ panel. We describe this process in Section 5.2.3.

5.2 Interface Design

As shown in Fig. 1, the QSAnglyzer interface consists of threebasic panels: the top left ‘Evaluations’, the top right ‘QuestionSpace Angles’, and bottom ‘Question Table’ panels. This designfollows the visualization mantra: “Overview first, zoom and filter,details on demand [20].” The ‘Evaluations’ panel provides the mosthigh-level overview of the evaluations, and helps users keep trackof how many questions have been filtered by showing the partialbars. In the ‘Question Space Angles’ panel, users perform prismaticanalysis; this panel allows users to filter and zoom into a finer set ofquestions. Finally, the ’Questions Table’ panel shows details relatedto questions and solver behavior, and supports keyword search andquestion bookmarking (i.e., starring). This design fulfills our designrationale R8. In addition to the basic panels, users obtain a link toshare the current view of the tool by clicking the link icon on the topright corner of the tool, which follows design rational R9. Below, weexplain each component of the QSAnglyzer and the design decisionsfor visual encoding and tool interaction.

5.2.1 Top Left Panel: A Visual Summary of Evaluations

In the top left ‘Evaluations’ panel, users add or remove evaluationsto visualize by providing URLs (the URLs link to their existingdatabase APIs). Users assign a name to the evaluation, or the sys-tem automatically names the evaluation based on the solver. Werepresent each set as a colored vertical bar, where the color indicatesthe overall accuracy (dark green represents high accuracy). Themaximum height of the bars corresponds to the highest number ofquestions loaded to the tool, and the height of the individual bars areproportional to the number of questions in each evaluation (labeledin text under the bars). When hovering over the bar, a tooltip popsup to show the exact accuracy.

These bars provide a visual summary for evaluations, and areupdated when the user interacts with the tool. When a user filters aset of questions, a partial bar appears reflecting this filtering action.The color of the partial bar represents the accuracy for the filteredset of questions. If the number of questions varies across evalua-tions, the system automatically filters the subset of questions that allevaluations contain.

Design choices Although color is not the best choice for en-coding quantitative values such as accuracy, it is a visual attributesupported by human pre-attentive processing [28]. In addition, sincelength is a recommended visual attribute for showing quantitative

data [16], we use it to encode the number of questions. The twodesign choices are also used in the ‘Question Space Angles’ panel.

5.2.2 Top Right Panel: Question Space AnglesThe Question Space Angles panel on the top right supports prismaticanalysis as a metaphorical representation of the question space. Eachbox on the panel represents an angle (an fi ∈ F) and its categories(Ci), and each category is represented by a horizontal slice. Theheight of a slice corresponds to the number of questions (Mnum) inthis category, where the height of an entire angle box correspondsto the number of visible questions (i.e., all questions or remainingquestions after filtering). For simplicity, hereafter we directly use‘angles’ to refer to these boxes, and ‘categories’ to refer to thehorizontal slices.

Figure 3: The design of an angle. The header shows the name ofthe angle. Each slice is a category within the angle, and its heightcorresponds to the number of questions in the category. The verticalsegments represent evaluations, and the colors indicate how accu-rate the solver in the evaluation is on the questions in the category.When hovering over a category, a tooltip appears, providing detailednumbers.

Fig. 3 illustrates the parts of an angle. When there are N eval-uations, every category is divided into N vertical segments. Eachsegment represents one evaluation. For example, in Fig. 1, as the toolis loaded with three evaluations, each category has three segments.The order of the segments follows the order in the ‘Evaluations’panel. The color of each segment reflects the accuracy (Macc) ofthe solver in an evaluation on the questions within the category.

When the user clicks a category within an angle, a filtering actionis triggered, and all angles are updated based on the remaining ques-tions. When users hover over a category, a tooltip appears (Fig. 3),providing detailed information such as the number of questions inthe category, the number of correctly-answered questions in eachevaluation, and the corresponding accuracy. When users seek to fil-ter more than one category, they use the filter box shown in Fig. 4A.If a filter is applied to an angle, the filter icon on its header turnswhite. The ‘Active Filters’ panel on the left lists the currently appliedfilters.

Categories within an angle are sorted with respect to variancesin accuracy (Mvar) across evaluations (i.e., the first category onthe top has the highest variance across evaluations). In the filterbox, however, we sort the categories with respect to the number ofquestions (Mnum). As we have observed cases in which both theorder of Mnum and Mvar can be useful, we use different orderingmechanisms as a quick way to provide the two types of informationwithout re-ordering.

Page 7: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

Figure 4: (A) The filter box of the ‘topics’ angle. The user can filteron more than one category using the check boxes on the left. Thecategories are sorted with respect to the number of questions. (B)The panel shows the details of the solvers’ answers to a question. Byunfolding the ‘outputs’, users examine a solver’s intermediate outputsfor an choice.

Design choices We chose to use a box to represent an angleand show all angles in parallel to reflect challenge C2 and designrationales R6 and R7. Such a design enables the easy adding andremoving of angles, as well as changing their order. In addition,putting evaluations as segments next to each other greatly facilitatescomparisons between evaluations and reflects challenge C1. Weeasily scale up to nine evaluations (R7). In addition, we chose to useorders to encode Mvar . Although Mvar is a quantitative variable,the exact values and the difference between values are not criticalfor analysis. Therefore, we assign a ordinal encoding to the variable.We also use orders to show Mnum. This is an additional encodingto the height encoding for this variable, which is suitable for ordinalencoding.

5.2.3 Bottom Panel: Question Table

The ‘Question Table’ panel at the bottom provides finer-grain detailsof solver evaluations. By default, this panel is collapsed. Every rowin the table corresponds to a question and solvers’ answers in theevaluations. The first column enables users to ‘bookmark’ (or ‘star’)questions, which can then be viewed as the bookmarked questionsin the ‘Starred’ tab. The second and following columns of the tableshow the solvers’ choices and confidence scores in each evaluation.The order of the evaluations is consistent with both the order in the‘Evaluations’ panel as well as that in the categories.

Bar charts in these columns represent the normalized confidencescores (between 0 and 1) returned from each evaluation, where thex-axis is the choices and the y-axis is the scores. When users mouseover the bars, a small tooltip appears with the normalized and rawscores. The letter above each bar chart represents the choice with thehighest confidence score (that is, the selected choice). When a solverreturns more than one choice, the letters of all the selected choicesare shown above the bar chart, whereas when a solver returns noanswer, the grid is empty (like the first column in the second row inFig. 1). In contrast to the accuracy colors in the top two panels, thecolors in these bar charts are binary: green for correctly answeredquestions, and orange for incorrectly answered questions.

In addition to solvers’ answers, each row of the question tablealso shows the question’s membership to each angle’s categories(the yellow labels in the bottom panel of Fig. 1). Users change thequestion’s membership of a category by adding or removing labels.This is also where users create and assign customized labels to the‘free labels’ angle.

Currently, each page of the question table contains 100 questions.Users can also perform keyword searches. As with filtering, com-pleted search actions update the interface to reflect the questionsreturned from the search. If users want to retrieve solvers’ inter-mediate outputs for a question, they click the eye icon in the last

Table 1: Lab study design. We took the number of evaluations(IVNumOfEval) and the tool used (IVtool) as the independent vari-ables.

Tools (IVtool)Excel

(control)QSAnglyzer(treatment)

# Evaluations(IVNumOfEval)

2 2-Excel 2-QSAnglyzer4 4-Excel 4-QSAnglyzer6 6-Excel 6-QSAnglyzer

column. A pop-up panel (Fig. 4B) appears so that users can compareintermediate outputs (folded under ‘outputs’ by default).

Design choices The design of this panel is based on the spread-sheets the QA researchers created before the study. Therefore, eventhough both colors and heights encode different variables from otherpanels, for the QA researchers these encodings are acceptable. Notethat we designed this panel for multiple-choice questions, but thedesign can be tweaked to show direct-answer questions without theneed to change the other panels.

6 EVALUATION

To verify the effectiveness and efficiency of QSAnglyzer, we con-ducted a lab study with non-experts and expert reviews. The goalof the lab study with non-experts was to evaluate the effectivenessof the visual and interaction design of QSAnglyzer. We defined thetasks in the lab study based on tasks drawn from the requirementanalysis. With the expert reviews, we sought to examine whetherQSAnglyzer facilitates insight discovery. The two evaluation ap-proaches are designed based on Munzner’s nested model [15]. Inthis section, we describe both studies in detail.

6.1 Lab Study with Non-expertsSince QA researchers commonly utilize spreadsheets in their work-flows, we designed a set of Excel spreadsheets that support thesame functionalities for prismatic analysis. These functionalitiesinclude the use of multiple angles to categorize questions; the useof accuracy, the number of questions, and accuracy variance as ag-gregate metrics; and filtering. In addition to the interface, we testedwhether the number of evaluations loaded in the tools impacts userperformance.

6.1.1 Lab Study DesignParticipants were asked to perform sets of predefined tasks with bothQSAnglyzer and the Excel spreadsheets. In addition, we tested bothExcel spreadsheets and our tool with 2-, 4-, and 6-solver evaluations(details in Table 1). We took as the independent variables the numberof evaluations (IVNumOfEval) and the tool used (IVtool). For eachcondition, the participants performed three tasks; we measured thetime spent (DVtime) and the accuracy (DVacc) of these tasks. Weadapted a within-group study design in which all participants wentthrough all conditions. We used balanced Latin square design [29]to minimize the learning effect.

6.1.2 Data and TasksWe prepared three sets of evaluations from different solvers or ver-sions of solvers to design tasks. The tasks were selected from thecommon workflow of QA researchers. Here is an example task andwhat it was meant to achieve for QA researchers:

“For questions for which the three solvers yield different answers,filter to the category “F/T/F” of the “is correct” angle. What is thecategory of “question type” that has the largest number of questions?”(This task is a combination of tasks T1, T3, and T6.)

The goal of this task is a combination of goals G2 and G3. QAresearchers are interested in questions that solvers disagree on, and

Page 8: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

Figure 5: Screenshot of a highly customized Excel spreadsheet forthe baseline condition. Users can perform rich interactions: for exam-ple, users can filter to a category or categories, and the summarizedperformance statistics of each category are pre-calculated and pre-sented.

Figure 6: Aggregate metrics in the customized Excel spreadsheet forthe categories of the ‘topics’ angle. Upon filtering, aggregate metricsare automatically updated.

specifically look at those question types for which a solver outper-forms others.

6.1.3 Excel Spreadsheets

We created highly customized Excel spreadsheets that support pris-matic analysis (Fig. 5). For example, as shown in Fig. 6, aggregatemetrics of each categories are also pre-calculated and presented. Adata table on the first sheet enables users to filter on angles. Uponfiltering, aggregate metrics are automatically updated. However, un-like QSAnglyzer, the order of the categories in the spreadsheet is notre-ordered after filtering. We believe this is a reasonable differenceand limitation, since automatic re-ordering requires the writing ofExcel-specific scripts, which for most users is not an option.

6.1.4 Participants

From the research institute and from the authors’ social networks,we recruited 13 non-experts who were not QA researchers but hadexperience in AI-related fields. The ages of the participants rangedfrom 21 to 40 (9 male, 3 female, and 1 other gender). Ten heldpost-graduate degrees in computer science or related fields.

6.1.5 Study Setup and Process

Each participant came to the study room individually. We describedthe context of the project to them, and then explained the two tools:QSAnglyzer and the customized Excel spreadsheets. Prior to audioand screen recording, the moderator secured participant permission.All tasks were performed on an 28-inch external monitor with a USBmouse and keyboard.

Prior to the formal sessions, the participants were given four prac-tice tasks: two on each tool. We guided the participants in the useof both the Excel spreadsheets and QSAnglyzer, and ensured they

Figure 7: Average time and accuracy for test conditions. Statisticaltesting showed significant differences between tools (IVtool) on bothDVtime (p < 0.01) and DVacc (p < 0.05).

understood the tasks. The formal study consisted of six sessions cor-responding to each study condition. In each session, the participantswere given one of the two tools preloaded with N evaluations. Thenumber N depended on which condition the session was. Duringeach session, the participants performed three tasks which were ran-domly selected from a pool of predefined tasks without replacement.Each task had a three-minute time limit. We ensured that participantsspent no more than the allotted time on the tasks.

During the practice session, the participants spoke aloud their taskanswers directly. Once the formal sessions started, the participantswere required to submit their answers via a Google form in a windownext to the tool. After performing each task, we helped participantsrestore the tool to its initial state with no filtering and sorting.

6.1.6 Analysis and Results

We analyzed the results of the lab study by a mixed-effects modelanalysis of variance. The model took the within-group variables(IVtool and IVNumOfEval) as factorial effects, but considered eachparticipant as a random level drawn from a population [31]. For eachcondition, the time (DVtime) of each participant was the averagetime for the completion of three tasks, whereas the accuracy (DVacc)was the number of correctly answered tasks (up to three). Theaverage time and accuracy for each condition are shown in Fig. 7.

Statistical testing showed significant differences between thetools (IVtool) for both DVtime (p < 0.01) and DVacc (p <0.05), but the difference between the number of loaded evalua-tions (IVNumOfEval) was significant with respect to neither timenor accuracy. There was also no interaction effect between IVtool

and IVNumOfEval. These results showed that QSAnglyzer moreeffectively and efficiently supports users performing the tasks. In-terestingly, the results did not show significant differences betweenthree levels of IVNumOfEval. This may be because the tasks didnot emphasize the comparison of multiple evaluations. Even thoughthe results were not significant, during the study we did observe thatwhen more evaluations were given, participants expressed difficultyin finding the target categories.

6.2 Use Cases from Expert Reviews

We invited five QA researchers in the research institute to test andreview QSAnglyzer. Each spent an hour testing the tool, and weasked them to think aloud throughout the testing sessions. We pre-loaded the evaluations of the solvers they developed, and let themexplore freely. The researchers all reacted positively and foundnumerous insights within an hour. They commented that being ableto thoroughly examine questions from various angles enabled themto quickly form hypotheses and verify them. They also starred manyquestions that they were interested in for further inspection. Belowwe highlight some of use cases and insights they found during thesessions.

Page 9: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

Figure 8: User insights: (A) One of the QA researchers confirmedtheir solver was improving in the newer version (the right segments)by examining the ‘difference’ and ‘is correct’ angles. After filteringto the ‘different’ category, the QA researcher was able to draw thisconclusion by comparing the number of questions in the ‘X / T’ and ‘F/ T’ categories with the number of questions in the ‘T / F’. (B) AnotherQA researcher discovered an unusual trend over time in the ‘water’category under the ‘subtopic’ angle. The tool was loaded with sevendifferent versions of a solver (in chronological order from left to right).The researcher filtered to the ‘the earth’ category in the ‘topic’ angle,and found this trend in the ‘subtopic’ angle. After further examinationof the intermediate outputs, the QA researcher concluded that theearlier versions had higher accuracy in this category only by chance.

6.2.1 Investigating Gain and Loss of Accuracy after Changes

As noted in goal G2, QA researchers seek to examine how changesto a solver impact its behavior. The researchers performed this taskusing QSAnglyzer. For example, one researcher first filtered to the‘different’ category in the ‘difference’ angle, and then investigatedthe ‘is correct’ angle (Fig. 8A) by comparing the height of thecategories. The researcher then explained: “This (the “is correct”angle) shows that we have gained more than we lost after addingthe new knowledge base... which is a good thing. It also shows thatmany questions our solver could not answer before (marked as “X”in the categories’ names) now have answers.”

6.2.2 Uncovering Incidentally Correct Questions

In one session, the tool was loaded with the evaluations of sevenversions of a specific solver, from left to right in chronologicalorder. Fig. 8B shows a pattern discovered by a researcher in thissession; the researcher learned that the solver’s overall performancegradually improved for questions on the topic “the earth” due tonew knowledge added to the solver’s knowledge base. However, theresearcher noticed that the subtopic “water” did not show the sameimprovement. By filtering for questions with the “water” category,the QA researcher concluded that the first couple of versions of thesolver were answering this set of questions correctly only by chance.This insight was novel and thought-provoking to the QA researcher.

6.2.3 Discovering Previously Unknown Solver Behaviors

During the evaluation sessions, three QA researchers observedsolver behavior that surprised them. For instance, because of the“num selected” angle a researcher realized that an earlier version ofhis solver chose more than one answer choice for many questions(more than 30). He noted that this pattern is significant since thetie between answer choices had been broken by newer versions ofthe solver. Another researcher commented that he did not knowthere were questions where his solver did not choose any answerchoice at all. He also noted that he discovered a potential bug usingQSAnglyzer: he believed that the sum of all confidence scores of allanswers should equal one, which was not always the case.

6.2.4 Relating to Different Solvers

In addition to loading different versions of the same solver, we alsoshowed evaluations of different solvers to the QA researchers. Onefound that a legacy solver that used an IR-based approach tended toperform better on “story”-type questions (a category of the “questiontypes” angle). Another QA researcher compared two versions of asolver along with a legacy IR-based solver as a reference. By doingso, the QA researcher learned that the solver behaved much like thelegacy IR-based solver: “I am interested in questions that used to bewrong, and now are correct, and want to see if the IR-based solveralso answered correctly on those questions. (after the explorationand comparison)... I guess my solver now becomes more like anIR-based solver.”

6.2.5 Finding Tricky or Mislabeled Questions

Since QSAnglyzer can be used to quickly form complex queries ondifferent angles, all researchers found questions that they consideredtricky or bad questions (e.g., they thought there was more than onecorrect answer). One found a few mislabeled questions, and wasable to use the label modification functions in the ‘Question Table’panel after the tool was deployed to their work environment.

7 DISCUSSION AND FUTURE WORK

In this section, we discuss how prismatic analysis and QSAnglyzerscaffold evaluation analysis. Then we discuss potential directions toextend the tool and other future directions of research.

7.1 Scaffolding Evaluation Analysis WorkflowsSince QA or other AI systems are often highly complex, generatinginsights from evaluations requires scaffolding diverse scenarios forcomparison and examination. Prismatic analysis provides a founda-tion that supports such scenarios by multi-angle categorization. Theshared aggregate metrics across angles enables diverse scenarios ofcomparing categories. The QSAnglyzer design further supports theanalysis workflows by leveraging the power of visual comparisonand interactive filtering.

From the lab study we found that when performing the compar-ison tasks using QSAnglyzer, users are faster and more accurate.From expert reviews we witnessed QA researchers gain insight intotheir solvers within one-hour sessions. In addition, when havingmultiple angles in separate spreadsheets as in our lab study, wefound users often lost track and forgot what filters had been applied.In contrast, QSAnglyzer helped QA researchers to maintain contextby presenting angles and filters at the same view. This aligns withour design rationale R6.

7.2 Adding Evaluation-based or Automatic AnglesThe current QSAnglyzer design includes only seven angles. Thoughthese angles are defined based on the QA researchers’ need, there isspace to add more angles. Three directions for future work include:evaluation-dependent, semi-automatic, and fully-automatic angles.Evaluation-dependent angles refer to angles that are specific to anevaluation, unlike the current evaluation-related angles, which treatall evaluations equally. For example, solvers usually have featuresfor intermediate processing: we can group questions based on thesefeatures. However, features are evaluation-dependent and may beextremely different across evaluations, which requires further studiesfor categorization and interface design. Semi-automatic angles arean augmented version of the ‘free labels’ angle. Although QSAn-glyzer supports manually-created angles, researchers commentedthat they want the system to recommend similar questions whenthey assign a category label to a question: fully-automatic anglesshould create categories using clustering algorithms. These threedirections are not mutually exclusive. Thus, we can also createevaluation-dependent angles using clustering algorithms, or we can

Page 10: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

run topic modeling algorithms (e.g., [2]) to create automatically-generated topics. Preliminary investigation in this direction suggestsa challenge: questions that are interesting to researchers may driftfrom time to time. It may be difficult to collect a set of questions todefine other categories that researchers seek to investigate in everyevaluation. In addition, automatic algorithms may often result inuninterpretable categories, and thus may not be useful to users. Weleave addressing these issues for future work.

7.3 Supporting Complex AI System Development

Evaluation analysis is critical in developing complex AI systems,yet it is only part of the development workflow. Many other partsof the workflows are under-studied and could be better supported.For instance, since each solver has many components, we couldalso build visual analytics to help researchers to understand the in-teraction between QA system components. In addition, one QAresearcher told us that QSAnglyzer is useful when they have devel-oped a solver that is accurate enough. When a solver is in its initialdevelopment stage, they may want to focus on getting one or twoquestions correct, rather than on reviewing overall patterns. Theseexamples indicate the need for more holistic studies on complex AIsystem development, which could lead to an open design space ofvisual analytics tools.

7.4 Applying Prismatic Analysis to Other AI Domains

One key reason for using prismatic analysis is that the analysis tar-get (e.g., evaluations) and the goal (e.g., improving a QA system)are complex. This necessitates examination from many angles andcategorization to help discover patterns. Therefore, prismatic anal-ysis is suitable not only for analyzing QA system evaluations, butalso any complex analysis target that can benefit from multi-anglecategorization. For example, a potential use case for prismatic anal-ysis is analyzing evaluations of a recommendation system. We cancategorize user profiles from multiple angles, and determine howwell a recommendation system performs on various profile cate-gories. However, since in AI systems new development practiceskeep emerging, more in-depth studies are needed on user workflowsto apply prismatic analysis in different contexts. In addition, inthe QSAnglyzer design, we focus on three aggregate metrics, oneof which (Macc) has multiple measurements. Other contexts mayhave different aggregate metrics, and thus the design of QSAnglyzermay be adjusted. Using prismatic analysis in different contexts andbuilding visual analytics for different contexts are directions thatcall for further exploration.

8 CONCLUSION

Building AI systems is a fascinating but challenging problem toAI researchers, yet most work in HCI and VIS focuses only onthe development and investigation of individual models. In thispaper, we address the development of complex AI systems in thedomain of QA systems. Specifically, we focus on the workflows ofmultiple-evaluation analysis. We work closely with QA researchersto extract their goals and tasks in evaluation analysis, identifyingchallenges and design rationales that lead to our proposal of pris-matic analysis and QSAnglyzer. Although here this is applied tothe QA domain, we envision that prismatic analysis can be appliedto other AI domains and so on as the design of QSAnglyzer. Thiswork opens a door and highlights the need for better support for AIdevelopment. More studies should further examine the design spaceand seek opportunities for visual analytics research.

ACKNOWLEDGMENTS

We wish to thank all the participants for their time and the anony-mous reviewers for their helpful comments. This work is supportedby Allen Institute for Artificial Intelligence (AI2).

REFERENCES

[1] L. Bell. Artificial intelligence is now intel’s major focus, 2016.Retrieved from http://www.alphr.com/the-future/1004845/artificial-intelligence-is-now-intels-major-focus.

[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.Journal of machine Learning research, 3(Jan):993–1022, 2003.

[3] E. Brill, S. Dumais, and M. Banko. An analysis of the askmsr question-answering system. pp. 257–264. Association for Computational Lin-guistics, 2002.

[4] M. Brooks, S. Amershi, B. Lee, S. M. Drucker, A. Kapoor, andP. Simard. FeatureInsight: Visual support for error-driven featureideation in text classification. In Proceedings of the 2015 IEEE Confer-ence on Visual Analytics Science and Technology (VAST), pp. 105–112,2015. doi: 10.1109/VAST.2015.7347637

[5] P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. Turney,and D. Khashabi. Combining retrieval, statistics, and inference toanswer elementary science questions. In Proceedings of the 30th AAAIConference on Artificial Intelligence, 2016.

[6] Defense Advanced Research Projects Agency. Explainable ArtificialIntelligence (XAI), 2016. Retrieved from http://www.darpa.mil/attachments/DARPA-BAA-16-53.pdf.

[7] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyan-pur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al. Buildingwatson: An overview of the deepqa project. AI magazine, 31(3):59–79,2010.

[8] D. Jurafsky and J. H. Martin. Speech and language processing, vol. 3.Pearson, 2014.

[9] A. Kapoor, B. Lee, D. Tan, and E. Horvitz. Performance and pref-erences: Interactive refinement of machine learning procedures. InProceedings of the Twenty-Sixth AAAI Conference on Artificial Intelli-gence, 2012. 00012.

[10] J. Ko, E. Nyberg, and L. Si. A probabilistic graphical model forjoint answer ranking in question answering. In Proceedings of the30th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pp. 343–350. ACM, 2007.

[11] N. Kushman, Y. Artzi, L. Zettlemoyer, and R. Barzilay. Learning toautomatically solve algebra word problems. Association for Computa-tional Linguistics, 2014.

[12] G. Lewis-Kraus. The Great A.I. Awakening, 2016. Retrievedfrom https://www.nytimes.com/2016/12/14/magazine/

the-great-ai-awakening.html.[13] M. M. Malik, C. Heinzl, and M. E. Groeller. Comparative visualization

for parameter studies of dataset series. 16(5):829–840, 2010. doi: 10.1109/TVCG.2010.20

[14] E. Muller, S. Gunnemann, I. Farber, and T. Seidl. Discovering mul-tiple clustering solutions: Grouping objects in different views of thedata. In Proceedings of the IEEE International Conference on DataEngineering (ICDE), pp. 1207–1210. IEEE, 2012.

[15] T. Munzner. A nested model for visualization design and validation.15(6):921–928, 2009. doi: 10.1109/TVCG.2009.111

[16] T. Munzner. Visualization Analysis and Design. AK Peters Visualiza-tion Series. CRC Press, 2014.

[17] T. Nocke, M. Flechsig, and U. Bohm. Visual exploration and evalu-ation of climate-related simulation data. In 2007 Winter SimulationConference, pp. 703–711, 2007. doi: 10.1109/WSC.2007.4419664

[18] S. Russell, D. Dewey, and M. Tegmark. Research priorities for robustand beneficial artificial intelligence. AI Magazine, 36(4):105–114,2015.

[19] M. J. Seo, H. Hajishirzi, A. Farhadi, and O. Etzioni. Diagram under-standing in geometry questions. In AAAI, pp. 2831–2838, 2014.

[20] B. Shneiderman. The eyes have it: A task by data type taxonomy for in-formation visualizations. In Proceedings of the 1996 IEEE Symposiumon Visual Languages, pp. 336–343. IEEE, 1996.

[21] D. Smilkov and S. Carter. Tensorflow neural network playground,2016. Retrieved from http://playground.tensorflow.org.

[22] J. Stasko and E. Zhang. Focus+ context display and navigation tech-niques for enhancing radial, space-filling hierarchy visualizations. InProceedings of the 2000 IEEE Symposium on Information Visualization,pp. 57–65. IEEE, 2000.

Page 11: QSAnglyzer: Visual Analytics for Prismatic Analysis …ai2-website.s3.amazonaws.com/publications/qsanglyzer...QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering

[23] P. Stone, R. Brooks, E. Brynjolfsson, R. Calo, O. Etzioni, G. Hager,J. Hirschberg, S. Kalyanakrishnan, E. Kamar, S. Kraus, K. Leyton-Brown, D. Parkes, W. Press, A. Saxenian, J. Shah, M. Tambe, andA. Teller. ”Artificial Intelligence and Life in 2030” - One Hundred YearStudy on Artificial Intelligence: Report of the 2015-2016 Study Panel.Technical report, Stanford University, Stanford, CA, September 2016.Retrieved from http://ai100.stanford.edu/2016-report.

[24] J. Talbot, B. Lee, A. Kapoor, and D. S. Tan. EnsembleMatrix: Interac-tive visualization to support machine learning with multiple classifiers.In Proceedings of the SIGCHI Conference on Human Factors in Com-puting Systems, CHI ’09, pp. 1283–1292. ACM, 2009. 00091. doi: 10.1145/1518701.1518895

[25] A. Tatu, F. Maaß, I. Farber, E. Bertini, T. Schreck, T. Seidl, and D. Keim.Subspace search and visualization to make sense of alternative clus-terings in high-dimensional data. In Proceedings of IEEE Conferenceon Visual Analytics Science and Technology (VAST), pp. 63–72. IEEE,2012.

[26] E. M. Voorhees et al. The TREC-8 Question Answering Track Report.In Trec, vol. 99, pp. 77–82, 1999.

[27] D. Wang, L. Boytsov, J. Araki, A. Patel, J. Gee, Z. Liu, E. Nyberg, andT. Mitamura. CMU Multiple-choice Question Answering System atNTCIR-11 QA-Lab. In NTCIR, 2014.

[28] C. Ware. Information visualization: perception for design. Elsevier,2012.

[29] E. Williams. Experimental designs balanced for the estimation of resid-ual effects of treatments. Australian Journal of Chemistry, 2(2):149–168, 1949.

[30] A. T. Wilson and K. C. Potter. Toward visual analysis of ensemble datasets. In Proceedings of the 2009 Workshop on Ultrascale Visualization,UltraVis ’09, pp. 48–53. ACM, 2009. doi: 10.1145/1838544.1838551

[31] R. D. Wolfinger, R. D. Tobias, and J. Sall. Mixed models: a futuredirection. In Proceedings of the Sixteenth Annual SAS Users GroupConference, SAS Institute Inc., Cary, NC, pp. 1380–1388, 1991.

[32] C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. arXivpreprint arXiv:1304.5634, 2013.

[33] J. Zhao, X. Xie, X. Xu, and S. Sun. Multi-view learning overview:Recent progress and new challenges. Information Fusion, 38:43–54,2017.

[34] Z. Zheng. Answerbus question answering system. In Proceedings ofthe second international conference on Human Language TechnologyResearch, pp. 399–404. Morgan Kaufmann Publishers Inc., 2002.