ExplainaBoard: An Explainable Leaderboard for NLP

Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11thInternational Joint Conference on Natural Language Processing: System Demonstrations, pages 280–289, August 1st - August 6th, 2021.

©2021 Association for Computational Linguistics

280

EXPLAINABOARD:An Explainable Leaderboard for NLP

Pengfei Liu1†, Jinlan Fu2, Yang Xiao2, Weizhe Yuan1, Shuaichen Chang3,Junqi Dai2, Yixin Liu1, Zihuiwen Ye1, Graham Neubig1‡

1Carnegie Mellon University, 2Fudan University, 3The Ohio State University,†[email protected], ‡[email protected]

Abstract

With the rapid development of NLP research,leaderboards have emerged as one tool to trackthe performance of various systems on vari-ous NLP tasks. They are effective in this goalto some extent, but generally present a rathersimplistic one-dimensional view of the sub-mitted systems, communicated only throughholistic accuracy numbers. In this paper, wepresent a new conceptualization and imple-mentation of NLP evaluation: the EXPLAIN-ABOARD, which in addition to inheriting thefunctionality of the standard leaderboard, alsoallows researchers to (i) diagnose strengthsand weaknesses of a single system (e.g. whatis the best-performing system bad at?) (ii) in-terpret relationships between multiple systems.(e.g. where does system A outperform systemB? What if we combine systems A, B andC?) and (iii) examine prediction results closely(e.g. what are common errors made by multi-ple systems or in what contexts do particularerrors occur?). So far, EXPLAINABOARD cov-ers more than 400 systems, 50 datasets, 40 lan-guages, and 12 tasks.1 We not only releasedan online platform at the website2 but alsomake our evaluation tool an API with MIT Li-cence at Github3 and PyPi4 that allows users toconveniently assess their models offline. Weadditionally release all output files from sys-tems that we have run or collected to motivate“output-driven” research in the future.

1 Introduction

Natural language processing (NLP) research hasbeen and is making astounding strides forward.

1EXPLAINABOARD keeps updated and is recently up-graded by supporting (1) multilingual multi-task bench-mark and (2) more complicated task: machine translation,which reviewers also suggested.

2http://explainaboard.nlpedia.ai/3https://github.com/neulab/

explainaBoard4https://pypi.org/project/

interpret-eval/

Figure 1: Illustration of the EXPLAINABOARD concept.Compared to vanilla leaderboards, EXPLAINABOARDallows users to perform interpretable (single-system ,pairwise analysis, data bias), interactive (system com-bination, fine-grained/common error analysis), and re-liable analysis (confidence interval, calibration) on sys-tems in which they are interested. “Comb.” denotes“combination” and “Errs” represents “errors”. “PER,LOC, ORG” refer to different labels.

This is true both for classical tasks such as machinetranslation (Sutskever et al., 2014; Wu et al., 2016),as well as for new tasks (Lu et al., 2016; Rajpurkaret al., 2016), domains (Beltagy et al., 2019), andlanguages (Conneau and Lample, 2019). One waythis progress is quantified is through leaderboards,which report and update performance numbersof state-of-the-art systems on one or more tasks.Some prototypical leaderboards include GLUE andSuperGLUE for natural language understanding(Wang et al., 2018, 2019), XTREME and XGLUE(Hu et al., 2020; Liang et al., 2020) for multilingualunderstanding, the WMT shared tasks (Barraultet al., 2020) for machine translation, and GEM andGENIE for natural language generation (Gehrmannet al., 2021; Khashabi et al., 2021), among manyothers.

These leaderboards serve an important purpose:

http://explainaboard.nlpedia.ai/

https://github.com/neulab/explainaBoard

https://github.com/neulab/explainaBoard

https://pypi.org/project/interpret-eval/

https://pypi.org/project/interpret-eval/

281

they provide a standardized evaluation setup, oftenon multiple tasks, that eases reproducible modelcomparison across organizations. However, at thesame time, due to the prestige imbued by a top, orhigh, place on a leaderboard they also can resultin a singular focus on raising evaluation numbersat the cost of deeper scientific understanding ofmodel properties (Ethayarajh and Jurafsky, 2020).In particular, we argue that, among others, the fol-lowing are three major limitations of the existingleaderboard paradigm:

• Interpretability: Most existing leaderboardscommonly use a single number to summarizesystem performance holistically. This is con-ducive to system ranking but at the same time,the results are opaque, making the strengths andweaknesses of systems less interpretable.

• Interactivity: Existing leaderboards are staticand non-interactive, which limits the ability ofusers to dig deeper into the results. Thus, (1)they usually do not flexibly support more com-plex evaluation settings (e.g. multi-dataset, multi-metric, multi-language) (2) users may miss op-portunities to understand the relationships be-tween different systems. For example, wheredoes model A outperform model B? Would theperformance be further improved if we combinethe Top-3 models?

• Reliability: Given the increasing role thatleaderboards have taken in guiding NLP research,it is important that information expressed in themis reliable, especially on datasets with small sam-ple sizes, but most current leaderboards do notgive an idea of the reliability of system rankings.

In this paper, we describe EXPLAINABOARD

(see Fig.1), a software package and hosted leader-board that satisfies all of the above desiderata.It also serves as a prototype implementation ofsome desirable features that may be included infuture leaderboards, even independent of the pro-vided software itself. We have deployed EXPLAIN-ABOARD for 9 different tasks and 41 differentdatasets, and demonstrate how it can be easilyadapted to new tasks of interest.

We expect that EXPLAINABOARD will benefitdifferent steps of the research process:(i) System Developement: EXPLAINABOARD

provides more detailed information regarding thesubmitted systems (e.g. fine-grained results, con-fidence intervals), allowing system developers to

diagnose successes and failures of their own sys-tems, or compare their systems with baselines andunderstand where improvements of their proposedmethods come from. This better understanding canlead to more efficient and effective system improve-ments. Additionally, EXPLAINABOARD can helpsystem developers uncover their systems’ advan-tages over others, even when these systems havenot achieved state-of-the-art performance holisti-cally. (ii) Leaderboard Organization: The EX-PLAINABOARD software both provides a ready-made platform for easy leaderboard developmentover different NLP tasks, and helps upgrade tradi-tional leaderboards to allow for more fine-grainedanalysis. For example, we have already establishedan ExplainaBoard 5 for the existing XTREMEbenchmark.6 (iii) Broad Analysis and Under-standing: Because EXPLAINABOARD encouragessystem developers to provide their system outputsin an easy-to-analyze format, these will also helpresearchers, particularly those just starting in a par-ticular NLP sub-field, get a broad sense of whatcurrent state-of-the-art models can and cannot do.This not only helps them quickly track the progressof different areas, but also can allow them to un-derstand the relative advantages of diverse sys-tems, suggesting insightful ideas for what’s leftand what’s next.7

2 ExplainaBoard

As stated above, EXPLAINABOARD extends exist-ing leaderboards, improving their interpretability,interactivity, and reliability. It does so by provid-ing a number of functionalities that are applicableto a wide variety of NLP tasks (as illustrated inTab. 1). Many of these functionalities are groundedin existing research on evaluation and fine-graineddiagnostics.

2.1 Interpretability

Interpretable evaluation (Popovic and Ney, 2011;Stymne, 2011; Neubig et al., 2019; Fu et al., 2020a),is a research area that considers methods that breakdown the holistic performance of each system intodifferent interpretable groups. For example, in a

5http://explainaboard.nlpedia.ai/leaderboard/xtreme/

6https://sites.research.google/xtreme/7Since the first release of EXPLAINABOARD, we have

received invitations from multiple companies, startups, andresearchers to collaborate, and we are working together tomake it better for the community.

http://explainaboard.nlpedia.ai/leaderboard/xtreme/

http://explainaboard.nlpedia.ai/leaderboard/xtreme/

https://sites.research.google/xtreme/

282

Aspect Functionality Input Output

Interpretability

Single-systemAnalysis One model

� 4<latexit sha1_base64="7v4dR4K94Y+/+eZ5Zm5Ub0mqdao=">AAACynicjVHLSsNAFD2Nr1pfVZdugq3iqiRF0GXBjQsXFewD2iLJdFqHpkmcTIRS3PkDbvXDxD/Qv/DOmIJaRCckOXPuOXfm3uvHgUiU47zmrIXFpeWV/GphbX1jc6u4vdNMolQy3mBREMm27yU8ECFvKKEC3o4l98Z+wFv+6EzHW3dcJiIKr9Qk5r2xNwzFQDBPEdUqd4f89rh8XSw5Fccsex64GSghW/Wo+IIu+ojAkGIMjhCKcAAPCT0duHAQE9fDlDhJSJg4xz0K5E1JxUnhETui75B2nYwNaa9zJsbN6JSAXklOGwfkiUgnCevTbBNPTWbN/pZ7anLqu03o72e5xsQq3BD7l2+m/K9P16IwwKmpQVBNsWF0dSzLkpqu6JvbX6pSlCEmTuM+xSVhZpyzPtvGk5jadW89E38zSs3qPcu0Kd71LWnA7s9xzoNmteI6FfeyWqodZqPOYw/7OKJ5nqCGc9TRMFU+4gnP1oUlrYk1/ZRaucyzi2/LevgAIF6ROA==</latexit><latexit sha1_base64="7v4dR4K94Y+/+eZ5Zm5Ub0mqdao=">AAACynicjVHLSsNAFD2Nr1pfVZdugq3iqiRF0GXBjQsXFewD2iLJdFqHpkmcTIRS3PkDbvXDxD/Qv/DOmIJaRCckOXPuOXfm3uvHgUiU47zmrIXFpeWV/GphbX1jc6u4vdNMolQy3mBREMm27yU8ECFvKKEC3o4l98Z+wFv+6EzHW3dcJiIKr9Qk5r2xNwzFQDBPEdUqd4f89rh8XSw5Fccsex64GSghW/Wo+IIu+ojAkGIMjhCKcAAPCT0duHAQE9fDlDhJSJg4xz0K5E1JxUnhETui75B2nYwNaa9zJsbN6JSAXklOGwfkiUgnCevTbBNPTWbN/pZ7anLqu03o72e5xsQq3BD7l2+m/K9P16IwwKmpQVBNsWF0dSzLkpqu6JvbX6pSlCEmTuM+xSVhZpyzPtvGk5jadW89E38zSs3qPcu0Kd71LWnA7s9xzoNmteI6FfeyWqodZqPOYw/7OKJ5nqCGc9TRMFU+4gnP1oUlrYk1/ZRaucyzi2/LevgAIF6ROA==</latexit><latexit sha1_base64="7v4dR4K94Y+/+eZ5Zm5Ub0mqdao=">AAACynicjVHLSsNAFD2Nr1pfVZdugq3iqiRF0GXBjQsXFewD2iLJdFqHpkmcTIRS3PkDbvXDxD/Qv/DOmIJaRCckOXPuOXfm3uvHgUiU47zmrIXFpeWV/GphbX1jc6u4vdNMolQy3mBREMm27yU8ECFvKKEC3o4l98Z+wFv+6EzHW3dcJiIKr9Qk5r2xNwzFQDBPEdUqd4f89rh8XSw5Fccsex64GSghW/Wo+IIu+ojAkGIMjhCKcAAPCT0duHAQE9fDlDhJSJg4xz0K5E1JxUnhETui75B2nYwNaa9zJsbN6JSAXklOGwfkiUgnCevTbBNPTWbN/pZ7anLqu03o72e5xsQq3BD7l2+m/K9P16IwwKmpQVBNsWF0dSzLkpqu6JvbX6pSlCEmTuM+xSVhZpyzPtvGk5jadW89E38zSs3qPcu0Kd71LWnA7s9xzoNmteI6FfeyWqodZqPOYw/7OKJ5nqCGc9TRMFU+4gnP1oUlrYk1/ZRaucyzi2/LevgAIF6ROA==</latexit>

3<latexit sha1_base64="9GJBKIMKDbN/wGyhhAENs48J8N0=">AAACxnicjVHLTsJAFD3UF+ILdemmETSuSIsLXZK4YYlRHgkS0w4DTihtM51qCDHxB9zqpxn/QP/CO2NJVGJ0mrZnzr3nzNx7/TgQiXKc15y1sLi0vJJfLaytb2xuFbd3WkmUSsabLAoi2fG9hAci5E0lVMA7seTe2A942x+d6Xj7lstEROGlmsS8N/aGoRgI5imiLsrH5etiyak4ZtnzwM1ACdlqRMUXXKGPCAwpxuAIoQgH8JDQ04ULBzFxPUyJk4SEiXPco0DalLI4ZXjEjug7pF03Y0Paa8/EqBmdEtArSWnjgDQR5UnC+jTbxFPjrNnfvKfGU99tQn8/8xoTq3BD7F+6WeZ/dboWhQFOTQ2CaooNo6tjmUtquqJvbn+pSpFDTJzGfYpLwswoZ322jSYxteveeib+ZjI1q/csy03xrm9JA3Z/jnMetKoV16m459VS7TAbdR572McRzfMENdTRQJO8h3jEE56tuhVaqXX3mWrlMs0uvi3r4QO1LY92</latexit><latexit sha1_base64="9GJBKIMKDbN/wGyhhAENs48J8N0=">AAACxnicjVHLTsJAFD3UF+ILdemmETSuSIsLXZK4YYlRHgkS0w4DTihtM51qCDHxB9zqpxn/QP/CO2NJVGJ0mrZnzr3nzNx7/TgQiXKc15y1sLi0vJJfLaytb2xuFbd3WkmUSsabLAoi2fG9hAci5E0lVMA7seTe2A942x+d6Xj7lstEROGlmsS8N/aGoRgI5imiLsrH5etiyak4ZtnzwM1ACdlqRMUXXKGPCAwpxuAIoQgH8JDQ04ULBzFxPUyJk4SEiXPco0DalLI4ZXjEjug7pF03Y0Paa8/EqBmdEtArSWnjgDQR5UnC+jTbxFPjrNnfvKfGU99tQn8/8xoTq3BD7F+6WeZ/dboWhQFOTQ2CaooNo6tjmUtquqJvbn+pSpFDTJzGfYpLwswoZ322jSYxteveeib+ZjI1q/csy03xrm9JA3Z/jnMetKoV16m459VS7TAbdR572McRzfMENdTRQJO8h3jEE56tuhVaqXX3mWrlMs0uvi3r4QO1LY92</latexit><latexit sha1_base64="9GJBKIMKDbN/wGyhhAENs48J8N0=">AAACxnicjVHLTsJAFD3UF+ILdemmETSuSIsLXZK4YYlRHgkS0w4DTihtM51qCDHxB9zqpxn/QP/CO2NJVGJ0mrZnzr3nzNx7/TgQiXKc15y1sLi0vJJfLaytb2xuFbd3WkmUSsabLAoi2fG9hAci5E0lVMA7seTe2A942x+d6Xj7lstEROGlmsS8N/aGoRgI5imiLsrH5etiyak4ZtnzwM1ACdlqRMUXXKGPCAwpxuAIoQgH8JDQ04ULBzFxPUyJk4SEiXPco0DalLI4ZXjEjug7pF03Y0Paa8/EqBmdEtArSWnjgDQR5UnC+jTbxFPjrNnfvKfGU99tQn8/8xoTq3BD7F+6WeZ/dboWhQFOTQ2CaooNo6tjmUtquqJvbn+pSpFDTJzGfYpLwswoZ322jSYxteveeib+ZjI1q/csy03xrm9JA3Z/jnMetKoV16m459VS7TAbdR572McRzfMENdTRQJO8h3jEE56tuhVaqXX3mWrlMs0uvi3r4QO1LY92</latexit>

2<latexit sha1_base64="LQrcwPmSbu646Y4RuMVqVzI3V4Q=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlOa2heTCZKKYI/4FY/TfwD/QvvjFNQi+iEJGfOvefM3Hv9NAwy6TivC9bi0vLKamGtuL6xubVd2tltZ0kuGG+xJExE1/cyHgYxb8lAhrybCu5Ffsg7/vhMxTu3XGRBEl/KScr7kTeKg2HAPEnURaVWuS6Vnaqjlz0PXAPKMKuZlF5whQESMOSIwBFDEg7hIaOnBxcOUuL6mBInCAU6znGPImlzyuKU4RE7pu+Idj3DxrRXnplWMzolpFeQ0sYhaRLKE4TVabaO59pZsb95T7WnutuE/r7xioiVuCH2L90s8786VYvEEKe6hoBqSjWjqmPGJdddUTe3v1QlySElTuEBxQVhppWzPttak+naVW89HX/TmYpVe2Zyc7yrW9KA3Z/jnAftWtV1qu55rVw/MqMuYB8HOKZ5nqCOBppokfcIj3jCs9WwYiu37j5TrQWj2cO3ZT18ALLMj3U=</latexit><latexit sha1_base64="LQrcwPmSbu646Y4RuMVqVzI3V4Q=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlOa2heTCZKKYI/4FY/TfwD/QvvjFNQi+iEJGfOvefM3Hv9NAwy6TivC9bi0vLKamGtuL6xubVd2tltZ0kuGG+xJExE1/cyHgYxb8lAhrybCu5Ffsg7/vhMxTu3XGRBEl/KScr7kTeKg2HAPEnURaVWuS6Vnaqjlz0PXAPKMKuZlF5whQESMOSIwBFDEg7hIaOnBxcOUuL6mBInCAU6znGPImlzyuKU4RE7pu+Idj3DxrRXnplWMzolpFeQ0sYhaRLKE4TVabaO59pZsb95T7WnutuE/r7xioiVuCH2L90s8786VYvEEKe6hoBqSjWjqmPGJdddUTe3v1QlySElTuEBxQVhppWzPttak+naVW89HX/TmYpVe2Zyc7yrW9KA3Z/jnAftWtV1qu55rVw/MqMuYB8HOKZ5nqCOBppokfcIj3jCs9WwYiu37j5TrQWj2cO3ZT18ALLMj3U=</latexit><latexit sha1_base64="LQrcwPmSbu646Y4RuMVqVzI3V4Q=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlOa2heTCZKKYI/4FY/TfwD/QvvjFNQi+iEJGfOvefM3Hv9NAwy6TivC9bi0vLKamGtuL6xubVd2tltZ0kuGG+xJExE1/cyHgYxb8lAhrybCu5Ffsg7/vhMxTu3XGRBEl/KScr7kTeKg2HAPEnURaVWuS6Vnaqjlz0PXAPKMKuZlF5whQESMOSIwBFDEg7hIaOnBxcOUuL6mBInCAU6znGPImlzyuKU4RE7pu+Idj3DxrRXnplWMzolpFeQ0sYhaRLKE4TVabaO59pZsb95T7WnutuE/r7xioiVuCH2L90s8786VYvEEKe6hoBqSjWjqmPGJdddUTe3v1QlySElTuEBxQVhppWzPttak+naVW89HX/TmYpVe2Zyc7yrW9KA3Z/jnAftWtV1qu55rVw/MqMuYB8HOKZ5nqCOBppokfcIj3jCs9WwYiu37j5TrQWj2cO3ZT18ALLMj3U=</latexit>

1<latexit sha1_base64="H/pnWRR6sCifSSMGu54Gg3UaChE=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlO69C8mEyUUgR/wK1+mvgH+hfeGVNQi+iEJGfOvefM3Hv9JBCpcpzXBWtxaXlltbBWXN/Y3Nou7ey20ziTjLdYHMSy63spD0TEW0qogHcTyb3QD3jHH5/peOeWy1TE0aWaJLwfeqNIDAXzFFEXFbdyXSo7Vccsex64OSgjX8249IIrDBCDIUMIjgiKcAAPKT09uHCQENfHlDhJSJg4xz2KpM0oi1OGR+yYviPa9XI2or32TI2a0SkBvZKUNg5JE1OeJKxPs008M86a/c17ajz13Sb093OvkFiFG2L/0s0y/6vTtSgMcWpqEFRTYhhdHctdMtMVfXP7S1WKHBLiNB5QXBJmRjnrs200qald99Yz8TeTqVm9Z3luhnd9Sxqw+3Oc86Bdq7pO1T2vletH+agL2McBjmmeJ6ijgSZa5D3CI57wbDWsyMqsu89UayHX7OHbsh4+ALBrj3Q=</latexit><latexit sha1_base64="H/pnWRR6sCifSSMGu54Gg3UaChE=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlO69C8mEyUUgR/wK1+mvgH+hfeGVNQi+iEJGfOvefM3Hv9JBCpcpzXBWtxaXlltbBWXN/Y3Nou7ey20ziTjLdYHMSy63spD0TEW0qogHcTyb3QD3jHH5/peOeWy1TE0aWaJLwfeqNIDAXzFFEXFbdyXSo7Vccsex64OSgjX8249IIrDBCDIUMIjgiKcAAPKT09uHCQENfHlDhJSJg4xz2KpM0oi1OGR+yYviPa9XI2or32TI2a0SkBvZKUNg5JE1OeJKxPs008M86a/c17ajz13Sb093OvkFiFG2L/0s0y/6vTtSgMcWpqEFRTYhhdHctdMtMVfXP7S1WKHBLiNB5QXBJmRjnrs200qald99Yz8TeTqVm9Z3luhnd9Sxqw+3Oc86Bdq7pO1T2vletH+agL2McBjmmeJ6ijgSZa5D3CI57wbDWsyMqsu89UayHX7OHbsh4+ALBrj3Q=</latexit><latexit sha1_base64="H/pnWRR6sCifSSMGu54Gg3UaChE=">AAACxnicjVHLSsNAFD3GV62vqks3wVZxVZJudFlw02VF+4BaJJlO69C8mEyUUgR/wK1+mvgH+hfeGVNQi+iEJGfOvefM3Hv9JBCpcpzXBWtxaXlltbBWXN/Y3Nou7ey20ziTjLdYHMSy63spD0TEW0qogHcTyb3QD3jHH5/peOeWy1TE0aWaJLwfeqNIDAXzFFEXFbdyXSo7Vccsex64OSgjX8249IIrDBCDIUMIjgiKcAAPKT09uHCQENfHlDhJSJg4xz2KpM0oi1OGR+yYviPa9XI2or32TI2a0SkBvZKUNg5JE1OeJKxPs008M86a/c17ajz13Sb093OvkFiFG2L/0s0y/6vTtSgMcWpqEFRTYhhdHctdMtMVfXP7S1WKHBLiNB5QXBJmRjnrs200qald99Yz8TeTqVm9Z3luhnd9Sxqw+3Oc86Bdq7pO1T2vletH+agL2McBjmmeJ6ijgSZa5D3CI57wbDWsyMqsu89UayHX7OHbsh4+ALBrj3Q=</latexit>

F1<latexit sha1_base64="Ay5waSm5tMPzb0tR7GabuA1hmOU=">AAACxXicjVHLSsNAFD2Nr1pfVZduglVwVZJudFkQ1GUV+4BaJEmndWheTCaFUsQfcKu/Jv6B/oV3ximoRXRCkjPn3nNm7r1+GvJMOs5rwVpYXFpeKa6W1tY3NrfK2zutLMlFwJpBEiai43sZC3nMmpLLkHVSwbzID1nbH52qeHvMRMaT+FpOUtaLvGHMBzzwJFFXZ+5tueJUHb3seeAaUIFZjaT8ghv0kSBAjggMMSThEB4yerpw4SAlrocpcYIQ13GGe5RIm1MWowyP2BF9h7TrGjamvfLMtDqgU0J6BSltHJImoTxBWJ1m63iunRX7m/dUe6q7TejvG6+IWIk7Yv/SzTL/q1O1SAxwomvgVFOqGVVdYFxy3RV1c/tLVZIcUuIU7lNcEA60ctZnW2syXbvqrafjbzpTsWofmNwc7+qWNGD35zjnQatWdZ2qe1mr1A/MqIvYwz6OaJ7HqOMCDTTJe4BHPOHZOrciS1rjz1SrYDS7+Lashw+FHI9m</latexit><latexit sha1_base64="Ay5waSm5tMPzb0tR7GabuA1hmOU=">AAACxXicjVHLSsNAFD2Nr1pfVZduglVwVZJudFkQ1GUV+4BaJEmndWheTCaFUsQfcKu/Jv6B/oV3ximoRXRCkjPn3nNm7r1+GvJMOs5rwVpYXFpeKa6W1tY3NrfK2zutLMlFwJpBEiai43sZC3nMmpLLkHVSwbzID1nbH52qeHvMRMaT+FpOUtaLvGHMBzzwJFFXZ+5tueJUHb3seeAaUIFZjaT8ghv0kSBAjggMMSThEB4yerpw4SAlrocpcYIQ13GGe5RIm1MWowyP2BF9h7TrGjamvfLMtDqgU0J6BSltHJImoTxBWJ1m63iunRX7m/dUe6q7TejvG6+IWIk7Yv/SzTL/q1O1SAxwomvgVFOqGVVdYFxy3RV1c/tLVZIcUuIU7lNcEA60ctZnW2syXbvqrafjbzpTsWofmNwc7+qWNGD35zjnQatWdZ2qe1mr1A/MqIvYwz6OaJ7HqOMCDTTJe4BHPOHZOrciS1rjz1SrYDS7+Lashw+FHI9m</latexit><latexit sha1_base64="Ay5waSm5tMPzb0tR7GabuA1hmOU=">AAACxXicjVHLSsNAFD2Nr1pfVZduglVwVZJudFkQ1GUV+4BaJEmndWheTCaFUsQfcKu/Jv6B/oV3ximoRXRCkjPn3nNm7r1+GvJMOs5rwVpYXFpeKa6W1tY3NrfK2zutLMlFwJpBEiai43sZC3nMmpLLkHVSwbzID1nbH52qeHvMRMaT+FpOUtaLvGHMBzzwJFFXZ+5tueJUHb3seeAaUIFZjaT8ghv0kSBAjggMMSThEB4yerpw4SAlrocpcYIQ13GGe5RIm1MWowyP2BF9h7TrGjamvfLMtDqgU0J6BSltHJImoTxBWJ1m63iunRX7m/dUe6q7TejvG6+IWIk7Yv/SzTL/q1O1SAxwomvgVFOqGVVdYFxy3RV1c/tLVZIcUuIU7lNcEA60ctZnW2syXbvqrafjbzpTsWofmNwc7+qWNGD35zjnQatWdZ2qe1mr1A/MqIvYwz6OaJ7HqOMCDTTJe4BHPOHZOrciS1rjz1SrYDS7+Lashw+FHI9m</latexit>

eLen<latexit sha1_base64="iEfE+gc8L5JynjQqD7tUKllhyIM=">AAACx3icjVHLSsNAFD2Nr1pfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQdTDIhmRRLEfQH3Oof+EniH+hfeGeaglpEJyQ5c+45Z+bOuJHPE2lZbzljbn5hcSm/XFhZXVvfKG5uNRKRxh6re8IXcct1EubzkNUllz5rRTFzAtdnTffmRNWbQxYnXISXchSxTuD0Q97jniMVxc5ZeF0sWxVLD3MW2BkoH78Uju4B1ETxFVfoQsBDigAMISRhHw4SetqwYSEiroMxcTEhrusMdyiQNyUVI4VD7A19+zRrZ2xIc5WZaLdHq/j0xuQ0sUseQbqYsFrN1PVUJyv2t+yxzlR7G9HfzbICYiUGxP7lmyr/61O9SPRwqHvg1FOkGdWdl6Wk+lTUzs0vXUlKiIhTuEv1mLCnndNzNrUn0b2rs3V0/V0rFavmXqZN8aF2SRds/7zOWdDYr9hWxb6wytUSJiOPHZSwR/d5gCpOUUOdsgd4xBOejTNDGEPjdiI1cplnG9+G8fAJBP2SMA==</latexit><latexit sha1_base64="uuDIzhQVICXnC3BUxPGGowZ0ajQ=">AAACx3icjVHLSsNAFD2NrxpfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQNzYvJpFiKC3/Are5d+EnFP9C/8M40BbWITkhy5txzzsydcWLfS4RpvuW0hcWl5ZX8qr62vrG5VdjeqSdRyl1WcyM/4k3HTpjvhawmPOGzZsyZHTg+aziDM1lvDBlPvCi8FqOYtQO7F3pdz7WFpNglC28LJbNsqmHMAysDpdNX/SR+mejVqDDBDTqI4CJFAIYQgrAPGwk9LVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZq2MDWkuMxPldmkVn15OTgP75IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF8eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wH9cOyZZatK7NUKWI68thDEQd0n0eo4BxV1Ci7j0c84Vm70CJtqN1NpVou8+zi29AePgE2S5Ok</latexit><latexit sha1_base64="uuDIzhQVICXnC3BUxPGGowZ0ajQ=">AAACx3icjVHLSsNAFD2NrxpfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQNzYvJpFiKC3/Are5d+EnFP9C/8M40BbWITkhy5txzzsydcWLfS4RpvuW0hcWl5ZX8qr62vrG5VdjeqSdRyl1WcyM/4k3HTpjvhawmPOGzZsyZHTg+aziDM1lvDBlPvCi8FqOYtQO7F3pdz7WFpNglC28LJbNsqmHMAysDpdNX/SR+mejVqDDBDTqI4CJFAIYQgrAPGwk9LVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZq2MDWkuMxPldmkVn15OTgP75IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF8eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wH9cOyZZatK7NUKWI68thDEQd0n0eo4BxV1Ci7j0c84Vm70CJtqN1NpVou8+zi29AePgE2S5Ok</latexit>0

Performance Histogram: the input model isgood at dealing with short entities, while achiev-ing lower performance on long entities.

PairwiseAnalysis

Two models(M1,M2)






eLen<latexit sha1_base64="iEfE+gc8L5JynjQqD7tUKllhyIM=">AAACx3icjVHLSsNAFD2Nr1pfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQdTDIhmRRLEfQH3Oof+EniH+hfeGeaglpEJyQ5c+45Z+bOuJHPE2lZbzljbn5hcSm/XFhZXVvfKG5uNRKRxh6re8IXcct1EubzkNUllz5rRTFzAtdnTffmRNWbQxYnXISXchSxTuD0Q97jniMVxc5ZeF0sWxVLD3MW2BkoH78Uju4B1ETxFVfoQsBDigAMISRhHw4SetqwYSEiroMxcTEhrusMdyiQNyUVI4VD7A19+zRrZ2xIc5WZaLdHq/j0xuQ0sUseQbqYsFrN1PVUJyv2t+yxzlR7G9HfzbICYiUGxP7lmyr/61O9SPRwqHvg1FOkGdWdl6Wk+lTUzs0vXUlKiIhTuEv1mLCnndNzNrUn0b2rs3V0/V0rFavmXqZN8aF2SRds/7zOWdDYr9hWxb6wytUSJiOPHZSwR/d5gCpOUUOdsgd4xBOejTNDGEPjdiI1cplnG9+G8fAJBP2SMA==</latexit><latexit sha1_base64="uuDIzhQVICXnC3BUxPGGowZ0ajQ=">AAACx3icjVHLSsNAFD2NrxpfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQNzYvJpFiKC3/Are5d+EnFP9C/8M40BbWITkhy5txzzsydcWLfS4RpvuW0hcWl5ZX8qr62vrG5VdjeqSdRyl1WcyM/4k3HTpjvhawmPOGzZsyZHTg+aziDM1lvDBlPvCi8FqOYtQO7F3pdz7WFpNglC28LJbNsqmHMAysDpdNX/SR+mejVqDDBDTqI4CJFAIYQgrAPGwk9LVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZq2MDWkuMxPldmkVn15OTgP75IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF8eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wH9cOyZZatK7NUKWI68thDEQd0n0eo4BxV1Ci7j0c84Vm70CJtqN1NpVou8+zi29AePgE2S5Ok</latexit><latexit sha1_base64="uuDIzhQVICXnC3BUxPGGowZ0ajQ=">AAACx3icjVHLSsNAFD2NrxpfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQNzYvJpFiKC3/Are5d+EnFP9C/8M40BbWITkhy5txzzsydcWLfS4RpvuW0hcWl5ZX8qr62vrG5VdjeqSdRyl1WcyM/4k3HTpjvhawmPOGzZsyZHTg+aziDM1lvDBlPvCi8FqOYtQO7F3pdz7WFpNglC28LJbNsqmHMAysDpdNX/SR+mejVqDDBDTqI4CJFAIYQgrAPGwk9LVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZq2MDWkuMxPldmkVn15OTgP75IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF8eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wH9cOyZZatK7NUKWI68thDEQd0n0eo4BxV1Ci7j0c84Vm70CJtqN1NpVou8+zi29AePgE2S5Ok</latexit>

0

Performance Gap Histogram (M1−M2): M1is better at dealing with short entities, while M2is better at dealing with long entities.

Data BiasAnalysis Multi-dataset

BN

BC

CN03

WB

.6.8

.4.2

eLen

Data Bias Chart: For the entity lengthattribute, the average entity length (We averagethe length of all test entities on a given data set.)of these datasets order by descending is BN>BC> CN03> WB.

Interactivity

Fine-grainedError Analysis

Single- orPairwise-systemdiagnostic results

··· Bromwich v BoltonNFL AMERICAN ···NCAA AMERICAN ···

Bolton

NCAANFL

orgorgorg O

O

True SentenceError Pred


Bolton

NCAANFL

orgorgorg O

O



Bolton

NCAANFL

orgorgorg O

O


PER LOC ORG Error Table: Error analysis allows the user toprint out the entities that are incorrectly pre-dicted by the given model, as well as the truelabel of the entity, the mispredicted label, andthe sentence where the entity is located.

SystemCombination

Multi-models(M1,M2,M3)

M1 M3M2 Comb0

Ensemble Chart: The combined result ofmodel M1, M2, and M3 is shown by the his-togram with x-label value comb. The combinedresult is better than the single models.

Reliability

Confidence One model


eLen<latexit sha1_base64="iEfE+gc8L5JynjQqD7tUKllhyIM=">AAACx3icjVHLSsNAFD2Nr1pfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQdTDIhmRRLEfQH3Oof+EniH+hfeGeaglpEJyQ5c+45Z+bOuJHPE2lZbzljbn5hcSm/XFhZXVvfKG5uNRKRxh6re8IXcct1EubzkNUllz5rRTFzAtdnTffmRNWbQxYnXISXchSxTuD0Q97jniMVxc5ZeF0sWxVLD3MW2BkoH78Uju4B1ETxFVfoQsBDigAMISRhHw4SetqwYSEiroMxcTEhrusMdyiQNyUVI4VD7A19+zRrZ2xIc5WZaLdHq/j0xuQ0sUseQbqYsFrN1PVUJyv2t+yxzlR7G9HfzbICYiUGxP7lmyr/61O9SPRwqHvg1FOkGdWdl6Wk+lTUzs0vXUlKiIhTuEv1mLCnndNzNrUn0b2rs3V0/V0rFavmXqZN8aF2SRds/7zOWdDYr9hWxb6wytUSJiOPHZSwR/d5gCpOUUOdsgd4xBOejTNDGEPjdiI1cplnG9+G8fAJBP2SMA==</latexit><latexit sha1_base64="uuDIzhQVICXnC3BUxPGGowZ0ajQ=">AAACx3icjVHLSsNAFD2NrxpfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQNzYvJpFiKC3/Are5d+EnFP9C/8M40BbWITkhy5txzzsydcWLfS4RpvuW0hcWl5ZX8qr62vrG5VdjeqSdRyl1WcyM/4k3HTpjvhawmPOGzZsyZHTg+aziDM1lvDBlPvCi8FqOYtQO7F3pdz7WFpNglC28LJbNsqmHMAysDpdNX/SR+mejVqDDBDTqI4CJFAIYQgrAPGwk9LVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZq2MDWkuMxPldmkVn15OTgP75IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF8eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wH9cOyZZatK7NUKWI68thDEQd0n0eo4BxV1Ci7j0c84Vm70CJtqN1NpVou8+zi29AePgE2S5Ok</latexit><latexit sha1_base64="uuDIzhQVICXnC3BUxPGGowZ0ajQ=">AAACx3icjVHLSsNAFD2NrxpfVZduQovgqiRudCMW3Ci4qGAfUIsk6bQNzYvJpFiKC3/Are5d+EnFP9C/8M40BbWITkhy5txzzsydcWLfS4RpvuW0hcWl5ZX8qr62vrG5VdjeqSdRyl1WcyM/4k3HTpjvhawmPOGzZsyZHTg+aziDM1lvDBlPvCi8FqOYtQO7F3pdz7WFpNglC28LJbNsqmHMAysDpdNX/SR+mejVqDDBDTqI4CJFAIYQgrAPGwk9LVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZq2MDWkuMxPldmkVn15OTgP75IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF8eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wH9cOyZZatK7NUKWI68thDEQd0n0eo4BxV1Ci7j0c84Vm70CJtqN1NpVou8+zi29AePgE2S5Ok</latexit>





0

Error Bars: the error bars represent 95% con-fidence intervals of the performance on the spe-cific bucket.

Calibration One model

OutputGap

Error: 28.6

0.0 0.5 1.0

Reliability Diagram: Confidence histograms(red) and reliability diagrams (blue). that indi-cate the accuracy of model probability estimates

Table 1: A graphical breakdown of the functionality of EXPLAINABOARD, with examples from an NER task.

Named Entity Recognition (NER) task, we may ex-amine the accuracy along different dimensions of aconcerned entity (such as “entity frequency,” tellingus how well the model does on entities that appearin the training data a certain number of times) orsentences (such as “sentence length,” telling ushow well the model does on entities that appearin longer or shorter sentences) (Fu et al., 2020a).This makes it possible to understand where mod-els do well and poorly, leading to further insightsbeyond those that can be gleaned by holistic evalua-tion numbers. Applying this to a new task involvesthe following steps: (i) Attribute definition: defineattributes by which we can partition the test setinto different groups. (ii) Bucketing: partition intodifferent buckets based on defined attributes andcalculate performance w.r.t each bucket.

Generally, previous work on interpretable evalu-ation has been performed over single tasks, whileEXPLAINABOARD allows for comprehensive eval-

uation of different types of tasks in a single soft-ware package. We concretely show several waysinterpretable evaluation can be defined within EX-PLAINABOARD below:

F18: Single-system Analysis: What is a systemgood or bad at? For an individual system as in-put, generate a performance histogram that high-lights the buckets where it performs well or poorly.For example, in Tab. 1 we demonstrate an examplefrom NER where the input system does worse indealing with longer entities (eLen ≥ 4).

F2: Pairwise Analysis: Where is one system bet-ter (worse) than another? Given a pair of sys-tems, interpret where the performance gap occurs.Researchers could flexibly choose two systems theyare interested in (e.g. selecting two rows from theleaderboard), and EXPLAINABOARD will output aperformance gap histogram to describe how the

8“F” represents “Functionality”.

283

performance differences change over differentbuckets of different attributes. Tab. 1 demonstrateshow we can see one system is better than the otherat longer or shorter entities.

F3: Data Bias Analysis: What are the charac-teristics of different evaluated datasets? Thedefined attributes do not only help us interpret sys-tem performance, but also make it possible forusers to take a closer look at characteristics of di-verse datasets. For example, from Fig. 1 shows anexample of analyzing differences in average entitylength across several datasets.

2.2 Interactivity

EXPLAINABOARD also allows users to dig deeper,interacting with the results in more complex ways.

F4: Fine-grained Error Analysis: What arecommon mistakes that most systems make andwhere do they occur? EXPLAINABOARD pro-vides flexible fine-grained error analyses based onthe above-described performance evaluation:

1. Users can choose multiple systems and seetheir common error cases, which can be use-ful to identify challenging samples or evenannotation errors.

2. In single-system analysis, users can chooseparticular buckets in the performance his-togram9 and see corresponding error samplesin that bucket (e.g. which long entities doesthe current system mispredict?).

3. In pairwise analysis, users can select a bucket,and the unique errors (e.g. system A succeedswhile B fails and vice versa) of two modelswill be displayed.

F5: System Combination: Is there potentialcomplementarity between different systems?System combination (Ting and Witten, 1997;González-Rubio et al., 2011; Duh et al., 2011) is atechnique to improve performance by combiningthe output from multiple existing systems. In EX-PLAINABOARD, users can choose multiple systemsand obtain combined results calculated by votingover multiple base systems.10 In practice, for NERtask, we use the recently proposed SPANNER (Fu

9Each bin of the performance histogram is clickable, re-turning an error case table.

10With the system combination button of Explainaboard,we observed the-state-of-the art performance of some tasks(e.g., NER, Chunking) can be further improved.

et al., 2021) as a combiner, and for text summariza-tion we employed REFACTOR, a state-of-the-art en-semble approach (Liu et al., 2021). Regarding theother tasks, we adopt the majority voting methodfor system combination.

2.3 Reliability

The experimental conclusions obtained from theevaluation metrics are not necessarily statisticallyreliable, especially when the experimental resultscan be affected by many factors. EXPLAIN-ABOARD also makes a step towards more reliableinterpretable evaluation.

F6: Confidence Analysis: To what extent canwe trust the results of our system? EXPLAIN-ABOARD can perform confidence analysis overboth holistic and fine-grained performance met-rics. As shown in Tab. 1, for each bucket, there isan error bar whose width reflects how reliable theperformance value is. We claim this is an importantfeature for fine-grained analysis since the numbersof test samples in each bucket are imbalanced, andwith the confidence interval, one could know howmuch uncertainty there is. In practice, we use boot-strapping method (Efron, 1992; Ye et al., 2021) tocalculate the confidence interval.

F7: Calibration Analysis: How well is the con-fidence of prediction calibrated with its correct-ness? One commonly-cited issue with modernneural predictors is that their probability estimatesare not accurate (i.e. they are poorly calibrated), of-ten being over-confident in the correctness of theirpredictions (Guo et al., 2017). We also incorporatethis feature into EXPLAINABOARD, allowing usersto evaluate how well-calibrated their systems ofinterest are.

3 Tasks, Datasets and Systems

We have already added to EXPLAINABOARD 12NLP tasks, 50 datasets, and 400 models,11 whichcover many or most of top-scoring systems on thesetasks.We briefly describe them below, and showhigh-level statistics in Tab. 2.

Text Classification Prediction of one or multi-ple pre-defined label(s) for a given input text. Thecurrent interface includes datasets for sentiment

11265 of these models are implemented by us, as unfortu-nately it is currently not standard in NLP to release the systemoutputs that EXPLAINABOARD needs.

284

classification (Pang et al., 2002), topic identifica-tion (Wang and Manning, 2012), and intention de-tection (Chen et al., 2013).

Text-Span Classification Prediction of a pre-defined class from the input of a text and a span,such as aspect-based sentiment classification task(Pappas and Popescu-Belis, 2014). We collect top-perform system outputs from (Dai et al., 2021).

Text Pair Classification Prediction of a classgiven two texts, such as the natural language in-ference task (Bowman et al., 2015).

Sequence Labeling Prediction of a label foreach token in a sequence. The EXPLAINABOARD

currently includes four concrete tasks: named en-tity recognition (Tjong Kim Sang and De Meulder,2003), part-of-speech tagging (Toutanova et al.,2003), text chunking (Ando and Zhang, 2005), andChinese word segmentation (Chen et al., 2015).

Structure Prediction Prediction of a syntac-tic or semantic structure from text, where EX-PLAINABOARD currently covers semantic parsingtasks (Berant et al., 2013; Yu et al., 2018).

Text Generation EXPLAINABOARD also con-siders text generation tasks, and currently mainlyfocuses on conditional text generation, for exam-ple, text summarization (Rush et al., 2015; Liu andLapata, 2019) and machine translation . Systemoutputs on text summarization are expanded basedon the previous work’s collection (Bhandari et al.,2020) as well as recently state-of-the-art systems(Liu and Liu, 2021) while outputs from machinetranslation are collected from the WMT20.12

4 Case Study

Here, we briefly showcase the actual EXPLAIN-ABOARD interface through a case study on analyz-ing state-of-the-art NER systems.

4.1 Experimental SetupAttribute Definition We define attributes follow-ing Fu et al. (2020a) and three of them are used be-low: entity length, sentence lengthand label of entity.

Collection of Systems Outputs Currently, wecollect system outputs by either implementing themby ourselves or collecting from other researchers

12http://www.statmt.org/wmt20/metrics-task.html

Task Data Model Attr.

Text Classification

Sentiment 8 40 2

Topic 4 18 2

Intention 1 3 2

Text-Span Classi-fication

Aspect Sentiment 4 20 4

Text Pair Classifi-cation

NLI 2 6 7

Sequence Labeling

NER 3 74 9

POS 3 14 4

Chunking 3 14 9

CWS 7 64 7

Structure Pred. Semantic Parsing 4 12 4

Text GenerationSummarization 2 36 7

Translation 4 60 9

Table 2: Brief descriptions of tasks, datasets and sys-tems that EXPLAINABOARD currently supports. “Attr.”denotes Attribute. “Pred.” denotes “Prediction”.

(Fu et al., 2020b; Schweter and Akbik, 2020; Ya-mada et al., 2020). Using these methods, we havegathered 74 models on six NER datasets with sys-tem output information.

4.2 Analysis using ExplainaBoardFig. 2 illustrates different types of results driven byfour functionality buttons13 over the top-3 NERsystems: LUKE (Yamada et al., 2020), FLERT(Schweter and Akbik, 2020) and FLAIR (Akbiket al., 2019).Box A breaks down the performance of the top-1system over different attributes.14 We can intu-itively observe that even the state-of-the-art systemdoes worse on longer entities. Users can furtherprint error cases in the longer entity bucket by click-ing the corresponding bin.Box B shows the 1st system’s (LUKE) performanceminus the 2nd system’s (FLERT) performance.We can see that although LUKE surpasses FLERTholistically, it performs worse when dealing withPERSON entities.Box C identifies samples that all systems mispre-dict. Further analysis of these samples uncoverschallenging patterns or annotation errors.Box D examines potential complementarity amongthese top-3 systems. The result shows that, bya simple voting ensemble strategy, a new state-

13As it is relatively challenging to define calibration in struc-ture prediction tasks, this feature is currently only providedfor classification tasks. We will explore more in the future.

14Due to the page limitation, we only show three.

http://www.statmt.org/wmt20/metrics-task.html

http://www.statmt.org/wmt20/metrics-task.html

285

Figure 2: An example of the actual EXPLAINABOARD interface for NER over three top-performing systems onthe CoNLL-2003 dataset. Box A shows the single-system analysis results obtained when users select the top-1system and click the “Single Analysis” button. Box B shows the pairwise analysis results when top-2 systems arechosen and “Pair Analysis” is clicked. Users can click any bin of the histogram, which results in a fine-grainederror case table. Box C represents a table with common errors of these top-3 systems. Box D illustrates thecombined result of the top-3 systems.

of-the-art (94.65 F1) has been achieved on theCoNLL-2003 dataset.

5 Usage

Example Use-cases To show the practical util-ity of EXPLAINABOARD, we first present exam-ples of how it has been used as an analysis toolin existing published research papers. Fu et al.(2020b) (Tab.4) utilize single-system analysis withthe attribute of label consistency for NERtask while Zhong et al. (2019) (Tab.4-5) use it fortext summarization with attributes of densityand compression. Fig.4 and Tab.3 in Fu et al.(2020a) leverage the data bias analysis and pair-wise system diagnostics to interpret top-performingNER systems while Tab.4 in Fu et al. (2020c) usesingle and pairwise system analysis to investigatewhat’s next for the Chinese Word Segmentationtask. Liu et al. (2021) use system combiner func-tionality to make ensemble analysis of summariza-tion systems and Fig.1 in Ye et al. (2021) use reli-ability analysis functionality to observe how con-fidence intervals change in different buckets of aperformance histogram.

Using ExplainaBoard Researchers can use EX-PLAINABOARD in different ways: (i) We main-tain a website where each task-specific EXPLAIN-ABOARD allows researchers to interact with it, in-terpreting systems and datasets that they are inter-ested in from different perspectives. (ii) We alsorelease our back-end code for different NLP tasksso that researchers could flexibly use them to pro-cess their own system outputs, which can assisttheir research projects.

Contributing to ExplainaBoard The commu-nity can contribute to EXPLAINABOARD in sev-eral ways: (i) Submit system outputs of their im-plemented models. (ii) Add more informative at-tributes for different NLP tasks. (iii) Add newdatasets or benchmarks for existing or new tasks.

6 Implications and Roadmap

EXPLAINABOARD presents a new paradigm inleaderboard development for NLP. This is just thebeginning of its development, and there are manyfuture directions.

286

Research Revolving on System Outputs15 Dueto the ability to analyze, contrast, or combine re-sults from many systems EXPLAINABOARD incen-tivizes researchers to submit their results to explain-aboard to better understand them and showcasetheir systems’ strengths. At the same time, EX-PLAINABOARD will serve as a central repositoryfor system outputs across many tasks, allowingfor future avenues of research into cross-systemanalysis or system combination.

Enriching ExplainaBoard with Glass-box Anal-ysis EXPLAINABOARD currently performs black-box analysis, solely analyzing system outputs with-out accessing model internals. On the other hand,there are many other glass-box interpretability toolsthat look at model internals, such as the AllenNLPInterpret (Wallace et al., 2019) and Language Inter-pretability Tool (Tenney et al., 2020). Expandingleaderboards to glass-box analysis methods (seeLipton (2018); Belinkov and Glass (2019) for asurvey) is an interesting future work.

In the future, we aim to improve the applicabil-ity and usefulness by following action items: (1)Collaborate with more leaderboard organizers ofdiverse tasks and set up corresponding EXPLAIN-ABOARDs for them. (2) Cover more tasks, datasets,models, as well as functionalities.

Acknowledgements

We thanks all reviewers for their valuable com-ments and authors who share their system outputswith us: Ikuya Yamada, Stefan Schweter, ColinRaffel, Yang Liu, Li Dong. We also thank Vi-jay Viswanathan, Yiran Chen, Hiroaki Hayashi foruseful discussion and feedback about EXPLAIN-ABOARD.

The work was supported in part by the AirForce Research Laboratory under agreement num-ber FA8750-19-2-0200. The U.S. Government isauthorized to reproduce and distribute reprints forGovernmental purposes notwithstanding any copy-right notation thereon. The views and conclusionscontained herein are those of the authors and shouldnot be interpreted as necessarily representing theofficial policies or endorsements, either expressedor implied, of the Air Force Research Laboratoryor the U.S. Government.

15We released system outputs of EXPLAINABOARD:http://explainaboard.nlpedia.ai/download.html

ReferencesAlan Akbik, Tanja Bergmann, and Roland Vollgraf.

2019. Pooled contextualized embeddings for namedentity recognition. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 724–728, Minneapolis, Minnesota. As-sociation for Computational Linguistics.

Rie Ando and Tong Zhang. 2005. A high-performancesemi-supervised learning method for text chunking.In Proceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics (ACL’05),pages 1–9, Ann Arbor, Michigan. Association forComputational Linguistics.

Loïc Barrault, Magdalena Biesialska, Ondrej Bojar,Marta R. Costa-jussà, Christian Federmann, YvetteGraham, Roman Grundkiewicz, Barry Haddow,Matthias Huck, Eric Joanis, Tom Kocmi, PhilippKoehn, Chi-kiu Lo, Nikola Ljubešic, ChristofMonz, Makoto Morishita, Masaaki Nagata, Toshi-aki Nakazawa, Santanu Pal, Matt Post, and MarcosZampieri. 2020. Findings of the 2020 conference onmachine translation (WMT20). In Proceedings ofthe Fifth Conference on Machine Translation, pages1–55, Online. Association for Computational Lin-guistics.

Yonatan Belinkov and James Glass. 2019. Analysismethods in neural language processing: A survey.Transactions of the Association for ComputationalLinguistics, 7:49–72.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-ERT: A pretrained language model for scientific text.In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computa-tional Linguistics.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 1533–1544, Seattle, Wash-ington, USA. Association for Computational Lin-guistics.

Manik Bhandari, Pranav Narayan Gour, Atabak Ash-faq, Pengfei Liu, and Graham Neubig. 2020. Re-evaluating evaluation in text summarization. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 9347–9359, Online. Association for Computa-tional Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages

http://explainaboard.nlpedia.ai/download.html

http://explainaboard.nlpedia.ai/download.html

https://doi.org/10.18653/v1/N19-1078

https://doi.org/10.18653/v1/N19-1078

https://doi.org/10.3115/1219840.1219841

https://doi.org/10.3115/1219840.1219841

https://www.aclweb.org/anthology/2020.wmt-1.1

https://www.aclweb.org/anthology/2020.wmt-1.1

https://doi.org/10.1162/tacl_a_00254

https://doi.org/10.1162/tacl_a_00254

https://doi.org/10.18653/v1/D19-1371

https://doi.org/10.18653/v1/D19-1371

https://www.aclweb.org/anthology/D13-1160

https://www.aclweb.org/anthology/D13-1160

https://doi.org/10.18653/v1/2020.emnlp-main.751


https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/D15-1075

287

632–642, Lisbon, Portugal. Association for Compu-tational Linguistics.

Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu,and Xuanjing Huang. 2015. Long short-term mem-ory neural networks for Chinese word segmentation.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages1197–1206, Lisbon, Portugal. Association for Com-putational Linguistics.

Zhiyuan Chen, Bing Liu, Meichun Hsu, Malu Castel-lanos, and Riddhiman Ghosh. 2013. Identifying in-tention posts in discussion forums. In Proceedingsof the 2013 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 1041–1050, Atlanta, Georgia. Association for Computa-tional Linguistics.

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advancesin Neural Information Processing Systems 32: An-nual Conference on Neural Information ProcessingSystems 2019, NeurIPS 2019, December 8-14, 2019,Vancouver, BC, Canada, pages 7057–7067.

Junqi Dai, Hang Yan, Tianxiang Sun, Pengfei Liu, andXipeng Qiu. 2021. Does syntax matter? a strongbaseline for aspect-based sentiment analysis withroberta. In The 2021 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies.

Kevin Duh, Katsuhito Sudoh, Xianchao Wu, HajimeTsukada, and Masaaki Nagata. 2011. Generalizedminimum bayes risk system combination. In Pro-ceedings of 5th International Joint Conference onNatural Language Processing, pages 1356–1360.

Bradley Efron. 1992. Bootstrap methods: anotherlook at the jackknife. In Breakthroughs in statistics,pages 569–593. Springer.

Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is inthe eye of the user: A critique of NLP leaderboards.In Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 4846–4853, Online. Association for Computa-tional Linguistics.

Jinlan Fu, Xuanjing Huang, and Pengfei Liu. 2021.Spanner: Named entity re-/recognition as span pre-diction. In Joint Conference of the 59th AnnualMeeting of the Association for Computational Lin-guistics and the 11th International Joint Conferenceon Natural Language Processing (ACL-IJCNLP),Virtual.

Jinlan Fu, Pengfei Liu, and Graham Neubig. 2020a. In-terpretable multi-dataset evaluation for named entityrecognition. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 6058–6069, Online. Associa-tion for Computational Linguistics.

Jinlan Fu, Pengfei Liu, and Qi Zhang. 2020b. Rethink-ing generalization of neural models: A named en-tity recognition case study. In The Thirty-FourthAAAI Conference on Artificial Intelligence, AAAI2020, The Thirty-Second Innovative Applications ofArtificial Intelligence Conference, IAAI 2020, TheTenth AAAI Symposium on Educational Advancesin Artificial Intelligence, EAAI 2020, New York, NY,USA, February 7-12, 2020, pages 7732–7739. AAAIPress.

Jinlan Fu, Pengfei Liu, Qi Zhang, and Xuanjing Huang.2020c. RethinkCWS: Is Chinese word segmentationa solved task? In Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 5676–5686, Online. As-sociation for Computational Linguistics.

Sebastian Gehrmann, Tosin Adewumi, Karmanya Ag-garwal, Pawan Sasanka Ammanamanchi, AremuAnuoluwapo, Antoine Bosselut, Khyathi RaghaviChandu, Miruna Clinciu, Dipanjan Das, Kaustubh DDhole, et al. 2021. The gem benchmark: Natu-ral language generation, its evaluation and metrics.arXiv preprint arXiv:2102.01672.

Jesús González-Rubio, Alfons Juan, and FranciscoCasacuberta. 2011. Minimum Bayes-risk systemcombination. In Proceedings of the 49th AnnualMeeting of the Association for Computational Lin-guistics: Human Language Technologies, pages1268–1277, Portland, Oregon, USA. Association forComputational Linguistics.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q.Weinberger. 2017. On calibration of modern neu-ral networks. In Proceedings of the 34th Inter-national Conference on Machine Learning, ICML2017, Sydney, NSW, Australia, 6-11 August 2017,volume 70 of Proceedings of Machine Learning Re-search, pages 1321–1330. PMLR.

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-ham Neubig, Orhan Firat, and Melvin Johnson.2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual gener-alisation. In International Conference on MachineLearning, pages 4411–4421. PMLR.

Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg,Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah ASmith, and Daniel S Weld. 2021. Genie: A leader-board for human-in-the-loop evaluation of text gen-eration. arXiv preprint arXiv:2101.06561.

Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fen-fei Guo, Weizhen Qi, Ming Gong, Linjun Shou,Daxin Jiang, Guihong Cao, et al. 2020. Xglue:A new benchmark dataset for cross-lingual pre-training, understanding and generation. arXivpreprint arXiv:2004.01401.

Zachary C Lipton. 2018. The mythos of model inter-pretability: In machine learning, the concept of in-terpretability is both important and slippery. Queue,16(3):31–57.

https://doi.org/10.18653/v1/D15-1141

https://doi.org/10.18653/v1/D15-1141

https://www.aclweb.org/anthology/N13-1124


https://proceedings.neurips.cc/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html

https://proceedings.neurips.cc/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html



https://arxiv.org/pdf/2106.00641.pdf





https://aaai.org/ojs/index.php/AAAI/article/view/6276





https://www.aclweb.org/anthology/P11-1127


http://proceedings.mlr.press/v70/guo17a.html

http://proceedings.mlr.press/v70/guo17a.html

288

Yang Liu and Mirella Lapata. 2019. Text summariza-tion with pretrained encoders. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 3730–3740, Hong Kong,China. Association for Computational Linguistics.

Yixin Liu and Pengfei Liu. 2021. Simcls: A sim-ple framework for contrastive learning of abstrac-tive summarization. In Joint Conference of the 59thAnnual Meeting of the Association for Computa-tional Linguistics and the 11th International JointConference on Natural Language Processing (ACL-IJCNLP), Virtual.

Yixin Liu, Dou Ziyi, and Pengfei Liu. 2021. Ref-sum: Refactoring neural summarization. In The2021 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies.

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.2016. Hierarchical question-image co-attention forvisual question answering. In Advances in Neu-ral Information Processing Systems 29: AnnualConference on Neural Information Processing Sys-tems 2016, December 5-10, 2016, Barcelona, Spain,pages 289–297.

Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel,Danish Pruthi, and Xinyi Wang. 2019. compare-mt:A tool for holistic comparison of language genera-tion systems. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics (Demonstra-tions), pages 35–41, Minneapolis, Minnesota. Asso-ciation for Computational Linguistics.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.2002. Thumbs up? sentiment classification usingmachine learning techniques. In Proceedings of the2002 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2002), pages 79–86.Association for Computational Linguistics.

Nikolaos Pappas and Andrei Popescu-Belis. 2014. Ex-plaining the stars: Weighted multiple-instance learn-ing for aspect-based sentiment analysis. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages455–466, Doha, Qatar. Association for Computa-tional Linguistics.

Maja Popovic and Hermann Ney. 2011. Towards au-tomatic error analysis of machine translation output.Computational Linguistics, 37(4):657–688.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392, Austin,Texas. Association for Computational Linguistics.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 379–389, Lisbon, Portugal.Association for Computational Linguistics.

Stefan Schweter and Alan Akbik. 2020. Flert:Document-level features for named entity recogni-tion. arXiv preprint arXiv:2011.06993.

Sara Stymne. 2011. Blast: A tool for error analysisof machine translation output. In Proceedings ofthe ACL-HLT 2011 System Demonstrations, pages56–61, Portland, Oregon. Association for Computa-tional Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In Advances in Neural Information Processing Sys-tems 27: Annual Conference on Neural Informa-tion Processing Systems 2014, December 8-13 2014,Montreal, Quebec, Canada, pages 3104–3112.

Ian Tenney, James Wexler, Jasmijn Bastings, TolgaBolukbasi, Andy Coenen, Sebastian Gehrmann,Ellen Jiang, Mahima Pushkarna, Carey Radebaugh,Emily Reif, et al. 2020. The language inter-pretability tool: Extensible, interactive visualiza-tions and analysis for nlp models. arXiv preprintarXiv:2008.05122.

Kai Ming Ting and Ian H. Witten. 1997. Stacked gen-eralizations: When does it work? In Proceedings ofthe Fifteenth International Joint Conference on Ar-tificial Intelligence, IJCAI 97, Nagoya, Japan, Au-gust 23-29, 1997, 2 Volumes, pages 866–873. Mor-gan Kaufmann.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003, pages142–147.

Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In Proceedings of the 2003 Human Language Tech-nology Conference of the North American Chapterof the Association for Computational Linguistics,pages 252–259.

Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Sub-ramanian, Matt Gardner, and Sameer Singh. 2019.AllenNLP interpret: A framework for explainingpredictions of NLP models. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP): System Demonstrations, pages7–12, Hong Kong, China. Association for Compu-tational Linguistics.

https://doi.org/10.18653/v1/D19-1387

https://doi.org/10.18653/v1/D19-1387




https://proceedings.neurips.cc/paper/2016/hash/9dcb88e0137649590b755372b040afad-Abstract.html

https://proceedings.neurips.cc/paper/2016/hash/9dcb88e0137649590b755372b040afad-Abstract.html

https://doi.org/10.18653/v1/N19-4007

https://doi.org/10.18653/v1/N19-4007

https://doi.org/10.18653/v1/N19-4007

https://doi.org/10.3115/1118693.1118704

https://doi.org/10.3115/1118693.1118704

https://doi.org/10.3115/v1/D14-1052

https://doi.org/10.3115/v1/D14-1052

https://doi.org/10.3115/v1/D14-1052

https://doi.org/10.1162/COLI_a_00072

https://doi.org/10.1162/COLI_a_00072

https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D15-1044

https://doi.org/10.18653/v1/D15-1044



https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html

http://ijcai.org/Proceedings/97-2/Papers/011.pdf

http://ijcai.org/Proceedings/97-2/Papers/011.pdf

https://www.aclweb.org/anthology/W03-0419

https://www.aclweb.org/anthology/W03-0419



https://doi.org/10.18653/v1/D19-3002

https://doi.org/10.18653/v1/D19-3002

289

Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R Bowman. 2019. Super-glue: A stickier benchmark for general-purposelanguage understanding systems. arXiv preprintarXiv:1905.00537.

Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP, pages 353–355, Brussels, Belgium.Association for Computational Linguistics.

Sida Wang and Christopher Manning. 2012. Baselinesand bigrams: Simple, good sentiment and topic clas-sification. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), pages 90–94, Jeju Island,Korea. Association for Computational Linguistics.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, Melvin John-son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws,Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, KeithStevens, George Kurian, Nishant Patil, Wei Wang,Cliff Young, Jason Smith, Jason Riesa, Alex Rud-nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,and Jeffrey Dean. 2016. Google’s neural machinetranslation system: Bridging the gap between humanand machine translation. CoRR, abs/1609.08144.

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, HideakiTakeda, and Yuji Matsumoto. 2020. LUKE: Deepcontextualized entity representations with entity-aware self-attention. In Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 6442–6454, On-line. Association for Computational Linguistics.

Zihuiwen Ye, Pengfei Liu, Jinlan Fu, and Graham Neu-big. 2021. Towards more fine-grained and reliableNLP performance prediction. In Conference of theEuropean Chapter of the Association for Computa-tional Linguistics (EACL), Online.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li,Qingning Yao, Shanelle Roman, Zilin Zhang,and Dragomir Radev. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages3911–3921, Brussels, Belgium. Association forComputational Linguistics.

Ming Zhong, Danqing Wang, Pengfei Liu, Xipeng Qiu,and Xuanjing Huang. 2019. A closer look at databias in neural extractive summarization models. InProceedings of the 2nd Workshop on New Frontiers

in Summarization, pages 80–89, Hong Kong, China.Association for Computational Linguistics.

https://doi.org/10.18653/v1/W18-5446

https://doi.org/10.18653/v1/W18-5446




http://arxiv.org/abs/1609.08144






https://arxiv.org/abs/2102.05486

https://arxiv.org/abs/2102.05486

https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D19-5410

https://doi.org/10.18653/v1/D19-5410

Documents

ExplainaBoard: An Explainable Leaderboard for NLP