Many NLP applications today deploy state-of-the-art deep neural networks that are essentially black-boxes. One of the goals of Explainable AI (XAI) is to have AI models reveal why and how they make their predictions so that these predictions are interpretable by a human. But work in this direction has been conducted on different datasets with correspondingly unique aims, and the inherent subjectivity in defining what constitutes ‘interpretability’ has resulted in no standard way to evaluate performance. Interpretability can mean multiple things depending on the task and context.
The Evaluating Rationales And Simple English Reasoning (ERASER) benchmark is the first ever effort to unify and standardize NLP tasks with the goal of interpretability. Specifically, we unify the definition of interpretability and metrics by using a standardized data collection and evaluation process for a suite of NLP tasks.
This benchmark comprises 7 diverse NLP datasets and tasks for which we collected human annotations of explanations as supporting evidence for predictions. ERASER focuses on “rationales”, that is, snippets of text extracted from the source document of the task that provides sufficient evidence for predicting the correct output. All the datasets included in ERASER are classification tasks including, sentiment analysis, Natural Language Inference, and Question Answering tasks, among others, with different number of labels and some have varying class labels. The figure below shows an example instance for 4 of the datasets and their corresponding classes as well as the rationales (erased) that support the predicted labels.
Evaluating explanations is a difficult and open research problem with no single metric to evaluate the various dimensions of what constitutes a good supporting evidence. We define several metrics that capture how well the rationales provided by neural models align with human rationales, and also how faithful these rationales are (i.e., the degree to which provided rationales actually influenced the model’s output predictions). For example, a model might be able to extract reasonable rationales but does not use them for its prediction.
ERASER, therefore, includes a suite of metrics to evaluate rationales on some of these dimensions such as agreement with what humans think is the right rationale and model faithfulness. We measure model faithfulness in terms of sufficiency and comprehensiveness. Sufficiency is the extent to which the extracted rationale was actually used by the model to make its prediction. Comprehensiveness is the measure of how much the model’s prediction changes when it’s not given the extracted rationale. Figure below illustrates the faithfulness metric on an instance of the Commonsense Explanations dataset.
ERASER also includes various baseline models, including BERT evaluated on the benchmark’s datasets using the proposed metrics. We find that no single model is able to adapt well to the length of input and granularity of the rationales for the tasks in ERASER. This indicates that although many NLP models are near-human performance on these datasets for predicting the outputs, they are lagging far behind in their ability to explain how they made these predictions. Lack of transparency means that these models are not just unreliable but they could also cause unintentional bias amplification.
Our hope is that releasing this benchmark facilitates progress on designing more interpretable NLP systems that can adapt to different task input types and rationale granularity. We have released all the benchmark tasks, code, and documentation at www.eraserbenchmark.com. Please see the paper and github for more details on the datasets, metrics, and models.
Website: http://www.eraserbenchmark.com Github: https://github.com/jayded/eraserbenchmark Paper: https://arxiv.org/abs/1911.03429