Question Answering for Artificial Intelligence (QuAIL)27 Feb 2020 • 21 mins
This blog post summarizes our AAAI 2020 paper “Getting Closer to AI-complete Question Answering: A Set of Prerequisite Real Tasks” (Rogers, Kovaleva, Downey, & Rumshisky, 2020).
The leaderboard for the QuAIL benchmark can be found here.
We thank the anonymous reviewers for their insightful comments. The first author is also grateful for the opportunities to discuss this work at an invited talk at AI2 and a keynote at Meta-Eval workshop. Among many people whose comments made this post better are Noah Smith, Matt Gardner, Daniel Khashabi, Ronan Le Bras (AI2), Chris Welty, Lora Aroyo, and Praveen Paritosh (Google Research).
Since 2018 there was an explosion of new datasets for high-level verbal reasoning tasks, such as reading comprehension, commonsense reasoning, and natural language inference. However, many new datasets get “solved” almost immediately, prompting concerns about data artifacts and biases. Our models get fooled by superficial cues (Jia & Liang, 2017; McCoy, Pavlick, & Linzen, 2019) and so may achieve seemingly super-human accuracy without any real verbal reasoning skills.
One of the top reasons for the datasets being so easy is poor diversity of data. Deep learning requires large training sets, which typically are generated by crowd workers provided with minimal instructions. This may result in large portions of data exhibiting spurious patterns that the model learns to associate with a particular label. For example, the word “never” is strongly predictive of the contradiction label in SNLI (Gururangan et al., 2018), simply because negating was an easy strategy for the crowd workers to generate contradictory statements. If a few such patterns cover large portions of the data, we have a problem.
To avoid this, fundamentally we need to reduce the amount of spurious correlations with predicted labels, which would hopefully force our models to learn generalizable patterns rather than dataset-specific “shortcuts”. The solution to this problem that we explored in QuAIL is balancing different types of data within one dataset. This does not guarantee the absence of any biases, but it should decrease the likelihood of a single bias affecting a large portion of data. In particular, we experiment with 4x9 design: 4 domains by 9 question types, each with approximately the same number of questions.
In addition to reducing biases, this approach should also enable diagnostics of both the models and the data: finding what the model can and cannot do, and whether any parts of data look suspiciously easy. When we started working on QuAIL, the only other reading comprehension dataset with annotation for question types was the original bAbI (Weston et al., 2016), which consisted of toy tasks with synthetic data.
It turns out, there are very good reasons why this approach has not been explored before. We have learned a lot from this project, and here are the main takeaways:
- it is possible to crowdsource a RC dataset balanced by question types and domains;
- balanced design is great for model and data diagnostics;
- reasoning over the full spectrum of uncertainty is harder for humans than for machines;
- paraphrasing hurts even BERT.
Balanced and diverse data is great!
QuAIL is a multi-domain RC dataset featuring news, blogs, fiction and user stories. Each domain is represented by 200 texts, which gives us a 4-way data split. The texts are 300-350 word excerpts from CC-licensed texts that were hand-picked so as to make sense to human readers without larger context. Domain diversity mitigates the issue of possible overlap between training and test data of large pre-trained models, which the current SOTA systems are based on. For instance, BERT is trained on Wikipedia + BookCorpus, and was tested on Wikipedia-based SQuAD (Devlin, Chang, Lee, & Toutanova, 2019).
In addition to balancing the domains, we balance the number of questions across 9 question types:
- text-based questions (the information should be found in the text):
- reasoning about factual information in the text (e.g. What did John do?);
- temporal order of events (e.g. When did X happen - before, after, or during Y?);
- character identity (e.g. Who ate the cake?);
- world knowledge questions (rely on some kind of inference about characters and events, based on information in the text and world knowledge):
- causality (e.g. Why did John eat the cake?);
- properties and qualities of characters (e.g. What kind of food does John probably like?);
- belief states (e.g. What does John think about Mary?);
- subsequent state after the narrated events (e.g. What does John do next?);
- duration of the narrated events (e.g. How long did it probably take John to eat the cake?);
- unanswerable questions are questions for which the information is not found in the text, and all answer options are equally likely (e.g. What is John’s brother’s name?)
In terms of the format, QuAIL is a multi-choice dataset where all questions have 3 answer options, as well as “not enough information” (NEI) option, which is the correct answer for the unanswerable questions. The data looks as follows.
Overall, the above design gives us the total of 4 x 9 balanced subsets of the data, which are all labeled for question types and domains. The labels can be used for diagnosing both the models (what they can and can’t do) and the data (finding the sections that are suspiciously easy).
So, how do our models do with this data? This is the place in the paper where you usually get a table with the results of a few baselines. If the dataset is partitioned like QuAIL, you get a larger table that tells you how successful your model was on different types of questions – and that might tell you something useful about how it works. We experimented with a simple heuristic baseline (picking the longest answer), a classic PMI solver that should prefer the answers with significant lexical overlap with the text, a biLSTM baseline using word embeddings, BERT (Devlin, Chang, Lee, & Toutanova, 2019), and TriAN - a ConceptNet-based system that won SemEval2018-Task11 (Wang, Sun, Zhao, Shen, & Liu, 2018). The random guess accuracy is 25%.
The first fun fact is that on world knowledge questions BERT beats a ConceptNet-based system made specifically for this kind of multi-choice QA, which suggests either that it contains more of world knowledge than ConceptNet, or its world knowledge is more relevant to QuAIL.
The most difficult question type for BERT is character identity, which often involves coreference resolution. Although both coreference and temporal order questions involve reasoning over longer dependencies, the former appears to be more difficult. In both cases, it loses to TriAN.
Surprisingly, simple PMI beats BERT on factual questions (e.g. What did John eat?). Note that this is the most prevalent type in most other RC datasets. If simple lexical matching provides such a strong signal, then training on such data would teach a model to rely on it, and we will show further that this is what happens to BERT too.
Finally, note how heuristic baselines are helpful for finding suspiciously easy sections of partitioned data. Our LongChoice heuristic is very successful on causality questions, which suggests that the Turkers mostly wrote the long causal explanations for the correct answers but not distractors. Since causality is also the easiest category for BERT, it is possible that this bias helps it as well. The partitioning methodology enables the dataset authors to locate the problematic parts and fix them, e.g. to release a more difficult version of the dataset.
Partitioning across domains is helpful as well. Consider the following breakdown of BERT results:
|Question category||Fiction||News||Blogs||User stories|
BERT does suspiciously well on unanswerable questions in the news, but not in blogs. This could be due to the fact that news style and vocabulary are quite far from everyday speech, making the crowdworkers’ distractor options easier to detect.
Crowdsourcing specific question types is hard (but not impossible)
So if partitioning datasets in sections by question types and domains is so great, why is nobody doing it?
Well, it’s very difficult.
First, not all texts are suitable for all question types: for example, you wouldn’t be able to do much for coreference or causality given a description of a landscape. One of the most frequent questions we got from reviewers and the audience was how we chose the question types and texts. The answer is, we chose them together by much trial and error, to maximize the number of possible questions for the same text. The domain options were also limited to non-specialized texts that could be processed by crowd workers.
Second, the crowd workers really do not want to write diverse questions. Most other studies that did not control for question types while they were generated provide a breakdown by the first word of the question (which is very crude, as causality, factual, world knowledge questions could all start with what), or an analysis of a small sample that they manually annotated. The reported distributions tend to be unbalanced, usually with factoid questions making the bulk of the data and a long tail of other question types. For example, in NarrativeQA (Kocisky et al., 2018) 25% of questions are about descriptions, 10% - about locations (both corresponding to our “factual” category, since the answers are supposed to be found in the narrative), 31% of questions are on character identity, about 10% causality, and less than 2% for event duration.
This is what we found as well: the easiest questions to generate are simple factoid questions (“John ate a cake” -> “What did John eat?”). Our form showed the text and sections corresponding to different question types, with different examples and instructions for each type. Even with the examples, the most frequent kind of error would be factoid questions instead of any other question type. Also, unanswerable questions could be of any type, but what we got was mostly factoid.
So, how do you ensure question diversity? In this work we experimented with two strategies:
1) Crowd-assisted writing: the crowdsourced questions were checked by additional student annotators, who were asked to correct any language errors, any ambiguous or incorrect answers, and also rewrite the questions of the wrong type.
2) Crowdsourcing with automatic keyword-based validation: the crowd workers had the same interface to work with, but now they got automatic feedback if their questions did not match our lists of relevant keywords (e.g. “because”, “reason” for causality questions).
Based on our experience, we recommend the latter workflow, as it is much faster and comparable in terms of data quality. In a manually annotated sample, 11.1% questions in workflow (1) had answer problems vs 16.7% for workflow (2), and 6.7% questions were of the wrong type, vs 11.7% for workflow (2).
Workflow (1) has the advantage of reducing language errors: many Turkers were clearly not native speakers, although we limited the HIT to workers from the US and Canada with at least 1000 accepted HITs and 97% approval. However, we found that (1) also introduced extra question type biases. In our sample 25% questions in our sample that were of the entity properties type, according to the annotator, were in fact on causality. None of the Turkers made this type of error. This is likely due to the recently pointed out annotator bias problem (Geva, Goldberg, & Berant, 2019): people processing large amounts of data are likely to develop their own heuristics for dealing with it. If a few annotators are processing a lot of crowdsourced data, the same effect can be expected. While completely relying on multiple Turkers increases the likelihood of noise, it also increases diversity.
The good news is that automatic keyword-based validation produces the Hawthorne effect on the crowd workers. Similarly to people making more donations to charities when they perceived they were being watched, our checks decreased the number of people who simply copy-pasted example questions or contributed nonsensical data.
World knowledge + text-based + unanswerable questions = trouble
To recap, QuAIL is the first verbal reasoning benchmark that attempted to combine text-based questions, world knowledge questions, and unanswerable questions. For the former, the answer is directly in the text. In world knowledge questions the system needs to combine the facts of the specific context with external knowledge, which makes one of the proposed answer options more likely than the others. In unanswerable questions, we asked the crowd workers to specify answer options that were are equally likely. All QuAIL questions have 3 human-generated answer options and “not enough information” option, which is the correct one for unanswerable questions.
Our rationale for trying this text-based + world knowledge + unanswerable question setup was that humans in everyday life are demonstrably able to perform reasoning across the full spectrum of uncertainty. A brief example:
Suppose you want to buy a new laptop, and are considering several options. You will probably want to know how much RAM you can get, and you know where to find the answer to that question (in the model description). That’s a perfect-information, text-based question. Now, say you also want to know whether the video calls will feature your nostrils. You know that you can’t just google for this information, but you can look at the photo and use your external knowledge (of optics, in this case) to deduce the answer. Finally, you probably also want to know how the keyboard feels - but this question is unanswerable, the only way is tell is to try typing on that keyboard yourself.
So, in real life humans are able to reason under all three uncertainty settings. But they don’t want to! When we presented our student volunteers with a sample of QuAIL data (180 questions, 10 texts, 2 questions of each type per text), the agreement with the Turkers was only .6 (Kripperndorff’s alpha). However, when we split the same questions into text-based+unanswerable questions (≈ SQuAD 2.0 setup) and world knowledge questions (≈ SWAG setup), the agreement was considerably higher - even though the questions were exactly the same!
This is an interesting result in “machine learning psychology” (© Chris Welty, discussion at Meta-Eval workshop) which deserves further investigation. One of the reasons for the stark difference between all-questions, SQuAD and SWAG setups could be that humans are uncomfortable admitting that they don’t know something, and so in the unanswerable questions they may opt to make a guess rather than select the “not enough information option”. Another possibility is that borderline cases of uncertainty estimation are mentally taxing, and humans are not good at performing this at scale. When there is not a clear reason to prefer one option over the other, we may take a long time even deciding which restaurant to go to, or just pick one randomly.
Note however that the systems we evaluated do NOT have this problem when evaluated in all-questions setup:
On many categories BERT is close to humans in the all-questions setup, in which the humans perform worse, but it does not struggle with unanswerable questions like they do. Manual analysis suggests that most unanswerable questions resemble factual questions (e.g. “What did John eat?”), so the success is not due to the model just learning to distinguish the unanswerable questions from other types. Some of the success is due to the above-mentioned problem with unanswerable questions in the news, but if there are no other major problems we did not notice, this may be a real case of “superhuman” performance on a verbal reasoning task, by virtue of machines not getting mentally tired the way we do.
Finally, note that the duration questions are consistently difficult for humans in both all-questions and world-knowledge setups, but not particularly difficult for BERT. It is possible that the high model performance is due to the fact that people tend to use only a few typical duration units (e.g. “five minutes” rather than “six” or “four”), and that collocation information provides good hints (e.g. “life” is more associated with “years” than “hours”). Low human performance indicates either a higher rate of genuine disagreements about expected event durations, or lower tolerance for guesswork on this topic. In McTaco (Zhou, Khashabi, Ning, & Roth, 2019) this question type required a lot of filtering out data that humans disagreed on (personal communication).
Since many of the questions formulated by crowd workers could be solved with simple cooccurrence counts, we conducted an additional experiment to see how much the performance would be hurt if these surface cues were taken away. The authors created paraphrased versions of questions for 30 fiction texts (556 questions in total), aiming in particular to reduce lexical cooccurrences between the text and the words in the question and the correct answer. Then we trained our baseline models on all non-paraphrased data and tested on this extra test set.
|Temporal||0.51 (0.06)||0.24 (-0.25)|
|Coreference||0.42 (-0.11)||0.31 (-0.13)|
|Factual||0.32 (-0.21)||0.4 (-0.12)|
|Causality||0.33 (-0.2)||0.3 (-0.27)|
|Subsequent||0.23 (-0.12)||0.3 (-0.18)|
|Duration||0.62 (0.0)||0.48 (-0.13)|
|Properties||0.31 (-0.06)||0.42 (0.02)|
|Belief states||0.26 (-0.27)||0.31 (-0.39)|
|Unanswerable||0.55 (-0.1)||0.48 (-0.13)|
|All questions||0.4 (-0.11)||0.36 (-0.17)|
The drop in performance is stark, even for BERT. With respect to factual questions, this means that it does rely on lexical cooccurrence counts to a large degree. These results complement the evidence of BERT’s susceptibility to shallow heuristics in NLI (McCoy, Pavlick, & Linzen, 2019).
For other question types, we interpret our results as suggesting that many of the correct answer options were initially formulated with lexical cues from the text, even for world knowledge questions, whereas the distractors were more often made up, and that made the data easier for the models. In particular, this seems to be true for the causality questions that also had the longest-answer bias.
Discussion: other ways to increase question diversity
Increasing data diversity and avoiding spurious correlations with predicted labels are the fundamental challenges for NLP data methodology. In developing QuAIL, we focused on one possible direction: diversifying data through balanced dataset design and partitioning by question types and domains.
In parallel with our work, the community is exploring adversarial data development. Datasets can be collected with a model-in-the-loop that rejects “easy” entries (Dua et al., 2019), or filtered post-hoc (Bras et al., 2019). This is very promising, and we intend in particular to experiment with adversarial selection of distractors, since human-written distractors are often easy to detect through text cooccurrence counts.
However, the adversarial approach also has caveats: its success ultimately depends on the hypothesis of a specific model-in-the-loop, which may not generalize to other models. For instance, the authors of PIQA used BERT as the adversary, and in the final dataset it performed worse than GPT-2 (Bisk, Zellers, Bras, Gao, & Choi, 2020), although generally it outperforms GPT-2 on many other benchmarks. There is also recent evidence that stronger adversaries change the question distribution in a way that prevents some weaker models from generalizing to the original distribution (Bartolo, Roberts, Welbl, Riedel, & Stenetorp, 2020).
While QuAIL aimed to collect balanced question types for the same texts, the community also started experimenting with multi-dataset learning (Dua, Gottumukkala, Talmor, Gardner, & Singh, 2019). A model trained on one RC dataset does not necessarily generalize to another (Yatskar, 2019), but training on multiple datasets will create more general-purpose systems (Talmor & Berant, 2019). In MRQA shared task the participants had to train on one set of datasets and generalize to another (Fisch et al., 2019).
Clearly, this approach enables using all the datasets already prepared by the community, and together they offer much more variety than is possible within any one dataset. However, there is also a problem: if the texts are not matched by domain, length and discourse structure, and come with different question distributions, the model could learn what kinds of questions are asked of what texts. In that sense, the QuAIL strategy provides a more general signal because the same kinds of questions are asked of all texts (but, as mentioned above, this limits both the possible question and text types).
Machine language understanding is far from being solved, but it’s fair to say we’re making progress, and the findings from all the above approaches will hopefully fuel a new generation of verbal reasoning benchmarks. Stay tuned for our tutorial at COLING 2020, which will provide both the latest updates in data collection methodology, and the dos and don’ts for the practitioners who don’t want their models to cheat.
- Zhou, B., Khashabi, D., Ning, Q., & Roth, D. (2019). “Going on a Vacation” Takes Longer than “Going for a Walk”: A Study of Temporal Commonsense Understanding. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3361–3367. https://doi.org/10.18653/v1/D19-1332 [PDF]
- Yatskar, M. (2019). A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2318–2323. [PDF]
- Weston, J., Bordes, A., Chopra, S., Rush, A. M., van Merriënboer, B., Joulin, A., & Mikolov, T. (2016). Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. ICLR 2016. [PDF]
- Wang, L., Sun, M., Zhao, W., Shen, K., & Liu, J. (2018). Yuanfudao at SemEval-2018 Task 11: Three-way Attention and Relational Knowledge for Commonsense Machine Comprehension. Proceedings of The 12th International Workshop on Semantic Evaluation, 758–762. https://doi.org/10.18653/v1/S18-1120 [PDF]
- Talmor, A., & Berant, J. (2019). MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4911–4921. https://doi.org/10.18653/v1/P19-1485 [PDF]
- Rogers, A., Kovaleva, O., Downey, M., & Rumshisky, A. (2020). Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks. AAAI, 11. [PDF]
- McCoy, T., Pavlick, E., & Linzen, T. (2019). Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3428–3448. https://doi.org/10.18653/v1/P19-1334 [PDF]
- Kocisky, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G., & Grefenstette, E. (2018). The NarrativeQA Reading Comprehension Challenge. Transactions of the Association for Computational Linguistics, 6, 317–328. [PDF]
- Jia, R., & Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension Systems. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2021–2031. https://doi.org/10.18653/v1/D17-1215 [PDF]
- Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., & Smith, N. A. (2018). Annotation Artifacts in Natural Language Inference Data. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 107–112. https://doi.org/10.18653/v1/N18-2017 [PDF]
- Geva, M., Goldberg, Y., & Berant, J. (2019). Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 1161–1166. https://doi.org/10.18653/v1/D19-1107 [PDF]
- Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., & Gardner, M. (2019). DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2368–2378. [PDF]
- Dua, D., Gottumukkala, A., Talmor, A., Gardner, M., & Singh, S. (2019). ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, 147–153. https://doi.org/10.18653/v1/D19-5820 [PDF]
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. [PDF]
- Bras, R. L., Swayamdipta, S., Bhagavatula, C., Zellers, R., Peters, M., Sabharwal, A., & Choi, Y. (2019). Adversarial Filters of Dataset Biases. [PDF]
- Bisk, Y., Zellers, R., Bras, R. L., Gao, J., & Choi, Y. (2020). PIQA: Reasoning about Physical Commonsense in Natural Language. AAAI. [PDF]
- Bartolo, M., Roberts, A., Welbl, J., Riedel, S., & Stenetorp, P. (2020). Beat the AI: Investigating Adversarial Human Annotations for Reading Comprehension. ArXiv:2002.00293 [Cs]. [PDF]