Textual Entailment Specialized Data Sets
Textual Entailment Specialized Data Sets are the result of a feasibility study carried out jointly by FBK-Irst, CELCT and Bar-Ilan University, on the application of a methodology for the decomposition of complex Textual Entailment pairs into T-H monothematic pairs, i.e. pairs in which a certain linguistic phenomenon relevant to entailment is highlighted and isolated. The expected benefits of specialized data sets derive from the intuition that investigating the linguistic phenomena separately, i.e. decomposing the complexity of the TE problem, would yield an improvement in the development of specific strategies to cope with them.
The methodology for the creation of the monothematic pairs starts from an existing RTE pair and defines the following steps:
- identify the linguistic phenomena present in the original RTE pair
- apply an annotation procedure to isolate each phenomenon and create the related monothematic pair
- group together all the monothematic T-H pairs relative to the same linguistic phenomenon, hence creating specialized data sets
The methodology has been applied to a sample of 90 T-H pairs randomly extracted from the RTE-5 data set (30 entailment, 30 contradiction and 30 unknown examples), and linguistic phenomena underlying the entailment/contradiction/unknown relations in the pairs (both with fine grained and macro categories) have been annotated by two annotators with skills in linguistics. 203 monothematic pairs have been created from the 90 annotated pairs (157 entailment, 33 contradiction, and 13 unknown examples). Such pilot data sets can be profitably used both to advance in the comprehension of the linguistic phenomena involved in the entailment judgments, and to make a first step toward the creation of large-scale specialized data sets.
Below you can find an example of the creation of monothematic pairs from original RTE pairs.
The data sets are freely available for research purposes. Here it is possible to download:
- 90 RTE-5 Test Set pairs annotated with linguistic phenomena (30 entailment, 30 contradiction and 30 unknown examples)
- 203 monothematic pairs (i.e. pairs where only one linguistic phenomenon is relevant to the entailment relation) created from the 90 annotated pairs (157 entailment, 33 contradiction, and 13 unknown examples)
For further information, please contact Elena Cabrio (cabrio[at]fbk.eu) and Danilo Giampiccolo (giampiccolo[at]celct.it).
Luisa Bentivogli, Elena Cabrio, Ido Dagan, Danilo Giampiccolo, Medea Lo Leggio, and Bernardo Magnini. 2010. Building Textual Entailment Specialized Data Sets: a Methodology for Isolating Linguistic Phenomena Relevant to Inference. Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC) . Valletta, Malta. 19-21 May. [.pdf]
The table above shows the decomposition of an original entailment pair (pair 408 in RTE-5) into monothematic pairs. First of all, the linguistic phenomena (i.e. apposition, synonymy, verbalization and argument realization) that are considered relevant to the entailment between T and H are annotated on the original pair. Furthermore, the methodology is applied in order to create monothematic pairs for each phenomenon detected in the first phase.
As an example, we apply step by step the procedure to the phenomenon we define as argument realization. As the first step, the general entailment rule:
Pattern: x y ↔ y in x
Constraint: type(x) = temporal_expression
is instantiated (2007 Nobel Prize in Literature ↔ Nobel Prize in Literature in 2007), and the substitution in T is carried out ([...] Doris Lessing, recipient of the Nobel Prize (in Literature) in 2007 [...]). Then, the monothematic pair T-H1 is composed and marked as "argument realization" (macro-category "syntactic"). Finally, this pair is judged as entailment. As said before, all these steps are then repeated for all the phenomena individuated in that pair.
It can be the case that several phenomena are collapsed on the same token.
For instance, as shown in the Table above, a chain of two phenomena should be solved to match recipient of with won.
In such cases, in order to create a monothematic H for each phenomenon, the methodology is applied recursively. It means that after applying it once to the first phenomenon of the chain (therefore creating the pair T−Hi), it is applied again on Hi (that becomes T') to solve the second phenomenon of the chain (creating the pair T'-Hj); more specifically, in the Table above the methodology is first applied on T for the verbalization (T-H3) and then, it is recursively applied on H3 (that becomes T') to solve the synonymy (T'- H4).
More details on the methodology and other examples of its application can be found in the reference paper (Bentivogli et al. 2010).