Integrating Explanations in the Textual Entailment of an LLM

By Ahmad Khalidi (HAW-Hamburg)

Large language models (LLM) show great potential in reasoning capabilities, but often struggle with multi-step reasoning and chain of thoughts. In this project, we train a huggingface🤗 sequence-to-sequence transformer model(LLama 2 7b) to work with textual entailment on a wide range of common-sense, entailment, legal, ethics and nature science datasets. The datasets are therefor embedded into a graph based reasoning language (GBRL). This GBRL is specially created for LLMs and allows models to understand current context and requested goal in a structured manner.

Figure 1 shows the one common task of textual entailment: predicting the entailment relationship between premisses and a hypothesis. Does the hypothesis logically entail from the premis?

Our model predicts these relationships and is able to construct explanations that reason about those relations (see Figure 2).

In this work we describe the GUI, define a GBRL, transform exemplary data sets into the GBRL. We then experiment with Chat-GPT and the trained LLama 2 model by querying them simple representive tasks.
We will show that the formal definitions for explanations and chains of thought in our GBRL are too imprecise and thus confuse the models or lead to unusable conclusions. We therefore conclude that the relationships between premises, explanations, chains of thought and hypotheses need to be made more precise.

To run the examples in this project, simply build the image in the root path by running:

docker build -t gbrl .

and then replace the square brackets with the access token to the transformer model and run the image with:

docker run -p 8888:8888 -v ${PWD}:/home/jovyan -e HUGGINGFACE_ACCESS_TOKEN=[REPLACE WITH ACCESS TOKEN] gbrl

The Docker container outputs two local URLs on the output channel, one of which provides access to a Jupyter environment within the container. By opening one of the URLs in a browser, we access the Jupyter environment with the notebooks. The notebooks can be executed in this environment.

Related Work

The use of a GBRL is strongly inspired by a work by Entailment trees[1]. The authors identified three reasons for invalid decision steps, which we will address later:

Repetition: The entailed conclusion simply repeats one of the input sentences.
Invalid Entailment: The entailed conclusion does not follow from input sentences.
Mis-evaluation and Irrelevance: The entailed conclusion is correct, but either different from gold or irrelevant to prove the hypothesis.

We trained our model on seventy-one datasets, four of which we present as representative.

The Winograd Schema (WS) [2] defines a binary multiple choice question dataset for common sense reasoning. They base their definition of textual entailment on the definition provided by the PASCAL Recognizing Textual Entailment Challenge[3], namely: "We consider an applied notion of textual entailment, defined as a directional relation between two text fragments, termed t - the entailing text, and h - the entailed text. We say that t entails h if, typically, a human reading t would infer that h is most likely true." Definitions of this kind can be found in most common sense data sets, which are often generated by non-professional crowdworkers under supervision.

The Story Cloze Test (SC) [4] is a dataset with a multi-sentence stories. For each story two alternative endings are presented, from which one is considered entailing from the story, while the other is considered contradicting with the story.

The Natural Questions Benchmark Dataset (NQ) [5] consists of a Wikipedia article and an open-end question about this article, whereby the answer to the question does not always have to be present.

The Stanford Natural Language Inference (SNLI) corpus [6] is a collection of sentence pairs in which the sentences of each pair are related to each other in the relationship entailment, contradiction or contextual independence (neutral).

Graph Based Reasoning Operations

Each agent manages a directed graph in which the user can add nodes and edges. A node has a name and a free text that represents any statement. In most cases, an edge describes the textual entailment relationship between two nodes. A relationship can be of the type ent for entailment, con for contradiction, neu for neutral or the negations of these relationships ¬ent, ¬con, ¬neu.
With three simple operations, the user then has the option of having edges and nodes created by the model. The edges generated by the model must reflect the correct relationship type between two nodes. The statements of generated nodes must not contradict the relationship type between newly generated nodes and existing nodes.

The operations are

relation between: Creates an edge between the input node and output node. Figure 1 illustrates this operation. Input nodes can also be a set of nodes, which are then connected with an & operator.
generate entailment: Creates a new node from the input node, which has the given relationship to the input node. A set of nodes can also be used as input nodes.
explain relation: Creates a new node whose statement can be inferred from the input node and whose relationship to the output node is the same as the relationship between the input and output nodes. Figure 2 illustrates this operation. This operation should generate a statement that explains the relationship between the input and output nodes.

For more information on these operations, please refer to agent.py.

The result of each operation is a decision graph in which each edge represents the "correct" relationship between the nodes. It should be mentioned here that the graph should have a generalization of the Markov property. The correctness of an edge must not depend on the incoming edges of the source node. In other words, each subgraph of the graph may only contain correct edges.
However, due to the non-deterministic nature of LLMs and the vague statement about the correctness of edges, this property can only be enforced to a limited extent. We will leave this observation for the time being.

Graph Based Reasoning Language

We extend the language defined from Entailment Trees [1] with the operations mentioned above. We also try to modify the tokens that are elementary for the syntax so that they are clearly separated from natural language tokens.
As a result, we get a language that is divided into variable definitions for nodes and edges and their placeholders.

The first example showcases how generate entailment operation is implemented.

<s1><:>some text<;><s2><:><t2><;><s1><ent><s2><;>

<s1> and <s2> are variable names for nodes, <:> assigns a value to the the variable, either a free text or a placeholder, for example <t2>. <;> is the delimiter between each definition and declaration. The last sequence <s1><ent><s2><;> assings the entailment relation between the nodes <s1> and <s2>.

The second example showcases how relation between operation is implemented.

<s1><:>some text<;><s2><:>some other text<;><e1><:><rel1><;><s1><e1><s2><;>

This time we are using a relationship variable <e1> with a placeholder <rel1>.

Each time the model recognizes a placeholder, it should try to fill the placeholders so that the relationships between the nodes are inferred "typically" as correct.

Queries can become more complicated, as in the explain relation example.

<s1><:>some text<;><s2><:><t2><;><s3><:>some contradicting text<;><s1><ent><s2><;><s1><con><s3><;><s2><con><s3><;>

This time the model has to take into account incoming and outgoing edges of <s2>.

Going back to Winograd Schema, the datasets are transformed by following schema:

<s1><:><C><;><s2><:>Question <t2><;><s3><:>candidate 1<;><s4><:>candidate 2<;><s1><ent><s2><;><t3><:><s3><|><s4><;>

while we expect a response like:

<t2><:><s3><;>

Questions, prompts or story continuations are modeled like masked language modeling. The answer to a question does not textual entail from the question. Instead, the question with the answer entails from the context. In this case, the context is common sense in general. To represent common sense as an abstract form, we introduce the special token <C>. The placeholder <t2> is narrowed down by the answer candidates <s3> and <s4>, while the <|> operator allows the model to respond with one or more answer candidates.

We can transform most reasoning and entailment datasets with these language rules. Possible entailment relations can be found on agent.py.

Experiments

We have prepared three notebooks showcasing:

the user interface and expected responses,
an experiment with the Chat-GPT agent (webversion) and
an experiment the fine-tuned LLama 2 model.

Expectation: User Interface and Expected Responses

This notebook showcases the the user interface and is done without an actual LLM as backend, but instead uses mockup responses. This way we can get a feeling for the usage of the agent in general and formulate expectations, we will compare to the outputs of actual LLMs with.
Please refer to expectations.ipynb for the experiment.
In summary, we expect the agent to not produce any contradictions. Generated statements should sound plausible and should not contradict the tautology of the graph. The graph structure helps us to identify errors in our thought process and in the models responses.

Experiment 1: Chat-GPT Agent

Chat-GPT is based on GPT-3.5 [7] and is propably the worlds most known chat-bot application. Due it's impact on our economy and our social life and due to the wide accasseability, we have chosen to benchmark our GBRL with Chat-GPT. We examine in more detail how language models take up graph notation and what conclusions they draw from it. Although GPT-3.5 is only one of many pre-trained models, we can identify weaknesses in the language and the tasks set with just a few interactions.

To make Chat-GPT understand our language, we prompt engineered an initial prompt with explanations and examples. The initial prompt can be found on chat_gpt-agent.py. The chat bot is then instructed to resolve the queries generated by our agent. The responses generated by Chat-GPT are then returned to the GUI by the user.
Please refer to chat_gpt_experiment.ipynb for the experiment.
The experiment has shown us a crucial error in the definition of explanations in the GBRL from Figure 1. If we look at the subgraphs (premise, explanation) and (explanation, hypothesis) respectively, we can see that the explanation in itself is not a logical conclusion from the premise, just as the hypothesis cannot be logically inferred from the explanation.

Figure 3 describes explanations in GBRL a little better, but has its own problems, which we will leave for now.

In summary, the Chat-GPT experiment made us realise three crucial points.

Our definition of what constitutes as an explanation in GBRL is flawd.
Simple graphs give the model too much freedom to chose the most obvious none helpful entailments, like repetition or paraphrasing.
The syntax of the GBRL may be confusing the model and may result in invalid entailment decisions.

Two of these three points (repetition and invalid entailments) were also recognised by Dalvi et al. [1]. We can can conclude that:

We need a formal definition for what states as an explanation and chain of thought in the GBRL.
We need to make the queries more explicit to reduce repition and paraphrasing.
We need to make GBRL more robust and understandable by language models in general.

Experiment 2: Finetuned LLama 2 Agent

This experiment showcases our finetuned sequence to sequence transformer model. LLama 2 7b is part of the LLama 2 family and was released 2023 by Meta AI [8]. We have finetuned the 7 billion parameter model on common sense, entailment, legal, ethics and nature science datasets that were transformed into GBRL tasks (total 36 mio datapoints). We were only able to train the model on 1/12 epoch, which already took 3 days on 9 24GB GPUs.
Please refer to llama2_experiment.ipynb for the experiment.

In summary, the model does not yet have the ability to respond in a syntactically correct form. It confuses the task of predicting the textual relationship between nodes with the task of generating statements that fit the given relationship. The main reason for this bad performance could be due to the short training time. Nevertheless, the model has already shown signs of an understanding of textual entailments.

Conclusions and Future Work

We have shown what a graph-based reasoning language can look like. We have shown how we can transform common sense and entailment datasets into this language. We defined expectations and compared them with the output of Chat-GPT and our finetuned LLama 2 model. We found that our formal definitions for explanations and chains of thought are imprecise and confuse the model, leaving too many gaps and possibly leading to repetition or invalid entailments.
We must therefore specify the formal definitions for explanations and chains of thought in order to force the model to generate meaningful statements.
In future work, we will replace GBRL with Cypher query language. Cypher has already been successfully trained with LLMs, is widely used and standardized. We define our LLM as a graph database that implicitly stores the nodes and edges of all reasoning graphs and returns them on query.
In addition, we have to go back to a smaller model in order to be able to carry out meaningful experiments with the relatively weak hardware in a finite time.

References

[1] Dalvi, Bhavana, Peter Alexander Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura and Peter Clark. “Explaining Answers with Entailment Trees.” Conference on Empirical Methods in Natural Language Processing (2021).

[2] Levesque, Hector J., Davis, Ernest and Morgenstern, Leora. "The Winograd Schema Challenge." In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning , 552--561. Rome, Italy: AAAI Press, 2012.

[3] Dagan, Ido, Glickman, Oren and Magnini, Bernardo. "The PASCAL Recognising Textual Entailment Challenge." Paper presented at the meeting of the Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment, 2005.

[4] Mostafazadeh, N., Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli and James F. Allen. “A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories.” ArXiv abs/1604.01696 (2016)

[5] Kwiatkowski, Tom, Palomaki, Jennimaria, Redfield, Olivia, Collins, Michael, Parikh, Ankur, Alberti, Chris, Epstein, Danielle, Polosukhin, Illia, Kelcey, Matthew, Devlin, Jacob, Lee, Kenton, Toutanova, Kristina N., Jones, Llion, Chang, Ming-Wei, Dai, Andrew, Uszkoreit, Jakob, Le, Quoc and Petrov, Slav Natural Questions: a Benchmark for Question Answering Research. (2019).

[6] Bowman, Samuel R., Gabor Angeli, Christopher Potts and Christopher D. Manning. “A large annotated corpus for learning natural language inference.” Conference on Empirical Methods in Natural Language Processing (2015).

[7] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language Models are Few-Shot Learners." ArXiv preprint ArXiv:2005.14165 (2020).

[8] Touvron, Hugo, Martin, Louis, Stone, Kevin R., Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, Bikel, D., Blecher, Lukas, Ferrer, Cristian Cantón, Chen, Moya, Cucurull, Guillem, Esiobu, David, Fernandes, Jude, Fu, Jeremy, Fu, Wenyin, Fuller, Brian, Gao, Cynthia, Goswami, Vedanuj, Goyal, Naman, Hartshorn, A., Hosseini, Saghar, Hou, Rui, Inan, Hakan, Kardas, Marcin, Kerkez, Viktor, Khabsa, Madian, Kloumann, Isabel M., Korenev, A., Koura, Punit Singh, Lachaux, Marie-Anne, Lavril, Thibaut, Lee, Jenya, Liskovich, Diana, Lu, Yinghai, Mao, Yuning, Martinet, Xavier, Mihaylov, Todor, Mishra, Pushkar, Molybog, Igor, Nie, Yixin, Poulton, Andrew, Reizenstein, Jeremy, Rungta, Rashi, Saladi, Kalyan, Schelten, Alan, Silva, Ruan, Smith, Eric Michael, Subramanian, R., Tan, Xia, Tang, Binh, Taylor, Ross, Williams, Adina, Kuan, Jian Xiang, Xu, Puxin, Yan, Zhengxu, Zarov, Iliyan, Zhang, Yuchen, Fan, Angela, Kambadur, Melanie, Narang, Sharan, Rodriguez, Aurelien, Stojnic, Robert, Edunov, Sergey and Scialom, Thomas LLama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023).

[9] Francis, Nadime, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer and Andrés Taylor. “Cypher: An Evolving Query Language for Property Graphs.” Proceedings of the 2018 International Conference on Management of Data (2018)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Images		Images
agents		agents
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat_gpt_experiment.ipynb		chat_gpt_experiment.ipynb
dockerfile		dockerfile
expectations.ipynb		expectations.ipynb
llama2_experiment.ipynb		llama2_experiment.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Integrating Explanations in the Textual Entailment of an LLM

Related Work

Graph Based Reasoning Operations

Graph Based Reasoning Language

Experiments

Expectation: User Interface and Expected Responses

Experiment 1: Chat-GPT Agent

Experiment 2: Finetuned LLama 2 Agent

Conclusions and Future Work

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Integrating Explanations in the Textual Entailment of an LLM

Related Work

Graph Based Reasoning Operations

Graph Based Reasoning Language

Experiments

Expectation: User Interface and Expected Responses

Experiment 1: Chat-GPT Agent

Experiment 2: Finetuned LLama 2 Agent

Conclusions and Future Work

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages