Skip to content

Python Package for the Generation of Syntactic Tests for LLM Evaluations.

License

Notifications You must be signed in to change notification settings

DanielGall500/Grew-TSE

Repository files navigation

PyPI version

Logo

Grew-TSE

Python Package for the Generation of Syntactic Tests for LLM Evaluations.
Explore the docs »

View Demo · Report Bug · Request Feature

About The Project

Grew-TSE is a tool for the query-based generation of custom minimal-pair syntactic tests from treebanks for Targeted Syntactic Evaluation of LLMs. The query language of choice is GREW (Graph Rewriting for NLP). Pronounced a bit like the german word Grütze, meaning grits or groats. It is available on the Python Package Index here.

The general research question that Grew-TSE aims to help answer is:
Can language models distinguish grammatical from ungrammatical sentences across syntactic phenomena and languages?
This means that if you speak a language, especially one that is low-resource, then you likely have something novel you could test in this area.

The pipeline generally looks something like the following:

  1. Parse a Universal Dependencies treebank in CoNLL-U format
  2. Isolate a specific syntactic phenomenon (e.g. verbal agreement) using a GREW query.
  3. Convert these isolated sentences into masked- or prompt-based datasets.
  4. Search the original treebank for words that differ by one syntactic feature to form a minimal pair.
  5. Evaluate a model available on the Hugging Face platform and view metrics such as accuracy, precision, recall, and the F1 score.

My image

What does a "minimal-pair syntactic test" look like?

To analyse models in this way, we use what are called minimal pairs. A minimal pair consists of either
(1) two sentences that differ by one syntactic feature, or
(2) one sentence with a "gap" (or simply end mid-sentence as for next-token prediction) and two accompanying lexical items (e.g. is/are), one being deemed grammatical in the given context and one not.
With this tool we concern ourselves with the latter, and focus on generating minimal pairs (W1, W2) for the same context.

An example of some tests are shown in the table below, generated using Grew-TSE from the English EWT UD Treebank.

masked_text form_grammatical form_ungrammatical
It [MASK] clear to me that the manhunt for high Ba... seems seem
In Ramadi, there [MASK] a big demonstration... was were
As the survey cited in the above-linked article [MASK]... shows show
Jim Lobe [MASK] more on the political implications... has have

The above tests are for models trained on a Masked Language Modelling Task (MLM), however you may also generate prompt-based datasets with Grew-TSE.

Try out the Hugging Face 🤗 Dashboard

You can try out the official Grew-TSE dashboard available as a Hugging Face Space. It currently is intended primarily for demonstration purposes, but can be useful for quickly carrying out syntactic evaluations.

Launch GrewTSE Space

Installation

Grew-TSE depends on the Grew ecosystem, so you must install Opam and Grewpy before using the package.

1. Install Opam & Grewpy Backend

Install Opam (Linux, macOS, or Windows via WSL), then set up Grewpy:

bash -c "sh <(curl -fsSL https://opam.ocaml.org/install.sh)"
opam init
opam remote add grew "https://opam.grew.fr"
opam update
opam install grewpy_backend
echo 'eval $(opam env)' >> ~/.bashrc

2. Install Grew-TSE

Once Opam and Grewpy are installed, go ahead and install the Python package:

pip install grew-tse

If you want to make use of the evaluation tools, you also need a few more dependencies:

pip install grew-tse[eval]

For the full installation guide, see the documentation: 👉 https://grew-tse.readthedocs.io/

Basic Usage

The first step in using this package is to create a lexical item set, which is a fancy way of saying a dataset of words and their features. These are used to identify the ungrammatical word for every grammatical word that you isolate in your Grew query.

from grewtse.pipeline import GrewTSEPipe
g_pipe = GrewTSEPipe()

# the first step is always to load in a UD Treebank
# you can supply either a single file path or a list of file paths
treebank_path = "./my-treebanks/german.conllu"
g_pipe.parse_treebank(treebank_path)

A Grew query and a target form the means by which we isolate individual phenomena and the target word, typically the grammatical word, for our grammatical-ungrammatical minimal pair. The Grew query feature values may change between treebanks, but the logic of the query should remain consistent. The target is that variable in our grew query that represents that word we want to change to form the minimal pair. For instance, DirObj in the below query isolates the direct object which we've assigned this name in the Grew query. Anything referenced as the target must be given a variable name in the query. The below fancy-schmancy query isolates non-negated transitive verb phrases:

grew_query = """
  pattern {
    V [upos=VERB];
    DirObj [Case=Acc];
    V -[obj]-> DirObj;
  }

  without {
    NEG [upos=PART, Polarity=Neg];
    V -[advmod:neg]-> NEG;
  }
"""

target = "V"

The deeper your knowledge of a language, the better you'll be at choosing syntactic phenomena to evaluate. Treebanks that are more expressive in terms of features will allow you to ask more questions and those that are of a larger size will be more likely to find suitable minimal pairs. The minimal pairs are found by isolating that word and its features, and altering the features by (typically) one. For instance, by changing an accusative noun to a genitive one. Note that morphological constraints (e.g Case, Gender, Number) are passed distinctly from universal constraints (upos) These are specified in a dict, like so:

morphology_change = {
  "case": "Gen"
}

The generation of grammatical-ungrammatical minimal pairs for each sentence, as well as the automatic masking of that sentence, can then be undertaken with the following:

# generate a dataset from the treebank that creates masked
# sentences for masked language modeling (MLM)
masked_df = g_pipe.generate_masked_dataset(
    grew_query, 
    target
)

# generate a dataset from the treebank that creates prompts
# for next-word prediction
prompt_df = g_pipe.generate_prompt_dataset(
    grew_query, 
    target
)

# can only occur after a masked or prompt dataset
# has been generated
mp_dataset = g_pipe.generate_minimal_pair_dataset(
    morphology_change,
)

Built With

Grew-TSE was built completely in Python and is available soon as a Python package. It makes use of the Huggingface Transformers library as well as plotnine for plotting.

  • Python
  • Huggingface

Of course, the grewpy package was essential for this project.



For questions or academic collaboration inquiries, please contact the maintainer via the GitHub repository.

About

Python Package for the Generation of Syntactic Tests for LLM Evaluations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages