Skip to content

rifatarefin/natgi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

480 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NatGI Overview

NatGI infers context-free grammar from example programs. It has only black‑box access to the language parser (an oracle) during the learning process. NatGI follows the parse tree recovery principle from Arvada/TreeVada for grammar inference. NatGI's tree building technique is more powerful with GPT-4o and it produces human‑readable grammars with semantically meaningful nonterminal names.

Ground Truth while Grammar NatGIBase Inferred while Grammar
grammar whileLang;

start
    : stmt
    ;
stmt
    : stmt ';' stmt
    | 'skip'
    ;
boolexpr
    : '~' boolexpr
    | boolexpr '&' boolexpr
    | numexpr '==' numexpr
    | 'true'
    | 'false'
    ;
numexpr
    : '(' numexpr '+' numexpr ')'
    | 'L'
    | 'n'
    ;
grammar whileLang;

start
    : stmt
    ;
stmt
    : stmt ';' stmt
    | 'L' '=' numexpr
    | 'L' '=' 'L'
    | 'while' boolexpr 'do' stmt
    | 'if' boolexpr 'then' stmt 'else' stmt
    | 'skip'
    ;
boolexpr
    : '~' boolexpr
    | boolexpr '&' boolexpr
    | numexpr '==' numexpr
    | 'L' '==' numexpr
    | numexpr '==' 'L'
    | 'L' '==' 'L'
    | 'true'
    | 'false'
    ;
numexpr
    : '(' 'L' '+' 'L' ')'
    | '(' 'L' '+' numexpr ')'
    | '(' numexpr '+' 'L' ')'
    | '(' numexpr '+' numexpr ')'
    | 'n'
    ;

Key features

  • Needs few example programs and an oracle, the oracle is a command that accepts a filename and returns 0 for valid inputs.
  • Relies on LLM to build parse-trees of the example programs.
  • Produces human-readable grammars with meaningful non-terminals.
  • Generates grammar in ANTLR4 format, which is directly compatible with popular grammar-based fuzzers (i.e. Grammarinator)

NatGI

NatGI is built on Arvada/TreeVada's approach for grammar inference and aims to make inferred grammars easier to inspect and use in downstream tooling.

Requirements

Requires at least python 3.10. Install the following two packages via pip to make sure everything runs:

$ pip3 install lark-parser tqdm openai PrettyPrintTree

(Optional) for using python target of antlr4 parser

pip3 install antlr4-python3-runtime==4.9.2

Quickstart

NatGI takes a directory of example programs (TRAIN_DIR) and an oracle command (ORACLE_CMD). The oracle must be invocable as:

  • ORACLE_CMD filename — runs the oracle on the file filename
  • return 0 for valid examples, non‑zero for invalid examples

Search.py is the entry point to the tool. To learn a grammar run:

$ python3 search.py ORACLE_CMD TRAIN_DIR LOG_FILE

This writes training details to LOG_FILE and saves the learned grammar as a pickled dictionary in LOG_FILE.gramdict. For more configurable options, run python3 search.py --help.

If you also have a held-out test set in TEST_DIR, you can evaluate the precision and recall of the mined grammar with the utility eval.py. This utility also handily prints out the unpickled grammar to LOG_FILE.eval file. The provided LOG_FILE must match one generated by search.py, as this utility looks for LOG_FILE.gramdict.

$ python3 eval.py [--no-antlr4] [-n PRECISION_SET_SIZE] ORACLE_CMD TEST_DIR LOG_FILE
  • PRECISION_SET_SIZE (optional) is the number of samples to draw from the learned grammar to estimate precision (default: 1000).
  • TEST_DIR is a directory of held-out valid programs.
  • --no-antlr4 flag doesn't generate antlr4 grammar. Otherwise antlr4 grammar is written into a .g4 file.

Zero shot GPT-o4 (mini) grammar

Compare how NatGI performs against zero shot grammar inference, even with powerful reasoning models like GPT-o4.

  1. Run gpt.py to get a text grammar.
python3 gpt.py SEED_DIR SEED_NAME   #python3 gpt.py Seed_Programs/tinyc/tinyc-train tinyc-r1

This writes a text grammar to results/gpt_grammar_{SEED_NAME}.txt.

  1. Convert the text grammar into the pickled .gramdict format (uses ORACLE_CMD for lexical inference on the terminals):
python3 grammar_from_csv.py SEED_NAME ORACLE_CMD        #python3 grammar_from_csv.py tinyc-r1 Seed_Programs/tinyc/parse_tinyc

This produces a binary grammar file named results/gpt-grammar_{SEED_NAME}.gramdict.

  1. Evaluate the saved grammar (precision/recall/F1) with the eval utility:
python3 eval.py ORACLE_CMD TEST_DIR LOG_FILE   #python3 eval.py Seed_Programs/tinyc/parse_tinyc Seed_Programs/tinyc/tinyc-test tinyc-r1

Here, LOG_FILE is the base name used for the .gramdict produced by step 2.

Note that grammar_from_csv.py expects the text grammar in a format instructed in the prompt. if the LLM cannot follow the specific format or the grammar has errors (i.e. some non-terminals used in rules but never defined), step 2 and 3 will fail.

Releases

No releases published

Packages

 
 
 

Contributors

Languages