Content

This README contains instructions to install and run vanilla Nomos and LaZ (scroll to the bottom).

Directories content:

benchmarks (models and specifications): compas,german_credit, mnist, speech_command, hotel_review, lunar, bipedal
auxiliary code for running benchmarks: mod_gym, mod_stable_baselines3
source code: lang, opt
experimental results: results
scripts to run the experiments: experiments
in the root directory main Nomos script and grammar: main.py and Nomos.g4

NOMOS

This repository contains our implementation of the NOMOS specification language and its application to testing and re-training of ML models. This language and our testing framework are introduced in our IJCAI 2023 paper titled as "Specifying and Testing k-Safety Properties for Machine-Learning Models".

Specification Language

The NOMOS grammar is presented in the Nomos.g4 file of the main directory. We use ANTLR4 for parsing the grammar and the language semantics are implemeted in lang/MyNomosVisitor.py. The parser, lexer and other necessary files have to be created using the following command:

antlr4 -Dlanguage=Python3 Nomos.g4 -visitor -o lang

This requires an ANTLR4 installation, including its python-runtime. The above command will generate the parser, lexer and other necessary files under the lang folder. main.py translates a NOMOS specification into Python3 code, namely the test harness. The following command illustrates how to translate a specification file:

python3 main.py <path_to_spec_file>

<benchmark_name> refers to the name of the benchmark that the ML model is traind on. A folder with this name must exist in the main directory and it should contain a file named helper.py; this file needs to provide necessary functions for testing the specification such as loading the model and loading inputs. See compas/helper.py for an example. <path_to_spec_file> refers to the path to the specification file. Check the following section to see where to find the specifications tested in the paper.

Upon successful execution of the above command a python file should be created und <benchmark_name>/harness_<spec_name>.py, where <spec_name> is the specification file name without the .nom file extension. See compas/harness_felony_inc.py for an example. We describe how to test the model below.

Specifications

This repository also contains all NOMOS specifications that we used to test models in the above paper. The specification files can be found in the following directories:

compas/specs/*.nom
german_credit/specs/*.nom
mnist/specs/*.nom
speech_command/specs/*.nom
hotel_review/specs/*.nom
lunar/specs/*.nom

Testing Models

The created test harness is used to check if the specification holds for a given ML model. A harness takes three inputs from the user:

a random seed: an integer to seed the random number generator
the path to the model: the path of the model file that will be loaded by helper.py
the test budget (i.e., the number of inputs for which the postcondition should be checked)

For example, the following command runs the test harness compas.harness_felony_inc for our Compas model compas/compas_dt_model.p with random seed 42 and budget 5000:

python3 -m compas.harness_felony_inc 42 compas/compas_dt_model.p 5000

Upon successful execution, a summary of the testing campaign is printed to the standard output.

Feasibility Study: Using Bugs for Model Training

For our feasibility study, we devise a new /guided/ RL training scheme (by extending standard PPO training) where we incorporate the states on which we identify bugs in the current or past policies. Our guided training algorithm can be started using the following command:

python3 -m lunar.train_agents <timesteps> guided <rand_seed>

<timesteps> is (usually) a large integer such as 1 million and <rand_seed> is an integer to seed the random number generator that is used for training.

LaZ - Lazy Testing of ML Models

To run LaZ first install all Nomos dependencies with: bash -l install_env.sh

Make sure the path to Mopsa analyser is in two environment variables: MOPSAPATH and PATH (the install_env.sh script will write necessary exports into the ~/.bashrc).

Generate a single test harness

To generate a harness for different variations of LaZ use the following commands (differ in flags):

Nomos: python3 main.py path-to-spec
LaZ-Naive: python3 main.py path-to-spec
LaZ: python3 main.py path-to-spec -on
LaZ-Man: already generated, conditions are hardcoded, check dirs <dataset>/harnesses/manual_on_<spec_name>.py
LaZ-Flex: python3 main.py path-to-spec -topt
LaZ-Flex-Man: already generated, conditions are hardcoded, check dirs <dataset>/harnesses/manual_opt_<spec_name>.py

See all available options by running python3 main.py -h.

Generate experiments for hyper-parameter study

To generate all experiments run: python experiments/hyperparameters.py

This will create harnesses for all datasets and specifications in <dataset>/hanresses/ directories, and Python scripts to execute and log experimental data in experiments/run_sa_<dataset-and-spec-name>.py.

You can run the hyperparameter experiment for one specification with: python3 -m experiments.run_sa_<dataset-and-spec-name>

or use the script that will execute all harnesses: ./experiments/fastest_first.sh

Important! All experiments for one specification have to be executed in one go, otherwise runtime fluctuations may affect the relative results and plots are likely to look different.

Time info: Hyper-parameter study takes several days, we execute 110 runs (5 seeds and 22 configurations) per specification. Beware that lunar and bipedal are slow, one test harness with 1000 tests budget takes >3 hours. The running times for faster benchmarks (german_credit and compas) were ca 90 seconds per configuration on our machine. The hyper-parameter experiment for each specification was approximately 90*110 = 9900 seconds, which is a little under 3 hours.

Note: Bipedal dataset requires retraining before any tests can be executed on a new machine (paths are hardcoded in the model).

When executed, each experiment creates a CSV file in results/ directory. After all experiments are done, run ./experiments_merge_csvs.sh to merge all results into one CSV.

To plot the relative test throughput improvement (figures 5a and 5b) and print analysis statistics (Table 1), use the command:

python experiments/plot_throughput.py results/hypertmp.csv [laz|nomos]

The last option regulates the baseline: laz will plot the throughput relative to LaZ results (default, as in the paper), nomos relative to vanilla Nomos. If no option is given, laz is used by default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content

NOMOS

Specification Language

Specifications

Testing Models

Feasibility Study: Using Bugs for Model Training

LaZ - Lazy Testing of ML Models

Generate a single test harness

Generate experiments for hyper-parameter study

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
compas		compas
experiments		experiments
german_credit		german_credit
hotel_review		hotel_review
lang		lang
lunar		lunar
mnist		mnist
mod_gym		mod_gym
mod_stable_baselines3		mod_stable_baselines3
opt		opt
results		results
speech_command		speech_command
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
Nomos.g4		Nomos.g4
README.md		README.md
install_env.sh		install_env.sh
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

Content

NOMOS

Specification Language

Specifications

Testing Models

Feasibility Study: Using Bugs for Model Training

LaZ - Lazy Testing of ML Models

Generate a single test harness

Generate experiments for hyper-parameter study

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages