Skip to content

Rigorous-Software-Engineering/LaZ

Repository files navigation

Content

This README contains instructions to install and run vanilla Nomos and LaZ (scroll to the bottom).

Directories content:

  • benchmarks (models and specifications): compas,german_credit, mnist, speech_command, hotel_review, lunar, bipedal
  • auxiliary code for running benchmarks: mod_gym, mod_stable_baselines3
  • source code: lang, opt
  • experimental results: results
  • scripts to run the experiments: experiments
  • in the root directory main Nomos script and grammar: main.py and Nomos.g4

NOMOS

This repository contains our implementation of the NOMOS specification language and its application to testing and re-training of ML models. This language and our testing framework are introduced in our IJCAI 2023 paper titled as "Specifying and Testing k-Safety Properties for Machine-Learning Models".

Specification Language

The NOMOS grammar is presented in the Nomos.g4 file of the main directory. We use ANTLR4 for parsing the grammar and the language semantics are implemeted in lang/MyNomosVisitor.py. The parser, lexer and other necessary files have to be created using the following command:

antlr4 -Dlanguage=Python3 Nomos.g4 -visitor -o lang

This requires an ANTLR4 installation, including its python-runtime. The above command will generate the parser, lexer and other necessary files under the lang folder. main.py translates a NOMOS specification into Python3 code, namely the test harness. The following command illustrates how to translate a specification file:

python3 main.py <path_to_spec_file>

<benchmark_name> refers to the name of the benchmark that the ML model is traind on. A folder with this name must exist in the main directory and it should contain a file named helper.py; this file needs to provide necessary functions for testing the specification such as loading the model and loading inputs. See compas/helper.py for an example. <path_to_spec_file> refers to the path to the specification file. Check the following section to see where to find the specifications tested in the paper.

Upon successful execution of the above command a python file should be created und <benchmark_name>/harness_<spec_name>.py, where <spec_name> is the specification file name without the .nom file extension. See compas/harness_felony_inc.py for an example. We describe how to test the model below.

Specifications

This repository also contains all NOMOS specifications that we used to test models in the above paper. The specification files can be found in the following directories:

  • compas/specs/*.nom
  • german_credit/specs/*.nom
  • mnist/specs/*.nom
  • speech_command/specs/*.nom
  • hotel_review/specs/*.nom
  • lunar/specs/*.nom

Testing Models

The created test harness is used to check if the specification holds for a given ML model. A harness takes three inputs from the user:

  1. a random seed: an integer to seed the random number generator
  2. the path to the model: the path of the model file that will be loaded by helper.py
  3. the test budget (i.e., the number of inputs for which the postcondition should be checked)

For example, the following command runs the test harness compas.harness_felony_inc for our Compas model compas/compas_dt_model.p with random seed 42 and budget 5000:

python3 -m compas.harness_felony_inc 42 compas/compas_dt_model.p 5000

Upon successful execution, a summary of the testing campaign is printed to the standard output.

Feasibility Study: Using Bugs for Model Training

For our feasibility study, we devise a new /guided/ RL training scheme (by extending standard PPO training) where we incorporate the states on which we identify bugs in the current or past policies. Our guided training algorithm can be started using the following command:

python3 -m lunar.train_agents <timesteps> guided <rand_seed>

<timesteps> is (usually) a large integer such as 1 million and <rand_seed> is an integer to seed the random number generator that is used for training.

LaZ - Lazy Testing of ML Models

To run LaZ first install all Nomos dependencies with: bash -l install_env.sh

Make sure the path to Mopsa analyser is in two environment variables: MOPSAPATH and PATH (the install_env.sh script will write necessary exports into the ~/.bashrc).

Generate a single test harness

To generate a harness for different variations of LaZ use the following commands (differ in flags):

  1. Nomos: python3 main.py path-to-spec
  2. LaZ-Naive: python3 main.py path-to-spec
  3. LaZ: python3 main.py path-to-spec -on
  4. LaZ-Man: already generated, conditions are hardcoded, check dirs <dataset>/harnesses/manual_on_<spec_name>.py
  5. LaZ-Flex: python3 main.py path-to-spec -topt
  6. LaZ-Flex-Man: already generated, conditions are hardcoded, check dirs <dataset>/harnesses/manual_opt_<spec_name>.py

See all available options by running python3 main.py -h.

Generate experiments for hyper-parameter study

To generate all experiments run: python experiments/hyperparameters.py

This will create harnesses for all datasets and specifications in <dataset>/hanresses/ directories, and Python scripts to execute and log experimental data in experiments/run_sa_<dataset-and-spec-name>.py.

You can run the hyperparameter experiment for one specification with: python3 -m experiments.run_sa_<dataset-and-spec-name>

or use the script that will execute all harnesses: ./experiments/fastest_first.sh

Important! All experiments for one specification have to be executed in one go, otherwise runtime fluctuations may affect the relative results and plots are likely to look different.

Time info: Hyper-parameter study takes several days, we execute 110 runs (5 seeds and 22 configurations) per specification. Beware that lunar and bipedal are slow, one test harness with 1000 tests budget takes >3 hours. The running times for faster benchmarks (german_credit and compas) were ca 90 seconds per configuration on our machine. The hyper-parameter experiment for each specification was approximately 90*110 = 9900 seconds, which is a little under 3 hours.

Note: Bipedal dataset requires retraining before any tests can be executed on a new machine (paths are hardcoded in the model).

When executed, each experiment creates a CSV file in results/ directory. After all experiments are done, run ./experiments_merge_csvs.sh to merge all results into one CSV.

To plot the relative test throughput improvement (figures 5a and 5b) and print analysis statistics (Table 1), use the command:

python experiments/plot_throughput.py results/hypertmp.csv [laz|nomos]

The last option regulates the baseline: laz will plot the throughput relative to LaZ results (default, as in the paper), nomos relative to vanilla Nomos. If no option is given, laz is used by default.

About

Source code for LaZ - lazy testing of machine-learning models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors