Skip to content

Commit c39323e

Browse files
authored
Merge pull request #67 from CLMBRs/learn_quant
Add learn quant
2 parents 6858d22 + 9ec99b5 commit c39323e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+14378
-0
lines changed

src/examples/learn_quant/README.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Introduction
2+
This module provides code used in the publication `Quantifiers of Greater Monotonicity are Easier to Learn` presented at SALT35 sponsored by the Linguistic Society of America.
3+
4+
This code provides an example of utilizing the `ultk` package for generating abstract data models of ordered referents in a universe and defining a grammar to generate unique quantifier expressions, as well as enumerating quantifier expressions and evaluating their meaning with respect to a universe of referents.
5+
The example also includes code for training neural models to correctly verify a given quantifier expression, in addition to functions that compute a quantifier's degree of monotonicity, as described in the published manuscript.
6+
7+
For an introduction to the data structures and research question, please refer to the publication and refer to the (tutorial)[src/examples/learn_quant/notebooks/tutorial.ipynb].
8+
9+
It is highly recommended that the user review the docs of the (`hydra` package)[www.hydra.cc].
10+
11+
# Usage
12+
13+
## Generation
14+
From the `src/examples` directory:
15+
`python -m learn_quant.scripts.generate_expressions`: generates `generated_expressions.yml` files that catalog licensed `QuantifierModel`s given a `Grammar` and `QuantifierUniverse` given the config at `conf/expressions.yaml`.
16+
17+
Using `hydra`, you may refer to the recipe files at `conf/recipes/`:
18+
`python -m learn_quant.scripts.generate_expressions recipe=3_3_3_xi.yaml`
19+
This would generate unique expressions evaluated over a universe and depth specified in the selected recipe config.
20+
21+
You may also override specific parameters:
22+
`python -m learn_quant.scripts.generate_expressions recipe=3_3_3_xi.yaml ++universe.m_size=4`
23+
24+
## Learning
25+
26+
### Sampling
27+
At large universe sizes and genereation depths, the number of generated expressions can be too numerous for completing learning experiments for given compute resources.
28+
29+
After generating a list of expressions, you may sample them using `notebooks/randomize_expression_index.ipynb`. This generates a `.csv` file that simply draws the desired number of expressions and maps them to their ordering in the original `generated_expressions.yaml` file.
30+
31+
### Training with `slurm`
32+
On your `slurm` configured node:
33+
34+
Uncomment the following lines in `conf/learn.yaml`:
35+
```
36+
# - override hydra/launcher: swarm
37+
# - override hydra/sweeper: sweep
38+
```
39+
40+
Run:
41+
`HYDRA_FULL_ERROR=1 python -m learn_quant.scripts.learn_quantifiers --multirun training.lightning=true training.strategy=multirun training.device=cpu model=mvlstm grammar.indices=false`.
42+
43+
This command will read the config at `conf/learn`, prepare training data based on the chosen quantifier expressions, and run 1 training job per expression using the `hydra` `submitit` plugin **in parallel**. To specify specific `slurm` parameters, you may modify `conf/hydra/launcher/swarm.yaml`.
44+
45+
### Without Slurm
46+
Run:
47+
`HYDRA_FULL_ERROR=1 python -m learn_quant.scripts.learn_quantifiers training.lightning=true training.strategy=multirun training.device=cpu model=mvlstm grammar.indices=false`.
48+
49+
This command will read the config at `conf/learn.yaml`, prepare training data based on the chosen quantifier expressions, and sequentially run 1 training job for all expressions on your local machine.
50+
51+
### Tracking
52+
53+
If you would like to track experimental runs to MLFlow, you may run an `mlflow` server at the endpoint specified at `tracking.mlflow.host` and have `learn_quant.scripts.learn_quantifiers` track metrics to the server.
54+
55+
You may turn off tracking with MLFlow by setting the config value `tracking.mlflow.active` to `false`.
56+
57+
## Calculation of monotonicity
58+
The `measures.py` script calculates monotonicity for specified quantifier expressions and at given universe sizes. This references the config `conf/learn.yaml`. For expressions, it references the generated expressions at the folder associated with the parameter values at the `expressions` keyspace. If universe parameters are defined at the `measures.monotonicity.universe` keyspace, they will define the size of the universe at which the monotonicity value will be calculated for each expression. `measures.expressions` specifies which expressions will be calculated.
59+
60+
Run `python -m learn_quant.measures` to generate a `.csv` file of the specified monotonicity measurements.
61+
62+
# Content Descriptions
63+
64+
- `scripts`: a set of scripts for generating `QuantifierModels` and measuring various properties of individual models and sets of models.
65+
- `generate_expressions.py` - This script will reference the configuration file at `conf/expressions.yaml` to generate a `Universe` of the specified dimensions and generate all expressions from a defined `Grammar`. Outputs will be saved in the `outputs` folder. The script will the _shortest_ expression (ULTK `GrammaticalExpression`s) for each possible `Meaning` (set of `Referent`s) verified by licit permutations of composed functions defined in `grammar.yml`. In particular, ULTK provides methods for enumerating all grammatical expressions up to a given depth, with user-provided keys for uniqueness and for comparison in the case of a clash. By setting the former to get the `Meaning` from an expression and the latter to compare along length of the expression, the enumeration method returns a mapping from meanings to shortest expressions which express them.
66+
- `learn_quantifiers.py` - This script will reference the configuration file at `conf/learn.yaml`. It loads expressions that are saved to the `output` folder after running the `generate_expressions.py` script. It transforms the data into a format that allows the training of a neural network the relationship between quantifier models and the truth values verified by a particular expression. The script then iterates through loaded expressions and uses Pytorch Lightning to train a neural model to verify randomly sampled models of particular sizes (determined by `M` and `X` parameters). Logs of parameters, metrics, and other artifacts are saved to an `mlruns` folder in directories specified by the configuration of the running `mlflow` server.
67+
- `grammar.yml`: defines the "language of thought" grammar (a ULTK `Grammar` is created from this file in one line in `grammar.py`) for this domain, using the functions in [van de Pol 2023](https://pubmed.ncbi.nlm.nih.gov/36563568/).
68+
- `measures.py`: functions to measure degrees of monotonicity of quantifier expressions according to Section 5 of [Steinert-Threlkeld, 2021](https://doi.org/10.3390/e23101335)
69+
- `outputs`: outputs from the generation routines for creating `QuantifierModel`s and `QuantifierUniverse`s
70+
- `quantifier.py`: Subclasses `ultk`'s `Referent` and `Universe` classes that add additional properties and functionality for quantifier learning with `ultk`
71+
- `sampling.py` - Functions for sampling quantifier models as training data
72+
- `set_primitives.py` - Optional module-defined functions for primitives of the basic grammar. Not used unless specified by the `grammar.typed_rules` key
73+
- `training.py`: Base `torch` classes and helper functions. Referenced only when `training.lightning=false`. Not maintained.
74+
- `training_lightning.py`: Primary training classes and functions. Uses `lightning`.
75+
- `util.py`: utility functions, I/O, etc.
76+
77+
# TODO:
78+
- Fully implement `hydra`'s `Structured Config` (example begun with `conf/expressions.py`)
79+
- Show example of adding custom primitives with custom-implemented classes (`quantifiers_grammar_xprimitives`)

src/examples/learn_quant/__init__.py

Whitespace-only changes.

src/examples/learn_quant/conf/__init__.py

Whitespace-only changes.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import dataclasses
2+
from dataclasses import dataclass, field
3+
from omegaconf import DictConfig
4+
import hydra
5+
from hydra.core.config_store import ConfigStore
6+
7+
8+
@dataclasses.dataclass
9+
class ModeConfig:
10+
name: str
11+
12+
13+
@dataclasses.dataclass
14+
class UniverseConfig:
15+
m_size: int
16+
x_size: int
17+
weight: float
18+
inclusive_universes: bool
19+
20+
21+
@dataclasses.dataclass
22+
class GrammarConfig:
23+
depth: int
24+
25+
26+
# Define a configuration schema
27+
@dataclasses.dataclass
28+
class Config:
29+
mode: ModeConfig
30+
universe: UniverseConfig
31+
grammar: GrammarConfig
32+
33+
34+
cs = ConfigStore.instance()
35+
# Registering the Config class with the name 'config'.
36+
cs.store(name="conf", node=Config)
37+
cs.store(group="universe", name="base_config", node=UniverseConfig)
38+
cs.store(group="grammar", name="base_config", node=GrammarConfig)
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# This file is used to generate expressions given a desired recipe for a grammar and universe.
2+
# Resulting expressions are saved in nested folders organized by the size of models generated at `./outputs`.
3+
4+
defaults:
5+
- _self_
6+
- recipe: ???
7+
8+
output: "learn_quant/outputs/"
9+
10+
mode:
11+
- generate
12+
13+
save: true
14+
time_trial_log: M${universe.m_size}_X${universe.x_size}_D${grammar.depth}_idx-${grammar.indices}.csv
15+
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
path: "learn_quant/grammar.yml"
2+
weight: 2.0
3+
depth: 3
4+
indices: true
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
defaults:
2+
- _self_
3+
- typed_rules: set_primitives
4+
path: "learn_quant/grammar_xprimitives.yml"
5+
weight: 2.0
6+
depth: 3
7+
indices: true
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
module_path: learn_quant.set_primitives
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
submitit_folder: ${hydra.sweep.dir}/.submitit/%j
2+
timeout_min: 1440
3+
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
4+
partition: gpu-l40
5+
account: clmbr # Your account (if required by your cluster)
6+
time: 2880 # Time in minutes (48 hours)
7+
cpus_per_task: 1
8+
mem_gb: 8
9+
additional_parameters: {"gpus": "0", "time": "1-00"}
10+
max_num_timeout: 10 # number of times to re-queue job after timeout
11+
array_parallelism: 120 # number of jobs to launch in parallel
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# This configuration is used to train Pytorch models generated quantifier expressions.
2+
# It can be modified to either run a loop in a single process over multiple expressions, or to swarm learning jobs using `slurm`.
3+
4+
5+
defaults:
6+
- _self_
7+
- model: null
8+
# To use the slurm launcher, you need to set the following options in the defaults list:
9+
# - override hydra/launcher: swarm # For use with launching multiple jobs via slurm
10+
# - override hydra/sweeper: sweep
11+
12+
experiment_name: transformers_improved_2 # Name of the experiment to be created in MLFlow
13+
notes: |
14+
This run is to evaluate the neural learning quantifiers and logging in MLFlow.
15+
16+
tracking:
17+
mlflow:
18+
active: true
19+
host: g3116 # This could be an IP address or a hostname (job name in slurm)
20+
port: 5000
21+
vars:
22+
MLFLOW_SYSTEM_METRICS_ENABLED: "true"
23+
MLFLOW_HTTP_REQUEST_MAX_RETRIES: "8"
24+
MLFLOW_HTTP_REQUEST_BACKOFF_FACTOR: "60"
25+
26+
# Options to define where the expressions should be created and/or loaded, how they should be represented, and how they should be generated.
27+
# The expressions are generated from a grammar, which is defined in the grammar.yml file.
28+
# The grammar is used to generate the expressions, and the expressions are then used to create the dataset used by the training script.
29+
expressions:
30+
n_limit: 2000
31+
output_dir: learn_quant/outputs/
32+
grammar:
33+
depth: 5
34+
path: learn_quant/grammar.yml
35+
indices: false # If set to true, the index primitives will be used in the grammar. Specific integer indices can also be set.
36+
index_weight: 2.0
37+
universe:
38+
x_size: 4
39+
m_size: 4
40+
representation: one_hot
41+
downsampling: true
42+
generation_args:
43+
batch_size: 1000
44+
n_limit: 5000 # Minimum number of sample rows in dataset for a *single* class. Full dataset length is 2 * n_limit.
45+
M_size: 12
46+
X_size: 16
47+
entropy_threshold: 0.01
48+
inclusive: False
49+
batch_size: 64
50+
split: 0.8
51+
target: "M${expressions.universe.m_size}/X${expressions.universe.x_size}/d${expressions.grammar.depth}"
52+
index_file: "learn_quant/expressions_sample_2k.csv" # If set, examples will be trained in order according to the index file
53+
54+
training:
55+
# Given an expressions file, the "resume" key will ensure that the training will continue from the designated expression in the file.
56+
#resume:
57+
# term_expression: and(and(not(subset_eq(A, B)), equals(cardinality(A), cardinality(B))), subset_eq(index(cardinality(A), union(A, B)), union(difference(A, B), difference(B, A))))
58+
strategy: multirun
59+
k_splits: 5
60+
n_runs: 1
61+
lightning: true
62+
device: cpu
63+
epochs: 50
64+
conditions: false
65+
early_stopping:
66+
threshold: 0.05
67+
monitor: val_loss
68+
min_delta: 0.001
69+
patience: 20
70+
mode: min
71+
check_on_train_epoch_end: false
72+
73+
optimizer:
74+
_partial_: true
75+
_target_: torch.optim.Adam
76+
lr: 1e-3
77+
78+
criterion:
79+
_target_: torch.nn.BCEWithLogitsLoss
80+
81+
82+
# This section defines how the measures will be calculated.
83+
# This is an example of how to use the measures module to calculate the monotonicity of the expressions.
84+
# This will search for an expressions file that fits the given arguments and then calculate the monotonicity of the expressions.
85+
# HYDRA_FULL_ERROR=1 python -m learn_quant.measures ++expressions.grammar.depth=3 ++expressions.grammar.index_weight=5.0 ++expressions.grammar.indices="[0,3]"
86+
measures:
87+
expressions:
88+
- all
89+
# - or(subset_eq(A, B), subset_eq(B, A))
90+
monotonicity:
91+
debug: false
92+
direction:
93+
- all
94+
create_universe: false # This creates a universe for the purpose of evaluating monotonicity
95+
universe:
96+
x_size: 6
97+
m_size: 6
98+
# If you want to filter out certain representations in the universe, you can use the 'universe_filter' key.
99+
# This will filter out models with indices of the given
100+
#universe_filter:
101+
# - 3
102+
# - 4
103+
104+
# The hydra sweeper is used in tandem with the hydra slurm launcher to launch individual jobs for each expression.
105+
hydra:
106+
sweeper:
107+
params:
108+
+expressions.index: range(0, ${expressions.n_limit})

0 commit comments

Comments
 (0)