Skip to content

Commit b6da9af

Browse files
committed
Merge pull request '2.5.0' (#281) from develop into master
2 parents d6c4386 + d40d25a commit b6da9af

88 files changed

Lines changed: 5364 additions & 2395 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/tests.yml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,18 @@ jobs:
1616
python-version: ['3.11', '3.12', '3.13', "3.14"]
1717

1818
steps:
19-
- uses: actions/checkout@v4
20-
- name: Run pre-commit checks on all files
21-
uses: pre-commit/action@v3.0.1
19+
- uses: actions/checkout@v5
2220
- name: Set up Python ${{ matrix.python-version }}
23-
uses: actions/setup-python@v5
21+
uses: actions/setup-python@v6
2422
with:
2523
python-version: ${{ matrix.python-version }}
26-
- name: Install Dependencies
24+
- name: Run pre-commit checks on all files
2725
run: |
2826
python -m pip install --upgrade pip
27+
pip install pre-commit
28+
pre-commit run --all-files
29+
- name: Install Dependencies
30+
run: |
2931
pip install -r requirements.txt
3032
# Should we have some tests with only requirements.txt?
3133
pip install -r requirements-dev.txt

.pre-commit-config.yaml

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ repos:
2121
name: (base:repo) debug-statements
2222
stages: [pre-commit]
2323
- repo: https://github.com/astral-sh/ruff-pre-commit
24-
rev: v0.14.0
24+
rev: v0.14.14
2525
hooks:
2626
- id: ruff
2727
name: (base:repo) ruff
@@ -31,25 +31,24 @@ repos:
3131
name: (base:repo) ruff-format
3232
stages: [pre-commit]
3333
- repo: https://github.com/sphinx-contrib/sphinx-lint
34-
rev: v1.0.0
34+
rev: v1.0.2
3535
hooks:
3636
- id: sphinx-lint
3737
name: (docs:repo) sphinx-lint
3838
stages: [pre-commit]
39-
- repo: local
39+
- repo: https://github.com/biomejs/pre-commit
40+
rev: v2.3.13
4041
hooks:
41-
- id: biome-check-ts-js
42+
- id: biome-check
4243
name: (webtools:repo) biome check typescript/javascript
43-
entry: npx @biomejs/biome check --write ./webtools/src --config-path=./webtools/biome.json --files-ignore-unknown=true --no-errors-on-unmatched
44-
language: node
45-
types: [text]
44+
args: ["./webtools/src", "--config-path=./webtools/biome.json"]
4645
files: "\\.(jsx?|tsx?|c(js|ts)|m(js|ts)|d\\.(ts|cts|mts)|jsonc?)$"
4746
exclude: ^docs/
4847
require_serial: true
49-
- id: biome-check-css
48+
stages: [pre-commit]
49+
- id: biome-check
5050
name: (webtools:repo) biome check css
51-
entry: npx @biomejs/biome check --write ./webtools/src --config-path=./webtools/biome.json --files-ignore-unknown=true --no-errors-on-unmatched
52-
language: node
51+
args: ["./webtools/src", "--config-path=./webtools/biome.json"]
5352
types: [text]
5453
files: "\\.(css?)$"
5554
exclude: ^docs/

CHANGELOG.md

Lines changed: 55 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,37 @@
11
# Changelog
22

3+
## 2.5.0
4+
5+
> [!WARNING]
6+
>
7+
> The `imperial_units` keyword argument for `parse_ingredient` is deprecated and will be removed at the next major release.
8+
>
9+
> Use the new `volumetric_units_system="imperial"` keyword argument for the same functionality.
10+
11+
* Improve execution and accuracy performance of the foundation foods matching functionality.
12+
13+
* See the docs [here](https://ingredient-parser.readthedocs.io/en/latest/explanation/foundation.html ) for details on how this now works.
14+
15+
* The execution performance is ~2.5x faster than in version 2.4.0.
16+
17+
* Add `volumetric_unit_system` keyword argument for `parse_ingredient` which allows for specifying unit system that will be used to volumetric units like cup, tablespoon etc. where there can are multiple options with slight differences in the volumes.
18+
19+
* This replaced the `imperial_units` argument which will removed in a future release.
20+
* Supported options are `us_customary` (default), `imperial`, `metric` (for metric tablespoon, teaspoon definitions) , `australian` (for Australian pints, tablespoons), `japanese` (for Japanese cups).
21+
* See the docs [here](https://ingredient-parser.readthedocs.io/en/latest/tutorials/options.html#volumetric-units-system) for specific details.
22+
* The customised Pint units registry (`UREG`) that contains additional units relevant to cooking (such as metric cups and tablespoons, Japanese cups etc.) is also more easily importable.
23+
24+
```py
25+
from ingredient_parser import UREG
26+
```
27+
* Add `unit_system` attribute to `IngredientAmount` and `CompositeIngredientAmount` to indicate which unit system the amount uses.
28+
29+
* This is an Enum with the following values: METRIC, US_CUSTOMARY, IMPERIAL, AUSTRALIAN, JAPANESE, OTHER, NONE.
30+
31+
* Fix a bug where an exception was raised if quantity range ended with `x` (e.g. `3-4x`).
32+
33+
* If an amount has `MULTIPLIER=True`, set `SINGULAR=True` for any immediately subsequent amounts.
34+
335
## 2.4.0
436

537
### General
@@ -26,13 +58,13 @@
2658
>
2759
> This release only contains changes related to the development tools for this library. There are no changes to the functionality of the library.
2860
29-
### Development tools
61+
### Development Tools
3062

31-
* Replace the labeller and webapp tools with a new tool ("webtools") written in react. Many thanks to @[mcioffi](https://github.com/mcioffi) for this contribution. Key functionality:
63+
* Replace the labeler and webapp tools with a new tool ("webtools") written in react. Many thanks to @[mcioffi](https://github.com/mcioffi) for this contribution. Key functionality:
3264

3365
* Parser, to display to parsed output of an input ingredient sentence.
3466

35-
* Labeller, to edit the labelled training data or add new training data.
67+
* Labeler, to edit the labelled training data or add new training data.
3668

3769
* Trainer, to initiate training of models.
3870

@@ -42,7 +74,7 @@
4274

4375
## 2.2.0
4476

45-
### Foundation foods:
77+
### Foundation Foods:
4678

4779
* Bias foundation food matching to prefer "raw" FDC ingredients, but only if the ingredient name does not include any verbs that indicate the ingredient is not raw (e.g. "cooked").
4880
* Normalise spelling of tokens in ingredient names to align with spelling used in FDC ingredient descriptions.
@@ -68,13 +100,13 @@
68100

69101
> [!WARNING]
70102
>
71-
> This version replaces the floret dependency with numpy.
103+
> This version replaces the floret dependency with NumPy.
72104
>
73-
> Numpy was already a dependency of floret, so if you are upgrading from v2.0.0 there should be little impact.
105+
> NumPy was already a dependency of floret, so if you are upgrading from v2.0.0 there should be little impact.
74106
75-
* This release overhauls the foundation foods functionality so that ingredient names are matched to entries in the [FoodData Central](https://fdc.nal.usda.gov/) (FDC) database.
107+
* This release overhauls the foundation foods functionality so that ingredient names are matched to entries in the [Food Data Central](https://fdc.nal.usda.gov/) (FDC) database.
76108

77-
* This update does not change the API. It adds additional fields to `FoundationFood` objects for FDC ID, category and data type. The `text` field now returns the description for the matching FDC entry.
109+
* This update does not change the API. It adds additional fields to `FoundationFood` objects for FDC ID, category, and data type. The `text` field now returns the description for the matching FDC entry.
78110

79111
* Beware that enabling this functionality causes the `parse_ingredient` function to be much slower than when disabled (default).
80112

@@ -179,7 +211,7 @@
179211

180212
* Various minor improvements to feature generation.
181213

182-
* Add PREPARED_INGREDIENT flag to IngredientAmount objects. This is used to indicate if the amount refers to the prepared ingredient (`PREPARED_INGREDIENT=True`) or the unpreprared ingredient (`PREPARED_INGREDIENT=False`).
214+
* Add PREPARED_INGREDIENT flag to IngredientAmount objects. This is used to indicate if the amount refers to the prepared ingredient (`PREPARED_INGREDIENT=True`) or the unprepared ingredient (`PREPARED_INGREDIENT=False`).
183215

184216
* Add `starting_index` attribute to IngredientText objects, indicating the index of the token that starts the IngredientText.
185217

@@ -245,15 +277,15 @@ Require NLTK >= 3.8.2 due to change in POS tagger weights format.
245277

246278
### Processing
247279

248-
* Change processing of numbers written as words (e.g. 'one', 'two' ). If the token is labelled as QTY, then the number will converted to a digit (i.e. 'one' -> 1) or collapsed into a range (i.e. 'one or two' -> 1-2), otherwise the token is left unchanged.
280+
* Change processing of numbers written as words (e.g. 'one', 'two' ). If the token is labelled as QTY, then the number will be converted to a digit (i.e. 'one' -> 1) or collapsed into a range (i.e. 'one or two' -> 1-2), otherwise the token is left unchanged.
249281

250282
## 1.0.1
251283

252284
> [!WARNING]
253285
>
254286
> This version requires NLTK >=3.8.2
255287
256-
NLTK 3.8.2 changes the file format (from pickle to json) of the weights used by the part of speech tagger used in this project, to address some security concerns. This patch updates the NLTK resource checks performed when `ingredient-parser` is imported to check for the new json files, and downloads them if they are not present.
288+
NLTK 3.8.2 changes the file format (from pickle to json) of the weights used by the part of speech tagger used in this project, to address some security concerns. This patch updates the NLTK resource checks performed when `ingredient-parser` is imported to check for the new JSON files, and downloads them if they are not present.
257289

258290
This version requires NLTK>=3.8.2.
259291

@@ -285,7 +317,7 @@ This version requires NLTK>=3.8.2.
285317
### Processing
286318

287319
* Various bug fixes to post-processing of tokens with labels NAME, COMMENT, PREP, PURPOSE, SIZE to correct punctuation and confidence calculations.
288-
* Modification of tokeniser to split full stops from the end of tokens. This helps to model avoid treating "`token.`" and "`token`" as different cases to learn.
320+
* Modification of tokenizer to split full stops from the end of tokens. This helps to model avoid treating "`token.`" and "`token`" as different cases to learn.
289321
* Add fallback functionality to `parse_ingredient` for cases where none of the tokens are labelled as NAME. This will select name as the token with the highest confidence of being labelled NAME, even though a different label has a high confidence for that token. This can be disabled by setting `expect_name_in_output=False` in `parse_ingredient`.
290322

291323
## 0.1.0-beta10
@@ -298,14 +330,14 @@ Fix incorrect python version specifier in package which was preventing pip in Py
298330

299331
### General
300332

301-
- Add github actions to run tests (#7, @boxydog)
333+
- Add GitHub actions to run tests (#7, @boxydog)
302334

303335
- Add pre-commit for use with development (#10, @boxydog)
304336

305337
### Model
306338

307339
- Add additional model performance metrics.
308-
- Add model hyper-parameter tuning functionality with `python train.py gridsearch` to iterate over specified training algorithms and hyper-parameters.
340+
- Add model hyperparameter tuning functionality with `python train.py gridsearch` to iterate over specified training algorithms and hyperparameters.
309341
- Add `--detailed` argument to output detailed information about model performance on test data. (#9, @boxydog)
310342
- Change model labels to treat label all punctuation as PUNC - this resolves some of the ambiguity in token labeling
311343
- Introduce SIZE label for tokens that modify the size of the ingredient. Note that his only applies to size modifiers of the ingredient. Size modifiers of the unit will remain part of the unit e.g. large clove.
@@ -316,7 +348,7 @@ Fix incorrect python version specifier in package which was preventing pip in Py
316348

317349
- By default, units in `IngredientAmount` object will be returned as `pint.Unit` objects (where possible). This enables the easy conversion of amounts between different units. This can be disabled by setting `string_units=True` in the `parse_ingredient` function calls.
318350

319-
- For units that have US customary and Imperial version with the same name (e.g, cup), setting `imperial_units=True` in the `parse_ingredient` function calls will return the imperial version. The default is US customary.
351+
- For units that have US customary and Imperial version with the same name (e.g., cup), setting `imperial_units=True` in the `parse_ingredient` function calls will return the imperial version. The default is US customary.
320352
- This only applies to units in `pint`'s unit registry (basically all common, standardised units). If the unit can't be found, then the string is returned as previously.
321353

322354
- Additions to `IngredientAmount` object:
@@ -326,7 +358,7 @@ Fix incorrect python version specifier in package which was preventing pip in Py
326358
- RANGE is set to True for quantity ranges e.g. `1-2`
327359
- MULTIPLIER is set to True for quantities like `1x`
328360
- Conversion of quantity field to `float` where possible
329-
- PreProcessor improvements
361+
- `PreProcessor` improvements
330362
- Be less aggressive about replacing written numbers (e.g. one) with the digit version. For example, in sentences like `1 tsp Chinese five-spice`, `five-spice` is now kept as written instead of being replaced by two tokens: `5 spice`.
331363
- Improve handling of ranges that duplicate the units e.g. `1 pound to 2 pound` is now returned as `1-2 pound`
332364

@@ -340,26 +372,26 @@ Fix incorrect python version specifier in package which was preventing pip in Py
340372
### Model
341373

342374
- Include more training data, expanding the Cookstr and BBC data by 5,000 additional sentences each
343-
- Change how the training data is stored. An SQLite database is now used to store the sentences and their tokens and labels. This fixes a long standing bug where tokens in the training data would be assigned the wrong label. csv exports are still available.
375+
- Change how the training data is stored. An SQLite database is now used to store the sentences and their tokens and labels. This fixes a long standing bug where tokens in the training data would be assigned the wrong label. CSV exports are still available.
344376
- Discard any sentences containing OTHER label prior to training model, so a parsed ingredient sentence can never contain anything labelled OTHER.
345377

346378
### Processing
347379

348380
- Remove `other` field from `ParsedIngredient` return from `parse_ingredient` function.
349381

350-
- Added `text` field to `IngredientAmount`. This is auto-generated on when the object is created and proves a human readable string for the amount e.g. "100 g"
382+
- Added `text` field to `IngredientAmount`. This is autogenerated on when the object is created and proves a human readable string for the amount e.g. "100 g"
351383

352384
- Allow SINGULAR flag to be set if the amount it's being applied to is in brackets
353385

354386
- Where a sentence has multiple related amounts e.g. `14 ounce (400 g)` , any flags set for one of the related amounts are applied to all the related amounts
355387

356-
- Rewrite the tokeniser so it doesn't require all handled characters to be explicitly stated
388+
- Rewrite the tokenizer so it doesn't require all handled characters to be explicitly stated
357389

358-
- Add an option to `parse_ingredient` to discard isolated stop words that appear in the name, comment and preparation fields.
390+
- Add an option to `parse_ingredient` to discard isolated stop words that appear in the name, comment, and preparation fields.
359391

360392
- `IngredientAmount.amount` elements are now ordered to match the order in which they appear in the sentence.
361393

362-
- Initial support for composite ingredient amounts e.g. `1 lb 2 oz` is now consider to be a single `CompositeIngredientAmount` instead of two separate `IngredientAmount`.
394+
- Initial support for composite ingredient amounts e.g. `1 lb 2 oz` is now consider to be a single `CompositeIngredientAmount` instead of two separate `IngredientAmount`.
363395

364396
- Further work required to handle other cases such `1 tablespoon plus 1 teaspoon`.
365397
- This solution may change as it develops
@@ -376,7 +408,7 @@ Fix incorrect python version specifier in package which was preventing pip in Py
376408
- Removal of StrangerFoods dataset from model training due to lack of PREP labels
377409
- Addition of a BBC Food dataset in the model training
378410
- 10,000 additional ingredient sentences from the archive of 10599 recipes found at https://archive.org/details/recipes-en-201706
379-
- Miscellaneous bug fixes to the preprocessing steps to resolve reported issues
411+
- Miscellaneous bugfixes to the preprocessing steps to resolve reported issues
380412
- Handling of fractions with the format: 1 and 1/2
381413
- Handling of amounts followed by 'x' e.g. 1x can
382414
- Handling of ranges where the units were duplicated: 100g - 200g
@@ -386,7 +418,7 @@ Fix incorrect python version specifier in package which was preventing pip in Py
386418
- Support the extraction of multiple amounts from the input sentence.
387419
- Change output dataclass to put confidence values with each field.
388420
- The name, comment, other fields are output as an `IngredientText` object containing the text and confidence
389-
- The amounts are output as an `IngredientAmount` object containing the quantity, unit, confidence and flags for whether the amount is approximate or for a singular item of the ingredient.
421+
- The amounts are output as an `IngredientAmount` object containing the quantity, unit, confidence, and flags for whether the amount is approximate or for a singular item of the ingredient.
390422
- Rewrite post-processing functionality to make it more maintainable and extensible in the future.
391423
- Add a [model card](https://github.com/strangetom/ingredient-parser/blob/master/ingredient_parser/ModelCard.md), which provides information about the data used to train and evaluate the model, the purpose of the model and it's limitations.
392424
- Increase l1 regularisation during model training.

MANIFEST.in

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
include ingredient_parser/density_context.txt
1+
include ingredient_parser/pint_extensions.txt
22
include ingredient_parser/en/data/model.en.crfsuite
33
include ingredient_parser/en/data/ModelCard.en.md
4-
include ingredient_parser/en/data/ingredient_embeddings.25d.glove.txt.gz
4+
include ingredient_parser/en/data/ingredient_embeddings.35d.glove.txt.gz
55
include ingredient_parser/en/data/fdc_ingredients.csv.gz
66
include ingredient_parser/en/data/ingredient_tagdict.json.gz
77
global-exclude test*

0 commit comments

Comments
 (0)