Skip to content

feat: port normalization from phoonnx#2

Open
JarbasAl wants to merge 6 commits intodevfrom
phoonnx
Open

feat: port normalization from phoonnx#2
JarbasAl wants to merge 6 commits intodevfrom
phoonnx

Conversation

@JarbasAl
Copy link
Copy Markdown
Member

@JarbasAl JarbasAl commented Aug 4, 2025

closes #1

Summary by CodeRabbit

  • New Features

    • Enhanced text normalization now supports expansion of contractions, pronunciation of numbers, fractions, units, dates, and times across multiple languages.
    • Improved handling of locale-specific number and date formats for more natural spoken output.
    • Added extensive locale-specific data for titles and units in multiple languages to improve normalization accuracy.
    • Introduced automated unit testing workflow across multiple Python versions for improved code quality.
  • Chores

    • Updated dependencies to support advanced text and number parsing capabilities.
    • Included package data files in distribution for comprehensive locale support.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 4, 2025

Warning

Rate limit exceeded

@JarbasAl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 14 minutes and 28 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 2c7af67 and b29eeb9.

📒 Files selected for processing (1)
  • ovos_dialog_normalizer_plugin/util.py (1 hunks)

Walkthrough

The codebase refactors dialog normalization by delegating all normalization logic from the main transformer class to a new utility module. The new module implements comprehensive, language-aware normalization, including number, date, time, contraction, title, and unit expansion. Dependencies are updated to support these features.

Changes

Cohort / File(s) Change Summary
Dialog Normalizer Refactor
ovos_dialog_normalizer_plugin/__init__.py
Removed all inline normalization logic from DialogNormalizerTransformer; now delegates normalization to normalize in util module.
Normalization Utility Module
ovos_dialog_normalizer_plugin/util.py
Added new module implementing language-aware normalization for numbers, dates, times, contractions, titles, and units.
Locale Data Files
ovos_dialog_normalizer_plugin/locale/*/contractions.json, .../titles.json, .../units.json
Added multiple locale-specific JSON files for contractions, titles, and units for languages including en, de, es, fr, pt, ca, gl, it, nl.
Dependency Updates
requirements.txt
Added dependencies: langcodes, ovos-date-parser, unicode_rbnf; updated ovos-number-parser version requirement.
CI Workflow
.github/workflows/unit_tests.yml
Added GitHub Actions workflow to run unit tests across Python 3.10, 3.11, and 3.12 with coverage reporting.
Package Setup
setup.py
Added helper function to include all package data files in distribution via package_files() and include_package_data=True.
Test Package Initialization
tests/__init__.py
Added empty __init__.py with docstring to mark tests directory as a package.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DialogNormalizerTransformer
    participant normalize (util.py)
    participant External Libraries

    User->>DialogNormalizerTransformer: transform(dialog, context)
    DialogNormalizerTransformer->>normalize (util.py): normalize(dialog, lang)
    normalize (util.py)->>External Libraries: parse/pronounce numbers, dates, times, units
    normalize (util.py)-->>DialogNormalizerTransformer: normalized_text
    DialogNormalizerTransformer-->>User: normalized_text, context
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Assessment against linked issues

Objective Addressed Explanation
Port phoonnx/util.py normalization logic (#1)
Improve normalization: numbers, dates, units, contractions, titles (#1)

Poem

A rabbit hopped through code so neat,
With numbers, dates, and times to greet.
Contractions stretched, units spelled out,
Titles expanded, no shadow of doubt.
Now dialogs shine, so clear and bright—
This bunny’s work feels just right!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch phoonnx

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added feature and removed feature labels Aug 4, 2025
coderabbitai bot added a commit that referenced this pull request Aug 4, 2025
Docstrings generation was requested by @JarbasAl.

* #2 (comment)

The following files were modified:

* `ovos_dialog_normalizer_plugin/__init__.py`
* `ovos_dialog_normalizer_plugin/util.py`
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 4, 2025

Note

Generated docstrings for this pull request at #3

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
requirements.txt (1)

2-5: Consider pinning versions for new dependencies

The newly added dependencies langcodes and unicode_rbnf don't have version constraints. To ensure reproducible builds and avoid unexpected breaking changes, consider adding version constraints.

 ovos-plugin-manager
-langcodes
+langcodes>=3.3.0
 ovos-number-parser>=0.4.0
 ovos-date-parser>=0.6.4a1
-unicode_rbnf
+unicode_rbnf>=1.1.0
ovos_dialog_normalizer_plugin/util.py (3)

14-186: Add contractions for other supported languages

The CONTRACTIONS dictionary only includes English entries, but the module claims to support multiple languages (pt, es, fr, de, etc.). Consider adding common contractions for these languages to ensure consistent normalization across all supported languages.

Would you like me to help generate common contractions for Portuguese, Spanish, French, and other supported languages?


500-506: Document year disambiguation logic

The 2-digit year disambiguation logic assumes years 00-29 map to 2000-2029 and 30-99 map to 1930-1999. This assumption may not be appropriate for all use cases and could lead to incorrect date parsing.

Consider making this configurable or documenting this behavior clearly in the function docstring:

 def _normalize_dates_and_times(text: str, full_lang: str, date_format: str = "DMY") -> str:
     """
     Helper function to normalize dates and times using regular expressions.
     This prepares the strings for pronunciation.
+    
+    Note: 2-digit years are expanded as follows:
+    - 00-29 -> 2000-2029
+    - 30-99 -> 1930-1999
     """

1-719: Consider future modularization for maintainability

This utility module is comprehensive but quite large (700+ lines). As the normalization features grow, consider splitting it into smaller, focused modules:

  • contractions.py - Language-specific contraction mappings
  • numbers.py - Number and fraction normalization
  • datetime.py - Date and time normalization
  • units.py - Unit conversion and normalization

This would improve maintainability and make it easier to add language-specific features.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb255e5 and d0a46f4.

📒 Files selected for processing (3)
  • ovos_dialog_normalizer_plugin/__init__.py (2 hunks)
  • ovos_dialog_normalizer_plugin/util.py (1 hunks)
  • requirements.txt (1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
ovos_dialog_normalizer_plugin/util.py

150-150: Dictionary key literal "shan't" repeated

Remove repeated key literal "shan't"

(F601)


712-712: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (1)
ovos_dialog_normalizer_plugin/__init__.py (1)

6-27: Clean refactoring with good separation of concerns!

The extraction of normalization logic to a dedicated utility module improves maintainability and makes the transformer class focused on its plugin responsibilities. Error handling and logging are properly preserved.

Docstrings generation was requested by @JarbasAl.

* #2 (comment)

The following files were modified:

* `ovos_dialog_normalizer_plugin/__init__.py`
* `ovos_dialog_normalizer_plugin/util.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added feature and removed feature labels Aug 4, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
ovos_dialog_normalizer_plugin/util.py (3)

12-12: Use a more specific logger name

The logger name "normalize" is too generic and could conflict with other modules. Consider using a name that includes the package namespace.


150-150: Remove duplicate dictionary key

The key "shan't" is already defined at line 51. This duplicate at line 150 will override the previous value.


772-772: Remove unnecessary f-string prefix

This string doesn't contain any placeholders, so the f prefix is not needed.

🧹 Nitpick comments (3)
ovos_dialog_normalizer_plugin/util.py (3)

362-378: Consider using locale information for separator detection

While the current implementation works for common languages, consider using Python's locale module or the babel library for more comprehensive and accurate locale-specific number formatting.

Example using babel:

from babel import Locale

def _get_number_separators(full_lang: str) -> tuple[str, str]:
    try:
        locale = Locale.parse(full_lang.replace('-', '_'))
        decimal_separator = locale.number_symbols.get('decimal', '.')
        thousands_separator = locale.number_symbols.get('group', ',')
        return decimal_separator, thousands_separator
    except Exception:
        # Fallback to current implementation
        lang_code = full_lang.split("-")[0]
        if lang_code in ["pt", "es", "fr", "de"]:
            return ',', '.'
        return '.', ','

531-534: Document the year expansion logic

The 2-digit year expansion logic assumes years 00-29 map to 2000-2029 and 30-99 map to 1930-1999. This assumption should be documented and might need to be configurable in the future.

             # Expand 2-digit year to 4-digit year
             if year < 100:
-                # Assume years 00-29 are 2000-2029, 30-99 are 1930-1999
+                # Assume years 00-29 are 2000-2029, 30-99 are 1930-1999
+                # TODO: Consider making this cutoff configurable or date-aware
                 year = 2000 + year if year < 30 else 1900 + year

1-779: Successfully ported normalization with improvements

This implementation successfully ports the normalization functionality from the phoonnx project while adding comprehensive error handling, multi-language support, and modular design. The code is well-structured and maintains good separation of concerns.

Consider adding unit tests to ensure the normalization behavior remains consistent across updates, especially for edge cases like 2-digit year expansion and locale-specific number formatting.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0a46f4 and d1d53a9.

📒 Files selected for processing (2)
  • ovos_dialog_normalizer_plugin/__init__.py (2 hunks)
  • ovos_dialog_normalizer_plugin/util.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ovos_dialog_normalizer_plugin/init.py
🧰 Additional context used
🪛 Ruff (0.12.2)
ovos_dialog_normalizer_plugin/util.py

150-150: Dictionary key literal "shan't" repeated

Remove repeated key literal "shan't"

(F601)


772-772: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (5)
ovos_dialog_normalizer_plugin/util.py (5)

188-359: Well-structured multi-language dictionaries

The TITLES and UNITS dictionaries provide comprehensive coverage for multiple languages with clear organization.


381-436: Robust number normalization with proper error handling

The implementation correctly handles various number formats, locale-specific separators, and provides appropriate fallback mechanisms with error logging.


440-467: Clean date/time pronunciation interfaces

Both functions provide appropriate abstractions with good error handling, especially the fallback mechanism in pronounce_time.


563-572: Simple and effective hyphen-digit normalization

The regex pattern correctly handles word-hyphen-digit combinations.


693-729: Well-orchestrated normalization pipeline

The main normalize function properly coordinates all normalization steps in a logical order with appropriate error handling for RBNF engine initialization.

JarbasAl and others added 2 commits August 4, 2025 13:10
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 4, 2025

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

@github-actions github-actions bot added feature and removed feature labels Aug 4, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
ovos_dialog_normalizer_plugin/util.py (1)

484-484: Remove unnecessary f-string prefix

This string doesn't contain any placeholders, so the f prefix is not needed.

-    print(f"\n--- Other Examples ---")
+    print("\n--- Other Examples ---")
🧹 Nitpick comments (3)
setup.py (1)

69-70: Consider being more selective with package data inclusion

Including all files from the plugin directory might package unnecessary files (e.g., __pycache__, .pyc files). Consider filtering by file extensions or using a MANIFEST.in file.

 def package_files(directory):
     paths = []
     for (path, _, filenames) in os.walk(directory):
         for filename in filenames:
-            paths.append(os.path.join(path, filename))
+            # Only include JSON files and other necessary resources
+            if filename.endswith(('.json', '.txt', '.yml', '.yaml')):
+                paths.append(os.path.join(path, filename))
     return paths

Alternatively, use a MANIFEST.in file for better control over included files.

ovos_dialog_normalizer_plugin/util.py (2)

79-83: Consider implementing the TODO: Move separator logic to locale JSON files

The hardcoded language-specific separator logic could be moved to JSON files for better maintainability and extensibility.

Would you like me to help implement this by:

  1. Creating a JSON structure for number format configurations
  2. Updating the LocaleDataManager to load this data
  3. Refactoring _get_number_separators to use the JSON data

This would make it easier to add support for new languages without modifying code.


303-337: Cache compiled regex patterns for better performance

The regex patterns are compiled on every function call. For better performance, especially when processing multiple texts, consider caching the compiled patterns.

# Add to LocaleDataManager or create a separate cache
class RegexCache:
    def __init__(self):
        self._cache = {}
    
    def get_units_regex(self, lang_code, separator_info):
        cache_key = (lang_code, separator_info)
        if cache_key not in self._cache:
            # Build and compile patterns
            self._cache[cache_key] = self._build_units_patterns(lang_code, separator_info)
        return self._cache[cache_key]

This would significantly improve performance when normalizing multiple texts in the same language.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1d53a9 and 98a541b.

📒 Files selected for processing (20)
  • .github/workflows/unit_tests.yml (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/ca/titles.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/de/titles.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/de/units.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/en/contractions.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/en/titles.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/en/units.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/es/titles.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/es/units.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/fr/titles.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/fr/units.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/gl/titles.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/it/titles.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/nl/titles.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/pt/titles.json (1 hunks)
  • ovos_dialog_normalizer_plugin/locale/pt/units.json (1 hunks)
  • ovos_dialog_normalizer_plugin/util.py (1 hunks)
  • requirements.txt (1 hunks)
  • setup.py (2 hunks)
  • tests/__init__.py (1 hunks)
✅ Files skipped from review due to trivial changes (17)
  • ovos_dialog_normalizer_plugin/locale/ca/titles.json
  • tests/init.py
  • ovos_dialog_normalizer_plugin/locale/es/titles.json
  • ovos_dialog_normalizer_plugin/locale/de/titles.json
  • ovos_dialog_normalizer_plugin/locale/en/titles.json
  • ovos_dialog_normalizer_plugin/locale/nl/titles.json
  • ovos_dialog_normalizer_plugin/locale/it/titles.json
  • ovos_dialog_normalizer_plugin/locale/es/units.json
  • ovos_dialog_normalizer_plugin/locale/fr/units.json
  • ovos_dialog_normalizer_plugin/locale/de/units.json
  • ovos_dialog_normalizer_plugin/locale/fr/titles.json
  • ovos_dialog_normalizer_plugin/locale/gl/titles.json
  • ovos_dialog_normalizer_plugin/locale/pt/units.json
  • ovos_dialog_normalizer_plugin/locale/en/units.json
  • ovos_dialog_normalizer_plugin/locale/pt/titles.json
  • .github/workflows/unit_tests.yml
  • ovos_dialog_normalizer_plugin/locale/en/contractions.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • requirements.txt
🧰 Additional context used
🪛 Ruff (0.12.2)
setup.py

53-53: Loop control variable directories not used within loop body

Rename unused directories to _directories

(B007)

ovos_dialog_normalizer_plugin/util.py

484-484: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (1)
ovos_dialog_normalizer_plugin/util.py (1)

236-240: Review 2-digit year expansion logic

The current logic assumes years 00-29 map to 2000-2029 and 30-99 map to 1930-1999. Since we're in 2025, this might misinterpret recent dates.

Consider making the cutoff year configurable or using a sliding window approach. For example, dates within 50 years in the past and 30 years in the future from the current year.

# Example sliding window approach
current_year = datetime.now().year
if year < 100:
    # Assume within -50 to +30 years from current year
    century = (current_year // 100) * 100
    candidate_year = century + year
    if candidate_year > current_year + 30:
        candidate_year -= 100
    elif candidate_year < current_year - 50:
        candidate_year += 100
    year = candidate_year

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 4, 2025

UTG Post-Process Complete

No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 4, 2025

Creating a PR to put the unit tests in...

The changes have been created in this pull request: View PR

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 4, 2025

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 4, 2025

UTG Post-Process Complete

No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 4, 2025

Creating a PR to put the unit tests in...

The changes have been created in this pull request: View PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

improve normalization

1 participant