Skip to content

Releases: analyticsinmotion/werpy

Version 3.3.0

19 Dec 10:21
v3.3.0
4badd1c

Choose a tag to compare

Enhancements

  • Implemented ultra-fast WER-only path with space-optimized 2-row dynamic programming algorithm and batch buffer reuse. Added four new functions (calculations_wer_only(), _calculations_wer_only_reuse_ptr(), _metrics_batch_wer_only(), metrics_wer_only()) that eliminate backtrace overhead and use O(n) memory instead of O(m×n). This optimization uses pointer swapping instead of value copying and reuses DP buffers across entire batches, providing significant performance gains for wer() and wers() functions that only need the WER metric without error counts or word lists.

  • Fixed portability issue in WER-only batch processing by replacing platform-dependent int* pointers with guaranteed 32-bit cnp.int32_t* pointers. This ensures correct behavior on all platforms where sizeof(int) may differ from 4 bytes, while also removing unnecessary type casts for cleaner code that follows NumPy/Cython best practices.

  • Expanded benchmarking support by adding optional third-party WER libraries (pywer, evaluate, universal-edit-distance, torchmetrics) to pyproject.toml under the benchmarks extra. Updated benchmark_synthetic_data_local.py to safely import optional dependencies, ensure all benchmark functions are always defined, and enforce consistent numeric return types. This fixes static analysis warnings, prevents runtime errors when optional packages are missing, and enables more comprehensive and reliable cross-package performance comparisons.

  • Standardized all Levenshtein dynamic programming buffers and memoryviews to use cnp.int32_t instead of platform-dependent int. This ensures strict dtype alignment with NumPy int32 arrays, removes undefined behavior on platforms where sizeof(int) != 4, and improves type safety without impacting performance.

Version 3.2.0

15 Dec 12:09
v3.2.0
dbd5c50

Choose a tag to compare

Enhancements

  • Refactored metrics computation architecture to eliminate np.vectorize() overhead by replacing it with a C-level batch processing loop (_metrics_batch()). This provides cleaner code structure and establishes a foundation for future performance optimizations without introducing any performance regression.

  • Fixed double computation bug where error_handler() was calling metrics() for validation, then wrapper functions were computing metrics again. The error_handler() function is now validation-only, and all metric calculations happen through a single unified metrics() entry point, improving efficiency and code maintainability.

  • Standardized internal metrics return format to row-based (n, 9) array structure instead of columnar format. This simplifies DataFrame construction in summary() and summaryp() functions by eliminating complex transpose operations and reducing code complexity.

  • Improved code organization with unified metrics() router function that dispatches to either single-pair calculations() or batch _metrics_batch() processing, providing a cleaner and more maintainable architecture for metric computation.

  • Updated Pylint configuration to suppress import errors for Cython modules during static analysis and exclude benchmark/development directories from linting. This resolves CI/CD build failures while maintaining code quality standards for the core package.

  • Optimized Levenshtein distance algorithm in calculations() function with C-level performance improvements: replaced np.zeros() with np.empty() to eliminate redundant initialization, moved boundary condition initialization outside the main DP loop to remove conditional branches from the hot path, and replaced Python's min() function with manual C-level sequential comparisons.

  • Implemented dual-path architecture with fast path optimization for functions that don't require word tracking. Added three new functions (calculations_fast(), _metrics_batch_fast(), metrics_fast()) that skip word list construction and return float64 arrays instead of object arrays. Updated wer(), wers(), werp(), and werps() functions to use the fast path, achieving performance improvement on synthetic benchmarks. Functions requiring word tracking (summary() and summaryp()) continue using the full path.

Bug Fixes

  • Expanded try/except scope in all wrapper functions (wer.py, wers.py, werp.py, werps.py, summary.py, summaryp.py) to properly catch exceptions from both validation (error_handler()) and computation (metrics()/metrics_fast()). This fixes 6 pre-existing test failures where invalid input types (e.g., lists of integers) would crash instead of returning None with an error message.

  • Added division-by-zero guards in calculations_fast() function (wer = (<double>ld) / m if m > 0 else 0.0) and corpus-level wrapper functions (wer.py, werp.py) to prevent crashes on empty input. Also added per-row masked division in werps.py to handle cases where individual samples have zero reference length.

Version 3.1.1

14 Dec 13:15
v3.1.1
c6455be

Choose a tag to compare

Bug Fixes

  • Fixed IndexError in wer, wers, werp, werps, summary, and summaryp functions when processing single string inputs. The issue occurred due to incorrect handling of np.vectorize return values, which produce different array structures for single versus batch inputs (0-dimensional arrays for single strings, 1-dimensional object arrays for lists). The fix unwraps 0-dimensional arrays and uses element-type checking to distinguish between single example vectors and batch processing scenarios.

  • Improved handling of ragged data structures in batch processing by implementing direct field summation instead of transpose operations. This prevents errors when processing mixed scalar and list data in the word error rate breakdown results.

Enhancements

  • Added comprehensive benchmarking infrastructure to compare performance against werx and jiwer packages using the LibriSpeech evaluation dataset.

  • Added optional dependencies groups in pyproject.toml for testing and benchmarking workflows, enabling easier development environment setup.

Version 3.1.0

22 Apr 01:48
v3.1.0
5654bad

Choose a tag to compare

Enhancements

  • Updated the 'summary' function in 'summary.py' to include a precise return type hint: 'pd.DataFrame | None'.

    • Improved readability and type safety for the function.
  • Updated the return type hint for the 'wer' function in 'wer.py' from 'float' to 'float | np.float64 | None'.

    • Enhanced type accuracy and alignment with the function's behavior.
  • Continued work on type hinting improvements and resolving 'mypy' errors for better code quality and maintainability.

  • Optimized Levenshtein distance matrix initialization in calculations() by replacing a Python list-of-lists with a Cython-typed NumPy array (cdef int[:, :]). This reduces memory overhead and significantly speeds up execution on typical workloads, especially for large datasets or repeated function calls. It improves scalability, responsiveness, and memory efficiency.

  • Refactored internal variable typing in calculations() for clarity and consistency:

    • Loop indices and size variables now use Py_ssize_t, matching Python's internal conventions.
    • Grouped and explicitly typed intermediate variables like inserted_words, deleted_words, and substituted_words for improved readability and static checks. This enhances code quality, reduces reliance on dynamic typing in performance-critical paths, and prepares the function for future optimizations.

Version 3.0.2

03 Apr 04:49
v3.0.2
f525094

Choose a tag to compare

Enhancements

  • Updated tests to better handle and validate invalid input types.

  • Improved code quality by addressing some minor CodeFactor alerts:

    • Fixed naming issues and built-in name overrides.
    • Updated constants to use proper UPPER_CASE naming style.
  • Included 'ZeroDivisionError' in exception handling across source files for better error coverage.

  • Adopted REUSE and SDCX specifications for license compliance:

    • Removed unnecessary license files for third-party dependencies ('numpy', 'pandas', 'cython') from the 'LICENSES' directory.
    • Added 'BSD-3-Clause.txt' license file to the 'LICENSES' directory.
    • Added 'SPDX-FileCopyrightText' and 'SPDX-License-Identifier' headers directly in all '.py' and '.pyx' source files.
    • Created a 'REUSE.toml' file to centralize license declarations. The benefits of doing this for other files include:
      • Centralized license declarations — avoids repetitive SPDX headers in every file (there are lots of other files).
      • Cleaner source files — especially for binary or non-comment-friendly formats (e.g., '.png', '.yml', '.toml').
      • Scalable and maintainable — simplifies licensing for new files and directories.
      • Improved automation — supports tools like 'reuse lint' and CI-based license checks.
      • Clarity and compliance — enhances transparency and ensures open source compliance through machine-readable metadata.
  • Initial support for static type checking: included py.typed marker file to enable type checkers to recognize the package as typed. Note: full type coverage is not yet guaranteed and will be improved incrementally.

Version 3.0.1

26 Mar 23:54

Choose a tag to compare

Enhancements

  • Custom exception handling for ZeroDivisionError to catch cases where the reference is blank or both reference and hypothesis are empty. When either of these events occur a clear, descriptive error message is Raised:
    "Invalid input: reference must not be blank, and reference and hypothesis cannot both be empty."

  • Unit test added to verify that blank reference and hypothesis inputs raise a ZeroDivisionError.

  • Publishing process is now automated using GitHub Actions and PyPI Trusted Publishing

Version 3.0.0

20 Mar 04:16

Choose a tag to compare

Breaking Changes

  • Dropped support for Python 3.8.x and 3.9.x

    • These versions have reached or are nearing their end of life
    • Users are strongly encouraged to upgrade to Python 3.10 or later
    • This change allows us to utilize newer Python features, improve overall package maintenance and security
    • Minimum supported Python version is now 3.10
  • Dropped support for Pandas 1.x (pandas>=2.0.0 is now required)

  • NumPy versions below 1.26.0 are no longer supported for Python 3.10+

  • NumPy 2.x is required for Python 3.12+, which may introduce API changes

Enhancements

  • Official support for Python 3.13:

    • This release ensures full compatibility with Python 3.13, with thorough testing to verify seamless functionality.
    • All dependencies have been reviewed and confirmed to be compatible with Python 3.13.
  • Updated minimum required versions for NumPy dependencies:

    • For Python < 3.12: Increased from 1.21.6 to 1.26.0
    • For Python >= 3.12: Increased to 2.2.0 (previously 1.23.2 for Python >= 3.11)
  • Updated minimum required versions for Pandas dependencies:

    • Increased from 1.3.0 to 2.0.0, ensuring compatibility with the latest NumPy versions.
  • Bump sphinx from 7.3.7 to 8.1.3

  • Bump sphinx-nefertiti from 0.3.4 to 0.7.4

    • The latest version ('0.7.4') is now fully compatible with Sphinx 8.1.
  • Updated C standard from C11 ('c_std=c11') to C17 ('c_std=c17') in the Meson build system.

    • Improves compatibility with modern compilers.
    • Aligns with the latest stable C standard while maintaining backward compatibility.
    • No breaking changes expected, as C17 is a bug-fix refinement of C11.
  • Updated Cython NumPy Import Convention

    • Changed cimport numpy as np to cimport numpy as cnp to align with Cython best practices, improving clarity between Python and Cython NumPy APIs.
  • Explicit NumPy C API Initialization

    • Added cnp.import_array() to ensure proper initialization of NumPy’s C API, resolving the numpy.core.multiarray failed to import error in NumPy 2.x+.
  • Updated Cython Function Type Annotation

    • Changed cpdef np.ndarray calculations(...) to cpdef cnp.ndarray calculations(...) to properly reference the Cython-level NumPy API, ensuring type safety and compatibility with compiled C extensions.

Version 2.1.2

05 Apr 03:54

Choose a tag to compare

Enhancements

  • Full compatibility with Python v3.12:

    • The package has been thoroughly tested and verified to work seamlessly with Python v3.12.
    • All dependencies have been updated and are also compliant with Python v3.12.
  • Added comprehensive package documentation using Sphinx and hosted on Read the Docs. The documentation covers installation instructions, usage guides, and worked examples of all feature functionality. Access the documentation at https://werpy.readthedocs.io/.

  • Updated the circleci config.yml file to support the compilation of Cython .pyx files.

  • Integrated codecov into the circleci config.yml file to generate comprehensive coverage reports.

  • Enhanced testing procedures to increase code coverage percentage.

  • Ensured compliance with the Black code formatting by modifying relevant files.

Changed

  • Bump jinja2 from 3.1.2 to 3.1.3 in /docs. (#1)

  • Bump sphinx-nefertiti from 0.2.1 to 0.2.3 (#2)

  • Bump sphinx-nefertiti from 0.2.3 to 0.3.1 (#3)

Version 2.1.1

27 Nov 10:37

Choose a tag to compare

Enhancements

  • Updated the meson.build file to align with the recommended approach for integrating Cython into the build process:
    • Added Cython to the list of languages utilized by the project.
    • Passed the Cython source code directly to the py.extension_module() definition for improved integration.
    • Specified the C standard configuration as C11, instructing Meson to use C11 as the designated C standard.

Version 2.1.0

23 Nov 04:17

Choose a tag to compare

New Feature

  • Enhanced cross-platform support by integrating cibuildwheel, enabling compatibility with macOS and popular Linux distributions. With existing Windows compatibility, the package now spans all major configurations. Feel free to reach out if you have a specific OS configuration you'd like to discuss for potential inclusion.