Skip to content

Access to LM ARPA Files and Training Text Data #15

@csetanmayjain

Description

@csetanmayjain

Could we also get access to the LM ARPA files and the underlying text data?
Currently, the shared language model is only available in binary format, which limits visibility into its structure.

Having access to the ARPA/text data would help us:

  • Understanding the tokenizer structure
  • Analyzing spoken punctuation handling
  • Extending or adapting the LM
  • Performing LM interpolation with other datasets

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions