Could we also get access to the LM ARPA files and the underlying text data?
Currently, the shared language model is only available in binary format, which limits visibility into its structure.
Having access to the ARPA/text data would help us:
- Understanding the tokenizer structure
- Analyzing spoken punctuation handling
- Extending or adapting the LM
- Performing LM interpolation with other datasets
Thanks
Could we also get access to the LM ARPA files and the underlying text data?
Currently, the shared language model is only available in binary format, which limits visibility into its structure.
Having access to the ARPA/text data would help us:
Thanks