I've been exploring SONAR's multilingual capabilities and am impressed by its ability to handle diverse languages through its encoder-decoder architecture. I'm wondering if it would be possible to extend SONAR to support structured languages, such as programming languages or other context-free grammars, by treating them as new languages in the system.
Given SONNAR's language-agnostic design and the use of SentencePiece tokenization, it seems theoretically possible to train SONAR to handle structured languages by defining them as new language codes (e.g., "py_Code" for Python, "java_Code" for Java, or "cfg_Form" for formal grammars).
Therefore, may I ask if it is possible to do the following:
- training the structured language as a new 'language' to encode and decode the expression?
- Will any modifications be needed to handle strict syntactic rules?
If so, may I further ask how I can possibly add a new language to SONNAR, i.e., the training recipe?
Looking forward to hearing from you :)
I've been exploring SONAR's multilingual capabilities and am impressed by its ability to handle diverse languages through its encoder-decoder architecture. I'm wondering if it would be possible to extend SONAR to support structured languages, such as programming languages or other context-free grammars, by treating them as new languages in the system.
Given SONNAR's language-agnostic design and the use of SentencePiece tokenization, it seems theoretically possible to train SONAR to handle structured languages by defining them as new language codes (e.g., "py_Code" for Python, "java_Code" for Java, or "cfg_Form" for formal grammars).
Therefore, may I ask if it is possible to do the following:
If so, may I further ask how I can possibly add a new language to SONNAR, i.e., the training recipe?
Looking forward to hearing from you :)