Efficiently translating Latin to English using a sequence-to-sequence transformer augmented with learnable morphologically-derived grammatical embeddings. 🚀
Evaluation requires two steps.
First, DeclEngine should be cloned and compiled.
A sentence should be translated to IR using the test.py script in DeclEngine:
$ python3 test.py "Qui Deum non audiunt, certe peribunt."
how<SEP><ACC>God<SEP>not<SEP>they hear,<SEP>surely<SEP>they will die.
Then the string should be used with the test subcommand:
$ ./target/release/seq2seq test model.pt "how<SEP><ACC>God<SEP>not<SEP>they hear,<SEP>surely<SEP>they will die."
output: <BOS> those who do not listen to god, will they die.<EOS>
Well done! Comparing Google Translate and our result:
| Translation | |
|---|---|
| GTranslate | Those who do not listen to God will surely perish. |
| Our result | those who do not listen to god, will they die. |
A comparison of translations from the Latin Vulgate (Genesis 8:7):
| Translation | |
|---|---|
| Latin | qui egrediebatur, et non revertebatur, donec siccarentur aquae super terram. |
| Ground Truth | which went forth and did not return, until the waters were dried up across the earth. |
| GTranslate | who went out and did not return until the waters were dried up on the earth. |
| Our result | and he went out, and he did not return, until the waters were dried up upon the earth. |
Tokenizers can be tested using the test-tok command:
$ ./target/release/seq2seq test-tok tokenizer.json "This is a test." # <tokenizer> <test-sentence>
["Ä this", "Ä is", "Ä a", "Ä test", "."]Training (tensorboard logs are written to ./logdir/train):
$ python3 src/model.py # generate torchscripts
$ python3 src/split.py # split data
$ ./target/release/seq2seq train ir.txt en.txt 5000 ir-en.txt false 1 # last parameter is number of hours before quitting
Epoch 1 complete!
Epoch 2 complete!
...
Epoch 18 complete!Checkpoints will be saved to model_<EPOCH>.pt