This repo contains a simple implementation of the Byte Pair Encoding algorithm implemented in Refal-05 language - a dialect of Refal.
BPE is a compression algorithm that iteratively finds the most frequent pair of bytes, replaces it with a unique token, and repeats this process until no further optimizable pairs remain. Despite its simplicity, BPE not only compresses text but also encodes semantic patterns of the input data into its substitution table. This feature has found applications in LLMs' tokenizers and automatic grammar inference algorithms.
build:
I don’t expect you to try building this yourself, since first you’d need to set up a Refal-05 compiler—which is an interesting but involved process on its own.
R05CCOMP="clang -o main" refal05c refal05rts Library LibraryEx main.refhelp:
$ ./main
Usage: ./main <command>
Commands:
encode <input file> <file to save table> <file to save encoded text> - Encode text with BPE, save BPE table and encoded tokens to file.
decode <input file> <table table> <file to save decoded text> - Decode text using specified BPE table and save decoded text to file.encode:
this may take a while, e.g. it takes about a minute to encode
main.refitself.
./main encode main.ref output/main.table output/main.tokensdecode:
this should run fast
./main decode output/main.tokens output/main.table output/main.decoded.ref