Skip to content

butvinm/refal-bpe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Bytes Pair Encoding in Refal-05

This repo contains a simple implementation of the Byte Pair Encoding algorithm implemented in Refal-05 language - a dialect of Refal.

BPE is a compression algorithm that iteratively finds the most frequent pair of bytes, replaces it with a unique token, and repeats this process until no further optimizable pairs remain. Despite its simplicity, BPE not only compresses text but also encodes semantic patterns of the input data into its substitution table. This feature has found applications in LLMs' tokenizers and automatic grammar inference algorithms.

Usage

build:

I don’t expect you to try building this yourself, since first you’d need to set up a Refal-05 compiler—which is an interesting but involved process on its own.

R05CCOMP="clang -o main" refal05c refal05rts Library LibraryEx main.ref

help:

$ ./main
Usage: ./main <command>

Commands:
  encode <input file> <file to save table> <file to save encoded text> - Encode text with BPE, save BPE table and encoded tokens to file.
  decode <input file> <table table> <file to save decoded text> - Decode text using specified BPE table and save decoded text to file.

encode:

this may take a while, e.g. it takes about a minute to encode main.ref itself.

./main encode main.ref output/main.table output/main.tokens

decode:

this should run fast

./main decode output/main.tokens output/main.table output/main.decoded.ref

Acknowledgments

About

Bytes Pair Encoding implemented in Refal-05

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published