Statistical approach to data segmentation for LLMs

This repository presents a novel tokenizer for training machine learning models for natural languages processing

Our tokenizer is an alternative approach to byte-pair-encoding aiming to mitigate hallucinations in large language models and enhance machine learning reasoning

The tokenizer uses syllable segmentation to tokenize the data and convert the tokenized dataset to a tensor compatible with PyTorch

The tool is trained based on Bayesian approaches and uses Expectation-Maximization (EM) algorithm

Our algorithm

Collect the data from wiktionary
Apply rule-based syllable segmentation
Train EM on the annotated data
Collect the dictionary (see model.json) for machine learning

Model usage and compatibility

Use vectorize.sh to vectorize your data with our algorithm to prevent any incompatibilities

The command line script returns JSON object with original text, segments, and vectors

Vector dimensionality equals vocabulary size (the number of unique segments from our EM-model). The vectors are built in one-hot-encoding manner and can be easily converted to any tensor format (PyTorch, TF, etc.), as well as converted from parse to dense type

Example usage

./vectorize.sh ~/stat-llm/train/model.json "ультравысокочастотными" sample_outputs/sample_output.json json

Sample output

{
  "text": "ультравысокочастотными",
  "segments": [
    "уль",
    "трав",
    "ы",
    "соко",
    "част",
    "о",
    "тным",
    "и"
  ],
  "vector": [
    1,
    0,
    ...
  ],
  "vocab_size": [
    6413
  ]
}

In Python, you can use our tool as a module (see vectorizer.py). The module is fully compatible with PyTorch workflows, including HuggingFace integrations

You can use this module as a standalone tool, e.g.:

python3 vectorizer.py --model ~/stat-llm/train/model.json --text "ультравысокочастотными"

This tool can be used for GPU-based computing as well. See torch_demo.py for example usage in PyTorch integration:

from vectorizer import TextVectorizer

vectorizer = TextVectorizer("train/model.json")
batch = ["синхрофазотрон", "гипотенуза", "алфавит"]
tensor_batch = vectorizer.batch_vectorize(batch, 'tensor')

# Sample class in PyTorch
# Could be your Torch.nn.layers, HF Transformers, etc.
class SegmentClassifier(torch.nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.fc = torch.nn.Linear(vocab_size, 1)
    
    def forward(self, x):
        return torch.sigmoid(self.fc(x))

model = SegmentClassifier(vectorizer.vocab_size)
output = model(tensor_batch)

Also, we provide a dummy LLM usage example, check out dummy_demo.py

Training pipeline

For model training use the following script:

python3 train.py \
  --input ~/stat-llm/data/train_data.csv \
  --output_model model.json \
  --output_test test_results.json \
  --test_words "ультравысокочастотными" "новоеслово" \
  --max_iterations 10 \
  --min_segment_count 10

Output format:

{
  "segment_probs": {
    "уль": 0.15,
    "тра": 0.12,
    ...
  },
  "transition_probs": {
    "уль,тра": 0.95,
    "тра,вы": 0.92,
    ...
  },
  "parameters": {
    "max_iterations": 50,
    "convergence_threshold": 1e-05,
    ...
  }
}

Output test results:

[
  {
    "word": "ультравысокочастотными",
    "segmentation": [
      "уль",
      "трав",
      "ы",
      "соко",
      "част",
      "о",
      "тным",
      "и"
    ],
    "hyphenated": "уль-трав-ы-соко-част-о-тным-и"
  },
  ...
]

Model testing

For model testing use the following script:

python3 test.py \
  --model ~/stat-llm/train/model.json \
  --test_data ~/stat-llm/data/test_data.csv \
  --output test_results.json

Example outputs:

Evaluating model...

Evaluation Report:
Metric         Score     
-------------------------
Precision      0.5740
Recall         0.5832
F1-Score       0.5786
Accuracy       0.1967

Confusion Matrix:
True Positives: 167320
False Positives: 124166
False Negatives: 119591

Correct Words: 15418/78388 (19.67%)
Saving detailed results to test_results.json...
Done!

TODO:

apply NKFD normalization in training data
clean training data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Statistical approach to data segmentation for LLMs

Training pipeline

Model testing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
preprocess		preprocess
sample_outputs		sample_outputs
test		test
train		train
.gitignore		.gitignore
README.md		README.md
dummy_demo.py		dummy_demo.py
torch_demo.py		torch_demo.py
vectorize.sh		vectorize.sh
vectorizer.py		vectorizer.py

vifirsanova/stat-llm

Folders and files

Latest commit

History

Repository files navigation

Statistical approach to data segmentation for LLMs

Training pipeline

Model testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages