Written as a hands-on learning exercise to understand how language models work by building one from scratch.
Simple n-gram statistical language model that illustrates some concepts used in modern LLMs, but without heavy machinery like neural networks.
Inspiration: "How LLMs Actually Generate Text" by LearnThatStack.
Requirements:
Scripts:
bun ito install dependenciesbun startto start appbun lintto lint code with ESLintbun typecheckto check TypeScript typesbun testto run all tests
-
dataset/contains rahular/simple-wikipedia data in Parquet format. It consists of 87 MB with 770k rows of text from English Wikipedia. -
docs/contains documentation and images -
src/:attention.ts- self-attention mechanismdefaults.ts– default configuration valuescontext.ts– context windows (n-grams)dataset/– dataset loading and text extractionembeddings.ts– tokens -> vectors in semantic spaceindex.ts– training + clillm.ts– combines all components into the LLMmodel.ts– statistical language model + sampling with temperature and Top Ptokenizer.ts– text -> tokensvocabulary.ts– word <-> number mapping (token IDs)
-
tests/contains unit and integration tests.
Defaults from defaults.ts:
DEFAULT_ATTENTION_LAYERS: Number of attention layersDEFAULT_CONTEXT_SIZE: N-gram size (how many words as context)DEFAULT_EMBEDDING_DIMENSION: Size of embedding vectorsDEFAULT_GENERATION_LENGTH: Number of tokens to generateDEFAULT_TEMPERATURE: Sampling randomnessDEFAULT_TOP_P: Nucleus sampling threshold
- Converts text into tokens (word -> unique number ID)
- File:
tokenizer.ts,vocabulary.ts - Wikipedia: Tokenization
- Converts token IDs into vectors (lists of numbers) where similar words are positioned close together in semantic space
- Note: Real LLMs learn embeddings through training. We initialize randomly.
- File:
embeddings.ts - Wikipedia: Word Embedding
- Helps the model understand relationships between tokens by computing attention scores across the context
- Note: Attention is used here for educational purposes-it illustrates the underlying principles of the mechanism. Without a neural network, however, it cannot be used for prediction.
- File:
attention.ts - Wikipedia: Attention
- Counts how often each word follows a given context and converts those counts into probabilities
- File:
model.ts - Wikipedia: Probability Distribution
- Temperature: Controls randomness
T < 1: more deterministic (precision)T = 1: proportional to probabilityT > 1: more random (creativity)
- Top P (Nucleus Sampling): Only considers the most likely tokens whose combined probability reaches P, balancing diversity and quality
- File:
model.ts - Wikipedia: Top-p Sampling
