Simple Language Model (n-gram)

Written as a hands-on learning exercise to understand how language models work by building one from scratch.

Simple n-gram statistical language model that illustrates some concepts used in modern LLMs, but without heavy machinery like neural networks.

Inspiration: "How LLMs Actually Generate Text" by LearnThatStack.

Preview

Usage

Requirements:

Bun

Scripts:

bun i to install dependencies
bun start to start app
bun lint to lint code with ESLint
bun typecheck to check TypeScript types
bun test to run all tests

Structure

dataset/ contains rahular/simple-wikipedia data in Parquet format. It consists of 87 MB with 770k rows of text from English Wikipedia.
docs/ contains documentation and images
src/:
- attention.ts - self-attention mechanism
- defaults.ts – default configuration values
- context.ts – context windows (n-grams)
- dataset/ – dataset loading and text extraction
- embeddings.ts – tokens -> vectors in semantic space
- index.ts – training + cli
- llm.ts – combines all components into the LLM
- model.ts – statistical language model + sampling with temperature and Top P
- tokenizer.ts – text -> tokens
- vocabulary.ts – word <-> number mapping (token IDs)
tests/ contains unit and integration tests.

Configuration

Defaults from defaults.ts:

DEFAULT_ATTENTION_LAYERS: Number of attention layers
DEFAULT_CONTEXT_SIZE: N-gram size (how many words as context)
DEFAULT_EMBEDDING_DIMENSION: Size of embedding vectors
DEFAULT_GENERATION_LENGTH: Number of tokens to generate
DEFAULT_TEMPERATURE: Sampling randomness
DEFAULT_TOP_P: Nucleus sampling threshold

Algorithm

Step 1. Tokenization

Converts text into tokens (word -> unique number ID)
File: tokenizer.ts, vocabulary.ts
Wikipedia: Tokenization

Step 2. Embeddings

Converts token IDs into vectors (lists of numbers) where similar words are positioned close together in semantic space
Note: Real LLMs learn embeddings through training. We initialize randomly.
File: embeddings.ts
Wikipedia: Word Embedding

Step 3. Attention Mechanism

Helps the model understand relationships between tokens by computing attention scores across the context
Note: Attention is used here for educational purposes-it illustrates the underlying principles of the mechanism. Without a neural network, however, it cannot be used for prediction.
File: attention.ts
Wikipedia: Attention

Step 4. Probability Distribution

Counts how often each word follows a given context and converts those counts into probabilities
File: model.ts
Wikipedia: Probability Distribution

Step 5. Sampling

Temperature: Controls randomness
- T < 1: more deterministic (precision)
- T = 1: proportional to probability
- T > 1: more random (creativity)
Top P (Nucleus Sampling): Only considers the most likely tokens whose combined probability reaches P, balancing diversity and quality
File: model.ts
Wikipedia: Top-p Sampling

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
dataset		dataset
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
eslint.config.js		eslint.config.js
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Language Model (n-gram)

Preview

Usage

Structure

Configuration

Algorithm

Step 1. Tokenization

Step 2. Embeddings

Step 3. Attention Mechanism

Step 4. Probability Distribution

Step 5. Sampling

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Simple Language Model (n-gram)

Preview

Usage

Structure

Configuration

Algorithm

Step 1. Tokenization

Step 2. Embeddings

Step 3. Attention Mechanism

Step 4. Probability Distribution

Step 5. Sampling

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages