Skip to content

twistezo/simple-language-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Language Model (n-gram)

Written as a hands-on learning exercise to understand how language models work by building one from scratch.

Simple n-gram statistical language model that illustrates some concepts used in modern LLMs, but without heavy machinery like neural networks.

Inspiration: "How LLMs Actually Generate Text" by LearnThatStack.

Preview

Usage

Requirements:

Scripts:

  • bun i to install dependencies
  • bun start to start app
  • bun lint to lint code with ESLint
  • bun typecheck to check TypeScript types
  • bun test to run all tests

Structure

Configuration

Defaults from defaults.ts:

  • DEFAULT_ATTENTION_LAYERS: Number of attention layers
  • DEFAULT_CONTEXT_SIZE: N-gram size (how many words as context)
  • DEFAULT_EMBEDDING_DIMENSION: Size of embedding vectors
  • DEFAULT_GENERATION_LENGTH: Number of tokens to generate
  • DEFAULT_TEMPERATURE: Sampling randomness
  • DEFAULT_TOP_P: Nucleus sampling threshold

Algorithm

Step 1. Tokenization

Step 2. Embeddings

  • Converts token IDs into vectors (lists of numbers) where similar words are positioned close together in semantic space
  • Note: Real LLMs learn embeddings through training. We initialize randomly.
  • File: embeddings.ts
  • Wikipedia: Word Embedding

Step 3. Attention Mechanism

  • Helps the model understand relationships between tokens by computing attention scores across the context
  • Note: Attention is used here for educational purposes-it illustrates the underlying principles of the mechanism. Without a neural network, however, it cannot be used for prediction.
  • File: attention.ts
  • Wikipedia: Attention

Step 4. Probability Distribution

Step 5. Sampling

  • Temperature: Controls randomness
    • T < 1: more deterministic (precision)
    • T = 1: proportional to probability
    • T > 1: more random (creativity)
  • Top P (Nucleus Sampling): Only considers the most likely tokens whose combined probability reaches P, balancing diversity and quality
  • File: model.ts
  • Wikipedia: Top-p Sampling

About

Simple n-gram statistical language model that illustrates some concepts used in modern LLMs, but without heavy machinery like neural networks.

Topics

Resources

License

Stars

Watchers

Forks

Contributors