Skip to content

ujjwal-basnet/Building-LLMs-for-Production

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📚 Building-LLMs-for-Production

This repository will include code, notes, and experiments from the book
"Building LLMs for Production" (by Towards AI).
Focus areas include training, scaling, deployment, and optimization of large language models (LLMs).


🧠 Reading Log & Insights

  • Finding: For a model with X parameters, the optimal training involves approximately X × 20 tokens.
  • Example: A model with 100 billion parameters should ideally be trained on about 2 trillion tokens.
  • This insight shifts focus from just scaling model size → to balancing model size and training data for optimal performance.

Limitations of the Original Transformer Architecture

  1. Quadratic Self-Attention

Attention(Q, K, V) = softmax(QKᵀ / √d) · V

  • Q shape = [n × d], Kᵀ shape = [d × n]
  • Matrix multiply → QKᵀ = [n × n]
  • Time complexity = O(n² × d)
  • If d is constant, then it's effectively O(n²) for attention

Raw Memory Calculation

 ``` For n = 1000 tokens:
 n² = 1000 × 1000 = 1000000  values 
 Each value = 4 bytes ,  Total = 1000000 × 4 = 4000000 bytes
4000000 bytes  ÷ 1024 bytes= 3906.25 KB 
3906.25 ÷ 1024 = 3.81 MB 
3.81 ÷ 1024 ≈ 0.0037 GB ```

same way for 100,000 tokens , attention matrix size is 10 billion values and it cost ~40 gb which is huge cost

and this is for one attention head, in one layer only

  • Each attention head computes its own [n × n] attention matrix independently.
  • A Transformer layer typically has 8–12 heads running in parallel.
  • The outputs of all heads are concatenated and linearly transformed.

Example: If one head uses ~37 GB for n = 100,000 tokens,
then 12 heads in one layer may require over 440 GB 😭😭.

Transformer Context Window Optimization

The O(n²) complexity in attention mechanisms creates a significant bottleneck, limiting the context window(the amount of text a model can process at once)

To overcome this, researchers have developed several optimization techniques that allow models to handle vastly longer contexts—up to 100,000 tokens and beyond—while managing computational and memory costs.

[click here for optimization techquies]


Evaluation Metrics of LLm

• Intrinsic metrics: which are directly related to the training objective. A well-known intrinsic metric is perplexity.

Extrinsic metrics evaluate performance across various downstream tasks and are not directly connected to the training objective. Popular examples of extrinsic metrics include benchmarking frameworks like GLUE, SuperGLUE, BIG-bench, HELM, and FLASK.

Perplexity

perplexility means confuse , puzzled or uncertain , higher the value of perplexility less the model is accurate

[click here to see Perplexilit calcuation]

About

Anything or code inside this book

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages