📚 Building-LLMs-for-Production

This repository will include code, notes, and experiments from the book
"Building LLMs for Production" (by Towards AI).
Focus areas include training, scaling, deployment, and optimization of large language models (LLMs).

🧠 Reading Log & Insights

1.✅ 📄 Research Paper: Chinchilla: An Empirical Model of Compute-Optimal Training

Finding: For a model with X parameters, the optimal training involves approximately X × 20 tokens.
Example: A model with 100 billion parameters should ideally be trained on about 2 trillion tokens.
This insight shifts focus from just scaling model size → to balancing model size and training data for optimal performance.

Limitations of the Original Transformer Architecture

Quadratic Self-Attention

Attention(Q, K, V) = softmax(QKᵀ / √d) · V

Q shape = [n × d], Kᵀ shape = [d × n]
Matrix multiply → QKᵀ = [n × n]
Time complexity = O(n² × d)
If d is constant, then it's effectively O(n²) for attention

Raw Memory Calculation

 ``` For n = 1000 tokens:
 n² = 1000 × 1000 = 1000000  values 
 Each value = 4 bytes ,  Total = 1000000 × 4 = 4000000 bytes
4000000 bytes  ÷ 1024 bytes= 3906.25 KB 
3906.25 ÷ 1024 = 3.81 MB 
3.81 ÷ 1024 ≈ 0.0037 GB ```

same way for 100,000 tokens , attention matrix size is 10 billion values and it cost ~40 gb which is huge cost

and this is for one attention head, in one layer only

Each attention head computes its own [n × n] attention matrix independently.

A Transformer layer typically has 8–12 heads running in parallel.

The outputs of all heads are concatenated and linearly transformed.

Example: If one head uses ~37 GB for n = 100,000 tokens,
then 12 heads in one layer may require over 440 GB 😭😭.

Transformer Context Window Optimization

The O(n²) complexity in attention mechanisms creates a significant bottleneck, limiting the context window(the amount of text a model can process at once)

To overcome this, researchers have developed several optimization techniques that allow models to handle vastly longer contexts—up to 100,000 tokens and beyond—while managing computational and memory costs.

[click here for optimization techquies]

Evaluation Metrics of LLm

• Intrinsic metrics: which are directly related to the training objective. A well-known intrinsic metric is perplexity.

Extrinsic metrics evaluate performance across various downstream tasks and are not directly connected to the training objective. Popular examples of extrinsic metrics include benchmarking frameworks like GLUE, SuperGLUE, BIG-bench, HELM, and FLASK.

Perplexity

perplexility means confuse , puzzled or uncertain , higher the value of perplexility less the model is accurate

[click here to see Perplexilit calcuation]

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LLM_fundamentals		LLM_fundamentals
lang_graph_getting_start		lang_graph_getting_start
langchain		langchain
llm-in-practice		llm-in-practice
media		media
projects		projects
research_paper_i_read		research_paper_i_read
storage		storage
.env.examples		.env.examples
.gitignore		.gitignore
LICENSE		LICENSE
LlamaIndex-intro.ipynb		LlamaIndex-intro.ipynb
README.md		README.md
all.csv		all.csv
knowledge_graph.html		knowledge_graph.html
knowledge_graphs.ipynb		knowledge_graphs.ipynb
news_article_sumarizer.ipynb		news_article_sumarizer.ipynb
output_parser.ipynb		output_parser.ipynb
policies_40_80.csv		policies_40_80.csv
requirements3.txt		requirements3.txt
requirements_building_llm		requirements_building_llm
scrapping.ipynb		scrapping.ipynb
techpana_cyber_securitycies_0_20.csv		techpana_cyber_securitycies_0_20.csv
techpana_policies.csv		techpana_policies.csv
techpana_startups.csv		techpana_startups.csv
vector_store_index.png		vector_store_index.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Building-LLMs-for-Production

🧠 Reading Log & Insights

1.✅ 📄 Research Paper: Chinchilla: An Empirical Model of Compute-Optimal Training

Limitations of the Original Transformer Architecture

Raw Memory Calculation

and this is for one attention head, in one layer only

Transformer Context Window Optimization

Evaluation Metrics of LLm

Perplexity

About

Uh oh!

Releases

Packages

Languages

License

ujjwal-basnet/Building-LLMs-for-Production

Folders and files

Latest commit

History

Repository files navigation

📚 Building-LLMs-for-Production

🧠 Reading Log & Insights

1.✅ 📄 Research Paper: Chinchilla: An Empirical Model of Compute-Optimal Training

Limitations of the Original Transformer Architecture

Raw Memory Calculation

and this is for one attention head, in one layer only

Transformer Context Window Optimization

Evaluation Metrics of LLm

Perplexity

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages