Scripts demonstrating 45 Prompt Engineering, Context Engineering, and Agentic AI techniques using LangChain and the OpenAI API.
Language / Idioma: Português Brasileiro | English
Study Roadmap: Want to master AI-assisted development? Check out the ROADMAP.md - a comprehensive guide with learning tracks (beginner to advanced), technique connection maps, project organization templates, and modern tooling integration with Claude Code.
| Script | Technique | Description |
|---|---|---|
01_zero_shot.py |
Zero-Shot | Direct prompts without prior examples |
02_chain_of_thought.py |
Chain of Thought (CoT) | Step-by-step reasoning |
03_few_shot.py |
Few-Shot | Examples to guide the model |
04_tree_of_thoughts.py |
Tree of Thoughts (ToT) | Multiple reasoning paths |
05_skeleton_of_thought.py |
Skeleton of Thought (SoT) | Structure first, details later |
06_react_agent.py |
ReAct | Reasoning + Actions with tools |
| Script | Technique | Description |
|---|---|---|
07_self_consistency.py |
Self-Consistency | Generate N responses, vote on most consistent |
08_least_to_most.py |
Least-to-Most | Progressive decomposition into sub-problems |
09_self_refine.py |
Self-Refine | Iterative critique and improvement |
10_prompt_chaining.py |
Prompt Chaining | Pipeline of connected prompts |
| Script | Technique | Description |
|---|---|---|
11_rag_basic.py |
RAG Basic | ChromaDB + semantic search + chunking |
12_rag_reranking.py |
RAG + Reranking | Reordering for better relevance |
13_rag_conversational.py |
Conversational RAG | RAG with chat memory |
| Script | Technique | Description |
|---|---|---|
14_ollama_basic.py |
Ollama Basic | Local LLMs (Llama 3, Mistral) |
15_ollama_rag.py |
Ollama + RAG | 100% offline RAG |
| Script | Technique | Description |
|---|---|---|
16_structured_output.py |
Structured Output | JSON mode + Pydantic models |
17_tool_calling.py |
Tool Calling | Custom function tools |
| Script | Technique | Description |
|---|---|---|
18_vision_multimodal.py |
Vision/Multimodal | Image analysis with GPT-4o |
19_memory_conversation.py |
Memory/Conversation | Persistent conversation context |
20_meta_prompting.py |
Meta-Prompting | LLM generating/optimizing prompts |
| Script | Technique | Description |
|---|---|---|
21_advanced_chunking.py |
Advanced Chunking | Semantic, recursive, token-based, sliding window strategies |
22_hybrid_search.py |
Hybrid Search | BM25 (keyword) + Vector (semantic) with RRF fusion |
23_query_transformation.py |
Query Transformation | HyDE, Multi-Query, Step-Back, Decomposition |
24_contextual_compression.py |
Contextual Compression | Extract only relevant parts from documents |
25_self_query.py |
Self-Query Retrieval | LLM auto-generates metadata filters |
| Script | Technique | Description |
|---|---|---|
26_parent_document.py |
Parent-Document Retrieval | Small chunks for search, large parents for context |
27_multi_vector.py |
Multi-Vector Retrieval | Multiple representations (summary + questions + content) |
28_ensemble_retrieval.py |
Ensemble Retrieval | Combine multiple retrievers with weighted RRF |
29_long_context.py |
Long Context Strategies | Map-Reduce, Refine, Map-Rerank for large documents |
30_time_weighted.py |
Time-Weighted Retrieval | Recency bias in retrieval with exponential decay |
| Script | Technique | Description |
|---|---|---|
31_mcp_basics.py |
MCP Basics | Model Context Protocol fundamentals (resources, tools, prompts) |
32_mcp_server_stdio.py |
MCP Server STDIO | Local MCP server with standard input/output transport |
33_mcp_server_http.py |
MCP Server HTTP/SSE | Remote MCP server with HTTP and Server-Sent Events |
34_multi_agent.py |
Multi-Agent | Collaborative AI agents (pipeline, debate, hierarchical patterns) |
35_prompt_evaluation.py |
Prompt Evaluation | Evaluate prompt quality, A/B testing, observability |
| Script | Technique | Description |
|---|---|---|
36_llm_security.py |
LLM Security | OWASP Top 10, prompt injection detection, guardrails, rate limiting |
37_caching_strategies.py |
Caching Strategies | Response cache, semantic cache, embedding cache, conversation cache |
38_cost_optimization.py |
Cost Optimization | Token counting, model selection, usage tracking, budget management |
39_ai_testing.py |
AI Testing | Testing non-deterministic outputs, validators, mocking, snapshots |
40_fine_tuning.py |
Fine-tuning | Dataset preparation, validation, fine-tuning workflow, best practices |
| Script | Technique | Description |
|---|---|---|
41_agent_skills.py |
Agent Skills | On-demand knowledge loading, SEO-like skill descriptions, lazy loading |
42_context_window.py |
Context Window Management | Monitor context usage, smart summarization, context health tracking |
43_subagent_orchestration.py |
Subagent Orchestration | Isolated context windows, parallel execution, results aggregation |
44_shared_memory.py |
Shared Memory | Memory DB for agents, short/medium/long-term memory, feedback loops |
45_spec_generation.py |
Spec-Driven Development | Technical specification generation, validation, task breakdown |
- Python 3.10+
- OpenAI API key
- (Optional) Ollama for local models
- (Optional) Cohere API key for reranking
- Clone or navigate to the project directory:
cd /path/to/project- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txt- Configure credentials:
cp .env.example .envEdit the .env file and add your keys:
OPENAI_API_KEY=sk-your-key-here
OPENAI_MODEL=gpt-4o-mini
# Optional - for Ollama (local models)
OLLAMA_MODEL=llama3.2
OLLAMA_BASE_URL=http://localhost:11434
# Optional - for Cohere reranking
COHERE_API_KEY=your-cohere-key-here
- (Optional) Install Ollama for local models:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama service
ollama serve
# Pull a model
ollama pull llama3.2
ollama pull nomic-embed-text # For embeddingsExecute any script from the techniques/ folder:
English examples:
# Basic Prompting (01-06)
python techniques/en/01_zero_shot.py
python techniques/en/02_chain_of_thought.py
python techniques/en/03_few_shot.py
python techniques/en/04_tree_of_thoughts.py
python techniques/en/05_skeleton_of_thought.py
python techniques/en/06_react_agent.py
# Advanced Prompting (07-10)
python techniques/en/07_self_consistency.py
python techniques/en/08_least_to_most.py
python techniques/en/09_self_refine.py
python techniques/en/10_prompt_chaining.py
# RAG (11-13) - Requires sample_data/
python techniques/en/11_rag_basic.py
python techniques/en/12_rag_reranking.py
python techniques/en/13_rag_conversational.py
# Ollama (14-15) - Requires Ollama running
python techniques/en/14_ollama_basic.py
python techniques/en/15_ollama_rag.py
# Structured Output & Tools (16-17)
python techniques/en/16_structured_output.py
python techniques/en/17_tool_calling.py
# Advanced Features (18-20)
python techniques/en/18_vision_multimodal.py
python techniques/en/19_memory_conversation.py
python techniques/en/20_meta_prompting.py
# Context Engineering - Chunking & Retrieval (21-25)
python techniques/en/21_advanced_chunking.py
python techniques/en/22_hybrid_search.py
python techniques/en/23_query_transformation.py
python techniques/en/24_contextual_compression.py
python techniques/en/25_self_query.py
# Context Engineering - Context Management (26-30)
python techniques/en/26_parent_document.py
python techniques/en/27_multi_vector.py
python techniques/en/28_ensemble_retrieval.py
python techniques/en/29_long_context.py
python techniques/en/30_time_weighted.py
# MCP & Agentic AI (31-35)
python techniques/en/31_mcp_basics.py
python techniques/en/32_mcp_server_stdio.py
python techniques/en/33_mcp_server_http.py
python techniques/en/34_multi_agent.py
python techniques/en/35_prompt_evaluation.py
# Enterprise & Production (36-40)
python techniques/en/36_llm_security.py
python techniques/en/37_caching_strategies.py
python techniques/en/38_cost_optimization.py
python techniques/en/39_ai_testing.py
python techniques/en/40_fine_tuning.py
# Agentic AI - Advanced (41-45)
python techniques/en/41_agent_skills.py
python techniques/en/42_context_window.py
python techniques/en/43_subagent_orchestration.py
python techniques/en/44_shared_memory.py
python techniques/en/45_spec_generation.pyPortuguese examples:
python techniques/pt-br/01_zero_shot.py
# ... (same pattern with pt-br/)Technique where the model receives a task without prior examples, using only its pre-trained knowledge.
Available functions:
classify_sentiment(text)- Classifies sentiment as POSITIVE, NEGATIVE, or NEUTRALtranslate_text(text, target_language)- Translates text to specified languageextract_entities(text)- Extracts people, locations, organizations, and datessummarize_text(text)- Summarizes text in a few sentences
Instructs the model to "think step by step" before reaching the final answer, improving performance on reasoning tasks.
Available functions:
solve_math_problem(problem)- Solves math problems showing each steplogical_reasoning(puzzle)- Solves logic puzzles with deductionsanalyze_decision(situation)- Analyzes scenarios for decision makingdebug_code(code, error)- Analyzes code and error to find solution
Provides examples to the model before the task, helping it understand the format and expected response type.
Available functions:
classify_support_ticket(ticket)- Classifies tickets with category, priority, and actionconvert_to_sql(description)- Converts natural language to SQLgenerate_docstring(code)- Generates Google Style docstringsextract_structured_data(text)- Extracts data in JSON format
Explores multiple reasoning paths in parallel, evaluates each one, and selects the most promising.
Available functions:
tree_of_thoughts(problem, depth)- Executes complete ToT algorithmgenerate_thoughts(problem, num)- Generates multiple initial approachesevaluate_thought(problem, thought)- Evaluates viability of an approachexpand_thought(problem, thought, next_step)- Develops an approach
First generates a "skeleton" (structure/topics) and then expands each part, allowing parallelization.
Available functions:
skeleton_of_thought_sync(topic, context)- Synchronous versionskeleton_of_thought_async(topic, context)- Asynchronous version (parallel)generate_skeleton(topic, context)- Generates list of topicsexpand_topic(main_topic, topic, context)- Expands a specific topic
Combines reasoning (Thought) with actions (Action) and observations (Observation) in an iterative loop, using external tools.
Available tools:
web_search- Internet search via DuckDuckGowikipedia- Wikipedia queriescalculator- Mathematical calculations
Generates multiple responses to the same problem, then uses majority voting to select the most consistent answer. Improves accuracy on reasoning tasks.
Available functions:
self_consistency_solve(problem, num_samples)- Solves with multiple samples and votingsolve_with_voting(problem, num_samples)- Alternative with explicit votingextract_answer(response)- Extracts final answer from response
Example:
from techniques.en.self_consistency import self_consistency_solve
result = self_consistency_solve(
"If a train travels 120 km in 2 hours, what is its speed?",
num_samples=5
)
print(result["final_answer"]) # 60 km/hDecomposes complex problems into smaller sub-problems, solves them progressively from simplest to most complex, building on previous answers.
Available functions:
least_to_most_solve(problem)- Complete decomposition and solutiondecompose_problem(problem)- Breaks down into sub-problemssolve_subproblem(subproblem, context)- Solves with previous context
Example:
from techniques.en.least_to_most import least_to_most_solve
result = least_to_most_solve(
"How do I build a machine learning model to predict house prices?"
)
print(result["final_answer"])Generates an initial response, then iteratively critiques and improves it until it meets quality standards.
Available functions:
self_refine(task, max_iterations)- Complete refinement loopgenerate_initial(task)- Creates first draftcritique(task, response)- Evaluates and identifies issuesrefine(task, response, feedback)- Improves based on critique
Example:
from techniques.en.self_refine import self_refine
result = self_refine(
"Write a function to check if a string is a palindrome",
max_iterations=3
)
print(result["final_response"])Connects multiple prompts in a pipeline where the output of one becomes the input of the next, enabling complex multi-step workflows.
Available functions:
chain_prompts(initial_input, prompt_chain)- Executes prompt pipelineresearch_chain(topic)- Research → Analysis → Summarycontent_chain(topic)- Outline → Draft → Edit → Format
Example:
from techniques.en.prompt_chaining import content_chain
result = content_chain("Benefits of Remote Work")
print(result["final_output"])Retrieval-Augmented Generation with ChromaDB for document storage, semantic search, and text chunking.
Available functions:
create_vectorstore(documents)- Creates ChromaDB vector storerag_query(question, vectorstore)- Queries with RAGload_and_split_documents(path)- Loads and chunks documents
Key features:
- Recursive text chunking (1000 chars, 200 overlap)
- OpenAI embeddings for semantic search
- Top-k retrieval with relevance scores
Example:
from techniques.en.rag_basic import create_vectorstore, rag_query
# Load documents and create vector store
vectorstore = create_vectorstore(documents)
# Query with RAG
result = rag_query("What is machine learning?", vectorstore)
print(result["answer"])Enhances basic RAG with reranking to improve retrieval relevance. Supports multiple reranking methods.
Reranking methods:
- LLM-based reranking (uses GPT to score relevance)
- Cohere Rerank (requires API key)
- CrossEncoder (local transformer model)
Available functions:
rag_with_reranking(question, vectorstore, method)- RAG with rerankingllm_rerank(question, documents)- LLM-based rerankingcohere_rerank(question, documents)- Cohere API reranking
RAG with conversation memory for multi-turn dialogues. Maintains context across questions.
Memory types:
- Buffer Memory - Full conversation history
- Summary Memory - Compressed summary
Available functions:
create_conversational_rag(vectorstore)- Creates conversational chainchat(question)- Chat with memoryget_chat_history()- Retrieve conversation history
Use local LLMs via Ollama without API costs or internet dependency.
Supported models:
llama3.2- Meta's Llama 3mistral- Mistral 7Bcodellama- Code-specialized Llamaphi3- Microsoft's Phi-3
Available functions:
ollama_chat(message)- Chat with local modelollama_generate(prompt)- Text generationlist_local_models()- List available models
Example:
from techniques.en.ollama_basic import ollama_chat
response = ollama_chat("Explain quantum computing in simple terms")
print(response)100% offline RAG using Ollama for both embeddings and generation.
Components:
- Local embeddings:
nomic-embed-text - Local LLM:
llama3.2ormistral - ChromaDB for vector storage
Available functions:
create_local_vectorstore(documents)- Creates store with local embeddingslocal_rag_query(question, vectorstore)- Query with local RAG
Force LLM outputs to follow specific schemas using Pydantic models or JSON mode.
Available functions:
extract_person(text)- Extract person info as Pydantic modelextract_invoice(text)- Extract invoice datajson_mode_extract(text, schema)- Generic JSON extraction
Example:
from techniques.en.structured_output import extract_person
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
occupation: str
result = extract_person("John is a 30-year-old software engineer")
print(result.name) # John
print(result.age) # 30Enable LLMs to call custom functions/tools to perform actions or retrieve information.
Available tools:
get_weather(city)- Get weather informationcalculate(expression)- Perform calculationssearch_database(query)- Search mock database
Example:
from techniques.en.tool_calling import agent_with_tools
response = agent_with_tools(
"What's the weather in Tokyo and calculate 15% tip on $85"
)
print(response)Analyze images using vision-enabled models like GPT-4o.
Available functions:
analyze_image(image_path, prompt)- Analyze image with custom promptdescribe_image(image_path)- Generate detailed descriptionextract_text_from_image(image_path)- OCR-like text extractionanalyze_chart(image_path)- Analyze charts and graphscompare_images(image1, image2)- Compare two images
Example:
from techniques.en.vision_multimodal import analyze_chart
result = analyze_chart("sample_data/images/chart.png")
print(result) # Chart type, data, insightsMaintain conversation context across multiple interactions using different memory strategies.
Memory types:
BufferMemory- Stores complete conversation historyWindowMemory- Stores last N exchangesSummaryMemory- Maintains compressed summaryEntityMemory- Tracks mentioned entities
Example:
from techniques.en.memory_conversation import ConversationChain
chain = ConversationChain(memory_type="buffer")
response1 = chain.chat("My name is Alice")
response2 = chain.chat("What's my name?") # Remembers "Alice"Use an LLM to generate, optimize, and improve prompts for other LLM tasks.
Available functions:
generate_prompt(task_description)- Generate optimized promptoptimize_prompt(original_prompt, issues)- Improve existing promptevaluate_prompt(prompt, task)- Score and critique a promptgenerate_prompt_variations(base_prompt)- A/B testing variationsauto_improve_prompt(prompt, task, test_input)- Iterative improvement
Example:
from techniques.en.meta_prompting import generate_prompt
prompt = generate_prompt(
task_description="Extract key information from customer emails",
context="SaaS company support",
constraints=["JSON output", "Include urgency level"]
)
print(prompt)Multiple text splitting strategies optimized for different content types and retrieval scenarios.
Chunking strategies:
RecursiveCharacter- Hierarchical splitting by separatorsTokenBased- Split by token count (model-aware)MarkdownAware- Respects markdown structureSemantic- Groups by semantic similaritySlidingWindow- Overlapping fixed-size windowsSentenceBased- Natural sentence boundaries
Available functions:
recursive_character_chunking(text, chunk_size, overlap)- Standard recursive splittingtoken_based_chunking(text, chunk_size)- Token-aware splittingmarkdown_aware_chunking(text)- Structure-preserving for markdownsemantic_chunking(text, threshold)- Similarity-based groupingsliding_window_chunking(text, window_size, step)- Overlapping windowssentence_based_chunking(text, sentences_per_chunk)- Sentence grouping
Example:
from techniques.en.advanced_chunking import semantic_chunking
chunks = semantic_chunking(long_document, threshold=0.75)
for chunk in chunks:
print(f"Chunk: {len(chunk)} chars")Combines keyword-based (BM25) and semantic (vector) search using Reciprocal Rank Fusion.
Components:
BM25Retriever- Traditional keyword matchingVectorRetriever- Semantic similarity searchHybridRetriever- Weighted combination
Available functions:
create_hybrid_retriever(documents, bm25_weight, vector_weight)- Create hybrid retrieverreciprocal_rank_fusion(results_list, k)- Combine ranked resultshybrid_search(query, k)- Search with both methods
Example:
from techniques.en.hybrid_search import HybridRetriever
retriever = HybridRetriever(documents, bm25_weight=0.4, vector_weight=0.6)
results = retriever.search("machine learning algorithms", k=5)Transform queries to improve retrieval effectiveness using various techniques.
Transformation methods:
HyDE- Hypothetical Document Embeddings (generate hypothetical answer, search with that)Multi-Query- Generate multiple query variationsStep-Back- Abstract query to broader conceptDecomposition- Break complex query into sub-queries
Available functions:
hyde_transform(query)- Generate hypothetical documentmulti_query_transform(query, num_queries)- Generate query variationsstep_back_transform(query)- Abstract to broader questiondecompose_query(query)- Split into sub-questions
Example:
from techniques.en.query_transformation import multi_query_transform
queries = multi_query_transform(
"What are the best practices for microservices?",
num_queries=3
)
# Returns variations like:
# - "microservices architecture best practices"
# - "how to design microservices effectively"
# - "recommended patterns for microservice development"Extract only the relevant portions of retrieved documents to reduce noise and token usage.
Compression methods:
LLMExtractor- Use LLM to extract relevant sentencesEmbeddingsFilter- Filter by semantic similaritySentenceExtractor- Extract relevant sentences by scoring
Available functions:
create_compression_retriever(base_retriever, compressor)- Wrap retriever with compressionllm_compress(documents, query)- LLM-based compressionembeddings_filter(documents, query, threshold)- Similarity-based filtering
Example:
from techniques.en.contextual_compression import ContextualCompressionRetriever
compression_retriever = ContextualCompressionRetriever(
base_retriever=vector_retriever,
compressor=LLMExtractorCompressor()
)
# Returns only relevant excerpts instead of full documents
results = compression_retriever.retrieve("What is RAG?")LLM automatically generates metadata filters from natural language queries.
Features:
- Automatic filter extraction from queries
- Support for comparison operators (=, >, <, >=, <=)
- Combines semantic search with structured filtering
Available functions:
create_self_query_retriever(vectorstore, metadata_info)- Create self-query retrieverparse_query(query)- Extract semantic query and filtersapply_filters(documents, filters)- Apply metadata filters
Example:
from techniques.en.self_query import SelfQueryRetriever
retriever = SelfQueryRetriever(
vectorstore=vectorstore,
metadata_fields=[
{"name": "category", "type": "string"},
{"name": "price", "type": "float"},
{"name": "year", "type": "integer"}
]
)
# Query: "cheap electronics from 2024"
# Auto-generates: category="electronics", price<100, year=2024
results = retriever.retrieve("cheap electronics from 2024")Search with small chunks for precision, but retrieve larger parent documents for context.
Concept:
- Child chunks: Small (e.g., 400 chars) for precise matching
- Parent documents: Larger (e.g., 2000 chars) for complete context
- Map child → parent for retrieval
Available functions:
create_parent_document_retriever(documents, child_size, parent_size)- Create retrieveradd_documents(documents)- Index documents with parent-child relationshipretrieve(query, k)- Search children, return parents
Example:
from techniques.en.parent_document import ParentDocumentRetriever
retriever = ParentDocumentRetriever(
child_chunk_size=400,
parent_chunk_size=2000
)
retriever.add_documents(documents)
# Searches small chunks, returns full parent context
results = retriever.retrieve("neural network architecture", k=3)Store multiple representations of documents for improved retrieval.
Representation types:
- Original document content
- Generated summaries
- Hypothetical questions the document answers
Available functions:
create_multi_vector_retriever(documents)- Create retriever with multiple vectorsgenerate_summary(document)- Generate document summarygenerate_questions(document)- Generate hypothetical questionsretrieve(query, k)- Search all representations
Example:
from techniques.en.multi_vector import MultiVectorRetriever
retriever = MultiVectorRetriever()
retriever.add_documents(documents) # Creates summary + question vectors
# Can match query to summary, questions, or original content
results = retriever.retrieve("How does backpropagation work?", k=3)Combine multiple retrievers using Reciprocal Rank Fusion with configurable weights.
Components:
- Multiple base retrievers (BM25, Vector, etc.)
- Configurable weights per retriever
- RRF algorithm for score combination
Available functions:
create_ensemble_retriever(retrievers, weights)- Create ensemblereciprocal_rank_fusion(results_list, weights, k)- Combine with RRF
Example:
from techniques.en.ensemble_retrieval import EnsembleRetriever
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever, sparse_retriever],
weights=[0.3, 0.5, 0.2]
)
results = ensemble.retrieve("machine learning optimization", k=5)Process documents that exceed typical context windows using various strategies.
Strategies:
Map-Reduce- Process chunks separately, combine resultsRefine- Iteratively build answer with each chunkMap-Rerank- Score each chunk, use best onesStuffing- Fit most relevant content into context
Available functions:
map_reduce_summarize(chunks)- Summarize with map-reducerefine_summarize(chunks)- Iterative refinementmap_rerank_answer(chunks, question)- Score and select best chunksstuffing_with_prioritization(chunks, question, max_context)- Priority-based stuffing
Example:
from techniques.en.long_context import map_reduce_summarize, map_rerank_answer
# Summarize a 50-page document
summary = map_reduce_summarize(document_chunks)
# Answer question using best chunks
answer = map_rerank_answer(
chunks=document_chunks,
question="What are the main conclusions?"
)Incorporate temporal relevance with exponential decay to prefer recent documents.
Features:
- Exponential decay function for time weighting
- Configurable decay rate and time units
- Combines semantic similarity with recency
Available functions:
create_time_weighted_retriever(documents, decay_rate)- Create retrievercalculate_time_weight(timestamp, decay_rate, time_unit)- Calculate decay weightretrieve(query, k, time_weight_factor)- Search with time weighting
Example:
from techniques.en.time_weighted import TimeWeightedRetriever
retriever = TimeWeightedRetriever(
documents=news_articles,
decay_rate=0.05, # Per day
time_unit="days"
)
# Recent articles score higher
results = retriever.retrieve(
"AI developments",
k=5,
time_weight_factor=0.4 # 40% time, 60% semantic
)Configuration guide:
| Use Case | Decay Rate | Time Unit | Weight Factor |
|---|---|---|---|
| News/Current | 0.1-0.5 | hours | 0.5-0.7 |
| Chat history | 0.05-0.1 | hours | 0.3-0.5 |
| Documentation | 0.01-0.05 | days | 0.2-0.4 |
| Research papers | 0.001-0.01 | days | 0.1-0.3 |
Model Context Protocol (MCP) is an open protocol created by Anthropic to connect AI assistants to external data sources and tools in a standardized way.
Key concepts:
- Resources - Data exposed by the server (files, databases, APIs)
- Tools - Functions that the LLM can invoke
- Prompts - Reusable prompt templates
Example:
from techniques.en.mcp_basics import MCPServerSimulator
server = MCPServerSimulator(name="demo-server")
server.add_tool("search_database", description="Search data", ...)
server.add_resource("file:///config.json", name="Config", ...)STDIO is the most common transport method for local MCP servers. Communication occurs through stdin/stdout.
Use cases:
- Claude Desktop integration
- Command line tools
- Local file access
- Script execution
Example:
# In production, use:
from mcp.server import Server
from mcp.server.stdio import stdio_server
server = Server("my-server")
async def main():
async with stdio_server() as (read_stream, write_stream):
await server.run(read_stream, write_stream)HTTP/SSE is the transport method for remote MCP servers. Enables communication over the network.
Use cases:
- AI-as-a-service APIs
- Enterprise integrations
- Centralized servers
- AI microservices
Example:
# In production, use FastAPI with MCP SDK:
from fastapi import FastAPI
from mcp.server import Server
from mcp.server.sse import SseServerTransport
app = FastAPI()
server = Server("my-http-server")Multi-agent systems allow multiple AI agents to collaborate to solve complex tasks.
Patterns:
Pipeline- Sequential processing (Agent1 → Agent2 → Agent3)Debate- Agents with different perspectives discuss to reach consensusHierarchical- Agents organized in authority levels (Director → Managers → Workers)Orchestrator- Central agent coordinates the others
Frameworks:
- LangGraph (LangChain)
- AutoGen (Microsoft)
- CrewAI
- Swarm (OpenAI)
Example:
from techniques.en.multi_agent import MultiAgentSystem, Agent, AgentRole
system = MultiAgentSystem(name="dev-team")
system.add_agent(Agent(name="Planner", role=AgentRole.PLANNER, ...))
system.add_agent(Agent(name="Coder", role=AgentRole.EXECUTOR, ...))
system.add_agent(Agent(name="Reviewer", role=AgentRole.REVIEWER, ...))Prompt evaluation is essential to ensure quality, consistency, and continuous improvement in LLM applications.
Metrics:
- Relevance - Does response address the question?
- Coherence - Does text have logic and flow?
- Groundedness - Based on facts/context?
- Accuracy - Correct information?
- Safety - Appropriate content?
Tools:
- LangSmith (LangChain)
- LangFuse (Open Source)
- Weights & Biases
- Promptfoo (CLI)
- Phoenix (Arize)
Example:
from techniques.en.prompt_evaluation import PromptEvaluator
evaluator = PromptEvaluator()
result = evaluator.evaluate_relevance(question, answer)
print(f"Relevance: {result.score:.2f}")Security measures for LLM applications following OWASP Top 10 guidelines for LLMs.
Security components:
PromptInjectionDetector- Detects prompt injection attemptsOutputValidator- Validates and sanitizes LLM outputsContentGuardrail- Enforces content policiesRateLimiter- Prevents abuse with token bucket algorithmSecureLLMWrapper- Combines all security measures
Example:
from techniques.en.llm_security import SecureLLMWrapper
secure_llm = SecureLLMWrapper(
enable_injection_detection=True,
enable_output_validation=True,
rate_limit_rpm=60
)
response = secure_llm.chat("User input here")Multiple caching approaches to reduce API costs and improve response times.
Cache types:
ResponseCache- Exact match caching for identical promptsEmbeddingCache- Cache embeddings to avoid recomputationSemanticCache- Find similar queries using embedding similarityConversationCache- Cache conversation contexts
Example:
from techniques.en.caching_strategies import SemanticCache
cache = SemanticCache(similarity_threshold=0.95)
cache.set("What is Python?", "Python is a programming language...")
# Similar query will return cached response
response = cache.get("Tell me about Python")Tools and strategies to monitor and reduce LLM API costs.
Components:
TokenCounter- Count tokens before API callsUsageTracker- Track usage and costs over timeModelSelector- Choose optimal model based on task complexity
Example:
from techniques.en.cost_optimization import TokenCounter, UsageTracker
counter = TokenCounter()
tokens = counter.count_tokens("Your prompt here")
estimated_cost = counter.estimate_cost(tokens, 500, "gpt-4o-mini")
tracker = UsageTracker()
tracker.daily_budget = 1.00Testing frameworks and strategies for non-deterministic LLM outputs.
Testing approaches:
- Property-based validators (contains, length, regex, JSON)
- Semantic similarity testing with embeddings
- LLM-as-Judge evaluation
- Mock clients for CI/CD
- Snapshot testing for regression detection
Example:
from techniques.en.ai_testing import LLMTestRunner, TestCase, ContainsValidator
test = TestCase(
name="Greeting test",
prompt="Say hello",
validators=[ContainsValidator(["hello"])]
)
runner = LLMTestRunner()
result = runner.run_test(test)Complete workflow for fine-tuning OpenAI models on custom datasets.
Components:
DatasetGenerator- Create training examples from templatesDatasetValidator- Validate dataset format and qualityFineTuningManager- Upload, train, and use fine-tuned models
Example:
from techniques.en.fine_tuning import DatasetGenerator, DatasetValidator
generator = DatasetGenerator(system_prompt="Classify sentiment...")
generator.add_examples_from_pairs([
("Great product!", "POSITIVE"),
("Terrible experience", "NEGATIVE")
])
generator.export_jsonl("training_data.jsonl")All scripts include automatic token counting to help monitor costs and API usage.
Each LLM call displays the tokens used:
Text: This product is amazing! It exceeded all...
Tokens - Input: 52 | Output: 3 | Total: 55
Sentiment: POSITIVE
At the end of each script, a total summary is displayed:
============================================================
TOTAL - Zero-Shot Prompting
Input: 1,234 tokens
Output: 456 tokens
Total: 1,690 tokens
============================================================
.
├── .env.example # Configuration template
├── .gitignore # Files ignored by Git
├── README.md # English documentation
├── README.pt-BR.md # Portuguese documentation
├── requirements.txt # Project dependencies
├── config.py # Centralized config + Token tracking
├── sample_data/ # Sample data for RAG, Vision, and Context Engineering
│ ├── documents/ # Text documents for RAG and Context Engineering
│ │ ├── ai_handbook.txt
│ │ ├── company_faq.txt
│ │ ├── technical_docs.md
│ │ ├── products_catalog.json # Product data with metadata (Self-Query)
│ │ ├── news_articles.txt # Dated articles (Time-Weighted)
│ │ └── long_document.txt # Large document (Long Context)
│ └── images/ # Images for Vision demos
│ ├── chart.png
│ ├── diagram.png
│ └── photo.jpg
└── techniques/
├── en/ # English examples (45 scripts)
│ ├── 01_zero_shot.py
│ ├── ...
│ ├── 20_meta_prompting.py
│ ├── 21_advanced_chunking.py
│ ├── ...
│ ├── 30_time_weighted.py
│ ├── 31_mcp_basics.py
│ ├── ...
│ ├── 40_fine_tuning.py
│ ├── 41_agent_skills.py
│ ├── 42_context_window.py
│ ├── 43_subagent_orchestration.py
│ ├── 44_shared_memory.py
│ └── 45_spec_generation.py
└── pt-br/ # Portuguese examples (45 scripts)
├── 01_zero_shot.py
├── ...
├── 20_meta_prompting.py
├── 21_advanced_chunking.py
├── ...
├── 30_time_weighted.py
├── 31_mcp_basics.py
├── ...
├── 40_fine_tuning.py
├── 41_agent_skills.py
├── 42_context_window.py
├── 43_subagent_orchestration.py
├── 44_shared_memory.py
└── 45_spec_generation.py
The config.py file provides utility functions:
from config import get_llm, get_model_name, TokenUsage
# Create LLM instance with custom temperature
llm = get_llm(temperature=0.7)
# Get configured model name
model = get_model_name() # e.g., "gpt-4o-mini"
# Create token tracker
tracker = TokenUsage()
# For Ollama (local models)
from config import get_ollama_llm, get_ollama_embeddings, is_ollama_available
if is_ollama_available():
local_llm = get_ollama_llm(model="llama3.2")
local_embeddings = get_ollama_embeddings()
# For embeddings
from config import get_embeddings
embeddings = get_embeddings() # OpenAI embeddingsTemperature is one of the most important parameters when working with LLMs. It controls the randomness and creativity of the model's responses.
- Range: 0.0 to 2.0 (most common usage is 0.0 to 1.0)
- Low values (0.0-0.3): More deterministic, focused, and consistent responses
- High values (0.7-1.0+): More creative, diverse, and unpredictable responses
| Technique | Temperature | Reason |
|---|---|---|
| Zero-Shot Classification | 0.0 | Consistent results |
| Chain of Thought | 0.0 | Accurate reasoning |
| Few-Shot | 0.0 - 0.3 | Follow examples |
| Tree of Thoughts | 0.3 - 0.8 | Diverse thoughts |
| Self-Consistency | 0.7 - 0.9 | Need variation |
| Self-Refine | 0.3 - 0.5 | Balanced critique |
| RAG | 0.0 - 0.3 | Factual answers |
| Structured Output | 0.0 | Consistent schema |
| Tool Calling | 0.0 | Reliable tool use |
| Meta-Prompting | 0.5 - 0.7 | Creative prompts |
| Query Transformation | 0.3 - 0.7 | Creative variations |
| Contextual Compression | 0.0 | Accurate extraction |
| Self-Query (filter gen) | 0.0 | Precise filters |
| Multi-Vector (summaries) | 0.3 | Balanced summaries |
| Long Context | 0.0 - 0.3 | Accurate synthesis |
gpt-4o- Most capable, most expensivegpt-4o-mini- Good cost/performance balance (recommended)gpt-4-turbo- Turbo version of GPT-4gpt-3.5-turbo- Cheaper, less capable
llama3.2- Meta's Llama 3 (recommended)mistral- Mistral 7Bcodellama- Code-specializedphi3- Microsoft Phi-3
langchain- LLM frameworklangchain-openai- OpenAI integrationopenai- OpenAI API clientpython-dotenv- Environment variables
chromadb- Vector databasesentence-transformers- Local embeddingspypdf- PDF processingunstructured- Document parsing
rank-bm25- BM25 keyword search for Hybrid Search
langchain-ollama- Ollama integration
cohere- Cohere rerankingpillow- Image processing
-
Start with Zero-Shot - It's the simplest technique and works well for direct tasks.
-
Use CoT for reasoning - Mathematical, logical problems, or those requiring analysis benefit from "think step by step".
-
Few-Shot for specific formats - When you need output in a specific format (JSON, SQL, etc.), provide examples.
-
Self-Consistency for accuracy - When you need high accuracy on reasoning tasks, generate multiple responses and vote.
-
Ollama for privacy - Use local models when data privacy is important or you want to avoid API costs.
-
Structured Output for APIs - When building integrations, use Pydantic models to ensure consistent output.
-
RAG for knowledge - Use RAG when you need the model to answer based on specific documents.
-
Hybrid Search for precision - Combine BM25 + Vector search when queries contain specific terms or keywords.
-
Advanced Chunking for quality - Choose chunking strategy based on content type (semantic for articles, markdown-aware for docs).
-
Query Transformation for recall - Use HyDE or Multi-Query when initial retrieval quality is low.
-
Contextual Compression for tokens - Compress retrieved documents to reduce token usage while maintaining relevance.
-
Self-Query for structured data - Use when documents have rich metadata that can filter results.
-
Parent-Document for context - When retrieved chunks lack surrounding context, use parent-document retrieval.
-
Long Context for large docs - Use Map-Reduce for summarization, Map-Rerank for Q&A on large documents.
-
Time-Weighted for freshness - Use when document recency matters (news, logs, chat history).
The scripts make calls to the OpenAI API, which charges per token.
Each script automatically displays:
- Input and output tokens per call
- Total tokens at the end of execution
Approximate prices (January 2025):
| Model | Input (1M tokens) | Output (1M tokens) |
|---|---|---|
| gpt-4o | $2.50 | $10.00 |
| gpt-4o-mini | $0.15 | $0.60 |
| gpt-3.5-turbo | $0.50 | $1.50 |
- Use
gpt-4o-mini(default) instead ofgpt-4o - Use Ollama for local inference (free)
- Reduce the number of examples in tests
- Monitor token totals displayed at the end of each script
MIT