A definitive reference for the terminology used in 2026 production-grade system design.
- ACID: Atomicity, Consistency, Isolation, Durability. The four properties that guarantee a database transaction is processed reliably.
- Agent-to-Agent (A2A): A communication protocol that standardizes how specialized AI models exchange goals and context.
- ANN (Approximate Nearest Neighbor): A class of algorithms (like HNSW) used to find similar vectors in a high-dimensional space without scanning every data point.
- API Gateway: A server that acts as an entry point for an application, handling routing, rate limiting, and authentication.
- Artifact: A persistent, versioned document or asset generated during a technical workflow.
- Asymmetric Workload: A query where the input is small but the output/computational cost is massive (typical of LLMs).
- At-Least-Once Delivery: A messaging guarantee that a message will be delivered one or more times, requiring idempotency on the receiver's end.
- Atomic Clock: A highly precise timekeeping device used in distributed databases (like Spanner) to order transactions globally.
- Augmentation: The process of providing an LLM with external context (see RAG).
- Availability: The percentage of time a system is operational and accessible to users.
- Back-of-the-Envelope: Quick, rough mathematical estimations used to validate an architectural path.
- Backpressure: A mechanism that allows a downstream service to signal to an upstream service that it is overloaded and cannot accept more requests.
- Batching: Grouping multiple requests or data points together to be processed in a single operation to improve throughput.
- BM25 (Best Matching 25): A classic sparse retrieval ranking function used for keyword-based search.
- Bulkhead: A design pattern that isolates resources (thread pools, memories) to prevent a failure in one component from cascading.
- Cache-Aside: A caching pattern where the application first checks the cache and then fetches from the DB on a miss, updating the cache afterward.
- Cache Stampede: A failure mode where multiple simultaneous cache misses on a hot key overwhelm the underlying data store.
- CAP Theorem: The principle that a distributed system can only provide two of three guarantees: Consistency, Availability, and Partition Tolerance.
- Chaos Engineering: Intentionally injecting failures into production to test system resiliency.
- Circuit Breaker: A mechanism that detects high failure rates and "trips," failing fast to protect downstream services.
- Consensus Algorithm: A protocol (like Raft or Paxos) that allows a cluster of nodes to agree on a single state change despite failures.
- Consistent Hashing: A hashing technique that distributes data across nodes such that adding or removing a node only impacts a small fraction of the data.
- Continuous Batching: An LLM serving optimization where new requests are added to the active batch as soon as others finish, maximizing GPU utilization.
- CPU-Bound: A system bottlenecked by the processor's speed, common in modern NVMe-backed databases.
- CRDT (Conflict-Free Replicated Data Type): A data structure that can be updated independently and concurrently across nodes, always merging to a consistent state.
- Cross-Encoder: A model that re-ranks search results by deeply evaluating the query and the document simultaneously for high precision.
- DORA Metrics: Four key metrics (Deployment Frequency, Lead Time, MTTR, Change Failure Rate) used to measure engineering team performance.
- Distance Metric: A mathematical measure (like Cosine Similarity or Euclidean Distance) used to calculate the similarity between two vectors.
- Distributed Tracing: Capturing the path of a request as it spans multiple services to identify bottlenecks and failures.
- Edge Computing: Moving computation and data storage closer to the user to reduce latency and bandwidth consumption.
- Embedding: A numerical representation of a piece of data (text, image) in a high-dimensional vector space.
- Ephemeral: Short-lived data or infrastructure that is not intended to be persistent.
- Even Consistency: A consistency model where data will eventually become identical across all nodes if no new updates occur.
- Failover: The automatic process of switching to a redundant or standby system upon the failure of the primary system.
- Fan-out: The process of delivering a single update to multiple destinations (common in social media news feeds).
- Feature Store: A centralized repo where mathematical features are stored for use in real-time ML inference.
- FP8 (8-bit Floating Point): A numerical format that reduces model weight size by 50% compared to FP16, saving memory and cost.
- GitOps: A practice where the desired state of infrastructure is defined in Git and automatically reconciled by a controller.
- Gossip Protocol: A peer-to-peer communication method where nodes share state with a few neighbors, eventually propagating the info to the entire cluster.
- gRPC: A high-performance remote procedure call (RPC) framework that uses Protobuf for efficient binary serialization.
- HNSW (Hierarchical Navigable Small World): A leading ANN algorithm that builds a multi-layered graph for fast, high-recall vector search.
- Hybrid Search: Combining vector-based semantic search with keyword-based sparse search (BM25) for maximum retrieval accuracy.
- Idempotency: A property where an operation can be performed multiple times without changing the result beyond the initial application.
- IDP (Internal Developer Portal): A self-service portal (like Backstage) that simplifies infrastructure management for developers.
- IO-Bound: A system bottlenecked by disk or network input/output, now less common in the era of NVMe.
- Isolate (Wasm): A lightweight execution environment (like V8) that starts in milliseconds and consumes minimal RAM.
- IVF (Inverted File): A vector indexing technique that clusters data into search buckets to reduce memory consumption.
- KV Cache: Key-Value Cache. In LLMs, it stores the internal state of previous tokens to speed up the generation of the next token.
- Latency: The time it takes for a request to travel from sender to receiver and back again.
- Leaderless Replication: A system (like Dynamo) where any node can accept writes, resolving conflicts later.
- LLM-as-a-Judge: Using a powerful LLM to evaluate the quality and accuracy of another model's output.
- LLMOps: The operational practices used to manage the delivery and maintenance of LLM-based applications.
- MCP (Model Context Protocol): An open standard for connecting AI agents to tools, data, and external APIs.
- Microservices: An architectural style that structures an application as a collection of small, independent services.
- MTTR (Mean Time To Recovery): The average time taken to restore a service after a failure occurs.
- Multi-tenancy: A software architecture where a single instance of an application serves multiple customers (tenants).
- NewSQL: A category of databases (like Spanner) that provide horizontal scale with strict ACID consistency.
- NVMe: Non-Volatile Memory Express. An interface for high-speed SSDs that has radically reduced disk latency.
- OIDC (OpenID Connect): An authentication layer built on top of OAuth 2.0 that provides identity information.
- Orchestrator: A central service that coordinates the actions of multiple sub-agents or microservices.
- PACELC: An extension of the CAP theorem, emphasizing the consistency vs. latency tradeoff during normal operations.
- PagedAttention: An LLM memory management technique that allocates KV caches in non-contiguous blocks, reducing fragmentation.
- Partition Tolerance: The ability of a distributed system to continue operating despite network failures.
- Paxos: A classic consensus algorithm used to reach agreement across a distributed cluster.
- Primary Source: A verifiable, high-authority origin for technical information (e.g., engineering blog or research paper).
- Prompt Caching: Storing the state of a long LLM prompt to bypass redundant processing costs for subsequent queries.
- Prompt Injection: A security vulnerability where user input "hijacks" the LLM's system instructions.
- Quantization: Compressing LLM weights to lower precision (e.g., FP16 to INT4) to reduce memory and infra costs.
- QUIC: A UDP-based transport layer protocol used by HTTP/3 to reduce latency and handle network switching.
- RAG (Retrieval-Augmented Generation): Ingesting external data into an LLM's context window to ground its responses in facts.
- Raft: A modern consensus algorithm designed to be more understandable and implementable than Paxos.
- Recall: A measure of search quality; the percentage of truly relevant results found by the search engine.
- Reranker: A high-precision model used to re-evaluate and order the candidate results from a search query.
- Saga Pattern: A distributed transaction pattern that uses a series of local transactions and compensating actions to maintain consistency.
- Semantic Caching: Caching LLM responses based on the meaning of the query rather than an exact string match.
- Sharding: Horizontally partitioning data across multiple database instances.
- SLO (Service Level Objective): A target value for a specific service level, such as "99.9% availability."
- SSE (Server-Sent Events): A unidirectional streaming protocol over HTTP, ideal for pushing LLM tokens to a UI.
- Speculative Decoding: Using a small "draft" model to guess future tokens and a large model to verify them, significantly increasing inference speed.
- TAO: The Associations and Objects. Meta's distributed graph cache for the social graph.
- Throughput: The amount of data or number of requests a system can handle in a given period.
- Token Bucket: A rate-limiting algorithm that allows for bursts while maintaining a steady average request rate.
- Transactional Outbox: A pattern that ensures a message is only sent to a queue if the local DB transaction succeeds.
- TrueTime: Google's atomic-clock-based timing system used to order transactions globally in Spanner.
- Vector Database: A database optimized for storing, indexing, and searching high-dimensional embeddings.
- Vertical Scaling: Adding more power (CPU, RAM) to a single machine to handle increased load.
- Wasm (WebAssembly): A binary instruction format that enables high-performance execution of code at near-native speeds.
- Write-Through Cache: A caching pattern where data is written to the cache and the database simultaneously to ensure consistency.
- Zero Trust: A security model that requires continuous verification of every user and device, assuming no internal trust.