System Design Glossary: 80+ Technical Terms

A definitive reference for the terminology used in 2026 production-grade system design.

A

ACID: Atomicity, Consistency, Isolation, Durability. The four properties that guarantee a database transaction is processed reliably.
Agent-to-Agent (A2A): A communication protocol that standardizes how specialized AI models exchange goals and context.
ANN (Approximate Nearest Neighbor): A class of algorithms (like HNSW) used to find similar vectors in a high-dimensional space without scanning every data point.
API Gateway: A server that acts as an entry point for an application, handling routing, rate limiting, and authentication.
Artifact: A persistent, versioned document or asset generated during a technical workflow.
Asymmetric Workload: A query where the input is small but the output/computational cost is massive (typical of LLMs).
At-Least-Once Delivery: A messaging guarantee that a message will be delivered one or more times, requiring idempotency on the receiver's end.
Atomic Clock: A highly precise timekeeping device used in distributed databases (like Spanner) to order transactions globally.
Augmentation: The process of providing an LLM with external context (see RAG).
Availability: The percentage of time a system is operational and accessible to users.

B

Back-of-the-Envelope: Quick, rough mathematical estimations used to validate an architectural path.
Backpressure: A mechanism that allows a downstream service to signal to an upstream service that it is overloaded and cannot accept more requests.
Batching: Grouping multiple requests or data points together to be processed in a single operation to improve throughput.
BM25 (Best Matching 25): A classic sparse retrieval ranking function used for keyword-based search.
Bulkhead: A design pattern that isolates resources (thread pools, memories) to prevent a failure in one component from cascading.

C

Cache-Aside: A caching pattern where the application first checks the cache and then fetches from the DB on a miss, updating the cache afterward.
Cache Stampede: A failure mode where multiple simultaneous cache misses on a hot key overwhelm the underlying data store.
CAP Theorem: The principle that a distributed system can only provide two of three guarantees: Consistency, Availability, and Partition Tolerance.
Chaos Engineering: Intentionally injecting failures into production to test system resiliency.
Circuit Breaker: A mechanism that detects high failure rates and "trips," failing fast to protect downstream services.
Consensus Algorithm: A protocol (like Raft or Paxos) that allows a cluster of nodes to agree on a single state change despite failures.
Consistent Hashing: A hashing technique that distributes data across nodes such that adding or removing a node only impacts a small fraction of the data.
Continuous Batching: An LLM serving optimization where new requests are added to the active batch as soon as others finish, maximizing GPU utilization.
CPU-Bound: A system bottlenecked by the processor's speed, common in modern NVMe-backed databases.
CRDT (Conflict-Free Replicated Data Type): A data structure that can be updated independently and concurrently across nodes, always merging to a consistent state.
Cross-Encoder: A model that re-ranks search results by deeply evaluating the query and the document simultaneously for high precision.

D

DORA Metrics: Four key metrics (Deployment Frequency, Lead Time, MTTR, Change Failure Rate) used to measure engineering team performance.
Distance Metric: A mathematical measure (like Cosine Similarity or Euclidean Distance) used to calculate the similarity between two vectors.
Distributed Tracing: Capturing the path of a request as it spans multiple services to identify bottlenecks and failures.

E

Edge Computing: Moving computation and data storage closer to the user to reduce latency and bandwidth consumption.
Embedding: A numerical representation of a piece of data (text, image) in a high-dimensional vector space.
Ephemeral: Short-lived data or infrastructure that is not intended to be persistent.
Even Consistency: A consistency model where data will eventually become identical across all nodes if no new updates occur.

F

Failover: The automatic process of switching to a redundant or standby system upon the failure of the primary system.
Fan-out: The process of delivering a single update to multiple destinations (common in social media news feeds).
Feature Store: A centralized repo where mathematical features are stored for use in real-time ML inference.
FP8 (8-bit Floating Point): A numerical format that reduces model weight size by 50% compared to FP16, saving memory and cost.

G

GitOps: A practice where the desired state of infrastructure is defined in Git and automatically reconciled by a controller.
Gossip Protocol: A peer-to-peer communication method where nodes share state with a few neighbors, eventually propagating the info to the entire cluster.
gRPC: A high-performance remote procedure call (RPC) framework that uses Protobuf for efficient binary serialization.

H

HNSW (Hierarchical Navigable Small World): A leading ANN algorithm that builds a multi-layered graph for fast, high-recall vector search.
Hybrid Search: Combining vector-based semantic search with keyword-based sparse search (BM25) for maximum retrieval accuracy.

I

Idempotency: A property where an operation can be performed multiple times without changing the result beyond the initial application.
IDP (Internal Developer Portal): A self-service portal (like Backstage) that simplifies infrastructure management for developers.
IO-Bound: A system bottlenecked by disk or network input/output, now less common in the era of NVMe.
Isolate (Wasm): A lightweight execution environment (like V8) that starts in milliseconds and consumes minimal RAM.
IVF (Inverted File): A vector indexing technique that clusters data into search buckets to reduce memory consumption.

K

KV Cache: Key-Value Cache. In LLMs, it stores the internal state of previous tokens to speed up the generation of the next token.

L

Latency: The time it takes for a request to travel from sender to receiver and back again.
Leaderless Replication: A system (like Dynamo) where any node can accept writes, resolving conflicts later.
LLM-as-a-Judge: Using a powerful LLM to evaluate the quality and accuracy of another model's output.
LLMOps: The operational practices used to manage the delivery and maintenance of LLM-based applications.

M

MCP (Model Context Protocol): An open standard for connecting AI agents to tools, data, and external APIs.
Microservices: An architectural style that structures an application as a collection of small, independent services.
MTTR (Mean Time To Recovery): The average time taken to restore a service after a failure occurs.
Multi-tenancy: A software architecture where a single instance of an application serves multiple customers (tenants).

N

NewSQL: A category of databases (like Spanner) that provide horizontal scale with strict ACID consistency.
NVMe: Non-Volatile Memory Express. An interface for high-speed SSDs that has radically reduced disk latency.

O

OIDC (OpenID Connect): An authentication layer built on top of OAuth 2.0 that provides identity information.
Orchestrator: A central service that coordinates the actions of multiple sub-agents or microservices.

P

PACELC: An extension of the CAP theorem, emphasizing the consistency vs. latency tradeoff during normal operations.
PagedAttention: An LLM memory management technique that allocates KV caches in non-contiguous blocks, reducing fragmentation.
Partition Tolerance: The ability of a distributed system to continue operating despite network failures.
Paxos: A classic consensus algorithm used to reach agreement across a distributed cluster.
Primary Source: A verifiable, high-authority origin for technical information (e.g., engineering blog or research paper).
Prompt Caching: Storing the state of a long LLM prompt to bypass redundant processing costs for subsequent queries.
Prompt Injection: A security vulnerability where user input "hijacks" the LLM's system instructions.

Q

Quantization: Compressing LLM weights to lower precision (e.g., FP16 to INT4) to reduce memory and infra costs.
QUIC: A UDP-based transport layer protocol used by HTTP/3 to reduce latency and handle network switching.

R

RAG (Retrieval-Augmented Generation): Ingesting external data into an LLM's context window to ground its responses in facts.
Raft: A modern consensus algorithm designed to be more understandable and implementable than Paxos.
Recall: A measure of search quality; the percentage of truly relevant results found by the search engine.
Reranker: A high-precision model used to re-evaluate and order the candidate results from a search query.

S

Saga Pattern: A distributed transaction pattern that uses a series of local transactions and compensating actions to maintain consistency.
Semantic Caching: Caching LLM responses based on the meaning of the query rather than an exact string match.
Sharding: Horizontally partitioning data across multiple database instances.
SLO (Service Level Objective): A target value for a specific service level, such as "99.9% availability."
SSE (Server-Sent Events): A unidirectional streaming protocol over HTTP, ideal for pushing LLM tokens to a UI.
Speculative Decoding: Using a small "draft" model to guess future tokens and a large model to verify them, significantly increasing inference speed.

T

TAO: The Associations and Objects. Meta's distributed graph cache for the social graph.
Throughput: The amount of data or number of requests a system can handle in a given period.
Token Bucket: A rate-limiting algorithm that allows for bursts while maintaining a steady average request rate.
Transactional Outbox: A pattern that ensures a message is only sent to a queue if the local DB transaction succeeds.
TrueTime: Google's atomic-clock-based timing system used to order transactions globally in Spanner.

V

Vector Database: A database optimized for storing, indexing, and searching high-dimensional embeddings.
Vertical Scaling: Adding more power (CPU, RAM) to a single machine to handle increased load.

W

Wasm (WebAssembly): A binary instruction format that enables high-performance execution of code at near-native speeds.
Write-Through Cache: A caching pattern where data is written to the cache and the database simultaneously to ensure consistency.

Z

Zero Trust: A security model that requires continuous verification of every user and device, assuming no internal trust.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Design Glossary: 80+ Technical Terms

A

B

C

D

E

F

G

H

I

K

L

M

N

O

P

Q

R

S

T

V

W

Z

FilesExpand file tree

GLOSSARY.md

Latest commit

History

GLOSSARY.md

File metadata and controls

System Design Glossary: 80+ Technical Terms

A

B

C

D

E

F

G

H

I

K

L

M

N

O

P

Q

R

S

T

V

W

Z