alauda · davidwtf · Mar 26, 2026 · Mar 25, 2026 · Mar 25, 2026 · Mar 25, 2026
diff --git a/docs/en/llama_stack/install.mdx b/docs/en/llama_stack/install.mdx
@@ -31,7 +31,8 @@ violet push --platform-address=platform-access-address --platform-username=platf
 After the operator is installed, deploy Llama Stack Server by creating a `LlamaStackDistribution` custom resource:
 
 > **Note:** Prepare the following in advance; otherwise the distribution may not become ready:
-> - **Secret**: Create a Secret (e.g., `deepseek-api`) in the same namespace with the LLM API token. Example: `kubectl create secret generic deepseek-api -n default --from-literal=token=<LLM_API_KEY>`.
+> - **Inference URL**: `VLLM_URL` must point at a **vLLM OpenAI-compatible** HTTP base URL (for example an in-cluster vLLM or KServe InferenceService) that serves the target model.
+> - **Secret (optional)**: `VLLM_API_TOKEN` is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from `containerSpec.env` (see the commented example in the manifest below).
 > - **Storage Class**: Ensure the `default` Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready.
 
 ```yaml
@@ -48,23 +49,28 @@ spec:
   replicas: 1                                      # Number of server replicas
   server:
     containerSpec:
+      name: llama-stack
+      port: 8321
       env:
         - name: VLLM_URL
-          value: "https://api.deepseek.com/v1"     # URL of the LLM API provider
+          value: "http://vllm-predictor.default.svc.cluster.local/v1"   # vLLM OpenAI-compatible base URL
         - name: VLLM_MAX_TOKENS
           value: "8192"                            # Maximum output tokens
-        - name: VLLM_API_TOKEN                     # Load LLM API token from secret
-          valueFrom:
-            secretKeyRef:                          # Create this Secret in the same namespace beforehand, e.g. kubectl create secret generic deepseek-api -n default --from-literal=token=<LLM_API_KEY>
-              key: token
-              name: deepseek-api
-      name: llama-stack
-      port: 8321
+
+        # Optional: VLLM_API_TOKEN — add only when the vLLM endpoint requires authentication.
+        # If vLLM is deployed without auth, omit the entire block below (do not set VLLM_API_TOKEN).
+        # Example after creating: kubectl create secret generic vllm-api-token -n default --from-literal=token=<TOKEN>
+        # - name: VLLM_API_TOKEN
+        #   valueFrom:
+        #     secretKeyRef:
+        #       key: token
+        #       name: vllm-api-token
+
     distribution:
       name: starter                                # Distribution name (options: starter, postgres-demo, meta-reference-gpu)
     storage:
       mountPath: /home/lls/.lls
-      size: 20Gi                                   # Requires the "default" Storage Class to be configured beforehand
+      size: 1Gi                                    # Requires the "default" Storage Class to be configured beforehand
 ```
 
 After deployment, the Llama Stack Server will be available within the cluster. The access URL is displayed in `status.serviceURL`, for example:
@@ -74,3 +80,16 @@ status:
   phase: Ready
   serviceURL: http://demo-service.default.svc.cluster.local:8321
 ```
+
+## Tool calling with vLLM on KServe
+
+The following applies to the **vLLM predictor** on KServe, not to the `LlamaStackDistribution` manifest. For agent flows that use **tools** (client-side tools or MCP), the vLLM process must expose tool-call support. Add predictor container `args` as required by upstream vLLM, for example:
+
+```yaml
+args:
+  - --enable-auto-tool-choice
+  - --tool-call-parser
+  - hermes
+```
+
+Choose `--tool-call-parser` (and any related flags) according to the **served model** and the vLLM documentation for that model family.
diff --git a/docs/en/llama_stack/quickstart.mdx b/docs/en/llama_stack/quickstart.mdx
@@ -9,10 +9,9 @@ This section provides a quickstart example for creating an AI Agent with Llama S
 ## Prerequisites
 
 - Python 3.12 or higher (if not satisfied, refer to [FAQ: How to prepare Python 3.12 in Notebook](#how-to-prepare-python-312-in-notebook))
-- Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install))
+- Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** (see install notes)
 - Access to a Notebook environment (e.g., Jupyter Notebook, JupyterLab)
-- Python environment with `llama-stack-client` and required dependencies installed
-- API key for the LLM provider (e.g., DeepSeek API key)
+- Python environment with `llama-stack-client`, `fastmcp` (for the MCP section), and other notebook dependencies installed
 
 ## Quickstart Example
 
@@ -24,13 +23,10 @@ Download the notebook and upload it to a Notebook environment to run.
 
 The notebook demonstrates:
 
-- Connecting to Llama Stack Server and client setup
-- Tool definition using the `@client_tool` decorator (weather query tool example)
-- Client connection to Llama Stack Server
-- Model selection and Agent creation with tools and instructions
-- Agent execution with session management and streaming responses
-- Result handling and display
-- Optional FastAPI deployment example
+- **Two tool options:** client-side tools (`@client_tool`) and MCP tools (FastMCP + `toolgroups.register`)
+- **Shared agent flow:** connect to Llama Stack Server, select a model, create an `Agent` with `tools=AGENT_TOOLS`, then run sessions and streaming turns
+- Streaming responses and event logging
+- Optional FastAPI deployment of the `agent`
 
 ## FAQ