shadeform · aowen14 · May 13, 2024 · Jul 11, 2024 · Jul 12, 2024 · Jul 12, 2024
diff --git a/guides/deleteinstance.mdx b/guides/deleteinstance.mdx
@@ -0,0 +1,60 @@
+---
+title: 'Stopping GPU Instances'
+icon: 'octagon'
+iconType: 'solid'
+---
+
+### Intro
+Stopping Instances helps save money on GPU deployments when you're not using them. The Shadeform API enables programmatically stopping instances, which saves money and enables new use cases.
+
+### Setup
+Let's assume that you have a Shadeform API key from [here](https://platform.shadeform.ai/settings/api), 
+and an active instance. If you need an active instance, then 
+[this guide](/guides/mostaffordablegpus) will help you get set up.
+
+### Deleting an instance:
+
+Using the API, let's list our active instances:
+<CodeGroup>
+```python Python
+base_url = "https://api.shadeform.ai/v1/instances"
+headers = {
+    "X-API-KEY": "<api-key>", 
+    "Content-Type" : "application/json"
+}
+instance_response = requests.request("GET", base_url, headers=headers)
+print(instance_response)
+```
+```bash cURL
+curl --request GET \
+  --url https://api.shadeform.ai/v1/instances \
+  --header 'X-API-KEY: <api-key>'
+```
+</CodeGroup>
+
+
+To delete this instance, we'll use the ID of the instance that we want to delete to make a delete request. 
+<CodeGroup>
+```python Python
+instances = json.loads(instance_response.text)['instances']
+# We will specifically be looking to stop the first instance in the list.
+# Note that this requires at least one active instance
+instance_id = instances[0]['id']
+delete_url = base_url + '/' + instance_id + '/delete'
+delete_response = requests.request("POST", delete_url, headers=headers)
+print(delete_response)
+```
+
+```bash cURL
+# Copy the instance id from above that you'd like to stop/delete
+curl --request POST \
+  --url https://api.shadeform.ai/v1/instances/{id}/delete \
+  --header 'X-API-KEY: <api-key>'
+
+```
+</CodeGroup>
+
+Just like that, we've stopped a running instance. While we had a human in the loop, this can all be done without human involvement once compute jobs are finished.
+
+### Next Steps:
+We alluded to the fact that stopping instances enables new use cases. Be on the lookout as we showcase some of these here 👀.
diff --git a/guides/mostaffordablegpus.mdx b/guides/mostaffordablegpus.mdx
@@ -15,7 +15,7 @@ In such cases, we provide an API that can help search for the most affordable GP
 
 ### Using the Instance Types API
 
-We can find the most affordable GPU instances by applying filtering and sorting query params to the [Instance Types API](/api-reference/endpoint/instances-types).
+We can find the most affordable GPU instances by applying filtering and sorting query params to the [Instance Types API](/api-reference/instances/instances-types).
 Without any query parameters, the API returns all of the instance types supported by Shadeform in a random order.
 By using the following query parameters, we can filter down the results and sort them as needed.
 
@@ -259,7 +259,7 @@ From array of results returned, we can see that the A6000 instance from RunPod i
 ```
 
 ### Launching the Instance
-After you have found the most affordable instance, you can launch the instance by using the [Create Instance API](/api-reference/endpoint/instances-create).
+After you have found the most affordable instance, you can launch the instance by using the [Create Instance API](/api-reference/instances/instances-create).
 In the example below, we will launch the most affordable A6000 GPU based on the data retrieved from the API call in the previous section.
 We will take the following properties from the previous response:
 

diff --git a/guides/serveembeddings.mdx b/guides/serveembeddings.mdx
@@ -0,0 +1,126 @@
+---
+title: 'Serve Embeddings'
+icon: 'webhook'
+iconType: 'solid'
+---
+
+### Intro
+Embedding and Re-Ranker models are important backbones for RAG and other AI applications with chained calls to LLMs.
+They help efficiently organize a body of knowledge and prepare it for an LLM to generate tokens. 
+Finetuning embeddings can also provide performance improvements over generic embedding models for your apps and workflows.
+In this guide, we will show you how you can serve an embedding model using 'Text-Embeddings-Inference` from Hugging Face for online inference.
+This framework runs inference very quicky and can handle large batch sizes.
+
+### Setup
+This guide builds off of our others for [finding the best gpu](/guides/mostaffordablegpus).
+
+We have a python notebook already to go for you to deploy this model that you can [find here](https://github.com/shadeform/examples/blob/main/serving_embeddings_tei.ipynb).
+The requirements are simple, so in a python environment with `requests` installed:
+
+```bash
+git clone https://github.com/shadeform/examples.git
+cd examples/
+```
+Then in `basic_serving_embeddings_tei.ipynb` you will need to input your [Shadeform API Key](https://platform.shadeform.ai/settings/api).
+
+### Configuring our Server
+
+Once we have an instance, we deploy a `Text-Embedding-Inference` container with this request payload.
+
+``` python
+model_id = "BAAI/bge-base-en-v1.5"
+
+payload = {
+  "cloud": best_instance["cloud"],
+  "region": region,
+  "shade_instance_type": shade_instance_type,
+  "shade_cloud": True,
+  "name": "text_embeddings_inference_server",
+  "launch_configuration": {
+    "type": "docker", 
+    "docker_configuration": {
+      "image": "ghcr.io/huggingface/text-embeddings-inference",
+      "args": "--model-id " + model_id,
+      "port_mappings": [
+        {
+          "container_port": 80,
+          "host_port": 8000
+        }
+      ]
+    }
+  }
+}
+```
+
+Note that we are mapping the container's port 80 to the host's port 8000. `Text-Embeddings-Inference` by default serves on port 80, but many higher level frameworks expect the server to host on port 8000.
+
+#### Tailoring your image to hardware version
+Shadeform supports multiple different versions and generations of GPUs, as does `Text-Embeddings-Inference`. But not all features are available on every hardware version, and there may be reliability issues that arize.
+Luckily, Hugging Face put out a list of [docker images](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images) for each hardware platform for the best performance and support. 
+We use the general base image as it works for the A6000 GPUs that we're using in this guide, and is somewhat flexible.
+
+You can substitute the image that you need above in the `image` key in the payload.
+
+### Serving Embeddings
+We can deploy our server to this instance by POST'ing the request like so: 
+``` python
+response = requests.request("POST", create_url, json=payload, headers=headers)
+#easy way to visually see if this request worked
+print(response.text)
+```
+
+Once we request it, Shadeform will provision the machine, and deploy a docker container based on the image, arguments, and environment variables that we selected.
+This might take 5-10 minutes depending on the machine chosen and the size of the model weights you choose.
+For more information on the API fields, check out the [Create Instance API Reference](/api-reference/instances/instances-create).
+
+### Checking on our server
+
+There are four main steps that we need to wait for: VM Provisioning, image downloading and startup, spinning up `Text-Embeddings-Inference`, and downloading the model.
+Luckily, `Text-Embeddings-Inference` is optimized for serverless use cases with a small image size, and embedding models are usually small compared to LLM's.
+
+```python
+instance_response = requests.request("GET", base_url, headers=headers)
+ip_addr = ""
+print(instance_response.text)
+instance = json.loads(instance_response.text)["instances"][0]
+instance_status = instance['status']
+if instance_status == 'active':
+    print(f"Instance is active with IP: {instance['ip']}")
+    ip_addr = instance['ip']
+else:
+    print(f"Instance isn't yet active: {instance}" )
+```
+This cell will print the IP address once it has provisioned. Generally the server should be ready a few minutes after the IP shows up from downloading the image and model.
+
+Or once we've made the request, we can watch the logs under [Running Instances](https://platform.shadeform.ai/instances). Once it is ready to serve it should look something like this:
+![StartedServing](/images/startedserving_tei.png)
+
+### Querying our Server
+
+<CodeGroup>
+``` python cURL
+# Copy the ip address printed above 
+curl XX.XX.XXX.XX/embed \
+    -X POST \
+    -d '{"inputs":"What is the meaning of life?"}' \
+    -H 'Content-Type: application/json'
+```
+
+```python requests
+server_headers = {
+    'Content-Type': 'application/json',
+}
+
+json_data = {
+    'inputs': 'What is the meaning of life?',
+}
+
+embedding_response = requests.post(f'http://{ip_addr}:8000/embed', headers=server_headers, json=json_data)
+
+print(embedding_response.text)
+```
+
+</CodeGroup>
+
+### What's Next
+You now can serve your own embedding models for your RAG, Search and other AI applications on Shadeform, giving you more control over your applications at more affordable prices. We'll show in future guides more details on how to do this.
diff --git a/guides/tgi.mdx b/guides/tgi.mdx
@@ -1,5 +1,138 @@
 ---
-title: 'Serve LLM Using TGI'
-icon: 'comment'
-iconType: 'solid'
+title: "Serve LLM Using TGI"
+icon: "comment"
+iconType: "solid"
 ---
+
+### Intro
+
+Model Deployment is a very common GPU use case. [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference) (`TGI`) is a popular framework to run inference on LLM's from Hugging Face.
+
+With Shadeform, it's easy to deploy models right to the most affordable gpu's in the market with just a few commands.
+
+In this guide, we will deploy [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) with `TGI` onto an A6000. This guide is very similar to our [vLLM guide](/guides/vllm) with a few changes to change the inference framework.
+
+```bash
+git clone https://github.com/shadeform/examples.git
+cd examples/
+```
+
+Then in `serve_tgi.ipynb` you will need to input your [Shadeform API Key](https://platform.shadeform.ai/settings/api).
+
+### Serving a Model
+
+Once we have an instance, we deploy a model serving container with this request payload.
+
+```python
+model_id = "mistralai/Mistral-7B-v0.1"
+port = 8000
+
+#If the model you need requires authenticated access, paste your Hugging Face api key here
+huggingface_token = ""
+
+payload = {
+  "cloud": best_instance["cloud"],
+  "region": region,
+  "shade_instance_type": shade_instance_type,
+  "shade_cloud": True,
+  "name": "text_generation_inference_server",
+  "launch_configuration": {
+    "type": "docker",
+    "docker_configuration": {
+      "image": "ghcr.io/huggingface/text-generation-inference:1.4",
+      "args": "--model-id " + model_id + f" --port {port}",
+      "envs": [],
+      "port_mappings": [
+        {
+          "container_port": 8000,
+          "host_port": 8000
+        }
+      ]
+    }
+  }
+}
+
+#Add another environment variable to the payload by adding a json
+if huggingface_token != "":
+  token_env_json = {
+    "name": "HUGGING_FACE_HUB_TOKEN",
+    "value" : huggingface_token
+  }
+  payload["launch_configuration"]["docker_configuration"]["envs"].append(token_env_json)
+
+#request the best instance that is available
+response = requests.request("POST", create_url, json=payload, headers=headers)
+#easy way to visually see if this request worked
+print(response.text)
+```
+
+Once we request it, Shadeform will provision the machine, and deploy a docker container with the image, arguments, and environment variables provided. In this case, it will deploy an openai compatible server with `TGI` serving Mistral-7b-v0.1.
+This might take 5-10 minutes depending on the machine chosen and the size of the model weights you choose.
+
+For more information on the API fields, check out the [Create Instance API Reference](/api-reference/instances/instances-create).
+
+### Checking on our Model server
+
+There are four main steps that we need to wait for: VM Provisioning, image downloading and startup, spinning up `TGI`, and downloading the model.
+
+```python
+instance_response = requests.request("GET", base_url, headers=headers)
+ip_addr = ""
+print(instance_response.text)
+instance = json.loads(instance_response.text)["instances"][0]
+instance_status = instance['status']
+if instance_status == 'active':
+    print(f"Instance is active with IP: {instance['ip']}")
+    ip_addr = instance['ip']
+else:
+    print(f"Instance isn't yet active: {instance}" )
+```
+
+This cell will print the IP address once it has provisioned. However, the image needs to download, and `TGI` needs to download the model and spin up, which should take a few minutes.
+
+#### Watch via the notebook
+
+Once the model is ready, this code will output the model list and a response to our query. We can use either requests or OpenAI's completions library.
+
+<CodeGroup>
+``` python requests
+tgi_headers = {
+    'Content-Type': 'application/json',
+}
+
+json_data = {
+'model': model_id,
+'prompt': 'New York City is the',
+'max_tokens': 7,
+}
+
+completion_response = requests.post(f'http://{ip_addr}:{port}/v1/completions', headers=tgi_headers, json=json_data)
+
+print(completion_response.text)
+
+```
+
+``` python openai
+#Alternatively, you can call this with the Open AI library, but requires that installed
+from openai import OpenAI
+
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = f"http://{ip_addr}:{port}/v1"
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+completion = client.completions.create(model=model_id,
+                                      prompt='New York City is the',
+                                      max_tokens=7)
+print("Completion result:", completion)
+
+```
+
+</CodeGroup>
+#### Watching with the Shadeform UI
+Or once we've made the request, we can watch the logs under [Running Instances](https://platform.shadeform.ai/instances). Once it is ready to serve it should look something like this:
+![StartedServing](/images/startedserving_tgi.png)
+
+Happy Serving!
diff --git a/guides/tutoriallibrary.mdx b/guides/tutoriallibrary.mdx
@@ -22,10 +22,19 @@ iconType: 'solid'
 <Card title="Serve LLM using vLLM" icon="microchip" href="/guides/vllm">
   Serve an LLM model using vLLM
 </Card>
+<Card title="Stop Active Instances" icon="octagon" href="/guides/deleteinstance">
+  Programmatically spin down instances
+</Card>
 ### Frequent Use Cases
 <Card title="Setup Firewalls using UFW" icon="block-brick-fire" href="/guides/firewall">
   Provision network rules on your instance
 </Card>
 <Card title="Evaluate Language Models" icon="chart-simple" href="/guides/benchmark">
   Evaluate language models using lm-eval-harness
+</Card>
+<Card title="Serve LLM using TGI" icon="comment" href="/guides/vllm">
+  Serve an LLM model using TGI
+</Card>
+<Card title="Serve Embedding and Re-Ranker Models" icon="webhook" href="/guides/serveembeddings">
+  Serve Embeddings using Text-Embeddings-Inference
 </Card>