Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions guides/deleteinstance.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: 'Stopping GPU Instances'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called "Deleting GPU Instances".

Deleting and Stopping are two different things in the cloud space. We support deleting but not stopping

icon: 'octagon'
iconType: 'solid'
---

### Intro
Stopping Instances helps save money on GPU deployments when you're not using them. The Shadeform API enables programmatically stopping instances, which saves money and enables new use cases.

### Setup
Let's assume that you have a Shadeform API key from [here](https://platform.shadeform.ai/settings/api),
and an active instance. If you need an active instance, then
[this guide](/guides/mostaffordablegpus) will help you get set up.

### Deleting an instance:

Using the API, let's list our active instances:
<CodeGroup>
```python Python
base_url = "https://api.shadeform.ai/v1/instances"
headers = {
"X-API-KEY": "<api-key>",
"Content-Type" : "application/json"
}
instance_response = requests.request("GET", base_url, headers=headers)
print(instance_response)
```
```bash cURL
curl --request GET \
--url https://api.shadeform.ai/v1/instances \
--header 'X-API-KEY: <api-key>'
```
</CodeGroup>


To delete this instance, we'll use the ID of the instance that we want to delete to make a delete request.
<CodeGroup>
```python Python
instances = json.loads(instance_response.text)['instances']
# We will specifically be looking to stop the first instance in the list.
# Note that this requires at least one active instance
instance_id = instances[0]['id']
delete_url = base_url + '/' + instance_id + '/delete'
delete_response = requests.request("POST", delete_url, headers=headers)
print(delete_response)
```

```bash cURL
# Copy the instance id from above that you'd like to stop/delete
curl --request POST \
--url https://api.shadeform.ai/v1/instances/{id}/delete \
--header 'X-API-KEY: <api-key>'

```
</CodeGroup>

Just like that, we've stopped a running instance. While we had a human in the loop, this can all be done without human involvement once compute jobs are finished.

### Next Steps:
We alluded to the fact that stopping instances enables new use cases. Be on the lookout as we showcase some of these here 👀.
4 changes: 2 additions & 2 deletions guides/mostaffordablegpus.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ In such cases, we provide an API that can help search for the most affordable GP

### Using the Instance Types API

We can find the most affordable GPU instances by applying filtering and sorting query params to the [Instance Types API](/api-reference/endpoint/instances-types).
We can find the most affordable GPU instances by applying filtering and sorting query params to the [Instance Types API](/api-reference/instances/instances-types).
Without any query parameters, the API returns all of the instance types supported by Shadeform in a random order.
By using the following query parameters, we can filter down the results and sort them as needed.

Expand Down Expand Up @@ -259,7 +259,7 @@ From array of results returned, we can see that the A6000 instance from RunPod i
```

### Launching the Instance
After you have found the most affordable instance, you can launch the instance by using the [Create Instance API](/api-reference/endpoint/instances-create).
After you have found the most affordable instance, you can launch the instance by using the [Create Instance API](/api-reference/instances/instances-create).
In the example below, we will launch the most affordable A6000 GPU based on the data retrieved from the API call in the previous section.
We will take the following properties from the previous response:

Expand Down
126 changes: 126 additions & 0 deletions guides/serveembeddings.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
title: 'Serve Embeddings'
icon: 'webhook'
iconType: 'solid'
---

### Intro
Embedding and Re-Ranker models are important backbones for RAG and other AI applications with chained calls to LLMs.
They help efficiently organize a body of knowledge and prepare it for an LLM to generate tokens.
Finetuning embeddings can also provide performance improvements over generic embedding models for your apps and workflows.
In this guide, we will show you how you can serve an embedding model using 'Text-Embeddings-Inference` from Hugging Face for online inference.
This framework runs inference very quicky and can handle large batch sizes.

### Setup
This guide builds off of our others for [finding the best gpu](/guides/mostaffordablegpus).

We have a python notebook already to go for you to deploy this model that you can [find here](https://github.com/shadeform/examples/blob/main/serving_embeddings_tei.ipynb).
The requirements are simple, so in a python environment with `requests` installed:

```bash
git clone https://github.com/shadeform/examples.git
cd examples/
```
Then in `basic_serving_embeddings_tei.ipynb` you will need to input your [Shadeform API Key](https://platform.shadeform.ai/settings/api).

### Configuring our Server

Once we have an instance, we deploy a `Text-Embedding-Inference` container with this request payload.

``` python
model_id = "BAAI/bge-base-en-v1.5"

payload = {
"cloud": best_instance["cloud"],
"region": region,
"shade_instance_type": shade_instance_type,
"shade_cloud": True,
"name": "text_embeddings_inference_server",
"launch_configuration": {
"type": "docker",
"docker_configuration": {
"image": "ghcr.io/huggingface/text-embeddings-inference",
"args": "--model-id " + model_id,
"port_mappings": [
{
"container_port": 80,
"host_port": 8000
}
]
}
}
}
```

Note that we are mapping the container's port 80 to the host's port 8000. `Text-Embeddings-Inference` by default serves on port 80, but many higher level frameworks expect the server to host on port 8000.

#### Tailoring your image to hardware version
Shadeform supports multiple different versions and generations of GPUs, as does `Text-Embeddings-Inference`. But not all features are available on every hardware version, and there may be reliability issues that arize.
Luckily, Hugging Face put out a list of [docker images](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images) for each hardware platform for the best performance and support.
We use the general base image as it works for the A6000 GPUs that we're using in this guide, and is somewhat flexible.

You can substitute the image that you need above in the `image` key in the payload.

### Serving Embeddings
We can deploy our server to this instance by POST'ing the request like so:
``` python
response = requests.request("POST", create_url, json=payload, headers=headers)
#easy way to visually see if this request worked
print(response.text)
```

Once we request it, Shadeform will provision the machine, and deploy a docker container based on the image, arguments, and environment variables that we selected.
This might take 5-10 minutes depending on the machine chosen and the size of the model weights you choose.
For more information on the API fields, check out the [Create Instance API Reference](/api-reference/instances/instances-create).

### Checking on our server

There are four main steps that we need to wait for: VM Provisioning, image downloading and startup, spinning up `Text-Embeddings-Inference`, and downloading the model.
Luckily, `Text-Embeddings-Inference` is optimized for serverless use cases with a small image size, and embedding models are usually small compared to LLM's.

```python
instance_response = requests.request("GET", base_url, headers=headers)
ip_addr = ""
print(instance_response.text)
instance = json.loads(instance_response.text)["instances"][0]
instance_status = instance['status']
if instance_status == 'active':
print(f"Instance is active with IP: {instance['ip']}")
ip_addr = instance['ip']
else:
print(f"Instance isn't yet active: {instance}" )
```
This cell will print the IP address once it has provisioned. Generally the server should be ready a few minutes after the IP shows up from downloading the image and model.

Or once we've made the request, we can watch the logs under [Running Instances](https://platform.shadeform.ai/instances). Once it is ready to serve it should look something like this:
![StartedServing](/images/startedserving_tei.png)

### Querying our Server

<CodeGroup>
``` python cURL
# Copy the ip address printed above
curl XX.XX.XXX.XX/embed \
-X POST \
-d '{"inputs":"What is the meaning of life?"}' \
-H 'Content-Type: application/json'
```

```python requests
server_headers = {
'Content-Type': 'application/json',
}

json_data = {
'inputs': 'What is the meaning of life?',
}

embedding_response = requests.post(f'http://{ip_addr}:8000/embed', headers=server_headers, json=json_data)

print(embedding_response.text)
```

</CodeGroup>

### What's Next
You now can serve your own embedding models for your RAG, Search and other AI applications on Shadeform, giving you more control over your applications at more affordable prices. We'll show in future guides more details on how to do this.
139 changes: 136 additions & 3 deletions guides/tgi.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,138 @@
---
title: 'Serve LLM Using TGI'
icon: 'comment'
iconType: 'solid'
title: "Serve LLM Using TGI"
icon: "comment"
iconType: "solid"
---

### Intro

Model Deployment is a very common GPU use case. [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference) (`TGI`) is a popular framework to run inference on LLM's from Hugging Face.

With Shadeform, it's easy to deploy models right to the most affordable gpu's in the market with just a few commands.

In this guide, we will deploy [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) with `TGI` onto an A6000. This guide is very similar to our [vLLM guide](/guides/vllm) with a few changes to change the inference framework.

```bash
git clone https://github.com/shadeform/examples.git
cd examples/
```

Then in `serve_tgi.ipynb` you will need to input your [Shadeform API Key](https://platform.shadeform.ai/settings/api).

### Serving a Model

Once we have an instance, we deploy a model serving container with this request payload.

```python
model_id = "mistralai/Mistral-7B-v0.1"
port = 8000

#If the model you need requires authenticated access, paste your Hugging Face api key here
huggingface_token = ""

payload = {
"cloud": best_instance["cloud"],
"region": region,
"shade_instance_type": shade_instance_type,
"shade_cloud": True,
"name": "text_generation_inference_server",
"launch_configuration": {
"type": "docker",
"docker_configuration": {
"image": "ghcr.io/huggingface/text-generation-inference:1.4",
"args": "--model-id " + model_id + f" --port {port}",
"envs": [],
"port_mappings": [
{
"container_port": 8000,
"host_port": 8000
}
]
}
}
}

#Add another environment variable to the payload by adding a json
if huggingface_token != "":
token_env_json = {
"name": "HUGGING_FACE_HUB_TOKEN",
"value" : huggingface_token
}
payload["launch_configuration"]["docker_configuration"]["envs"].append(token_env_json)

#request the best instance that is available
response = requests.request("POST", create_url, json=payload, headers=headers)
#easy way to visually see if this request worked
print(response.text)
```

Once we request it, Shadeform will provision the machine, and deploy a docker container with the image, arguments, and environment variables provided. In this case, it will deploy an openai compatible server with `TGI` serving Mistral-7b-v0.1.
This might take 5-10 minutes depending on the machine chosen and the size of the model weights you choose.

For more information on the API fields, check out the [Create Instance API Reference](/api-reference/instances/instances-create).

### Checking on our Model server

There are four main steps that we need to wait for: VM Provisioning, image downloading and startup, spinning up `TGI`, and downloading the model.

```python
instance_response = requests.request("GET", base_url, headers=headers)
ip_addr = ""
print(instance_response.text)
instance = json.loads(instance_response.text)["instances"][0]
instance_status = instance['status']
if instance_status == 'active':
print(f"Instance is active with IP: {instance['ip']}")
ip_addr = instance['ip']
else:
print(f"Instance isn't yet active: {instance}" )
```

This cell will print the IP address once it has provisioned. However, the image needs to download, and `TGI` needs to download the model and spin up, which should take a few minutes.

#### Watch via the notebook

Once the model is ready, this code will output the model list and a response to our query. We can use either requests or OpenAI's completions library.

<CodeGroup>
``` python requests
tgi_headers = {
'Content-Type': 'application/json',
}

json_data = {
'model': model_id,
'prompt': 'New York City is the',
'max_tokens': 7,
}

completion_response = requests.post(f'http://{ip_addr}:{port}/v1/completions', headers=tgi_headers, json=json_data)

print(completion_response.text)

```

``` python openai
#Alternatively, you can call this with the Open AI library, but requires that installed
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = f"http://{ip_addr}:{port}/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(model=model_id,
prompt='New York City is the',
max_tokens=7)
print("Completion result:", completion)

```

</CodeGroup>
#### Watching with the Shadeform UI
Or once we've made the request, we can watch the logs under [Running Instances](https://platform.shadeform.ai/instances). Once it is ready to serve it should look something like this:
![StartedServing](/images/startedserving_tgi.png)

Happy Serving!
9 changes: 9 additions & 0 deletions guides/tutoriallibrary.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,19 @@ iconType: 'solid'
<Card title="Serve LLM using vLLM" icon="microchip" href="/guides/vllm">
Serve an LLM model using vLLM
</Card>
<Card title="Stop Active Instances" icon="octagon" href="/guides/deleteinstance">
Programmatically spin down instances
</Card>
### Frequent Use Cases
<Card title="Setup Firewalls using UFW" icon="block-brick-fire" href="/guides/firewall">
Provision network rules on your instance
</Card>
<Card title="Evaluate Language Models" icon="chart-simple" href="/guides/benchmark">
Evaluate language models using lm-eval-harness
</Card>
<Card title="Serve LLM using TGI" icon="comment" href="/guides/vllm">
Serve an LLM model using TGI
</Card>
<Card title="Serve Embedding and Re-Ranker Models" icon="webhook" href="/guides/serveembeddings">
Serve Embeddings using Text-Embeddings-Inference
</Card>
Loading