generated from mintlify/starter
-
Notifications
You must be signed in to change notification settings - Fork 3
Add tei tgi delete instance #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
aowen14
wants to merge
4
commits into
shadeform:main
Choose a base branch
from
aowen14:add-tei-tgi-delete-instance
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| --- | ||
| title: 'Stopping GPU Instances' | ||
| icon: 'octagon' | ||
| iconType: 'solid' | ||
| --- | ||
|
|
||
| ### Intro | ||
| Stopping Instances helps save money on GPU deployments when you're not using them. The Shadeform API enables programmatically stopping instances, which saves money and enables new use cases. | ||
|
|
||
| ### Setup | ||
| Let's assume that you have a Shadeform API key from [here](https://platform.shadeform.ai/settings/api), | ||
| and an active instance. If you need an active instance, then | ||
| [this guide](/guides/mostaffordablegpus) will help you get set up. | ||
|
|
||
| ### Deleting an instance: | ||
|
|
||
| Using the API, let's list our active instances: | ||
| <CodeGroup> | ||
| ```python Python | ||
| base_url = "https://api.shadeform.ai/v1/instances" | ||
| headers = { | ||
| "X-API-KEY": "<api-key>", | ||
| "Content-Type" : "application/json" | ||
| } | ||
| instance_response = requests.request("GET", base_url, headers=headers) | ||
| print(instance_response) | ||
| ``` | ||
| ```bash cURL | ||
| curl --request GET \ | ||
| --url https://api.shadeform.ai/v1/instances \ | ||
| --header 'X-API-KEY: <api-key>' | ||
| ``` | ||
| </CodeGroup> | ||
|
|
||
|
|
||
| To delete this instance, we'll use the ID of the instance that we want to delete to make a delete request. | ||
| <CodeGroup> | ||
| ```python Python | ||
| instances = json.loads(instance_response.text)['instances'] | ||
| # We will specifically be looking to stop the first instance in the list. | ||
| # Note that this requires at least one active instance | ||
| instance_id = instances[0]['id'] | ||
| delete_url = base_url + '/' + instance_id + '/delete' | ||
| delete_response = requests.request("POST", delete_url, headers=headers) | ||
| print(delete_response) | ||
| ``` | ||
|
|
||
| ```bash cURL | ||
| # Copy the instance id from above that you'd like to stop/delete | ||
| curl --request POST \ | ||
| --url https://api.shadeform.ai/v1/instances/{id}/delete \ | ||
| --header 'X-API-KEY: <api-key>' | ||
|
|
||
| ``` | ||
| </CodeGroup> | ||
|
|
||
| Just like that, we've stopped a running instance. While we had a human in the loop, this can all be done without human involvement once compute jobs are finished. | ||
|
|
||
| ### Next Steps: | ||
| We alluded to the fact that stopping instances enables new use cases. Be on the lookout as we showcase some of these here 👀. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| --- | ||
| title: 'Serve Embeddings' | ||
| icon: 'webhook' | ||
| iconType: 'solid' | ||
| --- | ||
|
|
||
| ### Intro | ||
| Embedding and Re-Ranker models are important backbones for RAG and other AI applications with chained calls to LLMs. | ||
| They help efficiently organize a body of knowledge and prepare it for an LLM to generate tokens. | ||
| Finetuning embeddings can also provide performance improvements over generic embedding models for your apps and workflows. | ||
| In this guide, we will show you how you can serve an embedding model using 'Text-Embeddings-Inference` from Hugging Face for online inference. | ||
| This framework runs inference very quicky and can handle large batch sizes. | ||
|
|
||
| ### Setup | ||
| This guide builds off of our others for [finding the best gpu](/guides/mostaffordablegpus). | ||
|
|
||
| We have a python notebook already to go for you to deploy this model that you can [find here](https://github.com/shadeform/examples/blob/main/serving_embeddings_tei.ipynb). | ||
| The requirements are simple, so in a python environment with `requests` installed: | ||
|
|
||
| ```bash | ||
| git clone https://github.com/shadeform/examples.git | ||
| cd examples/ | ||
| ``` | ||
| Then in `basic_serving_embeddings_tei.ipynb` you will need to input your [Shadeform API Key](https://platform.shadeform.ai/settings/api). | ||
|
|
||
| ### Configuring our Server | ||
|
|
||
| Once we have an instance, we deploy a `Text-Embedding-Inference` container with this request payload. | ||
|
|
||
| ``` python | ||
| model_id = "BAAI/bge-base-en-v1.5" | ||
|
|
||
| payload = { | ||
| "cloud": best_instance["cloud"], | ||
| "region": region, | ||
| "shade_instance_type": shade_instance_type, | ||
| "shade_cloud": True, | ||
| "name": "text_embeddings_inference_server", | ||
| "launch_configuration": { | ||
| "type": "docker", | ||
| "docker_configuration": { | ||
| "image": "ghcr.io/huggingface/text-embeddings-inference", | ||
| "args": "--model-id " + model_id, | ||
| "port_mappings": [ | ||
| { | ||
| "container_port": 80, | ||
| "host_port": 8000 | ||
| } | ||
| ] | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Note that we are mapping the container's port 80 to the host's port 8000. `Text-Embeddings-Inference` by default serves on port 80, but many higher level frameworks expect the server to host on port 8000. | ||
|
|
||
| #### Tailoring your image to hardware version | ||
| Shadeform supports multiple different versions and generations of GPUs, as does `Text-Embeddings-Inference`. But not all features are available on every hardware version, and there may be reliability issues that arize. | ||
| Luckily, Hugging Face put out a list of [docker images](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images) for each hardware platform for the best performance and support. | ||
| We use the general base image as it works for the A6000 GPUs that we're using in this guide, and is somewhat flexible. | ||
|
|
||
| You can substitute the image that you need above in the `image` key in the payload. | ||
|
|
||
| ### Serving Embeddings | ||
| We can deploy our server to this instance by POST'ing the request like so: | ||
| ``` python | ||
| response = requests.request("POST", create_url, json=payload, headers=headers) | ||
| #easy way to visually see if this request worked | ||
| print(response.text) | ||
| ``` | ||
|
|
||
| Once we request it, Shadeform will provision the machine, and deploy a docker container based on the image, arguments, and environment variables that we selected. | ||
| This might take 5-10 minutes depending on the machine chosen and the size of the model weights you choose. | ||
| For more information on the API fields, check out the [Create Instance API Reference](/api-reference/instances/instances-create). | ||
|
|
||
| ### Checking on our server | ||
|
|
||
| There are four main steps that we need to wait for: VM Provisioning, image downloading and startup, spinning up `Text-Embeddings-Inference`, and downloading the model. | ||
| Luckily, `Text-Embeddings-Inference` is optimized for serverless use cases with a small image size, and embedding models are usually small compared to LLM's. | ||
|
|
||
| ```python | ||
| instance_response = requests.request("GET", base_url, headers=headers) | ||
| ip_addr = "" | ||
| print(instance_response.text) | ||
| instance = json.loads(instance_response.text)["instances"][0] | ||
| instance_status = instance['status'] | ||
| if instance_status == 'active': | ||
| print(f"Instance is active with IP: {instance['ip']}") | ||
| ip_addr = instance['ip'] | ||
| else: | ||
| print(f"Instance isn't yet active: {instance}" ) | ||
| ``` | ||
| This cell will print the IP address once it has provisioned. Generally the server should be ready a few minutes after the IP shows up from downloading the image and model. | ||
|
|
||
| Or once we've made the request, we can watch the logs under [Running Instances](https://platform.shadeform.ai/instances). Once it is ready to serve it should look something like this: | ||
|  | ||
|
|
||
| ### Querying our Server | ||
|
|
||
| <CodeGroup> | ||
| ``` python cURL | ||
| # Copy the ip address printed above | ||
| curl XX.XX.XXX.XX/embed \ | ||
| -X POST \ | ||
| -d '{"inputs":"What is the meaning of life?"}' \ | ||
| -H 'Content-Type: application/json' | ||
| ``` | ||
|
|
||
| ```python requests | ||
| server_headers = { | ||
| 'Content-Type': 'application/json', | ||
| } | ||
|
|
||
| json_data = { | ||
| 'inputs': 'What is the meaning of life?', | ||
| } | ||
|
|
||
| embedding_response = requests.post(f'http://{ip_addr}:8000/embed', headers=server_headers, json=json_data) | ||
|
|
||
| print(embedding_response.text) | ||
| ``` | ||
|
|
||
| </CodeGroup> | ||
|
|
||
| ### What's Next | ||
| You now can serve your own embedding models for your RAG, Search and other AI applications on Shadeform, giving you more control over your applications at more affordable prices. We'll show in future guides more details on how to do this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,138 @@ | ||
| --- | ||
| title: 'Serve LLM Using TGI' | ||
| icon: 'comment' | ||
| iconType: 'solid' | ||
| title: "Serve LLM Using TGI" | ||
| icon: "comment" | ||
| iconType: "solid" | ||
| --- | ||
|
|
||
| ### Intro | ||
|
|
||
| Model Deployment is a very common GPU use case. [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference) (`TGI`) is a popular framework to run inference on LLM's from Hugging Face. | ||
|
|
||
| With Shadeform, it's easy to deploy models right to the most affordable gpu's in the market with just a few commands. | ||
|
|
||
| In this guide, we will deploy [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) with `TGI` onto an A6000. This guide is very similar to our [vLLM guide](/guides/vllm) with a few changes to change the inference framework. | ||
|
|
||
| ```bash | ||
| git clone https://github.com/shadeform/examples.git | ||
| cd examples/ | ||
| ``` | ||
|
|
||
| Then in `serve_tgi.ipynb` you will need to input your [Shadeform API Key](https://platform.shadeform.ai/settings/api). | ||
|
|
||
| ### Serving a Model | ||
|
|
||
| Once we have an instance, we deploy a model serving container with this request payload. | ||
|
|
||
| ```python | ||
| model_id = "mistralai/Mistral-7B-v0.1" | ||
| port = 8000 | ||
|
|
||
| #If the model you need requires authenticated access, paste your Hugging Face api key here | ||
| huggingface_token = "" | ||
|
|
||
| payload = { | ||
| "cloud": best_instance["cloud"], | ||
| "region": region, | ||
| "shade_instance_type": shade_instance_type, | ||
| "shade_cloud": True, | ||
| "name": "text_generation_inference_server", | ||
| "launch_configuration": { | ||
| "type": "docker", | ||
| "docker_configuration": { | ||
| "image": "ghcr.io/huggingface/text-generation-inference:1.4", | ||
| "args": "--model-id " + model_id + f" --port {port}", | ||
| "envs": [], | ||
| "port_mappings": [ | ||
| { | ||
| "container_port": 8000, | ||
| "host_port": 8000 | ||
| } | ||
| ] | ||
| } | ||
| } | ||
| } | ||
|
|
||
| #Add another environment variable to the payload by adding a json | ||
| if huggingface_token != "": | ||
| token_env_json = { | ||
| "name": "HUGGING_FACE_HUB_TOKEN", | ||
| "value" : huggingface_token | ||
| } | ||
| payload["launch_configuration"]["docker_configuration"]["envs"].append(token_env_json) | ||
|
|
||
| #request the best instance that is available | ||
| response = requests.request("POST", create_url, json=payload, headers=headers) | ||
| #easy way to visually see if this request worked | ||
| print(response.text) | ||
| ``` | ||
|
|
||
| Once we request it, Shadeform will provision the machine, and deploy a docker container with the image, arguments, and environment variables provided. In this case, it will deploy an openai compatible server with `TGI` serving Mistral-7b-v0.1. | ||
| This might take 5-10 minutes depending on the machine chosen and the size of the model weights you choose. | ||
|
|
||
| For more information on the API fields, check out the [Create Instance API Reference](/api-reference/instances/instances-create). | ||
|
|
||
| ### Checking on our Model server | ||
|
|
||
| There are four main steps that we need to wait for: VM Provisioning, image downloading and startup, spinning up `TGI`, and downloading the model. | ||
|
|
||
| ```python | ||
| instance_response = requests.request("GET", base_url, headers=headers) | ||
| ip_addr = "" | ||
| print(instance_response.text) | ||
| instance = json.loads(instance_response.text)["instances"][0] | ||
| instance_status = instance['status'] | ||
| if instance_status == 'active': | ||
| print(f"Instance is active with IP: {instance['ip']}") | ||
| ip_addr = instance['ip'] | ||
| else: | ||
| print(f"Instance isn't yet active: {instance}" ) | ||
| ``` | ||
|
|
||
| This cell will print the IP address once it has provisioned. However, the image needs to download, and `TGI` needs to download the model and spin up, which should take a few minutes. | ||
|
|
||
| #### Watch via the notebook | ||
|
|
||
| Once the model is ready, this code will output the model list and a response to our query. We can use either requests or OpenAI's completions library. | ||
|
|
||
| <CodeGroup> | ||
| ``` python requests | ||
| tgi_headers = { | ||
| 'Content-Type': 'application/json', | ||
| } | ||
|
|
||
| json_data = { | ||
| 'model': model_id, | ||
| 'prompt': 'New York City is the', | ||
| 'max_tokens': 7, | ||
| } | ||
|
|
||
| completion_response = requests.post(f'http://{ip_addr}:{port}/v1/completions', headers=tgi_headers, json=json_data) | ||
|
|
||
| print(completion_response.text) | ||
|
|
||
| ``` | ||
|
|
||
| ``` python openai | ||
| #Alternatively, you can call this with the Open AI library, but requires that installed | ||
| from openai import OpenAI | ||
|
|
||
| # Modify OpenAI's API key and API base to use vLLM's API server. | ||
| openai_api_key = "EMPTY" | ||
| openai_api_base = f"http://{ip_addr}:{port}/v1" | ||
| client = OpenAI( | ||
| api_key=openai_api_key, | ||
| base_url=openai_api_base, | ||
| ) | ||
| completion = client.completions.create(model=model_id, | ||
| prompt='New York City is the', | ||
| max_tokens=7) | ||
| print("Completion result:", completion) | ||
|
|
||
| ``` | ||
|
|
||
| </CodeGroup> | ||
| #### Watching with the Shadeform UI | ||
| Or once we've made the request, we can watch the logs under [Running Instances](https://platform.shadeform.ai/instances). Once it is ready to serve it should look something like this: | ||
|  | ||
|
|
||
| Happy Serving! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be called "Deleting GPU Instances".
Deleting and Stopping are two different things in the cloud space. We support deleting but not stopping