Skip to content

Conversation

@Gastron
Copy link
Collaborator

@Gastron Gastron commented Aug 15, 2025

No description provided.

Gastron and others added 30 commits July 15, 2025 16:37
…(#379)

* Implementation LLM inference improvements for the patching model 70b

* precommit implementation

* Fix the patch for megatron 70B and move directory overrides one level down, now within helm/

	deleted:    workloads/llm-inference-megatron-lm/helm/mount/arguments.py
	modified:   workloads/llm-inference-megatron-lm/helm/mount/megatron-lm-inference-llama3-1-70b.patch
	modified:   workloads/llm-inference-megatron-lm/helm/templates/deployment.yaml

* changes in pre-commit.yaml to run the patch successfully, this is excluding .patch files globally in every hook.

* Refactor pre-commit configuration to handle .patch exclusion globally

* Fix formatting inconsistencies in pre-commit configuration

* Update .pre-commit-config.yaml

---------

Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com>
… (#321)

* Add annotations for PVC auto-creation and configuration in deployment

* add pvc annotations for jupyterlab

* add auto-pvc for comfyui

* adjust vscode wl to new pvc protocol

* adjust jupyter wl to new pvc protocol

* adjust comfyui wf to new pvc protocol

* save models and workflows to user pvc in comfyui

---------

Co-authored-by: Andrey Ivannikov <andrey.ivannikov@silo.ai>
* Adds link to tutorial 03 in tutorial list exposing it on the documentation page

* Adds link to tutorial 03 in documentation page sidebar
* Use checkpoint-final for the full param (merged) model

* All shells do not support &> redirect syntax

* Use Lorenzo's file count reporting version of minio storage check
…tep of tutorial (#381)

* update tutorial

* add option to download tokenizer from s3

* typo fix

* add suffix trim to the tokenizer id

* use correct path of the tokenizer in preprocessing script

* upload tokenizer on condition

* fix tutorial 0 links

* temporarily still use nousresearch tokenizer not to break the tutorial
* Change snake_case to camelCase in multinode pretraining template

Signed-off-by: Robert Talling <rtalling@amd.com>

* Wrap rayjob template with kaiwo

Signed-off-by: Robert Talling <rtalling@amd.com>

* Update tokenizer path and labels

Signed-off-by: Robert Talling <rtalling@amd.com>

* Add fixes to pvc creation + kaiwo, update tokenizer path

Signed-off-by: Robert Talling <rtalling@amd.com>

* Disable kaiwo by default

Signed-off-by: Robert Talling <rtalling@amd.com>

* Disable kaiwo pvc creation and check

Signed-off-by: Robert Talling <rtalling@amd.com>

* Move overrides

Signed-off-by: Robert Talling <rtalling@amd.com>

* Remove empty line

Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com>

* Add flags to gc.sh

Signed-off-by: Robert Talling <rtalling@amd.com>

* Set ttlSecondsAfterFinished to 0

Signed-off-by: Robert Talling <rtalling@amd.com>

* Update conditions

Signed-off-by: Robert Talling <rtalling@amd.com>

* Update condition

Signed-off-by: Robert Talling <rtalling@amd.com>

---------

Signed-off-by: Robert Talling <rtalling@amd.com>
Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com>
* add kaiwo cdr to single node megatron pretraining

* precommit

* add kaiwo crd to prepare megatron data job

* fix remoteCheckpointsPath

* precommit
* Judge MLFlow integration.

* fixed default image name and mlflow experiment and run names

* Fixed and added pydoc.
* Set values for dev center

* Fix hf:// prefix

* Llama 3.1 8B Instruct

* Llama 3.1 70B Instruct

* Don't use exclude if not set

* Qwen 2.5 7B Instruct

* Add DeepSeek-R1-Distill-Llama-8B finetuning BKC

* Argilla data override

* Qwen family, larger ones LoRA

* Add mergeadapter true

* Add deepseek-distill-qwen models

* Cohere Aya models

* Qwen 3 models

* Increase memory

* Llama 3.2 small models

* Add DeepSeek-R1-0528-Qwen3-8B

* Add google gemma-3-1b-it

* Add mixtral

* v0.6 tag

* Better downloads sizes

* Script to create download overrides

* Download overrides

---------

Co-authored-by: Emil Eirola <emil.eirola@amd.com>
* Add tutorial for llama70b multinode training with dddp2

Signed-off-by: Robert Talling <rtalling@amd.com>

* Minor fix in tutorial

Signed-off-by: Robert Talling <rtalling@amd.com>

* Update override name

Signed-off-by: Robert Talling <rtalling@amd.com>

* Update override name

Signed-off-by: Robert Talling <rtalling@amd.com>

* small fixes to the tutorial

* changes to toc

* rename tutorial 04 md file

* swap the steps in the tutorial

* yaml ext fix

* smaller corrections to the 03 tutorial

* smaller corrections to 04 tutorial

* Move readmes

Signed-off-by: Robert Talling <rtalling@amd.com>

* extension fix

---------

Signed-off-by: Robert Talling <rtalling@amd.com>
Co-authored-by: Robert Talling <rtalling@amd.com>
Co-authored-by: Andrey Ivannikov <andrey.ivannikov@silo.ai>
* Remove default MLFlow IP. Empty string means that MLFlow artifacts are not used.

* Remove special characters from job name.

* Add model dir mount to evaluation container.

* Add logging.

This reverts commit 13c7ba0987b2e48119ae66fad195cc9e7184563f.

* Add local model dir path for evaluation.

* Fix a wrong parameter name.

* Comment out unused path prefix removals.
* evaluation logging: judge documents progress

* evaluation logging: add rocm-smi GPU display in template

* evaluation logging: add batch progress

* evaluation logging: add inference progress to metrics workload

* evaluation logging: waiting time judge

* evaluation logging: spelling nit

* evaluation logging: also batch_number for judgments

* evaluation logging: readability fix

* evaluation logging: batch number easier to catch

* evaluation logging: tqdm over inference container batches

* evaluation logging: fix of use_subset check - don't allow negatives from argparser and fix if statement

* evaluation logging: fix num_judge_inferences and judged_documents_counter
Signed-off-by: Robert Talling <rtalling@amd.com>
* Add s3 support for Metrics Evaluation (Go Template).

* Add job name and use_data_subset to overrides, add MinIO env vars.
… API calls (#358)

* bugfix: service auto discovery to handle exceptions during Kubernetes API calls

* Update workloads/dev-chatui-openwebui/helm/mount/get_openai_api_base_urls.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update workloads/dev-chatui-openwebui/helm/mount/get_openai_api_base_urls.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update workloads/dev-chatui-openwebui/helm/mount/get_openai_api_base_urls.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: add missing logging

---------

Co-authored-by: Mark van Heeswijk <mark.vanheeswijk@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…kload specifications (#405)

* Add download configurations for various models based on inference workload specifications

* update model configuration and update storage quantities

* validated configs and updated storage quantities

* Update storage quantities

* revert changes in tutorial configs

* Update storage quantities for optimize resource allocation

* revert some changes on example configs

* Remove deprecated Llama model configuration files

* Set 880G for all r1 models
…b models (#410)

* Add configuration files for openai_gpt-oss-120b and openai_gpt-oss-20b models

* Add flag "disable-log-requests" to command line.

* add a comment and "no-enable-prefix-caching"
* Dev-Center overrides
* Add HF token to download workload dev center overrides
* Use modelId instead of modelID
* Disable tensorboard only in dev-center overrides
* Rename odia model for consistency
* Restore the tutorials in download workload (but with modelId)
* Don't change token in values.yaml
* adding option to start pretraining from scratch

* add overrides for training from scratch

* fix variable naming

* Update workloads/llm-pretraining-megatron-lm-ray/helm/templates/_helpers.tpl

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update workloads/llm-pretraining-megatron-lm-ray/helm/templates/ray_job.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* results: rm unused saved_results list, rename local_save_inference_results, rm return of unused path string

* results: rename save and read to read_local

* results: rename to minio_save, minio_results_dir_path, some logging, create two separate local and minio results dirs

* results: change argparser to minio_output_dir

* results: parameterize metrics minio_output_dir_path

* results: new save_json_object_to_minio function call

* results: save_json_object call

* results: summary results

* results: prompts, generations and all_scores

* results: refactor client names and create minio client once

* results: remove obsolete results function

* results: summary scores dict for judge, new results writing for run metrics main

* results: other dicts for judge minio results

* results: serializable json lists

* results: correct logging

* results: fix tolist

* results: create full prompts

* results: prompts variable fix

* results: fix judge path

* results: correct to local for judge results

* results: judge full prompts

* results: judge fix full prompt and prompt template storage

* results: data classes judge prompt name fix

* Refactor the inference results dictionary to harmonise with judge results, save judge results to local, copy local inference and judge results to MinIO.

* Refactor results minio copying to function. Fix changed dictionary key.

* Copy inference results to MinIO

* pre-commit

---------

Co-authored-by: Mikko Vilenius <mvileniu@amd.com>
* fix inference tokenizer arg and tutorials

* Update docs/tutorials/tutorial-04-deliver-llama70b-and-run-megatron-cpt-with-tp8-ddp2.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/tutorials/tutorial-03-deliver-resources-and-run-megatron-cpt.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…#424)

* First attempt in harmonising the override and values for helm chart

* pre-commit checks

* Updatest to readmes

* Cleaning up judge evaluation overrides / values.

* Cleaning up judge evaluation overrides / values. Corrections.

* Cleaning up metrics evaluation overrides / values.

* Added missing model path prefixes and changed MLFlow experiment and run names.

* pre-commit fixes

* overrides: metrics overrides

* fix id_column_name bug

* billsum job name

* billsums id fix

---------

Co-authored-by: Sander Bijl de Vroe <Sander.BijldeVroe@amd.com>
* overrides renames

* overrides case sensitive renames

* overrides renames consistency with other workloads
* minio client error handling

* fix: replace mc cp with mc mirror for model download and improve error logging

* Update workloads/llm-inference-vllm/helm/templates/_entrypoint.tpl

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update workloads/llm-inference-vllm/helm/templates/_entrypoint.tpl

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Mark van Heeswijk <mark.vanheeswijk@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Add HF_TOKEN environment variable to model configuration files

* Add max-model-len configuration for Llama-3.1 70b models

* Update GPU allocation and storage quantity for Mixtral-8x22B model configuration

* update configuration for Mixtral-8x22B sglang

* remove duplicated OdiaGenAI config

* Increased GPU allocation to 4 for Mixtral-8x22B

---------

Co-authored-by: Bo Zhang <Bo.Zhang2@amd.com>
* Add mkdocs entries for missing workloads

* Fix list - extra entries from unclean repo
@Gastron Gastron merged commit 1db2e3e into main Aug 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.