Update main 250815 #5

Gastron · 2025-08-15T11:21:09Z

No description provided.

…(#379) * Implementation LLM inference improvements for the patching model 70b * precommit implementation * Fix the patch for megatron 70B and move directory overrides one level down, now within helm/ deleted: workloads/llm-inference-megatron-lm/helm/mount/arguments.py modified: workloads/llm-inference-megatron-lm/helm/mount/megatron-lm-inference-llama3-1-70b.patch modified: workloads/llm-inference-megatron-lm/helm/templates/deployment.yaml * changes in pre-commit.yaml to run the patch successfully, this is excluding .patch files globally in every hook. * Refactor pre-commit configuration to handle .patch exclusion globally * Fix formatting inconsistencies in pre-commit configuration * Update .pre-commit-config.yaml --------- Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com>

…ation page (#384)

… (#321) * Add annotations for PVC auto-creation and configuration in deployment * add pvc annotations for jupyterlab * add auto-pvc for comfyui * adjust vscode wl to new pvc protocol * adjust jupyter wl to new pvc protocol * adjust comfyui wf to new pvc protocol * save models and workflows to user pvc in comfyui --------- Co-authored-by: Andrey Ivannikov <andrey.ivannikov@silo.ai>

* Adds link to tutorial 03 in tutorial list exposing it on the documentation page * Adds link to tutorial 03 in documentation page sidebar

* Use checkpoint-final for the full param (merged) model * All shells do not support &> redirect syntax * Use Lorenzo's file count reporting version of minio storage check

…tep of tutorial (#381) * update tutorial * add option to download tokenizer from s3 * typo fix * add suffix trim to the tokenizer id * use correct path of the tokenizer in preprocessing script * upload tokenizer on condition * fix tutorial 0 links * temporarily still use nousresearch tokenizer not to break the tutorial

* Change snake_case to camelCase in multinode pretraining template Signed-off-by: Robert Talling <rtalling@amd.com> * Wrap rayjob template with kaiwo Signed-off-by: Robert Talling <rtalling@amd.com> * Update tokenizer path and labels Signed-off-by: Robert Talling <rtalling@amd.com> * Add fixes to pvc creation + kaiwo, update tokenizer path Signed-off-by: Robert Talling <rtalling@amd.com> * Disable kaiwo by default Signed-off-by: Robert Talling <rtalling@amd.com> * Disable kaiwo pvc creation and check Signed-off-by: Robert Talling <rtalling@amd.com> * Move overrides Signed-off-by: Robert Talling <rtalling@amd.com> * Remove empty line Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com> * Add flags to gc.sh Signed-off-by: Robert Talling <rtalling@amd.com> * Set ttlSecondsAfterFinished to 0 Signed-off-by: Robert Talling <rtalling@amd.com> * Update conditions Signed-off-by: Robert Talling <rtalling@amd.com> * Update condition Signed-off-by: Robert Talling <rtalling@amd.com> --------- Signed-off-by: Robert Talling <rtalling@amd.com> Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com>

* add kaiwo cdr to single node megatron pretraining * precommit * add kaiwo crd to prepare megatron data job * fix remoteCheckpointsPath * precommit

* Judge MLFlow integration. * fixed default image name and mlflow experiment and run names * Fixed and added pydoc.

* Set values for dev center * Fix hf:// prefix * Llama 3.1 8B Instruct * Llama 3.1 70B Instruct * Don't use exclude if not set * Qwen 2.5 7B Instruct * Add DeepSeek-R1-Distill-Llama-8B finetuning BKC * Argilla data override * Qwen family, larger ones LoRA * Add mergeadapter true * Add deepseek-distill-qwen models * Cohere Aya models * Qwen 3 models * Increase memory * Llama 3.2 small models * Add DeepSeek-R1-0528-Qwen3-8B * Add google gemma-3-1b-it * Add mixtral * v0.6 tag * Better downloads sizes * Script to create download overrides * Download overrides --------- Co-authored-by: Emil Eirola <emil.eirola@amd.com>

* Add tutorial for llama70b multinode training with dddp2 Signed-off-by: Robert Talling <rtalling@amd.com> * Minor fix in tutorial Signed-off-by: Robert Talling <rtalling@amd.com> * Update override name Signed-off-by: Robert Talling <rtalling@amd.com> * Update override name Signed-off-by: Robert Talling <rtalling@amd.com> * small fixes to the tutorial * changes to toc * rename tutorial 04 md file * swap the steps in the tutorial * yaml ext fix * smaller corrections to the 03 tutorial * smaller corrections to 04 tutorial * Move readmes Signed-off-by: Robert Talling <rtalling@amd.com> * extension fix --------- Signed-off-by: Robert Talling <rtalling@amd.com> Co-authored-by: Robert Talling <rtalling@amd.com> Co-authored-by: Andrey Ivannikov <andrey.ivannikov@silo.ai>

* Remove default MLFlow IP. Empty string means that MLFlow artifacts are not used. * Remove special characters from job name. * Add model dir mount to evaluation container. * Add logging. This reverts commit 13c7ba0987b2e48119ae66fad195cc9e7184563f. * Add local model dir path for evaluation. * Fix a wrong parameter name. * Comment out unused path prefix removals.

* evaluation logging: judge documents progress * evaluation logging: add rocm-smi GPU display in template * evaluation logging: add batch progress * evaluation logging: add inference progress to metrics workload * evaluation logging: waiting time judge * evaluation logging: spelling nit * evaluation logging: also batch_number for judgments * evaluation logging: readability fix * evaluation logging: batch number easier to catch * evaluation logging: tqdm over inference container batches * evaluation logging: fix of use_subset check - don't allow negatives from argparser and fix if statement * evaluation logging: fix num_judge_inferences and judged_documents_counter

Signed-off-by: Robert Talling <rtalling@amd.com>

* Add s3 support for Metrics Evaluation (Go Template). * Add job name and use_data_subset to overrides, add MinIO env vars.

… API calls (#358) * bugfix: service auto discovery to handle exceptions during Kubernetes API calls * Update workloads/dev-chatui-openwebui/helm/mount/get_openai_api_base_urls.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update workloads/dev-chatui-openwebui/helm/mount/get_openai_api_base_urls.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update workloads/dev-chatui-openwebui/helm/mount/get_openai_api_base_urls.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: add missing logging --------- Co-authored-by: Mark van Heeswijk <mark.vanheeswijk@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…kload specifications (#405) * Add download configurations for various models based on inference workload specifications * update model configuration and update storage quantities * validated configs and updated storage quantities * Update storage quantities * revert changes in tutorial configs * Update storage quantities for optimize resource allocation * revert some changes on example configs * Remove deprecated Llama model configuration files * Set 880G for all r1 models

…b models (#410) * Add configuration files for openai_gpt-oss-120b and openai_gpt-oss-20b models * Add flag "disable-log-requests" to command line. * add a comment and "no-enable-prefix-caching"

* Dev-Center overrides * Add HF token to download workload dev center overrides * Use modelId instead of modelID * Disable tensorboard only in dev-center overrides * Rename odia model for consistency * Restore the tutorials in download workload (but with modelId) * Don't change token in values.yaml

* adding option to start pretraining from scratch * add overrides for training from scratch * fix variable naming * Update workloads/llm-pretraining-megatron-lm-ray/helm/templates/_helpers.tpl Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update workloads/llm-pretraining-megatron-lm-ray/helm/templates/ray_job.yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* results: rm unused saved_results list, rename local_save_inference_results, rm return of unused path string * results: rename save and read to read_local * results: rename to minio_save, minio_results_dir_path, some logging, create two separate local and minio results dirs * results: change argparser to minio_output_dir * results: parameterize metrics minio_output_dir_path * results: new save_json_object_to_minio function call * results: save_json_object call * results: summary results * results: prompts, generations and all_scores * results: refactor client names and create minio client once * results: remove obsolete results function * results: summary scores dict for judge, new results writing for run metrics main * results: other dicts for judge minio results * results: serializable json lists * results: correct logging * results: fix tolist * results: create full prompts * results: prompts variable fix * results: fix judge path * results: correct to local for judge results * results: judge full prompts * results: judge fix full prompt and prompt template storage * results: data classes judge prompt name fix * Refactor the inference results dictionary to harmonise with judge results, save judge results to local, copy local inference and judge results to MinIO. * Refactor results minio copying to function. Fix changed dictionary key. * Copy inference results to MinIO * pre-commit --------- Co-authored-by: Mikko Vilenius <mvileniu@amd.com>

* fix inference tokenizer arg and tutorials * Update docs/tutorials/tutorial-04-deliver-llama70b-and-run-megatron-cpt-with-tp8-ddp2.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/tutorials/tutorial-03-deliver-resources-and-run-megatron-cpt.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…#424) * First attempt in harmonising the override and values for helm chart * pre-commit checks * Updatest to readmes * Cleaning up judge evaluation overrides / values. * Cleaning up judge evaluation overrides / values. Corrections. * Cleaning up metrics evaluation overrides / values. * Added missing model path prefixes and changed MLFlow experiment and run names. * pre-commit fixes * overrides: metrics overrides * fix id_column_name bug * billsum job name * billsums id fix --------- Co-authored-by: Sander Bijl de Vroe <Sander.BijldeVroe@amd.com>

* overrides renames * overrides case sensitive renames * overrides renames consistency with other workloads

* minio client error handling * fix: replace mc cp with mc mirror for model download and improve error logging * Update workloads/llm-inference-vllm/helm/templates/_entrypoint.tpl Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update workloads/llm-inference-vllm/helm/templates/_entrypoint.tpl Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Mark van Heeswijk <mark.vanheeswijk@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Add HF_TOKEN environment variable to model configuration files * Add max-model-len configuration for Llama-3.1 70b models * Update GPU allocation and storage quantity for Mixtral-8x22B model configuration * update configuration for Mixtral-8x22B sglang * remove duplicated OdiaGenAI config * Increased GPU allocation to 4 for Mixtral-8x22B --------- Co-authored-by: Bo Zhang <Bo.Zhang2@amd.com>

* Add mkdocs entries for missing workloads * Fix list - extra entries from unclean repo

Gastron and others added 30 commits July 15, 2025 16:37

Specify size of /dev/shm in VeRL workload (#371)

3c07073

Don't use ephemeral storage, just emptyDir (#375)

8b7e675

Adds link to tutorial 03 in tutorial list exposing it on the document…

4dc2b15

…ation page (#384)

Fix tutorial 03 link in docs (again) (#385)

a1c882d

* Adds link to tutorial 03 in tutorial list exposing it on the documentation page * Adds link to tutorial 03 in documentation page sidebar

checkpoint-final always contains full (potentially merged) model (#387)

3f24aca

* Use checkpoint-final for the full param (merged) model * All shells do not support &> redirect syntax * Use Lorenzo's file count reporting version of minio storage check

peft_type instead of type (#391)

34bfb8e

Add kaiwo cdr to single node megatron pretraining (#393)

60cebb9

* add kaiwo cdr to single node megatron pretraining * precommit * add kaiwo crd to prepare megatron data job * fix remoteCheckpointsPath * precommit

Judge MLFlow integration. (#362)

8def350

* Judge MLFlow integration. * fixed default image name and mlflow experiment and run names * Fixed and added pydoc.

Add option to use kaiwo operator (#403)

25138fe

Pin megatron-ray image version (#409)

994635a

Signed-off-by: Robert Talling <rtalling@amd.com>

Add s3 support for Metrics Evaluation. (#408)

1aa4d1d

* Add s3 support for Metrics Evaluation (Go Template). * Add job name and use_data_subset to overrides, add MinIO env vars.

Add configuration files for openai_gpt-oss-120b and openai_gpt-oss-20…

03ca76e

…b models (#410) * Add configuration files for openai_gpt-oss-120b and openai_gpt-oss-20b models * Add flag "disable-log-requests" to command line. * add a comment and "no-enable-prefix-caching"

overrides renames (#429)

3279feb

* overrides renames * overrides case sensitive renames * overrides renames consistency with other workloads

Update workloads listing (#431)

1c2ffc0

* Add mkdocs entries for missing workloads * Fix list - extra entries from unclean repo

Gastron requested a review from markvanheeswijk August 15, 2025 11:21

markvanheeswijk approved these changes Aug 15, 2025

View reviewed changes

alexander-aurell-amd self-requested a review August 18, 2025 08:34

alexander-aurell-amd approved these changes Aug 18, 2025

View reviewed changes

Gastron merged commit 1db2e3e into main Aug 18, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update main 250815 #5

Update main 250815 #5

Uh oh!

Gastron commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Update main 250815 #5

Update main 250815 #5

Uh oh!

Conversation

Gastron commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants