Skip to content

Conversation

@Gastron
Copy link
Collaborator

@Gastron Gastron commented Jul 7, 2025

Bring latest changes from development repository here.

markvanheeswijk and others added 30 commits November 26, 2024 10:22
* Silogen finetuning through helm
* Remove imagePullSecret from example as it's not needed
* Refactor values.yaml and example configuration for clarity and updated paths

---------

Co-authored-by: Mark van Heeswijk <mark.vanheeswijk@amd.com>
userbz and others added 25 commits June 4, 2025 13:40
* parameterise model download url

* flux sdxl overrides

* convert entrypoint to template

* rename overrides

* update readme

* add flux1-schnell and sdxl-base overlays

* fix: rename

* update model config

* Add ingress/http_route templates

* improve model tag config

* one more model config

* readme update ingress

---------

Co-authored-by: Mark van Heeswijk <mark.vanheeswijk@amd.com>
* Add HTTP route and ingress configuration with schema updates

* Add support for configurable replicas in deployment and schema
* use rocm/pytroch image for faster deployment

* let comfy-cli handle requirements
* WIP: overrides for Qwen2.5-3B-Instruct model for interference.

* Removed -debug from image names.
* judge download: pull in helpers.tpl and main template changes

* judge download: working. Ugly dual bucket_storage_host fix to have different protocols in different containers

* judge download: pre-commit test

* judge download: pre-commit

* judge download: add protocol in helpers instead of two storage hosts

* judge download: move dataset protocol trimming from k8s to package code

* judge download: move dataset protocol trimming from k8s to package code2
* added mutate_manifest.py script that can be used
to wrap some resources with Kaiwo equivalents

* added RayService and fixed RayJob spec_key in mutate_manifest.py
* Add Ray based Megatron-LM workload chart

Signed-off-by: Robert Talling <rtalling@amd.com>
Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com>
Signed-off-by: Robert Talling <rtalling@amd.com>
* Add Helm chart for MLflow tracking server deployment (user/project)

* mlflow readme update

* Enhance db configuration with secret management

* update readme

* readme update regarding url prefix

* update to remove url prefix setting

* fix s3 store issue and set default to local artifact store

* improve minio settings and set default artifacts to S3 again

* refactor: update README and scripts for artifact storage configuration and usage instructions

* add MinIO S3-compatible storage configuration for MLflow artifacts

* final touch before merge
* add readme for infinity embedding workload

* remove whitespace

* add overrides
* Add model and data loading from minio

* Add deepspeed config example

* Validate starting of sync process, escape and quote path argument

* Wait 1s before checking if sync process started

* Fix for quotes in checkpointsRemote

* Update readme and other edits for clarity

* Update workloads/llm-finetune-llama-factory/helm/README.md

Co-authored-by: Aku Rouhe <akurouhe@amd.com>

---------

Co-authored-by: Aku Rouhe <akurouhe@amd.com>
* veRL GRPO finetuning ROCm example workload

* Refactor and complete VeRL workload

* Fix comments and typos

---------

Co-authored-by: Emil Eirola <emil.eirola@amd.com>
* Add basic MLFlow export

* Upgrade ROCm image.

* Fix nested folders for artifacts on MLFlow EVEN BETTER!

* Fix extra f-string

---------

Co-authored-by: Sander Bijl de Vroe <Sander.BijldeVroe@amd.com>
…rmonise model names. Harmonise arg parser. (#354)

* Fix erroneously removed LLM client URL prefix.
* Quote paths and escape chars in mc mirror

* Fix handling of minio paths

* Fix handling of quotes in echo statements
* Add on-boarding documentation for pre-commit

* clarify cd in docs and fix <br />

* small edit
* WandB downloader

* Make it work

* Correct override name, always mount

* No ephemeral storage, just emptyDir
@sarooshsh sarooshsh self-requested a review July 8, 2025 08:38
Copy link
Contributor

@Brednas Brednas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved as discussed in daily

@Gastron Gastron changed the base branch from main to gh-pages July 8, 2025 10:19
@Gastron Gastron closed this Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.