-
Notifications
You must be signed in to change notification settings - Fork 0
Update main 250815 #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…(#379) * Implementation LLM inference improvements for the patching model 70b * precommit implementation * Fix the patch for megatron 70B and move directory overrides one level down, now within helm/ deleted: workloads/llm-inference-megatron-lm/helm/mount/arguments.py modified: workloads/llm-inference-megatron-lm/helm/mount/megatron-lm-inference-llama3-1-70b.patch modified: workloads/llm-inference-megatron-lm/helm/templates/deployment.yaml * changes in pre-commit.yaml to run the patch successfully, this is excluding .patch files globally in every hook. * Refactor pre-commit configuration to handle .patch exclusion globally * Fix formatting inconsistencies in pre-commit configuration * Update .pre-commit-config.yaml --------- Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com>
…ation page (#384)
… (#321) * Add annotations for PVC auto-creation and configuration in deployment * add pvc annotations for jupyterlab * add auto-pvc for comfyui * adjust vscode wl to new pvc protocol * adjust jupyter wl to new pvc protocol * adjust comfyui wf to new pvc protocol * save models and workflows to user pvc in comfyui --------- Co-authored-by: Andrey Ivannikov <andrey.ivannikov@silo.ai>
* Adds link to tutorial 03 in tutorial list exposing it on the documentation page * Adds link to tutorial 03 in documentation page sidebar
* Use checkpoint-final for the full param (merged) model * All shells do not support &> redirect syntax * Use Lorenzo's file count reporting version of minio storage check
…tep of tutorial (#381) * update tutorial * add option to download tokenizer from s3 * typo fix * add suffix trim to the tokenizer id * use correct path of the tokenizer in preprocessing script * upload tokenizer on condition * fix tutorial 0 links * temporarily still use nousresearch tokenizer not to break the tutorial
* Change snake_case to camelCase in multinode pretraining template Signed-off-by: Robert Talling <rtalling@amd.com> * Wrap rayjob template with kaiwo Signed-off-by: Robert Talling <rtalling@amd.com> * Update tokenizer path and labels Signed-off-by: Robert Talling <rtalling@amd.com> * Add fixes to pvc creation + kaiwo, update tokenizer path Signed-off-by: Robert Talling <rtalling@amd.com> * Disable kaiwo by default Signed-off-by: Robert Talling <rtalling@amd.com> * Disable kaiwo pvc creation and check Signed-off-by: Robert Talling <rtalling@amd.com> * Move overrides Signed-off-by: Robert Talling <rtalling@amd.com> * Remove empty line Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com> * Add flags to gc.sh Signed-off-by: Robert Talling <rtalling@amd.com> * Set ttlSecondsAfterFinished to 0 Signed-off-by: Robert Talling <rtalling@amd.com> * Update conditions Signed-off-by: Robert Talling <rtalling@amd.com> * Update condition Signed-off-by: Robert Talling <rtalling@amd.com> --------- Signed-off-by: Robert Talling <rtalling@amd.com> Co-authored-by: aivanni <4340981+aivanni@users.noreply.github.com>
* add kaiwo cdr to single node megatron pretraining * precommit * add kaiwo crd to prepare megatron data job * fix remoteCheckpointsPath * precommit
* Judge MLFlow integration. * fixed default image name and mlflow experiment and run names * Fixed and added pydoc.
* Set values for dev center * Fix hf:// prefix * Llama 3.1 8B Instruct * Llama 3.1 70B Instruct * Don't use exclude if not set * Qwen 2.5 7B Instruct * Add DeepSeek-R1-Distill-Llama-8B finetuning BKC * Argilla data override * Qwen family, larger ones LoRA * Add mergeadapter true * Add deepseek-distill-qwen models * Cohere Aya models * Qwen 3 models * Increase memory * Llama 3.2 small models * Add DeepSeek-R1-0528-Qwen3-8B * Add google gemma-3-1b-it * Add mixtral * v0.6 tag * Better downloads sizes * Script to create download overrides * Download overrides --------- Co-authored-by: Emil Eirola <emil.eirola@amd.com>
* Add tutorial for llama70b multinode training with dddp2 Signed-off-by: Robert Talling <rtalling@amd.com> * Minor fix in tutorial Signed-off-by: Robert Talling <rtalling@amd.com> * Update override name Signed-off-by: Robert Talling <rtalling@amd.com> * Update override name Signed-off-by: Robert Talling <rtalling@amd.com> * small fixes to the tutorial * changes to toc * rename tutorial 04 md file * swap the steps in the tutorial * yaml ext fix * smaller corrections to the 03 tutorial * smaller corrections to 04 tutorial * Move readmes Signed-off-by: Robert Talling <rtalling@amd.com> * extension fix --------- Signed-off-by: Robert Talling <rtalling@amd.com> Co-authored-by: Robert Talling <rtalling@amd.com> Co-authored-by: Andrey Ivannikov <andrey.ivannikov@silo.ai>
* Remove default MLFlow IP. Empty string means that MLFlow artifacts are not used. * Remove special characters from job name. * Add model dir mount to evaluation container. * Add logging. This reverts commit 13c7ba0987b2e48119ae66fad195cc9e7184563f. * Add local model dir path for evaluation. * Fix a wrong parameter name. * Comment out unused path prefix removals.
* evaluation logging: judge documents progress * evaluation logging: add rocm-smi GPU display in template * evaluation logging: add batch progress * evaluation logging: add inference progress to metrics workload * evaluation logging: waiting time judge * evaluation logging: spelling nit * evaluation logging: also batch_number for judgments * evaluation logging: readability fix * evaluation logging: batch number easier to catch * evaluation logging: tqdm over inference container batches * evaluation logging: fix of use_subset check - don't allow negatives from argparser and fix if statement * evaluation logging: fix num_judge_inferences and judged_documents_counter
Signed-off-by: Robert Talling <rtalling@amd.com>
* Add s3 support for Metrics Evaluation (Go Template). * Add job name and use_data_subset to overrides, add MinIO env vars.
… API calls (#358) * bugfix: service auto discovery to handle exceptions during Kubernetes API calls * Update workloads/dev-chatui-openwebui/helm/mount/get_openai_api_base_urls.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update workloads/dev-chatui-openwebui/helm/mount/get_openai_api_base_urls.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update workloads/dev-chatui-openwebui/helm/mount/get_openai_api_base_urls.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: add missing logging --------- Co-authored-by: Mark van Heeswijk <mark.vanheeswijk@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…kload specifications (#405) * Add download configurations for various models based on inference workload specifications * update model configuration and update storage quantities * validated configs and updated storage quantities * Update storage quantities * revert changes in tutorial configs * Update storage quantities for optimize resource allocation * revert some changes on example configs * Remove deprecated Llama model configuration files * Set 880G for all r1 models
…b models (#410) * Add configuration files for openai_gpt-oss-120b and openai_gpt-oss-20b models * Add flag "disable-log-requests" to command line. * add a comment and "no-enable-prefix-caching"
* Dev-Center overrides * Add HF token to download workload dev center overrides * Use modelId instead of modelID * Disable tensorboard only in dev-center overrides * Rename odia model for consistency * Restore the tutorials in download workload (but with modelId) * Don't change token in values.yaml
* adding option to start pretraining from scratch * add overrides for training from scratch * fix variable naming * Update workloads/llm-pretraining-megatron-lm-ray/helm/templates/_helpers.tpl Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update workloads/llm-pretraining-megatron-lm-ray/helm/templates/ray_job.yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* results: rm unused saved_results list, rename local_save_inference_results, rm return of unused path string * results: rename save and read to read_local * results: rename to minio_save, minio_results_dir_path, some logging, create two separate local and minio results dirs * results: change argparser to minio_output_dir * results: parameterize metrics minio_output_dir_path * results: new save_json_object_to_minio function call * results: save_json_object call * results: summary results * results: prompts, generations and all_scores * results: refactor client names and create minio client once * results: remove obsolete results function * results: summary scores dict for judge, new results writing for run metrics main * results: other dicts for judge minio results * results: serializable json lists * results: correct logging * results: fix tolist * results: create full prompts * results: prompts variable fix * results: fix judge path * results: correct to local for judge results * results: judge full prompts * results: judge fix full prompt and prompt template storage * results: data classes judge prompt name fix * Refactor the inference results dictionary to harmonise with judge results, save judge results to local, copy local inference and judge results to MinIO. * Refactor results minio copying to function. Fix changed dictionary key. * Copy inference results to MinIO * pre-commit --------- Co-authored-by: Mikko Vilenius <mvileniu@amd.com>
* fix inference tokenizer arg and tutorials * Update docs/tutorials/tutorial-04-deliver-llama70b-and-run-megatron-cpt-with-tp8-ddp2.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/tutorials/tutorial-03-deliver-resources-and-run-megatron-cpt.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…#424) * First attempt in harmonising the override and values for helm chart * pre-commit checks * Updatest to readmes * Cleaning up judge evaluation overrides / values. * Cleaning up judge evaluation overrides / values. Corrections. * Cleaning up metrics evaluation overrides / values. * Added missing model path prefixes and changed MLFlow experiment and run names. * pre-commit fixes * overrides: metrics overrides * fix id_column_name bug * billsum job name * billsums id fix --------- Co-authored-by: Sander Bijl de Vroe <Sander.BijldeVroe@amd.com>
* overrides renames * overrides case sensitive renames * overrides renames consistency with other workloads
* minio client error handling * fix: replace mc cp with mc mirror for model download and improve error logging * Update workloads/llm-inference-vllm/helm/templates/_entrypoint.tpl Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update workloads/llm-inference-vllm/helm/templates/_entrypoint.tpl Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Mark van Heeswijk <mark.vanheeswijk@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Add HF_TOKEN environment variable to model configuration files * Add max-model-len configuration for Llama-3.1 70b models * Update GPU allocation and storage quantity for Mixtral-8x22B model configuration * update configuration for Mixtral-8x22B sglang * remove duplicated OdiaGenAI config * Increased GPU allocation to 4 for Mixtral-8x22B --------- Co-authored-by: Bo Zhang <Bo.Zhang2@amd.com>
* Add mkdocs entries for missing workloads * Fix list - extra entries from unclean repo
markvanheeswijk
approved these changes
Aug 15, 2025
alexander-aurell-amd
approved these changes
Aug 18, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.