Draft
Conversation
* Restoring deleted .buildkite/test-template.j2 * Enabling agents for split HW --------- Co-authored-by: Alexei Ivanov <alivanov@gpu9448.jax.cs.cpe.ice.amd.com>
#193) * [Grok-1] 1. upload moe configuration file for moe kernel optimization 2. support "--num-scheduler-steps" in benchmark_latency.py * [Grok-1] 1. upload moe configuration file for moe kernel optimization 2. add copy of benchmark_latency.py to support "--num-scheduler-steps" * [Grok-1] add option num-scheduler-steps in benchmark_latency.py
* Removing the original text in reminder_comment.yml And testing manually disabled github action that launches this script. * Delete .github/workflows/reminder_comment.yml
* update custom PA kernel with support for fp8 kv cache dtype; change custom PA partition size to 512 to prefer throughput scenarios at cost of latency * Fix lint * Fix BF16 with FP8 KV cache (scaled conversion incorrectly done in fp16) * Fix custom PA tests * Merge branch 'main' of git@github.com:ROCm/vllm.git into mawong/fix_custom_pa_tests * Fix partition sizes for PAv2, PAcustom * Fix linting * Fix a few names and variable scopes * Rename custom to rocm as per suggestion --------- Co-authored-by: Shomy Sanyal <shomy.sanyal@amd.com>
* Adding P3L measurement to the benchmarks collection tools. * . * . * . * .
* Fixing incompatibility with cython. * Change type as per reviewer suggestions Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> --------- Co-authored-by: Matt Wong <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
* Adding bias to hipb_mm in gradlib. Expanding gradlib to tune based on bias and dtype captured from the actual capture run
* Adding bias to hipb_mm in gradlib. Expanding gradlib to tune based on bias and dtype captured from the actual capture run
Upstream merge 24 9 23
…ero-sized tensor, on which skinny gemm fails (#204)
…llm-project#7995)" (#207) This reverts commit 34a0e96.
…hen num-scheduler-steps>1 (#210)
* extend moe padding to DUMMY weights
…e requirements-build Changing back PromptType to PromptInputs following refactoring revert
* Enable RPD for single/multi gpu Co-authored-by: AdrianAbeyta <adrian.abeyta@amd.com> * Add rpd build instructions to Dockerfile.rocm * Handle env path * Fix code errors * Move RPD based profiling over to profiling folder * use envs vs os.getenv --------- Co-authored-by: AdrianAbeyta <adrian.abeyta@amd.com>
* adding cython into docker file with flag * correcting if
* integrate new cpa kernel, update tests and benchmark * added comments to mfma4 kernel * further comments for mfma16 kernel * clang-format * Lint * add flag for logits rtz conversion and disable by default * lint * [Bugfix]: Fix paged attention unit tests of #372 (#389) * [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and `csrc/rocm/attention.cu`. * improve code documentation. * lint --------- Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com> Co-authored-by: TJian <tunjian1996@gmail.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
… quant is not supported Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
* Aiter section * Aiter section in docker * Enablement * Only exposing a single knob * More details on env defaults
…pstream_merge_25_02_03
Correct initial values
Upstream merge 25 02 03
* Enabling P3L.py & P3L_mling.py tests to run with multiple batched queries. This alternation adds minimal measurement noise. The underlining testing material is the same, the resulting measurements are comparable to the old (BS=1) testing runs. Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> * Making linters happy. Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> * Changed the device specification for the 'forced_sample' tensor. The resulting implementation produces identical measurement, and, actually, became faster (3.21s/it vs 3.42s/it with previous commit). Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> * Fixing reporting to reflect processed intervals. Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> --------- Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
1d2c43d to
eb9d4de
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DO NOT MERGE.
TESTING K8s agents with different gpu sizes