Conversation
|
✅ Результаты тестирования PR #927 Логи тестирования (нажмите чтобы развернуть)=== СТАТУС: Успешно выполнены программы: main_matrix_transpose, main_matrix_multiply === === main_matrix_transpose stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 8.31939 sec (CUDA: 0.113937 sec, OpenCL: 0.707783 sec, Vulkan: 7.49761 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using CUDA API... Matrix size: rows=H=8192 x cols=W=16384 (512 MB) ______________________________________________________ Evaluating algorithm #1/2: 01 naive transpose (non-coalesced) algorithm times (in seconds) - 10 values (min=0.0236044 10%=0.0236145 median=0.0238951 90%=0.0255099 max=0.0255099) median effective algorithm bandwidth: 41.8495 GB/s ______________________________________________________ Evaluating algorithm #2/2: 02 transpose via local memory (coalesced) algorithm times (in seconds) - 10 values (min=0.00817343 10%=0.00817558 median=0.00818427 90%=0.00830092 max=0.00830092) median effective algorithm bandwidth: 122.186 GB/s === main_matrix_multiply stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.304857 sec (CUDA: 0.126677 sec, OpenCL: 0.0388888 sec, Vulkan: 0.139232 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using CUDA API... C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096) matrices data size: A - 8 MB, B - 16 MB, C - 16 MB ______________________________________________________ Evaluating algorithm #1/3: CPU with OpenMP algorithm times (in seconds) - 1 values (min=11.4562 10%=11.4562 median=11.4562 90%=11.4562 max=11.4562) algorithm GFlops: 1.49889 GFlops algorithm effective memory bandwidth: 0.00477364 GB/s ______________________________________________________ Evaluating algorithm #2/3: 01 naive algorithm times (in seconds) - 10 values (min=0.0590775 10%=0.0597447 median=0.0611722 90%=0.0652011 max=0.0652011) algorithm GFlops: 280.707 GFlops algorithm effective memory bandwidth: 0.893993 GB/s relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294) median relative difference with CPU: 2.21073e-07 99% percentile relative difference with CPU: 1.09303e-05 ______________________________________________________ Evaluating algorithm #3/3: 02 using local memory algorithm times (in seconds) - 10 values (min=0.0172957 10%=0.0182773 median=0.022297 90%=0.0234924 max=0.0234924) algorithm GFlops: 770.126 GFlops algorithm effective memory bandwidth: 2.45269 GB/s relative differences with CPU: 8388608 values (min=0 10%=0 median=2.33797e-07 90%=1.88501e-06 max=31106) median relative difference with CPU: 2.33797e-07 99% percentile relative difference with CPU: 0.130007 === main_matrix_multiply stderr (exit code: -11 (segfault после выполнения)) === Error: Assertion "54623452334232 0.130007" failed at line 199 |
|
упал CI на github, пожалуйста исправьте его: откройте и пролистайте логи вниз (кнопкой End например, там много пустых строк видимо, долго грузит), поймите в чем проблема, попробуйте поискать в чате курса по этой ошибке, если не понятно что-то на этих этапах или не найдется, или еще что - не стесняйтесь спрашивать (в чате, или в личку) |
Логи тестирования (нажмите чтобы развернуть)Ошибка компиляции |
|
✅ Результаты тестирования PR #927 Логи тестирования (нажмите чтобы развернуть)=== СТАТУС: Успешно выполнены программы: main_matrix_transpose, main_matrix_multiply === === main_matrix_transpose stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.313536 sec (CUDA: 0.122462 sec, OpenCL: 0.0383136 sec, Vulkan: 0.152698 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using CUDA API... Matrix size: rows=H=8192 x cols=W=16384 (512 MB) ______________________________________________________ Evaluating algorithm #1/2: 01 naive transpose (non-coalesced) algorithm times (in seconds) - 10 values (min=0.0236967 10%=0.023718 median=0.0237391 90%=0.0238717 max=0.0238717) median effective algorithm bandwidth: 42.1246 GB/s ______________________________________________________ Evaluating algorithm #2/2: 02 transpose via local memory (coalesced) algorithm times (in seconds) - 10 values (min=0.00817316 10%=0.0081756 median=0.00818112 90%=0.00831771 max=0.00831771) median effective algorithm bandwidth: 122.233 GB/s === main_matrix_multiply stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.327598 sec (CUDA: 0.126617 sec, OpenCL: 0.0386167 sec, Vulkan: 0.162306 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using CUDA API... C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096) matrices data size: A - 8 MB, B - 16 MB, C - 16 MB ______________________________________________________ Evaluating algorithm #1/3: CPU with OpenMP algorithm times (in seconds) - 1 values (min=11.2695 10%=11.2695 median=11.2695 90%=11.2695 max=11.2695) algorithm GFlops: 1.52372 GFlops algorithm effective memory bandwidth: 0.00485272 GB/s ______________________________________________________ Evaluating algorithm #2/3: 01 naive algorithm times (in seconds) - 10 values (min=0.060987 10%=0.061368 median=0.0648583 90%=0.0658256 max=0.0658256) algorithm GFlops: 264.754 GFlops algorithm effective memory bandwidth: 0.843185 GB/s relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294) median relative difference with CPU: 2.21073e-07 99% percentile relative difference with CPU: 1.09303e-05 ______________________________________________________ Evaluating algorithm #3/3: 02 using local memory algorithm times (in seconds) - 10 values (min=0.0172814 10%=0.0189008 median=0.0231557 90%=0.0243095 max=0.0243095) algorithm GFlops: 741.567 GFlops algorithm effective memory bandwidth: 2.36173 GB/s relative differences with CPU: 8388608 values (min=0 10%=0 median=2.35155e-07 90%=2.03321e-06 max=70276.7) median relative difference with CPU: 2.35155e-07 99% percentile relative difference with CPU: 0.146045 === main_matrix_multiply stderr (exit code: -11 (segfault после выполнения)) === Error: Assertion "54623452334232 0.146045" failed at line 199 |
|
На Tesla T4 падает rassert 54623452334232 |
Логи тестирования (нажмите чтобы развернуть)Ошибка компиляции |
|
✅ Результаты тестирования PR #927 Логи тестирования (нажмите чтобы развернуть)=== СТАТУС: Успешно выполнены программы: main_matrix_transpose, main_matrix_multiply === === main_matrix_transpose stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 8.65443 sec (CUDA: 0.116201 sec, OpenCL: 0.707217 sec, Vulkan: 7.83095 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using CUDA API... Matrix size: rows=H=8192 x cols=W=16384 (512 MB) ______________________________________________________ Evaluating algorithm #1/2: 01 naive transpose (non-coalesced) algorithm times (in seconds) - 10 values (min=0.0235627 10%=0.0235686 median=0.0239527 90%=0.0288704 max=0.0288704) median effective algorithm bandwidth: 41.749 GB/s ______________________________________________________ Evaluating algorithm #2/2: 02 transpose via local memory (coalesced) algorithm times (in seconds) - 10 values (min=0.00812178 10%=0.00812347 median=0.00813146 90%=0.00824007 max=0.00824007) median effective algorithm bandwidth: 122.979 GB/s === main_matrix_multiply stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.301851 sec (CUDA: 0.12824 sec, OpenCL: 0.0419848 sec, Vulkan: 0.131569 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using CUDA API... C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096) matrices data size: A - 8 MB, B - 16 MB, C - 16 MB ______________________________________________________ Evaluating algorithm #1/3: CPU with OpenMP algorithm times (in seconds) - 1 values (min=11.9697 10%=11.9697 median=11.9697 90%=11.9697 max=11.9697) algorithm GFlops: 1.43458 GFlops algorithm effective memory bandwidth: 0.00456883 GB/s ______________________________________________________ Evaluating algorithm #2/3: 01 naive algorithm times (in seconds) - 10 values (min=1.24433 10%=1.24486 median=1.24952 90%=1.35516 max=1.35516) algorithm GFlops: 13.7424 GFlops algorithm effective memory bandwidth: 0.0437667 GB/s relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294) median relative difference with CPU: 2.21073e-07 99% percentile relative difference with CPU: 1.09303e-05 ______________________________________________________ Evaluating algorithm #3/3: 02 using local memory algorithm times (in seconds) - 10 values (min=0.152937 10%=0.152941 median=0.152948 90%=0.154574 max=0.154574) algorithm GFlops: 112.27 GFlops algorithm effective memory bandwidth: 0.357557 GB/s relative differences with CPU: 8388608 values (min=0 10%=8.6743e-08 median=4.71658e-07 90%=2.07979e-06 max=9.13368) median relative difference with CPU: 4.71658e-07 99% percentile relative difference with CPU: 1.9618e-05 |
src/main_02_matrix_multiply.cpp
Outdated
| } else if (context.type() == gpu::Context::TypeCUDA) { | ||
| if (algorithm == "01 naive") { | ||
| cuda::matrix_multiply_naive(gpu::WorkSize(GROUP_SIZE, 1, w, h), matrix_a_gpu, matrix_b_gpu, matrix_c_gpu, w, h, k); | ||
| cuda::matrix_multiply_naive(gpu::WorkSize(1, 1, w, h), matrix_a_gpu, matrix_b_gpu, matrix_c_gpu, w, h, k); |
There was a problem hiding this comment.
не запускайте пожалуйста на GPU рабочую группу 1х1, иначе где-то грустит 31 лилипут
There was a problem hiding this comment.
Да, как-то пропустил в ходе дебага. Поправил
|
✅ Результаты тестирования PR #927 Логи тестирования (нажмите чтобы развернуть)=== СТАТУС: Успешно выполнены программы: main_matrix_transpose, main_matrix_multiply === === main_matrix_transpose stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 11.8432 sec (CUDA: 0.112519 sec, OpenCL: 0.706046 sec, Vulkan: 11.0246 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using CUDA API... Matrix size: rows=H=8192 x cols=W=16384 (512 MB) ______________________________________________________ Evaluating algorithm #1/2: 01 naive transpose (non-coalesced) algorithm times (in seconds) - 10 values (min=0.0239883 10%=0.0239894 median=0.0240359 90%=0.0258209 max=0.0258209) median effective algorithm bandwidth: 41.6045 GB/s ______________________________________________________ Evaluating algorithm #2/2: 02 transpose via local memory (coalesced) algorithm times (in seconds) - 10 values (min=0.00817342 10%=0.00817681 median=0.00818267 90%=0.0082992 max=0.0082992) median effective algorithm bandwidth: 122.209 GB/s === main_matrix_multiply stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.308016 sec (CUDA: 0.124573 sec, OpenCL: 0.038077 sec, Vulkan: 0.145307 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using CUDA API... C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096) matrices data size: A - 8 MB, B - 16 MB, C - 16 MB ______________________________________________________ Evaluating algorithm #1/3: CPU with OpenMP algorithm times (in seconds) - 1 values (min=11.97 10%=11.97 median=11.97 90%=11.97 max=11.97) algorithm GFlops: 1.43454 GFlops algorithm effective memory bandwidth: 0.0045687 GB/s ______________________________________________________ Evaluating algorithm #2/3: 01 naive algorithm times (in seconds) - 10 values (min=0.171345 10%=0.172939 median=0.174058 90%=0.329976 max=0.329976) algorithm GFlops: 98.6536 GFlops algorithm effective memory bandwidth: 0.314191 GB/s relative differences with CPU: 8388608 values (min=0 10%=8.67401e-08 median=4.71637e-07 90%=2.07923e-06 max=3.12559) median relative difference with CPU: 4.71637e-07 99% percentile relative difference with CPU: 1.95534e-05 ______________________________________________________ Evaluating algorithm #3/3: 02 using local memory algorithm times (in seconds) - 10 values (min=0.152764 10%=0.152767 median=0.152778 90%=0.155259 max=0.155259) algorithm GFlops: 112.395 GFlops algorithm effective memory bandwidth: 0.357955 GB/s relative differences with CPU: 8388608 values (min=0 10%=8.67415e-08 median=4.71645e-07 90%=2.07943e-06 max=6.30526) median relative difference with CPU: 4.71645e-07 99% percentile relative difference with CPU: 1.95739e-05 |
|
9/10 баллов 👍 (за дедлайн) |
Transpose
Локальный вывод
Вывод Github CI
Multiply
Локальный вывод
Вывод Github CI