| Run 1x8 | Number processes: 1Number nodes: 1Run Command: <executable> -m meta-llama-3.1-8b-instruct-Q8_0.gguf -no-cnv -t <OMP_NUM_THREADS> -n 512 -p "what is a LLM?" --seed 0MPI Command: mpirun -n <number_processes> Dataset: Run Directory: /beegfs/hackathon/users/eoseret/qaas_runs_test/175-950-2189/intel/llama.cpp/run/oneview_runs/multicore/icx_3/oneview_run_1759511881OMP_NUM_THREADS: 8I_MPI_PIN_ORDER: bunchOMP_DISPLAY_AFFINITY: TRUEOMP_PROC_BIND: spreadOMP_AFFINITY_FORMAT: 'OMP: pid %P tid %i thread %n bound to OS proc set {%A}'OMP_DISPLAY_ENV: TRUEI_MPI_PIN_DOMAIN: autoI_MPI_DEBUG: 4OMP_PLACES: threads |
|---|---|
| Run 1x64 | Number processes: 1Number nodes: 1Run Command: <executable> -m meta-llama-3.1-8b-instruct-Q8_0.gguf -no-cnv -t <OMP_NUM_THREADS> -n 512 -p "what is a LLM?" --seed 0MPI Command: mpirun -n <number_processes> Dataset: Run Directory: /beegfs/hackathon/users/eoseret/qaas_runs_test/175-950-2189/intel/llama.cpp/run/oneview_runs/multicore/icx_3/oneview_run_1759511881OMP_NUM_THREADS: 64I_MPI_PIN_ORDER: bunchOMP_DISPLAY_AFFINITY: TRUEOMP_PROC_BIND: spreadOMP_AFFINITY_FORMAT: 'OMP: pid %P tid %i thread %n bound to OS proc set {%A}'OMP_DISPLAY_ENV: TRUEI_MPI_PIN_DOMAIN: autoI_MPI_DEBUG: 4OMP_PLACES: threads |
| Run 1x96 | Number processes: 1Number nodes: 1Run Command: <executable> -m meta-llama-3.1-8b-instruct-Q8_0.gguf -no-cnv -t <OMP_NUM_THREADS> -n 512 -p "what is a LLM?" --seed 0MPI Command: mpirun -n <number_processes> Dataset: Run Directory: /beegfs/hackathon/users/eoseret/qaas_runs_test/175-950-2189/intel/llama.cpp/run/oneview_runs/multicore/icx_3/oneview_run_1759511881OMP_NUM_THREADS: 96I_MPI_PIN_ORDER: bunchOMP_DISPLAY_AFFINITY: TRUEOMP_PROC_BIND: spreadOMP_AFFINITY_FORMAT: 'OMP: pid %P tid %i thread %n bound to OS proc set {%A}'OMP_DISPLAY_ENV: TRUEI_MPI_PIN_DOMAIN: autoI_MPI_DEBUG: 4OMP_PLACES: threads |
| Run 1x128 | Number processes: 1Number nodes: 1Run Command: <executable> -m meta-llama-3.1-8b-instruct-Q8_0.gguf -no-cnv -t <OMP_NUM_THREADS> -n 512 -p "what is a LLM?" --seed 0MPI Command: mpirun -n <number_processes> Dataset: Run Directory: /beegfs/hackathon/users/eoseret/qaas_runs_test/175-950-2189/intel/llama.cpp/run/oneview_runs/multicore/icx_3/oneview_run_1759511881OMP_NUM_THREADS: 128I_MPI_PIN_ORDER: bunchOMP_DISPLAY_AFFINITY: TRUEOMP_PROC_BIND: spreadOMP_AFFINITY_FORMAT: 'OMP: pid %P tid %i thread %n bound to OS proc set {%A}'OMP_DISPLAY_ENV: TRUEI_MPI_PIN_DOMAIN: autoI_MPI_DEBUG: 4OMP_PLACES: threads |
| Run 1x160 | Number processes: 1Number nodes: 1Run Command: <executable> -m meta-llama-3.1-8b-instruct-Q8_0.gguf -no-cnv -t <OMP_NUM_THREADS> -n 512 -p "what is a LLM?" --seed 0MPI Command: mpirun -n <number_processes> Dataset: Run Directory: /beegfs/hackathon/users/eoseret/qaas_runs_test/175-950-2189/intel/llama.cpp/run/oneview_runs/multicore/icx_3/oneview_run_1759511881OMP_NUM_THREADS: 160I_MPI_PIN_ORDER: bunchOMP_DISPLAY_AFFINITY: TRUEOMP_PROC_BIND: spreadOMP_AFFINITY_FORMAT: 'OMP: pid %P tid %i thread %n bound to OS proc set {%A}'OMP_DISPLAY_ENV: TRUEI_MPI_PIN_DOMAIN: autoI_MPI_DEBUG: 4OMP_PLACES: threads |
| Run 1x192 | Number processes: 1Number nodes: 1Run Command: <executable> -m meta-llama-3.1-8b-instruct-Q8_0.gguf -no-cnv -t <OMP_NUM_THREADS> -n 512 -p "what is a LLM?" --seed 0MPI Command: mpirun -n <number_processes> Dataset: Run Directory: /beegfs/hackathon/users/eoseret/qaas_runs_test/175-950-2189/intel/llama.cpp/run/oneview_runs/multicore/icx_3/oneview_run_1759511881OMP_NUM_THREADS: 192I_MPI_PIN_ORDER: bunchOMP_DISPLAY_AFFINITY: TRUEOMP_PROC_BIND: spreadOMP_AFFINITY_FORMAT: 'OMP: pid %P tid %i thread %n bound to OS proc set {%A}'OMP_DISPLAY_ENV: TRUEI_MPI_PIN_DOMAIN: autoI_MPI_DEBUG: 4OMP_PLACES: threads |
| Loop id | Source Location | Source Function | Level | Max Thread Time / Walltime 1x8 (%) | Max Thread Time / Walltime 1x64 (%) | Max Thread Time / Walltime 1x96 (%) | Max Thread Time / Walltime 1x128 (%) | Max Thread Time / Walltime 1x160 (%) | Max Thread Time / Walltime 1x192 (%) | Exclusive Coverage 1x8 (%) | Exclusive Coverage 1x64 (%) | Exclusive Coverage 1x96 (%) | Exclusive Coverage 1x128 (%) | Exclusive Coverage 1x160 (%) | Exclusive Coverage 1x192 (%) | Inclusive Coverage 1x8 (%) | Inclusive Coverage 1x64 (%) | Inclusive Coverage 1x96 (%) | Inclusive Coverage 1x128 (%) | Inclusive Coverage 1x160 (%) | Inclusive Coverage 1x192 (%) | Max Exclusive Time Over Threads 1x8 (s) | Max Exclusive Time Over Threads 1x64 (s) | Max Exclusive Time Over Threads 1x96 (s) | Max Exclusive Time Over Threads 1x128 (s) | Max Exclusive Time Over Threads 1x160 (s) | Max Exclusive Time Over Threads 1x192 (s) | Max Inclusive Time Over Threads 1x8 (s) | Max Inclusive Time Over Threads 1x64 (s) | Max Inclusive Time Over Threads 1x96 (s) | Max Inclusive Time Over Threads 1x128 (s) | Max Inclusive Time Over Threads 1x160 (s) | Max Inclusive Time Over Threads 1x192 (s) | Exclusive Time w.r.t. Wall Time 1x8 (s) | Exclusive Time w.r.t. Wall Time 1x64 (s) | Exclusive Time w.r.t. Wall Time 1x96 (s) | Exclusive Time w.r.t. Wall Time 1x128 (s) | Exclusive Time w.r.t. Wall Time 1x160 (s) | Exclusive Time w.r.t. Wall Time 1x192 (s) | Inclusive Time w.r.t. Wall Time 1x8 (s) | Inclusive Time w.r.t. Wall Time 1x64 (s) | Inclusive Time w.r.t. Wall Time 1x96 (s) | Inclusive Time w.r.t. Wall Time 1x128 (s) | Inclusive Time w.r.t. Wall Time 1x160 (s) | Inclusive Time w.r.t. Wall Time 1x192 (s) | Nb Threads 1x8 | Nb Threads 1x64 | Nb Threads 1x96 | Nb Threads 1x128 | Nb Threads 1x160 | Nb Threads 1x192 | GFLOPS 1x8 | GFLOPS 1x64 | GFLOPS 1x96 | GFLOPS 1x128 | GFLOPS 1x160 | GFLOPS 1x192 | Vectorization Ratio (%) | Vector Length Use (%) | Speedup If No Scalar Integer | Speedup If FP Vectorized | Speedup If Fully Vectorized | Speedup If Perfect Load Balancing 1x8 | Speedup If Perfect Load Balancing 1x64 | Speedup If Perfect Load Balancing 1x96 | Speedup If Perfect Load Balancing 1x128 | Speedup If Perfect Load Balancing 1x160 | Speedup If Perfect Load Balancing 1x192 | Stride 0 | Stride 1 | Stride n | Stride Unknown | Stride Indirect | Array Access Efficiency | (1x8) Efficiency | (1x8) Potential Speed-Up (%) | (1x64) Efficiency | (1x64) Potential Speed-Up (%) | (1x96) Efficiency | (1x96) Potential Speed-Up (%) | (1x128) Efficiency | (1x128) Potential Speed-Up (%) | (1x160) Efficiency | (1x160) Potential Speed-Up (%) | (1x192) Efficiency | (1x192) Potential Speed-Up (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5766 | libggml-cpu.so - quants.c:108-1042 [...] | ggml_vec_dot_q8_0_q8_0 | Single | 75.04 | 81.73 | 83.30 | 83.00 | 82.36 | 82.75 | 86.86 | 61.30 | 62.24 | 61.14 | 60.99 | 61.38 | 86.86 | 61.30 | 62.24 | 61.14 | 60.99 | 61.38 | 37.90 | 51.37 | 56.70 | 56.52 | 56.70 | 57.30 | 37.90 | 51.37 | 56.70 | 56.52 | 56.70 | 57.30 | 36.70 | 33.59 | 37.34 | 36.65 | 37.02 | 37.52 | 36.70 | 33.59 | 37.34 | 36.65 | 37.02 | 37.52 | 8 | 64 | 96 | 128 | 160 | 192 | 53.34 | 60.05 | 54.21 | 55.28 | 54.73 | 53.94 | NA | NA | NA | NA | NA | 1.03 | 1.54 | 1.53 | 1.55 | 1.55 | 1.54 | NA | NA | NA | NA | NA | 0.00 | 1 | 0 | 0.14 | 52.93 | 0.08 | 57.14 | 0.06 | 57.32 | 0.05 | 57.96 | 0.04 | 58.88 |
| 1779 | libggml-cpu.so - vec.cpp:311-316 | ggml_vec_dot_f16 | Single | 0.51 | 0.30 | 0.24 | 0.26 | 0.26 | 0.27 | 0.38 | 0.10 | 0.05 | 0.05 | 0.03 | 0.04 | 0.38 | 0.10 | 0.05 | 0.05 | 0.03 | 0.04 | 0.26 | 0.19 | 0.16 | 0.18 | 0.18 | 0.19 | 0.26 | 0.19 | 0.16 | 0.18 | 0.18 | 0.19 | 0.16 | 0.05 | 0.03 | 0.03 | 0.02 | 0.02 | 0.16 | 0.05 | 0.03 | 0.03 | 0.02 | 0.02 | 8 | 32 | 33 | 32 | 34 | 33 | 35.89 | 189.26 | 409.77 | 557.26 | 719.52 | 701.54 | 100 | 66.67 | 1 | 1 | 1.57 | 1.63 | 1.78 | 1.85 | 1.4 | 1.83 | 1.51 | 0 | 2 | 0 | 0 | 0 | 100.00 | 1 | 0 | 0.37 | 0.06 | 0.44 | 0.03 | 0.31 | 0.04 | 0.38 | 0.02 | 0.31 | 0.02 |
| 5300 | libggml-cpu.so - sgemm.cpp:138-1044 [...] | _ZN12_GLOBAL__N_115tinyBLAS_Q0_AVXI10block_q8_0S1_fE7gemm4xNILi4EEEvllll.A | Innermost | 0.32 | 0.19 | 0.18 | 0.18 | 0.19 | 0.20 | 0.28 | 0.11 | 0.11 | 0.11 | 0.11 | 0.11 | 0.28 | 0.11 | 0.11 | 0.11 | 0.11 | 0.11 | 0.16 | 0.12 | 0.12 | 0.12 | 0.13 | 0.14 | 0.16 | 0.12 | 0.12 | 0.12 | 0.13 | 0.14 | 0.12 | 0.06 | 0.07 | 0.07 | 0.07 | 0.07 | 0.12 | 0.06 | 0.07 | 0.07 | 0.07 | 0.07 | 8 | 64 | 95 | 128 | 156 | 189 | 119.90 | 234.46 | 210.51 | 212.47 | 216.36 | 227.92 | NA | NA | NA | NA | NA | 1.36 | 1.95 | 1.76 | 1.82 | 1.93 | 2.11 | NA | NA | NA | NA | NA | 0.00 | 1 | 0 | 0.24 | 0.09 | 0.14 | 0.1 | 0.11 | 0.1 | 0.09 | 0.1 | 0.07 | 0.1 |
| 4189 | libggml-cpu.so - vec.h:491-497 | ggml_compute_forward_flash_attn_ext | Innermost | 0.32 | 0.37 | 0.21 | 0.21 | 0.31 | 0.25 | 0.27 | 0.11 | 0.04 | 0.03 | 0.03 | 0.03 | 0.27 | 0.11 | 0.04 | 0.03 | 0.03 | 0.03 | 0.16 | 0.23 | 0.14 | 0.14 | 0.21 | 0.17 | 0.16 | 0.23 | 0.14 | 0.14 | 0.21 | 0.17 | 0.12 | 0.06 | 0.03 | 0.02 | 0.02 | 0.02 | 0.12 | 0.06 | 0.03 | 0.02 | 0.02 | 0.02 | 8 | 33 | 32 | 34 | 32 | 33 | 157.26 | 368.13 | 853.70 | 1427.13 | 1495.16 | 1562.63 | 100 | 75 | 1 | 1 | 1.45 | 1.39 | 1.98 | 1.87 | 2.08 | 2.15 | 1.58 | 0 | 2 | 0 | 0 | 0 | 100.00 | 1 | 0 | 0.24 | 0.08 | 0.38 | 0.03 | 0.4 | 0.02 | 0.29 | 0.02 | 0.26 | 0.02 |
| 5326 | libggml-cpu.so - sgemm.cpp:138-1044 [...] | _ZN12_GLOBAL__N_115tinyBLAS_Q0_AVXI10block_q8_0S1_fE7gemm4xNILi2EEEvllll.A | Innermost | 0.24 | 0.18 | 0.18 | 0.18 | 0.19 | 0.19 | 0.21 | 0.11 | 0.11 | 0.11 | 0.11 | 0.11 | 0.21 | 0.11 | 0.11 | 0.11 | 0.11 | 0.11 | 0.12 | 0.11 | 0.12 | 0.12 | 0.13 | 0.13 | 0.12 | 0.11 | 0.12 | 0.12 | 0.13 | 0.13 | 0.09 | 0.06 | 0.07 | 0.07 | 0.07 | 0.07 | 0.09 | 0.06 | 0.07 | 0.07 | 0.07 | 0.07 | 8 | 64 | 95 | 128 | 156 | 189 | 82.29 | 118.92 | 107.66 | 108.16 | 101.93 | 105.92 | NA | NA | NA | NA | NA | 1.35 | 1.84 | 1.82 | 1.86 | 1.86 | 1.96 | NA | NA | NA | NA | NA | 0.00 | 1 | 0 | 0.18 | 0.09 | 0.11 | 0.1 | 0.09 | 0.1 | 0.06 | 0.11 | 0.06 | 0.1 |
| 901 | libggml-cpu.so - binary-ops.cpp:10-32 [...] | ggml_compute_forward_add_non_quantized | Innermost | 0.51 | 0.27 | 0.28 | 0.10 | 0.13 | 0.13 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.26 | 0.17 | 0.19 | 0.07 | 0.09 | 0.09 | 0.26 | 0.17 | 0.19 | 0.07 | 0.09 | 0.09 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2 | 1 | 2 | 1 | 1 | 2 | 4.00 | 54.15 | 54.73 | 240.12 | 251.02 | 266.11 | 0 | 6.25 | 1 | 1.33 | 16 | 2 | 1 | 1.9 | 1 | 1 | 1.8 | 0 | 3 | 0 | 0 | 0 | 100.00 | 1 | 0 | 1.52 | -0 | 1.29 | -0 | 3.68 | -0 | 2.86 | -0 | 2.58 | -0 |
| 236 | libggml-cpu.so - ggml-cpu.c:1164-1198 [...] | ggml_compute_forward_mul_mat | InBetween | 0.10 | 0.30 | 0.22 | 0.22 | 0.22 | 0.33 | 0.06 | 0.12 | 0.09 | 0.07 | 0.09 | 0.12 | 0.21 | 0.15 | 0.12 | 0.09 | 0.10 | 0.14 | 0.05 | 0.19 | 0.15 | 0.15 | 0.15 | 0.23 | 0.11 | 0.22 | 0.19 | 0.17 | 0.16 | 0.26 | 0.03 | 0.07 | 0.06 | 0.04 | 0.05 | 0.07 | 0.09 | 0.08 | 0.07 | 0.05 | 0.06 | 0.08 | 8 | 64 | 96 | 128 | 160 | 192 | 92.76 | 67.89 | 61.34 | 64.33 | 53.66 | 54.38 | 0 | 11.14 | 1 | 1 | 9.6 | 1.91 | 2.92 | 2.68 | 3.74 | 2.88 | 3.18 | NA | NA | NA | NA | NA | 0.00 | 1 | 0 | 0.05 | 0.11 | 0.04 | 0.09 | 0.04 | 0.06 | 0.02 | 0.08 | 0.02 | 0.12 |
| 1793 | libggml-cpu.so - vec.h:1084-1116 [...] | ggml_vec_swiglu_f32 | Single | 0.26 | 0.13 | 0.21 | 0.24 | 0.31 | 0.16 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.13 | 0.08 | 0.14 | 0.16 | 0.21 | 0.11 | 0.13 | 0.08 | 0.14 | 0.16 | 0.21 | 0.11 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7 | 7 | 7 | 7 | 7 | 7 | 295.65 | 3553.38 | 3376.79 | 3694.81 | 3722.56 | 7455.70 | 100 | 100 | 1 | 1 | 1 | 7 | 6.22 | 7 | 6.59 | 7 | 5.92 | 0 | 3 | 0 | 0 | 0 | 100.00 | 1 | 0 | 1.44 | -0 | 0.92 | 0 | 0.76 | 0 | 0.61 | 0 | 0.99 | 0 |