options

Loops Index

30 loops have been discarded from the report because their ratio ((Max Inclusive Time Over Threads * 100) / Max Thread Active Time) is lower than the threshold set by object_coverage_threshold (0.1%). It represents about 0.03% of the application. To include them, change the value of object_coverage_threshold in the experiment directory configuration file, then rerun the command with the additionnal parameter --force-static-analysis

Columns Filter

Level Max Thread Time / Walltime aocc_4 (%) Exclusive Coverage aocc_4 (%) Inclusive Coverage aocc_4 (%) Max Exclusive Time Over Threads aocc_4 (s) Max Inclusive Time Over Threads aocc_4 (s) Exclusive Time w.r.t. Wall Time aocc_4 (s) Inclusive Time w.r.t. Wall Time aocc_4 (s) Nb Threads aocc_4 GFLOPS aocc_4 Vectorization Ratio (%) Vector Length Use (%) Speedup If No Scalar Integer Speedup If FP Vectorized Speedup If Fully Vectorized Speedup If Perfect Load Balancing aocc_4 Stride 0 Stride 1 Stride n Stride Unknown Stride Indirect Array Access Efficiency Level Max Thread Time / Walltime Exclusive Coverage Inclusive Coverage Max Exclusive Time Over Threads Max Inclusive Time Over Threads Exclusive Time w.r.t. Wall Time Inclusive Time w.r.t. Wall Time Nb Threads GFLOPS Vectorization Ratio Vector Length Use Speedup If No Scalar Integer Speedup If FP Vectorized Speedup If Fully Vectorized Speedup If Perfect Load Balancing Stride 0 Stride 1 Stride n Stride Unknown Stride Indirect Array Access Efficiency
Loop idSource LocationSource FunctionLevelMax Thread Time / Walltime aocc_4 (%)Exclusive Coverage aocc_4 (%)Inclusive Coverage aocc_4 (%)Max Exclusive Time Over Threads aocc_4 (s)Max Inclusive Time Over Threads aocc_4 (s)Exclusive Time w.r.t. Wall Time aocc_4 (s)Inclusive Time w.r.t. Wall Time aocc_4 (s)Nb Threads aocc_4GFLOPS aocc_4Vectorization Ratio (%)Vector Length Use (%)Speedup If No Scalar IntegerSpeedup If FP VectorizedSpeedup If Fully VectorizedSpeedup If Perfect Load Balancing aocc_4Stride 0Stride 1Stride nStride UnknownStride IndirectArray Access Efficiency
548libggml-cpu.so - mmq.cpp:1570-1597 [...]ggml_backend_amx_mul_mat(ggml_compute_params const*, ggml_tensor*)::$_1::operator()(int, int) const::{lambda()#1}::operator()() constSingle37.1524.9924.9913.0813.084.534.5318379.33NANANANANA2.79NANANANANA0.00
399libggml-cpu.so - mmq.cpp:303-1392 [...]void parallel_for<(anonymous namespace)::convert_B_packed_format<block_q8_0, 32>(void*, block_q8_0 const*, int, int)::{lambda(int, int)#1}>(int, (anonymous namespace)::convert_B_packed_format<block_q8_0, 32>(void*, block_q8_0 const*, int,...Innermost0.160.110.110.050.050.020.021780.0090.9138.761.4711.412.5923009085.94
1232libggml-cpu.so - vec.cpp:311-316ggml_vec_dot_f16Single0.470.090.090.170.170.020.02321325.00NANANANANA1.67NANANANANA0.00
713libggml-cpu.so - mmq.cpp:520-2488 [...]ggml_backend_amx_mul_mat(ggml_compute_params const*, ggml_tensor*)::$_2::operator()(int, int) const::{lambda()#1}::operator()() constInBetween0.140.080.090.050.050.020.02150329.15NANANANANA2.63NANANANANA0.00
2304libggml-cpu.so - vec.h:491-497ggml_compute_forward_flash_attn_extInnermost0.360.080.080.130.130.020.0232917.75NANANANANA1.41NANANANANA0.00
2294libggml-cpu.so - ops.cpp:8759-8881 [...]ggml_compute_forward_flash_attn_extInBetween0.210.040.120.070.200.010.02321547.5918.5912.32.431.69.791.8NANANANANA0.00
2015libggml-cpu.so - ops.cpp:6220-6245 [...]ggml_compute_forward_rope_f32(ggml_compute_params const*, ggml_tensor*, bool)Innermost0.090.040.040.030.030.010.01192581.0506.2511.1244.6512000100.00
117libggml-cpu.so - ggml-cpu.c:533-2891 [...]ggml_graph_compute_threadSingle0.070.020.020.030.030.000.0010814.7609.581114.773.51NANANANANA0.00
2004libggml-cpu.so - ops.cpp:6365-6484 [...]ggml_compute_forward_rope_f32(ggml_compute_params const*, ggml_tensor*, bool)InBetween0.070.010.020.030.030.000.0016646.2209.874.23.125.849.12NANANANANA0.00
3150libggml-cpu.so - quants.c:298-355 [...]quantize_row_q8_0Single0.850.010.010.300.300.000.001888.5660.729.6611.312.68102000100.00
2292libggml-cpu.so - ops.cpp:8885-8886 [...]ggml_compute_forward_flash_attn_extInnermost0.060.010.010.020.020.000.00260.0006.251.331162.170200166.67
2007libggml-cpu.so - ops.cpp:6446-6457ggml_compute_forward_rope_f32(ggml_compute_params const*, ggml_tensor*, bool)Innermost0.060.010.010.020.020.000.0033165.3406.2511.51630202075.00
914libggml-cpu.so - binary-ops.cpp:18-32 [...]ggml_compute_forward_mulInnermost0.620.010.010.220.220.000.007110.6606.2511.0616703000100.00
1240libggml-cpu.so - vec.h:1084-1115 [...]ggml_vec_swiglu_f32Single0.450.000.000.160.160.000.0076147.909898.1311170.5003056.25
3069exec - sampling.cpp:125-126 [...]common_sampler::set_logits(llama_context*, int)Single0.260.000.000.090.090.000.0010.0006.25311611180080.00
826libggml-cpu.so - binary-ops.cpp:10-32 [...]ggml_compute_forward_add_non_quantizedInnermost0.230.000.000.080.080.000.001343.3906.2511.0616103000100.00
2303libllama.so - stl_heap.h:139-262 [...]llama_token_data_array_partial_sort_inplace(llama_token_data_array*, int)Outermost0.200.000.000.070.070.000.0010.0008.482.33113.951NANANANANA0.00
2775libllama.so - hashtable.h:2386-2403 [...]std::_Hashtable<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::pair<std::pai...Single0.060.000.000.020.020.000.0010.00011.93112.461NANANANANA0.00
1734libggml-cpu.so - ops.cpp:4325-4326ggml_compute_forward_rms_normInnermost0.060.000.000.020.020.000.0013628.127518.7511.385.18101000100.00
×