OV - Compare Summary

▼Stylizer

maq_icx	maq_gcc	maq_clang
[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete. Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.	[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete. Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.	[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete. Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.
Not available for this run	[ 0 / 0 ] Fastmath not used Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.	[ 0 / 0 ] Fastmath not used Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.
[ 0 / 3 ] Compilation of some functions is not optimized for the target processor Architecture specific options are needed to produce efficient code for a specific processor ( -x(target) or -ax(target) ). Application run on the ZEN_V4 micro-architecture while the code was specialized for znver3.	[ 0 / 3 ] Compilation of some functions is not optimized for the target processor Architecture specific options are needed to produce efficient code for a specific processor ( -x(target) or -ax(target) ). Application run on the ZEN_V4 micro-architecture while the code was specialized for znver3.	[ 0 / 3 ] Compilation of some functions is not optimized for the target processor Architecture specific options are needed to produce efficient code for a specific processor ( -x(target) or -ax(target) ).
[ 2.68 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer -g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.	[ 3.00 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer -g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.	[ 0 / 3 ] Most of time spent in analyzed modules comes from functions without compilation information Functions without compilation information (typically not compiled with -g and -grecord-gcc-switches) cumulate 100.00% of the time spent in analyzed modules. Check that -g and -grecord-gcc-switches are present. Remark: if -g and -grecord-gcc-switches are indeed used, this can also be due to some compiler built-in functions (typically math) or statically linked libraries. This warning can be ignored in that case.
[ 4 / 4 ] Application profile is long enough (65.79 s) To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.	[ 4 / 4 ] Application profile is long enough (70.03 s) To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.	[ 4 / 4 ] Application profile is long enough (63.78 s) To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.
[ 2 / 2 ] Application is correctly profiled ("Others" category represents 10.16 % of the execution time) To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code	[ 2 / 2 ] Application is correctly profiled ("Others" category represents 19.55 % of the execution time) To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code	[ 0 / 2 ] Too much execution time spent in category "Others" (21.53 %) If the category "Others" represents more than 20% of the execution time, it means that the application profile misses a representative part of the application. Examine functions details to properly identify “Others” category components. Rerun after adding most represented library names (e.g. more than 20% of coverage) to external_libraries (the names can be directly provided by ONE View)
[ 2.68 / 3 ] Optimization level option is correctly used	[ 3 / 3 ] Optimization level option is correctly used	[ 0 / 3 ] Some functions are compiled with a low optimization level (O0 or O1) To have better performances, it is advised to help the compiler by using a proper optimization level (-O2 of higher). Warning, depending on compilers, faster optimization levels can decrease numeric accuracy.
[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.	[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.	[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

maq_icx

maq_gcc

maq_clang

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

[ 2 / 3 ] Security settings from the host restrict profiling. Some metrics will be missing or incomplete.

Current value for kernel.perf_event_paranoid is 2. If possible, set it to 1 or check with your system administrator which flag can be used to achieve this.

Not available for this run

[ 0 / 0 ] Fastmath not used

Consider to add ffast-math to compilation flags (or replace -O3 with -Ofast) to unlock potential extra speedup by relaxing floating-point computation consistency. Warning: floating-point accuracy may be reduced and the compliance to IEEE/ISO rules/specifications for math functions will be relaxed, typically 'errno' will no longer be set after calling some math functions.

[ 0 / 0 ] Fastmath not used

[ 0 / 3 ] Compilation of some functions is not optimized for the target processor

Architecture specific options are needed to produce efficient code for a specific processor ( -x(target) or -ax(target) ). Application run on the ZEN_V4 micro-architecture while the code was specialized for znver3.

[ 0 / 3 ] Compilation of some functions is not optimized for the target processor

Architecture specific options are needed to produce efficient code for a specific processor ( -x(target) or -ax(target) ).

[ 2.68 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 3.00 / 3 ] Most of time spent in analyzed modules comes from functions compiled with -g and -fno-omit-frame-pointer

-g option gives access to debugging informations, such are source locations. -fno-omit-frame-pointer improve the accuracy of callchains found during the application profiling.

[ 0 / 3 ] Most of time spent in analyzed modules comes from functions without compilation information

Functions without compilation information (typically not compiled with -g and -grecord-gcc-switches) cumulate 100.00% of the time spent in analyzed modules. Check that -g and -grecord-gcc-switches are present. Remark: if -g and -grecord-gcc-switches are indeed used, this can also be due to some compiler built-in functions (typically math) or statically linked libraries. This warning can be ignored in that case.

[ 4 / 4 ] Application profile is long enough (65.79 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 4 / 4 ] Application profile is long enough (70.03 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 4 / 4 ] Application profile is long enough (63.78 s)

To have good quality measurements, it is advised that the application profiling time is greater than 10 seconds.

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 10.16 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 2 / 2 ] Application is correctly profiled ("Others" category represents 19.55 % of the execution time)

To have a representative profiling, it is advised that the category "Others" represents less than 20% of the execution time in order to analyze as much as possible of the user code

[ 0 / 2 ] Too much execution time spent in category "Others" (21.53 %)

If the category "Others" represents more than 20% of the execution time, it means that the application profile misses a representative part of the application. Examine functions details to properly identify “Others” category components. Rerun after adding most represented library names (e.g. more than 20% of coverage) to external_libraries (the names can be directly provided by ONE View)

[ 2.68 / 3 ] Optimization level option is correctly used

[ 3 / 3 ] Optimization level option is correctly used

[ 0 / 3 ] Some functions are compiled with a low optimization level (O0 or O1)

To have better performances, it is advised to help the compiler by using a proper optimization level (-O2 of higher). Warning, depending on compilers, faster optimization levels can decrease numeric accuracy.

[ 1 / 1 ] Lstopo present. The Topology lstopo report will be generated.

▼Strategizer

maq_icx	maq_gcc	maq_clang
[ 0 / 4 ] CPU activity is below 90% (0.53%) CPU cores are idle more than 10% of time. Threads supposed to run on these cores are probably IO/sync waiting. Some hints: use faster filesystems to read/write data, improve parallel load balancing and/or scheduling.	[ 0 / 4 ] CPU activity is below 90% (13.92%) CPU cores are idle more than 10% of time. Threads supposed to run on these cores are probably IO/sync waiting. Some hints: use faster filesystems to read/write data, improve parallel load balancing and/or scheduling.	[ 0 / 4 ] CPU activity is below 90% (13.51%) CPU cores are idle more than 10% of time. Threads supposed to run on these cores are probably IO/sync waiting. Some hints: use faster filesystems to read/write data, improve parallel load balancing and/or scheduling.
[ 3 / 4 ] Affinity stability is lower than 90% (80.44%) Threads are often migrating to other CPU cores/threads. For OpenMP, typically set (OMP_PLACES=cores OMP_PROC_BIND=close) or (OMP_PLACES=threads OMP_PROC_BIND=spread). With OpenMPI + OpenMP, use --bind-to core --map-by node:PE=$OMP_NUM_THREADS --report-bindings. With IntelMPI + OpenMP, set I_MPI_PIN_DOMAIN=omp:compact or I_MPI_PIN_DOMAIN=omp:scatter and use -print-rank-map.	[ 2 / 4 ] Affinity stability is lower than 90% (56.02%) Threads are often migrating to other CPU cores/threads. For OpenMP, typically set (OMP_PLACES=cores OMP_PROC_BIND=close) or (OMP_PLACES=threads OMP_PROC_BIND=spread). With OpenMPI + OpenMP, use --bind-to core --map-by node:PE=$OMP_NUM_THREADS --report-bindings. With IntelMPI + OpenMP, set I_MPI_PIN_DOMAIN=omp:compact or I_MPI_PIN_DOMAIN=omp:scatter and use -print-rank-map.	[ 2 / 4 ] Affinity stability is lower than 90% (51.56%) Threads are often migrating to other CPU cores/threads. For OpenMP, typically set (OMP_PLACES=cores OMP_PROC_BIND=close) or (OMP_PLACES=threads OMP_PROC_BIND=spread). With OpenMPI + OpenMP, use --bind-to core --map-by node:PE=$OMP_NUM_THREADS --report-bindings. With IntelMPI + OpenMP, set I_MPI_PIN_DOMAIN=omp:compact or I_MPI_PIN_DOMAIN=omp:scatter and use -print-rank-map.
[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (68.41%) If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.	[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (57.11%) If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.	[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (57.18%) If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.
[ 3 / 3 ] Cumulative Outermost/In between loops coverage (26.43%) lower than cumulative innermost loop coverage (41.98%) Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex	[ 3 / 3 ] Cumulative Outermost/In between loops coverage (28.53%) lower than cumulative innermost loop coverage (28.58%) Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex	[ 3 / 3 ] Cumulative Outermost/In between loops coverage (23.70%) lower than cumulative innermost loop coverage (33.48%) Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex
[ 0 / 4 ] A significant amount of threads are idle (99.48%) On average, more than 10% of observed threads are idle. Such threads are probably IO/sync waiting. Some hints: use faster filesystems to read/write data, improve parallel load balancing and/or scheduling.	[ 0 / 4 ] A significant amount of threads are idle (89.24%) On average, more than 10% of observed threads are idle. Such threads are probably IO/sync waiting. Some hints: use faster filesystems to read/write data, improve parallel load balancing and/or scheduling.	[ 0 / 4 ] A significant amount of threads are idle (89.82%) On average, more than 10% of observed threads are idle. Such threads are probably IO/sync waiting. Some hints: use faster filesystems to read/write data, improve parallel load balancing and/or scheduling.
[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations BLAS2 calls usually could make a poor cache usage and could benefit from inlining.	[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations BLAS2 calls usually could make a poor cache usage and could benefit from inlining.	[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations BLAS2 calls usually could make a poor cache usage and could benefit from inlining.
[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (41.98%) If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.	[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (28.58%) If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.	[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (33.48%) If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.
[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations It could be more efficient to inline by hand BLAS1 operations	[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations It could be more efficient to inline by hand BLAS1 operations	[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations It could be more efficient to inline by hand BLAS1 operations
[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)	[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)	[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)
[ 4 / 4 ] Loop profile is not flat At least one loop coverage is greater than 4% (10.93%), representing an hotspot for the application	[ 4 / 4 ] Loop profile is not flat At least one loop coverage is greater than 4% (14.13%), representing an hotspot for the application	[ 4 / 4 ] Loop profile is not flat At least one loop coverage is greater than 4% (13.12%), representing an hotspot for the application

maq_icx

maq_gcc

maq_clang

[ 0 / 4 ] CPU activity is below 90% (0.53%)

CPU cores are idle more than 10% of time. Threads supposed to run on these cores are probably IO/sync waiting. Some hints: use faster filesystems to read/write data, improve parallel load balancing and/or scheduling.

[ 0 / 4 ] CPU activity is below 90% (13.92%)

[ 0 / 4 ] CPU activity is below 90% (13.51%)

[ 3 / 4 ] Affinity stability is lower than 90% (80.44%)

Threads are often migrating to other CPU cores/threads. For OpenMP, typically set (OMP_PLACES=cores OMP_PROC_BIND=close) or (OMP_PLACES=threads OMP_PROC_BIND=spread). With OpenMPI + OpenMP, use --bind-to core --map-by node:PE=$OMP_NUM_THREADS --report-bindings. With IntelMPI + OpenMP, set I_MPI_PIN_DOMAIN=omp:compact or I_MPI_PIN_DOMAIN=omp:scatter and use -print-rank-map.

[ 2 / 4 ] Affinity stability is lower than 90% (56.02%)

[ 2 / 4 ] Affinity stability is lower than 90% (51.56%)

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (68.41%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (57.11%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 4 / 4 ] Enough time of the experiment time spent in analyzed loops (57.18%)

If the time spent in analyzed loops is less than 30%, standard loop optimizations will have a limited impact on application performances.

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (26.43%) lower than cumulative innermost loop coverage (41.98%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (28.53%) lower than cumulative innermost loop coverage (28.58%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 3 / 3 ] Cumulative Outermost/In between loops coverage (23.70%) lower than cumulative innermost loop coverage (33.48%)

Having cumulative Outermost/In between loops coverage greater than cumulative innermost loop coverage will make loop optimization more complex

[ 0 / 4 ] A significant amount of threads are idle (99.48%)

On average, more than 10% of observed threads are idle. Such threads are probably IO/sync waiting. Some hints: use faster filesystems to read/write data, improve parallel load balancing and/or scheduling.

[ 0 / 4 ] A significant amount of threads are idle (89.24%)

[ 0 / 4 ] A significant amount of threads are idle (89.82%)

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 2 / 2 ] Less than 10% (0.00%) is spend in BLAS2 operations

BLAS2 calls usually could make a poor cache usage and could benefit from inlining.

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (41.98%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (28.58%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 4 / 4 ] Enough time of the experiment time spent in analyzed innermost loops (33.48%)

If the time spent in analyzed innermost loops is less than 15%, standard innermost loop optimizations such as vectorisation will have a limited impact on application performances.

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 3 / 3 ] Less than 10% (0.00%) is spend in BLAS1 operations

It could be more efficient to inline by hand BLAS1 operations

[ 2 / 2 ] Less than 10% (0.00%) is spend in Libm/SVML (special functions)

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (10.93%), representing an hotspot for the application

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (14.13%), representing an hotspot for the application

[ 4 / 4 ] Loop profile is not flat

At least one loop coverage is greater than 4% (13.12%), representing an hotspot for the application

▼Optimizer

▶Loops List

r0 - maq_icx - 10 analyzed loop(s)
- Loop 565 - multithreading_assembly_perf_test
- Loop 564 - multithreading_assembly_perf_test
- Loop 715 - libboundary_conditions.so
- Loop 713 - libboundary_conditions.so
- Loop 5162 - libfinite_elements.so
- Loop 283 - multithreading_assembly_perf_test
- Loop 285 - multithreading_assembly_perf_test
- Loop 352 - multithreading_assembly_perf_test
- Loop 836 - multithreading_assembly_perf_test
- Loop 284 - multithreading_assembly_perf_test
r1 - maq_gcc - 10 analyzed loop(s)
- Loop 294 - multithreading_assembly_perf_test
- Loop 296 - multithreading_assembly_perf_test
- Loop 2895 - libfinite_elements.so
- Loop 2892 - libfinite_elements.so
- Loop 623 - multithreading_assembly_perf_test
- Loop 7575 - libfinite_elements.so
- Loop 1548 - libfinite_elements.so
- Loop 7573 - libfinite_elements.so
- Loop 736 - multithreading_assembly_perf_test
- Loop 38 - multithreading_assembly_perf_test
r2 - maq_clang - 10 analyzed loop(s)
- Loop 451 - multithreading_assembly_perf_test
- Loop 450 - multithreading_assembly_perf_test
- Loop 272 - libboundary_conditions.so
- Loop 271 - libboundary_conditions.so
- Loop 195 - multithreading_assembly_perf_test
- Loop 273 - libboundary_conditions.so
- Loop 196 - multithreading_assembly_perf_test
- Loop 197 - multithreading_assembly_perf_test
- Loop 201 - multithreading_assembly_perf_test
- Loop 3626 - libfinite_elements.so

Analysis		r0	r1	r2
Loop Computation Issues	Less than 10% of the FP ADD/SUB/MUL arithmetic operations are performed using FMA	1	0	2
Loop Computation Issues	Presence of a large number of scalar integer instructions	3	3	3
Control Flow Issues	Presence of calls	1	2	2
	Presence of 2 to 4 paths	3	2	2
	Presence of more than 4 paths	2	2	3
	Non-innermost loop	4	5	5
Data Access Issues	Presence of constant non-unit stride data access	3	3	2
	More than 10% of the vector loads instructions are unaligned	3	1	0
	Presence of special instructions executing on a single port	2	4	4
	More than 20% of the loads are accessing the stack	2	2	1
Vectorization Roadblocks	Presence of calls	1	2	2
	Presence of 2 to 4 paths	3	2	2
	Presence of more than 4 paths	2	4	3
	Non-innermost loop	4	5	5
	Presence of constant non-unit stride data access	3	3	2
	Out of user code	1	0	0
Inefficient Vectorization	Presence of special instructions executing on a single port	2	4	4

Report Configuration

▼Stylizer

▼Strategizer

▼Optimizer

▶Loops List