OV - direct assembly 128 threads

multithreading_assembly_perf_test - 2025-07-30 12:07:29 - MAQAO 2025.1.0

Help is available by moving the cursor above any symbol or by checking MAQAO website.

▶Filter Information

There is no filter information to display

Global Metrics

Total Time (s)		126.84
Max (Thread Active Time) (s)		11.86
Average Active Time (s)		4.70
Activity Ratio (%)		3.73
Average number of active threads		4.747
Affinity Stability (%)		4.40
Time in analyzed loops (%)		29.4
Time in analyzed innermost loops (%)		22.0
Time in user code (%)		45.2
Compilation Options Score (%)		100
Array Access Efficiency (%)		82.0

Potential Speedups
Perfect Flow Complexity		1.00
Perfect OpenMP + MPI + Pthread		1.00
Perfect OpenMP + MPI + Pthread + Perfect Load Distribution		2.55
No Scalar Integer	Potential Speedup	1.05
No Scalar Integer	Nb Loops to get 80%	8
FP Vectorised	Potential Speedup	1.02
FP Vectorised	Nb Loops to get 80%	4
Fully Vectorised	Potential Speedup	1.12
Fully Vectorised	Nb Loops to get 80%	12
FP Arithmetic Only	Potential Speedup	1.14
FP Arithmetic Only	Nb Loops to get 80%	11

CQA Potential Speedups Summary

Average Active Threads Count⏎

Loop Based Profile⏎

Innermost Loop Based Profile⏎

Application Categorization⏎

Compilation Options⏎

Source Object	Issue
▼libassembly.so–
○Kokkos_OpenMP_Parallel_Scan.hpp
○finite_elements.hpp
○Kokkos_OpenMP_Parallel_For.hpp
▼libfinite_elements.so–
○PacketMath.h
○MapBase.h
○material_brick.hpp
○GeneralMatrixMatrix.h
○GeneralMatrixVector.h
○GeneralProduct.h
○generic_elements.hpp
○GemmKernel.h
○stl_vector.h
○GeneralBlockPanelKernel.h
○Matrix.h
○element_U.tpp
○TensorDeviceDefault.h
▼libamat.so–
○behavior_base.hpp
○behavior_integrator_direct.hpp
○TensorMap.h
○behavior_base.cpp
○GeneralMatrixVector.h
○integration_point_data_view.cpp
○elastic_behavior.cpp
○TensorExecutor.h
○material_context.cpp
○ProductEvaluators.h
▼libdofs.so–
○dof_list.cpp
○stl_vector.h
○MapBase.h
○stl_iterator.h
○dof.cpp
▼multithreading_assembly_perf_test–
○std_function.h
▼libboundary_conditions.so–
○GemmKernel.h

Loop Path Count Profile⏎

Cumulated Speedup If No Scalar Integer⏎

Cumulated Speedup If FP Vectorized⏎

Cumulated Speedup If Fully Vectorized⏎

Cumulated Speedup If FP Arithmetic Only⏎

Experiment Summary

Experiment Name	direct assembly 128 threads
Application	./multithreading_assembly_perf_test
Timestamp	2025-07-30 12:07:29	Universal Timestamp	1753870049
Number of processes observed	1	Number of threads observed	128
Experiment Type	OpenMP;
Machine	be-par054
Model Name	AMD EPYC 9534 64-Core Processor
Architecture	x86_64	Micro Architecture	ZEN_V4
Cache Size	1024 KB	Number of Cores	64
OS Version	Linux 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Wed Apr 5 13:35:01 EDT 2023
Architecture used during static analysis	x86_64	Micro Architecture used during static analysis	ZEN_V4
Frequency Driver	acpi-cpufreq	Frequency Governor	performance
Huge Pages	always	Hyperthreading	off
Number of sockets	2	Number of cores per socket	64
Compilation Options	libamat.so: GNU C++20 13.2.0 -march=znver4 -mprefer-vector-width=256 -g3 -O3 -std=c++20 -fno-omit-frame-pointer -fopenmp -funroll-loops -fPIC libassembly.so: GNU C++20 13.2.0 -march=znver4 -mprefer-vector-width=256 -g3 -O3 -std=c++20 -fno-omit-frame-pointer -funroll-loops -fPIC -fopenmp libboundary_conditions.so: GNU C++20 13.2.0 -march=znver4 -mprefer-vector-width=256 -g3 -O3 -std=c++20 -fno-omit-frame-pointer -fopenmp -funroll-loops -fPIC libdofs.so: GNU C++20 13.2.0 -march=znver4 -mprefer-vector-width=256 -g3 -O3 -std=c++20 -fno-omit-frame-pointer -fopenmp -funroll-loops -fPIC libfinite_elements.so: GNU C++20 13.2.0 -march=znver4 -mprefer-vector-width=256 -g3 -O3 -std=c++20 -fno-omit-frame-pointer -fopenmp -funroll-loops -fPIC multithreading_assembly_perf_test: GNU C++20 13.2.0 -march=znver4 -mprefer-vector-width=256 -g3 -O3 -std=c++20 -fno-omit-frame-pointer -funroll-loops -fopenmp
Comments

Configuration Summary

Dataset
Run Command	<executable> --method direct --ncut 280 --max_threads=128 --min_threads=128
Number Processes	1
Number Nodes	1
Filter	Not Used
Profile Start	Not Used
Maximal Path Number	4

Report Configuration

multithreading_assembly_perf_test - 2025-07-30 12:07:29 - MAQAO 2025.1.0

▶Filter Information

Global Metrics

CQA Potential Speedups Summary

Average Active Threads Count⏎

Loop Based Profile⏎

Innermost Loop Based Profile⏎

Application Categorization⏎

Compilation Options⏎

Loop Path Count Profile⏎

Cumulated Speedup If No Scalar Integer⏎

Cumulated Speedup If FP Vectorized⏎

Cumulated Speedup If Fully Vectorized⏎

Cumulated Speedup If FP Arithmetic Only⏎

Experiment Summary

Configuration Summary