OV - vmc.mov1 - Loop 5110

0x5fc0a6 VMOVUPD	(%R14,%R15,8),%YMM0    [2]

0x5fc0ac VMOVUPD	%YMM0,(%RDX,%R15,8)    [1]

0x5fc0b2 ADD	$0x4,%R15

0x5fc0b6 CMP	%RCX,%R15

0x5fc0b9 JB	5fc0a6

/home/kcamus/comparative/champ/champ/src/vmc/detsav.f: 69 - 70

--------------------------------------------------------------------------------

69:            do k=ndetsingle(iab)+1,kcum

70:               wfmat(k,1:4,iab,stoo(istate))=wfmatn(k,1:4,stoo(istate))

Coverage (%)	Name	Source Location	Module
►60.71+	metrop6	metrop_mov1_slat.f:659	vmc.mov1
○	vmc	vmc.f:157	vmc.mov1
○	optwf_sr	optwf_sr.f90:258	vmc.mov1
○	MAIN_	main.f90:96	vmc.mov1
○	main		vmc.mov1
○	__libc_init_first		libc.so.6
►39.29+	metrop6	metrop_mov1_slat.f:659	vmc.mov1
○	vmc	vmc.f:157	vmc.mov1
○	optwf_sr	optwf_sr.f90:164	vmc.mov1

Path /

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	2.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.20
Bottlenecks	micro-operation queue, P4,
Function	detsav
Source	detsav.f:69-70
Source loop unroll info	unrolled by 4
Source loop unroll confidence level	max
Unroll/vectorization loop type	main
Unroll factor	4
CQA cycles	1.00
CQA cycles if no scalar integer	1.00
CQA cycles if FP arith vectorized	1.00
CQA cycles if fully vectorized	0.50
Front-end cycles	1.00
DIV/SQRT cycles	0.50
P0 cycles	0.50
P1 cycles	0.83
P2 cycles	0.50
P3 cycles	1.00
P4 cycles	0.50
P5 cycles	0.50
P6 cycles	0.67
P7 cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	1.12
Stall cycles (UFS)	0.00
Nb insns	5.00
Nb uops	4.00
Nb loads	1.00
Nb stores	1.00
Nb stack references	0.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	64.00
Bytes prefetched	0.00
Bytes loaded	32.00
Bytes stored	32.00
Stride 0	0.00
Stride 1	2.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	0.00
Vectorization ratio all	100.00
Vectorization ratio load	100.00
Vectorization ratio store	100.00
Vectorization ratio mul	NA
Vectorization ratio add_sub	NA
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	50.00
Vector-efficiency ratio load	50.00
Vector-efficiency ratio store	50.00
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	NA
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Metric	Value
CQA speedup if no scalar integer	1.00
CQA speedup if FP arith vectorized	1.00
CQA speedup if fully vectorized	2.00
CQA speedup if no inter-iteration dependency	NA
CQA speedup if next bottleneck killed	1.20
Bottlenecks	micro-operation queue, P4,
Function	detsav
Source	detsav.f:69-70
Source loop unroll info	unrolled by 4
Source loop unroll confidence level	max
Unroll/vectorization loop type	main
Unroll factor	4
CQA cycles	1.00
CQA cycles if no scalar integer	1.00
CQA cycles if FP arith vectorized	1.00
CQA cycles if fully vectorized	0.50
Front-end cycles	1.00
DIV/SQRT cycles	0.50
P0 cycles	0.50
P1 cycles	0.83
P2 cycles	0.50
P3 cycles	1.00
P4 cycles	0.50
P5 cycles	0.50
P6 cycles	0.67
P7 cycles	0.00
Inter-iter dependencies cycles	1
FE+BE cycles (UFS)	1.12
Stall cycles (UFS)	0.00
Nb insns	5.00
Nb uops	4.00
Nb loads	1.00
Nb stores	1.00
Nb stack references	0.00
FLOP/cycle	0.00
Nb FLOP add-sub	0.00
Nb FLOP mul	0.00
Nb FLOP fma	0.00
Nb FLOP div	0.00
Nb FLOP rcp	0.00
Nb FLOP sqrt	0.00
Nb FLOP rsqrt	0.00
Bytes/cycle	64.00
Bytes prefetched	0.00
Bytes loaded	32.00
Bytes stored	32.00
Stride 0	0.00
Stride 1	2.00
Stride n	0.00
Stride unknown	0.00
Stride indirect	0.00
Vectorization ratio all	100.00
Vectorization ratio load	100.00
Vectorization ratio store	100.00
Vectorization ratio mul	NA
Vectorization ratio add_sub	NA
Vectorization ratio fma	NA
Vectorization ratio div_sqrt	NA
Vectorization ratio other	NA
Vector-efficiency ratio all	50.00
Vector-efficiency ratio load	50.00
Vector-efficiency ratio store	50.00
Vector-efficiency ratio mul	NA
Vector-efficiency ratio add_sub	NA
Vector-efficiency ratio fma	NA
Vector-efficiency ratio div_sqrt	NA
Vector-efficiency ratio other	NA

Path /

Average path: Display a virtual path defined by average values of all real paths

Function	detsav
Source file and lines	detsav.f:69-70
Module	vmc.mov1

The loop is defined in /home/kcamus/comparative/champ/champ/src/vmc/detsav.f:69-70.

It is main loop of related source loop which is unrolled by 4 (including vectorization).

gain
potential
hint
expert

Vectorization

Your loop is vectorized, but using only 256 out of 512 bits (AVX/AVX2 instructions on AVX-512 processors).

Details

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).

Workaround

Read the "512-bits vectorization on Skylake SP" report at "Potential" confidence level.

Execution units bottlenecks

Performance is limited by writing data to caches/RAM (the store unit is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 1.00 to 0.83 cycles (1.20x speedup).

Workaround

Write less array elements
Provide more information to your compiler:
- hardcode the bounds of the corresponding 'for' loop

512-bits vectorization on Skylake SP and Icelake Server

On Gold 5122, 6xxx and Platinum Skylake processors and Icelake Server processors, performance can be improved by using 512-bits vectorization if the number of vectorized loops is high and with high trip count.

Workaround

Recompile with -qopt-zmm-usage=high

Type of elements and instruction set

No instructions are processing arithmetic or math operations on FP elements. This loop is probably writing/copying data or processing integer elements.

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is loading 32 bytes. The binary loop is storing 32 bytes.

General properties

nb instructions	5
nb uops	4
loop length	21
used x86 registers	4
used mmx registers	0
used xmm registers	0
used ymm registers	1
used zmm registers	0
nb stack references	0

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	1.00 cycles
front end	1.00 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	0.50	0.50	0.83	0.50	1.00	0.50	0.50	0.67
cycles	0.50	0.50	0.83	0.50	1.00	0.50	0.50	0.67

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	1.12
Stall cycles	0.00

Cycles summary

Front-end	1.00
Dispatch	1.00
Data deps.	1.00
Overall L1	1.00

Vectorization ratios

all	100%
load	100%
store	100%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios

all	50%
load	50%
store	50%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.00 cycles. At this rate:

25% of peak load performance is reached (32.00 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))
50% of peak store performance is reached (32.00 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 1.00 to 0.83 cycles (1.20x speedup).

ASM code

In the binary file, the address of the loop is: 5fc0a6

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput
VMOVUPD (%R14,%R15,8),%YMM0	1	0	0	0.50	0.50	0	0	0	0	5-6	0.50
VMOVUPD %YMM0,(%RDX,%R15,8)	1	0	0	0.33	0.33	1	0	0	0.33	3	1
ADD $0x4,%R15	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
CMP %RCX,%R15	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
JB 5fc0a6	1	0.50	0	0	0	0	0	0.50	0	0	0.50-1

Function	detsav
Source file and lines	detsav.f:69-70
Module	vmc.mov1

The loop is defined in /home/kcamus/comparative/champ/champ/src/vmc/detsav.f:69-70.

It is main loop of related source loop which is unrolled by 4 (including vectorization).

gain
potential
hint
expert

Vectorization

Your loop is vectorized, but using only 256 out of 512 bits (AVX/AVX2 instructions on AVX-512 processors).

Details

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).

Workaround

Read the "512-bits vectorization on Skylake SP" report at "Potential" confidence level.

Execution units bottlenecks

Performance is limited by writing data to caches/RAM (the store unit is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 1.00 to 0.83 cycles (1.20x speedup).

Workaround

Write less array elements
Provide more information to your compiler:
- hardcode the bounds of the corresponding 'for' loop

512-bits vectorization on Skylake SP and Icelake Server

On Gold 5122, 6xxx and Platinum Skylake processors and Icelake Server processors, performance can be improved by using 512-bits vectorization if the number of vectorized loops is high and with high trip count.

Workaround

Recompile with -qopt-zmm-usage=high

Type of elements and instruction set

No instructions are processing arithmetic or math operations on FP elements. This loop is probably writing/copying data or processing integer elements.

Matching between your loop (in the source code) and the binary loop

The binary loop does not contain any FP arithmetical operations. The binary loop is loading 32 bytes. The binary loop is storing 32 bytes.

General properties

nb instructions	5
nb uops	4
loop length	21
used x86 registers	4
used mmx registers	0
used xmm registers	0
used ymm registers	1
used zmm registers	0
nb stack references	0

Front-end

ASSUMED MACRO FUSION FIT IN UOP CACHE

micro-operation queue	1.00 cycles
front end	1.00 cycles

Back-end

	P0	P1	P2	P3	P4	P5	P6	P7
uops	0.50	0.50	0.83	0.50	1.00	0.50	0.50	0.67
cycles	0.50	0.50	0.83	0.50	1.00	0.50	0.50	0.67

Cycles executing div or sqrt instructions	NA
Longest recurrence chain latency (RecMII)	1.00

Front-end and detailed OoO resources (UFS)

FE+BE cycles	1.12
Stall cycles	0.00

Cycles summary

Front-end	1.00
Dispatch	1.00
Data deps.	1.00
Overall L1	1.00

Vectorization ratios

all	100%
load	100%
store	100%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Vector efficiency ratios

all	50%
load	50%
store	50%
mul	NA (no mul vectorizable/vectorized instructions)
add-sub	NA (no add-sub vectorizable/vectorized instructions)
fma	NA (no fma vectorizable/vectorized instructions)
div/sqrt	NA (no div/sqrt vectorizable/vectorized instructions)
other	NA (no other vectorizable/vectorized instructions)

Cycles and memory resources usage

Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.00 cycles. At this rate:

25% of peak load performance is reached (32.00 out of 128.00 bytes loaded per cycle (GB/s @ 1GHz))
50% of peak store performance is reached (32.00 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))

Front-end bottlenecks

Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck). By removing all these bottlenecks, you can lower the cost of an iteration from 1.00 to 0.83 cycles (1.20x speedup).

ASM code

In the binary file, the address of the loop is: 5fc0a6

Instruction	Nb FU	P0	P1	P2	P3	P4	P5	P6	P7	Latency	Recip. throughput
VMOVUPD (%R14,%R15,8),%YMM0	1	0	0	0.50	0.50	0	0	0	0	5-6	0.50
VMOVUPD %YMM0,(%RDX,%R15,8)	1	0	0	0.33	0.33	1	0	0	0.33	3	1
ADD $0x4,%R15	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
CMP %RCX,%R15	1	0.25	0.25	0	0	0	0.25	0.25	0	1	0.25
JB 5fc0a6	1	0.50	0	0	0	0	0	0.50	0	0	0.50-1

Report Configuration

Vectorization

Details

Workaround

Execution units bottlenecks

Workaround

512-bits vectorization on Skylake SP and Icelake Server

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code

Vectorization

Details

Workaround

Execution units bottlenecks

Workaround

512-bits vectorization on Skylake SP and Icelake Server

Workaround

Type of elements and instruction set

Matching between your loop (in the source code) and the binary loop

General properties

Front-end

Back-end

Front-end and detailed OoO resources (UFS)

Cycles summary

Vectorization ratios

Vector efficiency ratios

Cycles and memory resources usage

Front-end bottlenecks

ASM code