Adjust writeback in non-zero memset

This fixes an ineffiency in the non-zero memset.  Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.

Tested against the GLIBC testsuite.
This commit is contained in:
Wilco Dijkstra 2018-11-06 14:42:10 +00:00 committed by Richard Earnshaw
parent 535903696c
commit d80db60066
1 changed files with 3 additions and 3 deletions

View File

@ -142,10 +142,10 @@ L(set_long):
b.eq L(try_zva)
L(no_zva):
sub count, dstend, dst /* Count is 16 too large. */
add dst, dst, 16
sub dst, dst, 16 /* Dst is biased by -32. */
sub count, count, 64 + 16 /* Adjust count and bias for loop. */
1: stp q0, q0, [dst], 64
stp q0, q0, [dst, -32]
1: stp q0, q0, [dst, 32]
stp q0, q0, [dst, 64]!
L(tail64):
subs count, count, 64
b.hi 1b