Adjust writeback in non-zero memset
This fixes an ineffiency in the non-zero memset. Delaying the writeback until the end of the loop is slightly faster on some cores - this shows ~5% performance gain on Cortex-A53 when doing large non-zero memsets. Tested against the GLIBC testsuite.
This commit is contained in:
		
				
					committed by
					
						 Richard Earnshaw
						Richard Earnshaw
					
				
			
			
				
	
			
			
			
						parent
						
							535903696c
						
					
				
				
					commit
					d80db60066
				
			| @@ -142,10 +142,10 @@ L(set_long): | ||||
| 	b.eq	L(try_zva) | ||||
| L(no_zva): | ||||
| 	sub	count, dstend, dst	/* Count is 16 too large.  */ | ||||
| 	add	dst, dst, 16 | ||||
| 	sub	dst, dst, 16		/* Dst is biased by -32.  */ | ||||
| 	sub	count, count, 64 + 16	/* Adjust count and bias for loop.  */ | ||||
| 1:	stp	q0, q0, [dst], 64 | ||||
| 	stp	q0, q0, [dst, -32] | ||||
| 1:	stp	q0, q0, [dst, 32] | ||||
| 	stp	q0, q0, [dst, 64]! | ||||
| L(tail64): | ||||
| 	subs	count, count, 64 | ||||
| 	b.hi	1b | ||||
|   | ||||
		Reference in New Issue
	
	Block a user