If you don't want to use __z88dk_callee it could be good to keep in mind that
; cycles size ld iy, #2 ; 16 4 add iy, sp ; 17 2 ld c, 0(iy) ; 21 3 ld b, 1(iy) ; 21 => 75 3 => 12
can be replaced with
; cycles size push hl ; 12 1 ld hl, #4 ; 11 3 add hl, sp ; 12 1 ld c, (hl) ; 8 1 inc hl ; 7 1 ld b, (hl) ; 8 1 pop hl ; 11 => 69 1 => 9
Or you can use Bengalacks alternative without __z88dk_callee as long as you push back bc
; cycles size pop iy ; 16 2 pop bc ; 11 1 push bc ; 12 => 39 1 => 4
You will still need to exit with jp (iy)
instead of ret
here unless you push back iy too.
Surely the ret to jump indirectly will still work? You push the jump address right before the ret.
But of course It depends on where this ends up in the end, and only aoineko will know. I was only thinking that if you already have return address in iy, it makes sense to use jp (iy). It is "only " 10 cycles. Doing push + ret instead is 17+11 = 28.
Here is the "final" version for the record:
void Mem_FastCopy(const void* src, void* dest, u16 size) __naked { src; // HL dest; // DE size; // SP+2 __asm // Get parameters pop iy // 16 cc (return address) pop bc // 11 cc (retreive size) mem_fastcopy_setup: // Setup fast LDIR loop xor a // 5 cc sub c // 5 cc and #15 // 8 cc jp z, mem_fastcopy_loop // 11 cc - total 29 cc (break-even at 16 loops) add a // 5 cc exx // 5 cc add a, #mem_fastcopy_loop // 8 cc ld l, a // 5 cc ld a, #0 // 8 cc adc a, #mem_fastcopy_loop >> 8 // 8 cc ld h, a // 5 cc push hl // 12 cc exx // 5 cc ret // 11 cc - total 101 cc (break-even at 25 loops) mem_fastcopy_loop: // Fast LDIR (with 16x unrolled LDI) .rept 16 ldi // 18 cc .endm jp pe, mem_fastcopy_loop // 11 cc (0,6875 cc per ldi) mem_fastcopy_end: jp (iy) // 10 cc __endasm; }
And here is some speed comparaison (after HL, DE and BC registers setup to count only the pure assembler part):
loop count => gain in % 16 => +9.6% (break-even for multiple of 16) 25 => +0.2% (break-even for non-multiple of 16) 30 => +3.4% 32 => +14.2% (multiple of 16) 40 => +7.2% 48 => +15.7% (multiple of 16) 50 => +9,5% 100 => +14.2% 128 => +17.6% (multiple of 16) 500 => +17.8% 512 => +18.5% (multiple of 16) ∞ => +18,7%
Thanks to all!
Think you need __z88dk_callee in addition to __naked. If the above works, it is because the caller code has stored original SP-value.
In your calculation of the break-even, you forgot the //Get Parameters
part.
It adds 75 t-states to the total.
This is also use for my "normal" ldir
version so I only counted the extra code between the 2 versions.
Yes but it's better to include it because parameter entry is longer for the fast version.
Think you need __z88dk_callee in addition to __naked. If the above works, it is because the caller code has stored original SP-value.
It's a little bit out of the topic, but __sdcccall(1)
(the new default function signature) act already like __z88dk_callee
. The stack adjustment is done by the function, not the caller.
Documentation it's not clear on that subject, but it's what I see in all my tests:
« If __z88dk_callee is not used, after the call, the stack parameters are cleaned up by the caller, with the following exceptions: functions that do not have variable arguments and return void or a type of at most 16 bits, or have both a first parameter of type float and a return value of type float. »
I added __z88dk_callee
for a peace of mind.
Great - that was news to me and very good to know. Thanks! If utilised, this can speed up A LOT! I've been replacing the old "retrieve-from-the-stack-dance" (ld iy,#2 add iy,sp, etc, etc) with sets of pops several places. So much faster.
By doing this, you need to keep the jp (iy)
, and not replace by ret
as Grauw
Grauw is right. I didn't look carefully enough -it is perfect to use ret
in this case.
Replace
jp z, mem_fastcopy_loop
with
jr z, mem_fastcopy_loop
and you will gain anouther 5 cc in most cases
But I lost 2 cc for multiple of 16 values, isn't it?
I like to have this "multiple of 16" optimization.