suggestion on optimization

Страница 1/5
| 2 | 3 | 4 | 5

By ARTRAG

Enlighted (6976)

Аватар пользователя ARTRAG

30-03-2019, 11:07

This code renders in the SAT a number of objects in a list in RAM
Basically if the object is active and its X,Y (16 bit coordinates in the level map) fall in the screen window (256x128 from Y=64) the object is copied in the SAT in RAM using one or two sprites (according to its shape).

The screen window goes tram -32 to 255 on X and from -16 to 127 on Y, so EC bit is fully managed.
Do you have suggestion to make it way faster ?


_plot_enemy:

	ld	iy,ram_sat
	ld	ix,enemies
	ld	bc,max_enem*256+0

.npc_loop1:
	bit 0,(ix+enemy_data.status)
	jp	z,.next

	ld	l,(ix+enemy_data.y)
	ld	h,(ix+enemy_data.y+1)
	ld	de,16
	add hl,de
	ld	de,(ymap)
	and a
	sbc hl,de		; hl = enemy.y + 16 - ymap
	jp	c,.next		; enemy.y - ymap < -16

	ld	de,128+16
	sbc hl,de		; enemy.y - ymap + 16 - 128 - 16 >= 0 
	jp	nc,.next	; enemy.y - ymap  >= 128
	ld	de,128+64
	add	hl,de
	ld	a,l
	ex 	af,af

	ld	l,(ix+enemy_data.x+0)
	ld	h,(ix+enemy_data.x+1)
	ld	de,32
	add hl,de
	ld	de,(xmap)
	and a
	sbc hl,de		; hl = enemy.x + 32 - xmap < 0
	jp	c,.next		; hl <0  <==> dx = enemy.x - xmap < -32
	ld	de,32
	sbc hl,de		; enemy.x + 32 - xmap - 32 <0

	ld	a,(ix+enemy_data.color)
	jp nc,.noec		; -32255
	
	ld	a,(ix+enemy_data.frame)
	cp	16*4					; hard coded in the SPT
	jp	nc,.two_layers

.one_layer:
	ld	(iy+2),a				; write shape
	ld	(iy+1),l				; write X
	ex 	af,af					; write Y
	ld	(iy+0),a
	ld	(iy+3),e				; write colour
	inc c
	ld	de,4
	add iy,de
	; jp 	.next
	
	
.next:
	ld	de,enemy_data
	add ix,de
	djnz	.npc_loop1

	ld	a,c
	ld	(visible_sprts),a
	ret
	
.two_layers:
	ld	(iy+2),a				; write shape
	add	a,8
	ld	(iy+2+4),a				; second layer shape
	ld	(iy+1),l				; write X
	ld	(iy+1+4),l	
	ex 	af,af					; write Y
	ld	(iy+0),a
	ld	(iy+0+4),a
	
	ld	(iy+3),e				; write colour
	ld	a,e
	and 0xF0
	or 	1						; second layer colour
	ld	(iy+3+4),a	
[2]	inc c
	ld	de,8
	add iy,de
	jp 	.next
Для того, чтобы оставить комментарий, необходимо регистрация или !login

By Sandy Brand

Champion (309)

Аватар пользователя Sandy Brand

30-03-2019, 11:47

Using IX and IY is generally quite slow.

Index registers are best used for when your algorithm has to access memory in a more or less 'random' pattern. You could, however, try to think of a way that you can write into the SAT buffer sequentially from start to finish. In that case you could just put HL to the start of the table and just increment it as you write the sprite attributes into the table. For this to work you will need to find a way to somehow 'sort' your enemies in the order that they need to be written into the SAT table. This will require some sort of 'allocation' system every time when a new enemy is created.

For bonus points you could even make sure that your entire SAT buffer is located in a specific place such that the high-byte of the addresses never changes. That will enable you to use INC L instead of INC HL.

Or, for additional bonus points, you could even not have a SAT table in memory in the first place, but just OUT the data directly into VRAM, but that will require a much more complex setup with regards to interrupt handling. Smile

Some further micro optimization suggestions:

        ....
	add hl,de
	ld	de,(ymap)
	and a
	sbc hl,de		; hl = enemy.y + 16 - ymap
        ...

the ADD HL,DE will likely never result in a carry? So no need to AND A to reset the carry.

        ...
	ld	de,128+16
	sbc hl,de		; enemy.y - ymap + 16 - 128 - 16 >= 0 
        ...

Subtracting constant 16 bit values can be done faster by adding the negated value (although you will need to rewrite some of the conditional logic that checks the carry afterwards):

LD DE,- (128 + 16) 
ADD HL,DE

I notice that you use 'JP condition' for almost everything. I would suggest to use 'JR condition' when you know that the condition will be false most of the time.

By ARTRAG

Enlighted (6976)

Аватар пользователя ARTRAG

30-03-2019, 12:31

I've already removed all the redundant "and a" and add hl,de and sbc hl,de cost the same time.
I need to store the sat in ram because I transfer it in Vram in the ISR alternatively in direct and reverse order to reduce sprite flickering. I cant write here the vram directly unless I develop two routines, one for direct the other for reverse scan of the objects and I put both in the ISR... doable but it wastes precious vblank time... doable, but it was my last resort...

By theNestruo

Champion (429)

Аватар пользователя theNestruo

30-03-2019, 15:13

You are computing (enemy.y + 16 - ymap) and (enemy.x + 32 - xmap) for every sprite...
Could it help to compute (ymap2 = 16 - ymap) and (xmap2 = 32 - xmap) outside the loop, and then do (enemy.y + ymap2) and (enemy.x + xmap2) inside?

By ARTRAG

Enlighted (6976)

Аватар пользователя ARTRAG

30-03-2019, 15:20

Good one!Thanks

By ARTRAG

Enlighted (6976)

Аватар пользователя ARTRAG

30-03-2019, 15:46

This is the current revision


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
;	plot enemies and bullets if visible
;
;	depends on xmap,ymap

_plot_enemy:

	ld	iy,(alt_ram_sat)
	ld	ix,enemies
	ld	bc,max_enem*256+0
	
	ld	hl,(ymap)
	ld	de,16
	and a
	sbc	hl,de
	ld	(tempy),hl

	ld	hl,(xmap)
	ld	de,32
	and a
	sbc	hl,de
	ld	(tempx),hl

.npc_loop1:
	bit 0,(ix+enemy_data.status)
	jp	z,.next

	ld	l,(ix+enemy_data.y)
	ld	h,(ix+enemy_data.y+1)
	ld	de,(tempy)
	and a
	sbc hl,de		; hl = enemy.y + 16 - ymap <0
	jp	m,.next		; enemy.y - ymap < -16

	ld	de,128+16
	sbc hl,de		; enemy.y - ymap + 16 - 128 - 16 >= 0 
	jp	nc,.next	; enemy.y - ymap  >= 128
	ld	e,128+64
	add	hl,de
	ld	(iy+0),l
	ld	(iy+0+4),l	; not needed if single layer but in this way it is overall faster 
	
	ld	l,(ix+enemy_data.x+0)
	ld	h,(ix+enemy_data.x+1)
	ld	de,(tempx)
	and a			
	sbc hl,de		; hl = enemy.x + 32 - xmap < 0
	jp	m,.next		; hl <0  <==> dx = enemy.x - xmap < -32
	
	ld	de,32
	sbc hl,de		; enemy.x + 32 - xmap - 32 <0

	ld	a,(ix+enemy_data.color)
	jp nc,.noec		; -32255
	
	ld	a,(ix+enemy_data.frame)
	cp	16*4					; hard coded in the SPT
	jp	nc,.two_layers

.one_layer:
	ld	(iy+1),l				; write X
	ld	(iy+2),a				; write shape
	ld	(iy+3),e				; write colour
	inc c
	ld	e,4
	add iy,de
	; jp 	.next
		
.next:
	ld	de,enemy_data
	add ix,de
	djnz	.npc_loop1

	ld	a,c
	ld	(alt_visible_sprts),a
	ret
	
.two_layers:
	ld	(iy+1),l				; write X
	ld	(iy+2),a				; write shape
	ld	(iy+3),e				; write colour
	
	ld	(iy+1+4),l				; second layer X
	add	a,8
	ld	(iy+2+4),a				; second layer shape
	ld	a,e
	and 0xF0
	or 	1						; second layer colour
	ld	(iy+3+4),a	
[2]	inc c
	ld	e,8
	add iy,de
	jp 	.next

By theNestruo

Champion (429)

Аватар пользователя theNestruo

31-03-2019, 09:54

Hi!

When computing tempx and tempy flags are not needed, so I guess Sandy Brand's suggestion of replacing "ld de,16/and a/sbc hl,de" by "ld de,-16/add hl,de" can be applied.

If speed is far more important than size, avoid the last "jp .next" by repeating the entire ".next" section afterwards (thus avoiding the cost of the "jp").
On the other hand, a few bytes can be saved moving one "inc c" and the "add iy,de" into the ".next" section. Also, in the "ld a,e/and 0xF0/or 1" sequence of the ".two_layers", the last "or 1" can be replaced by a shorter "inc a".

Also, the entire ".noec" section if missing. Again, if speed is far more important than size, the ".two_layers" section could be duplicated as two versions ("_ec" and "_noec") to avoid computing the small extra cost of computing the second layer colour (that seems to be always 1 in your code).

By ARTRAG

Enlighted (6976)

Аватар пользователя ARTRAG

31-03-2019, 23:43

Something in my comments was messing up the result in the forum. I hope this time it is fine.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
;	plot enemies and bullets if visible
;
;	depends on xmap,ymap

_plot_enemy:

	ld	iy,(alt_ram_sat)
	ld	ix,enemies
	ld	bc,max_enem*256+0
	
	ld	hl,(ymap)
	ld	de,-16
	add	hl,de
	ld	(tempy),hl

	ld	hl,(xmap)
	ld	de,-32
	add	hl,de
	ld	(tempx),hl

.npc_loop1:
	res 7,(ix+enemy_data.status)	; set it as invisible
	bit 0,(ix+enemy_data.status)
	jp	z,.next

	ld	l,(ix+enemy_data.y)
	ld	h,(ix+enemy_data.y+1)
	ld	de,(tempy)
	and a
	sbc hl,de		; hl = enemy.y + 16 - ymap <0
	jp	m,.next		; enemy.y - ymap < -16

	ld	de,128+16
	sbc hl,de		; enemy.y - ymap + 16 - 128 - 16 >= 0 
	jp	nc,.next	; enemy.y - ymap  >= 128
	ld	e,128+64
	add	hl,de
	ld	(iy+0),l
	ld	(iy+0+4),l	; not needed if single layer but in this way it is overall faster 
	
	ld	l,(ix+enemy_data.x+0)
	ld	h,(ix+enemy_data.x+1)
	ld	de,(tempx)
	and a			
	sbc hl,de		; hl = enemy.x + 32 - xmap < 0
	jp	m,.next		; hl <0  <==> dx = enemy.x - xmap < -32
	
	ld	de,32
	sbc hl,de		; enemy.x + 32 - xmap - 32 <0

	ld	a,(ix+enemy_data.color)
	jp nc,.noec		; -32< dx <0
	or	128			; set EC
	add	hl,de		; add 32
.noec
	ld	e,a
	ld	a,h
	and a
	jp	nz,.next	; dx >255
	
	ld	a,(ix+enemy_data.frame)
	cp	16*4					; hard coded in the SPT
	jp	nc,.two_layers

.one_layer:
	ld	(iy+1),l				; write X
	ld	(iy+2),a				; write shape
	ld	(iy+3),e				; write colour
	inc c
	ld	e,4
	add iy,de
	set 7,(ix+enemy_data.status)	; set it as visible
	; jp 	.next
		
.next:
	ld	de,enemy_data
	add ix,de
	djnz	.npc_loop1

	ld	a,c
	ld	(alt_visible_sprts),a
	ret
	
.two_layers:
	ld	(iy+1),l				; write X
	ld	(iy+2),a				; write shape
	ld	(iy+3),e				; write colour
	
	ld	(iy+1+4),l				; second layer X
	add	a,4
	ld	(iy+2+4),a				; second layer shape
	ld	a,e
	and 0xF0
	or 	1						; second layer colour
	ld	(iy+3+4),a	
[2]	inc c
	ld	e,8
	add iy,de
	set 7,(ix+enemy_data.status)	; set it as visible
	jp 	.next

By ARTRAG

Enlighted (6976)

Аватар пользователя ARTRAG

31-03-2019, 23:48

theNestruo wrote:

Hi!

When computing tempx and tempy flags are not needed, so I guess Sandy Brand's suggestion of replacing "ld de,16/and a/sbc hl,de" by "ld de,-16/add hl,de" can be applied.

If speed is far more important than size, avoid the last "jp .next" by repeating the entire ".next" section afterwards (thus avoiding the cost of the "jp").
On the other hand, a few bytes can be saved moving one "inc c" and the "add iy,de" into the ".next" section. Also, in the "ld a,e/and 0xF0/or 1" sequence of the ".two_layers", the last "or 1" can be replaced by a shorter "inc a".

Also, the entire ".noec" section if missing. Again, if speed is far more important than size, the ".two_layers" section could be duplicated as two versions ("_ec" and "_noec") to avoid computing the small extra cost of computing the second layer colour (that seems to be always 1 in your code).

Hi, thanks for the suggestion about temporary variables, anyway they are outside the loop, so the gain is small.
About unrolling, it is hard, due to the two branches that should be transformed in call/ret in order to return in the right place. I would save a DJNZ but I would loose cpu in the call/ret transformation.

I cannot change "ld a,e/and 0xF0/or 1" in INC A, as I need to remove the color in the lower nibble, not just take the adjacent color.

By theNestruo

Champion (429)

Аватар пользователя theNestruo

01-04-2019, 00:04

Maybe I have been unclear in my explanation (I was talking about several different optimizations at the same time, so maybe my message was confusing). Sorry!
Here's a pastebin with some of the optimizations applied to clarify my previous message: https://pastebin.com/iHF7z7VP (the lines with the "<---" mark)

By ARTRAG

Enlighted (6976)

Аватар пользователя ARTRAG

01-04-2019, 09:20

Thanks for your suggestions
I will try tonight and let you know
I think that the second djnz can't jump that far
Anyway it is only a guess
I have to see what I get tonight

Страница 1/5
| 2 | 3 | 4 | 5