Assembler Optimizer

Page 46/58
39 | 40 | 41 | 42 | 43 | 44 | 45 | | 47 | 48 | 49 | 50 | 51

By Bengalack

Paladin (747)

Bengalack's picture

08-05-2021, 07:18

Ok, that works!

INFO: New solution found after 933209832 solutions tested (size: 5, time: 59.999):
INFO:     add hl, hl
INFO:     add hl, hl
INFO:     add hl, hl
INFO:     add hl, hl
INFO:     add hl, hl
INFO: SearchBasedOptimizer: search ended (933210048 solutions tested)

And yes it takes some time. I recently got a new PC, and hoped it would be beefy enough for these algorithms, but it seems like there is no help from java, out of the box, to spread the load over more threads/cpus. That would have helped.

Anyway, the other day while I was finding optimization patterns for SDCC and was thinking of passing it over to MDL, but ... there are just sooo many potentials! Well, one of the opts was a simple 16-bit right shift 5:

    ; 110 cycles
    ; ld c, l
    ; ld b, h
    ; sra b
    ; rr c
    ; sra b
    ; rr c
    ; sra b
    ; rr c
    ; sra b
    ; rr c
    ; sra b
    ; rr c

    ; 66 cycles. Recognize a right shift 5
    XOR A
    ADD HL, HL
    RLA
    ADD HL, HL
    RLA
    ADD HL, HL
    RLA
    LD C, H
    LD B, A

Your example shift10 reminded me of that of course, and it would be cool to challenge MDL with Ian Taylors numbers here : http://www.chilliant.com/z80shift.html :-)

By santiontanon

Paragon (1805)

santiontanon's picture

08-05-2021, 15:22

ah, good point about parallelizing! I could try using multiple threads during the search. I don't think it should be too hard. I am using a standard depth-first-search + iterative deepening as the base search algorithm, so I could parallelize that maybe at the very first level of the search tree, to allow for k parallel threads (I am sure there must be an easy way to poll for the number of available cores with some system call?). I'll give it a try!

There's also several optimizations to the search process that I still need to implement (like defining a "canonical order" of instructions for those sequences that are synonymous, to make sure they are only searched once, etc.).

By Ped7g

Expert (67)

Ped7g's picture

09-05-2021, 10:48

Looking into github at the "<<13" test-example. Any idea, how slow it would be if you would allow it to look for optimal solution?

(I mean this one is maybe optimal (allowing hl+a to be used, like your test entry, but all Z80 instructions), I spent only couple of minutes on it, so maybe there's better, but should be close enough):

  ld a,l
  rrca
  rrca
  rrca
  and 0xE0
  ld h,a
  ld l,0

By santiontanon

Paragon (1805)

santiontanon's picture

09-05-2021, 16:26

That's a good question! I'll run a few tests and see how long would it take to find the smallest or the fastest. The main big problem is constants (like that 0xE0), which blow up the search space. By default, I only allow the constant "0" (but you can specify this, to consider more, or all). And the second problem is if allowing instructions that read/write from memory, since this forces me to run some extra code during search (e.g., to restore the program after each simulation, as a program could modify itself), making search significantly slower. However, I have not done any systematic experiment yet to see how slow does it get allowing EVERYTHING. Might be an interesting test! Smile

By Bengalack

Paladin (747)

Bengalack's picture

09-05-2021, 20:58

Ped7g wrote:

I spent only couple of minutes on it, so maybe there's better, but should be close enough)

Ian Taylor agrees with you :-) http://www.chilliant.com/z80shift.html

By santiontanon

Paragon (1805)

santiontanon's picture

09-05-2021, 22:29

Ok, I just did the test, tldr: it will take a long time haha

So, starting from easier to harder and using the latest version, which can do multithreaded search, and has some improvements on search-space prunning ( https://github.com/santiontanon/mdlz80optimizer/releases/tag... ):

First is the easy case, with a reduced search space:

allowed_ops: 
    addition
    shift
    ld
allow_ram_use = false
allowed_registers: hl, a
max_ops = 7
initial_state:
    HL = val
goal_state:
    HL = val << 13

This takes about 15 seconds on the latest development version using 8 threads in my laptop (Apple M1 macbook pro), and finds this solution when optimizing for number of instructions (-so-ops):

    add hl, hl
    add hl, hl
    add hl, hl
    add hl, hl
    add hl, hl
    ld h, l
    ld l, 0

I then modified it a bit to try to find the solution you proposed (targeting the instruction set), to this:

allowed_ops: 
    logic
    ld
    rotation
allow_ram_use = false
allowed_registers: a, hl
8bit_constants: 0, 224
max_ops = 7
initial_state:
    HL = val
goal_state:
    HL = val << 13

When optimizing for execution time (-so-time), this finds a very similar solution in a bit less than 30 seconds (I think they are equivalent):

    ld a, l
    ld l, 0
    rrca
    rrca
    rrca
    and 224
    ld h, a

I then expanded it to use ALL the instruction types that I currently have (but still limited to only using a, h, l and the constants 0 and 224):

allowed_ops: 
    logic
    increment
    addition
    ld
    rotation
    shift
    negation
    bits
    carry
    cp
    jump
allow_ram_use = false
allowed_registers: a, hl
8bit_constants: 0, 224
max_ops = 7
initial_state:
    HL = val
goal_state:
    HL = val << 13

I only left it running for about 10 minutes, but it's clear this will take a long time. From the growth rate I was seeing, maybe it could finish in less than 1 day? I'm not sure. But this search will take a long time in any case haha. And if we allow even more constants than 0 and 224 to try to guarantee finding the absolute optimal, that'll be even slower. Perhaps removing "cp" and "jump" can help, or maybe the "bits" operations (bit/set/res) can help as I'm not sure if those would be helpful?

By Bengalack

Paladin (747)

Bengalack's picture

10-05-2021, 16:37

Nice!
In a test, the previous output was:
INFO: SearchBasedOptimizer: search ended (933210048 solutions tested)
and now it is:
INFO: SearchBasedOptimizer: search ended (322422569 solutions tested)

I also see that all my cores are going (almost) full speed. Very nice!

My challenge, I guess, is to construct a description of what I want a solution to. For example, one of the heftiest optimisations I ever have come up with myself, is a scheme for updating x-amount of data by (ab-)using the SP-register as a general purpose register. -Solely because SP has the fastest way to move two bytes into a 16-bit register, and at the same time modify another 16-bit register (SP) (command: pop). All this in only 11 cycles is very fast, and I'm always on the lookout for ways to abuse this Smile

But again, describe things in a simple way, seems complex. Or maybe I just don't know MDL well enough. So, your examples are pretty nice for learning, in that regard.

By the way, maybe add a "Time elapsed" to a run? Nice to know the duration if you leave your computer on and leave it.

By santiontanon

Paragon (1805)

santiontanon's picture

10-05-2021, 20:47

Ah, yes, the "copying using the stack" is a very nice "trick" to move data very fast!

As for constructing the description for the MDL optimizer to use that, it can probably not be specified with the current specification language, and I'll have to expand it. Only register and flag constraints can be specified at this point. But it'd be very nice to be able to specify constraints concerning memory (for finding optimal ways to move data, etc.). I am not sure how fast/slow will the search process be, but this will be a very nice addition!! It's in my to-do list, hopefully for the next version Smile

And indeed, I'll add "time elapsed", since I usually run everything from inside the IDE, I get timing automatically, but I forget that's not the actual way MDL is supposed to be run haha.

By thegeps

Paragon (1187)

thegeps's picture

27-05-2021, 18:58

I'm sorry, I don't remeber if there is written somewhere in this thread (but this is so big so looking for the answer can be stressful):

there is a way to output the original source with annotated optimizations, so I can check them to be safe (they seems to be, Freedom Fighter seems to work fine, but I want to check and to learn something...)

or maybe a way to generate a txt file with all the MDL output because when going back in cmd window it doesn't reach the start of optimization output (yep, a lot of optimization, so my code has to be really terrible!)

when doing:
java -jar mdl.jar ffp.asm -asm ffopt.asm -dialect glass -po speed

I have:

mdl optimization summary:126 bytes, 1221 t-states saved. pattern based optimizer pattern application: 249

By santiontanon

Paragon (1805)

santiontanon's picture

27-05-2021, 23:16

Oh, there is currently no option to just "annotate" the optimizations. But that's an interesting option, I'll make a note of that. The two ways you can do something similar is:
- If you use an editor like Sublime Text, vim, or VSCode, you can integrate mdl into the editor, and it will show annotations directly over your code (in a similar way as compilation warnings/errors are displayed)
- Or a simpler option is just to redirect the output to a file, like: java -jar mdl.jar ffp.asm -asm ffopt.asm -dialect glass -po speed > output.txt
And then just inspect the output.txt file

Nice to see that it generates some optimizations!!! Big smile I have a new version coming soon, hopefully this weekend Smile

Page 46/58
39 | 40 | 41 | 42 | 43 | 44 | 45 | | 47 | 48 | 49 | 50 | 51