Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
ARM Assembly Code Optimization?

By MichaelCrawford in Technology
Sun Dec 15, 2002 at 03:46:36 AM EST
Tags: Help! (Ask Kuro5hin) (all tags)
Help! (Ask Kuro5hin)

I'm writing an embedded application for an ARM7TDMI processor. It's not going fast enough to bring me any joy yet. Not even close.

Can any of you suggest any books, magazine articles or websites that teach about optimizing ARM or Thumb assembly code? I need to reduce both my code size and my run time. I also need to reduce my application's overall RAM consumption, as my present RAM usage exceeds the tiny amount available on the chip by a few hundred bytes.

References on other processor architectures than the ARM could be useful too. I'm interested in learning any tricks I can for any architecture, because maybe I can figure out how they can apply to the ARM7TDMI, or maybe they will stimulate me to think up completely new approaches.


Some ARM processors like the ARM7TDMI support two instruction sets - 32-bit ARM instructions and 16-bit Thumb instructions. By "16-bits" I mean each instruction is 16 bits in size; the thumb still has 32 bit registers and a 4 GB address space.

The "T" in the ARM7TDMI's model number indicates that it supports Thumb instructions - there are a variety of other ARM cores that do as well.

ARM code is more flexible and can be made to run faster, but thumb takes less space for doing routine things. I'm planning to mix the two instruction models but I can't do that with complete freedom because there is a cost to switching modes (the pipeline gets flushed for one thing, and you have to execute a couple instructions to actually do the switch).

I'm pretty sure that I'm using the most efficient algorithm available for what I'm trying to do. I have done some of the more obvious things already to improve my speed. There is more that I can easily do, but I don't think that what I presently know to do will get me to my goal.

If I scale the algorithm's performance as reported on some other processors by the ratio of clock speeds, my target should be well within reach. However it is not so simple as that, as there are other factors to consider such as the relative efficiency of the different instruction sets and the capabilities of the hardware in question. Figures quoted for a Pentium III desktop PC cannot be counted to scale proportionally when comparing the results with an embedded 50 Mhz ARM7TDMI core.

There are some helpful documents at ARM's website, but I have not yet found the solution to my problem by studying them.

I'm particularly interested to understand better how the ARM7TDMI's pipeline can be utilized more effectively.

I know that on other processors, if the second instruction of a pair depends on the results of the first instruction, a pipeline stall will occur while the first instruction completes, slowing everything down.

Sometimes the pipeline can be kept full by moving an unrelated instruction between them in such a way that the whole program still gets the same results. The instruction in-between will give time for the first instruction to complete so that what is now the third instruction is now able to execute right away. Does the ARM7TDMI work like that?

The ARM7TDMI does not have a cache. Reducing the overall code size in itself won't help the runtime, what will help the most is reducing the total count of instructions executed, as well as eliminating any references to RAM that aren't totally necessary.

The paper Cost-Effective Microarchitecture Optimization for ARM7TDMI Microprocessor has a simple table of instruction timings for ARM code, as well as a good explanation of how the ARM7TDMI works in general. It is available in PDF Format or Google's "view as HTML" format.

(My vendor tells me that reading from RAM takes two cycles, writing takes one. In most cases the ARM can retire one instruction per clock, using a three-stage pipeline. However the instruction timings given in the paper cited above indicate a read takes three clocks, while a write takes two.)

Thank you for any help you can offer. It's very important to me that I succeed with this application, but I forsee that some very difficult challenges lie ahead.

(Running on different hardware is not an option at this time, for business reasons.)

Here are some books that have been recommended to me by some friends that I asked about this. Unfortunately my budget is very limited and I can't afford to buy them all right now. Perhaps those of you who are familiar with them could comment on the best ones to get:

Here are some books I already have:

  • Arm System-on-chip Architecture by Steve Furber

    This book has an easy-to-read introduction to ARM and Thumb assembly code and discusses in a general way how RISC processors work, but unfortunately is somewhat light on the kind of specifics that would help me at this point.

  • Optimizing PowerPC Code by Gary Kacmarcik

    This really helped me when I was working at Apple, but I don't think that most of what it has to say (like improving cache usage or keeping multiple execution units occupied) would apply to my situation.

  • Pentium Processor Optimization Tools by Michael L. Schmit

    This comes with a free version of Schmit's program PENTOPT, an "assembly code optimizer" for Pentium code. What it actually does is save a listing of your assembly program which has comments added to indicate how many cycles each instruction will take, and if an instruction takes a long time, it includes an explanation as to why, such as pipeline stalls and so on.

    I would be stoked to find out if there is a program that does the same for ARM and Thumb assembly.

If you'd like to learn more about ARM assembly code, check out Peter Cockerell's ARM Assembly Language Programming and Richard Murray's ARM Assembler. The ARM has a really pleasant assembly language, but unfortunately this is the first time I've written any, so I don't know any of the tricks yet.

I'm sorry to say that I am not at liberty to say what my application is until my client announces it.

I think the article would have more use to other people, and not just me, if people posted any sort of clever assembly trick they knew. And I think it would likely be the case that some optimizations that are apparently unrelated to my problem would stimulate me to think of something that would help, or find a way to apply it to my problem.

The algorithm in question, though, is entirely integer oriented. The chip I'm targeting has 64kbytes of flash ROM. I'm not sure if I'm allowed to say how much RAM is available, but it is a really tiny amount, as you might guess by my complaint that I'm over budget by a few hundred bytes.

I should point out that I am using a profiler to test my code.

There's two ways I do this - one is to load my code into the target system and execute my algorithm in a loop a few thousand times. I indicate the start and end of the test by lighting up certain LEDs on the development board I have to test with.

I can get more precise timing of the algorithm or its parts by using the ARMulator, a microprocessor emulator that comes with ARM's development system. This allows me to execute ARM binaries under Windows, where I can get at them with a source code debugger. (There are open source ARM emulators available as well, but I haven't tried any of them yet.)

What I can do is set a breakpoint just before stepping into a subroutine, reset the cycle counter, then step over the subroutine. The cycle counter can then tell me how many instructions have been executed while executing that subroutine and every subroutine it calls, and so on.

There is an option for real-time simulation if I write a memory map configuration file that gives the address ranges for the different kinds of memory involved, their data bus width and access times.

What I don't seem to have, though, is a tool that will tell me how many clock cycles each instruction will take. The profiler can tell me which general areas are most important to address, and it can tell me if I've gained or lost, but it doesn't help to much at the level of individual instructions.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Google
o some helpful documents
o PDF Format
o Google's "view as HTML"
o High Performance Computing
o High Performance Computer Architecture
o Hacker's Delight
o Arm System-on-chip Architecture
o Optimizing PowerPC Code
o Pentium Processor Optimization Tools
o ARM Assembly Language Programming
o ARM Assembler
o Also by MichaelCrawford


Display: Sort:
ARM Assembly Code Optimization? | 74 comments (55 topical, 19 editorial, 0 hidden)
Code Optimization is well Studied (4.50 / 2) (#4)
by HidingMyName on Sat Dec 14, 2002 at 10:17:24 AM EST

Sounds like you have two problems, code size and code speed. Some optimziations improve one at the expense of the other, but some techniques can improve both. Could you please confirm that I am right that you have enough PROM space but not enough volatile RAM.

Code optimizations happen at a variety of levels, starting at the highest level:

  • Algorithm and Problem Formulation
  • High level (intermediate code) optimizations
  • Low level (architecture dependent) optimizations
Hennessy, Patterson and Goldberg should provide a good coverage of many low level optimizations suitable for RISC processors.

Now for some remarks not commonly found in textbooks. Most of us know that algorithmic optimizations often provide the biggest improvements. Check closely for unused variables/constants/instructions/subroutines, as that can consume space. Another thing that can be done is to aggressively look for repeated stretches of code and gather them into subroutines, rather than coding them in line. This also improves cache performance. Another technique that works if you have null terminated strings (ala C/C++) is to look for constant strings where one string is a suffix of another string is to express the shorter suffix by creating a pointer into the correct position in the longer string. Use the stack when possible, this tends to reduce memory usage, since variables are only memory resident when they are in scope (and are likely to be used). Be careful about register allocation, don't have unused registers at the expense of memory.

Yes, you're right about the lack of ram (5.00 / 2) (#6)
by MichaelCrawford on Sat Dec 14, 2002 at 10:32:21 AM EST

Thank you for your answers. Yes, you're right about the lack of RAM. I have some space left in the flash but my program already exceeds the rather hardwired limit of available ram.

The way I can test my application at all is that I can run a slower version that doesn't exceed the ram, in order to test the other functions that are not performance critical, or I can start the chip up right into a performance test of my algorithm and then hang the chip when it's complete.

I'm not yet able to run both the basic functions of the chip and my latest algorithm implementation at the same time, because of the lack of RAM.

If I could find a good way to shave about 400 bytes off my ram consumption, I would be greatly relieved.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


[ Parent ]

Found the book where I can get it today! (none / 0) (#13)
by MichaelCrawford on Sat Dec 14, 2002 at 01:27:47 PM EST

It took some calling around, but I was able to find Computer Architecture: A Quantitative Approach in a bookstore a couple hours away. They have it on hold for me until I can get in to pick it up.

They also had Hacker's Delight, which promises all kinds of weird things you can do with twos-complement arithmetic. A friend I used to work with at Apple recommended it highly so I put that on hold too.

Now it's just a matter of getting to the bookstore. It's raining heavily right now.

A couple of months ago I had a long talk with a woman who's working for a non-profit foundation whose objective is to stimulate the growth of high-tech industry in the state of Maine. The lady from The Maine Technology Institute had a number of suggestions for how I could improve my consulting business, both in the short term and in the happy long term when I might be able to hire Maine residents as employees.

The main (!) suggestion I had for her was to ask that the State do something to encourage the sale of more technology books in the state. I know you can order them online, but there is no substitute for being able to browse in a real bookstore.

There are three bookstores in the whole state that I know carry some technical books, but the selection is pretty limited and they concentrate pretty heavy on stuff like web publishing and Microsoft technologies. It's not like the good-old-days when The Computer Literacy Bookstore was right there at Apple's Infinite Loop campus.

And I called around today to a number of online booksellers, and was not able to find any of them that could ship me my order today. So the earliest I could expect to receive a book if I ordered online would be Tuesday. (For the record, today is Saturday.)


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


[ Parent ]

Some Notes about Inline Optimization (none / 0) (#70)
by HidingMyName on Mon Dec 16, 2002 at 06:19:46 PM EST

Suppose that we have a function, F, using a callee saves model (the function is responsible for preserving registers that it overwrites and restoring them before exit), that is invoked in n different places in a program. Note that if the function is only invoked once, in-lining will always reduce memory consumption (perhaps at the expence of readability). For n > 2, if each function call takes c bytes of storage, and f is the number of bytes it takes to represent F in machine code, then, inlining reduces memory consumption when:

Inline-Cost < Non-Inlined -Cost

n * f < n * c + f

(n - 1) * f < n * c

f & lt; (n * c) / (n - 1) Taking the limit of the right hand side (RHS) as n gets large, we get:

limitn --> Infinity (((n - 1) + 1) * c)/(n - 1) = limitn --> Infinity c + c/(n-1) = c

and when n = 2, it can be shown that the RHS is 2 * c. So for very short functions, in-lining conserves memory, and the more frequently called a function is, the shorter it needs to be before inlining conserved memory.

[ Parent ]

ying and yang (none / 0) (#8)
by turmeric on Sat Dec 14, 2002 at 11:01:28 AM EST

why has nobody mentioned profiling here? would you try to improve your hundred yard dash without a stopwatch?

err... (none / 0) (#9)
by turmeric on Sat Dec 14, 2002 at 11:06:23 AM EST

maybe that is one of the 'obvious things'... i dunno. self modifying code?

[ Parent ]
Profiling on steroids (none / 0) (#11)
by MichaelCrawford on Sat Dec 14, 2002 at 11:19:56 AM EST

I addressed your concern in an update to the article just now. But I thought I should add that I'm hip to profilers. Check out where I discuss studying performance by tracing all the machine instructions that are executed by a task.

Some people at Apple thought I was pretty bent for doing that, but I found it very enlightening.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


[ Parent ]

i think what they are telling you (none / 0) (#33)
by turmeric on Sat Dec 14, 2002 at 10:09:04 PM EST

is they dont like one of their kind getting all dirty in the mud and muck, pay no attention to that man behind the curtain.

[ Parent ]
ps (none / 0) (#34)
by turmeric on Sat Dec 14, 2002 at 10:11:36 PM EST

i have no idea how to help you. sorry

[ Parent ]
It's a compute-bound problem (5.00 / 2) (#12)
by MichaelCrawford on Sat Dec 14, 2002 at 01:10:07 PM EST

I should add that, while my overall application does involve I/O, the problem at hand is purely compute bound.

I wanted to point this out because many performance tuning problems on operating systems involve getting the best out of your I/O, for example minimizing the latency or bandwidth of network traffic. That's an interesting problem too, but it's not my problem at the moment.

Problems with I/O performance are the probably the best kind to address by changing your algorithm or overall architecture, and are less likely to be helped by examining the pipeline utilization of your machine code.

In my case, I can put my bottleneck all inside a single subroutine (and the subroutines it calls), and do everything I need to do by just calling that subroutine. Nothing but computation and transfer to or from local memory happens during the call.

And that computation just takes too long.

Thank you for all your help.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


A couple of possibilities (5.00 / 1) (#25)
by CrimsonDeath on Sat Dec 14, 2002 at 06:14:40 PM EST

(repost as topical)

I've done a bit of work with ARM, so I just have a couple of thoughts.

1) What size is your bus? If it's only 16-bit, then thumb is really the way to go ... it'll be much faster and you don't lose much

2) If you're staying with 32-bit, you can increase speed (in some cases) and reduce memory use (always) if you make sure you use the conditional execution flags on all instructions where you can instead of branching

3) Are you sure it's not cached? I guess you must be  running with an MMU-less processor with the ARM7TDMI core?

And the number of clock cycles is described in the ARM7TDMI data sheets, I hope you have those. Sometimes it's really hard to know due to pipelining/caching, but if you don't have a cache (as you say) then that's not a problem.

If performance is important to you though, you shouldn't be running without cache.


16-bit ROM, 32-bit RAM (none / 0) (#39)
by MichaelCrawford on Sun Dec 15, 2002 at 04:17:17 AM EST

The flash ROM has a 16-bit bus. So code in the ROM really does have to be thumb, except for possibly unusually productive arm instructions.

The flash is also pretty slow compared to the clock speed of the CPU.

The RAM is 32 bit, with a faster access time, but there is very little of it. For now, I'm pretending that I have enough, but I've got some memory-mining to do before I'm done.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


[ Parent ]

Thumb in ROM, ARM in RAM (none / 0) (#48)
by pin0cchio on Sun Dec 15, 2002 at 09:05:04 AM EST

The flash ROM has a 16-bit bus ... The RAM is 32 bit

The Game Boy Advance uses this setup. In general, a GBA game will copy a set of inner-loop subroutines in ARM ISA (such as a mixer or a polygon rasterizer) from the flash cartridge to 32 KB of fast RAM at once, execute them, and then return to the Thumb caller running in either ROM or slow RAM.

I know this won't help your RAM usage if you already have a lot of relatively persistent data in RAM, but it does boost speed heavily.


lj65
[ Parent ]
questions and answers (5.00 / 3) (#26)
by pb on Sat Dec 14, 2002 at 06:16:17 PM EST

I looked up the data sheet on the ARM7TDMI processor and the assembler looks relatively normal... You say that you need to optimize for size and speed.

My advice is to use the thumb instructions as much as possible to reduce the code size of your application (if this is an issue).

One optimization to increase speed at the expense of code size is loop unrolling.  This can also get rid of branching and pipeline stalls.  For example, I had an (x86) assembly program that--among other things--needed to sort four characters.  I was able to get this down to 17 instructions--two to load/store two of the characters in memory, and 3 for each of the five possible swaps.  You see, I determined that it's possible to sort four numbers with at most 5 static swaps (this is an algorithmic improvement) and instead of having a loop of some sort, I just specified all five swaps (unrolled).  Code follows:


bsort:  mov     ax,[word ptr n2]        ;use al, ah as n2, n3
        cmp     [n1],al                 ;if (n1<n2)
        jae     skip1                   ;then
        xchg    [n1],al                 ;swap(n1,n2)
skip1:  cmp     ah,[n4]                 ;etc... basically
        jae     skip2                   ;this is a binary
        xchg    ah,[n4]                 ;subdivision, in-place
skip2:  cmp     [n1],ah                 ;sorting algorithm,
        jae     skip3                   ;but hardcoded for
        xchg    [n1],ah                 ;four characters.
skip3:  cmp     al,[n4]                 ;(and hopefully
        jae     skip4                   ;it will work...)
        xchg    al,[n4]                 ;...swap(n2,n4)
skip4:  cmp     al,ah                   ;if (n2<n3)
        jae     skip5                   ;then
        xchg    al,ah                   ;swap(n2,n3)
skip5:  mov [word ptr n2],ax            ;save new n2, n3

Now obviously this is a very special purpose algorithm, which is a big problem with giving people generic advice on assembly optimization--usually the best optimizations you can make will end up being extremely specific, so it's unlikely that we'll stumble on the right one for you.  However, loop unrolling is a very generic optimization, which is great for speeding up tight loops.

Another thing I did later on in the same program was that I managed to convert a summation into a lookup table.  Converting even relatively simple computations into (hopefully small) lookup tables will lower your total instructions executed (which was the task here).

Original code (before the lookup table):

nofirst:mov     bx,[word ptr n1]        ;get first two characters
        sub     bl,[n4]                 ;find position in list
        sub     bh,[n3]                 ;(summation - 1)
        mov     al,bl                   ;this does a summation
        inc     al                      ;and subtracts one
        mul     bl                      ;to find the offset in memory
        shr     ax,1                    ;for stable.

After the lookup table:


itable   db  0, 0, 2, 5, 9, 14, 20, 27, 35, 44
[...]
        sub     al,ah                   ;subtract n2, n3
        mov     bh,0                    ;clear bh
        mov     bl,[n1]                 ;bl=n1
        sub     bl,[n4]                 ;bl=n1-n4
        mov     bl,[itable+bx]          ;lookup summation thingy

You might have to take my word for it that these two chunks of code do the same thing--I probably reorganized a couple of other things between these two versions as well.  But the main thing to note is that besides shaving off a couple of instructions, we've avoided a mul as well!  And our lookup table isn't very large.  I know you've already discovered the joy of lookup tables in assembler optimization, but consider this encouragement to take it one or two steps farther whenever you need to (dramatically) lower your instructions executed.

I understand if you can't tell us very much about what you're working on now, so look up as much as you can on the net for specific algorithms and optimizations related to what you're doing.  Or read through some appropriately licensed source code if you can find free software that performs similar tasks.  (I suppose free software written in ARM assembler would be even better in this case...)

Best of luck to you!
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall

Dangers of x86 assembly optimization (5.00 / 2) (#32)
by arjan de lumens on Sat Dec 14, 2002 at 10:01:25 PM EST

In your first example, you are using the XCHG instruction with a memory argument, which is very slow on 486 and higher x86 processors - it locks the memory bus, taking 10+ clock cycles (tested: about 11 on a 486 and 25 on an athlon). Also, on PPro and higher you may want to remove the conditional jumps (which cost 15+ cycles each if the jump prediction logic fails), using CMOV instructions to move data around instead.

For the second example, you have

mov bh, 0
mov bl,[n1]
sub bl,[n4]
mov bl,[itable+bx]

which is probably optimal on a Pentium Classic, but on PentiumPro,2,3,4 may produce a partial register stall (happens when you write to partial registers and then immediately afterwards try to read the full register), which is even worse than a multiply. Try instead:

mov bl,[n1]
sub bl,[n4]
movzx bx,bl
mov bl,[itable+bx]

Also, PentiumPro,2,3, K6 and all Athlons (but not Pentium 4) have very fast integer multiplies, so much that they're not really worth the effort to get rid of most of the time.

x86 as a whole really isn't easy to optimize for, and for each new generation there is a risk that the nifty tricks you spent so much time figuring out that gave optimal results on the previous generation of processors now suddenly just blow up in your face. Especially Pentium 4 is notorious in that respect.

[ Parent ]

about the assembler code (none / 0) (#36)
by pb on Sun Dec 15, 2002 at 03:09:00 AM EST

Thanks for the tips, but I'd like to mention that this was done for one goal only--reducing the total count of instructions executed.  Also, it was designed for one instruction set only--that of the 8086.  And since I'm trying to illustrate concepts that might also apply to ARM assembler, it honestly doesn't matter which instructions I use at all.  :)

And yes, I agree--hardly any of the x86 processors these days have that much in common, so optimized assembler for one x86 chip can be pretty bad on another one.  (I remember seeing comparisons long ago between simple things, like the LOOP instruction vs. doing it by hand with incrementing and conditional jumps)
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]

What were you optimising? (none / 0) (#58)
by evanbd on Sun Dec 15, 2002 at 11:56:36 PM EST

Just out of curiosity, because instructions executed and instruction count are weird targets as opposed to bytes and clocks.  Also, those two on 8086 are ones I've written for (Assembly course, NC State University).

And to offer suggestions for the original poster...

Jump tables are wonderful things.  I'm sure you've thought about them, but taking things to the extreme of spaghetti code produces very interesting results (sometimes useful, but good luck with some of the weirder bugs...).

Also, consider tactfully setting aside register use calling conventions, if you spend a lot of time calling subroutines.  Sometimes the cost of having one less register available is outweighed by the benefit of not having to save it (both when calling and being called).

Good luck.

[ Parent ]

yes, yes they are. :) (none / 0) (#61)
by pb on Mon Dec 16, 2002 at 09:34:47 AM EST

It just so happens that both of those chunks of assembler are parts of the same porgram that I wrote for Assembler when I was at NCSU--so if it sounds weird, blame Dana Lasher, or tell him I said hi.  :)
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]
Assembly not always the way to go (5.00 / 2) (#35)
by GGardner on Sun Dec 15, 2002 at 12:23:47 AM EST

I cut and pasted your assembly example, with a tiny test rig, and ran it with the data reverse sorted. On my Pentium III, your assembly routine took about 219 cycles to run.

I then coded up a simple C version, again with the unrolled loop. It took about 700 cycles. I recompiled with optimization on, and it took about 146 cycles. The machine code generated by the C compile was 3 bytes shorter than the assembler version.

[ Parent ]

neat. (none / 0) (#37)
by pb on Sun Dec 15, 2002 at 03:11:51 AM EST

Could you paste at least the C code, and possibly the optimized assembler it outputs as well?

Also, did I win in total instructions executed, and did your compiler stick to only original 8086 instructions?  :)
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]

Here's the C code (5.00 / 1) (#56)
by GGardner on Sun Dec 15, 2002 at 07:06:37 PM EST

I will grant you that your code is faster on 2 of the 3 machines I tested it on. Of course, the first machine I ran it on was the other way around! Interestingly, VC++ generates much different code than gcc -- gcc turns the XOR'ing swap into xchgb instructions, which is pretty interesting, though I'm not sure it is optimal. VC++ leaves the XORs in.
#define SORT2(x, y) if (x > y) { x ^= y; y ^= x; x ^= y; }

void c_sort(char *arg) {
char a = arg[0];
char b = arg[1];
char c = arg[2];
char d = arg[3];

char t;

SORT2(a, b)
SORT2(c, d)
SORT2(a, c)
SORT2(b, c)
SORT2(b, d)
SORT2(c, d)

arg[0] = a;
arg[1] = b;
arg[2] = c;
arg[3] = d;
}



[ Parent ]
heh. (none / 0) (#57)
by pb on Sun Dec 15, 2002 at 11:01:35 PM EST

That is interesting.  Although you have an extra swap in there; you should only need these:


SORT2(a, b)
SORT2(c, d)
SORT2(a, c)
SORT2(b, d)
SORT2(b, c)

And some of this code (from gcc) looks familiar...


        cmpb    %al, %bl
        jle     .L2
        xchgb   %al, %dil
.L2:
        movl    %esi, %ebx
        cmpb    %bl, %dl
        jle     .L3
        xchgb   %sil, %dl
.L3:
        movl    %edi, %ebx
        cmpb    %dl, %bl
        jle     .L4
        xchgb   %dl, %dil
.L4:
        movl    %esi, %ebx
        cmpb    %bl, %al
        jle     .L5
        xchgb   %sil, %al
.L5:
        cmpb    %dl, %al
        jle     .L6
        xchgb   %dl, %al
.L6:

I tried to get Intel's compiler working, but it wasn't happy with my setup on Gentoo.  It would probably like my old RedHat installation better...
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]

XCHGing (none / 0) (#63)
by GGardner on Mon Dec 16, 2002 at 09:52:17 AM EST

That is interesting. Although you have an extra swap in there; you should only need these:

Good eye -- now it is much faster and smaller on all machines I tried it on. Note the reason that xchg is so slow on modern machines is that it needs to lock the bus to be atomic. I suspect this gets it into microcode. An xchg whose targets are both registers doesn't need to do this, and runs much faster. This is the big difference between the C version and the assembly.

[ Parent ]

P.S. (none / 0) (#38)
by pb on Sun Dec 15, 2002 at 03:50:08 AM EST

I tried this with gcc--it generated a lot more instructions and it used mov, movz, all four registers, and the stack (which is good if you want to avoid references to main memory, but not if you want to save some registers and instructions) Also, movz isn't an original 8086 instruction.

So, yes, if you're optimizing for cycles on a modern processor, my code isn't the code to use.  But then,  if I could have used any x86 instruction set I wanted, I could have just used the CMPXCHG instruction from the 486...

I think it all goes back to what I said originally after that block of code... "usually the best optimizations you can make will end up being extremely specific"--in this case, both specific to the processor and to your original goals.  :)
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]

intel StrongArm and ARM architecture documents (none / 0) (#28)
by sye on Sat Dec 14, 2002 at 06:36:13 PM EST

ftp://download.intel.com/design/strong/manuals/27824004.zip
http://www.google.com/search?q=DDI0100E_ARM_ARM.pdf.
Hope these two links are still good.

~~~~~~~~~~~~~~~~~~~~~~~
commentary - For a better sye@K5
~~~~~~~~~~~~~~~~~~~~~~~
ripple me ~~> ~allthingsgo: gateway to Garden of Perfect Brightess in CNY/BTC/LTC/DRK
rubbing u ~~> ~procrasti: getaway to HE'LL
Hey! at least he was in a stable relationship. - procrasti
enter K5 via Blastar.in

A Pentium is not an ARM processor (none / 0) (#29)
by GGardner on Sat Dec 14, 2002 at 06:40:28 PM EST

Well, that's obvious. Having done a fair amount of embedded programming myself, I'm frequently suprised by how fast our modern x86 CPUs are, even on a clock-for-clock basis. A simple embedded CPU, such as an ARM, especially a cache-less one running at one-tenth the clock speed of a pentium, will often run the same C code 50-100 times slower than the Intel or AMD chip.

The ARM chips are pretty nifty, run with amazingly low power/MIPS, and are much more deterministic than a modern Pentium. (Note that ever since Pentium II or so, Intel has stopped publishing clock cycles for each instruction, because is just varies so much). However, the price you pay for all this goodness is that the throughput of the ARM chip, even clock-for-clock, is a lot slower than a CPU aimed at a desktop or server.

look up table mathematics? (5.00 / 1) (#31)
by martingale on Sat Dec 14, 2002 at 09:01:57 PM EST

You stated that you "fast" algorithm uses a lookup table. I don't know what you're doing but is it possible that your lookup table has redundant information? Symmetries you can use to trade off size for an added couple of instructions? Since your reference implementation fits in memory, I presume your lookup table is fairly big, so even a reduction by a factor of two might be enough already.

Another thing you can try is to tweak the table, trading off accuracy for symmetry. Do the calculations have to be completely, 100% exact or can you perturb a couple of entries, making them wrong but so similar to something else that you obtain a symmetry you can leverage?

Actually, there's more potential in this sort of direction. Assuming you're doing some kind of complex calculation, how about doing a first order approximation to botain an inexact but close alternative calculation?

Here's what I'm setting into now (none / 0) (#40)
by MichaelCrawford on Sun Dec 15, 2002 at 04:43:31 AM EST

First, I must say I found a pleasant suprise when I woke up early this morning. I had to go to sleep in the middle of the voting yesterday because I'd been up all night hacking. It was very nice to get up and find my post approved by the moderators.

What I'm about to do is paste the entire source code for my assembly code algorithm into a Quattro Pro spreadsheet (my wife's a WordPerfect fan), leaving several columns of empty cells on the left that I will use for figuring.

Then I will refer to the timing chart I mention in the article above to note the times for each instruction.

Then I will tally them up proportionally to the number of times each loop is executed. I'll have to work out a nice way to do this, but if I know a loop is executed four times, then I'll multiply its time by four.

The whole thing is kind of complicated so I will probably be a few hours doing this but I hope to find it enlightening.

I already did this for the most-frequently executed block of code, and based on that I would have expected my whole algorithm to be about five times faster than it is. Something that I don't understand is killing my performance, and I hope to understand it by doing this.

One concern I have is that there are a lot of branches in my code. A branch costs three clock cycles because it flushes the pipeline - the ARM7TDMI doesn't have branch prediction or speculative execution like some fancier RISC processors (and the Pentium-class processors are RISC's in fancy clothes).

Many of those branches are unavoidable but there is probably something I can do about some of them.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


URL for PowerPC Compiler Writer's Guide (none / 0) (#41)
by blue0 on Sun Dec 15, 2002 at 05:01:45 AM EST

Michael,

Been reading you in Advogato!

This one at IBM, and another one outside IBM.

Regards, rogersm.
My way of joking is to tell the truth. That's the funniest joke in the world. -- Muhammad Ali

Here's a couple things I did (5.00 / 1) (#42)
by MichaelCrawford on Sun Dec 15, 2002 at 05:02:28 AM EST

Here's a couple things I did, one I picked up from the doc at ARM's website and the other I figured out after thinking about it for a couple days.

I frequently have a need to mask a 32-bit value into an 8-bit value. All ARM and Thumb instructions are 32-bit with the exception of being able to load and store bytes and halfwords.

The way I was doing that was to AND the value with 0xff. That's what the C code did after all.

But in both ARM and Thumb it costs two cycles to do an AND (I can't imagine why) and in Thumb it costs a register to hold the bitmask, because Thumb's smaller instruction word size places a lot of restrictions on immediate values - you can't do an immediate AND in Thumb.

But ARM distributions an Application Note called Writing Efficient C for ARM that mentioned in passing that the way the compiler masks an int down to a short or char is to shift to the left to clobber the high bits, then shift to the right. The upper bits are 0-extended so no masking is needed.

This still takes two cycles but it saves me a register, and by doing so I am now able to avoid some RAM reads in my innermost loop!

The other thing I have done is that I knew some code in a subroutine wasn't needed every time the subroutine was called. But it seemed too expensive to use a conditional branch to jump around it.

What I did I'm sure would horrify most object-oriented purists - I moved the code in question to the top of the subroutine, and added another entry point just after. So now one of my subroutines has two entry points!

You may gag at that but it just shaved 1500 instructions off the total amount needed to execute my algorithm just once, because that was also in the innermost loop.

Thank you for your attention.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


"You may gag at that..." (5.00 / 1) (#44)
by kreyg on Sun Dec 15, 2002 at 05:37:55 AM EST

The first rule in assembly optimization:

There are no rules.

There was a point to this story, but it has temporarily escaped the chronicler's mind. - Douglas Adams
[ Parent ]
I apologize in advance (5.00 / 1) (#69)
by cdyer on Mon Dec 16, 2002 at 06:18:35 PM EST

The second rule of assembly optimization is you do not talk about assembly optimization!

Cheers,
Cliff



[ Parent ]
Shift to the left! Shift to the right! (none / 0) (#60)
by wiredog on Mon Dec 16, 2002 at 08:32:38 AM EST

Stand up, sit down, byte byte byte!

I use shift instructions to multiply and divide (if it's powers of 2) in C++ code for efficiency reasons.

The greatest contribution of the internet to society is that it makes it possible for anyone of any age to become a grumpy old fart.
Parent ]

why stop there? (5.00 / 1) (#62)
by pb on Mon Dec 16, 2002 at 09:48:14 AM EST

a*3 == a<<1+a;
a*4 == a<<2;
a*5 == a<<2+a;
a*6 == a<<2+a<<1;
a*7 == a<<2+a<<1+a;
a*8 == a<<3;
a*9 == a<<3+a;

or, more generally...

r = a*b == r=0;while(b){if(b&1)r+=a;a<<=1;b>>=1;}

Note: mul will probably be faster here at some point, depending on your system.  :(
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]

ever heard of an optimizing compiler? (none / 0) (#71)
by joto on Tue Dec 17, 2002 at 06:35:44 PM EST

I use shift instructions to multiply and divide (if it's powers of 2) in C++ code for efficiency reasons.

I would be utterly amazed if you could actually find a C++ compiler that didn't do such obvious optimizations.

Hell, this can be easily done easily with a sed script working on the assembly code produced by the compiler, looking at only one instruction at the time (not even a peephole optimizer)...

The only thing you are going to accomplish with such premature micro-optimization is to confuse any readers of your code. It will certainly not make the code any faster!

[ Parent ]

That's nothing (none / 0) (#72)
by KWillets on Wed Dec 18, 2002 at 12:34:51 PM EST

I once saw a subroutine with two entry points.  One entry point used an ADD or something on one line.  The other entry point changed a bit in the ADD instruction to make it a SUB before executing.  The programmer had figured out that the two subroutines he needed had only one bit's difference between them.  

This, of course, was on a machine with no memory protection.  Read-only code pages - hah.

[ Parent ]

Assembly Optimization (5.00 / 1) (#43)
by kreyg on Sun Dec 15, 2002 at 05:12:18 AM EST

I've never worked with the ARM processor, but I've been mired in Playstation2 microcode for ages, and did lots of 6502 and x86 assembly years ago, so I might have a couple pointers...

1. Get a decent optimizing assembler! The computer's going to be way better at figuring it out than you are, and make a lot fewer errors. Some will do the optimization I've outlined in tip 5 automatically, but as far as I know it's not very common. If the compiler will only do what you say, rather than what you mean, then:
2. For memory reads/writes, use the automatic pre/post-increment/decrement capabilities. This will save you an extra addition or two to move your read/write pointers. Rearrange your data to accommodate this if at all possible.
3. You may also be able to remove a counter from the main loop by pre-calculating the end address of your data (start+numloops) and comparing with that value rather than a separate counter register.
4. If it's a 3-instruction pipeline, you'll probably need to put two instructions after a register write before you use the result to avoid stalls.
5. Along the same lines, you can be evaluating several different stages in parallel during a single loop. This would be like taking three iterations of the loop and interleaving them. Example:
[Load data 0]
[Data 0 Operation 0]
[Load data 1]
[Data 0 Operation 1]
[Data 1 Operation 0]
Begin Loop, n = 0 to numloops:
[Load data 2+n]
[Data n operation 2]
[Data 1+n operation 1]
[Data 2+n operation 0]
[Write data n]
Loop

That might be a little hard to follow, but it should let you execute with no stalls.

There was a point to this story, but it has temporarily escaped the chronicler's mind. - Douglas Adams
Insipid Advice (none / 0) (#45)
by bugmaster on Sun Dec 15, 2002 at 06:43:18 AM EST

Sorry, I have never programmed for ARM, so any advice I can offer is probably utterly idiotic.

Nonetheless: have you considered simply writing your program in C and letting the compiler optimize it for you ? Modern compilers are usually very good at optimizing code (once you turn on all the flags); much better than human beings. In fact, many optimization techniques that you (and others) have mentioned, such as pipelining, loop unrolling, etc., are usually performed automatically by the compiler.

This was certainly the case for me and DOS/Windows/MIPS; but then again, perhaps my assembly skills just weren't all that great...
>|<*:=

Tried that, assembly really is faster and smaller (5.00 / 3) (#46)
by MichaelCrawford on Sun Dec 15, 2002 at 07:24:14 AM EST

I am already three times faster in assembly than the fastest C I was able to obtain. Also my assembly is significantly smaller.

I have always been able to beat any optimizing compiler I've used when I've really had to get the best performance.

One reason for that is that there are some opcodes that do things in a single instruction that can't be expressed in C. An example is register rotation - you can shift a value in C but you can't rotate it. To achieve the effect of rotation requires a second scratch variable and some shifting and masking, and so is very slow compared to assembly

With some development environments that are "intrinsic functions" that are callable from C that can be compiled to a single machine opcode, but that's not the case with the environment I'm using. I'm using the ARM development system that comes form ARM the company.

Another reason assembly can beat C is that there are control structures you can create that cannot be expressed in C. Calling into the middle of a function is one such control structure.

I once wrote a decompressor that had a switch statement inside the innermost loop. Each time through the loop the variable that the switch selected on was set to a new value that was fixed for each case of the switch. The result was that each case of the switch was performed in a fixed sequence.

What I did in assembler was jump off a register - that is, set the program counter to the value of a register. Then when I was done with my work I would add 8 to that register and go back to the top of the loop.

I put my different cases one after the other in the order that they needed to be called. Each case was 8 bytes.

In this way I was able to get my two innermost loops of my decompressor to fit into the 64 BYTE cache of the 68020 microprocessor, which was the most widely used CPU for Macs at the time. It doubled the speed of the code.

One thing you have to realize is that compilers try hard, but they are sometimes not all that good. Not as good as spending a week tuning up the very instructions that your CPU will execute. The code I'm working on will be executed very, very frequently and hopefully by a lot of happy customers, so it's worth the effort to optimize.

One reason that optimizers are often not as good as people seem to think they are is software patents.

There a great many well-known ways to optimize code, and if any single compiler could use them all, I imagine it would be hard to beat.

But I've heard that compiler technology is a minefield of patents. Every commercial compiler vendor owns some and doesn't allow their competition to use them.

The result is that each individual compiler is not able to use many, perhaps even most, of the possible optimizations that have been published.

Sometimes compiler vendors will cross-license patents, so they both benefit. That's one reason some commercial compilers are better than gcc.

But a problem with gcc is that by its nature it cannot use any patented algorithms, and so it is limited in the best it can hope to achieve.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


[ Parent ]

Humans can *always* beat compilers (none / 0) (#65)
by hardburn on Mon Dec 16, 2002 at 01:09:49 PM EST

Before I go on, I need to insert an obligatory remark: rules about premature optimization still apply. We now return you to our regularly scheduled comment, already in progress.

A human with even a basic knowledge of ASM can always create code that is at least equal to the compiler. Why? Because the human can write the C code, push it through the compiler, grab the ASM output from the compiler, and then hand optimize that. And someone who is really good at ASM doesn't need no stinken' compiler.

Additionally, in some cases a human can look at the overall mathmatical structure of a code block and optimize for that, whereas a compiler can only look at things one peice at a time.

There is one thing of noticed about code snippets of hand-optimized ASM vs. the compiler output. While cycle-for-cycle, the hand-optimized snippet is almost always faster, the compiler output tends to handle branch misprediction better. There is no reason why a human couldn't create code that can do branch misprediction better, but it tends to be true, at least from what I've seen on the net. Naturally, if your processor doesn't do branch prediction, you have nothing to worry about.


----
while($story = K5::Story->new()) { $story->vote(-1) if($story->section() == $POLITICS); }


[ Parent ]
NOT *always* (none / 0) (#73)
by jrst on Sat Jan 04, 2003 at 06:55:51 PM EST

Not that this has anything to do with the main topic, but I feel compelled to respond...

Yes, a human can take compiler output and optimize it further.*  And yes, a human--given sufficient time and energy--can out-optimize a compiler.

Theoretically.  In practice, far from "always".  It is not atypical for compilers to beat human programmers at optimizing code.  This is not necessariy a common occurance, but it's been going on for at least three decades (about the first time I witnessed it, and believe me it was quite a shock).  

Compilers tend to beat assembly language programmers for two reasons:

1. The more complex the architecture, typically the more difficult it is for a human to optimize for.  The case above I referred to was an example of this phenomena.  The programmer who got beat was very good--but there were some out-of-order optimizations related to (the equivalent of) pipeline stalls that he had not accounted for.

2. Assembly language programs still have to be maintainable.  Thus--just as in high level languages--there are optimizations that programmers do not use because the optimizations would make the code unmaintainable.  Example: using a nearby instruction's opcode as a numeric literal.

-----
* I gotta wonder why you would do this.  Code is maintained at the level it is written.  If I have to go back through the hand optimization stage every time I change the high level source, maintainability goes way down.  And the smarter the compiler, the worse it gets.  Routinely massaging compiler assembly language output suggests that there are more serious structural problems in the code/programmer.

[ Parent ]

RISC OS (none / 0) (#47)
by beebware on Sun Dec 15, 2002 at 08:36:50 AM EST

You may find trawling the ODP category for RISC OS of us (fyi: RISC OS was designed for, and only runs on, ARM RISC processors). This site on ARM Assembler Programming may be of use, and you may find the programming forum on The IconBar of use as well (the Iconbar and Drobe are RISC OS orientated bulletin boards - hence everyone there knows about ARMs and some even know ARM assembler in detail: including the differences between each chipset).
-- Blog: http://blog.beebware.co.uk
ARM7 has a simple pipeline (none / 0) (#49)
by pin0cchio on Sun Dec 15, 2002 at 09:19:00 AM EST

If I scale the algorithm's performance as reported on some other processors by the ratio of clock speeds, my target should be well within reach.

At slightly less than one instruction per clock, I'd find ARM7 clock-for-clock comparable to Intel i486.

I know that on other processors, if the second instruction of a pair depends on the results of the first instruction, a pipeline stall will occur while the first instruction completes, slowing everything down.

You're thinking of some highly superscalar processor. The ARM7 has a much simpler pipeline, where the execute and writeback stages take place within one cycle.

Because the Game Boy Advance system uses an ARM7TDMI processor, you might be able to find some people experienced in low-level ARM7T coding on EFnet #gbadev. Also look through some of the documents on gbadev.org.

(My vendor tells me that reading from RAM takes two cycles, writing takes one. In most cases the ARM can retire one instruction per clock, using a three-stage pipeline. However the instruction timings given in the paper cited above indicate a read takes three clocks, while a write takes two.)

Your vendor is kinda-sorta right: Reading memory (ldr) takes three cycles (compute address, memory access, retire), which is two cycles plus retire, plus however many wait states your system forces on you. Writing to memory (str takes two cycles (compute address, memory access + retire), one cycle plus retire plus wait states.

(Running on different hardware is not an option at this time, for business reasons.)

In other words, you're making a 64 KB GBA intro, right?

(There are open source ARM emulators available as well, but I haven't tried any of them yet.)

Even if you're not working on the GBA, the debugging facilities in VisualBoyAdvance + GDB/Insight (both GPL'd) may help you get your algorithm working.


lj65
I don't understand the performance I am seeing (none / 0) (#50)
by MichaelCrawford on Sun Dec 15, 2002 at 09:57:12 AM EST

So I did as I said in a comment below, where I pasted my assembler source code into a spreadsheet and noted the execution times for each of the instructions.

Most instructions take 1 clock. Shifts take 2 (not ANDing, I was mistaken). Loading a register from RAM takes 3 clocks, according to the timing chart in that paper, and I think the RAM in my system is fast enough that it really does that. Branches take 3 clocks (because they flush the pipeline - there's no branch prediction).

Then I multiplied each time by the number of times its executed in a loop, and then in the cell to the left of that, how many times its executed in the next outer loop, and so on.

Then I added up the number of cycles executed, and divided by the clock speed (about 50 Mhz). Then I invert that to find the number of executions of my algorithm I can expect per second.

The figure I get is off from what I measure by a factor of four. Either I have made a mistake in my spreadsheet (a possibility, but I've looked it over pretty carefully) or something is going on that I don't understand (more likely, I think).

Hmph.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


Shifts (none / 0) (#52)
by CrimsonDeath on Sun Dec 15, 2002 at 12:56:38 PM EST

Just a quick note about shifts. There's a good chance you know this already, but I'll just mention it in case.

One other place you can save memory and gain speed is if you use the barrel shifter where ever you can. It works really well, since you can use it with most instructions, and you can often sneak in your shifts with other instructions. I'm not sure if it causes a clock cycle penalty, but I'd imagine it's only one cycle at most.

I don't have my reference documentation here (it's at work) but normally using mov to do a shift is wasteful.

[ Parent ]

Memory Speed (none / 0) (#53)
by CrimsonDeath on Sun Dec 15, 2002 at 01:05:33 PM EST

Also on the subject of memory speed.

That is extremely dependent on your memory controller. If it's an ASIC you're working on, this can be almost anything. You'd have to look at any datasheets you can get for this. The time it takes to do memory accesses is dependent on how long this memory controller stalls the AHB bus. You probably have one SRAM-style controller for the Flash ROM, and another SDRAM controller if your RAM is SDRAM? In any case, you might want to double check that you're running at full speed. The registers in these devices normally start up to slowest speed until you initialise, just so that it will work.

[ Parent ]

Pilot Error. My calculations really are correct. (none / 0) (#54)
by MichaelCrawford on Sun Dec 15, 2002 at 01:33:10 PM EST

It turned out that my spreadsheet that sums up the instruction times turns out to be very close to correct. The small difference difference is explained I think by not being to sure how long it takes to load a register from Flash.

I had made some dumb arithmetic mistake in converting the total number of cycles to the same sort of figures I was measuring from my timing tests.

My vendor says I can read from the flash in five clocks. The table of instruction times says the LDR instruction takes three clocks. What is the total time? Five clocks or eight clocks, or something in between?

I'm going to have to measure it, which I will do in a little while.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


[ Parent ]

I'd guess - ~6-8 cycles, depending on your flash (none / 0) (#68)
by artemb on Mon Dec 16, 2002 at 05:53:20 PM EST

Normal load takes 3 cycles: address calculation + load + retire.

Your vendor's estimation of 5 cycle flash access looks good in theory. The theory being that typical flash has ~80-120ns access cycle. Your CPU clock is 50MHz -> 20ns cycle time. THerefore to perform flash read cycles you'll need 4-6 CPU cycles.

All in all, you have

  • +1 address calculation cycle
  • +4-6 cycles for actual flash access
  • +1 retire cycle
  • = 6-8 CPU cycles.
In the end it looks like moving some of the code to RAM will give you huge performance boost. That is, if you can spare some RAM for it.

 

[ Parent ]

Give it up (5.00 / 2) (#51)
by X3nocide on Sun Dec 15, 2002 at 12:28:34 PM EST

You just can't port quake2 to the GBA.

pwnguin.net
Never Say Die! (5.00 / 1) (#55)
by MichaelCrawford on Sun Dec 15, 2002 at 04:06:27 PM EST

I have not yet begun to code.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


[ Parent ]

Some tips (5.00 / 1) (#59)
by jwaskett on Mon Dec 16, 2002 at 04:15:32 AM EST

I'm not very familiar with Thumb, but I've been writing ARM assembler since 1992 or so, so I hope I can help here.

Knowledge of other architectures will probably be of little help. ARM is quite unusual. There's only a single execution unit, so reordering instructions won't be of any benefit to you. Although you get slightly under one instruction per cycle, the great thing about ARM is how much you can do in that cycle.

The most important thing to do is to avoid pipeline stalls. That means avoiding branches. ARM has conditional execution for every instruction. Use it whereever possible. It depends on the core, but a good rule of thumb is that an branch is equivalent to three instructions. Also, remember that there is no branch predictor, so structure your code such that the most likely path will be quickest.

You've got 15 general purpose registers, which is often enough for tight loops to require little or no memory access. You can store the contents of the link register on the stack and use it freely. If you've got fixed addresses, you can store the stack pointer too.

Multiplies are still fairly expensive. A shift-and-add is often equivalent. eg ADD R0, R0, R0, LSL #1 is equivalent to multiplication by three. RSB is also your friend for multiplication by one less than powers of two.

chilliesmad at hotmail dot com



Details my friend.. (4.00 / 2) (#64)
by bored on Mon Dec 16, 2002 at 12:59:13 PM EST

I just went through this a couple of weeks ago. I don't know how far along you have come, so its hard to give specifics. The first thing I did was move my code onto the onchip RAM for my device. I'm using GCC so this involved changing my linker script ( '. = 0x02018000' to '. = 0x00100000') . This basically doubled my execution speed since i'm using full 32-bit instructions and the external memory interface is only 16-bit wide. Ouch... :> Secondly, use the ARM compiler its faster than any of the other ones. Third consider your algorithm. Finally, consider doing it in assembly. In my case since GCC sucks so much I just wrote a couple of small inline asembly routines and now i'm about 5x as fast as I need to be.

As others have pointed out, the ARM7tdmi chips tend to be really simple, no cache, single pipe in order chips. This means that you basically optimize them like you optimize a 6502. By counting cycles. This is really easy on the ARM because most of the instructions are single cycle. Try not to use the conditional execution for more than 3 instructions because a branch will probably be faster. After that, its a matter of writing the code in as few instructions as possible. Do as much as you can in the registers. Since its assembly, once you write a couple of versions of the code, all kinds of evil optimization ideas will become apparent. Consider lookup tables for algorithms that are slow. The ARM's clockrate to memory refrence timings are bad enough that lookup tables are very useful for lots of things.

In the end, buy a faster chip. ARM7's tend to be pretty SLOW so don't waste a bunch of time cramming a bunch of code into them when an extra $.50 will buy a nice ARM9 or a faster ARM7. This is especially true if your production run only calls for a few hundred.



Wacked idea (none / 0) (#66)
by bored on Mon Dec 16, 2002 at 01:27:38 PM EST

Well I've been reading more the of comments, and paid closer attention to the article this time. grin.. So, here is a wacky idea. Have you considered compressing parts of the code/data? I did this a few years back. Compressing the code will be tricky if your short on both code and data space. On the other had somewhere you said that a lot of the data is repetive. So it sounds like a perfect chance to implement one of the simple compression/decompression routines and compress the data that isn't being used. You should be able to get a basic RLE in maybe 20 instructions. A data specific compressor would probably work better though. Then it becomes a matter of structuring your code so that data access is carefully controlled.



A couple of ideas (none / 0) (#67)
by julesb on Mon Dec 16, 2002 at 04:00:31 PM EST

I don't know the nature of your inner loops, but I'd bear in mind the possibilities of using/abusing the LDM/STM instructions to access the data you're working with. I think you mentioned using byte-sized quantities - perhaps you can read 8, 12 or 16 of these and operate on all quantities simultaneously, in a DSP-type style. There are tricks that enable you to do some quite spectacular things with just the plain ARM instruction set (less so with Thumb, I expect).

You might need to re-align or re-order any data you're working with to be able to do this effectively. Look around for simple stuff people have done with ARM assembler - bitblt functions, CRC32, etc. There are some (large) sources of graphics demos written in ARM assembler out there, it might help to add "Acorn" or "RISC OS" to your searches... (and try to ignore the bigots, they're embarrasing).

Would be easier to give advice if you could say what you're doing, of course.

Jules
-- Get off the Internet

Not ARM-specific books, but may be useful (none / 0) (#74)
by jrst on Sat Jan 04, 2003 at 07:15:08 PM EST

The "bible" for all the old microprocessors was published by Adam Osborne (et. al.).  There were several volumes that covered concepts, architecture and programming.

Every assembly language programmer who worked on microprocessors I know had at least one of the volumes on their bookshelf.  I don't know of any modern books that provide an equivalent comparative study of microprocessors at a practical level.

They're all out of print (in print circa 1978-1982).  The architectures are not directly comparable to the ARM, but I'd bet a lot of tricks can still be used on the ARM.

There are several volumes that might be relevant to your quest:

Osborne 16-bit microprocessor handbook.*

An Introduction to Microcomputers-Volume 0 - The Beginners Book.

An Introduction to Microcomputers Volume 1 - Basic Concepts.

An Introductionto Microcomputers: Volume 2 - Some Real Microprocessors.*

---
* Probably of most interest/use given what you're after.

ARM Assembly Code Optimization? | 74 comments (55 topical, 19 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!