Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
ARM Assembly Code Optimization?

By MichaelCrawford in Technology
Sun Dec 15, 2002 at 03:46:36 AM EST
Tags: Help! (Ask Kuro5hin) (all tags)
Help! (Ask Kuro5hin)

I'm writing an embedded application for an ARM7TDMI processor. It's not going fast enough to bring me any joy yet. Not even close.

Can any of you suggest any books, magazine articles or websites that teach about optimizing ARM or Thumb assembly code? I need to reduce both my code size and my run time. I also need to reduce my application's overall RAM consumption, as my present RAM usage exceeds the tiny amount available on the chip by a few hundred bytes.

References on other processor architectures than the ARM could be useful too. I'm interested in learning any tricks I can for any architecture, because maybe I can figure out how they can apply to the ARM7TDMI, or maybe they will stimulate me to think up completely new approaches.


Some ARM processors like the ARM7TDMI support two instruction sets - 32-bit ARM instructions and 16-bit Thumb instructions. By "16-bits" I mean each instruction is 16 bits in size; the thumb still has 32 bit registers and a 4 GB address space.

The "T" in the ARM7TDMI's model number indicates that it supports Thumb instructions - there are a variety of other ARM cores that do as well.

ARM code is more flexible and can be made to run faster, but thumb takes less space for doing routine things. I'm planning to mix the two instruction models but I can't do that with complete freedom because there is a cost to switching modes (the pipeline gets flushed for one thing, and you have to execute a couple instructions to actually do the switch).

I'm pretty sure that I'm using the most efficient algorithm available for what I'm trying to do. I have done some of the more obvious things already to improve my speed. There is more that I can easily do, but I don't think that what I presently know to do will get me to my goal.

If I scale the algorithm's performance as reported on some other processors by the ratio of clock speeds, my target should be well within reach. However it is not so simple as that, as there are other factors to consider such as the relative efficiency of the different instruction sets and the capabilities of the hardware in question. Figures quoted for a Pentium III desktop PC cannot be counted to scale proportionally when comparing the results with an embedded 50 Mhz ARM7TDMI core.

There are some helpful documents at ARM's website, but I have not yet found the solution to my problem by studying them.

I'm particularly interested to understand better how the ARM7TDMI's pipeline can be utilized more effectively.

I know that on other processors, if the second instruction of a pair depends on the results of the first instruction, a pipeline stall will occur while the first instruction completes, slowing everything down.

Sometimes the pipeline can be kept full by moving an unrelated instruction between them in such a way that the whole program still gets the same results. The instruction in-between will give time for the first instruction to complete so that what is now the third instruction is now able to execute right away. Does the ARM7TDMI work like that?

The ARM7TDMI does not have a cache. Reducing the overall code size in itself won't help the runtime, what will help the most is reducing the total count of instructions executed, as well as eliminating any references to RAM that aren't totally necessary.

The paper Cost-Effective Microarchitecture Optimization for ARM7TDMI Microprocessor has a simple table of instruction timings for ARM code, as well as a good explanation of how the ARM7TDMI works in general. It is available in PDF Format or Google's "view as HTML" format.

(My vendor tells me that reading from RAM takes two cycles, writing takes one. In most cases the ARM can retire one instruction per clock, using a three-stage pipeline. However the instruction timings given in the paper cited above indicate a read takes three clocks, while a write takes two.)

Thank you for any help you can offer. It's very important to me that I succeed with this application, but I forsee that some very difficult challenges lie ahead.

(Running on different hardware is not an option at this time, for business reasons.)

Here are some books that have been recommended to me by some friends that I asked about this. Unfortunately my budget is very limited and I can't afford to buy them all right now. Perhaps those of you who are familiar with them could comment on the best ones to get:

Here are some books I already have:

  • Arm System-on-chip Architecture by Steve Furber

    This book has an easy-to-read introduction to ARM and Thumb assembly code and discusses in a general way how RISC processors work, but unfortunately is somewhat light on the kind of specifics that would help me at this point.

  • Optimizing PowerPC Code by Gary Kacmarcik

    This really helped me when I was working at Apple, but I don't think that most of what it has to say (like improving cache usage or keeping multiple execution units occupied) would apply to my situation.

  • Pentium Processor Optimization Tools by Michael L. Schmit

    This comes with a free version of Schmit's program PENTOPT, an "assembly code optimizer" for Pentium code. What it actually does is save a listing of your assembly program which has comments added to indicate how many cycles each instruction will take, and if an instruction takes a long time, it includes an explanation as to why, such as pipeline stalls and so on.

    I would be stoked to find out if there is a program that does the same for ARM and Thumb assembly.

If you'd like to learn more about ARM assembly code, check out Peter Cockerell's ARM Assembly Language Programming and Richard Murray's ARM Assembler. The ARM has a really pleasant assembly language, but unfortunately this is the first time I've written any, so I don't know any of the tricks yet.

I'm sorry to say that I am not at liberty to say what my application is until my client announces it.

I think the article would have more use to other people, and not just me, if people posted any sort of clever assembly trick they knew. And I think it would likely be the case that some optimizations that are apparently unrelated to my problem would stimulate me to think of something that would help, or find a way to apply it to my problem.

The algorithm in question, though, is entirely integer oriented. The chip I'm targeting has 64kbytes of flash ROM. I'm not sure if I'm allowed to say how much RAM is available, but it is a really tiny amount, as you might guess by my complaint that I'm over budget by a few hundred bytes.

I should point out that I am using a profiler to test my code.

There's two ways I do this - one is to load my code into the target system and execute my algorithm in a loop a few thousand times. I indicate the start and end of the test by lighting up certain LEDs on the development board I have to test with.

I can get more precise timing of the algorithm or its parts by using the ARMulator, a microprocessor emulator that comes with ARM's development system. This allows me to execute ARM binaries under Windows, where I can get at them with a source code debugger. (There are open source ARM emulators available as well, but I haven't tried any of them yet.)

What I can do is set a breakpoint just before stepping into a subroutine, reset the cycle counter, then step over the subroutine. The cycle counter can then tell me how many instructions have been executed while executing that subroutine and every subroutine it calls, and so on.

There is an option for real-time simulation if I write a memory map configuration file that gives the address ranges for the different kinds of memory involved, their data bus width and access times.

What I don't seem to have, though, is a tool that will tell me how many clock cycles each instruction will take. The profiler can tell me which general areas are most important to address, and it can tell me if I've gained or lost, but it doesn't help to much at the level of individual instructions.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Google
o some helpful documents
o PDF Format
o Google's "view as HTML"
o High Performance Computing
o High Performance Computer Architecture
o Hacker's Delight
o Arm System-on-chip Architecture
o Optimizing PowerPC Code
o Pentium Processor Optimization Tools
o ARM Assembly Language Programming
o ARM Assembler
o Also by MichaelCrawford


Display: Sort:
ARM Assembly Code Optimization? | 74 comments (55 topical, 19 editorial, 0 hidden)
Code Optimization is well Studied (4.50 / 2) (#4)
by HidingMyName on Sat Dec 14, 2002 at 10:17:24 AM EST

Sounds like you have two problems, code size and code speed. Some optimziations improve one at the expense of the other, but some techniques can improve both. Could you please confirm that I am right that you have enough PROM space but not enough volatile RAM.

Code optimizations happen at a variety of levels, starting at the highest level:

  • Algorithm and Problem Formulation
  • High level (intermediate code) optimizations
  • Low level (architecture dependent) optimizations
Hennessy, Patterson and Goldberg should provide a good coverage of many low level optimizations suitable for RISC processors.

Now for some remarks not commonly found in textbooks. Most of us know that algorithmic optimizations often provide the biggest improvements. Check closely for unused variables/constants/instructions/subroutines, as that can consume space. Another thing that can be done is to aggressively look for repeated stretches of code and gather them into subroutines, rather than coding them in line. This also improves cache performance. Another technique that works if you have null terminated strings (ala C/C++) is to look for constant strings where one string is a suffix of another string is to express the shorter suffix by creating a pointer into the correct position in the longer string. Use the stack when possible, this tends to reduce memory usage, since variables are only memory resident when they are in scope (and are likely to be used). Be careful about register allocation, don't have unused registers at the expense of memory.

ying and yang (none / 0) (#8)
by turmeric on Sat Dec 14, 2002 at 11:01:28 AM EST

why has nobody mentioned profiling here? would you try to improve your hundred yard dash without a stopwatch?

It's a compute-bound problem (5.00 / 2) (#12)
by MichaelCrawford on Sat Dec 14, 2002 at 01:10:07 PM EST

I should add that, while my overall application does involve I/O, the problem at hand is purely compute bound.

I wanted to point this out because many performance tuning problems on operating systems involve getting the best out of your I/O, for example minimizing the latency or bandwidth of network traffic. That's an interesting problem too, but it's not my problem at the moment.

Problems with I/O performance are the probably the best kind to address by changing your algorithm or overall architecture, and are less likely to be helped by examining the pipeline utilization of your machine code.

In my case, I can put my bottleneck all inside a single subroutine (and the subroutines it calls), and do everything I need to do by just calling that subroutine. Nothing but computation and transfer to or from local memory happens during the call.

And that computation just takes too long.

Thank you for all your help.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


A couple of possibilities (5.00 / 1) (#25)
by CrimsonDeath on Sat Dec 14, 2002 at 06:14:40 PM EST

(repost as topical)

I've done a bit of work with ARM, so I just have a couple of thoughts.

1) What size is your bus? If it's only 16-bit, then thumb is really the way to go ... it'll be much faster and you don't lose much

2) If you're staying with 32-bit, you can increase speed (in some cases) and reduce memory use (always) if you make sure you use the conditional execution flags on all instructions where you can instead of branching

3) Are you sure it's not cached? I guess you must be  running with an MMU-less processor with the ARM7TDMI core?

And the number of clock cycles is described in the ARM7TDMI data sheets, I hope you have those. Sometimes it's really hard to know due to pipelining/caching, but if you don't have a cache (as you say) then that's not a problem.

If performance is important to you though, you shouldn't be running without cache.


questions and answers (5.00 / 3) (#26)
by pb on Sat Dec 14, 2002 at 06:16:17 PM EST

I looked up the data sheet on the ARM7TDMI processor and the assembler looks relatively normal... You say that you need to optimize for size and speed.

My advice is to use the thumb instructions as much as possible to reduce the code size of your application (if this is an issue).

One optimization to increase speed at the expense of code size is loop unrolling.  This can also get rid of branching and pipeline stalls.  For example, I had an (x86) assembly program that--among other things--needed to sort four characters.  I was able to get this down to 17 instructions--two to load/store two of the characters in memory, and 3 for each of the five possible swaps.  You see, I determined that it's possible to sort four numbers with at most 5 static swaps (this is an algorithmic improvement) and instead of having a loop of some sort, I just specified all five swaps (unrolled).  Code follows:


bsort:  mov     ax,[word ptr n2]        ;use al, ah as n2, n3
        cmp     [n1],al                 ;if (n1<n2)
        jae     skip1                   ;then
        xchg    [n1],al                 ;swap(n1,n2)
skip1:  cmp     ah,[n4]                 ;etc... basically
        jae     skip2                   ;this is a binary
        xchg    ah,[n4]                 ;subdivision, in-place
skip2:  cmp     [n1],ah                 ;sorting algorithm,
        jae     skip3                   ;but hardcoded for
        xchg    [n1],ah                 ;four characters.
skip3:  cmp     al,[n4]                 ;(and hopefully
        jae     skip4                   ;it will work...)
        xchg    al,[n4]                 ;...swap(n2,n4)
skip4:  cmp     al,ah                   ;if (n2<n3)
        jae     skip5                   ;then
        xchg    al,ah                   ;swap(n2,n3)
skip5:  mov [word ptr n2],ax            ;save new n2, n3

Now obviously this is a very special purpose algorithm, which is a big problem with giving people generic advice on assembly optimization--usually the best optimizations you can make will end up being extremely specific, so it's unlikely that we'll stumble on the right one for you.  However, loop unrolling is a very generic optimization, which is great for speeding up tight loops.

Another thing I did later on in the same program was that I managed to convert a summation into a lookup table.  Converting even relatively simple computations into (hopefully small) lookup tables will lower your total instructions executed (which was the task here).

Original code (before the lookup table):

nofirst:mov     bx,[word ptr n1]        ;get first two characters
        sub     bl,[n4]                 ;find position in list
        sub     bh,[n3]                 ;(summation - 1)
        mov     al,bl                   ;this does a summation
        inc     al                      ;and subtracts one
        mul     bl                      ;to find the offset in memory
        shr     ax,1                    ;for stable.

After the lookup table:


itable   db  0, 0, 2, 5, 9, 14, 20, 27, 35, 44
[...]
        sub     al,ah                   ;subtract n2, n3
        mov     bh,0                    ;clear bh
        mov     bl,[n1]                 ;bl=n1
        sub     bl,[n4]                 ;bl=n1-n4
        mov     bl,[itable+bx]          ;lookup summation thingy

You might have to take my word for it that these two chunks of code do the same thing--I probably reorganized a couple of other things between these two versions as well.  But the main thing to note is that besides shaving off a couple of instructions, we've avoided a mul as well!  And our lookup table isn't very large.  I know you've already discovered the joy of lookup tables in assembler optimization, but consider this encouragement to take it one or two steps farther whenever you need to (dramatically) lower your instructions executed.

I understand if you can't tell us very much about what you're working on now, so look up as much as you can on the net for specific algorithms and optimizations related to what you're doing.  Or read through some appropriately licensed source code if you can find free software that performs similar tasks.  (I suppose free software written in ARM assembler would be even better in this case...)

Best of luck to you!
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall

intel StrongArm and ARM architecture documents (none / 0) (#28)
by sye on Sat Dec 14, 2002 at 06:36:13 PM EST

ftp://download.intel.com/design/strong/manuals/27824004.zip
http://www.google.com/search?q=DDI0100E_ARM_ARM.pdf.
Hope these two links are still good.

~~~~~~~~~~~~~~~~~~~~~~~
commentary - For a better sye@K5
~~~~~~~~~~~~~~~~~~~~~~~
ripple me ~~> ~allthingsgo: gateway to Garden of Perfect Brightess in crypto-cash
rubbing u ~~> ~procrasti: getaway to HE'LL
Hey! at least he was in a stable relationship. - procrasti
Enter K5 via my lair

A Pentium is not an ARM processor (none / 0) (#29)
by GGardner on Sat Dec 14, 2002 at 06:40:28 PM EST

Well, that's obvious. Having done a fair amount of embedded programming myself, I'm frequently suprised by how fast our modern x86 CPUs are, even on a clock-for-clock basis. A simple embedded CPU, such as an ARM, especially a cache-less one running at one-tenth the clock speed of a pentium, will often run the same C code 50-100 times slower than the Intel or AMD chip.

The ARM chips are pretty nifty, run with amazingly low power/MIPS, and are much more deterministic than a modern Pentium. (Note that ever since Pentium II or so, Intel has stopped publishing clock cycles for each instruction, because is just varies so much). However, the price you pay for all this goodness is that the throughput of the ARM chip, even clock-for-clock, is a lot slower than a CPU aimed at a desktop or server.

look up table mathematics? (5.00 / 1) (#31)
by martingale on Sat Dec 14, 2002 at 09:01:57 PM EST

You stated that you "fast" algorithm uses a lookup table. I don't know what you're doing but is it possible that your lookup table has redundant information? Symmetries you can use to trade off size for an added couple of instructions? Since your reference implementation fits in memory, I presume your lookup table is fairly big, so even a reduction by a factor of two might be enough already.

Another thing you can try is to tweak the table, trading off accuracy for symmetry. Do the calculations have to be completely, 100% exact or can you perturb a couple of entries, making them wrong but so similar to something else that you obtain a symmetry you can leverage?

Actually, there's more potential in this sort of direction. Assuming you're doing some kind of complex calculation, how about doing a first order approximation to botain an inexact but close alternative calculation?

Here's what I'm setting into now (none / 0) (#40)
by MichaelCrawford on Sun Dec 15, 2002 at 04:43:31 AM EST

First, I must say I found a pleasant suprise when I woke up early this morning. I had to go to sleep in the middle of the voting yesterday because I'd been up all night hacking. It was very nice to get up and find my post approved by the moderators.

What I'm about to do is paste the entire source code for my assembly code algorithm into a Quattro Pro spreadsheet (my wife's a WordPerfect fan), leaving several columns of empty cells on the left that I will use for figuring.

Then I will refer to the timing chart I mention in the article above to note the times for each instruction.

Then I will tally them up proportionally to the number of times each loop is executed. I'll have to work out a nice way to do this, but if I know a loop is executed four times, then I'll multiply its time by four.

The whole thing is kind of complicated so I will probably be a few hours doing this but I hope to find it enlightening.

I already did this for the most-frequently executed block of code, and based on that I would have expected my whole algorithm to be about five times faster than it is. Something that I don't understand is killing my performance, and I hope to understand it by doing this.

One concern I have is that there are a lot of branches in my code. A branch costs three clock cycles because it flushes the pipeline - the ARM7TDMI doesn't have branch prediction or speculative execution like some fancier RISC processors (and the Pentium-class processors are RISC's in fancy clothes).

Many of those branches are unavoidable but there is probably something I can do about some of them.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


URL for PowerPC Compiler Writer's Guide (none / 0) (#41)
by blue0 on Sun Dec 15, 2002 at 05:01:45 AM EST

Michael,

Been reading you in Advogato!

This one at IBM, and another one outside IBM.

Regards, rogersm.
My way of joking is to tell the truth. That's the funniest joke in the world. -- Muhammad Ali

Here's a couple things I did (5.00 / 1) (#42)
by MichaelCrawford on Sun Dec 15, 2002 at 05:02:28 AM EST

Here's a couple things I did, one I picked up from the doc at ARM's website and the other I figured out after thinking about it for a couple days.

I frequently have a need to mask a 32-bit value into an 8-bit value. All ARM and Thumb instructions are 32-bit with the exception of being able to load and store bytes and halfwords.

The way I was doing that was to AND the value with 0xff. That's what the C code did after all.

But in both ARM and Thumb it costs two cycles to do an AND (I can't imagine why) and in Thumb it costs a register to hold the bitmask, because Thumb's smaller instruction word size places a lot of restrictions on immediate values - you can't do an immediate AND in Thumb.

But ARM distributions an Application Note called Writing Efficient C for ARM that mentioned in passing that the way the compiler masks an int down to a short or char is to shift to the left to clobber the high bits, then shift to the right. The upper bits are 0-extended so no masking is needed.

This still takes two cycles but it saves me a register, and by doing so I am now able to avoid some RAM reads in my innermost loop!

The other thing I have done is that I knew some code in a subroutine wasn't needed every time the subroutine was called. But it seemed too expensive to use a conditional branch to jump around it.

What I did I'm sure would horrify most object-oriented purists - I moved the code in question to the top of the subroutine, and added another entry point just after. So now one of my subroutines has two entry points!

You may gag at that but it just shaved 1500 instructions off the total amount needed to execute my algorithm just once, because that was also in the innermost loop.

Thank you for your attention.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


Assembly Optimization (5.00 / 1) (#43)
by kreyg on Sun Dec 15, 2002 at 05:12:18 AM EST

I've never worked with the ARM processor, but I've been mired in Playstation2 microcode for ages, and did lots of 6502 and x86 assembly years ago, so I might have a couple pointers...

1. Get a decent optimizing assembler! The computer's going to be way better at figuring it out than you are, and make a lot fewer errors. Some will do the optimization I've outlined in tip 5 automatically, but as far as I know it's not very common. If the compiler will only do what you say, rather than what you mean, then:
2. For memory reads/writes, use the automatic pre/post-increment/decrement capabilities. This will save you an extra addition or two to move your read/write pointers. Rearrange your data to accommodate this if at all possible.
3. You may also be able to remove a counter from the main loop by pre-calculating the end address of your data (start+numloops) and comparing with that value rather than a separate counter register.
4. If it's a 3-instruction pipeline, you'll probably need to put two instructions after a register write before you use the result to avoid stalls.
5. Along the same lines, you can be evaluating several different stages in parallel during a single loop. This would be like taking three iterations of the loop and interleaving them. Example:
[Load data 0]
[Data 0 Operation 0]
[Load data 1]
[Data 0 Operation 1]
[Data 1 Operation 0]
Begin Loop, n = 0 to numloops:
[Load data 2+n]
[Data n operation 2]
[Data 1+n operation 1]
[Data 2+n operation 0]
[Write data n]
Loop

That might be a little hard to follow, but it should let you execute with no stalls.

There was a point to this story, but it has temporarily escaped the chronicler's mind. - Douglas Adams
Insipid Advice (none / 0) (#45)
by bugmaster on Sun Dec 15, 2002 at 06:43:18 AM EST

Sorry, I have never programmed for ARM, so any advice I can offer is probably utterly idiotic.

Nonetheless: have you considered simply writing your program in C and letting the compiler optimize it for you ? Modern compilers are usually very good at optimizing code (once you turn on all the flags); much better than human beings. In fact, many optimization techniques that you (and others) have mentioned, such as pipelining, loop unrolling, etc., are usually performed automatically by the compiler.

This was certainly the case for me and DOS/Windows/MIPS; but then again, perhaps my assembly skills just weren't all that great...
>|<*:=

RISC OS (none / 0) (#47)
by beebware on Sun Dec 15, 2002 at 08:36:50 AM EST

You may find trawling the ODP category for RISC OS of us (fyi: RISC OS was designed for, and only runs on, ARM RISC processors). This site on ARM Assembler Programming may be of use, and you may find the programming forum on The IconBar of use as well (the Iconbar and Drobe are RISC OS orientated bulletin boards - hence everyone there knows about ARMs and some even know ARM assembler in detail: including the differences between each chipset).
-- Blog: http://blog.beebware.co.uk
ARM7 has a simple pipeline (none / 0) (#49)
by pin0cchio on Sun Dec 15, 2002 at 09:19:00 AM EST

If I scale the algorithm's performance as reported on some other processors by the ratio of clock speeds, my target should be well within reach.

At slightly less than one instruction per clock, I'd find ARM7 clock-for-clock comparable to Intel i486.

I know that on other processors, if the second instruction of a pair depends on the results of the first instruction, a pipeline stall will occur while the first instruction completes, slowing everything down.

You're thinking of some highly superscalar processor. The ARM7 has a much simpler pipeline, where the execute and writeback stages take place within one cycle.

Because the Game Boy Advance system uses an ARM7TDMI processor, you might be able to find some people experienced in low-level ARM7T coding on EFnet #gbadev. Also look through some of the documents on gbadev.org.

(My vendor tells me that reading from RAM takes two cycles, writing takes one. In most cases the ARM can retire one instruction per clock, using a three-stage pipeline. However the instruction timings given in the paper cited above indicate a read takes three clocks, while a write takes two.)

Your vendor is kinda-sorta right: Reading memory (ldr) takes three cycles (compute address, memory access, retire), which is two cycles plus retire, plus however many wait states your system forces on you. Writing to memory (str takes two cycles (compute address, memory access + retire), one cycle plus retire plus wait states.

(Running on different hardware is not an option at this time, for business reasons.)

In other words, you're making a 64 KB GBA intro, right?

(There are open source ARM emulators available as well, but I haven't tried any of them yet.)

Even if you're not working on the GBA, the debugging facilities in VisualBoyAdvance + GDB/Insight (both GPL'd) may help you get your algorithm working.


lj65
I don't understand the performance I am seeing (none / 0) (#50)
by MichaelCrawford on Sun Dec 15, 2002 at 09:57:12 AM EST

So I did as I said in a comment below, where I pasted my assembler source code into a spreadsheet and noted the execution times for each of the instructions.

Most instructions take 1 clock. Shifts take 2 (not ANDing, I was mistaken). Loading a register from RAM takes 3 clocks, according to the timing chart in that paper, and I think the RAM in my system is fast enough that it really does that. Branches take 3 clocks (because they flush the pipeline - there's no branch prediction).

Then I multiplied each time by the number of times its executed in a loop, and then in the cell to the left of that, how many times its executed in the next outer loop, and so on.

Then I added up the number of cycles executed, and divided by the clock speed (about 50 Mhz). Then I invert that to find the number of executions of my algorithm I can expect per second.

The figure I get is off from what I measure by a factor of four. Either I have made a mistake in my spreadsheet (a possibility, but I've looked it over pretty carefully) or something is going on that I don't understand (more likely, I think).

Hmph.


--

Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy


Give it up (5.00 / 2) (#51)
by X3nocide on Sun Dec 15, 2002 at 12:28:34 PM EST

You just can't port quake2 to the GBA.

pwnguin.net
Some tips (5.00 / 1) (#59)
by jwaskett on Mon Dec 16, 2002 at 04:15:32 AM EST

I'm not very familiar with Thumb, but I've been writing ARM assembler since 1992 or so, so I hope I can help here.

Knowledge of other architectures will probably be of little help. ARM is quite unusual. There's only a single execution unit, so reordering instructions won't be of any benefit to you. Although you get slightly under one instruction per cycle, the great thing about ARM is how much you can do in that cycle.

The most important thing to do is to avoid pipeline stalls. That means avoiding branches. ARM has conditional execution for every instruction. Use it whereever possible. It depends on the core, but a good rule of thumb is that an branch is equivalent to three instructions. Also, remember that there is no branch predictor, so structure your code such that the most likely path will be quickest.

You've got 15 general purpose registers, which is often enough for tight loops to require little or no memory access. You can store the contents of the link register on the stack and use it freely. If you've got fixed addresses, you can store the stack pointer too.

Multiplies are still fairly expensive. A shift-and-add is often equivalent. eg ADD R0, R0, R0, LSL #1 is equivalent to multiplication by three. RSB is also your friend for multiplication by one less than powers of two.

chilliesmad at hotmail dot com



Details my friend.. (4.00 / 2) (#64)
by bored on Mon Dec 16, 2002 at 12:59:13 PM EST

I just went through this a couple of weeks ago. I don't know how far along you have come, so its hard to give specifics. The first thing I did was move my code onto the onchip RAM for my device. I'm using GCC so this involved changing my linker script ( '. = 0x02018000' to '. = 0x00100000') . This basically doubled my execution speed since i'm using full 32-bit instructions and the external memory interface is only 16-bit wide. Ouch... :> Secondly, use the ARM compiler its faster than any of the other ones. Third consider your algorithm. Finally, consider doing it in assembly. In my case since GCC sucks so much I just wrote a couple of small inline asembly routines and now i'm about 5x as fast as I need to be.

As others have pointed out, the ARM7tdmi chips tend to be really simple, no cache, single pipe in order chips. This means that you basically optimize them like you optimize a 6502. By counting cycles. This is really easy on the ARM because most of the instructions are single cycle. Try not to use the conditional execution for more than 3 instructions because a branch will probably be faster. After that, its a matter of writing the code in as few instructions as possible. Do as much as you can in the registers. Since its assembly, once you write a couple of versions of the code, all kinds of evil optimization ideas will become apparent. Consider lookup tables for algorithms that are slow. The ARM's clockrate to memory refrence timings are bad enough that lookup tables are very useful for lots of things.

In the end, buy a faster chip. ARM7's tend to be pretty SLOW so don't waste a bunch of time cramming a bunch of code into them when an extra $.50 will buy a nice ARM9 or a faster ARM7. This is especially true if your production run only calls for a few hundred.



Wacked idea (none / 0) (#66)
by bored on Mon Dec 16, 2002 at 01:27:38 PM EST

Well I've been reading more the of comments, and paid closer attention to the article this time. grin.. So, here is a wacky idea. Have you considered compressing parts of the code/data? I did this a few years back. Compressing the code will be tricky if your short on both code and data space. On the other had somewhere you said that a lot of the data is repetive. So it sounds like a perfect chance to implement one of the simple compression/decompression routines and compress the data that isn't being used. You should be able to get a basic RLE in maybe 20 instructions. A data specific compressor would probably work better though. Then it becomes a matter of structuring your code so that data access is carefully controlled.



A couple of ideas (none / 0) (#67)
by julesb on Mon Dec 16, 2002 at 04:00:31 PM EST

I don't know the nature of your inner loops, but I'd bear in mind the possibilities of using/abusing the LDM/STM instructions to access the data you're working with. I think you mentioned using byte-sized quantities - perhaps you can read 8, 12 or 16 of these and operate on all quantities simultaneously, in a DSP-type style. There are tricks that enable you to do some quite spectacular things with just the plain ARM instruction set (less so with Thumb, I expect).

You might need to re-align or re-order any data you're working with to be able to do this effectively. Look around for simple stuff people have done with ARM assembler - bitblt functions, CRC32, etc. There are some (large) sources of graphics demos written in ARM assembler out there, it might help to add "Acorn" or "RISC OS" to your searches... (and try to ignore the bigots, they're embarrasing).

Would be easier to give advice if you could say what you're doing, of course.

Jules
-- Get off the Internet

Not ARM-specific books, but may be useful (none / 0) (#74)
by jrst on Sat Jan 04, 2003 at 07:15:08 PM EST

The "bible" for all the old microprocessors was published by Adam Osborne (et. al.).  There were several volumes that covered concepts, architecture and programming.

Every assembly language programmer who worked on microprocessors I know had at least one of the volumes on their bookshelf.  I don't know of any modern books that provide an equivalent comparative study of microprocessors at a practical level.

They're all out of print (in print circa 1978-1982).  The architectures are not directly comparable to the ARM, but I'd bet a lot of tricks can still be used on the ARM.

There are several volumes that might be relevant to your quest:

Osborne 16-bit microprocessor handbook.*

An Introduction to Microcomputers-Volume 0 - The Beginners Book.

An Introduction to Microcomputers Volume 1 - Basic Concepts.

An Introductionto Microcomputers: Volume 2 - Some Real Microprocessors.*

---
* Probably of most interest/use given what you're after.

ARM Assembly Code Optimization? | 74 comments (55 topical, 19 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!