Some ARM processors like the ARM7TDMI support two instruction sets - 32-bit ARM
instructions and 16-bit Thumb instructions. By "16-bits" I mean each
instruction is 16 bits in size; the thumb still has 32 bit registers and a 4 GB
address space.
The "T" in the ARM7TDMI's model number indicates that it supports Thumb instructions - there are a variety of other ARM cores that do as well.
ARM code is more flexible and can be made to run faster, but thumb takes less
space for doing routine things. I'm planning to mix the two instruction models
but I can't do that with complete freedom because there is a cost to switching
modes (the pipeline gets flushed for one thing, and you have to execute a
couple instructions to actually do the switch).
I'm pretty sure that I'm using the most efficient algorithm available for what
I'm trying to do. I have done some of the more obvious things already to
improve my speed. There is more that I can easily do, but I don't think that
what I presently know to do will get me to my goal.
If I scale the algorithm's performance as reported on some other processors by the ratio of clock speeds, my target should be well within reach. However it is not so simple as that, as there are other factors to consider such as the relative efficiency of the different instruction sets and the capabilities of the hardware in question. Figures quoted for a Pentium III desktop PC cannot be counted to scale proportionally when comparing the results with an embedded 50 Mhz ARM7TDMI core.
There are
some helpful documents at ARM's website, but I have not yet found the solution to my problem by studying them.
I'm particularly interested to understand better how the ARM7TDMI's pipeline
can be utilized more effectively.
I know that on other processors, if the second instruction of a pair depends
on the results of the first instruction, a pipeline stall will occur while the
first instruction completes, slowing everything down.
Sometimes the pipeline can be kept full by moving an unrelated instruction
between them in such a way that the whole program still gets the same results.
The instruction in-between will give time for the first instruction to
complete so that what is now the third instruction is now able to execute right
away. Does the ARM7TDMI work like that?
The ARM7TDMI does not have a cache. Reducing the overall code size in itself
won't help the runtime, what will help the most is reducing the total count of
instructions executed, as well as eliminating any references to RAM that aren't
totally necessary.
The paper Cost-Effective Microarchitecture Optimization for ARM7TDMI Microprocessor has a simple table of instruction timings for ARM code, as well as a good explanation of how the ARM7TDMI works in general. It is available in
PDF Format or
Google's "view as HTML" format.
(My vendor tells me that reading from RAM takes two cycles,
writing takes one. In most cases the ARM can retire one instruction per clock,
using a three-stage pipeline. However the instruction timings given in the paper cited above indicate a read takes three clocks, while a write takes two.)
Thank you for any help you can offer. It's very important to me that I succeed
with this application, but I forsee that some very difficult challenges lie
ahead.
(Running on different hardware is not an option at this time, for business
reasons.)
Here are some books that have been recommended to me by some friends that I asked about this. Unfortunately my budget
is very limited and I can't afford to buy them all right now. Perhaps those of you
who are familiar with them could comment on the best ones to get:
Here are some books I already have:
-
Arm System-on-chip Architecture by Steve Furber
This book has an easy-to-read introduction to ARM and Thumb assembly code and
discusses in a general way how RISC processors work, but unfortunately is
somewhat light on the kind of specifics that would help me at this point.
-
Optimizing PowerPC Code by Gary Kacmarcik
This really helped me when I was working at Apple, but I don't think that most
of what it has to say (like improving cache usage or keeping multiple execution
units occupied) would apply to my situation.
-
Pentium Processor Optimization Tools by Michael L. Schmit
This comes with a free version of Schmit's program PENTOPT, an "assembly code
optimizer" for Pentium code. What it actually does is save a listing of your
assembly program which has comments added to indicate how many cycles each
instruction will take, and if an instruction takes a long time, it includes an
explanation as to why, such as pipeline stalls and so on.
I would be stoked to find out if there is a program that does the same for ARM
and Thumb assembly.
If you'd like to learn more about ARM assembly code, check out Peter Cockerell's
ARM Assembly Language Programming and
Richard Murray's
ARM Assembler. The ARM has a really pleasant assembly language, but
unfortunately this is the first time I've written any, so I don't know any of
the tricks yet.
I'm sorry to say that I am not at liberty to say what my application is until my client announces it.
I think the article would have more use to other people, and not just me, if people posted any sort of clever assembly trick they knew. And I think it would likely be the case that some optimizations that are apparently unrelated to my problem would stimulate me to think of something that would help, or find a way to apply it to my problem.
The algorithm in question, though, is entirely integer oriented. The chip I'm targeting has 64kbytes of flash ROM. I'm not sure if I'm allowed to say how much RAM is available, but it is a really tiny amount, as you might guess by my complaint that I'm over budget by a few hundred bytes.
I should point out that I am using a profiler to test my code.
There's two ways I do this - one is to load my code into the target system and execute my algorithm in a loop a few thousand times. I indicate the start and end of the test by lighting up certain LEDs on the development board I have to test with.
I can get more precise timing of the algorithm or its parts by using the ARMulator, a microprocessor emulator that comes with ARM's development system. This allows me to execute ARM binaries under Windows, where I can get at them with a source code debugger. (There are open source ARM emulators available as well, but I haven't tried any of them yet.)
What I can do is set a breakpoint just before stepping into a subroutine, reset the cycle counter, then step over the subroutine. The cycle counter can then tell me how many instructions have been executed while executing that subroutine and every subroutine it calls, and so on.
There is an option for real-time simulation if I write a memory map configuration file that gives the address ranges for the different kinds of memory involved, their data bus width and access times.
What I don't seem to have, though, is a tool that will tell me how many clock cycles each instruction will take. The profiler can tell me which general areas are most important to address, and it can tell me if I've gained or lost, but it doesn't help to much at the level of individual instructions.