Compare Algorithm written in Python on a Raspberry Pi with an Algorithm written in C on a STM32

Question

english is not my first language, so sorry for my bad writting.

i need to optimize an Algorithm, which is written with Python and running on a Raspberry Pi. The clue is, that i need to write the optimized Code as a C-Programm running on a stm32f4.

It is a image-processing algorithm (i know, image-processing with C on a Microcontroller sounds fun …) and the functionality muss stay the same (so same output with tolerance). Ofcourse i need a method of benchmarking the two programms.

In my case "Optimization" means that the programm should run faster (which it automatically will be, but i need to show that it is faster because of optimized Code and not because it is written in C and running on a bare-metal system).

i know that for examaple i can compare the number of Code lines, because the less lines the faster is the programm. is there more "factors", which are system independend and i can compare to explain why the optimized Code is faster?

kind regards,
Dan

PS: i thought about converting the Python code in C code with cython. Than i can compile it and compare the assembly or machine Code. But i am not sure if it is the right way, because i dont know what exactly cython is doing.

Asked By: Dan No

||

Source

Answer 1

Ofcourse i need a method of benchmarking the two programms.

For embedded systems, this is always done by toggling an GPIO pin at the start and end of the algorithm, then measure the time with an oscilloscope. This should be possible on Rasp PI and a STM32 target both. But you’ll be measuring raw execution speed and not just the algorithm – Rasp PI will be messing around with context switches and the like.

i know that for examaple i can compare the number of Code lines, because the less lines the faster is the programm.

No, that’s nonsense. The number of lines do not necessarily have any relation to execution speed. If you think that, then I’d say you are yet far too inexperienced to do manual code optimization for a specific target.

As for specific performance improvements things to look for, dropping Linux in favour of bare metal will give a huge performance boost. On the other hand, you will at the same time downsize from some Cortex A to a M4, which runs at much lower clock and lacks cache. But this also means that if you get it running faster on the M4, that’s mission completed, since it is a less powerful target. (And outperforming a Linux PC should be a walk in the park for a bare metal Cortex M4.)

I suspect that merely converting from Python to C will improve performance quite a bit, since all manner of type-generic goo and implicit/hidden function calls performed "behind the scenes" in Python will simply get removed.

Other than that, STM32F4 is advanced enough to have a form of branch prediction and it also has a FPU. So you can still look at reducing the number of branches and floating point operations. You can also look at CPU clock used versus flash wait states, see if there are possible improvements. As far as I know this MCU doesn’t have data cache, meaning it can’t compensate for flash wait states. So maybe consider executing code in RAM if wait states is a bottleneck. Or simply clock it up as much as possible.

Answered By: Lundin

Answer 2

Fewer lines of code, fewer machine instructions does not mean faster.

void more_fun ( unsigned int );
void fun ( void )
{
    more_fun(0x12345678);
}

00000000 <fun>:
   0:   4801        ldr r0, [pc, #4]    ; (8 <fun+0x8>)
   2:   f7ff bffe   b.w 0 <more_fun>
   6:   bf00        nop
   8:   12345678    .word   0x12345678

This is a perfectly functional solution, but

.thumb
.cpu cortex-m4
.syntax unified

movw r0,0x5678
movt r0,0x1234

ldr r1,=0x12345678

Disassembly of section .text:

00000000 <.text>:
   0:   f245 6078   movw    r0, #22136  ; 0x5678
   4:   f2c1 2034   movt    r0, #4660   ; 0x1234
   8:   4900        ldr r1, [pc, #0]    ; (c <.text+0xc>)
   a:   0000        .short  0x0000
   c:   12345678    .word   0x12345678

it is generic. A movw/movt will get you the same result but with two instructions instead of one. That should be twice as slow yes? Not at all the ldr is a load, the processor stalls it waits for a memory cycle to happen which takes some number of clocks to generate even with zero wait state memory. Then it waits. The flash on these mcus even with the prefetcher and cache thing can still be 2 or 4 or more times slower than the processor.

On your cortex-a with an operating system and dram it could be dozens to hundreds of clocks to get this data back, just depends. even once in l1 cache it is still not that fast.

While the movw/movt is twice as many instructions. They are linear, fed into the pipeline, the pipe does not have to stall for either one of them and the stall is not deterministic. Now in a loop (or not) where we might land on a cache boundary (if you even have icache) that cache line fill will take a while and the extra instruction may cause some bad luck as to where that boundary is, but if you are pushing this kind of hand tuning you have to know all of this and for that matter, as I have demonstrated many times here and elsewhere, alignment of the code can matter especially with high performance cores like arm. So adding NOPs here might improve the performance of a loop there significantly giving an overall win, just from fetch effects, then add in things like the compiler cannot know the target system and know which way is better to implement something.

I think the other answer is good. Python to C, yes you should see an instant improvement. But it is not deterministic and quite often you will find that despite the obstacles/bloat/etc the application on top of Linux may out run the mcu for the exact same C source code as much as you optimize it at the C level. But your cortex-a could be running slow and your cortex-m running fast.

I thought the reason arm put rules for say 0x00000000 to 0x20000000-1 and 0x20000000-1 and 0x40000000 to something was so they could turn on a data cache without an mmu. Maybe that is only cortex-m7 and not m4. I forget caching details on the cortex-ms as I don’t use the cache.

st has a flash cache thing in the stm32s that you normally cannot turn off (and can help or hurt your benchmark, remember all benchmarks are nonsense, easy to make a slow system look faster than a fast system, etc) and I think a prefetcher as well in front of the flash so that can help. Clocking the mcu up as fast as you can to get the core running faster, can cause the flash wait states to be longer making the thing mostly cache dependent not processor dependent, yet the sram should scale up so code run from sram can be from 2 to several times faster for the same machine code on the same alignment. Other companies do have this cache but might still have a prefetcher to make linear code runs faster (making movw/movt probably faster over a random access ldr that may stalls the pipe)

convert your code to C
get it to at least function and give the same results as the python
see if you can get it to fit in the mcu at all
get it to give the same results as on the host
start with alignment, branch prediction (might be on by default in that
core), an icache if present, etc.
then maybe try to change the C code
last ditch effort try to adjust/tune the assembly output of the compiler

The gpio method adds time to the test, if you make a high level c gpio call to do it that can mess up your results you need to try to do it in a single instruction before and after the code under test.

A timer in the mcu will work very very well as it is usually at the cpu clock. So it is the time plus a clock or few that it takes to execute the sampling before and after if you do it in asm or a single instruction, if you use time() calls or library calls for the timer or gpio that can/will skew/ruin your results and can leave you confusing or bogus results (benchmarks are nonsense).

Answered By: old_timer

Compare Algorithm written in Python on a Raspberry Pi with an Algorithm written in C on a STM32

Question:

Answers: