Is your ARM Cortex-A9 hot enough?

Is your ARM Cortex-A9 hot enough?

Inspired by the google+ post by Koen Kooi, I decided to check whether NEON is also hot in Cortex-A9. Appears that cpuburn tool supports ARM since 2010. And openembedded uses an alternative cpuburn-neon implementation. As we have at least two implementations, naturally one of them might be more efficient on Cortex-A9 than the other. So I tested both of them on my old OMAP4430 based pandaboard (I would not miss this board too much if it actually burns). The results of this comparison are provided in the table at the bottom.

I could have stopped at this point, but that would be not fun :) So I tried to experiment a bit with Cortex-A9 power consumption myself. Turns out that Cortex-A9 can actually run a bit hotter. On the NEON side, VLDx instructions seem to be more power hungry than anything else by a large margin. And aligned 128-bit reads are the best at generating heat. Using VLD2 variant with post-increment makes it do a bit more work than the plain VLD1. Moving to the ARM side, conditional branches and SMLAL instructions are also rather hot. Mixing everything together, we get one more implementation of cpuburn for Cortex-A9:

.syntax unified
    .text
    .arch armv7-a
    .fpu neon
    .arm

    .global main
    .global sysconf
    .global fork

/* optimal value for LOOP_UNROLL_FACTOR seems to be BTB size dependent */
#define LOOP_UNROLL_FACTOR   110
/* 64 seems to be a good choice */
#define STEP                 64

.func main
main:

#ifdef __linux__
        mov         r0, 84 /* _SC_NPROCESSORS_ONLN */
        blx         sysconf
        mov         r4, r0
        cmp         r4, #2
        blt         1f
        blx         fork /* have at least 2 cores */
        cmp         r4, #4
        blt         1f
        blx         fork /* have at least 4 cores */
1:
#endif

        ldr         lr, =(STEP * 4 + 15)
        subs        lr, sp, lr
        bic         lr, lr, #15
        mov         ip, #STEP
        mov         r0, #0
        mov         r1, #0
        mov         r2, #0
        mov         r3, #0
        ldr         r4, =0xFFFFFFFF
        b           0f
    .ltorg
0:
    .rept LOOP_UNROLL_FACTOR
        vld2.8      {q0}, [lr, :128], ip
        it          ne
        smlalne     r0, r1, lr, r4
        bne         1f
1:
        vld2.8      {q1}, [lr, :128], ip
        it          ne
        smlalne     r2, r3, lr, r4
        bne         1f
1:
        vld2.8      {q2}, [lr, :128], ip
        vld2.8      {q3}, [lr, :128], ip
        it          ne
        subsne      lr, lr, #(STEP * 4)
    .endr
        bne         0b
.endfunc

Maybe more improvements are still possible if I overlooked some better instructions, tricks with L2->L1 prefetches or anything else. Also I have not tried running any tests on Cortex-A8 yet. But Cortex-A8 needs different tuning and I would not be surprised if the the older cpuburn implementations can actually do a better job there. Finally, the obligatory warning: This program tries to stress the processor, attempting to generate as much heat as possible. Improperly cooled or otherwise flawed hardware may potentially overheat and fail. Use at your own risk!

As for the table below, each implementation has been tested with both Cortex-A9 cores fully loaded (starting two instances of cpuburn if needed). Current draw values were measured after running the test non-interrupted for 10-15 minutes. Honestly, the total ~1640 mA sustained current draw by pandaboard looks quite scary to me. At least I would not dare to even try additionally stressing GPU and/or the hardware video decoder at the same time.

cpuburn implementation, running on both A9 cores	current draw from 5V PSU (whole board, not just CPU)
idle system (this kernel has no power management)	~550 mA
cpuburn-neon	~1130 mA
cpuburn-1.4a (burnCortexA9.s)	~1180 mA
ssvb-cpuburn-a9.S	~1640 mA

And also a cpuburn tweak for ARM Cortex-A8 (added on 2011-04-11)

A quick test on Cortex-A8 shows that using SMLAL is a bad idea there, but extra NEON arithmetic instructions can be added because Cortex-A8 supports dual issue for NEON.

This time experimenting with DM3730 based IGEPv2 board (ARM Cortex-A8 @1GHz) and using dm3730-temp-sensor for temperature measurements:

cpuburn implementation	temperature
idle system (this kernel has no power management)	~57.75 C
cpuburn-neon	~92.75 C
cpuburn-1.4a (burnCortexA8.s)	~96.00 C
ssvb-cpuburn-a8.S	~104.25 C

~~If the sensor is not lying, then maybe using a plastic case for this board was not a good choice after all.~~ The sensor is most likely lying as explained by Nishanth Menon in the google+ comments.

Final words (added on 2011-04-11)

Before anybody jumps to wild conclusions, I would like to note that:

Pandaboard is not a mobile device and it is not designed for really low power consumption. It is a known fact that it requires a PSU rated at 4A. I don't have any idea where most of the heat is dissipated, but it is quite likely that not only OMAP chip is involved.
Cpuburn is very different from any typical workload and can't be used for estimating power consumption. It's just a hardware reliability testing tool

Posted on 10 Apr 2012