Inspired by the google+ post by Koen Kooi, I decided to check whether NEON is also hot in Cortex-A9. Appears that cpuburn tool supports ARM since 2010. And openembedded uses an alternative cpuburn-neon implementation. As we have at least two implementations, naturally one of them might be more efficient on Cortex-A9 than the other. So I tested both of them on my old OMAP4430 based pandaboard (I would not miss this board too much if it actually burns). The results of this comparison are provided in the table at the bottom.
I could have stopped at this point, but that would be not fun :) So I tried to experiment a bit with Cortex-A9 power consumption myself. Turns out that Cortex-A9 can actually run a bit hotter. On the NEON side, VLDx instructions seem to be more power hungry than anything else by a large margin. And aligned 128-bit reads are the best at generating heat. Using VLD2 variant with post-increment makes it do a bit more work than the plain VLD1. Moving to the ARM side, conditional branches and SMLAL instructions are also rather hot. Mixing everything together, we get one more implementation of cpuburn for Cortex-A9:
.syntax unified
.text
.arch armv7-a
.fpu neon
.arm
.global main
.global sysconf
.global fork
/* optimal value for LOOP_UNROLL_FACTOR seems to be BTB size dependent */
#define LOOP_UNROLL_FACTOR 110
/* 64 seems to be a good choice */
#define STEP 64
.func main
main:
#ifdef __linux__
mov r0, 84 /* _SC_NPROCESSORS_ONLN */
blx sysconf
mov r4, r0
cmp r4, #2
blt 1f
blx fork /* have at least 2 cores */
cmp r4, #4
blt 1f
blx fork /* have at least 4 cores */
1:
#endif
ldr lr, =(STEP * 4 + 15)
subs lr, sp, lr
bic lr, lr, #15
mov ip, #STEP
mov r0, #0
mov r1, #0
mov r2, #0
mov r3, #0
ldr r4, =0xFFFFFFFF
b 0f
.ltorg
0:
.rept LOOP_UNROLL_FACTOR
vld2.8 {q0}, [lr, :128], ip
it ne
smlalne r0, r1, lr, r4
bne 1f
1:
vld2.8 {q1}, [lr, :128], ip
it ne
smlalne r2, r3, lr, r4
bne 1f
1:
vld2.8 {q2}, [lr, :128], ip
vld2.8 {q3}, [lr, :128], ip
it ne
subsne lr, lr, #(STEP * 4)
.endr
bne 0b
.endfunc
Maybe more improvements are still possible if I overlooked some better instructions, tricks with L2->L1 prefetches or anything else. Also I have not tried running any tests on Cortex-A8 yet. But Cortex-A8 needs different tuning and I would not be surprised if the the older cpuburn implementations can actually do a better job there. Finally, the obligatory warning: This program tries to stress the processor, attempting to generate as much heat as possible. Improperly cooled or otherwise flawed hardware may potentially overheat and fail. Use at your own risk!
As for the table below, each implementation has been tested with both Cortex-A9 cores fully loaded (starting two instances of cpuburn if needed). Current draw values were measured after running the test non-interrupted for 10-15 minutes. Honestly, the total ~1640 mA sustained current draw by pandaboard looks quite scary to me. At least I would not dare to even try additionally stressing GPU and/or the hardware video decoder at the same time.
cpuburn implementation, running on both A9 cores | current draw from 5V PSU (whole board, not just CPU) |
---|---|
idle system (this kernel has no power management) | ~550 mA |
cpuburn-neon | ~1130 mA |
cpuburn-1.4a (burnCortexA9.s) | ~1180 mA |
ssvb-cpuburn-a9.S | ~1640 mA |
A quick test on Cortex-A8 shows that using SMLAL is a bad idea there, but extra NEON arithmetic instructions can be added because Cortex-A8 supports dual issue for NEON.
This time experimenting with DM3730 based IGEPv2 board (ARM Cortex-A8 @1GHz) and using dm3730-temp-sensor for temperature measurements:
cpuburn implementation | temperature |
---|---|
idle system (this kernel has no power management) | ~57.75 C |
cpuburn-neon | ~92.75 C |
cpuburn-1.4a (burnCortexA8.s) | ~96.00 C |
ssvb-cpuburn-a8.S | ~104.25 C |
If the sensor is not lying, then maybe using a plastic case for this board was not a good choice after all. The sensor is most likely lying as explained by Nishanth Menon in the google+ comments.
Before anybody jumps to wild conclusions, I would like to note that: