libjpeg-turbo is currently the fastest open source jpeg encoder/decoder to the best of my knowledge. Achieving good performance in libjpeg-turbo would be impossible without using SIMD instructions available in modern processors. The optimizations for MMX/SSE2 capable x86 processors existed in libjpeg-turbo for a while, and now support for ARM NEON is also coming in the next libjpeg-turbo 1.2 release.

One of the important parts of libjpeg-turbo, which benefits from SIMD optimizations is DCT/iDCT. For the obvious practical reasons (easier testing and maintenance and full compatibility with the older versions), it makes a lot of sense to ensure that SIMD optimized code produces exactly the same results as C code. That is, unless there are some really good reasons not to do so (for example, if the algorithm is a bad match for the instruction set of some particular processor).

And there are naturally some potential pitfalls on the bit-exactness road. In order to use SIMD efficiently, it is important to use the smallest possible data type in calculations. The C code is happy to use 32-bit variables and "32-bit * 32-bit -> 32-bit" multiplications. But for the SIMD code, using 16-bit data means that we can pack more information into a single register and process more of it in parallel, saving CPU cycles. Still using 16-bit calculations, we need to be sure that there are no unwanted overflows. And doing things somewhat different from C always has a risk of getting somewhat different results in the end.

DCT takes 8x8 blocks of samples with the values in [-128, 127] range and produces blocks of 8x8 DCT coefficients in [-1024, 1023] range. IDCT can convert the DCT coefficients back to the original 8-bit samples. Mathematically, the original samples can be perfectly reconstructed. But practically, there may be rounding errors and some extra loss of precision due to quantization. And there is one very important thing to note. Any arbitrary 8x8 block of [-128, 127] samples passed through DCT produces a 8x8 block of coefficients in [-1024, 1023] range. But any arbitrary 8x8 block of [-1024, 1023] coefficients does not necessarily produce 8x8 block of [-128, 127] samples when passed through IDCT. Some of the samples may be well outside [-128, 127] range. Searching on the Internet reveals some information, which says that the range of IDCT output may be as large as [-1805, 1805]. Obviously, there is no way for such arbitrarily selected DCT coefficients to have been generated by the forward DCT with the normal [-128, 127] input in the first place. However, it is possible to hand craft JPEG bitstreams and embed any arbitrary DCT coefficients there, so the decoder has to handle them somehow.

When developing SIMD optimized IDCT implementation, apparently there are two separate cases to consider:

- decoding the files generated by a normal jpeg encoder (DCT coefficients are generated by a normal forward DCT from [-128, 127] samples)
- decoding some bogus out-of-range data (DCT coefficients are generated in some arbitrary way)

For the former, the decoding result is better to be well defined and bit-exact when compared to C impementation. The latter is a bit of gray area. On one hand, still producing the same results as C would be nice. On the other hand, if producing the same results as C regresses performance, then it is clearly not so desirable. Also we may need to look carefully in the spec, just to see how the out-of-range DCT coefficients data fits into it and whether it is allowed. What if some cleverly optimized jpeg encoder tries to use them for some purpose?

But now it's time for some experiments. Generating hand crafted DCT coefficients is actually quite easy by modifying libjpeg code and using cjpeg tool. It is a simple matter of just hacking convsamp function and injecting the samples data there.

The first victim of these experiments is actually not SIMD, but C implementation. The comment from jdmaster.c explains:

```
MASK is 2 bits wider than legal sample data, ie 10 bits for 8-bit
samples. Under normal circumstances this is more than enough range and
a correct output will be generated; with bogus input data the mask will
cause wraparound, and we will safely generate a bogus-but-in-range output.
```

So what happens if we deliberately generate a jpeg file, which can decode to such very much out of range samples? One of the variants of 8x8 DCT coefficients for this purpose can be the following:

-1024 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

And the results of decoding this hand crafted sample are below. You may want to pay special attention to the leftmost image, because it links to the bogus jpeg file itself and gets decoded by the jpeg library used by your browser.

original file, decoded by your browser | decoded using jpeg_idct_islow | decoded using jsimd_idct_islow_sse2 |

The rightmost image (decoded by SSE2 implementation from libjpeg-turbo 1.1.1) does not have any range limitations and always performs correct clamping to bring the color into [0, 255] range. So the color of some 8x8 tiles gets saturated to white. The C implementation wraps around and shows the same tiles as black.

As mentioned earlier, SIMD relies a lot on 16-bit arithmetics. And looking at ISLOW IDCT C code, there is an obvious case of potential overflow:

```
/* Odd part per figure 8; the matrix is unitary and hence its
* transpose is its inverse. i0..i3 are y7,y5,y3,y1 respectively.
*/
tmp0 = (INT32) wsptr[7];
tmp1 = (INT32) wsptr[5];
tmp2 = (INT32) wsptr[3];
tmp3 = (INT32) wsptr[1];
z1 = tmp0 + tmp3;
z2 = tmp1 + tmp2;
z3 = tmp0 + tmp2;
z4 = tmp1 + tmp3;
z5 = MULTIPLY(z3 + z4, FIX_1_175875602); /* sqrt(2) * c3 */
```

The 16-bit values from wsptr[1], wsptr[3], wsptr[5] and wsptr[7] are all added together and passed as an argument to MULTIPLY macro, which is supposed to be able to treat its arguments as 16-bit values (so this sum must fit 16 bits). And this can easily overflow on the second pass if the DCT coefficients feeded to IDCT function contain arbitrary [-1024, 1023] input. The comment stating that

```
The outputs of the first pass are scaled up by PASS1_BITS bits so that
they are represented to better-than-integral precision. These outputs
require BITS_IN_JSAMPLE + PASS1_BITS + 3 bits; this fits in a 16-bit word
with the recommended scaling.
```

clearly applies to the case whan handling the "normal" DCT coefficients data. Because "BITS_IN_JSAMPLE + PASS1_BITS + 3" is equal to 13, we have enough of headroom to add 4 such values together without overflowing 16 bits. But again, this is not true for the arbitrary hand crafted [-1024, 1023] coefficients data. In any case, the C implementation uses 32-bit variables and we have no luck reproducing this overflow with it :)

The equivalent SSE2 code is a little bit different:

```
; -- Odd part
movdqa xmm4, XMMWORD [XMMBLOCK(1,0,rsi,SIZEOF_JCOEF)]
movdqa xmm6, XMMWORD [XMMBLOCK(3,0,rsi,SIZEOF_JCOEF)]
pmullw xmm4, XMMWORD [XMMBLOCK(1,0,rdx,SIZEOF_ISLOW_MULT_TYPE)]
pmullw xmm6, XMMWORD [XMMBLOCK(3,0,rdx,SIZEOF_ISLOW_MULT_TYPE)]
movdqa xmm1, XMMWORD [XMMBLOCK(5,0,rsi,SIZEOF_JCOEF)]
movdqa xmm3, XMMWORD [XMMBLOCK(7,0,rsi,SIZEOF_JCOEF)]
pmullw xmm1, XMMWORD [XMMBLOCK(5,0,rdx,SIZEOF_ISLOW_MULT_TYPE)]
pmullw xmm3, XMMWORD [XMMBLOCK(7,0,rdx,SIZEOF_ISLOW_MULT_TYPE)]
movdqa xmm5,xmm6
movdqa xmm7,xmm4
paddw xmm5,xmm3 ; xmm5=z3
paddw xmm7,xmm1 ; xmm7=z4
; (Original)
; z5 = (z3 + z4) * 1.175875602;
; z3 = z3 * -1.961570560; z4 = z4 * -0.390180644;
; z3 += z5; z4 += z5;
;
; (This implementation)
; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);
movdqa xmm2,xmm5
movdqa xmm0,xmm5
punpcklwd xmm2,xmm7
punpckhwd xmm0,xmm7
movdqa xmm5,xmm2
movdqa xmm7,xmm0
pmaddwd xmm2,[rel PW_MF078_F117] ; xmm2=z3L
pmaddwd xmm0,[rel PW_MF078_F117] ; xmm0=z3H
pmaddwd xmm5,[rel PW_F117_F078] ; xmm5=z4L
pmaddwd xmm7,[rel PW_F117_F078] ; xmm7=z4H
```

Here only the values of z3 (wsptr[3] + wsptr[7]) and z4 (wsptr[1] + wsptr[5]) are calculated using 16-bit additions and then used as 16-bit operands for multiplication. The following DCT coefficients have been hand crafted with the intention to trigger "wsptr[3] + wsptr[7]" overflow:

0 | -724 | 0 | -299 | 0 | -724 | 0 | 300 |

0 | -1004 | 0 | -416 | 0 | -1004 | 0 | 416 |

0 | -946 | 0 | -391 | 0 | -946 | 0 | 392 |

0 | -851 | 0 | -352 | 0 | -851 | 0 | 352 |

0 | -724 | 0 | -299 | 0 | -724 | 0 | 300 |

0 | -569 | 0 | -235 | 0 | -569 | 0 | 235 |

0 | -391 | 0 | -162 | 0 | -391 | 0 | 162 |

0 | -199 | 0 | -82 | 0 | -199 | 0 | 82 |

And the decoding results of the generated sample are below:

original file, decoded by your browser | decoded using jpeg_idct_islow (correctly clamped) | decoded using jpeg_idct_islow | decoded using jsimd_idct_islow_sse2 |

Funnily enough, the three images on the left are all different ("correctly clamped" is the case when C code is tweaked to solve the range problem described in the previous section). Comparing the leftmost image with each one of them can give some idea about what kind of IDCT implementation might be used on your computer.

I think it's necessary to add a disclaimer just in case: this all only applies to decoding bogus out-of-range data. So the differences in decoding results can't be immediately considered a bug.

This whole blog post is actually the result of my mini-investigation, intended to clear the doubts that I got shortly after submitting ARM NEON optimized ISLOW iDCT patch.

Just like SSE2 IDCT, ARM NEON code also has some overflows for the out-of-range data, but should be perfectly fine for the normal jpeg files. And it still can be easily tweaked to ensure no overflows even when handling any arbitrary [-1024, 1023] DCT coefficients. But this may cost a few extra CPU cycles.

And one more final disclaimer: I'm not a hardcore multimedia expert, so may be easily wrong. Comments and corrections are surely welcome.