This article is more than 1 year old

Bug of the month: Cache flow problem crashes Samsung phone apps

Exploding batteries to the left, exploding code to the right

It's not been a good summer for Samsung. It packed its Galaxy Note 7 smartphones with detonating batteries, sparking a global recall.

And its whizzy Exynos 8890 processor, which powers the Note 7 and the Galaxy S7 and S7 Edge, is tripping up apps with seemingly bizarre crashes – from null pointers to illegal instruction exceptions, all triggering randomly. It's an issue that has stumped engineers for months.

And now they've finally cracked the case.

Apps built by Mono, the software development toolkit, were crashing indiscriminatingly on Samsung's latest Android handsets with illegal instruction errors despite the code being good. The Gamecube-Wii emulator Dolphin, and the PSP emulator PPSSPP, were also falling over on the phones.

The faults involved code created just in time (JIT) on demand: Mono uses a JIT compiler to transform an app's portable bytecode to native ARM instructions on handheld devices. Dolphin and PPSSPP do similar to run a game's PowerPC or MIPS executable on the underlying CPU. Any program with self-modifying code was at risk of bombing out on the Exynos 8890, it seemed.

On the ARM architecture, due to the split instruction and data caches, JIT engines have to clear the processor's instruction cache to ensure that any freshly generated instructions are loaded and run.

Mono's engineers noticed that, when flushing 128-byte blocks from the I-cache, only 64 bytes were being cleared, allowing the processor core to run stale and mismatched code and crash the running application.

The Exynos 8890 system-on-chip has eight cores: four Cortex-A53s designed by ARM, and four Samsung-designed M1 cores. These are arranged in ARM's big.LITTLE style: four beefy cores – the M1s – for when a lot of processing power is briefly needed, and four lighter cores – the A53s – for normal work. Threads in apps move from core to core, big or LITTLE, depending on the amount of work that needs to be done.

The A53 has a 64-byte instruction cache line width, meaning, the cache is flushed and replaced in 64-byte blocks. The M1, on the other hand, has a 128-byte instruction cache line. This is problematic.

Ooops marks the spot ... Samsung's slide to chip designers last month

Apps built using GCC – as is the case with Mono, at least – use a function that looks like the following pseudocode to flush a core's instruction cache:

void __clear_cache (char *address, size_t size)
{
        static int cache_line_size = 0;
        if (!cache_line_size)
                cache_line_size = get_current_cpu_cache_line_size ();

        for (int i = 0; i < size; i += cache_line_size)
                flush_cache_line (address + i);
}

The first time __clear_cache is used by an application, it reads the CPU core's cache line width direct from the processor and stores it in cache_line_size. Then when flushing the cache, it loops through the memory it has to clear, telling the processor to dump its instruction cache, one cache line at a time.

So if the app starts on an A53, it'll expect to clear the instruction cache in 64-byte blocks, and loop in 64-byte increments. If it starts on an M1, it'll use 128-byte blocks.

Now, if an app that was running on an M1 is moved to an A53, it will expect to clear out the instruction cache in 128-byte blocks. In reality, the smaller core will only clear out the first 64 bytes, and __clear_cache will skip the rest to the next 128-byte block. That leaves stale code in the cache, which will confuse and crash the program.

Mono, Dolphin and PPSSPP have patched their code to try to avoid the problem.

Software built by the LLVM compiler and Google's V8 JavaScript engine does not suffer as badly as GCC's generated code because it requests the CPU's I-cache width immediately before each increment. Mono does something similar now: it tries to work out the smallest instruction cache width in the device and just uses that.

Unfortunately, it's tricky to completely solve from userspace because fetching the I-cache width from the CPU and flushing the next line is not atomic: a thread could be rescheduled onto a core with a different cache line width during the loop in a way that would cause memory to be skipped. Mono's approach is at least resilient to this.

One proper way out is to patch the operating system kernel so that reading the CPU's I-cache line width always returns the smallest size for the hardware, thus leaving no byte behind when swabbing out dead instructions.

On the one hand, this isn't strictly Samsung's fault. The technical reference manual for the Cortex-A15, an early big.LITTLE core, notes:

The Cortex-A15 processor L1 caches contain 64-byte lines. Other processors, however, can feature caches that support cache line lengths different than those of the Cortex-A15 processor.

That implies that software engineers should be aware that the cache line width can vary across a device. However, the Cortex-A53, as used in Samsung's Exynos 8890, makes no mention of other cores having different widths in its tech manual. It's convention in the ARM system-on-chip world to keep the I-cache lines the same widths within a package to avoid all of the above headaches: ARM certainly does in its Cortex-A designs (well, OK, it didn't with the A7 and A15, but there after.)

So, one could conclude that Samsung should have known better – or at least given us a little more warning. ®

More about

More about

More about

TIP US OFF

Send us news


Other stories you might like