Mystery bugs
I was working on part of my virtual memory management functionality when I decided to test what I had so far. The function was supposed to find and map a region of virtual memory by looking for and writing into a range of pagetable entries. It seemed to be working and generating the correct addresses. But when I tried to actually use the newly mapped page, it caused a “data abort” exception.
I hardcoded the address and attempted to isolate the problem. The strangest part seemed to be that the “data abort” exception would only occur if and only if I looked at the pagetable entry with some kind of read operation beforehand. If I blindly wrote the new entry into place, it was fine (which is good, because that is how I have been doing things up until now). But if I checked to see if I was using an available entry, it would cause the later code to crash. It would even crash if I examined only the neighboring pagetable entries. This kind of behavior “smells” like some kind of stale cache problem. But it didn’t make any sense. After all, the cache should be largely transparent to the processor, besides affecting performance. I even tried flushing the entire cache, but to no avail.
But then while reading through the Cortex-A8 manual, in the chapter about the TTB register, I came across this little tidbit:
“Note: The processor cannot perform a translation table walk from L1 cache. Therefore, if C is set to 1, to ensure coherency, you must store translation tables in inner write-through memory. If you store the translation tables in an inner write-back memory region, you must clean the appropriate cache entries after modification so that the mechanism for the hardware translation table walks sees them.”
In other words, the MMU was skipping over the cache and going directly to memory. But by reading the pagetable entry, I was bringing the line into cache. And since it was configured using the efficient “write-back” option, the subsequent pagetable updates were not being written directly into main memory. I needed to clean and invalidate the cache line. But I thought I had tried that. Upon further investigate, it turns out, the Cortex-A8 does not seem to have the same “flush entire cache” operations that older manuals talk about. But if I insert a “clean and invalidate cache” instruction that applies to the pagetable entry’s address, then everything starts working again.
A lot of systems code is about caches and management of caches, of all kinds. They make efficient operation possible. But coming from an x86 background, it never occurred to me that parts of the CPU might not be able to use the L1 cache (especially since the L1 data cache is physically indexed in the Cortex-A8). At least, not until I saw this strange behavior occur.
Categories: hacking