Tagged TLBs and Context Switching
ARM seems to be a pretty fast moving architecture. I’ve had to rethink my design a number of times because it turned out my technical documents were a little old. One of the new features that I am happy to see is the introduction of “tagged TLBs” in ARMv6 and above.
Context switching is an important mechanism in a modern, virtual memory operating system. Every process gets its own memory space, and that is defined by a set of page tables. Since traversing the page tables for every single memory access would be too slow, the traversals are cached in a special area of the processor known as the Translation Lookaside Buffer (TLB). This buffer makes modern virtual memory feasible. However, like all caches, it must be invalidated when the information inside it becomes incorrect. Otherwise, you might allow one process to access the memory of another process — a safety and security violation.
An operating system switches between different processes hundreds or thousands of times each second. Each time it switches, it may need to change the working set of page tables. And every time that happens, the TLB must be invalidated. This adds up a heavy cost, as the TLB gets repopulated thousands of times per second. The amount of time it takes to flush the TLB and related caches is approximately 5 micro-seconds. This does not seem like much, but it is not the only cost. Subsequent memory operations are now adversely affected: they most likely must trigger hardware page traversal in order to translate the virtual address to a physical address. It is impossible to use a micro-benchmark to find the impact of TLB flushing on overall performance, as it depends on the specific application and its pattern of memory access. Whatever it is, the less flushing, the better.
Older ARM architectures added a feature to avoid context switching in some cases: The Fast Context-Switch Extension (FCSE). If you were willing to follow some constraints, then you could avoid TLB flushing. It basically worked by creating an array of memory spaces, and indexing them by a given “process ID”. Every virtual address handled by the processor would be “modified” by adding 32MB multiplied by the process ID. Since ARMv6 this is deprecated, as it has been replaced with another option that is more widely used in industry: tagged TLBs.
With tagged TLBs, there is extra space in every TLB entry which stores an 8-bit “Address Space ID (ASID)”. In addition, the page tables now store a bit for every last-level entry which specifies whether that entry is “global” or not. When the MMU misses on a “non-global” page table entry, the TLB stores the result of the hardware page-table walk along with the current ASID. Later, when you access that same virtual address again, the MMU looks up the entry in the TLB. If the current ASID matches the stored ASID, then it uses the cached entry. Otherwise, it “misses” and re-walks the page tables. In essence, this dodges the security and safety problems of not flushing the TLB by checking an ASID number whenever it looks up an entry in the TLB.
Now when you switch contexts, you change the current set of page tables, and the current ASID. This cuts down the immediate cost to about 500 nano-seconds, an order of magnitude. However, the effect on overall performance is more difficult to quantify. Since you don’t wipe out the TLB, it is possible that the old entries are still there when you return to the original process. On the other hand, it is also possible that other programs have wiped them out and replaced the entries with their own. Either way, this should be better than the old way of doing things; at worst, all your entries were cleared.
I also found that ARMv7 has this feature permanently turned on. According to the ARMv6 technical manual, it was possible to toggle tagged TLBs. The trade-off was that the additional bits in the page table entries would take up the space formerly used by sub-page access permissions. Then I was reading through the Cortex-A8 technical manual and noticed that the bit was permanently set. Just another example of how things change quickly.
Another cute feature of ARM is the ability to access the hardware translation mechanism from software. Using CP15, c7, c8 and CP15, c7, c4, you can write a virtual address and read the result of the physical address translation, if successful. Without this feature, you must implement the walking of tables in software, and that involves either finding a virtual address from a physical address, or more likely, mapping the physical address temporarily. This is slow and has the annoying requirement of manipulating virtual memory structures that you may prefer to leave alone.
Categories: hacking