Mystery bugs

I was working on part of my virtual memory management functionality when I decided to test what I had so far. The function was supposed to find and map a region of virtual memory by looking for and writing into a range of pagetable entries. It seemed to be working and generating the correct addresses. But when I tried to actually use the newly mapped page, it caused a “data abort” exception.

I hardcoded the address and attempted to isolate the problem. The strangest part seemed to be that the “data abort” exception would only occur if and only if I looked at the pagetable entry with some kind of read operation beforehand. If I blindly wrote the new entry into place, it was fine (which is good, because that is how I have been doing things up until now). But if I checked to see if I was using an available entry, it would cause the later code to crash. It would even crash if I examined only the neighboring pagetable entries. This kind of behavior “smells” like some kind of stale cache problem. But it didn’t make any sense. After all, the cache should be largely transparent to the processor, besides affecting performance. I even tried flushing the entire cache, but to no avail.

But then while reading through the Cortex-A8 manual, in the chapter about the TTB register, I came across this little tidbit:

“Note: The processor cannot perform a translation table walk from L1 cache. Therefore, if C is set to 1, to ensure coherency, you must store translation tables in inner write-through memory. If you store the translation tables in an inner write-back memory region, you must clean the appropriate cache entries after modification so that the mechanism for the hardware translation table walks sees them.”

In other words, the MMU was skipping over the cache and going directly to memory. But by reading the pagetable entry, I was bringing the line into cache. And since it was configured using the efficient “write-back” option, the subsequent pagetable updates were not being written directly into main memory. I needed to clean and invalidate the cache line. But I thought I had tried that. Upon further investigate, it turns out, the Cortex-A8 does not seem to have the same “flush entire cache” operations that older manuals talk about. But if I insert a “clean and invalidate cache” instruction that applies to the pagetable entry’s address, then everything starts working again.

A lot of systems code is about caches and management of caches, of all kinds. They make efficient operation possible. But coming from an x86 background, it never occurred to me that parts of the CPU might not be able to use the L1 cache (especially since the L1 data cache is physically indexed in the Cortex-A8). At least, not until I saw this strange behavior occur.

Discussion (0) | November 18th, 2011 Categories: hacking

The Dark Art of Linker Scripts

There are many ways in which writing operating system code is like writing ordinary application code: you still need libraries, runtime support, and you use many common algorithms. However, there are also many ways in which it is different: for example, you are responsible for providing the libraries and runtime support. Most importantly, a kernel does not have the luxury of taking a simplistic view of memory.

A typical application assumes a flat memory layout with standardized addresses for program entry, stack and heap spaces. That illusion is provided by the operating system through the mechanism known as Virtual Memory which is implemented in hardware by the MMU. The system compiler produces object files that have symbol entries, and the system linker combines them into a single image using a linker script. The linker writes the final addresses for every symbol in the program (for dynamic linking, this step is deferred until runtime). You can get a list of symbols and addresses on any unstripped image by using the "nm" tool. You'll notice that on ordinary applications, the addresses are pretty much all very similar from program to program. That is because there is a default system-wide linker script for regular programs. You can see it by typing "ld --verbose" but it is likely to be very long and very confusing. Linker scripts are what would now be termed a "domain specific language" with the sole purpose of describing the layout of sections and symbols in an output file, combining many inputs.

There are a myriad of options and features, mostly tacked on here and there to support different platforms, formats or architectures. I will focus on our barebones ARM image in the ubiquitous Executable and Linking Format (ELF). You can read all about ELF and its use in ordinary circumstances from links, I am going to focus on puppy's needs. When examining ELF files you want to be familiar with the tools "readelf" and "objdump" (and keep in mind that we are using the "arm-none-linux-gnueabi-" tool-chain). Let's run "arm-none-linux-gnueabi-readelf -l" on "puppy.elf":


Elf file type is EXEC (Executable file)
Entry point 0x80008000
There are 5 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
EXIDX 0x00c2c0 0xc000c2c0 0x8000c2c0 0x000f8 0x000f8 R 0x4
LOAD 0x008000 0x80008000 0x80008000 0x0017c 0x0017c R E 0x8000
LOAD 0x009000 0xc0009000 0x80009000 0x040d0 0x040d0 RWE 0x8000
LOAD 0x010000 0xc0010000 0x8000d0d0 0x00000 0x08000 RW 0x8000
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RWE 0x4
Section to Segment mapping:
Segment Sections...
00 .ARM.exidx
01 .startup .stubtext .stubARMidx
02 .text .rodata .ARM.extab .ARM.exidx .rodata.str1.4 .data
03 .bss
04

Program headers list "segments" which describe memory layout. Section headers describe "sections" which partition the image file itself. Not all sections are loaded into memory -- some may just contain metadata for debuggers and such. In addition, sections that appear consecutively on disk may be loaded into disparate memory locations. For example, although ".data" and ".bss" are neighbors in the file, they are loaded into different program segments. In fact, ".bss" is interesting because it takes up zero space on disk, but some amount of space in memory. That is because it consists of uninitialized space for global variables. If you look carefully, you can see how that is represented in the ELF header.

The most important aspect of the program header to note is that there is a difference between VirtAddr and PhysAddr. This is one of the distinguishing aspects of a kernel image vs an ordinary program image. A kernel has to deal with the real physical memory underlying the virtual memory abstraction. It also has to live inside the virtual memory world that it ends up creating. In other words, kernel symbols must be in two places at the same time. This trick can be pulled off by linkers because they understand that the memory addresses assigned to symbols may be completely unrelated to where the symbol ends up living inside the image. Doing this, however, requires work on your part: you must bootstrap the virtual memory system so that these symbol addresses are actually meaningful and not wild references off into unmapped areas.

In Puppy, the job of bootstrapping virtual memory is handled by the "stub". The stub is placed at the entry address. There is a small bit of assembler which sets up a temporary stack -- enough to run bare metal C code. The stub C code then does the necessary initialization of the memory hardware. This needs to be done in so-called "identity-mapped" space: where virtual and physical addresses are equivalent. Why? Because the moment you toggle on the MMU, the program counter is now working in virtual memory. The simplest way to keep the kernel running (and not worry about pipeline issues) is to make sure that the next instruction is the same regardless of whether you are addressing virtually or physically. Once the stub returns, the assembler sets up the permanent system stacks (in high virtual memory) and then jumps to the real kernel entry point, in high virtual memory.

Therefore, we lay out the sections for the stub according to this portion of the linker script:

ENTRY(_reset)
SECTIONS
{
. = 0x80008000;
_physical_start = .;
_stub_start = .;
/* The stub runs in identity-mapped space */
.startup : { init/startup.o (.text) }
.stubtext : { init/stub.o (.text) }
.stubARMtab : { init/stub.o (.ARM.extab) }
.stubARMidx : { init/stub.o (.ARM.exidx) }
_stub_end = ALIGN(0x1000);
_stub_len = _stub_end - _stub_start;

I have selected the starting address, in physical memory, to be the same as Linux. Keep in mind that on the Beagle Board (OMAP353x), there are only two banks of SDRAM, and the first one has physical addresses beginning at 0x80000000. This is a bit weird if you are coming from x86, where physical memory starts at 0, but it makes sense because many other addresses are used for memory mapped device registers. The linker script maintains a running counter of "virtual memory" addresses, and you can set it a value using a line such as ". = 0x80008000;". The primary job of the linker script is to create output sections and symbols with addresses derived from the counter. The output sections can mix and match input sections from the various object files. Here I have specified that the input section ".text" of "init/startup.o" should go into the output section named ".startup". Note that ".text" is a standardized section name which is used for executable code. ".stubtext" contains the C code. The stack unwinding information in ".stubARMtab"/".stubARMidx" is not really necessary for me, because I am not using C++ exceptions, but due to the vagaries of ARM linking, I must put this somewhere. I define a few convenient symbols for finding the starting and ending address of the stub, as well as its length.

The rest of the kernel is placed into high virtual memory. The reason for this is to leave the low virtual memory addresses for the use of ordinary programs. This is a pretty standard technique for OS kernels that use paged virtual memory. The main problem is that PhysAddr and VirtAddr can no longer be equivalent. That means we have to override the default. This can be accomplished with the "AT" option. It is also important that the physical addresses be kept as close together as possible. This is because the final image is not going to be ELF -- the bootloader doesn't understand it -- but rather a raw binary image. Any discontinuities between sections will be filled in with zeroes, resulting in a much larger image file than you probably intended.

. = 0xC0008000;
/* From now on, virtual and physical addresses are separate. */
. += _stub_len; /* skip stub code */

We need to skip over the stub in virtual memory. The reason is simply that we are mapping a 1MB region that begins at 0xC0000000 to the physical address 0x80000000. Therefore, the stub code is going to take up the page at 0xC0008000. It can be reclaimed later.

_kernel_start = .;
/* The virtual address is taken from the counter, but the physical
* address must be specified with AT. */
.text : AT ( _stub_end ) { *(.text) }
.rodata : AT ( LOADADDR (.text) + SIZEOF (.text) ) { *(.rodata) }
. = ALIGN(0x1000);
_kernel_readonly_end = .;
.data : AT ( LOADADDR (.text) + _kernel_readonly_end - _kernel_start ) { *(.data) }
.bss : AT ( LOADADDR (.data) + SIZEOF (.data) ) { *(.bss) }
. = ALIGN(0x1000);
. = . + 0x1000; /* 4kB of stack memory */
svc_stack_top = .;

Since _stub_end is a physical address, "AT (_stub_end)" should be fairly self explanatory. The subsequent sections are somewhat more difficult. Since we do not have the advantage of a physical memory counter, we must calculate the appropriate locations. Fortunately, we are given the LOADADDR and SIZEOF functions, which get the section PhysAddr and section length respectively. Using these tools we lay each output section consecutively in physical memory, combining the input sections from all object files (the ones not already used). Then we kick the counter by a page and make a symbol representing the supervisor stack top, which is used in startup assembler.

_kernel_pages = (_kernel_end - _kernel_start) / 0x1000;
_kernel_readonly_pages = (_kernel_readonly_end - _kernel_start) / 0x1000;
_kernel_readwrite_pages = (_kernel_end - _kernel_readonly_end) / 0x1000;
/* compute the physical address of the l1table */
_l1table_phys = l1table - ((_kernel_start - _stub_len) - _stub_start);
/* To make arm-none-linux-gnueabi-gcc shut up: */
__aeabi_unwind_cpp_pr0 = .;
__aeabi_unwind_cpp_pr1 = .;
}

Linker scripts let us define arbitrary symbols, so I add some useful metadata about the kernel itself. In addition, there is another exception-related workaround: the two symbols at the end aren't used by my code, but I can't convince GNUEABI GCC to stop emitting references to them. This makes the linker stop complaining about undefined symbols.

Let's check the output of "arm-none-linux-gnueabi-nm puppy.elf":

80008000 T _reset
...
c0009018 T c_entry

Looks good, for now. But this won't be the end of tweaking the linker script. I've already spotted some small details that I would like to fix. And you can do other interesting things with linker scripts, like creating sections for constructors, per-cpu variables, or other special purposes. And while there is formal documentation for LD, it isn't really much of a help -- not to mention that the implementation is not always flawless. I've had to be careful with the linker version I use, as some language features change as bugs are fixed or new ones are introduced. Since linker script is considered more of a configuration file format instead of a programming language, it doesn't get the same level of careful management that might be given to C compiler front-ends. And it doesn't help that the primary writers of linker scripts are also the developers of LD itself, who are intimately familiar with the inner workings of their interpreter.

Discussion (0) | November 15th, 2011 Categories: hacking

Virtual Memory, Interrupts and Timers

ARM has a fairly straightforward MMU design but with a couple of twists that are unfamiliar to me as a longtime x86 programmer. The page table structures are similar but different sizes -- the first level controls 1MB regions instead of 4MB. Also there is a 64kB option at the second level, but I am not going to bother with it, and stick to 4kB pages. One small oddity is that it is possible to set access permissions down to the 1kB level -- there are 4 AP fields per page table entry.

The major twist is the notion of "Domains" which add another layer of protection that is integrated with the MMU. Basically, there are up to 16 Domains that you can program. Each page entry is assigned into a domain. There is a central domain register which contains 2 bits per domain and controls the access to every page, on top of the existing page table access permissions. So conceivably, you could share a TTB (translation table base) between multiple tasks, and then use domains to protect one from the other. Kind of like segments, in a way, but not limited to contiguous regions.

I'm not sure yet whether it is worthwhile to do something like that. There is another hardware feature which makes use of domains to provide something called the "Fast Context Switch Extension." Basically it allows you to assign a hardware process ID and avoid TLB flushes during context switches. But there are some limitations, including: 32MB max virtual size and no more than 16 process IDs. I need to do some microbenchmarks before I determine whether to pursue this. To do that I will need to get interrupts and timers working.

The OMAP353x interrupt controller is pretty well described in the TRM. There are 96 lines, though one of them covers all "external interrupt sources" whatever that may end up being. There are two interrupt inputs to the ARM processor itself: FIQ and IRQ. Programming it is pretty straightforward to start. I've just set all the lines to go to IRQ for the time being. In the future, there are a bunch of interesting possibilities: priorities, preemption and FIQ support. For now, I just need a way to get the timer interrupt.

While looking at the GCC documentation for function attributes, I discovered that it knows about ARM interrupt handlers already. That means I don't have to write assembly veneers for the C functions, which saved me some trouble. The only major trouble I had was setting up the interrupt vector table. It's actually a piece of code: the ARM processor transfers control to some offset in the table and expects your code to do the right thing at that point. Usually the right thing is to load the PC with the address of the actual interrupt handler. In x86, I am accustomed to encoding the address of a far jump directly into the instruction. However, with ARM's 32-bit fixed-length instructions, this is basically impossible except for limited cases. ARM assemblers will actually create a "pool" of 32-bit constants that you use in your program, and then load them using PC-relative addressing. It's a little bit of extra "magic" that they do for convenience, but it is a surprise if you are not used to having assemblers insert extra data into your code. In this case, the surprise got me because when I copied the interrupt table into place, I forgot to copy the "pool" that went with it. This resulted in unexpected "prefetch aborts" which is ARM-inology for "Page fault caused by instruction fetch."

Getting timers to work took me on a bit of a runaround. Actually there are two subsystems that I am dealing with here: performance monitoring, and general-purpose timers. The perfmon system in Cortex-A8 (or ARM-v7) processors is controlled via coprocessor 15. Originally, I had gotten ahold of bad information off one of the manuals. Unfortunately, QEMU does not support perfmon. So I had to wait until I tried it on the real HW to discover that I was causing "undefined instruction" exceptions. Except that I was getting "data abort" exceptions instead, which puzzled me until I remembered that I had not yet set up a stack for "undefined instruction." The HW was telling me that the CP15 functionality I was invoking did not exist. I tracked down the correct information and plugged it in -- no more crashes, but no performance monitoring either. I was just trying to count the number of processor cycles that had passed, which is one of the built-in counter sources. Then I noticed that there was yet another register which mentioned "Enable Clock Counter" in its description. Strange, but it worked -- to enable the feature, I had to set it in two places.

GP Timers are a bit more involved. There are 11 of them, scattered throughout various "Power Domains" for some reason. Some have different capabilities than the others. Also there is a 32kHz timer register, which counts the number of ticks since reset. Convenient -- and I timed my perfmon counter using this, finding that approximately 720 million cycles pass per second. This corresponds to the advertised value of 720 "MHz" -- though it seems they are abusing the term "MHz" here to mean "Million Hertz" instead of "2^20 Hertz". It doesn't help that the 32kHz timer is actually kilohertz (32*2^10). Oh well.

The GP Timers can be programmed to raise an interrupt on overflow, or when they match a certain value. There are also lots of other little features I will need to check out at some point. The first thing I tested is the clock-rate for GP timer 1. I expected it to be 32kHz, which seems to be the default reset value according to the manual. However, I only discovered this after trying to figure out why it was ticking approximately 100 times faster than expected. After spending a long time in the TRM (these manuals do not make it easy!) I found a register called CM_CLKSEL_WKUP which has a bit controlling GP timer 1. It can be sourced from either the 32K_FCLK (default), or the SYS_CLK. I checked it, and on my system it was defaulting to SYS_CLK instead of 32K_FCLK. Since SYS_CLK is driving the processor, it is a lot faster, which explains the speedy tick rate. I do not know why it was not set at the stated default. But now I have a working, predictable timer.

All I need is a primitive context switching experiment and I can do the microbenchmarks I had set out to do.

Discussion (2) | November 14th, 2011 Categories: hacking