Compiler
Mario Kart Wii was written in C and C++ and then compiled to PowerPC (PPC) Assembly Code by the Metrowerks CodeWarrior compiler. C and C++ code is not directly executable, but PPC Assembly Code can be run directly on the Wii's hardware. The distributed game contains only the PPC Assembly Code and it is likely that the original C and C++ code that the game was written in will never be made public. Understanding how the compiler works is therefore useful to gain an insight into what the original code may have looked like, and therefore guides the creation of new behaviour.
Compilation
ABI
See also Wikipedia's ABI article
In order to allow code to be reusable, most compilers conform to an Application Binary Interface (ABI). This is a set of rules and conventions for how functions will be compiled. For Mario Kart Wii the Metrowerks CodeWarrior compiler seems to have conformed with the PPC EABI, an extension of the PPC ABI. Some of these rules are outlined here.
Register Convention
See also Wikipedia's Calling Convention article
PowerPC has 32 general purpose integer registers and 32 general purpose floating point registers. These can store 32 bit integers and 64 bit floating point numbers respectively. Despite the 'general purpose' name, these have very strict usage conventions according to the ABI. PowerPC also has several special purpose registers which are optimised for storing very particular kinds of variables. Notably, there is the lr
(link register), the ctr
(count register) and the cr
s (condition registers).
Register(s) | Role | Caller/Callee Save | Description |
---|---|---|---|
r0
|
Temporary | Caller | A spare temporary register, somewhat awkard to use due to instruction set. |
r1
|
Stack Pointer | Callee | Stack pointer kept 8 byte aligned. |
r2
|
Read Only SDA | Constant | Pointer to the read only small data area + 0x8000; used for optimising loads to constants. The value of r2 never changes throught program execution.
|
r3- r4
|
Arguments/Return Values | Caller | On function call, r3 to r10 contain the integer arguments to the function, in code order. If a function needs more arguments, these are stored in the caller's stack frame. If the function returns an integer value, it is stored in r3 to r4 . r4 is only used for this purpose if the return value is a 64 bit number.
|
r5 -r10
|
Arguments | Caller | |
r11 -r12
|
Linker Reserved | Caller | On function call, r11 and r12 are never used to enable them to be used as temproaries by a dynamic linker if needed.
|
r13
|
Read/Write SDA | Constant | Pointer to the read/write small data area + 0x8000; used for optimising reads and writes to global variables. The value of r13 never changes throught program execution.
|
r14 -r31
|
Variables | Callee | r14 to r31 are used to store local variables. |
f0
|
Temporary | Caller | A spare temporary register. |
f1
|
Arguments/Return Value | Caller | On function call, f1 to f8 contain the floating point arguments to the function, in code order. If a function needs more arguments, these are stored in the caller's stack frame. If the function returns a floating point value, it is stored in f1 .
|
f2 -f8
|
Arguments | Caller | |
f9 -f13
|
Temporaries | Calller | Spare temorary registers. |
f14 -f31
|
Variables | Callee | f14 to f31 are used to store local variables.
|
Stack Frame Convention
See also Wikipedia's Call Stack article
In order to implement the function call abstraction in assembly code (which has no such concept), most systems use a call stack. The call stack gains a 'frame' every time a function starts, and loses one each time one ends. In the PPC ABI, r1
always points to the 'top' of the call stack. Whenever a new function is called, the size of that function's stack frame is subtracted from the current value of r1
to form the new r1
. When the function returns, it adds the same amount back to r1
to recover the original value of the stack pointer. Additionally, when the function starts, the old value of the stack pointer is stored at the value in memory addressed by the new stack pointer. That is to say, if you loaded the value addressed by r1
, you always find the old stack pointer. Recurisvely, if you loaded the value at that address, you would find the stack pointer of the caller's caller. This procedure can be repeated meaning the stack forms a 'chain' of stack frames. This can be useful for debugging, allowing a debugger to inspect the frame of all functions that are currently executing, even if they are not the 'top' of the stack. The last entry in the stack (the first function called) has 0
in this value, so the chain is not infinite.
The contents of the stack frame depend on the needs of the function. The shortest possible stack frame is 8 bytes, the longest observed is nearly 4 kilobytes. All used registers which are marked callee save must be stored in the stack frame. Additionally, any local variables created in the code must be stored there as well as the arguments to any call made by the function which do not fit in registers. Generally speaking the compiler is able to compute the 'worst case' stack frame size for each individual function and always uses that amount as the size when creating stack frames. Rarely, the compiler allocates too little space and then has a correction procedure in the middle of the function to adjust the size of the stack frame. This can happen if there is a dynamically sized local variable in the function, or if a local variable requires tighter alignment than 8 bytes.
While the exact contents of a stack frame differ for each function, the order of different types of data is fixed.
- The old stack frame value is at
r1
. - 4 bytes are left for the saved link register of functions that this function may call.
- Any arguments to child functions which may be needed.
- Any local variables as needed.
- Callee saved floating point registers as needed.
- Callee saved integer registers as needed.
The only way to know the contents of the stack frame of a particular function is to analyse its code.
Behaviour
Compilers are complicated pieces of software and often have many optimisation passes used to improve the quality of the code. Nevertheless, in many cases the compiler's behaviour amounts to a 'pattern match'. In other words, a given piece of C or C++ code is translated to a given set of assembly code. Some of these patterns are explained here.
Function Prologue
The start of each function must maintain the stack an arguments in accodance with the ABI. There are a few commonly seen patterns at the beginning of each function. The most common is something like:
stwu r1, -N(r1) mflr r0 stw r0, N+4(r1)
This short sequence creates an ABI conformant stack frame of size N. This is typically followed by a sequence of instructions to save the callee saved registers. Common variations include directly storing:
stw r31, N-4(r1) stw r30, N-8(r1) stw r29, N-12(r1) ...
The store more words instruction:
stmw r29, N-12(r1)
Or a helper function, which looks something like:
addi r11, r1, N bl _savegpr_29
The destination of the branch in this case is a small piece of reusable assembly code which looks something like:
... _savegpr_29: stw r29, -12(r11) _savegpr_30: stw r30, -8(r11) _savegpr_31: stw r31, -4(r11) blr
If the local variables in a function require tighter alginment than 8 bytes, the compiler must align r1
to an address which is a multiple of 8. For example, the following function in GNU C:
void f(void) { int example_variable[8] __attribute__((aligned(32)); ... }
The compiler therefore has a special function prolog sequence for such cases. For example, to create a stack frame of size N with 32 byte alignment, the compiler typically uses the following sequence of instructions:
clrlwi r11, r1, 27 mr r12, r1 subfic r11, r11, -N stwux r1, r1, r11 mflr r0 stw r0, 4(r12)
Function Epilogue
The end of each function must reverse the actions taken at the start of the function, in accodance with the ABI. The common pattern to remove a frame of size N is:
lwz r0, N+4(r1) addi r1, r1, N mtlr r0 blr
This snipped is typically prceeded by a few instruction to restore the callee saved registers which match the corresponding snippets listed above. It can be directed loading:
lwz r29, N-12(r1) lwz r30, N-8(r1) lwz r31, N-4(r1)
Or alternatively the load more words instruction:
lmw r29, N-12(r1)
Or finally the helpfer function variation:
addi r11, r1, N bl _savegpr_29
The destination of the branch in this case is a small piece of reusable assembly code which looks something like:
... _savegpr_29: lwz r29, -12(r11) _savegpr_30: lwz r30, -8(r11) _savegpr_31: lwz r31, -4(r11) blr
Leaf Functions
If a function is a leaf function (a funciton which does not call another function, or a function whose only call is a tail call) then the compiler will often optimise the stack frame away entirely. For example if we consider a simple two argument function which takes two integers as arguments and returns their sum:
int f(int x, int y) { return x + y; }
the compiler would optimise away the standard prolgue and epilogue to produce assembly code like:
add r3, r3, r4 blr
Similarly, a function which takes two arguments, swaps them, and then returns the result of a call to another function f
:
int g(int x, int y) { return f(y, x); }
the compiler may optimise as a leaf tail call like:
mr r0, r3 mr r3, r4 mr r4, r0 b f
32 bit constant
To set a register to a 32 bit constant, for example an address, two instructions are generally needed. For example, to set r3 to 0x80003180:
lis r3, 0x8000 addi r3, r3, 0x3180
One consequence of using the addi
instruction is that it becomes more complex to load a 32 bit constant if the 17th bit is 1. This is because the addi
instruction uses a signed immediate. Therefore, to set r3 to 0x8000b180:
lis r3, 0x8001 addi r3, r3, -0x4f80
As an optimisation, the above addi
instructions can be swapped for stw
or ldw
instructions (or similar) if the intention is to load or store a value at the address given by the constant. Occasionally the compiler makes use of the stwu
or ldwu
instructions to further optimise this if both the 32 bit constant and the value at that address are required.
Small Data Areas
The small data areas (SDAs) are a pair of optimisations used by the compiler to reduce the number of instructions needed to reference certain global variables. Typically, loading the value of a 32 bit global variable x
would require two instructions:
lis rN, x@ha lwz rN, x@l(rN)
However, if the variable is stored in, say, the read only small data area, it can be accessed in just one instruction. For example, if the variable is at offset 0xa4
into the small data area, it would be accessed by the following instruction.
lwz rN, 0xa4-0x8000(r2)
This is because r2 always holds the address of the read only small data area + 0x8000, so this load will always load the correct value.
Switch Statements
In C and C++ swtich statements are common code constructs for expressing many many divergent behaviours stemming from a single variable. The Metrowerks CodeWarrior compiler seems to have two strategies for implementing these, depending on whether they or not the tested values fall into a neat range. These are a jump table or transformation to conditionals. For example, in the following switch statement, all the cases fall in a relatively small range with relatively few gaps:
switch (x) { case 1: f1(); break; case 2: f2(); break; case 4: f4(); break; case 5: f5(); break; case 6: f6(); break; case 7: f7(); break; default: fdefault(); break; }
The compiler may therefore compile this to a jump table. The jump table code would look something like:
addi r0, rX, -1 cmplwi r0, 6 bgt defaultCase lis r3, jumptable@ha rlwinm r0, r0, 2, 0, 29 addi r3, r3, jumptable@l lwz r3, r3, r0 mtctr r3 bctr
The choice of r3 and r0 here is arbitrary. The strategy here is first to subtract the lowest numbered case from x
and then check if it needs to branch to the default case. This is done be comparing against the highest numbered case minus the lowest numbered case. If the test is greater than this (unsigned) value, then it definitely needs to branch to the default case. Otherwise it loads the address of the jump table, and loads an address to branch to from that table. Finally it does an indirect branch to that address. In this case the first table entry would be the address of the code for case 1
, the second entry would be for case 2
and so on, with the third entry being the addres of the default case, as case 3
did not exist. In the event that the switch statement had a case 0
the compiler would optimie away the unnecessary addi
instruction. This strategy can still be used if some or all of the cases are negative, all that matters is that the range is sufficiently small and sufficiently dense.
If the switch statement is not amenable to a jump table, it is transformed to a series of conditionals. For example, the following switch statement:
switch (x) { case 1000: f1000(); break; case 1001: f1001(); break; case 2000: f2000(); break; case 3000: f3000(); break; case 3001: f3001(); break; }
Is not amenable to a jump table, as the jump table would need over 2000 entries for just 5 cases. In this case, the compiler tends to produce an optimised conditional, something like:
cmpwi rX, 2000 blt lt1000 beq case2000 cmpwi rX, 3000 beq case3000 cmpwi rX, 3001 beq case3001 b defaultCase lt1000: cmpwi rX, 1000 beq case1000 cmpwi rX, 1001 beq case1001 b defaultCase
The compiler is using a divide and conquer strategy here, testing the middle case first in order to narrow down the number of possibilities. This clearly involves far more instructions that the jump table approach, but given the size of the jump table for this swtich statement it means less bytes over all. For very simple switches, the condition version may even be faster.
C++ classes
One key feature of C++ is the ability to support classes for object oriented programming. Most of Mario Kart Wii's game logic code seems to make heavy use of this. At a basic level, classes in C++ describe an object which has methods and fields. For example, consider the following class:
class Vector { private: float x; float y; floay z; public: float length () { return sqrt(x * x + y * y + z * z); } }
The compiler will check the access modifiers at compile time (public
/private
/...) but will ultimately discard them. By analysing the code, we have no way of knowing which fields and methods had what access modifiers. Whenever the programmer creates this Vector
in a function, it will need to store the three fields, x
, y
and z
. The compiler therefore decides how to lay these out in memory. It always does this identically, placing the fields in code order consecutively, along with any padding needed. Therefore, in memory, an instance of this class would just appear as three floating point values. The function length
would be implemented by transforming it to have an additional parameter, always passed in r3
. If it had other parameters, these would be passed starting in r4
. The pointer in r3
would point to the start of the vector upon which length is being run (known in C++ as this
). Therefore the code for length may look something like:
lfs f0, 0(r3) fmuls f1, f0, f0 lfs f0, 4(r3) fmadds f1, f0, f0, f1 lfs f0, 8(r3) fmadds f1, f0, f0, f1 b sqrt
Inheritance
See also Wikipedia's Virtual Method Table article
One of the key features of object oriented programming is inheritance. For example, consider the following code:
class Shape { public: int32_t x; int32_t y; virtual int32_t area(void) = 0; } class Square : public Shape { public: int32_t length; virtual int32_t area(void) { return length * length; } } class Rectangle : public Shape { public: int32_t width; int32_t height; virtual int32_t area(void) { return width * height; } }
In this case, we have an abstract class Shape
and two derived classes Square
and Rectangle
. The programmer can never create a Shape
directly because it lacks an implementation for area
. This would be checked at compile time. However, the programmer can create Square
s and Rectangle
s and cast these to pointers to shapes. For example, we could write a method:
int mulAreas(Shape *a, Shape *b) { return a->area() * b->area(); }
In general it is impossible for the compiler to know at compile time whether the arguments a
and b
are Square
s or Rectangle
s. It may even differer by call site. Therefore it is impossible for the compiler to know at compile time which version of the area
method it ought to call.
The solution to this problem used by the MetroWerks CodeWarrior compiler is a vtable or virtual method table. When organising the memory layout for Square
and Rectangle
it begins by putting a pointer to the vtable. The table itself is stored once somewhere in the read only data section of memory. The table always seems to contain 8 bytes of 0. There is no other header associated with the tables, so it is difficult to find them directly by analysing memory. These 8 bytes are followed by one pointer entry for each virtual
function in code order. In this case, that would just be area
. The pointer is the address in memory of the implementation of area
for that particular class. The code for mulAreas
would look something like:
stwu r1, -16(r1) mflr r0 stw r0, 20(r1) stw r31, 12(r1) mr r31, r4 lwz r12, 0(r3) lwz r12, 8(r12) mtctr r12 bctrl mr r0, r3 mr r3, r31 mr r31, r0 lwz r12, 0(r3) lwz r12, 8(r12) mtctr r12 bctrl mullw r3, r3, r31 lwz r31, 12(r1) lwz r0, 20(r1) addi r1, r1, 16 mtlr r0 blr
After the vtable pointer follows the fields of the parent class (x
and y
in this case), then the fields of the derived class (either length
or width
then height
). The size of the data therefore depends on which class is created (Square
at 16 bytes or Rectangle
at 20 bytes). For example, the code for area
for Square
would look something like:
lwz r3, 12(r3) mullw r3, r3, r3 blr
Complex Fields
One key difference with C++ objects and many other object oriented programming languages is that objects can be stored by value, not just by reference or pointer. That is to say, given the examples above, the following code:
class Panel { public: Rectangle bounds; int32_t textID; }
Would result in a 24 byte class panel, starting with a 20 byte Rectangle
, followed by the 4 byte int32_t
for textID
. It is therefore possible that analysing this class would confuse you about the nature of the class Rectangle
as the first entry of Panel
in memory would be the vtable for Rectangle
, which could be mistake for the vtable for class Panel
. Similarly, if the two fields in Panel
were the other way round, it would be easy to mistake the vtable of Rectangle
for an ordinary variable as it is not the first entry in the class Panel
. Thus, in general, any offset into an object may be a vtable and there may be many per object if there are multiple complex fields.
Libraries
Along with code written specifically for Mario Kart Wii, the game contains some library code which provides much generic functionality reusable between projects. It is theoretically impossible to say for certain which parts of the code represent libraries and which represent specific Mario Kart Wii code, but with some very resonable assumptions we can guess. In particular the C and C++ standards specify some libraries which the compiler must provide (even if the code decides not to use them).
Memory Management
See also Wikipedia's C Dynamic Memory Management article
One important part of most piece of code is dynamic memory management. In contrast to many more modern languages, C and C++ memory management is explicit. Nothing happens automatically. In C it is mostly done using the malloc
and free
methods. C++ tends to use new
and delete
, although these are often just implemented to call down to the C library. Perhaps suprisingly, Mario Kart Wii does not seem to use the standard malloc
and free
methods much. Instead, Nintendo's own Wii library seems to provide an alternative similar interface. Their allocation functions seem to take two arguments as opposed to malloc
which takes only one. The arguments are the amount of memory the code is requesting and also what heap to allocate in. The heaps are areas of memory dedicated to storing different types of variables. For example, there is a heap dedicated to storing copies of files from the game disk. There is also a heap dedicated to storing all variables related to a race during a race. The main advantage of this approach seems to be the ability to delete entire heaps very efficiently. For example, when a race ends, the game simply deletes the race data heap and all the variables that were part of it disappear. This can make analysing the game's memory usage somewhat more confusing as there may never be a call to free
corresponding to a call to malloc
.
One useful aspect of the game's version of malloc
is that it leaves a small header before each allocation, which it uses to track the allocated memory. Subtracting 0x10 from the pointer returned by malloc
gives the start of the header. The format of the block is as follows:
Offset | Type | Description |
---|---|---|
0x00 | Uint16 | Magic, always 0x5544 or UD' in ASCII. |
0x02 | UInt16 | Flags, unkown meanings. |
0x04 | UInt32 | N Length of the block in bytes. |
0x08 | UInt32 | Pointer to previous block. |
0x0C | UInt32 | Pointer to next block. |
0x10 | UInt8[N] | N bytes of data. |
N + 0x10 | End of this block, start of next block at next 8 byte aligned offset |
Because of the pointers, the blocks form a doubly linked list. If a block is free
d, it is not necessarily deleted immediately, but the pointers are corrected so that they point to the next and previous undeleted but free block. Thus, there is a linked list of free
d blocks and a linked list of used blocks. The game seems to reuse free
d blocks if another allocation of the same size is made.
The heaps also have a header, which seems to begin with the magic 0x4652 or FR in ASCII, though nothing further is no about it.
Multithreading
See also Wikipedia's Thread article
Like many modern programs, Mario Kart Wii makes heavy use of multithreading. This gives the illusion to the programmer that multiple lines of code are executing simultaneoulsy. Because the Wii's processor is single core, this is impossible. Instead the library starts a timer (the decrementer) when each section of code starts running. When the timer expires, the processor automatically interrupts the code (decrementer interrupt) and branches to another piece of code. This happens sufficiently quickly that all the threads make progress. If a thread has nothing to do, it can voluntarily suspend itself in order to another thread chance to work. This is common for threads that perform background tasks, such as loading from the disk.
Each thread in the game has a priority. The game's library executes only the highest priority thread(s) available until they all run out of work. It then considers the next priority level. The highest priority level is 31 and the lowest is 0.
The library seems to make a distinction between two related concepts a thread and a context. Every thread has a context, but not every context has a thread. Contexts are used for tiny pieces of work such as interrupt handlers. The library always seems to execute any threadless contexts first if possible, as if they are high priority threads. Contexts cannot call certain functions which require full blown threads, such as some sleep routines. A point to the current executing context is always stored at 0x800000d4
.
The context structure in memory is only updated when the context changes, in order to remember what to switch back to. The format of a context in memory is as follows:
Offset | Type | Description |
---|---|---|
0x000 | Uint32[32] | A copy of each register in order (r0 to r31 ).
|
0x080 | UInt32 | A copy of the condition register (cr ).
|
0x084 | UInt32 | A copy of the link register (lr ).
|
0x088 | UInt32 | A copy of the count register (ctr ).
|
0x08c | UInt32 | A copy of the integer exception register (xer ).
|
0x090 | Double[32] | A copy of each floating point register in order (f0 to f31 ).
|
0x190 | 8 bytes | Unknown |
0x198 | UInt32[2] | A copy of the two save/restore reigsters in order (srr0 and srr1 )
|
0x1a0 | 2 bytes | Unknown |
0x1a2 | UInt16 | Flags (0x1 is restore_fpr, 0x2 is restore_volatile_gpr, others unknown) |
0x1a4 | Uint32[8] | A copy of each graphics quantisation register in order (gqr0 to gqr7 ).
|
0x1c4 | 4 bytes | Unknown |
0x1c8 | Float[64] | A copy of each floating point pair register in order (f0 to f31 ).
|
0x2c8 | End of this structure |
A thread data structure is like an extended context:
Offset | Type | Description |
---|---|---|
0x000 | OSContext | As above, a full context. |
0x2c8 | 2 bytes | Unknown |
0x2ca | SInt16 | A copy of argument r9 from OSCreateThread .
|
0x2cc | 4 bytes | Unknown |
0x2d0 | SInt32 | A thread priority value. Usage unclear. |
0x2d4 | SInt32 | A thread priority value. Usage unclear. |
0x2d8 | 12 bytes | Unknown |
0x304 | Pointer32 | Highest address of the call stack (i.e. the bottom). |
0x308 | Pointer32 | Lowest address of the call stack (i.e. the bottom). |
0x30c | SInt32 | C error number variable (errno ).
|
0x310 | 8 bytes | Unknown |
0x318 | End of this structure |
Linking and Loading
The game's code (along with all referenced libraries) was divided and compiled into two files; main.dol and StaticR.rel. main.dol is loaded first by the disc loader (typically the Wii Menu) and then StaticR.rel is dynamically loaded and linked during the health and safety screen. main.dol is statically linked, so always loads to the same location in memory. StaticR.rel has references to the static method addresses in main.dol, but is itself relocatable and has relocation information to assist with changing its loaded address. Despite this, due to the deterministic nature of the game's early code, StaticR.rel is always loaded to the same address in practice.
It is not clear why the game's code was split in this manner; other games such as The Legend of Zelda: Twilight Princess feature many more rel files, and dynamically load and unload them to save on memory by loading code only when needed. Mario Kart Wii, having only one rel file, does not do this. One potential explanation is that most of the game's specific logic (physics, ui, ...) is located in StaticR.rel, so perhaps the game's main.dol is intended to be reusable as an engine for different games. There is no evidence this was ever used as such however; even the Mario Kart Channel uses a different main.dol.