Compiler

Mario Kart Wii was written in C and C++ and then compiled to PowerPC (PPC) Assembly Code by the Metrowerks CodeWarrior compiler. C and C++ code is not directly executable, but PPC Assembly Code can be run directly on the Wii's hardware. The distributed game contains only the PPC Assembly Code and it is likely that the original C and C++ code that the game was written in will never be made public. Understanding how the compiler works is therefore useful to gain an insight into what the original code may have looked like, and therefore guides the creation of new behaviour.

Compilation

ABI

See also Wikipedia's ABI article

In order to allow code to be reusable, most compilers conform to an Application Binary Interface (ABI). This is a set of rules and conventions for how functions will be compiled. For Mario Kart Wii the Metrowerks CodeWarrior compiler seems to have conformed with the PPC EABI, an extension of the PPC ABI. Some of these rules are outlined here.

Register Convention

PowerPC has 32 general purpose integer registers and 32 general purpose floating point registers. These can store 32 bit integers and 64 bit floating point numbers respectively. Despite the 'general purpose' name, these have very strict usage conventions according to the ABI. PowerPC also has several special purpose registers which are optimised for storing very particular kinds of variables. Notably, there is the lr (link register), the ctr (count register) and the crs (condition registers).

ABI Register Convention
Register(s)	Role	Caller/Callee Save	Description
`r0`	Temporary	Caller	A spare temporary register, somewhat awkward to use due to instruction set.
`r1`	Stack Pointer	Callee	Stack pointer kept 8 byte aligned.
`r2`	Read Only SDA	Constant	Pointer to the read only small data area + 0x8000; used for optimising loads to constants. The value of `r2` never changes throughout program execution.
`r3-r4`	Arguments/Return Values	Caller	On function call, `r3` to `r10` contain the integer arguments to the function, in code order. If a function needs more arguments, these are stored in the caller's stack frame. If the function returns an integer value, it is stored in `r3` to `r4`. `r4` is only used for this purpose if the return value is a 64 bit number.
`r5`-`r10`	Arguments	Caller
`r11`-`r12`	Linker Reserved	Caller	On function call, `r11` and `r12` are never used to enable them to be used as temporaries by a dynamic linker if needed.
`r13`	Read/Write SDA	Constant	Pointer to the read/write small data area + 0x8000; used for optimising reads and writes to global variables. The value of `r13` never changes throughout program execution.
`r14`-`r31`	Variables	Callee	r14 to r31 are used to store local variables.
`f0`	Temporary	Caller	A spare temporary register.
`f1`	Arguments/Return Value	Caller	On function call, `f1` to `f8` contain the floating point arguments to the function, in code order. If a function needs more arguments, these are stored in the caller's stack frame. If the function returns a floating point value, it is stored in `f1`.
`f2`-`f8`	Arguments	Caller
`f9`-`f13`	Temporaries	Calller	Spare temporary registers.
`f14`-`f31`	Variables	Callee	`f14` to `f31` are used to store local variables.

Stack Frame Convention

In order to implement the function call abstraction in assembly code (which has no such concept), most systems use a call stack. The call stack gains a 'frame' every time a function starts, and loses one each time one ends. In the PPC ABI, r1 always points to the 'top' of the call stack. Whenever a new function is called, the size of that function's stack frame is subtracted from the current value of r1 to form the new r1. When the function returns, it adds the same amount back to r1 to recover the original value of the stack pointer. Additionally, when the function starts, the old value of the stack pointer is stored at the value in memory addressed by the new stack pointer. That is to say, if you loaded the value addressed by r1, you always find the old stack pointer. Recursively, if you loaded the value at that address, you would find the stack pointer of the caller's caller. This procedure can be repeated meaning the stack forms a 'chain' of stack frames. This can be useful for debugging, allowing a debugger to inspect the frame of all functions that are currently executing, even if they are not the 'top' of the stack. The last entry in the stack (the first function called) has 0 in this value, so the chain is not infinite.

The contents of the stack frame depend on the needs of the function. The shortest possible stack frame is 8 bytes, the longest observed is nearly 4 kilobytes. All used registers which are marked callee save must be stored in the stack frame. Additionally, any local variables created in the code must be stored there as well as the arguments to any call made by the function which do not fit in registers. Generally speaking the compiler is able to compute the 'worst case' stack frame size for each individual function and always uses that amount as the size when creating stack frames. Rarely, the compiler allocates too little space and then has a correction procedure in the middle of the function to adjust the size of the stack frame. This can happen if there is a dynamically sized local variable in the function, or if a local variable requires tighter alignment than 8 bytes.

While the exact contents of a stack frame differ for each function, the order of different types of data is fixed.

The old stack frame value is at r1.
4 bytes are left for the saved link register of functions that this function may call.
Any arguments to child functions which may be needed.
Any local variables as needed.
Callee saved floating point registers as needed.
Callee saved integer registers as needed.

The only way to know the contents of the stack frame of a particular function is to analyse its code.

Behaviour

Compilers are complicated pieces of software and often have many optimisation passes used to improve the quality of the code. Nevertheless, in many cases the compiler's behaviour amounts to a 'pattern match'. In other words, a given piece of C or C++ code is translated to a given set of assembly code. Some of these patterns are explained here.

Function Prologue

The start of each function must maintain the stack an arguments in accordance with the ABI. There are a few commonly seen patterns at the beginning of each function. The most common is something like:

stwu   r1, -N(r1)
mflr   r0
stw    r0, N+4(r1)

This short sequence creates an ABI conformant stack frame of size N. This is typically followed by a sequence of instructions to save the callee saved registers. Common variations include directly storing:

stw    r31, N-4(r1)
stw    r30, N-8(r1)
stw    r29, N-12(r1)
...

The store more words instruction:

stmw   r29, N-12(r1)

Or a helper function, which looks something like:

addi   r11, r1, N
bl     _savegpr_29

The destination of the branch in this case is a small piece of reusable assembly code which looks something like:

 ...
_savegpr_29:
 stw    r29, -12(r11)
_savegpr_30:
 stw    r30, -8(r11)
_savegpr_31:
 stw    r31, -4(r11)
 blr

If the local variables in a function require tighter alginment than 8 bytes, the compiler must align r1 to an address which is a multiple of 8. For example, the following function in GNU C:

void f(void) {
  int example_variable[8] __attribute__((aligned(32));
  ...
}

The compiler therefore has a special function prolog sequence for such cases. For example, to create a stack frame of size N with 32 byte alignment, the compiler typically uses the following sequence of instructions:

clrlwi  r11, r1, 27
mr      r12, r1
subfic  r11, r11, -N
stwux   r1, r1, r11
mflr    r0
stw     r0, 4(r12)

Function Epilogue

The end of each function must reverse the actions taken at the start of the function, in accordance with the ABI. The common pattern to remove a frame of size N is:

lwz    r0, N+4(r1)
addi   r1, r1, N
mtlr   r0
blr

This snipped is typically proceeded by a few instruction to restore the callee saved registers which match the corresponding snippets listed above. It can be directed loading:

lwz    r29, N-12(r1)
lwz    r30, N-8(r1)
lwz    r31, N-4(r1)

Or alternatively the load more words instruction:

lmw r29, N-12(r1)

Or finally the helper function variation:

addi   r11, r1, N
bl     _restgpr_29

The destination of the branch in this case is a small piece of reusable assembly code which looks something like:

 ...
_restgpr_29:
 lwz   r29, -12(r11)
_restgpr_30:
 lwz   r30, -8(r11)
_restgpr_31:
 lwz   r31, -4(r11)
 blr

Leaf Functions

If a function is a leaf function (a function which does not call another function, or a function whose only call is a tail call) then the compiler will often optimise the stack frame away entirely. For example if we consider a simple two argument function which takes two integers as arguments and returns their sum:

int f(int x, int y) {
  return x + y;
}

the compiler would optimise away the standard prologue and epilogue to produce assembly code like:

add    r3, r3, r4
blr

Similarly, a function which takes two arguments, swaps them, and then returns the result of a call to another function f:

int g(int x, int y) {
  return f(y, x);
}

the compiler may optimise as a leaf tail call like:

mr     r0, r3
mr     r3, r4
mr     r4, r0
b      f

32 bit constant

To set a register to a 32 bit constant, for example an address, two instructions are generally needed. For example, to set r3 to 0x80003180:

lis    r3, 0x8000
addi   r3, r3, 0x3180

One consequence of using the addi instruction is that it becomes more complex to load a 32 bit constant if the 17th bit is 1. This is because the addi instruction uses a signed immediate. Therefore, to set r3 to 0x8000b180:

lis    r3, 0x8001
addi   r3, r3, -0x4f80

As an optimisation, the above addi instructions can be swapped for stw or ldw instructions (or similar) if the intention is to load or store a value at the address given by the constant. Occasionally the compiler makes use of the stwu or ldwu instructions to further optimise this if both the 32 bit constant and the value at that address are required.

Small Data Areas

The small data areas (SDAs) are a pair of optimisations used by the compiler to reduce the number of instructions needed to reference certain global variables. Typically, loading the value of a 32 bit global variable x would require two instructions:

lis    rN, x@ha
lwz    rN, x@l(rN)

However, if the variable is stored in, say, the read only small data area, it can be accessed in just one instruction. For example, if the variable is at offset 0xa4 into the small data area, it would be accessed by the following instruction.

lwz    rN, 0xa4-0x8000(r2)

This is because r2 always holds the address of the read only small data area + 0x8000, so this load will always load the correct value.

Switch Statements

In C and C++ switch statements there are common code constructs for expressing many divergent behaviours stemming from a single variable. The Metrowerks CodeWarrior compiler seems to have two strategies for implementing these, depending on whether or not the tested values fall into a neat range. These are a jump table or transformation to conditionals. For example, in the following switch statement, all the cases fall in a relatively small range with relatively few gaps:

switch (x) {
case 1: f1(); break;
case 2: f2(); break;
case 4: f4(); break;
case 5: f5(); break;
case 6: f6(); break;
case 7: f7(); break;
default: fdefault(); break;
}

The compiler may therefore compile this to a jump table. The jump table code would look something like:

addi   r0, rX, -1
cmplwi r0, 6
bgt    defaultCase
lis    r3, jumptable@ha
rlwinm r0, r0, 2, 0, 29
addi   r3, r3, jumptable@l
lwz    r3, r3, r0
mtctr  r3
bctr

The choice of r3 and r0 here is arbitrary. The strategy here is first to subtract the lowest numbered case from x and then check if it needs to branch to the default case. This is done be comparing against the highest numbered case minus the lowest numbered case. If the test is greater than this (unsigned) value, then it definitely needs to branch to the default case. Otherwise it loads the address of the jump table, and loads an address to branch to from that table. Finally it does an indirect branch to that address. In this case the first table entry would be the address of the code for case 1, the second entry would be for case 2 and so on, with the third entry being the address of the default case, as case 3 did not exist. In the event that the switch statement had a case 0 the compiler would optimise away the unnecessary addi instruction. This strategy can still be used if some or all of the cases are negative, all that matters is that the range is sufficiently small and sufficiently dense.

If the switch statement is not amenable to a jump table, it is transformed to a series of conditionals. For example, the following switch statement:

switch (x) {
case 1000: f1000(); break;
case 1001: f1001(); break;
case 2000: f2000(); break;
case 3000: f3000(); break;
case 3001: f3001(); break;
}

Is not amenable to a jump table, as the jump table would need over 2000 entries for just 5 cases. In this case, the compiler tends to produce an optimised conditional, something like:

  cmpwi  rX, 2000
  blt    lt1000
  beq    case2000
  cmpwi  rX, 3000
  beq    case3000
  cmpwi  rX, 3001
  beq    case3001
  b      defaultCase
lt1000:
  cmpwi  rX, 1000
  beq    case1000
  cmpwi  rX, 1001
  beq    case1001
  b      defaultCase

The compiler is using a divide and conquer strategy here, testing the middle case first in order to narrow down the number of possibilities. This clearly involves far more instructions that the jump table approaches, but given the size of the jump table for this switch statement it means less bytes over all. For very simple switches, the condition version may even be faster.

C++ classes

One key feature of C++ is the ability to support classes for object oriented programming. Most of Mario Kart Wii's game logic code seems to make heavy use of this. At a basic level, classes in C++ describe an object which has methods and fields. For example, consider the following class:

class Vector {
private:
  float x;
  float y;
  float z;
public:
  float length () { return sqrt(x * x + y * y + z * z); }
}

The compiler will check the access modifiers at compile time (public/private/...) but will ultimately discard them. By analysing the code, we have no way of knowing which fields and methods had what access modifiers. Whenever the programmer creates this Vector in a function, it will need to store the three fields, x, y and z. The compiler therefore decides how to lay these out in memory. It always does this identically, placing the fields in code order consecutively, along with any padding needed. Therefore, in memory, an instance of this class would just appear as three floating point values. The function length would be implemented by transforming it to have an additional parameter, always passed in r3. If it had other parameters, these would be passed starting in r4. The pointer in r3 would point to the start of the vector upon which length is being run (known in C++ as this). Therefore the code for length may look something like:

lfs    f0, 0(r3)
fmuls  f1, f0, f0
lfs    f0, 4(r3)
fmadds f1, f0, f0, f1
lfs    f0, 8(r3)
fmadds f1, f0, f0, f1
b      sqrt

Inheritance

One of the key features of object oriented programming is inheritance. For example, consider the following code:

class Shape {
public:
  virtual int32_t area(void) = 0;
  
  int32_t x;
  int32_t y;
  
}

class Square : public Shape {
public:
  int32_t length;
  virtual int32_t area(void) { return length * length; }
}

class Rectangle : public Shape {
public:
  int32_t width;
  int32_t height;
  virtual int32_t area(void) { return width * height; }
}

In this case, we have an abstract class Shape and two derived classes Square and Rectangle. The programmer can never create a Shape directly because it lacks an implementation for area. This would be checked at compile time. However, the programmer can create Squares and Rectangles and cast these to pointers to shapes. For example, we could write a method:

int mulAreas(Shape *a, Shape *b) {
  return a->area() * b->area();
}

In general it is impossible for the compiler to know at compile time whether the arguments a and b are Squares or Rectangles. It may even differ by call site. Therefore it is impossible for the compiler to know at compile time which version of the area method it ought to call.

The solution to this problem used by the MetroWerks CodeWarrior compiler is a vtable or virtual method table. When organising the memory layout for Square and Rectangle it puts a pointer to the vtable at the first virtual function in layout (almost always at the beginning). The table itself is stored once somewhere in the read only data section of memory. The table always seems to contain 8 bytes of 0. There is no other header associated with the tables, so it is difficult to find them directly by analysing memory. These 8 bytes are followed by one pointer entry for each virtual function in code order. In this case, that would just be area. The pointer is the address in memory of the implementation of area for that particular class. The code for mulAreas would look something like:

stwu    r1, -16(r1)
mflr    r0
stw     r0, 20(r1)
stw     r31, 12(r1)
mr      r31, r4
lwz     r12, 0(r3)
lwz     r12, 8(r12)
mtctr   r12
bctrl
mr      r0, r3
mr      r3, r31
mr      r31, r0
lwz     r12, 0(r3)
lwz     r12, 8(r12)
mtctr   r12
bctrl
mullw   r3, r3, r31
lwz     r31, 12(r1)
lwz     r0, 20(r1)
addi    r1, r1, 16
mtlr    r0
blr

After the vtable pointer follows the fields of the parent class (x and y in this case), then the fields of the derived class (either length or width then height). The size of the data therefore depends on which class is created (Square at 16 bytes or Rectangle at 20 bytes). For example, the code for area for Square would look something like:

lwz    r3, 12(r3)
mullw  r3, r3, r3
blr

Complex Fields

One key difference with C++ objects and many other object oriented programming languages is that objects can be stored by value, not just by reference or pointer. That is to say, given the examples above, the following code:

class Panel {
public:
  Rectangle bounds;
  int32_t textID;
}

Would result in a 24 byte class panel, starting with a 20 byte Rectangle, followed by the 4 byte int32_t for textID. It is therefore possible that analysing this class would confuse you about the nature of the class Rectangle as the first entry of Panel in memory would be the vtable for Rectangle, which could be mistake for the vtable for class Panel. Similarly, if the two fields in Panel were the other way round, it would be easy to mistake the vtable of Rectangle for an ordinary variable as it is not the first entry in the class Panel. Thus, in general, any offset into an object may be a vtable and there may be many per object if there are multiple complex fields.

Libraries

Along with code written specifically for Mario Kart Wii, the game contains some library code which provides much generic functionality reusable between projects. It is theoretically impossible to say for certain which parts of the code represent libraries and which represent specific Mario Kart Wii code, but with some very reasonable assumptions we can guess. In particular, the C and C++ standards specify some libraries which the compiler must provide (even if the code decides not to use them).

Memory Management

One important part of most piece of code is dynamic memory management. In contrast to many more modern languages, C and C++ memory management is explicit. Nothing happens automatically. In C it is mostly done using the malloc and free methods. C++ tends to use new and delete, although these are often just implemented to call down to the C library. Perhaps surprisingly, Mario Kart Wii does not seem to use the standard new and delete methods much. Instead, Nintendo's own Wii library seems to provide an alternative similar interface which is used in overloaded new and delete operators.

The game's equivalent of malloc (at PAL 0x80229814) takes three arguments. The first argument is the size of the memory block to allocate. Secondly, the alignment of the block to allocate. For tasks like reading a file from the DVD drive which requires 32-byte address alignment, this value will be need to set to 32. Lastly, the heap to allocate it the block in. If this value is null, the game will simply use the default heap. Additionally, if the current thread is wrapped by the game's C++ thread class, the heap at the thread + 0x3C will be used instead of the default or explicitly supplied heap. This doesn't appear to be used in the retail game. However, it was probably used to restrict debug-only threads to the debug heap (only created if it detects the game is running on a development unit). Additionally, the static heap at PAL 0x80386EA8 if not null, will restrict (but not redirect) all heap allocations to use this heap. Mario Kart Wii never uses this, opting instead for multiple heaps.

The system memory initialization functions creates a few expanded heaps. The game can further create children of these heaps. The game's C++ wrappers for these heaps store a pointer to the name at +0x34, by default set to "NoName".

System Heaps
Name	Arena Low	Arena High	Parent	Description
EGGRootMEM1	OS initial MEM1 arena low	OS initial MEM1 arena high	-	The root expanded heap for the MEM1 memory card.
EGGRootMEM2	OS initial MEM2 arena low	OS initial MEM2 arena high	-	The root expanded heap for the MEM2 memory card.
EGGRootDebug	End of retail MEM2 region and start of development unit additional MEM2 memory	End of development unit additional MEM2 memory	-	Only created if the MEM2 arena size is greater than 0x4000000. If it is, the size will be 0x3B50000.
EGGSystem	-	-	EGGRootMEM1	The heap used for the game's EGG system (akin to JSystem) tasks.

The heaps are areas of memory dedicated to storing different types of variables. For example, there is a heap dedicated to storing copies of files from the game disk. There is also a heap dedicated to storing all variables related to a race during a race. The main advantage of this approach seems to be the ability to delete entire heaps very efficiently. For example, when a race ends, the game simply deletes the race data heap and all the variables that were part of it disappear. This can make analysing the game's memory usage somewhat more confusing as there may never be a call to free corresponding to a call to malloc.

One useful aspect of the Wii library's version of malloc is that it leaves a small header before each allocation, which it uses to track the allocated memory. Subtracting 0x10 from the pointer returned by a heap allocation gives the start of the header. The format of the block is as follows:

Allocated memory block
Offset	Type	Description
0x00	UInt16	Magic, always 0x5544 or UD in ASCII.
0x02	UInt16	Flags, unkown meanings.
0x04	UInt32	N Length of the block in bytes.
0x08	UInt32	Pointer to previous block.
0x0C	UInt32	Pointer to next block.
0x10	UInt8[N]	N bytes of data.
N + 0x10		End of this block, start of next block at next 8 byte aligned offset

Because of the pointers, the blocks form a doubly linked list. If a block is freed, it is not necessarily deleted immediately, but the pointers are corrected so that they point to the next and previous undeleted but free block. Thus, there is a linked list of freed blocks and a linked list of used blocks. The game seems to reuse freed blocks if another allocation of the same size is made.

The heaps also have a header, which seems to begin with the magic 0x4652 or FR in ASCII, though nothing further is no about it.

Multithreading

See also Wikipedia's Thread article

Like many modern programs, Mario Kart Wii makes heavy use of multithreading. This gives the illusion to the programmer that multiple lines of code are executing simultaneously. Because the Wii's processor is single core, this is impossible. Instead the library starts a timer (the decrementer) when each section of code starts running. When the timer expires, the processor automatically interrupts the code (decrementer interrupt) and branches to another piece of code. This happens sufficiently quickly that all the threads make progress. If a thread has nothing to do, it can voluntarily suspend itself in order to another thread chance to work. This is common for threads that perform background tasks, such as loading from the disk.

Each thread in the game has a priority. The game's library executes only the highest priority thread(s) available until they all run out of work. It then considers the next priority level. The highest priority level is 0 and the lowest is 31.

The library seems to make a distinction between two related concepts: a thread and a context. Every thread has a context, but not every context has a thread. Contexts are used for tiny pieces of work such as interrupt handlers. The library always seems to execute any threadless contexts first if possible, as if they are high priority threads. Contexts cannot call certain functions which require full blown threads, such as some sleep routines. A point to the current executing context is always stored at 0x800000d4.

The context structure in memory is only updated when the context changes, in order to remember what to switch back to. The format of a context in memory is as follows:

OSContext data structure
Offset	Type	Description
0x000	UInt32[32]	A copy of each register in order (`r0` to `r31`).
0x080	UInt32	A copy of the condition register (`cr`).
0x084	UInt32	A copy of the link register (`lr`).
0x088	UInt32	A copy of the count register (`ctr`).
0x08c	UInt32	A copy of the integer exception register (`xer`).
0x090	Double[32]	A copy of each floating point register in order (`f0` to `f31`).
0x190	8 bytes	Unknown
0x198	UInt32[2]	A copy of the two save/restore reigsters in order (`srr0` and `srr1`)
0x1a0	2 bytes	Unknown
0x1a2	UInt16	Flags (0x1 is restore_fpr, 0x2 is restore_volatile_gpr, others unknown)
0x1a4	UInt32[8]	A copy of each graphics quantisation register in order (`gqr0` to `gqr7`).
0x1c4	4 bytes	Unknown
0x1c8	Float[64]	A copy of each floating point pair register in order (`f0` to `f31`).
0x2c8		End of this structure

A thread data structure is like an extended context:

OSThread data structure
Offset	Type	Description
0x000	OSContext	As above, a full context.
0x2c8	2 bytes	Unknown
0x2ca	SInt16	A copy of argument `r9` from `OSCreateThread`.
0x2cc	4 bytes	Unknown
0x2d0	SInt32	A thread priority value (0-31). Lower numbers equal higher priority. Usage unclear.
0x2d4	SInt32	A thread priority value (0-31). Lower numbers equal higher priority. Usage unclear.
0x2d8	12 bytes	Unknown
0x304	Pointer32	Highest address of the call stack (i.e. the bottom).
0x308	Pointer32	Lowest address of the call stack (i.e. the bottom).
0x30c	SInt32	C error number variable (`errno`).
0x310	8 bytes	Unknown
0x318		End of this structure

Linking and Loading

The game's code (along with all referenced libraries) was divided and compiled into two files: main.dol and StaticR.rel. main.dol is loaded first by the disc loader (typically the Wii Menu) and then StaticR.rel is dynamically loaded and linked during the health and safety screen. main.dol is statically linked, so always loads to the same location in memory. StaticR.rel has references to the static method addresses in main.dol, but is itself relocatable and has relocation information to assist with changing its loaded address. Despite this, due to the deterministic nature of the game's early code, StaticR.rel is always loaded to the same address in practice.

It is not clear why the game's code was split in this manner; other games such as The Legend of Zelda: Twilight Princess feature many more rel files, and dynamically load and unload them to save on memory by loading code only when needed. Mario Kart Wii, having only one REL file, does not do this. One potential explanation is that most of the game's specific logic (physics, UI, ...) is located in StaticR.rel, so perhaps the game's main.dol is intended to be reusable as an engine for different games. There is no evidence this was ever used as such however; even the Mario Kart Channel uses a different main.dol.