Geometry Transformation Engine (GTE) Part 2: In-Depth Functioning
First part of this article
In the first part of this article we went over the basis of what is the GTE for the Playstation. If you did not read it yet, we encourage you to do it before continuing.
We will now delve into a more detailed example of what the GTE does and how it works before finishing with a quick note on why it is essential to implement it correctly.
How we modeled the GTE in our emulator
The GTE is implemented as the COP2 (Coprocessor 2) which contains two register banks:
- Control registers:
m_ctrlReg[0..31](COP2C) - Data registers:
m_dataReg[0..31](COP2D)
This matches how the CPU communicates with the GTE. Instead of addressing the GTE using memory-mapped IO, it uses coprocessor opcodes to move values in/out and execute commands.
Quick overview of the most used registers
This is not a full register table — it’s the subset that matters for understanding the code below.
Control registers:
- Rotation matrix base (RT): used by
RTPS/RTPTcommands and also byOP(diagonal elements) - Translation vector (TR):
TRX/TRY/TRZ(we extract them viaextractTranslation(5)in theRTPScommand) - Projection parameters:
-
OFX= ctrl[24] -
OFY= ctrl[25] -
H= ctrl[26] -
DQA= ctrl[27] -
DQB= ctrl[28]
-
- Color pipeline:
-
RFC/GFC/BFClive around ctrl[21..23] in classic docs - Our code also uses ctrl[21..23] as interpolation targets in
INTPL(see below)
-
Data registers:
-
IR1/IR2/IR3= data[9..11] (intermediate vector) -
SXY0/SXY1/SXY2are stored as packed(SY << 16) | SXin data[12..14] -
SZ3used in RTPS is stored into data[19] -
MAC0often ends up in data[24] in our code (ex:NCLIP,AVSZ) -
MAC1/2/3are written in data[25..27] - Color FIFO head is packed into data[20] by
pushColorFIFO
Fixed-point arithmetic in practice
The MIPS R3000A CPU has the ability to have an (optional) COP1 as a floating-point unit (FPU). We can picture the FPU as a very fast and specialized processor than can only perform floating point operations very quickly.
(Un)fortunately, the PS1 CPU doesn’t have such FPU as it’s COP1. It was maybe for cost-effectiveness reasons that Sony decided not to include a dedicated FPU in their console. Instead, the GTE uses fixed-point arithmetic. This means that a value is divided into 3 sections. The sign bit, integer bits and fraction bits.
Depending on the type of value, the sections vary in size and can sometimes be unused.
Some common examples of fixed-point bit patterns:
| Usage | 16bit | 32bit |
|---|---|---|
| Matrix | 1-3-12 | - |
| Translation Vector | - | 1-31-0 |
| Background Color | - | 1-19-12 |
| Far Color | - | 1-27-4 |
| Average Z | 1-3-12 | - |
| Math Accumulators | - | 1-31-0 |
This is not an exhaustive list of bit patterns but it gives you a nice overview of how mathematical objects are stored and used on the system.
In our implementation, most maths are performed using int64_t values which are then optionally shifted by the sf flag.
GTE Commands
Command Encoding
This is the command encoding extracted from the psx-spx documentation:
1
2
3
4
5
6
7
8
9
10
31-25 Must be 0100101b for "COP2 imm25" instructions
20-24 Ignored
19 sf - Shift Fraction in IR registers (0=No fraction, 1=12bit fraction)
17-18 MVMVA Multiply Matrix (0=Rotation. 1=Light, 2=Color, 3=Reserved)
15-16 MVMVA Multiply Vector (0=V0, 1=V1, 2=V2, 3=IR/long)
13-14 MVMVA Translation Vector (0=TR, 1=BK, 2=FC/Bugged, 3=None)
11-12 Always zero
10 lm - Saturate IR1,IR2,IR3 result (0=To -8000h..+7FFFh, 1=To 0..+7FFFh)
6-9 Always zero
0-5 Real GTE Command Number (00h..3Fh) (used by hardware)
“sf” Flag
In essence, the “sf” flag removes the fractional portion of the fixed-point value.
-
sf = 1→ shift bits to the right by 12 (effectively removing fraction bits) -
sf = 0→ keep full precision
You’ll see this pattern everywhere:
1
int64_t mac = (...) >> (sf * 12);
If your test vectors are “in GTE scale” (S.12 fractional), then:
-
sf = 1tends to bring results back to “normal IR scale” -
sf = 0can overflow earlier or saturate differently
The RTPS/RTPT docs explicitly mention saturation differences depending on sf and how IR flags behave.
Coordinate Transformation (RTPS / RTPT)
One of the most important tasks of the GTE is transforming a 3D point from model space into screen space. The transformation pipeline is roughly:
1
Model space → World space → View space → Screen space
This is implemented in the GTE using matrix-vector multiplication and perspective division. The most common instruction for this is: RTPS (Rotate, Translate, Perspective Single)
It performs:
- Rotation of a vertex using the rotation matrix
- Translation using the translation vector
- Perspective projection to screen coordinates
The output is written into the screen registers SX, SY and SZ, which can then be sent to the GPU to draw polygons.
RTPS Flow Scematic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Input vertex Vn (from data regs) Projection params (ctrl regs)
┌────────────────────┐ ┌────────────────────────────┐
│ Vx, Vy, Vz (int16) │ │ OFX, OFY, H, DQA, DQB │
└─────────┬──────────┘ └─────────────┬──────────────┘
│ │
v v
┌──────────────────────────┐ ┌─────────────────────────┐
│ MAC1..3 = TR*4096 + R*V │ │ projectScale = H / SZ3 │
│ then >> (sf*12) │ │ clamped to 0..0x1FFFF │
└────────────┬─────────────┘ └────────────┬────────────┘
│ │
v v
┌────────────────────────────┐ ┌──────────────────────────────┐
│ IR1..3 = clamp(MAC1..3) │ │ SX2,SY2 = (scale*IR + OF)/16 │
│ SZ3 = clamp(MAC3 >> ...) │ │ packed into SXY2 (data[14]) │
└─────────────┬──────────────┘ └─────────────┬────────────────┘
│ │
v v
┌─────────┐ ┌─────────────────┐
│ MAC regs│ │ DQA/DQB depth │
└─────────┘ │ push: data[24] │
│ IR0-ish: data[8]│
└─────────────────┘
RTPT in our emulator
RTPT is simply “RTPS executed 3 times”:
1
2
for (uint8_t i = 0; i < 3; i++)
executeRTPS(opcode, i);
This matches the conceptual meaning: transform V0, V1, V2 and update the screen/depth FIFOs.
Example (with “made-up but realistic” values)
Assume:
sf = 1- rotation = identity
- translation = (0, 0, 0)
-
H = 0x100(256) OFX = OFY = 0
Vertex:
-
V0 = (4096, 0, 8192)(meaning X=1.0, Y=0.0, Z=2.0 if using 12 fractional bits)
Then:
MAC3 ≈ 8192 >> 12 = 2-
SZ3becomes small → in our code ifSZ3 <= H/2we clamp and set a flag, and use max scale.
This is exactly why fixed-point + threshold logic matters: “near camera” and “behind camera” paths are very sensitive.
Normal Clipping (NCLIP)
NCLIP is used to help decide if a triangle is front-facing or back-facing in screen space.
Our code:
- reads
SXY0= data[12],SXY1= data[13],SXY2= data[14] - extracts each
(sx, sy)from the packed words - computes:
1
MAC0 = (SX0*SY1 + SX1*SY2 + SX2*SY0) - (SX0*SY2 + SX1*SY0 + SX2*SY1)
and stores it into MAC0 (we write it to data[24]). This matches official training material and references.
Schematic
1
2
3
4
5
6
7
8
9
10
SXY0, SXY1, SXY2
│ │ │
v v v
Extract (sx0,sy0) (sx1,sy1) (sx2,sy2)
│
v
MAC0 = oriented area / 2 (signed)
│
v
data[24] = MAC0
If MAC0 is:
- positive → one winding
- negative → the other winding
Cross Product (OP)
In PSX docs this opcode is sometimes called “Outer Product”, but it’s effectively the cross product in practice.
In our emulator:
- Vector A =
(IR1, IR2, IR3) - Vector B =
(D1, D2, D3)where:D1 = RT11D2 = RT22-
D3 = RT33(diagonal of rotation matrix)
Then:
1
2
3
MAC1 = IR3*D2 - IR2*D3
MAC2 = IR1*D3 - IR3*D1
MAC3 = IR2*D1 - IR1*D2
Then we apply >> (sf*12), store into MAC1..3 and clamp into IR1..3. This matches the spec note about “misusing RT diagonal as a vector”.
Example
Let:
IR = (1000, 2000, 3000)-
D = (1, 2, 3)(from RT11/RT22/RT33) -
sf = 0(no shift)
Then:
MAC1 = 3000*2 - 2000*3 = 6000 - 6000 = 0MAC2 = 1000*3 - 3000*1 = 3000 - 3000 = 0MAC3 = 2000*1 - 1000*2 = 2000 - 2000 = 0
So the result is the zero vector (because A and D were aligned in a specific way). This is a great unit-test shape: it’s easy to verify, and it stresses “do we read D from the correct matrix slots?”.
Square Vector (SQR)
Our SQR:
- reads IR1..3
- squares each component
- applies
>> (sf*12) - stores into MAC1..3 and clamps into IR1..3
This instruction is often used for vector magnitude-ish workflows.
Example
Let:
IR = (200, -300, 400)sf = 0
Then:
-
MAC1 = 200*200 = 40000→ likely clamps into IR range -
MAC2 = (-300)*(-300) = 90000→ clamps -
MAC3 = 400*400 = 160000→ clamps
This is exactly the kind of instruction where saturation behavior defines whether geometry “explodes” or stays stable.
Average Z (AVSZ3 / AVSZ4)
The GPU uses an “ordering table” approach, and the GTE provides helpers to compute an average depth:
-
AVSZ3averages 3 Z values -
AVSZ4averages 4 Z values
In our emulator we share code:
- Sum SZ values from the SZ FIFO (we iterate starting at SZ3 and going backward)
- Multiply by
ZSF3orZSF4(read from a control register) - Shift down by 12
- Clamp to
[0 .. 0xFFFF] - Store into OTZ (lower halfword of data[7])
- Store MAC0 for inspection/debug
This aligns with the idea of “scaled average depth”, even if real hardware has a lot of nuance around FIFO ordering and flags.
General matrix and vector math (MVMVA)
MVMVA is the workhorse “matrix * vector + translation” command.
In our code, it is driven by opcode flags:
-
mxselects which matrix to use -
vselects which vector source to use -
cvselects which translation base to apply -
sfcontrols shift (>> 12or not) -
lmchanges the lower clamp bound for IR (either-0x8000or0)
We:
- build the matrix (special case when
mx == 3, otherwise extract it) - fetch the selected vector
- fetch the translation vector
- compute MAC1..3
- clamp into IR1..3 with a lower bound that depends on
lm
This instruction is often used for:
- transforming normals
- transforming light directions
- “small pipelines” inside other operations
Lighting and Color (executeNColor pipeline)
This is the most “emulator-specific” part of this article. It’s not just “what the GTE does” but it’s how we structured our implementation to cover a family of opcodes with one shared pipeline function.
The GTE has multiple color-related commands:
- NCS / NCT -> Normal Color Single/Triple
- NCCS / NCCT -> Normal Color Color Single/Triple
- NCDS / NCDT -> Normal Color Depth Cue Single/Triple
- CC -> Color Color
- CDP -> Color Depth
The PSX-SPX docs group these into “color calculation commands” with shared concepts:
- background color BK
- color matrix LCM
- far color FC
- interpolation factor IR0
- and the RGB FIFO
Our pipeline switches
Our function:
1
executeNColor(normal, sf, isNormal, color, depth)
interprets flags like this:
-
isNormal == true→ do a first stage “LLM * normal” to produce IR -
color == true→ apply a color multiplication stage (think NCCx/CC) -
depth == true→ apply a depth-cue interpolation stage (think NCDx/CDP)
Pipeline schematic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(Option A) Normal transform stage (if isNormal)
normal (V0/V1/V2) --[LLM]--> MAC1..3 --> clamp --> IR1..3
(Always) Background + ColorMatrix stage
BK + [LCM]*IR --> MAC1..3 --> clamp --> IR1..3
(Option B) Color and/or Depth stage (if color||depth)
IR scaled/colored --> MAC1..3
if depth: interpolate with IR0 (lerp toward "Far Color")
shift >> (sf*12) --> clamp --> IR1..3
(Final) Pack to RGB and push FIFO
RGB = clamp(MAC >> 4) into [0..255]
push into color FIFO, update data[20]
FIFO behavior in our code
We model a 3-entry FIFO:
1
Rgbc m_colorFIFO[3]; // RGB0, RGB1, RGB2
When we push:
- RGB0 <- RGB1
- RGB1 <- RGB2
- RGB2 <- new
And we pack the newest into data[20] as:
1
(CODE << 24) | (R << 16) | (G << 8) | (B)
Interpolation helpers (DPCS / DPCT / DCPL / INTPL / GPF / GPL)
These are “color math building blocks” in the GTE.
In our implementation:
-
DPCSbuilds a MAC vector from the currentRGBCand callsINTPL -
DPCTapplies that to 3 consecutive RGB values (loop) -
DCPLmultipliesRGBCcomponents by IR and then callsINTPL -
INTPLperforms a 3-channel interpolation using:- current MAC values
- target colors (ctrl[21..23] in our code)
- interpolation factor
data[8]then pushes the result to the color FIFO
Separately:
-
GPFandGPLuse IR * IR0-ish scaling with optional base addition, then push to FIFO
This is the kind of thing that makes the GTE feel like a “graphics DSP” rather than a pure transform unit.
Why The GTE Matters For Emulation
The GTE is responsible for almost all of the 3D math for the PlayStation.
If the GTE is inaccurate, you will see:
- Warped geometry
- Broken lighting
- Incorrect depth
- Flickering or exploding polygons
An accurate emulator must reproduce:
- Instruction timing and pipeline semantics
- Register saturation rules (especially IR and RGB paths)
- Fixed-point precision and
sfbehavior - Flag behavior (FLAG is not “optional”, games do depend on it)
The GTE is one of the hardest parts of a PlayStation emulator to implement correctly, but also one of the most rewarding.
Implementation notes (based on our current code)
This section is here because emulator blogs are most useful when they document real decisions and real gotchas.
1) NCS decode currently calls NCCS
In decodeAndExecute:
1
2
3
4
5
case GTEFunction::NCS: {
bool sf = (opcode >> 19) & 0x1;
executeNCCS(sf);
break;
}
This means NCS (normal color single) is currently routed to the NCCS path (normal color color single). If you see unexpected extra “color multiplication behavior” in games, this is a prime suspect.
2) Saturation is subtle (and games care)
No$PSX docs emphasize that:
- MAC registers themselves are generally not saturated
- IR/RGB saturations set FLAG bits
- some commands treat lm differently (RTPS/RTPT behave as if lm=0)
When debugging rendering bugs, always verify:
- where you clamp (MAC vs IR vs RGB)
- whether you set flags on “pre-shift” or “post-shift” values