Post

How We Rasterized 3D Graphics Like It's 1994

How We Rasterized 3D Graphics Like It's 1994

The PlayStation 1 doesn’t have a modern GPU with programmable shaders. It has a fixed-function chip that draws flat and Gouraud-shaded triangles, textured quads, and lines — one pixel at a time — into a 1 MB block of video RAM. There’s no z-buffer, no perspective-correct textures, and no floating-point math.

In this article, we’ll walk through how we implemented a software rasterizer for the PS1 GPU in our emulator, RogEm. We’ll cover how VRAM works, how the GPU receives and processes drawing commands, how triangles are filled pixel by pixel, and how textures are sampled from palettized formats that predate PNG.

A Framebuffer Made of Raw Pixels

Before we draw anything, we need somewhere to draw it. The PS1 GPU has 1 megabyte of Video RAM (VRAM), organized as a 1024×512 pixel grid. Each pixel is 16 bits in ABGR1555 format — 5 bits per color channel and 1 bit for a semi-transparency mask:

1
2
3
Bit:  15  14───10   9───5   4───0
       │    Blue    Green    Red
       └─ Mask bit (semi-transparency)

That’s 32,768 possible colors — a far cry from today’s 16 million, but enough to render Crash Bandicoot.

In our emulator, VRAM is a flat byte array:

1
std::array<uint8_t, 1024 * 2 * 512> m_vram;  // 1,048,576 bytes

Writing a single pixel means computing a byte offset and storing two bytes (little-endian):

1
2
3
4
5
6
void GPU::setPixel(const Vec2i &pos, uint16_t color)
{
    int index = (pos.y * 1024 + pos.x) * 2;
    m_vram[index]     = color & 0xFF;
    m_vram[index + 1] = color >> 8;
}

VRAM isn’t just for the framebuffer though — it also stores textures and color palettes (CLUTs). The game and the GPU share this space, which means that a game’s textures, its display output, and its rendering target all live in the same flat 1 MB buffer. The game is responsible for laying things out so they don’t overlap.

block-beta
  columns 4
  block:row1:4
    columns 3
    A["Framebuffer\n(display area)\n320×240 or 640×480"]
    B["Textures\n(4-bit, 8-bit, or 15-bit)"]
    C["CLUTs\n(Color\nLook-Up\nTables)"]
  end
  D["1024 × 512 pixels  =  1 MB VRAM (ABGR1555)"]:4

The PS1’s VRAM layout is entirely up to the game. There’s no hardware-enforced separation between the framebuffer and texture data. A buggy game could literally draw polygons over its own textures.

How Commands Reach the GPU

The CPU talks to the GPU through two IO ports:

  • GP0 (0x1F801810): rendering commands and VRAM data transfers
  • GP1 (0x1F801814): display control (resolution, video mode, display area)

When the CPU writes a 32-bit word to GP0, the top 3 bits tell the GPU what category of command it is:

Bits 31–29CategoryExamples
001Draw PolygonTriangles, quads (flat, Gouraud, textured)
010Draw LineLines, polylines (flat or Gouraud)
011Draw RectangleSprites (1×1, 8×8, 16×16, variable)
101CPU → VRAMUpload texture/CLUT data
110VRAM → CPURead back pixels
100VRAM → VRAMCopy regions within VRAM
111EnvironmentSet draw mode, texture window, draw area
000MiscNOP, cache clear, fill rectangle

The GPU operates as a state machine with four states:

stateDiagram
    [*] --> WaitingForCommand
    WaitingForCommand --> ReceivingParameters : GP0 command word
    WaitingForCommand --> WaitingForCommand : Immediate command
    ReceivingParameters --> WaitingForCommand : All params received
    WaitingForCommand --> ReceivingData : CPU→VRAM transfer start
    ReceivingData --> WaitingForCommand : Transfer complete
    WaitingForCommand --> SendingData : VRAM→CPU transfer start
    SendingData --> WaitingForCommand : Transfer complete

Most draw commands require multiple 32-bit words. For example, a Gouraud-shaded textured triangle needs 6 words: color+command, vertex 0 position, texcoord 0 + CLUT info, vertex 1 color, vertex 1 position, texcoord 1 + texture page info. The GPU collects parameters one by one until it has enough, then fires the appropriate draw function.

Decoding a Polygon Command

Let’s zoom into the most interesting command category: draw polygon. A single GP0 command word packs a lot of information:

1
2
3
4
5
6
7
8
 31──29  28   27   26   25   24   23───0
  001   Shade Quad Tex  Semi Raw  Color (R8G8B8)
         │     │    │     │    │
         │     │    │     │    └─ Raw texture (skip color modulation)
         │     │    │     └─ Semi-transparent blending
         │     │    └─ Textured polygon
         │     └─ 0 = Triangle (3 verts), 1 = Quad (4 verts)
         └─ 0 = Flat shading, 1 = Gouraud shading

From these 5 flag bits, the GPU knows exactly how many parameters to expect. The formula:

1
nbParams = nbVertices * (1 + shaded + textured) - shaded + 1;

This accounts for the fact that each vertex can have a position word, optionally a color word (Gouraud), and optionally a texture coordinate word. The first vertex’s color is embedded in the command word itself (hence the - shaded).

Here are a few concrete examples:

CommandShadeQuadTexParamsDescription
0x200004Flat-shaded triangle
0x301006Gouraud-shaded triangle
0x2C0119Flat-textured quad
0x3C11112Gouraud-textured quad

Rasterizing a Triangle: Edge Functions

Now we get to the heart of it: given three vertices on screen, how do we fill in the pixels?

We use the edge function method, which is a classic technique based on the cross product. For a triangle defined by vertices A, B, C, the edge function for edge AB evaluated at point P is:

\[E_{AB}(P) = (P_x - A_x)(B_y - A_y) - (P_y - A_y)(B_x - A_x)\]

If P is on the left side of the directed edge A→B, the result is positive. If it’s on the right, it’s negative. If P is on the edge itself, it’s zero.

A point P is inside the triangle if and only if all three edge functions have the same sign (all positive for counter-clockwise winding, all negative for clockwise):

1
2
3
4
static int edgeFunction(const Vec2i& a, const Vec2i& b, const Vec2i& c)
{
    return (c.x - a.x) * (b.y - a.y) - (c.y - a.y) * (b.x - a.x);
}

The algorithm is straightforward:

flowchart TD
    A["Compute bounding box of the 3 vertices\n(clamped to VRAM: 0–1023 × 0–511)"] --> B["Compute triangle area via edgeFunction(v0, v1, v2)\nIf area == 0 → degenerate, skip"]
    B --> C["For each pixel (x, y) in bounding box:"]
    C --> D["Compute w0, w1, w2\n(edge function for each edge)"]
    D --> E{"All ≥ 0\nor all ≤ 0?"}
    E -- "No" --> C
    E -- "Yes (inside)" --> F["Compute barycentric coords:\nα = w0/area, β = w1/area, γ = w2/area"]
    F --> G["Interpolate color, UVs\nSample texture if needed\nsetPixel()"]
    G --> C

In code, the core loop looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
int area = edgeFunction(v0, v1, v2);
if (area == 0) return;  // degenerate triangle

for (int y = minY; y <= maxY; y++) {
    for (int x = minX; x <= maxX; x++) {
        Vec2i p = {x, y};
        int w0 = edgeFunction(v1, v2, p);
        int w1 = edgeFunction(v2, v0, p);
        int w2 = edgeFunction(v0, v1, p);

        // Accept both winding orders
        if ((w0 >= 0 && w1 >= 0 && w2 >= 0) ||
            (w0 <= 0 && w1 <= 0 && w2 <= 0))
        {
            float alpha = (float)w0 / area;
            float beta  = (float)w1 / area;
            float gamma = (float)w2 / area;
            // ... shade, texture, write pixel
        }
    }
}

We accept both clockwise and counter-clockwise winding orders. The PS1 doesn’t cull back-faces at the rasterizer level — that’s the GTE’s job via the NCLIP command (see our GTE article). Games that use back-face culling check NCLIP and simply don’t send the polygon to the GPU.

What About Quads?

PS1 games love quads. The GPU handles them by splitting each quad into two triangles:

1
2
3
4
5
void GPU::rasterizePoly4(const Vertex *verts, ...)
{
    rasterizePoly3(verts, ...);       // Triangle: v0, v1, v2
    rasterizePoly3(verts + 1, ...);   // Triangle: v1, v2, v3
}

Simple, effective, and exactly how the real hardware does it.

From Flat Colors to Gouraud Shading

Flat Shading

The simplest case: one color for the entire polygon. The color is packed in the first command word as 8-bit RGB:

1
2
3
4
ColorRGBA color;
color.r = command & 0xFF;
color.g = (command >> 8) & 0xFF;
color.b = (command >> 16) & 0xFF;

Every pixel in the triangle gets this exact same color, converted to ABGR1555 and written to VRAM.

Gouraud Shading

When bit 28 is set, each vertex carries its own color. The GPU interpolates between them using the barycentric coordinates we already computed:

1
2
3
4
5
6
7
8
9
10
ColorRGBA interpolateColor(const ColorRGBA& c0, const ColorRGBA& c1,
                           const ColorRGBA& c2,
                           float alpha, float beta, float gamma)
{
    ColorRGBA color;
    color.r = static_cast<uint8_t>(c0.r * alpha + c1.r * beta + c2.r * gamma);
    color.g = static_cast<uint8_t>(c0.g * alpha + c1.g * beta + c2.g * gamma);
    color.b = static_cast<uint8_t>(c0.b * alpha + c1.b * beta + c2.b * gamma);
    return color;
}

This creates smooth color gradients across the polygon surface — essential for the PS1’s characteristic look.

graph LR
    subgraph "Flat Shading"
        F["All pixels = same color\n■■■■\n■■■■\n■■■■"]
    end
    subgraph "Gouraud Shading"
        G["Color interpolated per-pixel\n🔴🟠🟡\n🟠🟡🟢\n🟡🟢🔵"]
    end

Texture Mapping: Palettes and Pixel Packing

PS1 textures don’t work like modern textures. They come in three color depths, and two of them use color look-up tables (CLUTs) — essentially palettes, just like a GIF image.

The Three Texture Modes

ModeBits/pixelTexels per 16-bit VRAM wordNeeds CLUT?
4-bit44 texels packedYes (16-color palette)
8-bit82 texels packedYes (256-color palette)
15-bit161 texel directNo (direct ABGR1555)

For 4-bit textures, four pixels are squeezed into a single 16-bit VRAM word. To extract a single texel, we need bit-shifting:

1
2
3
4
// 4-bit mode: 4 texels per 16-bit word
uint16_t texData = getPixel({texPageBaseX + u / 4, texPageBaseY + v});
uint8_t index = (texData >> ((u % 4) * 4)) & 0xF;
return getPixel({clutX + index, clutY});  // CLUT lookup

For 8-bit textures, two pixels share one VRAM word:

1
2
3
4
// 8-bit mode: 2 texels per 16-bit word
uint16_t texData = getPixel({texPageBaseX + u / 2, texPageBaseY + v});
uint8_t index = (texData >> ((u % 2) * 8)) & 0xFF;
return getPixel({clutX + index, clutY});

For 15-bit, it’s a direct read — no palette indirection.

Where Do Textures Live?

Textures are stored directly in VRAM. The texture page base is specified per-polygon via the command parameters:

  • texPageX in units of 64 pixels (so values 0–15 cover the 1024px width)
  • texPageY as 0 or 256 (top or bottom half of VRAM)

The CLUT’s position is also in VRAM, specified as part of the first textured vertex’s parameter word. This means textures, palettes, and the framebuffer all coexist in the same 1 MB— games have to juggle these layouts carefully.

flowchart LR
    UV["UV from vertex\n(u, v) 8-bit each"] --> Sample["Sample VRAM at\ntexPage + offset"]
    Sample --> Mode{"Color\nmode?"}
    Mode -- "4-bit" --> Pack4["Extract 4-bit index\nfrom packed word"]
    Mode -- "8-bit" --> Pack8["Extract 8-bit index\nfrom packed word"]
    Mode -- "15-bit" --> Direct["Use ABGR1555 directly"]
    Pack4 --> CLUT["CLUT lookup\nin VRAM"]
    Pack8 --> CLUT
    CLUT --> Pixel["Final texel color"]
    Direct --> Pixel

Texture × Vertex Color Modulation

Once we have the texel color, it’s blended with the vertex color through multiplicative modulation:

1
2
3
finalR = (texR * vertexR) / 128;
finalG = (texG * vertexG) / 128;
finalB = (texB * vertexB) / 128;

The divide-by-128 (not 255!) means a vertex color of (128, 128, 128) preserves the texture’s original color. Values above 128 brighten the texture; below darkens it. This is how the PS1 achieves “baked lighting” on textured polygons.

If the rawTexture flag is set (bit 24), this modulation is skipped and the texture color is used as-is — useful for UI elements or pre-lit textures.

Transparency check: if a texel’s ABGR1555 value is exactly 0x0000, the pixel is considered fully transparent and is skipped entirely. This is the PS1’s version of alpha testing — simple, binary, and cheap.

Affine Texture Mapping (and Why Things Wobble)

You might have noticed: our UV interpolation uses the barycentric coordinates directly:

1
2
float u = alpha * verts[0].u + beta * verts[1].u + gamma * verts[2].u;
float v = alpha * verts[0].v + beta * verts[1].v + gamma * verts[2].v;

This is affine texture mapping — no perspective correction. Modern GPUs divide UVs by the depth (W) value to get perspective-correct textures. The PS1… doesn’t.

The result is the characteristic “texture warping” that PS1 games are famous for. When a polygon is viewed at an angle, the texture appears to swim and bend. Games work around this by subdividing large polygons into smaller ones (the GTE’s NCLIP helps decide when to subdivide), but the wobble is an inherent part of the PS1 aesthetic.

This is actually one of those rare cases where not implementing a feature is the correct emulation behavior. We match the hardware by doing it “wrong.”

If you’ve ever wondered why PS1 textures look “wobbly” compared to the N64 (which had perspective correction), this is why. It’s not a bug — it’s a 1994 cost-saving decision baked into the silicon.

Lines: Bresenham’s Algorithm

Lines are drawn using the classic Bresenham’s line algorithm — the same algorithm taught in every computer graphics course since the 1960s. For Gouraud-shaded lines, we interpolate colors linearly along the line:

1
2
3
4
5
6
7
8
float dr = (float)(v1.color.r - v0.color.r) / steps;
float dg = (float)(v1.color.g - v0.color.g) / steps;
float db = (float)(v1.color.b - v0.color.b) / steps;

// For each step along the line:
currentR += dr;
currentG += dg;
currentB += db;

Polylines (connected line segments) are handled by a 0x5xxx5xxx termination sentinel — the GPU keeps accepting line segments until it sees that magic value.

Rectangles: The Fast Path

Rectangles (sprites) take a shortcut: no edge function math needed. It’s a simple nested loop:

1
2
3
4
5
6
7
8
for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
        uint8_t u = vert.u + x;
        uint8_t v = vert.v + y;
        uint16_t texColor = sampleTexture(u, v, ...);
        // modulate, write pixel
    }
}

The GPU supports four rectangle sizes encoded in the command: variable (size in a parameter word), 1×1 (single pixel), 8×8, and 16×16. The fixed sizes are used heavily for tile-based backgrounds and UI elements.

VRAM Transfers: The Data Highway

Not all GPU commands draw things. Three copy operations move raw pixel data around:

flowchart LR
    CPU["CPU\n(main RAM)"] -- "GP0 cmd 0xA0\n32-bit words → 2 pixels each" --> VRAM["VRAM\n1024×512"]
    VRAM -- "GP0 cmd 0xC0\nread back pixels" --> CPU
    VRAM -- "GP0 cmd 0x80\npixel-by-pixel copy" --> VRAM
  • CPU → VRAM: Used to upload textures and CLUTs. The GPU switches to ReceivingDataWords state, and each subsequent 32-bit write is unpacked into two 16-bit pixels.
  • VRAM → CPU: Used for screenshots or reading back rendered data. The GPU switches to SendingDataWords state.
  • VRAM → VRAM: Immediate pixel-by-pixel copy within VRAM. Used for double-buffering tricks and texture manipulation.

Timing: Synced to the CRT

The GPU doesn’t just draw — it also generates the video timing signal. Our emulator simulates NTSC timing:

  • GPU clock: runs at 53.693 MHz (the CPU runs at 33.868 MHz, so we multiply CPU cycles by 11/7)
  • One scanline: 3413 GPU cycles
  • One frame: 263 scanlines
  • VBlank IRQ: fired at the end of each frame
1
2
3
4
5
6
7
8
9
10
11
12
13
void GPU::update(uint32_t cpuCycles)
{
    m_gpuCycles += cpuCycles * 11 / 7;  // CPU→GPU clock ratio
    while (m_gpuCycles >= 3413) {
        m_gpuCycles -= 3413;
        m_scanline++;
        if (m_scanline >= 263) {
            m_scanline = 0;
            // Fire VBlank interrupt
            m_interruptController->requestInterrupt(InterruptType::VBLANK);
        }
    }
}

This VBlank interrupt is what drives the game loop — most PS1 games synchronize their logic and rendering to the 60 Hz (NTSC) or 50 Hz (PAL) refresh rate.

Getting Pixels on Screen

After the GPU has rasterized everything into VRAM, we need to actually display it. In RogEm, the display pipeline is:

  1. Grab the raw VRAM byte array from the GPU
  2. Upload the entire 1024×512 buffer to an OpenGL texture using glTexSubImage2D with the GL_UNSIGNED_SHORT_1_5_5_5_REV format — which conveniently matches the PS1’s native ABGR1555 pixel format
  3. Compute UV coordinates based on the current display area settings (games can display from any position in VRAM)
  4. Render the texture into an ImGui window

The PS1 supports multiple display resolutions (256×240, 320×240, 512×240, 640×240, 640×480 interlaced), all configured through GP1 commands. The display area is just a window into the larger 1024×512 VRAM.

Implementation Notes

What’s Not Here (Yet)

Honesty is important, so here’s what our rasterizer doesn’t implement yet:

  • Semi-transparency blending: The flag is parsed and stored, but the actual alpha blending between the source and destination pixels isn’t applied during rasterization. The PS1 supports 4 blending modes (B/2+F/2, B+F, B-F, B+F/4).
  • Dithering: The PS1 applies an ordered 4×4 dither matrix when converting from 24-bit internal color to 15-bit VRAM. We track the flag but don’t apply it. This is what gives PS1 games their characteristic “grainy” gradients.
  • Draw area clipping: We clamp to VRAM edges but don’t enforce the configurable draw area rectangle that games use to prevent drawing outside the current framebuffer.
  • Mask bit: The PS1 can protect VRAM pixels from being overwritten (useful for UI overlays). We parse the settings but don’t enforce them.
  • Texture window masking: Allows texture coordinate wrapping/clamping within a sub-region, used for animated textures. Parsed but not applied.

What We Got Right

  • Both winding orders: accepting CW and CCW triangles is critical — PS1 games don’t have a consistent winding convention.
  • All three texture depths: 4-bit CLUT, 8-bit CLUT, and 15-bit direct all work correctly, with proper bit-packing extraction.
  • Affine textures: Deliberately not implementing perspective correction is the correct choice for PS1 emulation.

The Full Pipeline

Putting it all together, here’s the complete rendering pipeline for a single Gouraud-shaded textured triangle:

flowchart TD
    A["CPU writes GP0 word\n0x3C (Gouraud + Textured + Quad)"] --> B["GPUCommand parses flags:\nshaded=1, quad=1, textured=1"]
    B --> C["GPU collects 12 parameter words\n(colors, positions, UVs, CLUT, texpage)"]
    C --> D["drawPolygon() extracts vertices"]
    D --> E["rasterizePoly4() splits quad\ninto two triangles"]
    E --> F["rasterizePoly3():\nCompute bounding box"]
    F --> G["For each pixel:\nedgeFunction × 3"]
    G --> H{"Inside\ntriangle?"}
    H -- No --> G
    H -- Yes --> I["Barycentric coords\nα, β, γ"]
    I --> J["Interpolate Gouraud color\nR = R0·α + R1·β + R2·γ"]
    I --> K["Interpolate UVs\nu = u0·α + u1·β + u2·γ"]
    K --> L["sampleTexture()\n4-bit/8-bit CLUT or 15-bit direct"]
    L --> M{"texel == 0x0000?"}
    M -- "Yes (transparent)" --> G
    M -- No --> N["Modulate:\nfinal = tex × vertex / 128"]
    J --> N
    N --> O["Convert to ABGR1555\nsetPixel() → VRAM"]
    O --> G

What We Learned

Building a software rasterizer from scratch, even for 1994-era hardware, teaches you things that no graphics API tutorial will:

  1. Every pixel costs: Without hardware acceleration, you feel every multiplication, every branch, every memory access. It gives you a visceral understanding of why GPUs exist.

  2. The PS1’s constraints were brilliant: No z-buffer means games use the GTE’s AVSZ command and ordering tables to sort polygons by depth. No perspective correction means games subdivide geometry to minimize warping. Every hardware limitation spawned a creative software workaround.

  3. Simple >= complex: The edge function rasterizer is about 60 lines of code. Bresenham’s line algorithm is 30 lines. The texture sampler with all three modes is under 30 lines. The entire rasterizer fits in one 886-line file. The PS1 GPU is proof that you can build a complete 3D rendering pipeline without any of the complexity of modern graphics.

  4. VRAM as a unified memory was ahead of its time: Modern “unified memory” architectures in AMD APUs and Apple Silicon are conceptually similar to the PS1’s approach of putting everything — framebuffer, textures, palettes — in one shared memory space.

The PS1 GPU may not have shaders or compute dispatches, but it delivered experiences that defined a generation. And recreating it pixel by pixel has been one of the most rewarding parts of building RogEm.

References

  1. psx-spx GPU documentation
  2. Rodrigo Copetti — PlayStation Architecture
  3. Scratchapixel — Rasterization: a Practical Implementation
  4. Fabien Sanglard — “How do PSX emulators work?”
  5. Our GTE Part 2 article — the math that happens before polygons reach the GPU
This post is licensed under CC BY 4.0 by the author.