Pipeline Overview


Geometry Feed

Vertex data is read sequentially from up to 16 streams.  Each stream may contain any combination of vertex components, such as position, diffuse and/or specular color, normals, and texture coordinates.  These components are collected and then sent through the remaining stages of the pipeline.

Because the software renderer is designed around the SSE instruction set, the geometry feed pipeline actually fetches 4 triangles (12 vertices) and swizzles them before moving to the next stage.  This is why it is so important to render batches of triangles at once--maximum efficiency is achieved when all 4 triangles go through the pipeline together.  If fewer than 4 triangles are supplied, the pipeline "dummies up" infinitely small triangles which evaporate the moment they reach the rasterizer.  This is one of the reasons why Quake2 doesn't run as quickly as it should (especially at lower resolutions); it sends one triangle at a time, wasting much of the potential.

About SSE Data Packing

Vertex data is packed (or swizzled) before it can be processed by SSE instructions.  Similar components from adjacent triangles are packed into a single memory location corresponding to the data type of an SSE register.

    Vertex4ps vertex4[3];

    vertex4[0].vector.x[0] = src_vertex[0].x
    vertex4[0].vector.x[1] = src_vertex[3].x
    vertex4[0].vector.x[2] = src_vertex[6].x
    vertex4[0].vector.x[3] = src_vertex[9].x

    vertex4[1].vector.x[0] = src_vertex[1].x
    vertex4[1].vector.x[1] = src_vertex[4].x
    vertex4[1].vector.x[2] = src_vertex[7].x
    vertex4[1].vector.x[3] = src_vertex[10].x

    vertex4[2].vector.x[0] = src_vertex[2].x
    vertex4[2].vector.x[1] = src_vertex[5].x
    vertex4[2].vector.x[2] = src_vertex[8].x
    vertex4[2].vector.x[3] = src_vertex[11].x

You can see that non-adjacent vertex components are extracted from each of the 4 triangles and placed in a temporary buffer so that subsequent SSE instructions can operate on them, for example:

         // projection

    rcpps xmm0, [eax] vertex4ps.w // 1 / vector[0..3].w in register xmm0
    rcpps xmm1, [eax] vertex4ps.w // 1 / vector[0..3].w in register xmm1, too
    mulps xmm0, [eax] vertex4ps.x // vector[0..3].x * (1 / vector[0..3].w)
    mulps xmm1, [eax] vertex4ps.y // vector[0..3].y * (1 / vector[0..3].w)
    movaps [ebx] vertex4ps.screenpos.x, xmm0
    movaps [ebx] vertex4ps.screenpos.y, xmm1

Swizzled data cannot be processed by the SSE instructions right away; this is because modern processors will stall on a fragmented load/store operation, and its important to do something between the time when swizzled data is written to the cache and when it is processed by SSE instructions.


In this stage, vertices are transformed to screen space.  The world, view, and projection transforms have been combined to create the final composite matrix by which all vertices are transformed.  The results are stored and sent the to the next stage.


Vertex Blending

Each vertex can have up to 3 blend values; the 0th blend value is implicit.

Vertices are transformed by as many world matrices as necessary (corresponding to the number of blend values specified).  The outputs from each blend stage are "blended" using weighted formulas.


Viewport Detection

Triangles that are not visible in the viewing frustum can now be eliminated.  Viewport detection determines which triangles are completely outside the viewing frustum or partially inside.  Triangles which pass the visibility test are tagged and then passed to the next stage for clipping.


Triangles which passed the visibility test are now clipped.  Clipping in VSG isn't the same as normal plane clipping.  It doesn't attempt to clip triangles against all planes of the viewing frustum; only against the near-z and far-z planes.  Other edges of the viewing frustum are clipped by later stages of the pipeline;  top and bottom edges are clipping during triangle setup and left and right edges are clipped during rasterization.

A triangle may actually become a quad when only one vertex of the vertices is outside the viewing frustum.  If this happens, the clipper must divide the quad into 2 triangles; one of the triangles flows through the pipeline normally with the rest of the bunch, while the 2nd, newly computed triangle is placed in a queue where it waits to join up with 3 other triangles also generated from clipping.  As soon as the 4th triangle is ready, they then flow through the pipeline normally.


SSE Data Packing Stage 2

This stage is a little different than its earlier cousin; the data is already packed and swizzled, ready to be used with SSE instructions.  However, it is not naturalized.  What naturalization means is the color value, for example, is normally represented as a normalized value from 0 to 1.  The rasterizer, however, wants all color values to be between 0 and 255, so this  stage of the pipeline takes the normalized vertex components and naturalizes them--puts them in their normal range usable by the scanline rasterizer.



At this stage, standard lighting equations are applied.  Diffuse, ambient, and specular. Actually VSG doesn't support  specular lighting at this time. :)



Vertices are projected to screen space by dividing x and y over w.

    vertex.screenpos.x = vertex.worldpos.x / vertex.worldpos.w
    vertex.screenpos.y = vertex.worldpos.y / vertex.worldpos.w


Sort Vertices

The vertices must be sorted so that the top-most vertex is first in memory location and the bottom most vertex is last.  This is an optimization which reduces decision logic in the triangle rasterizer by guaranteeing a top-to-bottom drawing order, but does create a huge bubble in the triangle processing pipeline.


Rasterize Triangles

Here, triangles are broken down into 3 edges; stepping values for each edge and scanline are calculated (and pre-stepped; special thanks to Chris Hecker for that tip) and scanlines are drawn by sending them to the scanline rasterizer.


Rasterize Scanlines

A scanline is represented as a starting point, its length (run), and a stepping value for each component.

A scanline arrives at this stage in the pipeline ready to be drawn.  A stepping value for each component is used to increment the correct value as the rasterizer draws from left to right.  The scanline rasterizer is responsible for a number of things including texturing, shading, z-buffering, fog, and also z-buffering.  There is a built in optimization here that can trivially reject scanlines or shorten them depending on the results of an early Z buffer compare.  Perspective correction is applied to all vertex components, not just the texture uv's.

All pixel values are computed and stored internally on the CPU cache until they are ready to be flushed.  This is when the final z-compare and subsequent z-write occur.

Next: Rasterization Techniques