**Pipeline Overview**
**Geometry Feed**
Vertex data is read sequentially
from up to 16 streams. Each stream may contain any combination
of vertex components, such as position, diffuse and/or specular color,
normals, and texture coordinates. These components are
collected and then sent through the remaining stages of the
pipeline.
Because the software renderer is
designed around the SSE instruction set, the geometry feed pipeline
actually fetches 4 triangles (12 vertices) and swizzles them before
moving to the next stage. This is why it is so important to render
batches of triangles at once--maximum efficiency is achieved when
all 4 triangles go through the pipeline together. If fewer
than 4 triangles are supplied, the pipeline "dummies up"
infinitely small
triangles which evaporate the moment they reach the rasterizer.
This is one of the reasons why Quake2 doesn't run as quickly as it
should (especially at lower resolutions); it sends one triangle at a
time, wasting much of the potential.
**About SSE Data Packing**
Vertex data is packed (or swizzled)
before it can be processed by SSE instructions. Similar components from adjacent triangles are packed into a single
memory location corresponding to the data type of an SSE register.
Vertex4ps
vertex4[3];
vertex4[0].vector.x[0]
= src_vertex[0].x
vertex4[0].vector.x[1] = src_vertex[3].x
vertex4[0].vector.x[2] = src_vertex[6].x
vertex4[0].vector.x[3] = src_vertex[9].x
vertex4[1].vector.x[0]
= src_vertex[1].x
vertex4[1].vector.x[1] = src_vertex[4].x
vertex4[1].vector.x[2] = src_vertex[7].x
vertex4[1].vector.x[3] = src_vertex[10].x
vertex4[2].vector.x[0]
= src_vertex[2].x
vertex4[2].vector.x[1] = src_vertex[5].x
vertex4[2].vector.x[2] = src_vertex[8].x
vertex4[2].vector.x[3] = src_vertex[11].x
You can see that non-adjacent
vertex components are extracted from each of the 4 triangles and placed in
a
temporary buffer so that subsequent SSE instructions can operate on
them, for example:
// projection
rcpps xmm0, [eax] vertex4ps.w // 1 / vector[0..3].w in register xmm0
rcpps xmm1, [eax] vertex4ps.w // 1 / vector[0..3].w in
register xmm1, too
mulps xmm0, [eax] vertex4ps.x // vector[0..3].x * (1 /
vector[0..3].w)
mulps xmm1, [eax] vertex4ps.y // vector[0..3].y * (1 /
vector[0..3].w)
movaps [ebx] vertex4ps.screenpos.x, xmm0
movaps [ebx] vertex4ps.screenpos.y, xmm1
Swizzled data cannot be processed by
the SSE instructions right away; this is because modern processors will stall
on a fragmented load/store operation, and its important to do something between the time
when swizzled data is written to the cache and
when it is processed by SSE instructions.
**Transform**
In this stage, vertices are
transformed to screen space. The world, view, and projection
transforms have been combined to create the final composite matrix by which all
vertices are transformed. The results are stored and sent the
to the next stage.
**Vertex Blending**
Each vertex can have up to 3 blend
values; the 0th blend value is implicit.
Vertices are transformed by as many
world matrices as necessary (corresponding to the number of blend
values specified). The outputs from each blend stage are
"blended" using weighted formulas.
**Viewport Detection**
Triangles that are not visible in
the viewing frustum can now be eliminated. Viewport detection
determines which triangles are completely outside the viewing
frustum or partially inside. Triangles which
pass the visibility test are tagged and then passed to the next
stage for clipping.
**Clipping**
Triangles which passed the
visibility test are now clipped. Clipping in VSG isn't the
same as normal plane clipping. It doesn't attempt to clip
triangles against all planes of the viewing frustum; *only*
against the near-z and far-z planes. Other edges of the viewing
frustum are clipped by later stages of the pipeline; top and
bottom edges are clipping during triangle setup and left and
right edges are clipped during rasterization.
A triangle may actually become a
quad when only one vertex of the vertices is outside the viewing
frustum. If this happens, the clipper must divide the quad
into 2 triangles; one of the triangles flows through the pipeline
normally with the rest of the bunch, while the 2nd, newly computed
triangle is placed in a queue where it waits to join up with 3 other
triangles also generated from clipping. As soon as the 4th
triangle is ready, they then flow through the pipeline normally.
**SSE Data Packing Stage 2**
This stage is a little different
than its earlier cousin; the data is already packed and swizzled,
ready to be used with SSE instructions. However, it is not
naturalized. What naturalization means is the color value, for
example, is normally represented as a normalized value from 0 to 1.
The rasterizer, however, wants all color values to be between 0 and
255, so this stage of the pipeline takes the normalized vertex
components and naturalizes them--puts them in their normal range
usable by the scanline rasterizer.
**Lighting**
At this stage, standard lighting
equations are applied. Diffuse, ambient, and specular.
Actually VSG doesn't support specular lighting at
this time. :)
**Projection**
Vertices are projected to screen
space by dividing x and y over w.
vertex.screenpos.x = vertex.worldpos.x / vertex.worldpos.w
vertex.screenpos.y = vertex.worldpos.y / vertex.worldpos.w
**Sort Vertices**
The vertices must be sorted so that
the top-most vertex is first in memory location and the bottom most
vertex is last. This is an optimization which reduces decision
logic in the triangle rasterizer by guaranteeing a top-to-bottom
drawing order, but does create a huge bubble in the triangle
processing pipeline.
**Rasterize Triangles**
Here, triangles are broken down into
3 edges; stepping values for each edge and scanline are
calculated (and pre-stepped; special thanks to
Chris Hecker for that tip) and scanlines are drawn by sending
them to the scanline rasterizer.
**Rasterize Scanlines**
A scanline is represented as a
starting point, its length (run), and a stepping value for each
component. A scanline arrives at this stage in
the pipeline ready to be drawn. A stepping value for each component is used to
increment the correct value as the rasterizer draws from left to
right. The scanline rasterizer is responsible for a number of
things including texturing, shading, z-buffering, fog, and also z-buffering. There is a built in optimization here that can
trivially reject scanlines or shorten them depending on the results
of an early Z buffer compare. Perspective correction is
applied to all vertex components, not just the texture uv's.
All pixel values are computed and
stored internally on the CPU cache until they are ready to be
flushed. This is when the final z-compare and subsequent
z-write occur.
Next:
Rasterization Techniques |