Monochrome Polygons can also reach 66Mpix/s, however, the speed will drop to 33Mpix/s for Polygons with Texture or Gouraud shading (that's an apparant design mistake that can make the GPU twice as slow as needed, although it isn't as bad as it sounds because other overload may outweight that mistake).
Overload can include cache misses, semi-transparency, per-scanline overload (most noticeable in short scanlines), and precalculation (most noticeable in small polygons), and collisions with memory refresh and display fetching. The exact overload depends on the scenery, the average rendering speed might be around 6-11Mpix/s.
To avoid overload, it may be recommended to omit small polygons (or to use a reduced polygon count for more distant objects).
Code: Select all
_________________________________ Rectangles _________________________________
GP0(60h..7Fh) Rectangles
1.00 clks per any-scanline (even if outside of draw area y1,y2)
5.00 clks per scanline ;'
0.50 clks per pixel, when non-fully-transp ;'without ;
0.25 clks per pixel, when fully-transp ;/semi-transp ; Old GPU
1.00 clks per pixel-read, when non-fully-transp ;'with ;
0.00 clks per pixel-read, when fully-transp ; semi ;
0.50 clks per pixel-write ;/ ;/
3.50 clks per scanline, when width=1 ;' ;'
2.50 clks per scanline, when width=2..3 ; without ;
2.00 clks per scanline, when width=4..5 ; semi-transp ; New GPU
1.50 clks per scanline, when width=6..7 ; ;
1.00 clks per scanline, when width>=8 ;/ ;
6.00 clks per scanline ;'with ;
3.75 clks per 16pix chunk ;/semi-transp ;
0.50 clks per pixel, when non-fully-transp ;
0.25 clks per pixel, when fully-transp (color 0000h) ;/
and... extra time per cache-misses (when textured)
The timing for number of pixels per scanline is rounded to pixel-pairs,
generally that is "width AND NOT 1" (except: on Old GPU for pixel-reads it is
"num+1 AND NOT 1").
__________________________________ Polygons __________________________________
GP0(20h..3Fh) Polygons
Polygon timings consist of precalculation and rendering phases:
3-point Poly (one triangle) --> precalc + render
4-point Poly (two triangles) --> precalc + render + precalc + render
The precalc computes the fixed point steps for screen coords (and texcoords and
gouraud RGB values, if any). The precalc can be quite slow, however, it can be
done during the rendering phase of the previous command/triangle (so the
precalc won't take up any extra time, provided that the previous rendering
phase takes long enough to finish the precalc in parallel).
10.00 clks per triangle, base cycles ;'
90.00 clks per triangle, extra cycles when textured ; precalc
150.00 clks per triangle, extra cycles when gouraud shaded ;/
1.00 clks per any-scanline (even if outside of draw area y1,y2)
4.75 clks per scanline, without semi ;'
5.50 clks per scanline, with semi ;
3.00 clks per pixel, with semi, with gouraud/texture ; Old GPU
1.50 clks per pixel, with semi, without gouraud/texture ;
1.00 clks per pixel, without semi, with gouraud/texture ;
0.50 clks per pixel, without semi, without gouraud/texture;/
1.00 clks per scanline, without semi ;'
2.00 clks per scanline, with semi, with gouraud/texture ;
5.25 clks per scanline, with semi, without gouraud/texture; New GPU
2.00 clks per scanline, when width=1 ;' ;
1.50 clks per scanline, when width=2 ; without ;
1.25 clks per scanline, when width=3 ; semi-transp, ;
0.75 clks per scanline, when width=4..5 ; and without ;
0.50 clks per scanline, when width=6..7 ; gouraud/tex ;
0.00 clks per scanline, when width>=8 ;/ ;
3.75 clks per 16pix chunk, with semi, without gouraud/tex ;
1.00 clks per pixel, with gouraud/texture ;
0.50 clks per pixel, without gouraud/texture ;/
and... extra time per cache-misses (when textured)
XXX probably has faster cases for fully-transp pixels, alike Rectangles?
There's one odditity: 0-pixel polygons can be slower than a 1-pixel polygons
(maybe the precalculation timings get messed up when dividing by size=0).
___________________________________ Lines ____________________________________
GP0(40h..5Fh) Lines
Line timings consist of precalculation and rendering phases (similar as
Polygons). Lines aren't pariticulary fast, the per-pixel timing isn't optimal,
and non-horizontal lines do additionally have a high per-scanline overload
(versus few pixels per scanline). Nonetheless, Lines may be faster than large
filled shapes (that require more pixels).
40.00 clks per line segment, base cycles ;'precalc
60.00 clks per line segment, extra cycles when gouraud shaded ;/
1.00 clks per pixel, base time for horizontal lines ;'
2.00 clks per pixel, base time for non-horizontal lines ; pixels
2.00 clks per pixel, extra when Old GPU with semi-transp ;/
0.00 clks per offscreen-scanline (outside of draw area y1,y2)
0.00 clks per scanline, when 0..5' (flat) ;'
?? clks per scanline, when 5..44' ; Old GPU
5.00 clks per scanline, when 45..90' (steep) ;/
2.00 clks per scanline, when 0..30' (flat) ;'
2.00 clks per scanline, when 40' without semi-transp ;
?? clks per scanline, when 40' with semi-transp ; New GPU
2.50 clks per scanline, when 45..90' (steep) without semi ;
5.50 clks per scanline, when 45..90' (steep) with semi-transp ;/
______________________________ Memory Transfers ______________________________
GP0(02h) VRAM Fill
The VRAM chips have a special hardware feature for writing the fill value to
several pixels at once, that's making the fill command very fast (about 0.0625
clks per 16bit pixel). The overall timings consist of the pixel time plus some
scanline overload:
1.00 clks per 16 pixels
7.00 clks per scanline, when xsiz>0, on Old GPU
5.00 clks per scanline, when xsiz>0, on New GPU
1.00 clks per scanline, when xsiz=0, should never happen in practice
GP0(80h) VRAM-to-VRAM Copy
1.50 clks per pixel, without Mask Check ;'
2.50 clks per pixel, with Mask Check ; Old GPU
2.50 clks per scanline ;/
1.25 clks per pixel, without Mask Check ;'
1.50 clks per pixel, with Mask Check, when width<16 ;
1.00 clks per pixel, with Mask Check, when width>=16 ; New GPU
19.50 clks per scanline, without Mask Check ;
22.25 clks per scanline, with Mask Check, when width<16 ;
25.50 clks per 16pix chunk, with Mask Check, when width>=16 ;/
Like many other commands, data is transferred in Pixel-Pairs (albeit oddly,
this command seems to use pairs that begin on odd X texcoords).
GP0(A0h) CPU-to-VRAM Copy
The timings do mainly rely on the transfer rate on CPU side, the transfer
should be usually done via DMA (manually polling the I/O ports would be about
10x slower). The DMA timings depend on the DRQ mode and DMA blocksize (plus
additional slowdown if other DMA channels are simultaneously active). The
following timings per 16bit pixel are possible with DMA:
1.00 clks per pixel, without Mask Check ;'Old GPU
1.50 clks per pixel, with Mask Check ;/
1.00 clks per pixel, with/without Mask Check ;-New GPU
plus some overload per scanline/chunk
GP0(C0h) VRAM-to-CPU Copy
Again, the timings do mainly rely on the transfer rate on CPU side. The
following timings per 16bit pixel are possible with DMA:
1.50 clks per pixel ;-Old GPU
1.00 clks per pixel ;-New GPU
plus some overload per scanline/chunk (especially on New GPU)
____________________________ Additional Overload _____________________________
Cache Misses (for textured Rects/Polys)
CLUT Cache misses take 1.00 clks per halfword. That is,
16.00 clks when loading 16-color CLUT (color 00h..0Fh)
256.00 clks when loading 256-color CLUT (color 00h..FFh)
Texture cache misses do theoretically take 8.00 clks (2.00 clks per halfword).
However, there appears to be some overload resulting in timings like this:
8.25 clks for 16bpp, with/without semi-transparency ;-without CLUT
10.00 clks for 4bpp/8bpp, without semi-transparency ;'with CLUT
11.00 clks for 4bpp/8bpp, with semi-transparency ;/
There are cases where cache misses are triggering yet more overload (maybe
because cache loading can increase memory refresh collisions).
Semi-Transparency and Mask Check
These features require reading old pixels from framebuffer. Both
Semi-Transparency and Mask Check have the same overload (in the above tables,
the timings for "semi-transp" do also apply when using mask check). These
timings do always apply when using semi-transparent commands (no matter if the
actual texture pixels have the semi-transparent flag in color.bit15 set or
cleared).
Memory Refresh
The Old GPU appears to require a lot of refresh cycles, resulting in an average
overload of 2% to be added to the overall rendering time (albeit with rather
unstable results, the rendering time for the exact same polygon can vary
greatly depending on whether it's colliding with refresh cycles or not).
The New GPU requires very few refresh cycles, resulting in little overload and
much more stable timings. However, setting GP0(03h) to nonstandard values may
(or may not) slow down rendering:
Rect 3x128 rendering with GP0(03h)=0Fh is about 4x slower than GP0(03h)=0
Rect 512x128 rendering time is constant, no matter of GP0(03h)
(in the latter case, refresh probably occurs inbetween write buffer draining)
Vblank vs Display Area
The full rendering speed can be reached only during Vblank. There will be
additional overload outside of Vblank, the exact overload depends on the size
of the Display Area and on the horizontal resolution (dotclock).
XXX Todo: add some examples for Rects/Polys at different resolutions
Note: Trying to use GP1(03h) to disable display does NOT disable the overload
(the pixel VRAM fetches do still occur despite of not actually displaying those
pixels). However, setting Display Area Y2=Y1+1 will will cause the GPU to be
(almost) always in Vblank, and thus disable most of the overload.
Note: The Old GPU is using Dual-ported VRAM chips (with two data busses) which
do theoretically allow to fetch pixels during rendering (unknown if that's
actually working free of overload).
___________________________________ Notes ____________________________________
Pixel-Pairs
Most rendering commands are capable of doing two 16bit pixels as a single 32bit
pixel-pair (except, the Line command doesn't seem to do that). For example,
Rects with 3x200 or 4x200 pixels are both having the same rendering time (at
least so with even screen X coords, the timing is probably different on odd
screen X coords).
16-pixel Buffer
The New GPU is rendering most scanlines in 16-pix chunks, that new feature is
usually making it faster than Old GPU (although, it can also be slower for
short scanlines).
The feature is probably related to the Write Buffer (see GPU Texture Cache
chapter), and accordingly the timings do probably differ for odd/even screen X
coords, and with the first chunk being 15-17 pixels wide instead of 16.
Oscillators
All rendering timings seem to be bound to the 33MHz CPU clock (aka derived from
the 67MHz oscillator). The 53MHz GPU clock doesn't seem to affect GPU rendering
timings at all (it appears to be solely used to generate the PAL/NTSC color
clock & horizontal resolution dotclock; the latter might indirectly affect
rendering timings when rendering is colliding with display fetches).
Notes
The above timings are hopefully more or less correct. But they might include
some measurement inaccuracies and some other mistakes; it can be difficult to
tell apart which factors (or combination of factors) are causing which delays.
Timings like 0.50 or 0.25 clks may rely on the GPU being clocked at twice
33MHz, and on rendering pixel-pairs; some odd timings may also rely on things
like average refresh collisions.
In particular, I've been wonderng if the timing emulation will cause the frame rate to drop in some games. For example, Spiderman has quite high GPU load (it's using 25fps, with a high horizontal resolution of 512 pixels), but apparently real hardware is actually capable of rendering that many pixels.
Interestingly, the game requires up to 27M clks/s on NewGPU, and 32M clks/s on OldGPU, which is very close to the maximum of 33M clks/s. So the game engine or level design must be somehow aware of how many polygons it could display at any viewing angle, with any combination of cache hits and misses, without ever dropping below 25fps.