GPU Rendering Timings

Post by **nocash** » February 9th, 2024, 12:17 am

Here are the GPU rendering timings. All values are measured in 33MHz CPU clock cycles. The GPU can render up to 66 million pixels per second, that figure applies for monochrome Rects, textured Rects, and even for Rects with texture blending.
Monochrome Polygons can also reach 66Mpix/s, however, the speed will drop to 33Mpix/s for Polygons with Texture or Gouraud shading (that's an apparant design mistake that can make the GPU twice as slow as needed, although it isn't as bad as it sounds because other overload may outweight that mistake).
Overload can include cache misses, semi-transparency, per-scanline overload (most noticeable in short scanlines), and precalculation (most noticeable in small polygons), and collisions with memory refresh and display fetching. The exact overload depends on the scenery, the average rendering speed might be around 6-11Mpix/s.
To avoid overload, it may be recommended to omit small polygons (or to use a reduced polygon count for more distant objects).

Code: Select all

 _________________________________ Rectangles _________________________________

GP0(60h..7Fh) Rectangles
  1.00 clks     per any-scanline (even if outside of draw area y1,y2)
  5.00 clks     per scanline                                    ;'
  0.50 clks     per pixel, when non-fully-transp ;'without      ;
  0.25 clks     per pixel, when fully-transp     ;/semi-transp  ; Old GPU
  1.00 clks     per pixel-read, when non-fully-transp  ;'with   ;
  0.00 clks     per pixel-read, when fully-transp      ; semi   ;
  0.50 clks     per pixel-write                        ;/       ;/
  3.50 clks     per scanline, when width=1      ;'              ;'
  2.50 clks     per scanline, when width=2..3   ; without       ;
  2.00 clks     per scanline, when width=4..5   ; semi-transp   ; New GPU
  1.50 clks     per scanline, when width=6..7   ;               ;
  1.00 clks     per scanline, when width>=8     ;/              ;
  6.00 clks     per scanline                    ;'with          ;
  3.75 clks     per 16pix chunk                 ;/semi-transp   ;
  0.50 clks     per pixel, when non-fully-transp                ;
  0.25 clks     per pixel, when fully-transp (color 0000h)      ;/
  and...        extra time per cache-misses (when textured)
The timing for number of pixels per scanline is rounded to pixel-pairs,
generally that is "width AND NOT 1" (except: on Old GPU for pixel-reads it is
"num+1 AND NOT 1").

 __________________________________ Polygons __________________________________

GP0(20h..3Fh) Polygons
Polygon timings consist of precalculation and rendering phases:
  3-point Poly (one triangle)  --> precalc + render
  4-point Poly (two triangles) --> precalc + render + precalc + render
The precalc computes the fixed point steps for screen coords (and texcoords and
gouraud RGB values, if any). The precalc can be quite slow, however, it can be
done during the rendering phase of the previous command/triangle (so the
precalc won't take up any extra time, provided that the previous rendering
phase takes long enough to finish the precalc in parallel).
  10.00 clks    per triangle, base cycles                       ;'
  90.00 clks    per triangle, extra cycles when textured        ; precalc
  150.00 clks   per triangle, extra cycles when gouraud shaded  ;/
  1.00 clks     per any-scanline (even if outside of draw area y1,y2)
  4.75 clks     per scanline, without semi                      ;'
  5.50 clks     per scanline, with semi                         ;
  3.00 clks     per pixel, with semi, with gouraud/texture      ; Old GPU
  1.50 clks     per pixel, with semi, without gouraud/texture   ;
  1.00 clks     per pixel, without semi, with gouraud/texture   ;
  0.50 clks     per pixel, without semi, without gouraud/texture;/
  1.00 clks     per scanline, without semi                      ;'
  2.00 clks     per scanline, with semi, with gouraud/texture   ;
  5.25 clks     per scanline, with semi, without gouraud/texture; New GPU
  2.00 clks     per scanline, when width=1      ;'              ;
  1.50 clks     per scanline, when width=2      ; without       ;
  1.25 clks     per scanline, when width=3      ; semi-transp,  ;
  0.75 clks     per scanline, when width=4..5   ; and without   ;
  0.50 clks     per scanline, when width=6..7   ; gouraud/tex   ;
  0.00 clks     per scanline, when width>=8     ;/              ;
  3.75 clks     per 16pix chunk, with semi, without gouraud/tex ;
  1.00 clks     per pixel, with gouraud/texture                 ;
  0.50 clks     per pixel, without gouraud/texture              ;/
  and...        extra time per cache-misses (when textured)
XXX probably has faster cases for fully-transp pixels, alike Rectangles?
There's one odditity: 0-pixel polygons can be slower than a 1-pixel polygons
(maybe the precalculation timings get messed up when dividing by size=0).

 ___________________________________ Lines ____________________________________

GP0(40h..5Fh) Lines
Line timings consist of precalculation and rendering phases (similar as
Polygons). Lines aren't pariticulary fast, the per-pixel timing isn't optimal,
and non-horizontal lines do additionally have a high per-scanline overload
(versus few pixels per scanline). Nonetheless, Lines may be faster than large
filled shapes (that require more pixels).
  40.00 clks    per line segment, base cycles                       ;'precalc
  60.00 clks    per line segment, extra cycles when gouraud shaded  ;/
  1.00 clks     per pixel, base time for horizontal lines           ;'
  2.00 clks     per pixel, base time for non-horizontal lines       ; pixels
  2.00 clks     per pixel, extra when Old GPU with semi-transp      ;/
  0.00 clks     per offscreen-scanline (outside of draw area y1,y2)
  0.00 clks     per scanline, when 0..5' (flat)                     ;'
  ??   clks     per scanline, when 5..44'                           ; Old GPU
  5.00 clks     per scanline, when 45..90' (steep)                  ;/
  2.00 clks     per scanline, when 0..30' (flat)                    ;'
  2.00 clks     per scanline, when 40' without semi-transp          ;
  ??   clks     per scanline, when 40' with semi-transp             ; New GPU
  2.50 clks     per scanline, when 45..90' (steep) without semi     ;
  5.50 clks     per scanline, when 45..90' (steep) with semi-transp ;/

 ______________________________ Memory Transfers ______________________________

GP0(02h) VRAM Fill
The VRAM chips have a special hardware feature for writing the fill value to
several pixels at once, that's making the fill command very fast (about 0.0625
clks per 16bit pixel). The overall timings consist of the pixel time plus some
scanline overload:
  1.00 clks     per 16 pixels
  7.00 clks     per scanline, when xsiz>0, on Old GPU
  5.00 clks     per scanline, when xsiz>0, on New GPU
  1.00 clks     per scanline, when xsiz=0, should never happen in practice

GP0(80h) VRAM-to-VRAM Copy
  1.50 clks     per pixel, without Mask Check                    ;'
  2.50 clks     per pixel, with Mask Check                       ; Old GPU
  2.50 clks     per scanline                                     ;/
  1.25 clks     per pixel, without Mask Check                    ;'
  1.50 clks     per pixel, with Mask Check, when width<16        ;
  1.00 clks     per pixel, with Mask Check, when width>=16       ; New GPU
  19.50 clks    per scanline, without Mask Check                 ;
  22.25 clks    per scanline, with Mask Check, when width<16     ;
  25.50 clks    per 16pix chunk, with Mask Check, when width>=16 ;/
Like many other commands, data is transferred in Pixel-Pairs (albeit oddly,
this command seems to use pairs that begin on odd X texcoords).

GP0(A0h) CPU-to-VRAM Copy
The timings do mainly rely on the transfer rate on CPU side, the transfer
should be usually done via DMA (manually polling the I/O ports would be about
10x slower). The DMA timings depend on the DRQ mode and DMA blocksize (plus
additional slowdown if other DMA channels are simultaneously active). The
following timings per 16bit pixel are possible with DMA:
  1.00 clks     per pixel, without Mask Check                    ;'Old GPU
  1.50 clks     per pixel, with Mask Check                       ;/
  1.00 clks     per pixel, with/without Mask Check               ;-New GPU
  plus some overload per scanline/chunk

GP0(C0h) VRAM-to-CPU Copy
Again, the timings do mainly rely on the transfer rate on CPU side. The
following timings per 16bit pixel are possible with DMA:
  1.50 clks     per pixel                                        ;-Old GPU
  1.00 clks     per pixel                                        ;-New GPU
  plus some overload per scanline/chunk (especially on New GPU)

 ____________________________ Additional Overload _____________________________

Cache Misses (for textured Rects/Polys)
CLUT Cache misses take 1.00 clks per halfword. That is,
  16.00 clks    when loading 16-color CLUT (color 00h..0Fh)
  256.00 clks   when loading 256-color CLUT (color 00h..FFh)
Texture cache misses do theoretically take 8.00 clks (2.00 clks per halfword).
However, there appears to be some overload resulting in timings like this:
  8.25 clks     for 16bpp, with/without semi-transparency   ;-without CLUT
  10.00 clks    for 4bpp/8bpp, without semi-transparency    ;'with CLUT
  11.00 clks    for 4bpp/8bpp, with semi-transparency       ;/
There are cases where cache misses are triggering yet more overload (maybe
because cache loading can increase memory refresh collisions).

Semi-Transparency and Mask Check
These features require reading old pixels from framebuffer. Both
Semi-Transparency and Mask Check have the same overload (in the above tables,
the timings for "semi-transp" do also apply when using mask check). These
timings do always apply when using semi-transparent commands (no matter if the
actual texture pixels have the semi-transparent flag in color.bit15 set or
cleared).

Memory Refresh
The Old GPU appears to require a lot of refresh cycles, resulting in an average
overload of 2% to be added to the overall rendering time (albeit with rather
unstable results, the rendering time for the exact same polygon can vary
greatly depending on whether it's colliding with refresh cycles or not).
The New GPU requires very few refresh cycles, resulting in little overload and
much more stable timings. However, setting GP0(03h) to nonstandard values may
(or may not) slow down rendering:
  Rect 3x128    rendering with GP0(03h)=0Fh is about 4x slower than GP0(03h)=0
  Rect 512x128  rendering time is constant, no matter of GP0(03h)
  (in the latter case, refresh probably occurs inbetween write buffer draining)

Vblank vs Display Area
The full rendering speed can be reached only during Vblank. There will be
additional overload outside of Vblank, the exact overload depends on the size
of the Display Area and on the horizontal resolution (dotclock).
  XXX Todo: add some examples for Rects/Polys at different resolutions
Note: Trying to use GP1(03h) to disable display does NOT disable the overload
(the pixel VRAM fetches do still occur despite of not actually displaying those
pixels). However, setting Display Area Y2=Y1+1 will will cause the GPU to be
(almost) always in Vblank, and thus disable most of the overload.
Note: The Old GPU is using Dual-ported VRAM chips (with two data busses) which
do theoretically allow to fetch pixels during rendering (unknown if that's
actually working free of overload).

 ___________________________________ Notes ____________________________________

Pixel-Pairs
Most rendering commands are capable of doing two 16bit pixels as a single 32bit
pixel-pair (except, the Line command doesn't seem to do that). For example,
Rects with 3x200 or 4x200 pixels are both having the same rendering time (at
least so with even screen X coords, the timing is probably different on odd
screen X coords).

16-pixel Buffer
The New GPU is rendering most scanlines in 16-pix chunks, that new feature is
usually making it faster than Old GPU (although, it can also be slower for
short scanlines).
The feature is probably related to the Write Buffer (see GPU Texture Cache
chapter), and accordingly the timings do probably differ for odd/even screen X
coords, and with the first chunk being 15-17 pixels wide instead of 16.

Oscillators
All rendering timings seem to be bound to the 33MHz CPU clock (aka derived from
the 67MHz oscillator). The 53MHz GPU clock doesn't seem to affect GPU rendering
timings at all (it appears to be solely used to generate the PAL/NTSC color
clock & horizontal resolution dotclock; the latter might indirectly affect
rendering timings when rendering is colliding with display fetches).

Notes
The above timings are hopefully more or less correct. But they might include
some measurement inaccuracies and some other mistakes; it can be difficult to
tell apart which factors (or combination of factors) are causing which delays.
Timings like 0.50 or 0.25 clks may rely on the GPU being clocked at twice
33MHz, and on rendering pixel-pairs; some odd timings may also rely on things
like average refresh collisions.

Next no$psx update will emulate that timings, and display information about GPU clks in the I/O map status window.
In particular, I've been wonderng if the timing emulation will cause the frame rate to drop in some games. For example, Spiderman has quite high GPU load (it's using 25fps, with a high horizontal resolution of 512 pixels), but apparently real hardware is actually capable of rendering that many pixels.
Interestingly, the game requires up to 27M clks/s on NewGPU, and 32M clks/s on OldGPU, which is very close to the maximum of 33M clks/s. So the game engine or level design must be somehow aware of how many polygons it could display at any viewing angle, with any combination of cache hits and misses, without ever dropping below 25fps.

Post by **nocash** » February 11th, 2024, 2:20 am

And rougly related: The GPU FIFO behaviour. The FIFO itself isn't so special, but the timings for removing and processing data from FIFO are quite complicated...

Code: Select all

The GPU uses a 16-word (64-byte) write FIFO for sending commands and parameters
(and VRAM data) from CPU to GPU.

FIFO Overrun
Internally, the FIFO is using 5bit counters for computing the FIFO size:
  fifo_rd_index = (rd_ptr) AND 0Fh                                ;'4bit range
  fifo_wr_index = (wr_ptr) AND 0Fh                                ;/
  fifo_size     = (wr_ptr-rd_ptr) AND 1Fh                         ;-5bit range
  fifo_not_more_than_half_full = cleared when fifo_size=09h..1Fh  ;'drq flags
  fifo_empty                   = cleared when fifo_size=01h..1Fh  ;/
The 5bit size should be in range 0..10h, however it can grow as tall as 1Fh in
case overruns. Overruns do overwrite an old FIFO entry while appending the
newly written value, and the newly written value will be then processed twice
(once in place of the overwritten old value, and once in place of the newly
appended value). If the size grows bigger than 1Fh then it will wrap to 00h,
causing the FIFO to become empty.

FIFO Oddities
The above things are fairly common behaviour for FIFOs. The more confusing part
is WHEN the GPU is removing the separate word(s) from the FIFO:
The different GPU commands behave differently in that aspect, some commands do
remove the command word and some (or all) parameter words almost immediately
after writing, other commands do not remove any words until after executing the
command. Additionally, the GPU can prefetch the next command while the previous
command is still busy, so some words may be removed before finishing the
previous command.

FIFO Command Phases
Many commands consist of a precalculation phase and rendering phase; especially
the POLY and LINE commands need some time for precalculating steps for screen
coordinates (and for Texcoords and Gouraud RGB values).
Prefetching FIFO words seems to occur after the calculation phase(s), but can
occur before/during the final rendering phase. In case of 4-point QUAD
polygons, that would be after calculating & rendering the 1st triangle, and
after calculating the 2nd triangle (but before/during rendering the 2nd
triangle).

FIFO Prefetch
If a rendering command is busy, the following commands can prefetch the
following amounts of words:
  NOP,INVALID,DRAWBASE,DRAWAREA,MASKBITS     1 word (and executed right away)
  TEXPAGE,TEXWINDOW,FLUSHCACHE,IRQ,REFRESH   1 word (but not yet executed)
  POLY, LINE                                 0 words (none)
  RECT (fixed size, without texture)         1 words (not all parameters)
  RECT (variable size, or with texture)      2 words (not all parameters)
  VRAM FILL                                  2 words (not all parameters)
  VRAM-to-VRAM, CPU-to-VRAM, VRAM-to-CPU     1 word (not all parameters)
Note: If there is no rendering command busy, then some commands (eg. RECT) seem
to prefetch more word(s) than shown above (that is, more words are removed from
fifo, even if the command cannot start yet because some parameters do still
need to be written to fifo).
As shown above, NOPs (and some attributes) can be executed even when another
command is still busy, that leads to cases where NOPs may or may not take up
FIFO space, for example the following series of commands will act as so:
  RECT                ;-GPU gets busy
  NOP                 ;'
  NOP                 ; these are (almost) immediately REMOVED from FIFO
  NOP                 ; (even when RECT is still busy)
  NOP                 ;/
  TEXPAGE             ;-removed from FIFO, but not executed until RECT is done
  NOP                 ;'
  NOP                 ; these are KEPT in FIFO
  NOP                 ; (until RECT is done, and TEXPAGE is executed)
  NOP                 ;/
When rendering a series of POLYs, the precalculation of the next triangle can
occur while rendering the previous triangle (that is, there will be a delay
before starting to draw pixels of the first triangle, but pixels of the
following triangles can be drawn seamlessly after each other).

In short, it's quite confusing (and in some cases I still don't know what happens when exactly).

In practice, when rendering a Quad - which consists of two Triangles with shared vertices) - a FIFO overrun can change the appearance of the 2nd triangle (or it can even affect both triangles if the 1st triangle wasn't yet fully processed at time of overrun).

Post by **Marin25** » February 21st, 2024, 8:37 pm

Hey,

Thanks for sharing this detailed information about GPU rendering timings on the PS1. It provides a comprehensive understanding of the inner workings and timings involved in rendering various graphical elements. I appreciate the effort you've put into breaking down the timings for different commands and phases.

It's fascinating to see how the GPU handles different scenarios, including the prefetching of commands and the impact of FIFO overruns on rendering. This kind of insight is valuable for anyone interested in the technical aspects of the PS1 hardware.

If you have any specific questions or if there's anything else you'd like to discuss about PS1 development or related topics, feel free to share!