Some optimization questions

Post by **Laban** » March 30th, 2019, 10:04 pm

So I've got a large level-style 3D model made up of roughly 2800 polys (most of them goraud-shaded triangles).
This thing takes about 6 vertical blanks (using PAL) to draw, which feels a bit overly slow to me so I'm not entirely sure if the problem lies mainly in rendering time or other processing.

My code is written in C (not C++) and makes use of the Psy-Q libraries.
The mesh data is held in a separate memory structure that is then copied into a pre-allocated buffer using the setPolyG3, setPolyXY3, setPolyRGB0 etc. macros, and these structures are then linked to an ordering table using the addPrim macro.
I have two primitive buffers and ordering tables; one per draw buffer so that there won't be interference if the GPU is still drawing the previous frame (is this needed? I assume it is but not entirely sure; mayhap the draw buffer is simply copied to some dedicated GPU memory or else I imagine there'd be a lot of DMA transfer blocking if it fetches each GPU packet from main RAM, draws it and repeats?).

For triangles (which everything is at the moment), I'm using RotTransPers3 to project input 3D coordinates from the mesh data to screen space, which are then used to fill out the afforementioned primitive buffer.
There are two things I'm curious about with this particular function:

The underlying GTE operation is pretty costly. Is the ccpsx compiler smart enough to rearrange the generated MIPS instructions so that other calculations can be fit between issuing this instruction and fetching its results, or does it just dump in a bunch of NOPs to wait for it to complete before doing anything else?
If the projected screen coordinates are outside of the drawing area, will the GPU reject this draw call automatically, or is it more efficient (to a point where it matters) if I check for this myself and don't emit such primitives to the primitive buffer / ordering table?

As for the mesh itself, it is stored in a somewhat more packed format so as to not use up as much memory, and colours and vertex positions are unpacked into the primitive buffer by some bit shifting and masking (don't need full 16-bit positions / 24-bit colours) when about to be drawn. Would it be a significant enough trade-of to store the properly formatted GPU packets directly in the mesh data instead and link to those via the ordering table?
I'd of course have to compute the screen coordinates for each frame anyway as the viewer position may change, as well as the links to the next primitive, but I wouldn't have to also write the colour, size and GPU codes for each packet. It'd waste some space so I'd rather not do this unless the payoff in execution speed is well worth it though; as I imagine it the main slowdown should lie in the actual polygon drawing by the GPU and possibly the perspective transformation by the GTE as described above?

Now I'm considering breaking the mesh up into smaller chunks to check whether their bounding box will be visible with the current global rotation/translation matrix or not and cull the chunks based on that.
How would one best go about this? It seems to me that there's no dedicated projection matrix for the "camera" / "eye" that determines how the perspective projection is done. Normally you'd use this to implement frustum culling. Another option is to just use the RotTransPers function for the eight corner points of the bounding box and then check if any of its projected screen coordinates are within the drawing area, and otherwise reject said chunk from drawing. Which of these approaches would be the best (and if it's the frustum culling, how would I get the needed perspective transformation details?), or is there some third, superior approach I haven't thought of?

Finally, tying back into the question about the compiler above, this same issue would be present for several other multi-cycle instructions such as multipications, divisions, obviously most GTE and DMA instructions, but perhaps mainly memory access delays and branch delay slots, as these are bound to happen all over the place. Can I assume that the compiler does what it can about this from my C code, or should I expect it to just drop NOPs everywhere? In the latter case (I would think it should at least handle memory/branch delays and would be surpised if that wasn't the case though), is there any documentation somewhere on how you would write, compile and link assembly code with a Psy-Q project?

Sorry for the long post and all the different questions but I thought I may as well ask before sitting down and trying to hopefully get somewhere with this this weekend

Some optimization questions

Some optimization questions

Who is online

Login • Register