This is a great overview. I don't remember having to put in padding instructions...

dumael · on June 23, 2021

> This is a great overview. I don't remember having to put in padding instructions to prevent the pipeline issues mentioned here; maybe we just never ran into that. (I wrote pretty much all the R3000 code for Crash 1 and just do not recall problems like that coming up.)

If you were using the GNU assembler, it automatically fills branch delay slots with nop instructions unless you prefix assembly code as using `.set noreorder`. GAS would also handle load delay slots as well.

dmbaggett · on June 23, 2021

Ah yes, that makes perfect sense! I'm sure that's why I never had to worry about it.

rffn · on June 23, 2021

Inserting NOPs is a waste of code space and execution resources though. If resources are not too tight, this is fine.

We used gcc on a MIPS M4K in a communication chip. We had a lot of existing C code and were short on ROM and on CPU cycles. Therefore a few co-workers wrote a tool which parsed the gcc asm output to fill the branch delay slot with an instruction with no side effects on the branch. It also fixed some gcc issue with 16 bit memory accesses in C were created as 32 bit load instruction in asm (which can be two cycles if the first 16 bit of a 32 bit word are needed). I had a HW/SW cosim setup to test code and hardware (Verilog). These were cool projects. Good memories (although quite vague now).

PS: If we would have had a license of the Green Hills compiler we could have saved some of the effort. IIRC it did branch delay slot optimization.

chondl · on June 23, 2021

Reverse engineering the assembly code required to drive the Geometry Transformation Engine efficiently from the compiled output of the C libraries is to this day one of my favorite technical puzzles I've worked on in my career.

I remember sitting in a meeting with our technical contact at Sony's headquarters in San Mateo where he basically asked without asking if we were using the C libraries. From the way he asked the question it was clear that the correct answer was "No, not using the libraries" since otherwise our game wouldn't be performant. At the same time we both knew we couldn't say that explicitly since that would violate Sony's contracts.

a1369209993 · on June 24, 2021

> Reverse engineering the assembly code

Well of course you were using the libraries! Just like you were using the documentation. Exactly like you were using the documentation, in fact. The fact that neither got jumped to as executable code is clearly immaterial.

ndesaulniers · on June 23, 2021

I assume you only had a disassembler to work with, at the time?

coldpie · on June 23, 2021

Thanks for your great comment.

> This meant that you either had to do heroic pre-computation (sort polygons ahead of time)

Much more detail on this is available in this amazing series of blog posts:

https://all-things-andy-gavin.com/2011/02/02/making-crash-ba...

zahrc · on June 23, 2021

Thank you for this insight and special thanks for providing me and many others with this wonderful piece of childhood.

Arelius · on June 23, 2021

> To us the elephant in the room limitation of PS1 is not mentioned here as far as I could tell from a quick read-through: it had no Z-buffer.

There is a section "Visibility Approach" under the GPU section that starts: "Just like the competition, the PS1 doesn’t include any hardware feature that solves the visibility problem."

I agree that it doesn't really stress the ramifications of that.

ArtWomb · on June 23, 2021

Legend. I've heard the original GOAL code is floating around somewhere online. Is there any chance Sony ever releases it into the public domain?

I notice another game today on here using ClojureScript. Apart from a lack of tooling / engine, is Lisp better suited for gamedev?

dmbaggett · on June 23, 2021

One could endlessly debate the merits of Lisp dialects for game development, but at least in the Crash era it was a big win for us. Naughty Dog migrated away from Lisp in the PS3 era, I believe. (I was long gone by then.)

It was very powerful to be able to write code for critters and other animated objects in GOOL/GOAL because it was both conceptually simple and compiled into tiny amounts of code, which mattered a lot when you only had 2MB of RAM (half of which was video RAM if I recall correctly).

saw-lau · on June 23, 2021

I've not heard of GOAL before - do you have a link to any references or articles about it?

msk-lywenn · on June 23, 2021

The OpenGOAL project is very active at reimplementing the GOAL compiler and infrastructure of Jak 1.

https://blog.jakspeedruns.com/opengoal-project-update-septem...

e12e · on June 23, 2021

Looks like GOOL was an early interpreted(?) lisp/scheme, and GOAL was a later complier/compiled version?

GOOL: https://all-things-andy-gavin.com/2011/03/12/making-crash-ba...

GOAL: https://en.m.wikipedia.org/wiki/Game_Oriented_Assembly_Lisp

https://github.com/water111/jak-project

f00zz · on June 23, 2021

Comments like this are the reason I keep coming back to HN, thanks!

korethr · on June 23, 2021

> The CPU and RAM in those machines were just incredibly slow by today's standards.

I have a few questions, interspersed with my musings on them.

Based on an initial misreading of the above as the RAM being slow, I mistook your statement to mean that memory latency on a similar order to today's computers was something that you had to fight with. Was it memory latency per-se that made achieving sufficient performance hard?

Per the article, the system had EDO memory, which I discover can deliver a word read in 3 cycles for the first word of a page, and two cycles for subsequent words in the same page -- or 1 cycle for all if it is fast enough. I don't know the details of a MIPS memory access cycle, but by comparison, the 68k takes at least 4 clock cycles to pull a word off the bus. Thus, ISTM that EDO memory could keep a 68k fully supplied with data without resorting to wait states. I would hope that a MIPS processor, being a more modern design, could pull a word off the bus in a singe clock cycle, but on the other hand, I could see Sony possibly using cheaper chips with a tRAC slower than the CPUs clock period to save money.

Or, was the issue less so memory latency per-se, and more a bus utilization problem? In the article, I see that the DMA controller leaves the CPU idle while other devices are using the bus unless the CPU is making use of the scratchpad.

Re-reading your statement, that the CPU and RAM were just slow in general compared to today, I find myself wondering, how did the power of your development workstations compare to the PS1? Were you guys using high-end workstations like NeXT or SGIs, or common PCs of the era, since per the article, the SDK and development board targeted Windows 3.1 and 95?

dmbaggett · on June 23, 2021

Good questions. We were using SGI workstations which ran at 250Mhz if I recall correctly — so ~8x the PS1 CPU.

It was admittedly nearly 30 years ago but my recollection is that the only way to make anything fast on that hardware was to keep everything in registers and avoid touching memory except when absolutely necessary. It was definitely many cycles to access memory.

anthk · on June 24, 2021

An MMX Pentium would run circles around the PSX as it could play lots of games with a 3DFX.

A Pentium Pro could run Unreal, I think. That game (and engine) basically could crush down any PSX and N64 at once.

A Pentium 2 could emulate the PSX at lowest settings.

djhworld · on June 23, 2021

> The other thing to note is that to make a truly high-performance game for PS1 you had to completely ditch Sony's sanctioned C APIs -- something they said would cause your game to get banned, but which we did anyway.

Fascinating! Was this because Sony didn't bother checking, or they saw the performance you were getting and just let it slide?

dmbaggett · on June 23, 2021

I recall that they mandated use of the C APIs to easily support future PS2 backward compatibility. I think what they ended up doing was just embedding a PS1 into the PS2, so use of the C APIs never mattered. Which was good, because by the time PS2 was out, lots of PS1 games had shipped that used the bare metal hardware, circumventing the APIs.

The problem was there was just no way to make a fast C API for things like polygon rendering, because as soon as you touched RAM (the stack) you were looking at many cycles of delay per instruction.

EricE · on June 23, 2021

"I think what they ended up doing was just embedding a PS1 into the PS2, so use of the C APIs never mattered." I'm not sure on the PS2 but I know that's what they did in the PS3 - I found the 60GB launch PS3 that had the hardware PS2 embedded as well as the greatest amount of ports and maybe a few other aspects that made it one of the more sought after PS3 version. I still use it as a Bluray player and occasionally fire up Little Big Planet - still one of my favorite games.

Sholmesy · on June 23, 2021

The layout of this article seems unique, and in one of the sub-sections I think they did go into the Z-axis ordering a bit:

> Moving on, the ordering table puts the burden on the developer/program to show the geometry in the right order. In some cases, the calculations implemented rely on too many approximations to gain performance. This may result in flickering or occluded surfaces that should have been displayed.

NortySpock · on June 23, 2021

Something I've wondered about the "sorting polygons" strategy: would sorting only part of the part of the polygons per frame been enough? (front-half vs back-half, one-bubble-sort-pass-per-frame, or some other iterative partial solution)

How often did the polygons get out of order?

(Or am I thinking only in 2D arrays when the problem is really multiple-polygon occlusion?)

dmbaggett · on June 23, 2021

I doubt it. One observation is that polygons can experience cyclic overlap, so A > B and B > C does NOT imply A > C. This transitivity property is required for O(N lg N) sorting algorithms. And note that even with Crash 1 polygon counts, O(N^2) is > 1000^2 -- way too huge for a 33Mhz processor.

However, the way I did it was to precompute the sort order of the polygons ahead of time and store periodic key frames (the complete sorted list) along with diffs. These diffs were generally very small, meaning your intuition that not much changes from frame to frame is a valid one. The problem is without the runtime oracle telling you which polygons move from frame to frame, it seems hard to exploit this property.

Another issue with approximate polygon sorting methods like bucket sorting is that the sort isn't terribly stable from frame to frame, so you get the annoying flickering effect that you see so often in PS1 games, as a pair of polygons alternates between relative sort order as you move around.

NortySpock · on June 23, 2021

Thanks, really clear explanation, and thanks for taking the time to answer such a basic question. :)

dmbaggett · on June 23, 2021

It’s not a basic question at all. If you look at the most innovative games of that era, they all found some way to not resort to O(N^2) polygon sorting. A bit before Crash came out ID Software showed that spatial data structures like BSP trees could help a lot; we were influenced by that.

AtlasBarfed · on June 23, 2021

I think a late-generation ps1 game did have a z-buffer, I don't remember the one though. Blasto maybe?

ndesaulniers · on June 23, 2021

It would have had to have been in software then; modern hardware handles Z buffer calculations for you. That would have been slow and memory intensive on the PS1.

badsectoracula · on June 23, 2021

It is emulated but you can still see a sorting failure here (the gray platform at the bottom left side of the video appears "over" the ground for a bit):

https://youtu.be/Qo-wX5CeDRA?t=340

So it most likely wasn't Blasto.