Thumbnail

rani/buni.party.git

Clone URL: https://git.buni.party/rani/buni.party.git

Viewing file on branch master

1!!title="A Parallel Graphics Processor"
2!!date="December 13, 2025"
3!!
4$DPP "$title" "$title" <template/head.html # >
5$DPP <template/bar.html # >
6
7$DPP "$1" <template/readtime.html # >
8!!
9
10<p><a class="out" href="https://github.com/Raniconduh/paragon">PARAGON</a>
11(Parallel Architecture for Rendering and Graphics OperatioNs) is a graphics
12processor I designed for a class. Unlike some other class projects, this one was
13a lot of fun to build and so I've decided to write about it.</p>
14
15<h3>Overview</h3>
16<p>The final design consisted of 12 compute cores drawing to a 320x240
17framebuffer. The framebuffer stores 8 bit pixels (RGB 332) and each compute core
18accesses a single pixel at a time. Internally, the compute cores are a (rather
19strange) 24 bit Harvard RISC architecture. The architecture was designed to be
20as minimal as possible while still being very usable. (More on this later.) Each
21core runs in parallel and, for the most part, does not need to know that any
22other core exists. The video memory is organized in a 3 level hierarchy, with
23two levels of caching and the framebuffer at the highest level. (Note, however,
24that the final design did not include the level 2 cache due to reasons
25below.)</p>
26
27<h3>Architecture</h3>
28<p>
29The compute cores are designed around a minimal RISC architecture with 16 bit
30fixed width instructions and 24 bit registers. The register file contains 16
31registers indexed 0 through 15 where register 0 is the zero register (sinking
32all writes and sourcing 0) and register 15 is the core ID register (sinking
33writes and sourcing the ID of the current core). The core ID register is what
34truly allows for parallelism, as otherwise each core would run the exact same
35instructions in the exact same order. (This can cause glitches depending on the
36kernel; more on this later.)
37</p>
38<p>
39To the untrained eye, a register which is always zero seems wasteful. If you
40need a zero, you can just load it in, right? The zero register is a hallmark of
41RISC architectures. In fact, the inclusion of this register makes the
42architecture smaller and simpler, although it also means the programmer has to
43be more aware of what they are doing. (In many cases, this can be handled by the
44assembler. e.g. <code>mov rd, rs</code> is translated as <code>add rd, r0, rs</code>.)</p>
45<p>
46Instructions have 4 bit opcodes, meaning there are only 16 possible
47instructions. Fortunately, this limit was not reached in the final design. Most
48instructions have 3 operands which is typical of RISC architectures; although,
49some instructions use fewer operands.
50</p>
51<p>
52The instruction set contains the following ALU operations: ADD, SUB, MUL, AND,
53OR, XOR, NOT, SH, ADDI. The first few are self explanatory. SH is the shift
54instruction which performs left, right, or arithmetic shifts on a register by an
55immediate (i.e. a constant which is not a register). ADDI is perhaps the most
56unusual instruction of the ALU operations as it is not strictly necessary.
57Simply, it adds an immediate to a register. Note, however, that the immediate is
58a signed 4 bit integer (ew). This is the only instruction to use a 4 bit signed
59immediate. Even though the instruction is absolutely not required, it makes
60writing programs much easier.
61</p>
62<p>
63The processor's memory operations are limited, loading and storing only 1 byte at
64a time. The LD and ST instructions are the only two memory operations and they
65are used to access video memory (pixelwise). Because there are no other memory
66operations, the astute may wonder how the program memory is accessed. The answer
67to this is that program memory cannot be accessed. It is read only and only ever
68touched by the control system (i.e. a programmer cannot read or write to it).
69The LDI instruction is in a similar vein, although it does not read from memory.
70Instead it loads an unsigned 8 bit immediate into a register.
71</p>
72<p>
73Finally, the architecture includes two control flow instructions. B is the
74branch instruction. It supports both conditional and unconditional branch and
75can either branch to a PC relative immediate or directly to an address stored in
76a register. Once again, some may recognize that something is missing. Two things
77actually. Mainly, there is no compare instruction. This seems to imply that all
78branching is unconditional. However, all ALU operations modify a set of flags:
79N, Z, and C. These flags store whether the output of the ALU is (N)egative (i.e.
80the MSB is set), (Z)ero, or if the operation generated a carryout. Using these
81flags, comparisons are typically created by subtracting two registers and
82discarding the result. The other thing that is missing is some sort of linked
83branch (i.e. function call). This is solved by the final control instruction
84AIPC, or add immediate to program counter. The instruction adds a signed 8 bit
85immediate to the program counter (PC) and stores it in a register. This is
86especially helpful for direct branches and e.g. function calls. The instruction
87is technically redundant since the PC is always known to the assembler, but it
88helps making functions easier.
89</p>
90
91<h4>Program Memory</h4>
92<p>
93As mentioned above, the program memory is used solely by the processor's control
94unit. Even though it is not as extensible as the video memory, the program
95memory is obviously a necessary part of the system. Separating the program
96memory (ROM) from the video memory (VRAM) also helped keep the architecture
97smaller and simpler (and what makes it a Harvard architecture).
98</p>
99<p>
100One of the most important parts of a graphics processor is having it run
101different kernels. (It wouldn't be very interesting if it could only ever do one
102thing.) Because of this, the program ROM is reprogrammable. By enabling the
103programming mode on the ROM, instructions can be streamed to the graphics
104processor where they will be stored in the same order. Once programming is
105complete, the processor begins execution, starting at the first instruction.
106</p>
107<p>
108But if the compute cores are independent of each other, how can they work with
109only one program ROM? Well, one solution to this is to just go core by core,
110handling each read request one at a time. Of course, this is horribly slow. The
111solution used by PARAGON is to copy the instruction ROM into multiple
112(individual) ROMs. Since the ROM is (as per the name) read only, there is no
113risk of processors modifying the instructions in the kernel. This means that
114every core can access the program ROM at the exact same time and not have to
115worry about waiting for other cores. On the other hand, this means that the
116program ROM is way bigger than it has to be. This is not a huge deal for PARAGON
117since the ROM is quite small (2048 instructions deep). I also believe that speed
118is a much more important factor in this case than memory utilization. (If read
119requests were handled one at a time, I'm inclined to believe that ROM would be
12012x slower, meaning there wouldn't really be much benefit to having more than 1
121core.)
122</p>
123
124<h4>Video Memory</h4>
125<p>
126The video memory (VRAM) is seemingly simple. Most of video memory (~60%) is
127allocated to the framebuffer (what you actually see on the screen). There is a
128bit of memory (~40%) beneath the framebuffer for general purpose use, although
129it is a bit limited since memory is accessed byte by byte. (Maybe in the future
130loads and stores can optionally work on multiple bytes.) Again, this doesn't
131seem too complex. But, what happens when multiple cores are trying to use VRAM
132at the same time? The program ROM solution can be applied here too. Just copy
133the VRAM for each core. This seems to keep VRAM fast like it did with the
134program ROM. Although, the program ROM was very small (4kB). Copying VRAM for 12
135cores would cause it to be huge and absolutely impossible to accomplish. The
136real solution (and the one employed by PARAGON) is to have a caching scheme and
137memory arbitration.
138</p>
139<p>
140The caching scheme is fairly simple (especially because of the way VRAM was
141actually implemented). Each core has a private write-though level 1 cache that
142stores 64 cache lines (each of which is 8 bytes wide; i.e. a total of 256 bytes
143of cache). The cache is directly mapped meaning indices in the cache are not
144unique. (This can lead to thrashing due to conflicting cache accesses, but I am
145treating it as a non issue for the sake of PARAGON.) When a core wants to access
146VRAM, the cache first checks if the memory has been cached already. If it has,
147great! On reads, just return the cache line. On writes, modify it and send it to
148the next level. When the memory has not yet been cached, the L1 cache requests
149the line from the next level and waits for it to come back. Although, what
150happens if a core writes to memory that is cached by another core? There are a
151lot of different coherency schemes that can be used to handle this but PARAGON
152employs the most simple: such cache lines are invalidated. This means that if
153one core tries to read cached memory that has been modified by a different core,
154it will have to fetch the modified line from the next level.
155</p>
156<p>
157In the final design of PARAGON, the memory hierarchy does not include a level
158two cache. Instead, the memory is directly arbitrated between the private L1
159caches and the full VRAM. The arbitration is fairly simple. The memory arbiter
160keeps track of the last core that it granted a request to. It then tries to find
161requests starting at the next core before wrapping back around. This means that
162one core will never be able to hold the memory arbiter forever.
163</p>
164
165<h3>Okay... Does it Even Work?</h3>
166<p>
167Yes! PARAGON was implemented on a Xilinx Spartan 7 FPGA on the RealDigital
168<a class="out" href="https://www.realdigital.org/hardware/urbana">Urbana Board</a>
169development board. The Spartan 7 is a decent low end FPGA with a fair bit of
170resources. Originally, I thought that my limiting factor would be LUTs. However,
171once I actually built the thing, I quickly found out that I was mostly limited
172by the RAM. The Spartan 7 (XC7S50) contains 75 RAMB36 blocks. These are
173synchronous, dual port, 36 kilobit SRAM blocks. There are four ways in which RAM
174is used by the processor: program memory, cache memory, video memory, and CPU
175memory. I haven't mentioned anything about a CPU insofar but it is a necessary
176part of the graphics processor. Of course, you wouldn't have a computer with a
177GPU but no CPU. The CPU used in the design is the Xilinx MicroBlaze processor. I
178stripped out as much as I could from it to make it small (and slow; the CPU
179performance is not an integral part of the graphics processor). The MicroBlaze
180seems to require about 16KB of memory including all the graphics demos, although
181I allocated 64KB just to be safe. All in all, the biggest memory consumer is the
182VRAM. This is because it is implemented entirely on the on-chip memory. The dev
183board does come with 128Mbits of DDR3 memory, but I had a lot of difficulty
184trying to get it to work and I eventually scrapped that idea. The upside to
185using the on-chip memory is that it is very fast: the memory has a 1 clock cycle
186read latency, much faster than however many clock cycles it would take for DDR3
187to be ready. This is what led to the omission of an L2 cache. If the large, slow
188DDR3 had been used, the L2 cache would likely be necessary.
189</p>
190<p>
191So how fast is it? Well, each compute core runs at 50MHz. Originally the plan
192was to run them at 100MHz but that has proven to be impossible for two reasons.
193Firstly, the processor cores are not particularly efficient; that is, they are
194not pipelined and expect all data to be ready in the same clock cycle. This is
195not scalable, however, and initially caused me to lower the clock frequency to
19650MHz. (This might seem like an oversight on my part, but remember that this was
197a class project and I was on a time limit. Pipelined processor design is not the
198easiest and would greatly increase the development time.) The other limiting
199factor is physical. The processor uses logic cells from all over the chip.
200Physically, routing signals from one corner of the chip to the next is slow. I'm
201sure I could bump up the frequency a little but 50MHz is a nice round number.
202</p>
203
204<h3>Demos</h3>
205<p>
206I designed a bunch of demos to showcase PARAGON's processing power. Since the
207architecture contains only integer arithmetic and no special graphics
208operations, kernels are limited to fixed point arithmetic and manually writing
209out necessary functions (like matrix multiplication or dot product). (Floating
210point is technically possible using a software floating point library, but I
211haven't bothered to try implementing one as it is likely very difficult.)
212Surprisingly, even though kernels were limited to fixed point, I was still able
213to get some good looking demos.
214</p>
215<p>
216The following video shows the processor running the Rule 30 elementary cellular
217automaton. The simulation is artificially slowed and runs on only a single core.
218Yes, this is intentional. The simulation was otherwise way too fast and did not
219look good.
220</p>
221<div class="video"><video width="320" height="240" controls>
222 <source src="content/rule30.webm" type=video/webm>
223 Rule 30 Elementary Cellular Automaton Simulation
224</video></div>
225<p>
226You might notice the strange purple bar on the left side of the screen. This is
227caused by the (soft) HDMI encoder that is used to generate the video output. Can
228this be fixed? Probably. Is it a big deal? I don't think so. And so it stays.
229The demo looks cool (in my opinion) but it doesn't showcase the real power of
230the processor. For that, I wrote a Mandelbrot set rendering kernel. The
231Mandelbrot set is an "image" that is found by computing the rate of divergence
232of the equation <code>z_next = z_prev^2 + c</code>. I don't want to talk too
233much about the math, but the kernel just computes the formula a bunch of times
234and sees when (or if) the variable eventually blows up (i.e. gets really big in
235magnitude). The demo shows the computation twice: once on a single core and
236again but using all 12 cores.
237</p>
238<div class="video"><video width="320" height="240" controls>
239 <source src="content/mandelbrot.webm" type=video/webm>
240 Single Core and Multi Core Mandelbrot Set Rendering
241</video></div>
242<p>
243Clearly, the single core rendering is very slow while the multicore rendering is
244quite fast. Even though cores still have to wait their turn to write to memory,
245the heavy computation involved means that the memory bus sits mostly idle and so
246there are few conflicts between cores.
247</p>
248<p>
249Another interesting demo is Conway's Game of Life. The GoL is another cellular
250automaton but produces more interesting patterns than Rule 30. The demo uses
251Rule 30 as a pseudorandom initial state and then simulates the Game of Life.
252</p>
253<div class="video"><video width="320" height="240" controls>
254 <source src="content/gol.webm" type=video/webm>
255 Conway's Game of Life Simulation
256</video></div>
257<p>
258The demo runs the Game of Life on multiple cores. This is not because the
259simulation is computationally complex, but instead because it requires heavy
260memory usage. For each pixel, the kernel fetches the 8 surrounding pixels to
261compute the next state which is stored in the right half of the framebuffer.
262Once all cells have been computed, the next state buffer is copied back to the
263left half of the framebuffer so that the next computation can begin. I have also
264run this demo on a single core and it is abysmally slow. So, even though it is a
265memory heavy kernel, the cores are still able to share the memory bus
266effectively, thereby increasing the simulation speed.
267</p>
268<p>
269The final interesting demo is only half baked (visually and computationally, but
270still interesting). It rotates the vertices of a cube by applying a rotation
271matrix to them. The demo is also artifically slowed and single core since it did
272not look good at true computational speed.
273</p>
274<div class="video"><video width="320" height="240" controls>
275 <source src="content/cube.webm" type=video/webm>
276 A 3 Dimensional Rotating Cube
277</video></div>
278<p>
279Again, due to time constraints, I did not make the cube visually appealing
280(edges or face coloring). However, the video clearly shows a rotating cube.
281Eventually, the cube begins to fall apart and this is due to inaccuracy in the
282rotation matrix and vertex coordinates. Because they are fixed point, only so
283much accuracy can be encoded. To preserve some accuracy, the matrix values were
284precomputed and somewhat optimized (I checked a lot of different values by hand
285and the one in this demo is one of the best ones I found). The kernel does not
286re-normalize the vertices, even though it should, once again due to time
287constraints.
288</p>
289<p>
290The rotating cube demo is also single core for another reason (which is hinted
291at above). The cube's vertices are being stored in VRAM right below the
292framebuffer (so they are not visible in the video output). Because of this, if
293multiple cores are trying to read the vertices to draw them to the screen and
294then modify the vertices and store them back, the coordinates will end up being
295corrupted and too many vertices will be drawn to the screen. The real fix to
296this is to have each core work on an independent set of vertices and synchronize
297the cores before clearing the screen. I did not elect to do this since it adds
298additional complexity and still would look bad due to being much too fast.
299</p>
300<p>
301Below are some more fun renderings.
302</p>
303
304<table class="images"><tbody>
305 <tr>
306 <td class="image"><img src="content/blurbrot.jpg" width="320" height="240"></td>
307 <td class="image"><img src="content/blurrule30.jpg" width="320" height="240"></td>
308 </tr>
309 <tr>
310 <td class="caption">Blurred Mandelbrot Set</td>
311 <td class="caption">Blurred Rule 30 Simulation</td>
312 </tr>
313 <tr>
314 <td class="image"><img src="content/julia1.jpg" width="320" height="240"></td>
315 <td class="image"><img src="content/julia2.jpg" width="320" height="240"></td>
316 </tr>
317 <tr>
318 <td class="caption">Julia Set 1</td>
319 <td class="caption">Julia Set 2</td>
320 </tr>
321 <tr>
322 <td class="image"><img src="content/julia3.jpg" width="320" height="240"></td>
323 <td class="image"><img src="content/ship.jpg" width="320" height="240"></td>
324 </tr>
325 <tr>
326 <td class="caption">Julia Set 3</td>
327 <td class="caption">Burning Ship Fractal</td>
328 </tr>
329</tbody></table>
330
331<p>
332All these demos were painstakingly written by hand in PARAGON assembly.
333</p>
334
335<h3>What Now?</h3>
336<p>
337Even though it clearly works, PARAGON is lacking some features. The inclusion of
338large DDR3 memory would be very useful as it would allow for an increased
339framebuffer resolution and also add a ton more memory for general purpose use
340(e.g. double buffering). Since VRAM is byte addressable (at least as far as the
341compute cores are concerned), it is difficult (or at least annoying) to store
342large 24 bit values from the registers into memory. One fix to this might be to
343add a private scratchpad where cores may store general purpose memory, function
344call stacks, etc. Currently, recursion or functions calling functions is very
345difficult to implement due to the byte addressing limitation.
346</p>
347<p>
348The processors could also use some form of hardware synchronization. A simple
349solution is adding a SYNC instruction. Of course, this instruction could be
350shared with other instructions to form a set of processor control instructions.
351Hardware synchronization would allow for writing kernels where cores must
352interact with each other but will likely diverge in execution without using
353clunky software synchronization.
354</p>
355<p>
356Fixed point arithmetic is horrible to work with. Poor precision, rounding
357errors, etc. The addition of a floating point unit would greatly increase the
358ability to write high quality, high precision kernels. Of course, FPUs are huge,
359so there would likely only be one or two shared between all the cores. Yes, this
360would be slow, but it's worth having high precision arithmetic.
361</p>
362<p>
363All in all, I think I'm happy with how this turned out.
364</p>
365
366!!$DPP <template/foot.html # >
367