| 1 | !!title="A Parallel Graphics Processor" |
| 2 | !!date="December 13, 2025" |
| 3 | !! |
| 4 | $DPP "$title" "$title" <template/head.html # > |
| 5 | $DPP <template/bar.html # > |
| 6 | |
| 7 | $DPP "$1" <template/readtime.html # > |
| 8 | !! |
| 9 | |
| 10 | <p><a class="out" href="https://github.com/Raniconduh/paragon">PARAGON</a> |
| 11 | (Parallel Architecture for Rendering and Graphics OperatioNs) is a graphics |
| 12 | processor I designed for a class. Unlike some other class projects, this one was |
| 13 | a lot of fun to build and so I've decided to write about it.</p> |
| 14 | |
| 15 | <h3>Overview</h3> |
| 16 | <p>The final design consisted of 12 compute cores drawing to a 320x240 |
| 17 | framebuffer. The framebuffer stores 8 bit pixels (RGB 332) and each compute core |
| 18 | accesses a single pixel at a time. Internally, the compute cores are a (rather |
| 19 | strange) 24 bit Harvard RISC architecture. The architecture was designed to be |
| 20 | as minimal as possible while still being very usable. (More on this later.) Each |
| 21 | core runs in parallel and, for the most part, does not need to know that any |
| 22 | other core exists. The video memory is organized in a 3 level hierarchy, with |
| 23 | two levels of caching and the framebuffer at the highest level. (Note, however, |
| 24 | that the final design did not include the level 2 cache due to reasons |
| 25 | below.)</p> |
| 26 | |
| 27 | <h3>Architecture</h3> |
| 28 | <p> |
| 29 | The compute cores are designed around a minimal RISC architecture with 16 bit |
| 30 | fixed width instructions and 24 bit registers. The register file contains 16 |
| 31 | registers indexed 0 through 15 where register 0 is the zero register (sinking |
| 32 | all writes and sourcing 0) and register 15 is the core ID register (sinking |
| 33 | writes and sourcing the ID of the current core). The core ID register is what |
| 34 | truly allows for parallelism, as otherwise each core would run the exact same |
| 35 | instructions in the exact same order. (This can cause glitches depending on the |
| 36 | kernel; more on this later.) |
| 37 | </p> |
| 38 | <p> |
| 39 | To the untrained eye, a register which is always zero seems wasteful. If you |
| 40 | need a zero, you can just load it in, right? The zero register is a hallmark of |
| 41 | RISC architectures. In fact, the inclusion of this register makes the |
| 42 | architecture smaller and simpler, although it also means the programmer has to |
| 43 | be more aware of what they are doing. (In many cases, this can be handled by the |
| 44 | assembler. e.g. <code>mov rd, rs</code> is translated as <code>add rd, r0, rs</code>.)</p> |
| 45 | <p> |
| 46 | Instructions have 4 bit opcodes, meaning there are only 16 possible |
| 47 | instructions. Fortunately, this limit was not reached in the final design. Most |
| 48 | instructions have 3 operands which is typical of RISC architectures; although, |
| 49 | some instructions use fewer operands. |
| 50 | </p> |
| 51 | <p> |
| 52 | The instruction set contains the following ALU operations: ADD, SUB, MUL, AND, |
| 53 | OR, XOR, NOT, SH, ADDI. The first few are self explanatory. SH is the shift |
| 54 | instruction which performs left, right, or arithmetic shifts on a register by an |
| 55 | immediate (i.e. a constant which is not a register). ADDI is perhaps the most |
| 56 | unusual instruction of the ALU operations as it is not strictly necessary. |
| 57 | Simply, it adds an immediate to a register. Note, however, that the immediate is |
| 58 | a signed 4 bit integer (ew). This is the only instruction to use a 4 bit signed |
| 59 | immediate. Even though the instruction is absolutely not required, it makes |
| 60 | writing programs much easier. |
| 61 | </p> |
| 62 | <p> |
| 63 | The processor's memory operations are limited, loading and storing only 1 byte at |
| 64 | a time. The LD and ST instructions are the only two memory operations and they |
| 65 | are used to access video memory (pixelwise). Because there are no other memory |
| 66 | operations, the astute may wonder how the program memory is accessed. The answer |
| 67 | to this is that program memory cannot be accessed. It is read only and only ever |
| 68 | touched by the control system (i.e. a programmer cannot read or write to it). |
| 69 | The LDI instruction is in a similar vein, although it does not read from memory. |
| 70 | Instead it loads an unsigned 8 bit immediate into a register. |
| 71 | </p> |
| 72 | <p> |
| 73 | Finally, the architecture includes two control flow instructions. B is the |
| 74 | branch instruction. It supports both conditional and unconditional branch and |
| 75 | can either branch to a PC relative immediate or directly to an address stored in |
| 76 | a register. Once again, some may recognize that something is missing. Two things |
| 77 | actually. Mainly, there is no compare instruction. This seems to imply that all |
| 78 | branching is unconditional. However, all ALU operations modify a set of flags: |
| 79 | N, Z, and C. These flags store whether the output of the ALU is (N)egative (i.e. |
| 80 | the MSB is set), (Z)ero, or if the operation generated a carryout. Using these |
| 81 | flags, comparisons are typically created by subtracting two registers and |
| 82 | discarding the result. The other thing that is missing is some sort of linked |
| 83 | branch (i.e. function call). This is solved by the final control instruction |
| 84 | AIPC, or add immediate to program counter. The instruction adds a signed 8 bit |
| 85 | immediate to the program counter (PC) and stores it in a register. This is |
| 86 | especially helpful for direct branches and e.g. function calls. The instruction |
| 87 | is technically redundant since the PC is always known to the assembler, but it |
| 88 | helps making functions easier. |
| 89 | </p> |
| 90 | |
| 91 | <h4>Program Memory</h4> |
| 92 | <p> |
| 93 | As mentioned above, the program memory is used solely by the processor's control |
| 94 | unit. Even though it is not as extensible as the video memory, the program |
| 95 | memory is obviously a necessary part of the system. Separating the program |
| 96 | memory (ROM) from the video memory (VRAM) also helped keep the architecture |
| 97 | smaller and simpler (and what makes it a Harvard architecture). |
| 98 | </p> |
| 99 | <p> |
| 100 | One of the most important parts of a graphics processor is having it run |
| 101 | different kernels. (It wouldn't be very interesting if it could only ever do one |
| 102 | thing.) Because of this, the program ROM is reprogrammable. By enabling the |
| 103 | programming mode on the ROM, instructions can be streamed to the graphics |
| 104 | processor where they will be stored in the same order. Once programming is |
| 105 | complete, the processor begins execution, starting at the first instruction. |
| 106 | </p> |
| 107 | <p> |
| 108 | But if the compute cores are independent of each other, how can they work with |
| 109 | only one program ROM? Well, one solution to this is to just go core by core, |
| 110 | handling each read request one at a time. Of course, this is horribly slow. The |
| 111 | solution used by PARAGON is to copy the instruction ROM into multiple |
| 112 | (individual) ROMs. Since the ROM is (as per the name) read only, there is no |
| 113 | risk of processors modifying the instructions in the kernel. This means that |
| 114 | every core can access the program ROM at the exact same time and not have to |
| 115 | worry about waiting for other cores. On the other hand, this means that the |
| 116 | program ROM is way bigger than it has to be. This is not a huge deal for PARAGON |
| 117 | since the ROM is quite small (2048 instructions deep). I also believe that speed |
| 118 | is a much more important factor in this case than memory utilization. (If read |
| 119 | requests were handled one at a time, I'm inclined to believe that ROM would be |
| 120 | 12x slower, meaning there wouldn't really be much benefit to having more than 1 |
| 121 | core.) |
| 122 | </p> |
| 123 | |
| 124 | <h4>Video Memory</h4> |
| 125 | <p> |
| 126 | The video memory (VRAM) is seemingly simple. Most of video memory (~60%) is |
| 127 | allocated to the framebuffer (what you actually see on the screen). There is a |
| 128 | bit of memory (~40%) beneath the framebuffer for general purpose use, although |
| 129 | it is a bit limited since memory is accessed byte by byte. (Maybe in the future |
| 130 | loads and stores can optionally work on multiple bytes.) Again, this doesn't |
| 131 | seem too complex. But, what happens when multiple cores are trying to use VRAM |
| 132 | at the same time? The program ROM solution can be applied here too. Just copy |
| 133 | the VRAM for each core. This seems to keep VRAM fast like it did with the |
| 134 | program ROM. Although, the program ROM was very small (4kB). Copying VRAM for 12 |
| 135 | cores would cause it to be huge and absolutely impossible to accomplish. The |
| 136 | real solution (and the one employed by PARAGON) is to have a caching scheme and |
| 137 | memory arbitration. |
| 138 | </p> |
| 139 | <p> |
| 140 | The caching scheme is fairly simple (especially because of the way VRAM was |
| 141 | actually implemented). Each core has a private write-though level 1 cache that |
| 142 | stores 64 cache lines (each of which is 8 bytes wide; i.e. a total of 256 bytes |
| 143 | of cache). The cache is directly mapped meaning indices in the cache are not |
| 144 | unique. (This can lead to thrashing due to conflicting cache accesses, but I am |
| 145 | treating it as a non issue for the sake of PARAGON.) When a core wants to access |
| 146 | VRAM, the cache first checks if the memory has been cached already. If it has, |
| 147 | great! On reads, just return the cache line. On writes, modify it and send it to |
| 148 | the next level. When the memory has not yet been cached, the L1 cache requests |
| 149 | the line from the next level and waits for it to come back. Although, what |
| 150 | happens if a core writes to memory that is cached by another core? There are a |
| 151 | lot of different coherency schemes that can be used to handle this but PARAGON |
| 152 | employs the most simple: such cache lines are invalidated. This means that if |
| 153 | one core tries to read cached memory that has been modified by a different core, |
| 154 | it will have to fetch the modified line from the next level. |
| 155 | </p> |
| 156 | <p> |
| 157 | In the final design of PARAGON, the memory hierarchy does not include a level |
| 158 | two cache. Instead, the memory is directly arbitrated between the private L1 |
| 159 | caches and the full VRAM. The arbitration is fairly simple. The memory arbiter |
| 160 | keeps track of the last core that it granted a request to. It then tries to find |
| 161 | requests starting at the next core before wrapping back around. This means that |
| 162 | one core will never be able to hold the memory arbiter forever. |
| 163 | </p> |
| 164 | |
| 165 | <h3>Okay... Does it Even Work?</h3> |
| 166 | <p> |
| 167 | Yes! PARAGON was implemented on a Xilinx Spartan 7 FPGA on the RealDigital |
| 168 | <a class="out" href="https://www.realdigital.org/hardware/urbana">Urbana Board</a> |
| 169 | development board. The Spartan 7 is a decent low end FPGA with a fair bit of |
| 170 | resources. Originally, I thought that my limiting factor would be LUTs. However, |
| 171 | once I actually built the thing, I quickly found out that I was mostly limited |
| 172 | by the RAM. The Spartan 7 (XC7S50) contains 75 RAMB36 blocks. These are |
| 173 | synchronous, dual port, 36 kilobit SRAM blocks. There are four ways in which RAM |
| 174 | is used by the processor: program memory, cache memory, video memory, and CPU |
| 175 | memory. I haven't mentioned anything about a CPU insofar but it is a necessary |
| 176 | part of the graphics processor. Of course, you wouldn't have a computer with a |
| 177 | GPU but no CPU. The CPU used in the design is the Xilinx MicroBlaze processor. I |
| 178 | stripped out as much as I could from it to make it small (and slow; the CPU |
| 179 | performance is not an integral part of the graphics processor). The MicroBlaze |
| 180 | seems to require about 16KB of memory including all the graphics demos, although |
| 181 | I allocated 64KB just to be safe. All in all, the biggest memory consumer is the |
| 182 | VRAM. This is because it is implemented entirely on the on-chip memory. The dev |
| 183 | board does come with 128Mbits of DDR3 memory, but I had a lot of difficulty |
| 184 | trying to get it to work and I eventually scrapped that idea. The upside to |
| 185 | using the on-chip memory is that it is very fast: the memory has a 1 clock cycle |
| 186 | read latency, much faster than however many clock cycles it would take for DDR3 |
| 187 | to be ready. This is what led to the omission of an L2 cache. If the large, slow |
| 188 | DDR3 had been used, the L2 cache would likely be necessary. |
| 189 | </p> |
| 190 | <p> |
| 191 | So how fast is it? Well, each compute core runs at 50MHz. Originally the plan |
| 192 | was to run them at 100MHz but that has proven to be impossible for two reasons. |
| 193 | Firstly, the processor cores are not particularly efficient; that is, they are |
| 194 | not pipelined and expect all data to be ready in the same clock cycle. This is |
| 195 | not scalable, however, and initially caused me to lower the clock frequency to |
| 196 | 50MHz. (This might seem like an oversight on my part, but remember that this was |
| 197 | a class project and I was on a time limit. Pipelined processor design is not the |
| 198 | easiest and would greatly increase the development time.) The other limiting |
| 199 | factor is physical. The processor uses logic cells from all over the chip. |
| 200 | Physically, routing signals from one corner of the chip to the next is slow. I'm |
| 201 | sure I could bump up the frequency a little but 50MHz is a nice round number. |
| 202 | </p> |
| 203 | |
| 204 | <h3>Demos</h3> |
| 205 | <p> |
| 206 | I designed a bunch of demos to showcase PARAGON's processing power. Since the |
| 207 | architecture contains only integer arithmetic and no special graphics |
| 208 | operations, kernels are limited to fixed point arithmetic and manually writing |
| 209 | out necessary functions (like matrix multiplication or dot product). (Floating |
| 210 | point is technically possible using a software floating point library, but I |
| 211 | haven't bothered to try implementing one as it is likely very difficult.) |
| 212 | Surprisingly, even though kernels were limited to fixed point, I was still able |
| 213 | to get some good looking demos. |
| 214 | </p> |
| 215 | <p> |
| 216 | The following video shows the processor running the Rule 30 elementary cellular |
| 217 | automaton. The simulation is artificially slowed and runs on only a single core. |
| 218 | Yes, this is intentional. The simulation was otherwise way too fast and did not |
| 219 | look good. |
| 220 | </p> |
| 221 | <div class="video"><video width="320" height="240" controls> |
| 222 | <source src="content/rule30.webm" type=video/webm> |
| 223 | Rule 30 Elementary Cellular Automaton Simulation |
| 224 | </video></div> |
| 225 | <p> |
| 226 | You might notice the strange purple bar on the left side of the screen. This is |
| 227 | caused by the (soft) HDMI encoder that is used to generate the video output. Can |
| 228 | this be fixed? Probably. Is it a big deal? I don't think so. And so it stays. |
| 229 | The demo looks cool (in my opinion) but it doesn't showcase the real power of |
| 230 | the processor. For that, I wrote a Mandelbrot set rendering kernel. The |
| 231 | Mandelbrot set is an "image" that is found by computing the rate of divergence |
| 232 | of the equation <code>z_next = z_prev^2 + c</code>. I don't want to talk too |
| 233 | much about the math, but the kernel just computes the formula a bunch of times |
| 234 | and sees when (or if) the variable eventually blows up (i.e. gets really big in |
| 235 | magnitude). The demo shows the computation twice: once on a single core and |
| 236 | again but using all 12 cores. |
| 237 | </p> |
| 238 | <div class="video"><video width="320" height="240" controls> |
| 239 | <source src="content/mandelbrot.webm" type=video/webm> |
| 240 | Single Core and Multi Core Mandelbrot Set Rendering |
| 241 | </video></div> |
| 242 | <p> |
| 243 | Clearly, the single core rendering is very slow while the multicore rendering is |
| 244 | quite fast. Even though cores still have to wait their turn to write to memory, |
| 245 | the heavy computation involved means that the memory bus sits mostly idle and so |
| 246 | there are few conflicts between cores. |
| 247 | </p> |
| 248 | <p> |
| 249 | Another interesting demo is Conway's Game of Life. The GoL is another cellular |
| 250 | automaton but produces more interesting patterns than Rule 30. The demo uses |
| 251 | Rule 30 as a pseudorandom initial state and then simulates the Game of Life. |
| 252 | </p> |
| 253 | <div class="video"><video width="320" height="240" controls> |
| 254 | <source src="content/gol.webm" type=video/webm> |
| 255 | Conway's Game of Life Simulation |
| 256 | </video></div> |
| 257 | <p> |
| 258 | The demo runs the Game of Life on multiple cores. This is not because the |
| 259 | simulation is computationally complex, but instead because it requires heavy |
| 260 | memory usage. For each pixel, the kernel fetches the 8 surrounding pixels to |
| 261 | compute the next state which is stored in the right half of the framebuffer. |
| 262 | Once all cells have been computed, the next state buffer is copied back to the |
| 263 | left half of the framebuffer so that the next computation can begin. I have also |
| 264 | run this demo on a single core and it is abysmally slow. So, even though it is a |
| 265 | memory heavy kernel, the cores are still able to share the memory bus |
| 266 | effectively, thereby increasing the simulation speed. |
| 267 | </p> |
| 268 | <p> |
| 269 | The final interesting demo is only half baked (visually and computationally, but |
| 270 | still interesting). It rotates the vertices of a cube by applying a rotation |
| 271 | matrix to them. The demo is also artifically slowed and single core since it did |
| 272 | not look good at true computational speed. |
| 273 | </p> |
| 274 | <div class="video"><video width="320" height="240" controls> |
| 275 | <source src="content/cube.webm" type=video/webm> |
| 276 | A 3 Dimensional Rotating Cube |
| 277 | </video></div> |
| 278 | <p> |
| 279 | Again, due to time constraints, I did not make the cube visually appealing |
| 280 | (edges or face coloring). However, the video clearly shows a rotating cube. |
| 281 | Eventually, the cube begins to fall apart and this is due to inaccuracy in the |
| 282 | rotation matrix and vertex coordinates. Because they are fixed point, only so |
| 283 | much accuracy can be encoded. To preserve some accuracy, the matrix values were |
| 284 | precomputed and somewhat optimized (I checked a lot of different values by hand |
| 285 | and the one in this demo is one of the best ones I found). The kernel does not |
| 286 | re-normalize the vertices, even though it should, once again due to time |
| 287 | constraints. |
| 288 | </p> |
| 289 | <p> |
| 290 | The rotating cube demo is also single core for another reason (which is hinted |
| 291 | at above). The cube's vertices are being stored in VRAM right below the |
| 292 | framebuffer (so they are not visible in the video output). Because of this, if |
| 293 | multiple cores are trying to read the vertices to draw them to the screen and |
| 294 | then modify the vertices and store them back, the coordinates will end up being |
| 295 | corrupted and too many vertices will be drawn to the screen. The real fix to |
| 296 | this is to have each core work on an independent set of vertices and synchronize |
| 297 | the cores before clearing the screen. I did not elect to do this since it adds |
| 298 | additional complexity and still would look bad due to being much too fast. |
| 299 | </p> |
| 300 | <p> |
| 301 | Below are some more fun renderings. |
| 302 | </p> |
| 303 | |
| 304 | <table class="images"><tbody> |
| 305 | <tr> |
| 306 | <td class="image"><img src="content/blurbrot.jpg" width="320" height="240"></td> |
| 307 | <td class="image"><img src="content/blurrule30.jpg" width="320" height="240"></td> |
| 308 | </tr> |
| 309 | <tr> |
| 310 | <td class="caption">Blurred Mandelbrot Set</td> |
| 311 | <td class="caption">Blurred Rule 30 Simulation</td> |
| 312 | </tr> |
| 313 | <tr> |
| 314 | <td class="image"><img src="content/julia1.jpg" width="320" height="240"></td> |
| 315 | <td class="image"><img src="content/julia2.jpg" width="320" height="240"></td> |
| 316 | </tr> |
| 317 | <tr> |
| 318 | <td class="caption">Julia Set 1</td> |
| 319 | <td class="caption">Julia Set 2</td> |
| 320 | </tr> |
| 321 | <tr> |
| 322 | <td class="image"><img src="content/julia3.jpg" width="320" height="240"></td> |
| 323 | <td class="image"><img src="content/ship.jpg" width="320" height="240"></td> |
| 324 | </tr> |
| 325 | <tr> |
| 326 | <td class="caption">Julia Set 3</td> |
| 327 | <td class="caption">Burning Ship Fractal</td> |
| 328 | </tr> |
| 329 | </tbody></table> |
| 330 | |
| 331 | <p> |
| 332 | All these demos were painstakingly written by hand in PARAGON assembly. |
| 333 | </p> |
| 334 | |
| 335 | <h3>What Now?</h3> |
| 336 | <p> |
| 337 | Even though it clearly works, PARAGON is lacking some features. The inclusion of |
| 338 | large DDR3 memory would be very useful as it would allow for an increased |
| 339 | framebuffer resolution and also add a ton more memory for general purpose use |
| 340 | (e.g. double buffering). Since VRAM is byte addressable (at least as far as the |
| 341 | compute cores are concerned), it is difficult (or at least annoying) to store |
| 342 | large 24 bit values from the registers into memory. One fix to this might be to |
| 343 | add a private scratchpad where cores may store general purpose memory, function |
| 344 | call stacks, etc. Currently, recursion or functions calling functions is very |
| 345 | difficult to implement due to the byte addressing limitation. |
| 346 | </p> |
| 347 | <p> |
| 348 | The processors could also use some form of hardware synchronization. A simple |
| 349 | solution is adding a SYNC instruction. Of course, this instruction could be |
| 350 | shared with other instructions to form a set of processor control instructions. |
| 351 | Hardware synchronization would allow for writing kernels where cores must |
| 352 | interact with each other but will likely diverge in execution without using |
| 353 | clunky software synchronization. |
| 354 | </p> |
| 355 | <p> |
| 356 | Fixed point arithmetic is horrible to work with. Poor precision, rounding |
| 357 | errors, etc. The addition of a floating point unit would greatly increase the |
| 358 | ability to write high quality, high precision kernels. Of course, FPUs are huge, |
| 359 | so there would likely only be one or two shared between all the cores. Yes, this |
| 360 | would be slow, but it's worth having high precision arithmetic. |
| 361 | </p> |
| 362 | <p> |
| 363 | All in all, I think I'm happy with how this turned out. |
| 364 | </p> |
| 365 | |
| 366 | !!$DPP <template/foot.html # > |
| 367 | |