rani/buni.party.git

Viewing file on branch master

1	!!title="A Parallel Graphics Processor"
2	!!date="December 13, 2025"
3	!!
4	$DPP "$title" "$title" <template/head.html # >
5	$DPP <template/bar.html # >
6
7	$DPP "$1" <template/readtime.html # >
8	!!
9
10	<p><a class="out" href="https://github.com/Raniconduh/paragon">PARAGON</a>
11	(Parallel Architecture for Rendering and Graphics OperatioNs) is a graphics
12	processor I designed for a class. Unlike some other class projects, this one was
13	a lot of fun to build and so I've decided to write about it.</p>
14
15	<h3>Overview</h3>
16	<p>The final design consisted of 12 compute cores drawing to a 320x240
17	framebuffer. The framebuffer stores 8 bit pixels (RGB 332) and each compute core
18	accesses a single pixel at a time. Internally, the compute cores are a (rather
19	strange) 24 bit Harvard RISC architecture. The architecture was designed to be
20	as minimal as possible while still being very usable. (More on this later.) Each
21	core runs in parallel and, for the most part, does not need to know that any
22	other core exists. The video memory is organized in a 3 level hierarchy, with
23	two levels of caching and the framebuffer at the highest level. (Note, however,
24	that the final design did not include the level 2 cache due to reasons
25	below.)</p>
26
27	<h3>Architecture</h3>
28	<p>
29	The compute cores are designed around a minimal RISC architecture with 16 bit
30	fixed width instructions and 24 bit registers. The register file contains 16
31	registers indexed 0 through 15 where register 0 is the zero register (sinking
32	all writes and sourcing 0) and register 15 is the core ID register (sinking
33	writes and sourcing the ID of the current core). The core ID register is what
34	truly allows for parallelism, as otherwise each core would run the exact same
35	instructions in the exact same order. (This can cause glitches depending on the
36	kernel; more on this later.)
37	</p>
38	<p>
39	To the untrained eye, a register which is always zero seems wasteful. If you
40	need a zero, you can just load it in, right? The zero register is a hallmark of
41	RISC architectures. In fact, the inclusion of this register makes the
42	architecture smaller and simpler, although it also means the programmer has to
43	be more aware of what they are doing. (In many cases, this can be handled by the
44	assembler. e.g. <code>mov rd, rs</code> is translated as <code>add rd, r0, rs</code>.)</p>
45	<p>
46	Instructions have 4 bit opcodes, meaning there are only 16 possible
47	instructions. Fortunately, this limit was not reached in the final design. Most
48	instructions have 3 operands which is typical of RISC architectures; although,
49	some instructions use fewer operands.
50	</p>
51	<p>
52	The instruction set contains the following ALU operations: ADD, SUB, MUL, AND,
53	OR, XOR, NOT, SH, ADDI. The first few are self explanatory. SH is the shift
54	instruction which performs left, right, or arithmetic shifts on a register by an
55	immediate (i.e. a constant which is not a register). ADDI is perhaps the most
56	unusual instruction of the ALU operations as it is not strictly necessary.
57	Simply, it adds an immediate to a register. Note, however, that the immediate is
58	a signed 4 bit integer (ew). This is the only instruction to use a 4 bit signed
59	immediate. Even though the instruction is absolutely not required, it makes
60	writing programs much easier.
61	</p>
62	<p>
63	The processor's memory operations are limited, loading and storing only 1 byte at
64	a time. The LD and ST instructions are the only two memory operations and they
65	are used to access video memory (pixelwise). Because there are no other memory
66	operations, the astute may wonder how the program memory is accessed. The answer
67	to this is that program memory cannot be accessed. It is read only and only ever
68	touched by the control system (i.e. a programmer cannot read or write to it).
69	The LDI instruction is in a similar vein, although it does not read from memory.
70	Instead it loads an unsigned 8 bit immediate into a register.
71	</p>
72	<p>
73	Finally, the architecture includes two control flow instructions. B is the
74	branch instruction. It supports both conditional and unconditional branch and
75	can either branch to a PC relative immediate or directly to an address stored in
76	a register. Once again, some may recognize that something is missing. Two things
77	actually. Mainly, there is no compare instruction. This seems to imply that all
78	branching is unconditional. However, all ALU operations modify a set of flags:
79	N, Z, and C. These flags store whether the output of the ALU is (N)egative (i.e.
80	the MSB is set), (Z)ero, or if the operation generated a carryout. Using these
81	flags, comparisons are typically created by subtracting two registers and
82	discarding the result. The other thing that is missing is some sort of linked
83	branch (i.e. function call). This is solved by the final control instruction
84	AIPC, or add immediate to program counter. The instruction adds a signed 8 bit
85	immediate to the program counter (PC) and stores it in a register. This is
86	especially helpful for direct branches and e.g. function calls. The instruction
87	is technically redundant since the PC is always known to the assembler, but it
88	helps making functions easier.
89	</p>
90
91	<h4>Program Memory</h4>
92	<p>
93	As mentioned above, the program memory is used solely by the processor's control
94	unit. Even though it is not as extensible as the video memory, the program
95	memory is obviously a necessary part of the system. Separating the program
96	memory (ROM) from the video memory (VRAM) also helped keep the architecture
97	smaller and simpler (and what makes it a Harvard architecture).
98	</p>
99	<p>
100	One of the most important parts of a graphics processor is having it run
101	different kernels. (It wouldn't be very interesting if it could only ever do one
102	thing.) Because of this, the program ROM is reprogrammable. By enabling the
103	programming mode on the ROM, instructions can be streamed to the graphics
104	processor where they will be stored in the same order. Once programming is
105	complete, the processor begins execution, starting at the first instruction.
106	</p>
107	<p>
108	But if the compute cores are independent of each other, how can they work with
109	only one program ROM? Well, one solution to this is to just go core by core,
110	handling each read request one at a time. Of course, this is horribly slow. The
111	solution used by PARAGON is to copy the instruction ROM into multiple
112	(individual) ROMs. Since the ROM is (as per the name) read only, there is no
113	risk of processors modifying the instructions in the kernel. This means that
114	every core can access the program ROM at the exact same time and not have to
115	worry about waiting for other cores. On the other hand, this means that the
116	program ROM is way bigger than it has to be. This is not a huge deal for PARAGON
117	since the ROM is quite small (2048 instructions deep). I also believe that speed
118	is a much more important factor in this case than memory utilization. (If read
119	requests were handled one at a time, I'm inclined to believe that ROM would be
120	12x slower, meaning there wouldn't really be much benefit to having more than 1
121	core.)
122	</p>
123
124	<h4>Video Memory</h4>
125	<p>
126	The video memory (VRAM) is seemingly simple. Most of video memory (~60%) is
127	allocated to the framebuffer (what you actually see on the screen). There is a
128	bit of memory (~40%) beneath the framebuffer for general purpose use, although
129	it is a bit limited since memory is accessed byte by byte. (Maybe in the future
130	loads and stores can optionally work on multiple bytes.) Again, this doesn't
131	seem too complex. But, what happens when multiple cores are trying to use VRAM
132	at the same time? The program ROM solution can be applied here too. Just copy
133	the VRAM for each core. This seems to keep VRAM fast like it did with the
134	program ROM. Although, the program ROM was very small (4kB). Copying VRAM for 12
135	cores would cause it to be huge and absolutely impossible to accomplish. The
136	real solution (and the one employed by PARAGON) is to have a caching scheme and
137	memory arbitration.
138	</p>
139	<p>
140	The caching scheme is fairly simple (especially because of the way VRAM was
141	actually implemented). Each core has a private write-though level 1 cache that
142	stores 64 cache lines (each of which is 8 bytes wide; i.e. a total of 256 bytes
143	of cache). The cache is directly mapped meaning indices in the cache are not
144	unique. (This can lead to thrashing due to conflicting cache accesses, but I am
145	treating it as a non issue for the sake of PARAGON.) When a core wants to access
146	VRAM, the cache first checks if the memory has been cached already. If it has,
147	great! On reads, just return the cache line. On writes, modify it and send it to
148	the next level. When the memory has not yet been cached, the L1 cache requests
149	the line from the next level and waits for it to come back. Although, what
150	happens if a core writes to memory that is cached by another core? There are a
151	lot of different coherency schemes that can be used to handle this but PARAGON
152	employs the most simple: such cache lines are invalidated. This means that if
153	one core tries to read cached memory that has been modified by a different core,
154	it will have to fetch the modified line from the next level.
155	</p>
156	<p>
157	In the final design of PARAGON, the memory hierarchy does not include a level
158	two cache. Instead, the memory is directly arbitrated between the private L1
159	caches and the full VRAM. The arbitration is fairly simple. The memory arbiter
160	keeps track of the last core that it granted a request to. It then tries to find
161	requests starting at the next core before wrapping back around. This means that
162	one core will never be able to hold the memory arbiter forever.
163	</p>
164
165	<h3>Okay... Does it Even Work?</h3>
166	<p>
167	Yes! PARAGON was implemented on a Xilinx Spartan 7 FPGA on the RealDigital
168	<a class="out" href="https://www.realdigital.org/hardware/urbana">Urbana Board</a>
169	development board. The Spartan 7 is a decent low end FPGA with a fair bit of
170	resources. Originally, I thought that my limiting factor would be LUTs. However,
171	once I actually built the thing, I quickly found out that I was mostly limited
172	by the RAM. The Spartan 7 (XC7S50) contains 75 RAMB36 blocks. These are
173	synchronous, dual port, 36 kilobit SRAM blocks. There are four ways in which RAM
174	is used by the processor: program memory, cache memory, video memory, and CPU
175	memory. I haven't mentioned anything about a CPU insofar but it is a necessary
176	part of the graphics processor. Of course, you wouldn't have a computer with a
177	GPU but no CPU. The CPU used in the design is the Xilinx MicroBlaze processor. I
178	stripped out as much as I could from it to make it small (and slow; the CPU
179	performance is not an integral part of the graphics processor). The MicroBlaze
180	seems to require about 16KB of memory including all the graphics demos, although
181	I allocated 64KB just to be safe. All in all, the biggest memory consumer is the
182	VRAM. This is because it is implemented entirely on the on-chip memory. The dev
183	board does come with 128Mbits of DDR3 memory, but I had a lot of difficulty
184	trying to get it to work and I eventually scrapped that idea. The upside to
185	using the on-chip memory is that it is very fast: the memory has a 1 clock cycle
186	read latency, much faster than however many clock cycles it would take for DDR3
187	to be ready. This is what led to the omission of an L2 cache. If the large, slow
188	DDR3 had been used, the L2 cache would likely be necessary.
189	</p>
190	<p>
191	So how fast is it? Well, each compute core runs at 50MHz. Originally the plan
192	was to run them at 100MHz but that has proven to be impossible for two reasons.
193	Firstly, the processor cores are not particularly efficient; that is, they are
194	not pipelined and expect all data to be ready in the same clock cycle. This is
195	not scalable, however, and initially caused me to lower the clock frequency to
196	50MHz. (This might seem like an oversight on my part, but remember that this was
197	a class project and I was on a time limit. Pipelined processor design is not the
198	easiest and would greatly increase the development time.) The other limiting
199	factor is physical. The processor uses logic cells from all over the chip.
200	Physically, routing signals from one corner of the chip to the next is slow. I'm
201	sure I could bump up the frequency a little but 50MHz is a nice round number.
202	</p>
203
204	<h3>Demos</h3>
205	<p>
206	I designed a bunch of demos to showcase PARAGON's processing power. Since the
207	architecture contains only integer arithmetic and no special graphics
208	operations, kernels are limited to fixed point arithmetic and manually writing
209	out necessary functions (like matrix multiplication or dot product). (Floating
210	point is technically possible using a software floating point library, but I
211	haven't bothered to try implementing one as it is likely very difficult.)
212	Surprisingly, even though kernels were limited to fixed point, I was still able
213	to get some good looking demos.
214	</p>
215	<p>
216	The following video shows the processor running the Rule 30 elementary cellular
217	automaton. The simulation is artificially slowed and runs on only a single core.
218	Yes, this is intentional. The simulation was otherwise way too fast and did not
219	look good.
220	</p>
221	<div class="video"><video width="320" height="240" controls>
222	<source src="content/rule30.webm" type=video/webm>
223	Rule 30 Elementary Cellular Automaton Simulation
224	</video></div>
225	<p>
226	You might notice the strange purple bar on the left side of the screen. This is
227	caused by the (soft) HDMI encoder that is used to generate the video output. Can
228	this be fixed? Probably. Is it a big deal? I don't think so. And so it stays.
229	The demo looks cool (in my opinion) but it doesn't showcase the real power of
230	the processor. For that, I wrote a Mandelbrot set rendering kernel. The
231	Mandelbrot set is an "image" that is found by computing the rate of divergence
232	of the equation <code>z_next = z_prev^2 + c</code>. I don't want to talk too
233	much about the math, but the kernel just computes the formula a bunch of times
234	and sees when (or if) the variable eventually blows up (i.e. gets really big in
235	magnitude). The demo shows the computation twice: once on a single core and
236	again but using all 12 cores.
237	</p>
238	<div class="video"><video width="320" height="240" controls>
239	<source src="content/mandelbrot.webm" type=video/webm>
240	Single Core and Multi Core Mandelbrot Set Rendering
241	</video></div>
242	<p>
243	Clearly, the single core rendering is very slow while the multicore rendering is
244	quite fast. Even though cores still have to wait their turn to write to memory,
245	the heavy computation involved means that the memory bus sits mostly idle and so
246	there are few conflicts between cores.
247	</p>
248	<p>
249	Another interesting demo is Conway's Game of Life. The GoL is another cellular
250	automaton but produces more interesting patterns than Rule 30. The demo uses
251	Rule 30 as a pseudorandom initial state and then simulates the Game of Life.
252	</p>
253	<div class="video"><video width="320" height="240" controls>
254	<source src="content/gol.webm" type=video/webm>
255	Conway's Game of Life Simulation
256	</video></div>
257	<p>
258	The demo runs the Game of Life on multiple cores. This is not because the
259	simulation is computationally complex, but instead because it requires heavy
260	memory usage. For each pixel, the kernel fetches the 8 surrounding pixels to
261	compute the next state which is stored in the right half of the framebuffer.
262	Once all cells have been computed, the next state buffer is copied back to the
263	left half of the framebuffer so that the next computation can begin. I have also
264	run this demo on a single core and it is abysmally slow. So, even though it is a
265	memory heavy kernel, the cores are still able to share the memory bus
266	effectively, thereby increasing the simulation speed.
267	</p>
268	<p>
269	The final interesting demo is only half baked (visually and computationally, but
270	still interesting). It rotates the vertices of a cube by applying a rotation
271	matrix to them. The demo is also artifically slowed and single core since it did
272	not look good at true computational speed.
273	</p>
274	<div class="video"><video width="320" height="240" controls>
275	<source src="content/cube.webm" type=video/webm>
276	A 3 Dimensional Rotating Cube
277	</video></div>
278	<p>
279	Again, due to time constraints, I did not make the cube visually appealing
280	(edges or face coloring). However, the video clearly shows a rotating cube.
281	Eventually, the cube begins to fall apart and this is due to inaccuracy in the
282	rotation matrix and vertex coordinates. Because they are fixed point, only so
283	much accuracy can be encoded. To preserve some accuracy, the matrix values were
284	precomputed and somewhat optimized (I checked a lot of different values by hand
285	and the one in this demo is one of the best ones I found). The kernel does not
286	re-normalize the vertices, even though it should, once again due to time
287	constraints.
288	</p>
289	<p>
290	The rotating cube demo is also single core for another reason (which is hinted
291	at above). The cube's vertices are being stored in VRAM right below the
292	framebuffer (so they are not visible in the video output). Because of this, if
293	multiple cores are trying to read the vertices to draw them to the screen and
294	then modify the vertices and store them back, the coordinates will end up being
295	corrupted and too many vertices will be drawn to the screen. The real fix to
296	this is to have each core work on an independent set of vertices and synchronize
297	the cores before clearing the screen. I did not elect to do this since it adds
298	additional complexity and still would look bad due to being much too fast.
299	</p>
300	<p>
301	Below are some more fun renderings.
302	</p>
303
304	<table class="images"><tbody>
305	<tr>
306	<td class="image"><img src="content/blurbrot.jpg" width="320" height="240"></td>
307	<td class="image"><img src="content/blurrule30.jpg" width="320" height="240"></td>
308	</tr>
309	<tr>
310	<td class="caption">Blurred Mandelbrot Set</td>
311	<td class="caption">Blurred Rule 30 Simulation</td>
312	</tr>
313	<tr>
314	<td class="image"><img src="content/julia1.jpg" width="320" height="240"></td>
315	<td class="image"><img src="content/julia2.jpg" width="320" height="240"></td>
316	</tr>
317	<tr>
318	<td class="caption">Julia Set 1</td>
319	<td class="caption">Julia Set 2</td>
320	</tr>
321	<tr>
322	<td class="image"><img src="content/julia3.jpg" width="320" height="240"></td>
323	<td class="image"><img src="content/ship.jpg" width="320" height="240"></td>
324	</tr>
325	<tr>
326	<td class="caption">Julia Set 3</td>
327	<td class="caption">Burning Ship Fractal</td>
328	</tr>
329	</tbody></table>
330
331	<p>
332	All these demos were painstakingly written by hand in PARAGON assembly.
333	</p>
334
335	<h3>What Now?</h3>
336	<p>
337	Even though it clearly works, PARAGON is lacking some features. The inclusion of
338	large DDR3 memory would be very useful as it would allow for an increased
339	framebuffer resolution and also add a ton more memory for general purpose use
340	(e.g. double buffering). Since VRAM is byte addressable (at least as far as the
341	compute cores are concerned), it is difficult (or at least annoying) to store
342	large 24 bit values from the registers into memory. One fix to this might be to
343	add a private scratchpad where cores may store general purpose memory, function
344	call stacks, etc. Currently, recursion or functions calling functions is very
345	difficult to implement due to the byte addressing limitation.
346	</p>
347	<p>
348	The processors could also use some form of hardware synchronization. A simple
349	solution is adding a SYNC instruction. Of course, this instruction could be
350	shared with other instructions to form a set of processor control instructions.
351	Hardware synchronization would allow for writing kernels where cores must
352	interact with each other but will likely diverge in execution without using
353	clunky software synchronization.
354	</p>
355	<p>
356	Fixed point arithmetic is horrible to work with. Poor precision, rounding
357	errors, etc. The addition of a floating point unit would greatly increase the
358	ability to write high quality, high precision kernels. Of course, FPUs are huge,
359	so there would likely only be one or two shared between all the cores. Yes, this
360	would be slow, but it's worth having high precision arithmetic.
361	</p>
362	<p>
363	All in all, I think I'm happy with how this turned out.
364	</p>
365
366	!!$DPP <template/foot.html # >
367