Why Command And Vector Processors Rock September 7th, 2017
I had a Commodore Amiga as a kid. I'm told they were never especially popular in America, but in Europe they were everywhere. Well, sucks to be them I guess.
The Amiga was and still is, for its time, the best home computer ever made. It had a clean, powerful CPU architecture. It had an operating system that blended the best parts of CP/M and Unix with none of the unfriendly parts. It had 4-channel digital stereo sound playback in an era when a typical PC had a sound system that was either 0 or 1. It had proper multitasking, which took another ten years to finally arrive on Windows. It was also a completely open platform, unlike todays mobile, console, and store ecosystems.
But most importantly, it had Agnus.
Angus is the name for the main chip inside the Amiga. Its primary role is for graphics, but not as you'd think. The Amiga was unusual compared to most other home computers of the time. A typical early 80s computer has a CPU (usually Z80, 6502, or later the 68000). It would also have a video chip, which would read from the framebuffer RAM and either output pixels directly, or look up the bytes in a tilemap and output that instead. And that was generally as far as it went.
The Amiga, however, wasn't satisified with that. It had a CPU, sure. And it had a video chip ("Denise"), which read in bytes and spat out a video signal. But it didn't stop there. It had a custom-designed ASIC for each part of the machine. The entire hardware was built around these "custom chips" and the means to let them communicate.
Agnus is a kind of "ringmaster" chip. Its main component is the DMA controller (Direct Memory Access). This lets bytes be read from main memory and sent around to the various custom chips as needed. You can think of it as an asynchronous memcpy -- you give it an address and it'll either read or write bytes one at a time to/from the appropriate chip. It supported 25 different DMA channels at once for all the different parts of the machine that needed RAM access.
So what would you want to do with all this DMA? Let's look at one of the biggest examples -- the blitter.
The blitter was another part of the Agnus chip. It's operation was very simple. You'd give it three source pointers, one destination pointer, and a function ID. It would then read individual bits from memory (processing them 16 at a time), perform an arbitrary bitwise operation on them, and store the result out. You can think of it as a general-purpose bitwise arithmetic chip.
Given three bits (let's call them A,B, and C), there's exactly 8 different combinations that can result from these. So in order to specify your arithmetic function, you just need a lookup table of 8 result bits. This handily fits into a single byte.
Having this kind of bitwise arithmetic was important because, like many machines of the time, the Amiga used bitplanes as a format to store its graphics in.
Let's say you're using 32-color paletted mode. That's 5 bits per pixel you need to store. How do you store that? Well, you could use a byte per pixel, use 5 of the 8 bytes to store your data and leave the other 3 empty. But that's a hell of a waste. Instead, you store it as 5 individual bitplanes, each plane using one bit per pixel. (i.e. each byte contains one bit from eight different pixels)
Now let's apply our blitter to this. Imagine we want to draw a sprite on-screen. We've got it stored as 5 individual bitplanes, plus a sixth 'mask' bitplane to store the transparency. To get the blitter to draw this for us, we set up our three inputs:
A - The position on the framebuffer we want it at B - Our sprite source data (1st bitplane) C - Our sprite transparency mask
We'll need to repeat the whole thing a total of 5 times, one for each sprite bitplane, but that's fine (it's real quick!). The final piece of the puzzle is how we specify how to mix these three inputs. We can build a Boolean truth table to handle it.
Our goal is to use the transparency bit (C) to select EITHER (A) the background data (if C=0), or (B) the sprite data (if C=1). i.e.
D = C ? B : A. To figure out our function ID, we just list out all eight cases:
If we concatenate all the bits of D together, we get the value 0xD8. This is called a minterm, and it represents our bitwise operation in its entirety.
This "minterm" idea is a pretty powerful one. You can combine elements together to get any bitwise function you like. Want to XOR images? Sure. Want to just clear memory? That works too, just set the ID to 0x00 and ignore all the inputs. You'll still occasionally see systems that use this. Windows, for example, still uses it for its BitBlt function, although you'd never know that from reading the BitBlt documentation.
To actually program the blitter to do this, we simply write the three source address into three of its registers, write the function ID to another register, and then signal it to start. It'll run in the background while our CPU gets on with other things, and we can either check a flag to see if its finished, or get it to wake the CPU with an interrupt.
So far we've seen how Agnus contains the blitter functionality, and the DMA controller. But there's one more little secret hidden inside this chip, and that's the co-processor (aka "COPR", or simply "copper" to its friends).
The copper was a completely independent CPU that ran in parallel with the main one. It wasn't a Turing-complete, general-purpose CPU like the 68000. It only had three instructions. It didn't have its own memory or registers, but instead shared main memory (like everything else on the Amiga), and it could directly access many of the registers inside the custom chips.
The copper read its instructions via DMA. This meant you allocated some memory and filled in a program, called a "copper list", by writing 16-bit instructions into it. You then pointed the DMA at that address and started the program. The DMA would fetch each instruction and feed them into the copper.
So what could you do with a CPU that only has three instructions? Let's see what the instructions were:
MOVE reg, value WAIT X, Y SKIP X, Y
That's a pretty simple machine. We can load a value into a register, wait for the raster beam to hit a specific X/Y position, or skip the next instruction if the raster beam is past a specific X/Y position. Doesn't sound like much at first. If I wanted to write registers I could just do it on the main CPU, right? Why would I want to wait to write registers at a specific time? But there's some surprising effects you can get out of this simple mechanism.
What the copper let you do is to apply different properties to different parts of the screen. You can change the address of the framebuffer on a line-by-line basis, for example, to create parallax scrolling in Shadow Of The Beast (above).
Or you can change the address of the framebuffer at the halfway point on each line, to create a 2-player scrolling split-screen, like in Lemmings: (no other port of Lemmings could do this!)
Or you could wait till a certain line and change both the color palette AND the frambuffer address, to create a wobbly water effect, like in Ugh:
And this wasn't just a trick for games, even the operating system made use of this. If they chose to, the Amiga allowed programs to have their own private screen with a different resolution. Windows suffered for years with this problem -- ever switch back from a DirectX program and see all your icons have moved around on the desktop? Not in Amiga-land. Here, two different programs on different screens can co-exist just by dragging the menu bar down:
These aren't just different windows you're seeing here. These are different screens, each running at a different resolution! The lower screen is 640x256 at 4 colors and the upper screen is 320x256 at 32 colors. Try that on a PC.
All these effects, and more, are achieved via the simple ability to change settings when you want to, rather than having them fixed at the start of the frame. It didn't require more power to be added to the system, just the flexibility to use the existing system in unusual ways.
If you want to see more creative uses like this, try the excellent codetapper.com which takes apart many Amiga games to see how they do things.
Hardware as a tool
The reason I'm writing all this isn't just to show off how cool the Amiga was. I want to show how its design principles allow new avenues to be opened up.
The Amiga hardware never said "this tool is for this purpose". It gave you a toolbox but let you decide what these things were to be used for. And it allowed each tool to interoperate with the others using common registers and common data formats.
I've presented the blitter here as a thing for processing graphics bitplanes, but it was really just a vector coprocessor for operating on boolean/bitwise data. It could be used for other tasks, and it was. The Amiga's floppy disks were formatted using MFM encoding, which is a kind of edge-based binary encoding. To decode it, you had to process the bit array from disk and look for 0-1 transitions. The blitter provides an ideal tool for doing this with, and the OS made use of it for exactly that. The same kind of tasks we might use a compute kernel for today, perhaps.
The copper, while seeming to be a very simple processor, effectively acted as an amplifier for the power contained within the other chips. It could be viewed perhaps as a metaprocessor -- not doing the work itself but controlling the work of others.
This combination of a vector processor and a control chip is a powerful one. It's so powerful in fact, that the machine you're reading this on now has the same architecture. A modern GPU consists of three parts:
Part a) is a thing that can draw triangles. There's usually special-purpose hardware for doing this. There was a time a few years ago when this was what we thought of as the GPU, but we're seeing less and less of that every year. Games now are doing voxel ray-tracing, and people are using GPUs for lots of things other than just rendering.
Part b) is the vector processor, a unit that reads data and runs functions using it. Ours are much more powerful than the old blitter though. We can do full floating-point operations on ours, not just bitwise ops. But it's a more advanced version of the same principle -- a program that operates on many values at once rather than just one.
Part c) is the command processor. A modern GPU has a chip that reads instructions from the host CPU, decodes the various draw calls, state changes, etc, and then issues work to the vector processor (for compute kernels). Or, when using rendering APIs, it sends work to the triangle drawer which in turn sends work to the vector processor (either to shade vertices or pixels).
Right now we're a little stuck, however. A modern GPU lets you use its triangle drawer (via OpenGL perhaps), and it lets you use its vector processor (via CUDA perhaps). But the one thing it does not do, on almost any platform (even most consoles), is to let you use the command processor. About the only one I've ever seen that did give you that kind of access was the PlayStation 2, something I'll no doubt write about in a future article.
You see, the Amiga documented its command processor. The designers wanted you to write programs that ran on it. They wanted you to use it for doing all sorts of clever things. They recognized that the power to operate the underlying horsepower directly was something that could amplify the capabilities of a system way past the limits of its original design.
But on Direct3D, or OpenGL, all you can do is call DrawIndexedPrimitive etc. and let it do things on your behalf. You can't build your own copper lists like you used to on the Amiga. Some APIs let you make a command buffer, but they're usually just recording API calls into it. You can't program it with your own logic, or your own algorithms. The 3D driver has the power to do this, but you don't.
The Amiga was a good machine not because of what it was designed to do but because the designers intentionally gave you the flexibility to do things they'd never designed it to do.
The old COPR chip only had three instructions and couldn't do much by itself, but you could use it to make the rest of the system sing. I'm sure the command processors in modern desktops have a much more advanced processor -- I'd love to see what we could do with them given the chance.
Written by Richard Mitton,
software engineer and travelling wizard.
Follow me on twitter: http://twitter.com/grumpygiant