Something Rotten In The Core October 24th, 2017
There's a key thought of UNIX philosophy which centers around the idea of linking programs together. You know, piping the output from
sedand then into
sort, that kind of thing. It kinda works well, I guess. For text at least.
But one of the reasons it can work OK is because you, as the end-user writing this little script or command, have full knowledge of the pieces you're building it from. You understand grep, you understand sed, and if any of those pieces suddenly stop working then you can pick the piped command-line apart again and see why. It's a system made up of pieces that you have control over, and most importantly: all the pieces are exposed to you.
This idea spread throughout UNIX, but in ways it should never have. I'm referring here to the misfortune that is debugging.
You see, on UNIX there's GDB. That's the debugger. That's the only debugger. It's very old and has had a lot of work put into it, and as a result it usually works pretty well, at least in terms of functionality. But on every other metric you measure software by, it kinda sucks.
Despite every computer made in the past 40 years having a graphical display, GDB lives in a parallel universe where the framebuffer was never invented and we all still use teletype printers. Teletypes work OK enough for getting output, but for interactive programs it just falls apart.
And so people who didn't want to deal with the pain of GDB invented the "GDB Wrapper" -- a separate piece of software that would show a nice user interface, but internally would call GDB to do the work.
We're not talking about calling out to a library here. We're talking about actually launching an instance of GDB, passing it commands, and parsing the results it prints out. And this is where we get led down a dangerous path.
APIs are hard to begin with. Good API design is very much an art, and it takes a lot of experience to come up with good ones. And the reason so many APIs are bad isn't because someone designed a bad API -- it's that they didn't even realize they were designing an API to begin with.
So much of our software world now is filled with wrappers -- programs that don't actually do the thing themselves, but 'outsource' their work to other programs. It's a stack of layers, and it's not a nice clean stack. I remember something Jeff Roberts once said to me -- the layers grind against each other, and you can feel each one chipping bits away as they collide.
I had the utter delight a few months back of trying to debug something using Qt Creator one day, except I couldn't. It just suddenly one morning refused to start debugging programs. It wouldn't say why of course, it would just sit there doing nothing.
Nothing had changed, at least so I thought. So why the failure? It turned out, after some experimentation, to be because Microsoft's symbol servers were down. That's right, a remote failure on someone else's part meant I couldn't debug locally.
Now of course errors happen in life, and are to be expected. But I didn't get an error. I didn't get a warning. I didn't get anything, except an unresponsive UI. And the reason for this, I think, is precisely because the wrapper wasn't fully aware of all the facts.
Jeff Goldblum said it best in that famous scene from Jurassic Park:
The problem with the scientific power you've used is it didn't require any discipline to attain it. You read what others had done and you took the next step. You didn't earn the knowledge yourselves, so you don't take the responsibility for it. You stood on the shoulders of geniuses to accomplish something as fast as you could, and before you knew what you had, you patented it, packages it, slapped in on a plastic lunch box, and now you want to sell it.
We've seen it hundreds of times in all kinds of software. Functions that return
boolinstead of an error code. Where did the precise error vanish to? Poof, it's gone! What used to be a useful error message became
false, and if you're lucky you'll get a generic "Unexpected error" appearing on screen. And that's if your program is using a library. If it's calling out to a command-line worker, the most likely case is it won't get checked at all and will just get printed out into a log file you'll never find, and then never seen again.
Or networking software that just sits there spinning a cursor when something went wrong. So much user-facing network software is built on top of other programs, like
rsync, and when those things fail they just don't know what to do. And so much of the problem is precisely because they're not using them as libraries, they're using them as command-line utilities. They're using these things that have ill-defined interfaces to begin with, and because it's all based on outputting text, the programmers think they can just look at examples of the output and figure out an API from that.
There's a quote from the great Douglas Adams which I'm sure I've used many times before, but it's just so incredibly apt for most of today's software:
The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair.
You've all seen those cheap Chinese toys that look like a PlayStation, but inside its just a 6502 and 50 NES games. It's all fake, it's an illusion. It's a nice plastic finish on the outside, but if you were to open it up there's nothing in there. It's a rotten core, wrapped in layers of opaque complexity.
We're making systems that are fragile, because they're just glued on rather than bolted together. We're wrapping complex things up a wrappers that don't take the same responsibilities as the things they rely on. Like Homer's pecking bird in the Simpsons, they work just fine when everything is as expected, but when the slightest change in situation happens then everything breaks.
Little Lightmap Tricks October 10th, 2017
Just a quick post today to write down some lightmap lessons I've learned over the years, inspired by Ignacio Castaño's post on his iOS optimizations for The Witness. Many years ago I helped a little on the Wii port of "Call Of Duty: Modern Warfare", and it reminded me of the "fun" I had rewriting the lightmap system to fit into that. So I thought I'd write up some of the little tips and tricks I picked up along the way. Nothing special here, just some common mistakes to avoid.
Don't put gaps in!
It's surprisingly common to see people generating lightmaps with empty pixels in-between each chart. You don't need to do this! Put them right up against each other without any gaps. If you're worried that maybe you needed the black pixel to stop the charts bleeding into each other, then you're calculating your UVs wrong.
Squish blocks down
If all the pixels in the chart are exactly the same color (or near-enough), you don't need to waste space storing all of them. Just shrink the entire chart down to a single pixel. And furthermore (see below), use the same single pixel for all of the shrunken charts.
Share identical charts
You might imagine that the majority of a lightmap's space is taken up by nice big chunky pieces, like open terrain areas, or the sides of buildings. But in fact, you'll find that probably the majority of the lightmap space is occupied by a thousand little tiny shards of rubbish. Many of these only occupy a single 2x2 or 3x3 block on the lightmap. Now if you think about it, there's only so many 2x2 blocks that can exist in the world. So:
For each chart, search for all previous charts that are the same size and have the same contents (within a given pixel error). If you find any, throw the new chart away and simply re-use the UVs of the old one.
You might even find that this doesn't just work for little 2x2 blocks. If there's any instanced geometry in the level that's facing the same direction, then they'll often have identical lightmaps too. One example would be the side of an apartment block, with many balconies. Because the sun is a directional light, each balcony will have the same shadow cast onto it. So you can get chart re-use in a lot more places than you'd think.
Don't ruin your block compression
You can get a big benefit by using a block compression scheme for your texture (DXT/BC/etc). But don't just compress your texture without thinking first! DXT stores two colors for each 4x4 block. The pixels you didn't write to on the lightmap will be an empty black. Do you really want to waste one of those colors on storing black? Of course you don't.
For each 4x4 block, fill in the unused pixels with one of the other pixels from the same block (doesn't really matter which).
One of the really cool things I did was to write a little visual debugger -- a small command-line parameter that would pop open an OpenGL window showing the results of the bake. You could fly around and inspect the results, and if you found a strangely black triangle somewhere, you could click on it, and the program would re-run the lighting for that triangle and automatically break into the debugger at the right location. I highly recommend this.
The scheme I used
I tried a lot of texture compression schemes out, but here's the one I finally settled on. Bear in mind we were super-tight on memory, so I didn't want to use anything more than 4-5 bits per pixel really.
Like its parent 360/PS3 versions, the Wii port of COD:MW uses a non-HDR lighting engine where the sun's shadow is stored off as a separate lightmap channel. This allows partial time-of-day changes, special effects like lightning flashes, and also allows the total lighting brightness to exceed 1.0, allowing some overbrightening. I really wanted to keep that scheme for the Wii port, but I didn't want to use any more memory.
The shadow term is stored at full resolution, in the red channel of a DXT1 texture. This consists of the shadow visibility multiplied by the N.L term. This is then blended at runtime with the actual sun color.
The remaining non-sun light is stored a little differently. I split the secondary light into its separate components -- luminance and color. A separate RGB(565) texture stores the color, at quarter resolution. The luminance is stored at full resolution into the green channel of the above full-resolution texture. At runtime, we simply read both textures and multiply them together.
Now, you have to be careful about this. DXT compression relies on correlation between the channels -- you can't just throw any old data into the separate RGB channels and expect it to compress well.
But it turns out that for our purposes, this works great. It tends to be that each area of the world only has one strong light affecting each point (either the primary sun or one of the secondary lights), so the DXT compressor is free to EITHER:
a) If the shadow term is mostly constant, focus its efforts on the secondary luminance. Or, b) If the shadow term varies, focus on that and let the luminance suffer.
It tends to work out either way though. The human eye is drawn to brightness, so if the luminance gets bad compression then you're probably looking at an area that has strong, varying shadows, and that'll be the thing that stands out anyway. And of course, when you throw the diffuse texture on top it hides a lot of any remaining errors.
Because the quarter-res 16bpp texture has 1/16th the pixels of the high-res 4bpp texture, the total storage space is effectively 5 bits-per-pixel.
Well there you go. Nothing that special, but it's nice to write these things down sometimes so they don't get lost. Of course, it's all voxel GI these days so I don't expect this to be of much use, but there it is.
Why Command And Vector Processors Rock September 7th, 2017
I had a Commodore Amiga as a kid. I'm told they were never especially popular in America, but in Europe they were everywhere. Well, sucks to be them I guess.
The Amiga was and still is, for its time, the best home computer ever made. It had a clean, powerful CPU architecture. It had an operating system that blended the best parts of CP/M and Unix with none of the unfriendly parts. It had 4-channel digital stereo sound playback in an era when a typical PC had a sound system that was either 0 or 1. It had proper multitasking, which took another ten years to finally arrive on Windows. It was also a completely open platform, unlike todays mobile, console, and store ecosystems.
But most importantly, it had Agnus.
Angus is the name for the main chip inside the Amiga. Its primary role is for graphics, but not as you'd think. The Amiga was unusual compared to most other home computers of the time. A typical early 80s computer has a CPU (usually Z80, 6502, or later the 68000). It would also have a video chip, which would read from the framebuffer RAM and either output pixels directly, or look up the bytes in a tilemap and output that instead. And that was generally as far as it went.
The Amiga, however, wasn't satisified with that. It had a CPU, sure. And it had a video chip ("Denise"), which read in bytes and spat out a video signal. But it didn't stop there. It had a custom-designed ASIC for each part of the machine. The entire hardware was built around these "custom chips" and the means to let them communicate.
Agnus is a kind of "ringmaster" chip. Its main component is the DMA controller (Direct Memory Access). This lets bytes be read from main memory and sent around to the various custom chips as needed. You can think of it as an asynchronous memcpy -- you give it an address and it'll either read or write bytes one at a time to/from the appropriate chip. It supported 25 different DMA channels at once for all the different parts of the machine that needed RAM access.
So what would you want to do with all this DMA? Let's look at one of the biggest examples -- the blitter.
The blitter was another part of the Agnus chip. It's operation was very simple. You'd give it three source pointers, one destination pointer, and a function ID. It would then read individual bits from memory (processing them 16 at a time), perform an arbitrary bitwise operation on them, and store the result out. You can think of it as a general-purpose bitwise arithmetic chip.
Given three bits (let's call them A,B, and C), there's exactly 8 different combinations that can result from these. So in order to specify your arithmetic function, you just need a lookup table of 8 result bits. This handily fits into a single byte.
Having this kind of bitwise arithmetic was important because, like many machines of the time, the Amiga used bitplanes as a format to store its graphics in.
Let's say you're using 32-color paletted mode. That's 5 bits per pixel you need to store. How do you store that? Well, you could use a byte per pixel, use 5 of the 8 bytes to store your data and leave the other 3 empty. But that's a hell of a waste. Instead, you store it as 5 individual bitplanes, each plane using one bit per pixel. (i.e. each byte contains one bit from eight different pixels)
Now let's apply our blitter to this. Imagine we want to draw a sprite on-screen. We've got it stored as 5 individual bitplanes, plus a sixth 'mask' bitplane to store the transparency. To get the blitter to draw this for us, we set up our three inputs:
A - The position on the framebuffer we want it at B - Our sprite source data (1st bitplane) C - Our sprite transparency mask
We'll need to repeat the whole thing a total of 5 times, one for each sprite bitplane, but that's fine (it's real quick!). The final piece of the puzzle is how we specify how to mix these three inputs. We can build a Boolean truth table to handle it.
Our goal is to use the transparency bit (C) to select EITHER (A) the background data (if C=0), or (B) the sprite data (if C=1). i.e.
D = C ? B : A. To figure out our function ID, we just list out all eight cases:
A B C D (output) 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 1
If we concatenate all the bits of D together, we get the value 0xD8. This is called a minterm, and it represents our bitwise operation in its entirety.
This "minterm" idea is a pretty powerful one. You can combine elements together to get any bitwise function you like. Want to XOR images? Sure. Want to just clear memory? That works too, just set the ID to 0x00 and ignore all the inputs. You'll still occasionally see systems that use this. Windows, for example, still uses it for its BitBlt function, although you'd never know that from reading the BitBlt documentation.
To actually program the blitter to do this, we simply write the three source address into three of its registers, write the function ID to another register, and then signal it to start. It'll run in the background while our CPU gets on with other things, and we can either check a flag to see if its finished, or get it to wake the CPU with an interrupt.
So far we've seen how Agnus contains the blitter functionality, and the DMA controller. But there's one more little secret hidden inside this chip, and that's the co-processor (aka "COPR", or simply "copper" to its friends).
The copper was a completely independent CPU that ran in parallel with the main one. It wasn't a Turing-complete, general-purpose CPU like the 68000. It only had three instructions. It didn't have its own memory or registers, but instead shared main memory (like everything else on the Amiga), and it could directly access many of the registers inside the custom chips.
The copper read its instructions via DMA. This meant you allocated some memory and filled in a program, called a "copper list", by writing 16-bit instructions into it. You then pointed the DMA at that address and started the program. The DMA would fetch each instruction and feed them into the copper.
So what could you do with a CPU that only has three instructions? Let's see what the instructions were:
MOVE reg, value WAIT X, Y SKIP X, Y
That's a pretty simple machine. We can load a value into a register, wait for the raster beam to hit a specific X/Y position, or skip the next instruction if the raster beam is past a specific X/Y position. Doesn't sound like much at first. If I wanted to write registers I could just do it on the main CPU, right? Why would I want to wait to write registers at a specific time? But there's some surprising effects you can get out of this simple mechanism.
What the copper let you do is to apply different properties to different parts of the screen. You can change the address of the framebuffer on a line-by-line basis, for example, to create parallax scrolling in Shadow Of The Beast (above).
Or you can change the address of the framebuffer at the halfway point on each line, to create a 2-player scrolling split-screen, like in Lemmings: (no other port of Lemmings could do this!)
Or you could wait till a certain line and change both the color palette AND the frambuffer address, to create a wobbly water effect, like in Ugh:
And this wasn't just a trick for games, even the operating system made use of this. If they chose to, the Amiga allowed programs to have their own private screen with a different resolution. Windows suffered for years with this problem -- ever switch back from a DirectX program and see all your icons have moved around on the desktop? Not in Amiga-land. Here, two different programs on different screens can co-exist just by dragging the menu bar down:
These aren't just different windows you're seeing here. These are different screens, each running at a different resolution! The lower screen is 640x256 at 4 colors and the upper screen is 320x256 at 32 colors. Try that on a PC.
All these effects, and more, are achieved via the simple ability to change settings when you want to, rather than having them fixed at the start of the frame. It didn't require more power to be added to the system, just the flexibility to use the existing system in unusual ways.
If you want to see more creative uses like this, try the excellent codetapper.com which takes apart many Amiga games to see how they do things.
Hardware as a tool
The reason I'm writing all this isn't just to show off how cool the Amiga was. I want to show how its design principles allow new avenues to be opened up.
The Amiga hardware never said "this tool is for this purpose". It gave you a toolbox but let you decide what these things were to be used for. And it allowed each tool to interoperate with the others using common registers and common data formats.
I've presented the blitter here as a thing for processing graphics bitplanes, but it was really just a vector coprocessor for operating on boolean/bitwise data. It could be used for other tasks, and it was. The Amiga's floppy disks were formatted using MFM encoding, which is a kind of edge-based binary encoding. To decode it, you had to process the bit array from disk and look for 0-1 transitions. The blitter provides an ideal tool for doing this with, and the OS made use of it for exactly that. The same kind of tasks we might use a compute kernel for today, perhaps.
The copper, while seeming to be a very simple processor, effectively acted as an amplifier for the power contained within the other chips. It could be viewed perhaps as a metaprocessor -- not doing the work itself but controlling the work of others.
This combination of a vector processor and a control chip is a powerful one. It's so powerful in fact, that the machine you're reading this on now has the same architecture. A modern GPU consists of three parts:
Part a) is a thing that can draw triangles. There's usually special-purpose hardware for doing this. There was a time a few years ago when this was what we thought of as the GPU, but we're seeing less and less of that every year. Games now are doing voxel ray-tracing, and people are using GPUs for lots of things other than just rendering.
Part b) is the vector processor, a unit that reads data and runs functions using it. Ours are much more powerful than the old blitter though. We can do full floating-point operations on ours, not just bitwise ops. But it's a more advanced version of the same principle -- a program that operates on many values at once rather than just one.
Part c) is the command processor. A modern GPU has a chip that reads instructions from the host CPU, decodes the various draw calls, state changes, etc, and then issues work to the vector processor (for compute kernels). Or, when using rendering APIs, it sends work to the triangle drawer which in turn sends work to the vector processor (either to shade vertices or pixels).
Right now we're a little stuck, however. A modern GPU lets you use its triangle drawer (via OpenGL perhaps), and it lets you use its vector processor (via CUDA perhaps). But the one thing it does not do, on almost any platform (even most consoles), is to let you use the command processor. About the only one I've ever seen that did give you that kind of access was the PlayStation 2, something I'll no doubt write about in a future article.
You see, the Amiga documented its command processor. The designers wanted you to write programs that ran on it. They wanted you to use it for doing all sorts of clever things. They recognized that the power to operate the underlying horsepower directly was something that could amplify the capabilities of a system way past the limits of its original design.
But on Direct3D, or OpenGL, all you can do is call DrawIndexedPrimitive etc. and let it do things on your behalf. You can't build your own copper lists like you used to on the Amiga. Some APIs let you make a command buffer, but they're usually just recording API calls into it. You can't program it with your own logic, or your own algorithms. The 3D driver has the power to do this, but you don't.
The Amiga was a good machine not because of what it was designed to do but because the designers intentionally gave you the flexibility to do things they'd never designed it to do.
The old COPR chip only had three instructions and couldn't do much by itself, but you could use it to make the rest of the system sing. I'm sure the command processors in modern desktops have a much more advanced processor -- I'd love to see what we could do with them given the chance.
Written by Richard Mitton,
software engineer and travelling wizard.
Follow me on twitter: http://twitter.com/grumpygiant