20150624

AMD Fury X (aka Fiji) is a Beast of a GPU Compute Platform

Compute Raw Specs
Raw specs from wikipedia adjusted to per millisecond: comparing what the two vendors built around 600 mm^2 on 28 nm,

AMD FURY X: 8.6 GFlop/ms, 0.5 GB/ms, 0.27 GTex/ms
NV TITAN X: 6.1 GFlop/ms, 0.3 GB/ms, 0.19 GTex/ms


Or the same numbers in operations per pixel at 1920x1080 at 60 Hz,

AMD FURY X: 69 KFlop/pix, 4.0 KB/pix, 2.2 KTex/pix
NV TITAN X: 49 KFlop/pix, 2.4 KB/pix, 1.5 KTex/pix


Think about what is possible with 69 thousand flops per pixel per frame.

HBM
HBM definitely represents the future of bandwidth scaling for GPUs: a change which brings the memory clocks down and bus width up (512 bytes wide on Fury X vs 48 bytes wide on Titan X). This will have side effects on ideal algorithm design: ideal access granularity gets larger. Things like random access global atomics and random access 16-byte vector load/store operations become much less interesting (bad idea before, worse idea now). Working in LDS with shared atomics, staying in cache, etc, becomes more rewarding.

20150623

BenQ XL2730Z Blur Reduction vs CRT

TFT Central Review for the BenQ XL2730Z

The BenQ XL2730Z is one of the best 2560x1440 LCDs on the market. This display has a blur reduction mode which enables a strobe back-light (which only works properly with fixed 120Hz or 144Hz refresh rates). Its sharpness control also enables a reduction of sharpness, which can cut out a lot of the hard exact square pixel LCD feel, leaving an image which is more pleasing to look at. And it provides an exact pixel mode which instead of enlarging lower resolution modes, keeps a 1:1 mapping of input to output pixel, and draws the outside borders with black. So 1920x1080 still looks good, just doesn't fully fill the screen. Compared to other LCDs this display is awesome, and running with a 1000Hz gaming mouse at 144Hz with a strobe back-light is a wonderful experience in comparison to what most people are used to.

Compared to CRTs?
CRTs still easily surpass even these top of the line LCDs in quality. One of the core differences is that even the best LCDs cannot transition pixels fast enough for a true low persistence display. The BenQ for instance has frame cross-talk (seeing part of one or more frames). At the default "Area" setting (which controls the vertical position of minimal cross-talk), at 144Hz, up to 2 extra frames are partly visible at the top and bottom of the screen, however toward the center just 1 extra frame is partly visible. I have a feeling that variable cross-talk has to do with the fact that the back-light has to be a global strobe, but pixel rows are still scanned (changed) in the classic CRT scan order. So row to strobe timing is only best at one point on the screen. Likewise the LCD uses a column dither on the transition overdrive, even and odd pixel columns over-shoot then under-shoot respectively the transition such that the 2 column average is closer to the correct signal. The minimal 1 extra frame cross-talk is a product of the lack of ability to transition to the proper signal in one frame. This is quite distracting in comparison with the CRT. The CRT can easily offer low persistence at much lower frame rates without transition artifacts, while providing filtered pixels at a lower resolution while presenting a proper image instead of a lesson in cubism. So what would be 2560x1440x120Hz or 442 Mpix/s on the LCD, still does not compare to the quality of 1600x1200x76Hz or 146 Mpix/s on the CRT. And the CRT can use that 3x GPU performance for higher quality visuals.

Other Technology?
OLED isn't yet cost effective for consumers for desktop displays. For example, for $26K it is possible to get a Sony BVME250A - 25" Trimaster EL OLED Master Monitor. Sony's pro OLEDs seem to max out at 75Hz, not sure they even have a global strobe?

Prysm sells Laser Phosphor Display (LPD) Tiles: CRT like phosphor display driven instead by a laser. Each 4:3 aspect ratio tile is a near border-less 20"x15" (25" diagonal) with 320x240 or 427x320 resolution, scanned at 360 Hz according to their product sheet. I'm guessing these are also prohibitively expensive? If this is fixed 360 Hz scanning rate, that would also be a serious problem: a 90Hz input would be quad strobed, and that won't look good at all.

20150604

Panasonic CT-34WX59 = Piece of Junk

Waiting for "Updating dependencies...", time to reflect on a fail while CRT hunting...
Majority mass market consumer "HDTV ready" CRTs closing out the CRT TV era scanned out 1080i using an aperture grille (and some with a shadow mask) which could never resolve 1920x. Because of that they were effectively 960x540p displays limited to a smaller resolution by overscan. This is not a problem, this is exactly what I wanted. The CT-34WX59 is a 34" 16:9 CRT which can do 896x512 displayable at 60Hz via an HDMI cable and a custom modeline. I grabbed it for $10 on Craigslist. That was a 180 lbs mistake. I knew the display had a sharpen control, I didn't know Panasonic's idea of sharpen=0 adds ringing to everything: they don't support fully turn sharpening off. Some Panasonic displays don't have a Scan Velocity Modulation (SVM) disable, so just in case I opened up the TV and un-did the physical connector which I believed was for the SVM feature: didn't fix it. Attempted to find an override in the service menu, nothing there. Did some more testing: all 1920x1080i, 1920x540p, or 960x540p based modelines all get the same over-sharpening effect. Effectively this display was garbage even when new, avoid Panasonic when CRT hunting...

20150525

OS Project : 7 - PS/2 and Misc

Just started back up again after a long break. Finished up the PS/2 driver. Simplified keyboard interface to one 64-bit bit array in memory, and a simple polling function.

   __0 __1 __2 __3 __4 __5 __6 __7 __8 __9 __A __B __C __D __E __F
0| 0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
1| G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V
2| W   X   Y   Z   `   -   =   [   ]   \   ;   '   ,   .   /   SPC
3| LF  RT  UP  DN  HOM END PUP PDN INS DEL BS  TAB RET SHF CTL ALT

Key down sets a bit. Key release clears bits for {SHF, CTL, ALT} only. It is up to the application to clear bits otherwise. Ended up being a good design choice as keys like '\' immediately generate release commands even when held down on my keyboard. Key bitmap organized to make editor usage (like key-to-hex conversions) simple. Keys limited to what I expect to use, and supporting everything on typical PC arcade controllers.

Multitasking
Humans and modern computers share a common thread: neither is any good at multitasking. The context of multitasking in this post is not async IO, but rather running multiple applications at the same time on the same CPU core. Suspect that the needs of multitasking change quite a bit at the point where nearly everything on the machine is perceptually instantaneous: as it would be once the 30 years of 40% yearly compounding complexity dept is removed. SSDs have throughput similar to last-level cache on early P4 era PCs. Logical end point: CPU cores owned by the applications themselves, no single-core multitasking, no preemption. App using power control to drop CPU core to lowest power state required to sustain a hard real-time user interface.

20150519

Voxelnauts

Kickstarter Link: Voxelnauts

"Voxelnauts: one universe, infinite worlds, with possibilities limited only by your imagination. Your adventure awaits - come build it"

As a young kid I grew up building things out of legos. Had one box which collected all the legos from a few sets. Would set out to build things and imagine what they could be in a virtual world. Voxelnauts looks to be the modern realization of that, but done in a way which was impossible 30+ years ago: ability to feel the presence of what you build in a massive real-life scale, to share that with others, and interact in a persistent world...

20150509

OS Project : 6 - Hashing

Been a while since I had any time to work on this project. Renamed as I've done some major design changes...

Axed
Dropping source-less programming part. Source-less worked great for x86 programming, however there are a few things I didn't like about it. Little issues like having 2 words for each common forth construct like "CALL WORD" instead of just "WORD". Larger issues being the requirement to have another editor to generate complex data or non-x86 machine code.

From the Ashes
Pulled the old work into a new bootloader. This time rapid prototyping using just NASM. Eventually I'll have to rewrite the bootloader in the source/language I end up creating. Switched back to 64-bit long mode (see why below). Running with 1GB pages since long mode forces page tables.

Re-Thinking The Forth Dictionary
Time off from this project brought forth some new ideas addressing the original motivation for going source-less: (a.) don't like linear searches at interpreter-time through large dictionaries (simple forth implementation), (b.) don't like how hashing (complex forth implementation) takes a bunch of ALU-time, (c.) really don't like how dynamic hashing can leave 75% of the cache unused, (d.) and still want to have full 32-bit values in source without multiple source tokens.

Solving (d.) is easy, just switch to 64-bit source tokens. Source is now 2x the size, but 32-bit immediates are trivial, and words can now use 10 characters instead of 5. Source is a prefetchable linear read, so the "more memory for more simple" trade is worth it to me.

Now Can Hashing be Made Better?
Solving (b.): how about factoring everything in the hashing algorithm except the "AND" operation into the string encoding itself? The tokens encode strings as 60-bit integers. Just need to switch from a generic 6-bits per character, to something where the characters are encoded in a way that they effect all bits.

One thing I tried is to use a reversible cellular automata (CA) like Wolfram Rule 30R. So strings in tokens are kept in memory in the form after being processed through N steps of applying the CA. Editor to display the string simply has to reverse the CA process.

To test the theory I first grabbed an english dictionary, then hashed to {256, 512, 1K, 2K, ... 512K, 1 M} entry tables, and tracked the {min,max,average} table bin counts. Comparing the reversible CA to just the 64-bit finalizer in MurmurHash3 (not reversible but provides a good baseline): both have similar results.

Also tested against a dictionary with all possible {1,2,3} character strings (including symbols and numbers). Results are again similar.

Second I tried a recursive interval encoding: each character is encoded in the interval of the parent character. First character starts with the full 0 to 2^60-1 interval. Character 63 is a special terminal character (signals no more characters are encoded). Character 63 gets a smaller interval, allowing the other characters to get an extended interval designed to spread out the value when ANDed to a power of two. This works similar in theory to arithmetic encoding, except this has constant probabilities. Multiplers used to encode each character (starting with the 1st character and going to the last),

64*64*64*64*64*64*64*64*64 +63*63*63*63*63*63*63*63,
64*64*64*64*64*64*64*64 +63*63*63*63*63*63*63,
64*64*64*64*64*64*64 +63*63*63*63*63*63,
64*64*64*64*64*64 +63*63*63*63*63,
64*64*64*64*64 +63*63*63*63,
64*64*64*64 +63*63*63,
64*64*64 +63*63,
64*64 +63,
64 +1,
1,

This evenly distributes the effect of a character across all bits. The synthetic all {1,2,3} length string case performs better with this encoding. The english dictionary does not see much of an improvement.

A forth option would be to just use arithmetic encoding, or predictive arithmetic encoding on the string. I suspect this would have similar results on the english dictionary as the above fixed interval encoder had on the all possible strings case. However I didn't try it, because I don't have any great way to get symbol probabilities for source that does not exist yet, and current results are probably good enough.

Hash Cache Utilization
Solving (c.) hash cache utilization: how about a software hash cache. Source tends to have a small working set of words, even if the dictionary gets quite large. Now that hashing to different size tables is free via AND, I can split the hash table into two parts: a 256 entry table (or larger) which will effectively always be filled and in hardware cache, and a much larger table for software misses which still wastes 75% of the hardware cacheline. The software hash cache contains the following per entry,

{4-byte data, 4-byte larger table address, 8-byte string}

The larger table is inclusive. Eviction of the small hash uses the larger table address, to avoid searching (or probing) for the right entry.

Moving On
This should have the right combination of being simple enough and fast enough to avoid triggering the "something is wrong" re-design impulse again.

The high-level view I have of the OS is effectively an ultra-high-power micro-controller that boots into a forth editor like a C64 boots into basic...

20150426

Source-Less Programming : 5

Boot Loader Bring-up
Managed to get the boot loader done, which includes the following steps,

(1.) Move the stack seg:pointer (since next step overwrites it).
(2.) Use BIOS to read the other 62 512-byte sectors for the first track.
(3.) Use BIOS to switch to 80x50 text mode and load custom character glyphs.
(4.) Use BIOS to set EGA text palette to 0-15 with 0 for overscan.
(5.) Program VGA palette registers for those 16 colors.
(6.) Use BIOS to enable A20.
(7.) Turn off interrupts, and relocate the image's 63 sectors to zero.
(8.) Load zero entry IDT, minimal 3 entry GDT.
(9.) Enable protected mode and jump to the 3rd sector.

The 2nd 512-byte sector contains the 8x8 character bitmaps for the first 64 characters. The majority of the time was spent making a nice font, getting colors the way I wanted, and prototyping editor look and feel (without building it).

Didn't feel like fully hand assembling 16-bit x86 machine code for the boot loader, so I used NASM and hexdump to accellerate the process (to provide machine code I could pad out to 32-bit alignment). Also wrote a quick C based tool to bootstrap the process of building the loader. Something which would enable me to easily build out an annotated image, and show a print out in the console of what I'd be seeing in the editor. Here is a shot of a bit of the scratch C code I used to make the font,



Here is a shot in QEMU of the loader displaying the font,



And another shot from QEMU showing the pallet,



What the Current Annotated Image Looks Like
Below is a shot captured from the terminal window output of the C tool. I'm using 3 cache lines for the loader code.



Grey lines separate the 512-byte sectors. Memory address on the left in grey. Each pair of lines shows half a x86 cacheline. The blue to white shows the 5 character/word annotation strings (now using the extra 2 bits of the label for color). The red hex show the image data. Not using {GET,ABS,REL} tagged words in this part, so everything in the bootloader is just hand assembled 16-bit machine code, and this is not representative of what the rest of the system will look like. The rest of the system will have {GET opcode} followed by {HEX} or {ABS} for opcode immediates (easy to write). The 16-bit code is {HEX} mixed opcode and immediates, quite a bit different (hard to write).

Some hints on the annotations,

Everything is in base 16. AX is TOP so I don't bother with "A=9000" (which wouldn't fit anyway), instead I just write "9000" (the A= is implied). The "!" means store so "SSSP!" is storing TOP (or AX) into both SS and SP. The "B=200" means BX=200h. In this 16-bit x86 case I use 3E to pad out opcodes to 32-bit. The "X" = SI, "Y" = DI, "F" = BP.

Next Step
Ground work is done, next step is to bring up the opcode dictionary for {GET} words, then write a little IDE driver to get access to load the rest of the image, and to be able to save in the editor. After that, write the drawing code for the editor, then a mini PS/2 driver for the input, then write editor input handling. Then I have a full OS ready to start on a real machine.

20150423

Source-Less Programming : 4

Still attempting to fully vet the design before the bootstrap reboot...

DAT words in the edit image need to maintain their source address in the live image This way on reload the live data can be copied over, and persistent data gets saved to disk. DAT annotations no longer have 30 bits of free space, instead they have a live address. When live address is zero. then DAT words won't maintain live data. This way read-only data can be self-repairing (as long as the annotations don't get modified). Going to use a different color for read-only DAT words. New persistent data DAT words will reference their edit-image hex value before reload (then get updated to the live address).

REL words always get changed on reload (self repairing). No need to keep the live address. REL is only used for relative branching x86 opcodes. Don't expect to have any run-time (non-edit-time) self-modifying of relative branch addresses. Given that branching to a relative branch opcode immedate is not useful, the LABEL of a REL word is only useful as a comment.

GET words also get changed on reload (self repairing). GET is only designed for opcodes and labeled constants. GET words will often be LABELed as a named branch/call target. Been thinking about removing GET, and instead making a new self-annotating word (display searches for a LABELed DAT word with the same image value, then displays the LABEL instead of HEX). This opens up the implicit possibility of mis-annotations. Would be rare for opcodes given they are large 32-bit values. But for annotating things like data structure immediate offsets, this will be a problem (4 is the second word offset in any structure).

ABS words always get changed on reload (self repairing). ABS words are targets for self-modifying code/data, so they also need LABELs. Reset on reload presents a problem in that ABS cannot be used to setup persistent data unless that persistent data is constant or only built/changed in the editor. But this limitation makes sense in the context that ABS addresses in live data structures can get invalidated by moving stuff around in memory. The purpose of ABS is edit-time relinking.

Source-Less Programming : 3

Annotation Encoding
Refined from last post, two 32-bit annotation words per binary image word,

FEDCBA9876543210FEDCBA9876543210
================================
00EEEEEEDDDDDDCCCCCCBBBBBBAAAAAA - LABEL : 5 6-bit chr string ABCDE


FEDCBA9876543210FEDCBA9876543210
================================
..............................00 - DAT : hex data
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA01 - GET : get word from address A*4
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA02 - ABS : absolute address to A*4
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA03 - REL : relative address to A*4


Going to switch to just 2 lines per word displayed in the editor, Only DAT annotations show hex value, other types show LABEL of referenced address in the place of the hex value. So no need for an extra note. In practice will be using some amount of binary image memory to build up a dictionary of DAT words representing all the common somewhat forth like opcodes, then GET words in the editor to build up source.

Need to redo the bootloader from floppy to harddrive usage, and switch even the bootloader's 16-bit x86 code to 32-bit aligned LABEL'ed stuff so the final editor can edit the bootloader. Prior was avoiding manually assembling the 16-bit x86 code in the boot loader, but might as well ditch NASM and use something else to bootstrap everything.

20150422

Source-Less Programming : 2

Continuing with what will either be an insanely great or amazingly stupid project...

Making slow progress with bits of free-time after work, far enough thinking through the full editor design to continue building. Decided to ditch 64-bit long mode for 32-bit protected mode. Not planning on using the CPU for much other than driving more parallel friendly hardware... so this is mostly a question of limiting complexity. Don't need 16 registers and the REX prefix is too ugly for me to waste time on any more. The 32-bit mode uses much more friendly mov reg,[imm32] absolute addressing, also with ability to use [EBP+imm32] without an SIB byte (another thing I mostly avoid). Unfortunately still need relative addresses for branching. 32-bit protected mode thankfully doesn't require page tables unlike 64-bit long mode. Can still pad out instructions to 32-bits via reduntant segment selectors.

Source-Less Analog to Compile-Time Math?
Compile-time math is mostly for the purpose of self-documenting code: "const uint32_t willForgetHowICameUpWithThisNumber = startHere + 16 * sizeof(lemons);". The source-less analog is to write out the instructions to compute the value, execute that code at edit time, then have anotations for 32-bit data words which automatically pull from the result when building 32-bit words for opcode immediates for the new binary image.

Reduced Register Usage Via Self Modifying Code
Sure, kills the trace cache in two ways, what do I care. Sometimes the easist way to do something complex is to just modify the opcode immediates before calling into the function...

What Will Annotations Look Like?
The plan so far is for the editor to display a grid of 8x8 32-bit words. Each word is colored according to a tag annotation {data, absolute address, relative address, pull value}. Each word has two extra associated annotations {LABEL, NOTE}. Both are 5 6-bit character strings. Words in grid get drawn showing {LABEL, HEX VALUE, NOTE} as follows,

LABEL
00000000
NOTE


The LABEL provides a name for an address in memory (data or branch address). Words tagged with absolute or relative addresses or pull value show in the NOTE field the LABEL of the memory address they reference. Words tagged with data use NOTE to describe the opcode, or the immediate value. Editor when inserting a NOTE can grab the data value from other words with the same NOTE (so only need to manually assemble an opcode once). Edit-time insert new words, delete words, and move blocks of words, all just relink the entire edit copy of the binary image. ESC key updates a version number in the edit copy, which the excuting copy sees triggering it to replace itself with the edit copy.

Boot Strapping
I'm bootstrapping the editor in NASM in a way that I'll be able to see and edit later at run-time. This is a time consuming process to get started because instead of using NASM to assemble code, I need to manually write the machine code to get the 32-bit padded opcodes. Once enough of the editor is ready, I need a very tiny IDE/PATA driver to be able to store to the disk image. Then I can finish the rest of the editor in the editor. Then I'll also be self hosted outside the emulator and running directly on an old PC with a non-USB keyboard, but with a proper PCIe slot...

Look No Triangles : Scatter vs Gather

There are a bunch of people working-on and succeeding in non-triangle rendering. With GPU perf still climbing IMO it is possible to return to the golden age of a different kind of software rendering: the kind done in a pipeline built out of compute shaders.

In my sphere tracing of math based SDF fields I was purely ALU bound, tracing to the limit of floating point precision. The largest performance win was found by doing a many-level hierarchical trace (starting with very coarse grain empty space skipping). But the limit of all this is just a log reduction of the number of steps in the search, still requires many search steps per pixel. And when doing a memory based trace (instead of a math based trace) the search is just a very long latency chain with divergent access patterns. Tracing via searching on the GPU hits a wall. To make matters worse when tracing, the ALU units are loaded up with work involved in tracing, instead of something useful.

The alternative to this is to switch to a mostly scatter based design. A large amount of the tree structure traversed each frame in a gather based approach is similar across frames. Why not just have the tree stored mostly expanded in memory based on the needs of the view. Then expand or collapse the tree based on the new visibility needs of the next frame. Rendering is then a mostly scatter process which reads leaves in the tree once. Reads of memory can now be coherent, and ALU can be used for things more interesting than search. Scatter will be somewhat divergent, but that cost can be managed by loading up enough useful ALU work in parallel. There are a lot of ways to skin this. Nodes of the tree can be bricks. Bricks can be converted into little view based depth sprites, then binned into tiles and composited. Seems as if bricks converted into triangle meshes and rasterized is the popular path now, but still using the CPU to feed everything. This could get much more interesting when the GPU is generating the cached geometry bricks: artistically controlled procedual volume generation...

20150421

From Scratch Bug 2 : Source-Less Programming

This is a disruptive idea which comes back periodically: source-less programming. Is it possible to efficiently program at a level even lower than an assembler?

The general idea is that the editor is now something similar to an enhanced hex editor which edits a binary image directly. Lowest memory is split into three parts {{running image, annotations for edit image}, edit image}. The {running image, annotations for edit image} is the boot image. The {edit image} is working space which gets snapshot replacement for {running image} on a safe transition zone. The "annotation" section is what enables human understanding of the binary image.

Words
One way to keep the system very simple is to always work in 32-bit words. A 32-bit word in memory is one of four things {data, opcode, absolute address, rip relative address}. Data is easily handled by the hex editor part. The annotation could provide a name for the data or a comment. Opcodes in x86 can be complex but it is easy to simplify. Start with something similar to forth zero-operand and one-operand operations (calls, etc). Make all operations padded to 32-bit alignment (x86-64 can use the 2e ignored prefix to pad). A call for instance becomes {32-bit opcode for call, 32-bit rip relative branch address}. Or a global load becomes {32-bit opcode for load, 32-bit rip relative branch address}. Annotation memory can provide a name for the opcode. Annotation can provide a tag for each word in memory which marks if the memory is a relative or absolute address (word gets a different color based on tag similar to color forth). Addresses can be auto annotated by displaying the annotation associated with the destination. Editor works on the {edit image}, with insert and delete of words automatically adjusting all the words tagged as address (effectively relinking at edit time). The {edit image} can also keep a mapping of original {running image} address so that it is possible to view the live values of any data. Editor provides something to easily be able to write an annotation and have the associated opcode or address automatically get updated. For example type the opcode name and the 32-bit value is automatically updated. Very simple and yet powerful minimal tool.