Demo Tubes: Parnassum & Monolith

Thoughts on the Evolution of Processor Design

Feels like the fundamental limiter in the evolution of processor design is the {load,alu,store} design paradigm: the separation of memory and ALU at all scales. A CPU is effectively like having billions of people each with a mailbox to store data, each routing the data to just one single person (out of billions) with a calculator doing computation. As CPUs have evolved, there has only been a tiny increase in the number of people with calculators. Then the GPU enters the timeline, providing a substantial increase in the number of people with calculators, but this increase is still relatively tiny with respect to the number of people routing data to and from the mailboxes. I'm wondering if perhaps all people should just have calculators. Looking at some numbers,

Chip Capacity in Flop per Clock per Transistor
Using numbers from Wikipedia for Fury X,

8601 Gflop/s
8900 Mtransistors
1050 MHz

Capacity for upwards of 8000 flops each clock, but with around 1 million transistors per flop.

Science Fiction Version of the Suggested Paradigm Shift
A completely science fiction version of the suggested paradigm shift might be a chip with 256 MB of memory, divided into 32 million 64-bit cells, with each cell doing one 64-bit uber-operation every 64 clocks (bit per clock), clocked at a relatively low clock rate like 500 Mhz: providing something like 250,000,000,000,000 uber-ops per second. The board composed of 3D stacks of these chips connected by TSVs, stacks connected by interposers (like HBM). Board might have something like 16 stacks, providing 4,000,000,000,000,000 uber-ops per second. The local parallel just-in-time compile step configures cells to work around bad cells, yield problems go away. The mindset used to program the machine is quite different. Data constantly flows around chip as it is filtered by computation. The organization of data constantly changing to adapt to locality of reference. Programs reconfigure parts of the chip at run-time to solve problems. Reconfigure of a cell is basically adjusting which neighborhood connections are inputs to the uber-op, and the properties of the uber-op.


1536-3 : Simplify, Repeat

Night 3 on the 1536 project, decided to make some changes before going full ahead with writing an editor.

(1.) Switched the boot loader to not relocate to zero. Now leaving the BIOS areas alone, I doubt I'll ever go back to real mode after switching to long mode, this ends up making the code easier, and provides room for the next change.

(2.) Switched the boot loader to fetch the first 15 tracks instead of just 1 track. Now have a little over 472KB of source to work with on boot, which is effectively infinite for this project. The motivation for this change was the realization that x86-64 assembly source would get big. Don't want to change this later. 472KB is close to the maximum without checking for things like EBA, or handling non-full-track reads.

(3.) Switched to an easier source model. Lines are now always 64 characters long. Comments are switch to a single \ which functions like the C // style comment, ignoring the rest of the line. Since lines are always 64 bytes (cacheline sized and aligned), the interpreter can quickly skip over comments. This is a trade in increased source size, for simplification of the editor: fixed size lines makes everything trivial.

(4.) Making a convention that syntax highlighting with color only has a line of context. Which translates into don't let things wrap. Easy.

Star Wars Battle Pod Arcade Review : Save Your Tokens for Air Hockey

Went to the Cary North Carolina Dave and Busters to try out the Star Wars Battle Pod a few days ago after posting someone's youtube review ages ago. The arcade experience in the US has certainly changed since I was a youth. Nearly everything I loved as a kid is gone, with the exception of some classic physical games like air hockey, skee ball, pool, etc. The Battle Pod is a great example of how the spirit of the arcade is getting lost. Starting with the screen: it's a spherical projection screen where Dave and Busters had the awesome idea of keeping their card reader illuminated so strongly during gameplay, that the screen black level was practically white: nearly impossible to see what was going on. That might have actually been a blessing, because whoever wrote the spherical projection code apparently figured out how to do something worse than bilinear filtering: it looks horrible. What is left is relatively low resolution which would be fine if properly filtered, except in this case the aliasing is so bad, I kept on getting the feeling that the only point to the card reader was to hand out a refund to pay for the player's eye pain. It gets better: the game hitches, doesn't even feel like 30 Hz, let alone the 60 Hz which sets the minimum bar for frame rate in a real arcade game. The classic arcades were defined by perceptually zero latency input designed to take a beating, with locked frame rates at the highest rate possible on the display hardware, and stunning visuals pushing the limits of the hardware. Someone badly needs to bring that experience back...


CRT Shadow Masks vs LCD

Found these two images below (click for full resolution) on this thread http://www.vogons.org/viewtopic.php?t=32331 when looking for source photographs for CRT shadow masks. Great visual description of why CRT shadow masks were so good for image quality in comparison to modern LCDs. Sure LCD image is double scanned (2 scanlines per pixel), but the comparison would still hold even if it was not.


Stochastic 1 Sample/Pixel Lit Fog Stills

Found a the old stochastic 1 sample/pixel lit fog. Left the post process grain on, it is smoother in practice. This could be an algorithm which only looks good in this demo, never really tried adjustments on other content...

General idea is to stochastically select a z value/pixel between the eye and the opaque backing z value based on the volume of material in between. In this demo I just used a very large sphere of volumetric stuff behind the center sphere. Each {z} point is shaded and "lit" (fake in the demo), and also has some opacity value. Then there is a separate spatial+temporal filter process which attempts to remove the noise from the extremely sparse volume sampling, and also correctly manage un-occlusions, etc. The volume is treated separately from the opaque layer and blended together before the final temporal noise reduction pass (scene is traced). The demo was running at 120Hz, and didn't ever look right at 60Hz. These temporal techniques are all about visual masking of artifacts in motion, so they tent to be highly tuned just to the point of perceptual artifacts at a given target frame rate. The one takeaway from this little project, was to weight samples in the filter based on similarity of their backing opaque z value to center backing opaque z value (instead of using shading z values). This tends to maintain an even gradient based on objects which are at a similar distance from the eye. Which is what one would expect in general for diffuse fog volumes.

Something I didn't try but would help here is to decouple volume density sampling (aka alpha value) from shaded color. Run alpha computation at a higher sampling rate, then mix together later...

Runs a spatial filter on {color,alpha} with 13 taps in the following pattern,

. . . . . . . . .
. . x . . . x . .
. . . . x . . . .
. . x . . . x . .
x . . . x . . . x
. . x . . . x . .
. . . . x . . . .
. . x . . . x . .
. . . . . . . . .

Pixel weights are "gaussian * f(sampleOpaqueBackingZ,centerOpaqueBackingZ)", where f(s,c) decreases weight as opaque z-buffer value becomes non-matching (filter intent is that fog tends to have similar effect when the opaque backing is at a similar distance away),

r = min(c,s)/max(c,s);
return (r*r)*(r*r);

Runs a second spatial filter with 13 taps in the following pattern,

. . . . . . . . .
. . . . . . . . .
. . . . x . . . .
. . . x x x . . .
. . x x x x x . .
. . . x x x . . .
. . . . x . . . .
. . . . . . . . .
. . . . . . . . .

Pixel weights are "gaussian * f(sampleOpaqueBackingZ,centerOpaqueBackingZ)", where f(s,c) does something similar,

r = 1.0/(1.0+abs(c-s)/min(s,c));

Cannot remember why these two spatial filter passes have different z based weighting functions. Turns out the temporal filter has another depth weighting function. They both have some fixes for when depths are zero which I didn't bother to copy in. The temporal filter reprojects 5 points in a packed + pattern. Want to use reprojected Z (project reprojected backing Z into the current frame). Does a neighborhood clamp, then has reprojection weights based on "gaussian * f(sampleOpaqueBackingZ,centerOpaqueBackingZ)", where f(s,c) does something similar but with a depth bias,

r = 1.0/(1.0+abs(c-s)/c);


1536-2 : Assembling From the Nothing

Started bringing up a limited subset x86-64 assembler. The full x86-64 opcode encoding space is an unfortunate beast of complexity which I'd like to avoid. So I did...

This prototype sticks to only exactly 4-byte or 8-byte instructions (8-byte only if the instruction contains a 32-bit immediate/displacement). The native x86-64 opcodes are prefix padded to fill the full 4-byte word. Given that x86-64 CPUs work in chunks of 16-bytes of instruction fetch, this makes it easy to maintain branch alignment visually in the code. Since x86-64 float opcodes are natively 4-bytes without the REX prefix, I'm self limiting to only 8 registers for this assembler, which is good enough for the intended usage. I'm not doing doubles and certainly not wasting time on vector instructions (have an attached GPU for that!). Supported opcode forms in classic Intel syntax,

op reg;
op reg,reg;
op reg,imm32;
op reg,[reg];
op reg,[reg+imm8];
op reg,[reg+imm32];
op reg,[imm32];
op reg,[reg+reg]; <- For LEA only.

This is a bloody ugly list which needed translation into some kind of naming in which "op" changes based on the form. I borrowed some forthisms: @ for load, ! for store. Then added ' for imm8, " for imm32, and # for RIP relative [imm32]. A 32-bit ADD and LEA ends up with this mess of options (note . pushes word value on the stack, so A. pushes 0 for EAX in this context, and , pushes a hex number, and / executes the opcode word which assembles the instruction to the current assembly write position),

A.B.+/ .......... add eax,ebx;
A.1234,"+/ ...... add eax,0x1234;
A.B.@+/ ......... add eax,[rbx];
A.B.12,'@+/ ..... add eax,[rbx+0x12];
A.B.1234,"@+/ ... add eax,[rbx+0x1234];
A.LABEL.#@+/ .... add eax,[LABEL]; <- RIP relative
A.B.12,'+=/ ..... lea eax,[rbx+0x12];
A.B.C.+=/ ....... lea eax,[rbx+rcx*1];

Then using L to expand from 32-bit operand to 64-bit operand,

A.B.L+/ .......... add rax,rbx;
A.1234,L"+/ ...... add rax,0x1234;
A.B.L@+/ ......... add rax,[rbx];
A.B.12,L'@+/ ..... add rax,[rbx+0x12];
A.B.1234,L"@+/ ... add rax,[rbx+0x1234];
A.LABEL.L#@+/ .... add rax,[LABEL]; <- RIP relative
A.B.12,L'+=/ ..... lea rax,[rbx+0x12];
A.B.C.L+=/ ....... lea rax,[rbx+rcx*1];

Source Example With Google Docs Mockup Syntax Highlighting
Font and colors are not what I'm going for, just enough to get to the next step. This is an expanded example which starts building up enough of an assembler to boot and clear the VGA text screen. Some of this got copied from older projects in which I used "X" instead of "L" to mark the 64-bit operand (just noticed I need to fix the shifts...). I just currently copy from this to a text file which gets included into the boot loader on build.

From Nothing to Something
This starts by semi-self-documenting hand assembled x86 instructions via macros. So "YB8-L'![F87B8948,/]" reads like this,

(1.) Y.B.8-,L'! packed to a word name YB8-L'! with tag characters removed.
(2.) [ which starts the macro.
(3.) F87B8948 which is {48 (REX 64-bit operand), 89 (store version of MOV), 79 (modrm byte: edi,[rbx+imm8]), F8 (-8)}.
(4.) , which pushes the number on the data stack.
(5.) / which after , executes the empty word, which pops the data stack and writes 32-bit to the asm position.
(6.) ] which ends the macro.

Later YB8-L'! with ; appended can be used to assemble that instruction by interpreting the macro.

The first assembled words are $ which pushes the current assembly position on the stack, and $DRP (which is actually a bug which needs to be removed). The $! pops an address from the data stack, and stores the current assembly position to given address. This is later used for instruction build macros which do things like PSH` where the ` results in the dictionary address for the PSH word to be placed on the data stack. The end game is getting to the point where given one of the opcode forms, it is possible to write the following to produce a function which compiles an opcode,


Which pushes the 4-byte opcode base 0xC033403E, then the opcode name ^ for XOR, then runs the _ macro which assembles this into:

MOV eax,0xC033403E;

Immediately afterwards it is possible to execute the ^ word (call it) and assemble an XOR instruction. The X86-RM expects to get the REG and RM operands from the data stack with base instruction opcode data in EAX.

Making a Mess to Clean Up
This about concludes the worst part of getting going from nothing, except for the PTSD dreams where people only speak in mixed hex and x86 machine code: FUCOM! REX DA TEST JO. When placed into final context there will be a few KB of source to build an assembler which covers all functionality I need for the rest of the system. At this point I can easily add instructions and a few more of the opcode forms as they are they are needed. And it becomes very easy to write assembly like this,

A.A.^/ Y.B8000,"/ C.1F40,"/ L!REP/

Which is this in Intel syntax,

xor eax,eax; <- set eax to zero
mov edi,0xB8000; <- VGA text memory start address
mov ecx,0x1F40; <- 80x50 times two bytes per character
rep storq; <- using old CISC style slow form to "do:mov [edi],rax;add rdi,8;dec rcx;jnz do;"


1536-1 : The Programmer Addiction = Feedback

Continuing on the 1536-byte loader based system. Interpreter finished, under the 1536-byte goal. Second major goal is to get to the instant feedback productivity addiction loop going: modify, get feedback, repeat. Have simple ASCII to custom 64-character set encoding converter, and way to include the converter source text starting at sector 3 in the boot loader. First major test, getting an infinite loop or infinite reboot working. Source without any syntax coloring,

Loader sets up memory with dictionary at 1MB (2MB chunk with 1MB overflow), copies source to 4MB (4MB chunk maximum), then starts the compile position at 8MB (so 8MB and on is the rest of the memory on the system). Had one major bug getting the interpreter up, forgot that \ in NASM results in a line continuation even when in a comment, removing a line of a lookup table resulting in a crash. Tracking down bugs is very easy, add "JMP $" or "db 0xEA" in NASM to hang or reboot respectively.

Adjusted the character syntax.
- - Negate the 64-bit number, add dash to the string.
. - Lookup word in dictionary, and push 64-bit value from word entry onto data stack.
, - Push 64-bit number on data stack.
: - Lookup word in dictionary, pop 64-bit value from data stack to word entry.
; - Lookup word in dictionary, interpret string starting at address stored in word entry.
[ - Lookup word in dictionary, store pointer to source after the [ in the word entry, skip past next ].
] - When un-matched with ], this ends interpretation via RET.
\ - Ignore text until the next \.
/ - Lookup word in dictionary, call to address stored in dictionary entry.
` - Lookup word in dictionary, push address of word on data stack.

Space and every character above except the - char, clear the working word string and number. So , results in pushing a zero on the data stack. And / results in calling the empty word, which I've setup as a word that pops from the data stack and writes a 32-bit word to the compile position, then advances the compile position. This provides the base mechanics to start to create opcodes via manual hand assembly and build out an assembler, which is the topic of the next post...

Great Tube: Old computers did it better!


Solskogen 2015 and Misc Demo Tubes

Running on a ZX Evolution,

Oh How Programming Has Changed

Programming in the 1990's: power on computer, open up text file in source editor, edit, run shell script to compile, test, repeat.

Programming today: Put in windows password to get machine to wake up after automatically going to sleep, find that trackpad no longer works for some reason, pull usb mouse from another computer, click away popup of firefox asking for a security update right now, then find that Windows is also forcing an update and a reboot, find something else to do for a while, come back after machine reboots, bring up text file in source editor, edit, open developer command prompt for visual studio, find error message "ERROR: Cannot determine the location of the VS Common Tool folder", search on the internet for solution and find nothing that works, open up visual studio instead, get message that visual studio license no longer works, requires login, attempt to login with windows sign on, find that account has been temporarily disabled for some reason, go through process to re-enable account via email, click through message on email, re-login to windows sign on, get new message, account still disabled and for "security reasons" must re-enable via process through phone, choose SMS process, wait for a while, phone never gets SMS message, scramble to find another solution, try call method instead, finally get windows sign on re-enabled, finally get visual studio to work, build console project, click on option to not create a new directory, find it creates a new directory anyway, close visual studio, move around files, reopen, add source file to project, attempt to make a quick change to the code, notice that by default visual studio is reforming the code to something other than what is desired, look on internet to find out how to disable auto-format, disable auto-format, press F7 to compile, test, out of time for the day, repeat again tomorrow with a different selection of things which are randomly broken...


Inspiration Reboot

Quite inspired by the insane one still or video per day at beeple.tumblr.com. Attempting to get back in the grove of consistently taking a small amount of non-work time every day to reboot fun projects. I'm on week 2 now of probably a three month process of healing from a torn lower back, sitting in front of a computer is now low enough pain to have fun again...

Setting a new 1536-byte (3x 512-byte sector) constraint for a bootloader + source interpreter which brings up a PC in 64-bit long-mode with a nice 8x8 pix VGA font and with 30720-bytes (60 sectors, to fill out one track) of source text for editor and USB driver. USB providing thumb drive access to load in more stuff. Have 1st sector bringing up VGA and long mode, 2nd sector with 64-character font, and last 512-byte sector currently in progress as the interpreter. Went full circle back to something slow, but dead simple: interpreter works on bytes as input. The following selection of characters appends simultaneously to a 60-bit 10 6-bit/char word string, and a 64-bit number,


Then giving a "color forth tag" like meaning to another fixed set of characters,

~ - Negate the 64-bit number.
. - Lookup word in dictionary, and push 64-bit value onto data stack.
, - Push 64-bit number on data stack.
: - Lookup word in dictionary, pop 64-bit value from data stack to word.
; - Write 32-bit number at compile position.
" - Lookup word in dictionary, interpret the string at address stored in word.
[ - Lookup word in dictionary, store compile position in word, append string from [ to ] compile position.
] - When un-matched with ], this ends interpretation via RET.
\ - Ignore text until the next \.
` - Lookup word in dictionary, call to address stored in word.

Those set of characters replace the "space" character in forth, and work like a post-fix tag working on either the string or number just parsed from input. The set of tags is minimal but flexible enough to build up a forth style macro language assembler, with everything defined in the source itself. More on this next time. One nice side effect of post-fix tags is that syntax highlight is trivial by reading characters backwards starting at the end of the block.

Sony Wega CRT HDTVs
The old Wega CRT HDTVs work quite well. They apparently are nearly fixed frequency 1080 interlaced with around 540 lines (or fields) per ~60 Hz frame, and unlike prior NTSC CRT TVs, they seem to not do any real progressive scanning. Taking a working 1080i modeline and converting it to 540p and driving the CRT results in the Wega initiating a "mode-reset" when it doesn't see the interlaced fields for the 2nd frame. However 480p modes do work (perhaps with an internal conversion to 1080i). Given that 1080i modes are totally useless as the 60Hz interlace flicker is horrible, and 540p won't work, these HDTVs should be complete garbage. However 720p works awesome as the TV's processing to re-sample to 1080i does not flicker any worse than 60Hz already does. In theory the even and odd fields (in alternating frames) share around 75% of an input line (540/720), and likely more if the re-sampling has some low-pass filtering. Drop in a PS4 game which has aliasing problems, and the CRT HDTV works like magic. These late model "hi-scan" Wega CRTs only had roughly 853 pixel width aperture grille: 853x540 from what was a 1920x1080 render is a good amount of super-sampling...


GPU Unchained ASCII Notes

Posted stills prior, here are the notes, hopefull the "pre" tag works for everyone...

                     ##X=--=X##  ##X=--=X##  ##      ##
                     ##          __      ##  ##      ##
                .    ##    -=##  ##      ##  ##      ##
  .                  ##      ##  ##X=--=X##  ##      ##               .
                     ##X=--=X##  ##          ##X=--=X##
          .                                 _               .
                              #                                 #        .
 .      .   #   # #=-=# #=-=# #=-=# #=-=# -=#   #=-=# #=-=# #=-=#         .
.:...    .. #   # #   # #     #   # ___ #   #   #   # # __# #   #   ...  :::..:.
:::::::::::.# . # # . # # .:. # : # #   # . = . # . # # ... # . #.::::::::::::::
|||||||'||| #=-=# # : # #=-=# # | # #=-=# #=-=# # : # #=-=# #=-=# ||||.|||||||||           
''''/   ''''';' ' '     '                              ';'    '' '  ''''''''';''  
   /     '  /                  ___                                   . '''  /
                                |imothy Lottes


     "Evaluation criteria for correct sharpness
                                should be different for stills and video"

 - In motion, sharpess must be reduced to have good spatial precision

 - 1/4 x 1/4 resolution rendering 
 - Render via stylized even field aperture grille (arcade Trinitron)
 - Font designed on pixel grid 
 - Font rendered via function of min 1D distance to capsules (for lines)

                         FILTERING DIGITAL ORIGAMI [b]

      "Filtering is critical 
              to convince the skeptical mind that a scene is believable"

 - 1/2 x 1/2 resolution tracing
 - Full resolution reconstruction
 - Lots of high frequency sub-pixel sized content

 - Simple gaussian resolve kernel (not enough samples/pix for negative lobes)
 - Temporal feedback also weights samples by rcp of reprojected luma difference


      "Highly variable frame timing is a poison for high refresh rates"

 - Standard termination,
     if(abs(distanceToSurface) <= distanceAlongRay * coneRadius) hit
 - LOD termination,
     ... <= distanceAlongRay * (coneRadius + steps * lodFactor)) hit
 - Expands surfaces as ray march gets progressively more expensive

 - Biased: 
     1/(1 + abs(a-b) * const)
 - Still not great: 
     1/(1 + (abs(a-b) / max(minimum,min(a,b))) * const)
 - Better:
     square(1 - abs(a-b) / max(a,b,minimum))


     "Maintaining higher spatial precision
               without increasing shading rate for higher quality visual"

 - Still tracing at sample-jittered half resolution (1/4 area).
 - Still sampling reprojection at half resolution.
 - But sampling reprojection from full resolution source.
 - And running reconstruction at full resolution.


    "Reorder work to keep ALUs fully loaded,
                      ok to leverage poor data access patterns if ALU bound"

 - Not pixel to thread -> too much ALU under utilization after rays hit surf
 - Rather fixed distance estimator iterations per thread

 - Fetch ray
 - For fixed traversal iterations
    - Distance estimation
    - Walk forward
    - If ray hit (aka "close enough")
       - Store result
       - Start on another ray

                                  BREAD CRUMBS [e]

  "When working in parallel, don't wait, if data isn't ready, just compute it"

 - Order rays by parent cones first in space filling curve,

     0 -> 12 -> 4589 -> and so on
          34    67ab

 - Cones end by writing their stop depth into memory.
 - Child cones will check for completed parent result,
   otherwise will start from scratch.

                                 IN THE NOISE [e]  

                   "Degrade to pleasing noise when out of time"

 - Fixed maximum DE iterations -> not always going to traverse full scene
 - Ok fine, lets not require tracing all rays

 - Fill ordering,

    0 7 E 5   0 . . .   0 7 . 5   0 7 . 5
    C 3 A 1   . 3 . 1   . 3 . 1   . 3 A 1
    8 F 6 D   . . . .   . . 6 .   8 . 6 .
    4 B 2 9   . . 2 .   4 . 2 .   4 B 2 9

 - Pattern defined by math (fast but perhaps not ideal)
 - Each frame gets different fill order (holes not always in same place)


 - Rendering into an octahedron map
 - Each sample writes out {x,y} projected screen position
 - Distortion finds the nearest texel in the octahedron map
 - Then samples a neighborhood around the texel
 - Then uses the difference in pixel center and sample {x,y} to filter
 - Reprojection samples from the warped reconstructed frame
 - Ideally adjust the filter kernel based on domain distortion
    - Showing simple adjustment of kernel size here (anisotropic is better)