20160728

Blink Mechanic for Fast View Switch for VR

As seen in the SIGGRAPH Realtime section for Bound for PS4 PSVR around 42 minutes in this video. Great to see someone making good use of the "blink mechanic" to quickly switch view in VR. Scene quickly transitions to black, to simulate eyelid closing, followed by fading back to a new view, simulating eyelid opening.

My recommendation for the ideal VR display interface used this mechanic. Specifically "blink" to hide the transition of exclusive ownership of the Display between a "Launcher / VR Portal" application and the game. The advantages of exclusive app-owned Display for VR on PC would have been tremendous. For instance it then becomes possible to,

(1.) Fully saturate the GPU. No more massive sections of GPU idle time.

(2.) Render directly to the post-warped final image for a 2x reduction in shaded samples for Compute generated Graphics Non-Triangle based rendering, and pixel perfect image quality.

(3.) Factor out shading to Async Compute, and only generate the view right before v-sync. Rendering just in time is better than time-warp: no more incorrect transparency, no more doubling of visual error for dynamic objects which are moving differently than the head camera tracking.

(4.) Race the beam for the ultimate in low latency.

(5.) Great MGPU scaling (app owned display cuts MGPU transfer cost by 4x).

(6.) Have any GPU Programming API, even compute APIs, be able to work well in VR without complex cross-process interopt.

(7.) Etc.

Ultimately no one on the PC space implemented this, and thus all my R&D on the ultimate VR experience got locked out and blocked by external process VR compositors, pushing me personally out of VR, and back to flat 3D where I still can actually push the limits, with good frame delivery, without artifacts, and with perfect image quality.

20160727

Vulkan - How to Deal With the Layouts of Presentable Images

Continuing my posts on building a Vulkan based Compute based Graphics engine from scratch (no headers, no libraries, no debug tools, no using existing working code)...

Interesting to Google something and already get hits on Vulkan questions on Stack Overflow - How to Deal With the Layouts of Presentable Images. Turns out one of the frustrating aspects of Vulkan is the WSI or presentation interface. Three specific things make this a pain in the butt, quoting from the Vulkan spec.

(1.) "Use of a presentable image must occur only after the image is returned by vkAcquireNextImageKHR, and before it is presented by vkQueuePresentKHR. This includes transitioning the image layout and rendering commands."

(2.) "The order in which images are acquired is implementation-dependent. Images may be acquired in a seemingly random order that is not a simple round-robin."

(3.) "Let n be the total number of images in the swapchain, m be the value of VkSurfaceCapabilitiesKHR::minImageCount, and a be the number of presentable images that the application has currently acquired (i.e. images acquired with vkAcquireNextImageKHR, but not yet presented with vkQueuePresentKHR). vkAcquireNextImageKHR can always succeed if a<=n-m at the time vkAcquireNextImageKHR is called. vkAcquireNextImageKHR should not be called if a>n-m ..."

The last part (3.) roughly translates into the fact that you might not be guaranteed ability to Acquire all images at any one time. Placing all these problems together means that it is impossible to do the following,

(A.) No way to robustly pre-transform all images into VK_IMAGE_LAYOUT_PRESENT_SRC_KHR before entering normal run-time conditions. Instead you have to special case the transition the 1st time acquire returns a given index. IMO this adds unnecessary complexity for absolutely no benefit, and makes it really easy to introduce bugs. I've see online Vulkan examples violate rule (1.).

(B.) No way to ensure a simplified round-robin order even in cases where it is physically impossible to get anything other than round-robin (such as full-screen flip with v-sync on and a 2 deep swap chain).

Working Around the Problem
This problem infuriates me personally because of all the wasted time required to add complexity for no benefit. Likewise being forced to double buffer instead of front buffer is also a large waste of time for a regression in latency. Since my engine is command buffer replay based (no command buffers are generated after init-time), I ended up needing 8 baked command buffer permutations.

(1.) Even frame pre-acquire.
(2.) Odd frame pre-acquire.
(3.) Even frame post-acquire image index 0.
(4.) Even frame post-acquire image index 1.
(5.) Odd frame post-acquire image index 0.
(6.) Odd frame post-acquire image index 1.
(7.) Transition from UNDEFINED to PRESENT_SRC for image index 0.
(8.) Transition from UNDEFINED to PRESENT_SRC for image index 1.

The workaround I have for needing to special case transitions on 1st acquire of a given index, is to run the transition from UNDEFINED command buffer instead of the one which normally draws into the frame. So there is a possibility of randomly seeing one extra black frame after init time. This IMO is all throw-away code anyway once I can get some kind of front-buffer access.

Bugs
Interesting to look back at the bugs I had to deal with on route to getting the basic example of a compute shader rendering into a back-buffer. Really only 2 bugs, one I forgot to vkGetDeviceQueue(), which was trivial to find and fix. The other was that when creating the swap chain I accidentally set imageExtent.width to height and left imageExtent.height to zero. No amount of type-checking would ever help in finding that bug. Didn't see any errors, so took a while of reinspecting the code to see what I had screwed up.

In hindsight, after knowing what to do, using Vulkan was actually quite easy.

20160726

On Killing WIN32?

Many years ago I used to be a dedicated reader of Ars, but it slowly transitioned to something a little too biased for my taste, so I avoid it, but thanks to twitter, it is possible to get sucked into a highly controversial article: "Tim Sweeney claims that Microsoft will remove Win32, destroy Steam".

My take on this is quite simple. Everyone in this industry who has lived long enough to have programmed in the C64 era, has witnessed a universal truth on every mass market platform: the freedom and access to the computer by the user or programmer is reduced annually at a rate which is roughly scaling with the complexity of the software and hardware.

The emergent macro level behavior is undeniable. Human nature is undeniable. It is possible to continuously limit freedom as long as it is done slowly enough such that it falls under the instantaneous tolerance to act on each micro-level regression of freedom. Or translation, humans are lazy, humans adapt fast, and humans don't live long. Each new generation lacks the larger perspective of the last, and starts ignorant of what had been lost.

The reason why computers and freedom are so important is that computers are on a crash course to continue deeper and deeper integration with our lives. I believe ultimately humans will transcend the limits of our biology, blurring the lines between the mind and machine. Seems rather important at that apex to have the individual freedoms we have today, the privacy of our thoughts, etc.

In the short term as a developer I'm also somewhat concerned that the infants that will grow up to replace the generation I started in, will have the same opportunities I had, the same ability to get access to the hardware, to have the freedom implement their dreams, and to if they choose to, make a living doing so, in a free market, controlling their own destiny, selling their own product, without a larger controlling interest gating that process.

WIN32 is one such manifestation of that freedom.

There are some very obvious trends in the industry specifically in the layers of complexity being introduced either in hardware or software. For example, virtualization in hardware mixed with more attempts to sandbox software. Or the increased distance one has to the display hardware. Look at VR, you as an application developer are locked out of the display, and have to pass through a graphics API interopt layer which does a significant amount of extra processing in a separate process. Or perhaps the "service-ication" of software to subscription models. Or perhaps the HDR standard removing your ability to control tone-mapping. Or perhaps it is just the complexity of the API which makes it no longer practical to do what was done before, even if it is still actually possible.

Following the trends to their natural conclusion perhaps paints a different picture for system APIs like WIN32. They don't go away per say, they just get virtualized behind so many layers, it is becomes impossible to gain the advantages those APIs had when they were direct. That is one of the important freedoms which is eventually lost.

One of the best examples of this phenomenon is how the new generation perceives old arcade games. Specifically as, games with incorrect color (CRT gamma around 2.5 being presented as sRGB without conversion), giant exactly square pixels (never happened on CRTs), with dropped frames (arcade had crystal clear no-jitter on v-sync animation), with high latency input due to emulation in a browser for example (arcade input was instant in contrast), with more latency due to swap-chains added in the program (arcade hardware generated images on scan-out), with added high latency displays (HDTVs and their +100 milliseconds, vs instant CRTs), and games with poor button and joystick quality (arcade controls are a completely different experience). Everything which made arcades awesome was lost in the emulation translation.

Returning to the article, I don't believe there is any risk in WIN32 being instantly deprecated, because if that was to happen, it would be a macro-level event well beyond the tolerance level required to trigger action. The real risk is the continued slow extinction.

20160724

Simplified Vulkan Rapid Prototyping

Nothing simple about using Vulkan, so this title is a little misleading ...
Trying something new for my next Vulkan based at-home prototyping effort and building from scratch for 64-bit machines only. Building a simplified version of my prior rapid prototyping system. This version on code change instead of reloading a DLL, actually does re-compile and restart the program. My theory is that restart time is going to be lower than the time it takes to recompile shaders. I'm not concerned with re-fill of the GPU with baked data because I don't ever use much, and also never have much non-runtime-regeneratable state either. Program is required, somewhat like a "save snapshot" game emulator, to be able to instantly restart to where it was running before (at the time of last snapshot). This has some interesting advantages, like error handling becomes trivial, just exit the program and restart! For correct handling of things like VK_ERROR_DEVICE_LOST or VK_ERROR_SURFACE_LOST_KHR just exit. No need to have two binaries (one for development, one for release), as I never use debug.

Details
I've got only one source file, with #defines to enabling keeping both GLSL and C code in the same file. Also I've got no includes to optimize for compile time. Notice on Windows, "vulkan.h" ultimately includes "windows.h", for example to get HWMD and HINSTANCE types, so sans rolling your own version of the headers, the compile dips into the massive platform include tree. Re-rolling only what I need from the Vulkan headers is quite frankly a nightmare of work due to Vulkan verbosity, but should be mostly over soon. I've also in the process made un-type-safe (yeah) version of the Vulkan API, returning to base system types, so I never have to bother with silly compile warnings. All handles are just 64-bit pointers, etc. It works great. I was beyond having type-safety bugs from birth, being brought up on assembly first. The bugs I have now are more like, "the last time I worked on this was a month ago, and I forgot to call vkGetDeviceQueue(), but already wrote code out-of-order using the queue handle". As any programmer, out of habit, I first blamed the driver, and ultimately realized that I was the idiot instead.

Part of the motivation for this design is out of laziness. Since Vulkan requires SPIR-V input, and I work in GLSL, I need to call "glslangValidator.exe" to convert my GLSL into SPIR-V, and I sure didn't feel like writing a complex system to be spawning processes from inside my app. So I have a shell script per platform which does, {compile shaders, convert SPIR-V binaries to headers which are included in the program, recompile the program, launch program, then repeat}.

Engine design is trivial as well, just setting up baked command buffers and then replaying them until exit. Everything compute based, and dispatch indirect based to manage variability. No graphics makes using Vulkan quite easy relatively speaking, no graphics state, no render passes, trivial transitions.

I'm debating on if to eventually release basic source for this project or not. On one hand it is a good example of Windows/Linux Vulkan app from scratch. On the other hand, my code is very much in shorthand which looks alien to other humans (likely the inverse of how C++ looks totally alien to me). For example, the following (which might get wrapped poorly by the browser) is my implementation of everything I need for printf style debugging writing to terminal.

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//=============================================================================================================================
//
// [KON] CONSOLE MESSAGE SYSTEM
//
//-----------------------------------------------------------------------------------------------------------------------------
// A background message thread which handles printing.
// This works around the problem of slow console print on Windows.
// This also allows single point to override to stream to file, etc.
// Multiple threads can send messages simultaneously.
// It would be faster to queue messages per thread, but this isn't about speed, but rather mostly debug.
// Merging per message gives proper idea of sequence of events across threads.
// This has spin waits in case of overflow panic, so set limits so overflow panic never happens. 
//=============================================================================================================================
 // Defaults, must be a power of two.
 // Number of characters in ring.
 #ifndef KON_BUF_MAX
  #define KON_BUF_MAX 32768
 #endif
 // Number of messages in ring.
 #ifndef KON_SIZE_MAX
  #define KON_SIZE_MAX 1024
 #endif
 // Maximum message size for macro message generation.
 #ifndef KON_CHR_MAX
  #define KON_CHR_MAX 1024
 #endif
//-----------------------------------------------------------------------------------------------------------------------------
 typedef struct {
  A_(64) U1 buf[KON_BUF_MAX*2]; // Buffer for messages, double size for overflow.
  A_(64) U4 size[KON_SIZE_MAX]; // Size of messages.
  A_(64) U8 atomReserved[1];    // Amount reserved: packed {MSB 32-bit buffer bytes, LSB 32-bit size count}.
  U8 atom[1];                   // Next: packed {MSB 32-bit buffer offset, LSB 32-bit size offset}.
  C2 write;                     // Function to write to console (adr,size). 
  U4 frame;                     // Updated +1 everytime the writter goes to sleep (used for drain).
 } KonT;
 S_ KonT kon_[1];
 #define konR TR_(KonT,kon_)
 #define konV TV_(KonT,kon_)
//-----------------------------------------------------------------------------------------------------------------------------
 // Begin KON_CHR_MAX macro message.
 #define K_ { U1 konMsg[KON_CHR_MAX]; U1R konPtr=U1R_(konMsg)
 // Ends.
 #define KON_MSG KonWrite(konMsg,U4_(U8_(konPtr)-U8_(konMsg)))
 #define KE_ KON_MSG; }
 #define KW_ KON_MSG; KonWake(); }
 #define KD_ KON_MSG; KonDrain(); }
//-----------------------------------------------------------------------------------------------------------------------------
 #define KN_ konPtr[0]='\n'; konPtr++
 // Ends with newline.
 #define KNE_ KN_; KE_
 #define KNW_ KN_; KW_
 #define KND_ KN_; KD_
//-----------------------------------------------------------------------------------------------------------------------------
 // Append numbers.
 #define KH_(a) konPtr=Hex(konPtr,a)
 #define KU1_(a) konPtr=HexU1(konPtr,a)
 #define KU2_(a) konPtr=HexU2(konPtr,a)
 #define KU4_(a) konPtr=HexU4(konPtr,a)
 #define KU8_(a) konPtr=HexU8(konPtr,a)
 #define KS1_(a) konPtr=HexS1(konPtr,a)
 #define KS2_(a) konPtr=HexS2(konPtr,a)
 #define KS4_(a) konPtr=HexS4(konPtr,a)
 #define KS8_(a) konPtr=HexS8(konPtr,a)
//-----------------------------------------------------------------------------------------------------------------------------
 // Append decimal.
 #define KDec1_(a) konPtr=Dec1(konPtr,a)
 #define KDec2_(a) konPtr=Dec2(konPtr,a)
 #define KDec3_(a) konPtr=Dec3(konPtr,a)
//-----------------------------------------------------------------------------------------------------------------------------
 // Append raw data.
 #define KR_(a,b) do { U4 konSiz=U4_(b); CopyU1(konPtr,U1R_(a),konSiz); konPtr+=konSiz; } while(0)
 // Append character.
 #define KC_(a) konPtr[0]=U1_(a); konPtr++
 // Append zero terminated compile time immediate C-string.
 #define KZ_(a) CopyU1(konPtr,Z_(a)-1); konPtr+=sizeof(a)-1
 // Append non-compile time immediate C-string.
 #define KZZ_(a) KR_(a,ZeroLen(U1R_(a)))
//-----------------------------------------------------------------------------------------------------------------------------
 // Quick message for debug.
 #define KQ_(a) K_; KZ_(a); KD_
//-----------------------------------------------------------------------------------------------------------------------------
 // Quick decimal.
 #define KDec2Dot3_(a) KDec2_(a/1000); KC_('.'); KDec3_(a%1000)
 #define KDec3Dot3_(a) KDec3_(a/1000); KC_('.'); KDec3_(a%1000)
//=============================================================================================================================
 S_ void KonWake(void) { SigSet(SIG_KON); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Unpack components from atom.
 I_ U4 KonSize(U8 atom) { return U4_(atom); } 
 I_ U4 KonBuf(U8 atom) { return U4_(atom>>U8_(32)); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Unpack components from atom and mask.
 I_ U4 KonMaskSize(U8 atom) { return KonSize(atom)&(KON_SIZE_MAX-1); }
 I_ U4 KonMaskBuf(U8 atom) { return KonBuf(atom)&(KON_BUF_MAX-1); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Reserve space to write message.
 I_ U8 KonReserve(U4 bytes) { return AtomAddU8(konV->atomReserved,(U8_(bytes)<<32)+1); } 
//-----------------------------------------------------------------------------------------------------------------------------
 // Release space reservation.
 S_ void KonRelease(U4 bytes,U4 msgs) { AtomAddU8(konV->atomReserved,(-(U8_(bytes)<<32))+(-U8_(msgs))); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Check if reservation under limits.
 S_ U4 KonOk(U8 atom) { return (KonSize(atom)atom,(U8_(bytes)<<32)+1); } 
//-----------------------------------------------------------------------------------------------------------------------------
 // Copy in message.
 S_ void KonCopy(U8 atom,U1R adr,U4 bytes) { CopyU1(konR->buf+KonMaskBuf(atom),adr,bytes);
  AtomSwapU4(konV->size+KonMaskSize(atom),bytes); }
//-----------------------------------------------------------------------------------------------------------------------------
 // Used for debug busy wait until message is displayed.
 S_ void KonDrain(void) { U4 f=konV->frame; while(f==konV->frame) { SigSet(SIG_KON); ThrYield(); } }
//-----------------------------------------------------------------------------------------------------------------------------
 // Write message to console.
 S_ void KonWrite(U1R adr,U4 bytes) { while(1) { 
  if(KonOk(KonReserve(bytes))) { KonCopy(KonNext(bytes),adr,bytes); return; }
  KonRelease(bytes,1); KonWake(); ThrYield(); } }
//=============================================================================================================================
 // Background thread which sends messages to the actual console.
 S_ U8 KonThread(U8 unused) { U4 bufOffset=0; U4 sizeOffset=0; 
  while(1) { U4 bytes=0; U4 msgs=0; SigWait(SIG_KON,1000); SigReset(SIG_KON);
   while(1) { U4 size=konV->size[sizeOffset]; bytes+=size;       
    // If not zero need to force clear before adjusting free counts, to mark as unused entry.
    if(size) konV->size[sizeOffset]=0; 
    // Force write if would wrap, or found zero size message.
    if(((bufOffset+bytes)>=KON_BUF_MAX)||(size==0)) {
     konV->write(U8_(konR->buf+bufOffset),bytes);
     bufOffset=(bufOffset+bytes)&(KON_BUF_MAX-1); 
     KonRelease(bytes,msgs); bytes=0; msgs=0;
     // If hit zero size break (zero size means rest of messages are empty).
     if(size==0) break; }
    msgs++; sizeOffset=(sizeOffset+1)&(KON_SIZE_MAX-1); } 
   // Only advance frame until after draining.
   BarC(); konV->frame++; } } 
//-----------------------------------------------------------------------------------------------------------------------------
 S_ void KonInit(void) { konR->write=C2_(ConWrite); ThrOpen(KonThread,THR_KON); }

20160723

Why Motion Blur Trivially Breaks Tone-Mapping - Part 2

Continuing from last post...

The question remains how to "fix it"? Any "fix" requires that the post-tone-mapped image is physically energy conserving under motion blur. As in energy conserving from the perspective of the viewer looking at the display. This requires, by definition, knowledge of the tone-mapping transfer function. On conventional displays this is not a problem, the application owns the tone-mapping transfer function.

However all the standards bodies for the new HDR Displays (HDR-10, Dolby Vision, ...) decided to do something rather rash: they took away the ability for you the developer to tone-map, and gave that step to the display OEM!

Ouch!

Long but important technical tangent for those who haven't been following what is happening in the display world, here is a refresher. Until recently you have enjoyed the freedom of targeting a display with enough knowledge of the output transfer function to do what you need. Basically on PCs you use sRGB, on Mac you use Gamma 2.2, and on HDTVs you use Rec709.

This ends with "HDR" Displays.

HDR Display signals like HDR-10 have switched from display relative to a absolute nits scale signal with absolute wide-gamut primaries both of which are far outside the realm of capacity for a consumer display to output. Each display has a different capacity. Luminance range for the HDR signal is {0-10000 nits}, but it is roughly only {0-500 nits} for an OLED display (yeah likely not even a stop brighter than the LDR screen you are reading this post from). Gamut for the signal is Rec 2020, but gamut for the displays are around P3. The only consumer devices which are similar in gamut to Rec 2020 are the MicroVision-based pico laser projectors, and related licensed products, which have existed for a while. In order to reach such a wide gamut they resort to ultra narrow primaries which have a side effect: metamerism (meaning all viewers see something different on the display, they are impossible to calibrate for multiple human viewers). There also has existed a market selling displays over 5000 nits, the outdoor LED sign industry. While LCD HDR TVs are driven by LED back lights, adapting outdoor sign LEDs would require water cooling, and power draw which is very far outside the range of what is acceptable in the consumer space. Point being, the range of the HDR signal will remain outside of realm of consumer displays for the foreseeable future: they will always be tone-mapped when driven by these new "absolute" scale signals.

In fact, the HDR standards *require* the display to tone-map the input signal and reduce to the range that the display can output. But the tone-mapper is up to the display OEM (or possibly some other 3rd party in the signal output chain in the future for non-TV cases). OEMs like to have a collection of different tone-mapping transfer functions depending on TV settings like "standard", "movie mode", "vivid", etc. You as a developer have no way of knowing what the user selected, and each TV can be different, even within the same product due to firmware updates.

So yes, the HDR TV standard for tone-mapping is effectively random!

Double Ouch!

Many developers understand what it means to target "random", because existing HDTVs already have this problem with things like contrast and saturation settings, just not to the extent of HDR TVs. The only way to author content is to take and purchase a large collection of displays, then play whack-a-mole, take the cases with worst case output visually, and keep re-adjusting the content until it looks ok on the largest amount of displays. The problem with this, besides the expensive iteration cycles, as many color professionals know, is that when you cannot target calibrated displays, you also cannot push to the limits of the displays (especially in the darks), you must play it safe with content, and accept that your visual message gets diluted before your consumer sees it.

But it gets better (well at least sarcastically speaking): you also cannot really calibrate these new HDR displays!

Triple Ouch!

Both LED driven LCDs and OLEDs have different problems. Lets start with LCDs. The UHD HDR certification label requires an output contrast range which is physically impossible for LCDs to display without cheating. Quoting the above press release for the part applying to LCDs: "More than 1000 nits peak brightness and less than 0.05 nits black level". Taken literally that means 1000/0.05 or a 20000:1 contrast ratio.

PC LCD displays reached their ultra cheep pricing due to cutting panel quality, they typically top out at around 1000:1 ANSI contrast. Below is a table yanked from TFT Central showing some recent measured examples.



The best LCD displays are around 5000:1 ANSI contrast. The HDR TV industry high-end LCD models use LCDs which have around 4000:1 ANSI contrast. So LCDs are anywhere between 2 to 5 stops away from the minimum requirements for the HDR label.

Now on to the cheating part, enter Local Dimming. The back-light of these LCDs are driven by a regular grid of a hundred or so LEDs (called zones) each which can be controlled individually. It then becomes possible to very coarsely adjust black level by dropping down peak white level in a given zone. Lets go through an example of how this works with a LCD panel capable of 2000:1 ANSI contrast,

full _______ bright zone -> _2000:1 contrast ratio from display peak to black level in zone
1 stop_ less bright zone -> _4000:1 contrast ratio from display peak to black level in zone
2 stops less bright zone -> _8000:1 contrast ratio from display peak to black level in zone
3 stops less bright zone -> 16000:1 contrast ratio from display peak to black level in zone
4 stops less bright zone -> 32000:1 contrast ratio from display peak to black level in zone


Now lets see at what this looks like visually on a real display with some simple test cases. First a white line going through the center of the screen, then a white line outlining the screen. The image on the left is on a display without Local Dimming and represents what the output is supposed to look like, the image on the right is on a display with Local Dimming. Images below yanked from Gizmodo.




This visually shows exactly why it is impossible to calibrate a Local Dimming display, because the error introduced into the signal by the TV exceeds all sane standards for calibration. Quite literally these displays have horrible "uniformity". The error introduced by the display in the blacks can range 2-5 stops depending on the quality of the LCD panel.

Bringing this back to authoring content, the display introduces it's own "square-ish bloom filter". Where the "square-ish bloom" doesn't move as the bright content moves by a fraction of a zone, and where the bloom doesn't actually scale in intensity with the average intensity of the output signal. Instead even a tiny amount of highlight will cause the full "bloom", because the display needs to fire the associated zone at peak to correctly reproduce the brightness of just a few pixels. Also the color of the "bloom" might not track the color of the associated highlight(s). No self respecting developer would ever release a game with such poor quality "bloom".

This becomes even more of a problem with real-time games. Existing games often have bright UI elements wrapping the screen, or overlaid on top of game content. The reason this UI content is bright, is because it needs to be, in order to be visible over in-game content. Existing LDR PC displays often reach roughly 400 nits, so these new "HDR" LCD displays are only a little over 1 stop brighter (UHD HDR cert minimum is 1000 nits). Taking the example of a 2000:1 ANSI contrast UHD "HDR" labeled display, any 400 nit white text in that UI for instance is going to bring the nearby zones to close to 1 stop from peak, which drops the effective contrast ratio to around 4000:1, and will introduce "square-ish bloom" if the game content is dark.

High quality "non-HDR-labeled" displays from the past will actually produce much better quality output than current HDR LCDs, and they don't require any new HDR signal to do so. For example, the circa 2013 Eizo Foris FG2421 120 Hz LCD as reviewed by TFT Central has around a 5000:1 ANSI Contrast ratio without resorting to Local Dimming and has a low-persistence mode for gaming. Older top-of-the-line discontinued Plasma displays are much better than these new HDR LCDs because they don't have Local Dimming. Plasma unfortunately was ended by the display industry. Like OLED, Plasma is not as energy efficient as the LEDs driving the LCD back lights, so they get lower APL peaks. Lets look at one of the best, the Pioneer Elite KURO PRO-110FD Plasma HDTV from 2007. The KUDO's ANSI contrast is APL limited measured around 3239:1. That doesn't really tell the full story, as real content is non-APL limited, at which point the contrast ratio is measured around 10645:1 which is over double the best current HDR LCDs true contrast ratios (meaning what is possible without major signal degradation).

Now lets look at OLED.

OLEDs don't have back lights, so they do not suffer from the wows of Local Dimming, they do however have a "darker" problem, quite literally in fact. The problem is described quite well by the latest HDTV Test's LG OLED55E6 4K HDR OLED TV Review: "the E6V’s default [Brightness] position of “50” has been purposely set up from factory to crush some shadow detail so that most users won’t pick up its near-black foibles. Once we raised [Brightness] to its correct reference value, we could see that the television was applying dithering to shadowed areas to better mask above-black blockiness."

The OLEDs crush the blacks because they have severe problems in near-black uniformity. From some personal testing and calibration on a slightly older LG OLED, I could see uniformity problems similar to burn-in that were larger in the darks than a few steps of 8-bit sRGB output. So while absolute black is quite dark, they don't have enough accuracy in the darks to reproduce anything well when APL is dropped down to levels appropriate for a dark room good for HDR viewing. These OLEDs can look ok in day time viewing because the default black crush in combination with ambient reflection on the screen masks the problem in the darks.

The black crush problem is a symptom of the larger problem facing OLED TVs. This problem is also described quite well by HDTV Test's review of the LG OLED65G6P HDR OLED TV, "We calibrated several G6s in both the ISF Night and ISF Day modes, and found that with about 200-250 hours run-in time on each unit, the grayscale tracking was consistently red deficient and had a green tint as a result (relative to a totally accurate reference). We witnessed the same inaccuracy on the European Panasonic CZ950/CZ952 OLED (which of course also uses a WRGB panel from LG Display).".

OLEDs cannot hold calibration.

OLED as a technology suffers from gradual pixel decay issues. These latest LG OLEDs resort to a 10-20 minute "Cleaning" attempted auto-re-calibration cycle when the TV is in powered standby after around 4 hours of viewing time. Even this cannot solve the problem. The accuracy of the calibration is poor, which is why OLEDs have such a problem reproducing anything interesting in the darks.

Returning to "Fixing" the Motion Blur Issue

So as established above you are effectively screwed on newer HDR Displays if you use the HDR input signal, tone-mapping is out of your hands. It is possible on HDR displays to still select the "PC" input and drive the display without tone-mapping using a traditional non-HDR signal. Often the HDR TVs (given they really are not that bright anyway) still support their full brightness range in that mode. Some HDR TVs I believe even support disabling Local Dimming. However there is no guarantee this will be the case or will remain the case. Also it is placing a large burden on the consumer to be tech savvy enough to correctly navigate the TVs options and do the right thing.

Until game developers organize in mass and throw their combined weight into forcing a display property query and pass-through signal standard based on measured minimal pixel-level signal degradation as the cert metric, you will be forced to live with all the problems of the new HDR standards.

On non-HDR-signal output there are a lot of interesting options for "fixing" the motion blur issue.

Lets break this down into non-photo-real and photo-real categories. The non-photo-real case is what I'm personally interested in because it aligns to one of my current at-home projects involving a state-of-the-art Vulkan based engine. My intended graphics pipeline does {auto-exposure, color-grading and tone-mapping} of graphics elements into the limited display range then afterwards does a linear composite of applying linearly processed motion-blur and DOF to the elements. Thus side-stepping the problem. Everything done post-tone-mapping is energy preserving, so motion does not change Average Picture Level.

One of the interesting things I've found when playing around with low persistence 160 Hz output, most recently playing through DOOM on a CRT, is that motion blur still plays an important role in visual quality. One would think at 160 Hz that motion blur simply isn't necessary. This is true as long as it is possible to restrict camera motion slow enough, and the eye always tracks exactly with the motion. But with twitch games where the camera can spin instantly, the motion can get fast enough such that when the eye is not actively matching the camera rotation, it is possible to see discontiguous strobes of the image at the 160 Hz refresh. The eye/mind picks up on edges perpendicular to the direction of assumed motion. Introducing some amount of motion blur which effectively removes those perpendicular edges, enables the mind to accept a continuously moving image.

For real-time graphics it is impossible to do "correct" motion blur because of the display limits and because we lack eye tracking. Instead motion blur serves a different purpose: to best mask artifacts so the mind falls into a deeper relationship with the moving picture. The desire is that the mind lives in what it imagines the world to be based on what it is seeing, without being distracted back into the real world by awareness of things like {triangles, pixels, scan-and-hold, aliasing, flickering, etc}. This is where the true magic happens, and why games like INSIDE and LIMBO by Playdead, which are effectively visually flawless, are so deeply engaging.

As for the photo-real case with content outside the capacity of the display, I believe it would be quite interesting to re-engineer motion blur to be APL conserving as observed by the gamer, but to maintain correct linear behavior. With bloom turned off on a still image, the actual brightness of various specular highlights is very hard to know due to highlight compression done in the tone-mapper. This hints at one possible compromise. Apply motion blur linearly so the color is correct, but with the pixel intensity as seen post-tone-mapped, so the APL as observed by the gamer is constant. Then linearly add a very diffuse bloom to the motion blurred image such that the visual hint of virtual scene brightness remains. This bloom is computed prior to motion blur using the original pre-tone-mapped full-dynamic-range color. The bloom must be diffuse enough such that motion length would have not caused a perceptual visual difference in the bloom if computed using traditional motion blur. Bloom in this case is more like a colored graduated filter or like a fog on the glass.

There are other interesting ideas around this, but that is all I have time for this time ...

20160721

Why Motion Blur Trivially Breaks Tone-Mapping

Some food for visual thought ...

You have a scene with a bunch of High Dynamic Range (HDR) light sources, or bright secondary reflections. Average Picture Level (APL) of the scene as displayed is relatively low, which is standard practice and expected behavior for displayed HDR scenes. Because the display cannot output the full dynamic range of the scene, the scene is tone-mapped before display, and thus the groups of pixels representing highlights appear not as bright as they should.

Now the camera moves, and the scene gets motion blur applied in engine prior to tone-mapping. All the sudden the scene gets brighter. The APL increases. In fact the larger the motion, the brighter the scene gets.

You, yes you the reader, have seen this before in many games. And technically speaking the game engine isn't doing anything wrong.

What is happening is those small bright areas of the scene, get blurred, distributing their energy over more pixels. Each of these blurred pixels have less intensity than the original source. As the intensity lowers, it falls more away from the aggressive areas of highlight compression, approaching tonality which can be reproduced by the display. So the APL increases, because less of the scene's output energy is getting limited by the tone-mapper.

The irony in this situation is that as motion and motion blur increases, the scene is actually getting closer to it's correct energy conserving visual representation as displayed.

Note the same effect applies as more of the scene gets out of focus during application of Depth of Field.

Re Twitter: Thoughts on Vulkan Command Buffers

Because twitter is too short ...

API Review
In Vulkan the application is free to create multiple VkCommandPools each of which can be used to allocate multiple VkCommandBuffers. However effectively only one VkCommandBuffer per VkCommandPool can be actively recording commands at a given time. The intent of this design is to avoid having a mutex when command buffer recording needs to allocate new CPU|GPU memory.

Usage/Problem Case
The following hypothetical situation is my best understanding of the usage case as presented in fragmented max 140 character messages on twitter. Say one had a 16-core CPU, where each core did a variable amount of command buffer recording. The application will need at a minimum 16 VkCommandPools in order to have 16 instances of command buffer recording going in parallel (one per core). Say the application has a peak of 256 command buffers generated per frame, and cores pull a job to write a command buffer from some central queue. Now given CPU threading and preemption is effectively random, it is possible in the worst case that only one thread on the machine has to generate all 256 command buffers. In Vulkan there are two obvious methods one could attempt to manage this situation,

(1.) Could pre-allocate 256 VkCommandBuffers on the 16 VkCommandPools, resulting in needing 4096 VkCommandBuffer objects total. Unfortunately AMD's Vulkan driver currently has higher than desired minimum allocated memory for each VkCommandBuffer. On the plus side there is an active bug, number 98777 (if you want to reference this in an email to AMD), for resolving this issue.

(2.) Could alternatively allocate then free VkCommandBuffers at run-time each frame.

Once bug 98777 is resolved with a driver fix, option (1.) would be the preferred solution from the above two options.

Digging Deeper
Part of what concerns me personally about this usage case is that it implies building an engine where VkCommandPool is effectively pinned to a specific CPU thread, and then randomly asymmetrically loading each VkCommandPool! For example, say in typical case each CPU thread builds on average the same amount of command buffers in terms of CPU and GPU memory consumption. In this mostly symmetrical load pattern, the total memory utilization of each VkCommandPool will be relatively balanced. Now say at some frequency one of the threads chosen randomly, and it's associated VkCommandPool, is loaded with 50% of the frame's command buffers in terms of memory utilization. If VkCommandPools "pool" memory and keep it, then over time each VkCommandPool would end up "pooling" 50% of the memory required for all the frame's command buffers. Which in this case would be roughly 8 times what is required.

This problem isn't really Vulkan specific, it is a fundamental problem on anything which does deferred freeing of a resource. The amount of over-subscription in random asymmetrical load is a function of the delay before deferred free. Which ultimately becomes a balancing act between the overhead in run-time or synchronization cost for dynamic allocation, against the extra memory required.

Possible Better Solution?
Might be better to un-pin VkCommandPool from CPU thread. Then instead use a few more VkCommandPools than CPU threads, and have each CPU grab exclusive access to a random VkCommandPool at run-time to use to build command buffers for jobs until after a set timeout, at which point it releases a given VkCommandPool, and then chooses the next free one to start work again. Note there is no mutex in here for acquire/release pool, but rather a lock-free atomic access to a bit array in say a 64-bit word.

In this situation, assuming CPU/GPU memory overhead for a command buffer scales roughly with CPU load of filling said command buffer, regardless of how asymmetrical the mapping is of jobs to CPU threads, the VkCommandPools get loaded relatively symmetrically.

Another thing about CPU threading which is rather important IMO, is that the OS will preempt CPU threads randomly after they have taken a job, which can cause random pipeline bubbles. As long as this is a problem, it might be desirable to preempt the OS's preemption and instead manually yield execution to another CPU thread at a point which ensures no pipeline bubbles (ie after finishing a job and releasing a lock on a queue, etc). The idea being to transform the OS's perception of the thread from being "compute-bound" thread (something which always runs until preemption) to something which looks like an interactive "IO-bound" thread (something which ends in self blocking). Maybe it is possible to do this by having more worker threads than physical/virtual CPU threads, and waking another worker, then blocking until woken again. Something to think about...

Transferring Command Buffers Across Pools?
I'll admit here I've been so Vulkan focused that I'm current out of touch with how exactly DX12 works. Seems like the twitter claim is that the Vulkan design is fundamentally flawed because VkCommandBuffer is locked to a VkCommandPool at allocation-time, instead of being set at begin-recording-time like DX12. This sounds to me the same as (2.) at the top of this post, effectively making "Allocate" and "Free" very fast for command buffers in a given pool, just "Allocate" is now effectively "Begin Recording" in the DX12 model. Meaning just shuffling work around to different API entry points. Assigning the Pool at "Begin Recording" time does not do anything to solve the asymmetric Pool loading problem caused by the desire to have Pools pinned to CPU threads for this usage case.

Baking Command Buffers - And Replaying
As the number of command buffers increases, one is effectively factoring out the sorting/predication of commands which would otherwise be baked into one command buffer, and deferring that sorting/predication until batch command buffer submit time. As command buffer size gets smaller, it can cross the threshold where it becomes more expensive to generate the tiny command buffers, than to cache them and just place them into the batch submit. So if say one had roughly 256 command buffers in effectively everything outside of shadow generation and drawing, meaning everything from compute based lighting through post processing, it is likely better to just cache baked command buffers instead of always regenerating them.

My personal preference is effectively "compute-generated-graphics", rending with compute only, mixed with fully baked command buffer replay (no command buffer generation after init time), and indirect dispatch to manage adjusting amount of work to run per frame ...

20160715

LED Displays

Gathering information to attempt to understand what is required to drive indoor LED sign based displays...

Target
256x128 2:1 letter box display (NES was 256 pixels wide).

How do LED Modules Work?
Adafruit provides one description how to drive a 32x16 LED module. Attempting a rough translation. LEDs are either on or off. The 32x16 panel can only drive 64 LEDs at one time, organized as two 32x1 lines 8 rows apart. Scanning starts with lines {0,9}, then {1,10}, then {2,11}, and so on.

Panels are designed to be chained, driven by a 16-bit connector which provides 2 pixels per clock (one pixel for top and one for bottom scan-line). Looks like some other grouped LED panels go up to 128x128, driven by 4 row chunks of 128x32, each built from two chained 64x32 panels. Seems like the 64x32 panels are driven with 2 lines of 64 pixels (based on the addition of one extra address bit). Could not find a good description of chaining yet.

Seems like the 64x32 panels have roughly a 1/16 duty cycle (meaning only 1/16 of the LEDs are active at any one time). LED displays are low-persistence high-frame-rate displays with binary pixels. Based on this thread they can drive one cable at 40 MHz. So a 128x128 panel with 4 cables would be roughly 80M pixels / (128*32 pixel/frame) = 19.5 thousand frames per second.

The basic Pulse Width Modulation (PWM) to modulate brightness would transform this low-persistence display into something effectively scan-and-hold, just with a lot of micro-strobed sub-frames doing PWM across the effective "scan-and-hold" period. Getting something truly low-persistence is more of a challenge. These displays can be over 1500 nits (even with a 1/16 duty cycle). So one option for lower persistence is to actually insert black frames between frames, dropping the scan-and-hold time.

A 120 Hz frame rate provides 8.333 ms of frame time, switching to half black frames would drop to 4.16 ms (which isn't yet low persistence IMO), and would reduce to a 750 nit display (half the contrast), leaving roughly 80 or so sub-frames for PWM.

A 240 Hz frame rate at half black frames could be at the right compromise between lost contrast and low persistence. A 480 Hz frame rate with no black frames might be able to provide full contrast, and low enough persistence, but likely would need some seriously good temporal dithering.

20160706

Low Cost Branching to Factoring Out Loop Exit Check

Threading to hide pipeline depth combined with an ISA which makes branching cheep is one goal. Specifically absolute branching with immediate destination in the opcode word (single word branch/call, no adder), and instructions which include a return flag (make returns free). Enables easy computed branches, both for jump tables, and loops. Can factor out loop check into hierarchical call tree,

Do4: Unroll work four times; Return;
Do16: Call Do4; Call Do4; Call Do4; Jump Do4;
Do64: Call Do16; Call Do16; Call Do16; Jump Do16;
... etc ...


Can use a computed branch to jump into the tree for other loop counts.

20160705

CPU Threading to Hide Pipelining

If a CPU has a 4 stage pipeline, would be nice to have 4 CPU threads round robin scheduled to ensure pipeline delays for {memory, alu, branches, etc} do not have to be programmer visible, and to avoid complexities such as forwarding.

According to docs, Xilinx 7-series DSPs need a 3 stage pipeline for full speed MAC, and BRAMs (Block RAMs) need 1 cycle delay for reads.

Working from high level constraints, I'm planning using the following for the CPU-side of the project,

16-bit or 18-bit machine
1 DSP
X BRAMs of Instruction RAM (2 ports, read or write for either)
Y BRAMs of Data RAM (2 ports, read or write for either)


Which suggests the following 4 stage pipeline (with 4 CPU threads running in parallel, one on each pipeline stage),

[0]
Instruction BRAM Read -> Instruction BRAM Registers
DSP MUL -> DSP {M,C} Registers (from prior instruction)

[1]
Instruction Decode
DSP ALU -> DSP {P} Registers (from prior instruction)

[2]
Data BRAM Write(s) (results from prior instruction)
Data BRAM Read(s) -> Data BRAM Registers

[3]
DSP Input -> DSP {A,B,D} Registers


With an ISA which can do something as complex as the following (below) in one instruction. A focus on instruction forms which can leverage both ports on the Instruction BRAMs (opcode and separate optional immediate), as well as both ports on the Data BRAMs. Using dedicated address registers to provide windows into the Data BRAMs for immediate access instead of a conventional register file, and leveraging a special high-bit-width accumulator to maintain precision of intermediate fixed-point operations.

[addressRegister[2bitImmediate]^nbitImmediate] = accumulator;
accumulator = [addressRegister[2bitImmediate]^nbitImmediate]] OP 18bitImmediate;

Relative Addressing With XOR - Removing an Adder

Could be an interesting compromise: use XOR instead of ADD for relative addressing to remove an adder. Specifically,

Address = AddressRegister XOR Immediate

Forces the programmer to keep address register on some power of two alignment. With caches and/or parallel access to banked memory, this would be a bad. But likely fine for a core with a private memory, and code written in assembly.

20160704

SpartanMC

SpartanMC - An FPGA soft core with an 18-bit word size, with a SPARC like sliding register window.

"Forth" of July Reboot

Used the nuclear option on the blog, starting over, synchronizing with an internal reboot, an attempt to completely refocus personal hobby time on FPGA based hardware design. This blog serving as a place to collect thoughts and ideas as I stumble towards something to synthesize...