NVIDIA's Shield Tablet

Shield Tablet
Assuming I can replace Android with Linux, get updated GL drivers for Linux, and use bluetooth keyboard/mouse in Linux, this is the first tablet I've ever considered getting. Simply because I could use it as a portable development device for GL4 PC content, and with Linux, I can do everything on this that my desktop can do (just at lower perf). Comparing the K1 GPU to a GTX 560 ti (what I have at home),

560ti 1263 128 52_ 26_
K1___ 365_ 17_ 7.6 3.8

N:1__ 3.5_ 7.5 6.8 6.8

Really impressive given the power difference.


Body Hacking: Running on Oil

Now for something quite serious, completely off topic of what I normally write on this blog.

In terms of health, I've had no real problems, now working towards my 37th year. Not everyone I know well has been so lucky. Attempting to help a close friend or family member though a challenging health related issue can be a combination of the hardest and yet most meaningful and rewarding experience. It might require placing your life on hold, living in an ICU room in the hospital, and attempting to learn everything about topics which you either ignored your entire life: like being healthy, or specifics about something which even doctors still know little about. Reading medical journals is quite a bit different than reading papers from SIGGRAPH. Ultimately through this experience you might end up learning something critical for your own well being as well.

Health care in the US is interesting. Given that doctors are both licensed and often sued, doctors often do not have the flexibility to do anything other than the standard of care, even if that standard of care does not work, or even if there are alternative methods of treatment which have had great success in individuals, but have not been proven by case studies. Likewise insurance will flat out refuse to pay for anything other than the standard of care. Case studies are more interesting. Standard of care is based on case studies which are limited to things which can get funding, almost always by major drug companies.

This model is massively broken, often anything which can have an instrumental effect on health or prognosis that is not profitable by drug companies never gets into the standard of care. One obvious example: diet, more on this later. It gets better. If there is a treatment which is proven to work in some individuals in other countries, or even proven in early case study results in the US. The US will not allow you to even choose to use the treatment, it is flat out illegal, even if you have a terminal disease, and the treatment might be critical towards survival. On top of this, case studies are very selective of who they will allow in the study. Something like prior treatment choices can easily invalidate you.

For those who are lucky of getting into a case study, for something that is critical to long term survival, they often have a 50% chance of getting placed in the placebo group (no treatment). On top of this, majority of case studies that I have read about are strictly controlled, either no treatment, or only one treatment. Rarely any moderate combined treatment. Case study participants don't have the option of adding extra treatments, as that would drop them from the study. Often this is running on the failed assumption that a cure to a certain problem can be found with say one new drug. The reality of the situation is that the rare people who survive terminal illnesses survive not through one thing. No conventional case study would ever test the combination of things they did to survive. The end result of this is that case studies often basically insure early death for people in the placebo group.

Summary, in the US if you are facing a terminal illness, this country would rather insure you die, than allow you to selectively by choice try something which might save your life.

As a patient, some of the most critical aspects of treatment end up being your responsibility: diet, supplements, exercise, emotional well being, etc, all of which can have a profound outcome on long term prognosis.

If you are in chemo treatment and ask what you should eat, there is no good answer, because it isn't part of standard of care, because no human case studies have really explored this in combination with the standard treatment. If you walk into the radiation wing of the Duke Hospital, as a cancer patient, they will give you a free donut on certain days of the week. There is great irony in this, in that high glucose levels directly accelerate cancer growth.

The defining characteristic of nearly all cancers is impaired cellular energy metabolism. In the presence of oxygen, normal cells leverage oxidative phosphorylation for energy (respiration). Cancer cells on the other hand have mitochondrial defects which force a switch to less efficient forms of energy metabolism, like glycolysis, which is the breakdown of sugar (fermentation). In this case, cancer cells are relying on glucose to function. Another side effect of damaged mitochondria is that cancer cells also rely on higher levels of glucose to repair free radical damage.

Humans evolved a second energy pathway which the body switches over to using when it runs low on glucose. This second pathway is fueled by ketones which are produced in the liver. Ketones can cross the blood brain barrier and also fuel the brain. When the body switches to Ketosis, the body is actively converting fat into ketones. Cancer cells thanks to mitochondrial defects, cannot function on ketones.

One interesting aspect of cancer, is that it seems like cancer cells are being generated all the time in the body. A regular process in which damaged cells lose the ability to self terminate. Instead the cancer cells live on and start replicating at will. Normally the body keeps them in check and kills off the damaged cells. In this war, when the cancer cells get the upper hand over the immune system, this is what we label as "cancer". The plan of the conventional treatment options is to basically selectively (via radiation) and universally (via chemo) destroy the body. During this extreme stress the cancer cells typically die out faster than the normal cells. My understanding is that chemo only works when the agent is in the cell during division. Since cancer cells divide faster than normal cells, cancer cells die out faster. Healthy and young people can often bounce back after treatment.

The most common problem in cancer treatment is that later in life (some times right after treatment) the cancer grows back, typically much more aggressive than before. The problem being that the first treatment selectively removed the weaker cancer cells, leaving the cancer cells which adapted best to survive under conventional treatment. Often after treatment, patients return to the very life choices which likely enabled the cancer to win the war in the first place. Doctors have no idea on how to treat the root cause of the problem.

For aggressive cancers like say high grade brain cancer, conventional treatment usually just extends out life for a matter of months or maybe a few years. However out of this group, of people labeled with high grade terminal cancers, there are some cases of complete recovery. People who return to regular life and never have problems again. The medical community largely ignores these cases, or even refuses to invest in any kind of research into why they survived. Sometimes these people are the same people who when given a terminal prognosis, decided to skip conventional treatment and do something different.

As a graphics developer, if I learn about the possibility of doing the impossible by seeing just one example (even if it is only partially working), I'm all over that, attempting to learn anything about it, attempting to find out how it works, and then attempting to do a better job. This is my contribution to the field. Something the next generation can work from and in turn end up doing a much better job than me. However for the medical community, with the existence of people for the pass generation beating terminal cancers, where is the insane push for understanding why. If I was in cancer research, why would I waste my time researching the conventional standard treatment which fails for majority if not all terminal cancers? The only logical conclusion is to research those who have successfully survived.

While I don't feel like spending the time to mark up the huge number of things I've been referencing to make this post, I will provide a few links to stories about some of these people. Dr. Fred Hatfield (Dr Squat), this guy is a well know world power lifting champion, very technically oriented, I used to use one of his books as a guide when powerlifting in my 20s. He had 3 months to live because of widespread metastatic cancer. No conventional treatment, survived. Joe Mancaruso, active in Brazilian Jiu Jitsu with only one functioning lung. Surviving terminal lung cancer. Ben Williams, PhD, surviving an extremely agressive terminal brain cancer. Just the stores surrounding these peoples' fight and will to live is worth reading as a source of motivation.

One common factor in a bunch of the cases I could find was leveraging ketosis to place the cancer under sever metabolic stress (starving the cancer). Now normal people have a slightly elevated amount of ketones in the morning after fasting through a nights sleep. After breakfast, glucose spikes and the body is again fully adapted to run from sugar. It is possible through diet control to switch over to running in a much deeper ketosis all the time. In fact some societies like Eskimos or Maasai had natural diets which placed them in a constant state of ketosis. This is the path some cancer survivors have been leveraging.

In order to help someone else who wanted to try this form of not medically accepted metabolic treatment for cancer, early this year I decided to run an experiment, and try to shift my body into a constant state of ketosis. I didn't try the fast path of multi-day fasting. Instead I just shifted my diet, from {mostly carbs then protein}, to {mostly fat, with a controlled amount of protein (because too much protein gets converted into glucose), and almost no carbs}. It takes roughly a month before the body converts over. The process is described by most people as horrible, because the body goes through an extremely strong set of sugar withdrawal symptoms. The next challenge is getting used to running on oil. In order to consume the quantity of fat required in the diet, it is very common to supplement with drinking raw coconut oil (this is what I do) as coconut oil is easily converted to ketones. Yes, majority of people exit the diet quickly, as it is hard to tolerate.

Roughly six months later, I'm still on the diet and so is the person I referenced before, and that person is having great success thus far. After trying it, I've decided to just adopt it as change to my lifestyle to promote better long term health. There are a lot of other benefits of being in ketosis, for example, especially if you like sports and activities which leave you injured, ketosis naturally reduces inflammation. Ketosis results in a the lack of sugar and carbs in the diet which results in improved teeth health (was quite a bit difference in experience at the dentist after being on ketosis). I do cheat on carbs once and a while on business trips or business dinners, but I've found after this long of being on the diet, that my body seems to fall back into ketosis quite quickly. My morning glucose levels are around 55-65 mg/dL, which is below the point at which most people feel sick. Under ketosis however it just seems normal. I can still successfully do hard core Kali training on the diet, and overall my endurance has actually improved (I'm always in the endurance training "fat" burning state). However I did loose around 35 lbs, most of that being fat, but a lot of muscle too (wasn't lifting any more either). Ketosis basically eats your body fat, and you loose your desire to eat lots of food on the diet (no insulin spikes).

Anyway, if there is anything to learn from this post, when you or someone you know needs to face an obstacle which appears to be the end of the line. Give it everything you have, be willing to think outside conventional wisdom, and I wish you the best of luck!


Bad Industry Humor: Computer Engineering Hall of Shame

4K HDTV - Timing is critical, need to sell these 4K TVs before the next generation of 300 dpi smart-phone-addicted permanently-nearsighted kids realize the TV is not in focus when sitting on the couch. That way we can use the proceeds from the 4K gold rush to invest in the Lasik industry to ready the next generation for 8K. Which needs to happen before Google's self driving cars program takes off, because after that no one will need to pass an eye exam.

Loading Libraries Outside The Lower 32-bit Address Space on a 64-bit Machine - Forced 64-bit address for any library call, because given the slop in modern software engineering, looks like we are going to need over 4GB of executable code really soon.

Operating Systems Without Raw Human Input Interfaces for Full Screen Applications - Because we need to protect the world from viruses and hackers which read PS3 controller input just in case you might want to enter your etrade password using a game controller instead of the keyboard.

Dynamic Linking Everything - At 20Gb/$ for drive space, saving a few cents by not static linking helps insure my hundred other Linux comrades can fork this beautiful package manager then promptly dump it after they get infatuated by forking something else.

Sendfile in Linux 2.6.x - Because syscall backwards compatibility does not matter now that we have these great package managers, and because only an idiot would actually want OS support for zero-copy file copy.

Graphics Drivers 100's of MB in Size - Because we are secretly getting into the game distribution business, one set of shaders at a time.

ELF - If "hello world" is not bigger than 4KB, then clearly the system is not complicated enough.

Compositing Window Managers - Because like bell-bottoms, it is going to be a while before rectangular windows without animation or transparency comes back into fashion.

Systems Mounting Everything Read-Write with Hundreds of Unknown OS Background Jobs - Because after manufacturing has left the country, the US is readying for the days when the only export it has left is outsourcing it's massive IT and security infrastructure.

Cloud Serving Games - Because with years of desensitization to dropped frames, targeting and missing 30 Hz, non-game-mode HDTVs, and 1080-pee Youtube, this newer generation can no longer realize the difference.

Chains of Random Access First Read Page Faults When First Loading Large Dynamically Linked Applications - Because virtual memory is super important to handle the complexity we built on after designing for machines with only 1MB of memory.

Computing Devices Which are Not Allowed to Have Compilers or Desktops - Because it is important for a device to be limited to just one purpose, like making phone calls. And besides we don't have any patents useful for extorting the wireless keyboard and mouse industry.

HDR HDVTs - If we play our cards right, we can put the tanning bed industry out of business.

Private Class Members - After successfully factoring out the ability to understand the code via awesome feats of abstraction and templatation, I need to protect it from being used or changed.

ELF - Why use a 32-bit index for a symbol, we are CS majors, and we need to put all this advanced string management and hashing knowledge into something. Of course we slept through the factoring part of basic algebra, why do at compile time on one build machine what you can do at run-time on every phone instead!

Software Complexity - Is like peeing into a small pool, when only one person does it, it is awful, but now that everyone pees in the pool, children just grow up thinking that pool water was always yellow.

Phones 2084 - Government mandates loan insurance to cover the loan parents take out to cover the IP patent pool amortized into the cost of the device required for their children to connect to the internet.

Graphics Conferences 2099 - We charge people early just for thinking about giving a talk at the conference, but that is ok because we simultaneously send out spam for thought insurance so you do not have to worry about having a thought which has already been patented.


Infinite Projection Matrix Notes

If you are reading this and have to deal with GL/GLES vendors not supporting DX style [0 to 1] clip space, please talk to your hardware/OS vendors and ask for it!!

Clip Coordinates (CC)
Output of the vertex shader,
GL: gl_Position
DX: SV_Position

Normalized Device Coordinates (NDC)
The following transform is done after the vertex shader,

NDC = float4(CC.xyz * rcp(CC.w), 1.0);

On both GL and DX NDC.xy are [-1 to 1] ranged. On DX NDC.z is [0 to 1] ranged. On GL NDC.z is [-1 to 1] ranged and this can cause precision problems (see below in Window Coordinate transform). Anything outside the range is clipped by hardware unless in DX11 DepthClipEnable=FALSE, or in GL glEnable(GL_DEPTH_CLAMP) is used to clamp clipped geometry to the near and far plane.

Window Coordinates (WC) : DX11
The following transform is done in hardware,

float3 WC = float3(
NDC.x * (W* 0.5 ) + (X + W*0.5),
NDC.y * (H*(-0.5)) + (Y + H*0.5),
NDC.z * (F-N) + N);

With parameters specified by RSSetViewports(),

X = D3D11_VIEWPORT.TopLeftX;
Y = D3D11_VIEWPORT.TopLeftY;
W = D3D11_VIEWPORT.Width;
H = D3D11_VIEWPORT.Height;
N = D3D11_VIEWPORT.MinDepth;
F = D3D11_VIEWPORT.MaxDepth;

Fractional viewport parameters for X and Y are supported on DX11. DX10 does not support fractional viewports, and DX11 feature level 9 implicitly casts to DWORD internally insuring fractional viewport will not work. Not sure if DX11 feature level 10 supports fractional viewports or not. Both N and F are required to be in the [0 to 1] range. It is better to not use the viewport transform to modify depth and instead fold any transform into the application projection shader code. For best precision, N should be zero.

Window Coordinates (WC) : GL
The following transform is done in hardware,

float3 WC = float3(
NDC.x * (W*0.5) + (X + W*0.5),
NDC.y * (H*0.5) + (Y + H*0.5),
NDC.z * ((F-N)*0.5) + (N+F)*0.5);

Window Coordinates (WC) : OpenGL 4.2 and OpenGL ES 2.0
These versions of GL have the following form to specify input parameters,

glDepthRangef(GLclampf N, GLclampf F); // ES and some versions of GL
glDepthRange(GLclampd N, GLclampd F); // GL
glViewport(GLint X, GLint Y, GLsizei W, GLsizei H);

Note the inputs to glDepthRange*() are clamped to [0 to 1] range. This insures NDC.z is biased by a precision destroying floating point addition. The default N=0 and N=1 results in,

WC.z = NDC.z * 0.5 + 0.5;

Which when computed with standard 32-bit floating point, I believe has in theory exactly only enough precision for 24-bit integer depth buffers.

Window Coordinates (WC) : OpenGL 4.3 and OpenGL ES 3.0
These versions of OpenGL dropped the clamp type resulting in,

glDepthRangef(GLfloat N, GLfloat F);

Both specs say, "If a fixed-point representation is used, the parameters n and f are clamped to the range [0;1] when computing zw". So in theory for floating point depth buffers the following can be specified,

glDepthRangef(-1.0f, 1.0f);

Which results in no precision destroying floating point addition in the Window Coordinate transform,

WC.z = NDC.z * 1.0 + 0.0;

However at least some vendor(s) still do the clamp anyway. To get around the clamp in GL one can use GL_NV_depth_buffer_float which provides glDepthRangedNV() which is supported by both AMD and NVIDIA.

Projection Matrix
General form,

X 0 0 0
0 Y 0 0
0 0 A 1
0 0 B 0

The A=0 case provides the highest precision. Next highest precision from A=1 or A=-1, then ideally choose A to have an exact representation in floating point. In the low precision cases, it is better for precision to break the model view projection matrix into two parts for an increase in just one scalar multiply accumulate operation overall,

// Constants
float4 ConstX = ModelViewMatrixX * X;
float4 ConstY = ModelViewMatrixY * Y;
float4 ConstZ = ModelViewMatrixZ;
float2 ConstAB = float2(A, B);

// Vertex shader work
float3 View = float3(
dot(Vertex, ConstX) + ConstX.w,
dot(Vertex, ConstY) + ConstY.w,
dot(Vertex, ConstZ) + ConstZ.w);

float3 Projected = float3(
View.z * ConstA + ConstB,

Projection Matrix : Infinite Reversed (1=near, 0=far)
This is both the fastest and highest precision path. For DX, or GL using glDepthRangedNV(-1.0, 1.0),

X 0 0 0
0 Y 0 0
0 0 0 1
0 0 N 0

This can be optimized to the following,

// Constants
float4 ConstX = ModelViewMatrixX * X;
float4 ConstY = ModelViewMatrixY * Y;
float4 ConstZ = ModelViewMatrixZ;
float ConstN = N; // Ideally N=1 and no constant is needed.

// Vertex shader work
float4 Projected = float4(
dot(Vertex, ConstX) + ConstX.w,
dot(Vertex, ConstY) + ConstY.w,
dot(Vertex, ConstZ) + ConstZ.w);

For GL without glDepthRangedNV(-1.0, 1.0),

X 0 0__ 0
0 Y 0__ 0
0 0 -1_ 1
0 0 2*N 0

Which can be optimized to the following,

// Constants
vec4 ConstX = ModelViewMatrixX * X;
vec4 ConstY = ModelViewMatrixY * Y;
vec4 ConstZ = ModelViewMatrixZ;
float ConstN = 2.0 * N; // Ideally N=1 and no constant is needed.

// Vertex shader work
vec4 Projected;
Projected.w = dot(Vertex, ConstZ);
Projected.xyz = float3(
dot(Vertex, ConstX) + ConstX.w,
dot(Vertex, ConstY) + ConstY.w,
ConstN - Projected.w);



VR Topics: Racing Scan-Out + Filtering/Noise


(1.) [A] Do view independent work.

(2.) Read latest prediction of head position and orientation for the time at which the frame gets displayed. This is reading from a client side persistent mapped buffer, then writing into a uniform buffer on the GPU. A real-time background CPU job is updating the prediction each time new sensor data arrives.

(3.) [B] Do view dependent work which all rendering depends on.

(4.) Render frame into the front buffer, racing scan-out. Must render in the coarse granularity order at which the front-buffer gets scanned out. Given that raster order is vendor dependent, this process involves splitting the frame into some number of stacked blocks (where block width is the width of the frame). Each block gets rendered independently in scan-out order. Blocks must be large enough such that they can full fill the GPU with work.

Racing scan-out might be good for a little over a half frame latency reduction in practice.

Below is an overly simplified example display frame (timing is made up, I removed v-blank, etc). Eight blocks are used. Refresh rate is 8 ms/frame, so scan out per block is 1 ms. Prediction is updated every millisecond. Display flashes 1 ms after scan-out finished, and stays lit for just 2 ms, followed by 6 ms of darkness. A frame has 4/3 ms of view independent work, and 4/3 ms of view dependent work before rendering the frame. Each block takes 2/3 ms to draw. The numbers are all made up to enable drawing an easy ASCII diagram of what is going on,

444555666777000111222333444555666777000111222333 scan-out
__AAAABBBB0011223344556677______________________ GPU work for one frame
__AAAA__________________________________________ view independent work
_____||_________________________________________ read prediction
______BBBB______________________________________ view dependent work
__________00____________________________________ GPU work for block 0
____________000_________________________________ scan-out for block 0
________________________77______________________ GPU work for block 7
_________________________________777____________ scan-out for block 7
_______________________________________XXXXXX___ global display
___PPP__________________________________________ prediction jitter
___---------------------------------------______ latency
444555666777000111222333444555666777000111222333 scan-out

In this made up example, total latency would be at best around 13 ms for a 8 ms scan-out (125 fps). If back-buffer rendering, latency will be longer than 18 ms for a 8 ms scan-out,

000111222333444555666777000111222333444555666777000111222 scan-out
AAAABBBB0011223344556677_________________________________ GPU work for one frame
_______________________||________________________________ swap
________________________------------------------_________ scan-out
_PPP_____________________________________________________ prediction jitter
___________________________________________________XXXXXX global display
_-----------------------------------------------------___ latency

Future Hardware Wish-list
Drive scan-out at the peak rate of the bus and sleep, instead of driving scan-out at the display rate. The faster the better.

Implementation Challenges
Assuming ray-tracing instead of rasterization (can ray-trace in the warped space), and a fully pull model based engine (just re-submit the same commands each frame), and relatively fixed cost per frame, there are still challenges left.

Synchronizing the CPU to the GPU with front-buffer rendering is not well supported by any API. Need something to stall the "read prediction" step until a set amount of time before scan-out of the next frame. For GPU's which support volatile reads which pass L2 and get through to the system bus, could poll a time value written by a background CPU thread. Will need some way to calibrate this system, perhaps as hacky at first as the user dialing in the delay until just before any tearing happens.

Post processing must happen while rendering each block, no screen-space effects. This means bloom needs to be replaced in the common case when diffuse bloom is being used to fake atmospheric scattering effects. For quality filtering, each block must have at least a 2 pixel stencil. Dealing with chromatic aberration is the larger problem, requires an even larger stencil.

One option is to go monochrome in green only, removing the need for any chromatic aberration based filtering,

Filtering and Noise
If DK2 is around 2Mpix/frame at 75Hz, that is a lot of pixels to push. Pixel quality can be broken down into various components,

(a.) Antialiasing. Does geometry snap to pixels?
(b.) Sharpness. What is the maximum frequency of detail in the scene?
(c.) Resolution. What granularity of pixels does geometric edges move by?

With ray-marching, sharpness can be directly related to LOD, or how close one gets to the actual surface. Resolution is relatively independent of the number of rays shot per frame. With a high quality sample pattern it is easy to resolve to a frame which has more pixels than the number of rays shot per frame. With VR, I'd argue that antialiasing and resolution is more important than sharpness because sub-pixel motion is critical for depth perception. On top of that, textures are virtually useless because they look like images painted on toys instead of real geometry. In the spectrum of options, GPU ray-tracing based methods have serious advantages over GPU raster methods simply because of the flexibility of sample distribution. With ray based methods, sharpness ends up being a function of how much GPU perf is available. Lower perf can mean a less sharp frame, but still native spatial resolution to get the same sub-pixel parallax.

I'm highly biased towards using something which feels like temporal film grain to both remove the illusion of rendering perfection and mask rendering artifacts. Probably as a hold over from the fact that I use a Plasma HDTV as a primary monitor, I stick to grain around 1.5 pixels in diameter. Plasma HDTV's use temporal error diffusion, therefore single pixel noise can result in artifacts. Not sure yet what is best for VR, but guessing the grain should be at or slightly higher than the frequency of detail in the scene.

An example of low sharpness, but full resolution, high quality antialiasing, and grain,


Unreal Engine 4 "Rivalry" Demo -- Google I/O 2014

Normally I don't talk about work related stuff here, but this is quite awesome. The "Rivalry" demo is the GL ES 3.1 AEP path running the DX11 based desktop UE4 engine. Same crazy fat G-buffer, deferred shading, reflection probes, screen space reflections, temporal AA algorithm, etc. Using the scalability options that desktop has.

All of this on a tablet platform: Tegra K1.

I'm really happy that Epic has the balls to just capture the youtube video below exactly how it looks on device, WYSIWYG! They could have just dumped out still frames generated with massive super-sampling and all settings turned up to GTX Titan level, but they did not. I know this because all the artifacts of the depth of field and motion blur are there, stuff I'm hoping to getting around to fixing later this year, and unfortunately for the sake of time, my fix for the screen space reflection flash on scene changes didn't make it into that capture (it actually looks better on device).


Sad Day for High-End ARM

Nvidia Abandons 64-bit Denver Chip for Servers - I guess this was to be expected as HPC likely only exists when highly subsidized by the volume of high-end desktop products. After Microsoft was successful in completely destroying the market for desktop ARM chips by crippling Windows on ARM (making it Metro only): no volume desktop chips = no cost competitive HPC products. The looser in this market is the consumer. 64-bit ARM for desktop would have been an awesome platform.


No Traditional Dynamic Memory Allocation

To provide some meaning to a prior tweet... Outside of current and prior day jobs, I never use traditional dynamic memory allocation. All allocations are done up front at application load. On modern CPUs the cost of virtual memory is always there, might as well use it. Load time, allocate virtual memory address space for maximum practical memory usage for various data in the application. Virtual address space is backed initial by read-only common page zero-fill (no physical memory allocated). On write the OS modifies the page table and fills written pages with unique zero-fill physical backing. Can also preempt the OS write-fault and manual force various OSs to pre-back used virtual memory space (for real-time applications).

The reduction of complexity, runtime cost, and development cost enabled by this practice is massive.

This practice is a direct analog to what is required to do efficient GPU programming: layout data by usage locality into 1D/2D/3D arrays/tables/textures with indexes or handles linking different data structures. Programs designed around transforms of data, by some mix of gather/scatter. Bits of the larger application cut up into manageable parallel pieces which can be debugged/tested/replaced/optimized individually. Capture and replay of the entire program is relatively easy. Synchronization factored into coarse granularity signals and barriers. Scaling this development up to a large project requires an architect which can layout the high-level network of how data flows through the program. Then sub-architects which own self-contained sub-networks. Then individuals which provide the programs and details of parts of each sub-network.

My programming language of choice to feed this system, is nothing standard, but rather something built around instant run-time edit/modify/test cycle which works from within the program itself. Something similar to forth (a tiny fully expressive programming language which requires no parsing) which can also express assembly which can be run-time assembled into the tiny programs which process data in the application network.

Should be obvious that the practices which enable fast shader development and fast development of the GPU side of a rendering pipeline can directly apply to everything else as well, and also elegantly solves the problem in a way which scales on a parallel machine.


Filtering and Reconstruction : Points and Filtered Raster Without MSAA

Just ran across these old screen captures (open images in a new window, they are likely getting cropped), stuff I'm no longer working on, but might re-invest in sometime in the future...

Reconstruction with Points or Surfels

The above shot is a HDR monochrome image (converted to color) with a little grain and bloom, of a fractal with lots of sub-pixel geometry rendered in a 360-degree fisheye with point based reconstruction. There is an average of 2 samples/pixel (using 1 bin/pixel but 2x the screen area). The GPU keeps a tree of the scene representation which is updated based on visibility, and then bins all the leaf nodes each frame. Finally there is a reconstruction pass which takes {color, sub-pixel offset} for all the binned points and then reconstructs the scene. Reconstruction is a simple gaussian filter which does a weighted average of a neighborhood of points with weights based on distance from pixel center. For HDR pre-convert to color*rcp(luma+1) as well to remove the HDR fire-fly problem (not energy conserving), remembering to un-convert after filtering. Binning of points uses some stratified jitter so that each frame gets a slightly different collection of points for filtering. Did not use any temporal filtering. The result is a quality analog feeling reconstruction which has soothing temporal noise. Need about 2 samples/pixel for good quality if not using temporal filtering. With really good temporal filtering (see Brian Karis's talk at Siggraph) only need on average one shaded sample/pixel/frame.

Binning for the above image was done using standard point raster, everything running in the vertex shader, near null fragment shader. This works ok on NVIDIA GPUs, but I would not advise for AMD. Future forward, with GPUs which have 64-bit integer min/max atomics and ability to run a full cache-line of atomics in one clock, the binning process should be quite fast. Bin layout in cache-lines must have good 2D locality (atomics on texture not buffer). Evaluate points in groups which have good spatial locality. Ideally should hide shading operations in the shadow of atomic throughput. If not enough work to fill that shadow, then try a pre-resolve atomic collisions in shared memory. Pack pixels into 64-bits {MSB logZ, LSB color and a few bits of sub-pixel offset}. The color is stored scaled by rcp(luma+1) and uses dithering in the conversion.

Points can early out by fetching destination pixels and checking if they are more distant than the current framebuffer result. Just make sure to schedule the load early to avoid the latency problem. Can extend this theory to scene traversal and do a hierarchical z-buffer pre-pass. Render out in a compute shader anti-conservative points into the hierarchy. This process would involve culling paths in the scene graph, and appending a list of nodes (point groups) which need to get rendered. Load balancing this traversal is a challenge which is worthy of another blog post.

Common in both point based techniques and ray-marching with fixed cost/frame algorithms, is that depending on traversal cost, some percentage of screen pixels might have holes or might have fractional coverage. It is possible to hole fill leveraging temporal reprojection, also a topic worthy of another blog post.

Reconstruction with Rotated Rendering and 2x Super-Sampling

Another monochrome image (converted to color) this time of a lot of unlit boxes. This image is also using a slight image warp and some out-of-focus vignette. This was generated by rendering an average of 2x super-sampling, but rendering the frame rotated. This removes the garbage sample pattern of regular non-rotated rendering. The result with a proper reconstruction filter (I used gaussian for performance with weights based on sample distance from pixel center) is something which is similar to 3xMSAA (horizontal and vertical gradients have three visible primary steps if the frame rotation angle is chosen correctly). There is a high cost in memory for rendering rotated. However the pre-z-fill (set to near plane) pass to mask out the invisible extents is super fast.