20140820

Next Generation OpenGL Initiative Details from Khronos BOF

OpenGL Ecosystem BOF 2014

Highlights,

Cross vendor project between OpenGL and OpenGL ES working groups:
- Chair = Tom Olson (ARM)
- IL Group Chair = Bill Licea-Kane (Qualcomm)
- API Spec Editors = Graham Sellers (AMD) and Jeff Bolz (NVIDIA)

Committed to adopting a portable intermediate language for shaders.
Compatibility break from existing OpenGL.
Starting from first principles.
Multi-thread friendly.
Greatly reduced CPU overhead.
Full support for tiled and direct renderers.
Explicit control: application tells driver what it wants.

20140819

Scanlines

Link to the Shadertoy example.


Growing up in the era of the CRT "CGA" Arcade Monitor was just awesome. Roughly 320x240 or lower resolution at 60 Hz with a low persistence display. Mix that with stunning pixel art. One of the core reasons I got into the graphics industry.

Built the above Shadertoy example to show what I personally like in attempting to simulate that old look and feel on modern LCD displays. The human mind is exceptionally good at filling in hidden visual information. The dark gaps between scanlines enable the mind to reconstruct a better image than what is actually there. The right most panel adds a quick attempt at a shadow mask. It is nearly impossible to do a good job simulating that because the LCD cannot get bright enough. The compromise in the shader example is to rotate the mask 90 degrees to reduce chromatic aberration. The mask could definitely be improved, but this is a great place to start...

Feel free to use/modify the shader. Hopefully I'll get lucky and have the option to turn on the vintage scanline look when I play those soon to be released games with awesome pixel art!

20140816

Vintage Programming

A photo (not a screenshot) of one of my home vintage development environments running on modern fast PCs. Shot shows colored syntax highlighted source to the compiler of the language I use most often (specifically the part which generates the ELF header for Linux). More on this below.



This is running 640x480 on a small mid 90's VGA CRT which supports around 1000 lines. So no garbage double scan and no horrible squares for pixels. Instead a high quality analog display running at 85 Hz. The font is my 6x11 fixed size programming font.



This specific compiler binary on x86-64 Linux is under 1700 bytes.

A Language
The language is ultra primitive, it does not include a linker, or anything to do code generation, there is no debugger (and it frankly is not needed as debuggers are slower than instant run-time recompile/reload style development). Instead the ELF (or platform) header for the binary, and the assembler or secondary language which actually describes the program, is written in the language itself.

Over the years I've been playing with either languages which are in classic text form, and languages which require custom editors and are in a binary form. This A language is the classic text source form. All the variations of languages I've been interested in are heavily influenced by Color Forth.

This A compiler works in 2 passes, the first both parses and translates the source into x86-64 machine code. Think of this as factoring out the interpreter into the parser. The second pass simply calls the entry point of the source code to interpret the source (by running the existing generated machine code). After that whatever is written in the output buffer gets saved to a file.

Below is the syntax for the A language. A symbol is an untyped 64-bit value in memory. Like Forth there is a separate data and return stack.

\comment\
012345- \compile: push -0x12345 on the data stack\
,c3 \write a literal byte into the compile stream\
symbol \compile: call to symbol, symbol value is a pointer to function\
'symbol \compile: pop top of data stack, if value is true, call symbol\
`symbol \copy the symbol data into the compile stream, symbol is {32-bit pointer, 32-bit size}\
:symbol \compile: pop data stack into symbol value\
.symbol \compile: push symbol value onto data stack\
%symbol \compile: push address of symbol value onto data stack\
"string" \compile: push address of string, then push size of string on the data stack\
{ symbol ... } \define a function, symbol value set to head of compile stream\


And that is the A language. The closing "}" writes out the 32-bit size to the packed {32-bit pointer, 32-bit size} symbol value, and also adds an extra RET opcode to avoid needing to add one at the end of every define. There is one other convention missing in the above description, there is a hidden register used for the pointer to the output buffer.

Writing Parts of the Language in the Language
The first part of any source file is a collection of opcodes, like the { xor ,48 ... } at the top of the image which is the raw x86-64 machine code to do the following in traditional assembly language (rax = top of data stack, rbx points to second data stack entry),

XOR rax, [rbx]
SUB rbx, 8


These collection of opcodes generate symbols which form the stack based language the interpreter uses. They would get used like `xor in the code (the copy symbol to compile stream syntax). For instance `long pops the top of the data stack and writes out 8-bytes to the output buffer, and `asm pushes the output buffer pointer onto the data stack.

I use this stack based language to then define an assembler (in the source code), and then I write code in the assembler using the stack based language as effectively the ultimate macro language. For instance if I was to describe the `xor command in the assembly it would look like follows,

{ xor .top .stk$ 0 X@^ .stk$ 8 #- }

Which is really hard to read without syntax coloring (sorry my HTML is lazy). For naming, the "X" = 64-bit extended, the "@" = load, and the "#" = immediate. So the "X@^" means assemble "XOR reg,[mem+imm]". The symbols "top" and "stk$" contain the numbers of the registers for the top of the stack and the pointer to the second item on the stack respectively.

Compiler Parser
The compiler parsing pass is quite easy, just a character jump table based on prefix character to a function which parses the {symbol, number, comment, white space, etc}. These functions don't return, they simply jump to the next thing to parse. As symbol strings are read they are hashed into a register and bit packed into two extra 64-bit registers (lower 4-bits/character in one register, upper 3-bits/character in another register). This packing makes string compare easy later when probing. Max symbol string is 16 characters. Hash table is a simple linear probing style, but with an array 2 of entries per hash value filling one cacheline. Each hash table entry has the following 8-byte values {lower bits of string, upper bits of string, pointer to symbol storage, unused}. The symbol storage is allocated from another stack (which only grows). Upon lookup, if a symbol isn't in the hash table it is added with new storage. Symbols never get deleted.

Highly Recommend PowerNotebooks.com

Got a custom notebook from powernotebooks.com, and I'd highly recommend them for anyone else looking for a new laptop. They have an interesting practice of providing faster orders and a few percent off for paying cash (via a few different methods). Their customer service for traditional phone calls is also quite awesome. I learned a few things talking to their technically knowledgeable staff.

20140814

HRAA And Coverage and Related Topics

Michal Drobot's HRAA Slides, great talk, I've read it a few times now. Really good seeing people get serious about solving the aliasing problem.

Coverage Fail Case?
Start with a simple example of two color and depth samples {N,S} (with associated coverage samples). And two extra coverage samples {w,e}. In the following pattern,

.N..
...e
w...
..S.


Starting with all cleared (unknown) case,

._..
..._
_...
.._.


Render a triangle in the foreground which covers {S,w,e},

._..
...s
s...
..S.


Now render a triangle in the background which covers {N,S,w,e}, this for instance could be a skybox. The N sample passes the depth test, the S sample fails the depth test, and the {w,e} coverage samples get set to unknown (coverage samples have no depth, raster unit does not know which triangle is in front, because a coverage sample's associated depth sample won't work in sloped cases). The result being the same as if there are no coverage samples,

.N..
..._
_...
..S.


Front-to-back drawing order (the best order for performance) clears out coverage information. Only back-to-front draw order (the worst for overdraw) builds coverage as front triangles evict and optional replace coverage sample association.

This is ultimately why I abandoned the idea of using coverage samples for reconstruction.

Since then I've learned that coverage might work if there is a front-to-back full z-pre-pass, followed by rendering back-to-front with depth test passing if depth is nearer or equal. This process would likely re-restore coverage. This likely explains why EQAA and CSAA actually seemed to work when they were first introduced, because engines did do z-pre-passes at that time. Back when I tried coverage based reconstruction I never did a z-pre-pass (couldn't afford to submit the geometry again).

The CRAA LUT in Michal Drobot's paper is a great idea for a z-pre-passing engine working on a platform which provides coverage information.

FLIPQUAD
For any hardware which provides programmable sample locations in a granularity of 2x2 pixels (or beyond), one can do better than the flipquad setup. On slide 70, notice blue samples of two pixels {0,2} and {1,3} align on vertical lines.

BFCEE
Tried before a few times, never found it to work well as just blending in more of the source as a function of approaching full pixel offset, perhaps I did something wrong, going to need to try this again!

Abdul Bezrati: Real-time lighting via Light Linked List

Real-time lighting via Light Linked List

Seems like the basic idea is as follows,

Render G-buffer.
Get min/max depth for each 8x8 tile.
For 1/64 area (1600/8 by 900/8), raster lights with software depth test.
For each tile, build linked list of lights intersecting tile.
Linked lists of {half depthMin, half depthMax, uint8 lightIndex, uint24 nextStructureIndex}.
Keeping light bounds helps avoid shading when tile has large depth range.
Full screen pass shading for all lights intersecting a tile.

Would This be Faster?
Given the maximum of 256 lights, have a fixed 32-bytes per tile which is a bitmask of the lights intersecting the tile. Raster the lights via AtomicOr() with no return (no latency to hide), setting the bit in the bitmask. At deferred shading time per workgroup (workgroup shades a tile), first load the bitmask into LDS, then in groups of lights which fit in the LDS, do a ballot based scan of the remaining lights in the bitmask, load the active bit lights into the LDS, then switch to shading pixels with the light data, then repeat.

20140811

NVIDIA's Project Denver

NVIDIA Blogs on Project Denver



I'm reading this press release as follows,

ARMv8 64-bit Processor.
Hardware decode of ARMv8 instructions (see above image).
Seems like similar area: dual core 3-way A16 @ 2.3 GHz -- single core 7-way Denver @ 2.5 GHz.
Run-time "Dynamic Code Optimization" into a 128MB chunk of DRAM backed by a 128KB cache.
7-way looks like (see above image): 2 Load/store units, 2 FPUs, 2 Integer ALUs, 1 Branch unit?

Wronski: Volumetric Fog SIGGRAPH 2014

Wronski: Volumetric Fog SIGGRAPH 2014

OpenGL 4.5, WebGL, AEP

Khronos Group Announces Key Advances in OpenGL Ecosystem

Pervasive WebGL
Enabling Shadertoy to run on all platforms!!!
"WebGL brings powerful GPU access to HTML5. As with any Web standard, pervasive availability across many browsers is the key to providing a commercially relevant deployment platform. Today, all mainstream desktop browsers support WebGL, including Chrome, Firefox, Safari and Internet Explorer, and WebGL support is rapidly being deployed to major mobile browsers. WebGL enables a true industry first: the ability to write high-performance 3D applications that run with zero porting effort on every significant desktop and mobile platform."

OpenGL 4.5
NVIDIA's OpenGL 4.5 Drivers
OpenGL 4.5 Quick Reference Card
GL 4.5 Spec with Changes Marked
GLSL 4.5 Spec with Changes Marked

Highlights,
ARB_direct_state_access: Majority of GL API without bind-to-edit.
ARB_clip_control: Enables DX clip space, glClipControl(GL_UPPER_LEFT, GL_ZERO_TO_ONE), high precision depth!
ARB_derivative_control: Explicit derivative control via dFdxCoarse(), dFdxFine(), etc.
GLSL adds "if(gl_HelperInvocation) { }" construct which enables writing shader code for invisible fragments used only for derivative computation.
GLSL adds "coherent" image qualifier specifies that writes will be visible to other shader invocations.
GL adds glGetGraphicsResetStatus() to check for a device reset (context lost).
ARB_get_texture_sub_image: Ability to get single slices of a texture.
ARB_sparse_buffer: Like sparse texture but for buffers.
ARB_pipeline_statistics_query: Ability to get various primitive and invocation counters.
ARB_texture_barrier: Better manage rendering to currently bound texture.
KHR_context_flush_control: Better control over context flushing.

AEP
Android Extension Pack: Brings a collection of features beyond ES 3.1 to Android.

Next Generation OpenGL Initiative

The most important aspect of the Khronos Group Announces Key Advances in OpenGL Ecosystem press release is the Next Generation OpenGL Initiative. Specifically, "Khronos announced a call for participation today in a project to define a future open standard for high-efficiency access to graphics and compute on modern GPUs. Key directions for the new ground-up design include explicit application control over GPU and CPU workloads for performance and predictability, a multithreading-friendly API with greatly reduced overhead, a common shader program intermediate language, and a strengthened ecosystem focus that includes rigorous conformance testing. Fast-paced work on detailed proposals and designs are already underway, and any company interested to participate is strongly encouraged to join Khronos for a voice and a vote in the development process."

Also, Raja Koduri, chief technology officer, graphics at AMD: "AMD is tremendously excited to take a contributing role in the Next Generation OpenGL initiative as an evolution of the OpenGL standard aligned with AMD’s vision for low-overhead and multi-threaded graphics APIs."

And, "We are super excited to contribute and work with the Next Generation OpenGL Initiative, and bring our experience of low-overhead and explicit graphics APIs to build an efficient standard for multiple platforms and vendors in Khronos," said Johan Andersson, technical director at Frostbite - Electronic Arts. "This work is of critical importance to get the most out of modern GPUs on both mobile and desktop, and to make it easier to develop advanced and efficient 3D applications - enabling us to build amazing future games with Frostbite on all platforms."

----

As a contributor to this project, I'm also tremendously excited that this industry-wide effort from the Promoters and Contributing Members of Khronos is set to bring forward a next generation graphics API which is the high-performance platform-portable solution to the needs of modern games and applications! OpenGL has a very bright future.

Evoke 2014 Tubes















20140810

Front Buffer Rendering

Ideas for practical front buffer rendering?

The largest problem is probably variability in CPU thread scheduling. In theory not sleeping and polling causes an OS to treat the thread like a compute bound thread, increasing scheduling variability. Sleeping might increase interactiveness, but only if not blocking on a timer event (which typically has poor accuracy), and instead blocking on real IO. Attempting to track and adapt to scheduling variability is an option: sleep until the worst expected time before needing to wake, then spin until the actual event. But in practice, at interesting frame rates, I'd bet variability is too high for this to be practical.

A Better Way
At home I'm working in 100% pull-model graphics. Meaning, what to draw is computed on the GPU, not the CPU. For a majority of the time, the CPU simply blasts the same exact rendering commands every frame. This decoupling of CPU and GPU enables the CPU to batch commands with any number of frames of latency before the GPU draws them. All IO between the CPU and GPU (gamepad input, etc) goes via low latency persistent mapped client side storage. This is fully async from the generation of draw calls. The issue of non-atomic writes and no memory barrier over PCIe is managed by sending and checking CRCs for all data transfered. The latency wall is effectively gone.

The next step in the evolution of this system is to race scan-out using front buffer rendering instead of double buffered rendering. Have not had the time to try this yet, so the rest of this post is theory...

Given the lack of API tools to make this easy, the core idea is to have the CPU periodically updating a "wall clock" value in persistent mapped client storage. The GPU shaders would read this value, and adapt the cost of rendering towards meeting the target framerate and sync. A background real-time priority CPU thread would write the time, then sleep, repeat. If required due to OS scheduling variability, regular CPU threads would also write out the time periodically. On platforms with no way to find the time of v-sync, likely this system would require providing the user with a tuning knob which dials in the v-sync phase (knob moves the tear line until it is off-screen).