Optical Flow

To enhance the possibilities to apply external velocity to the fluid solver, PixelFlow contains an optical flow module.

Based on the idea of simple frame-differencing (image1 – image2), the optical flow of subsequent images (Frames-Animation, Movie, WebcamCapture, etc…), can be computed from 3 quantities:

dt – difference of frames (frame_A – frame_B)
dx – sum of horizontal grandients (sobelX_A + sobelX_B)
dy – sum of vertical gradients (sobelY_A + sobelY_B)
The normalized vector(dx, dy) is basically already the flow-vector. dt defines the length and direction (positive or negative).

There are several ways to generate the gradients. I got decent results with using the Sobel-Operator. Since everything can be done on the GPU using shaders the overhead is minimal.

Optical Flow Algorithm
update step
preprocessing (blur, noise-reduction, grayscale/color, etc…) – optional
gradients (sobel, horizontal+vertical)
optical flow (GLSL-Shader)
smooth (gaussian blur + temporal smooth) – optional
Most obviously, the input images can (and should) be preprocessed. It depends a lot on the source-footage if its worth to apply more expensive filters (median, bilateral filter, …) and/or simply a gaussian blur.
A threshold, to omit flow-vectors, of a certain length.
Gauss-bluring of the resulting flow and finally some temporal smoothing. (mixing the current and the previous flow)
GLSL-Shader – Optical Flow

  • PixelFlow | Copyright (C) 2016 Thomas Diewald – http://thomasdiewald.com

version 150

precision mediump float;
precision mediump int;

define SUM_RGB(v) ((v).r + (v).g + (v).b)

out vec2 glFragColor;

uniform sampler2D tex_curr_frame ; // current image
uniform sampler2D tex_prev_frame ; // previous image
uniform sampler2D tex_curr_sobelH; // current gradient horizontal
uniform sampler2D tex_curr_sobelV; // current gradient vertical
uniform sampler2D tex_prev_sobelH; // previous gradient horizontal
uniform sampler2D tex_prev_sobelV; // previous gradient vertical

uniform vec2 wh; // size rendertarget
uniform float scale; // scale flow
uniform float threshold = 1.0; // flow intensity threshold

void main(){

vec2 posn = gl_FragCoord.xy / wh;

// dt, dx, dy
vec3 dt_ = texture(tex_curr_frame , posn).rgb –
texture(tex_prev_frame , posn).rgb;
vec3 dx_ = texture(tex_curr_sobelH, posn).rgb +
texture(tex_prev_sobelH, posn).rgb;
vec3 dy_ = texture(tex_curr_sobelV, posn).rgb +
texture(tex_prev_sobelV, posn).rgb;

// sum rgb-channels
float dt = SUM_RGB(dt_), dx = SUM_RGB(dx_), dy = SUM_RGB(dy_);
// gradient length
float dd = sqrt(dxdx + dydy + 1.0);
// optical flow
vec2 flow = scale * dt * vec2(dx, dy) / dd;

// threshold
float len_old = sqrt(flow.xflow.x + flow.yflow.y + 0.00001);
float len_new = max(len_old – threshold, 0.0);
flow = (len_new * flow)/len_old;

glFragColor = flow;

OpenCL Pathtracing [1] ~ Realtime Renderer

Pathtracer – OpenCL

My Previous posts were about implementing a pathracer in Java. But the more features i implemented, the more time i had to wait, to check the resulting image, make changes, and again wait … in general 1-10 seconds.

So it was time, to move the whole thing to OpenCL.

At the beginning i had a lot of problems creating the buffers (got a lot of FATAL ERROR …). As soon i removed all the bugs it worked quite well, since i could almost 1:1 port the (already working) code from my java implementation to the OpenCL language.

Im still using a BVH-data-structure, but the flat version, using a stack pointer for traversing, since OpenCL doesnt allow recursion.

On the hostside, the obj-file gets imported, the BVH is created, and the buffers (mainly using structs) are created. The whole pathtracing (bvh-traversal, shading, etc.) happens on the device then.

The change in performance was quite impressive. On the same System (CPU: Q6600, GPU: GeForce GTX 550 Ti) it got a huge performance hit (factor 20 – 30) on the GPU versus the CPU implementation. So now i can render a scene, containing a hundreds of thousands of triangles and always having a at least a framerate of 1, … of course thats far from realtime, but allready very promising.

Also added some new Features, as Sunlight/Skylight (diffuse Lightning, Ambient Occlusion) and transparent materials (refraction).

The following videos/images show the current state of my Pathtracing-experiment ( the videos are a bit laggy due to the screen-capturing, while rendering).

my Setup is …

Java bindings for OpenGL: http://jogamp.org/jogl/www/
Java bindings for OpenCL: http://www.jocl.org/
GPU: 1 Nvidia GeForce GTX 550 Ti
CPU: Intel Q6600
the demo application can be downloaded here: http://thomasdiewald.com/blog/?p=1637

Get the Flash Player to see this content.

First testscene

… working on transparency and refraction.

Get the Flash Player to see this content.

Velocity/Position update

While working out some PixelFlow Demos, i needed some simple (CPU-based for the moment) particle systems which is able to solve collisions and constraints. What started as a basic particle simulation experiment turned out to be quite a powerful tool.

Velocity/Position update
To move a particle, information about its position and velocity is required.

There are several ways to update a particles position, I only mention the 2 most common (probably).

Euler Integration

v += a * dt
c += v * dt
a = 0

Verlet Integration

v = c – p
p = c
c += v + a * 0.5 * dt * dt
a = 0

v … velocity
c … current position
p … previous position
a … acceleration
dt … timestep
I was used to use Euler Integration, but it kind of failed when using it for softbody dynamics. So Verlet Integration it is.

For a more detailed overview i can recommend the following articles

Advanced Character Physics, by Thomas Jacobsen
Integration by Example – Euler vs Verlet vs Runge-Kutta, by Florian Bösch
Simulate Tearable Cloth and Ragdolls With Simple Verlet Integration, by Jared Counts
Gaffer on Games

Collision Detection
To solve collisions, information about a particles size (radius) is required. A collision can be handled (in its simplest way), by moving two particles, that are colliding (the distance between them is lower than the sum of their radii) apart by half the collision distance (can be weighted by their masses, etc…).

In a Particle System, one particle most likely has to handle many collisions at once, which usually takes multiple subsequent iterations of the collision update step (relaxation). Obviously this task is very expensive and grows with the number of particles (each particle needs to check all other particles).

For reducing the number of collision checks some space-partitioning techniques can be used, e.g. Quadtree/Octree, BVH, etc.

PixelFlow uses a regular collision Grid at the moment using PPLL (per pixel linked lists) to efficiently store particles in grid-cells. I chose to use the grid because it is very easy to implement, still fast, and can also be implemented on the GPU using GLSL-Shaders (http://thomasdiewald.com/blog/?p=2099).

dxyz = pb.c – pa.c
dmin = pb.r + pa.r
dd_cur_sq = squaredLength(dxyz)
dd_min_sq = dmin * dmin

if (dd_cur_sq < dd_min_sq) {
force = (dd_min_sq / (dd_cur_sq + dd_min_sq) – 0.5f)
pa.collision -= dxyz * force
Since the particles are stored in a grid and the current position is used to lookup-gridcells, the position cant be updated immediately which is why a temporary “collision” variable is used.

Additionally, all kind of constraints can be added to generate more complex simulations.

For example, a very simple constraint can be: keep two particles apart by maintaining a certain distance (rest-length) between them.

Just this simple rule is all it takes to generate stunning cloth/softbody-simulations. The code is almost identical to the one for handling collisions.

dxyz = pb.c – pa.c
dd_cur_sq = squaredLength(dxyz)
force = (dd_rest_sq / (dd_cur_sq + dd_rest_sq) – 0.5f)
pa.c -= dxyz * force
pb.c += dxyz * force
In contrary to the collision solver, the position can be updated immediately. However, this adds some noticeable error to the result. Since particles previously computed may be changed due to the current results (read/write race), which in turn adds an error to the previous computation. Its noticable in certain situations. Iteratively executing the solver (I use 8 iterations by default) diminishes the error. If more accurary is required, the position-offset must be added/subtracted to a temporary buffer, same as its done by the collision handler.

Cloth Simulation

Distance Transform

This post is about a short comparison of 3 similar distance transform algorithms.

CDT – Chamfer Distance Transform
DRA – Dead Reckoning Algorithm
DDT – developved by myself
All 3 have quite a lot in common and are based on the Chamfer Distance Transform. The CDT is a 2-pass algorithm, propagating distances and the corresponding nearest-neighbors (NN) in a small window (3×3) across the image. The first pass starts at the upper-left corner, the second pass in the opposite corner. Thats all.

0) initialization:
find all foreground pixels and set their distance to 0.

1) pass 1:
move window from left -> right, top -> bottom
at each pixel, check neighbors 0, 1, 2, 3 and save the smallest distance and NN
1 2 3
0 x .
. . .

2) pass 2:
move window from right -> left, bottom -> top
at each pixel, check neighbors 4, 5, 6, 7 and save the smallest distance and NN
. . .
. x 4
7 6 5

3) … finished …

… what differs among the 3 versions is, how the smallest distance is update/calculated.

Chamfer Distance Transform (CDT)

… looks-up the direct neighbors distance, and adds the pixel-distance (either 1=orthogonally or sqrt(2)=diagonally, depending on the neighbor) to it. The smallets of this 4 new distances is then stored. In the 2nd pass (backwards) the currently saved distance is also taken into account. Also, the actuall nearest neighbor is stored too.

Dead Reckong Algorithm (DRA)

… looks-up the 4 direct neighbors distances, and same as in CDT, adds the pixel-distance (1, or sqrt(2) ). If that distance is smaller then the one currently saved at X, then the euclidean distance from X to the direct neighbor’ NN is calculated and stored. The new found NN is also stored.

This is definitely a great improvement to CDT because the distance are not summed up anymore, but updated each time. Although there still can be introduced a small error, which happens during comparing the distances (before actually computing them). Another drawback of this comparison is, that the distance cant be squared, which would allow to avoid using floating-point precission and the sqrt() call. Besides that, the results in generall look as they are supposed to.

(DDT) My own algorithm

… (i couldn’t think of a name, so its just DiewaldDT, or whatever) was inspired by DRA. Mainly i was looking for a way, to only used squared distances and avoid introducing errors as much as possible. Which means, i need to compute, before comparing is done, the distances to the 4 direct-neighbors’ NN, and then compare the currents’ NN-distance to these. Not as in DRA, comparing the currents’ NN-distance to the neighbors-NN-distance (including the estimated correction of the pixel-distance).

Since it’s now possible to only work with squared distances, there is no more call for square-root, and the squared-distances can be saved as Integral-values.


While, in my tests, DRA was about 20-30% slower than CDT, my version (DDT) runs at almost the same speed as CDT, with, at least in in theory, better (but not really visible, see images) results. This timings were taken for computing the distance field, … not including the render-pass.

EDIT: for DRA and DDT it is useful to create a lookup-tabe for the distances. Since there is only a limited possible number of distances, this distances can be pre-computed once and saved into a matrix (w,h). For DRA this is most advantageous, because squareroot-calls can be completely avoided during the distance transform, which makes it almost equally fast as CDT.


Besides the above result look correct for DRA and DDT, there is still another source for errors. During comparison we only can decide for one NN to keep, which gets further propagated then to the other pixels. But there are cases, where one or more neighbors are at equal distance. While for the current pixel this it is fine to just keep one of them, for later pixels one of the other NN have been might be closer. The resulting artifacts are quite obvious in the following images. CDT seems to be a better choice here.


some other test-cases






Demos (Java Applets)

comparing/debugging http://www.openprocessing.org/sketch/107671
random samples http://www.openprocessing.org/sketch/107674
lena voronoi http://www.openprocessing.org/sketch/107673


Distance field: randomly places samples, text and user-drawing, varying distance-factors and smooth color falloff accross the screen.


Lena Voronoi: centroids are a set of choosen random samples, based on some SAT-averaging.