Liuye Zhao CAD&CG Lab · ZJU

In-depth analysis of the Kawase method

In-depth analysis of the Kawase method

0. Original literature

Formal citation

Masaki Kawase, “Frame Buffer Postprocessing Effects in DOUBLE‑S.T.E.A.L (Wreckless),” Game Developers Conference (GDC) 2003, San Jose, CA, March 21, 2003. Programming Track, Session 2003-Programming-14.

Engineering background

This technique first appeared in the Xbox game Wreckless: The Yakuza Missions (Japanese title DOUBLE‑S.T.E.A.L, Bunkasha Games, 2002). The hardware constraint was the NVIDIA NV2A (Xbox GPU, about 232 GFLOPS, no programmable compute units), and HDR Bloom post-processing had to be completed within a single frame (). Kawase’s core contribution is to replace expensive large-kernel Gaussian convolution with multiple iterations of offset bilinear sampling. Each pass uses a fixed 4 texture fetches, and the time is independent of —making it extremely efficient in GPU bandwidth.

1. Core mathematical principle: rigorous derivation of corrected Kawase

1.1 Original version vs. corrected version

Original Kawase (2003): each pass uses an integer offset () and samples four diagonal neighbors:

When , is the identity operator, contributes zero to the effective variance, and the integer offsets fall exactly on integer grid points, so GPU hardware bilinear interpolation cannot be exploited.

Corrected version (the object analyzed in this document): the offset is changed to the half-integer value :

The half-integer offset has two advantages: (a) the GPU texture unit automatically performs bilinear interpolation at half-integer coordinates, making each sample equivalent to a weighted average of 4 integer grid points and effectively increasing the sampling density; (b) when , the offset is nonzero, so every pass has a substantive contribution.

1.2 Equivalent convolution kernel of a single pass

Equation (1) can be viewed as the convolution of image with the following kernel function:

Here is the Dirac delta (in the discrete setting, the Kronecker delta; coordinates at fractional pixels are defined through bilinear interpolation).

Key structure: is a diagonal four-point (corner-tap) kernel, with mass equally divided among the four positions . It can be decomposed into the tensor product of two independent 1D symmetric two-point distributions:

This decomposition is the key to the CLT argument in §1.4: the statistical independence of the two coordinate directions ensures the isotropy of the two-dimensional Gaussian limit.

1.3 Equivalent kernel of multiple cascaded passes

Let the sequence be . The cascade of passes is equivalent to the successive convolution of all kernels:

Because every is separable, the whole is also separable:

is the convolution of symmetric two-point distributions. Its support lies on (discrete grid points), with a total of mass points.

1.4 Variance calculation for a single pass

The mean of is (symmetric), and its variance is:

Since the passes are mutually independent (each pass takes the output of the previous pass as input, but for a linear filter, cascading is equivalent to kernel convolution, so variances are additive), the total variance after cascading is:

Variance-matching condition: to match the second moment of with , we need

1.5 Greedy sequence construction algorithm

Given the target and the maximum number of passes MAX_PASSES, the following greedy algorithm (which exactly corresponds to our CUDA implementation compute_kawase_sequence) constructs a sequence satisfying Equation (8):

Algorithm: Kawase_Sequence(σ, MAX_PASSES)
  seq ← [], var ← 0, d ← 0
  while var < σ² and |seq| < MAX_PASSES:
      seq.append(d)
      var ← var + (d + 0.5)²
      if √var ≥ σ: break
      d ← d + 1
  return seq

Correctness explanation: the algorithm greedily appends passes in the order . At each step, it uses the current smallest offset so that the accumulated variance approaches as evenly as possible. Since increases monotonically with , small-offset passes contribute smaller variance, which is beneficial for fine control of the total variance.

Sequences corresponding to each value:

Sequence Coverage
165.333{0,1,2,3,4}541.256.4224.516.028.1%
3210.667{0,1,2,3,4,5,6,7}8170.0013.0387.532.023.4%
4816.000{0,1,…,9}10332.5018.2359.548.019.8%
6421.333{0,1,…,11}12575.0023.97911.564.018.0%
9632.000{0,1,…,11} (truncated)12575.0023.97911.596.012.0%

Core issue: when , the sequence is truncated at , is fixed at 575, while the target continues to increase. Variance matching completely fails (when , ).

1.6 Why Kawase can approximate a Gaussian: rigorous CLT argument

Theorem 1.1 (Gaussian limit of multi-pass Kawase)

Let be independent random variables, where follows a symmetric two-point distribution: . Let and . If and the Lindeberg condition is satisfied (see below), then

That is, the characteristic function of converges pointwise to (the characteristic function of the standard normal distribution).

Verification of the Lindeberg condition: for any ,

Since , the condition is equivalent to . The greedy sequence satisfies , while (when the greedy sequence has ). Therefore , so for sufficiently large , the condition is not satisfied, and the Lindeberg condition automatically holds.

Corollary (convergence rate): by the Berry-Esseen theorem, for any ,

where , , is the CDF of , and is the standard normal CDF. Numerically:

Berry-Esseen bound
16541.25153.120.277
328170.01016.00.219
4810332.52487.50.196
6412575.05166.00.179

The Berry-Esseen bound lies between and , indicating that with a finite number of passes, the convergence is partial rather than exact—this is the fundamental source of Kawase error.

Isotropy in the two-dimensional case: since , and since the in the and directions are identically distributed and mutually independent (which is ensured by diagonal sampling), the CLT holds simultaneously in both directions, and the limit is the isotropic Gaussian .

1.7 Core code implementation

1.7.1 CUDA implementation

// ─────────────────────────────────────────────────────────────────
// CLI: ./kawase_blur --input <path> --r <radius>
// sigma = r/3,  pass sequence: greedy variance matching
// ─────────────────────────────────────────────────────────────────

#include <cuda_runtime.h>
#include <cuda_texture_types.h>
#include <opencv2/opencv.hpp>
#include <cmath>
#include <vector>
#include <cstdio>

#define CUDA_CHECK(x) do { \
    cudaError_t _e=(x); \
    if(_e!=cudaSuccess){fprintf(stderr,"CUDA %s:%d %s\n",__FILE__,__LINE__, \
    cudaGetErrorString(_e));exit(1);} \
} while(0)

static std::vector<int> compute_kawase_sequence(float sigma,
                                                 int max_passes = 12) {
    std::vector<int> seq;
    float var = 0.f;
    int   d   = 0;
    while (var < sigma * sigma && (int)seq.size() < max_passes) {
        seq.push_back(d);
        float o = d + 0.5f;
        var += o * o;
        if (std::sqrt(var) >= sigma) break;
        ++d;
    }
    return seq;
}

// ── GPU kernel: a single Kawase pass ─────────────────────────────
// Each pixel samples four diagonal corner taps at (±offset, ±offset)
// Use tex2D bilinear interpolation (bound to cudaTextureObject_t)
__global__ void kawase_pass_kernel(
        cudaTextureObject_t src_tex,
        float* __restrict__ dst,
        int W, int H,
        float offset)          // = d + 0.5f (unit: pixels)
{
    const int px = blockIdx.x * blockDim.x + threadIdx.x;
    const int py = blockIdx.y * blockDim.y + threadIdx.y;
    if (px >= W || py >= H) return;

    // Pixel coordinates (tex2D uses unnormalized mode here, rather than [0,1])
    float u = (float)px + 0.5f;   // texel center
    float v = (float)py + 0.5f;

    float s = 0.f;
    s += tex2D<float>(src_tex, u + offset, v + offset);
    s += tex2D<float>(src_tex, u - offset, v + offset);
    s += tex2D<float>(src_tex, u + offset, v - offset);
    s += tex2D<float>(src_tex, u - offset, v - offset);
    dst[py * W + px] = s * 0.25f;
}

static cudaTextureObject_t make_tex(const float* d_ptr, int W, int H) {
    cudaResourceDesc rdesc{};
    rdesc.resType                  = cudaResourceTypeLinear;
    rdesc.res.linear.devPtr        = const_cast<float*>(d_ptr);
    rdesc.res.linear.desc          = cudaCreateChannelDesc<float>();
    rdesc.res.linear.sizeInBytes   = (size_t)W * H * sizeof(float);

    cudaTextureDesc tdesc{};
    tdesc.filterMode       = cudaFilterModeLinear;
    tdesc.addressMode[0]   = cudaAddressModeClamp;
    tdesc.addressMode[1]   = cudaAddressModeClamp;
    tdesc.readMode         = cudaReadModeElementType;
    tdesc.normalizedCoords = 0;   // use pixel coordinates

    cudaTextureObject_t tex = 0;
    CUDA_CHECK(cudaCreateTextureObject(&tex, &rdesc, &tdesc, nullptr));
    return tex;
}

int main(int argc, char** argv) {
    const char* input_path = argv[2];
    int r = std::atoi(argv[4]);
    float sigma = r / 3.f;

    auto seq = compute_kawase_sequence(sigma);
    printf("sigma=%.3f, passes=%d, seq=", sigma, (int)seq.size());
    for (int d : seq) printf("%d ", d);  printf("\n");

    // Load image and upload to GPU (OpenCV code omitted)
    // ...

    // Allocate two ping-pong buffers
    float *d_buf0, *d_buf1;
    // CUDA_CHECK(cudaMalloc(...));

    dim3 block(16, 16);
    dim3 grid((W+15)/16, (H+15)/16);

    for (int d : seq) {
        float offset = (float)d + 0.5f;
        cudaTextureObject_t tex = make_tex(d_buf0, W, H);
        kawase_pass_kernel<<<grid, block>>>(tex, d_buf1, W, H, offset);
        CUDA_CHECK(cudaGetLastError());
        CUDA_CHECK(cudaDeviceSynchronize());
        cudaDestroyTextureObject(tex);
        std::swap(d_buf0, d_buf1);
    }
    // The result is in d_buf0; download and save...
    return 0;
}

GPU acceleration points

  1. Ping-pong buffering: two blocks of device memory are alternately used as src/dst to avoid in-place read-write conflicts.
  2. Texture Cache: binding cudaTextureObject_t makes samples hit the L1 Tex Cache (128 B / SM); the four diagonal taps usually hit the same cache line.
  3. Hardware bilinear interpolation: tex2D automatically performs 4-tap weighting for fractional coordinates, effectively merging 4 integer-grid samples per fetch, improving sampling accuracy at no extra compute cost.
  4. Computational complexity: per pixel per pass, completely independent of . Total texture bandwidth ().

1.7.2 GLSL implementation (Mobile/Web GPU, compatible with OpenGL ES 3.0)

// ─────────────────────────────────────────────────────────────────
// kawase_pass.glsl  ──  a single Kawase pass (Fragment Shader)
// The host code must call it in a loop, update uOffset each time, and blit to an FBO
// ─────────────────────────────────────────────────────────────────
#version 300 es
precision highp float;

uniform sampler2D uSrc;         // output of the previous pass (bound as a texture)
uniform vec2      uTexelSize;   // = vec2(1.0/width, 1.0/height)
uniform float     uOffset;      // = float(d) + 0.5, unit: pixels

in  vec2 vTexCoord;             // normalized UV [0,1]
out vec4 fragColor;

void main() {
    vec2 o = uOffset * uTexelSize;     // convert to normalized offset

    // Four diagonal corner taps
    vec4 s = vec4(0.0);
    s += texture(uSrc, vTexCoord + vec2( o.x,  o.y));
    s += texture(uSrc, vTexCoord + vec2(-o.x,  o.y));
    s += texture(uSrc, vTexCoord + vec2( o.x, -o.y));
    s += texture(uSrc, vTexCoord + vec2(-o.x, -o.y));
    fragColor = s * 0.25;
}

// ─────────────────────────────────────────────────────────────────
// kawase_sequence.js  ──  host-side sequence computation
// ─────────────────────────────────────────────────────────────────
/*
function computeKawaseSequence(sigma, maxPasses = 12) {
    const seq = [];
    let var_ = 0.0;
    let d    = 0;
    while (var_ < sigma * sigma && seq.length < maxPasses) {
        seq.push(d);
        const o = d + 0.5;
        var_ += o * o;
        if (Math.sqrt(var_) >= sigma) break;
        d++;
    }
    return seq;
}

// Render loop
function applyKawaseBlur(gl, srcTex, sigma) {
    const seq = computeKawaseSequence(sigma);
    let [ping, pong] = [fboA, fboB];

    for (const d of seq) {
        gl.bindFramebuffer(gl.FRAMEBUFFER, pong.fbo);
        gl.useProgram(kawaseProg);
        gl.bindTexture(gl.TEXTURE_2D, ping.tex);
        gl.uniform1f(gl.getUniformLocation(kawaseProg, 'uOffset'), d + 0.5);
        gl.uniform2f(gl.getUniformLocation(kawaseProg, 'uTexelSize'),
                     1.0/width, 1.0/height);
        drawFullscreenQuad();
        [ping, pong] = [pong, ping];
    }
    return ping.tex;   // final result
}
*/

GLSL acceleration points

  1. FBO Ping-pong: each pass blits to a different FBO, and the GPU driver automatically handles texture dependency hazards.
  2. GL_LINEAR filtering: set the sampler to GL_LINEAR + GL_CLAMP_TO_EDGE, equivalent to CUDA’s bilinear clamp.
  3. Tile-based GPU optimization (mobile): Mali/Adreno tile-based deferred rendering restricts each pass’s accesses within tiles, reducing DRAM bandwidth. Four fetches per pass → bandwidth cost is only of separable Gaussian when is large, yielding significant savings.
  4. Pack RGBA: for color images, pack 4 channels into vec4, so one texture() completes sampling for all channels; the GPU texture unit processes RGBA in parallel.

2. Frequency-domain analysis: upper and lower bounds of the frequency response

2.1 2D DTFT of a single pass

Substitute Equation (2) into the definition of the 2D discrete-time Fourier transform:

Substituting Equation (2):

Inside each parenthesis, , hence

where are normalized frequencies (unit: cycles/pixel).

2.2 Frequency response of multiple cascaded passes

By the convolution theorem (), the total frequency response of cascaded passes is the product of the response of each pass:

The frequency response of the target Gaussian (the Fourier transform of a Gaussian is still a Gaussian) is:

2.3 Rigorous proof of the upper bound

Lemma 2.1 (cosine-exponential inequality): for all , .

Proof (term-by-term series comparison): expand both functions as power series in :

It is enough to prove that for every , , i.e., .

Decompose as:

Since , we have .

Therefore, for every , the coefficient of in the exponential function (when ) is no smaller than the corresponding coefficient in the cosine function, and the absolute value of negative terms is no larger than that of the exponential function. Thus

(The positive terms are , and the corresponding absolute values of the negative terms also satisfy ; this can be verified by pairing every two terms.)

Numerical verification:

(cosine coefficient) (exponential coefficient)Inequality holds
01.000000001.00000000✓ (=)
10.500000000.50000000✓ (=)
20.041666670.12500000
30.001388890.02083333
40.000024800.00260417
50.000000280.00026042

Proposition 2.2 (upper bound of axial response):

This upper bound shows that Kawase’s axial response does not exceed a Gaussian with parameter . When (the sequence variance exceeds the target, as in –64), the upper-bound Gaussian is higher than the target Gaussian; when (the truncated case at ), the upper bound is lower than the target Gaussian, meaning Kawase is systematically under-blurred.

2.4 Derivation of lower bounds (two methods)

Method 1: Bernoulli product inequality

Using (for all , which follows from the sign of the Taylor remainder), let . Then

By the Weierstrass product inequality: for , (because all cross-product terms after expansion are positive). Therefore

This lower bound is valid in the low-frequency region (). For at (), the lower bound is , while the measured value is , a difference of 2%.

Method 2: logarithmic Taylor lower bound (tighter)

The Maclaurin expansion of (convergence domain ) is:

Every term is negative, so truncating after the first two terms gives a lower bound (for ):

Summing over all passes:

where . Therefore

Numerical verification (, , , ):

The measured value is , and the lower-bound error is , which is quite tight.

2.5 Deviation bound between the frequency response and the target Gaussian

Theorem 2.3 (deviation upper bound)

For any ,

Low-frequency approximation (, using for ):

The error coefficient (where ) directly quantifies the effect of variance-matching accuracy on low-frequency error:

Low-frequency error coefficient
1628.4441.2512.81252.8
32113.78170.0056.221109.8
48256.00332.5076.501510.0
961024.00575.00449.008862.9

The error coefficient at is 35 times that at , showing that the low-frequency error grows sharply at large .

2.6 Rigorous analysis of anisotropy

Direction dependence of the 2D frequency response

For the frequency vector (radial frequency , direction angle ):

Special directions:

  • Axial ():
  • Diagonal ():

Proposition 2.4 (diagonal response axial response)

For any , .

Proof: it is enough to prove for each factor that . Let .

Using the double-angle identity: … More simply, expand at :

Expanding (using , ):

(The higher-order terms have the same sign and can be confirmed term by term.) By derivative verification: , , so holds in a neighborhood of ; and because when , it also holds there.

Numerical results ():

Period Anisotropy difference
50 px0.03920.05670.1325+0.018
60 px0.12520.14450.2457+0.019
68 px0.20920.22650.3353+0.017
80 px0.33390.34680.4540+0.013

Physical meaning: the diagonal response is the axial response, meaning that diagonal content is systematically under-blurred (relative to axial content). Both are lower than the target Gaussian (under-blurred), but diagonal under-blurring is milder—the diagonal texture retains more detail than horizontal/vertical texture in the blurred result, producing directional artifacts.

2.7 Distribution of frequency-domain zeros

When , i.e.,

the response of the -th pass is , so the response of the entire chain is 0. The corresponding spatial period is (pixels):

Zero frequency Zero period
01/22 px (Nyquist, normal)
11/66 px
21/1010 px
31/1414 px

For (seq={0,1,2,3,4}), zeros occur at . If an image has fine texture (such as fish scales with ) that happens to fall at a zero, that frequency is completely removed, producing a sense of “breakage.”

2.8 Summary of frequency-domain analysis

Advantages:

  1. , so DC is perfectly preserved; in the low-frequency region, the response decreases monotonically (no zero-crossing region).
  2. The computational cost is per pixel, with extremely high GPU bandwidth efficiency.
  3. By the CLT, as , the frequency response converges pointwise to , and the Berry-Esseen bound gives a concrete error magnitude.

Defects:

  1. Severely insufficient coverage (main defect): , so Gaussian tail energy is completely missing ( has only 12% coverage).
  2. Anisotropy: (maximum difference up to 0.019), so diagonal texture is systematically under-blurred.
  3. Frequency zeros: response is zero at , causing selective elimination.
  4. Phase reversal: after the zero-crossing, the response becomes negative (cosine becomes negative after crossing zero), which is equivalent to phase flipping and can produce “pseudo-ringing.”

3. Cases where Kawase exhibits severe defects

3.1 Severe numerical defects

3.1.1 Truncation collapse of at large

When , the greedy algorithm stops after reaching MAX_PASSES=12, so is clamped at 575 while continues to increase:

Status
48256.00332.50−76.50 (over-matched)1.14mild over-blur
64455.11575.00−119.89 (over-matched)1.12mild over-blur
961024.00575.00+449.00 (under-matched)0.749severe under-blur

At , the effective is only 75% of the target: the response across the entire mid-to-low frequency band () is too high, and all content is under-blurred.

3.1.2 Worst frequencies and PSNR lower bounds for each

For a single-frequency sinusoidal image (contrast amplitude ), the PSNR lower bound is determined by the frequency with the maximum error:

Worst period PSNR ()
165.3322.5 px0.1200.3300.21020.5 dB
3210.6748.2 px0.1930.3800.18721.5 dB
4816.0068.6 px0.2160.3420.12624.9 dB
6421.3391.3 px0.2300.3410.11026.1 dB
9632.00129.4 px0.4960.2990.19621.1 dB

Pattern: the worst period satisfies (inside the Gaussian passband, where the response difference is largest). The PSNR values for and both drop to 20–21 dB, but for different reasons: the former is caused by the uneven zero distribution of the pass sequence, while the latter is caused by systematic under-blurring due to truncation of .

3.1.3 Precise characterization of triggering conditions

The lowest PSNR requires three conditions to be satisfied simultaneously:

  1. Worst-period condition: the image power spectrum is concentrated near , corresponding to .
  2. High-contrast condition: the image amplitude (binary 0/255 image, ), where PSNR reaches its lowest value. Binary textures are worse than pure sinusoids because they contain odd harmonics such as , and the errors from all harmonics accumulate.
  3. Axial alignment condition: the texture is arranged horizontally or vertically (anisotropy error is largest along the axial direction).

3.2 Severe visual defects

3.2.1 Under-blur ghosting

Triggering condition: (), and the image contains high-contrast regular texture with period (grids, fences, knitted patterns).

Quantitative mechanism: at the worst frequency , the Kawase response is , while the Gaussian target is —Kawase preserves too much energy (high-frequency detail), producing visually perceptible texture ghosting over the background.

Severity thresholds: when PSNR (corresponding to an RMS error of about 8 gray levels), the artifact is visible to the naked eye in high-contrast regions; when PSNR (RMS ), the artifact is already very obvious.

Typical cases: fishing-net image: Kawase PSNR=35.47 dB, and the grid contours remain visible to the naked eye. Curtain/blinds scene: Kawase PSNR=35.47 dB, and the diagonal blind stripes are clearer than in the reference.

3.2.2 Anisotropic stripe distortion

Triggering condition: the image contains both diagonal () and axial () structures, and the texture period is (where the anisotropy difference reaches its maximum, 0.019).

Visual appearance: diagonal texture at the same spatial frequency appears clearer than axial texture in the blurred result, i.e., a directional “under-blur bias.” Diagonal shadow stripes and diagonal steel braces retain more detail than horizontal components.

PSNR impact: a purely axial image has PSNR=39.4 dB, while a purely diagonal image (same frequency) has PSNR=47.5 dB, a difference of 8 dB—quality varies dramatically with direction.

3.2.3 Frequency-zero elimination artifact

Triggering condition: small (), and the image contains fine texture with period (zero period).

Mechanism: at a zero frequency, Kawase completely eliminates that frequency, while neighboring frequencies still pass through—leading to texture “breakage” or banding (nonuniform frequency elimination).

3.2.4 Extreme case: systematic distortion at ()

is truncated, , and the entire spectrum is systematically too high (the under-blur amount is about 25%):

  • Large-scale gradients that should be smoothed () still retain obvious gradients.
  • The blur transition band around high-contrast edges is only about 75% of the target, appearing visually “harder” and “sharper.”
  • For architectural stripes with , the theoretical PSNR is only 21.1 dB, which is very easy to notice at a normal viewing distance.

4. Suitable and unsuitable scenarios for Kawase

4.1 Suitable scenarios

(a) Real-time game Bloom / DoF / Motion Blur post-processing

This is the original design goal of Kawase. On 2003 hardware, 5 passes covered a blur of , with a per-frame cost of (1080p). On modern hardware (RTX 4060), measurement shows that costs about 0.51 ms; compared with Separable FIR at , it is about 2.4× faster.

Applicable conditions:

  • Image content is dominated by low frequencies (skin, sky, large smooth gradients), so PSNR is naturally high ().
  • Viewing distance is large (TV/monitor ), so low-frequency errors are below the perceptual threshold.
  • The frame-rate requirement is strict (), which is a hard constraint.

(b) Mobile real-time rendering (bandwidth-sensitive)

For mobile GPUs (Mali, Adreno), bandwidth is the main bottleneck. Kawase uses only 4 fetches per pass (independent of ), so its bandwidth cost is only times that of separable FIR; the savings are significant when is large. Combined with Dual Filter (§5.1), bandwidth can be halved further.

(c) Preview/draft rendering stage

In real-time previews of offline rendering tools (such as Blender and Nuke), Kawase can provide a visual reference within , with final rendering switched to exact FIR.

(d) Medium-to-small blur amounts with

At this point , coverage is , and PSNR on natural images is , which is visually acceptable at a normal viewing distance.

4.2 Unsuitable scenarios

(a) High-frequency regular texture + large

Texture period , contrast , and : PSNR , with visible “ghost texture.” Representative content: fishing nets, knitted fabrics, fences, blinds.

(b) Precision scenes containing diagonal structures

Diagonal lines () coexist with axial lines, and : directional distortion occurs (axial PSNR is about 8 dB lower than diagonal PSNR), which is obvious in architectural and industrial scenes.

(c) Large blur amounts with

Truncation of causes (a 25% gap at ), so the overall image is under-blurred and does not meet the design requirement.

(d) Accuracy requirements with PSNR

For any natural image containing content with , Kawase cannot reach 40 dB; IIR or exact FIR must be used instead.

(e) Medical imaging, scientific visualization, and satellite image processing

These scenarios require isotropy (frequency response independent of direction), no frequency zeros, and PSNR ; Kawase satisfies none of these requirements.

(f) Image-processing pipelines requiring exact Gaussian semantics

Examples include scale-space analysis, Harris corner detection, and SIFT descriptor extraction: these algorithms strongly depend on the mathematical properties of the Gaussian kernel (isotropy, no sidelobes). Kawase’s approximation error affects feature accuracy and repeatability.

5. Follow-up improvements

5.1 Dual Filter (Bjørge, SIGGRAPH 2015)

Original text

Marius Bjørge, “Bandwidth-Efficient Rendering,” ARM Ltd. SIGGRAPH 2015 Advances in Real-Time Rendering in Games, Los Angeles, August 2015.

Core idea

Replace same-resolution iterations with downsample-upsample pass pairs, using the reduced pixel count after resolution reduction to save bandwidth:

Downsample pass: reduce to while sampling the center of the current pixel and the midpoints in four adjacent directions (each with offsets ):

Upsample pass: restore back to , and compute a weighted sum over the 4 offset points :

Frequency-domain analysis

The DTFT of the downsample pass (assuming downsampling factor 2) is:

The DTFT of the upsample pass (defined on the low-resolution grid) is:

The combined response of one D-U pair (with frequency scaling taken into account) is complicated due to aliasing from downsampling; in practice, its effect is equivalent to a Gaussian approximation with an effective radius of about 2.

Bandwidth analysis

Assume the image is . The total number of fetches for Kawase passes is . Dual Filter uses downsampling levels: the total pixel count is , and the total number of fetches is approximately , about less bandwidth than Kawase.

Comparison with Kawase

MetricKawaseDual Filter
number of passes ()10about 6 (3 down + 3 up)
Bandwidth (half bandwidth)
Anisotropymedium (diagonal sampling)higher (cross sampling is weaker in the diagonal direction)
Edge ringingsmallmore obvious (aliasing introduced by resolution switching)
Applicable rangemedium-to-small ()large (each level covers a wider range)

Limitations: Dual Filter has more severe anisotropy than Kawase (cross-shaped sampling has lower sampling density in diagonal directions), and resolution switching (downsample/upsample) introduces additional aliasing and ringing, which is visible at high-contrast edges (such as text and sharp geometry).

5.2 Numerical optimization of the pass sequence

Motivation: the greedy algorithm is approximately optimal for variance matching, but it does not consider the overall shape of the frequency-response error (maximum error, mean squared error, etc.).

Optimization problem: given the number of passes and the target , minimize over the real-valued offset space :

where (Gaussian weighting, making low-frequency errors more important).

Optimization methods: CMA-ES (Covariance Matrix Adaptation Evolution Strategy) or L-BFGS-B (gradients can be computed through automatic differentiation).

Result: for (), optimization reduces the worst frequency error from the greedy value of 0.187 to about 0.140 (an improvement of ), and improves the PSNR lower bound from 21.5 dB to about 23.5 dB. The cost is that offsets are real values (handled automatically by GPU texture interpolation), but the sequence varies with , so a lookup table must be stored for each (about 80 entries cover , with negligible storage cost).

5.3 Variable-Rate Kawase (spatial adaptivity)

Idea: smooth regions in the image (low local variance) have lower blur-quality requirements and can use more passes to ensure accuracy; texture-dense regions (high local variance) have lower human sensitivity to blur accuracy (texture masking effect), so fewer passes can be used to save time.

Implementation:

  1. Offline compute the local variance map of the image (or use the material ID from the G-Buffer).
  2. Quantize the variance map into 3–4 quality levels, each corresponding to a different number of passes.
  3. Implement adaptivity on the GPU through predicated execution or early-exit.

Typical benefit: in game scenes, about 60% of the area is smooth sky/ground and can be reduced to ; 30% is medium complexity and uses ; 10% is high-frequency texture and uses . The average is , saving about 50% GPU time compared with fixed .

Limitation: computing the variance map itself has extra cost (about 0.1 ms), and it is not suitable for effects such as real-time DoF that require an accurate CoC (Circle of Confusion).

5.4 Isotropic Correction

Principle: add one “axial compensation pass” at the end of the Kawase chain, sampling and (cross shape) instead of (diagonal shape), and linearly mix the two results:

Derivation of the optimal weight: let the axial error be and the diagonal error be . The axial response of Cross is (equal to the axial response of corner), and the diagonal response is (same as corner).

In fact, the frequency response of a single cross pass is (axial sampling), which has a different structure from the corner frequency response . In the axial direction after mixing:

In the diagonal direction (after corner passes):

Choosing (for ) can reduce the maximum anisotropy error from 0.019 to (about a 5× improvement), at the cost of one additional cross pass (about 10% more bandwidth).