Lecture 4: Portrait Morphing and 2D Polygon Shape Gradients Based on StyleGAN

Computer Animation 第 4 讲

I.Morphing based on StyleGAN

1.1 Generative Adversarial Networks (GAN)

Before delving into StyleGAN, we must first understand its cornerstone—Generative Adversarial Networks (GANs).

In 2014, Ian Goodfellow et al. proposed GANs, which revolutionized the field of generative models. The core idea of GANs originates from the “zero-sum game” concept in game theory. It consists of two competing neural networks:

Generator (G): Its task is to learn the distribution of real data, thereby generating new, indistinguishable “fake” data. It’s like a forger trying to create a fake that can fool experts.
Discriminator (D): Its task is to determine whether the input data comes from a real dataset or was generated by the generator. It’s like an art expert trying to distinguish between genuine and fake data.

Mathematical Principle: Minimax Game

The training process of GAN is a Minimax game. Assume the real data distribution is $p_{data} (x)$ , the generator samples noise $z$ from a simple prior distribution (such as Gaussian distribution) $p_{z} (z)$ , and generates samples $G (z)$ . The discriminator $D (x)$ outputs a scalar, representing the probability that $x$ comes from real data.

Our goal is to find a Nash equilibrium such that the distribution generated by the generator $p_{g}$ is infinitely close to the real data distribution $p_{data}$ . This process can be achieved by optimizing the following value function $V (D, G)$ :

G min D max V (D, G) = E_{x \sim p_{data} (x)} [lo g D (x)] + E_{z \sim p_{z} (z)} [lo g (1 - D (G (z)))]

For the discriminator D: Its goal is to maximize $V (D, G)$ . When the input is a real sample $x$ , it wants $D (x)$ to be close to 1; when the input is a generated sample $G (z)$ , it wants $D (G (z))$ to be close to 0.
For the generator G: Its goal is to minimize $V (D, G)$ . It wants its generated samples $G (z)$ to fool the discriminator, i.e., to make $D (G (z))$ close to 1, which is equivalent to minimizing $lo g (1 - D (G (z)))$ .

Through this adversarial training, the discriminator and generator co-evolve. Ultimately, when the generator can produce samples indistinguishable from real data, the discriminator’s output will be 0.5 for any sample, and the system reaches equilibrium.

The introduction of GANs is a milestone in the field of deep learning. In 2019, three pioneers of deep learning, Geoffrey Hinton, Yoshua Bengio, and Yann LeCun, jointly received the Turing Award. Yann LeCun once described GAN adversarial training as “the coolest thing since sliced bread,” demonstrating its significant impact. From blurry faces in 2014 to photorealistic images after 2017, the development speed of GANs has been astonishing.

1.2. Application of GANs in Face Generation: From Pixels to Probability Distributions

How do we view the problem of face generation? An $n \times n$ color face image can be seen as a point in an $N = n \times n \times 3$ dimensional space. However, most points in this high-dimensional space correspond to meaningless noise, and only a very small subspace (a complex manifold) corresponds to images that “look like faces.”

The goal of GAN is to learn the probability distribution on this “face manifold.” We cannot describe this distribution with a clear formula, but we can implicitly learn it by training a generator.

The generator $G$ can be seen as a complex nonlinear function that maps a random vector $z$ from a simple, low-dimensional latent space to the high-dimensional image space, and its output $G (z)$ follows the desired face probability distribution.

G : Z \to X

where $Z$ is the latent space (e.g., $R^{512}$ ) and $X$ is the image space (e.g., $R^{1024 \times 1024 \times 3}$ ).

1.3. StyleGAN: A Style-Based Generator Architecture

Traditional GANs take the latent code $z$ directly as input and generate images through a series of transposed convolutional layers (upsampling). However, StyleGAN, proposed by NVIDIA in 2019, made revolutionary improvements to this.

Core Idea: StyleGAN believes that the image generation process can be decomposed into controlling different levels of “styles.” It no longer feeds the latent code directly into the generator network but uses the latent code to modulate the behavior of each layer of the generator network.

Two Major Advantages of StyleGAN:

High Generation Quality: The generated images reached a new level in terms of resolution and realism.
Superior Latent Space Properties: Its latent space has good Disentanglement properties, meaning that different directions in the latent space correspond to different, interpretable semantic attributes (such as age, hairstyle, gender, pose, etc.).

StyleGAN Architecture Analysis:

From Z Space to W Space:
- StyleGAN has two latent spaces. The first is the traditional Z space, usually a 512-dimensional standard Gaussian distribution.
- StyleGAN introduces a Mapping Network $f$ composed of 8 fully connected layers, which maps $z \in Z$ to an intermediate latent space—W space—to obtain $w = f (z), w \in W$ .
- Why is W space needed? Z space is a hypersphere. To fit the extremely complex data distribution in the real world, the network needs to undergo highly nonlinear distortion. This distortion leads to the “entanglement” of features, meaning that one direction may control multiple semantic attributes simultaneously. The role of the mapping network $f$ is to “untangle” this entanglement, making the distribution in W space better match the feature distribution of real data and making attributes more linearly separable, thus achieving better disentanglement.
Style Modulation:
- The input to the generator network (called Synthesis Network $g$ ) is no longer the latent code but a learnable fixed constant tensor.
- The intermediate latent code $w$ is used to generate a “style” vector for each layer of the synthesis network through a learnable Affine Transform A.
- This style vector controls the output feature map of each layer through Adaptive Instance Normalization (AdaIN).
The mathematical formula for AdaIN is:
$AdaIN (x_{i}, y) = y_{s, i} \frac{x _{i} - μ ( x _{i} )}{σ ( x _{i} )} + y_{b, i}$
where $x_{i}$ is the $i$ -th feature map, and $μ (x_{i})$ and $σ (x_{i})$ are its mean and standard deviation. $y = (y_{s, i}, y_{b, i})$ is the style vector obtained from $w$ through the affine transformation A, representing the scaling factor and bias, respectively. This operation essentially replaces the statistical properties (style) of the feature map with the statistical properties specified by $w$ .
Hierarchical Style Control: StyleGAN’s synthesis network usually has 18 layers (corresponding to $1024 \times 1024$ resolution). The style at different levels controls different granularities of image features:
- Coarse styles (Layers 1-4): Control high-level, coarse-grained features such as pose, face shape, and hairstyle outline.
- Middle styles (Layers 5-8): Control medium-grained features such as facial details (eyes, nose) and hair texture.
- Fine styles (Layers 9-18): Control details, colors, and lighting, such as skin color, hair color, lighting direction, background, etc.
By injecting style from different $w$ vectors at different levels, Style Mixing can be achieved. For example, combining the pose of A with the skin color of B can generate a new, non-existent face.

1.4. Image Morphing in StyleGAN

The W space of StyleGAN, due to its good disentanglement and smoothness, is very suitable for interpolation.

Core Principle: In the W-space, points along the linear path between two points $w_{1}$ and $w_{2}$ (corresponding to images $I_{1}$ and $I_{2}$ respectively) generate an intermediate image with a visually smooth transition.

Implementation Method: Given two implicit codes $w_{1}$ and $w_{2}$ , we can generate an intermediate implicit code $w_{interp}$ using simple linear interpolation:

w_{interp} (t) = (1 - t) w_{1} + t w_{2}, t \in [0, 1]

Then, $w_{interp} (t)$ is fed into the synthesis network $g$ to obtain the $t$ -th frame image $I (t) = g (w_{interp} (t))$ during the gradation process.

Thanks to the superior properties of the W-space, this simple linear interpolation can produce extremely natural and high-quality visual gradient effects. We can even selectively interpolate different levels of style; for example, interpolating only coarse styles to change face shape and posture while keeping skin tone and lighting unchanged, thus achieving more creative control.

II. Morphing based on Diffusion Model

Although StyleGAN performs well in specific domains (such as faces), Diffusion Models have become the new SOTA (State-of-the-Art) in image generation in recent years.

2.1. Basic Principles of the Diffusion Model

The diffusion model comprises two processes:

Forward/Diffusion Process: This is a fixed process that progressively adds Gaussian noise to a clear image $x_{0}$ . After $T$ steps, the image becomes pure noise $x_{T} \sim N (0, I)$ . The noise addition process at step $t$ can be represented as:

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)

Where $β_{t}$ is a preset, small constant (noise variance) that increases with $t$ .

Reverse/Denoising Process: This is the process the model needs to learn. It starts with pure noise $x_{T}$ and iteratively removes noise step by step until a clear image $x_{0}$ is recovered. This process is parameterized by a neural network (usually a U-Net architecture) $p_{θ}$ :

In practice, the model usually does not directly predict the denoised image $μ_{θ}$ , but rather predicts the noise $ϵ_{t}$ added at step $t$ . The loss function is to make the noise predicted by the model $ϵ_{θ} (x_{t}, t)$ as close as possible to the actual added noise $ϵ$ .

2.2. Controlling Diffusion Models with ControlNet

While large diffusion models like Stable Diffusion are powerful, their generation process is primarily guided by text prompts, making fine-grained structural control difficult. ControlNet is a revolutionary technique that adds additional conditional control to large pre-trained models without compromising their structure.

How it Works: ControlNet freezes the original pre-trained model (such as Stable Diffusion’s U-Net) and then creates a trainable copy. This copy receives additional conditional inputs (such as Canny edge maps, human pose skeletons, depth maps, etc.) and learns how to adjust the generation process based on these conditions. The output of the copy is added to the corresponding layer of the original model through special “zero-convolutional layers.” Because the zero-convolutional layers output zero initially during training, they do not degrade the performance of the original model but rather act as a plug-in, progressively injecting control information into the generation process.

This allows for unprecedented control, such as making generated figures perfectly conform to a specified pose or transforming a photograph into a different style with the same composition.

2.3. Overview of Large-Scale Video Generation Models

The transition from static image generation to dynamic video generation represents the next major breakthrough in the AIGC (AI Generative Design and Generation) field. In recent years, numerous large-scale video models have emerged, such as OpenAI’s Sora, Luma AI’s Dream Machine, and Google’s Veo3.

The core challenge these models face is temporal consistency, ensuring that objects and scenes in a video maintain a consistent identity, appearance, and physical characteristics across consecutive frames. They typically employ architectures similar to Diffusion or Transformer, but introduce a temporal dimension when processing data. For example, Sora’s “Spacetime Patches” technique treats the video as a series of visual data blocks arranged in time and space, thus learning its dynamic changes within a unified framework.

Models like Dream Machine support setting first and last frames, which greatly facilitates video looping and controlled generation, and also opens up new possibilities for video morphing.

III. 2D Shape Blending

Now, let’s return to the classic field of computer graphics and explore the problem of shape blending in 2D vector shapes. This has wide applications in 2D animation (In-betweening), font design, industrial design, etc.

Given two keyframe shapes (polygons) $P_{A}$ and $P_{B}$ , our goal is to generate smooth intermediate shapes $P (t)$ .

This problem can be decomposed into two subproblems:

Vertex Correspondence: Determine which vertex on $P_{A}$ corresponds to which vertex on $P_{B}$ .
Vertex Path: Determine how the corresponding vertices move.

3.1. Linear Interpolation (LERP) and Its Defects

The simplest method is vertex linear interpolation. Assuming we have solved the vertex correspondence problem and the two polygons have the same number of vertices $n$ . For the $i$ -th pair of corresponding vertices $P_{A, i}$ and $P_{B, i}$ , the intermediate vertex $P_{i} (t)$ can be calculated as:

P_{i} (t) = (1 - t) P_{A, i} + t P_{B, i}, t \in [0, 1]

This method is simple but has serious defects:

Shrinkage and Kink: When the object rotates, linear interpolation causes the shape to shrink and deform unnaturally. For example, a square rotated 90 degrees will collapse into a point at $t = 0.5$ . This is because linear interpolation does not consider the rigid motion of the object, it only cares about the absolute coordinates of the vertices.

3.2. Method Based on Intrinsic Shape Interpolation

To address the shortcomings of linear interpolation, Sederberg et al. proposed a more elegant method in 1993. Instead of interpolating the Cartesian coordinates of vertices, it interpolates the intrinsic properties of the shape.

This method is inspired by “Turtle Graphics”: a polygon can be defined not by vertex coordinates, but by a series of “forward” and “turn” instructions. These instructions are the shape’s intrinsic properties: side lengths and vertices angles.

Algorithm Steps:

Calculate Intrinsic Properties: For the source polygon $P_{A}$ and the target polygon $P_{B}$ , we calculate their side length sequences $L_{A, i}$ and $L_{B, i}$ , and the directed vertex angle sequences $θ_{A, i}$ and $θ_{B, i}$ .
Interpolation Intrinsic Properties: We perform linear interpolation on these intrinsic properties to obtain the intrinsic properties of the intermediate shape:

L_{i} (t) = (1 - t) L_{A, i} + t L_{B, i} θ_{i} (t) = (1 - t) θ_{A, i} + t θ_{B, i}

Reconstructing the Shape and Closure Problem: Starting from a starting point, using the interpolated side lengths $L_{i} (t)$ and angles $t h e t a_{i} (t)$ , we can reconstruct the vertices of the intermediate polygon step by step. However, the reconstructed polygon is usually not closed! That is, the last vertex cannot precisely return to the first vertex. This is because the linear combination of side lengths and angles does not guarantee the satisfaction of the geometric closure constraints of the polygon.
Forced Closure: Constraint Optimization Problem: This “almost closed” polygon is the key to solving the problem. We need to make small adjustments to the interpolated side lengths (called Edge Tweaking), denoted as $S_{i}$ , so that the adjusted new side lengths $L_{i}^{'} (t) = L_{i} (t) + S_{i}$ form a closed polygon, while these adjustments $S_{i}$ themselves should be as small as possible.

This translates to a classic constrained optimization problem:

Objective Function: Minimize the weighted sum of squares of the adjustments. We want the adjustments to be as small as possible.

min i = 0 \sum m \frac{S _{i}^{2}}{L _{A B, i}}

(The denominator $L_{A B, i}$ is a normalization term used to handle edges of different scales, usually defined as $max (∣ L_{A, i} - L_{B, i} ∣, ϵ)$ )

Constraint: The adjusted polygon must be closed. This means that the sum of all edge vectors is zero. Let $α_{i}$ be the absolute orientation angle of the $i$ -th edge (which can be obtained by accumulating the rotation angles). Then the constraints are:

ϕ_{1} = i = 0 \sum m L_{i}^{'} (t) cos (α_i) = i = 0 \sum m (L_{i} (t) + S_{i}) cos (α_{i}) = 0 ϕ_{2} = i = 0 \sum m L_{i}^{'} (t) sin (α_{i}) = i = 0 \sum m (L_{i} (t) + S_{i}) sin (α_{i}) = 0

Solving using the Lagrange multiplier method: This is a quadratic programming problem with equality constraints, which can be solved using the Lagrange multiplier method. We construct the Lagrangian function:

Φ (S_{0}, ..., S_{m}, λ_{1}, λ_{2}) = i = 0 \sum m \frac{S _{i}^{2}}{L _{A B, i}} + λ_{1} ϕ_{1} + λ_{2} ϕ_{2}

Taking the partial derivatives of $S_{i}$ and $λ_{1}, λ_{2}$ and setting them to zero:

From the first equation, we can solve for the expression $S_{i}$ in terms of $λ_{1}, λ_{2}$ :

S_{i} = - \frac{L _{A B, i}}{2} λ_{1} cos α_{i}) + λ_{2} sin (α_{i}))

Substituting this expression into the two constraints, we get a system of $2 \times 2$ linear equations in terms of $λ_{1}, λ_{2}$ . After solving for $λ_{1}$ and $λ_{2}$ , back substitution yields all $S_{i}$ .

Final Reconstruction: After obtaining the optimized side length $L_{i}^{'} (t)$ , we can accurately reconstruct the closed, visually natural-looking intermediate polygon.

Algorithm Summary

Input: Polygons $P_{A}$ and $P_{B}$ corresponding to two vertices.
Process:

Calculate the intrinsic representations (side lengths, rotation angles) of $P_{A}$ and $P_{B}$ .
Linear interpolation intrinsic representation.
Establish and solve the system of linear equations about $λ_{1}$ and $λ_{2}$ .
Calculate the side length adjustment $S_{i}$ .
Update the side lengths and reconstruct the polygon vertex coordinates from the starting point.

Output: Intermediate frame polygon $P (t)$ .

This intrinsic interpolation method handles rotation, scaling, and non-rigid deformations well, producing results far more natural than linear interpolation.

3.3. Implementation Code Example (Python)

Below is a simplified Python implementation that uses numpy for calculations and matplotlib for visualization.

import numpy as np
import matplotlib.pyplot as plt

def compute_intrinsics(vertices):
    """Calculate the intrinsic properties of a polygon: side lengths and angles"""
    shifted_vertices = np.roll(vertices, -1, axis=0)
    edges = shifted_vertices - vertices
    lengths = np.linalg.norm(edges, axis=1)
    
    angles = np.arctan2(edges[:, 1], edges[:, 0])
    shifted_angles = np.roll(angles, 1)
    
    turn_angles = angles - shifted_angles
    # Normalize angles to (-pi, pi]
    turn_angles = (turn_angles + np.pi) % (2 * np.pi) - np.pi
    
    # The first turn angle is the absolute angle relative to the x-axis
    turn_angles[0] = angles[0]
    
    return lengths, turn_angles

def reconstruct_from_intrinsics(lengths, turn_angles):
    """Reconstruct polygon vertices from intrinsic properties"""
    num_verts = len(lengths)
    vertices = np.zeros((num_verts + 1, 2))
    abs_angles = np.cumsum(turn_angles)
    
    edges = np.zeros((num_verts, 2))
    edges[:, 0] = lengths * np.cos(abs_angles)
    edges[:, 1] = lengths * np.sin(abs_angles)
    
    vertices[1:] = np.cumsum(edges, axis=0)
    return vertices[:-1] # Return the vertices of the closed polygon

def intrinsic_morph(verts_a, verts_b, t):
    """Perform intrinsic shape interpolation"""
    if len(verts_a) != len(verts_b):
        raise ValueError("Polygons must have the same number of vertices.")
        
    # 1. Calculate intrinsic properties
    len_a, angles_a = compute_intrinsics(verts_a)
    len_b, angles_b = compute_intrinsics(verts_b)

    # 2. Interpolate intrinsic properties
    interp_len = (1 - t) * len_a + t * len_b
    interp_angles = (1 - t) * angles_a + t * angles_b
    
    # Accumulate to get absolute angles
    abs_angles = np.cumsum(interp_angles)
    
    # 3. Solve the constrained optimization problem
    cos_a = np.cos(abs_angles)
    sin_a = np.sin(abs_angles)
    
    # Define L_ABi for normalization (simplified here)
    lab = np.maximum(np.abs(len_a - len_b), 1e-6)

    # Establish 2x2 linear system E*lambda = U
    E_mat = np.zeros((2, 2))
    E_mat[0, 0] = np.sum(lab * cos_a * cos_a)
    E_mat[0, 1] = np.sum(lab * cos_a * sin_a)
    E_mat[1, 0] = E_mat[0, 1]
    E_mat[1, 1] = np.sum(lab * sin_a * sin_a)
    
    # Calculate initial closure error
    initial_reconstruction = reconstruct_from_intrinsics(interp_len, interp_angles)
    closure_error = initial_reconstruction[0] - np.roll(initial_reconstruction, 1, axis=0)[0]
    
    U_vec = 2 * closure_error # Actually U = -2 * C_x, V = -2 * C_y
    
    # Solve lambda
    try:
        lambdas = np.linalg.solve(E_mat, U_vec)
    except np.linalg.LinAlgError:
        lambdas = np.array([0., 0.]) # If the matrix is singular, do not adjust

    # 4. Calculate edge length adjustments S_i
    s = -0.5 * lab * (lambdas[0] * cos_a + lambdas[1] * sin_a)
    
    # 5. Update edge lengths and reconstruct
    final_len = interp_len + s
    morphed_verts = reconstruct_from_intrinsics(final_len, interp_angles)
    
    # Move the shape centroid to the interpolated position of the original centroid
    centroid_a = np.mean(verts_a, axis=0)
    centroid_b = np.mean(verts_b, axis=0)
    interp_centroid = (1 - t) * centroid_a + t * centroid_b
    
    current_centroid = np.mean(morphed_verts, axis=0)
    morphed_verts += (interp_centroid - current_centroid)
    
    return morphed_verts

# --- Example ---
if __name__ == '__main__':
    # A square
    square = np.array([
        [0, 0], [1, 0], [1, 1], [0, 1]
    ])
    
    # A rotated and stretched star (corresponding vertices)
    star = np.array([
        [2.5, 2.0], [3.0, 3.0], [3.5, 2.0], [2.75, 2.75]
    ])
    # Adjust the star to have the same number of vertices as the square
    # This is a simplified correspondence relationship, the actual correspondence problem is very complex
    # Here we assume the vertices of the square correspond to the four outer vertices of the star
    star_like = np.array([
        [2, 2], [3, 1], [4, 2], [3, 3]
    ])


    plt.figure(figsize=(12, 5))
    
    # LERP for comparison
    plt.subplot(1, 2, 1)
    plt.title("Linear Interpolation (LERP)")
    plt.plot(square[:, 0], square[:, 1], 'r-o', label='Start')
    plt.plot(star_like[:, 0], star_like[:, 1], 'b-o', label='End')
    for t_val in np.linspace(0, 1, 6):
        morphed = (1 - t_val) * square + t_val * star_like
        plt.plot(morphed[:, 0], morphed[:, 1], 'g-', alpha=0.5)
    plt.axis('equal')
    plt.legend()
    
    # Intrinsic Morphing
    plt.subplot(1, 2, 2)
    plt.title("Intrinsic Shape Interpolation")
    plt.plot(square[:, 0], square[:, 1], 'r-o', label='Start')
    plt.plot(star_like[:, 0], star_like[:, 1], 'b-o', label='End')
    for t_val in np.linspace(0, 1, 6):
        morphed = intrinsic_morph(square, star_like, t_val)
        plt.plot(morphed[:, 0], morphed[:, 1], 'g-', alpha=0.5)
    plt.axis('equal')
    plt.legend()
    
    plt.show()

Conslusion

Today we have discussed two “Morphing” techniques that are fundamentally different but achieve the same goal:

Portrait Morphing based on StyleGAN: Utilizes the high-quality, disentangled latent space learned by deep generative models to achieve photorealistic face transitions through simple linear interpolation. This is a prime example of a data-driven approach, where the upper limit of the effect depends on the model’s expressive power and the quality of the training data.
Two-dimensional Polygon Shape Interpolation: Employs classical computer graphics methods, interpolating the intrinsic geometric properties (side lengths and angles) of the shape and combining them with constrained optimization to ensure geometric validity. This is a model-based and mathematically reasoned approach, resulting in precise, controllable, and physically interpretable results.

From StyleGAN to diffusion models, and now to the latest video generation large models, we see that AI’s ability to simulate and create visual content is developing at an unprecedented rate. In contrast, classical shape interpolation algorithms provide us with the theoretical foundation for understanding and controlling geometric deformations.

Future research directions may include:

3D Morphing: Extending these ideas to three-dimensional models, such as the interpolation of NeRF (Neural Radiance Fields) or the intrinsic geometric gradient of 3D meshes.
Controllability and Semantic Editing: Combining the advantages of both, for example, using AI to understand high-level instructions (“make him smile”) and then using geometric methods to accurately execute the deformation.
Physical Realism: Introducing physical simulation during the gradient process to ensure that the deformation conforms to the laws of material mechanics and dynamics.