Training Signals#

How do we get a training signal for this system? We don’t really have the still image in the pose we want it to get from the driving video. Well the key is to decouple the motion from the identity in the still image.

We represent the driving image as a set of key-points and the motion transfer happens based on those key-points. The driving image used as an input to produce the key-points, as soon as we have those, we don’t need the driving image anymore.

We can train this system as a video reconstruction task. That is, we take the first frame of the driving video and trying to deform it into the next frames of the video. In this case we know how the end result should look like and can train based on this.

Few training source 1:

we want the output image look the same as the driving image (Perceptual Loss)
we want the key-points to correspond to some interesting features in the image (Equivariance Loss)
The optical flow we learn needs to turn source image to driving image. So we want the deformed source image to have the same encoder feature maps as the driving image. (Optical Flow Loss)

Perceptual Loss#

To make sure that the images are the same we use a pre-trained VGG network and get its feature maps from different scales for both the driving and the source images. Note that we also scale the images.

We put an \(L_1\) loss on the difference between the two sets of feature maps. This loss produces gradients for all the networks in the system.

Equivariance Loss#

This loss is to make the key-points correspond to some interesting features. What does that mean? Let’s say we do some known transformation to the image. If the key points correspond to interesting points in the image we expect to see the them transformed in similar fashion. This is exactly what this loss forces the key-points to do. We take an image and its predicted key-points as inputs. Do a random TPS transform on the image and find the key-points of the transformed image. Next we apply the same random TPS transform on the predicted key-points. Finally we put an \(L_1\) loss on the difference between the transformed key-points and the transformed image key-points.

This loss produces gradients for the key-points detection network.

Optical-Flow Loss#

The inpainting network has an encoder-decoder architecture. Where the encoder feature maps get deformed and occluded by the optical flow and masks predicted in the dense motion stage. Optical flow is the parts we wish to train. So we want to take the feature maps from the deformed and occluded source image and make them look like the occluded driving image. Once again an \(L_1\) loss is put on the feature differences.

This loss doesn’t train the inpainting encoder nor does it trains the occlusion masks but only for the optical flow.

1: There is additional background transformation loss, that we skip for now.