Stages of Motion Transfer#

Motion Transfer Models take a still image and a video and animate the still image such that it mimics the motion of the video. We refer to the still image as the source and to the video as driving.

Our particular model is Thin-Plate Spline motion transfer 0. It has three major stage:

Detect Key-points - Use a ResNet model to discover points of interest in an image. We do this for the source image and for the driving video.
Turn Key-points to optical flow - Based on the difference between source and driving key-points we infer an optical flow 1. Some parts of the driving image can’t be seen in the source image, so if we also try and infer where those areas are (we call those occluded areas)
Fill-in the occluded areas - A generative model is use to predict what should go in the missing parts.

0: original repository
1: where each pixel from the source image should move if we want to deform it to match the driving image

Detect Key-points#

We use a ResNet18 model to predict the key-points. Key-points are just 2D coordinates on an image. So the ResNet outputs a vector with a dimension of \(K \times N \times 2\) and values \(\in [0,1]\). transmotion.kp_detection.KPDetector is the key-points detector network and transmotion.kp_detection.KPResult bundles the outputs

What are those \(K\) and \(N\) over there??#

The model uses multiple deformations (later on that) in this case \(K\) is the number of deformation and \(N\) is the number of points per deformation.

Turn Key-points to optical flow#

Key-points are sparse and lossy representation of what’s going on in the images. We need to use this information to actually estimate how to deform the image to do that we do several things:

Heatmap Representation: Turn sparse key-points into a Gaussian heatmap
Estimate Image Deformation: Key-points represent sparse motion, that is we only know where limited number of points in the source image should move to. To estimate what happens to the other points in the image we learn a Thin-Plate Spline Transformation (TPS).
Occlusion Masks: Image deformation is too simple for the real world (e.g. some parts of the driving image can’t be seen on the source image). We estimate what we don’t know after applying deformation.

All this happens inside transmotion.dense_motion.DenseMotionNetwork and the output is bundled in transmotion.dense_motion.DenseMotionResult

TPS#

From a birds eye view, TPS takes a 2D points as input and produces 2D point as an output. The idea is to make the TPS tell us where points from the source image should move to match the driving frame. The only known points for us are the key-points we found earlier. We call those the control points of the TPS.

TPSs output for the control points on the driving frame will match the control points on the source frame. Other points will be smoothly moved based on the movement of the control points (weighted by the distance from current points to the control points).

The different colors in @fig-key-points-demo represent control points for different TPS transforms. Here we will use a bunch of the to deform the image in many different ways. The different TPS transforms are mixed together using weighting coefficients learned by a neural-net

Why do we need occlusion masks?#

TPS is a simple transformation compared to the complexity that is required to transform a pose of a person in a real image. Also, TPS is locally accurate, that is only fits the key-points we asked it to fit.

Going outside further from those points we lose accuracy completely. We need to know what are the region we can’t approximate well with a simple image deformation. This can be because a new feature needs to appear (e.g. person shows no teeth in the static image, but needs to show those as part of motion transfer) or because our approximation is bad.

Fill-in the occluded areas#

At this point we have the optical flow and the occlusion masks, now its time to inpaint the missing areas from the image. The inpainting is done by an encoder-decoder architecture. We encode the source image and decode back the transformed source image. All this happens in transmotion.inpainting.InpaintingNetwork and the final results are bundled in transmotion.inpainting.InpaintingResult

The encoding is a series of feature maps on the decoding end, the feature maps get deformed by the optical flow and occluded by the occlusion masks and passed through a decoder layer. So what the encoder-decoder predicts is the occluded areas. At the end of the decoder we take a deformed source image, occlude it and add the predicted occluded areas.