Motion Estimation

Vicente González Ruiz - Depto Informática - UAL

February 2, 2025

1 Causes of motion

Still camera, moving objects.
Still scene, moving camera.
Moving objects, moving camera.

Notice that the motion captured by the camera is a projection of the 3D movement of the objects in the scene to the 2D plane captured by the camera.

Notice that captured motion is undefined in occluded regions.

2 Idea

In some 3D signals processed as sequences of 2D frames (for example, in a video that is a sequence of frames), motion estimation techniques find a mapping between such frames. Such mappings between two or more frames (usually, in the form of one or more motion vector fields per frame) can be used in motion compensated transforms, such as Hybrid Coding [?] and MCTF [3]). Notice that in these examples of temporal transforms, the motion information must be available also during the decoding process.

In its simplest form, a motion compensated transform inputs one (or more) reference frame(s) \({\mathbf R}=\{{\mathbf R}_i\}\), and a motion vectors field \(\overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}\) that indicates how to project \(\mathbf R\) onto the predicted (anchor) frame \(\mathbf P\), and outputs a prediction frame \begin {equation} \hat {{\mathbf P}} = \overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}({\mathbf R}). \label {eq:MCP1} \end {equation} With this, we compute the residue frame (prediction error) \begin {equation} {\mathbf E} = {\mathbf P} - \hat {\mathbf P}. \end {equation}

An example of such transformation can be found in the notebook Full search block-based ME. As it can be seen, the entropy of the motion compensated redidue has been significantly decreased compared to the case in which the motion is not compensated.

3 Block-based motion estimation [6]

Figure 1: ME using disjoint blocks. \(({\mathbf M}_x, {\mathbf M}_y)\) is the motion vector that indicates where the block \((x,y)\) of \(\mathbf P\) is found in \(\mathbf R\).

Block-based ME is the simplest ME algorithm (see the Fig. 1), \(\mathbf P\) is divided in blocks of (for example) 16x16 pixels¹, and we can use the (R)MSE that measures the distance in L\(_2\) (also known as the Euclidean distance) between each block of \(\mathbf P\) and its surrounding pixels in \(\mathbf R\) (the so called search area) [7]. For each block, a motion vector that indicates the best match (smaller distance) is found. The set of motion vectors form the motion vectors field \(\overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}\) that obviously, except for a block size of 1x1, will be less dense than \(\mathbf R\) and \(\mathbf P\). Notice, however, that, it is not a good idea to use such a small block size because, in general, the motion vectors will not describe the true motion in the scene.

However, as it can be seen in the Figure ??, the motion information computed by the block-based ME algorithm not always represents the true motion in the scene in the case of using block-based matching. This can be a drawback, for example, for solving object tracking problems. In the case of video coding, the main disadvantage of such issue is that the entropy of the motion fields increases, which also decreases the compression ratio.

4 Deformable block matching

Allows to matp affine and bilinear motion estimation models for objects.

5 Overlapped block matching

A better approximation to the OF for small block sizes can be found if we allow the blocks to overlap in \(\mathbf P\) [5], case in which the block size for performing the comparisons must be larger. Again, as it happens with the disjoint case, only the non overlaped pixels are used for building the prediction (see the Fig. 2). Obviously, the main drawback of this technique is that it can be more computationally demanding than the previous one.

The dense ME algorithm can obtain better predictions than the block-based one, as it can be seen in the notebook Full search dense ME. Notice also that the MVs are also more coherent.

Figure 3: ME using overlaped blocks, averaging the overlaped pixels.

An improvement of the previous technique can also average the overlaped pixels in the prediction frame \(\hat {P}\), as it has been shown in the Fig. 3.

5.1 Machine learning

ANNs (Artifical Neural Networks) can be trained to estimate the motion between frames [1]. For the training of ANNs, animation videos are generally used where the motion fields are known with precision.

6 Subpixel accuracy

Objects in real scenes usually move a rational number of pixels², and therefore, the motion information should provide subpixel displacements.

This problem can be mitigated if the predictions are generated after: (1) interpolate the reference(s) image(s), and (2) subsample³ the prediction to the resolution of the predicted image. For example, in MPEG-1, the motion estimation can have up to 1/2 pixel accuracy. In this case, a bi-linear interpolator is used (see the Fig. 4).

Unfortunately, the use of subpixel accuracy increases the entropy of the motion information and, potentially, the number of motion vectors.

7 Searching strategies

Uauslly, only performed by the compressor.

Figure 5: \(\pm \) 1 spiral search. Notice that, in the case that all the comparisons have the same error, the null motion vector is selected. Notice also that the spiral can have any size.

7.1 Full (exhaustive) search

All the possibilities are checked (see the Fig. 6). Advantage: the highest compression ratios. Disadvantage: CPU killer. Usually, to maximize the number of vectors equal to zero, a spiral search is performed (see Fig. [?]).

7.2 Hierarchical search

It is a version of the full search algorithm where the blocks and the search area are sub-sampled. After finding the best coincidence, the resolution is increased in a power of 2 and the previous match is refined in a search area of \(\pm 1\), until the maximal resolution (even using subpixel accuracy) is reached.

7.3 Telescopic search

Any of the previously described techniques can be accelerated up if the searching area is reduced. This can be done supposing that the motion vector of the same block in two consecutive images is similar.

7.4 Optical flow

The Optical Flow (OF) [4] describes the aparent motion of the pixels in the scene between two frames. There are several OF estimators proposed in the literature, and one of the most used is the Farnebäck’s algorithm [2], which instead of comparing pixels in the image domain, compares the coefficients generated by the transform defined by the basis functions \begin {equation} \{1, x, y, x^2, y^2, xy\} \end {equation} (see the notebook Farnebäck’s motion estimation). In this transform domain, the corresponding subbands quantify the tendency of the image to increase its intensity in different 2D directions, and therefore, it is more efficient to know the direction in which the objects are moving.

Farnebäck’s is a dense OF estimator, which means that we obtain one motion vector per pixel. This is achieved applying the previous algorithm to any pixel of the image using a sliding window. It also provided subpixel accuracy.

8 ME and RDO

ME can be designed to minimize the distortion \(D\) of the residues after using the MCT (Motion Compensated Transform), or to minimize the lagrangian \begin {equation} J = R + \lambda D, \end {equation} which also takes into consideration the bit-rate \(R\). Notice, however, that in this case the computation of the motion information is also determined by the bit-rate achieved by the entropy coding of the motion data and the residues.

Notice that, in general, \(D\) will decrease if the “motion part” of \(R\) increases. However, if the motion information can be infered by the decoder, \(R\) will be only affected by the entropy encoding of the residues. On the other hand, when the motion information is infered at the decoder, this information will be less accurate that if we use all the visual information avaiable at the encoder.

9 Matching criteria

Let \(a\) and \(b\) the blocks which we want to compare. Two main distortion metrics are commonly used:
- MSE (Mean Square Error): We minimize the energy \(\mathbf E\) (also known as the L\(^2\) distance):
  \begin {equation} \frac {1}{16\times 16}\sum _{i=1}^{16}\sum _{j=1}^{16}(a_{ij}-b_{ij})^2 \end {equation}
- MAE (Mean Absolute Error):
  \begin {equation} \frac {1}{16\times 16}\sum _{i=1}^{16}\sum _{j=1}^{16}|a_{ij}-b_{ij}| \end {equation}
These similitude measures are used only by MPEG compressors. Therefore, any other one with similar effects (such as the error variance or the error entropy) could be used also.
Other less common distortion metrics that can work are:
- EE (Error Entropy):
  \begin {equation} -\frac {1}{16\times 16}\sum _{i=1}^{16}\sum _{j=1}^{16}\log _2(a_{ij}-b_{ij})p(a_{ij}-b_{ij}) \end {equation}

10 References

[1] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.

[2] G. Farnebäck. Two-Frame Motion Estimation Based on Polynomial Expansion. In Scandinavian conference on Image analysis, pages 363–370. Springer, 2003.

[3] V. González-Ruiz. Motion Compensated Temporal Filtering (MCTF).

[4] B.K.P. Horn and B.G. Schunck. Determining Optical Flow. In Techniques and Applications of Image Understanding, volume 281, pages 319–331. International Society for Optics and Photonics, 1981.

[5] M.T. Orchard and G.J. Sullivan. Overlapped Block Motion Compensation: An Estimation-Theoretic Approach. IEEE Transactions on Image Processing, 3(5):693–699, 1994.

[6] Kamisetty Ramamohan Rao and Jae Jeong Hwang. Techniques and standards for image, video, and audio coding, volume 70. Prentice Hall New Jersey, 1996.

[7] S. Zhu and K.-K. Ma. A New Diamond Search Algorithm for Fast Block-Matching Motion Estimation. IEEE transactions on Image Processing, 9(2):287–290, 2000.

¹For example, in the MPEG-1 standard, the reference image/s is/are divided in blocks of \(16\times 16\) pixels called macroblocks.

²This means that, even if the images are visually identical, they have different representation, and therefore, \({\mathbf E}\ne {\mathbf 0}\).

³This operation implies a filtering to avoid the aliasing after the downsampling.