Motion Estimation

Vicente González Ruiz - Depto Informática - UAL

October 25, 2025

1 Causes of motion

  1. Still camera, moving objects.
  2. Still scene, moving camera.
  3. Moving objects, moving camera.

Notice that the motion captured by the camera is a projection of the 3D movement of the objects in the scene to the 2D plane captured by the camera.

Notice that captured motion is undefined in occluded regions.

2 Idea

In some 3D signals processed as sequences of 2D frames (for example, in a video that is a sequence of frames), motion estimation techniques find a mapping between such frames. Such mappings between two or more frames (usually, in the form of one or more motion vector fields per frame) can be used in motion compensated transforms, such as Hybrid Coding [?] and MCTF [3]). Notice that in these examples of temporal transforms, the motion information must be available also during the decoding process.

In its simplest form, a motion compensated transform inputs one (or more) reference frame(s) \({\mathbf R}=\{{\mathbf R}_i\}\), and a motion vectors field \(\overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}\) that indicates how to project \(\mathbf R\) onto the predicted (anchor) frame \(\mathbf P\), and outputs a prediction frame \begin {equation} \hat {{\mathbf P}} = \overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}({\mathbf R}). \label {eq:MCP1} \end {equation} With this, we compute the residue frame (prediction error) \begin {equation} {\mathbf E} = {\mathbf P} - \hat {\mathbf P}. \end {equation}

An example of such transformation can be found in the notebook Full search block-based ME. As it can be seen, the entropy of the motion compensated redidue has been significantly decreased compared to the case in which the motion is not compensated.

3 Block-based motion estimation [6]

Figure 1: ME using disjoint blocks. \(({\mathbf M}_x, {\mathbf M}_y)\) is the motion vector that indicates where the block \((x,y)\) of \(\mathbf P\) is found in \(\mathbf R\).

Block-based ME is the simplest ME algorithm (see the Fig. 1), \(\mathbf P\) is divided in blocks of (for example) 16x16 pixels1, and we can use the (R)MSE that measures the distance in L\(_2\) (also known as the Euclidean distance) between each block of \(\mathbf P\) and its surrounding pixels in \(\mathbf R\) (the so called search area) [7]. For each block, a motion vector that indicates the best match (smaller distance) is found. The set of motion vectors form the motion vectors field \(\overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}\) that obviously, except for a block size of 1x1, will be less dense than \(\mathbf R\) and \(\mathbf P\). Notice, however, that, it is not a good idea to use such a small block size because, in general, the motion vectors will not describe the true motion in the scene.

However, as it can be seen in the Figure ??, the motion information computed by the block-based ME algorithm not always represents the true motion in the scene in the case of using block-based matching. This can be a drawback, for example, for solving object tracking problems. In the case of video coding, the main disadvantage of such issue is that the entropy of the motion fields increases, which also decreases the compression ratio.

4 Deformable block matching

Allows to matp affine and bilinear motion estimation models for objects.

5 Overlapped block matching

Figure 2: ME using overlaped blocks.

A better approximation to the OF for small block sizes can be found if we allow the blocks to overlap in \(\mathbf P\) [5], case in which the block size for performing the comparisons must be larger. Again, as it happens with the disjoint case, only the non overlaped pixels are used for building the prediction (see the Fig. 2). Obviously, the main drawback of this technique is that it can be more computationally demanding than the previous one.

The dense ME algorithm can obtain better predictions than the block-based one, as it can be seen in the notebook Full search dense ME. Notice also that the MVs are also more coherent.

Figure 3: ME using overlaped blocks, averaging the overlaped pixels.

An improvement of the previous technique can also average the overlaped pixels in the prediction frame \(\hat {P}\), as it has been shown in the Fig. 3.

5.1 Machine learning

ANNs (Artifical Neural Networks) can be trained to estimate the motion between frames [1]. For the training of ANNs, animation videos are generally used where the motion fields are known with precision.

6 Subpixel accuracy

Objects in real scenes usually move a rational number of pixels2, and therefore, the motion information should provide subpixel displacements.

Figure 4: Pixel interpolation.

This problem can be mitigated if the predictions are generated after: (1) interpolate the reference(s) image(s), and (2) subsample3 the prediction to the resolution of the predicted image. For example, in MPEG-1, the motion estimation can have up to 1/2 pixel accuracy. In this case, a bi-linear interpolator is used (see the Fig. 4).

Unfortunately, the use of subpixel accuracy increases the entropy of the motion information and, potentially, the number of motion vectors.

7 Searching strategies

Uauslly, only performed by the compressor.

Figure 5: \(\pm \) 1 spiral search. Notice that, in the case that all the comparisons have the same error, the null motion vector is selected. Notice also that the spiral can have any size.

7.1 Full (exhaustive) search

All the possibilities are checked (see the Fig. 6). Advantage: the highest compression ratios. Disadvantage: CPU killer. Usually, to maximize the number of vectors equal to zero, a spiral search is performed (see Fig. [?]).

Figure 6: The full search scheme.

7.2 Hierarchical search

It is a version of the full search algorithm where the blocks and the search area are sub-sampled. After finding the best coincidence, the resolution is increased in a power of 2 and the previous match is refined in a search area of \(\pm 1\), until the maximal resolution (even using subpixel accuracy) is reached.

7.3 Telescopic search

Any of the previously described techniques can be accelerated up if the searching area is reduced. This can be done supposing that the motion vector of the same block in two consecutive images is similar.

7.4 Optical flow

The Optical Flow (OF) [4] describes the aparent motion of the pixels in the scene between two frames. There are several OF estimators proposed in the literature, and one of the most used is the Farnebäck’s algorithm [2], which instead of comparing pixels in the image domain, compares the coefficients generated by the transform defined by the basis functions \begin {equation} \{1, x, y, x^2, y^2, xy\} \end {equation} (see the notebook Farnebäck’s motion estimation). In this transform domain, the corresponding subbands quantify the tendency of the image to increase its intensity in different 2D directions, and therefore, it is more efficient to know the direction in which the objects are moving.

Farnebäck’s is a dense OF estimator, which means that we obtain one motion vector per pixel. This is achieved applying the previous algorithm to any pixel of the image using a sliding window. It also provided subpixel accuracy.

Farneback’s algorithm represents each pixel of an image \(f(x,y)\) by a quadratic (second order) polynomial (expansion) \begin {equation} f(x,y) \approx a_1x^2 + a_2xy + a_3y^2 + a_4x + a_5y + a_6, \label {eq:expansion} \end {equation} which draws a 3D surface that is an approximation to the real 3D surface generated by the real values of a window surrounding the pixel with coordinates \((x, y)\). Eq. 4 can also be written as \begin {equation} f(x) \approx x^TAx + b^Tx + c \end {equation} where \(x = [x,y]^T\), and the coefficients are collected into \begin {equation} A = \begin {bmatrix} a_1 & a_2/2 \\ a_2/2 & a_3 \end {bmatrix},\quad b=\begin {bmatrix} a_4 \\ a_5 \end {bmatrix}, \quad c=a_6. \end {equation}

A estimation of a displacement \(d\) between two images \(f_1\) and \(f_2\) is represented by \begin {equation} f_2(x) \approx f_1(x-d). \end {equation} Expanding \begin {equation} f_1(x-d) = (x-d)^TA(x-d) + b^T(x-d) + c = x^TAx + (b-2Ad)^Tx + (d^TAdb^Td+c). \end {equation} So, if we also represent \(f_2(x)\) as \begin {equation} f_2(x) \approx x^TA_2x + b_2^Tx + c_2, \end {equation} then, for a pure small displacement, we expect that \begin {equation} A_2 = A,\quad b_2 = b-2Ad, c_2=c + d^TAd-b^Td. \end {equation} Solving \(d\) for the linear terms, \begin {equation} d=\frac {A^{-1}(b-b_2)}{2}. \end {equation} Notice that \(d\) represents the displacement that applied to \(f_1\) look at much as possible like \(f_2\). In other words, \begin {equation} f_2(\mathbf {x}) \approx f_1(\mathbf {x} - \mathbf {d}). \end {equation}

The Aperture Problem

Between two consecutive images of a sequence we can suppose that \begin {equation} I(x, y, t) \approx I(x+u, y+v, t+1) \end {equation} where \([u v]^T\) is a motion vector (the flow) we are interested in, and \(t\) usually represents time.

Using a first-order Taylor expansion on the RHS, we get that \begin {equation} I(x+u, y+v, t+1) \approx I(x, y, t) + \frac {\partial I}{\partial x}u + \frac {\partial I}{\partial y}v + \frac {\partial I}{\partial t}, \end {equation} where if we define \(I_x = \frac {\partial I}{\partial x}\) (horizontal gradient), \(I_x = \frac {\partial I}{\partial y}\) (vertical gradient), and \(I_t = \frac {\partial I}{\partial t}\) (temporal gradient), and supposing that the displacement does not affect to their intensity, i.e., \begin {equation} I \approx I + I_xu + I_yv + I_t \end {equation} we arrive to the Optical Flow Constraint Equation: \begin {equation} I_xu + I_yv + I_t = 0 \end {equation} As we can see, we have only one equation and two unknows (\(u\) and \(v\)). This is the mathematical statement of the Aperture Problem. As we can see, if we are in a flat region, \(I_x\approx 0\), \(I_y\approx 0\), so the equation becomes \(0\approx -I_t\), which is useless. On a stright edge, we only get information about the motion perpendicular (normal) to the edge.

The Least-Squares solution

Therefore, to find \(d=[u v]^T\) we must impose anoter restriction (equation). This typically is that the flow should be constant for all pixels in a small neighborhood (a window centered at the pixel in which we are interested to determine \(d\). Depending on the size of the neighborhood, we create an overdetemined equations system, which in matrix form is \begin {equation} \underbrace { \begin {bmatrix} I_{x_1} & I_{y_1} \\ I_{x_2} & I_{y_2} \\ \vdots & \vdots \\ I_{x_n} & I_{y_n} \end {bmatrix} }_{\mathbf {A}} \underbrace { \begin {bmatrix} u \\ v \end {bmatrix} }_{\mathbf {d}} = \underbrace { \begin {bmatrix} -I_{t_1} \\ -I_{t_2} \\ \vdots \\ -I_{t_n} \end {bmatrix}. }_{\mathbf {b}} \end {equation} We cannot solve this exactaly, so we find the motion vector \(\mathbf {d}\) that minimizes the squared error \begin {equation} \text {min}_d||\mathbf {A}\mathbf {d}-\mathbf {b}||^2. \end {equation} The standard least-sauares solution is found by solving the normal equations \begin {equation} (\mathbf {A}^T\mathbf {A})d=\mathbf {A}^T\mathbf {d}, \end {equation} where \(\mathbf {A}^T\mathbf {A}\) is a \(2\times 2\) matrix that sums the spatial gradient information (the “structure”) across the entire window \begin {equation} \mathbf {A}^T\mathbf {A} = \sum _{i\in W} \begin {bmatrix} I_{x_i}^2 I_{x_i}I_{y_i} \\ I_{x_i}I_{y_i} I_{y_i}^2 \end {bmatrix}, \end {equation} and \(\mathbf {A}^T\mathbf {b}\) is a \(2\times 1\) vector that sums the motion evidence \begin {equation} \mathbf {A}^T\mathbf {b} = \sum _{i\in W} \begin {bmatrix} -I_{x_i}I_{t_i} \\ -I_{y_i}I_{t_i} \end {bmatrix} \end {equation} The final motion is then found by \begin {equation} \mathbf {d} = (\mathbf {A}^T\mathbf {A})^{-1}(\mathbf {A}^T\mathbf {b}) \end {equation}

7.5 Practical issues and solutions

  1. A may be singular. If A is not invertible (e.g., in flat, pure-random regions, or simply because the motion cannot be described as a simple translation) it is impossible to find \(A^{-1}\). Solutions:

    1. Compute the pseudo-inverse.
    2. Compute \((A+\epsilon I)^{-1}\) instead of \(A^{-1}\), where \(\epsilon I\) is a regularizer term.
    3. Suppose that \(d=0\).
    4. Create a overdetermined linear system with the polynomials of the neighborhood pixels, considering for example a Gaussian window for wheighting their contribution, and solving \(d\) in least-squares sense. This is the solution proposed by Farnebäck’. See next section.
    5. Rely on pyramids. This helps to capture large motions (a big pixel movement in the full-resolution image becomes a small, manageable movement in the low-resolution (top) levels of the pyramid).

Finding \(d\) using least squares

Let’s define the polynomial expansion for pixel location \(l_i\) in the first image as \begin {equation} f^1(p) \approx l^TA_l^1 + {(b_i^1)}^Tl + c_i^1 \end {equation} where, \(l_i\) is the \(i\)-th pixel coordinates, and the same for the second image \begin {equation} f^2(p) \approx l^TA_l^2 + {(b_i^2)}^Tl + c_i^2. \end {equation} As we deribed earlier, if \(f^2\) is a translated version of \(f^1\) by displacement \(d=[d_x d_y]^T\), ideally we have that \begin {equation} b^2_i-b^1_i = -2A^1_id. \end {equation} Notice that we use, let’s say for a pixel at location \(l_1\), all the pixels in a window centered at \(l_1\) to determine a single \(d=[d_x d_y]^T\).

Therefore, when \(A^1_i\) is not invertible, considering also the Gaussian weights defined by \begin {equation} w_i = \exp \left ( -\frac {||x_i-x_0||^2}{2\sigma ^2} \right ), \end {equation} (notice that this downweight distant neighbors) we can build the overdetermined linear system \begin {equation} \begin {bmatrix} 2w_1A^1_1 \\ 2w_2A^2_2 \\ \vdots \\ 2w_NA^1_N \end {bmatrix} d = \begin {bmatrix} w_1(b^1_1-b^2_1) \\ w_2(b^1_2-b^2_2) \\ \vdots \\ w_N(b^1_N-b^2_N) \\ \end {bmatrix} \end {equation} Notice that since \(N>1\) (the number of pixel in the window), at least we have \(2N=4\) equations for only two unknowns (the components of \(d\)). The best \(d\) (in the least-squares sense) minimizes the error \begin {equation} E(d) = \sum _i w_i ||b^2_i - b^1_i + 2A^1_id||^2. \end {equation} that as can be seen, is a weighted squared error.

To find the minimizer, set the gradient of \(E(d)\) with respect to \(d\) to zero: \begin {equation} \frac {\partial E}{\partial d} = 4\sum _i w_i(A^1_i)^T\left (A^1_i d + \frac {1}{2}(b^1_i-b^2_i)\right ) = 0. \end {equation} Simplify. First, let’s define \(A^1_i=A_i\) and \(b^1_i=b_i\). Then \begin {equation} \left ( \sum _i w_iA_i^TA_i \right ) d = -\frac {1}{2}\sum _i w_iA_i^T(b_i-b^2_i) \end {equation} that, considering that \((b^2_i - b_i) = -(b_i-b^2_i)\) (if we interchange \(f^1\) and \(f^2\), the sign of \(d\) changes), also can be written as \begin {equation} \left ( \sum _i w_iA_i^TA_i \right ) d = \frac {1}{2}\sum _i w_iA_i^T(b^2_i-b_i), \end {equation} which are the normal equations.

This is a \(2\times 2\) linear system (for 2D flow, obviously), easely solvable for \(d\) as \begin {equation} d = \frac {1}{2}\left ( \sum _i w_iA_i^TA_i \right )^{-1}\left (\sum _i w_iA_i^T(b_i-b^2_i)\right ). \end {equation}

Example

Let’s

The general least-squares problem

Suppose you have a overdetermined linear system \begin {equation} Ax=b, \end {equation} were \(A\) is an \(m\times n\) matrix with \(m>n\) (more equations than unknowns), \(x\) is the \(n\times \) vector of unknowns, and \(b\) is the \(m\times 1\) vector of known measurements. Usually, there’s no exact solution (the equations cannot all be satisfied simultaneously). So we look for a \(x\) that minimizes the sum of squared residuals \begin {equation} E(x) = ||Ax-b||^2 = (Ax-b)^T(Ax-b). \end {equation} This is the least-squares problem.

To minimize \(E(x)\), we set its gradient with respect to \(x\) equal to zero: \begin {equation} \frac {\partial E}{\partial x} = 2A^T(Ax-b) = 0. \end {equation}

Simpliplying, \begin {equation} A^TAx = A^tb \end {equation} we obtain the normal equation (the residual vector \(r=Ax-b\) is orthogonal (normal) to the column space of \(A\), i.e., \(A^Tr=0\)).

If \(A^TA\) is invertible (i.e., columns of \(A\) are linearly independent), the least-squares solution is \begin {equation} x = (A^TA)^{-1}A^Tb \end {equation} where \(x\) minimizes the squared error.

For example, let’s say we want to fit a line \(y=ax+b\) through noisy data points. For exam point \(i\) we have that \begin {equation*} ax_i+b = y_i. \end {equation*} Stacking for \(N\) points, \begin {equation*} \begin {bmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \end {bmatrix} \begin {bmatrix} a \\ b \\ \end {bmatrix} = \begin {bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end {bmatrix}. \end {equation*} That’s \(Ax=b\). The normal equation is \begin {equation*} A^TA \begin {bmatrix} a \\ b \\ \end {bmatrix} = A^Tb \end {equation*} which yields the classic least-squares line fit.

8 ME and RDO

ME can be designed to minimize the distortion \(D\) of the residues after using the MCT (Motion Compensated Transform), or to minimize the lagrangian \begin {equation} J = R + \lambda D, \end {equation} which also takes into consideration the bit-rate \(R\). Notice, however, that in this case the computation of the motion information is also determined by the bit-rate achieved by the entropy coding of the motion data and the residues.

Notice that, in general, \(D\) will decrease if the “motion part” of \(R\) increases. However, if the motion information can be infered by the decoder, \(R\) will be only affected by the entropy encoding of the residues. On the other hand, when the motion information is infered at the decoder, this information will be less accurate that if we use all the visual information avaiable at the encoder.

9 Matching criteria

10 References

[1]   A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.

[2]   G. Farnebäck. Two-Frame Motion Estimation Based on Polynomial Expansion. In Scandinavian conference on Image analysis, pages 363–370. Springer, 2003.

[3]   V. González-Ruiz. Motion Compensated Temporal Filtering (MCTF).

[4]   B.K.P. Horn and B.G. Schunck. Determining Optical Flow. In Techniques and Applications of Image Understanding, volume 281, pages 319–331. International Society for Optics and Photonics, 1981.

[5]   M.T. Orchard and G.J. Sullivan. Overlapped Block Motion Compensation: An Estimation-Theoretic Approach. IEEE Transactions on Image Processing, 3(5):693–699, 1994.

[6]   Kamisetty Ramamohan Rao and Jae Jeong Hwang. Techniques and standards for image, video, and audio coding, volume 70. Prentice Hall New Jersey, 1996.

[7]   S. Zhu and K.-K. Ma. A New Diamond Search Algorithm for Fast Block-Matching Motion Estimation. IEEE transactions on Image Processing, 9(2):287–290, 2000.

1For example, in the MPEG-1 standard, the reference image/s is/are divided in blocks of \(16\times 16\) pixels called macroblocks.

2This means that, even if the images are visually identical, they have different representation, and therefore, \({\mathbf E}\ne {\mathbf 0}\).

3This operation implies a filtering to avoid the aliasing after the downsampling.