Motion Compensation

Vicente González Ruiz

January 17, 2023

1 Motion Mompensation (MC)
2 Hybrid coding alternatives
3 Decision mode and RDO
4 I-type frames, P-type frames and B-type frames
5 Intra coding (III...)
6 Low-delay coding (IPP...)
7 “Random”-access mode (IBPB...)
8 Hybrid video coding
  8.1 Codec
  8.2 A block diagram of the step codec
9 The GOP (Group Of Pictures) concept
10 Deblocking filtering
11 Bit-rate allocation
12 Coding in the Transform Domain
  12.1 The IPP... decorrelation pattern
  12.2 A block diagram of the step codec
  12.3 Spatial multiresolution
  12.4 SQ (layers) progression
13 Linear frame interpolation using block-based motion compensation
14 A problem
15 References

1 Motion Mompensation (MC)

MC is the process of generating the residue frames through substracting to the predicted (anchor) frame a prediction frame generated with one or more motion field to project one or more reference frames.

Most of video coding standards use block-based MC techniques.

If blocks can be deformed when they are projected (to create the prediction frame) we are using a mesh-based MC.

2 Hybrid coding alternatives

t+2d: The sequence of images is decorrelated first along the time (t) and the residue images are compressed, exploiting the remaining spatial (2d) redundancy. Examples: MPEG* and H.26* codecs (except H.264/SVC).
2d+t: The spatial (2d) redudancy is explited first (using typically the DWT) and after that, the coefficients are decorrelated along the time (t). For now, this has only been an experimental setup because most DWT transformed domains are not invariant to the displacement, and therefore, ME/MC can not be directly applied.
2d+t+2d: The fist step creates a Laplacian Pyramid (2d), which is invariant to the displacement. Next, each level of the pyramid is decorrelated along the time (t) and finally, the remaining spatial redundancy is removed (2d). The Fig 1 show an example for H.264/SVC.

3 Decision mode and RDO

Most video coding standards operate at the block level. For each block, the decision mode (run only by the compressor) determines the “type of block”:

I-type (intra), when the block is encoded without considering any other external reference information. In this case, no resudue is generated), and a I-type block represents only texture. RDO selects this mode when the compression of residue block generates more bits than the original (predicted) one.
P-type (predicted), if there is one reference (block) for the (predicted) block, which belongs to a previous frame. Now we encode a block with residual texture and a motion vector per block. P-type blocks require less data than I-type blocks (this is guaranteed by the RDO).
B-type (bidirectionally-predicted), if there are two or more references for the block, anterior (past) and posterior (future). By definition (RDO), it is more RD-advantageous to use B-type blocks compared to P-type blocks, because the entropy of the residual texture decreases (even considering that now we need one motion vector per reference).
S-type (skipped), if the residue is so small that it is more beneficial to consider that it is zero. Consequently, the block is a copy of the reference.

The type of the blocks is signaled in the code-stream. An visual example of the decision of the type of the macro-blocks is shown in the Figure 2.

4 I-type frames, P-type frames and B-type frames

Frames can be:

I-type (or simly a I-frame), if only uses I-type blocks.
P-type frame (P-frame), that can use I, P, and S-type blocks.
B-type frame (B-frame), if I, P, S and B-type blocks can be used.

5 Intra coding (III...)

In the intra coding mode, all frames are encoded as independent images (no MC has been used at all, and therefore, all blocks in the frames are I-type). Intra-coded frames are also called keyframes. See the notebook III... video compression.

Advantages:

Minimal pipeline delay.
Minimal buffering.
No error propagation.
Maximum temporal scalability.

Disadvantages:

Low compression ratios.

Mainly used in video edition.

6 Low-delay coding (IPP...)

In the low-delay coding mode (see Fig. 10), only the first frame is intra-coded and the rest, prediction-coded using unidirectional motion compensation. See the notebook Introducing the Low-delay (IPP...) Mode.

Advantages:

Low pipeline delay (one frame-time).
Usually, only one frame buffering.
Medium-high compression ratios.

Disadvantages:

Error propagation sequentially, until the next GOF.
Low temporal scalability.

Mainly used in video surveillance.

7 “Random”-access mode (IBPB...)

In the random-access mode (see Fig. 3) it is allowed to use B-type frames. Notice that. by definition, the low-delay coding mode is also a random-access mode.

Advantages:

The highest compression ratio.

Disadvantages:

High pipeline delay (up to GOF-time).
Error propagation (depending on the block-type decisions), until the next GOF.
Low temporal scalability.

8 Hybrid video coding

See the Fig. 4.

Used in all the video compression standards.
Intracoded images are transformed (\(T\)) and quantized (\(Q\)).
Intercoded images are also motion compensated (\(P\)).
The encoder incorporates a decoder to avoid the drift error.

8.1 Codec

The Figure 6 describes an inter/intra video codec, that corresponds to:

\begin {equation} {\mathbf W}_k = \text {color-T}({\mathbf V}_k), \tag {a} \end {equation}

\begin {equation} {\mathbf W}_{k-1} = Z^{-1}({\mathbf W}, k-1), \tag {b} \end {equation} and by definition, \(Z^{-1}({\mathbf W}, -1) = {\mathbf 0}\),

\begin {equation} \overset {(k-1)\rightarrow k}{\mathbf M} = \text {ME}({\mathbf W}_{k-1}, {\mathbf W}_k), \tag {c} \end {equation} where ME stands for Motion Estimation, and by definition, \(\overset {(-1)\rightarrow 0}{{\mathbf M}} = {\mathbf 0}\),

\begin {equation} \overset {(k-1)\rightarrow k}{\mathbf M} = \overset {(k-1)\rightarrow k}{\mathbf M} \text {(lossless~coding)}, \tag {d} \end {equation}

\begin {equation} \overset {(k-1)\rightarrow k}{\mathbf M} = \overset {(k-1)\rightarrow k}{\mathbf M} \text {(lossless~decoding)}, \tag {e} \end {equation}

\begin {equation} {\mathbf E}_k = {\mathbf W}_k - \overset {\wedge }{{\mathbf W}}_k, \tag {f} \end {equation} where the symbol \(-\) represents to the pixel-wise substraction,

\begin {equation} \overset {\sim }{{\mathbf E}_k} = \text {QE}({\mathbf E}_k), \tag {g} \end {equation} where QE\((\cdot )\) represents the lossy compression of the prediction error texture data,

\begin {equation} \overset {\sim }{\mathbf E}_k = \text {DQ}^{-1}(\overset {\sim }{\mathbf E}_k), \tag {h} \end {equation} where DQ\(^{-1}(\cdot )\) represents the decompression of the prediction error texture data,

\begin {equation} \overset {\sim }{\mathbf W}_k \leftarrow \overset {\sim }{\mathbf E}_k + \overset {\wedge }{\mathbf W}_k, \tag {i} \end {equation} and notice that if \(\overset {\wedge }{\mathbf W}_k = {\mathbf 0}\), then \(\overset {\sim }{\mathbf E}_k = \overset {\sim }{\mathbf W}_k\),

\begin {equation} \overset {\wedge }{\mathbf W}_k = \text {P}(\overset {\sim }{\mathbf W}_{k-1}, \overset {(k-1)\rightarrow k}{\mathbf M}), \tag {j} \end {equation} where P\((\cdot ,\cdot )\) is a motion compensated predictor, and

\begin {equation} \overset {\wedge }{\mathbf W}_{k-1} = Z^{-1}(\overset {\wedge }{\mathbf W}, k-1), \tag {k} \end {equation} where by definition, \(Z^{-1}(\overset {\wedge }{\mathbf W}, -1) = 0\), and

\begin {equation} {\mathbf V} = \text {color-T}^{-1}({\mathbf W}_k), \tag {l} \end {equation} is the inverse color transform.

Notice that if \(\overset {\wedge }{{\mathbf W}}_k\) is similar to \({\mathbf W}_k\), then \({\mathbf E}_k\) will be approximately zero, and therefore, easely compressed. Another interesting aspect to highlight is that the encoder replicates de decoder in order to use the reconstructed images as reference and avoid the drift error.

8.2 A block diagram of the step codec

It’s time to test the performance of the ME/MC process previously described, in the image domain. We encode a sequence of frames \(\{W_k\}\) using the pattern IPP... which means that the first frame will be intra-coded (I-type frame) and the rest of frames of the GOF (Group Of (Frames) will be predicted-coded (P-type frame), respect to the previous one.

The IPP... coding can be done by the codec shown in the Fig. 6, where:

\begin {equation} V_k \leftarrow \text {C}(W_k) = \begin {bmatrix} \frac {1}{4} & \frac {1}{2} & \frac {1}{4} \\ \frac {1}{2} & 0 & -\frac {1}{2} \\ -\frac {1}{4} & \frac {1}{2} & -\frac {1}{4} \end {bmatrix} \begin {bmatrix} W_k.\text {R} \\ W_k.\text {G} \\ W_k.\text {B} \end {bmatrix} , \tag {a} \end {equation}

\begin {equation} Z^{-1}(V_k) = V_{k-1}, \tag {b} \end {equation} and by definition, \(Z^{-1}(V_{-1}) = 0\),

\begin {equation} \overset {k\rightarrow k-1}{V} \leftarrow \text {M}(V_k, V_{k-1}), \tag {c} \end {equation} where M stands for Motion Estimation, and by definition, \(\overset {0\rightarrow (-1)}{V}=0\),

\begin {equation} \overset {\sim }{\overset {k\rightarrow k-1}{V}} \leftarrow \text {E}_{\overset {\rightarrow }{V}}(\overset {k\rightarrow k-1}{V}), \tag {d} \end {equation} where E\(_{\overset {\rightarrow }{V}}(\cdot )\) represents the lossy compression of the motion data,

\begin {equation} \overset {\sim }{\overset {k\rightarrow k-1}{V}} \leftarrow \text {D}_{\overset {\rightarrow }{V}}(\overset {\sim }{\overset {k\rightarrow k-1}{V}}), \tag {e} \end {equation} where D\(_{\overset {\rightarrow }{V}}(\cdot )\) represents the decompression of the motion data,

\begin {equation} E_k \leftarrow V_k - \overset {\wedge }{{V}}_k, \tag {f} \end {equation} where the symbol \(-\) represents to the pixel-wise substraction,

\begin {equation} \overset {\sim }{E_k} \leftarrow \text {E}_{E}(E_k), \tag {g} \end {equation} where E\(_{E}(\cdot )\) represents the lossy compression of the prediction error texture data,

\begin {equation} \overset {\sim }{E}_k \leftarrow \text {D}_{E}(\overset {\sim }{E}_k), \tag {h} \end {equation} where D\(_{E}(\cdot )\) represents the decompression of the prediction error texture data,

\begin {equation} \overset {\sim }{V}_k \leftarrow \overset {\sim }{E}_k + \overset {\wedge }{V}_k, \tag {i} \end {equation} and notice that if \(\overset {\wedge }{V}_k=0\), then \(\overset {\sim }{E}_k = \overset {\sim }{V}_k\),

\begin {equation} \overset {\wedge }{V}_k \leftarrow \text {P}(\overset {\sim }{\overset {k\rightarrow k-1}{V}}, \overset {\sim }{V}_{k-1}), \tag {j} \end {equation} where P\((\cdot ,\cdot )\) is a motion compensated predictor.

Notice that if \(\overset {\wedge }{{V}}_k\) is similar to \(V_k\), then \(E_k\) will be approximately zero, and therefore, easely compressed. Another interesting aspect to highlight is that the encoder replicates de decoder in order to use the reconstructed images as reference and avoid the drift error.

9 The GOP (Group Of Pictures) concept

The temporal redundancy is exploited by blocks of images called GOPs. This means that a GOP can be decoded independently of the rest of GOPs (see the Fig. ??).

Figure 7: A GOP.

10 Deblocking filtering

If any other block-overlaping techniques have not been applied, block-based video encoders improve their performance if a deblocking filter in used to create the quantized prediction predictions (see the Fig. 8).

The low-pass filter is applied only on the block boundaries.

11 Bit-rate allocation

VBR: Under a constant quantization level (constant video quality), the number of bits that each compressed image needs depends on the image content (Variable Bit-Rate). In the Fig. 9 there is an example.

Figure 9: Example of variable bit-allocation.
CBR: Using a Constant Bit-Rate strategy, all frames need the same space. In the Fig. 10 there is an example.

Figure 10: Example of constant bit-allocation.

12 Coding in the Transform Domain

12.1 The IPP... decorrelation pattern

It’s time to put together all the “tools” that we have developed for encoding a sequence of frames \(\{V_k\}\). First, the sequence will be splitted into GOFs (Group Of Frames), and the structure of each GOF will be IPP... [1], which means that the first frame of each GOF will be intra-coded (I-type), and the rest of frames of the GOF will be predicted-coded (P-type), respect to the previous one¹. Notice that in an I-type frame all the coefficients (coeffs in short, remember that we are compensating the motion in the DWT domain) will be I-type coeffs, and in a P-type frame, the different coeffs will be I-type or P-type.

12.2 A block diagram of the step codec

Figure 11: The IPP... MRVC step codec. Notice that the input to the (step) encoder is a DWT transformed sequence of frames.

The MRVC IPP... step (one resolution level) codec has been described in the Fig. 11. The equations that describe this system are:

\begin {equation} (V_k.L, V_k.H) \leftarrow \text {DWT}(V_k), \tag {a} \end {equation} where \(\leftarrow \) denotes the assignment operator, and \(V_k\) is the \(k\)-th frame of the sequence,

\begin {equation} [V_k.L] \leftarrow \text {DWT}^{-1}(V_k.L, 0), \tag {E.a} \end {equation}

\begin {equation} Z^{-1}([V_k.L]) = [V_{k-1}.L], \tag {E.b} \end {equation} and by definition, \(Z^{-1}([V_{-1}.L]) = 0\),

\begin {equation} \overset {k\rightarrow k-1}{V} \leftarrow \text {M}([V_k.L], [V_{k-1}.L]), \tag {E.c} \end {equation} where M stands for Motion estimation, and by definition, \(\overset {0\rightarrow (-1)}{V}=0\),

\begin {equation} [\hat {V}_k.L] \leftarrow \text {P}(\overset {k\rightarrow k-1}{V}, [V_{k-1}.L]), \tag {E.d} \end {equation} where P stands for motion compensated Prediction,

\begin {equation} [E_k.L] \leftarrow [V_k.L] - [\hat {V}_k.L], \tag {E.e} \end {equation}

\begin {equation} \{[M_k],[S_k]\} \leftarrow \text {EW-min}([V_k.L], [E_k.L]) \tag {E.f} \end {equation} where \begin {equation} [M_k]_{i,j}=\text {min}([V_k.L]_{i,j}, [E_k.L]_{i,j}), \end {equation} and \([S_k]\) is a binary matrix defined by \begin {equation} [S_k]_{i,j} = \left \{ \begin {array}{lll} 0 & \text {if}~[V_k.L]_{i,j} < [E_k.L]_{i,j} & \text {(I-type coeff)} \\ 1 & \text {otherwise} & \text {(P-type coeff)}, \end {array} \right . \label {eq:matrix} \end {equation} (notice that \([M_k]\), that contains the element-wise minimum of both matrices, is discarded)

\begin {equation} [V_k.H] \leftarrow \text {DWT}^{-1}(0, V_k.H), \tag {b} \end {equation}

\begin {equation} [E_k.H] \leftarrow [V_k.H] - [\hat {V}_k.H], \tag {c} \end {equation} where, notice that \begin {equation} [E_k.H]_{i,j} = \left \{ \begin {array}{ll} {[}V_k.H{]}_{i,j} & \text {if}~{[}\hat {V}'_k.H{]}_{i,j} = 0~\text {(I-type coeff)} \\ {[}V_k.H{]}_{i,j} - [\hat {V}'_k.H]_{i,j} & \text {otherwise}~\text {(P-type coeff)}, \end {array} \right . \end {equation}

\begin {equation} [\tilde {E}_k.H] \leftarrow \text {Q}([E_K.H]), \tag {d} \end {equation}

\begin {equation} [\tilde {E}_k.H] \leftarrow \text {Q}^{-1}([\tilde {E}_K.H]), \tag {E.g} \end {equation}

\begin {equation} [\tilde {V}_k.H] \leftarrow [\tilde {E}_k.H] + [\hat {V}'_k.H], \tag {E.h} \end {equation} and notice that if \([\hat {V}_k.H]=0\), then \([\tilde {V}_k.H] = [\tilde {E}_k.H]\),

\begin {equation} Z^{-1}([\tilde {V}_k.H]) = [V_{k-1}.H], \tag {E.i} \end {equation} and by definition, \(Z^{-1}([V_{-1}.H]) = 0\),

\begin {equation} [\hat {V}_k.H] \leftarrow \text {P}(\overset {k\rightarrow k-1}{V}, [\overset {\sim }{V}_{k-1}.H]), \tag {E.j} \end {equation}

\begin {equation} [\hat {V}'_k.H]_{i,j} \leftarrow \left \{ \begin {array}{ll} {[}\hat {V}_k.H{]}_{i,j} & \text {if}~{[}E_k.L{]}_{i,j} < {[}V_k.L{]}_{i,j} \text {(P-type coeff)} \\ 0 & \text {otherwise (I-type coeff)}, \end {array} \right . \tag {E.k} \end {equation}

\begin {equation} (0, \tilde {E}_k.H) \leftarrow \text {DWT}([\tilde {E}_k.H]), \tag {f} \end {equation}

\begin {equation} \{V_k.L, \tilde {E}_k.H\} \leftarrow \text {E}(V_k.L, \tilde {E}_k.H), \tag {g} \end {equation} where E represents the entropy coding of both data sources, in two different code-streams,

\begin {equation} (V_k.L, \tilde {E}_k.H) \leftarrow \text {E}^{-1}(\{V_k.L, \tilde {E}_k.H\}), \tag {h} \end {equation}

\begin {equation} [\tilde {E}_k.H] \leftarrow \text {DWT}^{-1}(0, \tilde {E}_k.H), \tag {i} \end {equation}

\begin {equation} (0, \tilde {V}_k.H) \leftarrow \text {DWT}(0, [\tilde {V}_k.H]), \tag {j} \end {equation}

and

\begin {equation} \tilde {V}_k \leftarrow \text {DWT}^{-1}(V_k.L, \tilde {V}_k.H). \tag {k} \end {equation}

The IPP... codec is inspired in Differential Pulse Code Moldulation. This notebook shows how to implement a simple DPCM codec.

12.3 Spatial multiresolution

As it can be seen in the previous section, in each IPP... iteration of the step encoder, only the high-frequency information of the sequence of frames is decorrelated (\(H\) subbands) considering the information provided by the low-frequences (\(L\) subband), which are losslessly transmitted between the encoder and the decoder. Notice also that if the \(L\) data cannot be transmitted to the decoder, a drift error will occur because the matrix \(S_k\) will be different in the encoder and the decoder for the same frame \(k\).

Obviously (and unfortunately), the lossless transmission of the \(L\)’s bounds the compression ratio that we will get. One solution is to perform more (than 1) levels at the DWT stage (see Eq. (a)) and to apply the IPP... MRVC step encoder by spatial resolutions, starting at the lowest, as the decoder will do at decompression time. If we represent the Spatial Resolution Level (SRL) with an superindex, being 0 the original SRL, we can express the operation of the codec described in the Fig. 11 by \begin {equation} \left \{ \begin {array}{l} \text {SE}(V^0_k) = \{V^0_k.L, \tilde {E}^0_k.H\} = \{V^1_k, \tilde {E}^0_k.H\} \\ \text {SD}(\{V^1_k, \tilde {E}^0_k.H\}) = \tilde {V}^0_k, \end {array} \right . \label {eq:codec_1l} \end {equation} where \(\text {SE}(\cdot )\) represents to the operation of the IPP... step encoder and \(\text {SD}(\cdot )\) to the operation of the IPP... step decoder. As it can be seen, Eq. ?? is only valid when only one level of the DWT has been applied.

In general, for \(s\) levels of the DWT, we have that \begin {equation} \left \{ \begin {array}{l} \text {SE}(V^{s-1}_k) = \{V^s_k, \tilde {E}^{s-1}_k.H\} \\ \text {SD}(\{V^s_k, \tilde {E}^{s-1}_k.H\}) = \tilde {V}^{s-1}_k, \end {array} \right . \label {eq:codec_sl} \end {equation} where \(\tilde {V}^{s-1}\) is the \((s-1)\)-th SRL of the reconstructed sequence \(\tilde {V}\).

The next SRL (\(s-2\)), \(\tilde {V}^{s-2}\), is determined by \begin {equation} \left \{ \begin {array}{l} \text {SE}(\tilde {V}^{s-2}_k) = \{\tilde {V}^{s-1}_k, \tilde {E}^{s-2}_k.H\} \\ \text {SD}(\{\tilde {V}^{s-1}_k, \tilde {E}^{s-2}_k.H\}) = \tilde {V}^{s-2}_k, \end {array} \right . \label {eq:codec_s1l} \end {equation} and finally, for the highest SRL, we get \(\tilde {V}^0\) defined by Eq. ??.

So, only the lowest SRL of \(\tilde {V}\), \(\tilde {V}^s\) is an III... pure sequence of small frames, losslessly encoded. This multiresolution (multistage) scheme has been described in the Fig. 12. The output of an IPP... encoder will be refered as “spatial-layers”, or simply as “S-layers”.

The Fig. 13 shows the decoder. It inputs the collection of subbands generated by the encoder and for each one, a video with a different spatial resolution is obtained.

12.4 SQ (layers) progression

The logical order of the S-layers in the code-stream is the one that allows, when the code-stream is decoded sequentially, the progressive increase of the spatial resolution of the video. For example, if \(s\) is the number of levels of the DWT, the generated code-stream has \((s+1)\) S-layers \begin {equation*} \{V^s,\tilde {E}^{s-1}.H,\tilde {E}^{s-2}.H,\cdots ,\tilde {E}^0.H\}, \end {equation*} which are able to generate \((s+1)\) progressive reconstructions \begin {equation*} \{V^s,\tilde {V}^{s-1},\tilde {V}^{s-2},\cdots ,\tilde {V}^0\}. \end {equation*}

Moreover, Quality (Q-progression) scalability in each SRL can be also achieved in the high-frequency textures if a quality-scalable image codec such as JPEG2000 [2] replaces the PNG compressor, generating a number \(q\) of quality-layers (“Q-layers”) by each motion compensated high-frequency subband. A SQ-progression is defined considering both forms of scalability (spatial and quality), with a higher number of layers. For example, if \(s=3\) (2 IPP...-type iterations) and \(q=2\), the progression of layers would be \begin {equation*} \{V^s[1],V^s[0],\tilde {E}^{s-1}.H[1],\tilde {E}^{s-1}.H[0],\tilde {E}^{s-2}.H[1],\tilde {E}^{s-2}.H[0],\cdots ,\tilde {E}^0.H[1],\tilde {E}^0.H[0]\}. \end {equation*}

The use of quality scalability boosts the possibilities in real-time streaming scenarios where the transmission bit-rate can be variable (sending more or less Q-layers of a given spatial resolution depending on the bit-rate). Notice that the SQ-progression is free of drift-error.²

13 Linear frame interpolation using block-based motion compensation

Input

\(R\): square search area, in pixels.
\(B\): square block size, in pixels.
\(O\): border size, in pixels.
\(s_i\), \(s_j\) and \(s_k\) three chronologically ordered, equidistant frames, with resolution \(X\times Y\).
\(A\): \(\frac {1}{2^A}\) subpixel accuracy.

Output

\(\hat {s}_j\): a prediction for frame \(s_j\).
\(m\): a matrix with \(\lceil X/B\rceil \times \lceil Y/B\rceil \) bidirectional motion vectors.
\(e\): a matrix with \(\lceil X/B\rceil \times \lceil Y/B\rceil \) bidirectional Root Mean Square matching Wrrors (RMSE).

Algorithm

Compute the DWT\(^l\), where \(l=\lfloor \log _2(R)\rfloor -1\) levels, of the predicted frame \(s_j\) and the two reference frames \(s_i\) and \(s_k\).
\(LL^l(m)\leftarrow 0\), or any other precomputed values (for example, from a previous ME in neighbor frames).
Divide the subband \(LL^l(s_j)\) into blocks of size \(B\times B\) pixels, and \(\pm 1\)-spiral-search them in the subbands \(LL^l(s_i)\) and \(LL^l(s_k)\), calculating a low-resolution \(LL^l(m)=\{LL^l(\overleftarrow {m}), LL^l(\overrightarrow {m})\}\) bi-directional motion vector field.
While \(l>0\):
1. Synthesize \(LL^{l-1}(m)\), \(LL^{l-1}(s_j)\), \(LL^{l-1}(s_i)\) and \(LL^{l-1}(s_k)\), by computing the 1-level DWT\(^{-1}\).
2. \(LL^{l-1}(M)\leftarrow LL^{l-1}(M)\times 2\).
3. Refine \(LL^{l-1}(m)\) using \(\pm 1\)-spiral-search.
4. \(l\leftarrow l-1\). (When \(l=0\), the motion vectors field \(m\) has the structure:)
While \(l<A\) (in the first iteration, \(l=0\), and \(LL^0(M):=M\)):
1. \(l\leftarrow l+1\).
2. Synthesize \(LL^{-l}(s_j)\), \(LL^{-l}(s_i)\) and \(LL^{-l}(s_k)\), computing the 1-level DWT\(^{-1}\) (high-frequency subbands are \(0\)). This performs a zoom-in in these frames using \(1/2\)-subpixel accuracy.
3. \(m\leftarrow m\times 2\).
4. \(B\leftarrow B\times 2\).
5. Divide the subband \(LL^{-l}(s_j)\) into blocks of \(B\times B\) pixels and \(\pm 1\)-spiral-search them into the subbands \(LL^{-l}(s_i)\) and \(LL^{-l}(s_k)\), calculating a \(1/2^l\) sub-pixel accuracy \(m\) bi-directional motion vector field.
6. Frame prediction. For each block \(b\):
7. Compute \begin {equation} \hat {b}\leftarrow \frac {b_i\big (\overleftarrow {e}_{\text {max}}-\overleftarrow {e}(b)\big ) + b_k\big (\overrightarrow {e}_{\text {max}}-\overrightarrow {e}(b)\big )}{\big (\overleftarrow {e}_{\text {max}}-\overleftarrow {e}(b)\big ) + \big (\overrightarrow {e}_{\text {max}}-\overrightarrow {e}(b)\big )}, \end {equation}
  where \(\overleftarrow {e}(b)\) is the (minimum) distortion of the best backward matching for block \(b\), \(\overrightarrow {e}(b)\) the (minimum) distortion of the best forward matching for block \(b\), \(\overleftarrow {e}_{\text {max}}=\overrightarrow {e}_{\text {max}}\) are the backward and forward maximum matching distortions, \(b_i\) is the (backward) block found (as the most similar to \(b\)) in frame \(s_i\) and \(b_k\) is the (forward) block found in frame \(s_k\). Notice that, if \(\overleftarrow {e}(b)=\overrightarrow {e}(b)\), then the prediction is \begin {equation} \hat {b} = \frac {b_i + b_k}{2}, \end {equation} and if \(\overleftarrow {e}(b)=0\), \begin {equation} \hat {b} = b_k, \end {equation} and viceversa.

14 A problem

Using the encoding system described in the Figure 15, and defined by \begin {equation} \left \{\ \begin {array}{l} \tilde {\mathbf R} = \text {Q}_{\mathbf R}({\mathbf R}) \\ \tilde {\mathbf E} = \text {Q}_{\mathbf E}\big ({\mathbf P}-\overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}(\tilde {\mathbf R})\big ) \end {array} \right . \label {eq:forward} \end {equation} and \begin {equation} \begin {array}{l} \tilde {\mathbf P} = \tilde {\mathbf E} + \overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}(\tilde {\mathbf R}), \end {array} \label {eq:backward} \end {equation}

find \(\text {Q}_{\mathbf {R}}\) and \(\text {Q}_{\mathbf {E}}\) that minimize in the RD domain (the RD curve of) \begin {equation} \text {MSE}(\{\mathbf {R},\mathbf {P}\},\{\hat {\mathbf {R}},\hat {\mathbf {P}}\}) = \frac {\text {MSE}({\mathbf R},\hat {\mathbf R}) + \text {MSE}({\mathbf P},\hat {\mathbf P})}{2}, \end {equation} set that \begin {equation} \text {MSE}({\mathbf R},\tilde {\mathbf R}) = \text {MSE}({\mathbf P},\tilde {\mathbf P}). \label {eq:constant_quality} \end {equation} Equation ?? indicates that all the decoded frames should have the same distortion (from a human perception point of view). Notice that the transform defined by the Equations ?? and ?? is not orthogonal and therefore, the “subbands” \(\tilde {\mathbf R}\) and \(\tilde {\mathbf P}\) are not independent. It can be seen that \(\text {Q}_{\mathbf R}\) affects to the selection of \(\text {Q}_{\mathbf E}\), because \(\tilde {\mathbf R}\) is used as reference for finding \(\mathbf E\).

15 References

[1] D. Le Gall. MPEG: A Video Compression Standard for Multimedia Applications. Communications of the ACM, 34(4):46–58, 1991.

[2] D.S. Taubman and W.M. Marcellin. JPEG2000. Image Compression Fundamentals, Standards and Practice. Kluwer Academic Publishers, 2002.

¹A P-type frame except for the second one, that always has a I-frame as reference.

²Other progressions such as the QS-progression should generate drift because the decoder at some truncation points of the code-stream would use a low-frequency information with a different quality than the encoderd used.