MC is the process of generating the residue frames through substracting to the predicted (anchor) frame a prediction frame generated with one or more motion field to project one or more reference frames.
Most of video coding standards use block-based MC techniques.
If blocks can be deformed when they are projected (to create the prediction frame) we are using a mesh-based MC.
Most video coding standards operate at the block level. For each block, the decision mode (run only by the compressor) determines the “type of block”:
The type of the blocks is signaled in the code-stream. An visual example of the decision of the type of the macro-blocks is shown in the Figure 2.
Frames can be:
In the intra coding mode, all frames are encoded as independent images (no MC has been used at all, and therefore, all blocks in the frames are I-type). Intra-coded frames are also called keyframes. See the notebook III... video compression.
Advantages:
Disadvantages:
Mainly used in video edition.
In the low-delay coding mode (see Fig. 10), only the first frame is intra-coded and the rest, prediction-coded using unidirectional motion compensation. See the notebook Introducing the Low-delay (IPP...) Mode.
Advantages:
Disadvantages:
Mainly used in video surveillance.
In the random-access mode (see Fig. 3) it is allowed to use B-type frames. Notice that. by definition, the low-delay coding mode is also a random-access mode.
Advantages:
Disadvantages:
See the Fig. 4.
The Figure 6 describes an inter/intra video codec, that corresponds to:
\begin {equation} {\mathbf W}_k = \text {color-T}({\mathbf V}_k), \tag {a} \end {equation}
\begin {equation} {\mathbf W}_{k-1} = Z^{-1}({\mathbf W}, k-1), \tag {b} \end {equation} and by definition, \(Z^{-1}({\mathbf W}, -1) = {\mathbf 0}\),
\begin {equation} \overset {(k-1)\rightarrow k}{\mathbf M} = \text {ME}({\mathbf W}_{k-1}, {\mathbf W}_k), \tag {c} \end {equation} where ME stands for Motion Estimation, and by definition, \(\overset {(-1)\rightarrow 0}{{\mathbf M}} = {\mathbf 0}\),
\begin {equation} \overset {(k-1)\rightarrow k}{\mathbf M} = \overset {(k-1)\rightarrow k}{\mathbf M} \text {(lossless~coding)}, \tag {d} \end {equation}
\begin {equation} \overset {(k-1)\rightarrow k}{\mathbf M} = \overset {(k-1)\rightarrow k}{\mathbf M} \text {(lossless~decoding)}, \tag {e} \end {equation}
\begin {equation} {\mathbf E}_k = {\mathbf W}_k - \overset {\wedge }{{\mathbf W}}_k, \tag {f} \end {equation} where the symbol \(-\) represents to the pixel-wise substraction,
\begin {equation} \overset {\sim }{{\mathbf E}_k} = \text {QE}({\mathbf E}_k), \tag {g} \end {equation} where QE\((\cdot )\) represents the lossy compression of the prediction error texture data,
\begin {equation} \overset {\sim }{\mathbf E}_k = \text {DQ}^{-1}(\overset {\sim }{\mathbf E}_k), \tag {h} \end {equation} where DQ\(^{-1}(\cdot )\) represents the decompression of the prediction error texture data,
\begin {equation} \overset {\sim }{\mathbf W}_k \leftarrow \overset {\sim }{\mathbf E}_k + \overset {\wedge }{\mathbf W}_k, \tag {i} \end {equation} and notice that if \(\overset {\wedge }{\mathbf W}_k = {\mathbf 0}\), then \(\overset {\sim }{\mathbf E}_k = \overset {\sim }{\mathbf W}_k\),
\begin {equation} \overset {\wedge }{\mathbf W}_k = \text {P}(\overset {\sim }{\mathbf W}_{k-1}, \overset {(k-1)\rightarrow k}{\mathbf M}), \tag {j} \end {equation} where P\((\cdot ,\cdot )\) is a motion compensated predictor, and
\begin {equation} \overset {\wedge }{\mathbf W}_{k-1} = Z^{-1}(\overset {\wedge }{\mathbf W}, k-1), \tag {k} \end {equation} where by definition, \(Z^{-1}(\overset {\wedge }{\mathbf W}, -1) = 0\), and
\begin {equation} {\mathbf V} = \text {color-T}^{-1}({\mathbf W}_k), \tag {l} \end {equation} is the inverse color transform.
Notice that if \(\overset {\wedge }{{\mathbf W}}_k\) is similar to \({\mathbf W}_k\), then \({\mathbf E}_k\) will be approximately zero, and therefore, easely compressed. Another interesting aspect to highlight is that the encoder replicates de decoder in order to use the reconstructed images as reference and avoid the drift error.
It’s time to test the performance of the ME/MC process previously described, in the image domain. We encode a sequence of frames \(\{W_k\}\) using the pattern IPP... which means that the first frame will be intra-coded (I-type frame) and the rest of frames of the GOF (Group Of (Frames) will be predicted-coded (P-type frame), respect to the previous one.
The IPP... coding can be done by the codec shown in the Fig. 6, where:
\begin {equation} V_k \leftarrow \text {C}(W_k) = \begin {bmatrix} \frac {1}{4} & \frac {1}{2} & \frac {1}{4} \\ \frac {1}{2} & 0 & -\frac {1}{2} \\ -\frac {1}{4} & \frac {1}{2} & -\frac {1}{4} \end {bmatrix} \begin {bmatrix} W_k.\text {R} \\ W_k.\text {G} \\ W_k.\text {B} \end {bmatrix} , \tag {a} \end {equation}
\begin {equation} Z^{-1}(V_k) = V_{k-1}, \tag {b} \end {equation} and by definition, \(Z^{-1}(V_{-1}) = 0\),
\begin {equation} \overset {k\rightarrow k-1}{V} \leftarrow \text {M}(V_k, V_{k-1}), \tag {c} \end {equation} where M stands for Motion Estimation, and by definition, \(\overset {0\rightarrow (-1)}{V}=0\),
\begin {equation} \overset {\sim }{\overset {k\rightarrow k-1}{V}} \leftarrow \text {E}_{\overset {\rightarrow }{V}}(\overset {k\rightarrow k-1}{V}), \tag {d} \end {equation} where E\(_{\overset {\rightarrow }{V}}(\cdot )\) represents the lossy compression of the motion data,
\begin {equation} \overset {\sim }{\overset {k\rightarrow k-1}{V}} \leftarrow \text {D}_{\overset {\rightarrow }{V}}(\overset {\sim }{\overset {k\rightarrow k-1}{V}}), \tag {e} \end {equation} where D\(_{\overset {\rightarrow }{V}}(\cdot )\) represents the decompression of the motion data,
\begin {equation} E_k \leftarrow V_k - \overset {\wedge }{{V}}_k, \tag {f} \end {equation} where the symbol \(-\) represents to the pixel-wise substraction,
\begin {equation} \overset {\sim }{E_k} \leftarrow \text {E}_{E}(E_k), \tag {g} \end {equation} where E\(_{E}(\cdot )\) represents the lossy compression of the prediction error texture data,
\begin {equation} \overset {\sim }{E}_k \leftarrow \text {D}_{E}(\overset {\sim }{E}_k), \tag {h} \end {equation} where D\(_{E}(\cdot )\) represents the decompression of the prediction error texture data,
\begin {equation} \overset {\sim }{V}_k \leftarrow \overset {\sim }{E}_k + \overset {\wedge }{V}_k, \tag {i} \end {equation} and notice that if \(\overset {\wedge }{V}_k=0\), then \(\overset {\sim }{E}_k = \overset {\sim }{V}_k\),
\begin {equation} \overset {\wedge }{V}_k \leftarrow \text {P}(\overset {\sim }{\overset {k\rightarrow k-1}{V}}, \overset {\sim }{V}_{k-1}), \tag {j} \end {equation} where P\((\cdot ,\cdot )\) is a motion compensated predictor.
Notice that if \(\overset {\wedge }{{V}}_k\) is similar to \(V_k\), then \(E_k\) will be approximately zero, and therefore, easely compressed. Another interesting aspect to highlight is that the encoder replicates de decoder in order to use the reconstructed images as reference and avoid the drift error.
VBR: Under a constant quantization level (constant video quality), the number of bits that each compressed image needs depends on the image content (Variable Bit-Rate). In the Fig. 9 there is an example.
CBR: Using a Constant Bit-Rate strategy, all frames need the same space. In the Fig. 10 there is an example.
It’s time to put together all the “tools” that we have developed for encoding a sequence of frames \(\{V_k\}\). First, the sequence will be splitted into GOFs (Group Of Frames), and the structure of each GOF will be IPP... [1], which means that the first frame of each GOF will be intra-coded (I-type), and the rest of frames of the GOF will be predicted-coded (P-type), respect to the previous one1. Notice that in an I-type frame all the coefficients (coeffs in short, remember that we are compensating the motion in the DWT domain) will be I-type coeffs, and in a P-type frame, the different coeffs will be I-type or P-type.
The MRVC IPP... step (one resolution level) codec has been described in the Fig. 11. The equations that describe this system are:
\begin {equation} (V_k.L, V_k.H) \leftarrow \text {DWT}(V_k), \tag {a} \end {equation} where \(\leftarrow \) denotes the assignment operator, and \(V_k\) is the \(k\)-th frame of the sequence,
\begin {equation} [V_k.L] \leftarrow \text {DWT}^{-1}(V_k.L, 0), \tag {E.a} \end {equation}
\begin {equation} Z^{-1}([V_k.L]) = [V_{k-1}.L], \tag {E.b} \end {equation} and by definition, \(Z^{-1}([V_{-1}.L]) = 0\),
\begin {equation} \overset {k\rightarrow k-1}{V} \leftarrow \text {M}([V_k.L], [V_{k-1}.L]), \tag {E.c} \end {equation} where M stands for Motion estimation, and by definition, \(\overset {0\rightarrow (-1)}{V}=0\),
\begin {equation} [\hat {V}_k.L] \leftarrow \text {P}(\overset {k\rightarrow k-1}{V}, [V_{k-1}.L]), \tag {E.d} \end {equation} where P stands for motion compensated Prediction,
\begin {equation} [E_k.L] \leftarrow [V_k.L] - [\hat {V}_k.L], \tag {E.e} \end {equation}
\begin {equation} \{[M_k],[S_k]\} \leftarrow \text {EW-min}([V_k.L], [E_k.L]) \tag {E.f} \end {equation} where \begin {equation} [M_k]_{i,j}=\text {min}([V_k.L]_{i,j}, [E_k.L]_{i,j}), \end {equation} and \([S_k]\) is a binary matrix defined by \begin {equation} [S_k]_{i,j} = \left \{ \begin {array}{lll} 0 & \text {if}~[V_k.L]_{i,j} < [E_k.L]_{i,j} & \text {(I-type coeff)} \\ 1 & \text {otherwise} & \text {(P-type coeff)}, \end {array} \right . \label {eq:matrix} \end {equation} (notice that \([M_k]\), that contains the element-wise minimum of both matrices, is discarded)
\begin {equation} [V_k.H] \leftarrow \text {DWT}^{-1}(0, V_k.H), \tag {b} \end {equation}
\begin {equation} [E_k.H] \leftarrow [V_k.H] - [\hat {V}_k.H], \tag {c} \end {equation} where, notice that \begin {equation} [E_k.H]_{i,j} = \left \{ \begin {array}{ll} {[}V_k.H{]}_{i,j} & \text {if}~{[}\hat {V}'_k.H{]}_{i,j} = 0~\text {(I-type coeff)} \\ {[}V_k.H{]}_{i,j} - [\hat {V}'_k.H]_{i,j} & \text {otherwise}~\text {(P-type coeff)}, \end {array} \right . \end {equation}
\begin {equation} [\tilde {E}_k.H] \leftarrow \text {Q}([E_K.H]), \tag {d} \end {equation}
\begin {equation} [\tilde {E}_k.H] \leftarrow \text {Q}^{-1}([\tilde {E}_K.H]), \tag {E.g} \end {equation}
\begin {equation} [\tilde {V}_k.H] \leftarrow [\tilde {E}_k.H] + [\hat {V}'_k.H], \tag {E.h} \end {equation} and notice that if \([\hat {V}_k.H]=0\), then \([\tilde {V}_k.H] = [\tilde {E}_k.H]\),
\begin {equation} Z^{-1}([\tilde {V}_k.H]) = [V_{k-1}.H], \tag {E.i} \end {equation} and by definition, \(Z^{-1}([V_{-1}.H]) = 0\),
\begin {equation} [\hat {V}_k.H] \leftarrow \text {P}(\overset {k\rightarrow k-1}{V}, [\overset {\sim }{V}_{k-1}.H]), \tag {E.j} \end {equation}
\begin {equation} [\hat {V}'_k.H]_{i,j} \leftarrow \left \{ \begin {array}{ll} {[}\hat {V}_k.H{]}_{i,j} & \text {if}~{[}E_k.L{]}_{i,j} < {[}V_k.L{]}_{i,j} \text {(P-type coeff)} \\ 0 & \text {otherwise (I-type coeff)}, \end {array} \right . \tag {E.k} \end {equation}
\begin {equation} (0, \tilde {E}_k.H) \leftarrow \text {DWT}([\tilde {E}_k.H]), \tag {f} \end {equation}
\begin {equation} \{V_k.L, \tilde {E}_k.H\} \leftarrow \text {E}(V_k.L, \tilde {E}_k.H), \tag {g} \end {equation} where E represents the entropy coding of both data sources, in two different code-streams,
\begin {equation} (V_k.L, \tilde {E}_k.H) \leftarrow \text {E}^{-1}(\{V_k.L, \tilde {E}_k.H\}), \tag {h} \end {equation}
\begin {equation} [\tilde {E}_k.H] \leftarrow \text {DWT}^{-1}(0, \tilde {E}_k.H), \tag {i} \end {equation}
\begin {equation} (0, \tilde {V}_k.H) \leftarrow \text {DWT}(0, [\tilde {V}_k.H]), \tag {j} \end {equation}
and
\begin {equation} \tilde {V}_k \leftarrow \text {DWT}^{-1}(V_k.L, \tilde {V}_k.H). \tag {k} \end {equation}
The IPP... codec is inspired in Differential Pulse Code Moldulation. This notebook shows how to implement a simple DPCM codec.
As it can be seen in the previous section, in each IPP... iteration of the step encoder, only the high-frequency information of the sequence of frames is decorrelated (\(H\) subbands) considering the information provided by the low-frequences (\(L\) subband), which are losslessly transmitted between the encoder and the decoder. Notice also that if the \(L\) data cannot be transmitted to the decoder, a drift error will occur because the matrix \(S_k\) will be different in the encoder and the decoder for the same frame \(k\).
Obviously (and unfortunately), the lossless transmission of the \(L\)’s bounds the compression ratio that we will get. One solution is to perform more (than 1) levels at the DWT stage (see Eq. (a)) and to apply the IPP... MRVC step encoder by spatial resolutions, starting at the lowest, as the decoder will do at decompression time. If we represent the Spatial Resolution Level (SRL) with an superindex, being 0 the original SRL, we can express the operation of the codec described in the Fig. 11 by \begin {equation} \left \{ \begin {array}{l} \text {SE}(V^0_k) = \{V^0_k.L, \tilde {E}^0_k.H\} = \{V^1_k, \tilde {E}^0_k.H\} \\ \text {SD}(\{V^1_k, \tilde {E}^0_k.H\}) = \tilde {V}^0_k, \end {array} \right . \label {eq:codec_1l} \end {equation} where \(\text {SE}(\cdot )\) represents to the operation of the IPP... step encoder and \(\text {SD}(\cdot )\) to the operation of the IPP... step decoder. As it can be seen, Eq. ?? is only valid when only one level of the DWT has been applied.
In general, for \(s\) levels of the DWT, we have that \begin {equation} \left \{ \begin {array}{l} \text {SE}(V^{s-1}_k) = \{V^s_k, \tilde {E}^{s-1}_k.H\} \\ \text {SD}(\{V^s_k, \tilde {E}^{s-1}_k.H\}) = \tilde {V}^{s-1}_k, \end {array} \right . \label {eq:codec_sl} \end {equation} where \(\tilde {V}^{s-1}\) is the \((s-1)\)-th SRL of the reconstructed sequence \(\tilde {V}\).
The next SRL (\(s-2\)), \(\tilde {V}^{s-2}\), is determined by \begin {equation} \left \{ \begin {array}{l} \text {SE}(\tilde {V}^{s-2}_k) = \{\tilde {V}^{s-1}_k, \tilde {E}^{s-2}_k.H\} \\ \text {SD}(\{\tilde {V}^{s-1}_k, \tilde {E}^{s-2}_k.H\}) = \tilde {V}^{s-2}_k, \end {array} \right . \label {eq:codec_s1l} \end {equation} and finally, for the highest SRL, we get \(\tilde {V}^0\) defined by Eq. ??.
So, only the lowest SRL of \(\tilde {V}\), \(\tilde {V}^s\) is an III... pure sequence of small frames, losslessly encoded. This multiresolution (multistage) scheme has been described in the Fig. 12. The output of an IPP... encoder will be refered as “spatial-layers”, or simply as “S-layers”.
The Fig. 13 shows the decoder. It inputs the collection of subbands generated by the encoder and for each one, a video with a different spatial resolution is obtained.
The logical order of the S-layers in the code-stream is the one that allows, when the code-stream is decoded sequentially, the progressive increase of the spatial resolution of the video. For example, if \(s\) is the number of levels of the DWT, the generated code-stream has \((s+1)\) S-layers \begin {equation*} \{V^s,\tilde {E}^{s-1}.H,\tilde {E}^{s-2}.H,\cdots ,\tilde {E}^0.H\}, \end {equation*} which are able to generate \((s+1)\) progressive reconstructions \begin {equation*} \{V^s,\tilde {V}^{s-1},\tilde {V}^{s-2},\cdots ,\tilde {V}^0\}. \end {equation*}
Moreover, Quality (Q-progression) scalability in each SRL can be also achieved in the high-frequency textures if a quality-scalable image codec such as JPEG2000 [2] replaces the PNG compressor, generating a number \(q\) of quality-layers (“Q-layers”) by each motion compensated high-frequency subband. A SQ-progression is defined considering both forms of scalability (spatial and quality), with a higher number of layers. For example, if \(s=3\) (2 IPP...-type iterations) and \(q=2\), the progression of layers would be \begin {equation*} \{V^s[1],V^s[0],\tilde {E}^{s-1}.H[1],\tilde {E}^{s-1}.H[0],\tilde {E}^{s-2}.H[1],\tilde {E}^{s-2}.H[0],\cdots ,\tilde {E}^0.H[1],\tilde {E}^0.H[0]\}. \end {equation*}
The use of quality scalability boosts the possibilities in real-time streaming scenarios where the transmission bit-rate can be variable (sending more or less Q-layers of a given spatial resolution depending on the bit-rate). Notice that the SQ-progression is free of drift-error.2
Compute the DWT\(^l\), where \(l=\lfloor \log _2(R)\rfloor -1\) levels, of the predicted frame \(s_j\) and the two reference frames \(s_i\) and \(s_k\).
\(LL^l(m)\leftarrow 0\), or any other precomputed values (for example, from a previous ME in neighbor frames).
Divide the subband \(LL^l(s_j)\) into blocks of size \(B\times B\) pixels, and \(\pm 1\)-spiral-search them in the subbands \(LL^l(s_i)\) and \(LL^l(s_k)\), calculating a low-resolution \(LL^l(m)=\{LL^l(\overleftarrow {m}), LL^l(\overrightarrow {m})\}\) bi-directional motion vector field.
While \(l>0\):
Synthesize \(LL^{l-1}(m)\), \(LL^{l-1}(s_j)\), \(LL^{l-1}(s_i)\) and \(LL^{l-1}(s_k)\), by computing the 1-level DWT\(^{-1}\).
\(LL^{l-1}(M)\leftarrow LL^{l-1}(M)\times 2\).
Refine \(LL^{l-1}(m)\) using \(\pm 1\)-spiral-search.
While \(l<A\) (in the first iteration, \(l=0\), and \(LL^0(M):=M\)):
Synthesize \(LL^{-l}(s_j)\), \(LL^{-l}(s_i)\) and \(LL^{-l}(s_k)\), computing the 1-level DWT\(^{-1}\) (high-frequency subbands are \(0\)). This performs a zoom-in in these frames using \(1/2\)-subpixel accuracy.
\(m\leftarrow m\times 2\).
Divide the subband \(LL^{-l}(s_j)\) into blocks of \(B\times B\) pixels and \(\pm 1\)-spiral-search them into the subbands \(LL^{-l}(s_i)\) and \(LL^{-l}(s_k)\), calculating a \(1/2^l\) sub-pixel accuracy \(m\) bi-directional motion vector field.
Compute \begin {equation} \hat {b}\leftarrow \frac {b_i\big (\overleftarrow {e}_{\text {max}}-\overleftarrow {e}(b)\big ) + b_k\big (\overrightarrow {e}_{\text {max}}-\overrightarrow {e}(b)\big )}{\big (\overleftarrow {e}_{\text {max}}-\overleftarrow {e}(b)\big ) + \big (\overrightarrow {e}_{\text {max}}-\overrightarrow {e}(b)\big )}, \end {equation}
where \(\overleftarrow {e}(b)\) is the (minimum) distortion of the best backward matching for block \(b\), \(\overrightarrow {e}(b)\) the (minimum) distortion of the best forward matching for block \(b\), \(\overleftarrow {e}_{\text {max}}=\overrightarrow {e}_{\text {max}}\) are the backward and forward maximum matching distortions, \(b_i\) is the (backward) block found (as the most similar to \(b\)) in frame \(s_i\) and \(b_k\) is the (forward) block found in frame \(s_k\). Notice that, if \(\overleftarrow {e}(b)=\overrightarrow {e}(b)\), then the prediction is \begin {equation} \hat {b} = \frac {b_i + b_k}{2}, \end {equation} and if \(\overleftarrow {e}(b)=0\), \begin {equation} \hat {b} = b_k, \end {equation} and viceversa.
Using the encoding system described in the Figure 15, and defined by \begin {equation} \left \{\ \begin {array}{l} \tilde {\mathbf R} = \text {Q}_{\mathbf R}({\mathbf R}) \\ \tilde {\mathbf E} = \text {Q}_{\mathbf E}\big ({\mathbf P}-\overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}(\tilde {\mathbf R})\big ) \end {array} \right . \label {eq:forward} \end {equation} and \begin {equation} \begin {array}{l} \tilde {\mathbf P} = \tilde {\mathbf E} + \overset {{\mathbf R}\rightarrow {\mathbf P}}{\mathbf M}(\tilde {\mathbf R}), \end {array} \label {eq:backward} \end {equation}
find \(\text {Q}_{\mathbf {R}}\) and \(\text {Q}_{\mathbf {E}}\) that minimize in the RD domain (the RD curve of) \begin {equation} \text {MSE}(\{\mathbf {R},\mathbf {P}\},\{\hat {\mathbf {R}},\hat {\mathbf {P}}\}) = \frac {\text {MSE}({\mathbf R},\hat {\mathbf R}) + \text {MSE}({\mathbf P},\hat {\mathbf P})}{2}, \end {equation} set that \begin {equation} \text {MSE}({\mathbf R},\tilde {\mathbf R}) = \text {MSE}({\mathbf P},\tilde {\mathbf P}). \label {eq:constant_quality} \end {equation} Equation ?? indicates that all the decoded frames should have the same distortion (from a human perception point of view). Notice that the transform defined by the Equations ?? and ?? is not orthogonal and therefore, the “subbands” \(\tilde {\mathbf R}\) and \(\tilde {\mathbf P}\) are not independent. It can be seen that \(\text {Q}_{\mathbf R}\) affects to the selection of \(\text {Q}_{\mathbf E}\), because \(\tilde {\mathbf R}\) is used as reference for finding \(\mathbf E\).
[1] D. Le Gall. MPEG: A Video Compression Standard for Multimedia Applications. Communications of the ACM, 34(4):46–58, 1991.
[2] D.S. Taubman and W.M. Marcellin. JPEG2000. Image Compression Fundamentals, Standards and Practice. Kluwer Academic Publishers, 2002.
1A P-type frame except for the second one, that always has a I-frame as reference.
2Other progressions such as the QS-progression should generate drift because the decoder at some truncation points of the code-stream would use a low-frequency information with a different quality than the encoderd used.