March 6, 2020

1 Sources of redundancy

2 Hybrid video coding

3 Block-based MC (Motion Compensation) [1]

4 Sub-pixel accuracy

5 Matching criteria (similitude between macroblocks)

6 Searching strategies

7 The GOP (Group Of Pictures) concept

8 MCTF (Motion Compensated Temporal Filtering)

9 $\pm $ 1-spiral-search ME (Motion Estimation)

10 Linear frame interpolation using block-based motion compensation

11 MC/DWT hybrid coding alternatives

12 Deblocking ﬁltering

13 Bit-rate allocation

14 Video scalability

14.1 Quality scalability

14.2 Temporal scalability

14.3 Spatial scalability

References

2 Hybrid video coding

3 Block-based MC (Motion Compensation) [1]

4 Sub-pixel accuracy

5 Matching criteria (similitude between macroblocks)

6 Searching strategies

7 The GOP (Group Of Pictures) concept

8 MCTF (Motion Compensated Temporal Filtering)

9 $\pm $ 1-spiral-search ME (Motion Estimation)

10 Linear frame interpolation using block-based motion compensation

11 MC/DWT hybrid coding alternatives

12 Deblocking ﬁltering

13 Bit-rate allocation

14 Video scalability

14.1 Quality scalability

14.2 Temporal scalability

14.3 Spatial scalability

References

See the Fig. 1.

- Used in all the video compression standards.
- Intracoded images are transformed ($T$) and quantized ($Q$).
- Intercoded images are also motion compensated ($P$).
- The encoder incorporates a decoder to avoid the drift error.

- Usually, only performed by the encoder (compress one. decompress many).
- MC removes temporal redundancy. A predicted image can be encoded as the diﬀerence between it and another image called prediction image which is a motion compensated projection of one or more images named reference images. ME tries to generate residue images as close as possible to the null images.
- For example, in the MPEG-1 standard, the reference image/s is/are divided in blocks of $16\times 16$ pixels called macroblocks.
- Each reference block is searched in the predicted image and the best match is indicated by mean of a motion vector.
- Depending on the success of the search and the number of reference images,
the macroblocks are classiﬁed into:
- I (intra): When the compression of residue block generates more bits than the original (predicted) one.
- P (predicted): When it is better to compress the residue block and there is only one reference macroblock.
- B (bidirectionally predicted): The same, but if we have two reference macroblocks.
- S (skipped): When the energy of the residue block is smaller than a given threshold.

- I-pictures are composed of I macroblocks, only.
- P-pictures do not have B macrobocks.
- B-pictures can have any type of macroblocks.

See the Fig. 2.

- The motion estimation can be carried out using integer pixel accuracy or a fractional (sub-) pixel accuracy.
- For example, in MPEG-1, the motion estimation can have up to 1/2 pixel accuracy. A bi-linear interpolator is used (see the Fig. 3).

- Let $a$ and $b$ the macroblocks which we want to compare. Two main distortion metrics are commonly used:
- These similitude measures are used only by MPEG compressors. Therefore, any other one with similar eﬀects (such as the error variance or the error entropy) could be used also.
- Other less common distortion metrics that can work are:
- EE (Error Entropy:
$$-\frac{1}{16\times 16}\sum _{i=1}^{16}\sum _{j=1}^{16}{log}_{2}\left({a}_{ij}-{b}_{ij}\right)p\left({a}_{ij}-{b}_{ij}\right)$$ (3)

- EE (Error Entropy:

- Only performed by the compressor.
- Full search: All the possibilities are checked (see the Fig. 4). Advantage: the best compression. Disadvantage: CPU killer.

- ** Logaritmic search**: It is a version of the full search algorithm where the macro-blocks and the search area are sub-sampled. After ﬁnding the best coincidence, the resolution is increased in a power of 2 and the previous match is reﬁned in a search area of $\pm 1$, until the maximal resolution (even using subpixel accuracy) is reached.

- Telescopic search: Any of the previously described techniques can be speeded up if the searching area is reduced. This can be done supposing that the motion vector of the same macro-block in two consecutive images is similar.

- The temporal redundancy is exploited by blocks of images called GOPs. This means that a GOP can be decoded independently of the rest of GOPs (see the Fig. 5).

- This is a DWT where the input samples are the original video images and the output is a sequence of residue images.

- $R$: square search area, in pixels.
- $B$: square block size, in pixels.
- $O$: border size, in pixels.
- ${s}_{i}$, ${s}_{j}$ and ${s}_{k}$ three chronologically ordered, equidistant frames, with resolution $X\times Y$.
- $A$: $\frac{1}{{2}^{A}}$ subpixel accuracy.

- ${\u015d}_{j}$: a prediction for frame ${s}_{j}$.
- $m$: a matrix with $\lceil X\u2215B\rceil \times \lceil Y\u2215B\rceil $ bidirectional motion vectors.
- $e$: a matrix with $\lceil X\u2215B\rceil \times \lceil Y\u2215B\rceil $ bidirectional Root Mean Square matching Wrrors (RMSE).

- Compute the DWT${}^{l}$, where $l=\lfloor {log}_{2}\left(R\right)\rfloor -1$ levels, of the predicted frame ${s}_{j}$ and the two reference frames ${s}_{i}$ and ${s}_{k}$. Example.
- $L{L}^{l}\left(m\right)\leftarrow 0$, or any other precomputed values (for example, from a previous ME in neighbor frames). Example.
- Divide the subband $L{L}^{l}\left({s}_{j}\right)$ into blocks of size $B\times B$ pixels, and $\pm 1$-spiral-search them in the subbands $L{L}^{l}\left({s}_{i}\right)$ and $L{L}^{l}\left({s}_{k}\right)$, calculating a low-resolution $L{L}^{l}\left(m\right)=\left\{L{L}^{l}\left(\stackrel{\u20d6}{m}\right),L{L}^{l}\left(\stackrel{\u20d7}{m}\right)\right\}$ bi-directional motion vector ﬁeld. Example. Example.
- While $l>0$:
- Synthesize $L{L}^{l-1}\left(m\right)$, $L{L}^{l-1}\left({s}_{j}\right)$, $L{L}^{l-1}\left({s}_{i}\right)$ and $L{L}^{l-1}\left({s}_{k}\right)$, by computing the 1-level DWT${}^{-1}$. Example. Example
- $L{L}^{l-1}\left(M\right)\leftarrow L{L}^{l-1}\left(M\right)\times 2$. Example.
- Reﬁne $L{L}^{l-1}\left(m\right)$ using $\pm 1$-spiral-search. Example.
- $l\leftarrow l-1$. (When $l=0$, the motion vectors ﬁeld $m$ has the structure:)

- While $l<A$ (in the
ﬁrst iteration, $l=0$,
and $L{L}^{0}\left(M\right):=M$):
- $l\leftarrow l+1$.
- Synthesize $L{L}^{-l}\left({s}_{j}\right)$, $L{L}^{-l}\left({s}_{i}\right)$ and $L{L}^{-l}\left({s}_{k}\right)$, computing the 1-level DWT${}^{-1}$ (high-frequency subbands are $0$). This performs a zoom-in in these frames using $1\u22152$-subpixel accuracy.
- $m\leftarrow m\times 2$.
- $B\leftarrow B\times 2$.
- Divide the subband $L{L}^{-l}\left({s}_{j}\right)$ into blocks of $B\times B$ pixels and $\pm 1$-spiral-search them into the subbands $L{L}^{-l}\left({s}_{i}\right)$ and $L{L}^{-l}\left({s}_{k}\right)$, calculating a $1\u2215{2}^{l}$ sub-pixel accuracy $m$ bi-directional motion vector ﬁeld. Example.
- Frame prediction. For each block $b$:
- Compute
where $\stackrel{\u20d6}{e}\left(b\right)$ is the (minimum) distortion of the best backward matching for block $b$, $\stackrel{\u20d7}{e}\left(b\right)$ the (minimum) distortion of the best forward matching for block $b$, ${\stackrel{\u20d6}{e}}_{\text{max}}={\stackrel{\u20d7}{e}}_{\text{max}}$ are the backward and forward maximum matching distortions, ${b}_{i}$ is the (backward) block found (as the most similar to $b$) in frame ${s}_{i}$ and ${b}_{k}$ is the (forward) block found in frame ${s}_{k}$. Notice that, if $\stackrel{\u20d6}{e}\left(b\right)=\stackrel{\u20d7}{e}\left(b\right)$, then the prediction is

$$\widehat{b}=\frac{{b}_{i}+{b}_{k}}{2},$$ (5) and if $\stackrel{\u20d6}{e}\left(b\right)=0$,

$$\widehat{b}={b}_{k},$$ (6) and viceversa.

Implement the Section 10 (work on https://github.com/Sistemas-Multimedia/MCDWT/blob/master/transform/mc/block/interpolate.py). Use https://github.com/Sistemas-Multimedia/MCDWT/blob/master/mcdwt/mc/block/interpolate.py and https://github.com/vicente-gonzalez-ruiz/MCTF-video-coding/blob/master/src/motion\_estimate.cpp as reference.

Compare the performance of the proposed matching strategies (MSE, MAE and EE) in the Section 10, by computing the variance of the prediction error between the original frame (${s}_{j}$) and the prediction frame (${\u015d}_{j}$).

Test diﬀerent DWT ﬁlters in the Section 10 and compare their performance. Compute the prediction error between the original frame (${s}_{j}$) and the prediction frame (${\u015d}_{j}$). Measure the dependency between this performance and the distance between frames ($i$, $j$, and $k$ indexes).

Test the use of both the luma and the chroma in Section 10, and measure the performance of each option (only luma vs. all components), by computing the prediction error between the original frame (${s}_{j}$) and the prediction frame (${\u015d}_{j}$). Measure the dependency of the results with the distance between frames ($i$, $j$, and $k$ indexes).

Analyze the impact of the $R$ (search range) parameter in the Section 10. Compute the prediction error between the original frame (${s}_{j}$) and the prediction frame (${\u015d}_{j}$). Study the impact of initializing the motion vectors (Section ??). Measure the dependency with the distance between frames ($i$, $j$, and $k$ indexes).

Analyze the impact of the $O$ (overlaping) parameter in the Section ??, by means of computing the prediction error between the original frame (${s}_{j}$) and the prediction frame (${\u015d}_{j}$). Measure the dependency with the distance between frames ($i$, $j$, and $k$ indexes).

Analyze the impact of the $B$ (block size) parameter in the Section ??, by computing the prediction error between the original frame (${s}_{j}$) and the prediction frame (${\u015d}_{j}$). Compute the expected size of the motion ﬁelds using their 0-order entropy. Measure the dependency with the distance between frames ($i$, $j$, and $k$ indexes).

Analyze the impact of the $A$ (subpixel accuracy) parameter in the Section ??, by computing the prediction error between the original frame (${s}_{j}$) and the prediction frame (${\u015d}_{j}$). Compute the expected size of the motion ﬁelds using their entropy. Measure the dependency with the distance between frames ($i$, $j$, and $k$ indexes).

Compare the performance of the Section ?? when it holds that

$$\widehat{b}=\frac{{b}_{i}+{b}_{k}}{2},$$ | (7) |

for all blocks.

- t+2d: The sequence of images is decorrelated ﬁrst along the time (t) and the residue images are compressed, exploiting the remaining spatial (2d) redundancy. Examples: MPEG* and H.26* codecs (except H.264/SVC).
- 2d+t: The spatial (2d) redudancy is explited ﬁrst (using typically the DWT) and after that, the coeﬃcients are decorrelated along the time (t). For now, this has only been an experimental setup because most DWT transformed domains are not invariant to the displacement, and therefore, ME/MC can not be directly applied.
- 2d+t+2d: The ﬁst step creates a Laplacian Pyramid (2d), which is invariant to the displacement. Next, each level of the pyramid is decorrelated along the time (t) and ﬁnally, the remaining spatial redundancy is removed (2d). The Fig 8 show an example for H.264/SVC.

- If any other block-overlaping techniques have not been applied, block-based video encoders improve their performance if a deblocking ﬁlter in used to create the quantized prediction predictions (see the Fig. 9).

- The low-pass ﬁlter is applied only on the block boundaries.

- VBR: Under a constant quantization level (constant video quality), the
number of bits that each compressed image needs depends on the image
content (Variable Bit-Rate). In the Fig. 10 there is an example.

- CBR: Using a Constant Bit-Rate strategy, all frames need the same space.
In the Fig. ?? there is an example.

Véase la Fig. 12.

- Ideal for remote visualization environments.
- By deﬁnition, ${V}^{\left[0\right]}:=V$.

Véase la Fig. 13.

- It holds that
$${V}^{t}=\left\{{V}_{{2}^{t}i}\right\}=\left\{{V}_{2i}^{t-1}\right\},$$ (8) where $t$ denotes the Temporal Resolution Level (TRL).

- Notice that $V:={V}^{0}$.
- Useful for fast random access.

Véase la Fig. 14.

- Useful for low-resolution devices.
- By deﬁnition, ${V}_{i}:={V}_{i}^{<0>}$ and ${V}_{i}^{<s>}$ has a $\frac{Y}{{2}^{s}}\times \frac{X}{{2}^{s}}$ resolution, where $X\times Y$ is the resolution of ${V}_{i}$.

[1] Kamisetty Ramamohan Rao and Jae Jeong Hwang. Techniques and standards for image, video, and audio coding, volume 70. Prentice Hall New Jersey, 1996.