Symbol encoding

Symbol encoding

Vicente González Ruiz

December 31, 2019

Contents

1 How it works?
2 Bits, data and information
3 Entropy
4 Compression basics
4.1 Encoding of a symbol
4.2 Decoding of a symbol
4.2.1 Tip
4.3 Example
5 Probabilistic models
6 Shannon-Fano coding
7 Huﬀman coding
8 Arithmetic coding
9 Move-to-front transform
10 Context-base text predictive transform
11 Unary coding
12 Golomb-Rice coding
13 gzip
References

1 How it works?

We can compress a sequence of symbols if each one is translated by a code-word and, in average, the lengths of the code-words are smaller than the length of the symbols.
The encoder and the decoder have a probabilistic model $M$ which provides to a variable-length encoder ( $C$ )/decoder( $C^{- 1}$ ) the probability $p (s)$ of each symbol $s$ .
The most probable symbols are represented by the shorter code-words and viceversa.

2 Bits, data and information

data != information (data is the representation of the information).
Lossless data compression uses a shorter representation for information.
By deﬁnition, a bit of data stores a bit of information, if and only if, it represents the occurrence of an equiprobable event (an event that can be true or false with the same probability). In this ideal situation, the representation is fully eﬃcient (no futher compression would be possible).

By deﬁnition, a symbol

s

with probability

p (s)

stores

I (s) = - {log}_{2} p (s)

(Eq:symbol_information)

bits of information.

So, ideally, the length of a code-word in bits (of data) should match with the number bits of information.

3 Entropy

The entropy $H (S)$ measures the amount of information per symbol that a source of information $S$ produces, in average. By deﬁnition
$H (S) = \frac{1}{N} \sum_{s = 1}^{N} p (s) \times I (s)$ (1)

bits-of-information/symbol, where $N$ is the size of the source alphabet (number of diﬀerent symbols).

4 Compression basics

4.1 Encoding of a symbol

While the decoder does not know the symbol:
1. Assert something about the symbol that allows to the decoder to minimize the uncertainty of ﬁnding that symbol. This guess should have true or false with the same probability.
2. Output a bit of code that says if the last guess is true or false.

4.2 Decoding of a symbol

While the symbol is not known without uncertainty:
1. Make the same guess that the encoder.
2. Input a bit of code that represents the result of the last guess.

4.2.1 Tip

This codec is 100% eﬃcient if the guesses are equiprobable.

4.3 Example

Let’s suppose that we use the Spanish alphabet. Humans know that symbols does not form words in any order, so we can formulate the following VLC (Variable Length Codec):
In Spanish there are 28 letters. Therefore, to encode, for example, the word preciosa, the ﬁrst symbol p can be represented by its index inside of the Spahish alphabet with a code-word of 5 bits. In this try, the encoding is not a very eﬃcient, but this we are in ﬁrst letter … For the second one r we can see (using a Spanish dictionary) that after a p, the following symbols are possible: (1) a, (2) e, (3) i, (4) l, (5) n, (6) o, (7) r, (8) s and (9) u. Therefore, we don’t need 5 bits now, 4 are enough.

Notice that the compression ratio has been 40/25:1 (preciosa has 8 letters).

5 Probabilistic models

6 Shannon-Fano coding

7 Huﬀman coding

8 Arithmetic coding

9 Move-to-front transform

10 Context-base text predictive transform

11 Unary coding

12 Golomb-Rice coding

13 gzip

References

[1] John Cleary and Ian Witten. Data compression using adaptive coding and partial string matching. IEEE transactions on Communications, 32(4):396–402, 1984.

[2] Peter Deutsch. DEFLATE compressed data format speciﬁcation version 1.3. Technical report, 1996.

[3] Peter Deutsch. GZIP ﬁle format speciﬁcation version 4.3. Technical report, 1996.

[4] Robert M. Fano. The transmission of information. Massachusetts Institute of Technology, Research Laboratory of Electronics, 1949.

[5] Solomon Golomb. Run-length encodings (Corresp.). IEEE transactions on information theory, 12(3):399–401, 1966.

[6] David A Huﬀman. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers (IRE), 40(9):1098–1101, 1952.

[7] Giovanni Manzini. An analysis of the Burrows—Wheeler transform. Journal of the ACM (JACM), 48(3):407–430, 2001.

[8] Robert Rice and James Plaunt. Adaptive variable-length coding for eﬃcient compression of spacecraft television data. IEEE Transactions on Communication Technology, 19(6):889–897, 1971.

[9] Jorma Rissanen and Glen G. Langdon. Arithmetic coding. IBM Journal of research and development, 23(2):149–162, 1979.

[10] Claude Elwood Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.

[11] Ian H Witten, Radford M Neal, and John G Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987.