(Eq:symbol_information) |
bits of information.
(1) |
bits-of-information/symbol, where is the size of the source alphabet (number of different symbols).
In Spanish there are 28 letters. Therefore, to encode, for example, the word preciosa, the first symbol p can be represented by its index inside of the Spahish alphabet with a code-word of 5 bits. In this try, the encoding is not a very efficient, but this we are in first letter … For the second one r we can see (using a Spanish dictionary) that after a p, the following symbols are possible: (1) a, (2) e, (3) i, (4) l, (5) n, (6) o, (7) r, (8) s and (9) u. Therefore, we don’t need 5 bits now, 4 are enough.
[1] John Cleary and Ian Witten. Data compression using adaptive coding and partial string matching. IEEE transactions on Communications, 32(4):396–402, 1984.
[2] Peter Deutsch. DEFLATE compressed data format specification version 1.3. Technical report, 1996.
[3] Peter Deutsch. GZIP file format specification version 4.3. Technical report, 1996.
[4] Robert M. Fano. The transmission of information. Massachusetts Institute of Technology, Research Laboratory of Electronics, 1949.
[5] Solomon Golomb. Run-length encodings (Corresp.). IEEE transactions on information theory, 12(3):399–401, 1966.
[6] David A Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers (IRE), 40(9):1098–1101, 1952.
[7] Giovanni Manzini. An analysis of the Burrows—Wheeler transform. Journal of the ACM (JACM), 48(3):407–430, 2001.
[8] Robert Rice and James Plaunt. Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions on Communication Technology, 19(6):889–897, 1971.
[9] Jorma Rissanen and Glen G. Langdon. Arithmetic coding. IBM Journal of research and development, 23(2):149–162, 1979.
[10] Claude Elwood Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
[11] Ian H Witten, Radford M Neal, and John G Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987.