Burrows-Wheeler Transform (BWT)

Vicente González Ruiz

December 31, 2019

Contents

1 Intro
2 Encoder
3 Decoder
 3.1 Forward BWT
4 Inverse BWT
 4.1 Lab
Bibliography

1 Intro

2 Encoder

Let B the block-size in symbols:

  1. While the input is not exhausted:
    1. Read B symbols in S.
    2. (L, I) = BWT(S).
    3. Output L (the output block with longer runs) and I (an index of a symbol in L).

3 Decoder

  1. While input pairs (L, I):
    1. S = iBWT(L, I).
    2. Output S.

3.1 Forward BWT

  1. Input S, the sequence of B symbols. S = "abraca."
  2. Compute a matrix M’ with all possible cyclic rotations of S.
    M’ = ["abraca.",  
          "braca.a",  
          "raca.ab",  
          "aca.abr",  
          "ca.abra",  
          "a.abrac",  
          ".abraca"]

  3. Sort lexicographically M’ go generate
    M = [".abraca",  
         "a.abrac",  
         "abraca.",  
         "aca.abr",  
         "braca.a",  
         "ca.abra",  
         "raca.ab"]

  4. Let I the index of S in M. I = 2
  5. Let L the column B-1 of M. L = "ac.raab"
  6. Output I and L.

4 Inverse BWT

The backward transform regenerates the I-th row of M. Here there is an example:

  1. Input I and L, the output of a BWT applied to a string S of length B.
  2. The first F and the last L columns of M are available taking into consideration that F=sorted(L).
        F23456L  
        .     a  
        a     c  
        a     .  
        a     r  
        b     a  
        c     a  
        r     b

  3. Notice that for a particular symbol in L, the corresponding symbol in F follow it in S (for example, r follows b in abraca.). Therefore, we have found all pairs of S by taking pairs of LF.
        a.  
        ca  
        .a  
        ra  
        ab  
        ac  
        br

    Which sorted:

        .a  
        a.  
        ab  
        ac  
        br  
        ca  
        ra

    become the first two columns of M.

  4. Repeat the process until getting the rest of the columns of M (here only a few are shown):
        F23456L  
        .a    a  
        a.    c  
        ab    .  
        ac    r  
        br    a  
        ca    a  
        ra    b

    Now, for a particular symbol in L, the corresponding pair in columns F and 2 follows it in S (for example, pair br follows symbol a in abraca.). So, we can find all triples of S by tacking triples of LF2:

        a.a         .ab  
        ac.         a.a  
        .ab  sort   abr  
        rac ------> aca  
        abr         bra  
        aca         ca.  
        bra         rac

    to have the partial reconstruction of M:

        F23456L  
        .ab   a  
        a.a   c  
        abr   . <- I  
        aca   a  
        bra   a  
        ca.   a  
        rac   b

In an optimized implementation of the BWT, only the row I is generated.

4.1 Lab

IPython notebook

References

[1]   M. Burrows and D. J. Wheeler. A Block-Sorting Lossless Data Compression Algorithm. Technical Report 124, Digital Systems Research Center, Palo Alto, California, 1994.