The Monthly Course: The Burrows-Wheeler Transform

This month we are going to decode a string mangled with the Burrows Wheeler Transform

Please don't be scared by the somewhat long motivation that follows, even if you are a beginner. At the end you will be shown a few pretty simple algorithms

.

Important change

The official perl version for this golf is perl version 5.8.0. If you don't like compiling from scratch, most distributions provide 5.8.0 binaries now, and windows people can get a binary from ActiveState as usual.

Or simply keep using your old perl. Most likely it won't matter. But when you meet a situation where you're pretty sure your code should work and it doesn't, you might consider testing it on a different perl. We've set up an automatic tester (using 5.8.0) where you can test your solutions even without getting a new perl (ignore the leaderboard on that site. The only real leaderboard is here)


Motivation

The Burrows-Wheeler Transform (BWT) is a method that takes a block of text as input and rearranges it using a sorting algorithm. The unexpected thing is that the transform is reversible (which you normally don't expect. "Sort" rarely comes with an "unsort"). The result of the transform is a block of text where consecutive letters are very often the same, so it can be easily compressed with even very naive algorithms. The bzip program is based on this.

So how does it work ? It's pretty easy in fact. Let's assume the input text is "DRDOBBS". Write the string and all rotated versions under each other like this:

  
DRDOBBS(row 0)
RDOBBSD(row 1)
DOBBSDR(row 2)
OBBSDRD(row 3)
BBSDRDO(row 4)
BSDRDOB(row 5)
SDRDOBB(row 6)

Next sort rows:

  
BBSDRDO(row 4)
BSDRDOB(row 5)
DOBBSDR(row 2)
DRDOBBS(row 0)
OBBSDRD(row 3)
RDOBBSD(row 1)
SDRDOBB(row 6)

And finally, write down the last column: "OBRSDDB". Also note that the D in the last column of the second-last row (that's the new row 5, or the old row 1) is the D that starts the word "DRDOBBS". These two together are the result of the transform: "OBRSDDB" and 5.

It's pretty unexpected that this transform is reversible. One way to do that is to recover the (row n) lines in the sorted table. Let's first try to recover the sorted table a bit. We obviously already know:

  
......O
......B
......R
......S
......D
......D(row 1)
......B

In fact, we also know the first column, because it's just the letters in the last column, but sorted:

  
B.....O
B.....B
D.....R
D.....S
O.....D
R.....D(row 1)
S.....B

But this helps a lot ! From the row labeled with "(row 1)" we now know that R follows that D (remember that the strings were rotated, so the letter in the first column follows the letter in the last column). So if we know that the R is the first letter in row 1, we also know it's the last letter in row 2. So we can now extend our table to:

  
B.....O
B.....B
D.....R(row 2)
D.....S
O.....D
R.....D(row 1)
S.....B

Now repeat the same reasoning for the row labeled "(row 2)" We see that the letter after R must be a D which corresponds to row 3. We already have one of the D's labeled, so it must be the other one. And let's repeat this process:

  
B.....O(row 4)
B.....B
D.....R(row 2)
D.....S
O.....D(row 3)
R.....D(row 1)
S.....B

Oops, here we run into a problem. We know from the row labeled "(row 4)" that the letter after O is a B, but there are two of them in the last column. After which one should we write "(row 5)" ? To answer that, consider the letters that follow each B. By looking at a B in the last column and looking at the corresponding letter in the first column, we know they are B and S respectively. And since all rows are sorted, we in fact know:

  
BB....O(row 4)
BS....B
D.....R(row 2)
D.....S
O.....D(row 3)
R.....D(row 1)
S.....B

This B and that S in the second column of course also occur in the first column somewhere (on the second and last row in fact), and since the first column is also sorted, they occur in the same order, B first (on the second row) and the S later (on the last row), and they each have a B in their last column). So after this subtle bit of reasoning we get to a very simple rule: if you have repeated letters in the first row (the B in BB and BS), they correspond to that same letter in the last row in exactly the same order.

So the lines in a picture like this will never cross:

Demonstrate correspondence principle

The B in the first column of row one corresponds to the B in the last column of the second row, and the B in the first column of the second row corresponds to the B in the last column of the last row.

Let's go back to our question. We had O which was followed by B due to the top row. This is the first B in column 1, so it corresponds to the first B in the last column, the one on the second row. Therefore that's the one we must label with "(row 5)". So we get:

  
B.....O(row 4)
B.....B(row 5)
D.....R(row 2)
D.....S
O.....D(row 3)
R.....D(row 1)
S.....B

The rest is simple (don't forget that after row 6 we get row 0, not row 7):

  
B.....O(row 4)
B.....B(row 5)
D.....R(row 2)
D.....S(row 0)
O.....D(row 3)
R.....D(row 1)
S.....B(row 6)

And finally we write down the first letters from the rows labeled "(row 0)", "(row 1)", ..., "(row 6)" and we get "DRDOBBS"


Algorithms

Algorithm 1

Wow, that was hard ! Or was it ? Let's redo the exercise using the special rule we found: "All occurrences of the same letter in the first column will correspond to all occurrences of that letter in the last column in the same order" (see the picture where this rule is demonstrated if the letter is B). Then we can start with the encoded string with below it the sorted string and position 5 marked:

OBRSDDB
BBDDORS

In the sorted string at the bottom we see the first R marked. Now mark the first R in the top string (and the D below it):

OBRSDDB
BBDDORS

So now we have the first D below marked. Now mark the first D at the top and the O below it (forget the old mark):

OBRSDDB
BBDDORS

First O at the bottom marked, mark first O at the top:

OBRSDDB
BBDDORS

First B at the bottom marked (this also happens to be the first time we are processing a B, but "first" here means it's the first B in the string "BBDDORS"), mark the first B at the top (that is, the first B in the string "OBRSDDB"):

OBRSDDB
BBDDORS

Second B at the bottom marked, so mark the second B at the top:

OBRSDDB
BBDDORS

First S at the bottom marked, mark first S at the top:

OBRSDDB
BBDDORS

Second D at the bottom marked, but when we mark the second D at the top, we see we are back at the original:

OBRSDDB
BBDDORS

When we look at each of the letters we marked at the top in order, we see: "DRDOBBS"

Algorithm 2

Another way is to observe that if you know the first few columns in the sorted table, you can fill in the last column too. And since the letters in the last column are prefixes of the string you already have in a row, you can move that letter to the front, and you get a set of strings that must also be in the sorted table (but not sorted yet). Sort them, and we are in the start situation again, but all strings are one letter longer. This leads to the following method:

First write down a column consisting of the encoded string:

O
B
R
S
D
D
B

Then sort it:

B
B
D
D
O
R
S

Now write the encoded string in front of that again:

OB
BB
RD
SD
DO
DR
BS

And sort it again:

BB
BS
DO
DR
OB
RD
SD

Repeat prefixing and sorting until you get:

BBSDRDO
BSDRDOB
DOBBSDR
DRDOBBS
OBBSDRD
RDOBBSD
SDRDOBB

Now take row 5, and move the D at the end to the front, giving:

DRDOBBS

Many more algorithms exist of course. Maybe they even lead to shorter programs...


Specification

On STDIN you get one line consisting of a number (representing the start offset), a tab and a string consisting of uppercase characters and commas with a length between 2 and 999 (this is the string you have to decode) followed by a newline. So the input matches /^(0|[1-9]\d{0,2})\t[A-Z,]{2,999}\n\z/). The offset number represents the character that will be the first character of the output (counting starts at zero, and the number will be less than the string length, you don't have to handle invalid input).

You must write the properly newline terminated decoded string to STDOUT. Nothing should appear on STDERR. The returncode of the program does not matter.

We use normal ASCII sorting, so a comma sorts before all letters.

So if you get on STDIN:

5	OBRSDDB

The output should be:

DRDOBBS

Tiebreaker

The tiebreaker will favor programs with many the same characters (so they can be easily compressed).


General rules

  1. The program must be written as one or more lines. The score for each hole is the total number of bytes you need (smaller is better). If your program is more than one line, you must count the newlines in between as one byte each. The #! line is not counted. If you use options on the #! line, the options themselves are counted, including the leading space and -. Use the test program with the -n option if you have any doubts.
  2. The program is expected to finish in a reasonable time, e.g. if it gets started when the challenge opens, it's expected to be finished by the time the challenge closes even on a somewhat slow computer. The program has to be valid during the period that the challenge runs.
  3. At least 55% of the code must consist of normal ASCII characters matching /[ -~\t\n]/.
  4. You can submit as often as you want to until the deadline, no reason to wait until the last minute. In fact, other people want to see the score to beat on the leaderboard. So don't be a spoilsport by hoarding your score. Submit early and often.
  5. If you decide to play in a team, you may only play in one team and not at the same time play as an individual. Teams may be extended during play, but not with people already playing.
  6. The program may only use the perl executable, no other executables on the system are allowed (in particular, you must not use the programs implementing any other holes. The program may use itself though). You may use any of the perl 5.8.0 standard core modules. Your solution must be portable in the sense that it should work on all official versions of perl 5.8.0 everywhere. It's perfectly fine to abuse perl 5.8.0 bugs. For perl golf the executable (not the documentation) defines the language.
  7. The program should be self-contained (except for any I/O mentioned in the challenge). In particular, you may not do things like fetching extra data from a remote site.
  8. You may assume ASCII as character set and you may use perl's unicode semantics.
  9. Any bytes may be used in the sourcecode, including things like binary 0.
  10. All input will have a total size that will comfortably fit in memory (without swapping) and still allow you ample memory to play with.
  11. You may assume total memory is < 2**32 bytes (with the previous rule this e.g. implies sizes of arbitrary datastructures can be represented in a plain integer, and that you must not try to generate arbitrarily big datastructures).
  12. Your program should not introduce arbitrary limits not specifically mentioned in the challenge. In particular, your program should not need changes depending on the amount of free memory it runs with. E.g. doing @a = (1) x (10**8) (trying to create an array bigger than every valid inputsize) is wrong. It will fail miserably unless the machines has about 2 GigaByte of free memory. You may however always assume enough base memory to make datastructures of a few million entries, so something like @a = (1) x (10**6) is just fine.
  13. The program will be called as
    some_perl_5.8.0_binary programname {args}
    The file programname will be non-executable, but readable and writable (in fact, on unix it will have permissions 644). You do not get to choose the programname, but it will match /^[a-zA-Z][\w.-]*\z/ and will be the initial value of $0. You also don't get to choose the name of the perl binary.
  14. Your program must start with #!perl, possibly followed by some options.
  15. STDIN, STDOUT and STDERR may or may not be files.
  16. If read on a perl handle give EOF (End Of File), you may assume later reads on the same handle will also return EOF.
  17. The sort operator is stable in perl 5.8.0 (which means that elements that compare as equal among each other will appear in the output in the same order as in the input).
  18. Several things are not so much determined by perl itself, but by the environment in which it runs. But in real perls some things still tend to be true. Here are a number of assumptions you may make until someone finds a real perl (not specifically built to break them) that breaks them. Some of them may be even further restricted.


Deadline

The game starts November 12 (23:00 UTC) and ends November 19 (23:00 UTC).


Test program

A test program (version 1) is provided to help screen entries.

Any program that passes the test program can be submitted. If you are surprised that your solution passed the test program, please submit it anyway! That will help us identify bugs in the test program.

Name your script bwt.pl and place it in the same directory as your test program. Then verify and score your attempt using:

    $ perl tpr06.pl
To see some extra options of the test program, do:
    $ perl tpr06.pl --help

Passing the test program does not assure your solution is valid. The referees have the final say.


Submitting

You can submit your solutions here (you'll notice it's the same page as the Leaderboard). As a courtesy to other players, you should not wait until the last moment to submit your solutions. Everyone has more fun when the leaderboard depicts the situation correctly.

Do not publish your solutions anywhere. That will spoil the game, as your solutions are meant to be secret. All solutions will be published at the end of the game.

Prizes (provided by O'Reilly and ActiveState) will be awarded to veteran and beginner winners. A prize may also be awarded for any especially fascinating solution(s).


Leaderboard

You can track your ranking through the leaderboard here. Beginners are encouraged to enter and there is a separate leaderboard for them.

There is also a special leaderboard for teams. There will be no prizes awarded to the best team, other than the admiration of your fellow golfers. If you are in a team, you can't also play individually.


Referees

PGAS Master