This month we are going to decode a string mangled with the Burrows Wheeler Transform
Please don't be scared by the somewhat long motivation that follows, even if you are a beginner. At the end you will be shown a few pretty simple algorithms
.The official perl version for this golf is perl version 5.8.0. If you don't like compiling from scratch, most distributions provide 5.8.0 binaries now, and windows people can get a binary from ActiveState as usual.
Or simply keep using your old perl. Most likely it won't matter. But when you meet a situation where you're pretty sure your code should work and it doesn't, you might consider testing it on a different perl. We've set up an automatic tester (using 5.8.0) where you can test your solutions even without getting a new perl (ignore the leaderboard on that site. The only real leaderboard is here)
The Burrows-Wheeler Transform (BWT) is a method that takes a block of text as input and rearranges it using a sorting algorithm. The unexpected thing is that the transform is reversible (which you normally don't expect. "Sort" rarely comes with an "unsort"). The result of the transform is a block of text where consecutive letters are very often the same, so it can be easily compressed with even very naive algorithms. The bzip program is based on this.
So how does it work ? It's pretty easy in fact. Let's assume the input text is "DRDOBBS". Write the string and all rotated versions under each other like this:
D | R | D | O | B | B | S | (row 0) |
R | D | O | B | B | S | D | (row 1) |
D | O | B | B | S | D | R | (row 2) |
O | B | B | S | D | R | D | (row 3) |
B | B | S | D | R | D | O | (row 4) |
B | S | D | R | D | O | B | (row 5) |
S | D | R | D | O | B | B | (row 6) |
B | B | S | D | R | D | O | (row 4) |
B | S | D | R | D | O | B | (row 5) |
D | O | B | B | S | D | R | (row 2) |
D | R | D | O | B | B | S | (row 0) |
O | B | B | S | D | R | D | (row 3) |
R | D | O | B | B | S | D | (row 1) |
S | D | R | D | O | B | B | (row 6) |
And finally, write down the last column: "OBRSDDB". Also note that the D in the last column of the second-last row (that's the new row 5, or the old row 1) is the D that starts the word "DRDOBBS". These two together are the result of the transform: "OBRSDDB" and 5.
It's pretty unexpected that this transform is reversible. One way to do that is to recover the (row n) lines in the sorted table. Let's first try to recover the sorted table a bit. We obviously already know:
. | . | . | . | . | . | O | |
. | . | . | . | . | . | B | |
. | . | . | . | . | . | R | |
. | . | . | . | . | . | S | |
. | . | . | . | . | . | D | |
. | . | . | . | . | . | D | (row 1) |
. | . | . | . | . | . | B |
In fact, we also know the first column, because it's just the letters in the last column, but sorted:
B | . | . | . | . | . | O | |
B | . | . | . | . | . | B | |
D | . | . | . | . | . | R | |
D | . | . | . | . | . | S | |
O | . | . | . | . | . | D | |
R | . | . | . | . | . | D | (row 1) |
S | . | . | . | . | . | B |
But this helps a lot ! From the row labeled with "(row 1)" we now know that R follows that D (remember that the strings were rotated, so the letter in the first column follows the letter in the last column). So if we know that the R is the first letter in row 1, we also know it's the last letter in row 2. So we can now extend our table to:
B | . | . | . | . | . | O | |
B | . | . | . | . | . | B | |
D | . | . | . | . | . | R | (row 2) |
D | . | . | . | . | . | S | |
O | . | . | . | . | . | D | |
R | . | . | . | . | . | D | (row 1) |
S | . | . | . | . | . | B |
Now repeat the same reasoning for the row labeled "(row 2)" We see that the letter after R must be a D which corresponds to row 3. We already have one of the D's labeled, so it must be the other one. And let's repeat this process:
B | . | . | . | . | . | O | (row 4) |
B | . | . | . | . | . | B | |
D | . | . | . | . | . | R | (row 2) |
D | . | . | . | . | . | S | |
O | . | . | . | . | . | D | (row 3) |
R | . | . | . | . | . | D | (row 1) |
S | . | . | . | . | . | B |
Oops, here we run into a problem. We know from the row labeled "(row 4)" that the letter after O is a B, but there are two of them in the last column. After which one should we write "(row 5)" ? To answer that, consider the letters that follow each B. By looking at a B in the last column and looking at the corresponding letter in the first column, we know they are B and S respectively. And since all rows are sorted, we in fact know:
B | B | . | . | . | . | O | (row 4) |
B | S | . | . | . | . | B | |
D | . | . | . | . | . | R | (row 2) |
D | . | . | . | . | . | S | |
O | . | . | . | . | . | D | (row 3) |
R | . | . | . | . | . | D | (row 1) |
S | . | . | . | . | . | B |
This B and that S in the second column of course also occur in the first column somewhere (on the second and last row in fact), and since the first column is also sorted, they occur in the same order, B first (on the second row) and the S later (on the last row), and they each have a B in their last column). So after this subtle bit of reasoning we get to a very simple rule: if you have repeated letters in the first row (the B in BB and BS), they correspond to that same letter in the last row in exactly the same order.
So the lines in a picture like this will never cross:
The B in the first column of row one corresponds to the B in the last column of the second row, and the B in the first column of the second row corresponds to the B in the last column of the last row.
Let's go back to our question. We had O which was followed by B due to the top row. This is the first B in column 1, so it corresponds to the first B in the last column, the one on the second row. Therefore that's the one we must label with "(row 5)". So we get:
B | . | . | . | . | . | O | (row 4) |
B | . | . | . | . | . | B | (row 5) |
D | . | . | . | . | . | R | (row 2) |
D | . | . | . | . | . | S | |
O | . | . | . | . | . | D | (row 3) |
R | . | . | . | . | . | D | (row 1) |
S | . | . | . | . | . | B |
The rest is simple (don't forget that after row 6 we get row 0, not row 7):
B | . | . | . | . | . | O | (row 4) |
B | . | . | . | . | . | B | (row 5) |
D | . | . | . | . | . | R | (row 2) |
D | . | . | . | . | . | S | (row 0) |
O | . | . | . | . | . | D | (row 3) |
R | . | . | . | . | . | D | (row 1) |
S | . | . | . | . | . | B | (row 6) |
And finally we write down the first letters from the rows labeled "(row 0)", "(row 1)", ..., "(row 6)" and we get "DRDOBBS"
Wow, that was hard ! Or was it ? Let's redo the exercise using the special rule we found: "All occurrences of the same letter in the first column will correspond to all occurrences of that letter in the last column in the same order" (see the picture where this rule is demonstrated if the letter is B). Then we can start with the encoded string with below it the sorted string and position 5 marked:
OBRSDDB BBDDORS
In the sorted string at the bottom we see the first R marked. Now mark the first R in the top string (and the D below it):
OBRSDDB BBDDORS
So now we have the first D below marked. Now mark the first D at the top and the O below it (forget the old mark):
OBRSDDB BBDDORS
First O at the bottom marked, mark first O at the top:
OBRSDDB BBDDORS
First B at the bottom marked (this also happens to be the first time we are processing a B, but "first" here means it's the first B in the string "BBDDORS"), mark the first B at the top (that is, the first B in the string "OBRSDDB"):
OBRSDDB BBDDORS
Second B at the bottom marked, so mark the second B at the top:
OBRSDDB BBDDORS
First S at the bottom marked, mark first S at the top:
OBRSDDB BBDDORS
Second D at the bottom marked, but when we mark the second D at the top, we see we are back at the original:
OBRSDDB BBDDORS
When we look at each of the letters we marked at the top in order, we see: "DRDOBBS"
Another way is to observe that if you know the first few columns in the sorted table, you can fill in the last column too. And since the letters in the last column are prefixes of the string you already have in a row, you can move that letter to the front, and you get a set of strings that must also be in the sorted table (but not sorted yet). Sort them, and we are in the start situation again, but all strings are one letter longer. This leads to the following method:
First write down a column consisting of the encoded string:
O B R S D D B
Then sort it:
B B D D O R S
Now write the encoded string in front of that again:
OB BB RD SD DO DR BS
And sort it again:
BB BS DO DR OB RD SD
Repeat prefixing and sorting until you get:
BBSDRDO BSDRDOB DOBBSDR DRDOBBS OBBSDRD RDOBBSD SDRDOBB
Now take row 5, and move the D at the end to the front, giving:
DRDOBBS
Many more algorithms exist of course. Maybe they even lead to shorter programs...
On STDIN you get one line consisting of a number (representing the start offset), a tab and a string consisting of uppercase characters and commas with a length between 2 and 999 (this is the string you have to decode) followed by a newline. So the input matches /^(0|[1-9]\d{0,2})\t[A-Z,]{2,999}\n\z/). The offset number represents the character that will be the first character of the output (counting starts at zero, and the number will be less than the string length, you don't have to handle invalid input).
You must write the properly newline terminated decoded string to STDOUT. Nothing should appear on STDERR. The returncode of the program does not matter.
We use normal ASCII sorting, so a comma sorts before all letters.
So if you get on STDIN:
5 OBRSDDB
The output should be:
DRDOBBS
The tiebreaker will favor programs with many the same characters (so they can be easily compressed).
some_perl_5.8.0_binary programname {args}The file programname will be non-executable, but readable and writable (in fact, on unix it will have permissions 644). You do not get to choose the programname, but it will match /^[a-zA-Z][\w.-]*\z/ and will be the initial value of $0. You also don't get to choose the name of the perl binary.
perl -wle 'print []+0'You may assume that such an address will be > 100000 (At least one machine is known where the number is 156000) You may assume the address is a multiple of 4.
The game starts November 12 (23:00 UTC) and ends November 19 (23:00 UTC).
A test program (version 1) is provided to help screen entries.
Any program that passes the test program can be submitted. If you are surprised that your solution passed the test program, please submit it anyway! That will help us identify bugs in the test program.
Name your script bwt.pl and place it in the same directory as your test program. Then verify and score your attempt using:
$ perl tpr06.plTo see some extra options of the test program, do:
$ perl tpr06.pl --help
Passing the test program does not assure your solution is valid. The referees have the final say.
You can submit your solutions here (you'll notice it's the same page as the Leaderboard). As a courtesy to other players, you should not wait until the last moment to submit your solutions. Everyone has more fun when the leaderboard depicts the situation correctly.
Do not publish your solutions anywhere. That will spoil the game, as your solutions are meant to be secret. All solutions will be published at the end of the game.
Prizes (provided by O'Reilly and ActiveState) will be awarded to veteran and beginner winners. A prize may also be awarded for any especially fascinating solution(s).
You can track your ranking through the leaderboard here. Beginners are encouraged to enter and there is a separate leaderboard for them.
There is also a special leaderboard for teams. There will be no prizes awarded to the best team, other than the admiration of your fellow golfers. If you are in a team, you can't also play individually.