spooknick has asked for the wisdom of the Perl Monks concerning the following question:
Hi
I am trying to implement Burrows-Wheeler transform (BWT) and this is the code I have:
my $bwt = join('',
map{substr($text, ($_ - 1), 1)}
sort{
(substr($text,$a).substr($text, 0, $a))
cmp
(substr($text, $b).substr($text,0, $b))
} 0..length($text)-1
);
It is totally functional, but it gets problematic when the text is large.
I suspect, that the sort function here above is making all the cyclic rotations first and then sort them. If that's the case, is it possible to make references to the cyclic rotations without actually making them, and use those in the sort function?
Any remarks, suggestions are welcomed.
For those that are not familiar with, the BWT is done by making all cyclic rotation of text, sorting them, and keeping the last char of each rotation. BWT of text "BANA$" is done as following:
Cyclic rotations:
BANA$
ANA$B
NA$BA
A$BAN
$BANA
Which are sorted as:
$BANA
A$BAN
ANA$B
BANA$
NA$BA
Keeping the last chars gives BWT = ANB$A
Re: Is it possible to make reference to substrings and sort them?
by hdb (Monsignor) on Mar 22, 2015 at 10:07 UTC
|
my $ntext = length $text;
my $bwt2 = join('',
map{substr($text, ($_ - 1), 1)}
sort{
my $j = -1;
my $diff = 0;
while( $j++<$ntext ) {
$diff = substr( $text, ($j+$a) % $ntext, 1 ) cmp substr( $text
+, ($j+$b) % $ntext, 1 );
last if $diff;
}
return $diff;
} 0..$ntext-1
);
print "$text: $bwt2\n";
Whether or not this is faster, I do not know, but it avoids building all the cyclic rotations. | [reply] [d/l] |
|
my $text2 = "$text$text";
my $bwt4 = join('',
map{substr($text, ($_ - 1), 1)}
map{ $_->[0] }
sort{ ${$a->[1]} cmp ${$b->[1]} }
map{
[$_,\substr($text2,$_,$ntext)]
} 0..$ntext-1
);
print "$text: $bwt4\n";
| [reply] [d/l] |
|
my $text2 = "$text$text";
my $tlen = length($text);
my $bwt7 = join('',
map{ substr($$_, -1) }
sort{ $$a cmp $$b }
map{ \substr($text2, $_, $tlen) } 0..$tlen-1
);
print "$text: $bwt7\n";
Intermediate array of references consumes more memory than an array of offsets would, I think, but is faster to sort. Something to test and benchmark... | [reply] [d/l] |
|
Excellent, it took:
Spent 398.221 wallclock secs (397.00 usr + 0.98 sys = 397.98 CPU)
for a sequence of 4639675 nt
I just thought about not using the complete length of text in sort, like using only the first 1000 nt, but your idea is accurate.
| [reply] [d/l] |
Re: Is it possible to make reference to substrings and sort them?
by choroba (Cardinal) on Mar 22, 2015 at 11:02 UTC
|
my @suff = 0 .. length $string;
pop @suff;
@suff = sort { substr($string, $a) cmp substr $string, $b } @suff;
for my $idx (@suff) {
print substr $string, $idx - 1, 1;
}
print "\n";
Update: Now I noticed they mentioned O(|Text|) solution wasn't needed in the Note.
| [reply] [d/l] |
|
| [reply] |
Re: Is it possible to make reference to substrings and sort them?
by Anonymous Monk on Mar 22, 2015 at 09:56 UTC
|
my @substrrefs = map {
[ \substr( $text, $_ ) ,
\substr( $text, 0, $_ )
]
} 0 .. (-1+length $text);
my $bwt = join ...
sort {
$$a[0].$$a[1]
cmp $$b[0].$$b[1]
} @substrrefs;
| [reply] [d/l] |
Re: Is it possible to make reference to substrings and sort them?
by BrowserUk (Patriarch) on Mar 22, 2015 at 10:01 UTC
|
| [reply] |
|
| [reply] |
|
the text can be up to 5M characters
No matter how you generate them, just storing the 5 million rotations of 5 million characters (>23 Terabytes of data) is going to be a problem, long before you get to the point of trying to sort.
If the purpose of your BWT generation is to aid compression, then there are much more effective and efficient methods of compressing genomic data.
If the purpose is Short Read Alignment, you almost certainly need to look at the suffix trie methods described there.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] |
|
|
|
Re: Is it possible to make reference to substrings and sort them?
by johngg (Canon) on Mar 22, 2015 at 12:28 UTC
|
$ perl -Mstrict -Mwarnings -E '
say join q{},
map { substr $_, -1 }
sort
sub {
my $s = shift;
my @r;
push @r, do {
$s = chop( $s ) . $s;
$s;
} for 1 .. length $s;
return @r;
}->( q{BANA$} );'
ANB$A
$
I think I read somewhere that chop is fast so it might be worth benchmarking.
| [reply] [d/l] |
|
I thought of a similar approach, but as BrowserUk pointed out above, if a list of all rotations of a 5,000,000 character string must be generated and stored somewhere, it's going to crash and burn. Anyway, FWIW:
c:\@Work\Perl\monks>perl -wMstrict -le
"use Test::More 'no_plan';
;;
my $t = 'BANA$';
;;
my @rots =
do { my $u = $t; map $u = chop($u) . $u, 1 .. length $u }
;
;;
my $ar_expected = [ reverse qw(BANA$ ANA$B NA$BA A$BAN $BANA) ];
is_deeply \@rots, $ar_expected, 'cyclic rotations';
;;
my $bwt =
join '',
map chop,
sort
@rots
;
;;
is $bwt, 'ANB$A', qq{'$t' -> '$bwt' transform};
"
ok 1 - cyclic rotations
ok 2 - 'BANA$' -> 'ANB$A' transform
1..2
The generation of the @rots array is broken out into a separate step just so it can be tested against the OPed data; normally, it would be part of the sort operation chain.
Give a man a fish: <%-(-(-(-<
| [reply] [d/l] [select] |
Re: Is it possible to make reference to substrings and sort them?
by Anonymous Monk on Mar 22, 2015 at 20:02 UTC
|
Testing the code in OP, and this following version ...
my $text2 = "$text$text";
my $bwt = join('',
map{substr($text2, $_-1, 1)}
sort{substr($text2,$a) cmp substr($text2,$b)}
0..length($text)-1
);
... reveals the memory usage is well under control for a ~10MB text. The versions with substring reference, on the other hand, blow up completely.
Problem here is that the sort is SLOW. About half a minute for 100k text. Is there not a module for BWT?
| [reply] [d/l] |
|
|