Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Is it possible to make reference to substrings and sort them?

by spooknick (Novice)
on Mar 22, 2015 at 09:31 UTC ( [id://1120869]=perlquestion: print w/replies, xml ) Need Help??

spooknick has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I am trying to implement Burrows-Wheeler transform (BWT) and this is the code I have:

my $bwt = join('', map{substr($text, ($_ - 1), 1)} sort{ (substr($text,$a).substr($text, 0, $a)) cmp (substr($text, $b).substr($text,0, $b)) } 0..length($text)-1 );

It is totally functional, but it gets problematic when the text is large.

I suspect, that the sort function here above is making all the cyclic rotations first and then sort them. If that's the case,
is it possible to make references to the cyclic rotations without actually making them, and use those in the sort function?

Any remarks, suggestions are welcomed.

For those that are not familiar with, the BWT is done by making all cyclic rotation of text, sorting them, and keeping the last char of each rotation. BWT of text "BANA$" is done as following:

Cyclic rotations:

BANA$
ANA$B
NA$BA
A$BAN
$BANA

Which are sorted as:

$BANA
A$BAN
ANA$B
BANA$
NA$BA

Keeping the last chars gives BWT = ANB$A

Replies are listed 'Best First'.
Re: Is it possible to make reference to substrings and sort them?
by hdb (Monsignor) on Mar 22, 2015 at 10:07 UTC

    With a bit of index arithmetic and a manual compare function you would get

    my $ntext = length $text; my $bwt2 = join('', map{substr($text, ($_ - 1), 1)} sort{ my $j = -1; my $diff = 0; while( $j++<$ntext ) { $diff = substr( $text, ($j+$a) % $ntext, 1 ) cmp substr( $text +, ($j+$b) % $ntext, 1 ); last if $diff; } return $diff; } 0..$ntext-1 ); print "$text: $bwt2\n";

    Whether or not this is faster, I do not know, but it avoids building all the cyclic rotations.

      If you can spare the memory to double your text, you can create references to all your rotations. Below I use the Schwartzian transform to create the references once, then sort them.

      my $text2 = "$text$text"; my $bwt4 = join('', map{substr($text, ($_ - 1), 1)} map{ $_->[0] } sort{ ${$a->[1]} cmp ${$b->[1]} } map{ [$_,\substr($text2,$_,$ntext)] } 0..$ntext-1 ); print "$text: $bwt4\n";

        Good call on doubling the string. Though the Schwartzian seems unnecessary here:

        my $text2 = "$text$text"; my $tlen = length($text); my $bwt7 = join('', map{ substr($$_, -1) } sort{ $$a cmp $$b } map{ \substr($text2, $_, $tlen) } 0..$tlen-1 ); print "$text: $bwt7\n";

        Intermediate array of references consumes more memory than an array of offsets would, I think, but is faster to sort. Something to test and benchmark...

      Excellent, it took:

      Spent 398.221 wallclock secs (397.00 usr +  0.98 sys = 397.98 CPU)

      for a sequence of 4639675 nt

      I just thought about not using the complete length of text in sort, like using only the first 1000 nt, but your idea is accurate.

Re: Is it possible to make reference to substrings and sort them?
by choroba (Cardinal) on Mar 22, 2015 at 11:02 UTC
    This is how I implemented BW for Rosalind.info:
    my @suff = 0 .. length $string; pop @suff; @suff = sort { substr($string, $a) cmp substr $string, $b } @suff; for my $idx (@suff) { print substr $string, $idx - 1, 1; } print "\n";

    Update: Now I noticed they mentioned O(|Text|) solution wasn't needed in the Note.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Your algorithm is somewhat different from the OP's in that you do not sort cyclic rotations at all, only the variable length endings of the text.

Re: Is it possible to make reference to substrings and sort them?
by Anonymous Monk on Mar 22, 2015 at 09:56 UTC

    Something like this

    my @substrrefs = map { [ \substr( $text, $_ ) , \substr( $text, 0, $_ ) ] } 0 .. (-1+length $text); my $bwt = join ... sort { $$a[0].$$a[1] cmp $$b[0].$$b[1] } @substrrefs;
Re: Is it possible to make reference to substrings and sort them?
by BrowserUk (Patriarch) on Mar 22, 2015 at 10:01 UTC
    problematic when the text is large.

    How big is your alphabet? How big is "large text"?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      It's for bioinformatics, so the alphabet is 4 letters, and the text can be up to 5M characters in my case.

        the text can be up to 5M characters

        No matter how you generate them, just storing the 5 million rotations of 5 million characters (>23 Terabytes of data) is going to be a problem, long before you get to the point of trying to sort.

        If the purpose of your BWT generation is to aid compression, then there are much more effective and efficient methods of compressing genomic data.

        If the purpose is Short Read Alignment, you almost certainly need to look at the suffix trie methods described there.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Is it possible to make reference to substrings and sort them?
by johngg (Canon) on Mar 22, 2015 at 12:28 UTC

    Using chop could be another way to go.

    $ perl -Mstrict -Mwarnings -E ' say join q{}, map { substr $_, -1 } sort sub { my $s = shift; my @r; push @r, do { $s = chop( $s ) . $s; $s; } for 1 .. length $s; return @r; }->( q{BANA$} );' ANB$A $

    I think I read somewhere that chop is fast so it might be worth benchmarking.

    Cheers,

    JohnGG

      I thought of a similar approach, but as BrowserUk pointed out above, if a list of all rotations of a 5,000,000 character string must be generated and stored somewhere, it's going to crash and burn. Anyway, FWIW:

      c:\@Work\Perl\monks>perl -wMstrict -le "use Test::More 'no_plan'; ;; my $t = 'BANA$'; ;; my @rots = do { my $u = $t; map $u = chop($u) . $u, 1 .. length $u } ; ;; my $ar_expected = [ reverse qw(BANA$ ANA$B NA$BA A$BAN $BANA) ]; is_deeply \@rots, $ar_expected, 'cyclic rotations'; ;; my $bwt = join '', map chop, sort @rots ; ;; is $bwt, 'ANB$A', qq{'$t' -> '$bwt' transform}; " ok 1 - cyclic rotations ok 2 - 'BANA$' -> 'ANB$A' transform 1..2
      The generation of the  @rots array is broken out into a separate step just so it can be tested against the OPed data; normally, it would be part of the sort operation chain.


      Give a man a fish:  <%-(-(-(-<

Re: Is it possible to make reference to substrings and sort them?
by Anonymous Monk on Mar 22, 2015 at 20:02 UTC

    Testing the code in OP, and this following version ...

    my $text2 = "$text$text"; my $bwt = join('', map{substr($text2, $_-1, 1)} sort{substr($text2,$a) cmp substr($text2,$b)} 0..length($text)-1 );
    ... reveals the memory usage is well under control for a ~10MB text. The versions with substring reference, on the other hand, blow up completely.

    Problem here is that the sort is SLOW. About half a minute for 100k text. Is there not a module for BWT?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1120869]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 14:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found