Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

(Golf) Fragment Reassembly

by MeowChow (Vicar)
on May 02, 2001 at 03:37 UTC ( [id://77183]=perlmeditation: print w/replies, xml ) Need Help??

... aka DNA Sequencing or Shortest Common Superstring. Your mission, should you choose to accept it, is to design a golfed subroutine that reassembles a list of string fragments into a minimal length string which contains all those string fragments as substrings.

For example:

sub assemble { ... } assemble qw(logger gerbil log analog) # analoggerbil assemble qw(GATTACA ATTACA GATT AAGAT CCC) # CCCAAGATTACA # .. or # AAGATTACACCC
Extra Credit: Provide a solution the runs in polynomial time ;)
   MeowChow                                   
               s aamecha.s a..a\u$&owag.print

Replies are listed 'Best First'.
(dws)Re: (Golf) Fragment Reassembly
by dws (Chancellor) on May 02, 2001 at 09:26 UTC
    For my first round of Golf at the Monastery, here's a non-strict, non-polynomial approach.
    sub assemble { ## Begin Golf $w=join'',@_;t("",@_);$w} sub t{my$s=shift;$w=$s if!@_ and length$s<length$w; return if!@_;my@b=(@_,@_); map{t($_,@b[0..$#_-1])}_($s,shift@b)while(@b>@_)} sub _{my($a,$b)=@_;$l=length$a;for(0..$l){$x=substr($a,0,$_).$b; return$x if substr($x,0,$l)eq$a} ## End Golf } print assemble qw(GATTACA ATTACA GATT AAGAT CCC)
    emits AAGATTACACCC That's 248 by my count (less the extra newlines that I threw in to avoid ugly wrapping). There's a 20 character test that prunes the search space back, but not enough (I think) to make it a polynomial solution.

    Update: This is polynomial after all. I was too lazy to work that out at first. Bah.

      A bit of tuning after reading japhy's Golf advice drops the score to 243
      sub assemble { ## Begin Golf $w=join'',@_;t("",@_);$w} sub t{my$s=shift;$w=$s if!@_ and length$s<length$w; return if!@_;my@b=(@_,@_); map{t($_,@b[0..@_-2])}_($s,shift@b)while(@b>@_)} sub _{($a,$b)=@_;$l=length$a;for(0..$l){$x=substr($a,0,$_).$b; last if substr($x,0,$l)eq$a}$x ## End Golf } print assemble qw(GATTACA ATTACA GATT AAGAT CCC)
Re: (Golf) Fragment Reassembly
by jepri (Parson) on May 02, 2001 at 09:28 UTC
    Update: A while ago I saw an absolutely fantastic piece of art on film. It was only a minute or two long. The artist had done some visual thing, but the highlight for me was that he/she had a person reading words that were joined together, similar to what I did to MeowChows post below. It was always just on the edge of understandable. I think it was called 'Mouth Music'. Anyway, it rocked.

    Well, here's code that does some of what you want. What I would really like to see is a mathematical proof of the minimum number and length of sequences to get complete reconstruction of the thing you chopped up in the first place.

    Plus, when you golf, aren't you meant to post your own code for other people to play with? Maybe I should call homework on this :)

    Here are the results:

    logger,gerbil,log,analog, Constructed sequence is analoggerbil GATTACA,ATTACA,GATT,AAGAT,CCC, Constructed sequence is gattacaagatt Fragment,Reassembly,by,MeowChow,... Constructed sequence is byoureassemblesuperstringolfednakasubstringseq +uencing

    And here is the code that did it:

    use strict; sub assemble{ my @sequences = sort {length($b) <=> length($a)} @_; &text_mangle($_) for @sequences; my $a=shift @sequences; my $b=shift @sequences; while ( @sequences ) { my $c=overlap($a,$b); if ( defined($c) ) { $a=$c; #push @sequences,$b; $b=shift @sequences; } else { push @sequences,$b; $b=shift @sequences; } } print "Constructed sequence is $a\n"; }; sub overlap{ my ($a,$b)=@_; my ($i,$l); if ($a=~/ / || $b=~/ / ) { die "Don't use spaces in your data! +\n";} $l=length($a); for ($i=0;$i<$l; $i++ ) { if ( substr($a,$i,$l-$i) eq substr($b,0,$l-$i) ) { substr($a,$i,($l-$i))= $b; return $a; } } $l=length($b); for ($i=0;$i<$l; $i++ ) { if ( lc(substr($b,$i,$l-$i)) eq lc(substr($a,0,$l-$i)) + ) { substr($b,$i,($l-$i))= $a ; return $b; } return undef; }; sub text_mangle { $_[0]=~s/ |\.|\,//g; my $a=lc($_[0]); $_[0]=$a; } sub swap { my $a=$_[0]; $_[0]=$_[1]; $_[1]=$a; } my @array=qw(logger gerbil log analog); foreach (@array) {print; print ",";};print "\n";assemble(@array);print + "\n"; @array=qw(GATTACA ATTACA GATT AAGAT CCC); foreach (@array) {print; print ",";};print "\n";assemble(@array);print + "\n"; @array=qw(Fragment Reassembly by MeowChow ... aka DNA Sequencing or Shortest Common Superstring. Your mission, s +hould you choose to accept it, is to design a golfed sforeach (@array +) {print; print ",";};print "\n";assemble(@array);print "\n";

    ____________________
    Jeremy
    I didn't believe in evil until I dated it.

Re (tilly) 1: (Golf) Fragment Reassembly
by tilly (Archbishop) on May 03, 2001 at 05:21 UTC
    I am surprised that nobody pointed out how this is related to Golf: Embedded In Order and (Golf) Ordered Combinations. From there we can define two helper functions:
    sub c{@r='';@r=map{$c=$_;map$c.$_,@r}@_ for 1..shift;@r} sub i{($t=pop)=~s/./.*\Q$&/gs;pop=~/$t/s}
    which have bodies of 34 and 49 respectively. Plus 14 for the surrounding pieces. So we are at 97 characters. And then it is easy to finish off with
    sub assemble { my$n;{for(c($n++,map{split//}@_)){$v=$_;map{i($v,$_)||next}@_;return$_ +}redo} }
    whose body has 76 characters for 173 characters. (Note that I added 5 characters to allow it to be called twice without retaining state.)

    This is a theoretically correct solution, but be warned that it is not polynomial either in speed or memory requirements. So it isn't a very useful solution.

    In fact it raises questions about what a solution is. This will not run on my machine with either of the original data sets. I do not have such a machine to test on, but I do not believe that even if you try to compile Perl on a 64-bit machine with a very large amount of memory that it will succeed. So while the algorithm is fine on paper, it cannot work on the stated data set.

    Is a correct algorithm that will not finish on practical machines considered a solution?

    My test data is:

    print assemble(qw(oa af wf wa));
    which cheerfully finds "owaf" as its answer.
      Confirming your observation about memory, this runs for about 45 seconds before running out of memory (on a 256Mb box) when run via   print assemble qw(GATTACA ATTACA GATT AAGAT CCC); Good code compression, though.

      I had considered explicitly stating that solutions such as yours, which iterate through all possible strings, would be rated in a seperate class. This makes me wonder, however, if there is a class of optimization problems for which iterating brute-force through the entire solution space is faster (algorithmically) than directly computing a solution.

      You are a bit mistaken in choosing Golf: Embedded In Order, however, since that is not the same thing as a substring:

      print assemble(qw(oa af wf wa)); # owaf - a wrong answer # oafwfwa - a right answer
      If you change that into an index, things work out bettter (and with less code):
      sub c{@r='';@r=map{$c=$_;map$c.$_,@r}@_ for 1..shift;@r} sub assemble { my$n;{for(c($n++,map{split//}@_)){$v=$_;map{1+index$v,$_ or next}@_;re +turn$_}redo} } print assemble(qw(oa af fa afa));
         MeowChow                                   
                     s aamecha.s a..a\u$&owag.print
        Oops, my misreading.

        As for the question, there actually are well-explored areas where the best known algorithms (by various criteria) are found by randomly guessing something with certain characteristics and then testing whether it really was a solution...

Re: (Golf) Fragment Reassembly
by satchboost (Scribe) on May 02, 2001 at 21:01 UTC
    This is a really cool problem! ++, MeowChow.

    sub assemble { $x='';sub a{($p,$q)=@_;return$q if!$p;$l=length$p;for(0..$l){$y= substr($p,0,$_).$q;return$y if substr($y,0,$l)eq$p}}sub b{my$s;$s=a( $s,$_)for@_;$x=$s if!$x||length$s<length$x;} sub c{my($a,$b)=@_;if(@$b){for(0..$#$b){my@c=@$b; c([@$a,splice@c,$_,1],[@c])}}else{b @$a}}c([],[@_]);$x }

    279 by my count. I know I could cut this down further in the recursion, but I'm damned if I can see it right now.

Re: (Golf) Fragment Reassembly
by chromatic (Archbishop) on May 03, 2001 at 05:55 UTC
    255 characters, minus four for newlines. I could cut it down at least to 243 if it doesn't need -w and strict.

    It's not beautiful, but it's much nicer than my first attempt:

    sub assemble{my($f,$s)=splice(@_,0,2);return$f unless$s; $f=(sort{length($a)<=>length($b)}(c($f,$s),c($s,$f)))[0]; assemble($f,@_)} sub c{($_,$b)=@_;return$_ if/$b/;my $r='';while($b and$r=chop($b).$r and!(index($_,$b)>0)){}$_.$r.(length($b)==1?$b:'')}

    Update: I see what the problem is. Luckily, a fix is even shorter. Here's one at 246 characters, minus four newlines. Removing -w and strict would put me around 230:

    sub assemble{my($f,$s)=splice(@_,0,2);return$f unless$s; $f=(sort{length($a)<=>length($b)}(c($f,$s),c($s,$f)))[0]; assemble($f,@_)} sub c{($_,$b)=@_;return$_ if/$b/;my $r='';while($b and$r=chop($b).$r and!(index($_,$b)>0)){}$_.$r.(/$b$/?'':$b)}
      Fails to produce the minimal fragment for the test case   print assemble qw(cxxx xxxa abc)
Re: (Golf) Fragment Reassembly
by MeowChow (Vicar) on May 03, 2001 at 10:55 UTC
    Ok, here's my current solution at 192 chars.
    sub a{$#_?(sort{length$a<=>length$b}map{my$s=$_;my@l=grep{$s ne $_}@_; +map{my$t=$_;my@m=grep{$t ne $_}@l;a(i($s,$t),@m)}@l}@_)[0]:pop} sub i{($_,$t)=@_;chop$t while!s/\Q$t\E$/$_[1]/&&''ne$t;$_;} print a qw(logger gerbil log analog); print "\n"; print a qw(GATTACA ATTACA GATT AAGAT CCC);
    update1: trimmed another 4 chars from the i sub, and stopped clobbering globals:
    sub i{my($s,$t)=@_;chop$t while$s!~s/\Q$t\E$/$_[1]/;$s}
    ... if we're clobbering $_, we can trim another 5:
    sub i{($_,$t)=@_;chop$t while!s/\Q$t\E$/$_[1]/;$_}
    update2: 163 chars, strict and global-clean, all in one sub, named assemble (or 156 if it's named a):
    sub assemble{$#_?(sort{length$a<=>length$b}map{my$s=$_;my@l=grep$s ne$ +_,@_;map{my($t,$u)=($s,$_);my@m=grep$u ne$_,@l;chop$u while$t!~s/\Q$u +\E$/$_/;assemble($t,@m)}@l}@_)[0]:pop}
       MeowChow                                   
                   s aamecha.s a..a\u$&owag.print
      Ok, here's my current solution at 192 chars.

      If we're only counting inside of sub assemble { ... } then I count 185 characters (and it passes my stress tests). Though if you use "assemble", as the problem statement suggests, you're up back up to 192 characters. Still, that's excellent.

Re: (Golf) Fragment Reassembly
by indigo (Scribe) on May 03, 2001 at 04:13 UTC
    92, although I suspect it may not run in polynomial time...:)
    sub assemble {$l=eval(join'+',map{y===c}@_);for$x('a'..'z'x$l,'A'..'Z'x$l){return$x + if!grep{$x!~/$_/}@_}}
      When I try to invoke this via   print assemble qw(GATTACA ATTACA GATT AAGAT CCC); I get   Modification of a read-only value attempted at indigo.pl line 2. If I pass a copy of the array, Perl runs out of memory (PIII-600, 256Mb, ActiveState Perl 5.6.0).

      The map looks suspicious.

        I'd strongly recommend not running this one with strings of length greater than 2...:)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://77183]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2024-03-28 23:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found