Progressive pattern matching

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Progressive pattern matching by Masem (Monsignor) on Oct 14, 2001 at 05:41 UTC
Here's a different approach; find the longest left substring of a string that is in another string. `my $test = "asdflkjeroiuasdflkjabcdeawerij2lkjasdf"; my $string = join '', (a..z); my $match = 0; my $string = "$string "; # add one junk char... do { $string = substr( $string, 0, -2 ); $match = $test =~ $string; } while ( !$match \|\| !$string ); print $match ? "success with $string!\n" : "failure\n";` [download] ----------------------------------------------------- Dr. Michael K. Neylon - mneylon-pm@masemware.com \|\| "You've left the lens cap of your mind on again, Pinky" - The Brain It's not what you know, but knowing how to find it if you don't know that's important	[reply] [d/l]
Re: Re: Progressive pattern matching by Anonymous Monk on Oct 14, 2001 at 08:26 UTC
Sorry guys...thanks for your code...I am still trying to see if it is applicable...done some initial testing with it and I think that I might not have explained myself well enough. In one of my user prompts I ask for a string (it is actually a short protein sequence that they can put in)...lets say they input 20 letters (they represent the proteins sequence). I want to take these 20 letters and initiate a search into the file of their choice... What I would like to happen is for the program to take the first letter (1/20) and search with it...if this letter is found in their file, the program will add onto the first letter, the second letter (2/20), and now search with both 1&2/20. If these two letters are found in the file in succession then the third letter will be added to the search string and so on. What may happen in reality is that it won't be until letter 8/20 before there is an initial match...letters 9, 10, 11, 12 when added one at a time will eventually make a search string of 8,9,10,11,12/20. Letter 13 will not match and then the program must end. Here is my latest modification... `while (<MOTIF>) {@motif = <MOTIF>}; for ($i=0;$i<=length($blocks);$i++) { $pattern = substr($blocks, 0, $i); if (/$pattern/) {push(@array, $&);` [download] not sure... Again, thank-you for your help, dr_jgbn	[reply] [d/l]
Re: Re: Re: Progressive pattern matching by Corion (Patriarch) on Oct 14, 2001 at 12:12 UTC
To clarify what I think that you want, let me construct some examples : Input string : GATTACA File : `ATTACGATTACAAA GATT ZZGATTZZ asdghckasdlkj TTACA` Output : `On line 1 : GATTACA,GATTAC,ATTACA,TTACA,GATTA,ATTAC,TTAC,TACA, GATT,ATTA,TTA,TAC,GAT,ATT,ACA,TT,TA,GA,CA,AT,AC,T,G,C,A On line 2 : GATT,GAT,ATT,TT,GA,AT,T,G,A On line 3 : GATTA,GATT,ATTA,TTA,GAT,ATT,TT,TA,GA,AT,T,G,A On line 4 : GATTACA,GATTAC,ATTACA,TTACA,GATTA,ATTAC,TTAC,TACA, GATT,ATTA,TTA,TAC, GAT,ATT,ACA,TT,TA,GA,CA,AT,AC,T,G,C,A` [download] To achieve this, you want to find the longest substring of the input string that is found on a line of the file, for the various substrings that match until the end of the last character of the search string. To show you a first approach which is surely suboptimal, look at the following code which tries a brute force approach : use strict; my $searchString = "GATTACA"; my %subStrings = {}; my @subStrings = (); sub populate { # Fills the hash subStrings with all "allowed" substrings # of the argument. Duplicates are avoided by # filling a hash instead of an array. my ($string) = @_; return if $string eq ""; my $line = ""; foreach (split "", $string) { $line .= $_; #print "Added $line\n"; $subStrings{$line} = "1"; }; populate( substr( $string, 1 )); }; populate( $searchString ); # We are only interested in the keys of our hash, # longest matches first : @subStrings = reverse sort { length($a) <=> length($b) # Sort by string length \|\| $a cmp $b # and then by string content } keys %subStrings; # We read the file line by line : my ( $line, $substring ); while ($line = <DATA>) { my @MatchedSubstrings = (); foreach $substring (@subStrings) { if ($line =~ /$substring/) { push @MatchedSubstrings, $substring; }; }; if ($#MatchedSubstrings != -1) { print "On line $. : ", join(",", @MatchedSubstrings ),"\n"; }; }; __DATA__ AGATTACAAA ZZGATTZZ GATTAZZ GATGATTACAZZ asdfgh gattaca [download] Note that there already are many Perl modules for Bioinformatics, a search of the CPAN (http://www.cpan.org) should give you interesting results, as should a Google search for Perl and DNA I guess. `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l] [select]
Re: Re: Re: Re: Progressive pattern matching by tfrayner (Curate) on Oct 15, 2001 at 19:35 UTC
Re: Re: Re: Progressive pattern matching by pike (Monk) on Oct 15, 2001 at 13:42 UTC
Here's my solution to what I understood you wanted (using your definitions of `$blocks` and `@motifs`: `my @results; while ($blocks) { for (my $len = 1;$len <= length $blocks;$len++) { my $search = substr ($blocks, 0, $len); push @results, grep (/$search/, @motif); } #if there is a match, we're done last if @results; #mo match at starting position - try from next pos $blocks = substr ($blocks, 1); }` [download] Does that do what you wanted, or did I misunderstand you? pike	[reply] [d/l] [select]
Re: Progressive pattern matching by tommyw (Hermit) on Oct 14, 2001 at 04:59 UTC
If I'm understanding this right, try: `print "$1\n" if /(a?b?c?d?e?)/;` For your more specific example: `print "$1\n" if /((?:a\|b)?(?:b\|c)?(?:c\|d\|e)?)/;` or, more simply, `print "$&\n" if /(a\|b)?(b\|c)?(c\|d\|e)?/;`, provided you can stand the dreaded $& These return the first occurance of a matching string. To find the longest string, I'd add a /g modifier, and sort the results according to length. Don't see how to make that work with $& though? Update:It was late, and I'd been down the pub. Not a good time to write code... I missed all the the ? counts out of the regexps after the first, simplest one. My bad. Worse, I've just realised that the simple version will, of course, match ace as well as any of the prefixes from abcde. Which is not what was wanted. My very bad :-(	[reply] [d/l] [select]
Re: Progressive pattern matching by chipmunk (Parson) on Oct 14, 2001 at 19:14 UTC
Here's one way to do it, given an input string such as 'abcde'. The program tries matching the one-character substring, then the two-character substring, and so on until it fails to match or the entire input string has been matched. `#!/usr/local/bin/perl -w use strict; print "Name of file containing various random strings?\n"; chomp (my $file = <STDIN>); open (MOTIF, "$file") \|\| die "$!"; print "String to search with?\n"; chomp (my $blocks = <STDIN>); my $motif; { local $/; $motif = <MOTIF>; } my $i = 1; my $re = substr($blocks, 0, $i); while ($i <= length $blocks and $motif =~ /\Q$re/) { $re = substr($blocks, 0, ++$i); } chop $re if $i <= length $blocks; print "$re\n";` [download] To handle an input string like '(a\|b)(b\|c)(c\|d\|e)', you could use a similar approach, perhaps storing the possible characters for each position in an array of arrays, and iterating over each sub-array to construct the regexes.	[reply] [d/l]
Re: Progressive pattern matching by tfrayner (Curate) on Oct 17, 2001 at 14:11 UTC
Okay, I think understand now. I also think that the following code should do what you want. It takes a sequence ($seq) and a motif (@motif, which may be degenerate) and finds all the matches of greater than 3 residues. Again, this may be a suboptimal solution and I can't shake the nagging feeling that there's a simpler way. Of course, that describes all my perl experiance to date :-) Apologies to all and sundry for once again descending into biological jargon. It's the only way I can get my head round this stuff... #!/usr/bin/perl -w use strict; use warnings; my $seq="APKLGIYSPRIGLYHFHKLDTPRLGAKLJHHDGFYSDA"; my @motif=("ST","P","RK","ILVF","G","ILVFM","Y"); # set up motif array of arrays my @motifarray; for (my $e=0;$e<=$#motif;$e++){ my @elementarray= split (/ /, $motif[$e]); $motifarray[$e]=\@elementarray; } my $mstartpos = 0; # starting point within motif my $success = 0; # cycle through starting motif residues ("ST","P" etc.) MOTIFRES: while ($mstartpos+1 < $#motif){ # find all matches for a given starting motif residue my $test=$seq; my $lastmatchpos=0; while ($lastmatchpos < length($seq)){ my $found=''; # deal with the first 3 residue matches as a special case my @r0=@{$motifarray[$mstartpos]}; my @r1=@{$motifarray[$mstartpos+1]}; my @r2=@{$motifarray[$mstartpos+2]}; if ($test=~ /([@r0])(?=[@r1][@r2])/gc){ $found = $1; $lastmatchpos=pos($test); } # next motif starting residue if no further matches found unless ($found){ $mstartpos++; next MOTIFRES; } # get all the other residues in the motif for (my $e=$mstartpos+1;$e<=$#motifarray;$e++){ my @rn=@{$motifarray[$e]}; if ($test=~ /\G([@rn])/gc){ $found .= $1; } } # print out what we've got so far $success++; print ("$found at $lastmatchpos\n"); } # repeat, using the next motif residue as the new starting point $mstartpos++; } die ("No matches found.\n") unless ($success); print ("Total number of matches (nested or otherwise): $success\n"); [download] Have fun, Tim Update:* Minor bugfix; also removed a couple of superfluous and misconceived lines to tidy it up a bit.	[reply] [d/l]
Re: Re: Progressive pattern matching by blakem (Monsignor) on Oct 17, 2001 at 15:33 UTC
Ugh, its too late to finish my golf game.... warning: not necessarily portable since it uses 'glob'. #!/usr/bin/perl -w use strict; my $seq="APKLGIYSPRIGLYHFHKLDTPRLGAKLJHHDGFYSDA"; my @motif=("ST","P","RK","ILVF","G","ILVFM","Y"); my @a = my @b = my @c = @motif; my %m; while(my$r=!$\|++&&\@a\|\|pop@a&&\@a\|\|shift@b&&\@b){for(glob('{'. join('',map{'{'.(join',',split(//)).'}'}@$r).'}')){length($_)> 2&&$seq=~/$_/&&$m{$_}++;}}for(sort{length($b)<=>length($a)}keys %m){print"$_ at ",index($seq,$_)+1,"\n"} =OUTPUT SPRIGLY at 8 PRIGLY at 9 PKLGIY at 2 SPRIGL at 8 KLGIY at 3 RIGLY at 10 SPRIG at 8 TPRLG at 21 TPRL at 21 IGLY at 11 LGIY at 4 SPRI at 8 GFY at 33 GIY at 5 TPR at 21 GLY at 12 SPR at 8 [download] -Blake	[reply] [d/l]
Re: Re: Re: Progressive pattern matching by tfrayner (Curate) on Oct 17, 2001 at 17:25 UTC
Heh - nice. I knew there must be a shorter solution. I won't comment on whether it's simpler, but I know which I think is more readable :-). However, I note that your solution is in fact technically a little more complete than mine. My script misses substrings that aren't at the end of matches (i.e. will match GLY in SPRIGLY but not SPR). Neither script matches PRI, RIG or IGL in the above. Although why you might want to do this (save for the sake of programming elegance) I'm not sure :-P About my only contribution (since it's going to take me a little while to fully comprehend the golf) is that the `@c` array appears to be dispensable. Tim	[reply] [d/l]
Re: Re: Re: Re: Progressive pattern matching by blakem (Monsignor) on Oct 18, 2001 at 02:40 UTC
Re: Progressive pattern matching by tfrayner (Curate) on Oct 15, 2001 at 18:52 UTC
This problem happens to be of interest to me as well. I think the following code does what you're getting at. It's a little crude, but the best I can do at this point is: #!/usr/bin/perl use strict; use warnings; my $seq="ASPTFHKLDTPRLAKLJHHDFSDA"; my @pattern=("ST","P","RK","ILVF"); # array of refs to arrays of redundant # residues within the pattern my @patternarray; for (my $e=0;$e<=$#pattern;$e++){ my @elementarray= split (/ /, $pattern[$e]); $patternarray[$e]=\@elementarray; } my $found; my $lastmatchpos; LOOP: until ($found){ # deal with the first residue match as a special case my @resarray=@{$patternarray[0]}; $seq=~ /([@resarray])/gc; die("Sequence does not contain requested motif.\n") unless $1; $found = $1; $lastmatchpos=pos($seq); # all the other residues in the pattern for (my $e=1;$e<=$#patternarray;$e++){ my @resarray=@{$patternarray[$e]}; if ($seq=~ /\G([@resarray])/gc){ $found .= $1; }else{ #reset matching algorithm my $newmatchpos=$lastmatchpos+1; last if ($newmatchpos > length($seq)); pos($newmatchpos); $found=''; next LOOP; } } } print ("$found at $lastmatchpos\n"); [download] This only matches the first occurrance of a motif in a given sequence. It should be possible to extend this to return all the matches with a little work. For use of the m/\G.../gc idiom, see perldoc:perlop. Hope this helps, Tim Update:* Sorry, I just re-read the original request and one of the nuances escaped me. To get the script to just print out the most it can match after having matched the initial residue, I think you can just change the else clause to: `}else{ next LOOP; }` [download] Update to the update: To deal with the case where the first x residues in the motif don't match the target sequence, I think you should be able to do something like wrapping the `LOOP` block in another `for` loop to iterate over motif residues while looking for an initial match. Hmmm. I'm still not sure I've quite got what your're looking for. At what point do you call a match significant? I.e. do you want target sequences matching only 4 motif residues or more, for example? Or will just a single matching residue do (which I doubt, but which is of course the easiest case)?	[reply] [d/l] [select]
Re: Re: Progressive pattern matching by Anonymous Monk on Oct 16, 2001 at 09:12 UTC
Thanks guys for all the code...it will take some time go through it all...being a bit of a newbie and all The answer to your last question tfrayner is that I would like to have a minimum of 3 residues that match...hopefully more of course I was avoiding these more biological type details so that they would not confuse the issue for some. You basically have it right...imagine the user desiring to take his/her "motif" (lets say 10 a.a.) and searching it against one or more protein sequences. The program must match any part, and as much of, the initial input as possible to the protein sequence(s). All cases where there is any type of a match must be printed out. Hope this clarifies more of what I am trying to do. I notice that you are a post doc...cool...I am part of a research team at the Clinical Genomics Centre in Toronto, Canada. Thanks again for your help, it is greatly appreciated! dr_jgbn	[reply]


Perl-Sensitive Sunglasses
	PerlMonks