Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Progressive pattern matching

by Anonymous Monk
on Oct 14, 2001 at 04:43 UTC ( [id://118718]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am a seeker of perl wisdom...have worked on this for days too no avail.

I would like to take an input string e.g. "abcde" (more specifically it is like (a|b)(b|c)(c|d|e)) start at the first position "a" or (a|b) and search through another string of letters. If it finds "a", I would like the program to continue to "b" and then search with "ab". Again, if it finds the string "ab", next to try "abc" etc. until it can no longer extend the string and find a resulting match. i.e. it will find "abcd" then add onto it "e"...not find "abcde" and so print out just "abcd". I deperately hope this makes sense. Here is a snippet of my actual program to give you a basic idea of what I have done thus far...
print "Name of file containing various random strings?\n"; chomp ($motif=STDIN [omitting <>]); open (MOTIF, "$motif") || die "$!"; print "String to search with?\n"; chomp ($blocks=STDIN ); while (MOTIF) { @motif = MOTIF}; foreach (@motif) {
not sure what to do at this point??

Any help is greatly appreciated,

New'b'ish

dr_jgbn

Replies are listed 'Best First'.
Re: Progressive pattern matching
by Masem (Monsignor) on Oct 14, 2001 at 05:41 UTC
    Here's a different approach; find the longest left substring of a string that is in another string.
    my $test = "asdflkjeroiuasdflkjabcdeawerij2lkjasdf"; my $string = join '', (a..z); my $match = 0; my $string = "$string "; # add one junk char... do { $string = substr( $string, 0, -2 ); $match = $test =~ $string; } while ( !$match || !$string ); print $match ? "success with $string!\n" : "failure\n";

    -----------------------------------------------------
    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
    It's not what you know, but knowing how to find it if you don't know that's important

      Sorry guys...thanks for your code...I am still trying to see if it is applicable...done some initial testing with it and I think that I might not have explained myself well enough.
      In one of my user prompts I ask for a string (it is actually a short protein sequence that they can put in)...lets say they input 20 letters (they represent the proteins sequence). I want to take these 20 letters and initiate a search into the file of their choice...
      What I would like to happen is for the program to take the first letter (1/20) and search with it...if this letter is found in their file, the program will add onto the first letter, the second letter (2/20), and now search with both 1&2/20. If these two letters are found in the file in succession then the third letter will be added to the search string and so on. What may happen in reality is that it won't be until letter 8/20 before there is an initial match...letters 9, 10, 11, 12 when added one at a time will eventually make a search string of 8,9,10,11,12/20. Letter 13 will not match and then the program must end.
      Here is my latest modification...
      while (<MOTIF>) {@motif = <MOTIF>}; for ($i=0;$i<=length($blocks);$i++) { $pattern = substr($blocks, 0, $i); if (/$pattern/) {push(@array, $&);
      not sure...
      Again, thank-you for your help,
      dr_jgbn

        To clarify what I think that you want, let me construct some examples :

        Input string :
        GATTACA
        File :
        ATTACGATTACAAA
        GATT
        ZZGATTZZ
        asdghckasdlkj
        TTACA
        Output :

        On line 1 : GATTACA,GATTAC,ATTACA,TTACA,GATTA,ATTAC,TTAC,TACA, GATT,ATTA,TTA,TAC,GAT,ATT,ACA,TT,TA,GA,CA,AT,AC,T,G,C,A On line 2 : GATT,GAT,ATT,TT,GA,AT,T,G,A On line 3 : GATTA,GATT,ATTA,TTA,GAT,ATT,TT,TA,GA,AT,T,G,A On line 4 : GATTACA,GATTAC,ATTACA,TTACA,GATTA,ATTAC,TTAC,TACA, GATT,ATTA,TTA,TAC, GAT,ATT,ACA,TT,TA,GA,CA,AT,AC,T,G,C,A

        To achieve this, you want to find the longest substring of the input string that is found on a line of the file, for the various substrings that match until the end of the last character of the search string. To show you a first approach which is surely suboptimal, look at the following code which tries a brute force approach :

        use strict; my $searchString = "GATTACA"; my %subStrings = {}; my @subStrings = (); sub populate { # Fills the hash subStrings with all "allowed" substrings # of the argument. Duplicates are avoided by # filling a hash instead of an array. my ($string) = @_; return if $string eq ""; my $line = ""; foreach (split "", $string) { $line .= $_; #print "Added $line\n"; $subStrings{$line} = "1"; }; populate( substr( $string, 1 )); }; populate( $searchString ); # We are only interested in the keys of our hash, # longest matches first : @subStrings = reverse sort { length($a) <=> length($b) # Sort by string length || $a cmp $b # and then by string content } keys %subStrings; # We read the file line by line : my ( $line, $substring ); while ($line = <DATA>) { my @MatchedSubstrings = (); foreach $substring (@subStrings) { if ($line =~ /$substring/) { push @MatchedSubstrings, $substring; }; }; if ($#MatchedSubstrings != -1) { print "On line $. : ", join(",", @MatchedSubstrings ),"\n"; }; }; __DATA__ AGATTACAAA ZZGATTZZ GATTAZZ GATGATTACAZZ asdfgh gattaca

        Note that there already are many Perl modules for Bioinformatics, a search of the CPAN (http://www.cpan.org) should give you interesting results, as should a Google search for Perl and DNA I guess.

        perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
        Here's my solution to what I understood you wanted (using your definitions of $blocks and @motifs:

        my @results; while ($blocks) { for (my $len = 1;$len <= length $blocks;$len++) { my $search = substr ($blocks, 0, $len); push @results, grep (/$search/, @motif); } #if there is a match, we're done last if @results; #mo match at starting position - try from next pos $blocks = substr ($blocks, 1); }
        Does that do what you wanted, or did I misunderstand you?

        pike

Re: Progressive pattern matching
by tommyw (Hermit) on Oct 14, 2001 at 04:59 UTC

    If I'm understanding this right, try:   print "$1\n" if /(a?b?c?d?e?)/;

    For your more specific example:   print "$1\n" if /((?:a|b)?(?:b|c)?(?:c|d|e)?)/; or, more simply,   print "$&\n" if /(a|b)?(b|c)?(c|d|e)?/;, provided you can stand the dreaded $&

    These return the first occurance of a matching string. To find the longest string, I'd add a /g modifier, and sort the results according to length. Don't see how to make that work with $& though?

    Update:It was late, and I'd been down the pub. Not a good time to write code... I missed all the the ? counts out of the regexps after the first, simplest one. My bad.

    Worse, I've just realised that the simple version will, of course, match ace as well as any of the prefixes from abcde. Which is not what was wanted. My very bad :-(

Re: Progressive pattern matching
by chipmunk (Parson) on Oct 14, 2001 at 19:14 UTC
    Here's one way to do it, given an input string such as 'abcde'. The program tries matching the one-character substring, then the two-character substring, and so on until it fails to match or the entire input string has been matched.
    #!/usr/local/bin/perl -w use strict; print "Name of file containing various random strings?\n"; chomp (my $file = <STDIN>); open (MOTIF, "$file") || die "$!"; print "String to search with?\n"; chomp (my $blocks = <STDIN>); my $motif; { local $/; $motif = <MOTIF>; } my $i = 1; my $re = substr($blocks, 0, $i); while ($i <= length $blocks and $motif =~ /\Q$re/) { $re = substr($blocks, 0, ++$i); } chop $re if $i <= length $blocks; print "$re\n";
    To handle an input string like '(a|b)(b|c)(c|d|e)', you could use a similar approach, perhaps storing the possible characters for each position in an array of arrays, and iterating over each sub-array to construct the regexes.
Re: Progressive pattern matching
by tfrayner (Curate) on Oct 17, 2001 at 14:11 UTC
    Okay, I think understand now. I also think that the following code should do what you want. It takes a sequence ($seq) and a motif (@motif, which may be degenerate) and finds all the matches of greater than 3 residues.

    Again, this may be a suboptimal solution and I can't shake the nagging feeling that there's a simpler way.

    Of course, that describes all my perl experiance to date :-)

    Apologies to all and sundry for once again descending into biological jargon. It's the only way I can get my head round this stuff...

    #!/usr/bin/perl -w use strict; use warnings; my $seq="APKLGIYSPRIGLYHFHKLDTPRLGAKLJHHDGFYSDA"; my @motif=("ST","P","RK","ILVF","G","ILVFM","Y"); # set up motif array of arrays my @motifarray; for (my $e=0;$e<=$#motif;$e++){ my @elementarray= split (/ */, $motif[$e]); $motifarray[$e]=\@elementarray; } my $mstartpos = 0; # starting point within motif my $success = 0; # cycle through starting motif residues ("ST","P" etc.) MOTIFRES: while ($mstartpos+1 < $#motif){ # find all matches for a given starting motif residue my $test=$seq; my $lastmatchpos=0; while ($lastmatchpos < length($seq)){ my $found=''; # deal with the first 3 residue matches as a special case my @r0=@{$motifarray[$mstartpos]}; my @r1=@{$motifarray[$mstartpos+1]}; my @r2=@{$motifarray[$mstartpos+2]}; if ($test=~ /([@r0])(?=[@r1][@r2])/gc){ $found = $1; $lastmatchpos=pos($test); } # next motif starting residue if no further matches found unless ($found){ $mstartpos++; next MOTIFRES; } # get all the other residues in the motif for (my $e=$mstartpos+1;$e<=$#motifarray;$e++){ my @rn=@{$motifarray[$e]}; if ($test=~ /\G([@rn])/gc){ $found .= $1; } } # print out what we've got so far $success++; print ("$found at $lastmatchpos\n"); } # repeat, using the next motif residue as the new starting point $mstartpos++; } die ("No matches found.\n") unless ($success); print ("Total number of matches (nested or otherwise): $success\n");
    Have fun,
    Tim

    Update: Minor bugfix; also removed a couple of superfluous and misconceived lines to tidy it up a bit.

      Ugh, its too late to finish my golf game....
      warning: not necessarily portable since it uses 'glob'.
      #!/usr/bin/perl -w use strict; my $seq="APKLGIYSPRIGLYHFHKLDTPRLGAKLJHHDGFYSDA"; my @motif=("ST","P","RK","ILVF","G","ILVFM","Y"); my @a = my @b = my @c = @motif; my %m; while(my$r=!$|++&&\@a||pop@a&&\@a||shift@b&&\@b){for(glob('{'. join('',map{'{'.(join',',split(//)).'}'}@$r).'}')){length($_)> 2&&$seq=~/$_/&&$m{$_}++;}}for(sort{length($b)<=>length($a)}keys %m){print"$_ at ",index($seq,$_)+1,"\n"} =OUTPUT SPRIGLY at 8 PRIGLY at 9 PKLGIY at 2 SPRIGL at 8 KLGIY at 3 RIGLY at 10 SPRIG at 8 TPRLG at 21 TPRL at 21 IGLY at 11 LGIY at 4 SPRI at 8 GFY at 33 GIY at 5 TPR at 21 GLY at 12 SPR at 8

      -Blake

        Heh - nice.

        I knew there must be a shorter solution. I won't comment on whether it's simpler, but I know which I think is more readable :-). However, I note that your solution is in fact technically a little more complete than mine. My script misses substrings that aren't at the end of matches (i.e. will match GLY in SPRIGLY but not SPR). Neither script matches PRI, RIG or IGL in the above.

        Although why you might want to do this (save for the sake of programming elegance) I'm not sure :-P

        About my only contribution (since it's going to take me a little while to fully comprehend the golf) is that the @c array appears to be dispensable.

        Tim

Re: Progressive pattern matching
by tfrayner (Curate) on Oct 15, 2001 at 18:52 UTC
    This problem happens to be of interest to me as well. I think the following code does what you're getting at. It's a little crude, but the best I can do at this point is:
    #!/usr/bin/perl use strict; use warnings; my $seq="ASPTFHKLDTPRLAKLJHHDFSDA"; my @pattern=("ST","P","RK","ILVF"); # array of refs to arrays of redundant # residues within the pattern my @patternarray; for (my $e=0;$e<=$#pattern;$e++){ my @elementarray= split (/ */, $pattern[$e]); $patternarray[$e]=\@elementarray; } my $found; my $lastmatchpos; LOOP: until ($found){ # deal with the first residue match as a special case my @resarray=@{$patternarray[0]}; $seq=~ /([@resarray])/gc; die("Sequence does not contain requested motif.\n") unless $1; $found = $1; $lastmatchpos=pos($seq); # all the other residues in the pattern for (my $e=1;$e<=$#patternarray;$e++){ my @resarray=@{$patternarray[$e]}; if ($seq=~ /\G([@resarray])/gc){ $found .= $1; }else{ #reset matching algorithm my $newmatchpos=$lastmatchpos+1; last if ($newmatchpos > length($seq)); pos($newmatchpos); $found=''; next LOOP; } } } print ("$found at $lastmatchpos\n");

    This only matches the first occurrance of a motif in a given sequence. It should be possible to extend this to return all the matches with a little work. For use of the m/\G.../gc idiom, see perldoc:perlop.

    Hope this helps,
    Tim

    Update: Sorry, I just re-read the original request and one of the nuances escaped me. To get the script to just print out the most it can match after having matched the initial residue, I think you can just change the else clause to:

    }else{ next LOOP; }

    Update to the update: To deal with the case where the first x residues in the motif don't match the target sequence, I think you should be able to do something like wrapping the LOOP block in another for loop to iterate over motif residues while looking for an initial match.

    Hmmm. I'm still not sure I've quite got what your're looking for. At what point do you call a match significant? I.e. do you want target sequences matching only 4 motif residues or more, for example? Or will just a single matching residue do (which I doubt, but which is of course the easiest case)?

      Thanks guys for all the code...it will take some time go through it all...being a bit of a newbie and all
      The answer to your last question tfrayner is that I would like to have a minimum of 3 residues that match...hopefully more of course

      I was avoiding these more biological type details so that they would not confuse the issue for some.

      You basically have it right...imagine the user desiring to take his/her "motif" (lets say 10 a.a.) and searching it against one or more protein sequences. The program must match any part, and as much of, the initial input as possible to the protein sequence(s). All cases where there is any type of a match must be printed out.
      Hope this clarifies more of what I am trying to do. I notice that you are a post doc...cool...I am part of a research team at the Clinical Genomics Centre in Toronto, Canada.

      Thanks again for your help, it is greatly appreciated!
      dr_jgbn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://118718]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-25 23:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found