Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Fast Identification Of String Difference

by neversaint (Deacon)
on Jan 17, 2011 at 02:38 UTC ( [id://882590]=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
Given pairs of string like this.
my $s1 = "ACTGGA"; my $s2 = "AGTG-A"; # Note the string can be longer than this. # But they are always of the same length.
I would like to find position and character in in $s1 where it differs with $s2. In this case the answer would be:
#String Position 0-based # First col = Base in S1 # Second col = Base in S2 # Third col = Position in S1 where they differ C G 1 G - 4
I can achieve that easily with substr(). But it is horribly slow. Typically I need to compare millions of such pairs. Is there a fast way to achieve that?

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Fast Identification Of String Difference
by AnomalousMonk (Archbishop) on Jan 17, 2011 at 03:31 UTC

    The 'classic' Perlish approach to this type of problem involves bitwise string boolean operations. The string  $diff generated by the bitwise-xor of characters in original sequence strings can be used to produce masks that can then be used to extract the differing sub-string sequences from the original strings.

    use warnings; use strict; my $s1 = 'ACTGGACGTATGCA'; my $s2 = 'AGTG-ACGC-CGCA'; my $diff = $s1 ^ $s2; my @dpos; push @dpos, [ $-[1], $+[1] - $-[1] ] while $diff =~ m{ ([^\x00]+) }xmsg; print qq{diff at offset $_->[0], length $_->[1] \n} for @dpos; (my $mask = $diff) =~ tr{\x00}{\xff}c; $s1 &= $mask; $s2 &= $mask; my $differences = qr{ [^\x00]+ }xms; @dpos = (); while ($s1 =~ m{ ($differences) }xmsg) { # this code produces same result # my @diff_data = ($1); # $s2 =~ m{ ($differences) }xmsg; # push @diff_data, $1, $-[1]; # push @dpos, \@diff_data; push @dpos, [ $1, do { $s2 =~ m{ ($differences) }xmsg && $1, $-[1] } ] ; } print qq{@$_ \n} for @dpos;

    Output:

    diff at offset 1, length 1 diff at offset 4, length 1 diff at offset 8, length 3 C G 1 G - 4 TAT C-C 8

    See @- and @+ in perlvar, also Bitwise Or and Exclusive Or and Bitwise And in perlop.

    BrowserUk is very good on this general topic.

    Update: Added better code example, doc links. And thanks to ELISHEVA.

    Update: Fixed @- link above. What was I thinking?

      Using a look-ahead with pos can save having to do the sums.

      knoppix@Microknoppix:~$ perl -E ' > $s1 = q{ACTGGACGTATGCA}; > $s2 = q{AGTG-ACGC-CGCA}; > $m = $s1 ^ $s2; > push @pos, pos( $m ) while $m =~ m{(?=([^\x00]))}g; > $m =~ tr{[\x01-\xfe]}{\xff}; > $c1 = $s1 & $m; > @p1 = $c1 =~ m{([^\x00])}g; > $c2 = $s2 & $m; > @p2 = $c2 =~ m{([^\x00])}g; > printf qq{%-2s%-2s%6d\n}, $p1[ $_ ], $p2[ $_ ], $pos[ $_ ] > for 0 .. $#pos;' C G 1 G - 4 T C 8 A - 9 T C 10 knoppix@Microknoppix:~$

      I hope this is of interest.

      Cheers,

      JohnGG

Re: Fast Identification Of String Difference
by BrowserUk (Patriarch) on Jan 17, 2011 at 09:12 UTC
    I would like to find position and character in in $s1 where it differs with $s2. In this case the answer would be:

    What are you going to do with the information once you have it?

    I mean in the program that is discovering it. Ie. Are you going to write it to disk or utilise it within the program for some further processing?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Dear BrowserUK,
      Store in memory for further processing.

      ---
      neversaint and everlastingly indebted.......

        Could you give us more information about what processing you are going to do?

        For example:

        • Will you use each triplet of data (c1,c2,p) in isolation?
        • Or do you need all the triplets from a given pair of strings all together?
        • Or do you need the triplets from different pairs of strings together?

        The reason for these questions is that whether done in Perl or C allocating the memory in which to build the list of positions is a substantial part of the overall cost. If you only need each position in isolation, then an iterator interface might be more efficient to use.

        Equally, are the pairs of strings you are comparing the same length? Or are you comparing short strings with (every?) substring of a large strings? Are you comparing many short strings against (every) substring of larger strings?

        The problem is that the basic mechanics of comparing the characters in two string is very fast. Especially in C. But the details of the code that surrounds that can have a big impact on the overall application time.

        Rather than an extended to'n'fro of questions, it would be easier if you posted code or pseudo-code of the actual application, along with numbers and sizes of the strings involved.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast Identification Of String Difference
by ww (Archbishop) on Jan 17, 2011 at 03:23 UTC

    How are you using substr()? What is the actual value of "horribly slow?"

    Our replies -- by preference -- would address your request for a faster way to evaluate your data with an assessment of the existing code and some profiling data.

Re: Fast Identification Of String Difference
by Limbic~Region (Chancellor) on Jan 17, 2011 at 14:41 UTC
    neversaint,
    Since you appear to have nucleic acid strings, perhaps you could leverage the minimal alphabet used to pre-compute differences. Assuming each position can only be represented by 5 characters (A, C, T, G and -) then taking 4 characters at a time, there can only be 625 possible strings. Comparing two such sub strings means there are only 390,625 possible pairs - surely small enough to fit into memory.

    I do realize that you have to keep track of which "quad" you are comparing so that your position comes out right but this is hardly significant. Even comparing 5 characters at a time is less than 10 million parings so depending on your memory constraints you may not be limited to quads. Of course, if your strings can have more than 5 characters at each position you will have to do fewer at a time.

    If you need an implementation of this, let me know but I think it should be fairly obvious.

    Cheers - L~R

Re: Fast Identification Of String Difference
by chrestomanci (Priest) on Jan 17, 2011 at 10:35 UTC

    Considering how simple your problem is, and the fact that you want maximum speed, have you considered re-writing in C, or another fast compiled language.

    In the past when I was working on image processing problems, I often saw a program run take an hour or more in perl, but only take a minute or so when re-coded in C. In my case, the typical problem was to take some bitmap images representing 1080p frames of uncompressed video, and convert them from RGB to YUV or suchlike, and them output them as ASCII text listing pixel values.

Re: Fast Identification Of String Difference
by Marshall (Canon) on Jan 17, 2011 at 17:13 UTC
    I looked for a Perl module that would do this with 'C' XS code and couldn't find one. I found one module with the right functionality, but alas it was pure Perl and not very fast. That surprised me because I've seen similar buffer comparison questions before.

    The 'C' code in a C program for something like this would be very fast. I haven't written an XS module myself yet and I don't know how efficient the transfer of information is between Perl and C. The hard part isn't the C code, it is how to interface it with Perl.

    As far as assembly language goes, for comparing byte buffers together, in Intel x86 assembly, there is a single machine instruction that can do this. That would be as fast as it could be done.

    If anybody writes an XS module to do this, I would like to see it and would like to help with it if desired.

Re: Fast Identification Of String Difference
by FunkyMonk (Chancellor) on Jan 18, 2011 at 01:41 UTC
    the string can be longer than this
    How long? The string length makes a huge difference in runtime.

    But it is horribly slow
    Define slow. Slow to some people might be really quick to others.
    Typically I need to compare millions of such pairs
    How many millions? 1, 10, 100, 1,000? It all makes a difference.

    I think you need to show your code and some realistic data. I have basic substr-based code that will process your example data 100k+ times a second.

    That, to me, is fast

Re: Fast Identification Of String Difference
by educated_foo (Vicar) on Jan 18, 2011 at 03:09 UTC
    I was curious, so...
    use Inline C; sub cmp_perl { my ($a, $b) = @_; my $c = $a ^ $b; my @ret; push @ret, pos($c) while $c =~ /[^\0]/g; @ret; } use Benchmark ':all'; my ($x, $y); $y = $x = 'A' x 1e6; substr($y, $_*100, 1) = 'C' for 1..(1e4-1); timethese(0, { perl => sub { cmp_perl($x, $y); }, C => sub { cmp_c($x, $y); }, }); __END__ __C__ void cmp_c(char *a, char *b) { int i, n; Inline_Stack_Vars; Inline_Stack_Reset; for (i=0, n=0; *a && *b; a++, b++, i++) if (*a != *b) mXPUSHi(i), n++; Inline_Stack_Return(n); }
    I was disappointed that the Perl version was about 8x slower:
    Benchmark: running C, perl for at least 3 CPU seconds... C: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 37 +9.18/s (n=1202) perl: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 43 +.22/s (n=137)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://882590]
Approved by broomduster
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-03-29 12:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found