Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Optimizing a string processing sub

by dragonchild (Archbishop)
on Jan 08, 2003 at 19:49 UTC ( [id://225339]=note: print w/replies, xml ) Need Help??


in reply to Optimizing a string processing sub

Try something like this:
sub score { my ($x, $y) = @_; return $words_equal if $x eq $y; return 0 unless $x && $y; @{[$x =~ /[$y]/g]}; }
Basically, I'm treating $y as a character class to match on $x.

Now - there is a major assumption with this code - both words don't have characters twice. Thus, 'perl' and 'temp' is ok. But, 'perl' and 'etemp' isn't and neither is 'eperl' and 'temp'. Depending on which is first, the answer will be different. (The reason is that the characters in $y are treated once, but the characters in $x are treated as many times as it appears.)

I justify this cause you don't specify how to deal with multiple characters. If you are looking for just the number of unique characters they share, regardless of how many times it appears in either word, then do something like:

sub score { # Same as above, until: my $cnt1 = @{[$x =~ /[$y]/g]}; my $cnt2 = @{[$y =~ /[$x]/g]}; $cnt1 < $cnt2 ? $cnt1 : $cnt2; }

------
We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Replies are listed 'Best First'.
Re: Re: Optimizing a string processing sub
by MarkM (Curate) on Jan 08, 2003 at 22:10 UTC

    Unfortunately (unfortunate because it is clever and short) -- I believe your solution is flawed.

    Using the words 'aabcc' and 'abbbc', your code suggests that the words have 5 characters in common. The original poster was vague about the requirements, however, I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected.

    With this in mind, I suggest that your first example is correct if the goal is to determine the sum of the number of characters that is in each word that are also in the other word (probably not what the original posted wanted to see), and that the second does not do what you expect.

    Good luck tweaking your solution... :-)

      I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected.

      The case where the words are equal is a special one in the original code and results in $words_equal being returned. Given that fact, I think that 5 would be expected.

      -sauoq
      "My two cents aren't worth a dime.";
      

        I guess that all depends on what the goal of the scoring is (it hasn't been stated by the original poster yet... :-) ).

        I assume that the goal has something to do with determining how 'close' two words are to each other. "aabcc" and "abbbc" are "5 close" according to the algorithms described. "aabcc" and "abc" are also "5 close". Is this still expected?

      You are right, 'aabcc' and 'abbbc' scoring 5 is a mistake. A score of 5 for two 5 letter words means that they are a permutation of one another. For instance, 'abcde' and 'eacdb' should score 5.

      Sorry for being vague about the requirements... They really are vague ! Maybe examples can help:

      aabcc, abbbc -> 3 (once a, once b and once c, that's it)
      stress, super -> 3 (once s, e and r)
      abcde, caebd -> 5 (permutation)
      abcde, caebdxxy -> 5 (doesn't change things)

        I put some more thought into the exact trouble I have with dragonchild's second solution. It works in all cases where each unique character in word1 shows up fewer times or an equal numbers of times as they do in word2 (or vice versa). The reason is simple. It calculates the number of characters in word1 that are in word2, and then word2 in word1, and returns the lower of the two values. It does not consider the possibility that some characters will show up fewer times in word1, and some characters will show up fewer times in word2. (Example: "aabccc" and "abbbbc" yield 6 with this solution - very unexpected)

        Rather than merely complaining, I will offer a solution:

        sub score { my($word1, $word2) = @_; return $words_are_equal_score if $word1 eq $word2; my $score = 0; for ($word1 =~ /(.)/g) { $score++ if $word2 =~ s/\Q$_\E//; } $score; }

        For the border case (the example) described above, this solution scores "aabccc" and "abbbbc" with the value 3.

        The trick is that the "$word2 =~ s/\Q$_\E//" ensures that characters in $word2 will not be considered twice, because they are removed as they are counted. I will continue to think of a method of accelerating this function. I wanted to get one, fast, correct answer in this thread out first. :-)

Re: Re: Optimizing a string processing sub
by sauoq (Abbot) on Jan 08, 2003 at 21:36 UTC

    I like your approach.

    Your first example does exactly what his does with one exception: it won't behave the same in list context.

    I suspect that your second example is really the desired behavior anyway (as it is commutative.)

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Re: Optimizing a string processing sub
by spurperl (Priest) on Jan 08, 2003 at 20:49 UTC
    Many thanks, dragonchild, ++.

    Your code works correctly and faster.
    Could you please explain how exactly it works ? (the @{[ part)
      The @{[]} is a trick to force list context. The reason why list context is important is because of what the regex operator returns. In list context, the regex operator returns a list of what is matched. In scalar context, the regex operator returns whether or not it matched. (In theory, that should be the number of things matched as well, but I couldn't get it to work.)

      So, by forcing list context, I get the list of things matched. Then, by converting the list to scalar context, I get the number of things in the list.

      I'm sure there's a more elegant way than creating two array references on the fly, but that's what I had in 60sec of imperfect memory.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

        The @{[]} is a trick to force list context.

        Actually, it does a bit more than that. It creates an array. You could have forced list context simply by using parens but a list in scalar context would return the last element in the list rather than the number of elements. So, you did what you had to. :-)

        -sauoq
        "My two cents aren't worth a dime.";
        

        chromatic's context chaining:

        my $count = () = $x =~ /(foo)/g;

        Update: I'm not claiming to have originated it; I just liked the alliteration.

        To ensure that characters such as '^', '-' or ']' do not interfere, I would always be tempted to surround the interpolated value with \Q..\E.

        An alternative (more intuitive?) method of forcing list context on an expression, is to assign the expression to a list:

        my $count = () = $x =~ /[\Q$y\E]/g;

        The comma operator in scalar context returns the second argument. The intermediate assignment above causes the comma operator to believe it is an list context.

        UPDATE: chromatic beat me to the draw by 2 minutes on the ()= trick. However -- chromatic: you cannot claim credit for being the first to use this trick (chromatic's context chaining). I remember it from the time that it was being argued over on the perl5-porters mailing list, perhaps 5 years ago... :-)

Re: Re: Optimizing a string processing sub
by John M. Dlugosz (Monsignor) on Jan 10, 2003 at 16:21 UTC
    I was thinking along similar lines, but using tr// in its count-them mode. $cnt1= $x =~ eval "tr/$y//";

    Hmm, the tr doesn't do interpolation. So it trades the @{[ for an eval.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://225339]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (10)
As of 2024-04-24 09:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found