Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Re: Optimizing a string processing sub

by MarkM (Curate)
on Jan 08, 2003 at 22:10 UTC ( [id://225372]=note: print w/replies, xml ) Need Help??


in reply to Re: Optimizing a string processing sub
in thread Optimizing a string processing sub

Unfortunately (unfortunate because it is clever and short) -- I believe your solution is flawed.

Using the words 'aabcc' and 'abbbc', your code suggests that the words have 5 characters in common. The original poster was vague about the requirements, however, I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected.

With this in mind, I suggest that your first example is correct if the goal is to determine the sum of the number of characters that is in each word that are also in the other word (probably not what the original posted wanted to see), and that the second does not do what you expect.

Good luck tweaking your solution... :-)

  • Comment on Re: Re: Optimizing a string processing sub

Replies are listed 'Best First'.
Re: Re: Re: Optimizing a string processing sub
by sauoq (Abbot) on Jan 08, 2003 at 22:36 UTC
    I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected.

    The case where the words are equal is a special one in the original code and results in $words_equal being returned. Given that fact, I think that 5 would be expected.

    -sauoq
    "My two cents aren't worth a dime.";
    

      I guess that all depends on what the goal of the scoring is (it hasn't been stated by the original poster yet... :-) ).

      I assume that the goal has something to do with determining how 'close' two words are to each other. "aabcc" and "abbbc" are "5 close" according to the algorithms described. "aabcc" and "abc" are also "5 close". Is this still expected?

        "aabcc" and "abc" are also "5 close". Is this still expected?

        Well, they are either 5-close or 3-close depending on what order you submit them to the original function. (I really do think a commutative version, like dragonchild's second one, is the desired behavior.)

        How would you expect to score "beard" and "bread"? How about an example where no two letters are in the same position, like "peach" and "cheap"? By my reasoning, both pairs would be 5-close even though they aren't equal.

        After looking at your example again, "aabcc" and "abbbc", I suspect you think those should be considered 3-close (based on number of unique chars rather than length.) I suppose that has some merit, but it would result in any combination of "case", "cases", "cease", and "ceases" being 4-close. I think it makes sense that "case" and "cease" would be 4-close but "cases" and "cease" would be 5-close.

        Like you say, we can't really know without some input from the original poster... sigh.

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: Re: Re: Optimizing a string processing sub
by spurperl (Priest) on Jan 09, 2003 at 05:53 UTC
    You are right, 'aabcc' and 'abbbc' scoring 5 is a mistake. A score of 5 for two 5 letter words means that they are a permutation of one another. For instance, 'abcde' and 'eacdb' should score 5.

    Sorry for being vague about the requirements... They really are vague ! Maybe examples can help:

    aabcc, abbbc -> 3 (once a, once b and once c, that's it)
    stress, super -> 3 (once s, e and r)
    abcde, caebd -> 5 (permutation)
    abcde, caebdxxy -> 5 (doesn't change things)

      I put some more thought into the exact trouble I have with dragonchild's second solution. It works in all cases where each unique character in word1 shows up fewer times or an equal numbers of times as they do in word2 (or vice versa). The reason is simple. It calculates the number of characters in word1 that are in word2, and then word2 in word1, and returns the lower of the two values. It does not consider the possibility that some characters will show up fewer times in word1, and some characters will show up fewer times in word2. (Example: "aabccc" and "abbbbc" yield 6 with this solution - very unexpected)

      Rather than merely complaining, I will offer a solution:

      sub score { my($word1, $word2) = @_; return $words_are_equal_score if $word1 eq $word2; my $score = 0; for ($word1 =~ /(.)/g) { $score++ if $word2 =~ s/\Q$_\E//; } $score; }

      For the border case (the example) described above, this solution scores "aabccc" and "abbbbc" with the value 3.

      The trick is that the "$word2 =~ s/\Q$_\E//" ensures that characters in $word2 will not be considered twice, because they are removed as they are counted. I will continue to think of a method of accelerating this function. I wanted to get one, fast, correct answer in this thread out first. :-)

        Nicely done.

        I think bart wins first place though and, to be fair, dragonchild's was the first that matched the example code in the original node.

        Excellent point about dragonchild's second solution, by the way. Although, it's still valid for some interpretation of the specs. :-)

        -sauoq
        "My two cents aren't worth a dime.";
        
        Very good catch, MarkM!. I didn't really thoroughly test my solution, leaving that as an exercise for the reader.

        As for my solution, although this wasn't in the original specs, I also wanted to solve the problem without using some variant of split (which /(.)/g is). However, I'm blanking on it.

        Although this may have been posted in the mainline discussion, I'll post this, as it Benchmark's faster than yours.

        sub score { my ($x, $y) = @_; return -1 if $x eq $y; my %x; $x{$_}++ for $x =~ /./g; my $score = 0; $x{$_} && $score++ for $y =~/./g; $score; }
        For 1 million iterations, this benches at 74.56sec. Yours benches at 122.51sec. :-)

        ------
        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://225372]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-25 09:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found