Re: Re: Optimizing a string processing sub

Replies are listed 'Best First'.
Re: Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 22:36 UTC
I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected. The case where the words are equal is a special one in the original code and results in `$words_equal` being returned. Given that fact, I think that 5 would be expected. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Re: Re: Re: Optimizing a string processing sub by MarkM (Curate) on Jan 08, 2003 at 23:15 UTC
I guess that all depends on what the goal of the scoring is (it hasn't been stated by the original poster yet... :-) ). I assume that the goal has something to do with determining how 'close' two words are to each other. "aabcc" and "abbbc" are "5 close" according to the algorithms described. "aabcc" and "abc" are also "5 close". Is this still expected?	[reply]
Re: Re: Re: Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 23:54 UTC
"aabcc" and "abc" are also "5 close". Is this still expected? Well, they are either 5-close or 3-close depending on what order you submit them to the original function. (I really do think a commutative version, like dragonchild's second one, is the desired behavior.) How would you expect to score "beard" and "bread"? How about an example where no two letters are in the same position, like "peach" and "cheap"? By my reasoning, both pairs would be 5-close even though they aren't equal. After looking at your example again, "aabcc" and "abbbc", I suspect you think those should be considered 3-close (based on number of unique chars rather than length.) I suppose that has some merit, but it would result in any combination of "case", "cases", "cease", and "ceases" being 4-close. I think it makes sense that "case" and "cease" would be 4-close but "cases" and "cease" would be 5-close. Like you say, we can't really know without some input from the original poster... sigh. -sauoq "My two cents aren't worth a dime.";	[reply]
Re: Re: Re: Optimizing a string processing sub by spurperl (Priest) on Jan 09, 2003 at 05:53 UTC
You are right, 'aabcc' and 'abbbc' scoring 5 is a mistake. A score of 5 for two 5 letter words means that they are a permutation of one another. For instance, 'abcde' and 'eacdb' should score 5. Sorry for being vague about the requirements... They really are vague ! Maybe examples can help: aabcc, abbbc -> 3 (once a, once b and once c, that's it) stress, super -> 3 (once s, e and r) abcde, caebd -> 5 (permutation) abcde, caebdxxy -> 5 (doesn't change things)	[reply]
Re: Re: Re: Re: Optimizing a string processing sub by MarkM (Curate) on Jan 09, 2003 at 06:20 UTC
I put some more thought into the exact trouble I have with dragonchild's second solution. It works in all cases where each unique character in word1 shows up fewer times or an equal numbers of times as they do in word2 (or vice versa). The reason is simple. It calculates the number of characters in word1 that are in word2, and then word2 in word1, and returns the lower of the two values. It does not consider the possibility that some characters will show up fewer times in word1, and some characters will show up fewer times in word2. (Example: "aabccc" and "abbbbc" yield 6 with this solution - very unexpected) Rather than merely complaining, I will offer a solution: `sub score { my($word1, $word2) = @_; return $words_are_equal_score if $word1 eq $word2; my $score = 0; for ($word1 =~ /(.)/g) { $score++ if $word2 =~ s/\Q$_\E//; } $score; }` [download] For the border case (the example) described above, this solution scores "aabccc" and "abbbbc" with the value 3. The trick is that the "$word2 =~ s/\Q$_\E//" ensures that characters in $word2 will not be considered twice, because they are removed as they are counted. I will continue to think of a method of accelerating this function. I wanted to get one, fast, correct answer in this thread out first. :-)	[reply] [d/l]
Re: Re: Re: Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 09, 2003 at 08:31 UTC
Nicely done. I think bart wins first place though and, to be fair, dragonchild's was the first that matched the example code in the original node. Excellent point about dragonchild's second solution, by the way. Although, it's still valid for some interpretation of the specs. :-) -sauoq "My two cents aren't worth a dime.";	[reply]
Re5: Optimizing a string processing sub by dragonchild (Archbishop) on Jan 09, 2003 at 15:11 UTC
Very good catch, MarkM!. I didn't really thoroughly test my solution, leaving that as an exercise for the reader. As for my solution, although this wasn't in the original specs, I also wanted to solve the problem without using some variant of split (which /(.)/g is). However, I'm blanking on it. Although this may have been posted in the mainline discussion, I'll post this, as it Benchmark's faster than yours. `sub score { my ($x, $y) = @_; return -1 if $x eq $y; my %x; $x{$_}++ for $x =~ /./g; my $score = 0; $x{$_} && $score++ for $y =~/./g; $score; }` [download] For 1 million iterations, this benches at 74.56sec. Yours benches at 122.51sec. :-) ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l]


Think about Loose Coupling
	PerlMonks