Re: Optimizing a string processing sub

Replies are listed 'Best First'.
Re: Re: Optimizing a string processing sub by MarkM (Curate) on Jan 08, 2003 at 22:10 UTC
Unfortunately (unfortunate because it is clever and short) -- I believe your solution is flawed. Using the words 'aabcc' and 'abbbc', your code suggests that the words have 5 characters in common. The original poster was vague about the requirements, however, I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected. With this in mind, I suggest that your first example is correct if the goal is to determine the sum of the number of characters that is in each word that are also in the other word (probably not what the original posted wanted to see), and that the second does not do what you expect. Good luck tweaking your solution... :-)	[reply]
Re: Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 22:36 UTC
I suspect that if both words have 5 characters, a declaration that the words have 5 characters in common, with different words, may be unexpected. The case where the words are equal is a special one in the original code and results in `$words_equal` being returned. Given that fact, I think that 5 would be expected. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Re: Re: Re: Optimizing a string processing sub by MarkM (Curate) on Jan 08, 2003 at 23:15 UTC
I guess that all depends on what the goal of the scoring is (it hasn't been stated by the original poster yet... :-) ). I assume that the goal has something to do with determining how 'close' two words are to each other. "aabcc" and "abbbc" are "5 close" according to the algorithms described. "aabcc" and "abc" are also "5 close". Is this still expected?	[reply]
Re: Re: Re: Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 23:54 UTC
Re: Re: Re: Optimizing a string processing sub by spurperl (Priest) on Jan 09, 2003 at 05:53 UTC
You are right, 'aabcc' and 'abbbc' scoring 5 is a mistake. A score of 5 for two 5 letter words means that they are a permutation of one another. For instance, 'abcde' and 'eacdb' should score 5. Sorry for being vague about the requirements... They really are vague ! Maybe examples can help: aabcc, abbbc -> 3 (once a, once b and once c, that's it) stress, super -> 3 (once s, e and r) abcde, caebd -> 5 (permutation) abcde, caebdxxy -> 5 (doesn't change things)	[reply]
Re: Re: Re: Re: Optimizing a string processing sub by MarkM (Curate) on Jan 09, 2003 at 06:20 UTC
I put some more thought into the exact trouble I have with dragonchild's second solution. It works in all cases where each unique character in word1 shows up fewer times or an equal numbers of times as they do in word2 (or vice versa). The reason is simple. It calculates the number of characters in word1 that are in word2, and then word2 in word1, and returns the lower of the two values. It does not consider the possibility that some characters will show up fewer times in word1, and some characters will show up fewer times in word2. (Example: "aabccc" and "abbbbc" yield 6 with this solution - very unexpected) Rather than merely complaining, I will offer a solution: `sub score { my($word1, $word2) = @_; return $words_are_equal_score if $word1 eq $word2; my $score = 0; for ($word1 =~ /(.)/g) { $score++ if $word2 =~ s/\Q$_\E//; } $score; }` [download] For the border case (the example) described above, this solution scores "aabccc" and "abbbbc" with the value 3. The trick is that the "$word2 =~ s/\Q$_\E//" ensures that characters in $word2 will not be considered twice, because they are removed as they are counted. I will continue to think of a method of accelerating this function. I wanted to get one, fast, correct answer in this thread out first. :-)	[reply] [d/l]
Re: Re: Re: Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 09, 2003 at 08:31 UTC
Re5: Optimizing a string processing sub by dragonchild (Archbishop) on Jan 09, 2003 at 15:11 UTC
Re: Re: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 21:36 UTC
I like your approach. Your first example does exactly what his does with one exception: it won't behave the same in list context. I suspect that your second example is really the desired behavior anyway (as it is commutative.) -sauoq "My two cents aren't worth a dime.";	[reply]
Re: Re: Optimizing a string processing sub by spurperl (Priest) on Jan 08, 2003 at 20:49 UTC
Many thanks, dragonchild, ++. Your code works correctly and faster. Could you please explain how exactly it works ? (the @{[ part)	[reply]
Re3: Optimizing a string processing sub by dragonchild (Archbishop) on Jan 08, 2003 at 21:00 UTC
The `@{[]}` is a trick to force list context. The reason why list context is important is because of what the regex operator returns. In list context, the regex operator returns a list of what is matched. In scalar context, the regex operator returns whether or not it matched. (In theory, that should be the number of things matched as well, but I couldn't get it to work.) So, by forcing list context, I get the list of things matched. Then, by converting the list to scalar context, I get the number of things in the list. I'm sure there's a more elegant way than creating two array references on the fly, but that's what I had in 60sec of imperfect memory. ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l]
Re: Re3: Optimizing a string processing sub by sauoq (Abbot) on Jan 08, 2003 at 21:27 UTC
The `@{[]}` is a trick to force list context. Actually, it does a bit more than that. It creates an array. You could have forced list context simply by using parens but a list in scalar context would return the last element in the list rather than the number of elements. So, you did what you had to. :-) -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Re3: Optimizing a string processing sub by chromatic (Archbishop) on Jan 08, 2003 at 22:02 UTC
chromatic's context chaining: `my $count = () = $x =~ /(foo)/g;` Update: I'm not claiming to have originated it; I just liked the alliteration.	[reply] [d/l]
Re: Re3: Optimizing a string processing sub by MarkM (Curate) on Jan 08, 2003 at 22:04 UTC
To ensure that characters such as '^', '-' or ']' do not interfere, I would always be tempted to surround the interpolated value with \Q..\E. An alternative (more intuitive?) method of forcing list context on an expression, is to assign the expression to a list: `my $count = () = $x =~ /[\Q$y\E]/g;` The comma operator in scalar context returns the second argument. The intermediate assignment above causes the comma operator to believe it is an list context. UPDATE: chromatic beat me to the draw by 2 minutes on the ()= trick. However -- chromatic: you cannot claim credit for being the first to use this trick (chromatic's context chaining). I remember it from the time that it was being argued over on the perl5-porters mailing list, perhaps 5 years ago... :-)	[reply] [d/l]
Re: Re: Optimizing a string processing sub by John M. Dlugosz (Monsignor) on Jan 10, 2003 at 16:21 UTC
I was thinking along similar lines, but using tr// in its count-them mode. `$cnt1= $x =~ eval "tr/$y//";` Hmm, the tr doesn't do interpolation. So it trades the @{[ for an eval.	[reply] [d/l]


P is for Practical
	PerlMonks