Re: Comparing Lines within a Word List

Replies are listed 'Best First'.
Re^2: Comparing Lines within a Word List by hippo (Bishop) on Apr 27, 2016 at 08:53 UTC
This is a good, clear solution (++). However it looks like there is an off-by-one error in the while condition which means that the last two entries of `@words` are not compared against each other. If we change it to `while ( @words > 1 ) {` [download] then it seems to cover that case as well.	[reply] [d/l] [select]
Re^2: Comparing Lines within a Word List by dominick_t (Acolyte) on Apr 27, 2016 at 03:57 UTC
Thanks very much for the reply! At the moment, the problem I'd first like to solve is the second one you described . . . at least if I'm reading you correctly. I want pairs of words that are exactly the same except where one has an R, the other has an S. But I'm very much a newbie and am having a tough time parsing your example code so I cannot tell how it is working . . . would it be possible to walk me through it a bit? In general, are we starting with the first word in the list and testing it against the words below it for the R/S difference and reporting it if one is found, then moving to the next item on the list and testing it against all the ones below it, and so on? Again, many thanks!	[reply]
Re^3: Comparing Lines within a Word List by graff (Chancellor) on Apr 27, 2016 at 10:07 UTC
Yes, you've correctly described the approach, which uses the shift function to extract the word that is currently at the beginning of the array, and then, if that first word contains "r" or "s", a regex is created and used with the grep function to search for matches in all the remaining words in the array. One thing you didn't specify yet is what to do with sets like "cases / carer / caser / cares": Should the first one match all of the other three? Should the second one match both of the last two? Should the last two match each other? If the answer is "yes" on all points, then you'll want to create a different regex, which can be done using the split and map functions, and (my favorite from C) the "ternary" conditional operator: `my $model = shift @words; my $regex = join( "", map{ ( /[rs]/ ) ? "[rs]" : $_ } split( /([rs +])/, $model )); next if ( $regex eq $model ); # skip if model has no "r" or "s" my @hits = grep /^$regex$/, @words; ...` [download] (BTW, maybe you already know, but `/$regex/i` (adding the "i" modifier at the end) does case-insensitive matches.) (updated to add a missing paren at the end of the second line in the snippet -- also added the anchors around $regex in the grep call)	[reply] [d/l] [select]
Re^4: Comparing Lines within a Word List by dominick_t (Acolyte) on Apr 27, 2016 at 14:58 UTC
Thank you. This is extremely helpful. To answer your question about the tricky case you mention: 'cases' should NOT match 'carer', as for my purpose these words differ in two positions. (even though both positions involve an R/S swap, I still the need matches to differ in exactly one position). 'cases' SHOULD match 'caser' and 'cares'. For the same reason, 'carer' SHOULD match both of the remaining two ('caser' and 'cares') And finally, 'caser' should NOT match 'cares', again because these words differ at more than one position. Does that clarify things?	[reply]
Re^2: Comparing Lines within a Word List by Anonymous Monk on Apr 27, 2016 at 15:22 UTC
Let perl regex do the looping for you. `#!/usr/bin/perl -l use strict; use warnings; $_ = do{local $/; <DATA>}; /^((\w) [rs] (\w))$ # find a first word .* # skip non match lines ^(\2 [rs] \3)$ # second word same except rs (??{print "$1 $4"})/xms; # print and fail __DATA__ bare mare base case bust burt bent sat rat bat matter mattes pat` [download]	[reply] [d/l]
Re^2: Comparing Lines within a Word List by dominick_t (Acolyte) on Apr 27, 2016 at 15:33 UTC
Thank you, hippo, for catching the one-line issue, and thank you graff for this enormously helpful code. I responded to your question about the specific case you raised below but I'll post it here as well. 'cases' should NOT match 'carer', as for my purpose these words differ in two positions. (even though both positions involve an R/S swap, I still the need matches to differ in exactly one position). 'cases' SHOULD match 'caser' and 'cares'. For the same reason, 'carer' SHOULD match both of the remaining two ('caser' and 'cares'). And finally, 'caser' should NOT match 'cares', again because these words differ at more than one position. With all that in mind, does this code still find the matches I need? Another question: am I right to say this program will accept as input a word list with words of different lengths and has a way of handling that? Earlier I had gotten the idea that I might have to split my word list into separate lists (words of length 2, words of length 3, and so on), which the program would then handle each in turn. But does this piece of code automatically handle that? I suppose this is solved in the regex line, but I'm still learning the ins and outs of regular expressions. Any further explanations as to how that regex line works would be greatly appreciated.	[reply]
Re^2: Comparing Lines within a Word List by dominick_t (Acolyte) on Apr 29, 2016 at 23:04 UTC
I was able to run this code with my long word list. It appears to not be grabbing some matches that I need, but I do not know enough about regular expressions to be able to fix it. For example, when I run the code looking for a/b swap matches, 'lama' and 'lamb' should be a match but it is not showing up as such. I'm guessing it's to do with the fact that 'lama' has two a's? Is it possible to easily amend this code to handle this case?	[reply]
Re^3: Comparing Lines within a Word List by hippo (Bishop) on Apr 30, 2016 at 10:04 UTC
I'm guessing it's to do with the fact that 'lama' has two a's? Is it possible to easily amend this code to handle this case? Yes, and it's a trivial amendment. Just deploy the `/g` modifier: `#!/usr/bin/perl use strict; use warnings; my @words = <DATA>; chomp @words; while ( @words >= 2 ) { my $model = my $regex = shift @words; if ( $regex =~ s/(.?)[ab](.?)/$1\[ab\]$2/g ) { my @hits = grep /^$regex$/, @words; if ( @hits ) { print join( " ", $model, "matches", @hits, "using", $regex +, "\n" ); } } } __DATA__ lama lamb` [download]	[reply] [d/l] [select]
Re^4: Comparing Lines within a Word List by dominick_t (Acolyte) on Apr 30, 2016 at 14:06 UTC
Thanks hippo, I will give this a shot. I'm reading the documentation on the `/g` modifier but do not see yet how this will work. If the issue was that there were two a's in 'lama', what does it mean that when I ran the original code, 'aaron' successfully matched 'baron' as it should have? Guessing it's to do with the fact that in this latter case, the repeated instance of 'a' comes after rather than before the a/b swap position, unlike 'lama' and 'lamb'. How this plays into the regex though, I don't yet see.	[reply] [d/l]
Re^5: Comparing Lines within a Word List by graff (Chancellor) on Apr 30, 2016 at 15:27 UTC
Re^5: Comparing Lines within a Word List by hippo (Bishop) on Apr 30, 2016 at 14:26 UTC
Re^6: Comparing Lines within a Word List by dominick_t (Acolyte) on Apr 30, 2016 at 14:49 UTC
Some notes below your chosen depth have not been shown here