Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Comparing Lines within a Word List

by dominick_t (Acolyte)
on Apr 26, 2016 at 19:54 UTC ( [id://1161587]=perlquestion: print w/replies, xml ) Need Help??

dominick_t has asked for the wisdom of the Perl Monks concerning the following question:

Hello all-- New to Perl, new to this forum. Many thanks in advance for reading and offering help. I have a background in mathematics and have done a bit of programming, mostly for specific tasks that lead me to learn just enough of a language to achieve them. So I wouldn't call myself thoroughly conversant in any language. I am, however, interested in learning Perl more deeply, as I have some long-term projects that will require managing and searching through word lists in creative ways. I've been reading the O'Reilly book Learning Perl, but I have a specific problem that I need to solve somewhat urgently, and I'm afraid I haven't learned enough Perl yet to even attempt some code that could do it. So, here's the problem: I have a long list of text strings saved in a .txt file. What I am interested in are pairs of words that are exactly the same, except in one position . . . in particular, where one word has, say, an R, the other word has, say, an S. So if the word list was a standard dictionary and I ran the code on it, the output would include the pairs RAT and SAT, also RATE and SATE, also BARE and BASE, also BARR and BARS. This strikes me as something that should be possible using regular expressions in a Perl script. Am I right about that? If so, and if it's pretty easy for an expert to write some code that will do this, I would be much obliged, not just because I need a speedy solution to this question due to a deadline, but also because it will give me a great piece of example code to help me in getting my hands dirty learning Perl. All best-- Dominick

Replies are listed 'Best First'.
Re: Comparing Lines within a Word List
by Eily (Monsignor) on Apr 26, 2016 at 20:27 UTC

    Regular expression are not the best tool to do what you want actually. Not because your problem is impossible or even difficult to solve with regular expressions, but because there is a much better option. The bitwise xor operator "^" will yield a 0 anywhere the two strings are equal, but 1 for every bit that is different between the two.

    my $first = "Fool"; my $second = "Foot"; my $diff = ($first ^ $second); print unpack "B*", $diff; # Print the binary representation of the dif +ference my @diff_char = split //, $diff; # get a char by char difference.
    With that, and maybe the use of ord (you don't actually need it but it may help make things clearer) you should be able to do what you want.

      Actually, bitwise-xor on strings and  tr/// (update: see Quote-Like Operators in perlop) go together quite nicely for something like this:

      c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(pp); ;; for my $word (qw(Fool Foot Tool Toot Foal)) { my $diff = 'Fool' ^ $word; print qq{'$word': }, pp $diff; print qq{'Fool' and '$word' differ by 1 char} if 1 == $diff =~ tr/\x00//c; } " 'Fool': "\0\0\0\0" 'Foot': "\0\0\0\30" 'Fool' and 'Foot' differ by 1 char 'Tool': "\22\0\0\0" 'Fool' and 'Tool' differ by 1 char 'Toot': "\22\0\0\30" 'Foal': "\0\0\16\0" 'Fool' and 'Foal' differ by 1 char

      Update: Changed example code to use  tr/\x00//c (/c modifier: complement the search list).


      Give a man a fish:  <%-{-{-{-<

        Thank you both for the replies! I hope everyone in the thread can see this, and not just the author of the note on which I hit the reply button. Okay, so if I'm getting this right, it looks like in this example, you're taking the word 'fool' and comparing its characters to each of the five words in the array, and since 'fool' matches itself exactly, the return on that one is all zeros. Any place there is not a zero is a place where the words differ. (I'm not immediately sure why the "difference" between the character 'l' and 't' would be 30 but I'm sure it's easily explained.) So I see how this works in principle, to compare two given words and look for word pairings that yield a one-character difference. But then how might I use this to solve the problem that I have, which is to find -- from let's say a massive dictionary of English language words -- all pairs of words that are the same except for one letter, and in particular, for that character difference to be that one has an R while the other has an S? Again, many thanks.

        I always forget about using tr/// for counting, thanks for the reminder :)

Re: Comparing Lines within a Word List
by graff (Chancellor) on Apr 26, 2016 at 23:16 UTC
    If you're looking for a way to catch any two words that differ by any single letter (e.g. report if input contains "fool/foot" and/or "foot/loot" and/or "loot/lout"), then it's a matter of building on Eily's idea using bitwise xor on pairs of strings that are the same length. Look at the elements of the array returned by the "split" statement, and if only one element of the array is non-zero, then the given pair of strings differ by only one character. (You'll want to organize your input into groups based on word length, and only do comparisons within each group.)

    OTOH, if you're looking for words that contain a particular pair of characters, and differ only in terms of using one vs. the other of those two (e.g. you really just want "bare/base", etc., but not "foot/fool"), you would probably want to use a regex like this:

    #!/usr/bin/perl use strict; use warnings; my @words = <DATA>; chomp @words; while ( @words >= 2 ) { my $model = my $regex = shift @words; if ( $regex =~ s/(.*?)[rs](.*?)/$1\[rs\]$2/ ) { my @hits = grep /^$regex$/, @words; if ( @hits ) { print join( " ", $model, "matches", @hits, "using", $regex +, "\n" ); } } } __DATA__ bare mare base case bust burt bent sat rat bat matter mattes pat
    (update: fixed the "while" condition as per hippo's remark below -- also added anchors around $regex in the grep call, so that "bare" doesn't match "debased", etc.)

      This is a good, clear solution (++). However it looks like there is an off-by-one error in the while condition which means that the last two entries of @words are not compared against each other. If we change it to

      while ( @words > 1 ) {

      then it seems to cover that case as well.

      Thanks very much for the reply! At the moment, the problem I'd first like to solve is the second one you described . . . at least if I'm reading you correctly. I want pairs of words that are exactly the same except where one has an R, the other has an S. But I'm very much a newbie and am having a tough time parsing your example code so I cannot tell how it is working . . . would it be possible to walk me through it a bit? In general, are we starting with the first word in the list and testing it against the words below it for the R/S difference and reporting it if one is found, then moving to the next item on the list and testing it against all the ones below it, and so on? Again, many thanks!
        Yes, you've correctly described the approach, which uses the shift function to extract the word that is currently at the beginning of the array, and then, if that first word contains "r" or "s", a regex is created and used with the grep function to search for matches in all the remaining words in the array.

        One thing you didn't specify yet is what to do with sets like "cases / carer / caser / cares": Should the first one match all of the other three? Should the second one match both of the last two? Should the last two match each other? If the answer is "yes" on all points, then you'll want to create a different regex, which can be done using the split and map functions, and (my favorite from C) the "ternary" conditional operator:

        my $model = shift @words; my $regex = join( "", map{ ( /[rs]/ ) ? "[rs]" : $_ } split( /([rs +])/, $model )); next if ( $regex eq $model ); # skip if model has no "r" or "s" my @hits = grep /^$regex$/, @words; ...
        (BTW, maybe you already know, but /$regex/i (adding the "i" modifier at the end) does case-insensitive matches.)

        (updated to add a missing paren at the end of the second line in the snippet -- also added the anchors around $regex in the grep call)

      Let perl regex do the looping for you.

      #!/usr/bin/perl -l use strict; use warnings; $_ = do{local $/; <DATA>}; /^((\w*) [rs] (\w*))$ # find a first word .* # skip non match lines ^(\2 [rs] \3)$ # second word same except rs (??{print "$1 $4"})/xms; # print and fail __DATA__ bare mare base case bust burt bent sat rat bat matter mattes pat

      Thank you, hippo, for catching the one-line issue, and thank you graff for this enormously helpful code. I responded to your question about the specific case you raised below but I'll post it here as well.

      'cases' should NOT match 'carer', as for my purpose these words differ in two positions. (even though both positions involve an R/S swap, I still the need matches to differ in exactly one position). 'cases' SHOULD match 'caser' and 'cares'.

      For the same reason, 'carer' SHOULD match both of the remaining two ('caser' and 'cares').

      And finally, 'caser' should NOT match 'cares', again because these words differ at more than one position.

      With all that in mind, does this code still find the matches I need?

      Another question: am I right to say this program will accept as input a word list with words of different lengths and has a way of handling that? Earlier I had gotten the idea that I might have to split my word list into separate lists (words of length 2, words of length 3, and so on), which the program would then handle each in turn. But does this piece of code automatically handle that? I suppose this is solved in the regex line, but I'm still learning the ins and outs of regular expressions. Any further explanations as to how that regex line works would be greatly appreciated.

      I was able to run this code with my long word list. It appears to not be grabbing some matches that I need, but I do not know enough about regular expressions to be able to fix it. For example, when I run the code looking for a/b swap matches, 'lama' and 'lamb' should be a match but it is not showing up as such. I'm guessing it's to do with the fact that 'lama' has two a's? Is it possible to easily amend this code to handle this case?
        I'm guessing it's to do with the fact that 'lama' has two a's? Is it possible to easily amend this code to handle this case?

        Yes, and it's a trivial amendment. Just deploy the /g modifier:

        #!/usr/bin/perl use strict; use warnings; my @words = <DATA>; chomp @words; while ( @words >= 2 ) { my $model = my $regex = shift @words; if ( $regex =~ s/(.*?)[ab](.*?)/$1\[ab\]$2/g ) { my @hits = grep /^$regex$/, @words; if ( @hits ) { print join( " ", $model, "matches", @hits, "using", $regex +, "\n" ); } } } __DATA__ lama lamb
Re: Comparing Lines within a Word List
by GotToBTru (Prior) on Apr 26, 2016 at 20:09 UTC

    If you're unsure of the Perl syntax, try to work the steps out in English.

    Does the text file contain one word per line? Are any words in the file repeated? Is the list of letter pairs (R and S in your example) fixed or something the user provides with each execution?

    But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

      Thank you for the reply! The text file contains one word per line, and no words in the file repeated. You could basically think of the text file as a dictionary of standard English words. Ideally the user would be able to easily supply any letter pair each execution. I figured it would be easier for me just to see some code that will do the trick for the R/S pairing, and then I assumed it would be easy enough matter to add some lines that would allow the user to input any pair whatsoever.

        The Tutorials here will tell you everything you need to know about opening a file and reading the words therein. You will probably store them in an array. Here's a trick to detect when two words differ by only single letter:

        if (($word1 ^ $word2) =~ tr[\1-\255][] == 1) { ... }

        That works because the ^ operator performs an XOR function. Same letters in $word1 and $word2 become null values, and the tr function returns the count of how many matches it found. I told it to look for non-null values. If the count is 1, that's a word we want to look at.

        Update: should have looked at the other replies before I composed this one. Better and more complete answers already provided!

        But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

Re: Comparing Lines within a Word List
by Marshall (Canon) on Apr 27, 2016 at 04:36 UTC
    As another thought for you, consider Levenshtein_distance. Also the Perl module, Levenshtein.pm.

    I think you are just looking for what are called N1 errors, that means: one substitution, or one deletion or one addition. Life gets complicated if you working with a subset of N2 that allows transpositions, that would be for example, RATE and ARTE would match, but that counts as 2 errors. Note 2 errors could also be RATE vs XATB which is completely different. So Levenshtein == 2 can produce a lot of false "positives".

    The last pattern matching code that I wrote, analyzed the input string, then generated a regex dynamically which was compiled and run against the candidate strings. A lot of "yeah, but's" with code like that and very specific to my particular application. Just a comment that something like that is possible in Perl (have Perl write a program (a REGEX), and then run it).

    I think that the standard Levenshtein module will do most of what you want and I would start there. There are links to other "string compares" in the doc's at Levenshtein.pm. If somebody else has built the wheel that you need, I would use it. Some of these things have XS modules which will run much more quickly than native Perl.

    Update: I thought that I should mention yet another approximate matcher agrep, agrep wiki. The algorithms are top notch and agrep is fast. In a lot of cases it will outperform standard grep for simple matches.

    The caveat here is that I am unsure of the code status. I was using agrep for all of my grepping until I caught it missing a match! I spent several days reducing my dataset into a "smallest" reproduceable error report and talked with the mantainer. He verified the problem, but indicated the difficulty of a fix. That was about a decade ago and I'm not sure what happened (i.e. whether the "Marshall" fix got implemented or not?). If it didn't then very rarely agrep will miss a match that it should get, even when used like standard grep. With that caveat, agrep is pretty cool. Certainly the algorithms are.

Re: Comparing Lines within a Word List
by clueless newbie (Curate) on Apr 27, 2016 at 19:19 UTC
    #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; # Filter out all words that can't be candidates one way or the other my @candidates=grep { m{[RS]} } <DATA>; chomp(@candidates); # Sort them in order of length my @orderedCandidates=sort{ length $a <=> length $b } @candidates; # Push a guard on the end push @orderedCandidates,""; warn Data::Dumper->Dump([\@orderedCandidates],[qw(*orderedCandidates)] +),' '; my %hash; my $length=length($orderedCandidates[0]); for my $word (@orderedCandidates) { if (length($word) == $length) { # Same length ... add it to the ha +sh $hash{$word}=undef; } else { # Hash is complete for my $Rword (grep { m{R} } keys %hash) { # Word has a R my $pos=0; while (($pos=index($Rword,'R',$pos)) >= 0) { # Found an R # Make its S equivalent my $Sword=$Rword; substr($Sword,$pos,1)='S'; print "$Rword - $Sword\n" if (exists $hash{$Sword}); $pos++ } } # Done with this hash --- so start over %hash=(); $hash{$word}=undef; $length=length($word); }; } __DATA__ ONE THREE FOUR FOUS RRRRRRRR RSRRRRRS RRSRRRRR
    which yields
    FOUR - FOUS RRRRRRRR - RRSRRRRR
    Just for fun I modified this replacing R by each of the letters of the alphabet and S by another (using english-words.95 (~ 220,000 words). The R <=> S exchange is the most common with 381 words and the J <=> Q is the least common with only 5 words. Elapsed time for the run 344.65 seconds.
      Thank you! Your explanatory comments regarding the overall approach and the details are exactly what I need to get my bearings. I'll now dive in there and make it all make sense to me. One question which I already have: I assume I should not just copy and paste my own data set, which is a massive file, into the code here to replace the example data set. What's the best way to handle that part? Many many thanks.

        Put your filename as first argument on command line, then change <DATA> to <>

      Hello and thank you again for posting this, clueless newbie. Unlike you I am in fact a clueless newbie, and I can't seem to run your code correctly, with my own word list for the data set. Might you have a minute to help, either here or via private message?
        Here you go ... more comments:
        #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use Time::HiRes qw(time); die <<"__DOC__" perl $0 <dictionary file name> [<optional character> [<option swap +>]] so perl $0 <dictionary file name> R S will solve the problem as stated in the OP and perl $0 <dictionary file name> will try all of the pairs. (note that they are symmetrical ie r-s and +s-r will be the same) __DOC__ unless (@ARGV); # Get all the words into an array open(my $DATA,'<',$ARGV[0]) or die "Couldn't open '$ARGV[0]' for reading! $!"; my @all=<$DATA>; close($DATA) or die "Couldn't close '$ARGV[0]' after reading! $!"; chomp(@all); # @all now holds all of the words from the dictionary specified by $AR +GV[0] my $start=time(); my %counts; for my $R ($ARGV[1] || ('a'..'y')) { # Take $ARGV[1] or the letters fr +om a to y (no point to doing z) one at a time my $count=0; for my $S ($ARGV[2] || (chr(ord($R)+1)..'z')) { # Take $ARGV[2] or + the letters following $R one at a time my $re=qr{[$R$S]}; # Filter out everything that isn't relevant # (if it doesn't have $R or $s it can't be a word to be altere +d or a word after alteration) my @candidates=grep{ m{$re} } @all; # Order them by length (no use comparing a 4 letter word with +a 5 letter word) @candidates=sort{ length $a <=> length $b } @candidates; # Put a guard at the end of the array - to trigger the "comple +tion" push @candidates,''; my %hash; # Something to count up the number of matches my $count; # Initialize $length by setting it to length of the first word + (failure to do this is of no consequnce!) my $length=length $candidates[0]; for my $word (@candidates) { unless (length($word) == $length) { # Current word is of d +ifferent length - so we need to process everything already in the has +h for my $Rword (grep { m{r} } keys %hash) { # Word has +a $R my $pos=0; while (($pos=index($Rword,'r',$pos)) >= 0) { # Fou +nd a $R in $Rword # Make its S equivalent my $Sword=$Rword; substr($Sword,$pos,1)='s'; # Increment $count if $Sword appears in the ha +sh $count++ if (exists $hash{$Sword}); # Need to look for the next $R so $pos must be + incremented $pos++ } } # Done with this hash --- so recycle it (hey I'm green +! --- or really I date back to machines that only had 10,000 digits!) %hash=(); # A new length $length=length($word); }; $hash{$word}=undef; }; #warn "$R-$S: ",$count; # Save our count $counts{"$R-$S"}=$count; }; }; # All done - see how long all this took my $end=time(); printf("%.2f\n", $end-$start); # Dump the counts hash --- but we want the keys in sorted order $Data::Dumper::Sortkeys=1; print Data::Dumper->Dump([\%counts],[qw(*counts)]); # All done exit; __DATA__
Re: Comparing Lines within a Word List
by coicles (Sexton) on Apr 28, 2016 at 03:27 UTC

    Here's a trie solution which probably does everything you have asked about. Its performance is suboptimal in that it is a standard trie, where most "trie" applications would opt for one of the optimized equivalents. Also, this has a lot of confusing nested structure, as well as two traversal context stacks, so for clarity, hashes with named elements are used where arrays of fixed-index fields would be faster.

    It still runs pretty fast with SCOWL's big English word list, though.

    Also, the code's size is bloated by excessive commenting.

    use strict; package WordTrie { # "new" creates an empty WordTrie node, which is also a proper # WordTrie object when handled as as the trie's root node. sub new { bless {children => {}, terminates => 0}, shift } # insert a word into the trie sub insert { my $self = shift; for my $word (@_) { $word =~ s/^\s+|\s+$//g; # trim whitespace length($word) or next; my $node = $self; for my $ch (split //, lc $word) { $node = ($node->{children}->{$ch} //= new ref($self)); } $node->{terminates} = 1; } } # insert a list of words loaded from a file by name or GLOB refere +nce # the optional first argument is a whitelisting regex for filterin +g # the file's words for insertion. sub load { my $self = shift; my $regex = ref($_[0]) eq 'Regexp' ? shift() : undef; for my $arg (@_) { my $fh = ref($arg) ? $arg : undef; $fh or open($fh, $arg) or die "failed to read $arg"; my @list = <$fh>; chomp(@list); defined($regex) and @list = grep { /$regex/ } @list; $self->insert(@list); } } # simple traversal for dumping all of the trie's words sub full_traverse { my $self = shift; my @stack = ([$self]); while(@stack) { my $context = $stack[-1]; my ($node, $word, $child_keys) = @$context; unless($child_keys) { $node->{terminates} and print "$word\n"; $child_keys = $context->[2] = [sort keys %{$node->{chi +ldren}}]; } if(@$child_keys) { my $ch = shift @$child_keys; push @stack, [$node->{children}->{$ch}, $word . $ch]; } else { pop @stack; } } } # A primary traversal, with secondary traversals for finding pairs + of same-length words, # different at some number of character positions, within a given +range of counts. # Accepts zero, one or two integer arguments specifying the extrem +es of an # inclusive range of difference counts. # If only one number is provided, the range includes only one diff +erence count. # If no arguments are given, defaults to a difference count of 1 sub traverse_deltas { my $self = shift; my ($min_diff_count, $max_diff_count) = @_; $min_diff_count //= 1; $max_diff_count //= $min_diff_count; $max_diff_count < $min_diff_count and ($min_diff_count, $max_d +iff_count) = ($max_diff_count, $min_diff_count); # A secondary traversal does not intersect its primary travers +al, so "zero difference" pairs # cannot be found. Throw an error if pairs with zero (or nega +tive!) differences are requested. $min_diff_count <= 0 and die "0 is inside traverse_deltas diff +count range"; # stack[0] saves the states at each node level of a depth-firs +t recursive tree traversal # # stack[1] when the primary traversal finds a word, stack[1] g +ets a copy of stack[0] # and performs a secondary continuation of the primar +y traversal, specifically # for finding words whose lengths match the primary w +ord, and with character # difference counts between $min_diff_count and $max_ +diff_count (inclusive). # Start with one stack containing one element, a new context f +or the trie's root node. # Each node is a WordTrie object, for simplicity and because n +o collection metadata # was needed in this demo for a container object to maintain. my @stacks = ([{node => $self}]); while(@stacks) { my $ctx = $stacks[-1][-1]; # get the active context, at th +e top position of the active stack unless($ctx->{child_keys}) { # if "child_keys" is not defined, this context was jus +t created. # Traversal may return to this node more than once, an +d an undefined "child_keys" # acts as a flag for triggering one-time actions for t +his node, # like something that needs to be done once each time +a complete word is found. # create "child_keys", an array of single-character in +dexes into this node's # "children" hash. Each child node represents the app +ending of its key character # to this context's "word". $ctx->{child_keys} = [sort keys %{$ctx->{node}->{child +ren}}]; # Check if a word ends here. Even if this node termin +ates a word, it may still # have children. This happens when one word prefixes +another longer word. # For example, 'an' could terminate at this node, but +if 'and' exists, # it would terminate with this node's child whose key +is 'd' if($ctx->{node}->{terminates}) { if(@stacks == 1) { # The primary traversal found a complete word. + Now create a secondary # traversal to search for matching words. The + secondary traversal continues # from the primary context. Work would be was +ted if it started at the beginning, # because each matching word pair would be dup +licated. # This copies the primary context stack into a + secondary context stack. # The procedure is complicated by the need to +make a deep-copy of the "child_keys" arrays, # in order to preserve the primary versions. for my $ctx (@{$stacks[0]}) { # This uses a key/value hash slice, a newi +sh feature push @{$stacks[1]}, { %{$ctx}{qw/node word + diff_count/}, child_keys => [@{$ctx->{child_keys}}] }; } } elsif(@{$stacks[-1]} == @{$stacks[-2]}) { # The secondary traversal hit a word whose len +gth is the same as the primary. # Check that the difference count is within th +e desired range; if so, print the pair. # Pairs are always ordered alphabetically (or +however "sort" orders keys), with # the primary word preceding the secondary. if($ctx->{diff_count} >= $min_diff_count && $c +tx->{diff_count} <= $max_diff_count) { my ($pri, $sec, $diffs) = ($stacks[-2][-1] +->{word}, $ctx->{word}, $ctx->{diff_count}); # This is where one would add custom code +for handling each matching pair. # The demo just prints the primary word, t +he secondary word and their difference count. print " $pri $sec ($diffs)\n"; } } } # If a secondary traversal is being performed, do not +traverse deeper than the primary # traversal, because deeper words are all longer than +the primary word. @stacks > 1 && @{$stacks[0]} == @{$stacks[1]} and $ctx + = undef; } if(!$ctx || !@{$ctx->{child_keys}}) { # if context was cleared (to cancel deeper traversal), # or no children remain to be searched in this context +, # remove this context from the top of the active (top) + stack # and if the active stack is now empty, remove it from # the stack of context stacks, @stacks. # Primary traversal takes place when @stacks has only +one element, # and the main loop ends when @stacks is empty. pop @{$stacks[-1]}; @{$stacks[-1]} or pop @stacks; } else { # Children remain to be searched: get the next child's + key character my $ch = shift @{$ctx->{child_keys}}; my $diff_count = 0+$ctx->{diff_count}; if(@stacks > 1) { # For a secondary traversal, check whether the chi +ld key # just selected is the same as the primary word's +character # at the next depth. If they are different, incre +ment # "diff_count" between this context and the new ch +ild my $primary_word = $stacks[-2][@{$stacks[-1]}]->{w +ord}; my ($primary_ch) = $primary_word =~ /(.)$/; $ch ne $primary_ch and ++$diff_count; # If difference count corresponding to this new ch +ild is greater # than the allowed maximum, cancel traversal into +the child. $diff_count > $max_diff_count and next; } # push a new child context # "node" is this node's child, from its "children" has +h, indexed by the next key, stored in $ch # "word" is this context's word, appended with the chi +ld's key character # "diff_count" is this node's "diff_count", incremenet +ed only during a secondary traversal, # when the key character is different tha +n the primary traversal's key character # at the same depth as the child. push @{$stacks[-1]}, { node => $ctx->{node}->{children +}->{$ch}, word => $ctx->{word} . $ch, diff_count => $diff_count }; } } } } # Create an empty WordTrie object my $trie = WordTrie->new; # Example which loads words from the __DATA__ section at the bottom of + this file. $trie->load(\*DATA); ## Example which loads words from a dictionary file named 'english-wor +ds.95' ## This file has words which contain symbols like apostrophes, so ## a whitelisting regex is supplied for keeping just all-alpha words. ## Although the trie stores words in a case-insensitive manner, the lo +ading regex ## is applied to words as they appear in the file, so it must take thi +s into account. # $trie->load(qr/^[a-z]+$/i, 'english-words.95'); ## This prints all words contained in the trie # $trie->full_traverse; # This finds pairs which differ by only one charcter $trie->traverse_deltas(1); ## This finds pairs which differ by between 1 and 3 characters. # $trie->traverse_deltas(1, 3); __DATA__ aardvark aardbark bbardvark bardbark bardvark barddork aaardvark
Re: Comparing Lines within a Word List
by CountZero (Bishop) on Apr 28, 2016 at 20:32 UTC
    It takes a little longer, but the result is a file that shows all words in your worlist which differ in one place only and they are sorted according to which characters are different.

    For instance for "rs" differences, the result (based upon my wordlist) is as follows:

    ***** rs ***** abaser - abases abater - abates abider - abides abjurer - abjures ... year - yeas yodler - yodles yummier - yummies zanier - zanies zoner - zones
    And here is the program:
    use Modern::Perl qw/2015/; use List::MoreUtils qw/zip natatime/; use autodie; open( my $DICT, '<', './wordsEn.txt' ) or die "Could not open dictiona +ry - $!"; my @dict; print scalar <$DICT>; # show copyright message of word list # step 1: sort the dictionary into bins per length of the word while ( my $word = <$DICT> ) { chomp $word; next unless $word; push @{ $dict[ length $word ] }, $word; } close $DICT; # step 2: for each bin look for words that differ in one place only my %results; for my $length ( 2 .. @dict - 1 ) { next unless $dict[$length]; # skip if no words of this length my @bin = @{ $dict[$length] }; next if @bin < 2; # skip if only one word of this length say "*********** Testing words of length $length ***********"; while ( my ( $index, $test ) = each @bin ) { for my $check ( @bin[ $index + 1 .. @bin - 1 ] ) { my $diff = $test ^ $check; if ( 1 == $diff =~ tr/\x00//c ) { # only one character dif +ferent # find which characters are different my @first = split '', $test; my @second = split '', $check; my @test = zip @first, @second; my $it = natatime 2, @test; while ( my @vals = $it->() ) { next if $vals[0] eq $vals[1]; my $key = $vals[0] lt $vals[1] ? "$vals[0]$vals[1]" : "$vals[1]$vals[0]"; push @{ $results{$key} }, "$test - $check"; # save + in hash last; } } } } } say "Now writing results"; open( my $RESULTS, '>', './results.txt' ) or die "Cannot open output f +ile $!"; for my $key ( sort keys %results ) { say $RESULTS "***** $key *****"; for my $words ( sort @{ $results{$key} } ) { say $RESULTS "\t$words"; } } close $RESULTS;

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      "Light and agile" eh?

      #!/usr/bin/perl -l use strict; use warnings; $_ = do{local $/; <DATA>}; /^((\w*) (.) (\w*))$ # find a first word .* # skip non match lines ^(\2 (?!\3). \4)$ # second word same except one letter (??{print "$1 $5"})/xms; # print and fail __DATA__ cases carer caser cares bare mare base case bust burt bent sat rat bat matter mattes pat
Re: Comparing Lines within a Word List
by Anonymous Monk on Apr 27, 2016 at 19:02 UTC
    #!/usr/bin/perl -l use strict; use warnings; my %words; @words{map tr/a-z//cdr, <DATA>} = (); for (sort keys %words) # sort optional { exists $words{"$`s$'"} and print "$_ $`s$'" while /r/g; } __DATA__ cases carer caser cares bare mare base case bust burt bent sat rat bat matter mattes pat
Re: Comparing Lines within a Word List
by Anonymous Monk on Apr 27, 2016 at 15:47 UTC

    So you have a looming deadline and no idea how to solve your problem (every CS problem can be solved "using regular expressions in a Perl script"), so you come here to ask us to do your homework for you... well today is your lucky day :-P

    My first thought was "spell checker", so I investigated how they work. Apparently one possible implementation is the use of a Trie. I've never worked with a trie before, so I decided to use this chance to try it out. Since it's the first time I'm working with tries, the code might have some bugs. I compared it to a plain, brute-force linear search regex match. In my first tests, the trie data structure takes roughly 5x the memory (as reported by Devel::Size), but on large word lists I saw a 100x to 700x speedup. (Note: My implementation of AnomalousMonk's xor+tr search seems to be roughly twice as fast as the regex implementation.)

    If you want to understand what's going on in the code, the book Learning Perl is a good start, as is perldsc. Also, ask here.

    (Protip: Don't get thrown out of your class or school for plagiarism and cite your sources.)

      Just to clarify, I'm not a computer science student and my deadline is not a class deadline. I'm an artist with a background in mathematics, and I also occasionally publish crosswords and other word puzzles. My sense was that Perl and regular expressions could be enormously helpful in the latter, so I've been teaching myself with the O'Reilly book, but haven't learned enough yet to solve on my own a particular matching question that I could really use an answer to in the next few days, as it's to do with a printer deadline for a puzzle that I'm writing for my best friend's wedding.

      Thanks for these thoughts on the spell-checking approach. The code looks enormously interesting but obviously a lot to work through, and maybe more than I need at the moment? The solutions above appear to be able to tackle my specific problem, but perhaps there is something I'm not seeing.

        Try this with your dictionary file. If performance is a problem break into separate files according to word length.

        #!perl use strict; open IN,'dict.txt' or die "$!"; my %dict=(); for (<IN>){ chomp; $dict{uc $_}=1; }; close IN; for my $word (sort keys %dict){ next unless $word =~ /R/; my @f = split //,$word; # loop over each letter # changing R to S # to create new word in $w for my $i (0..$#f){ if ($f[$i] eq 'R'){ my $w = $word; substr($w,$i,1) = "S"; # check if generated word exists in dict if (exists $dict{$w}){ print "$word $w\n"; } } } }
        poj

        That's fine of course, we just get a lot of "do my homework for me" type questions here. The code works (at least on my machine) and as far as I can tell from your descriptions in this thread it generates the output you're looking for, so feel free to just use it. The learning curve of this particular code may be a little steep, and it contains a lot of benchmarking/statistics code which of course you don't need, but at least the theory behind a Trie is still something you can look into if you like. Doing a linear search on a long word list is inefficient when run often (which doesn't seem to be the case in your situation) so the size/speed tradeoff of a trie can be very useful.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1161587]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-04-18 23:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found