Using special characters in left part of a regex match?

shamat has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Using special characters in left part of a regex match? by punch_card_don (Curate) on Feb 05, 2013 at 23:28 UTC
To my mind, you have a logic issue long before you have a regex syntax issue. Does the first phrase ($var[0]) always define the full version of the phrase? If so, ok, straight forward enough. But if not, then how do you decide which phrases define what is "normal" to be in the phrase, and which have a variant? The human brain may be able to see that naturally enough, but your code has no element of that. If the first phrase always defines the complete phrase, then, easy: #!/usr/bin/perl -w use strict; use warnings; print "Content-type:text/html\n\n"; my @var; $var[0] = "Gallia est omnis divisa in partes tres"; $var[1] = "Gallia est omnis divisa in ..."; $var[2] = "Gallia est omnis ..."; $var[3] = "Gallia"; $var[4] = "... omnis divisa in ..."; $var[5] = "Gallia est ... tres"; $var[6] = "Gallia ... partes tres"; $var[7] = "Gallia est ... partes tres"; $var[8] = "Gallia ... divisa ... tres"; $var[9] = "... tres"; $var[10] = "quattuor"; my @base_phrase_words = split(/\s/, $var[0]); push(@base_phrase_words, "..."); my %words = map { $_ => 1 } @base_phrase_words; for (my $i=1; $i<=$#var; $i++) { my @partial_phrase_words = split(/\s/, $var[$i]); foreach my $element (@partial_phrase_words){ if (exists($words{$element})) { # do whatever } else { print "<p>Found word $element in phrase $i varies from bas +e phrase\n"; } } } [download] Output `Found word quattuor in phrase 10 varies from base phrase` [download] Time flies like an arrow. Fruit flies like a banana.	[reply] [d/l] [select]
Re^2: Using special characters in left part of a regex match? by shamat (Acolyte) on Feb 06, 2013 at 20:44 UTC
Thank you for this one! Indeed there might not be a "full version", although this case is rare. In addition, the most complete version may not occur as the first element of the array. Sorry if my example was misleading.	[reply]
Re: Using special characters in left part of a regex match? by kennethk (Abbot) on Feb 06, 2013 at 00:01 UTC
So, I'm assuming 3 should actually be `Gallia ...` since it would otherwise be inconsistent with all other options and would not map to the first string. Without the ability to anchor at the front and back, this problem seems unsolvable to me. If you have a full string, as in case 0, you can trivially use regular expressions. Otherwise, only your leading and trailing words are actually constraining. This strikes me more as a case where you would split your strings on `...` and then compare substrings -- this is not consistent with using regular expressions. My version, with index to do the work fragment checking: #!/usr/bin/perl use strict; use warnings; my @var; $var[0] = "Gallia est omnis divisa in partes tres"; $var[1] = "Gallia est omnis divisa in ..."; $var[2] = "Gallia est omnis ..."; $var[3] = "Gallia ..."; $var[4] = "... omnis divisa in ..."; $var[5] = "Gallia est ... tres"; $var[6] = "Gallia ... partes tres"; $var[7] = "Gallia est ... partes tres"; $var[8] = "Gallia ... divisa ... tres"; $var[9] = "... tres"; $var[10] = "quattuor"; for my $i (0 .. $#var) { for my $j ($i+1 .. $#var) { print "$i - $j DO NOT MATCH!\n" unless compare($var[$i], $var[ +$j]); } } sub compare { my @str1 = split /\Q...\E/, shift, -1; my @str2 = split /\Q...\E/, shift, -1; if (@str1 == 1) { # Regex is possible local $" = ".+"; return $str1[0] =~ /^@str2$/; } elsif (@str2 == 1) { # Regex is still possible local $" = ".+"; return $str2[0] =~ /^@str1$/; } else { # Fragment matching # Openings must be consistent if (length $str1[0] > length $str2[0]) { return if index($str1[0], $str2[0]) != 0; } else { return if index($str2[0], $str1[0]) != 0; } # Closings must be consistent, start search from end if (length $str1[-1] > length $str2[-1]) { return if index(reverse($str1[-1]), reverse($str2[-1])) != + 0; } else { return if index(reverse($str2[-1]), reverse($str1[-1])) != + 0; } } return 1; } [download] which outputs `0 - 10 DO NOT MATCH! 1 - 10 DO NOT MATCH! 2 - 10 DO NOT MATCH! 3 - 10 DO NOT MATCH! 4 - 10 DO NOT MATCH! 5 - 10 DO NOT MATCH! 6 - 10 DO NOT MATCH! 7 - 10 DO NOT MATCH! 8 - 10 DO NOT MATCH! 9 - 10 DO NOT MATCH!` [download] If instead you meant case 10 to be `... quattuor ...`, you get `0 - 10 DO NOT MATCH!` [download] #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^2: Using special characters in left part of a regex match? by shamat (Acolyte) on Feb 06, 2013 at 20:53 UTC
<quote>So, I'm assuming 3 should actually be Gallia ...</quote> This is correct, thanks for spotting it out. Unfortunately, I might not have a full string in case 0, so I should really compare every variant with anyone else. I also realized that anchor points are important, but case 10 is indeed "... quattuor", and should be evaluated as the only non-variant in the set.	[reply]
Re: Using special characters in left part of a regex match? by LanX (Saint) on Feb 05, 2013 at 23:01 UTC
> Is there a way to put special characters in the left part of the regular expression? with special characters you mean the regex wildcards you're including for `...` and the answer is no. in short `$var[1] = "Gallia est omnis divisa in .+"; $var[5] = "Gallia est .+ tres"; $var[1] =~ $var[5] ;# false` [download] what could maybe work is to successively strip common beginning and/or ending parts. "Gallia est ..." is the smallest starting substring till an ellipsis. Deleting it from both strings: `$var[1] = "omnis divisa in ..."; $var[5] = "... tres";` [download] this configuration is always true. such a strategy might work. Cheers Rolf	[reply] [d/l] [select]
Re^2: Using special characters in left part of a regex match? by shamat (Acolyte) on Feb 06, 2013 at 16:56 UTC
Thank you for the suggestion, I really appreciate it! I'll try your strategy, which I think may work well with syntactical variations as well.	[reply]
Re: Using special characters in left part of a regex match? by Kenosis (Priest) on Feb 05, 2013 at 23:07 UTC
Would something like the Levenshtein edit distance assist you (the greater the value, the greater the two strings' distance)? use strict; use warnings; use Text::LevenshteinXS qw(distance); my @var; $var[0] = "Gallia est omnis divisa in partes tres"; $var[1] = "Gallia est omnis divisa in ..."; $var[2] = "Gallia est omnis ..."; $var[3] = "Gallia"; $var[4] = "... omnis divisa in ..."; $var[5] = "Gallia est ... tres"; $var[6] = "Gallia ... partes tres"; $var[7] = "Gallia est ... partes tres"; $var[8] = "Gallia ... divisa ... tres"; $var[9] = "... tres"; $var[10] = "quattuor"; print qq{Each string's 'distance' from "$var[0]":\n\n}; for ( 0 .. $#var ) { print distance( $var[0], $var[$_] ) . " - $var[$_]\n"; } [download] Output: `Each string's 'distance' from "Gallia est omnis divisa in partes tres" +: 0 - Gallia est omnis divisa in partes tres 11 - Gallia est omnis divisa in ... 21 - Gallia est omnis ... 32 - Gallia 21 - ... omnis divisa in ... 22 - Gallia est ... tres 19 - Gallia ... partes tres 15 - Gallia est ... partes tres 18 - Gallia ... divisa ... tres 33 - ... tres 34 - quattuor` [download]	[reply] [d/l] [select]
Re^2: Using special characters in left part of a regex match? by soonix (Canon) on Feb 06, 2013 at 20:31 UTC
Is there a Levenshtein implementation that uses words instead of letters? That probably would be suitable to the problem at hand, although not exactly what the OP wanted.	[reply]
Re^2: Using special characters in left part of a regex match? by shamat (Acolyte) on Feb 06, 2013 at 20:38 UTC
Thank you Kenosis, I was not aware of that, it may be a useful tool for further researches.	[reply]
Re: Using special characters in left part of a regex match? by kcott (Archbishop) on Feb 06, 2013 at 07:10 UTC
G'day shamat, Firstly, my comments (some of which have already been mentioned in earlier responses): Do you really want to compare all fragments with each other? I can envisage a situation where you're attempting to decide whether `"... est ..."` matches `"... in ..."`. Perhaps you'd want to filter badly damaged fragments from any sort of matching whatsoever. I think you'd be better off comparing the fragments with a single reference string. You wrote "... some of them being partly damaged.", so presumably some of them are complete. You wrote "... only the last string should not match ..." (that would be `"quattuor"`). If that's the case, `"Gallia"` should probably be `"Gallia ..."` The output you show does not match the code that creates it. From the code you posted, I'd be expecting output like: `N-M: [string1] and [string2] DO NOT MATCH!` Here's a solution that takes all of the above into account: #!/usr/bin/env perl use strict; use warnings; my @exemplars = <DATA>; my $reference = shift @exemplars; print "Reference string: $reference"; for (@exemplars) { my $exemplar = $_; s/[.]{3}/.+?/g; if ($reference !~ /$_/) { print "NO MATCH: $exemplar"; } } __DATA__ Gallia est omnis divisa in partes tres Gallia est omnis divisa in ... Gallia est omnis ... Gallia ... omnis divisa in ... Gallia est ... tres Gallia ... partes tres Gallia est ... partes tres Gallia ... divisa ... tres ... tres quattuor Gallia ... [download] Output: `$ pm_latin_fragments.pl Reference string: Gallia est omnis divisa in partes tres NO MATCH: Gallia NO MATCH: quattuor` [download] -- Ken	[reply] [d/l] [select]
Re^2: Using special characters in left part of a regex match? by shamat (Acolyte) on Feb 06, 2013 at 21:31 UTC
Thank you so much Ken! This is amazing. As for your first comment, I might want to compare all the fragments with each other, which is a very hard job. As a work around, I added a (clumsy) piece of code to yours, so that the script picks up the most complete string as the reference one -- meaning the string which contains most words. Here is the code: #!/usr/bin/env perl @exemplars = <DATA>; foreach $line (@exemplars) { @words = split (/\s+/, $line); $array[$#words] = $line; } @array = sort { $a <=> $b } @array; $reference = $array[-1]; print "Reference string: $reference"; for (@exemplars) { $exemplar = $_; s/[.]{3}/.+?/g; if ($reference !~ /$_/) { print "NO MATCH: $exemplar"; } } __DATA__ Gallia est omnis divisa in partes tres Gallia est omnis divisa in ... Gallia est omnis ... Gallia ... omnis divisa in ... ... in ... ... est ... Gallia est ... tres Gallia ... partes tres Gallia est ... partes tres Gallia ... divisa ... tres ... tres quattuor Gallia ... [download] Output is the same as yours. I will run some tests, and see what happens.	[reply] [d/l]
Re: Using special characters in left part of a regex match? by moritz (Cardinal) on Feb 06, 2013 at 00:11 UTC
`$var[$j] =~ s/\.\.\./\.\+/g;` The right half of an `s///` substitution is not a regex, but a mere string. So if you want to construct `.+` in the replacement, you need to write it as `.+` and not at `\.\+`. Perl 6 - the future is here, just unevenly distributed	[reply] [d/l] [select]
Re^2: Using special characters in left part of a regex match? by LanX (Saint) on Feb 06, 2013 at 00:14 UTC
> you need to write it as .+ and not at \.\+. He doesn't need to, it's no error. But it's better readable w/o escaping. Cheers Rolf	[reply]
Re: Using special characters in left part of a regex match? by CountZero (Bishop) on Feb 05, 2013 at 22:41 UTC
"De Bello Gallico" was written by Julius Caesar. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re^2: Using special characters in left part of a regex match? by ww (Archbishop) on Feb 06, 2013 at 00:42 UTC
And what happened to the version my Latin studies taught? "Omnia Gallia in tres partes divisa est."	[reply]
Re^3: Using special characters in left part of a regex match? by shamat (Acolyte) on Feb 06, 2013 at 16:53 UTC
Good question. The goal of the database is to spot lexical and morphological variations, not the syntactical ones. The problem obviously arises when an exemplar showing syntactical variations also has lexical and morphological ones. I'm still working on that anyway.	[reply]
Re^2: Using special characters in left part of a regex match? by shamat (Acolyte) on Feb 05, 2013 at 22:42 UTC
True, correction needed.	[reply]


laziness, impatience, and hubris
	PerlMonks