Re: Using special characters in left part of a regex match?
by punch_card_don (Curate) on Feb 05, 2013 at 23:28 UTC
|
To my mind, you have a logic issue long before you have a regex syntax issue.
Does the first phrase ($var[0]) always define the full version of the phrase? If so, ok, straight forward enough. But if not, then how do you decide which phrases define what is "normal" to be in the phrase, and which have a variant? The human brain may be able to see that naturally enough, but your code has no element of that.
If the first phrase always defines the complete phrase, then, easy:
#!/usr/bin/perl -w
use strict;
use warnings;
print "Content-type:text/html\n\n";
my @var;
$var[0] = "Gallia est omnis divisa in partes tres";
$var[1] = "Gallia est omnis divisa in ...";
$var[2] = "Gallia est omnis ...";
$var[3] = "Gallia";
$var[4] = "... omnis divisa in ...";
$var[5] = "Gallia est ... tres";
$var[6] = "Gallia ... partes tres";
$var[7] = "Gallia est ... partes tres";
$var[8] = "Gallia ... divisa ... tres";
$var[9] = "... tres";
$var[10] = "quattuor";
my @base_phrase_words = split(/\s/, $var[0]);
push(@base_phrase_words, "...");
my %words = map { $_ => 1 } @base_phrase_words;
for (my $i=1; $i<=$#var; $i++) {
my @partial_phrase_words = split(/\s/, $var[$i]);
foreach my $element (@partial_phrase_words){
if (exists($words{$element})) {
# do whatever
}
else {
print "<p>Found word $element in phrase $i varies from bas
+e phrase\n";
}
}
}
Output
Found word quattuor in phrase 10 varies from base phrase
Time flies like an arrow. Fruit flies like a banana.
| [reply] [d/l] [select] |
|
Thank you for this one! Indeed there might not be a "full version", although this case is rare. In addition, the most complete version may not occur as the first element of the array. Sorry if my example was misleading.
| [reply] |
Re: Using special characters in left part of a regex match?
by kennethk (Abbot) on Feb 06, 2013 at 00:01 UTC
|
So, I'm assuming 3 should actually be Gallia ... since it would otherwise be inconsistent with all other options and would not map to the first string. Without the ability to anchor at the front and back, this problem seems unsolvable to me.
If you have a full string, as in case 0, you can trivially use regular expressions. Otherwise, only your leading and trailing words are actually constraining. This strikes me more as a case where you would split your strings on ... and then compare substrings -- this is not consistent with using regular expressions. My version, with index to do the work fragment checking:
#!/usr/bin/perl
use strict;
use warnings;
my @var;
$var[0] = "Gallia est omnis divisa in partes tres";
$var[1] = "Gallia est omnis divisa in ...";
$var[2] = "Gallia est omnis ...";
$var[3] = "Gallia ...";
$var[4] = "... omnis divisa in ...";
$var[5] = "Gallia est ... tres";
$var[6] = "Gallia ... partes tres";
$var[7] = "Gallia est ... partes tres";
$var[8] = "Gallia ... divisa ... tres";
$var[9] = "... tres";
$var[10] = "quattuor";
for my $i (0 .. $#var) {
for my $j ($i+1 .. $#var) {
print "$i - $j DO NOT MATCH!\n" unless compare($var[$i], $var[
+$j]);
}
}
sub compare {
my @str1 = split /\Q...\E/, shift, -1;
my @str2 = split /\Q...\E/, shift, -1;
if (@str1 == 1) { # Regex is possible
local $" = ".+";
return $str1[0] =~ /^@str2$/;
} elsif (@str2 == 1) { # Regex is still possible
local $" = ".+";
return $str2[0] =~ /^@str1$/;
} else { # Fragment matching
# Openings must be consistent
if (length $str1[0] > length $str2[0]) {
return if index($str1[0], $str2[0]) != 0;
} else {
return if index($str2[0], $str1[0]) != 0;
}
# Closings must be consistent, start search from end
if (length $str1[-1] > length $str2[-1]) {
return if index(reverse($str1[-1]), reverse($str2[-1])) !=
+ 0;
} else {
return if index(reverse($str2[-1]), reverse($str1[-1])) !=
+ 0;
}
}
return 1;
}
which outputs
0 - 10 DO NOT MATCH!
1 - 10 DO NOT MATCH!
2 - 10 DO NOT MATCH!
3 - 10 DO NOT MATCH!
4 - 10 DO NOT MATCH!
5 - 10 DO NOT MATCH!
6 - 10 DO NOT MATCH!
7 - 10 DO NOT MATCH!
8 - 10 DO NOT MATCH!
9 - 10 DO NOT MATCH!
If instead you meant case 10 to be ... quattuor ..., you get
0 - 10 DO NOT MATCH!
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] [select] |
|
<quote>So, I'm assuming 3 should actually be Gallia ...</quote>
This is correct, thanks for spotting it out. Unfortunately, I might not have a full string in case 0, so I should really compare every variant with anyone else. I also realized that anchor points are important, but case 10 is indeed "... quattuor", and should be evaluated as the only non-variant in the set.
| [reply] |
Re: Using special characters in left part of a regex match?
by LanX (Saint) on Feb 05, 2013 at 23:01 UTC
|
$var[1] = "Gallia est omnis divisa in .+";
$var[5] = "Gallia est .+ tres";
$var[1] =~ $var[5] ;# false
what could maybe work is to successively strip common beginning and/or ending parts.
"Gallia est ..." is the smallest starting substring till an ellipsis.
Deleting it from both strings:
$var[1] = "omnis divisa in ...";
$var[5] = "... tres";
this configuration is always true.
such a strategy might work.
| [reply] [d/l] [select] |
|
Thank you for the suggestion, I really appreciate it! I'll try your strategy, which I think may work well with syntactical variations as well.
| [reply] |
Re: Using special characters in left part of a regex match?
by Kenosis (Priest) on Feb 05, 2013 at 23:07 UTC
|
Would something like the Levenshtein edit distance assist you (the greater the value, the greater the two strings' distance)?
use strict;
use warnings;
use Text::LevenshteinXS qw(distance);
my @var;
$var[0] = "Gallia est omnis divisa in partes tres";
$var[1] = "Gallia est omnis divisa in ...";
$var[2] = "Gallia est omnis ...";
$var[3] = "Gallia";
$var[4] = "... omnis divisa in ...";
$var[5] = "Gallia est ... tres";
$var[6] = "Gallia ... partes tres";
$var[7] = "Gallia est ... partes tres";
$var[8] = "Gallia ... divisa ... tres";
$var[9] = "... tres";
$var[10] = "quattuor";
print qq{Each string's 'distance' from "$var[0]":\n\n};
for ( 0 .. $#var ) {
print distance( $var[0], $var[$_] ) . " - $var[$_]\n";
}
Output:
Each string's 'distance' from "Gallia est omnis divisa in partes tres"
+:
0 - Gallia est omnis divisa in partes tres
11 - Gallia est omnis divisa in ...
21 - Gallia est omnis ...
32 - Gallia
21 - ... omnis divisa in ...
22 - Gallia est ... tres
19 - Gallia ... partes tres
15 - Gallia est ... partes tres
18 - Gallia ... divisa ... tres
33 - ... tres
34 - quattuor
| [reply] [d/l] [select] |
|
Is there a Levenshtein implementation that uses words instead of letters?
That probably would be suitable to the problem at hand, although not exactly what the OP wanted.
| [reply] |
|
Thank you Kenosis, I was not aware of that, it may be a useful tool for further researches.
| [reply] |
Re: Using special characters in left part of a regex match?
by kcott (Archbishop) on Feb 06, 2013 at 07:10 UTC
|
G'day shamat,
Firstly, my comments (some of which have already been mentioned in earlier responses):
-
Do you really want to compare all fragments with each other? I can envisage a situation where you're attempting to decide whether "... est ..." matches "... in ...". Perhaps you'd want to filter badly damaged fragments from any sort of matching whatsoever.
-
I think you'd be better off comparing the fragments with a single reference string. You wrote "... some of them being partly damaged.", so presumably some of them are complete.
-
You wrote "... only the last string should not match ..." (that would be "quattuor"). If that's the case, "Gallia" should probably be "Gallia ..."
-
The output you show does not match the code that creates it. From the code you posted, I'd be expecting output like:
N-M: [string1] and [string2] DO NOT MATCH!
Here's a solution that takes all of the above into account:
#!/usr/bin/env perl
use strict;
use warnings;
my @exemplars = <DATA>;
my $reference = shift @exemplars;
print "Reference string: $reference";
for (@exemplars) {
my $exemplar = $_;
s/[.]{3}/.+?/g;
if ($reference !~ /$_/) {
print "NO MATCH: $exemplar";
}
}
__DATA__
Gallia est omnis divisa in partes tres
Gallia est omnis divisa in ...
Gallia est omnis ...
Gallia
... omnis divisa in ...
Gallia est ... tres
Gallia ... partes tres
Gallia est ... partes tres
Gallia ... divisa ... tres
... tres
quattuor
Gallia ...
Output:
$ pm_latin_fragments.pl
Reference string: Gallia est omnis divisa in partes tres
NO MATCH: Gallia
NO MATCH: quattuor
| [reply] [d/l] [select] |
|
Thank you so much Ken! This is amazing. As for your first comment, I might want to compare all the fragments with each other, which is a very hard job. As a work around, I added a (clumsy) piece of code to yours, so that the script picks up the most complete string as the reference one -- meaning the string which contains most words. Here is the code:
#!/usr/bin/env perl
@exemplars = <DATA>;
foreach $line (@exemplars) {
@words = split (/\s+/, $line);
$array[$#words] = $line;
}
@array = sort { $a <=> $b } @array;
$reference = $array[-1];
print "Reference string: $reference";
for (@exemplars) {
$exemplar = $_;
s/[.]{3}/.+?/g;
if ($reference !~ /$_/) {
print "NO MATCH: $exemplar";
}
}
__DATA__
Gallia est omnis divisa in partes tres
Gallia est omnis divisa in ...
Gallia est omnis ...
Gallia
... omnis divisa in ...
... in ...
... est ...
Gallia est ... tres
Gallia ... partes tres
Gallia est ... partes tres
Gallia ... divisa ... tres
... tres
quattuor
Gallia ...
Output is the same as yours. I will run some tests, and see what happens. | [reply] [d/l] |
Re: Using special characters in left part of a regex match?
by moritz (Cardinal) on Feb 06, 2013 at 00:11 UTC
|
$var[$j] =~ s/\.\.\./\.\+/g;
The right half of an s/// substitution is not a regex, but a mere string. So if you want to construct .+ in the replacement, you need to write it as .+ and not at \.\+.
| [reply] [d/l] [select] |
|
| [reply] |
Re: Using special characters in left part of a regex match?
by CountZero (Bishop) on Feb 05, 2013 at 22:41 UTC
|
"De Bello Gallico" was written by Julius Caesar.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] |
|
And what happened to the version my Latin studies taught?
"Omnia Gallia in tres partes divisa est."
| [reply] |
|
Good question. The goal of the database is to spot lexical and morphological variations, not the syntactical ones. The problem obviously arises when an exemplar showing syntactical variations also has lexical and morphological ones. I'm still working on that anyway.
| [reply] |
|
| [reply] |