Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Using special characters in left part of a regex match?

by shamat (Acolyte)
on Feb 05, 2013 at 22:37 UTC ( [id://1017293]=perlquestion: print w/replies, xml ) Need Help??

shamat has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm using perl for processing ancient texts (which is cool!), the problem is: I am having troubles with pattern matching in regular expressions. More in detail, I have a list of exemplars of Caesar's "De bello gallico", some of them being partly damaged. The first line of each exemplar is loaded in to an array as follows:
$var[0] = "Gallia est omnis divisa in partes tres"; $var[1] = "Gallia est omnis divisa in ..."; $var[2] = "Gallia est omnis ..."; $var[3] = "Gallia"; $var[4] = "... omnis divisa in ..."; $var[5] = "Gallia est ... tres"; $var[6] = "Gallia ... partes tres"; $var[7] = "Gallia est ... partes tres"; $var[8] = "Gallia ... divisa ... tres"; $var[9] = "... tres"; $var[10] = "quattuor";
The broken words are expressed by "...", which means to "one or more words are missing". I would like to compare these strings so that perl outputs a message when they do not match. This is useful to spot unexpected variations. In this case, only the last string should not match the others. My idea is to substitute "..." with ".+", and so far I got this:
for ($i=0;$i<=$#var;$i++) { $var[$i] =~ s/\.\.\./\.\+/g; for ($j=$i+1;$j<=$#var;$j++) { $var[$j] =~ s/\.\.\./\.\+/g; if ($var[$i] !~ m/$var[$j]/) { print "$i-$j:\t[$var[$i]] and [$var[$j]] DO NOT MATCH!\n"; } } }
which prints:

0 - 10 DO NOT MATCH!
1 - 5 DO NOT MATCH!
1 - 6 DO NOT MATCH!
1 - 7 DO NOT MATCH!
1 - 8 DO NOT MATCH!
1 - 9 DO NOT MATCH!
1 - 10 DO NOT MATCH!\
2 - 4 DO NOT MATCH!
2 - 5 DO NOT MATCH!
2 - 6 DO NOT MATCH!
2 - 7 DO NOT MATCH!
2 - 8 DO NOT MATCH!
2 - 9 DO NOT MATCH!
2 - 10 DO NOT MATCH!
3 - 4 DO NOT MATCH!
3 - 5 DO NOT MATCH!
3 - 6 DO NOT MATCH!
3 - 7 DO NOT MATCH!
3 - 8 DO NOT MATCH!
3 - 9 DO NOT MATCH!
3 - 10 DO NOT MATCH!
4 - 5 DO NOT MATCH!
4 - 6 DO NOT MATCH!
4 - 7 DO NOT MATCH!
4 - 8 DO NOT MATCH!
4 - 9 DO NOT MATCH!
4 - 10 DO NOT MATCH!
5 - 6 DO NOT MATCH!
5 - 7 DO NOT MATCH!
5 - 8 DO NOT MATCH!
5 - 10 DO NOT MATCH!
6 - 7 DO NOT MATCH!
6 - 8 DO NOT MATCH!
6 - 10 DO NOT MATCH!
7 - 8 DO NOT MATCH!
7 - 10 DO NOT MATCH!
8 - 10 DO NOT MATCH!
9 - 10 DO NOT MATCH!

Is there a way to put special characters in the left part of the regular expression? I think this is what goes wrong with my idea, but I can't find the solution for it. Thank you so much for your help!

Replies are listed 'Best First'.
Re: Using special characters in left part of a regex match?
by punch_card_don (Curate) on Feb 05, 2013 at 23:28 UTC
    To my mind, you have a logic issue long before you have a regex syntax issue.

    Does the first phrase ($var[0]) always define the full version of the phrase? If so, ok, straight forward enough. But if not, then how do you decide which phrases define what is "normal" to be in the phrase, and which have a variant? The human brain may be able to see that naturally enough, but your code has no element of that.

    If the first phrase always defines the complete phrase, then, easy:

    #!/usr/bin/perl -w use strict; use warnings; print "Content-type:text/html\n\n"; my @var; $var[0] = "Gallia est omnis divisa in partes tres"; $var[1] = "Gallia est omnis divisa in ..."; $var[2] = "Gallia est omnis ..."; $var[3] = "Gallia"; $var[4] = "... omnis divisa in ..."; $var[5] = "Gallia est ... tres"; $var[6] = "Gallia ... partes tres"; $var[7] = "Gallia est ... partes tres"; $var[8] = "Gallia ... divisa ... tres"; $var[9] = "... tres"; $var[10] = "quattuor"; my @base_phrase_words = split(/\s/, $var[0]); push(@base_phrase_words, "..."); my %words = map { $_ => 1 } @base_phrase_words; for (my $i=1; $i<=$#var; $i++) { my @partial_phrase_words = split(/\s/, $var[$i]); foreach my $element (@partial_phrase_words){ if (exists($words{$element})) { # do whatever } else { print "<p>Found word $element in phrase $i varies from bas +e phrase\n"; } } }
    Output
    Found word quattuor in phrase 10 varies from base phrase



    Time flies like an arrow. Fruit flies like a banana.
      Thank you for this one! Indeed there might not be a "full version", although this case is rare. In addition, the most complete version may not occur as the first element of the array. Sorry if my example was misleading.
Re: Using special characters in left part of a regex match?
by kennethk (Abbot) on Feb 06, 2013 at 00:01 UTC
    So, I'm assuming 3 should actually be Gallia ... since it would otherwise be inconsistent with all other options and would not map to the first string. Without the ability to anchor at the front and back, this problem seems unsolvable to me.

    If you have a full string, as in case 0, you can trivially use regular expressions. Otherwise, only your leading and trailing words are actually constraining. This strikes me more as a case where you would split your strings on ... and then compare substrings -- this is not consistent with using regular expressions. My version, with index to do the work fragment checking:

    #!/usr/bin/perl use strict; use warnings; my @var; $var[0] = "Gallia est omnis divisa in partes tres"; $var[1] = "Gallia est omnis divisa in ..."; $var[2] = "Gallia est omnis ..."; $var[3] = "Gallia ..."; $var[4] = "... omnis divisa in ..."; $var[5] = "Gallia est ... tres"; $var[6] = "Gallia ... partes tres"; $var[7] = "Gallia est ... partes tres"; $var[8] = "Gallia ... divisa ... tres"; $var[9] = "... tres"; $var[10] = "quattuor"; for my $i (0 .. $#var) { for my $j ($i+1 .. $#var) { print "$i - $j DO NOT MATCH!\n" unless compare($var[$i], $var[ +$j]); } } sub compare { my @str1 = split /\Q...\E/, shift, -1; my @str2 = split /\Q...\E/, shift, -1; if (@str1 == 1) { # Regex is possible local $" = ".+"; return $str1[0] =~ /^@str2$/; } elsif (@str2 == 1) { # Regex is still possible local $" = ".+"; return $str2[0] =~ /^@str1$/; } else { # Fragment matching # Openings must be consistent if (length $str1[0] > length $str2[0]) { return if index($str1[0], $str2[0]) != 0; } else { return if index($str2[0], $str1[0]) != 0; } # Closings must be consistent, start search from end if (length $str1[-1] > length $str2[-1]) { return if index(reverse($str1[-1]), reverse($str2[-1])) != + 0; } else { return if index(reverse($str2[-1]), reverse($str1[-1])) != + 0; } } return 1; }
    which outputs
    0 - 10 DO NOT MATCH! 1 - 10 DO NOT MATCH! 2 - 10 DO NOT MATCH! 3 - 10 DO NOT MATCH! 4 - 10 DO NOT MATCH! 5 - 10 DO NOT MATCH! 6 - 10 DO NOT MATCH! 7 - 10 DO NOT MATCH! 8 - 10 DO NOT MATCH! 9 - 10 DO NOT MATCH!

    If instead you meant case 10 to be ... quattuor ..., you get

    0 - 10 DO NOT MATCH!

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      <quote>So, I'm assuming 3 should actually be Gallia ...</quote> This is correct, thanks for spotting it out. Unfortunately, I might not have a full string in case 0, so I should really compare every variant with anyone else. I also realized that anchor points are important, but case 10 is indeed "... quattuor", and should be evaluated as the only non-variant in the set.
Re: Using special characters in left part of a regex match?
by LanX (Saint) on Feb 05, 2013 at 23:01 UTC
    > Is there a way to put special characters in the left part of the regular expression?

    with special characters you mean the regex wildcards you're including for ... and the answer is no.

    in short

    $var[1] = "Gallia est omnis divisa in .+"; $var[5] = "Gallia est .+ tres"; $var[1] =~ $var[5] ;# false

    what could maybe work is to successively strip common beginning and/or ending parts.

    "Gallia est ..." is the smallest starting substring till an ellipsis.

    Deleting it from both strings:

    $var[1] = "omnis divisa in ..."; $var[5] = "... tres";

    this configuration is always true.

    such a strategy might work.

    Cheers Rolf

      Thank you for the suggestion, I really appreciate it! I'll try your strategy, which I think may work well with syntactical variations as well.
Re: Using special characters in left part of a regex match?
by Kenosis (Priest) on Feb 05, 2013 at 23:07 UTC

    Would something like the Levenshtein edit distance assist you (the greater the value, the greater the two strings' distance)?

    use strict; use warnings; use Text::LevenshteinXS qw(distance); my @var; $var[0] = "Gallia est omnis divisa in partes tres"; $var[1] = "Gallia est omnis divisa in ..."; $var[2] = "Gallia est omnis ..."; $var[3] = "Gallia"; $var[4] = "... omnis divisa in ..."; $var[5] = "Gallia est ... tres"; $var[6] = "Gallia ... partes tres"; $var[7] = "Gallia est ... partes tres"; $var[8] = "Gallia ... divisa ... tres"; $var[9] = "... tres"; $var[10] = "quattuor"; print qq{Each string's 'distance' from "$var[0]":\n\n}; for ( 0 .. $#var ) { print distance( $var[0], $var[$_] ) . " - $var[$_]\n"; }

    Output:

    Each string's 'distance' from "Gallia est omnis divisa in partes tres" +: 0 - Gallia est omnis divisa in partes tres 11 - Gallia est omnis divisa in ... 21 - Gallia est omnis ... 32 - Gallia 21 - ... omnis divisa in ... 22 - Gallia est ... tres 19 - Gallia ... partes tres 15 - Gallia est ... partes tres 18 - Gallia ... divisa ... tres 33 - ... tres 34 - quattuor
      Is there a Levenshtein implementation that uses words instead of letters? That probably would be suitable to the problem at hand, although not exactly what the OP wanted.
      Thank you Kenosis, I was not aware of that, it may be a useful tool for further researches.
Re: Using special characters in left part of a regex match?
by kcott (Archbishop) on Feb 06, 2013 at 07:10 UTC

    G'day shamat,

    Firstly, my comments (some of which have already been mentioned in earlier responses):

    • Do you really want to compare all fragments with each other? I can envisage a situation where you're attempting to decide whether "... est ..." matches "... in ...". Perhaps you'd want to filter badly damaged fragments from any sort of matching whatsoever.
    • I think you'd be better off comparing the fragments with a single reference string. You wrote "... some of them being partly damaged.", so presumably some of them are complete.
    • You wrote "... only the last string should not match ..." (that would be "quattuor"). If that's the case, "Gallia" should probably be "Gallia ..."
    • The output you show does not match the code that creates it. From the code you posted, I'd be expecting output like:
      N-M:   [string1] and [string2] DO NOT MATCH!

    Here's a solution that takes all of the above into account:

    #!/usr/bin/env perl use strict; use warnings; my @exemplars = <DATA>; my $reference = shift @exemplars; print "Reference string: $reference"; for (@exemplars) { my $exemplar = $_; s/[.]{3}/.+?/g; if ($reference !~ /$_/) { print "NO MATCH: $exemplar"; } } __DATA__ Gallia est omnis divisa in partes tres Gallia est omnis divisa in ... Gallia est omnis ... Gallia ... omnis divisa in ... Gallia est ... tres Gallia ... partes tres Gallia est ... partes tres Gallia ... divisa ... tres ... tres quattuor Gallia ...

    Output:

    $ pm_latin_fragments.pl Reference string: Gallia est omnis divisa in partes tres NO MATCH: Gallia NO MATCH: quattuor

    -- Ken

      Thank you so much Ken! This is amazing. As for your first comment, I might want to compare all the fragments with each other, which is a very hard job. As a work around, I added a (clumsy) piece of code to yours, so that the script picks up the most complete string as the reference one -- meaning the string which contains most words. Here is the code:
      #!/usr/bin/env perl @exemplars = <DATA>; foreach $line (@exemplars) { @words = split (/\s+/, $line); $array[$#words] = $line; } @array = sort { $a <=> $b } @array; $reference = $array[-1]; print "Reference string: $reference"; for (@exemplars) { $exemplar = $_; s/[.]{3}/.+?/g; if ($reference !~ /$_/) { print "NO MATCH: $exemplar"; } } __DATA__ Gallia est omnis divisa in partes tres Gallia est omnis divisa in ... Gallia est omnis ... Gallia ... omnis divisa in ... ... in ... ... est ... Gallia est ... tres Gallia ... partes tres Gallia est ... partes tres Gallia ... divisa ... tres ... tres quattuor Gallia ...
      Output is the same as yours. I will run some tests, and see what happens.
Re: Using special characters in left part of a regex match?
by moritz (Cardinal) on Feb 06, 2013 at 00:11 UTC
      > you need to write it as .+ and not at \.\+.

      He doesn't need to, it's no error.

      But it's better readable w/o escaping.

      Cheers Rolf

Re: Using special characters in left part of a regex match?
by CountZero (Bishop) on Feb 05, 2013 at 22:41 UTC
    "De Bello Gallico" was written by Julius Caesar.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      And what happened to the version my Latin studies taught?
      "Omnia Gallia in tres partes divisa est."
        Good question. The goal of the database is to spot lexical and morphological variations, not the syntactical ones. The problem obviously arises when an exemplar showing syntactical variations also has lexical and morphological ones. I'm still working on that anyway.
      True, correction needed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1017293]
Approved by mildside
Front-paged by mildside
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-25 05:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found