Enforcing growth of regex

Hena has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

Again I'm parsing strings that are used in reference. I'm currently facing a problem of how to make regex continue to find another match as I've used anchoring both caret and dollar sign.

The input data is like this

110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's
anemia: anomaly of enzyme passage through the nuclear membrane? Anomalous
intracellular distribution of topoisomerase activity in placental
extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155,
1981.

And I'm trying to seperate the journal name, which in this case is 'Hum. Genet.'.

Now the script I used to read input in file is below

#!/usr/bin/perl
#
# parse publications strings
#
use warnings;
use strict;
use File::Basename;
use Data::Dumper;
# use open IN => ":any", OUT=> ":utf8";

my $TITLE = 'title';
my $YEAR = 'year';
my $START_PAGE = 'start_page';
my $END_PAGE = 'end_page';
my $JOURNAL = 'journal';
my $TYPE = 'type';
my $AUTHORS = 'authors';
my $VOLUME = 'volume';

sub parse_pub2 ($) {
  my $string = shift @_;
  local $_;
  my %ret = ();

  # use re "debug";
  # pos($string) = 0;
  # if ($string =~ m/^\d+\.([^:]+): ?((?:[^.]+\([^\)]+\)[^.?!]+|[^.?!]
++)[.?!]) ?(\([\w.]+\))? (.+)$/i) {
  # if ($string =~ m/^([^:]+): (.+?[.?!]) (\(\w+\.?\) )?([A-Z](?=\w*[.
+ ]).+)$/i) {
  # while ($string =~ m/\G^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[
+A-Z]\w+[. ])([A-Z].+)$/g) {
  while ($string =~ m/^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[A-Z]
+\w+[. ])([A-Z].+)$/g) {
    my $authors = $1;
    $ret{$TITLE} = $2;
    if ($4) {
      $ret{$TYPE} = $3;
      $_ = $4;
    } else {
      $_ = $3;
    }
    if (m/^([^:]+) ([\w()]+): (\d+)-(\d+), (\d+)\./) {
      $ret{$JOURNAL}=$1;
      $ret{$VOLUME}=$2;
      $ret{$START_PAGE}=$3;
      $ret{$END_PAGE}=$4;
      $ret{$YEAR}=$5;
      my @array = split (/ /,$ret{$JOURNAL});
      # last if (10 > scalar(@{[split (/ /,$ret{$JOURNAL})]}));
      last if (scalar(@array) < 10);
    } else {
      $ret{$JOURNAL}=$_;
      last;
    }
  }
  return %ret;
}

my $omimf = shift @ARGV || "-";

open (INF,"$omimf") or die "Unable to open '$omimf': $!";

my $i = 1;        # line number
my $space = 0;        # was last line space
my $extra = "";        # some entries are in multiple lines

while (<INF>) {
  m/^#/ && next;
  chomp;
  s/\r$//;
  if (!$_) {
    $space = 1;
  } else {
    $space = 0;
  }
  if ($space && $extra) {
    chop ($extra);
    print "$extra\n";
    my %pub = parse_pub2($extra);
    #print Dumper(%pub);
    if (defined($pub{$VOLUME}) && defined($pub{$YEAR}) &&
        defined($pub{$START_PAGE}) && defined($pub{$JOURNAL})) {
      #print  "${pub{$JOURNAL}}[JO] AND ${pub{$YEAR}}[DP] AND ",
      #           "${pub{$VOLUME}}[VI] AND ${pub{$START_PAGE}}[PG]\n";
      print "J:$pub{$JOURNAL}\n\n";
    } else {
      #print "NOT PMID($i): ";
      #foreach (sort keys %pub) { printf "$_ -> %s,",defined($pub{$_})
+ ? $pub{$_} : "undef" }
      #print "\n";
    }
    $extra = "";
  } else {
    $extra .= "$_ ";
  }
} #continue { printf ("\r%d", $i++); } print "\n";
continue { $i++ }
exit;
[download]

Heres the same but with input as scalar to eliminate some extra code bits (as requested).

#!/usr/bin/perl
#
# parse publications strings
#
use warnings;
use strict;
use Data::Dumper;

my $TITLE = 'title';
my $YEAR = 'year';
my $START_PAGE = 'start_page';
my $END_PAGE = 'end_page';
my $JOURNAL = 'journal';
my $TYPE = 'type';
my $AUTHORS = 'authors';
my $VOLUME = 'volume';

sub parse_pub ($) {
  my $string = shift @_;
  local $_;
  my %ret = ();

  # pos($string) = 0;
  # if ($string =~ m/^\d+\.([^:]+): ?((?:[^.]+\([^\)]+\)[^.?!]+|[^.?!]
++)[.?!]) ?(\([\w.]+\))? (.+)$/i) {
  # if ($string =~ m/^([^:]+): (.+?[.?!]) (\(\w+\.?\) )?([A-Z](?=\w*[.
+ ]).+)$/i) {
  # while ($string =~ m/\G^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[
+A-Z]\w+[. ])([A-Z].+)$/g) {
  while ($string =~ m/^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[A-Z]
+\w+[. ])([A-Z].+)$/g) {
    my $authors = $1;
    $ret{$TITLE} = $2;
    if ($4) {
      $ret{$TYPE} = $3;
      $_ = $4;
    } else {
      $_ = $3;
    }
    if (m/^([^:]+) ([\w()]+): (\d+)-(\d+), (\d+)\./) {
      $ret{$JOURNAL}=$1;
      $ret{$VOLUME}=$2;
      $ret{$START_PAGE}=$3;
      $ret{$END_PAGE}=$4;
      $ret{$YEAR}=$5;
      my @array = split (/ /,$ret{$JOURNAL});
      # last if (10 > scalar(@{[split (/ /,$ret{$JOURNAL})]}));
      last if (scalar(@array) < 10);
    } else {
      last;
    }
  }
  return %ret;
}

my $line = "110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fa
+nconi's anemia: anomaly of enzyme passage through the nuclear membran
+e? Anomalous intracellular distribution of topoisomerase activity in 
+placental extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149
+-155, 1981.";

print "$line\n";
my %pub = parse_pub($line);
#print Dumper(%pub);
print "J:$pub{$JOURNAL}\n\n";
exit;
[download]

This prints out

110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anem
+ia: anomaly of enzyme passage through the nuclear membrane? Anomalous
+ intracellular distribution of topoisomerase activity in placental ex
+tracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981.
J:Anomalous intracellular distribution of topoisomerase activity in pl
+acental extracts in a case of Fanconi's anemia. Hum. Genet.
[download]

What I would want it to print out is

110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anem
+ia: anomaly of enzyme passage through the nuclear membrane? Anomalous
+ intracellular distribution of topoisomerase activity in placental ex
+tracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981.
J:Hum. Genet.
[download]

Now obviously as I've set it to find shortest string ending with one of '.!?' the match is correct. However later I test if the journal name (comes from second pattern match) is longer than 10 words. If it is 10 or more, I would like it to continue from where it left off and try to find more. I tried with while(m//g) {}, but no luck. Also using \G I failed. So any idea how to do this?

Thanks.

Comment on Enforcing growth of regex Select or Download Code

Replies are listed 'Best First'.
Re: Enforcing growth of regex by liverpole (Monsignor) on Nov 23, 2005 at 13:38 UTC
The reason you're not getting the answer you want is that you haven't yet presented the problem in an easy-to-understand format. I don't think it is obvious that you're looking for "a string ending with one of '.!?'". For one thing, there isn't such a regex in your program. Maybe you meant ".?!", from the line "m/^\d+\. (^:+): (.+?.?!) ..."? But, I'm sorry, I still have no idea what you're trying to do. I spent about 10 minutes just trying to get your program to output something, before I realized the input has to contain multiple lines, and then a following blank line. This is the kind of thing where you have to help us to help you, by saying something along the lines of: "To run this program, perform the following steps ...". My suggestion is that you simplify this down to just a few lines, where the problem you're having is exhibited in a brief example. Make the input a hardcoded list; eg.: `my @lines = ( "110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's", "anemia: anomaly of enzyme passage through the nuclear membrane? Ano +malous", "intracellular distribution of topoisomerase activity in placental", "extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155,", "1981.", "", );` [download] so it doesn't take the user a lot of extra setup to be able to duplicate the problem. Perhaps you can even show several examples (with code, not just words) of what you've tried. The reason I'm suggesting this is twofold -- not only will you make the problem quicker to reproduce, and thus more tempting to try to solve, but you may actually find an answer yourself in the process of elucidating what you've tried. @ARGV=split//,"/:L"; map{print substr crypt($_,ord pop),2,3}qw"PerlyouC READPIPE provides"	[reply] [d/l]
Re: Enforcing growth of regex by Eimi Metamorphoumai (Deacon) on Nov 23, 2005 at 16:21 UTC
I think I understand what your problem is. Basically, you're trying to use a regexp match in a loop, creating your own backoff by altering part that was matched. As you found out, that's really not going to work. Although I'm sure there are different ways to approach this, the most natural seems to be to put all your matching into a single regexp. So when you say you want the journal to be no more than 10 words, specify that in your regexp. Then the regexp engine can do the backoff for you, and life should be good. My code below is rather severely rewritten, mostly because that's what it took for me to understand what you were doing. #!/usr/bin/perl # # parse publications strings # use warnings; use strict; use Data::Dumper; my $TITLE = 'title'; my $YEAR = 'year'; my $START_PAGE = 'start_page'; my $END_PAGE = 'end_page'; my $JOURNAL = 'journal'; my $TYPE = 'type'; my $AUTHORS = 'authors'; my $VOLUME = 'volume'; sub parse_pub ($) { my $string = shift @_; local $_; my %ret = (); @ret{$AUTHORS, $TITLE, $TYPE, $JOURNAL, $VOLUME, $START_PAGE, $END_PAGE, $YEAR} = $string =~ m/^\d+\.\s+ #citation number ([^:]+):\s+ #authors (.+?[.?!])\s+ #title (as short as possible) ($\w+.?$\s+)? #type (optional) ((?:\w+[.?!]?\s+){1,10}?) #journal ([\w()]+):\s+ #volume (\d+)-(\d+),\s+ #start page, end page (\d+)\.?$ #year /x or return undef; #not sure the best way to fail gracefully $ret{$JOURNAL} =~ s/\s+$//; return %ret; } my $line = "110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fa +nconi's anemia: anomaly of enzyme passage through the nuclear membran +e? Anomalous intracellular distribution of topoisomerase activity in +placental extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149 +-155, 1981."; print "$line\n"; my %pub = parse_pub($line); #print Dumper(\%pub); print "J:$pub{$JOURNAL}\n\n"; [download]	[reply] [d/l]
Re^2: Enforcing growth of regex by Hena (Friar) on Nov 24, 2005 at 07:53 UTC
That does it, thanks :). I had it splitted to two sections, since there are citations that do not follow the "normal" format given above and the first would have parsed that input as well. But I quess that I can do another pattern matching if that pattern fails. Since that does the same thing really. Btw. How does that assingment to %ret works? AFAIK the match returns the list of words (camel book p.151 is an example). But how @ret turns to %ret I do not understand.	[reply]
Re: Enforcing growth of regex by Perl Mouse (Chaplain) on Nov 23, 2005 at 09:40 UTC
I guess the regex you have a problem with is burried somewhere in the large blob of code you posted. Could you write a short program that shows the problem you are having? `Perl --((8:>*`	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Enforcing growth of regex by TedPride (Priest) on Nov 23, 2005 at 09:55 UTC
Giving us a sample of proper output would probably be helpful.	[reply]
Re^2: Enforcing growth of regex by Hena (Friar) on Nov 23, 2005 at 10:41 UTC
Umm... like what I want it to print? It was there, but I quess too obscurely. Anyways this: 110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anemia: anomaly of enzyme passage through the nuclear membrane? Anomalous intracellular distribution of topoisomerase activity in placental extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981. J: Hum. Genet. But other fields in %pub should be captured as well. So it cannot be completely changed without further thinking.	[reply]


Syntactic Confectionery Delight
	PerlMonks