Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Enforcing growth of regex

by Hena (Friar)
on Nov 23, 2005 at 09:08 UTC ( [id://511031]=perlquestion: print w/replies, xml ) Need Help??

Hena has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

Again I'm parsing strings that are used in reference. I'm currently facing a problem of how to make regex continue to find another match as I've used anchoring both caret and dollar sign.

The input data is like this
110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's
anemia: anomaly of enzyme passage through the nuclear membrane? Anomalous
intracellular distribution of topoisomerase activity in placental
extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155,
1981.
And I'm trying to seperate the journal name, which in this case is 'Hum. Genet.'.

Now the script I used to read input in file is below
#!/usr/bin/perl # # parse publications strings # use warnings; use strict; use File::Basename; use Data::Dumper; # use open IN => ":any", OUT=> ":utf8"; my $TITLE = 'title'; my $YEAR = 'year'; my $START_PAGE = 'start_page'; my $END_PAGE = 'end_page'; my $JOURNAL = 'journal'; my $TYPE = 'type'; my $AUTHORS = 'authors'; my $VOLUME = 'volume'; sub parse_pub2 ($) { my $string = shift @_; local $_; my %ret = (); # use re "debug"; # pos($string) = 0; # if ($string =~ m/^\d+\.([^:]+): ?((?:[^.]+\([^\)]+\)[^.?!]+|[^.?!] ++)[.?!]) ?(\([\w.]+\))? (.+)$/i) { # if ($string =~ m/^([^:]+): (.+?[.?!]) (\(\w+\.?\) )?([A-Z](?=\w*[. + ]).+)$/i) { # while ($string =~ m/\G^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[ +A-Z]\w+[. ])([A-Z].+)$/g) { while ($string =~ m/^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[A-Z] +\w+[. ])([A-Z].+)$/g) { my $authors = $1; $ret{$TITLE} = $2; if ($4) { $ret{$TYPE} = $3; $_ = $4; } else { $_ = $3; } if (m/^([^:]+) ([\w()]+): (\d+)-(\d+), (\d+)\./) { $ret{$JOURNAL}=$1; $ret{$VOLUME}=$2; $ret{$START_PAGE}=$3; $ret{$END_PAGE}=$4; $ret{$YEAR}=$5; my @array = split (/ /,$ret{$JOURNAL}); # last if (10 > scalar(@{[split (/ /,$ret{$JOURNAL})]})); last if (scalar(@array) < 10); } else { $ret{$JOURNAL}=$_; last; } } return %ret; } my $omimf = shift @ARGV || "-"; open (INF,"$omimf") or die "Unable to open '$omimf': $!"; my $i = 1; # line number my $space = 0; # was last line space my $extra = ""; # some entries are in multiple lines while (<INF>) { m/^#/ && next; chomp; s/\r$//; if (!$_) { $space = 1; } else { $space = 0; } if ($space && $extra) { chop ($extra); print "$extra\n"; my %pub = parse_pub2($extra); #print Dumper(%pub); if (defined($pub{$VOLUME}) && defined($pub{$YEAR}) && defined($pub{$START_PAGE}) && defined($pub{$JOURNAL})) { #print "${pub{$JOURNAL}}[JO] AND ${pub{$YEAR}}[DP] AND ", # "${pub{$VOLUME}}[VI] AND ${pub{$START_PAGE}}[PG]\n"; print "J:$pub{$JOURNAL}\n\n"; } else { #print "NOT PMID($i): "; #foreach (sort keys %pub) { printf "$_ -> %s,",defined($pub{$_}) + ? $pub{$_} : "undef" } #print "\n"; } $extra = ""; } else { $extra .= "$_ "; } } #continue { printf ("\r%d", $i++); } print "\n"; continue { $i++ } exit;
Heres the same but with input as scalar to eliminate some extra code bits (as requested).
#!/usr/bin/perl # # parse publications strings # use warnings; use strict; use Data::Dumper; my $TITLE = 'title'; my $YEAR = 'year'; my $START_PAGE = 'start_page'; my $END_PAGE = 'end_page'; my $JOURNAL = 'journal'; my $TYPE = 'type'; my $AUTHORS = 'authors'; my $VOLUME = 'volume'; sub parse_pub ($) { my $string = shift @_; local $_; my %ret = (); # pos($string) = 0; # if ($string =~ m/^\d+\.([^:]+): ?((?:[^.]+\([^\)]+\)[^.?!]+|[^.?!] ++)[.?!]) ?(\([\w.]+\))? (.+)$/i) { # if ($string =~ m/^([^:]+): (.+?[.?!]) (\(\w+\.?\) )?([A-Z](?=\w*[. + ]).+)$/i) { # while ($string =~ m/\G^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[ +A-Z]\w+[. ])([A-Z].+)$/g) { while ($string =~ m/^\d+\. ([^:]+): (.+?[.?!]) (\(\w+.?\) )?(?=[A-Z] +\w+[. ])([A-Z].+)$/g) { my $authors = $1; $ret{$TITLE} = $2; if ($4) { $ret{$TYPE} = $3; $_ = $4; } else { $_ = $3; } if (m/^([^:]+) ([\w()]+): (\d+)-(\d+), (\d+)\./) { $ret{$JOURNAL}=$1; $ret{$VOLUME}=$2; $ret{$START_PAGE}=$3; $ret{$END_PAGE}=$4; $ret{$YEAR}=$5; my @array = split (/ /,$ret{$JOURNAL}); # last if (10 > scalar(@{[split (/ /,$ret{$JOURNAL})]})); last if (scalar(@array) < 10); } else { last; } } return %ret; } my $line = "110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fa +nconi's anemia: anomaly of enzyme passage through the nuclear membran +e? Anomalous intracellular distribution of topoisomerase activity in +placental extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149 +-155, 1981."; print "$line\n"; my %pub = parse_pub($line); #print Dumper(%pub); print "J:$pub{$JOURNAL}\n\n"; exit;
This prints out
110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anem +ia: anomaly of enzyme passage through the nuclear membrane? Anomalous + intracellular distribution of topoisomerase activity in placental ex +tracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981. J:Anomalous intracellular distribution of topoisomerase activity in pl +acental extracts in a case of Fanconi's anemia. Hum. Genet.
What I would want it to print out is
110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anem +ia: anomaly of enzyme passage through the nuclear membrane? Anomalous + intracellular distribution of topoisomerase activity in placental ex +tracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981. J:Hum. Genet.
Now obviously as I've set it to find shortest string ending with one of '.!?' the match is correct. However later I test if the journal name (comes from second pattern match) is longer than 10 words. If it is 10 or more, I would like it to continue from where it left off and try to find more. I tried with while(m//g) {}, but no luck. Also using \G I failed. So any idea how to do this?

Thanks.

Replies are listed 'Best First'.
Re: Enforcing growth of regex
by liverpole (Monsignor) on Nov 23, 2005 at 13:38 UTC
    The reason you're not getting the answer you want is that you haven't yet presented the problem in an easy-to-understand format.  I don't think it is obvious that you're looking for "a string ending with one of '.!?'".  For one thing, there isn't such a regex in your program.  Maybe you meant ".?!", from the line "m/^\d+\. (^:+): (.+?.?!) ..."?  But, I'm sorry, I still have no idea what you're trying to do.  I spent about 10 minutes just trying to get your program to output something, before I realized the input has to contain multiple lines, and then a following blank line.  This is the kind of thing where you have to help us to help you, by saying something along the lines of:  "To run this program, perform the following steps ...".

    My suggestion is that you simplify this down to just a few lines, where the problem you're having is exhibited in a brief example.  Make the input a hardcoded list; eg.:

    my @lines = ( "110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's", "anemia: anomaly of enzyme passage through the nuclear membrane? Ano +malous", "intracellular distribution of topoisomerase activity in placental", "extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155,", "1981.", "", );
    so it doesn't take the user a lot of extra setup to be able to duplicate the problem.  Perhaps you can even show several examples (with code, not just words) of what you've tried.  The reason I'm suggesting this is twofold -- not only will you make the problem quicker to reproduce, and thus more tempting to try to solve, but you may actually find an answer yourself in the process of elucidating what you've tried.

    @ARGV=split//,"/:L"; map{print substr crypt($_,ord pop),2,3}qw"PerlyouC READPIPE provides"
Re: Enforcing growth of regex
by Eimi Metamorphoumai (Deacon) on Nov 23, 2005 at 16:21 UTC
    I think I understand what your problem is. Basically, you're trying to use a regexp match in a loop, creating your own backoff by altering part that was matched. As you found out, that's really not going to work. Although I'm sure there are different ways to approach this, the most natural seems to be to put all your matching into a single regexp. So when you say you want the journal to be no more than 10 words, specify that in your regexp. Then the regexp engine can do the backoff for you, and life should be good. My code below is rather severely rewritten, mostly because that's what it took for me to understand what you were doing.
    #!/usr/bin/perl # # parse publications strings # use warnings; use strict; use Data::Dumper; my $TITLE = 'title'; my $YEAR = 'year'; my $START_PAGE = 'start_page'; my $END_PAGE = 'end_page'; my $JOURNAL = 'journal'; my $TYPE = 'type'; my $AUTHORS = 'authors'; my $VOLUME = 'volume'; sub parse_pub ($) { my $string = shift @_; local $_; my %ret = (); @ret{$AUTHORS, $TITLE, $TYPE, $JOURNAL, $VOLUME, $START_PAGE, $END_PAGE, $YEAR} = $string =~ m/^\d+\.\s+ #citation number ([^:]+):\s+ #authors (.+?[.?!])\s+ #title (as short as possible) (\(\w+.?\)\s+)? #type (optional) ((?:\w+[.?!]?\s+){1,10}?) #journal ([\w()]+):\s+ #volume (\d+)-(\d+),\s+ #start page, end page (\d+)\.?$ #year /x or return undef; #not sure the best way to fail gracefully $ret{$JOURNAL} =~ s/\s+$//; return %ret; } my $line = "110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fa +nconi's anemia: anomaly of enzyme passage through the nuclear membran +e? Anomalous intracellular distribution of topoisomerase activity in +placental extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149 +-155, 1981."; print "$line\n"; my %pub = parse_pub($line); #print Dumper(\%pub); print "J:$pub{$JOURNAL}\n\n";
      That does it, thanks :). I had it splitted to two sections, since there are citations that do not follow the "normal" format given above and the first would have parsed that input as well.

      But I quess that I can do another pattern matching if that pattern fails. Since that does the same thing really.

      Btw. How does that assingment to %ret works? AFAIK the match returns the list of words (camel book p.151 is an example). But how @ret turns to %ret I do not understand.
Re: Enforcing growth of regex
by Perl Mouse (Chaplain) on Nov 23, 2005 at 09:40 UTC
    I guess the regex you have a problem with is burried somewhere in the large blob of code you posted.

    Could you write a short program that shows the problem you are having?

    Perl --((8:>*
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Enforcing growth of regex
by TedPride (Priest) on Nov 23, 2005 at 09:55 UTC
    Giving us a sample of proper output would probably be helpful.
      Umm... like what I want it to print? It was there, but I quess too obscurely. Anyways this:

      110. Wunder, E.; Burghardt, U.; Lang, B.; Hamilton, L.: Fanconi's anemia: anomaly of enzyme passage through the nuclear membrane? Anomalous intracellular distribution of topoisomerase activity in placental extracts in a case of Fanconi's anemia. Hum. Genet. 58: 149-155, 1981.
      J: Hum. Genet.

      But other fields in %pub should be captured as well. So it cannot be completely changed without further thinking.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://511031]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-24 05:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found