Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

problem with optional capture group

by Special_K (Monk)
on Dec 22, 2020 at 16:48 UTC ( [id://11125614]=perlquestion: print w/replies, xml ) Need Help??

Special_K has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to use a regex to match lines that will always have an opening <div tag and could optionally have a closing </div tag on the same line. If the closing </div tag is present, additional code will be executed. Here is a sample that illustrates my problem:

#!/usr/bin/perl -w use strict; my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\" +></div>"; if ($line =~ /<div.+?(<\/div)*/) { printf("line matched\n"); if (defined($1)) { printf("right after match, 1 is defined\n"); } }

The output of running the above is:

line matched

I can't figure out why the closing div tag isn't being captured. I thought adding the non-greedy ? would prevent any closing div tags from getting consumed by the .+, but even with that addition the closing div tag isn't being captured.



EDIT: After more searching I found this SO thread which describes the same basic problem I have: https://stackoverflow.com/questions/28782603/regex-optional-capturing-group
After reviewing the example in that thread, I modified my original code to the following, which does work:

#!/usr/bin/perl -w use strict; my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\" +></div>"; if ($line =~ /<div.+?(<\/div>)*$/) { printf("line matched\n"); if (defined($1)) { printf("right after match, 1 is defined\n"); } }

I had to add the $ anchor at the end and also the closing > to the optional div capture group. I still don't quite understand how the regex engine is parsing this regex, however:

1. Why is it necessary to add the '>' in order for the capture group to work?

2. If I replace the '*' at the end of the optional capture group with a '?' (non-greedy qualifier), the group is still captured. Are '*' and '?' equivalent when applied to a group?

3. If I omit the '$' from the above regex, the optional div is not captured. The referenced SO thread says this regarding why the regex without the '$' fails to capture the optional group ('cat' changed to 'div' to be consistent with my code):

The reason that you do not get an optional div after a reluctantly-qualified .+? is that it is both optional and non-anchored: the engine is not forced to make that match, because it can legally treat the div as the "tail" of the .+? sequence.

My question is: generally speaking, how does Perl handle the case in which an optional or non-greedy match (.+? in this case) is followed by another optional or non-greedy match ((<\/div>)* in this case)? Does it always prefer to use more characters for one match (i.e. act greedy) rather than make additional matches (when those matches are optional)?

Replies are listed 'Best First'.
Re: problem with optional capture group
by davido (Cardinal) on Dec 22, 2020 at 17:20 UTC

    Just use a proper DOM parser and be done with it. You don't have to work extra hard to produce an inferior regex solution to a problem that has already been solved robustly and that has a solution with very low barrier to entry.

    #!/usr/bin/env perl use strict; use warnings; use Mojo::DOM; use Mojo::Util 'trim'; my $html = do {local $/ = undef; <DATA>}; my $dom = Mojo::DOM->new($html); foreach my $div ($dom->find('div')->each) { printf "Found div with id [%s], class [%s] and content [%s]\n", $div->{'id'} // '', $div->{'class'} // '', trim($div->content // ''); } __DATA__ <div id="roguebin-response-35911" class="bin-response"></div> <div id="othertest-1" class="foobar">content here </div>

    Notice the second div spans more than one line. This is a problem you don't have to solve yourself. It may be relatively trivial to do so, but this is only the first of many problems people encounter using regexes to treat HTML as regular text when it's not.

    Here's the output:

    Found div with id [roguebin-response-35911], class [bin-response] and +content [] Found div with id [othertest-1], class [foobar] and content [content h +ere]

    Installing Mojolicious will set you back two megabytes of storage once installed, and will also provide you with a UserAgent, a web framework, and a testing tool. You can install it using cpanm, or by downloading the tarball from Mojolicious, unpacking it, and running perl Makefile.PL && make && make test && make install. In containers, Carton is a nice way of specifying the dependency. Install time takes about a minute, and has no non-core Perl requirements, although you do need to be on a version of Perl less than six years old.

    If you want to trigger some behavior based on whether the div is on a different line, just search $div->content for \n. But unless you're writing some sort of tool for cleaning up HTML it's usually best to write code that doesn't care about the formatting of the HTML.


    Dave

Re: problem with optional capture group
by hippo (Bishop) on Dec 22, 2020 at 17:31 UTC

    While I entirely agree with davido's exhortation to use a proper parser for HTML, I will answer your question because it is (in this one instance) fairly trivial. The capture group does not match because you have used the asterisk as the quantifier after it. This matches zero or more instances, and zero is, of course, the shortest.

    Here's your code with a few small tweaks and the key change of using the plus as the quantifier:

    #!/usr/bin/perl use strict; use warnings; my $line = '<div id="roguebin-response-35911" class="bin-response"></d +iv>'; if ($line =~ /<div.+?(<\/div)+/) { print "line matched\n"; if (defined $1) { print "right after match, 1 is defined\n"; } }

    Similarly you don't really need a quantifier at all here because there is only one closing div in the string and one is the default quantity of anything in a regex.

    I've used print instead of printf because you are not doing any format conversion. I've removed some unnecessary brackets and have used single quotes to delimit the initial string so the internal double quotes no longer need escaping (and you aren't interpolating in this string either).

    But seriously, use a parser.


    🦛

      The problem with using the + quantifier is that the entire regex will not match a line that has an opening <div tag but no closing </div tag on the same line. For example, your modification will not match the following:

      my $line = "<div id=\"roguebin-response-35911\" class=\"bin-response\" +>";

      I was hoping to write a single regex that will handle both cases, i.e. an opening <div tag with a closing </div tag on the same line, and an opening <div tag with no closing </div tag on the same line. I understand a built in parser would make this task easier, but I would still like to understand how to write a single regex that would capture both of these cases.

        Win8 Strawberry 5.8.9.5 (32) Tue 12/22/2020 16:43:09 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings for my $line ( '<div id="foo-bar-321" class="bin-boff"></div>', '<div id="foo-bar-321" class="bin-boff"> </div>', '<div id="foo-bar-321" class="bin-boff">foo</div>', '<div id="foo-bar-321" class="bin-boff"> foo </div>', '<div id="foo-bar-321" class="bin-boff">', '<div id="foo-bar-321" class="bin-boff"> ', '<div id="foo-bar-321" class="bin-boff">foo', '<div id="foo-bar-321" class="bin-boff"> foo', ) { if ($line =~ m{ <div (?: (?! </div) .)+ (</div)? }xms) { print "line matched \n '$&' \n"; if (defined $1) { print " right after match, \$1 is defined '$1' \n"; } } } ^Z line matched '<div id="foo-bar-321" class="bin-boff"></div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff"> </div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff">foo</div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff"> foo </div' right after match, $1 is defined '</div' line matched '<div id="foo-bar-321" class="bin-boff">' line matched '<div id="foo-bar-321" class="bin-boff"> ' line matched '<div id="foo-bar-321" class="bin-boff">foo' line matched '<div id="foo-bar-321" class="bin-boff"> foo'


        Give a man a fish:  <%-{-{-{-<

Re: problem with optional capture group
by choroba (Cardinal) on Dec 22, 2020 at 21:57 UTC
    You can use an alternative with the $1 in the first branch - it will be tried first.
    #!/usr/bin/perl use strict; use feature qw{ say }; use warnings; for my $line ('<div id="roguebin-response-35911" class="bin-response"> +</div>', '<div id="roguebin-response-35911" class="bin-response"> +' ) { if ($line =~ m{<div.+(</div)|<div.+}) { say "line matched"; if (defined $1) { say "right after match, 1 is defined: $1"; } } }
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: problem with optional capture group
by GrandFather (Saint) on Dec 22, 2020 at 20:05 UTC

    As a general thing any .[*+] match is suspect. An often useful trick is to use a negative lookahead match to nibble one character at a time:

    my $str = "Send it to bob.sled\@gmail.com. Don't send it to ski.guy\@y +ahoo.com."; print "$1\n" if $str =~ /(( (?!\.\s). )+ \.\s)/x;

    Prints:

    Send it to bob.sled@gmail.com.
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11125614]
Approved by toolic
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-19 23:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found