Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Regexp question

by davidov0009 (Scribe)
on Dec 20, 2007 at 16:42 UTC ( [id://658167]=perlquestion: print w/replies, xml ) Need Help??

davidov0009 has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I am trying to match a string within some HTML but seem to be having some trouble doing so. The following is part of the HTML source that I am looking to match:

name="challenge" value="26eca68705b3b0c76a6d0602937ac524"

I want to grab the string for value. I am using grouping in my regexp and expect therefore that $1 will return 26eca68...etc, essentially the string in between the quotes for the value parameter. Here is the regexp I am using:

$returned_data =~ s/\n//g; my $c_code = $returned_data; $c_code =~ m/name="challenge" value="(.*)"/; #" print $1 . "\n";
When $1 is printed it does not just print the text in between the quotes but instead prints all the remaining text after the first quote of the value parameter so like this: 26eca68705b3b0c76a6d0602937ac524" /><input type="hidden" id="md5pass" name="md5pass" value="1" /><input type="hidden" id="noerror" name="noerror" value="1" /><label><span>Email:</span><input type="text" class="inputtext" name="....MORE HTML SOURCE...

Any ideas as to why $1 seems to be capturing more that it should?

use strict; use CGI;

Replies are listed 'Best First'.
Re: Regexp question
by RMGir (Prior) on Dec 20, 2007 at 16:48 UTC
    You've run into the "greedy" behaviour of * - it grabs the longest possible match.

    perlre explains it well.

    The workaround is to modify the * to be non-greedy, by adding a ?

    $c_code =~ m/name="challenge" value="(.*?)"/;

    Mike
Re: Regexp question
by kyle (Abbot) on Dec 20, 2007 at 16:50 UTC

    The * modifier in regular expressions is "greedy". That means it will match the most that it possibly can. You can get a non-greedy match by adding ? after it like so:

    m/name="challenge" value="(.*?)"/

    Note that in general this won't cope with escaped quotes as in "* is \"greedy\" in regex". Before there were non-greedy matches, I might have written the expression this way (which also doesn't cope with escaped quotes):

    m/name="challenge" value="([^"]*)"/

    Or, given your input, this way (which will only match a hexadecimal string):

    m/name="challenge" value="([0-9a-f]+)"/

    As an aside, let me recommend in general using something like HTML::Parser when you're parsing HTML. Getting it right with regular expressions can be really difficult, but the module does the right thing for you.

Re: Regexp question
by toolic (Bishop) on Dec 20, 2007 at 16:54 UTC
    In addition to being conscious of the greedy behavior, it is also good practice to check that the match succeeded before using $1.
      One last quick question. Is this valid for checking whether the match occurred, and would $1 be initialized to the matched text in this code?
      if ( $html =~ m/id="$element" name="challenge" value="(.*?)"/ ) { $post{$element} = $1; }

      use strict; use CGI;
        I believe so, but you can easily prove it for yourself with a trivial testcase.
        Yes, that'll work. As will
        $post{$element} = $1 if $html =~ m/id="$element" name="challenge" value="(.*?)"/;

        Unless you expect $element to contain metacharacters, you should use

        m/id="\Q$element\E" name="challenge" value="(.*?)"/;

        to escape any metacharacters in $html. See quotemeta for more.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://658167]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-24 03:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found