Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

hepl with a regex problem

by Anonymous Monk
on Jun 04, 2003 at 23:19 UTC ( #263176=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm new to perl (started today) and I need some help with a matching. I'm scraping a web page and I'm trying to extract some information between two tags but I keep getting more that I ask for. Here is what I'm working wiht:
......wrap><font face=arial size=-1>Volume<br>262,605,456</font></td>< +td nowr......
I need the number 262,605,456. I've tried this:
/.*Volume<br>(.*)<\/font><\/td>.*/
But I get the whole page back. I've googled around and I think my problem has to do with nested tags and the "<" ">" symbols might mean something special but I can't figure it out. Thanks! Al

Replies are listed 'Best First'.
Re: hepl with a regex problem
by Juerd (Abbot) on Jun 04, 2003 at 23:24 UTC

    I've tried this: /.*Volume
    (.*)<\/font><\/td>.*/ But I get the whole page back.

    What do you mean by 'getting back'? $1 contains whatever the middle .* matched. You don't need .* at the beginning and end, because regexes can match anywhere unless explicitly anchored.

    Perhaps you should not try regexes on your first day of Perl. They're omnipresent, but not as easy as the average text processor's search function. Don't let their friendly exteriors fool you :)

    If you're serious about learning Perl and haven't got a book yet, get Beginning Perl (for free, legally) at http://learn.perl.org/library/beginning_perl/. Paper copies are available if you don't like screen reading.

    Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re: hepl with a regex problem
by moxliukas (Curate) on Jun 04, 2003 at 23:27 UTC

    The first thing you should try is "non-greedy" regular expression match. "Non-greedy" means "match as little as possible" and all you need to change is to add a question mark:

    /.*Volume<br>(.*?)<\/font><\/td>.*/

    Hope this helps

    Update: reading Juerd's answer I am starting to think that possibly I have misinterpreted the question ;) So yes, if you are having problems extracting the match, it is in $1. If you are just having a problem of something matching too much -- a greedy regexp may be to blame.

    As far as I can remember it took me days to understand greedy/non-greedy regexps when I started learning Perl ;)

      I'm thinking that AnonyMonk was referring to a greedy match. At least, that's how I read it at first.

      And if you know that what you're looking for is a number alone, you can use something like m!Volume<br>([\d,]+)</font></td>! for the regex. This will capture numbers that may or may not have a comma in them. It won't match negative numbers, or numbers with decimal points, or fractions, etc. Note that the m at the beginning allows you to use another delimiter instead of /, so you don't have to escape the / characters in the closing HTML tags. Just another way to do it...

Re: hepl with a regex problem
by arrow (Friar) on Jun 05, 2003 at 00:14 UTC
    Also, you might want to consider using XML if you want to extract data from the page. With XML you could have the number enclosed with <number></number> tags, because your data may not always be inbetween the two same tags. It makes data extraction alot simpler, and formatting is achieved using powerful stylesheets. Hope this helps!

    Just Another Perl Wannabe

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://263176]
Approved by Enlil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2020-07-10 00:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?