hepl with a regex problem

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm new to perl (started today) and I need some help with a matching. I'm scraping a web page and I'm trying to extract some information between two tags but I keep getting more that I ask for. Here is what I'm working wiht:

 
......wrap><font face=arial size=-1>Volume<br>262,605,456</font></td><
+td nowr......
[download]

I need the number 262,605,456. I've tried this:

 
/.*Volume<br>(.*)<\/font><\/td>.*/
[download]

But I get the whole page back. I've googled around and I think my problem has to do with nested tags and the "<" ">" symbols might mean something special but I can't figure it out. Thanks! Al

Comment on hepl with a regex problem Select or Download Code

Replies are listed 'Best First'.
Re: hepl with a regex problem by Juerd (Abbot) on Jun 04, 2003 at 23:24 UTC
I've tried this: /.Volume (.)<\/font><\/td>./ But I get the whole page back. What do you mean by 'getting back'? $1 contains whatever the middle . matched. You don't need .* at the beginning and end, because regexes can match anywhere unless explicitly anchored. Perhaps you should not try regexes on your first day of Perl. They're omnipresent, but not as easy as the average text processor's search function. Don't let their friendly exteriors fool you :) If you're serious about learning Perl and haven't got a book yet, get Beginning Perl (for free, legally) at http://learn.perl.org/library/beginning_perl/. Paper copies are available if you don't like screen reading. Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }	[reply]
Re: hepl with a regex problem by moxliukas (Curate) on Jun 04, 2003 at 23:27 UTC
The first thing you should try is "non-greedy" regular expression match. "Non-greedy" means "match as little as possible" and all you need to change is to add a question mark: `/.Volume<br>(.?)<\/font><\/td>./` Hope this helps Update:* reading Juerd's answer I am starting to think that possibly I have misinterpreted the question ;) So yes, if you are having problems extracting the match, it is in `$1`. If you are just having a problem of something matching too much -- a greedy regexp may be to blame. As far as I can remember it took me days to understand greedy/non-greedy regexps when I started learning Perl ;)	[reply] [d/l] [select]
Re: Re: hepl with a regex problem by Nkuvu (Priest) on Jun 04, 2003 at 23:41 UTC
I'm thinking that AnonyMonk was referring to a greedy match. At least, that's how I read it at first. And if you know that what you're looking for is a number alone, you can use something like `m!Volume<br>([\d,]+)</font></td>!` for the regex. This will capture numbers that may or may not have a comma in them. It won't match negative numbers, or numbers with decimal points, or fractions, etc. Note that the m at the beginning allows you to use another delimiter instead of /, so you don't have to escape the / characters in the closing HTML tags. Just another way to do it...	[reply] [d/l]
Re: hepl with a regex problem by arrow (Friar) on Jun 05, 2003 at 00:14 UTC
Also, you might want to consider using XML if you want to extract data from the page. With XML you could have the number enclosed with `<number></number>` tags, because your data may not always be inbetween the two same tags. It makes data extraction alot simpler, and formatting is achieved using powerful stylesheets. Hope this helps! Just Another Perl Wannabe	[reply] [d/l]