Parsing web pages (sort of)

MrDoney has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing web pages (sort of) by swiftone (Curate) on May 25, 2000 at 20:46 UTC
In case it was not clear from previous posts, your regex was fine, but you matched and then did nothing. You needed to print the match. So your regex: `m/\bDE\d+/g` [download] would match and stop. What they did (in addition to being more specific with the number of digits) is to group your match in parentheses: `m/\b(DE\d+)/g` [download] And then the match is located in $1. (Which they print)	[reply] [d/l] [select]
Re: Parsing web pages (sort of) by perlcgi (Hermit) on May 25, 2000 at 19:29 UTC
`# Would like to see the page source but this will work # for one target string per line. # $_ contains mypage.html print $1 if /\s(DE\d{6})/;` [download]	[reply] [d/l]
RE: Re: Parsing web pages (sort of) by lhoward (Vicar) on May 25, 2000 at 19:30 UTC
I came up with this code: `perl -n -e 'while(m/\b(DE\d{6})/g){print "$1\n";}' mypage.html` [download] There is also this slight variation: `perl -n -e 'print "$1\n" while(/\b(DE\d{6})/g)' mypage.html` [download] but I prefer the first because as a general rule I don't like suffixing statments with conditionals except for error handling. It is slightly longer but does handle multiple matches in a line and it also breaks apart the output to one "DE" per line.	[reply] [d/l] [select]
Re: Parsing web pages (sort of) by perlcgi (Hermit) on May 25, 2000 at 19:41 UTC
Neat. What do you mean by your matches run together? Is it the lack of a \n? Thanks.	[reply]
RE: Re: Parsing web pages (sort of) by lhoward (Vicar) on May 25, 2000 at 19:42 UTC
that is exactly what I mean... without a \n or other separater the output is harder to read because it runs together.	[reply]
RE: RE: Re: Parsing web pages (sort of) by perlcgi (Hermit) on May 25, 2000 at 19:44 UTC
Well, sorry :-) I thought the lack of \n was kinda obvious. So if thats the case, I'd say your "$1\n" really should be $1,"\n" on the grounds of efficiency. 25% faster. D'oh, I must admit your solution is better than mine. :-)	[reply]
RE: RE: RE: Re: Parsing web pages (sort of) by lhoward (Vicar) on May 25, 2000 at 19:54 UTC


Perl Monk, Perl Meditation
	PerlMonks