Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Parsing web pages (sort of)

by MrDoney (Initiate)
on May 25, 2000 at 19:16 UTC ( [id://14771]=perlquestion: print w/replies, xml ) Need Help??

MrDoney has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've got a kind of dopey question.. I'm referring into a web page which contains some useful data. What I need to do is parse out the useful strings (about five of them) from the html, I'm sure this should be easy, because each of these strings begins with the characters ' DE', followed by a six-character figure. How do I do it? My thinking has been along the lines of: perl -w 'm/\bDE\d+/g' mypage.html - undortunately it doesn't bleedin' work, and I'm stumped..

Replies are listed 'Best First'.
Re: Parsing web pages (sort of)
by swiftone (Curate) on May 25, 2000 at 20:46 UTC
    In case it was not clear from previous posts, your regex was fine, but you matched and then did nothing. You needed to print the match. So your regex:
    m/\bDE\d+/g
    would match and stop. What they did (in addition to being more specific with the number of digits) is to group your match in parentheses:
    m/\b(DE\d+)/g
    And then the match is located in $1. (Which they print)
Re: Parsing web pages (sort of)
by perlcgi (Hermit) on May 25, 2000 at 19:29 UTC
    # Would like to see the page source but this will work # for one target string per line. # $_ contains mypage.html print $1 if /\s(DE\d{6})/;
      I came up with this code:
      perl -n -e 'while(m/\b(DE\d{6})/g){print "$1\n";}' mypage.html
      There is also this slight variation:
      perl -n -e 'print "$1\n" while(/\b(DE\d{6})/g)' mypage.html
      but I prefer the first because as a general rule I don't like suffixing statments with conditionals except for error handling.

      It is slightly longer but does handle multiple matches in a line and it also breaks apart the output to one "DE" per line.

Re: Parsing web pages (sort of)
by perlcgi (Hermit) on May 25, 2000 at 19:41 UTC
    Neat.
    What do you mean by your matches run together? Is it the lack of a \n?
    Thanks.
      that is exactly what I mean... without a \n or other separater the output is harder to read because it runs together.
        Well, sorry :-) I thought the lack of \n was kinda obvious.
        So if thats the case, I'd say your "$1\n" really should be $1,"\n" on the grounds of efficiency. 25% faster. D'oh, I must admit your solution is better than mine. :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://14771]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2024-04-25 15:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found