Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

HTML parsing OR capturing text from a string within tags

by kevyt (Scribe)
on Dec 24, 2006 at 02:11 UTC ( [id://591484]=perlquestion: print w/replies, xml ) Need Help??

kevyt has asked for the wisdom of the Perl Monks concerning the following question:

I hate to ask this but I have been reading and trying to do this for a day. I read and installed the follwoing perl modules, HTML::Strip, Parser, TreeBuilder, and Element and I still cannot figure out how to get the data that I need.

I want to grab the following fields from an html page.

Harry Jones Wood Shop, 56789904,-938882991, Smith Rd New York, NY 14254, (154)555-1234

<div class=\042mytitle maximumtitle\042 id=\042idtitle\042> Harry Jone +s <b>Wood </b> &amp; Shop</div> latlng=56789904,-938882991,3132132133321 &amp; <div class=\042address\042 id=\042idaddr\042>737373 Smith Rd<br/>New Y +ork, NY 14254<br/></div><div class=\042 </div><div class=\042phone\042>(154) 555-1234&nbsp;-&nbsp;<span style= +\042display:none\042 class=\042my_hide\042>
I am able to get the web page into the program by doing this:
my $url = 'http://www.somepage.com'; # $browser->cookie_jar({}); #### use if the site requires cookies my $browser = LWP::UserAgent->new; my @ns_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Language' => 'en-US', ); my $response = $browser->get( $url, @ns_headers); die "Can't get $url -- ", $response->status_line unless $response->is_success; die "Hey, I was expecting HTML, not ", $response->content_type unless $response->content_type eq 'text/html';
I have tried methods find->tag and otehrs and I am not getting anywhere. I also found a post on perlmonks regarding parsing and I edited the line fro mthe posting and tried this:
@addr = $response->content =~ /<div class=\042mytitle maximumtitle\042 + id=\042idtitle\042>"([^ "]+)"/gi;
Can you please help? Thanks

Replies are listed 'Best First'.
Re: HTML parsing OR capturing text from a string within tags
by liverpole (Monsignor) on Dec 24, 2006 at 02:59 UTC
    Hi kevyt,

    I've found (being fairly close to a beginner myself with parsing HTML), that it's best to attack such a problem in little pieces.  Use print/printf along the way to show what your data looks like at the moment (and use Data::Dumper to really inspect your data with a fine tooth comb).

    I don't see in your program where you're trying to construct the HTML tree, so I took your program and extended it a bit.  Here's what I have:

    # Strict use strict; use warnings; # Libraries use Data::Dumper; use LWP::UserAgent; use HTML::TreeBuilder; my $url = 'http://www.somepage.com'; # $browser->cookie_jar({}); #### use if the site requires cookies my $browser = LWP::UserAgent->new; my @ns_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Language' => 'en-US', ); my $response = $browser->get($url, @ns_headers); die "Can't get $url -- ", $response->status_line unless $response->is_ +success; die "Hey, I was expecting HTML, not ", $response->content_type unless $response->content_type eq 'text/html'; # Now get the content, and display it my $content = $response->content; print "TFD> content $content\n"; # Now build the HTML tree my $tree = HTML::TreeBuilder->new_from_content($content); # Now find each occurrence of the desired tag my $tag = 'a'; my $match = $tree->find($tag); if ($match) { # Found it! print "Found tag '$tag' ...\n"; print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"; } else { print "Unable to find tag '$tag'\n"; }

    Note that I'm building an HTML tree from the $content which is returned after a successful get from the LWP::UserAgent opbject.

    The program then prints out the contents in the line:

    print "TFD> content $content\n";

    as a debugging step (you can remove that once you're sure you're getting what you expect back from the LWP fetch).

    Then you construct the HTML tree with:

    my $tree = HTML::TreeBuilder->new_from_content($content);

    Finally, you use find to locate an occurrence of the desired tag.  In the program above, I searched for the first occurrence of an anchor 'a' with:

    my $tag = 'a'; my $match = $tree->find($tag);

    which is then rendered both as text and HTML with:

    print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"

    Does that help you get further along?


    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
      Thanks, I will try this in the morning. I could not get this to print anything worthwhile.
      ### $response->content has the webpage stored in it $a = HTML::Element->new('a', $response->content); $addr = $a->find('tag', 'title'); print $addr;
      Thanks Liverpole, That explains a lot. I was not able to get it to work with my example because I guess that long string of goop is not a tag. So, I changed the tag = 'title' and that worked wonderfully!!! I noticed this line s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/ at the end of your posting but Iam not sure what that is for. unless that is a very complex regular expression to parse the data out. Thanks for all of your time and help. I might be able to make something work form what you wrote. Kevin
        Hi kevyt,

        I'm glad you were able to get further with your problem.  Always consider printing out intermediate results, so you know what your data looks like at each step of the way.

        The line at the end of my post:

        s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/

        is just my "signature".  If you run it as a separate Perl script, it prints liverpole.  You can create your own signature by editing your Signature Settings page.


        s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
      Liverpole, I tried this
      my $tag = 'div class=\\042mytitle maximumtitle\\042 id=\\042idtitle04 +2'; my $match = $tree->find($tag); if ($match) { # Found it! print "Found tag '$tag' ...\n"; print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"; } else { print "Unable to find tag '$tag'\n"; }
      I like how all of this is suppose to work! I think I read in one of the docs that there is a list of tags in the PM. Maybe I can add this tag to the list of html tags in the PM ? I was hoping that it would think that anything between < > are tags but I guess it does not do that. Thanks, Kevin
Re: HTML parsing OR capturing text from a string within tags
by astaines (Curate) on Dec 24, 2006 at 02:52 UTC

    Well, let's see. LWP::UserAgent returns a HTTP::Response object from it's get function. According to the documents the content function of this in turn returns a HTTP::Message object, and the content function of this returns the text body of the webpage, as a string of bytes. You then need to do something intelligent with this string, presumably.

    You don't describe how you are using HTML::Strip, but this is really intended to produce a pure text representation of the page. I suspect something like HTML::TreeBuilder which actually parses the HTML, and HTML::Element which lets you disassemble it at your leisure, would suit your needs better.

    -- Anthony Staines
Re: HTML parsing OR capturing text from a string within tags
by Popcorn Dave (Abbot) on Dec 24, 2006 at 06:23 UTC
    kevyt,

    Might I suggest a differnt tact than you're taking now?

    Long ago, I wrote a newspaper headline grabber for a Perl class using LWP::Simple's get function to grab web pages. I found that easier to use since it can return the whole page to a scalar. Then I used HTML::TokeParser to actually divide up the information and based my collection on only the tokens I actually wanted to save.

    If you look at Re: HTML::TokeParser help - parsing headlines there's a quick and dirty token parser that I wrote so that you can see how it splits up an HTML file.

    Hope that helps!

    Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

    If quizzes are quizzical, what are tests?

      Popcorn Dave, Thanks... I will try that... I just added a lot of prints to Element.pm to see what is going on. I will try your method tomorrow :) Thanks... This is what I have done. The format of Element.pm looks similar to code I use to work with at a former job.
      sub find_by_ktag_name { my(@pile) = shift(@_); # start out the to-do stack for the traverser Carp::croak "find_by_created_tag_name can be called only as an objec +t method" unless ref $pile[0]; return() unless @_; print "pile is @pile\n"; my(@tags) = $pile[0]->_fold_case(@_); print "tags are @tags\n"; my(@matching, $this, $this_tag); while(@pile) { $this_tag = ($this = shift @pile)->{'_tag'}; print "In while loop. this_tag is $this_tag\n"; foreach my $t (@tags) { print "foreach going through elements of tag. Elements are t an +d t is $t\n"; print "next step will check to see if t is eq to this_tag. this_ +tag is $this_tag\n"; if($t eq $this_tag) { print "inside of if... t and this_tag are equal.\n"; if(wantarray) { print "I am here if wantarray is true. Now push this onto +array matching\n"; push @matching, $this; print "matching is @matching\n"; last; } else { print "wantarray not true, returning this $this\n"; return $this; } } } unshift @pile, grep ref($_), @{$this->{'_content'} || next}; } print "returning @matching if wantarray\n"; return @matching if wantarray; return; }
      My print statements showed me that there is a library of predefined tags. If I can add my own tags, I think it will work :) I will also try your method. Tackling this is sort of fun. some output:
      next step will check to see if t is eq to this_tag. this_tag is a In while loop. this_tag is a next step will check to see if t is eq to this_tag. this_tag is font next step will check to see if t is eq to this_tag. this_tag is br
      Popcorn Dave, I looked at your code. I dont know how it works yet. Will it allow me to add my own string and remove the text right after it. For exmaple...
      <div\042\... > Person <b> Ran <\div>
      will it allow me to capture Person Ran? I think this is the file where I can add my own tags :)
      HTML-Tree-3.23/lib/HTML/AsSubs.pm
        All that code does is get a html page and parse it in to tokens. It will spit the whole mess out, so I ran it at command line, e.g. perl tokeparser.pl > output.txt

        That way you can scan through the file and see how it's tokenizing the information you fed it.

        Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

        If quizzes are quizzical, what are tests?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://591484]
Approved by astaines
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-04-19 16:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found