HTML parsing OR capturing text from a string within tags

kevyt has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML parsing OR capturing text from a string within tags by liverpole (Monsignor) on Dec 24, 2006 at 02:59 UTC
Hi kevyt, I've found (being fairly close to a beginner myself with parsing HTML), that it's best to attack such a problem in little pieces. Use print/printf along the way to show what your data looks like at the moment (and use Data::Dumper to really inspect your data with a fine tooth comb). I don't see in your program where you're trying to construct the HTML tree, so I took your program and extended it a bit. Here's what I have: # Strict use strict; use warnings; # Libraries use Data::Dumper; use LWP::UserAgent; use HTML::TreeBuilder; my $url = 'http://www.somepage.com'; # $browser->cookie_jar({}); #### use if the site requires cookies my $browser = LWP::UserAgent->new; my @ns_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, /', 'Accept-Charset' => 'iso-8859-1,,utf-8', 'Accept-Language' => 'en-US', ); my $response = $browser->get($url, @ns_headers); die "Can't get $url -- ", $response->status_line unless $response->is_ +success; die "Hey, I was expecting HTML, not ", $response->content_type unless $response->content_type eq 'text/html'; # Now get the content, and display it my $content = $response->content; print "TFD> content $content\n"; # Now build the HTML tree my $tree = HTML::TreeBuilder->new_from_content($content); # Now find each occurrence of the desired tag my $tag = 'a'; my $match = $tree->find($tag); if ($match) { # Found it! print "Found tag '$tag' ...\n"; print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"; } else { print "Unable to find tag '$tag'\n"; } [download] Note that I'm building an HTML tree from the $content which is returned after a successful get* from the LWP::UserAgent opbject. The program then prints out the contents in the line: `print "TFD> content $content\n";` [download] as a debugging step (you can remove that once you're sure you're getting what you expect back from the LWP fetch). Then you construct the HTML tree with: `my $tree = HTML::TreeBuilder->new_from_content($content);` [download] Finally, you use find to locate an occurrence of the desired tag. In the program above, I searched for the first occurrence of an anchor 'a' with: `my $tag = 'a'; my $match = $tree->find($tag);` [download] which is then rendered both as text and HTML with: `print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"` [download] Does that help you get further along? s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply] [d/l] [select]
Re^2: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Dec 24, 2006 at 03:34 UTC
Thanks, I will try this in the morning. I could not get this to print anything worthwhile. `### $response->content has the webpage stored in it $a = HTML::Element->new('a', $response->content); $addr = $a->find('tag', 'title'); print $addr;` [download]	[reply] [d/l]
Re^2: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Dec 24, 2006 at 06:13 UTC
Thanks Liverpole, That explains a lot. I was not able to get it to work with my example because I guess that long string of goop is not a tag. So, I changed the tag = 'title' and that worked wonderfully!!! I noticed this line `s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/` at the end of your posting but Iam not sure what that is for. unless that is a very complex regular expression to parse the data out. Thanks for all of your time and help. I might be able to make something work form what you wrote. Kevin	[reply] [d/l]
Re^3: HTML parsing OR capturing text from a string within tags by liverpole (Monsignor) on Dec 24, 2006 at 13:43 UTC
Hi kevyt, I'm glad you were able to get further with your problem. Always consider printing out intermediate results, so you know what your data looks like at each step of the way. The line at the end of my post: `s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/` [download] is just my "signature". If you run it as a separate Perl script, it prints liverpole. You can create your own signature by editing your Signature Settings page. s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply] [d/l]
Re^2: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Dec 24, 2006 at 06:25 UTC
Liverpole, I tried this `my $tag = 'div class=\\042mytitle maximumtitle\\042 id=\\042idtitle04 +2'; my $match = $tree->find($tag); if ($match) { # Found it! print "Found tag '$tag' ...\n"; print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"; } else { print "Unable to find tag '$tag'\n"; }` [download] I like how all of this is suppose to work! I think I read in one of the docs that there is a list of tags in the PM. Maybe I can add this tag to the list of html tags in the PM ? I was hoping that it would think that anything between < > are tags but I guess it does not do that. Thanks, Kevin	[reply] [d/l]
Re: HTML parsing OR capturing text from a string within tags by astaines (Curate) on Dec 24, 2006 at 02:52 UTC
Well, let's see. LWP::UserAgent returns a HTTP::Response object from it's get function. According to the documents the content function of this in turn returns a HTTP::Message object, and the content function of this returns the text body of the webpage, as a string of bytes. You then need to do something intelligent with this string, presumably. You don't describe how you are using HTML::Strip, but this is really intended to produce a pure text representation of the page. I suspect something like HTML::TreeBuilder which actually parses the HTML, and HTML::Element which lets you disassemble it at your leisure, would suit your needs better. -- Anthony Staines	[reply]
Re: HTML parsing OR capturing text from a string within tags by Popcorn Dave (Abbot) on Dec 24, 2006 at 06:23 UTC
kevyt, Might I suggest a differnt tact than you're taking now? Long ago, I wrote a newspaper headline grabber for a Perl class using LWP::Simple's get function to grab web pages. I found that easier to use since it can return the whole page to a scalar. Then I used HTML::TokeParser to actually divide up the information and based my collection on only the tokens I actually wanted to save. If you look at Re: HTML::TokeParser help - parsing headlines there's a quick and dirty token parser that I wrote so that you can see how it splits up an HTML file. Hope that helps! Revolution. Today, 3 O'Clock. Meet behind the monkey bars. If quizzes are quizzical, what are tests?	[reply]
Re^2: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Dec 24, 2006 at 07:09 UTC
Popcorn Dave, Thanks... I will try that... I just added a lot of prints to Element.pm to see what is going on. I will try your method tomorrow :) Thanks... This is what I have done. The format of Element.pm looks similar to code I use to work with at a former job. sub find_by_ktag_name { my(@pile) = shift(@_); # start out the to-do stack for the traverser Carp::croak "find_by_created_tag_name can be called only as an objec +t method" unless ref $pile[0]; return() unless @_; print "pile is @pile\n"; my(@tags) = $pile[0]->_fold_case(@_); print "tags are @tags\n"; my(@matching, $this, $this_tag); while(@pile) { $this_tag = ($this = shift @pile)->{'_tag'}; print "In while loop. this_tag is $this_tag\n"; foreach my $t (@tags) { print "foreach going through elements of tag. Elements are t an +d t is $t\n"; print "next step will check to see if t is eq to this_tag. this_ +tag is $this_tag\n"; if($t eq $this_tag) { print "inside of if... t and this_tag are equal.\n"; if(wantarray) { print "I am here if wantarray is true. Now push this onto +array matching\n"; push @matching, $this; print "matching is @matching\n"; last; } else { print "wantarray not true, returning this $this\n"; return $this; } } } unshift @pile, grep ref($_), @{$this->{'_content'} \|\| next}; } print "returning @matching if wantarray\n"; return @matching if wantarray; return; } [download] My print statements showed me that there is a library of predefined tags. If I can add my own tags, I think it will work :) I will also try your method. Tackling this is sort of fun. some output: `next step will check to see if t is eq to this_tag. this_tag is a In while loop. this_tag is a next step will check to see if t is eq to this_tag. this_tag is font next step will check to see if t is eq to this_tag. this_tag is br` [download]	[reply] [d/l] [select]
Re^2: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Dec 24, 2006 at 07:31 UTC
Popcorn Dave, I looked at your code. I dont know how it works yet. Will it allow me to add my own string and remove the text right after it. For exmaple... `<div\042\... > Person <b> Ran <\div>` [download] will it allow me to capture Person Ran? I think this is the file where I can add my own tags :) `HTML-Tree-3.23/lib/HTML/AsSubs.pm` [download]	[reply] [d/l] [select]
Re^3: HTML parsing OR capturing text from a string within tags by Popcorn Dave (Abbot) on Dec 24, 2006 at 09:12 UTC
All that code does is get a html page and parse it in to tokens. It will spit the whole mess out, so I ran it at command line, e.g. perl tokeparser.pl > output.txt That way you can scan through the file and see how it's tokenizing the information you fed it. Revolution. Today, 3 O'Clock. Meet behind the monkey bars. If quizzes are quizzical, what are tests?	[reply]
Re^4: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Jan 02, 2007 at 17:44 UTC
Re^5: HTML parsing OR capturing text from a string within tags by Popcorn Dave (Abbot) on Jan 02, 2007 at 18:43 UTC
Some notes below your chosen depth have not been shown here


Syntactic Confectionery Delight
	PerlMonks