I hope you had fun, here is something new :) a walkthrough of how to shorten your html parsing stuff, declarative style (i think)
$ lwp-download http://apod.nasa.gov/apod/ apod.html
4.21 KB received
$ perl htmltreexpather.pl apod.html _tag p | head -n 6
HTML::Element=HASH(0xb5ed04) 0.1.1.0
Milky Way Over Piton de l'Eau
/html/body/center[2]/b
/html/body/center[2]/b
/html/body[@link='#0000FF' and @vlink='#7F0F9F' and @alink='#FF0000' a
+nd @bgcolor='#F4F4FF' and @text='#000000']/center[2]/b
------------------------------------------------------------------
Then plug stuff into scraper/Web::Scraper
$ scraper apod.html
scraper> d
$VAR1 = {};
scraper> process '/html/body/center/p[2]' => 'Date' => 'TEXT';
scraper> d
$VAR1 = {
'Date' => ' 2012 June 25 '
};
scraper> process '//b' => 'b[]' => 'TEXT';
scraper> y
---
Date: ' 2012 June 25 '
b:
- " Milky Way Over Piton de l'Eau "
- ' Image Credit & Copyright: '
- ' Explanation: '
- ' Help Evaluate APOD: '
- " Tomorrow's picture: "
- ' Authors & editors: '
- 'NASA Official: '
- 'A service of:'
- '&'
scraper> c all
#!c:\perl\5.14.1\bin\MSWin32-x86-multi-thread\perl.exe
use strict;
use Web::Scraper;
use URI;
my $file = \do { my $file = "apod.html"; open my $fh, $file or die "$f
+ile: $!"; join '', <$fh> };
my $scraper = scraper {
process '/html/body/center/p[2]' => 'Date' => 'TEXT';
process '//b' => 'b[]' => 'TEXT';
};
my $result = $scraper->scrape($file);
scraper> q
And repeat. Firefox/Firebug can be useful for extracting xpaths. You can end up with
my $scraper = scraper {
process '//b[1]' => 'Title' => 'TEXT';
process '/html/body/center[2]/b[2]' => 'Credit' => 'TEXT';
process '/html/body/p[1]' => 'Desc' => 'TEXT';
process '/html/body/center/p[2]' => 'Date' => 'TEXT';
#~ process q{//a[ @href =~ "image/" ]} => 'Image' => '@HREF';
process q{//a[ contains(@href, "image/") ]} => 'Image' => '@HREF';
};
## NOTE use URI object so scraper will download (read) file
my $url = URI->new( 'file:apod.html' );
my $base = 'http://apod.nasa.gov/apod/';
my $ret = $scraper->scrape( $url , $base );
You can also mirror the html file ( LWP::Simple::mirror() ) and only scrape-it if its new
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.