Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.

Retrieve select information from HTML

by Smaug (Pilgrim)
on Jul 18, 2013 at 19:06 UTC ( [id://1045152] : perlquestion . print w/replies, xml ) Need Help??

Smaug has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have a a file of serial numbers for machines that I need to determine the warranty status of, as well as the model.
The serial numbers are in a file which looks like this:
13XGK4J 9KJDK4J CNOR665J74445994CNTS 3RT1K4J
called serials.txt
Essentially I am searching each machine and retrieving the information on the model which exists in the meta-data at the start of the HTML as follows:
use strict; use warnings; use LWP::Simple; use Data::Dumper; use HTML::HeadParser; #use HTML::Miner; #use HTML::Tree; my $URL; my $head; open (DELLSERIALS, 'serials.txt'); while (<DELLSERIALS>) { chomp; print "<<<Now checking $_>>>\n"; my $funky = " +sdt1/Servicetag/" . $_; $URL = get("$funky"); $head = HTML::HeadParser->new; $head->parse("$URL"); # print Dumper($head); print $head->header('x-meta-supportservicetag') . "\n"; print $head->header('x-meta-supportproductcode') . "\n"; } close (DELLSERIALS);

The problem I have now is in getting that warranty information, and some machines have more than one warranty, from somewhere in the middle of the HTML although in the rendering of the page it is shown in the block to the right.
For each serial number I need the Next Business Day and the date (e.g. 11/12/2012) which are in bold in a hash or an array, something like:
$VAR1 = { 'warranty' => {'NBD ProSupport for Enterprise' => '02/12/201 +2', 'Next Business Day' => '02/12/2012' }, 'model' => 'Latitude E6500', 'serial' => '13XGK4J ' };

Any help would be appreciated. I did look at HTML::Miner and HTML::Tree but neither seemed to accomplish what I needed with my limited knowledge of HTML.
The longer serial number is a monitor and should be ignored, but I will handle that by not processing items with more than 7 digits in the serial number.
Peddle faster monkeys!! I need more power!!

Replies are listed 'Best First'.
Re: Retrieve select information from HTML
by Anonymous Monk on Jul 19, 2013 at 01:13 UTC
Re: Retrieve select information from HTML
by Anonymous Monk on Jul 19, 2013 at 02:37 UTC

    Here is a head up using "HTML::TreeBuilder" module.
    You will have to get the output you want and display it yourself that is left for the OP as exercise.

    use warnings; use strict; use LWP::UserAgent; use HTML::TreeBuilder; use URI; my $ua = LWP::UserAgent->new; my @linker = qw(13XGK4J 9KJDK4J CNOR665J74445994CNTS 3RT1K4J ); for (@linker) { # get the basic url for each product my $url = URI->new_abs( $_, " +vicetag/" ); my $broswer = $ua->get($url); if ( $broswer->is_success ) { my $tree = HTML::TreeBuilder->new; $tree->parse( $broswer->decoded_content ); $tree->eof; $tree->look_down( 'warrantyDetails', 'warrantServiceTagSub' ); print $tree->as_trimmed_text; $tree->delete; } else { die $broswer->status_line(), $/; } }

Re: Retrieve select information from HTML
by poj (Abbot) on Jul 19, 2013 at 07:12 UTC
    #!perl use strict; use warnings; use LWP::Simple; use HTML::TreeBuilder; use Data::Dump; my $URL = ' +ervicetag/'; my %machine=(); while (<DATA>){ chomp; $machine{$_}{'serial'} = $_ if (length($_)<=7); } my $ua = LWP::UserAgent->new; for my $sn (sort keys %machine){ print "Checking $sn .. "; parse_dell($sn); print "done\n"; } sub parse_dell { my ($sn) = @_; my $response = $ua->get($URL.$sn); if ($response->is_success) { my $tree = HTML::TreeBuilder->new(); $tree->parse($response->decoded_content); my @e = $tree->look_down('class', 'TopTwoWarrantyListItem'); for (@e){ my $t = $_->as_text; if ($t =~ m!(NBD ProSupport for Enterprise|Next Business Day).*( +\d{2}/\d{2}/\d{4})! ){ $machine{$sn}{$1}=$2; } } # meta tags @e = $tree->find_by_attribute('name','supportproductcode'); for (@e){ $machine{$sn}{'model'} = $_->attr('content'); } } else { warn $response->statusline; } } dd %machine; __DATA__ 13XGK4J 9KJDK4J CNOR665J74445994CNTS 3RT1K4J
Re: Retrieve select information from HTML
by Preceptor (Deacon) on Jul 18, 2013 at 19:20 UTC

    I can't comment directly, as I don't know what your webpage looks like. But have you had a look at HTML::TableExtract - that may do what you need.