Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Retrieve select information from HTML

by Smaug (Pilgrim)
on Jul 18, 2013 at 19:06 UTC ( [id://1045152]=perlquestion: print w/replies, xml ) Need Help??

Smaug has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have a a file of serial numbers for machines that I need to determine the warranty status of, as well as the model.
The serial numbers are in a file which looks like this:
13XGK4J 9KJDK4J CNOR665J74445994CNTS 3RT1K4J
called serials.txt
Essentially I am searching each machine and retrieving the information on the model which exists in the meta-data at the start of the HTML as follows:
use strict; use warnings; use LWP::Simple; use Data::Dumper; use HTML::HeadParser; #use HTML::Miner; #use HTML::Tree; my $URL; my $head; open (DELLSERIALS, 'serials.txt'); while (<DELLSERIALS>) { chomp; print "<<<Now checking $_>>>\n"; my $funky = "http://www.dell.com/support/troubleshooting/ed/en/edb +sdt1/Servicetag/" . $_; $URL = get("$funky"); $head = HTML::HeadParser->new; $head->parse("$URL"); # print Dumper($head); print $head->header('x-meta-supportservicetag') . "\n"; print $head->header('x-meta-supportproductcode') . "\n"; } close (DELLSERIALS);

The problem I have now is in getting that warranty information, and some machines have more than one warranty, from somewhere in the middle of the HTML although in the rendering of the page it is shown in the block to the right.
For each serial number I need the Next Business Day and the date (e.g. 11/12/2012) which are in bold in a hash or an array, something like:
$VAR1 = { 'warranty' => {'NBD ProSupport for Enterprise' => '02/12/201 +2', 'Next Business Day' => '02/12/2012' }, 'model' => 'Latitude E6500', 'serial' => '13XGK4J ' };

Any help would be appreciated. I did look at HTML::Miner and HTML::Tree but neither seemed to accomplish what I needed with my limited knowledge of HTML.
The longer serial number is a monitor and should be ignored, but I will handle that by not processing items with more than 7 digits in the serial number.
Regards,
Smaug.
Peddle faster monkeys!! I need more power!!

Replies are listed 'Best First'.
Re: Retrieve select information from HTML
by Anonymous Monk on Jul 19, 2013 at 01:13 UTC
Re: Retrieve select information from HTML
by Anonymous Monk on Jul 19, 2013 at 02:37 UTC

    Here is a head up using "HTML::TreeBuilder" module.
    You will have to get the output you want and display it yourself that is left for the OP as exercise.

    use warnings; use strict; use LWP::UserAgent; use HTML::TreeBuilder; use URI; my $ua = LWP::UserAgent->new; my @linker = qw(13XGK4J 9KJDK4J CNOR665J74445994CNTS 3RT1K4J ); for (@linker) { # get the basic url for each product my $url = URI->new_abs( $_, "http://www.dell.com/support/troubleshooting/ed/en/edbsdt1/Ser +vicetag/" ); my $broswer = $ua->get($url); if ( $broswer->is_success ) { my $tree = HTML::TreeBuilder->new; $tree->parse( $broswer->decoded_content ); $tree->eof; $tree->look_down( 'warrantyDetails', 'warrantServiceTagSub' ); print $tree->as_trimmed_text; $tree->delete; } else { die $broswer->status_line(), $/; } }

Re: Retrieve select information from HTML
by Preceptor (Deacon) on Jul 18, 2013 at 19:20 UTC

    I can't comment directly, as I don't know what your webpage looks like. But have you had a look at HTML::TableExtract - that may do what you need.

Re: Retrieve select information from HTML
by poj (Abbot) on Jul 19, 2013 at 07:12 UTC
    #!perl use strict; use warnings; use LWP::Simple; use HTML::TreeBuilder; use Data::Dump; my $URL = 'http://www.dell.com/support/troubleshooting/ed/en/edbsdt1/S +ervicetag/'; my %machine=(); while (<DATA>){ chomp; $machine{$_}{'serial'} = $_ if (length($_)<=7); } my $ua = LWP::UserAgent->new; for my $sn (sort keys %machine){ print "Checking $sn .. "; parse_dell($sn); print "done\n"; } sub parse_dell { my ($sn) = @_; my $response = $ua->get($URL.$sn); if ($response->is_success) { my $tree = HTML::TreeBuilder->new(); $tree->parse($response->decoded_content); my @e = $tree->look_down('class', 'TopTwoWarrantyListItem'); for (@e){ my $t = $_->as_text; if ($t =~ m!(NBD ProSupport for Enterprise|Next Business Day).*( +\d{2}/\d{2}/\d{4})! ){ $machine{$sn}{$1}=$2; } } # meta tags @e = $tree->find_by_attribute('name','supportproductcode'); for (@e){ $machine{$sn}{'model'} = $_->attr('content'); } } else { warn $response->statusline; } } dd %machine; __DATA__ 13XGK4J 9KJDK4J CNOR665J74445994CNTS 3RT1K4J
    poj

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1045152]
Approved by kevbot
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (7)
As of 2024-04-25 11:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found