As
kaif points
out my script
above did indeed produce questionable output and I also couldn't figure out why.
After a lot of head scratching and cursing I noticed that the OPs data had a variety of quotes around the attribute values. I changed them to ordinary quotes and it now works ok.
kaif++ for spotting the snag.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
use Data::Dumper;
my $p = HTML::TokeParser::Simple->new(*DATA)
or die "couldn't parse DATA: $!\n";
my (@records, %record, $start, $i);
while (my $t = $p->get_token){
if ($t->is_start_tag('span')){
if ($t->get_attr('class') and $t->get_attr('class') eq 'jobname'){
$record{jobname} = $p->get_trimmed_text('/span');
}
elsif ($t->get_attr('class') and $t->get_attr('class') eq 'jobseri
+al'){
$record{jobserial} = $p->get_trimmed_text('/span');
}
elsif ($t->get_attr('name') and $t->get_attr('name') eq 'em'){
push @{$record{em}}, $p->get_trimmed_text('/span');
}
elsif ($t->get_attr('name') and $t->get_attr('name') eq 'offices')
+{
$record{offices} = $p->get_trimmed_text('/span');
}
}
if ($t->is_start_tag('blockquote')){
next if $i;
my $txt = $p->get_trimmed_text(('blockquote'));
$record{job_desc} = $txt;
push @records, {%record};
%record = ();
$i++;
}
}
print Dumper \@records;
__DATA__
<p><b>
<span class="jobname">Accounting Assistant, Level 2</span>
<span class="jobserial">(19203)</span>
<br />Current members:<br />
<span name="em">Plow, Elliot</span>
<span name="em">Wang, Susan</span>
<br />
<span name="offices">Huston</span>
</p>
<blockquote>
Job descriptions here.
This block quoted text contains a job description
and it what I am really looking to recover.
</blockquote>
<blockquote>
<a href="#top">Go to the top of this page</a>.
</blockquote>
<blockquote>
<a href="companyHR.html">Check for open positions now!</a>
</blockquote>
---------- Capture Output ----------
> "c:\perl\bin\perl.exe" _new.pl
$VAR1 = [
{
'em' => [
'Plow, Elliot',
'Wang, Susan'
],
'job_desc' => 'Job descriptions here. This block quoted text conta
+ins a job description and it what I am really looking to recover.',
'offices' => 'Huston',
'jobserial' => '(19203)',
'jobname' => 'Accounting Assistant, Level 2'
}
];
> Terminated with exit code 0.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.