Re: Parsing HTML files to recover data...

This give you an array of hashes. It uses the second blockquote to trigger the start of the next record.

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TokeParser::Simple;
use Data::Dumper;

my $p = HTML::TokeParser::Simple->new(*DATA)
  or die "couldn't parse DATA: $!\n";
  
my (@records, %record, $start);

while (my $t = $p->get_token){
  
  if ($t->is_start_tag('span')){
    if ($t->get_attr('class') and $t->get_attr('class') eq 'jobname'){
      $record{jobname} = $p->get_trimmed_text('span');
    }
    elsif ($t->get_attr('class') and $t->get_attr('class') eq 'jobseri
+al'){
      $record{jobserial} = $p->get_trimmed_text('span');
    }
    
    elsif ($t->get_attr('name') and $t->get_attr('name') eq 'em'){
      push @{$record{em}}, $p->get_trimmed_text('span');
    }
    elsif ($t->get_attr('name') and $t->get_attr('name') eq 'offices')
+{
        $record{offices} = $p->get_trimmed_text('span');
    }
  }

  if ($t->is_start_tag('blockquote')){
    next if exists $record{job_desc};
    $record{job_desc} = $p->get_trimmed_text('blockquote');
    #die Dumper \%record;
    push @records, \%record;
    %record = ();
  }
    
}

print Dumper \@records;

__DATA__
<p><b><span class="jobname">
Accounting Assistant, Level 2
</span>  

<span class="jobserial">(19203)</span>
<br />
Current members:
<br />
<span name="em">Plow, Elliot</span> 
<span name="em">Wang, Susan</span>
<br />

<span name=”offices”>Huston</span>
</p>
<blockquote>
Job descriptions here.
This block quoted text contains a job description and it what I am rea
+lly looking to recover. 
</blockquote>
<blockquote><a href="#top">Go to the top of this page</a>.</blockquote
+>
<blockquote><a href=”companyHR.html”>Check for open positions now!</a>
+</blockquote>
[download]

output:

$VAR1 = {
  'job_desc' => 'Job descriptions here. This block quoted text contain
+s a job description and it what I am really looking to recover.',
  'em' => [
    'Plow, Elliot',
    'Wang, Susan'
  ],
  'jobserial' => '(19203) Current members:',
  'jobname' => 'Accounting Assistant, Level 2'
};
[download]

update: see my reply below.

Comment on Re: Parsing HTML files to recover data... Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing HTML files to recover data... by wfsp (Abbot) on Nov 22, 2006 at 14:07 UTC
As kaif points out my script above did indeed produce questionable output and I also couldn't figure out why. After a lot of head scratching and cursing I noticed that the OPs data had a variety of quotes around the attribute values. I changed them to ordinary quotes and it now works ok. kaif++ for spotting the snag. #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; use Data::Dumper; my $p = HTML::TokeParser::Simple->new(*DATA) or die "couldn't parse DATA: $!\n"; my (@records, %record, $start, $i); while (my $t = $p->get_token){ if ($t->is_start_tag('span')){ if ($t->get_attr('class') and $t->get_attr('class') eq 'jobname'){ $record{jobname} = $p->get_trimmed_text('/span'); } elsif ($t->get_attr('class') and $t->get_attr('class') eq 'jobseri +al'){ $record{jobserial} = $p->get_trimmed_text('/span'); } elsif ($t->get_attr('name') and $t->get_attr('name') eq 'em'){ push @{$record{em}}, $p->get_trimmed_text('/span'); } elsif ($t->get_attr('name') and $t->get_attr('name') eq 'offices') +{ $record{offices} = $p->get_trimmed_text('/span'); } } if ($t->is_start_tag('blockquote')){ next if $i; my $txt = $p->get_trimmed_text(('blockquote')); $record{job_desc} = $txt; push @records, {%record}; %record = (); $i++; } } print Dumper \@records; __DATA__ <p><b> <span class="jobname">Accounting Assistant, Level 2</span> <span class="jobserial">(19203)</span> <br />Current members:<br /> <span name="em">Plow, Elliot</span> <span name="em">Wang, Susan</span> <br /> <span name="offices">Huston</span> </p> <blockquote> Job descriptions here. This block quoted text contains a job description and it what I am really looking to recover. </blockquote> <blockquote> <a href="#top">Go to the top of this page</a>. </blockquote> <blockquote> <a href="companyHR.html">Check for open positions now!</a> </blockquote> [download] `---------- Capture Output ---------- > "c:\perl\bin\perl.exe" _new.pl $VAR1 = [ { 'em' => [ 'Plow, Elliot', 'Wang, Susan' ], 'job_desc' => 'Job descriptions here. This block quoted text conta +ins a job description and it what I am really looking to recover.', 'offices' => 'Huston', 'jobserial' => '(19203)', 'jobname' => 'Accounting Assistant, Level 2' } ]; > Terminated with exit code 0.` [download]	[reply] [d/l] [select]
Re^2: Parsing HTML files to recover data... by kaif (Friar) on Nov 22, 2006 at 12:18 UTC
That's a questionable value for the 'jobserial' key. Looking at your code, I can't figure out why that could happen ...	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks