Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: HTML::Parser / Regex

by AnomalousMonk (Archbishop)
on May 26, 2017 at 22:03 UTC ( #1191327=note: print w/replies, xml ) Need Help??


in reply to HTML::Parser / Regex

I know that ... some other module might have easier way to do this. But for now, I want to learn and apply HTML::Parser and regex ...

Ok, so you're committed to drilling all those holes in your head just to prove to yourself for sure that drilling holes in your head is a bad idea. Here's one approach:

c:\@Work\Perl\monks>perl -wMstrict -le "use warnings; use strict; ;; use Regexp::Common; ;; use Data::Dump qw(dd); ;; my @lines = ( 'Summary</h1><table border=\"1\"><tr><th>Employee John Doe</th><th> +-0.82</th>', 'Summary</h1><table border=\"1\"><tr><th> Employee Fred D. Poe </th +><th> -5.03 </th>', 'Summary</h1><table border=\"1\"><tr><th>Employee Billy-Bob Toe</th +><th> </th>', 'Summary</h1><table border=\"1\"><tr><th>Employee</th><th>999</th>' +, '<th>Employee Prince </th><th> 123</th>', '<th>Employee O</th><th> 1.23 </th>', ); ;; my $rx_name = qr{ \S+? (?: \s+ \S+)*? }xms; my $rx_th_open = qr{ \s* < th > \s* }xms; my $rx_th_close = qr{ \s* < / th > \s* }xms; ;; my %per_employee; ;; LINE: for my $line (@lines) { my $parsed = my ($name, $amount) = $line =~ m{ $rx_th_open Employee \s+ ($rx_name) $rx_th_close $rx_th_open ($RE{num}{real})? $rx_th_close }xms; ;; if (not $parsed) { warn qq{'$line' failed to parse}; next LINE; } ;; $amount = 'no amount' unless defined $amount; $per_employee{$name} = $amount; } ;; dd \%per_employee; " 'Summary</h1><table border="1"><tr><th>Employee</th><th>999</th>' fail +ed to parse at -e line 1. { "Billy-Bob Toe" => "no amount", "Fred D. Poe" => "-5.03", "John Doe" => "-0.82", O => "1.23", Prince => 123, }
(Note that the  $rx_name regex for an actual, human name is very naive. (Update: See off-site Falsehoods Programmers Believe About Names.))

Update: Significant changes to example code:  $rx_th_open $rx_th_close regexes made more elegant (?); added rudimentary error handling; added corner and error test cases.


Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1191327]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2023-12-02 03:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?











    Results (13 votes). Check out past polls.

    Notices?