Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

HTML::Parser / Regex

by MissPerl (Sexton)
on May 26, 2017 at 20:30 UTC ( [id://1191319]=perlquestion: print w/replies, xml ) Need Help??

MissPerl has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow Perl Monks,

I am trying to get text/number in a html file then store them into a variable. I know that HTML::TableExtract or some other module might have easier way to do this. But for now, I want to learn and apply HTML::Parser and regex first.

This is part of my failed attempt perl script, it got errors like bareword found (might be runaway multi-line) and can't use global $1 in my. At the beginning of the script, it prompt user for input then store them into a variable. For now, I am writing the part for the script to be able to reads the $ca html file and find match. Then next part of the script will continue for the other states' html file.

use HMTL::Parser; my $ca = "california.html"; open (my $f1, "<" , $ca) || die ("Can't open file : california.html"); while (<$f1>){ if (my $text =~ /Employee\sA</th><th>.\d</){ my $one = $1; }elsif (my $text =~ /Employee\sB</th><th>.\d</){ my $two = $1; }elsif (my $text =~ /Employee\sC</th><th>.\d</){ my $three = $1; } } close ($f1);
Below are a few lines from two different html files.

the employee A/B/C is fixed. But for sometimes there will be no value +between the <th> tag. </tr></table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</th> </tr><tr><th>Employee B</th><th>-5.02</th> </tr><tr><th>Employee C</th><th>19</th> </tr></table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th></th> </tr><tr><th>Employee B</th><th></th> </tr><tr><th>Employee C</th><th></th>
And I've been trying to get the value into the variable, so that I could use them later. $one = -0.82 $two = -5.02 $three = 19

Apologize in advance for the fail attempt of perl script that I wrote. I could understand if it's too painful to watch. But kindly point out my mistake and guide me the correct way. Thank you so much.

P/S: You could say that I am a slow learner because after a few days of self-learning perl, I still could't quite pick up how this string-matching works. I'm using perl v5.8.8

Replies are listed 'Best First'.
Re: HTML::Parser / Regex
by Mr. Muskrat (Canon) on May 26, 2017 at 21:05 UTC

    You are not using strict and warnings.

    You have loaded HMTL::Parser instead of HTML::Parser but then you do not try to use it.

    You are trying to search an undefined variable $text.

    You are trying to use captured values but have not any capture groups.

    (May not be a problem in your real code but) you are defining a different variable in each part of the if/elsif blocks.

    The pattern you are using to match the numbers is a bit odd. Take a look at Regexp::Common.

    You are trying to use regular expressions to search for slashes without changing the "/" delimiters. Regexp quote-like operators.

    # partial snippet use strict; use warnings; use Regexp::Common; # ... while( chomp(my $text = <$f1>) ) { my ($one, $two, $three); # Also these variable names are not very de +scriptive. if ($text =~ m!Employee\sA</th><th>($RE{num}{real})<!) { $one = $1; } # ...
      Hi Mr. Muskrat,

      Thank you for your reply. I did tried to use HTML::Parser, but it was ended up pretty ugly, so I did not include that part of code.

      Do you have any recommend link for HTML::Parser?

      Apologize for not mentioning what perl version I am using at the first place. I am using v5.8.8.

      And I've tried on the solution you provided, it seems that the version that I am using does't support Regexp::Common.

      Also thanks for pointing out those mistakes I made! And I totally forgot to turn on strict and warnings!
        ... the version that I am using does't support Regexp::Common.

        Why do you say that? What errors/system messages do you get? The code I posted here uses Regexp::Common and runs under Perl 5.8.9. Are you sure you have the module installed on your system?


        Give a man a fish:  <%-{-{-{-<

        ..tried to use HTML::Parser, but it was ended up pretty ugly,

        What didn't you like with using HTML::Parser ?

        #!/usr/bin/perl use warnings; use strict; use HTML::Parser; my %inside = (); my $tbl = -1; my $col; my $row; my @table = (); my $p = HTML::Parser->new( handlers => { start => [ \&start,'tagname' ], end => [ \&end, 'tagname' ], text => [ \&text, 'text' ], } ); $p->parse_file(\*DATA); # or filename # output for my $t (0..$#table){ print "\nTable $t\n"; for my $r (0..$#{$table[$t]}){ my $line = join "\t",$r,@{$table[$t][$r]}; print "$line\n"; } } sub start { my $tag = shift; $inside{$tag} = 1; if ($tag eq 'table'){ ++$tbl; $row = -1; } elsif ($tag eq 'tr'){ ++$row; $col = -1; } elsif ($tag eq 'th'){ ++$col; $table[$tbl][$row][$col] = ''; # or undef } } sub end { my $tag = shift; $inside{$tag} = 0; } sub text { my $str = shift; if ( $inside{'th'} ){ $table[$tbl][$row][$col] = $str; } } __DATA__ </table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</th> </tr><tr><th>Employee B</th><th>-5.02</th> </tr><tr><th>Employee C</th><th>19</th> </tr></table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th></th> </tr><tr><th>Employee B</th><th></th> </tr><tr><th>Employee C</th><th></th>
        poj
Re: HTML::Parser / Regex
by AnomalousMonk (Archbishop) on May 26, 2017 at 21:18 UTC
    ... I still could't quite pick up how this string-matching works.

    If you haven't already seen them, please take a look at perlrequick, perlretut (and, of course, at perlre for the hard core stuff), and at Pattern Matching, Regular Expressions, and Parsing in the monastery's own Tutorials.

    Update: Oh, and please let us know the version of Perl you're working with so we can know the regex features available to you!


    Give a man a fish:  <%-{-{-{-<

      Hi AnomalousMonk,

      Thank you for all those links. I think I have overlooked most of the links. I'm going to have a look on it.

      Also I am using perl v5.8.8. It was pretty old but yea..
        ... using perl v5.8.8. ... pretty old ...

        Not to worry. The example code I posted here runs under 5.8.9. It's intentionally ambitious; I hope you will come back to it repeatedly as you read more about Perl, regexes, etc.


        Give a man a fish:  <%-{-{-{-<

Re: HTML::Parser / Regex
by kcott (Archbishop) on May 27, 2017 at 06:09 UTC

    G'day MissPerl,

    Welcome to the Monastery.

    "... after a few days of self-learning perl ..."

    I'd recommend reading through "perlintro - Perl introduction for beginners". It's not particularly long (about 10 screenfuls on my monitor) and will walk you through the basics. As you've only just started learning Perl, you have (quite understandably) made a number of novice mistakes: this document should go a long way to clearing up those problems.

    In addition, it's peppered with links to more detailed information and advanced topics. Return to this document when the need arises, and delve into the specifics as required.

    — Ken

      Good day Ken!

      Thank you for the link !

      I am going through all the useful materials from fellow PerlMonks, time is ticking, I'll be sure to come back if I come across with something I don't undertand, Thanks!!

Re: HTML::Parser / Regex
by AnomalousMonk (Archbishop) on May 26, 2017 at 22:03 UTC
    I know that ... some other module might have easier way to do this. But for now, I want to learn and apply HTML::Parser and regex ...

    Ok, so you're committed to drilling all those holes in your head just to prove to yourself for sure that drilling holes in your head is a bad idea. Here's one approach:

    c:\@Work\Perl\monks>perl -wMstrict -le "use warnings; use strict; ;; use Regexp::Common; ;; use Data::Dump qw(dd); ;; my @lines = ( 'Summary</h1><table border=\"1\"><tr><th>Employee John Doe</th><th> +-0.82</th>', 'Summary</h1><table border=\"1\"><tr><th> Employee Fred D. Poe </th +><th> -5.03 </th>', 'Summary</h1><table border=\"1\"><tr><th>Employee Billy-Bob Toe</th +><th> </th>', 'Summary</h1><table border=\"1\"><tr><th>Employee</th><th>999</th>' +, '<th>Employee Prince </th><th> 123</th>', '<th>Employee O</th><th> 1.23 </th>', ); ;; my $rx_name = qr{ \S+? (?: \s+ \S+)*? }xms; my $rx_th_open = qr{ \s* < th > \s* }xms; my $rx_th_close = qr{ \s* < / th > \s* }xms; ;; my %per_employee; ;; LINE: for my $line (@lines) { my $parsed = my ($name, $amount) = $line =~ m{ $rx_th_open Employee \s+ ($rx_name) $rx_th_close $rx_th_open ($RE{num}{real})? $rx_th_close }xms; ;; if (not $parsed) { warn qq{'$line' failed to parse}; next LINE; } ;; $amount = 'no amount' unless defined $amount; $per_employee{$name} = $amount; } ;; dd \%per_employee; " 'Summary</h1><table border="1"><tr><th>Employee</th><th>999</th>' fail +ed to parse at -e line 1. { "Billy-Bob Toe" => "no amount", "Fred D. Poe" => "-5.03", "John Doe" => "-0.82", O => "1.23", Prince => 123, }
    (Note that the  $rx_name regex for an actual, human name is very naive. (Update: See off-site Falsehoods Programmers Believe About Names.))

    Update: Significant changes to example code:  $rx_th_open $rx_th_close regexes made more elegant (?); added rudimentary error handling; added corner and error test cases.


    Give a man a fish:  <%-{-{-{-<

Re: HTML::Parser / Regex
by eyepopslikeamosquito (Archbishop) on May 28, 2017 at 06:30 UTC

    What you're attempting as a first program is too tough for a complete beginner IMHO ... So, like kcott, I suggest you read perlintro or some of these Learning Perl links.

    Then write some simpler programs first, to gain some confidence. Feel free to ask more questions if you get stumped. Once you've done that (will probably take a week or two) return to your original problem.

    That said, I can see you're very determined to try to solve your real world problem immediately! If so, try running this simple program:

    use strict; use warnings; my $ca = "california.html"; open(my $f1, "<" , $ca) or die "Can't open file '$ca': $!"; while ( my $line = <$f1> ) { print "line: $line"; if ( $line =~ m{Employee +([^<]+)</th><th>([^<]+)} ) { my $name = $1; my $two = $2; print " name='$name' two='$two'\n"; } } close ($f1);
    on your original test california.html file:
    </tr></table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</th> </tr><tr><th>Employee B</th><th>-5.02</th> </tr><tr><th>Employee C</th><th>19</th> </tr></table></body><body bgcolor="black"><h1> Summary</h1><table border="1"><tr><th>Employee A</th><th></th> </tr><tr><th>Employee B</th><th></th> </tr><tr><th>Employee C</th><th></th>
    which should produce the following output:
    line: </tr></table></body><body bgcolor="black"><h1> line: Summary</h1><table border="1"><tr><th>Employee A</th><th>-0.82</ +th> name='A' two='-0.82' line: </tr><tr><th>Employee B</th><th>-5.02</th> name='B' two='-5.02' line: </tr><tr><th>Employee C</th><th>19</th> name='C' two='19' line: </tr></table></body><body bgcolor="black"><h1> line: Summary</h1><table border="1"><tr><th>Employee A</th><th></th> line: </tr><tr><th>Employee B</th><th></th> line: </tr><tr><th>Employee C</th><th></th>
    Now, take the time to understand how the above program works by reading the introductory Perl links above. Feel free to ask any questions about it.

    Please note that I am NOT endorsing the above program as a sound way to solve your real world problem. It is just a simple program, directly related to your real world problem, to help motivate you to learn some Perl basics. For a sound solution to your problem, I suspect HTML-Parser is the way to go.

      Hi eyepopslikeamosquito,

      Thanks for your sample code!

      However I come across with the error "Can't use global $1 in "my"" This isn't the first time I see them, I tried googled and get around with it, but unfortunately nothing worked.

      As I am still reading the beginners' material, for my knowledge, I would think that I need $_ or $1, to scan for current lines?

      And I figured you are the best person I could ask for advice?!

        "... the error "Can't use global $1 in "my"" ..."

        Somewhere in your code, you have "... my $1 ...". Here's a couple of examples:

        $ perl -e 'my $1;' Can't use global $1 in "my" at -e line 1, near "my $1" Execution of -e aborted due to compilation errors. $ perl -e 'my $1 = 42;' Can't use global $1 in "my" at -e line 1, near "my $1 " Execution of -e aborted due to compilation errors.

        You can find the full description of the problem from "perldiag - Perl diagnostic messages". Until you're familiar with that document, it can be a bit difficult finding the information. In this instance, you'd need to search for "Can't use global" (not "Can't use global $1"). Doing so, locates this:

        Can't use global %s in "%s"
        (F) You tried to declare a magical variable as a lexical variable. This is not allowed, because the magic can be tied to only one location (namely the global variable) and it would be incredibly confusing to have variables in your program that looked like magical variables but weren't.

        While you're learning, you may find it useful to use the diagnostics pragma. Put this line near the start of your code:

        use diagnostics;

        That will give you a full description, rather than the somewhat terse shortened form.

        Important: That pragma is intended as a developer tool. Do not leave it production code.

        — Ken

        ... the error "Can't use global $1 in "my""

        What is the specific code that produces this error? If you don't show us the code, we can only make more or less wild quesses. This just wastes our time and yours. Please see How do I post a question effectively? and How (Not) To Ask A Question.

        Update: BTW: I ran the code eyepopslikeamosquito posted here under Perl 5.8.9 and I get the advertised output with no errors or warnings.


        Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1191319]
Approved by ww
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-20 00:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found