Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Parsing Forms with selections

by Helter (Chaplain)
on Sep 10, 2002 at 20:52 UTC ( [id://196812]=perlquestion: print w/replies, xml ) Need Help??

Helter has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse some HTML I've downloaded from a site I have no control over (so no "just change the site" answers :) and am having trouble parsing the form I'm pasting below.

I am also pasting a test script that has the basic flow of my code. I must be missing something, which is entirely possible considering my newbie-ness to web/HTML/HTTP processing, so be gentle if it turns out I'm doing something really stupid :)

A few notes may be in order, I have changed the html to protect the innocent.
In the actual content of the page there are 2 forms, which is why I parse multiple forms in my code. If it works doing it one at a time, I'm all for code that works.
#!/usr/bin/perl -w use HTTP::Request::Common qw(POST GET); use HTTP::Cookies; use HTTP::Request::Form; use LWP::UserAgent; use HTML::Form; use HTML::TreeBuilder; open( TEST, "<test.html") or die "cannot opentest.html\n$!"; #Read in the whole file as a single string., $/ is the delimiter used, + use small scope $text = do { local $/; <TEST>}; print $text; $tree = HTML::TreeBuilder->new; $tree->parse( $text ); $tree->eof(); my @requestForms = HTTP::Request::Form->new_many( $tree ); $tree->delete(); #just some code to pause on while I play with the debugger. foreach $form (@requestForms) { print "$form\n"; # hash...awesome! #do something }

The html I'm trying to parse
<form name="form1" method="post" action="/foo/bar.php"> <table width="550" border="0" cellspacing="0" cellpadding="0" align="c +enter"> <tr> <td> <div align="center">IMAGE</div> </td> </tr> <tr> <td> <div align="center"> <table width="550" border="0" cellspacing="0" cellpadding="0"> <tr> <td width="275"> <div align="center"><span class="plaintextbold">IMAGE</spa +n></div> </td> <td> <div align="center" class="fightdamage"> TEXT <br> <hr width="175"> <span class="plaintextbold">TEXT: 1 <br> </span> <div align="center"><span class="plaintextbold">TEXT:< +/span> 14 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 11 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 3 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 3 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 15 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 16 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 16 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 0 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 0 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 31 / 31 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 0 / 0 <br> <hr width="175"> </div> </div> </td> </tr> </table> </div> </td> </tr> <tr> <td> <div align="center"></div> </td> </tr> <tr> <td> <div align="center"> <hr width="550"> <span class="plaintextbold"> <br> <select name="bar"> <option>OPTION1</option> <option>OPTION2</option> <option>OPTION3</option> </select> <input type="submit" name="Submit2" value="Go!" class="login"> <br> <br> </span></div> </td> </tr> <tr> <td> <div align="center">IMAGE</div> </td> </tr> <tr> <td> <div align="center"><br> </div> </td> </tr> </table> </form> </div> </td> </tr>
I'm a debugger kind of guy, here is the output when I examine the @requestForms object.
DB<3> x @requestForms 0 HTTP::Request::Form=HASH(0x864f6b0) 'allfields' => ARRAY(0x864fb4c) 0 'bar' 'base' => undef 'buttons' => ARRAY(0x8686edc) 0 'Submit2' 'buttontypes' => HASH(0x8668c70) 'Submit2' => ARRAY(0x8650b78) 0 'submit' 'buttonvals' => HASH(0x86828a4) 'Submit2' => ARRAY(0x864f518) 0 'Go!' 'checkboxstate' => HASH(0x8682484) empty hash 'debug' => undef 'fields' => ARRAY(0x864fb34) 0 'bar' 'fieldtypes' => HASH(0x866bee0) 'bar' => 'select' 'fieldvals' => HASH(0x86805b4) empty hash 'link' => '/foo/bar.php' 'method' => 'post' 'name' => 'form1' 'selections' => HASH(0x8680518) 'bar' => ARRAY(0x8650ab8) 0 undef 1 undef 2 undef 'upload' => 0
You may note the undef items, I'm assuming they are supposed to have OPTIONX in them?
I'm not sure how the default option is specified.
I'm not sure what functions you use to select a selection. (the field function does not seem to do what I want)
My copy of Perl & LWP should be here any day now.....maybe that will solve everything :)

Thanks!

Replies are listed 'Best First'.
Re: Parsing Forms with selections
by mp (Deacon) on Sep 11, 2002 at 05:02 UTC
    Depending on what you are trying to accomplish with the data you extract, you may want to also look at HTML::Form as an alternative to HTML::TreeBuilder. I have used both and found HTML::Form much easier to work with for both parsing forms from HTML and generating POST data.

    Update: Fixed cut-and-paste error. Second link should have been to HTML::TreeBuilder.
      Maybe I missing something obvious here but what's the difference between the two modules/links you mention?

      rdfield

        Thanks for catching the error. The second link was incorrect. I have corrected the node.
Re: Parsing Forms with selections
by Helter (Chaplain) on Sep 11, 2002 at 10:23 UTC
    I'll try to answer some questions that have come up.
    podmaster here is the Data::Dumper output of the array:
    $VAR1 = bless( { 'fieldvals' => {}, 'checkboxstate' => {}, 'base' => undef, 'buttontypes' => { 'Submit2' => [ 'submit' ] }, 'buttons' => [ 'Submit2' ], 'upload' => 0, 'selections' => { 'bar' => [ undef, undef, undef ] }, 'buttonvals' => { 'Submit2' => [ 'Go!' ] }, 'method' => 'post', 'fieldtypes' => { 'bar' => 'select' }, 'fields' => [ 'bar' ], 'name' => 'form1', 'allfields' => [ 'bar' ], 'debug' => undef, 'link' => '/foo/bar.php' }, 'HTTP::Request::Form' );
    As to what my problem is, in the 'selections' field it shows 3 undef values. So when I use the code like this:
    $response = $userAgent->request( $form->press( "Submit2" ));
    Those fields do not get filled in, and I do not get the response I need.
    mp and rdfield, I found HTML::FORM, and before I was using HTTP::Request::Form. The data dumper output from this code:
    @forms2 = HTML::Form->parse( $text, "here.com"); $d2 = Data::Dumper->new( \@forms2 ); print $d2->Dump;
    Gives me:
    $VAR1 = bless( { 'inputs' => [ bless( { 'seen' => [ 1, 0, 0 ], 'menu' => [ 'OPTION1', 'OPTION2', 'OPTION3' ], 'current' => 0, 'type' => 'option', 'name' => 'bar' }, 'HTML::Form::ListInput' ), bless( { 'class' => 'login', 'value' => 'Go!', 'type' => 'submit', 'name' => 'Submit2' }, 'HTML::Form::SubmitInput' ) ], 'extra_attr' => { 'name' => 'form1' }, 'enctype' => 'application/x-www-form-urlencoded', 'method' => 'POST', 'action' => bless( do{\(my $o = '/foo/bar.php')}, 'UR +I::_foreign' ) }, 'HTML::Form' );
    Which looks like it parsed it right! I have yet to try it in my actual application, but it looks like it offers the same functionality, without having to build the tree first. Thanks!!!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://196812]
Approved by mdillon
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-24 09:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found