Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

using mechanize to expand a collasped menu structure.

by Anonymous Monk
on Jul 04, 2006 at 15:09 UTC ( [id://559192]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

HI Monks. Can I request your help. I'm trying to use mechanize to click every links on a html page. There is a collaped menu on the page. Clicking on a link opens a node which reveals more collapased nodes. There are thousnads of them. My code is below. Its the first time I've used mechanize and html::tokeparse.

I believe they work as follows, Its like a factory. The while loop powers the conveyer belt.

The code is examined a token at a time, get_token() does this. get_tag() can be used when you want to get only a certain type of tag.

It's similar to a person examining products on a factory production line.

After being examined the tokens are either thown away or acted on. The IF clauses within the while loop, find the tokens we want and say how to act.

The orginal html is stored in a string, this is kept and not modified.

It's not possible to jump to a certain place in the string. Just like its not possible to jump to a certain place on a VHS cassete.

How can I output the final page with all the menu nodes expanded?

Anyway this is what I want to do.

#1 search through html until the comment <!-- begin title --->

#2 find the next A href link

#3 click the link it is associated with (this will expand a menu option )

#4 reload the page (wait while this happens

#5 if there are more links left then repeat the steps 1 -4 .

#6 if there are no more nodes closed then print html to a file and exit function

My code is below. the problem is occurring when trying to follow the link using $agent->get($a_href);. I'm getting an error

main::searchHTML() called too early to check prototype at sitemech1.pl line 75.

I've no idea what this means, should I be using a sleep function to wait for the page to reload?

#!/usr/bin/perl -w use strict; use WWW::Mechanize; use HTML::TokeParser; my $login_un = "xxxxxxxx"; my $login_pwd ="yyyyyyyy"; my $agent = WWW::Mechanize->new(); $agent->get("http://somedomain.com"); $agent->form(1); $agent->field("login_un", $login_un); $agent->field("login_pwd", $login_pwd); $agent->click(); searchHTML(); #my $stream = HTML::TokeParser->new("source.html")|| die "Can't open +: $!"; # # <IMG SRC="/T4SiteManager/images/explore-another-item.gif" HEIGHT=" +21" WIDTH="15" VSPACE="0" HSPACE="0" ALT=""><A HREF="SiteManager?ctfn +=hierarchy&fnno=100&nOP=1257&oH=hierarchy&oF=0"> #<IMG SRC="/T4SiteManager/images/explore-node-closed.gif" width="15" + height="21" border="0" vspace="0" hspace="0" alt="Open"> #</A> # sub searchHTML(){ my $stream = HTML::TokeParser->new(\$agent->{content}); while (my $token = $stream ->get_token) { # start searching from <!-- begin title --> if($token->[0] eq "C") # start tag? { my $comment = $token->[1]; #print ("\n\nFound a comment $comment\n\n" ); if ($comment eq "<!-- begin title -->") { print("FOUND $comment"); }; } ### search the A tags my $ttype = shift @{ $token }; if($ttype eq "S") # start tag? { my($tag, $attr, $attrseq, $rawtxt) = @{ $token }; if($tag eq "a") { my $a_href = $attr->{'href'}; if ($a_href =~ m/fnno/) { #this filters the correct links print("link found: $a_href \n\n"); $agent->get($a_href); #searchHTML(); }; } } ### end searching the A tags } print("All finished\n"); } # close searchHTML sub ############# comments #################### #1 search through html until the comment <!-- begin title ---> #2 find the next A href link #3 click the link it is associated with (this will expand a menu opt +ion ) #4 reload the page showing the expanded menu option (wait while this + happens #5 if there are more links left then repeat the steps 1 -4 . #6 if there are no more nodes closed then print html to a file and e +xit function. ############## end comments ##################

Edited by planetscape - removed sensitive information from script

( keep:0 edit:5 reap:0 )

Replies are listed 'Best First'.
Re: using mechanize to expand a collasped menu structure.
by marto (Cardinal) on Jul 04, 2006 at 15:42 UTC
    Anonymous Monk,

    Firstly, I hope that the logon details you have given are fake. Secondly, how are these links generated? Is this a dynamic menu (driven by a database) or are all of the links in one HTML page, by this I mean is this some kind of collapsible menu (CSS or whatever)? The $mech->find_all_links() method may be of interest to you, find all of the links, click them, then examine $mech->content. If in doubt have a look at the documentation. Also please read the PerlMonks FAQ and How do I post a question effectively? if you have not done so already.

    Hope this helps.

    Martin
Re: using mechanize to expand a collasped menu structure.
by Ieronim (Friar) on Jul 04, 2006 at 16:29 UTC
    Study the HTML code of the page you are trying to process. The collapsed menu can be based on JavaScript - in this case you don't need to use click() at all, as all links are alredy present in the page's HTML!
      I agree.

      There are really two possibilities here.

      One, the page actually contains all the links and some of them are hidden from modern GUI browsers using CSS or JavaScript.

      Two, the page actually reloads when you click the links, and displays links using some kind of CGI script, based on the URL/query string.

      If it's One, then all the links are in the page and you just need to find them. If it's Two, then it's not really "a page" at all, but a series of pages and you might as well treat them as such:

      @pages=('domain.com/script?shown=foo', 'domain.com/script?shown=bar'); # etc get_links_from(@pages);

      With, of course, some kind of test to make sure you don't fetch any link more than once.



      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print
Re: using mechanize to expand a collasped menu structure.
by shonorio (Hermit) on Jul 04, 2006 at 16:07 UTC
    If you are working on Win32 enviroment take a look of SAMIE or Win32::IE::Mechanize.

    Solli Moreira Honorio
    Sao Paulo - Brazil

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://559192]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-04-25 11:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found