Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^7: collect data from web pages and insert into mysql

by wfsp (Abbot)
on Aug 02, 2010 at 14:23 UTC ( [id://852478]=note: print w/replies, xml ) Need Help??


in reply to Re^6: collect data from web pages and insert into mysql
in thread collect data from web pages and insert into mysql

When you go to the first page the top right side says, "PAGE 2 >". When you click on that you're on page 2. Then the top right hand side says, "PAGE 3 >". On page three there is nothing (there isn't a next page).

What that sub (get_next_page) does is to check if there is a link to a next page. If there is it returns the page number and that is the page that is processed next. If there isn't a page number it returns undef and that exits you out of the

while ($page){
loop. With hindsight I should have called the sub get_next_page_number because that is what it is doing (it's not loading the page).

The sub (get_sids) returns a list of all the sids. I reckon it would be simplest to do that and then decided which ones you want. grep might help with that. A tab delimited record sounds as thought it would do fine.

By the way, there are, in this case, three calls to the website. So you have to give it a moment to finish.

Let us know how you get on.

Replies are listed 'Best First'.
Re^8: collect data from web pages and insert into mysql
by SteinerKD (Acolyte) on Aug 02, 2010 at 14:42 UTC

    Ah! Now i get it, smart thinking! So that solves condition 1 then.

    As for condition 2 I don't think first fetching all and then sort it is the way to go as people can rack up many hundreds of sorties in quite short time which means processing dossens of pages even if the lproc was on page 1.

      Adjust the way the get_sids() sub is called
      my $lproc = 621557; my @sids = get_sids($url, $pid, $lproc);
      Change the sub
      sub get_sids{ my ($url, $pid, $lproc) = @_; my $page = 1; my $uri = URI->new($url); my ($i, @sids); while ($page){ # build the uri $uri->query_form(page => $page, pid => $pid); my $uri_string = $uri->as_string; # get the content, check for success my $content = get $uri->as_string; die qq{LWP get failed: $!\n} unless $content; # build the tree my $t = HTML::TreeBuilder->new_from_content($content) or die qq{new from content failed: $!\n}; # get a list of all anchor tags my @anchors = $t->look_down(_tag => q{a}) or die qq{no tables found in : $!\n}; # look at each anchor my $more = 1; # flag for my $anchor (@anchors){ # get the href my $href = $anchor->attr(q{href}); if ($href){ # test for a sid in the query fragment my $uri = URI->new($href); my %q = $uri->query_form; my $sid = $q{sid}; next unless $sid; # exit the while loop if it # is the last processed sid $more--, last if $sid == $lproc; # otherwise save it push @sids, $sid; } } last unless $more; # see if there is another page $page = get_next_page($t); # avoid accidental indefinite loops # hammering the server, adjust to suit die if $i++ > 5; } # send 'em back return @sids; }
      Have a look at the URI docs to see what the $uri->query_form does. Very useful.

      Update: corrected the sub

        OK, have looked it over and think I understands most of it fairly well now. Adapted it to do the whole list as well as import PID/Lproc for processing

        There's a bug somewhere though making it abort if a PID have 0 SIDs. Instead of moving on to next PID for processing it simply ends.

        Here's what I have so far (I added some print stuff so I can see it progressing):

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://852478]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-03-28 14:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found