Re^5: collect data from web pages and insert into mysql

One step at a time.

This will get a list of sid numbers from all the pages available.

#! /usr/bin/perl

use strict;
use warnings;

use Data::Dumper;
use HTML::TreeBuilder;
use LWP::Simple;
use URI;

my $url = q{http://csr.wwiionline.com/scripts/services/persona/sorties
+.jsp};
my $pid = 173384;

my @sids = get_sids($url, $pid);
die qq{no sids found\n} unless @sids;

print Dumper \@sids;

sub get_sids{
  
  my ($url, $pid) = @_;
  my $page = 1;
  my $uri = URI->new($url);
  my ($i, @sids);

  while ($page){
    
    # build the uri
    $uri->query_form(page => $page, pid => $pid);
    my $uri_string = $uri->as_string;

    # get the content, check for success
    my $content = get $uri->as_string;
    die qq{LWP get failed: $!\n} unless $content;
    
    # build the tree
    my $t = HTML::TreeBuilder->new_from_content($content)
      or die qq{new from content failed: $!\n};
    
    # get a list of all anchor tags
    my @anchors = $t->look_down(_tag => q{a})
      or die qq{no tables found in : $!\n};
    
    # look at each anchor
    for my $anchor (@anchors){
      # get the href
      my $href = $anchor->attr(q{href});
      
      if ($href){
        # test for a sid in the query fragment
        my $uri = URI->new($href);
        my %q = $uri->query_form;
        # save it if it is there
        push @sids, $q{sid} if exists $q{sid};
      }
    }
    
    # see if there is another page
    $page = get_next_page($t);
    
    # avoid accidental indefinite loops
    # hammering the server, adjust to suit
    die if $i++ > 5;
    
  }
  
  # send 'em back
  return @sids;
  
}

sub get_next_page{
  
  my ($t) = @_;
  
  # we want table 9
  my @tables = $t->look_down(_tag => q{table});
  my $table = $tables[8];
  
  # first row
  my @trs = $table->look_down(_tag => q{tr});
  my $tr = $trs[0];
  
  # second column
  my @tds = $tr->look_down(_tag => q{td});
  my $td = $tds[1];

  # get any text  
  my $page_number_txt = $td->as_text;
  
  # and test if it is a page number
  # will be undef otherwise
  my ($page) = $page_number_txt =~ /PAGE (\d) >/;
  
  return $page;
  
}
[download]

Some points to note:

It uses HTML::TreeBuilder to parse the HTML. I find it easier than using regexes. There are many parsers available and monks have their preferences, I've settled on this one and have got used to it.

It also uses URI to construct/parse URIs. Could be overkill in this case but if someone else has done all the work I'm happy to take advantage. :-)

And all those 'q's? They're alternatives to single and double quote marks (there are some others too). You don't have to use them, again it's a preference. I started using them for the very scientific reason that my code highlighter is particularly bad at handling single and double quotes. :-)

If you download it, first see if it compiles. Then see if it runs. If the output is not as expected make a note of what Perl says about the matter and post it here. If all goes fine let us know the next step.

Fingers crossed.

Comment on Re^5: collect data from web pages and insert into mysql Download Code

Replies are listed 'Best First'.
Re^6: collect data from web pages and insert into mysql by SteinerKD (Acolyte) on Aug 02, 2010 at 14:02 UTC
I haven't tried it yet but saw some things I'd like to comment about (if I've understood the code correctly, this new one was a bit above my level). `sub get_next_page{ my ($t) = @_; # we want table 9 my @tables = $t->look_down(_tag => q{table}); my $table = $tables[8]; # first row my @trs = $table->look_down(_tag => q{tr}); my $tr = $trs[0]; # second column my @tds = $tr->look_down(_tag => q{td}); my $td = $tds[1]; # get any text my $page_number_txt = $td->as_text; # and test if it is a page number # will be undef otherwise my ($page) = $page_number_txt =~ /PAGE (\d) >/; return $page; }` [download] If I understand correctly you load next page, go through the source code to a particular spot on page and looks at page number? This will fail for my scenario as the server keeps serving the page number you request even if it contains no data so no matter what page number you enter it will give you a valid answer. I just used `last if $content =~/No sorties/ ;` which seems to do the trick `if ($href){ # test for a sid in the query fragment my $uri = URI->new($href); my %q = $uri->query_form; # save it if it is there push @sids, $q{sid} if exists $q{sid}; }` [download] Guess this would be the perfect place for the second loop-exit condition, we want to stop processing sids when we find the last one previously processed (`$lproc`). This variable needs to be read from the pid list file as well (not sure what delimiter to use, what is best, tab or semi colon?), instead as now one number per line the actual DB export will contain two numbers per line (pid and lproc). Q: Where does the sids end up? Going to try it now and will be back with more comments. I really appreciate your help with this!	[reply] [d/l] [select]
Re^7: collect data from web pages and insert into mysql by wfsp (Abbot) on Aug 02, 2010 at 14:23 UTC
When you go to the first page the top right side says, "PAGE 2 >". When you click on that you're on page 2. Then the top right hand side says, "PAGE 3 >". On page three there is nothing (there isn't a next page). What that sub (`get_next_page`) does is to check if there is a link to a next page. If there is it returns the page number and that is the page that is processed next. If there isn't a page number it returns undef and that exits you out of the `while ($page){` [download] loop. With hindsight I should have called the sub `get_next_page_number` because that is what it is doing (it's not loading the page). The sub (`get_sids`) returns a list of all the sids. I reckon it would be simplest to do that and then decided which ones you want. grep might help with that. A tab delimited record sounds as thought it would do fine. By the way, there are, in this case, three calls to the website. So you have to give it a moment to finish. Let us know how you get on.	[reply] [d/l] [select]
Re^8: collect data from web pages and insert into mysql by SteinerKD (Acolyte) on Aug 02, 2010 at 14:42 UTC
Ah! Now i get it, smart thinking! So that solves condition 1 then. As for condition 2 I don't think first fetching all and then sort it is the way to go as people can rack up many hundreds of sorties in quite short time which means processing dossens of pages even if the lproc was on page 1.	[reply]
Re^9: collect data from web pages and insert into mysql by wfsp (Abbot) on Aug 02, 2010 at 15:25 UTC
Re^10: collect data from web pages and insert into mysql by SteinerKD (Acolyte) on Aug 03, 2010 at 15:13 UTC
Some notes below your chosen depth have not been shown here


Perl-Sensitive Sunglasses
	PerlMonks