Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^5: Combining Excel Parser with Google Scholar Scraper

by kennethk (Abbot)
on Apr 14, 2009 at 16:53 UTC ( [id://757436]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Combining Excel Parser with Google Scholar Scraper
in thread Combining Excel Parser with Google Scholar Scraper

Don't worry about a present lack of experience - only Larry Wall was born with knowledge of Perl. The rest of us are acolytes.

The error being reported means that some code in the Mechanize module is attempting to find a subroutine named url on a variable with an undefined value. Since it's more likely this code has a bug rather than WWW::Mechanize, it implies you are either passing it bad values or calling it wrong. My best guess is that the Excel file is misformatted - replicating a parsing issue without the file is question is difficult. Try running the following code and see if the output gives you any indications of what lines in the file may be problematic.

#!/usr/bin/perl use strict; use WWW::Mechanize; use Win32::OLE qw(in with); use Win32::OLE::Const 'Microsoft Excel'; $Win32::OLE::Warn = 3; # die on errors. +.. # get already active Excel application or open new my $Excel = Win32::OLE->GetActiveObject('Excel.Application') || Win32::OLE->new('Excel.Application', 'Quit'); # open Excel file my $Book = $Excel->Workbooks->Open("C:/Documents and Settings/rto5u/My + Documents/CV.xls"); # select worksheet number 1 (you can also select a worksheet by name) my $Sheet = $Book->Worksheets(1); foreach my $row (2..4) { foreach my $col (1..1) { # skip empty cells next unless defined $Sheet->Cells($row,$col)->{'Value'}; my $URL = 'http://scholar.google.com/advanced_scholar_search'; my $FORM_NAME = 'f'; #print "Author Name: "; #chomp ($AUTHOR = <>); my $AUTHOR = "MD Li"; print "Author Name: $AUTHOR\n"; #print "Paper Title: "; #chomp ($TITLE = <>); my $TITLE = $Sheet->Cells($row,$col)->{'Value'}; print "Paper Title: $TITLE\n"; #print "$TITLE"; #my $TITLE = "Region-specific transcriptional response to chro +nic nicotine in rat brain"; my $mech = WWW::Mechanize->new(stack_depth=>10); $mech->get($URL) || die ("Could not connect to $URL.\n"); my $res = $mech->submit_form( form_name => $FORM_NAME, fields => { 'num' => 100, 'as_epq' => $TITLE, 'as_occt' => 'title', 'as_sauthors' => $AUTHOR, 'as_allsubj' => 'all', }, ); while ($res && $res->is_success()){ my $content = $res->content; #print $content; while ($content =~ /<p class=g>(.*?)<\/font>\s\s\s/gs){ my $section = $1; my $title = ""; my $citedby = 0; # get title $title = getTitle($section); $title =~ s/<.*?>//g; $title =~ s/&hellip;/\.\.\./g; # get citedby # $citedby = getCitedBy($section); if ($citedby){ print "\"$title\"\nCited by: $citedby\n\n"; } } $res = $mech->follow_link( text_regex => qr/Next/i); } } } $Book->Close; ###################################################################### +####### sub getTitle($){ my ($section) = @_; my $title; if ($section =~ /<span class="w">.*?<a href.*?>(.*?)<\/a><\/span>/ +s){ # papers with a link $title = $1; }elsif ($section =~ /&nbsp;(.*?)<font size=-1>/s){ # pa +pers w/o a link $title = $1; }else{ $title = $1; } return $title; } #--------------------------------------------------------------------- +------- sub getCitedBy($){ my ($section) = @_; my $citedby; if ($section =~ />Cited by (\d+)</s){ $citedby = $1; } return $citedby; } #--------------------------------------------------------------------- +-------

A couple notes on the code:

  1. The lines starting with #! are used to tell Unix-like systems how to interpret the file. They are only meaningful if they are on the first line of a file. The -w switch is equivalent to the warnings pragma.
  2. On your subroutines, you use prototyping behavior, i.e. the ($). This is supposed to tell the Perl interpreter what the argument list looks like. They are generally not used (see subroutine prototypes still bad?). If you are going to use them, the subroutines must be declared at the top of the file, i.e. before they are called in code. This just involves a copy-paste for you.
  3. The foreach indices on $row and $col may not correspond to the areas of the file you intend to loop over.

If the above does not elucidate your issue, I'll need to see the Excel file in order to debug further.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://757436]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (2)
As of 2024-04-26 06:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found