Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number


by Amoe (Friar)
on Jan 21, 2002 at 03:10 UTC ( #140263=sourcecode: print w/replies, xml ) Need Help??
Category: Web Stuff
Author/Contact Info Amoe. See pod.

Replacement for the WWW::Search::Google module. I apologise for the scrappiness of the code, but at least it works.

Thanks crazyinsomniac and hacker.

Update 06/03/2002: Surprisingly, this module still works . After all the changes that Google has gone through since the time I first released it, I would expect it to have broken a long time ago, considering it parses HTML rather than some stable format. There's an interesting story at slashdot about googling via SOAP - maybe this is the future direction this module could take?

package WWW::Google;
use strict;

# - amoe 20/01/2002
# hackish module to search google programmatically

use LWP::UserAgent;
use HTTP::Request;
use HTML::TokeParser;
use URI::Escape;

# /me apologises in advance

sub new {
    my $class = shift;
    my $self = bless {}, $class;
    my $agent_name = shift 
    || "WWW-Google/0.1 ($^O;
    my $agent = LWP::UserAgent->new;
    $self->{cgiloc} = ['',
    $self->{place}  = 0;
    $self->{agent}  = $agent;
    while (my ($key, $value) = splice @_, 0, 2) {
        $self->{$key} = $value;
    return $self;

sub build {
    my $self = shift;
    my @bits = $self->cgiloc;
    my $query = join('' => shift @bits, shift @bits,
                            '?', 'q=', $self->query);
    if (@bits) {
        $query .= '&' . join('&', @bits);
    my $res = $self->agent->request(HTTP::Request->new(GET => $query))
    my $parsee = HTML::TokeParser->new(\$res->content);
    return $res;

sub next_result {
    my $self = shift;
    my $result = {};
    while (!%$result) {
        while (my $tag = $self->parsee->get_tag('p')) {
            my $a = $self->parsee->get_tag;
            unless ($a->[0] eq 'a') {
            $result->{url}   = $a->[1]->{href};
            $result->{title} = $self->parsee->get_trimmed_text('/a');
            return $result;
    } continue {
        $self->place($self->place + 10);
        $self->cgiloc(($self->cgiloc)[0, 1],
                       'start=' . $self->place);

sub query {
    my $self = shift;
    if (@_) {
        $self->{query} = uri_escape(shift);
    } else {
        return $self->{query};

sub place {
    my $self = shift;
    if (@_) {
        $self->{place} = shift;
    } else {
        return $self->{place};

sub cgiloc {
    my $self = shift;
    if (@_) {
        $self->{cgiloc} = [@_];
    } else {
        return @{$self->{cgiloc}};

sub parsee {
    my $self = shift;
    if (@_) {
        $self->{parsee} = shift;
    } else {
        return $self->{parsee};

sub agent { shift->{agent} }




=head1 NAME

WWW::Google - Temporary replacement for WWW::Search::Google


 use WWW::Google;

 my $search = WWW::Google->new;

 # build up query in $q


 while (my $res = $search->next_result) {
     print $res->{url}, ': ', $res->{title};

 $search->cgiloc('', 'search');    # use german go
 $search->place(50);    # start at page 50


This module uses the search engine Google to find websites related to 
particular term. The C<WWW::Search> modules are supposed to do this, b
+ut it
seems none of them work properly. So I decided to code up a hackish re
to use in the meantime. And here it is. And here are its methods:

=over 4

=item new

Returns a C<WWW::Google> object. Takes the name of the search robot as
+ the
first argument, followed by an optional list of name-value pairs to se
+t the
object up. Possible values are cgiloc, place and query, all of which p
basically the same task as the method of the same names, with one exce
query-strings are autoescaped in C<query> the method, whereas they're 
+passed in
raw if you use the C<new> interface.

=item build

Gets a query page and sets it up for parsing. It takes no arguments, a
+nd must
be called before C<next_result> is.

=item query

Sets the query for the object to use when C<build> gets called. If cal
without argument, returns the current query string. Queries are automa

=item place

The amount of results to start the search as. By default, it starts at
+ the
first page of results, i.e. C<0>. Multiples of ten are probably best.

=item cgiloc

Specify a different location for C<build> to get the query result from
+. Can be
used to specify national variants of Google, presuming they use the sa
+me HTML
format as the one. This is experimental.

=item next_result

Returns a hash containing two keys, C<url> and C<title>, which contain
+ the path
to the search result and the title of the search result. This is what 
+you use
to get the search results. If you use this in a loop, it will probably
+ turn
infinite because of the sheer amount of search results. You'll have to
+ exit it
early with a C<last> or something once you hit your desired amount of 


=head1 NOTES


This is almost certainly very buggy - it was written in about an hour,
+ but it
does the job. The code looks horrible and probably runs slower than it
+ should.

People will probably be wanting the excerpt of text Google provides. W
+ell, I
found it was pretty hard to parse this - the problem being that some s
+ites have
categories and some don't, so how can you judge where the text ends? W
+ell, you
can, but I couldn't be bothered at the time. I will get around to it.

=head1 AUTHOR

Amoe. Thanks to crazyinsomniac and hacker.

=head1 CONTACT

Amoe on

or email C<subvert underscore you at hotmail dot com>.

The website will be at

if I ever get it up.


Free (substandard) software, daddy.

This program is free software. You may copy or
redistribute it under the same terms as Perl itself.

Replies are listed 'Best First'.
Re: WWW::Google
by IlyaM (Parson) on Jan 21, 2002 at 03:37 UTC
    IIRC google can return search results in XML. It could be slighly easier and more errorprone to parse it than parse HTML which can be changed in any day.

    Ilya Martynov (

      Definitely, it would be preferable to do that. I put a little research into the topic and couldn't find anything - Mostly only searched their "Services" page though.

      That would be much better, I could solve some parsing problems...*checks*

      my one true love
        This PDF file mentions that you can use HTTP requests like
        to get search results in XML.

        But I've just checked it again and it seems it doesn't work anymore :(

        Ilya Martynov (

Re: WWW::Google
by Amoe (Friar) on Mar 19, 2002 at 16:33 UTC

    Having seen hossman's enlightening node, I suppose I'd better disclaim this module. It probably goes without saying that I didn't read the Google TOS. I agree with hossman on this issue; the existence of this code isn't a violation of the TOS. If you're paranoid, you may wish to change the user-agent string it sends. Use at your own risk, and stuff.

    And as for the TOS itself, if it wasn't serious it would be funny. For those worried, I think there isn't much chance that Google will sue you for using this module. I think that the TOS itself is overly harsh; as hossman noted, you could call all sorts of things automated searching. For example, I search by typing a phrase into my location bar in Mozilla and clicking a search button. Because I didn't enter the phrase in the textbox, does that make it illegal?

    As ever, IANAL. God, I hate the legal system.

    my one true love

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://140263]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2020-09-27 10:08 GMT
Find Nodes?
    Voting Booth?
    If at first I donít succeed, I Ö

    Results (142 votes). Check out past polls.