Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: ...How to parse search engine results fast?

by inman (Curate)
on Feb 03, 2005 at 18:04 UTC ( [id://427746]=note: print w/replies, xml ) Need Help??


in reply to ...How to parse search engine results fast?

The following example gets information from three sources: Google, MSN and Yahoo!. You would need to create a custom parser for each engine. You may wish to look at HTML::Parser for this.

#! /usr/bin/perl -w use strict; use warnings; use LWP; use threads; use Thread::Queue; my $query ="perl"; my $dataQueue = Thread::Queue->new; my $threadCount = 0; while (<DATA>) { chomp; s/^\s+//; s/\s+$//; my ($engine, $url) = split /\s+/; next unless $url; $url.=$query; print "$url\n"; my $thr = threads->new(\&doSearch, $engine, $url); $thr->detach; $threadCount ++; } while ($threadCount) { my $engine = $dataQueue->dequeue; my $content = $dataQueue->dequeue; print "$engine returned: $content\n"; $threadCount --; } print "Parse and return remaining content\n"; sub doSearch { my $engine = shift; my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.0'); $ua->timeout(10); $ua->env_proxy; my $response = $ua->get($url); if ($response->is_success) { $dataQueue->enqueue($engine, $response->content); } else { $dataQueue->enqueue($engine, $response->message); } } __DATA__ Google http://www.google.com/search?q= Yahoo! http://search.yahoo.com/search?p= MSN http://beta.search.msn.co.uk/results.aspx?q=

Replies are listed 'Best First'.
Re^2: ...How to parse search engine results fast?
by A200560 (Novice) on Feb 03, 2005 at 18:13 UTC
    Do you think that managing 3-4 requests to different search engines with LWP::Parallel can give me some benefits in speed?


    V.B.
Re^2: ...How to parse search engine results fast?
by tphyahoo (Vicar) on Mar 03, 2005 at 16:46 UTC
    If I comment out the print line in
    while ($threadCount) { my $engine = $dataQueue->dequeue; my $content = $dataQueue->dequeue; #print "$engine returned: $content\n"; $threadCount --; }
    I frequently get error (warning?) "A thread exited while two threads were running".

    I am a thread newbie and don't know why this is happening, nor how "bad" this is, or if it's bad at all.

    You may want to check back at What is the fastest way to download a bunch of web pages? where BrowserUK does something similar which doesn't give this warning. At least, not yet.

    At any rate, thanks for giving me something to get my fingers dirty with in thread world.

Re^2: ...How to parse search engine results fast?
by A200560 (Novice) on Jan 11, 2006 at 16:43 UTC
    inman, can you send me a your private mail? Thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://427746]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-03-29 00:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found