Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Wget using backquotes

by Anonymous Monk
on Feb 18, 2007 at 16:10 UTC ( [id://600706]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am using backquotes (qx) to capture a wget shell command with perl but my return value is undefined. When I try to execute the same command from the shell, it works fine. I am not using LWP::Simple because I find wget a lot faster. I am running this against thousands of URLS:
$response = `wget --spider -nv $url`;
ie => get url, don't download anything, non verbose mode
$response => ''
I just want to capture the status code from the headers taking into consideration any redirection. I tried curl but I need to parse the results which is one more time-consuming step. I am just investigating the best way to do this. Any ideas/recommendations?

Also while I am at it, any ideas what could be the reason for getting 200 OK status code sometimes, and other times 500 Server Error while retrieving the same url link using wget from the shell?

Thanks for your help.

Replies are listed 'Best First'.
Re: Wget using backquotes
by liverpole (Monsignor) on Feb 18, 2007 at 16:29 UTC
    It sounds like maybe you'd be better off using system, perhaps?

    With system, you get the exit status of the processed command, whereas with qx (or "backticks `...`, which are the same thing as qx) you get the "collected standard output" of the command.

    Please look at perlop, specifically the section titled "qx/STRING/".  You'll see that qx is the command to use when you need to assign to the output from your system command.

    With the system command, on the other hand, you care more about the side-effects of running the command (in your case, the wget to fetch webpages), and in the end, only want to know whether the command succeeded or failed.

    As for your "OK" versus "Server Error" status messages -- sorry, but I don't know much about wget.  Try doing a man page on it, or googling for documentation, and see if there's a verbose flag or something that will present a more detailed failure message.


    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
Re: Wget using backquotes
by andyford (Curate) on Feb 18, 2007 at 16:30 UTC

    Looks like wget is putting the response code into STDERR. I did this

    $response = `wget --spider -nv $url 2>&1`;
    And got the code into $response.

    As for the varying return codes, it probably means that the server you're hitting is not able to respond properly everytime. Maybe it's overloaded, or purposely throttling you. Could be many things.

    non-Perl: Andy Ford

      Thank you! the 2>&1 did the trick for me!
Re: Wget using backquotes
by zentara (Archbishop) on Feb 18, 2007 at 16:41 UTC
    wget is one of those apps that dosn't play nice with backticks. It writes directly to the tty, and sends some messages to STDERR. You can try this with IPC::Open3, but you might need to use IO::Pty.
    #!/usr/bin/perl use warnings; use strict; use IPC::Open3; use IO::Select; my $pid = open3(0, \*READ,\*ERROR,"wget --spider -nv http://zentara.n +et"); my $sel = new IO::Select(); $sel->add(\*READ); $sel->add(\*ERROR); my($error,$answer)=('',''); foreach my $h ($sel->can_read) { my $buf = ''; if ($h eq \*ERROR) { sysread(ERROR,$buf,4096); if($buf){print "ERROR-> $buf\n"} } else { sysread(READ,$buf,4096); if($buf){print "response->$buf\n"} } } waitpid($pid, 1);
    Output: ERROR-> 200 OK

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: Wget using backquotes
by zentara (Archbishop) on Feb 18, 2007 at 17:28 UTC
    Another IPC idea is to start a bash shell, and send it wget commands and collect output.
    #!/usr/bin/perl use warnings; use strict; use IPC::Open3; $|=1; my $pid=open3(\*IN,\*OUT,\*ERR,'/bin/bash'); my $url = 'wget --spider -nv http://zentara.net'; # setup a while loop here and feed it urls # while(@urls){ print IN "$url\n"; my $buf = ''; sysread(ERR,$buf,4096); if($buf){print "ERROR-> $buf\n"} sysread(OUT,$buf,4096); if($buf){print "response->$buf\n"} #} waitpid($pid, 1);

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: Wget using backquotes
by perrin (Chancellor) on Feb 18, 2007 at 16:56 UTC
    You can get better speed without needing to use wget if you use a faster module than LWP::Simple. Try HTTP::Lite, HTTP::MHTTP, or HTTP::GHTTP.
Re: Wget using backquotes
by hacker (Priest) on Feb 18, 2007 at 16:22 UTC
Re: Wget using backquotes
by pemungkah (Priest) on Feb 18, 2007 at 20:06 UTC
    Since you don't care about the contents of the pages, you should use the HEAD method rather than GET; you won't be transferring the pages at all, just a few tens of bytes in headers.

    Using one of the alternate HTTP modules as well should get you very high speed indeed.

Re: Wget using backquotes
by Anonymous Monk on Feb 18, 2007 at 18:34 UTC
    Thank you all for your excellent answers. Zentara, you're really beyond human ;-} I appreciate the code. Just to be clear, I cannot use "system" because I need to report on the status of the pages I am analyzing. Most of these link are huge PDF files. I am not really interested in the content but just the status header information. MHTTP and GHTTP sound very useful. I will have to benchmark them against wget. Frankly, I'd rather just use Perl. Again, thank you for your help. You guys rock!

      As I previously mentioned... you want IPC::Run or IPC::Open3...

      You do not want to use system here either. Well ok, you CAN use system in "list mode" here to avoid spawning a shell, but this is NOT what zentara showed you, that approach is unsafe.

      Using "backticks" (otherwise properly known as "accent grave" is deathly unsafe, and you should never use anything of the sort. From the pod:

      IPC::Open3, open3 - open a process for reading, writing, and error handling

      Also, you should be using one of the standard LWP modules here, and catching the response codes that come back, instead of relying on a userland binary (which can easily be faked, opening a hole in your system).

      If you don't value the security of the system, then go ahead and implement the unsafe, incorrect approach.

Re: Wget using backquotes
by Anonymous Monk on Feb 18, 2007 at 22:08 UTC
    I understand the security concerns associated with backticks. But using LWP modules is so slow it takes 3 days to check the links on this site. I also used the head and get binaries associated with libwww-perl but they are slow as well. I am using head to speed up the process. However, when this fails, I need to use get, excluding the contents. But even that is time-consuming. However, when using wget, it runs like a breeze. wget is already installed on the client's machine and comes with RedHat. HTTP::MHTTP sounds like it has a lot of potential. However, it requires installing libghttp. I just want to use the most efficient (and of course, safest) way possible. At this stage, it seems that wget/curl are the best applications optimized for this task. Please correct me if I am wrong. Thank you again.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://600706]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-25 05:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found