Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

wget not working from perl

by harangzsolt33 (Chaplain)
on Sep 10, 2019 at 01:57 UTC ( [id://11105920]=perlquestion: print w/replies, xml ) Need Help??

harangzsolt33 has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow Monks, This is a little exercise which I thought was going to be super easy. But my script is not working!! :(

My perl script downloads a web page and saves it in a temporary file in a folder called "WGET" and then reads it and sends it to STDOUT. It seems to work fine when I enter the web address manually thru stdin, but it doesn't do anything if I pass the argument thru the URL online. Why is that???

Test Script online and see the results

To prove that WGET exists on my server, I also created a wget_test.pl script that simply prints the result of wget --help: See this script in action

Here is the script that works only when I run it from my computer:

#!/usr/bin/perl -w use strict; use warnings; # # This perl script downloads a web page using an application # called WGET and returns its contents as an encoded file. # It can be run from command line or from the web: # # Usage (from the web): www.something.com/wget.pl?escaped_url # # Usage (Command line): wget.pl <URL> # #################################################################### my $ROOT = ENV('DOCUMENT_ROOT'); my $INPUT = ENV('QUERY_STRING'); my $UNIQUE = ENV('UNIQUE_ID'); my $ONLINE = length($UNIQUE) ? 1 : 0; # Only the following characters are allowed in the URL, # anything else will be rejected: my $ALLOW = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123 +456789:;/+!?&#%=-._'; ############################################################## ############################################################## if ($ONLINE) { print "Content-type: text/javascript\n\n"; length($INPUT) or EXIT(2, 'No URL specified.'); $INPUT = substr($INPUT, 0, 1900); _isFromCharSet($INPUT, $ALLOW) or EXIT(3, 'Illegal characters found +in URL.'); $INPUT = unescape($INPUT); Download($INPUT); EXIT(0, 'SUCCESS. Argument was received from URL.'); } $ROOT = GetPath($0); if (@ARGV) { if (@ARGV == 1) { Download($ARGV[0]); EXIT(0, 'SUCCESS. Argument was received from command line.'); } else { PrintUsage(); EXIT(-1, 'Argument missing.'); } } else { Download(GetArgs()); EXIT(0, 'SUCCESS. Argument was received from stdin.'); } ################################################################ ################################################################ # # This function asks the user to enter the web address (URL) # of the web page to download and returns the URL string. # sub GetArgs { print "\n\n This Perl script downloads a web page from the internet +\n and prints its content to STDOUT.\n\n Enter web address: "; return scalar <STDIN>; } ################################################################# sub Download { my $URL = defined $_[0] ? $_[0] : ''; $URL = Trim($URL); length($URL) or return; my $P = index($URL, '://'); if ($P < 0 || $P > 10) { $URL = 'http://' . $URL; } $URL = '"' . $URL . '"'; $ROOT = JoinPath(GetPath($0), 'WGET'); my $FILENAME = JoinPath($ROOT, RandomString(8).'.TXT'); my $COMMAND = "wget -q -O $FILENAME $URL"; print "Content-type: text/javascript\n\n"; print "// Script name: $0\n"; print "// URL: $URL\n"; print "// Root dir: $ROOT\n"; print "// File Name: $FILENAME\n"; print "// Creating directory: $ROOT\n"; mkdir $ROOT, 0777; print "// Executing: $COMMAND\n"; print `$COMMAND`; my $SIZE = -s $FILENAME; print "// File Size: $SIZE bytes\n"; sysopen(FH, $FILENAME, 0) or EXIT(4, 'Cannot open file for reading.' +); print "// File opened for reading - $FILENAME\n"; my @DATA = <FH>; my $CONTENT = join('', @DATA); print "// Read " . length($CONTENT) . " bytes\n"; print "\nReceiver(\"" . toJStr($CONTENT) . "\");\n\n"; close FH; print "// File was closed.\n"; if (unlink($FILENAME) == 1) { print "// File was deleted - $FILENAME\n"; } else { print "// File could not be deleted - $FILENAME\n"; } } ################################################################# # # This function receives a binary string and converts it to a # JavaScript string that can be safely inserted between "..." marks. # # Usage: STRING = toJStr(STRING) # # sub toJStr { @_ or return ''; my $S = shift; defined $S or return ''; my $L = length($S); $L or return ''; my $c; my $J = ''; for (my $i = 0; $i < $L; $i++) { $c = vec($S, $i, 8); if ($c == 9) { $J .= '\t'; next; } if ($c == 13) { $J .= '\r'; next; } if ($c == 10) { $J .= '\n'; next; } if ($c == 60) { $J .= '\x3C'; next; } if ($c == 62) { $J .= '\x3E'; next; } if ($c == 38) { $J .= '\x26'; next; } if ($c == 34) { $J .= '\"'; next; } if ($c == 92) { $J .= '\\'; next; } if ($c >= 0 && $c <= 7) { $J .= "\\$c"; next; } if ($c < 32 || $c > 126) { $J .= '\x' . toHex($c); next; } $J .= chr($c); } return $J; } ############################################################## # This function sends an error code to the browser. # Usage: EXIT(INTEGER, MESSAGE) sub EXIT { my $ERRCODE = @_ ? shift : 0; my $MESSAGE = @_ ? shift : ''; print "\n"; if (length($MESSAGE)) { print "// $MESSAGE\n"; } print "ERRCODE = $ERRCODE;\n"; exit; } ############################################################## sub PrintUsage { print "\n This Perl script downloads a web page from the internet u +sing a program\n called 'WGET' and prints its content to STDOUT. Thi +s script can be called\n from a browser or from command line. Either + way it expects one argument,\n the URL address. The URL string shou +ld be escaped when used online.\n\n Online Usage: wget.pl?U +RL\n Command-Line Usage: wget.pl <URL>\n\n"; } ############################################################### ############################################################### # v2019.09.05 STRING = escape(STRING) # Converts a binary string to URL-safe string. sub escape{my$X=defined$_[0]?$_[0]:'';my$Z='';for(my$i=0;$i<length($X) +;){my$C=vec($X,$i++,8);$Z.=$C==32?'+':$C==96?'%60':$C>44&&$C<58||$C>9 +4&&$C<123||$C>63&&$C<91||$C==42?chr($C):'%'.sprintf('%.02X',$C);}$Z} # v2019.09.08 STRING = unescape(STRING) # Converts an URL string to regular binary string. It's the opposite o +f the escape() function. This function silently ignores errors. sub unescape{my$X=defined$_[0]?$_[0]:'';$X=~tr|+| |;my$i=index($X,'%') +>=0||return$X;my($H,$j,$C,$D)=('0123456789ABCDEF',$i);while($i<length +($X)){$C=vec($X,$i++,8);if($C==37){$C=substr($X,$i++,1);length($C)||l +ast;$C=index($H,uc($C));if($C<0){$i--;next;}$D=substr($X,$i++,1);if(l +ength($D)){$D=index($H,uc($D));if($D<0){$i--;}else{$C<<=4;$C+=$D;}}}v +ec($X,$j++,8)=$C;}substr($X,0,$j)} # v2019.09.08 INTEGER = Ceil(NUMBER) # Returns the smallest integer greater than or equal to a number. sub Ceil{my$N=defined$_[0]?$_[0]:0;my$I=int($N);$N<0?$I:$N-$I==0?$I:$I ++1;} # v2019.09.08 INTEGER = Floor(NUMBER) # Returns the largest integer less than or equal to a number. sub Floor{my$N=defined$_[0]?$_[0]:0;my$I=int($N);$N>0?$I:$N-$I==0?$I:$ +I-1;} # v2019.08.25 STRING = Trim(STRING) # Removes whitespace from before and after string and returns a new st +ring. sub Trim{my$X=defined$_[0]?$_[0]:'';my$L=length($X);my$P=0;while($P<=$ +L&&vec($X,$P++,8)<33){}for($P--;$P<=$L&&vec($X,$L--,8)<33;){}substr($ +X,$P,$L-$P+2)} # v2019.6.15 VALUE = ENV(NAME, [DEFAULT, [OVERRIDE]]) # Returns the named environment variable. Returns "" or DEFAULT if the + environment variable doesn't exist. If a third argument is provided, + this function will return the value of the third argument ALWAYS wit +hout even checking the environment variable. sub ENV{my$N=defined$_[0]?shift:'';my$D=@_?shift:'';return @_?shift:le +ngth($N)&&exists($ENV{$N})?Trim($ENV{$N}):$D;} # v2019.09.08 STRING = RandomString(LENGTH) # Creates a random string of letters and numbers. sub RandomString{defined$_[0]||return'';my$S='';my$L=shift;my$A='01234 +56789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz';while(leng +th($S)<$L){vec($S,length($S),8)=vec($A,int(rand(length($A))),8);}$S} # v2019.09.08 STRING = GetPath(FULL_NAME) # This function returns the path portion of a full file name without t +he trailing / or \ character. Example: GetPath($0) returns this perl +script's path. sub GetPath{@_||return'';my$F=shift;$F=~tr#\\#/#;my$P=rindex($F,'/');r +eturn($P>0)?substr($F,0,$P):'.';} # v2019.06.16 STRING = JoinPath(STRING, [STRING], [STRING]) # This function joins two names into a single path by adding / in betw +een the names. It also simplifies the resulting path by removing repe +ated \\ // characters, and tries to resolve the "." and ".." in a pat +h name to literal names only. sub JoinPath { @_ or return ''; my $P = join('/', @_); defined $P or r +eturn ''; length($P) or return ''; $P = Trim($P); $P =~ tr#\\#/#; if +(uc(substr($P, 0, 8)) eq 'FILE:///') { $P = substr($P, 8, length($P)) +; } $P =~ s|///|/|g; $P =~ s|//|/|g; my $DRIVE = (vec($P, 1, 8) == 58 +) ? vec($P, 0, 8) & 223 : 0; if ($DRIVE) { $P = substr($P, 2, length( +$P)); } my $SLASH = (vec($P, 0, 8) == 47) ? 47 : 0; if ($SLASH) { $P += substr($P, 1, length($P)); } my @A = split('/', $P); for (my $i = +0; $i < @A; $i++) { if ($A[$i] eq '.') { splice(@A, $i--, 1); } if ($ +A[$i] eq '..') { if ($i > 0) { splice(@A, --$i, 2); $i--; } else { sp +lice(@A, $i, 1); $i--; } } } return ($DRIVE ? chr($DRIVE) . ':' : '') + . ($SLASH ? '/' : '') . join('/', @A); } # v2019.08.28 STRING = toHex(INTEGER) # Converts a small integer to a two-digit hex string. sub toHex{my$N=defined$_[0]?$_[0]:0;$N>0||return'00';$N<255||return'FF +';sprintf('%.02X',$N&255)} # v2019.6.24 INTEGER = _isFromCharSet(STRING, KNOWN) # Returns 1 if string is strictly made up of characters listed in stri +ng KNOWN. Returns 0 if string contains any "unknown" characters. sub _isFromCharSet { @_ or return 1; my $S = shift; defined $S or retu +rn 1; my $L = length($S); $L or return 1; @_ or return 0; my $K = shi +ft; defined $K or return 0; length($K) or return 0; while ($L--) { in +dex($K, substr($S, $L, 1)) >= 0 or return 0; } return 1; }

Replies are listed 'Best First'.
Re: wget not working from perl
by Tux (Canon) on Sep 10, 2019 at 06:41 UTC
    1. There shall NEVER be an else after a return/exit/croak/die! NEVER!
    2. You call wget from a composed string. If the $url has " in there (or worse, a ; followed by a command as haukex poited out in Re: wget not working from perl), quotation as you use it might fail:
      my $COMMAND = "wget -q -O $FILENAME $URL"; print `$COMMAND`;

      --->

      open my $sh, "-|", "wget", "-q", "-O", $FILENAME, $URL; print <$sh>; close $sh;

      is much safer.

    3. Why do you use all those capitalized variable names?
    4. Why use a shell escape anyway?
      use LWP::Simple; is_success (getprint ($URL)) or warn "Fetch failed";

    Enjoy, Have FUN! H.Merijn
Re: wget not working from perl
by haukex (Archbishop) on Sep 14, 2019 at 08:36 UTC

    I waited with posting this until you'd taken down your script, which it seems you've done now, because it contains at least one classic and major security issue: As I've described in my node here, allowing practically unfiltered user input to be used directly in backticks allows anyone to execute arbitrary shell commands on your server. (Not to mention the fact that this script is basically a proxy open to everyone, which is an issue by itself.)

    For example, a QUERY_STRING of example.com%22%3Bcat+%22%2Fetc%2Fpasswd would have caused the script to execute the shell command wget ... "http://example.com";cat "/etc/passwd". I hope you see the major problem with that or any other arbitrary command.

    I've also commented on your style of reinventing all the wheels before. I don't do this just for the sake of the criticism itself or because I want to discourage learning or take from any enjoyment you might get from writing code in this style - I'm very much a fan of TIMTOWTDI - and if you want to write these scripts like this for yourself, that's fine. But as soon as you put these into some kind of "production", what I've said before become real issues: the more code your write yourself, the more code you have to test and maintain*. (And for asking questions, it gives others much more code to wade through.)

    And if you expose this to the world, there's the added issue of having much more code to secure properly. And with security issues, your site can quickly become the next spam relay or home for scammers, so it affects everyone.

    If you're going to be putting stuff online like this, I implore you to use the proper modules and follow the best practices for security.

    * Just for example, your sub unescape contains at least one bug: If the input string starts with a %HH encoded character, that is skipped, because my $i = index( $X, '%' ) >= 0 || return $X; doesn't actually get the index, $i will always be the return value of the logical expression. Every single one of your obfuscated subs has a corresponding function in a popular, well-maintained module, or in the Perl core itself.

Re: wget not working from harangzsolt33 perl program
by Anonymous Monk on Sep 10, 2019 at 03:50 UTC
Re: wget not working from perl
by roboticus (Chancellor) on Sep 10, 2019 at 17:51 UTC

    harangzsolt33:

    In your first section, you tell the user 'Argument missing.' when they provide too many arguments. That's misleading, I'd change that to something like "Please provide only one URL" or similar.

    In your Download() function, I see a couple things:

    • You're constructing a wget command and then running it, but causing yourself a bit of grief here: Using the backticks (or qx) executes the command as a shell command, meaning that you have to take extra care quoting the arguments--otherwise the shell may make a hash of things for you.
      Even worse is that you're opening yourself up to unexpected command injection, possibly giving a malicious user an opportunity to break into your machine. You can avoid both of these problems by using the system() function to execute a command on your system. By giving it the command and arguments as a list, like this:
      my $err_code = system 'wget', '-q', -O', $FILENAME, $URL; if ($err_code) { if ($err_code == 1) { EXIT(-1, "wget: Generic error code") } elsif ($err_code==2) { EXIT(-1, "wget: Parse error (cmdline, .wget +rc, ...)") } elsif ($err_code==3) { EXIT(-1, "wget: File I/O error") } ... etc. ... }
      you can avoid the quoting difficulties and presenting the shell as an attack surface.
    • Why are you creating a directory on your local machine? Just download the file to a temporary location, send it to the console and delete it. There's no reason you need to create directories in various locations. Additionally, since you're not cleaning the directories up when you're done with them, malicious users could start creating hundreds of thousands of directories on your machine, which can cause your machine to get very slow or even run out of disk space.
    • Speaking of downloading to a temporary location, you can use File::Temp to generate temporary file handles for you, so you don't accidentally make a collision. Your RandomString(8) is unlikely to collide, but File::Temp is a much-used (and tested!) module that will avoid running into files that already exist.

    The toJStr() function is a bit suspicious, as it looks homegrown--like a "simple" task where you keep adding in exceptions as you find new ones. What happens if you get a binary string containing UTF-8 characters not in the ASCII range? Should your string be encoded?

    One interesting thing might be what happens when you pass it something like "\00765". I didn't test it, but I'm thinking it's not going to end well.

    The escape() and unescape() functions look unnecessarily opaque, and I worry that they may be homegrown things that don't have all the special cases handled. If they're "tried and true", you may want to mention that in a comment, as well as where they came from so you can treat them like "black boxes". That way, if some error occurs in the future, you can decide to dig into them or not based on your comment(s).

    Also, why do you hate whitespace? If they were formatted better, at least you could look them over and get a feel for what they do.

    Your Ceil() and Floor() functions are straightforward enough, that being one-liners seems OK. Even so, I'd still put a little whitespace in there for readability.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: wget not working from perl
by Anonymous Monk on Sep 10, 2019 at 04:30 UTC
    It seems to work fine when I enter the web address manually thru stdin, but it doesn't do anything if I pass the argument thru the URL online.

    When run manually your program has the same powers as your user account. When a web server runs the program it only has the privileges of the server. Make sure the files and folders you want to access can be changed by the web server user account on your system.

      More specifically, make sure to give the web server user access to ONLY the files that you want the WHOLE internet to be able to access.

      Additionally, the webserver may not allow you to spawn external commands through system. This was the case for me when I used a free-hosting service once.

Re: wget not working from perl
by Anonymous Monk on Sep 10, 2019 at 11:24 UTC
    Messy, not at all portable (using wget rather than perl) and down right dangerous.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11105920]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-24 23:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found