Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

downloading a file on a page with javascript

by Aldebaran (Curate)
on Mar 30, 2020 at 21:30 UTC ( [id://11114814]=perlquestion: print w/replies, xml ) Need Help??

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I thought I would try something easy, namely using perl to download a file, and I have yet to achieve it. The page in question is found on google's public download site, so there isn't a question of permissions: useful utilities.

I've tried it a few different ways and even another syntax, and what I get with the download is the html for the page itself, which I won't show, but it definitely is the same as when you go to the above site and select "view page source."

use LWP::Simple; my $url = 'https://code.google.com/archive/p/dotnetperls-controls/down +loads/enable1.txt'; my $file = 'a.txt'; getstore($url, $file);

Fishing for tips,

Replies are listed 'Best First'.
Re: downloading a file on a page with javascript
by choroba (Cardinal) on Mar 30, 2020 at 21:46 UTC
    Where did you find the URL? If I point my mouse on the file and save the link, I get
    https://storage.googleapis.com/google-code-archive-downloads/v2/code.g +oogle.com/dotnetperls-controls/enable1.txt

    Using this URL instead of the one you used also stores a list of words to the output file, which I guess is the output you had expected.

    Getting this URL from the Archive page without JavaScript is hard. Search the Monastery for related questions.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Where did you find the URL?

      I cobbled it together together from the base url and the file I wanted.

      If I point my mouse on the file and save the link,

      I get the same thing. What I realize from your and bliako's post is that I underused the power of the browser to figure this out.

      Using this URL instead of the one you used also stores a list of words to the output file, which I guess is the output you had expected.

      Thx, choroba, that is indeed what I seek for my wordgames. With the correct url, my script gets the english dictionary. I decided to try it out with an older source post of yours: Re^7: Words in Words. "Correct" entries are words that have a properly-encompassing word. A hybrid is this:

      Source:

      #!/usr/bin/perl use strict; use warnings; use LWP::Simple; use 5.016; my $url = 'https://storage.googleapis.com/google-code-archive-download +s/v2/code.google.com/dotnetperls-controls/enable1.txt'; my $file = '/home/hogan/Documents/phone/from_laptop/my_data/bb.txt'; getstore($url, $file); ## open my $IN, '<', $file or die "$!"; my %words; while (my $word = <$IN>) { chomp $word; undef $words{$word}; } my %reported; for my $word (keys %words) { my $length = length $word; for my $pos (0 .. $length - 1) { my $skip_itself = ! $pos; for my $len (1 .. $length - $pos - $skip_itself) { my $subword = substr($word, $pos, $len); next if exists $reported{$subword}; next if $word eq $subword . q{s} or $word eq $subword . q{'s}; if (exists $words{$subword}) { say "$subword"; undef $reported{$subword}; } } } }

      Logophiles like me play gladly with such output. I speak english natively, so I'm rarely challenged with english vocabulary. The resulting list is fascinating:

      $ grep phosphorylating bb.txt dephosphorylating phosphorylating $ grep aerially bb.txt aerially subaerially $ grep physiology bb.txt ecophysiology electrophysiology histophysiology neurophysiology pathophysiology physiology psychophysiology $ grep quids bb.txt equids liquids nonliquids quids semiliquids soliquids squids $ grep consciouses bb.txt consciouses preconsciouses subconsciouses unconsciouses $

      Who knew that there were 4 different consciouses? I couldn't find an example that failed to have a larger including word.

      Anyways, thanks for your comment that got me on the right track and also for the fun of replicating your "words within words" script.

      "Perl scripting: great for pandemics...."

Re: downloading a file on a page with javascript
by bliako (Monsignor) on Mar 30, 2020 at 22:21 UTC

    there are at least two ways to approach this.

    The first is to use WWW::Mechanize::Chrome which is like running a browser but without the gui (headless) from inside your script. With it you will be able to dive into the fetched page's DOM and extract anything you like from it, including those divs that you don't see with a view-page-source because they are fetched later via javascript/ajax.

    The second is to open the site with your browser, open the developer tools (firefox, but also other will have similar functionality). Go to the network tab, select XHR and reload the page. You will see all the data fetched via ajax. And you will see where does that data come from, it comes from urls just like the one you tried to download. Copy that url as CURL (its on the right-click menu somewhere) and you can see exactly what the url is, what its parameters are. Now, note the url, its parameters and whether it is a POST or a GET and what request-headers it has. It's easy to translate those into LWP::UserAgent.

    Edit: converting a beast of a CURL commandline to LWP::UserAgent can be done easily by using Corion's curl2lwp (see http://blogs.perl.org/users/max_maischein/2018/11/curl2lwp---convert-curl-command-line-arguments-to-lwp-mechanize-perl-code.html)

      there are at least two ways to approach this.

      I was particularly pleased to see this response from bliako, whose pm posts are at a level where I can, about half the time, stretch my game to replicate, understand, and incorporate into "my game," whatever that is. I was thinking there should be several ways that perl could do either natively, or by wrapping C, or with modules. Getting the url right needs to be a part of any solution.

      The first is to use WWW::Mechanize::Chrome

      I had trouble installing WWW::Mechanize::Chrome, but it was all of the variety where I needed only to make better web searches for prereq's.

      The first "problem" was getting WWW::Mechanize::Chrome to install on ubuntu. I lacked 2 things at the beginning: a chrome executable, and headers for png.h .

      For ubuntu, a good command line install for chrome is here. Since being able to save a screenshot as a png is necessary, I also needed:

      sudo apt-get install libpng-dev

      This is as far as I got along this prong. Output, then source:

      $ ./1.mai.pl enable1.txt Yay
      #!/usr/bin/perl use strict; use Log::Log4perl qw(:easy); use WWW::Mechanize::Chrome; use Data::Dump; use 5.016; my $mech = WWW::Mechanize::Chrome->new(); my $url = 'https://code.google.com/archive/p/dotnetperls-controls/down +loads'; $mech->get($url); print $_->text . "\n" for $mech->find_all_links( text_regex => qr/enable/i ); $mech->follow_link( xpath => '//a[text() = "enable1.txt"]' ); my @words; # check the outcome if ($mech->success) { #print $res->decoded_content; #@words = mech->decoded_content; print "Yay\n"; } else { print "Error: " . $mech->status . "\n"; } if (@words) { print "@words\n"; } sleep 1;

      Aspects of downloads are yet to be implemented according to the 35:06 mark here: corion's presentation from 2017

      Q1) How do I brook the gap from $mech->follow_link to populating @words ?

      The second is to open the site with your browser, open the developer tools (firefox, but also other will have similar functionality). Go to the network tab, select XHR and reload the page. You will see all the data fetched via ajax. And you will see where does that data come from, it comes from urls just like the one you tried to download. Copy that url as CURL (its on the right-click menu somewhere) and you can see exactly what the url is, what its parameters are. Now, note the url, its parameters and whether it is a POST or a GET and what request-headers it has. It's easy to translate those into LWP::UserAgent.

      I did something close to this dozens of different ways. What ended up working for me was left-clicking on the link while the developer tools--including network tab--are on and then finding the copy to curl on the right click menu as one hovers over it in the tools. This yields:

      curl 'https://www.googleapis.com/storage/v1/b/google-code-archive/o/v2 +%2Fcode.google.com%2Fdotnetperls-controls%2Fproject.json?alt=media&st +ripTrailingSlashes=false' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; L +inux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0' -H 'Accept: applic +ation/json, text/plain, */*' -H 'Accept-Language: en-US,en;q=0.5' --c +ompressed -H 'Origin: https://code.google.com' -H 'Connection: keep-a +live' -H 'Referer: https://code.google.com/archive/p/dotnetperls-cont +rols/downloads' -H 'Cache-Control: max-age=0' -H 'TE: Trailers'

      Then I turned to Corion's curl2lwp converter. I'm super pleased by this:

      $ ./2.curl.pl | tail -5 zymotic zymurgies zymurgy zyzzyva zyzzyvas $ cat 2.curl.pl #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; my $ua = LWP::UserAgent->new( 'send_te' => '0' ); my $r = HTTP::Request->new( 'GET' => 'https://storage.googleapis.com/google-code-archive-downloads/v2/code. +google.com/dotnetperls-controls/enable1.txt', [ 'Connection' => 'keep-alive', 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*; +q=0.8', 'Accept-Encoding' => 'gzip, x-gzip, deflate, x-bzip2, bzip2', 'Accept-Language' => 'en-US,en;q=0.5', 'Host' => 'storage.googleapis.com:443', 'Referer' => 'https://code.google.com/archive/p/dotnetperls-controls/down +loads', 'User-Agent' => 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:74.0) Gecko/20100101 Firef +ox/74.0', 'Upgrade-Insecure-Requests' => '1', ], ); my $res = $ua->request( $r, ); ### begin Aldebaran-added source my @words; # check the outcome if ($res->is_success) { #print $res->decoded_content; @words = $res->decoded_content; } else { print "Error: " . $res->status_line . "\n"; } if (@words) { print "@words\n"; } __END__ $

      This represents a huge learning curve partially-ascended for me, including considering the Bigger picture with introduction to DOM.

      I have one more question at this point, regarding the practice scripts at examples, all of which use Log::Log4perl. If I have:

      $ cat /etc/2.log.conf ###################################################################### +######### # Log::Log4perl Conf + # ###################################################################### +######### log4perl.rootLogger = DEBUG, LOG1, SCREEN log4perl.appender.SCREEN = Log::Log4perl::Appender::Screen log4perl.appender.SCREEN.stderr = 0 log4perl.appender.SCREEN.layout = Log::Log4perl::Layout::PatternLayou +t log4perl.appender.SCREEN.layout.ConversionPattern = %m %n log4perl.appender.LOG1 = Log::Log4perl::Appender::File log4perl.appender.LOG1.filename = /home/hogan/Documents/hogan/logs/2. +log4perl.txt log4perl.appender.LOG1.mode = append log4perl.appender.LOG1.layout = Log::Log4perl::Layout::PatternLayou +t log4perl.appender.LOG1.layout.ConversionPattern = %d %p %m %n $

      , and this successfully logs events and errors:

      #!/usr/bin/perl use Log::Log4perl; # Initialize Logger my $log_conf = "/etc/2.log.conf"; Log::Log4perl::init($log_conf); my $logger = Log::Log4perl->get_logger(); $logger->info("===== before system call"); system('ls -l qwerty'); if( $? > 0 ) { $logger->error("there was an error: $?"); } $logger->info("===== after system call");

      Q2) How do I log using this scheme? For example, do I go from

      else { print "Error: " . $mech->status . "\n"; }

      to:

      else { $logger->error("there was an error: $mech->status" . "\n") ; }

      Again, thanks all for comments, which seem to be the "service work" that most of us can do in these unusual times of "social distancing." Stay healthy!

      2020-04-07 Athanasius fixed formatting of over-long code line.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11114814]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-25 16:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found