Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: downloading a file on a page with javascript

by bliako (Monsignor)
on Mar 30, 2020 at 22:21 UTC ( [id://11114818]=note: print w/replies, xml ) Need Help??


in reply to downloading a file on a page with javascript

there are at least two ways to approach this.

The first is to use WWW::Mechanize::Chrome which is like running a browser but without the gui (headless) from inside your script. With it you will be able to dive into the fetched page's DOM and extract anything you like from it, including those divs that you don't see with a view-page-source because they are fetched later via javascript/ajax.

The second is to open the site with your browser, open the developer tools (firefox, but also other will have similar functionality). Go to the network tab, select XHR and reload the page. You will see all the data fetched via ajax. And you will see where does that data come from, it comes from urls just like the one you tried to download. Copy that url as CURL (its on the right-click menu somewhere) and you can see exactly what the url is, what its parameters are. Now, note the url, its parameters and whether it is a POST or a GET and what request-headers it has. It's easy to translate those into LWP::UserAgent.

Edit: converting a beast of a CURL commandline to LWP::UserAgent can be done easily by using Corion's curl2lwp (see http://blogs.perl.org/users/max_maischein/2018/11/curl2lwp---convert-curl-command-line-arguments-to-lwp-mechanize-perl-code.html)

  • Comment on Re: downloading a file on a page with javascript

Replies are listed 'Best First'.
Re^2: downloading a file on a page with javascript
by Aldebaran (Curate) on Apr 06, 2020 at 22:33 UTC
    there are at least two ways to approach this.

    I was particularly pleased to see this response from bliako, whose pm posts are at a level where I can, about half the time, stretch my game to replicate, understand, and incorporate into "my game," whatever that is. I was thinking there should be several ways that perl could do either natively, or by wrapping C, or with modules. Getting the url right needs to be a part of any solution.

    The first is to use WWW::Mechanize::Chrome

    I had trouble installing WWW::Mechanize::Chrome, but it was all of the variety where I needed only to make better web searches for prereq's.

    The first "problem" was getting WWW::Mechanize::Chrome to install on ubuntu. I lacked 2 things at the beginning: a chrome executable, and headers for png.h .

    For ubuntu, a good command line install for chrome is here. Since being able to save a screenshot as a png is necessary, I also needed:

    sudo apt-get install libpng-dev

    This is as far as I got along this prong. Output, then source:

    $ ./1.mai.pl enable1.txt Yay
    #!/usr/bin/perl use strict; use Log::Log4perl qw(:easy); use WWW::Mechanize::Chrome; use Data::Dump; use 5.016; my $mech = WWW::Mechanize::Chrome->new(); my $url = 'https://code.google.com/archive/p/dotnetperls-controls/down +loads'; $mech->get($url); print $_->text . "\n" for $mech->find_all_links( text_regex => qr/enable/i ); $mech->follow_link( xpath => '//a[text() = "enable1.txt"]' ); my @words; # check the outcome if ($mech->success) { #print $res->decoded_content; #@words = mech->decoded_content; print "Yay\n"; } else { print "Error: " . $mech->status . "\n"; } if (@words) { print "@words\n"; } sleep 1;

    Aspects of downloads are yet to be implemented according to the 35:06 mark here: corion's presentation from 2017

    Q1) How do I brook the gap from $mech->follow_link to populating @words ?

    The second is to open the site with your browser, open the developer tools (firefox, but also other will have similar functionality). Go to the network tab, select XHR and reload the page. You will see all the data fetched via ajax. And you will see where does that data come from, it comes from urls just like the one you tried to download. Copy that url as CURL (its on the right-click menu somewhere) and you can see exactly what the url is, what its parameters are. Now, note the url, its parameters and whether it is a POST or a GET and what request-headers it has. It's easy to translate those into LWP::UserAgent.

    I did something close to this dozens of different ways. What ended up working for me was left-clicking on the link while the developer tools--including network tab--are on and then finding the copy to curl on the right click menu as one hovers over it in the tools. This yields:

    curl 'https://www.googleapis.com/storage/v1/b/google-code-archive/o/v2 +%2Fcode.google.com%2Fdotnetperls-controls%2Fproject.json?alt=media&st +ripTrailingSlashes=false' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; L +inux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0' -H 'Accept: applic +ation/json, text/plain, */*' -H 'Accept-Language: en-US,en;q=0.5' --c +ompressed -H 'Origin: https://code.google.com' -H 'Connection: keep-a +live' -H 'Referer: https://code.google.com/archive/p/dotnetperls-cont +rols/downloads' -H 'Cache-Control: max-age=0' -H 'TE: Trailers'

    Then I turned to Corion's curl2lwp converter. I'm super pleased by this:

    $ ./2.curl.pl | tail -5 zymotic zymurgies zymurgy zyzzyva zyzzyvas $ cat 2.curl.pl #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; my $ua = LWP::UserAgent->new( 'send_te' => '0' ); my $r = HTTP::Request->new( 'GET' => 'https://storage.googleapis.com/google-code-archive-downloads/v2/code. +google.com/dotnetperls-controls/enable1.txt', [ 'Connection' => 'keep-alive', 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*; +q=0.8', 'Accept-Encoding' => 'gzip, x-gzip, deflate, x-bzip2, bzip2', 'Accept-Language' => 'en-US,en;q=0.5', 'Host' => 'storage.googleapis.com:443', 'Referer' => 'https://code.google.com/archive/p/dotnetperls-controls/down +loads', 'User-Agent' => 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:74.0) Gecko/20100101 Firef +ox/74.0', 'Upgrade-Insecure-Requests' => '1', ], ); my $res = $ua->request( $r, ); ### begin Aldebaran-added source my @words; # check the outcome if ($res->is_success) { #print $res->decoded_content; @words = $res->decoded_content; } else { print "Error: " . $res->status_line . "\n"; } if (@words) { print "@words\n"; } __END__ $

    This represents a huge learning curve partially-ascended for me, including considering the Bigger picture with introduction to DOM.

    I have one more question at this point, regarding the practice scripts at examples, all of which use Log::Log4perl. If I have:

    $ cat /etc/2.log.conf ###################################################################### +######### # Log::Log4perl Conf + # ###################################################################### +######### log4perl.rootLogger = DEBUG, LOG1, SCREEN log4perl.appender.SCREEN = Log::Log4perl::Appender::Screen log4perl.appender.SCREEN.stderr = 0 log4perl.appender.SCREEN.layout = Log::Log4perl::Layout::PatternLayou +t log4perl.appender.SCREEN.layout.ConversionPattern = %m %n log4perl.appender.LOG1 = Log::Log4perl::Appender::File log4perl.appender.LOG1.filename = /home/hogan/Documents/hogan/logs/2. +log4perl.txt log4perl.appender.LOG1.mode = append log4perl.appender.LOG1.layout = Log::Log4perl::Layout::PatternLayou +t log4perl.appender.LOG1.layout.ConversionPattern = %d %p %m %n $

    , and this successfully logs events and errors:

    #!/usr/bin/perl use Log::Log4perl; # Initialize Logger my $log_conf = "/etc/2.log.conf"; Log::Log4perl::init($log_conf); my $logger = Log::Log4perl->get_logger(); $logger->info("===== before system call"); system('ls -l qwerty'); if( $? > 0 ) { $logger->error("there was an error: $?"); } $logger->info("===== after system call");

    Q2) How do I log using this scheme? For example, do I go from

    else { print "Error: " . $mech->status . "\n"; }

    to:

    else { $logger->error("there was an error: $mech->status" . "\n") ; }

    Again, thanks all for comments, which seem to be the "service work" that most of us can do in these unusual times of "social distancing." Stay healthy!

    2020-04-07 Athanasius fixed formatting of over-long code line.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11114818]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-04-26 02:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found