Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: LWP and Mechanize

by pryrt (Abbot)
on May 21, 2022 at 20:44 UTC ( [id://11144055]=note: print w/replies, xml ) Need Help??


in reply to LWP and Mechanize

Have you checked the TOS for that site?

When I tried the script you showed, it gave me a 403 Forbidden error. When I checked with Chrome, it downloaded fine. When I tried a curl -v https://www.sec.gov/Archives/edgar/full-index/2019/QTR1/master.idx, it was a bit more specific:

< HTTP/1.1 403 Forbidden < Server: AkamaiGHost < Mime-Version: 1.0 < Content-Length: 4793 < Cache-Control: no-cache, no-store, must-revalidate < Pragma: no-cache < Expires: 0 < Content-Type: text/html < Date: Sat, 21 May 2022 20:40:48 GMT < Connection: keep-alive < Strict-Transport-Security: max-age=31536000 ; includeSubDomains ; pr +eload ... <title>SEC.gov | Request Rate Threshold Exceeded</title> ... <h1>Your Request Originates from an Undeclared Automated Tool</h1> <p>To allow for equitable access to all users, SEC reserves the right +to limit requests originating from undeclared automated tools. Your r +equest has been identified as part of a network of automated tools ou +tside of the acceptable policy and will be managed until action is ta +ken to declare your traffic.</p> <p>Please declare your traffic by updating your user agent to include +company specific information.</p> ... <p>For best practices on efficiently downloading information from SEC. +gov, including the latest EDGAR filings, visit <a href="https://www.s +ec.gov/developer" target="_blank">sec.gov/developer</a>. You can also + <a href="https://public.govdelivery.com/accounts/USSEC/subscriber/ne +w?topic_id=USSEC_260" target="_blank">sign up for email updates</a> o +n the SEC open data program, including best practices that make it mo +re efficient to download data, and SEC.gov enhancements that may impa +ct scripted downloading processes. For more information, contact <a h +ref="mailto:opendata@sec.gov">opendata@sec.gov</a>.</p> <p>For more information, please see the SEC’s <a href="#internet">Web +Site Privacy and Security Policy</a>. Thank you for your interest in +the U.S. Securities and Exchange Commission. <p>Reference ID: 0.9db31bb8.1653165648.37b3e960</p>

Basically, you need to make sure you are following their TOS in terms of load limits, and define a user-agent string that meets their rules. (Or if you want to risk violating the SEC's rules, use a user-agent string that mimics a browser's string without looking up what their rules are ↗). Both LWP::UserAgent and WWW::Mechanize allow setting the user agent, and document how to do so.


↗: Looks like LanX determined that wouldn't work in id://11144056, which wasn't there when I started writing my post.
edit 2: you could have seen the full error message yourself if you had checked for content as well as status during the else condition, like else {die $response->status_line . ($response->content||'');}

Replies are listed 'Best First'.
Re^2: LWP and Mechanize
by perlmike (Initiate) on May 21, 2022 at 21:12 UTC

    This is very helpful! Thank you very much. How to print out content during the else condition?

      How to print out content during the else condition?

      That was in my edit2 section: else {die $response->status_line . ($response->content||'');}

      Or, putting it into your whole script:

      use strict; use warnings; use LWP::UserAgent(); my $ua = LWP::UserAgent->new(timeout => 10); $ua->env_proxy; my $response = $ua->get('https://www.sec.gov/Archives/edgar/full-index +/2019/QTR1/master.idx'); open(OUT, ">" . "master") or die "Cannot open master"; if ($response->is_success) {print OUT $response->decoded_content; } else {die $response->status_line . ($response->content||'');} close OUT;

        I see now. Can you tell me how to set up the user agent?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11144055]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-20 01:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found