When I tried the script you showed, it gave me a 403 Forbidden error. When I checked with Chrome, it downloaded fine. When I tried a curl -v https://www.sec.gov/Archives/edgar/full-index/2019/QTR1/master.idx, it was a bit more specific:
< HTTP/1.1 403 Forbidden < Server: AkamaiGHost < Mime-Version: 1.0 < Content-Length: 4793 < Cache-Control: no-cache, no-store, must-revalidate < Pragma: no-cache < Expires: 0 < Content-Type: text/html < Date: Sat, 21 May 2022 20:40:48 GMT < Connection: keep-alive < Strict-Transport-Security: max-age=31536000 ; includeSubDomains ; pr +eload ... <title>SEC.gov | Request Rate Threshold Exceeded</title> ... <h1>Your Request Originates from an Undeclared Automated Tool</h1> <p>To allow for equitable access to all users, SEC reserves the right +to limit requests originating from undeclared automated tools. Your r +equest has been identified as part of a network of automated tools ou +tside of the acceptable policy and will be managed until action is ta +ken to declare your traffic.</p> <p>Please declare your traffic by updating your user agent to include +company specific information.</p> ... <p>For best practices on efficiently downloading information from SEC. +gov, including the latest EDGAR filings, visit <a href="https://www.s +ec.gov/developer" target="_blank">sec.gov/developer</a>. You can also + <a href="https://public.govdelivery.com/accounts/USSEC/subscriber/ne +w?topic_id=USSEC_260" target="_blank">sign up for email updates</a> o +n the SEC open data program, including best practices that make it mo +re efficient to download data, and SEC.gov enhancements that may impa +ct scripted downloading processes. For more information, contact <a h +ref="mailto:opendata@sec.gov">opendata@sec.gov</a>.</p> <p>For more information, please see the SEC’s <a href="#internet">Web +Site Privacy and Security Policy</a>. Thank you for your interest in +the U.S. Securities and Exchange Commission. <p>Reference ID: 0.9db31bb8.1653165648.37b3e960</p>
Basically, you need to make sure you are following their TOS in terms of load limits, and define a user-agent string that meets their rules. (Or if you want to risk violating the SEC's rules, use a user-agent string that mimics a browser's string without looking up what their rules are ↗). Both LWP::UserAgent and WWW::Mechanize allow setting the user agent, and document how to do so.
In reply to Re: LWP and Mechanize
by pryrt
in thread LWP and Mechanize
by perlmike
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |