downloading a file on a page with javascript

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: downloading a file on a page with javascript by choroba (Cardinal) on Mar 30, 2020 at 21:46 UTC
Where did you find the URL? If I point my mouse on the file and save the link, I get `https://storage.googleapis.com/google-code-archive-downloads/v2/code.g +oogle.com/dotnetperls-controls/enable1.txt` [download] Using this URL instead of the one you used also stores a list of words to the output file, which I guess is the output you had expected. Getting this URL from the Archive page without JavaScript is hard. Search the Monastery for related questions. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: downloading a file on a page with javascript by Aldebaran (Curate) on Apr 06, 2020 at 22:38 UTC
Where did you find the URL? I cobbled it together together from the base url and the file I wanted. If I point my mouse on the file and save the link, I get the same thing. What I realize from your and bliako's post is that I underused the power of the browser to figure this out. Using this URL instead of the one you used also stores a list of words to the output file, which I guess is the output you had expected. Thx, choroba, that is indeed what I seek for my wordgames. With the correct url, my script gets the english dictionary. I decided to try it out with an older source post of yours: Re^7: Words in Words. "Correct" entries are words that have a properly-encompassing word. A hybrid is this: Source: #!/usr/bin/perl use strict; use warnings; use LWP::Simple; use 5.016; my $url = 'https://storage.googleapis.com/google-code-archive-download +s/v2/code.google.com/dotnetperls-controls/enable1.txt'; my $file = '/home/hogan/Documents/phone/from_laptop/my_data/bb.txt'; getstore($url, $file); ## open my $IN, '<', $file or die "$!"; my %words; while (my $word = <$IN>) { chomp $word; undef $words{$word}; } my %reported; for my $word (keys %words) { my $length = length $word; for my $pos (0 .. $length - 1) { my $skip_itself = ! $pos; for my $len (1 .. $length - $pos - $skip_itself) { my $subword = substr($word, $pos, $len); next if exists $reported{$subword}; next if $word eq $subword . q{s} or $word eq $subword . q{'s}; if (exists $words{$subword}) { say "$subword"; undef $reported{$subword}; } } } } [download] Logophiles like me play gladly with such output. I speak english natively, so I'm rarely challenged with english vocabulary. The resulting list is fascinating: `$ grep phosphorylating bb.txt dephosphorylating phosphorylating $ grep aerially bb.txt aerially subaerially $ grep physiology bb.txt ecophysiology electrophysiology histophysiology neurophysiology pathophysiology physiology psychophysiology $ grep quids bb.txt equids liquids nonliquids quids semiliquids soliquids squids $ grep consciouses bb.txt consciouses preconsciouses subconsciouses unconsciouses $` [download] Who knew that there were 4 different consciouses? I couldn't find an example that failed to have a larger including word. Anyways, thanks for your comment that got me on the right track and also for the fun of replicating your "words within words" script. "Perl scripting: great for pandemics...."	[reply] [d/l] [select]
Re: downloading a file on a page with javascript by bliako (Abbot) on Mar 30, 2020 at 22:21 UTC
there are at least two ways to approach this. The first is to use WWW::Mechanize::Chrome which is like running a browser but without the gui (headless) from inside your script. With it you will be able to dive into the fetched page's DOM and extract anything you like from it, including those divs that you don't see with a view-page-source because they are fetched later via javascript/ajax. The second is to open the site with your browser, open the developer tools (firefox, but also other will have similar functionality). Go to the network tab, select XHR and reload the page. You will see all the data fetched via ajax. And you will see where does that data come from, it comes from urls just like the one you tried to download. Copy that url as CURL (its on the right-click menu somewhere) and you can see exactly what the url is, what its parameters are. Now, note the url, its parameters and whether it is a POST or a GET and what request-headers it has. It's easy to translate those into LWP::UserAgent. Edit: converting a beast of a CURL commandline to LWP::UserAgent can be done easily by using Corion's curl2lwp (see http://blogs.perl.org/users/max_maischein/2018/11/curl2lwp---convert-curl-command-line-arguments-to-lwp-mechanize-perl-code.html)	[reply]
Re^2: downloading a file on a page with javascript by Aldebaran (Curate) on Apr 06, 2020 at 22:33 UTC
there are at least two ways to approach this. I was particularly pleased to see this response from bliako, whose pm posts are at a level where I can, about half the time, stretch my game to replicate, understand, and incorporate into "my game," whatever that is. I was thinking there should be several ways that perl could do either natively, or by wrapping C, or with modules. Getting the url right needs to be a part of any solution. The first is to use WWW::Mechanize::Chrome I had trouble installing WWW::Mechanize::Chrome, but it was all of the variety where I needed only to make better web searches for prereq's. The first "problem" was getting WWW::Mechanize::Chrome to install on ubuntu. I lacked 2 things at the beginning: a chrome executable, and headers for png.h . For ubuntu, a good command line install for chrome is here. Since being able to save a screenshot as a png is necessary, I also needed: `sudo apt-get install libpng-dev` This is as far as I got along this prong. Output, then source: `$ ./1.mai.pl enable1.txt Yay` [download] #!/usr/bin/perl use strict; use Log::Log4perl qw(:easy); use WWW::Mechanize::Chrome; use Data::Dump; use 5.016; my $mech = WWW::Mechanize::Chrome->new(); my $url = 'https://code.google.com/archive/p/dotnetperls-controls/down +loads'; $mech->get($url); print $_->text . "\n" for $mech->find_all_links( text_regex => qr/enable/i ); $mech->follow_link( xpath => '//a[text() = "enable1.txt"]' ); my @words; # check the outcome if ($mech->success) { #print $res->decoded_content; #@words = mech->decoded_content; print "Yay\n"; } else { print "Error: " . $mech->status . "\n"; } if (@words) { print "@words\n"; } sleep 1; [download] Aspects of downloads are yet to be implemented according to the 35:06 mark here: corion's presentation from 2017 Q1) How do I brook the gap from `$mech->follow_link` to populating @words ? The second is to open the site with your browser, open the developer tools (firefox, but also other will have similar functionality). Go to the network tab, select XHR and reload the page. You will see all the data fetched via ajax. And you will see where does that data come from, it comes from urls just like the one you tried to download. Copy that url as CURL (its on the right-click menu somewhere) and you can see exactly what the url is, what its parameters are. Now, note the url, its parameters and whether it is a POST or a GET and what request-headers it has. It's easy to translate those into LWP::UserAgent. I did something close to this dozens of different ways. What ended up working for me was left-clicking on the link while the developer tools--including network tab--are on and then finding the copy to curl on the right click menu as one hovers over it in the tools. This yields: curl 'https://www.googleapis.com/storage/v1/b/google-code-archive/o/v2 +%2Fcode.google.com%2Fdotnetperls-controls%2Fproject.json?alt=media&st +ripTrailingSlashes=false' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; L +inux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0' -H 'Accept: applic +ation/json, text/plain, /' -H 'Accept-Language: en-US,en;q=0.5' --c +ompressed -H 'Origin: https://code.google.com' -H 'Connection: keep-a +live' -H 'Referer: https://code.google.com/archive/p/dotnetperls-cont +rols/downloads' -H 'Cache-Control: max-age=0' -H 'TE: Trailers' [download] Then I turned to Corion's curl2lwp converter. I'm super pleased by this: $ ./2.curl.pl \| tail -5 zymotic zymurgies zymurgy zyzzyva zyzzyvas $ cat 2.curl.pl #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; my $ua = LWP::UserAgent->new( 'send_te' => '0' ); my $r = HTTP::Request->new( 'GET' => 'https://storage.googleapis.com/google-code-archive-downloads/v2/code. +google.com/dotnetperls-controls/enable1.txt', [ 'Connection' => 'keep-alive', 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/; +q=0.8', 'Accept-Encoding' => 'gzip, x-gzip, deflate, x-bzip2, bzip2', 'Accept-Language' => 'en-US,en;q=0.5', 'Host' => 'storage.googleapis.com:443', 'Referer' => 'https://code.google.com/archive/p/dotnetperls-controls/down +loads', 'User-Agent' => 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:74.0) Gecko/20100101 Firef +ox/74.0', 'Upgrade-Insecure-Requests' => '1', ], ); my $res = $ua->request( $r, ); ### begin Aldebaran-added source my @words; # check the outcome if ($res->is_success) { #print $res->decoded_content; @words = $res->decoded_content; } else { print "Error: " . $res->status_line . "\n"; } if (@words) { print "@words\n"; } __END__ $ [download] This represents a huge learning curve partially-ascended for me, including considering the Bigger picture with introduction to DOM. I have one more question at this point, regarding the practice scripts at examples, all of which use Log::Log4perl. If I have: $ cat /etc/2.log.conf ###################################################################### +######### # Log::Log4perl Conf + # ###################################################################### +######### log4perl.rootLogger = DEBUG, LOG1, SCREEN log4perl.appender.SCREEN = Log::Log4perl::Appender::Screen log4perl.appender.SCREEN.stderr = 0 log4perl.appender.SCREEN.layout = Log::Log4perl::Layout::PatternLayou +t log4perl.appender.SCREEN.layout.ConversionPattern = %m %n log4perl.appender.LOG1 = Log::Log4perl::Appender::File log4perl.appender.LOG1.filename = /home/hogan/Documents/hogan/logs/2. +log4perl.txt log4perl.appender.LOG1.mode = append log4perl.appender.LOG1.layout = Log::Log4perl::Layout::PatternLayou +t log4perl.appender.LOG1.layout.ConversionPattern = %d %p %m %n $ [download] , and this successfully logs events and errors: `#!/usr/bin/perl use Log::Log4perl; # Initialize Logger my $log_conf = "/etc/2.log.conf"; Log::Log4perl::init($log_conf); my $logger = Log::Log4perl->get_logger(); $logger->info("===== before system call"); system('ls -l qwerty'); if( $? > 0 ) { $logger->error("there was an error: $?"); } $logger->info("===== after system call");` [download] Q2) How do I log using this scheme? For example, do I go from `else { print "Error: " . $mech->status . "\n"; }` [download] to: `else { $logger->error("there was an error: $mech->status" . "\n") ; }` [download] Again, thanks all for comments, which seem to be the "service work" that most of us can do in these unusual times of "social distancing." Stay healthy! 2020-04-07 Athanasius fixed formatting of over-long code line.	[reply] [d/l] [select]