tbone654 has asked for the wisdom of the Perl Monks concerning the following question:

I have some code on a Yahoo Small Business server, to scrape a http:// webpage.
The key code is here:

use lib "../lib4"; use URI; use Web::Scraper;
In ../lib4 I upload the perl modules (.pm) from my laptop that are required.

Keeping in mind that:
1- All code and Perl Modules must reside on a Yahoo server
2- No additional page garbage (id, password, cookies, etc.) is required.

Does anyone have a simple way to scrape an https:// url?
Edit - By simple I mean something NOT dependent on linked libraries or multiple perl modules. I don't need to break through any SSL security, just need the raw HTML from an https page
Note: WWW::Mechanize does not work, as I can't get all the Perl Module files uploaded to the Yahoo Server.

I have code that works locally.
use strict; use Web::Scraper; use WWW::Mechanize; my $url = 'https://finance.yahoo.com/quote/SPY/history?p=SPY'; my $m = WWW::Mechanize->new(); $m->get($url); my $testdata = scraper {process "tr > td", 't1[]' => 'TEXT';}; my $res = $testdata->scrape(URI->new($url)); print "\nURL is: " . $m->uri() . "\n"; for(my $a=0;$a<6;$a++){print $res->{t1}[$a] . "\t";}

But I can't seem to get all the modules I need uploaded to make it run on Yahoo... Edit -> It blows up on LWP which does requires XSLoader or DynaLoader (see lower for exact software error) which means it's linking to a library it cannot find.
I've been working on this for about 3 weeks and need to reach out for guidance please.
So I guess I am looking for a clever way of scraping https without using Mechanize

Replies are listed 'Best First'.
Re: Get https simple solution desired
by stevieb (Canon) on May 31, 2018 at 21:02 UTC
    "Note: WWW::Mechanize does not work, as I can't get all the Perl Module files uploaded to the Yahoo Server."

    Providing a full description of what that means would be prudent here. What fails? How does it fail? How are you attempting to upload/install the "problematic" distributions? What errors occur?

    Edit your question and add the output/error messages from the locally-working script when run on your Yahoo! server.

    Can you upload LWP::Simple? I don't have time to test, but I've scraped using that before. The test was HTTP not HTTPS, but perhaps that may work. Take a look at this post, particularly the perlmonks() subroutine.

    Unfortunately, I have not the time right now to help with actual testing, nor is any of my equipment currently configured to simply re-run that test. May provide enough of a baseline though for you to adapt to.

Re: Get https simple solution desired
by Corion (Patriarch) on Jun 01, 2018 at 09:43 UTC

    WWW::Mechanize uses LWP::UserAgent. LWP::UserAgent needs LWP::Protocol::https to access https:// websites. That module isn't easily installed as it needs SSL libraries installed on the target system.

    Your best approach would be to contact the webserver administrator to get them to install LWP::Protocol::https on the machine.

    Alternatively, maybe some command line tools are available that already have https capabilities, like wget or curl. Then you can use these to fetch the data.

      Alternatively, maybe some command line tools are available that already have https capabilities, like wget or curl. Then you can use these to fetch the data.

      Or better yet, use modules such as Net::Curl::Easy which leverage their libraries and avoid shelling out.

        ... if that library is installed, and a C compiler is installed. Both of which are close prerequisites to getting SSL working.

        A statically linked wget or curl binary is more likely to be available in cases where SSL libraries are unavailable.

      That is exactly the problem with using WWW::Mechanize

      Software error: GET https://finance.yahoo.com/quote/SPY/history?p=SPY failed: 501 Prot +ocol scheme 'https' is not supported (LWP::Protocol::https not instal +led) at /test/aaa3.pl line 42
      I know I have LWP::Protocol::https.pm installed in the correct spot, and I know the problem is the same with any module that has XSLoader or Dynaloader as a dependancy to load libraries.

      Yahoo does have these modules, (32 with LWP in them) and maybe then libraries, but I think they must be a little stale and I don't know how to make the link to them.

      LWP::Protocol - Base class for LWP protocols
      LWP::Simple - simple procedural interface to LWP
      LWP::UserAgent - Web user agent class
      DBI 5.8.7::LWP - The World-Wide Web library for Perl
      5.8.7::Bundle::LWP - install all libwww-perl related modules
      5.8.7::LWP::Protocol - Base class for LWP protocols
      5.8.7::LWP::Simple - simple procedural interface to LWP
      5.8.7::LWP::UserAgent - Web user agent class


      Do I just call them by use 5.8.7::LWP::Protocol; and hope for the best?
      My experience is that Yahoo is not interested in adding libraries, and I have had no luck figuring out how to upload a library and then getting LWP to recognize it's path.

      Net::SSLeay object version 1.25 does not match bootstrap parameter 1.85 at ../lib4/Net/SSLeay.pm line 444.
      Compilation failed in require at ../lib4/IO/Socket/SSL.pm line 19.

      What I believe it is telling might be that:
      18 use IO::Socket;
      19 use Net::SSLeay 1.46;
      20 use IO::Socket::SSL::PublicSuffix;
      21 use Exporter ();

      The object file (ver 1.25) is the library on the host... IO/Socket/SSL.pm wants to know (ver 1.46)... and Net/SSLeay.pm wants (ver 1.85)

      Thank you very much for your help

Re: Get https simple solution desired
by bliako (Abbot) on Jun 01, 2018 at 09:38 UTC

    Hi, the code you provided works for me as is. I can also confirm that I have used LWP::UserAgent over https successfully. Many scripts are provided for doing that (e.g. see LWP not working with HTTPS protocol (SOLVED), I can include some when you confirm that indeed said module works for you at server.

    I had problems with installing modules in web-hosting servers where you do not have shell access. As sundialsvc4 said you can upload those modules from your local computer install locations to any dir you own at the remote server and then include the path of said dir in your scripts like use lib 'PATH_TO_DIR'. I can confirm that this worked for me in similar settings and without shell access.

    A key point here is to make sure that the path you think you uploaded modules to is what you use lib in your scripts. When you do not have shell access (ssh or telnet) to the server and therefore you can't just login and run your script from the shell, relative paths usually do not work because who knows where your script is executed from -- e.g. they may be executed as cgi-bins by the web-server. In this case prepare a test script and have it print out the complete path to the script like using __FILE__ or Cwd::getcwd(), see How to determine absolute path of current Perl file?

    The next hurdle is that some modules include XS and require compilation or depend on other system-installed libraries (i mean binary libs not perl modules). In which case you can't simply ftp them over unless you are lucky and the architectures match. Note: you may go a bit further if you link statically. A wild guess would be that LWP depends on some SSL library for doing its https business.

    You can sneak under the webmasters' nose and crawl the barbed wire a bit further by preparing a cgi-bin script which ... installs (as properly installs and not simply upload) and compile if necessary modules at the server by spawning a system command from a cgi-bin for cpanm. I leave this as an exercise to the intrepid reader but in my experience the web-server allocates a tiny allowance of cpu to your scripts and spawned system commands - if allowed at all to run - gradually become so nice they are irritating and, like Xeno, never arrive.

    In any case your mileage may vary (I always wanted to write this)!

Re: Get https simple solution desired
by locked_user sundialsvc4 (Abbot) on May 31, 2018 at 21:25 UTC

    I am not directly familiar with Yahoo’s server offering, but what exactly is the roadblock that you are now experiencing?   In your original post, you never actually say.

    If you are able to run Perl there but are having trouble “uploading the necessary modules,” perhaps this is simply another manifestation of the usual problems of running Perl in a shared-hosting environment.   (There are plenty of web-pages on that, here and elsewhere.)   While you cannot update the host’s set of installed libraries, you can create a new subdirectory, point the local cpan(m) command to it, install CPAN packages there, and arrange for that library to occur in the PERL5LIB library search list when your application runs.   When you do this, Perl will see your locally-installed packages first.

    Again, what you need to add ... as a reply, please ... is exact details of what you have tried and what is the roadblock as you now see it.   I am quite sure that you will receive an immediate helpful response here once these details have been given.

      For once, I have to commend you.

      This is actually a situation where no code need-be supplied, so we'll get that facet out of the way right off the bat.

      You've given reasonably decent potential direction to OP in a sane and orderly way, and it isn't off-the-charts ridiculous with far too much HTML emphasis.

      A point of advice here though for your own information... if one installs Perlbrew (assuming a Unix system, berrybrew for Windows), most, if not all of the hack-type requirements of setting environment variables and such while installing become irrelevant, obsolete and not required.

      Cheers,

      -stevieb

        "While you cannot update the host’s set of installed libraries, you can create a new subdirectory, point the local cpan(m) command to it, install CPAN packages there, and arrange for that library to occur in the PERL5LIB library search list when your application runs. When you do this, Perl will see your locally-installed packages first."

        What you say makes sense, but I don't have access to the server... It's a Yahoo Small Business server and all I can do is FTP files up...

        What I do to get perl modules uploaded is first run the package on my laptop installation of cygwin with perl as follows:

         cpanm -vf --local-lib /home/dirname/perl5 LWP::Protocol::https

        This forces installation into a directory I can control vs. cpan updating a directory that was installed as a package with the perl installation.
        Then I use FileZilla to upload to a directory I create to hold all the perl modules my code needs to run.

        Then place code in the perl .pl file to
         use lib "../lib4";
        Which puts my path to the front of the path list.

        Thank you very much for your help...

        A point of advice here though for your own information... if one installs Perlbrew (assuming a Unix system, berrybrew for Windows), most, if not all of the hack-type requirements of setting environment variables and such while installing become irrelevant, obsolete and not required.

        Well thats simply not true and you know it