Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

I'm having a facebook discussion with a friend from high school who claimed that the word 'enemy' appeared on a webpage, and I wanted to prove that he was making it up and thought to do so with perl. I seem to be unable to get the content of the page and thought I'd ask for a work-around here:

use strict; use warnings; use feature 'say'; use LWP::Simple; my $url = 'https://berniesanders.com/issues/racial-justice/'; my $content = get $url; die "Couldn't get $url" unless defined $content; if($content =~ m/enemy/i) { say "enemy found"; } else say $content; }

Output:

Couldn't get https://berniesanders.com/issues/racial-justice/  at rm1.pl line 8.

I suspect that my problem is that https is different than http, but I see no work-around on cpan for LWP::Simple. Am I using the right tool? If so, how do I use it correctly? Thanks for your comment.

Replies are listed 'Best First'.
Re: getting content of an https website
by tangent (Parson) on Aug 31, 2015 at 23:06 UTC
    It works for me if I use LWP::UserAgent - note you have to set the user agent string for this site, if the default is used you get a 'banned' message.
    use LWP::UserAgent; my $ua = LWP::UserAgent->new(); $ua->agent('Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Fire +fox/31.0'); my $response = $ua->get('https://berniesanders.com/issues/racial-justi +ce/'); my $content = $response->content;

      Thanks, tangent, that's got it. With a little help from HTML::Tree, this suffices:

      use strict; use warnings; use feature 'say'; use LWP::UserAgent; use HTML::Tree; my $url = 'https://berniesanders.com/issues/racial-justice/'; my $ua = LWP::UserAgent->new(); $ua->agent( 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Fire +fox/31.0' ); my $response = $ua->get($url); my $content = $response->content; if ( $content =~ m/enemy/i ) { say "enemy found"; } else { my $tree = HTML::Tree->new(); $tree->parse($content); print $tree->as_text; }

      I've seen code like this before, and I thought I actually needed to have the browser in question, but apparently not. Am I correct to think that that string need to have nothing to do with the actual machine it runs on? Does the string you used make a good overall choice for such queries?

      I'd like to consider a related question, given that we're barely warmed up here. I've always wanted the funtionality of having mechanized events happen and then having an actual browser opened. I don't know if one browser works better than another for this, but I use Chrome for most of my day-in and day-out surfing, viewing or whatever. Clearly, I would have to define a path to the executable, which I believe is here:

      Directory of C:\Program Files (x86)\Google\Chrome\Application 08/22/2015 03:42 AM <DIR> . 08/22/2015 03:42 AM <DIR> .. 08/14/2015 12:43 PM <DIR> 44.0.2403.155 08/22/2015 03:42 AM <DIR> 44.0.2403.157 08/17/2015 10:23 PM 813,896 chrome.exe 06/03/2013 04:26 PM 18,546 master_preferences 06/19/2014 02:37 AM <DIR> Plugins 08/22/2015 03:42 AM 399 VisualElementsManifest.xml

      How might I open the url from the original post in this browser?

        system($url); will usually do it, depending on how paranoid you are, you might want to ensure that only properly encoded strings are executed.
Re: getting content of an https website
by Anonymous Monk on Sep 01, 2015 at 17:25 UTC
    Why not "wget" the file and "grep" it?
Re: getting content of an https website
by Anonymous Monk on Sep 01, 2015 at 17:26 UTC
    Go to site, view page source, control-F find the word. Also look at DOM structure.