getting content of an https website

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

I'm having a facebook discussion with a friend from high school who claimed that the word 'enemy' appeared on a webpage, and I wanted to prove that he was making it up and thought to do so with perl. I seem to be unable to get the content of the page and thought I'd ask for a work-around here:

  use strict;
  use warnings;
  use feature 'say';
  use LWP::Simple;
  
  my $url = 'https://berniesanders.com/issues/racial-justice/';
  my $content = get $url;
  die "Couldn't get $url" unless defined $content;
  if($content =~ m/enemy/i) {
    say "enemy found";
  } else   
    say $content;
  }
[download]

Output:

Couldn't get https://berniesanders.com/issues/racial-justice/ at rm1.pl line 8.

I suspect that my problem is that https is different than http, but I see no work-around on cpan for LWP::Simple. Am I using the right tool? If so, how do I use it correctly? Thanks for your comment.

Comment on getting content of an https website Select or Download Code

Replies are listed 'Best First'.
Re: getting content of an https website by tangent (Parson) on Aug 31, 2015 at 23:06 UTC
It works for me if I use LWP::UserAgent - note you have to set the user agent string for this site, if the default is used you get a 'banned' message. `use LWP::UserAgent; my $ua = LWP::UserAgent->new(); $ua->agent('Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Fire +fox/31.0'); my $response = $ua->get('https://berniesanders.com/issues/racial-justi +ce/'); my $content = $response->content;` [download]	[reply] [d/l]
Re^2: getting content of an https website by Aldebaran (Curate) on Sep 01, 2015 at 02:21 UTC
Thanks, tangent, that's got it. With a little help from HTML::Tree, this suffices: `use strict; use warnings; use feature 'say'; use LWP::UserAgent; use HTML::Tree; my $url = 'https://berniesanders.com/issues/racial-justice/'; my $ua = LWP::UserAgent->new(); $ua->agent( 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Fire +fox/31.0' ); my $response = $ua->get($url); my $content = $response->content; if ( $content =~ m/enemy/i ) { say "enemy found"; } else { my $tree = HTML::Tree->new(); $tree->parse($content); print $tree->as_text; }` [download] I've seen code like this before, and I thought I actually needed to have the browser in question, but apparently not. Am I correct to think that that string need to have nothing to do with the actual machine it runs on? Does the string you used make a good overall choice for such queries? I'd like to consider a related question, given that we're barely warmed up here. I've always wanted the funtionality of having mechanized events happen and then having an actual browser opened. I don't know if one browser works better than another for this, but I use Chrome for most of my day-in and day-out surfing, viewing or whatever. Clearly, I would have to define a path to the executable, which I believe is here: `Directory of C:\Program Files (x86)\Google\Chrome\Application 08/22/2015 03:42 AM <DIR> . 08/22/2015 03:42 AM <DIR> .. 08/14/2015 12:43 PM <DIR> 44.0.2403.155 08/22/2015 03:42 AM <DIR> 44.0.2403.157 08/17/2015 10:23 PM 813,896 chrome.exe 06/03/2013 04:26 PM 18,546 master_preferences 06/19/2014 02:37 AM <DIR> Plugins 08/22/2015 03:42 AM 399 VisualElementsManifest.xml` [download] How might I open the url from the original post in this browser?	[reply] [d/l] [select]
Re^3: getting content of an https website by Anonymous Monk on Sep 01, 2015 at 03:40 UTC
HTML::Display https://metacpan.org/pod/WWW::Mechanize#mech-agent_alias-alias WWW::UserAgent::Random - Perl extension to generate random User Agent / List of User-Agents (Spiders, Robots, Browser)	[reply]
Re^4: getting content of an https website by Aldebaran (Curate) on Sep 01, 2015 at 07:54 UTC
Re^3: getting content of an https website by Anonymous Monk on Sep 07, 2015 at 17:46 UTC
system($url); will usually do it, depending on how paranoid you are, you might want to ensure that only properly encoded strings are executed.	[reply]
Re: getting content of an https website by Anonymous Monk on Sep 01, 2015 at 17:25 UTC
Why not "wget" the file and "grep" it?	[reply]
Re: getting content of an https website by Anonymous Monk on Sep 01, 2015 at 17:26 UTC
Go to site, view page source, control-F find the word. Also look at DOM structure.	[reply]