using HTML::TreeBuilder effectively

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I'm looking to replicate the example that cpan gave me for HTML::TreeBuilder and fall short. I took a look at the yahoo site which I still use to get news and have an internet identity with, one that is useful, so I don't feel like I ever want yahoo to disappear.

Q1) The first thing I ask for is a diagnosis for the errors I post after its source. https://metacpan.org/pod/HTML::Tree::Scanning

use strict;
use HTML::TreeBuilder 2.97;
use LWP::UserAgent;
sub get_headlines {
  my $url = $_[0] || die "What URL?";
   
  my $response = LWP::UserAgent->new->request(
    HTTP::Request->new( GET => $url )
  );
  unless($response->is_success) {
    warn "Couldn't get $url: ", $response->status_line, "\n";
    return;
  }
   
  my $tree = HTML::TreeBuilder->new();
  $tree->parse($response->content);
  $tree->eof;
   
  my @out;
  foreach my $link (
    $tree->look_down(   # !
      '_tag', 'a',
      sub {
        return unless $_[0]->attr('href');
        my @c = $_[0]->content_list;
        @c == 1 and ref $c[0] and $c[0]->tag eq 'b';
      }
    )
  ) {
    push @out, [ $link->attr('href'), $link->as_text ];
  }
   
  warn "Odd, fewer than 6 stories in $url!" if @out < 6;
  $tree->delete;
  return @out;
}

foreach my $section (qw[tc sc hl wl en]) {
  my @links = get_headlines(
    "http://dailynews.yahoo.com/h/$section/"
  );
  print
    $section, ": ", scalar(@links), " stories\n",
    map(("  ", $_->[0], " : ", $_->[1], "\n"), @links),
    "\n";
}
[download]

The terminal looks like it's looking for urls that no longer exist:

C:\cygwin64\home\Fred\pages2\hunt>perl lib2.pl
Couldn't get http://dailynews.yahoo.com/h/tc/: 500 Can't connect to da
+ilynews.ya
hoo.com:80 (Bad hostname)
tc: 0 stories

Couldn't get http://dailynews.yahoo.com/h/sc/: 500 Can't connect to da
+ilynews.ya
hoo.com:80 (Bad hostname)
sc: 0 stories

Couldn't get http://dailynews.yahoo.com/h/hl/: 500 Can't connect to da
+ilynews.ya
hoo.com:80 (Bad hostname)
hl: 0 stories

Couldn't get http://dailynews.yahoo.com/h/wl/: 500 Can't connect to da
+ilynews.ya
hoo.com:80 (Bad hostname)
wl: 0 stories

Couldn't get http://dailynews.yahoo.com/h/en/: 500 Can't connect to da
+ilynews.ya
hoo.com:80 (Bad hostname)
en: 0 stories
[download]

Q3) My next question goes to syntax. What is this creature: $_->[0]

Q4) What is a clean, contemporary update for this example?

Thank you for your comment,

Comment on using HTML::TreeBuilder effectively Select or Download Code

Replies are listed 'Best First'.

Re: using HTML::TreeBuilder effectively
by Athanasius (Archbishop) on Sep 16, 2015 at 07:11 UTC

Hello Datz_cozee75,

(1) When I enter http://dailynews.yahoo.com/h/tc/ — or even just http://dailynews.yahoo.com — into Google Chrome, I get:

This web page is not available
ERR_NAME_NOT_RESOLVED
...
The server at dailynews.yahoo.com can't be found because the DNS look-up failed....

Looks as though the web address has changed to http://news.yahoo.com/?

(2) Is there a question 2?

(3) Within the call to map, each element of the array @links is in turn aliased to $_ (see map). @links was previously initialised via a call to get_headlines(), which returns the elements in @out. The latter is populated by this line:

push @out, [ $link->attr('href'), $link->as_text ];
[download]

which creates an anonymous array containing two elements, and pushes a reference to it onto the array @out (see perlreftut). So, within the map, $_->[0] is the first element of the anonymous array currently referenced by $_, and $_->[1] is the second element.

(4) Sorry, I don’t know how to get just the headlines from http://news.yahoo.com/.

Anyway, hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: using HTML::TreeBuilder effectively
by Anonymous Monk on Sep 16, 2015 at 07:24 UTC

your title is misleading

[reply]

Re^2: using HTML::TreeBuilder effectively

by skaryzgik (Novice) on Sep 16, 2015 at 20:57 UTC

HTML::TreeBuilder

If the current title is misleading, is there another that might be better?

[reply]

Re^3: using HTML::TreeBuilder effectively

by Anonymous Monk on Sep 16, 2015 at 23:10 UTC

It seems to me that someone who doesn't understand the error message could easily not realize the problem isn't with the usage of HTML::TreeBuilder.

If the current title is misleading, is there another that might be better?

Maybe "How I spent my summer vacation?"

Maybe the error message ie "Couldn't get http://dailynews.yahoo.com/h/tc/: 500 Can't connect to dailynews.yahoo.com:80 (Bad hostname)?"

Sure, its possible the OP doesn't understand the message ... but OP seems to have done fine for title in getting content of an https website and Using example script correctly for opening cpan module but not creating a useful browser from automation

[reply]

Re^2: using HTML::TreeBuilder effectively

by Aldebaran (Curate) on Sep 17, 2015 at 07:21 UTC

I have to admit that I'm curious how you think I should have typed the subject for this thread. It also seems to be the case that the script in the original post has outlived its assumptions for how it gives useful output. It has interesting syntax, and I'd like to be able to say that I had that part mastered by now, but I do not.

My Q2 may have been pinned on that script, but I'd prefer not to speak about it again until we obtain output as described in the subject of the original post.

How does one make yahoo able to find its own news?

C:\cygwin64\home\Fred\pages2\hunt>perl lib6.pl
GET https://search.yahoo.com/search [s]
  p=                             (text)
  <NONAME>=Search                (submit)
  fr=sfp                         (hidden readonly)
  fr2=                           (hidden readonly)
  iscqry=                        (hidden readonly)

search string is Yahoo News

C:\cygwin64\home\Fred\pages2\hunt>type lib6.pl
#! /usr/bin/perl
use warnings;
use strict;
use 5.01;

# create a new browser
use WWW::Mechanize;
my $browser = WWW::Mechanize->new();

# tell it to get the main page
$browser->get('https://search.yahoo.com/');

# make sure $link is defined
if ( defined $browser ) {
  $browser->dump_forms;

  my $brand = 'Yahoo';
  my $collection = 'News';
  my $search_string = "$brand $collection";
  say "search string is $search_string";

  my $url = $browser->uri;
  system( 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe
+', $url );

}
else {
  use 5.01;
  $browser->back;
  say "tja";

  my $url = $browser->uri;
  system( 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe
+', $url );

}
[download]

Where I'd rather focus is on a generic way to specify searches, which are the trade of the website we're talking to. They announce themselves as 'Search' in this example, but I would not want to be wed to the notion that it had to be upper case, for example. Here is where I think the output is useful:

<NONAME>=Search (submit)

So I would like to populate the search string, submit, and then follow the first link suggested.

Thank you for your comment,

[reply]
[d/l]
[select]

Re^3: using HTML::TreeBuilder effectively

by poj (Abbot) on Sep 17, 2015 at 07:33 UTC

Re: your OP this works in the UK, you may have to amend for your location

#!perl
use strict;
use HTML::TreeBuilder 2.97;
use LWP::UserAgent;

sub get_headlines {
  my $url = $_[0] || die "What URL?";
   
  my $response = LWP::UserAgent->new->request(
    HTTP::Request->new( GET => $url )
  );
  unless($response->is_success) {
    warn "Couldn't get $url: ", $response->status_line, "\n";
    return;
  }
   
  my $tree = HTML::TreeBuilder->new();
  $tree->parse($response->content);
  $tree->eof;
   
  my @out;
  foreach my $link (
    $tree->look_down(   # !
      '_tag', 'a',
      sub {
        return 1 if $_[0]->attr('class') =~ /title/;
#        my @c = $_[0]->content_list;
#        @c == 1 and ref $c[0] and $c[0]->tag eq 'b';
      }
    )
  ) {
    push @out, [ $link->attr('href'), $link->as_text, ];
  }
   
  warn "Odd, fewer than 6 stories in $url!" if @out < 6;
  $tree->delete;
  return @out;
}

#science health world entertainment
open OUT,'>:utf8','yahoo.txt' or die "$!";
foreach my $section (qw[tech science health world entertainment]) {
  my @links = get_headlines(
    "https://uk.news.yahoo.com/$section/"
  );
  print OUT
    $section, ": ", scalar(@links), " stories\n",
    map(("  ", $_->[1], "\n"), @links),"\n";
}
[download]

[reply]
[d/l]

Re^3: using HTML::TreeBuilder effectively

by Anonymous Monk on Sep 17, 2015 at 07:54 UTC

I have to admit that I'm curious how you think I should have typed the subject for this thread... tl;dr

Yeah, that part covered in Re^3: using HTML::TreeBuilder effectively

FWIW, I would have tried to engage with this reply

[reply]