cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow seminarians. My sympathies to those of you who, like me, are programming on Christmas Eve.

I am using HTML::TreeBuilder to parse HTML for the purposes of extracting information on some web pages, using HTML::Element to do this. When I do something like  my $d = $tree->look_down('class','date'); the hash I get back has a {_content} array with the values I want.

The docs describe keys beginning with underscore as "internal attributes," which I suppose means I'm not supposed to access the directly, but I can and do. For example my $date = $d->{_content}->[1]

But I'm wondering if there is a more proper construction for this. For example, in LWP::UserAgent you can get $response->status_line which I think accesses an internal attribute. In the date example my $date = $d->content[1] doesn't work. I've searched around and have not been able to find an answer.

Replies are listed 'Best First'.
Re: HTML::Element accessing "internal attributes" the proper way
by jcb (Parson) on Dec 25, 2020 at 03:27 UTC

    A quick look at the HTML::Element documentation suggests using the content_list method, as in my $date = ($d->content_list)[1]. If there was no element 1 in the content list (as when the element is empty), this will set $date to undef and might produce a warning.

    Alternately, your use of content was very close: the content method returns an arrayref or undef, so you want my $date = $d->content->[1] but note that that will crash if the element has no content, so you might want to use eval: my $date = eval { $d->content->[1] } which will set $date to undef (and set $@ to an error about attempting to dereference something that was not a reference) if the element has no content.

    (All code obviously untested.)

      > his will set $date to undef and might produce a warning.

      this would surprise me

      DB<71> use warnings; $a=(1..2)[9] DB<72>

      > so you might want to use eval: my $date = eval { $d->content->[1] }

      Thats why the use is of ->content is discouraged and ->content_array_ref is offered

      $content_array_ref = $h->content_array_ref(); # never undef

      "This is like content (with all its caveats and deprecations) except that it is guaranteed to return an array reference."

      > (All code obviously untested.)

      dito! :)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        I said "might" because I was not sure off the top of my head whether reading off the end of a list produces a warning in Perl or not.

        As the documentation says, content_array_ref is just as deprecated as content, with content_list being the preferred interface for new code.

Re: HTML::Element accessing "internal attributes" the proper way
by Anonymous Monk on Dec 25, 2020 at 01:44 UTC
      Wow not sure how I missed that in the docs. Thanks for the pointers.