cookersjs has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I am trying to access a table written in HTML code from a website. Currently I am just trying to get the program to output the HTML code in the terminal so that I know I have captured the right code. From there I plan to use HTML::TableExtract. The problem I am having is that when using HTML::TreeBuilder, it only outputs the page source, which doesn't contain the table of interest.


My code:
use strict; use warnings; use HTML::TreeBuilder; my $url = "https://cancer.sanger.ac.uk/census"; my $tree = HTML::TreeBuilder->new_from_url($url); say $tree->as_HTML;

This is the output: view-source:https://cancer.sanger.ac.uk/census

What I am trying to access is the table under the 'Abbreviations' tab at this link: https://cancer.sanger.ac.uk/census



Thanks!

Replies are listed 'Best First'.
Re: How to access a web pages HTML elements?
by huck (Prior) on Dec 13, 2016 at 20:54 UTC

    There is scripting on that page that i doubt HTML::TreeBuilder is able to deal with. Via Firefox-webdeveloper-network i think i found out that you are looking for this page https://cancer.sanger.ac.uk/census/abbreviations and TreeBuilder may be able to deal with that

    edit:s/Firefix/Firefox/g

      This looks perfect, thank you so much!
Re: How to access a web pages HTML elements?
by 1nickt (Canon) on Dec 13, 2016 at 22:50 UTC

    See HTML::TableExtract for extracting tables from HTML.

    But you don't want that here because:

    • The table you want (as seen in the source of the page you linked to) is on the page at https://cancer.sanger.ac.uk/cosmic/help/census#abbrev
    • The table is not in fact an HTML table but rendered as a table with:
      <h3 class="emboldened">Cancer Gene Census - Abbreviations</h3> <h4> Glossary Terms </h4> <dt class="l30">A</dt><dd class="l50"> Amplification; </dd> <dt class="l30">AEL</dt><dd class="l50"> Acute eosinophilic leuk +emia; </dd> <dt class="l30">AL</dt><dd class="l50"> Acute leukemia;</dd> <dt class="l30">ALCL</dt><dd class="l50"> Anaplastic large-cell +lymphoma; </dd> <dt class="l30">ALL</dt><dd class="l50"> Acute lymphocytic leuke +mia;</dd> ...

    Please note the copyright restriction included on the page(s) you referenced. Hope this helps!


    The way forward always starts with a minimal test.
      I was aware of the restriction but for my purposes it won't be a problem.

      Thanks!