How to access a web pages HTML elements?

cookersjs has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I am trying to access a table written in HTML code from a website. Currently I am just trying to get the program to output the HTML code in the terminal so that I know I have captured the right code. From there I plan to use HTML::TableExtract. The problem I am having is that when using HTML::TreeBuilder, it only outputs the page source, which doesn't contain the table of interest.

My code:

 
       use strict;
       use warnings;
       use HTML::TreeBuilder;
       my $url = "https://cancer.sanger.ac.uk/census";
       my $tree = HTML::TreeBuilder->new_from_url($url);
       say $tree->as_HTML;
[download]

This is the output: view-source:https://cancer.sanger.ac.uk/census

What I am trying to access is the table under the 'Abbreviations' tab at this link: https://cancer.sanger.ac.uk/census

Thanks!

Comment on How to access a web pages HTML elements? Download Code

Replies are listed 'Best First'.
Re: How to access a web pages HTML elements? by huck (Prior) on Dec 13, 2016 at 20:54 UTC
There is scripting on that page that i doubt HTML::TreeBuilder is able to deal with. Via Firefox-webdeveloper-network i think i found out that you are looking for this page https://cancer.sanger.ac.uk/census/abbreviations and TreeBuilder may be able to deal with that edit:s/Firefix/Firefox/g	[reply]
Re^2: How to access a web pages HTML elements? by Anonymous Monk on Dec 14, 2016 at 13:25 UTC
This looks perfect, thank you so much!	[reply]
Re: How to access a web pages HTML elements? by 1nickt (Canon) on Dec 13, 2016 at 22:50 UTC
See HTML::TableExtract for extracting tables from HTML. But you don't want that here because: The table you want (as seen in the source of the page you linked to) is on the page at https://cancer.sanger.ac.uk/cosmic/help/census#abbrev The table is not in fact an HTML table but rendered as a table with: `<h3 class="emboldened">Cancer Gene Census - Abbreviations</h3> <h4> Glossary Terms </h4> <dt class="l30">A</dt><dd class="l50"> Amplification; </dd> <dt class="l30">AEL</dt><dd class="l50"> Acute eosinophilic leuk +emia; </dd> <dt class="l30">AL</dt><dd class="l50"> Acute leukemia;</dd> <dt class="l30">ALCL</dt><dd class="l50"> Anaplastic large-cell +lymphoma; </dd> <dt class="l30">ALL</dt><dd class="l50"> Acute lymphocytic leuke +mia;</dd> ...` [download] Please note the copyright restriction included on the page(s) you referenced. Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l]
Re^2: How to access a web pages HTML elements? by cookersjs (Acolyte) on Dec 14, 2016 at 13:26 UTC
I was aware of the restriction but for my purposes it won't be a problem. Thanks!	[reply]