in reply to Using HTML::Treebuilder effectively to capture data

HTML::TreeBuilder is too low-level. Use HTML::TableExtract.

I also used WWW::Mechanize::GZip to handle the download.

#! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use Syntax::Construct qw{ // }; use open ':std', OUT => ':utf8'; use WWW::Mechanize::GZip; use HTML::TableExtract; my $site = 'http://www.fourmilab.ch/yoursky/cities.html'; my $mech = 'WWW::Mechanize::GZip'->new; $mech->get($site); $mech->follow_link( text => 'Portland OR' ); my $te = 'HTML::TableExtract'->new; $te->parse($mech->content); my $table = ($te->tables)[3]; for my $row ($table->rows) { say join "\t", map $_ // q(), @$row; }
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Replies are listed 'Best First'.
Re^2: Using HTML::Treebuilder effectively to capture data
by FryingFinn (Beadle) on Jun 16, 2015 at 18:47 UTC
    You may want to look into Web::Query which allows you to scrape a html page with jQuery like constructs.
Re^2: Using HTML::Treebuilder effectively to capture data
by Aldebaran (Curate) on Jun 18, 2015 at 05:02 UTC

    Thank you choroba for your concise yet effective routine. I had to learn quite a bit of syntax just to catch up with it, and I have remaining questions. Maybe I should get those out of the way before moving on. I assume this statement handles exotic characters but by what means is it connected with the output?

    use open ':std', OUT => ':utf8';

    Also, I couldn't pick my way through all of the join..map syntax here, nor could I herd it into a lexical variable that a bumbling scribe like me can deal with. I littered it with say statements that told me very little about what was going on. The join with the tabs makes it all nicely columnar. The // is this supercool defined $a ? $a : $b syntax, and the q() is a literal quote, but I can't put the whole thing together.

    for my $row ($table->rows) { say join "\t", map $_ // q(), @$row; }

    Based on this script I was able to move in closer on the things I'm trying to zero in on:

    #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use Syntax::Construct qw{ // }; use open ':std', OUT => ':utf8'; use WWW::Mechanize::GZip; use HTML::TableExtract qw(tree); my $site = 'http://www.fourmilab.ch/yoursky/cities.html'; my $mech = 'WWW::Mechanize::GZip'->new; $mech->get($site); $mech->follow_link( text => 'Portland OR' ); my $te = 'HTML::TableExtract'->new; $te->parse($mech->content); my $table = ($te->tables)[3]; my $table_tree = $table->tree; my $table_text = $table_tree->as_text; say "table text is $table_text"; my $venus = $table_tree->cell(4,1)->as_text; say "say venus is $venus"; my $jupiter = $table_tree->cell(7,1)->as_text; say "say jupiter is $jupiter"; my $lub = 2457204.63659; #least upper bound my $glb = 2457207.63659; #greatest lower bound __END__
    $ perl tree4.pl table text is  RightAscensionDeclinationDistance(AU)From 45°31'5"N 122 +°40'33"W:AltitudeAzimuthSun5h 45m 15s+23° 23.5'1.0161.776122.364UpMer +cury4h 19m 31s+17° 16.1'0.711−14.916135.207SetVenus8h 57m 45s+1 +9° 5.3'0.61730.89886.249UpMoon7h 6m 7s+17° 35.0'61.4 ER10.488104.477U +pMars5h 40m 59s+24° 0.7'2.5731.626123.509UpJupiter9h 27m 56s+15° 51.6 +'5.92533.91177.612UpSaturn15h 52m 22s−18° 0.9'9.06617.428&#8722 +;38.534UpUranus1h 14m 31s+7° 12.0'20.373−37.278−179.032Se +tNeptune22h 46m 36s−8° 36.9'29.655−40.890−126.794Se +tPluto19h 1m 59s−20° 38.6'31.925−11.937−72.601Set say venus is 8h 57m 45s say jupiter is 9h 27m 56s

    I could get the right ascension for venus and jupiter by using a regex on the table text or by just using the two values in the cells. The latter might be more concise. What I want to do now is to enter different julian dates to see when this confluence occurs precisely. I have defined a least upper bound time of July 1, as Jupiter has a higher right ascension then. Likewise, I have defined July 4th as a greatest lower bound, as the reverse is the case at this julian date. From here I intend to write a control that will contract these values until they sandwich the event itself.

    I tried to get the WWW::Mechanize part of getting the relevant control button pressed and corresponding jd value supplied. The relevant html from the site is here:

    <input type="radio" name="date" onclick="0" value="2" /> <a href="/you +rsky/help/controls.html#Julian">Julian day:</a> </td> <td> <input type="text" name="jd" value="2457189.88345" size="20" onchange= +"document.request.date[2].checked=true;" /> >

    What needs to happen here (I think), is have onclick go to 1 on the first control and then the same values provided on the second one, except that 'value' should equal a lexical variable of my choice, say $guess.

    Alright, well I hope I'm making sense here, and I certainly appreciate the help. Thanks again, choroba.

      handles exotic characters but by what means is it connected with the output
      See open.

      say join "\t", map $_ // q(), @$row;

      Read it from right: get $row, dereference it as an array (@$row). map then takes each of its members and replaces undefined ones with an empty string. The resulting elements are joined by a tab.

      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ