in reply to Should I use; Html Parser, table extract, Extractor

If what you are tryuing to do is extract the data from the table then the following code using HTML::TreeBuilder and HTML::ElementTable may be a good starting point for you:

use strict; use warnings; use LWP::Simple; use HTML::TreeBuilder; use HTML::ElementTable; my $page = get ('http://www.ovt.ncsu.edu/cotton_soy/2004/table_11.html +'); my $root = HTML::TreeBuilder->new_from_content ($page); my $theTable = $root->find ('table'); die "Table not found" if ! defined $theTable; $theTable = HTML::ElementTable->new_from_tree($theTable); for my $row (1..$theTable->maxrow()-2) { for (0..$theTable->maxcol()) { my $cellText = $theTable->cell ($row, $_)->as_text (); print "$cellText "; } print "\n"; }
LINT PLANT PERCENT UHM VARIETY OR YIELD LINT HEIGHT BOLLS S.L. UNIFORMITY T1 BRAND VARIETY LB/ACRE % INCHES OPENED (IN.) INDEX (G/TEX) MIKE ELONGAT +ION FiberMax 991BR 855** 38.8 32 31 1.17 85.0 34.8 5.3 4.3 Stoneville ST5599BR 848* 39.9 31 30 1.14 82.7 31.5 5.0 3.6 Deltapine DP555BG/RR 816* 43.0 34 25 1.19 83.1 32.9 4.6 4.0 Deltapine DP449BG/RR 815* 38.2 32 38 1.16 83.4 32.0 4.8 4.4 Deltapine DP488BG/RR 778* 40.3 31 36 1.25 85.5 34.6 4.9 4.8 Deltapine DP 445 BG/RR 775* 41.8 28 22 1.18 84.4 32.7 5.2 6.0 Deltapine DP 543 BGII/RR 772* 39.5 32 36 1.15 83.3 31.7 5.0 4.0 FiberMax 989B2R 761* 39.3 25 21 1.17 83.2 33.4 5.2 3.3 Deltapine DP 455 BG/RR 751* 40.4 36 34 1.17 84.8 34.3 4.3 4.1 FiberMax 989BR 721 38.7 32 12 1.17 83.2 31.9 5.0 3.8 Stoneville ST5454B2R 667 36.9 30 35 1.12 82.7 30.5 5.2 5.5 FiberMax 991B2R 665 36.5 27 33 1.22 85.0 35.8 4.6 3.9 Stoneville ST5242BR 630 40.3 25 27 1.13 85.5 28.3 4.9 5.6 Deltapine DP451B/RR 629 36.2 29 60 1.18 85.1 30.5 4.8 5.2 Stoneville ST6636BR 594 37.5 30 34 1.18 84.4 32.8 4.8 4.3 Deltapine DP493 492 40.2 36 46 1.18 83.9 33.1 4.2 4.1 Stoneville ST5303R 477 39.1 33 61 1.10 85.2 32.4 4.7 4.8 Deltapine DP 5415RR 458 38.4 35 33 1.17 85.0 31.6 4.1 5.3 BCG 24R 445 39.7 34 36 1.12 85.4 29.6 4.7 6.4 BCG 295 428 38.4 26 41 1.22 84.5 32.1 4.6 4.2 Deltapine DP491 426 40.5 33 41 1.26 84.9 38.7 4.4 4.3 Stoneville ST6848R 379 37.9 33 38 1.19 86.0 35.5 4.5 4.4 +Deltapine DPLX02T57R 353 37.5 32 55 1.14 84.3 28.3 4.1 7.0 FiberMax 989R 345 39.4 30 27 1.20 87.2 36.3 4.8 4.5 Deltapine DP494RR 328 40.1 34 37 1.20 84.1 34.5 4.2 4.8 Deltapine DP 5690RR 307 37.2 34 44 1.19 84.9 34.2 4.7 5.0 Mean 581 39.2 31 36 1.17 84.5 32.9 4.7 4.6 Adj.R2 (%) 78 C.V.(%) 19 BLSD(K-50) 115 s.e. 51 Error d.f. 108

Note that $theTable->maxrow()-2 ignores the last two rows to avoid a problem with missing cells in those rows and the first row is skipped for the same reason.


DWIM is Perl's answer to Gödel