in reply to Displaying Web Table Data Properly

First, we need to see *exactly* what your @$row array elements contain. Use Data::Dumper!

use Data::Dumper; $Data::Dumper::Useqq = 1; foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { print Dumper $row; } }
Output:
$VAR1 = [ "\n\t\tACME Communications, Inc.\n\t\t\n\t", "\n\t\t\n\t\tACME\n\t", "\n\t\t-\$0.43\n\t", "\n\n\t\n\t\t\240\n\t\t-\$0.67\n\t\t\n\t\n\n\t" ]; ... $VAR1 = [ "\n\t\tJP Realty Inc.\n\t\t\n\t\t\n\t\t*\n\t\t\n\t\t\n\t", "\n\t\t\n\t\tJPR\n\t", "\n\t\t\$0.64\n\t", "\n\n\t\n\t\t\240\n\t\t\n\t\t\n\t\n\n\t" ]; ...

Notice that there are:

  1. leading and trailing newlines and tabs,
  2. embedded newlines and tabs ( in JP Realty Inc. .... * ),
  3. non-breaking spaces (  in HTML; translated by TableExtract to \240 octal or \xA0 hex).

For your purposes, the simplest solution is:

  1. translate non-breaking spaces, tabs, newlines, and regular spaces into regular spaces, "squeezing" any groups of them into a single space,
  2. removing the remaining leading and trailing whitespace.
This code does does exactly that:
foreach (@$row) { tr{ \t\n\xA0}{ }s; s{^\s+}{}; s{\s+$}{}; }

The final program (after a bandwidth-saving change) looks like this:

#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; my $file = 'earnings.dat'; my $url = "http://www.earnings.com/fin/earnListing.jsp?date=2003-05-04"; # Be kind to the website, at least during testing! # Get the $url only when $file is missing or more than 1 day old. mirror($url, $file) unless -e $file or -M $file > 1; # Slurp $file into $content. my $content = do { local (*F, $/); open F, $file or die; <F>; }; my $te = HTML::TableExtract->new( headers => [qw(Company Symbol Estimate Actual)], ); $te->parse($content); #use Data::Dumper; #$Data::Dumper::Useqq = 1; # Examine all matching tables foreach my $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach my $row ($ts->rows) { foreach (@$row) { tr{ \t\n\xA0}{ }s; s{^\s+}{}; s{\s+$}{}; } # print Dumper $row; print join(',', @$row), "\n"; } }

You may want to change your join to use tab instead of comma, since some of the company names contain commas.

Replies are listed 'Best First'.
Re: Re: Displaying Web Table Data Properly
by canguro (Novice) on May 12, 2003 at 21:18 UTC
    Thank you tremendously for your code suggestions. They worked without any problem, and your advice about the embedded commas is well taken, and I will think carefully about the choices. This situation was really driving me nuts, but I am impressed by the way you took it apart.

    Again, many thanks.