First, we need to see *exactly* what your @$row array elements contain. Use Data::Dumper!

use Data::Dumper; $Data::Dumper::Useqq = 1; foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { print Dumper $row; } }
Output:
$VAR1 = [ "\n\t\tACME Communications, Inc.\n\t\t\n\t", "\n\t\t\n\t\tACME\n\t", "\n\t\t-\$0.43\n\t", "\n\n\t\n\t\t\240\n\t\t-\$0.67\n\t\t\n\t\n\n\t" ]; ... $VAR1 = [ "\n\t\tJP Realty Inc.\n\t\t\n\t\t\n\t\t*\n\t\t\n\t\t\n\t", "\n\t\t\n\t\tJPR\n\t", "\n\t\t\$0.64\n\t", "\n\n\t\n\t\t\240\n\t\t\n\t\t\n\t\n\n\t" ]; ...

Notice that there are:

  1. leading and trailing newlines and tabs,
  2. embedded newlines and tabs ( in JP Realty Inc. .... * ),
  3. non-breaking spaces (  in HTML; translated by TableExtract to \240 octal or \xA0 hex).

For your purposes, the simplest solution is:

  1. translate non-breaking spaces, tabs, newlines, and regular spaces into regular spaces, "squeezing" any groups of them into a single space,
  2. removing the remaining leading and trailing whitespace.
This code does does exactly that:
foreach (@$row) { tr{ \t\n\xA0}{ }s; s{^\s+}{}; s{\s+$}{}; }

The final program (after a bandwidth-saving change) looks like this:

#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; my $file = 'earnings.dat'; my $url = "http://www.earnings.com/fin/earnListing.jsp?date=2003-05-04"; # Be kind to the website, at least during testing! # Get the $url only when $file is missing or more than 1 day old. mirror($url, $file) unless -e $file or -M $file > 1; # Slurp $file into $content. my $content = do { local (*F, $/); open F, $file or die; <F>; }; my $te = HTML::TableExtract->new( headers => [qw(Company Symbol Estimate Actual)], ); $te->parse($content); #use Data::Dumper; #$Data::Dumper::Useqq = 1; # Examine all matching tables foreach my $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach my $row ($ts->rows) { foreach (@$row) { tr{ \t\n\xA0}{ }s; s{^\s+}{}; s{\s+$}{}; } # print Dumper $row; print join(',', @$row), "\n"; } }

You may want to change your join to use tab instead of comma, since some of the company names contain commas.


In reply to Re: Displaying Web Table Data Properly by Util
in thread Displaying Web Table Data Properly by canguro

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.