canguro has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I posted this question a few days ago, hoping to resolve my problem, which is that I am able to successfully retrieve the table data from a website but the 4 columns each come out on a separate line. I received a reply from someone with the suggestion to chomp newlines, but I have tried a whole bunch of variations and I am getting nowhere. A big part of my problem is lack of experience in this area (with tables). Can someone please straighten me out on where I am going wrong? What I am hoping to get is something like this:

a1,a2,a3,a4
b1,b2,b3,b4
.
.
.

But it's just not happening. Every field i.e., a1, a2, etc., is coming out on separate lines. I need some heavy-duty expert advice. This is the code I have so far:
#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; my $url; $url="http://www.earnings.com/fin/earnListing.jsp?tckr=&exch=&eff=&dat +e=2003-05-04"; my $content=get $url; my $te = new HTML::TableExtract( headers => [qw(Company Symbol Estimate Actual)] ); $te->parse($content); # Examine all matching tables foreach my $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach my $row ($ts->rows) { print join(',', @$row), "\n"; } }
Again, any help would be greatly appreciated.

edited: Mon May 12 14:17:10 2003 by jeffa - code tags, title change (was: This Newbie Needs Perl Monks Help In Displaying Web Table Data Properly)

Replies are listed 'Best First'.
Re: Displaying Web Table Data Properly
by Jaap (Curate) on May 12, 2003 at 11:16 UTC
    Your problem lies in this line of code:
    print join(',', @$row), "\n";
    The elements of the array @$row appear to have newlines in them. These need to be chomped.
    Try something like this:
    foreach my $element (@$row) { chomp($element); ### remove the newline print "$element,"; ### print it with a comma } print "\n";
Re: Displaying Web Table Data Properly
by Util (Priest) on May 12, 2003 at 13:02 UTC

    First, we need to see *exactly* what your @$row array elements contain. Use Data::Dumper!

    use Data::Dumper; $Data::Dumper::Useqq = 1; foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { print Dumper $row; } }
    Output:
    $VAR1 = [ "\n\t\tACME Communications, Inc.\n\t\t\n\t", "\n\t\t\n\t\tACME\n\t", "\n\t\t-\$0.43\n\t", "\n\n\t\n\t\t\240\n\t\t-\$0.67\n\t\t\n\t\n\n\t" ]; ... $VAR1 = [ "\n\t\tJP Realty Inc.\n\t\t\n\t\t\n\t\t*\n\t\t\n\t\t\n\t", "\n\t\t\n\t\tJPR\n\t", "\n\t\t\$0.64\n\t", "\n\n\t\n\t\t\240\n\t\t\n\t\t\n\t\n\n\t" ]; ...

    Notice that there are:

    1. leading and trailing newlines and tabs,
    2. embedded newlines and tabs ( in JP Realty Inc. .... * ),
    3. non-breaking spaces (  in HTML; translated by TableExtract to \240 octal or \xA0 hex).

    For your purposes, the simplest solution is:

    1. translate non-breaking spaces, tabs, newlines, and regular spaces into regular spaces, "squeezing" any groups of them into a single space,
    2. removing the remaining leading and trailing whitespace.
    This code does does exactly that:
    foreach (@$row) { tr{ \t\n\xA0}{ }s; s{^\s+}{}; s{\s+$}{}; }

    The final program (after a bandwidth-saving change) looks like this:

    #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; my $file = 'earnings.dat'; my $url = "http://www.earnings.com/fin/earnListing.jsp?date=2003-05-04"; # Be kind to the website, at least during testing! # Get the $url only when $file is missing or more than 1 day old. mirror($url, $file) unless -e $file or -M $file > 1; # Slurp $file into $content. my $content = do { local (*F, $/); open F, $file or die; <F>; }; my $te = HTML::TableExtract->new( headers => [qw(Company Symbol Estimate Actual)], ); $te->parse($content); #use Data::Dumper; #$Data::Dumper::Useqq = 1; # Examine all matching tables foreach my $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach my $row ($ts->rows) { foreach (@$row) { tr{ \t\n\xA0}{ }s; s{^\s+}{}; s{\s+$}{}; } # print Dumper $row; print join(',', @$row), "\n"; } }

    You may want to change your join to use tab instead of comma, since some of the company names contain commas.

      Thank you tremendously for your code suggestions. They worked without any problem, and your advice about the embedded commas is well taken, and I will think carefully about the choices. This situation was really driving me nuts, but I am impressed by the way you took it apart.

      Again, many thanks.

Re: Displaying Web Table Data Properly
by Abstraction (Friar) on May 12, 2003 at 12:27 UTC
    Try

    #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; my $url; $url="http://www.earnings.com/fin/earnListing.jsp?tckr=&exch=&eff=&dat +e=2003-05-04"; my $content=get $url; my $te = new HTML::TableExtract( headers => [qw(Company Symbol Estimat +e Actual)] ); $te->parse($content); # Examine all matching tables foreach my $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach my $row ($ts->rows) { print join ',', grep { s/\s{2,}//g; } @$row; print "\n"; } }
    Output:
    ACME Communications, Inc.,ACME,-$0.43, -$0.67 American Axle & Manufacturing Holdings Inc,AXL,$0.84, $1.02 Arch Chemicals Inc,ARJ,-$0.23, $0.11 Barry (R G) Corporation,RGB,n/a, Boots&Coots/Int Well,WEL,n/a, $999.00 Cameco Corporation,CCJ,n/a, Chevron Texaco Corporation,CVX,$1.29, Cigna Corp,CI,$1.27, $1.46 DTE Energy Company,DTE,$1.26, Gaylord Entertainment Co. (New),GET,n/a, Grant Prideco Inc,GRP,$0.02, $0.04 Grupo Financiero Galicia S.A. - American Depositary Shares Representin +g Class B*,GGAL,n/a, Hardinge, Inc.,HDNG,n/a, Hearst - Argyle Television Series A,HTV,$0.39, $0.11 Home Properties Of Ny Inc.,HME,$0.77, $0.60 Insmed, Inc.,INSM,-$0.33, JP Realty Inc.*,JPR,$0.64, Mge Energy Inc.,MGEE,n/a, OSI Systems, Inc.,OSIS,$0.23, $0.29 OXiGENE, Inc.,OXGN,-$0.23, Pinnacle West Capital Corp.,PNW,$0.57, Plains Resources, Inc,PLX,$0.41, Sun Communities Inc.,SUI,$0.88, $0.92 Superior Energy Service Inc,SPN,$0.07, $0.10 Tanox, Inc.,TNOX,-$0.19, -$0.14 Tyler Technologies Inc,TYL,$0.04,
      Thank you for your advice and time and effort. I tried out your suggestions and found there was a big improvement, although to be honest I am picking up additional data entries. I don't know where they are coming from.

      I also tried the suggestions made by 'Util' and they came out without a hitch. Either way, though, I want you to know your help is very much appreciated.