emmiesix has asked for the wisdom of the Perl Monks concerning the following question:

I found a really nice program that will convert simple html tables into ascii (e.g., |-separated) tables. This is very useful as I am using the data for scientific work so I want to be able to parse it easily.

I successfully turned the program example into a module which takes my html and outputs ascii. Neat! Now the problem is that I don't really want to print the ascii, I want to store it (either string or array, doesn't matter). For some odd reason, if I change that 'print DumpTable(...' line in the convert sub into something like:

my $test = DumpTable(...

I don't get a text string. I've tried this lots of different ways and I'm afraid I'm just over my head a bit with this style of perl (I really, really hate the "operators" style... mostly because I don't understand what the heck is going on, and where things are actually stored, etc). So I guess as a side note, now that I've gone through the beginner perl book, where do I go to learn about this kind of perl coding?

I've attached the module file below:

package htmltoascii; use strict; use HTML::TreeBuilder; use Text::ASCIITable; use List::Util qw(max); sub convert { my $html = shift; my $t = HTML::TreeBuilder->new(); $t->parse($html); $t->eof; print DumpTable( $_ ), $/, $/ for $t->find_by_tag_name('table') ; } sub DumpTable { my $ht = shift; die "$ht is not a table" unless $ht->tag eq 'table'; my $tt = Text::ASCIITable::->new; my @co; my @da; my $da = []; for my $ro ( @{ $ht->content() } ) { if( $ro->tag eq 'tr' ) { push @da, $da if @$da; $da = []; for my $ce ( @{ $ro->content() } ) { if( $ce->tag eq 'td' ) { if( $ce->look_down( '_tag', 'table' ) ) { my $string = ''; for my $i ( @{ $ce->content() } ) { if( not ref $i ) { $string .= $i; } elsif( $i->tag eq 'table' ) { $string .= "\n"; $string .= DumpTable($i); $string .= "\n"; } else { $string .= $i->as_text; } } push @$da, $string; } else { push @$da, $ce->as_text; } } elsif( $ce->tag eq 'th' ) { push @co, $ce->as_text; } } } } push @da, $da if @$da; if(@co) { $tt->setCols(@co); } else { use List::Util qw(max); my $max = 1 + max( 0, map { $#$_ } @da ); $tt->setCols( (' ') x $max ); $tt->setOptions( hide_HeadRow => 1 ); $tt->setOptions( hide_HeadLine => 1 ); } $tt->addRow($_) for @da; $tt->setOptions( 'drawRowLine', 1) if $ht->attr('border'); # return $tt->draw(); return $tt->draw( [ '.=', '=.', '-', '-' ], # .=-----------=. [ '|', '|', '|' ], # | info | info | [ '|-', '-|', '=', '=' ], # |-===========-| [ '|', '|', '|' ], # | info | info | [ "'=", "='", '-', '-' ], # '=-----------=' [ '|=', '=|', '-', '*' ] # rowseperator ); } 1;

Replies are listed 'Best First'.
Re: odd text object problem - how to store as string?
by ikegami (Patriarch) on Jun 24, 2011 at 21:30 UTC

    If you use the object as a string, it becomes a string. You could do

    my $str = "" . $obj;

    but it would be clearer if you did the equivalent

    my $str = $obj->draw();

      Sorry to be dense, but what is $obj in your example? I can't figure out why the "print Dumpfile..." clearly prints a text object but trying to save the returned object from Dumpfile gets me an empty value.

      I have attached a short program which calls this in case anyone wants to try running it...

      Thanks for the help

      #!/usr/bin/env perl use htmltoascii; use strict; use LWP::Simple; my $url = 'https://archive.nrao.edu/archive/ArchiveRouter?SCAN_FILE_ID +S=181828362'; my $content = get($url); my $ascii = &htmltoascii::convert($content); print "test is $ascii\n";

      where I have changed the "convert" sub to:

      sub convert { my $html = shift; my $t = HTML::TreeBuilder->new(); $t->parse($html); $t->eof; my $obj = DumpTable( $_ ), $/, $/ for $t->find_by_tag_name('table') ; return($obj); }

        Sorry to be dense, but what is $obj in your example?

        When I read

        my $test = DumpTable(...

        I got confused and thought that was the Text::ASCIITable constructor. Of course, that's not the case, so what I said doesn't apply.


        Please avoid

        my ... for ...;

        It straddles edge cases in Perl. my ... if ...; is not allowed, for example, and it's not clear what your code does.

        May I recommend

        my $table = ( $t->find_by_tag_name('table') )[-1]; return DumpTable($table);

        If the find only returns one table, you can also use

        my ($table) = $t->find_by_tag_name('table'); return DumpTable($table);
        You are re-initializing $obj for every call to find_by_tag_name. You need to change your code to something like below:
        sub convert { my $html = shift; my $t = HTML::TreeBuilder->new(); $t->parse($html); $t->eof; my $obj; $obj .= DumpTable( $_ ), $/, $/ for $t->find_by_tag_name('table') ; return($obj); }