Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've been battling with awk, grep, and perl to try and get two columns out of the following page(after wget'ting it into a text file): http://setiathome.ssl.berkeley.edu/stats/country_7.html
Basically I want the displayed name, and the number of workunits. Would be lovely if it came out as: Joe Bloggs:470

Replies are listed 'Best First'.
Re: Sneeky Snake
by extremely (Priest) on Oct 09, 2000 at 04:12 UTC

    Great subject line... =)

    You may wish to look at a HTML::Parser or a more specific module like HTML::TableExtract rather than throwing sed, awk and such at it. It has the annoying TD per line layout that will make it a real nightmare for breaking on columns. Worse, half the names are links and half aren't.

    If you really wish to stick to tinkering with it yourself, at least there is only one table. You should be able to rip off the beginning to the <TABLE> tag, rip from the </TABLE> tag to the end, remove everything in <TH> containers, remove all <TDgt; and <TR> and </TD></TR> tags, change all remaining </TD<\n sequences into ";" or "|" or something you like.

    And then, rewrite it everytime it breaks when they reformat that page.

    --
    $you = new YOU;
    honk() if $you->love(perl)

RE: Sneeky Snake
by Zarathustra (Beadle) on Oct 09, 2000 at 10:50 UTC

    I decided to use your post as an opportunity for myself to learn HTML::Parser.

    Being my first time at this module, it took ~2 hours to write, with the last 10% of the code
    taking 90% of the time...

    ( FYI: that was in getting the <a> tags within the <td> tags to parse correctly )

    So, here ya go -- using HTML::Parser and LWP::UserAgent :

    #!/usr/bin/perl -w use strict; use LWP::UserAgent; use HTML::Parser; my ( $href, $ua, $req, $resp, $tmp, $i, $p, $tr, @stats ); $href = "http://setiathome.ssl.berkeley.edu/stats/country_7.html"; $ua = LWP::UserAgent->new(); $req = new HTTP::Request('GET', $href); $resp = $ua->request($req); sub get_table_text { return unless $i < 3; my $self = shift; my $text = shift; $self->handler( text => sub { return if shift eq "" }, "dtext" ); ( $text = $text ) =~ s/^\d+\)\s(.*)$/$1/; if ( $i == 1 ) { $tmp .= $text; } elsif ( $i == 2 ) { chomp ($tmp .= ":$text") } } sub grab_href_text { my $self = shift; $self->handler( text => sub { return if shift eq "" }, "dtext" ); $tmp .= shift; } sub end_table { return unless shift eq "tr"; push(@stats, "$tmp\n"); undef $tmp; $i = 0; } sub start { my ( $tag, $self ) = @_; return unless $tag =~ /^(tr|td|a)$/; $tag =~ /td/ && do { $i++; $self->handler( text => \&get_table_text, "self, dtext" ); }; $tag =~ /a/ && do { $self->handler( text => \&grab_href_text, "self, dtext" ) }; $self->handler( end => \&end_table, "tagname, self"); } $p = HTML::Parser->new( api_version => 3 ); $p->handler( start => \&start, "tagname, self" ); $p->parse( $resp->{'_content'} ); print @stats;


    Hope that's educational/usefull ... despite the conspicuous lack of comments!

    (c8=

Re: Sneeky Snake
by cianoz (Friar) on Oct 09, 2000 at 04:41 UTC
    i wrote this ugly piece of code for joke, you should write a real parser using HTML::TableExtract as suggested by extremely
    for some strange reason it seems to get the work done (sort of)
    #!/usr/bin/perl -w open TEXT, 'country_7.html' || die $!; my $text; while(<TEXT>) { ## lot of ram... chomp; $text .= $_; } $text =~ s/<\/table.*//; my @lines = split /<tr>/im, $text; shift @lines; ## drop trailing html shift @lines; ## ...and <th>.. foreach my $line (@lines) { $line =~ s/<\/tr>//i; $line =~ s/<\/td>//gi; my @values = split /<td[\sa-zA-Z=]*>/im, $line; shift @values; print '|'; foreach my $value (@values) { $value =~ s/<\/a>//; $value =~ s/&nbsp;//; $value =~ s/<a .+>//; print $value , "|"; } print "\n"; }
Re: Sneeky Snake
by Trimbach (Curate) on Oct 09, 2000 at 08:58 UTC
    Or, a lot shorter (but not any less RAM hungry)...
    #!/usr/bin/perl -w use strict; use LWP::Simple; my $page = get ("http://setiathome.ssl.berkeley.edu/stats/country_7.ht +ml"); my @lines = split /<tr>/, $page; for (@lines) { # Take out the links (for the lines that have 'em) s/<a.*?>(.*)<\/a>/$1/g; # Take out the silly &nbsp; s/&nbsp;//g; # Match the 2 parts you want m/<td>(.*?)<\/td>.*?(\d+)/isg; # And print it print "$1 : $2\n"; }
    Works fine, although it's definitely in the "one shot" category... any major changes to the web page format will break this program. Although using HTML::TableExtract is a better overall solution, throwing a hack like this together only takes a few minutes. It's an (easy) example of the general idea of loading in a web page and sucking out the bits that you're interested in.

    Gary Blackburn
    Trained Killer