Sneeky Snake

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sneeky Snake by extremely (Priest) on Oct 09, 2000 at 04:12 UTC
Great subject line... =) You may wish to look at a HTML::Parser or a more specific module like HTML::TableExtract rather than throwing sed, awk and such at it. It has the annoying TD per line layout that will make it a real nightmare for breaking on columns. Worse, half the names are links and half aren't. If you really wish to stick to tinkering with it yourself, at least there is only one table. You should be able to rip off the beginning to the <TABLE> tag, rip from the </TABLE> tag to the end, remove everything in <TH> containers, remove all <TDgt; and <TR> and </TD></TR> tags, change all remaining </TD<\n sequences into ";" or "\|" or something you like. And then, rewrite it everytime it breaks when they reformat that page. -- $you = new YOU; honk() if $you->love(perl)	[reply]
RE: Sneeky Snake by Zarathustra (Beadle) on Oct 09, 2000 at 10:50 UTC
I decided to use your post as an opportunity for myself to learn HTML::Parser. Being my first time at this module, it took ~2 hours to write, with the last 10% of the code taking 90% of the time... ( FYI: that was in getting the <a> tags within the <td> tags to parse correctly ) So, here ya go -- using HTML::Parser and LWP::UserAgent : #!/usr/bin/perl -w use strict; use LWP::UserAgent; use HTML::Parser; my ( $href, $ua, $req, $resp, $tmp, $i, $p, $tr, @stats ); $href = "http://setiathome.ssl.berkeley.edu/stats/country_7.html"; $ua = LWP::UserAgent->new(); $req = new HTTP::Request('GET', $href); $resp = $ua->request($req); sub get_table_text { return unless $i < 3; my $self = shift; my $text = shift; $self->handler( text => sub { return if shift eq "" }, "dtext" ); ( $text = $text ) =~ s/^\d+\)\s(.*)$/$1/; if ( $i == 1 ) { $tmp .= $text; } elsif ( $i == 2 ) { chomp ($tmp .= ":$text") } } sub grab_href_text { my $self = shift; $self->handler( text => sub { return if shift eq "" }, "dtext" ); $tmp .= shift; } sub end_table { return unless shift eq "tr"; push(@stats, "$tmp\n"); undef $tmp; $i = 0; } sub start { my ( $tag, $self ) = @_; return unless $tag =~ /^(tr\|td\|a)$/; $tag =~ /td/ && do { $i++; $self->handler( text => \&get_table_text, "self, dtext" ); }; $tag =~ /a/ && do { $self->handler( text => \&grab_href_text, "self, dtext" ) }; $self->handler( end => \&end_table, "tagname, self"); } $p = HTML::Parser->new( api_version => 3 ); $p->handler( start => \&start, "tagname, self" ); $p->parse( $resp->{'_content'} ); print @stats; [download] Hope that's educational/usefull ... despite the conspicuous lack of comments! (c8=	[reply] [d/l]
Re: Sneeky Snake by cianoz (Friar) on Oct 09, 2000 at 04:41 UTC
i wrote this ugly piece of code for joke, you should write a real parser using HTML::TableExtract as suggested by extremely for some strange reason it seems to get the work done (sort of) #!/usr/bin/perl -w open TEXT, 'country_7.html' \|\| die $!; my $text; while(<TEXT>) { ## lot of ram... chomp; $text .= $_; } $text =~ s/<\/table.//; my @lines = split /<tr>/im, $text; shift @lines; ## drop trailing html shift @lines; ## ...and <th>.. foreach my $line (@lines) { $line =~ s/<\/tr>//i; $line =~ s/<\/td>//gi; my @values = split /<td[\sa-zA-Z=]>/im, $line; shift @values; print '\|'; foreach my $value (@values) { $value =~ s/<\/a>//; $value =~ s/ //; $value =~ s/<a .+>//; print $value , "\|"; } print "\n"; } [download]	[reply] [d/l]
Re: Sneeky Snake by Trimbach (Curate) on Oct 09, 2000 at 08:58 UTC
Or, a lot shorter (but not any less RAM hungry)... `#!/usr/bin/perl -w use strict; use LWP::Simple; my $page = get ("http://setiathome.ssl.berkeley.edu/stats/country_7.ht +ml"); my @lines = split /<tr>/, $page; for (@lines) { # Take out the links (for the lines that have 'em) s/<a.?>(.)<\/a>/$1/g; # Take out the silly   s/ //g; # Match the 2 parts you want m/<td>(.?)<\/td>.?(\d+)/isg; # And print it print "$1 : $2\n"; }` [download] Works fine, although it's definitely in the "one shot" category... any major changes to the web page format will break this program. Although using HTML::TableExtract is a better overall solution, throwing a hack like this together only takes a few minutes. It's an (easy) example of the general idea of loading in a web page and sucking out the bits that you're interested in. Gary Blackburn Trained Killer	[reply] [d/l]