wfsp has asked for the wisdom of the Perl Monks concerning the following question:
I have about a 100 pages that have a 'see also' right hand column. Over the years they have suffered from 'tweak creep' and it was time to 'normalise' them.
I needed to extract the data in it's various guises ready to feed to HTML::Template.
Whilst you end up with a lot of subs (10 in this case) quite complex logic can easily and, imo, be clearly accommodated.
The 'right' div can contain any of 5 different blocks (the example has all of them). The main sub (right_start) sets the conditions for handling each block. The remaining subs store the appropriate data in an array.
I appreciate I'm only scratching the surface here but for a first outing I'm very impressed. Any pointers, criticisms?
output:#!/usr/bin/perl use warnings; use strict; use Data::Dumper; use HTML::Parser (); my $html; { local $/; $html = <DATA> } my ($h2, $link_item, $img_link, @db); my $p = HTML::Parser->new( api_version => 3, start_h => [\&find_right, 'self, tag, attr'], ) or die "can't create parser: $!"; $p->unbroken_text(1); $p->parse($html); { $, = "\n"; print @db; } sub find_right{ my ($self, $tag, $attr) = @_; if ( $tag eq 'div' and exists $attr->{class} and $attr->{class} =~ /right/ ) { $self->handler(start => \&right_start, 'self, tag, attr'); $self->handler(end => \&right_end, 'self, tag, attr'); } } sub right_start{ my ($self, $tag, $attr) = @_; if ( $tag eq 'p' and exists $attr->{class} and $attr->{class} eq 'header' ) { $self->handler(text => \&header_text, 'self, dtext'); } elsif ( $tag eq 'p' and exists $attr->{class} and $attr->{class} eq 'link_item' ) { $self->handler(start => \&link_item_href, 'self, tag, attr'); $self->handler(end => \&link_item_end, 'self, tag'); } elsif ($tag eq 'h2'){ $self->handler(start => \&h2_href, 'self, attr'); } elsif ($tag eq 'img'){ push @db, 'image:' . $attr->{src}; } elsif ($tag eq 'a'){ $img_link = 'image-link:' . $attr->{href}; $self->handler(start => \&img_link, 'self, tag, attr'); } } sub right_end{ my ($self, $tag) = @_; $self->eof if $tag eq '/div'; } sub img_link{ my ($self, $tag, $attr) = @_; $img_link .= ':' . $attr->{src}; push @db, $img_link; $img_link = ''; $self->handler(start => \&right_start, 'self, tag, attr'); } sub link_item_end{ my ($self, $tag) = @_; if ($tag eq '/p'){ $self->handler(start => \&right_start); $self->handler(end => \&right_end); push @db, $link_item; $link_item = ''; } } sub span_text{ my ($txt) = @_; for ($txt){ s/^\s+//; s/\s+$//; } $link_item .= ':' . $txt if $txt; } sub link_item_href{ my ($self, $tag, $attr) = @_; if ($tag eq 'a'){ $link_item = 'link-item:' . $attr->{href}; $self->handler(text => \&span_text, 'dtext'); } } sub h2_href{ my ($self, $attr) = @_; $h2 = 'link-h2:' . $attr->{href}; $self->handler(text => \&h2_text, 'self, dtext'); } sub h2_text{ my ($self, $txt) = @_; $h2 .= ':' . $txt; push @db, $h2; $h2 = ''; $self->handler(text => ''); $self->handler(start => \&right_start, 'self, tag, attr'); } sub header_text{ my ($self, $txt) = @_; push @db, "header:$txt"; $self->handler(start => \&right_start, 'self, tag, attr'); $self->handler(text => ''); } __DATA__ <html> <head> <title>page with right column</title> </head> <body> <div class="middle"> <!-- middle --> </div> <div class="right"> <p class="header">see also</p> <p class="link_item"> <a href="page1.html"> <span class="fi">line 1</span> <br /> <span class="fi">line 2</span> <br /> <span class="fi">line 3</span> </a> </p> <a href="some.html"> <img src="some.jpg" /> </a> <p class="link_item"> <a href="page2.html"> <span class="fi">line1</span> <br /> <span class="fi">line2</span> <br /> <span class="fi">line3</span> </a> </p> <p class="header">see something else</p> <h2><a href="more.html">link to more</a></h2> <img src="another.jpg" /> </div> <div class="footer"> <h4>footer links</h4> </div> </body> </html>
---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" parser.pl header:see also link-item:page1.html:line 1:line 2:line 3 image-link:some.html:some.jpg link-item:page2.html:line1:line2:line3 header:see something else link-h2:more.html:link to more image:another.jpg > Terminated with exit code 0.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Getting to grips with HTML::Parser
by tphyahoo (Vicar) on Jun 02, 2006 at 10:47 UTC | |
by Anonymous Monk on Jun 02, 2006 at 16:33 UTC | |
|
Re: Getting to grips with HTML::Parser
by merlyn (Sage) on Jun 02, 2006 at 13:08 UTC |