Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?

Here's an example using Ovid's HTML::TokeParser::Simple---The one module you didn't mention in your post ;)

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple ();

use constant SKIP => 0;
use constant COPY => 1;

die "usage: $0 inputfile > outputfile\n" if @ARGV != 1;

my $p = HTML::TokeParser::Simple->new(shift);
my @results;
my $state = SKIP;

while(my $t = $p->get_token) {
    if ( $state == SKIP && $t->is_start_tag('table') && ( $t->return_a
+ttr->{border} =~ /^0$/ && $t->return_attr->{align} =~ /center/ ) ) {
        $state = COPY;
    }
    if ( $state == COPY && $t->is_end_tag('table') ) {
        $state = SKIP;
    }
    elsif($state == COPY) {
        push @results, $t->as_is;
    }
    elsif ( $state == SKIP ) {
        next;
    }
    else {
        die "I'm confused about my state ($state) at token ".$t->as_is
+;
    }
}

print "$_\n" for @results;
[download]

Thanks to Aristotle for helping me with a similar problem months ago.

--
Allolex

Comment on Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? Download Code