(Ovid) Re: How do I extract all text around a keyword between two symbols?

Don't use regular expression to parse HTML. There are simply too many ways to write HTML. What happens if the closing tag is </td > (note the white space), for example?

The following code uses HTML::TokeParser and stuffs the data you want into the @data array.

#!/usr/bin/perl -w
require 5.004;
use strict;
use HTML::TokeParser;
use Data::Dumper;
    
my $doc      = shift or &usage;
my $p        = HTML::TokeParser->new($doc) || die "Can't open: $!";
my @data;

# walk through document and get each tag
while (my $token = $p->get_tag) {
    my $tag = $token->[0];
    if ( $tag eq 'td' ) {
        # get_text until a closing 'td' is found
        push @data, $p->get_trimmed_text( "/td" );
    }
}

print Dumper \@data;

sub usage {
    print "\ttest.pl some.html";
    exit;
}
[download]

Cheers,
Ovid

Vote for paco!

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Comment on (Ovid) Re: How do I extract all text around a keyword between two symbols? Download Code