Re: Problem with parsing HTML with Regex's

Parsing HTML with regexes should be a last resort, IMHO. Use a parser instead. That's why it's called HTML parsing! Here is a demonstration using HTML::TokeParser::Simple and doc's data.

use strict;
use warnings;
use HTML::TokeParser::Simple;
use Data::Dumper;

my $parser = HTML::TokeParser::Simple->new(\*DATA);
my (@img,@link,@a);

while (my $token = $parser->get_token) {

   if ($token->is_start_tag('img')) {
      push @img, $token->return_attr->{src};

   } elsif ($token->is_start_tag('link')) {
      push @link, $token->return_attr->{href};

   } elsif ($token->is_start_tag('a')) {
      push @a, $token->return_attr->{href};
   }
}

print Dumper \@img,\@link,\@a;

__DATA__
<A href=normal.link2 class="foo" >
<img src="img.link2" alt="foo">
<a class=foo href='normal.link4'>
<img height=20 width=25 src=img.link3 >
<IMG src='img.link4'>
<link href="css.link1">
<a class=foo href="normal.link1">
<img src="img.link1">
<a href="normal.link3">
<a Href='normal.link5'>
[download]

It is not less code than doc's, but it is much more readable. You might want to throw in error checking for missing href and src attributes.

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Comment on Re: Problem with parsing HTML with Regex's Download Code

Replies are listed 'Best First'.

Re: Re: Problem with parsing HTML with Regex's
by PodMaster (Abbot) on Nov 10, 2003 at 13:56 UTC

use YAPE::HTML;
use Data::Dumper;
use strict;
use warnings;

my $content = q[
<img src="img.link1">
<img src="img.link2" alt="foo">
<img height=20 width=25 src=img.link3 >
<IMG src='img.link4'>
<link href="css.link1">
<a class=foo href="normal.link1">
<A href=normal.link2 class="foo" >
<a href="normal.link3">
<a class=foo href='normal.link4'>
<a Href='normal.link5'>
];

my $parser  = YAPE::HTML->new($content);

my( @a, @link, @img );

# here is the tokenizing part
while ( my $chunk = $parser->next ) {
    if( $chunk->type eq 'tag' ){
        if( $chunk->tag eq 'a' ){
            push @a,
                $chunk->get_attr('href')
                    if $chunk->has_attr('href');
        }
        elsif( $chunk->tag eq 'link' ){
            push @link,
                $chunk->get_attr('href')
                    if $chunk->has_attr('href');
        }
        elsif($chunk->tag eq 'img'){
            push @img,
                $chunk->get_attr('src')
                    if $chunk->has_attr('src');
        }
    }
}

print Dumper \@img,\@link,\@a;

__END__

$VAR1 = [
          'img.link1',
          'img.link2',
          'img.link3',
          'img.link4'
        ];
$VAR2 = [
          'css.link1'
        ];
$VAR3 = [
          'normal.link1',
          'normal.link2',
          'normal.link3',
          'normal.link4',
          'normal.link5'
        ];
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]