Problem with parsing HTML with Regex's

OverlordQ has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Problem with parsing HTML with Regex's
by jeffa (Bishop) on Nov 10, 2003 at 13:30 UTC

parsing

HTML::TokeParser::Simple

use strict;
use warnings;
use HTML::TokeParser::Simple;
use Data::Dumper;

my $parser = HTML::TokeParser::Simple->new(\*DATA);
my (@img,@link,@a);

while (my $token = $parser->get_token) {

   if ($token->is_start_tag('img')) {
      push @img, $token->return_attr->{src};

   } elsif ($token->is_start_tag('link')) {
      push @link, $token->return_attr->{href};

   } elsif ($token->is_start_tag('a')) {
      push @a, $token->return_attr->{href};
   }
}

print Dumper \@img,\@link,\@a;

__DATA__
<A href=normal.link2 class="foo" >
<img src="img.link2" alt="foo">
<a class=foo href='normal.link4'>
<img height=20 width=25 src=img.link3 >
<IMG src='img.link4'>
<link href="css.link1">
<a class=foo href="normal.link1">
<img src="img.link1">
<a href="normal.link3">
<a Href='normal.link5'>
[download]

doc

href

src

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

[reply]
[d/l]

Re: Re: Problem with parsing HTML with Regex's

by PodMaster (Abbot) on Nov 10, 2003 at 13:56 UTC

use YAPE::HTML;
use Data::Dumper;
use strict;
use warnings;

my $content = q[
<img src="img.link1">
<img src="img.link2" alt="foo">
<img height=20 width=25 src=img.link3 >
<IMG src='img.link4'>
<link href="css.link1">
<a class=foo href="normal.link1">
<A href=normal.link2 class="foo" >
<a href="normal.link3">
<a class=foo href='normal.link4'>
<a Href='normal.link5'>
];

my $parser  = YAPE::HTML->new($content);

my( @a, @link, @img );

# here is the tokenizing part
while ( my $chunk = $parser->next ) {
    if( $chunk->type eq 'tag' ){
        if( $chunk->tag eq 'a' ){
            push @a,
                $chunk->get_attr('href')
                    if $chunk->has_attr('href');
        }
        elsif( $chunk->tag eq 'link' ){
            push @link,
                $chunk->get_attr('href')
                    if $chunk->has_attr('href');
        }
        elsif($chunk->tag eq 'img'){
            push @img,
                $chunk->get_attr('src')
                    if $chunk->has_attr('src');
        }
    }
}

print Dumper \@img,\@link,\@a;

__END__

$VAR1 = [
          'img.link1',
          'img.link2',
          'img.link3',
          'img.link4'
        ];
$VAR2 = [
          'css.link1'
        ];
$VAR3 = [
          'normal.link1',
          'normal.link2',
          'normal.link3',
          'normal.link4',
          'normal.link5'
        ];
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Problem with parsing HTML with Regex's
by diotalevi (Canon) on Nov 10, 2003 at 07:27 UTC

I already gave you an expression in the chatterbox. Its as close as I think you'll get without the use of some module (which is a really good idea, BTW).

s(
    ((?#1) # Capture the entire tag beginning
     <img(?s:.+?) 
     src\s*=\s* # Optional space
     ((?#2)['"]) # Capture the delimiter
     ((?#3)(?s:.*?)) # Capture the URL
     \2 # Use whatever delimiter was used to start the URL
)
    { "$1$2" . resolveimg( $3 ) . $2 }gixe
[download]

[reply]
[d/l]

Re: Re: Problem with parsing HTML with Regex's

by OverlordQ (Hermit) on Nov 10, 2003 at 07:43 UTC

$text =~ s#src=\"(.*?)\"#&resolveimg($1)#sige;

[reply]
[d/l]

Re: Problem with parsing HTML with Regex's
by Anonymous Monk on Nov 10, 2003 at 07:48 UTC

Which doesn't seem to work very well as I'm getting some rather strange output. TIA.

Parsing html (which is what you're trying to do) with regular expressions is hard (more so when you're green). But as usual, there is always CPAN (HTML::StripScripts::Regex, YAPE::HTML )

use Regexp::Common qw /delimited/;
my $text = q~
qqq <img   src  = "src"  >
sss <img   src='src' >

~;
$text =~ s~
         img \s+ src
        \s* \= \s*
        (?: $RE{delimited}{-delim=>'"'} |
            $RE{delimited}{-delim=>"'"}
        )
         ~bongo~sigx;
print $text,$/,$/;
__END__

qqq <bongo  >
sss <bongo >
[download]

[reply]
[d/l]

Re: Re: Problem with parsing HTML with Regex's

by OverlordQ (Hermit) on Nov 10, 2003 at 07:51 UTC

Original Script

My 'modified' Regex's

[reply]

Re: Problem with parsing HTML with Regex's
by doc (Scribe) on Nov 10, 2003 at 13:05 UTC

local $/;
my $data = <DATA>;
my @img = $data =~ m/<\s*img[^>]*?src\s*=\s*['"]?([^"' >\n]+)/gi;
my @css = $data =~ m/<\s*link[^>]*?href\s*=\s*['"]?([^"' >\n]+)/gi;
my @lnk = $data =~ m/<\s*a[^>]*?href\s*=\s*['"]?([^"' >\n]+)/gi;
use Data::Dumper;
print Dumper \@img, \@css, \@lnk;
__DATA__
<img src="img.link1">
<img src="img.link2" alt="foo">
<img height=20 width=25 src=img.link3 >
<IMG src='img.link4'>
<link href="css.link1">
<a class=foo href="normal.link1">
<A href=normal.link2 class="foo" >
<a href="normal.link3">
<a class=foo href='normal.link4'>
<a Href='normal.link5'>
__END__
$VAR1 = [
          'img.link1',
          'img.link2',
          'img.link3',
          'img.link4'
        ];
$VAR2 = [
          'css.link1'
        ];
$VAR3 = [
          'normal.link1',
          'normal.link2',
          'normal.link3',
          'normal.link4',
          'normal.link5'
        ];
[download]

[reply]
[d/l]

Re: Problem with parsing HTML with Regex's
by ysth (Canon) on Nov 10, 2003 at 07:28 UTC

.*?

[^=]*

\b

You really ought to be doing this with HTML::Parser.

[reply]
[d/l]
[select]

Re: Re: Problem with parsing HTML with Regex's

by Anonymous Monk on Nov 10, 2003 at 07:58 UTC

Without a sample line that is going wrong, this is just a shot in the dark, but does replacing each .*? with ^=* and putting \b before and after img, a, and link help?

[^=]*

.*?

precisely

[reply]
[d/l]
[select]

Re: Re: Re: Problem with parsing HTML with Regex's

by Anonymous Monk on Nov 10, 2003 at 08:02 UTC

But who's to say that the input will always look like "img src...", it could be "img border" or anything like that.

[reply]

Re: Re: Problem with parsing HTML with Regex's

by OverlordQ (Hermit) on Nov 10, 2003 at 07:39 UTC

original script

bad regex's

[reply]

Re: Re: Re: Problem with parsing HTML with Regex's

by ysth (Canon) on Nov 10, 2003 at 08:10 UTC

s#(?:\bimg\b[^<>]*src|\blink\b[^<>]*href)\=\"(.*?)\"#...

and

s#(?:\ba\b[^<>]*href)\=\"(.*?)\"#
[download]

diotalevi

.+

Or you may want to follow this suggestion; I had assumed you wanted to cover even something like <a title="whoohoo" href=...>, so I didn't switch to \s*.

But given that you want to be able to handle *any* web page that is entered, you ought to use a real html parser instead. (BTW, I tried http://google.com and was disappointed that the buttons didn't get translated.)

[reply]
[d/l]
[select]

Re: Problem with parsing HTML with Regex's
by Anonymous Monk on Nov 10, 2003 at 08:21 UTC

Edit: Yes I know I should be using a module, but that's the reason I'm asking, I dont want to use one.

use

[reply]

Re: Problem with parsing HTML with Regex's
by idsfa (Vicar) on Nov 10, 2003 at 16:01 UTC

None of the proposed regex solutions properly handle:

<img alt=">click here<" src="/images/button.gif" NAME=">click<">
[download]

Yet Another Reason to use an HTML parser.

Updated:
None of the proposed purely regex solutions ... happy now? ;-)

Yes, the given regexes can all be modified to work with this (psychotic) example. The point is that parsing HTML is difficult, and that this wheel has already been invented a few times.

My parents just came back from a planet where the dominant life form had no
bilateral symmetry, and all I got was this stupid F-Shirt.