OverlordQ has asked for the wisdom of the Perl Monks concerning the following question:

Title sums it up, lemme me try to give you a brief rundown. I made the (Pseudo)-Spanglish Translator. But I happened to notice a bug when I was use it, it kills the stylesheet links. Original Regexes:
$text =~ s#src=\"(.*?)\"#&resolveimg($1)#sige; $text =~ s#href\=\"(.*?)\"#&resolvehref($1)#sige;
What I want is that if it's an image link (img src="blah") or an css stylesheet link (link href="blah") to use the the subroutine &resolveimg on it, for the rest of the links (a href="blah") I want to use &resolvehref. This is what I *tried* to come up with:
$text =~ s#(?:img.*?src|link.*?href)\=\"(.*?)\"#&resolveimg($1)#sige; $text =~ s#(?:a.*?href)\=\"(.*?)\"#&resolvehref($1)#sige;
Which doesn't seem to work very well as I'm getting some rather strange output. TIA.

Edit:ysth's solution has been the closest match so far, quick edit gives me:
$text =~ s#(\bimg\b[^<>]*src|\blink\b[^<>]*href)\=\"(.* +?)\"#$1 . "=\"" . &resolveimg($2) . "\""#sige; $text =~ s#(?:\ba\b[^<>]*href)\=\"(.*?)\"#&resolvehref +($1)#sige;

20031110 Edit by jeffa: Changed title from 'YaRP (Yet another Regex Problem) '

Replies are listed 'Best First'.
Re: Problem with parsing HTML with Regex's
by jeffa (Bishop) on Nov 10, 2003 at 13:30 UTC
    Parsing HTML with regexes should be a last resort, IMHO. Use a parser instead. That's why it's called HTML parsing! Here is a demonstration using HTML::TokeParser::Simple and doc's data.
    use strict; use warnings; use HTML::TokeParser::Simple; use Data::Dumper; my $parser = HTML::TokeParser::Simple->new(\*DATA); my (@img,@link,@a); while (my $token = $parser->get_token) { if ($token->is_start_tag('img')) { push @img, $token->return_attr->{src}; } elsif ($token->is_start_tag('link')) { push @link, $token->return_attr->{href}; } elsif ($token->is_start_tag('a')) { push @a, $token->return_attr->{href}; } } print Dumper \@img,\@link,\@a; __DATA__ <A href=normal.link2 class="foo" > <img src="img.link2" alt="foo"> <a class=foo href='normal.link4'> <img height=20 width=25 src=img.link3 > <IMG src='img.link4'> <link href="css.link1"> <a class=foo href="normal.link1"> <img src="img.link1"> <a href="normal.link3"> <a Href='normal.link5'>
    It is not less code than doc's, but it is much more readable. You might want to throw in error checking for missing href and src attributes.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Same exact thing using YAPE::HTML ;)( YAPE::HTML is pure-perl for those who don't know)
      use YAPE::HTML; use Data::Dumper; use strict; use warnings; my $content = q[ <img src="img.link1"> <img src="img.link2" alt="foo"> <img height=20 width=25 src=img.link3 > <IMG src='img.link4'> <link href="css.link1"> <a class=foo href="normal.link1"> <A href=normal.link2 class="foo" > <a href="normal.link3"> <a class=foo href='normal.link4'> <a Href='normal.link5'> ]; my $parser = YAPE::HTML->new($content); my( @a, @link, @img ); # here is the tokenizing part while ( my $chunk = $parser->next ) { if( $chunk->type eq 'tag' ){ if( $chunk->tag eq 'a' ){ push @a, $chunk->get_attr('href') if $chunk->has_attr('href'); } elsif( $chunk->tag eq 'link' ){ push @link, $chunk->get_attr('href') if $chunk->has_attr('href'); } elsif($chunk->tag eq 'img'){ push @img, $chunk->get_attr('src') if $chunk->has_attr('src'); } } } print Dumper \@img,\@link,\@a; __END__ $VAR1 = [ 'img.link1', 'img.link2', 'img.link3', 'img.link4' ]; $VAR2 = [ 'css.link1' ]; $VAR3 = [ 'normal.link1', 'normal.link2', 'normal.link3', 'normal.link4', 'normal.link5' ];

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Problem with parsing HTML with Regex's
by diotalevi (Canon) on Nov 10, 2003 at 07:27 UTC

    I already gave you an expression in the chatterbox. Its as close as I think you'll get without the use of some module (which is a really good idea, BTW).

    s( ((?#1) # Capture the entire tag beginning <img(?s:.+?) src\s*=\s* # Optional space ((?#2)['"]) # Capture the delimiter ((?#3)(?s:.*?)) # Capture the URL \2 # Use whatever delimiter was used to start the URL ) { "$1$2" . resolveimg( $3 ) . $2 }gixe
      Doesn't that basically do the exact same as: $text =~ s#src=\"(.*?)\"#&resolveimg($1)#sige; ? And that wasn't the problem, the problem was the stylesheet links.
Re: Problem with parsing HTML with Regex's
by Anonymous Monk on Nov 10, 2003 at 07:48 UTC
    Which doesn't seem to work very well as I'm getting some rather strange output. TIA.
    Such as?

    Parsing html (which is what you're trying to do) with regular expressions is hard (more so when you're green). But as usual, there is always CPAN (HTML::StripScripts::Regex, YAPE::HTML )

    use Regexp::Common qw /delimited/; my $text = q~ qqq <img src = "src" > sss <img src='src' > ~; $text =~ s~ img \s+ src \s* \= \s* (?: $RE{delimited}{-delim=>'"'} | $RE{delimited}{-delim=>"'"} ) ~bongo~sigx; print $text,$/,$/; __END__ qqq <bongo > sss <bongo >
Re: Problem with parsing HTML with Regex's
by doc (Scribe) on Nov 10, 2003 at 13:05 UTC
    local $/; my $data = <DATA>; my @img = $data =~ m/<\s*img[^>]*?src\s*=\s*['"]?([^"' >\n]+)/gi; my @css = $data =~ m/<\s*link[^>]*?href\s*=\s*['"]?([^"' >\n]+)/gi; my @lnk = $data =~ m/<\s*a[^>]*?href\s*=\s*['"]?([^"' >\n]+)/gi; use Data::Dumper; print Dumper \@img, \@css, \@lnk; __DATA__ <img src="img.link1"> <img src="img.link2" alt="foo"> <img height=20 width=25 src=img.link3 > <IMG src='img.link4'> <link href="css.link1"> <a class=foo href="normal.link1"> <A href=normal.link2 class="foo" > <a href="normal.link3"> <a class=foo href='normal.link4'> <a Href='normal.link5'> __END__ $VAR1 = [ 'img.link1', 'img.link2', 'img.link3', 'img.link4' ]; $VAR2 = [ 'css.link1' ]; $VAR3 = [ 'normal.link1', 'normal.link2', 'normal.link3', 'normal.link4', 'normal.link5' ];
Re: Problem with parsing HTML with Regex's
by ysth (Canon) on Nov 10, 2003 at 07:28 UTC
    Without a sample line that is going wrong, this is just a shot in the dark, but does replacing each .*? with [^=]* and putting \b before and after img, a, and link help?

    You really ought to be doing this with HTML::Parser.

      Without a sample line that is going wrong, this is just a shot in the dark, but does replacing each .*? with ^=* and putting \b before and after img, a, and link help?
      [^=]* is not much better than .*?. "A" is not "=", "B" is not "=", "C" is not "=", "D" is not "=", .... Since he's parsing html, he should replace .*? with \s* (regular expressions are easy to write if you know precisely what you're matching).
        But who's to say that the input will always look like "img src...", it could be "img border" or anything like that.
        Try:
        s#(?:\bimg\b[^<>]*src|\blink\b[^<>]*href)\=\"(.*?)\"#... and s#(?:\ba\b[^<>]*href)\=\"(.*?)\"#
        (diotalevi's solution may be a better place to start, adjusting the .+ to not allow intervening tags.)

        Or you may want to follow this suggestion; I had assumed you wanted to cover even something like <a title="whoohoo" href=...>, so I didn't switch to \s*.

        But given that you want to be able to handle *any* web page that is entered, you ought to use a real html parser instead. (BTW, I tried http://google.com and was disappointed that the buttons didn't get translated.)

Re: Problem with parsing HTML with Regex's
by Anonymous Monk on Nov 10, 2003 at 08:21 UTC
    Edit: Yes I know I should be using a module, but that's the reason I'm asking, I dont want to use one.
    You don't have to use it to use it. Look at the source and borrow the appropriate regular expressions.
Re: Problem with parsing HTML with Regex's
by idsfa (Vicar) on Nov 10, 2003 at 16:01 UTC

    None of the proposed regex solutions properly handle:

    <img alt=">click here<" src="/images/button.gif" NAME=">click<">

    Yet Another Reason to use an HTML parser.

    Updated:
    None of the proposed purely regex solutions ... happy now? ;-)

    Yes, the given regexes can all be modified to work with this (psychotic) example. The point is that parsing HTML is difficult, and that this wheel has already been invented a few times.


    My parents just came back from a planet where the dominant life form had no
    bilateral symmetry, and all I got was this stupid F-Shirt.
      *cough*cough* YAPE::HTML is a regex solution ;)