hostile17 has asked for the wisdom of the Perl Monks concerning the following question:

I have a regex which is supposed to strip images out of HTML and replace them with
   [image]
and if there's an ALT attribute for an image, replace it with
   [image:"alt tag"]
my code is like this:
    $html =~ s/<IMG.*?ALT="(\[^"\]*)"\[^>\]*>/\[image: "$1"\]/sgi;
but it doesn't work when it encounters an image without an alt tag --- it matches forward until it comes across an image tag which *does* have one.
So instead of:
<IMG SRC="foo"><BR> bar bar bar<BR> <IMG SRC="foo" alt="bar">

becoming
[image]<BR> bar bar bar<BR> [image: "bar"]

it matches everything including the text and I just get
[image: "bar"]
I need the /sgi options at the end due to the variability of the HTML, but how do I rewrite the regex
to make the matching of the ALT tag optional?

Replies are listed 'Best First'.
Re: Regex For HTML Image Tags?
by Desdinova (Friar) on Mar 27, 2001 at 12:10 UTC
    You could also look at HTML::TokeParser on CPAN. It is a great little module for parsing HTML elements. I Personally like to aviod trying parse something like HTML there are way too many gotchas for what i know of Regexs
Re: Regex For HTML Image Tags?
by alfie (Pilgrim) on Mar 27, 2001 at 11:51 UTC
    You need to tweak your regular expression a little bit. Let me start with this fast diddle:
    $html =~ s/<IMG[^>]+?(?:ALT="([^"]*)"[^>]*)>/[image: "$1"]/sgi;
    Let me explain what I did: I changed your .*? to [^>]+? for you only want to match non-end delimeters for the <img> tag in here, and also there I think it's more sensible to use + than * for there is no special need to catch all an empty tag, and there must at least be a whitespace inbetween :-)

    Secondly, why did you escape the brackets in the alt-tag, and at the end? That doesn't really make sense, for you want the special meaning of it at that point. There is also no need to escape it in the replacement string for they don't have a special meaning there.

    And, you need to put brackets with a ? followed around the alt-part for as you already noticed it wouldn't match tags without an alt-tag. I did it with (?: so it won't get stored.

    This will produce the following:

    <img foo><img alt="bar"> [image: ""][image: "bar"]
    If you want to have just plain [image] in the replacement if there is no alt atribute present I guess that wouldn't be possible with a single substitute, but you can still do the following substitution afterwards:
    $html =~ s/\[image: ""\]/[image]/g;
    HTH & HAND!
    --
    Alfie
      it is possible in one regular expression though:
      $html=~s/<IMG[^>]+?(?:ALT="([^"]*)"[^>]*)?>/"[image".((defined $1)?": +\"$1\"":"")."]"/sgei;
      short explanation:
      the match starts with "<img" followed by something that is not the end of the tag (but don't be greedy), or it will also match the ALT part which is optional "(?: )?" which should be self-explanatory (with basic perl knowledge)

      we then substitute with an expression (the /e modifier)

      long explanation:
      I am too lazy to write this.

        Let me see.

        You would match that text inside of attribute values for other tags.

        You fail to consider that the closing > can appear in the values of other attributes for the IMG tag. There are quite a few which could have it.

        The alt attribute may be quoted with "", '', or nothing at all. You only deal with one of these cases.

        There is optional whitespace between ALT and = and = and the value. Not accounted for.

        In my experience the odds of your being bitten are highest for the different delimiter, then for munging up text that appeared in quoted delimiters. The others are possible but unlikely.

        If you know your data, then an RE is OK. I have certainly done that. But if you don't, then an RE hack will break sooner or later...

        I had been munging on a regex as well (just an exercise), and I think the extended regex clearifies a bit:
        $html=<DATA>; $html =~ s/<IMG \s+ #match the IMG tag SRC \s* = \s* "[^"]+" \s* #match the Source (ALT \s* = \s* "([^"]+)" \s*)? #match an optional Alt > #end of tag /'[image' . ($2 ? ": $2" : '') .']' #print the image stuff /sgixe; print $html; __DATA__ <IMG SRC="foo"><BR> bar bar bar<BR> <IMG SRC="foo" alt="bar">
        This works, but keep in mind that the IMG tag is still valid if for example, the SRC and the ALT are reversed in order.

        That's why HTML::Tokeparser (as Desdinova pointed out already) or maybe even (if the HTML is yours) Template Toolkit are better approaches.

        Cheers,

        Jeroen
        "We are not alone"(FZ)

        I knew about that (somewhere, deep hidden in my memories) - but couldn't find it quickly in the manual pages. Strangely it's the first modifier described in the perlop section *hmm*
        Thanks for pointing it out, I simply haven't found it :)
        --
        Alfie
Re: Regex For HTML Image Tags?
by merlyn (Sage) on Mar 27, 2001 at 21:42 UTC
    Unable to test right now, but this should work:
    use HTML::Parser; HTML::Parser->new( default_h => [sub { print shift; }, "text"], start_h => [sub { my ($text, $tagname,$attr) = @_; return print $text unless $tagname eq "img"; if ($attr->{alt}) { print "[image: \"$attr->{alt}\"]"; } else { print "[image]"; } }, "text,tagname,attr"], )->parse(join "", <DATA>); __END__ <IMG SRC="foo"><BR> bar bar bar<BR> <IMG SRC="foo" alt="bar">

    -- Randal L. Schwartz, Perl hacker

      Thank you all very much indeed. I appreciate it.

      I do know in my heart that I should use a module, but I also feel the need to wrestle with and write my own code, in order to learn...

      I'm going to write another more detailed question about the procedure I'm using, which I'm sure will give you all lots to laugh at.

      h17