Regex For HTML Image Tags?

hostile17 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex For HTML Image Tags? by Desdinova (Friar) on Mar 27, 2001 at 12:10 UTC
You could also look at HTML::TokeParser on CPAN. It is a great little module for parsing HTML elements. I Personally like to aviod trying parse something like HTML there are way too many gotchas for what i know of Regexs	[reply]
Re: Regex For HTML Image Tags? by alfie (Pilgrim) on Mar 27, 2001 at 11:51 UTC
You need to tweak your regular expression a little bit. Let me start with this fast diddle: `$html =~ s/<IMG[^>]+?(?:ALT="([^"])"[^>])>/[image: "$1"]/sgi;` [download] Let me explain what I did: I changed your `.?` to `[^>]+?` for you only want to match non-end delimeters for the `<img>` tag in here, and also there I think it's more sensible to use + than for there is no special need to catch all an empty tag, and there must at least be a whitespace inbetween :-) Secondly, why did you escape the brackets in the alt-tag, and at the end? That doesn't really make sense, for you want the special meaning of it at that point. There is also no need to escape it in the replacement string for they don't have a special meaning there. And, you need to put brackets with a ? followed around the alt-part for as you already noticed it wouldn't match tags without an alt-tag. I did it with `(?:` so it won't get stored. This will produce the following: `<img foo><img alt="bar"> [image: ""][image: "bar"]` [download] If you want to have just plain `[image]` in the replacement if there is no alt atribute present I guess that wouldn't be possible with a single substitute, but you can still do the following substitution afterwards: `$html =~ s/\[image: ""\]/[image]/g;` [download] HTH & HAND! -- Alfie	[reply] [d/l] [select]
Re: Re: Regex For HTML Image Tags? by zodiac (Beadle) on Mar 27, 2001 at 12:45 UTC
it is possible in one regular expression though: `$html=~s/<IMG[^>]+?(?:ALT="([^"])"[^>])?>/"[image".((defined $1)?": +\"$1\"":"")."]"/sgei;` [download] short explanation: the match starts with "<img" followed by something that is not the end of the tag (but don't be greedy), or it will also match the ALT part which is optional "(?: )?" which should be self-explanatory (with basic perl knowledge) we then substitute with an expression (the /e modifier) long explanation: I am too lazy to write this.	[reply] [d/l]
Re (tilly) 3: Regex For HTML Image Tags? by tilly (Archbishop) on Mar 27, 2001 at 17:13 UTC
Let me see. You would match that text inside of attribute values for other tags. You fail to consider that the closing > can appear in the values of other attributes for the IMG tag. There are quite a few which could have it. The alt attribute may be quoted with "", '', or nothing at all. You only deal with one of these cases. There is optional whitespace between ALT and = and = and the value. Not accounted for. In my experience the odds of your being bitten are highest for the different delimiter, then for munging up text that appeared in quoted delimiters. The others are possible but unlikely. If you know your data, then an RE is OK. I have certainly done that. But if you don't, then an RE hack will break sooner or later...	[reply]
Re:{3} Regex For HTML Image Tags? by jeroenes (Priest) on Mar 27, 2001 at 14:21 UTC
I had been munging on a regex as well (just an exercise), and I think the extended regex clearifies a bit: `$html=<DATA>; $html =~ s/<IMG \s+ #match the IMG tag SRC \s* = \s* "[^"]+" \s* #match the Source (ALT \s* = \s* "([^"]+)" \s)? #match an optional Alt > #end of tag /'[image' . ($2 ? ": $2" : '') .']' #print the image stuff /sgixe; print $html; __DATA__ <IMG SRC="foo"><BR> bar bar bar<BR> <IMG SRC="foo" alt="bar">` [download] This works, but keep in mind that the IMG tag is still valid if for example, the SRC and the ALT are reversed in order. That's why HTML::Tokeparser (as Desdinova pointed out already) or maybe even (if the HTML is yours) Template Toolkit are better approaches. Cheers, Jeroen "We are not alone"(FZ)*	[reply] [d/l]
Re: Re: Re: Regex For HTML Image Tags? by alfie (Pilgrim) on Mar 27, 2001 at 13:22 UTC
I knew about that (somewhere, deep hidden in my memories) - but couldn't find it quickly in the manual pages. Strangely it's the first modifier described in the perlop section hmm Thanks for pointing it out, I simply haven't found it :) -- Alfie	[reply]
Re: Regex For HTML Image Tags? by merlyn (Sage) on Mar 27, 2001 at 21:42 UTC
Unable to test right now, but this should work: `use HTML::Parser; HTML::Parser->new( default_h => [sub { print shift; }, "text"], start_h => [sub { my ($text, $tagname,$attr) = @_; return print $text unless $tagname eq "img"; if ($attr->{alt}) { print "[image: \"$attr->{alt}\"]"; } else { print "[image]"; } }, "text,tagname,attr"], )->parse(join "", <DATA>); __END__ <IMG SRC="foo"><BR> bar bar bar<BR> <IMG SRC="foo" alt="bar">` [download] -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: Re: Regex For HTML Image Tags? by hostile17 (Novice) on Mar 28, 2001 at 04:35 UTC
Thank you all very much indeed. I appreciate it. I do know in my heart that I should use a module, but I also feel the need to wrestle with and write my own code, in order to learn... I'm going to write another more detailed question about the procedure I'm using, which I'm sure will give you all lots to laugh at. h17	[reply]