in reply to Problem with parsing HTML with Regex's

Without a sample line that is going wrong, this is just a shot in the dark, but does replacing each .*? with [^=]* and putting \b before and after img, a, and link help?

You really ought to be doing this with HTML::Parser.

Replies are listed 'Best First'.
Re: Re: Problem with parsing HTML with Regex's
by Anonymous Monk on Nov 10, 2003 at 07:58 UTC
    Without a sample line that is going wrong, this is just a shot in the dark, but does replacing each .*? with ^=* and putting \b before and after img, a, and link help?
    [^=]* is not much better than .*?. "A" is not "=", "B" is not "=", "C" is not "=", "D" is not "=", .... Since he's parsing html, he should replace .*? with \s* (regular expressions are easy to write if you know precisely what you're matching).
      But who's to say that the input will always look like "img src...", it could be "img border" or anything like that.
Re: Re: Problem with parsing HTML with Regex's
by OverlordQ (Hermit) on Nov 10, 2003 at 07:39 UTC
      Try:
      s#(?:\bimg\b[^<>]*src|\blink\b[^<>]*href)\=\"(.*?)\"#... and s#(?:\ba\b[^<>]*href)\=\"(.*?)\"#
      (diotalevi's solution may be a better place to start, adjusting the .+ to not allow intervening tags.)

      Or you may want to follow this suggestion; I had assumed you wanted to cover even something like <a title="whoohoo" href=...>, so I didn't switch to \s*.

      But given that you want to be able to handle *any* web page that is entered, you ought to use a real html parser instead. (BTW, I tried http://google.com and was disappointed that the buttons didn't get translated.)