Re: Problem with parsing HTML with Regex's

Replies are listed 'Best First'.
Re: Re: Problem with parsing HTML with Regex's by Anonymous Monk on Nov 10, 2003 at 07:58 UTC
Without a sample line that is going wrong, this is just a shot in the dark, but does replacing each .? with ^=* and putting \b before and after img, a, and link help?* `[^=]` is not much better than `.?`. "A" is not "=", "B" is not "=", "C" is not "=", "D" is not "=", .... Since he's parsing html, he should replace .? with \s (regular expressions are easy to write if you know precisely what you're matching).	[reply] [d/l] [select]
Re: Re: Re: Problem with parsing HTML with Regex's by Anonymous Monk on Nov 10, 2003 at 08:02 UTC
But who's to say that the input will always look like "img src...", it could be "img border" or anything like that.	[reply]
Re: Re: Problem with parsing HTML with Regex's by OverlordQ (Hermit) on Nov 10, 2003 at 07:39 UTC
original script bad regex's	[reply]
Re: Re: Re: Problem with parsing HTML with Regex's by ysth (Canon) on Nov 10, 2003 at 08:10 UTC
Try: `s#(?:\bimg\b[^<>]src\|\blink\b[^<>]href)\=\"(.?)\"#... and s#(?:\ba\b[^<>]href)\=\"(.?)\"#` [download] (diotalevi's solution may be a better place to start, adjusting the `.+` to not allow intervening tags.) Or you may want to follow this suggestion; I had assumed you wanted to cover even something like `<a title="whoohoo" href=...>`, so I didn't switch to \s. But given that you want to be able to handle any web page that is entered, you ought to use a real html parser instead. (BTW, I tried http://google.com and was disappointed that the buttons didn't get translated.)	[reply] [d/l] [select]