Re: Re: Problem with parsing HTML with Regex's

Replies are listed 'Best First'.
Re: Re: Re: Problem with parsing HTML with Regex's by ysth (Canon) on Nov 10, 2003 at 08:10 UTC
Try: `s#(?:\bimg\b[^<>]src\|\blink\b[^<>]href)\=\"(.?)\"#... and s#(?:\ba\b[^<>]href)\=\"(.?)\"#` [download] (diotalevi's solution may be a better place to start, adjusting the `.+` to not allow intervening tags.) Or you may want to follow this suggestion; I had assumed you wanted to cover even something like `<a title="whoohoo" href=...>`, so I didn't switch to \s. But given that you want to be able to handle any web page that is entered, you ought to use a real html parser instead. (BTW, I tried http://google.com and was disappointed that the buttons didn't get translated.)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: Problem with parsing HTML with Regex's
by ysth (Canon) on Nov 10, 2003 at 08:10 UTC

s#(?:\bimg\b[^<>]*src|\blink\b[^<>]*href)\=\"(.*?)\"#...

and

s#(?:\ba\b[^<>]*href)\=\"(.*?)\"#
[download]

diotalevi

.+

Or you may want to follow this suggestion; I had assumed you wanted to cover even something like <a title="whoohoo" href=...>, so I didn't switch to \s*.

But given that you want to be able to handle *any* web page that is entered, you ought to use a real html parser instead. (BTW, I tried http://google.com and was disappointed that the buttons didn't get translated.)

[reply]
[d/l]
[select]