in reply to extract_tagged
Considering some of the nasty ways people can arrange their links, this is about as good as you can get. If you want to eliminate anything starting with command: other than http: (like mailto:), you can modify the above as follows:while ($text =~ /(href|<frame .*?src)[ ="']+(.*?)["'>]/g) { print $2; }
If you find a link format that gets past this, feel free to post so I can update the regex.while ($text =~ /(href|<frame .*?src)[ ="']+((http:)?[^:]*?)["'>]/g) { print $2; }
|
|---|