I was going to propose, as one more alternative, Regexp::Common, but when I did quick test of it, I discovered that it gives somewhat wrong undesired results with some URLs (note the trailing non-URL characters parentheses, commas, semicolons, etc. in some of returned URLs):
% wget -qO - http://www.ebay.com |
perl -MRegexp::Common=URI -wnle 'print $1 while /($RE{URI}{HTTP})/g'|h
+ead
http://include.ebaystatic.com/js/v/us/homepage.js
http://include.ebaystatic.com/aw/pics/us/css/homepage.css
http://pics.ebaystatic.com/aw/pics/userSitePrefs/bottomDropShadow_20x2
+0.gif)
http://pics.ebaystatic.com/aw/pics/userSitePrefs/sideDropShadow_20x20.
+gif)
http://pics.ebaystatic.com/aw/pics/userSitePrefs/dropshadow2_20x10.gif
+)
http://include.ebaystatic.com/aw/pics/css/ebay.css
http://include.ebaystatic.com/';
http://include.ebaystatic.com/js/v/us/ebaybase.js
http://include.ebaystatic.com/js/v/us/ebaysup.js
http://search.ebay.com/',
...while URI::Find::Rule does a better job DWIM:
% wget -qO - http://www.ebay.com |
perl -MURI::Find::Rule -wlne '
print $_->[1] for URI::Find::Rule->scheme("http")->in($_)'|head
http://include.ebaystatic.com/js/v/us/homepage.js
http://include.ebaystatic.com/aw/pics/us/css/homepage.css
http://pics.ebaystatic.com/aw/pics/userSitePrefs/bottomDropShadow_20x2
+0.gif
http://pics.ebaystatic.com/aw/pics/userSitePrefs/sideDropShadow_20x20.
+gif
http://pics.ebaystatic.com/aw/pics/userSitePrefs/dropshadow2_20x10.gif
http://include.ebaystatic.com/aw/pics/css/ebay.css
http://include.ebaystatic.com/
http://include.ebaystatic.com/js/v/us/ebaybase.js
http://include.ebaystatic.com/js/v/us/ebaysup.js
http://search.ebay.com/
Update: Fixed the incorrect wording. As merlyn pointed out, the unwanted trailing characters are valid URL characters. Still I think they could be a problem in the case of the application the OP described. Therefore, in this case, R::C is not the most straighforward solution.
|