extract URLs from text

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: extract URLs from text by rob_au (Abbot) on Jan 03, 2002 at 08:47 UTC
How about this using URI::Find ... `#!/usr/bin/perl -Tw use URI::Find; use strict; my $text = " ... long string with lots of URLs ... "; my @urls; find_uris($text, sub { my ($uri, $orig_uri) = @_; push @urls, $orig_uri; }); print join("\n", @urls), "\n"; exit 0;` [download] Note that the CPAN documentation for URI::Find is out-of-date with the newest version (0.04) exporting only the one function, `find_uris`, which takes two arguments, the string to be searched and a function reference. `perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'`	[reply] [d/l] [select]
Re: extract URLs from text by Anonymous Monk on Jan 03, 2002 at 08:25 UTC
Regex to find URLs in a string may be the place to start. As for putting things in an array, I'm sure that is easier done than said.	[reply]
Re: extract URLs from text by cormanaz (Deacon) on Sep 08, 2023 at 15:29 UTC
Thanks for the responses. Here is what I finally came up with. Should work for any shortened URL: use strict; use feature ':5.10'; use LWP::UserAgent; our $ua = LWP::UserAgent->new; $ua->max_redirect(10); my $uri = 'https://t.co/O4qjsxuCsV'; say expand($uri); sub expand { my ($short) = @_; my $long; my $response = $ua->get($short); if ($response->is_success) { my @redirects = $response->redirects(); if (@redirects) { $long = $redirects[$#redirects]->header('location'); } elsif ($response->header('refresh')) { $long = $response->header('refresh'); $long =~ s/0\;URL\=//; } return $long; } else { return $short; } } [download]	[reply] [d/l]
Re^2: extract URLs from text by marto (Cardinal) on Sep 08, 2023 at 15:53 UTC
I suspect you intended to reply to Unshortening t.co links. "Should work for any shortened URL" The first search result I could find for 'url shorten', which didn't require payment or signing up was https://shorturldotat, which does not work with this code, however the following does work: `#!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use feature 'say'; use Data::Dumper; my $url = 'https://shorturl[dot]at/REDACTED'; my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 ( + KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = LWP::UserAgent->new; $ua->agent( $uaname ); my $resp = $ua->get($url); my $final = $resp->request()->uri(); while ($resp) { say $resp->request()->uri(); $resp = $resp->previous(); } say "Final URL: $final";` [download] In short I don't think you're going to find a one size fits all solution, without automating a browser like WWW::Mechanize::Chrome. Update: url cloaked a little, it's a bit spammy.	[reply] [d/l]
Re^2: extract URLs from text by Bod (Parson) on Sep 08, 2023 at 23:05 UTC
Here is what I finally came up with According to the thread dates, it only took you 21 years!!! Hopefully, marto is right about you being in the wrong thread rather than in a time warp...	[reply]
Re: extract URLs from text by belg4mit (Prior) on Jan 03, 2002 at 11:35 UTC
There will eventually be TheDamian's Regexp::Common. `-- perl -pe "s/\b;([st])/'\1/mg"`	[reply]