Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I was wondering if anyone knows of a way to extract URLs from a plain text file? The usual modules have not been helpful as they are geared towards HTML. I need to parse a plain text file that has numerous URLs (not links), and put them into an array. Any tips or snippets are most appreciated!

Replies are listed 'Best First'.
Re: extract URLs from text
by rob_au (Abbot) on Jan 03, 2002 at 08:47 UTC
    How about this using URI::Find ...

    #!/usr/bin/perl -Tw use URI::Find; use strict; my $text = " ... long string with lots of URLs ... "; my @urls; find_uris($text, sub { my ($uri, $orig_uri) = @_; push @urls, $orig_uri; }); print join("\n", @urls), "\n"; exit 0;

    Note that the CPAN documentation for URI::Find is out-of-date with the newest version (0.04) exporting only the one function, find_uris, which takes two arguments, the string to be searched and a function reference.

     

    perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'

Re: extract URLs from text
by Anonymous Monk on Jan 03, 2002 at 08:25 UTC
Re: extract URLs from text
by cormanaz (Deacon) on Sep 08, 2023 at 15:29 UTC
    Thanks for the responses. Here is what I finally came up with. Should work for any shortened URL:
    use strict; use feature ':5.10'; use LWP::UserAgent; our $ua = LWP::UserAgent->new; $ua->max_redirect(10); my $uri = 'https://t.co/O4qjsxuCsV'; say expand($uri); sub expand { my ($short) = @_; my $long; my $response = $ua->get($short); if ($response->is_success) { my @redirects = $response->redirects(); if (@redirects) { $long = $redirects[$#redirects]->header('location'); } elsif ($response->header('refresh')) { $long = $response->header('refresh'); $long =~ s/0\;URL\=//; } return $long; } else { return $short; } }

      I suspect you intended to reply to Unshortening t.co links.

      "Should work for any shortened URL"

      The first search result I could find for 'url shorten', which didn't require payment or signing up was https://shorturldotat, which does not work with this code, however the following does work:

      #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use feature 'say'; use Data::Dumper; my $url = 'https://shorturl[dot]at/REDACTED'; my $uaname = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 ( + KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36'; my $ua = LWP::UserAgent->new; $ua->agent( $uaname ); my $resp = $ua->get($url); my $final = $resp->request()->uri(); while ($resp) { say $resp->request()->uri(); $resp = $resp->previous(); } say "Final URL: $final";

      In short I don't think you're going to find a one size fits all solution, without automating a browser like WWW::Mechanize::Chrome.

      Update: url cloaked a little, it's a bit spammy.

      Here is what I finally came up with

      According to the thread dates, it only took you 21 years!!!

      Hopefully, marto is right about you being in the wrong thread rather than in a time warp...

Re: extract URLs from text
by belg4mit (Prior) on Jan 03, 2002 at 11:35 UTC