htmanning has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I'm using the following to detect urls and make them clickable links.
use URI; use URI::Find; find_uris($description, sub { my ($find_uri, $orig_uri) = @_; my $uri = URI->new( $orig_uri ); $uri = $uri->canonical->as_string; return '<a href="' . $uri . '" target=_blank>' . $uri . '</a>'; });
I'm having 2 issues with the following. First, some people leave off the "http://" and simply type "www." Secondly, believe it or not some people write their URLS in all caps. What is the best way to address these two issues?

Replies are listed 'Best First'.
Re: Detecting URLs with URI
by stevieb (Canon) on May 26, 2016 at 20:11 UTC

    For the former issue (CAPS in URL), URI's SYNOPSIS has this:

    $u5 = URI->new("HTTP://WWW.perl.com:80")->canonical;

    ...and from the canonical() method's documentation:

    " Returns a normalized version of the URI. The rules for normalization are scheme-dependent. They usually involve lowercasing the scheme and Internet host name components, removing the explicit port specification if it matches the default port, uppercasing all escape sequences, and unescaping octets that can be better represented as plain characters. For efficiency reasons, if the $uri is already in normalized form, then a reference to it is returned instead of a copy."

    I'm not sure about the http:// issue as I don't web-scrape often at all, but perhaps perusing the documentation in the link above will prove fruitful (scheme() method looks promising). That, or perhaps the docs of URI::Find has something.

Re: Detecting URLs with URI
by Mr. Muskrat (Canon) on May 26, 2016 at 20:21 UTC

    1. Pick a suitable default and use it if the URL doesn't contain a scheme.
    2. Allow URI->canonical to lowercase it.
    Example code below:

    #!/bin/env perl use strict; use warnings; use URI; for my $search ('http://www.foo.com', 'https://www.foo.com', 'www.foo. +com', 'WWW.FOO.COM', 'ftp://www.foo.com') { my $url = $search; # we can't modify $search directly $url = 'http://' . $url unless $url =~ m!^\w+://!; # If there isn't +a schema, add http:// my $uri = URI->new($url)->canonical->as_string; print "$search => $uri\n"; }