Flavia has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

This is my first posting :) ("hi, mom!") I'm hoping someone will be able to help me.

I'm trying to validate the entry of an URL in a form, so I'm using a very simple piece of code that looks like this:

if ($URLcheck =~ m/\W/){ printError(); }
The problem is that an URL can have "-", it can't have accented letters or spaces in it... the code above is not quite what I need. Any ideas?

Thanks in advance! :)

Flavia

Replies are listed 'Best First'.
Re (tilly) 1: Form validation
by tilly (Archbishop) on Nov 17, 2000 at 05:32 UTC
    Try URI::Find. It is a little tricky though:
    use strict; use URI::Find; # Time passes sub test_uri { my $val = shift; my $copy = $val; my $found; find_uris($copy, sub { $found = shift; } ); return $val eq $found; }
    Note that this test is also rather picky...you may prefer to do the following which is both simpler and more useful IMO:
    sub normalize_uri { my $found; find_uris($val, sub { $found = shift; } ); return $found; }
    If the text is even remotely acceptable, it will try to guess at something valid... :-)
      Wow, I'll have some fun tonight! :) Thank you so much for all the help, guys. I believe my question has been answered. I'll be trying your sugestions now.

      Thanks again and very best,

      Flavia

Re: Form validation
by arturo (Vicar) on Nov 17, 2000 at 04:27 UTC

    When I saw this post, I thought to myself that there must be a module that validates that a string has the right form to be a URL (without actually trying to use the string to connect). But I couldn't find one. If you're looking for a "good enough" solution, something like what cianoz proposes would be a start. It depends how nit-picky you're going to be. You might want to make sure there's a protocol ID in the string, and at least one period surrounded by other valid, non-period characters (although http://localhost is valid and won't fit that, so even *that's* not guaranteed). I'm not enough of a regex genius to make that happen (yet), and I don't have the RFC handy. So the following comes with heaps of disclaimers. Hopefully, it will get you started and won't damage your career as a Perl programmer or a human being =)

    # since this is messy and we need to use it a bunch of times, let's lo +ad it into a scalar. my $val = "[a-z0-9?]+=\-%"; # there is no way in heck this is correct; + it's a *beginning* though # get input if (is_url($string)) { # do something with it } ... sub is_url { my $input = shift; $input =~ #(?:http(?:s)|ftp)://$val+\.$val+#i; }
    HTH

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: Form validation
by cianoz (Friar) on Nov 17, 2000 at 04:11 UTC
    are you trying to validate a URL or just a hostname? a URL has a LOT of valid non word characters (at least in certain positions... [_-/:@?#+~.] any other?) if your are trying to match just the hostname then this would do (tell me if i forgot something)
    unless($URLcheck =~ /[A-Za-z\-\.0-9]+/) { #reject this.. }
    Update! i've found this on slashdot, i see a short life for my regexp :-)
Re: Form validation
by the_slycer (Chaplain) on Nov 17, 2000 at 04:25 UTC
    Probably the better way to go here (because URLS can have many funky chars) is to find out what they can't have and regex on that instead. I am not very good with regexes but you could try something like
    if ($urlcheck =~ m/[^\s]/){ #stuff to do bad url here }
    which would get rid of the ones that have a space.
      Watch out, though! URL's are allowed to have spaces. I checked, and linux/apache/netscape don't have problems with the spaces.

      From RFC 1738:

      2.2. URL Character Encoding Issues URLs are sequences of characters, i.e., letters, digits, and specia +l characters. A URLs may be represented in a variety of ways: e.g., i +nk on paper, or a sequence of octets in a coded character set. The interpretation of a URL depends only on the identity of the characters used. In most URL schemes, the sequences of characters in different parts of a URL are used to represent sequences of octets used in Internet protocols. For example, in the ftp scheme, the host name, directory name and file names are such sequences of octets, represented by parts of the URL. Within those parts, an octet may be represented +by the chararacter which has that octet as its code within the US-ASCI +I [20] coded character set.
      and
      Reserved: Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters "; +", "/", "?", ":", "@", "=" and "&" are the characters which may be reserved for special meaning within a scheme. No other characters m +ay be reserved within a scheme.
      and
      3.1. Common Internet Scheme Syntax While the syntax for the rest of the URL may vary depending on the particular scheme selected, URL schemes that involve the direct use of an IP-based protocol to a specified host on the Internet use a common syntax for the scheme-specific data: //<user>:<password>@<host>:<port>/<url-path> Some or all of the parts "<user>:<password>@", ":<password>", ":<port>", and "/<url-path>" may be excluded. The scheme specific data start with a double slash "//" to indicate that it complies wi +th the common Internet scheme syntax. The different components obey th +e
      So, I think the url checking should be a little bit more sofisticated. I don't know how URI::Find implements the check.

      Cheers,

      Jeroen

      I was dreaming of guitarnotes that would irritate an executive kind of guy (FZ)