in reply to string matching

Use URI. URIs are more complicated than they seem and this makes handling them easier. Here's an example with a couple of surprise cases that show why/how a regular expression can be much more difficult.

use strict; use warnings; use URI; URI: for my $raw ( <DATA> ) { my $uri = URI->new($raw); if ( $uri->scheme ne "https" ) { warn "$uri is not secure, skipping\n"; next URI; } if ( $uri->path =~ m,/\z, ) { warn "$uri has a trailing slash, skipping\n"; next URI; } print "GOOD: $uri\n"; } __DATA__ http://perlmonks.org/?node_id=825405 https://gmail.com https://gmail.com/ https://perlmonks.org/? https://mail.google.com/mail/#inbox

Replies are listed 'Best First'.
Re^2: string matching
by ungalnanban (Pilgrim) on Feb 27, 2010 at 09:53 UTC
    We can match this requirement in single line.
    Example:
    use strict; use warnings; open(FH,"data"); foreach ( <FH>){ if ( $_ =~ m/^https.*[^\/]\n$/ ) { print $_; } }

      I think you missed the point and an i modifier.

      while ( <DATA> ) { print if /\Ahttps.*[^\/]\n\z/; } __DATA__ http://perlmonks.org/?node_id=825405 HTTPS://gmail.com https://gmail.com/ httpsux https://perlmonks.org/? https://mail.google.com/mail/#inbox

      Gives these which are either completely invalid or "end" with a trailing slash since the fragment and the empty query string are irrelevant to the URI path.

      httpsux https://perlmonks.org/? https://mail.google.com/mail/#inbox

      If you know for a fact that your data set is simple/normalized enough, you could use a straightforward regular expression. URI is simple and robust however so not using it is just sloth and it will eventually bite you or the dev who inherits your code. Trusting input data to be well-formed is risky and only appropriate in one-offs.