japhy has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to solve the following problem with a regex, and so far I have not been successful. I've mostly resigned myself to solving it without a regex, but I'd like to know if anyone here can come up with a clever solution.
I have a potentially malformed URL. It MAY be missing some of the leading characters of the "http://". That is, it might be tp://www.foo.com/ or ://www.foo.com; then again, it might be fine.
I am trying to determine WHAT I need to supply to the beginning of it. I know I could just do a check for ^http://, ^ttp://, ^tp://, and so on, but that seems so barbaric. Any ideas?
Re: What is missing from the beginning of this string? (direct)
by tye (Sage) on Oct 08, 2010 at 05:12 UTC
|
#!/usr/bin/perl -p
s-^(((((((h?t)?)t)?p)?:)?/)?/)?(?=\w+\.)-http://-;
__END__
http://www.perlmonks.org/
ttp://www.perlmonks.org/
tp://www.perlmonks.org/
p://www.perlmonks.org/
://www.perlmonks.org/
//www.perlmonks.org/
/www.perlmonks.org/
www.perlmonks.org/
produces
http://www.perlmonks.org/
http://www.perlmonks.org/
http://www.perlmonks.org/
http://www.perlmonks.org/
http://www.perlmonks.org/
http://www.perlmonks.org/
http://www.perlmonks.org/
http://www.perlmonks.org/
(update:) Replace \. with [./] if you want to support intranet URLs like http://cvs/. Supporting alternate-port intranet URLs like http://wiki:8080/ with just [./:] would cause ftp://... to become http://ftp://... but you could consider (?=\w+([./]|:\d)).
| [reply] [d/l] [select] |
Re: What is missing from the beginning of this string?
by Marshall (Canon) on Oct 07, 2010 at 22:32 UTC
|
Why don't you just get rid of the stuff in front of the www.foo.com stuff? I.e., assume its "bad" and put "http://" in front of it? Or for that matter just leave the http:// off once you've done step (1).
#!/usr/bin/perl -w
use strict;
my @urls = ('tp://www.foo.com/' , '://www.foo.com',
'http//:www.foo.com', 'www.foo.com');
foreach (@urls)
{
s/^.*?www/www/;
print "http://$_\n";
}
__END__
prints:
http://www.foo.com/
http://www.foo.com
http://www.foo.com
http://www.foo.com
Update: well, this could be more complex as a valid URL does not have to start with www, it could be xyz.tv, then I guess you would want: http://xyz.tv? It helps if you present a representative set of test cases. It also helps if you can say something about the context of the application. Here I suppose you are trying to "guess" the user's intention of a manually entered URL? And then auto-magically "fix" it? Sometimes it is better to just try to use what the user entered and if it doesn't work, present an error message about what is acceptable for a URL.
Just another regex example... I'm sure that other monks can provide even better regex'es, but specifying the problem as clearly as you can is important.
my @urls = ('tp://www.foo.com/' , '://www.foo.com',
'http//:www.foo.com', 'www.foo.com',
'xxx.tv', 'http//:xxx.tv', 'tp:xx.tv');
foreach (@urls)
{
s/^(.*?)(\w+\.)/$2/;
print "http://$_\n";
}
__END__
prints:
http://www.foo.com/
http://www.foo.com
http://www.foo.com
http://www.foo.com
http://xxx.tv
http://xxx.tv
http://xx.tv
| [reply] [d/l] [select] |
|
Your regex solution s/^(.*?)(\w+\.)/$2/ should work perfectly for this.
The mad scientist in me, though, is still wondering if there's a way to do this sort of thing abstractly: to provide a prefix for a string where the prefix may be only partially present. I'll think about it later. It's Friday.
| [reply] [d/l] |
Re: What is missing from the beginning of this string?
by jakeease (Friar) on Oct 08, 2010 at 00:55 UTC
|
I posted this in the wrong place first, so it may show twice
try something like this:
sub fix_URL {
use URI;
my $in = shift;
my $url = URI->new($in);
$url->scheme('http');
print "input is: $in\n";
print "fixed url is: $url\n";
}
perl> fix_URL 'tp://www.cnn.com'
input is: tp://www.cnn.com
fixed url is: http://www.cnn.com
perl>
| [reply] [d/l] [select] |
|
| [reply] |
Re: What is missing from the beginning of this string?
by dasgar (Priest) on Oct 07, 2010 at 23:00 UTC
|
I have a potentially malformed URL. It MAY be missing ... then again, it might be fine.
Honestly, I haven't the faintest clue what you're trying to do or why you think there is a problem. Here's what I'm able to gather from your post:
- You have some source providing your script/program with URLs.
- Your script/program is acting on that information.
- Since your code is not always doing what is expected, you believe that there's a chance you're getting invalid URLs.
Sounds to me like it's time for debugging, which for me means to start adding print statements to figure out what's happening where. For example, print out the URLs, then look to see what kind of issues there are, and then develop a plan to deal with them. Doing this might help you figure out if you're getting invalid URLs and if so, how are they invalid, which in turn helps with figuring out the regex.
The only other idea right now is to find a module that does the URL validation for you.
| [reply] |
|
"I haven't the faintest clue what you're trying to do"
I am trying to correct a potentially malformed "http://" at the beginning of a URL.
I am not in control of the URLs I am receiving. It is not for me to debug, it is simply for me to correct.
| [reply] |
|
| [reply] [d/l] |
Re: What is missing from the beginning of this string?
by ssandv (Hermit) on Oct 08, 2010 at 19:43 UTC
|
use index to find "//", or failing that, "/". (It would be easier if all URLs started with www, but good luck on that). Depending on the results of the previous tests, prepend the appropriate substr from "http://".
| [reply] [d/l] |
|
|