Searching for web sites

FouRPlaY has asked for the wisdom of the Perl Monks concerning the following question:

I've been writing a program to turn various .plan files into HTML (any suggestions for that would be helpful too). One of the ones I'm working containes web site addresses. I figured I'd go through and put the proper achor tags in fornt and behind them to make them acctual links in the finished HTML. I was wondering if I could get some suggestions, help, comments, on the code I did write:

if ($newline =~ /http/i)
{
 $newline =~ s/http\:\/\//\<a href \= \"http\:\/\//ig;
 if (substr ($newline, $#newline - 1, 1) eq ".")
 {
  $x = substr ($newline, /\G/, $#newline - 1);
 }
 else
 {
  $x = substr ($newline, /\G/); 
 }
 $page = $x;
 $page =~ s/(http\:\/\/)|(\<\/a\>)|\=|\"|(href)|(\<a)|
  (\w+[ ]+)//g;
 $newline =~ s/$x/$x\"\>$page\<\/a\>/;    
}
print "<td>" . $newline . "</td>";
[download]

$newline contains a string that may or may not have a web site address in it. Any help/suggestions/comments are great, as I've only been using PERL for a month or so now.

Comment on Searching for web sites Download Code

Replies are listed 'Best First'.
(Ovid) Re: Searching for web sites by Ovid (Cardinal) on Oct 24, 2000 at 21:30 UTC
You may wish to check out the HTML::FromText module. It will, amongst other things, automatically convert URLs to hyperlinks. I've never worked with .plan files, so I can't say for certain whether this is an appropriate solution, but I suspect that it's a good place to start. Also, if you wish to do it by hand, switching to a different delimeter on your regexes will help you avoid backslashitis. Further, if your URLs are not broken across lines (i.e., if they don't have embedded newline) or have spaces, your could try the following (untested) regex as a starting point for conversion: `$newline =~ s#(http://[^.]+\.[^.]+\S+)#<a href="$1">$1</a>#gi;` [download] The above regex assumes that, at minimum, you will have two groups to characters separated by a period after the `http://` portion. The negated character classes should actually be replaced by classes that state allowable characters (and if you really want to be anal, I recall that the first allowable character in a domain is different from other allowable characters, but sometimes I get into regex overkill). Cheers, Ovid Join the Perlmonks Setiathome Group or just go the the link and check out our stats.	[reply] [d/l]
RE: (Ovid) Re: Searching for web sites by electronicMacks (Beadle) on Oct 25, 2000 at 03:53 UTC
If you’re using such a through regex that checks for dots and allowable characters, you may wish to ditch the http:// completely. People are more likely to list websites in their .plan files without it (for example, I visit perlmonks.org and not I visit http://www.perlmonks.org) Personally I’d feel safe putting anchor tags around anything that looks like xxx.xxx, although you could also include a list of allowable Top Level Domains, something like `@TLDs = ("com","net", "org", "edu","us","nl","de","it","se","ch","uk","ca","hr","ae","br","jp","be","us","au","ie","ar","fi","mil","gov","sg","es","mx","no","pt","dk","il","ru","nz","th","pl","id","cy","in","kw","at","za","cn","fr","is","ro","kr","gr","co","ph","bo","hu","cr","pe","cl","tr","arpa","tw","eg","ee","ge","ua","om","ec","hk","ve","ag","cz","ni","to","nu","sm","ni","lt","yu","bg","ba","do","qa","ck","mt","bf","lu","su","bh");`	[reply] [d/l]
RE: RE: (Ovid) Re: Searching for web sites by mirod (Canon) on Oct 25, 2000 at 15:39 UTC
Isn't this a little dangerous? Any time new TLD's are added you will need to go and change the list, plus I cannot see .cx, home of a bunch of free software projects in this list. http:// or at least www(\..+)+\.\w+ seem the safest matches	[reply]
RE: RE: RE: (Ovid) Re: Searching for web sites by FouRPlaY (Monk) on Oct 25, 2000 at 20:47 UTC
Re: Searching for web sites by AgentM (Curate) on Oct 24, 2000 at 21:29 UTC
You could `use [CGI];` in shell mode to print out nice HTML as well as help to clean up your code. Another thing you might want to consider is expanding img tags if that's allowed the .plan files (I assume every user has an editable .plan where this is supposed to parse it to HTML- but in that case, why not require them to use HTML and use the HTML::Parser subclasses to restrict what you wish to restrict?) AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.	[reply] [d/l]


Do you know where your variables are?
	PerlMonks

Searching for web sites

AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.