Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Searching for web sites

by FouRPlaY (Monk)
on Oct 24, 2000 at 21:23 UTC ( [id://38149]=perlquestion: print w/replies, xml ) Need Help??

FouRPlaY has asked for the wisdom of the Perl Monks concerning the following question:

I've been writing a program to turn various .plan files into HTML (any suggestions for that would be helpful too). One of the ones I'm working containes web site addresses. I figured I'd go through and put the proper achor tags in fornt and behind them to make them acctual links in the finished HTML. I was wondering if I could get some suggestions, help, comments, on the code I did write:
if ($newline =~ /http/i) { $newline =~ s/http\:\/\//\<a href \= \"http\:\/\//ig; if (substr ($newline, $#newline - 1, 1) eq ".") { $x = substr ($newline, /\G/, $#newline - 1); } else { $x = substr ($newline, /\G/); } $page = $x; $page =~ s/(http\:\/\/)|(\<\/a\>)|\=|\"|(href)|(\<a)| (\w+[ ]+)//g; $newline =~ s/$x/$x\"\>$page\<\/a\>/; } print "<td>" . $newline . "</td>";
$newline contains a string that may or may not have a web site address in it. Any help/suggestions/comments are great, as I've only been using PERL for a month or so now.

Replies are listed 'Best First'.
(Ovid) Re: Searching for web sites
by Ovid (Cardinal) on Oct 24, 2000 at 21:30 UTC
    You may wish to check out the HTML::FromText module. It will, amongst other things, automatically convert URLs to hyperlinks. I've never worked with .plan files, so I can't say for certain whether this is an appropriate solution, but I suspect that it's a good place to start.

    Also, if you wish to do it by hand, switching to a different delimeter on your regexes will help you avoid backslashitis. Further, if your URLs are not broken across lines (i.e., if they don't have embedded newline) or have spaces, your could try the following (untested) regex as a starting point for conversion:

    $newline =~ s#(http://[^.]+\.[^.]+\S+)#<a href="$1">$1</a>#gi;
    The above regex assumes that, at minimum, you will have two groups to characters separated by a period after the http:// portion. The negated character classes should actually be replaced by classes that state allowable characters (and if you really want to be anal, I recall that the first allowable character in a domain is different from other allowable characters, but sometimes I get into regex overkill).

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

      If you’re using such a through regex that checks for dots and allowable characters, you may wish to ditch the http:// completely. People are more likely to list websites in their .plan files without it (for example, I visit perlmonks.org and not I visit http://www.perlmonks.org) Personally I’d feel safe putting anchor tags around anything that looks like xxx.xxx, although you could also include a list of allowable Top Level Domains, something like @TLDs = ("com","net", "org", "edu","us","nl","de","it","se","ch","uk","ca","hr","ae","br","jp","be","us","au","ie","ar","fi","mil","gov","sg","es","mx","no","pt","dk","il","ru","nz","th","pl","id","cy","in","kw","at","za","cn","fr","is","ro","kr","gr","co","ph","bo","hu","cr","pe","cl","tr","arpa","tw","eg","ee","ge","ua","om","ec","hk","ve","ag","cz","ni","to","nu","sm","ni","lt","yu","bg","ba","do","qa","ck","mt","bf","lu","su","bh");

        Isn't this a little dangerous? Any time new TLD's are added you will need to go and change the list, plus I cannot see .cx, home of a bunch of free software projects in this list.

        http:// or at least www(\..+)+\.\w+ seem the safest matches

Re: Searching for web sites
by AgentM (Curate) on Oct 24, 2000 at 21:29 UTC
    You could use [CGI]; in shell mode to print out nice HTML as well as help to clean up your code. Another thing you might want to consider is expanding img tags if that's allowed the .plan files (I assume every user has an editable .plan where this is supposed to parse it to HTML- but in that case, why not require them to use HTML and use the HTML::Parser subclasses to restrict what you wish to restrict?)
    AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://38149]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-25 17:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found