Finding & creating links in HTML files

heezy has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

Applogies first off, this is my second question today (it's not a good day here at work)

I have a perl script that takes a list of file names and then cleans up the HTML that a certain Word Processor has produced as the HTML output.

I want to modify this so that if there is text in the document that should be a hyperlink but is not such as...

http://www.google.com

I want to identify that and change it to... <a href="http://www.google.com">http://www.google.com</a> so that it produces a nice link like http://www.google.com

My current code to tidy up the HTML looks something like..


#!/usr/bin/perl

foreach $fname (@ARGV){
    print "Processing: $fname\n";
    open (FILE, $fname) || die("Cannot open file $fname! ($!)\n");

    $file = join( "", <FILE>);
    close(FILE);

    rename ($fname, $fname.".bak");

    $file =~ s/<META NAME=.*?>\n//gis;

    $file =~ s/<P STYLE.*?>/<P>/gis;

    $file =~ s/&ndash\;/-/gis;
    $file =~ s/&rdquo\;/\"/gis;
    $file =~ s/&ldquo\;/\"/gis;
    
    $file =~ s/<FONT.*?>//gis;
    $file =~ s/<\/FONT>//gis;
    $file =~ s/<SPAN.*?>//gis;
    $file =~ s/<\/SPAN>//gis;

    open (FILE, ">$fname");
    print FILE $file;
}
[download]

Anybody written anything likes this before or know a quick Perl-ish way of doing it?

Thanks in advance

Edit by tye, remove PRE tag around wide line

Comment on Finding & creating links in HTML files Select or Download Code

Replies are listed 'Best First'.
Re: Finding & creating links in HTML files by fruiture (Curate) on Oct 17, 2002 at 23:11 UTC
Cleaning up HTML is a task probably everyone at least once encounters, but this node has given me enough inspiration for HTML tidying to write an HTML-Cleaner in PHP (and it's hell working with regular expressions in this ugly language, especially when you know Perl). Consider using that Code (+ a bit customisation) instead of trying some quick'n'dirty regular expressions that will fail, if not today, tomorrow. As for the linking: you can build regular expressions for URIs using the neccessary RFCs, but that will result in very complex expressions that are much too accurate for just extracting "stuff that begins with 'http://' and/or 'www.' or 'ftp://' or maybe 'mailto:'". If you want an accurate solution anyways, check out Regexp::Common::URI ;) -- http://fruiture.de	[reply]
Re: Finding & creating links in HTML files by heezy (Monk) on Oct 17, 2002 at 20:20 UTC
Sorry one last point.. There may already be hyperlinks in the document, therefor any searching for the string http://www.google.com should match only if it does not have a " before and after it. (i.e. it's already a hyperlink) Thanks people M	[reply]