Re: HTML Parsing

This is what I've ended up with:

    if ($field eq "comments") {        

        # Remove any links (because they break URL to link conversion)
        $$field =~ s/<A.*?HRef.*?>//isg; $$field =~ s/<\/A>//isg;

        # Extract any image links and add them to an array for safe-ke
+eping, replace them with placeholders
        $image_database = 0;
        while ($$field =~ /<Img(.*?)>/) {
            $$field =~ s/(<Img(.*?)>)/\[My_Image=$image_database\]/iso
+;
            $images[$image_database] = $1;
            $image_database ++;
        }

        # If HTML is not allowed, strip any remaining HTML
        if ($allow_html != 1) { $$field =~ s/<(?:[^>'"]*|(['"]).*?\1)*
+>//gs; }

        # Convert URL's and e-mail addresses to links (with regex)
        $$field =~ s/(((ht|f)tp):(\/\/)[a-z0-9%&_\-\+=:@~#\/.\?]+(\/|[
+a-z]))/<A HRef="$1" Target="_blank">$1<\/A>/isg;
        $$field =~ s/(^\W|\s)([a-z0-9_\-.]+\@[a-z0-9_\-]+\.[a-z]+)(.*?
+$)/$1<A HRef="mailto:$2">$2<\/A>$3/mig;

        # Replace the image placeholders with their corresponding imag
+es
        $image_database = 0;
        while ($$field =~ /\[My_Image=(\d*)\]/) {
            $img_src = $images[$1];
            $$field =~ s/\[My_Image=(\d*)\]/$img_src/iso;
            $image_database ++;
        }

    }
[download]

(Yes, I know I'm not using "strict" - this is a prototype only).

Anyone see any problems with this code?

In theory, there is no difference between theory and practise. But in practise, there is.

Jonathan M. Hollin
Digital-Word.com

Comment on Re: HTML Parsing Download Code

Replies are listed 'Best First'.
Re: Re: HTML Parsing by DarkBlue (Sexton) on Feb 12, 2001 at 05:51 UTC
Just realised that `$$field =~ s/<A.?HRef.?>//isg; $$field =~ s/<\/A>//isg;` is going to screw up any <A Name...> tags... damn... In theory, there is no difference between theory and practise. But in practise, there is. Jonathan M. Hollin Digital-Word.com	[reply] [d/l]