in reply to Re: Stripping HTML tags efficiently
in thread Stripping HTML tags efficiently

Sir, I am having one problem again. That the code completely eliminates the html tags but what I want is to convert it into tags which it is not doing. Can u plz tell me how it can be done?

Replies are listed 'Best First'.
Re^3: Stripping HTML tags efficiently
by gaal (Parson) on Dec 11, 2004 at 08:59 UTC
    If I understand what you're trying to do:

    You want to strip out all the tags from the original data, but gether them all in a separate place? Okay, instead of doing nothing ("1"), gather the data.

    my @extragted_tags; push @extracted_tags, $1 while s/$pattern/" " x length $1/ge;

    (Not tested, either!)

    This puts the separate tags in separate elements of @extracted_tags. If you want them all together in a single string, try this.

    my $extracted_tags; $extracted_tags .= $1 while s/$pattern/" " x length $1/ge;

    The better you manage to specify what you want to do, the easier it will be for you to do it.

      Sir, I am not concerned with the html tags i.e, I don't want to extract the html tags. My sole purpose is to convert the html tags into spaces, thats it. I think u can understand my problem.
        Okay, this time I just tested something on the command line, and it appears to work. At least for simple HTML with no confusing attributes in tags:

        perl -le '$d = "<moose>elk</moose>"; $p = qr/<.*?>/; $d =~ s/$p/" " x length $1/ge; print $d'

        The is even simpler than what I suggested previously:

        • the regexp is very simple: make a non-greedy match from < to >
        • no need for the '1 while' construct. Just use the /g modifier.

        But once again, one of the HTML parsers may do a better job at this.