in reply to Re^3: Stripping HTML tags efficiently
in thread Stripping HTML tags efficiently

Sir, I am not concerned with the html tags i.e, I don't want to extract the html tags. My sole purpose is to convert the html tags into spaces, thats it. I think u can understand my problem.

Replies are listed 'Best First'.
Re^5: Stripping HTML tags efficiently
by gaal (Parson) on Dec 11, 2004 at 09:36 UTC
    Okay, this time I just tested something on the command line, and it appears to work. At least for simple HTML with no confusing attributes in tags:

    perl -le '$d = "<moose>elk</moose>"; $p = qr/<.*?>/; $d =~ s/$p/" " x length $1/ge; print $d'

    The is even simpler than what I suggested previously:

    • the regexp is very simple: make a non-greedy match from < to >
    • no need for the '1 while' construct. Just use the /g modifier.

    But once again, one of the HTML parsers may do a better job at this.

      Sir, I want any help using HTML:Parser only as the data to be parsed is not less than 8 MB. So it is not possible for regular expression to parse all the tags so fast as HTML Parser can do.