Re^3: Stripping HTML tags efficiently

If I understand what you're trying to do:

You want to strip out all the tags from the original data, but gether them all in a separate place? Okay, instead of doing nothing ("1"), gather the data.

my @extragted_tags;
push @extracted_tags, $1 while s/$pattern/" " x length $1/ge;
[download]

(Not tested, either!)

This puts the separate tags in separate elements of @extracted_tags. If you want them all together in a single string, try this.

my $extracted_tags;
$extracted_tags .= $1 while s/$pattern/" " x length $1/ge;
[download]

The better you manage to specify what you want to do, the easier it will be for you to do it.

Comment on Re^3: Stripping HTML tags efficiently Select or Download Code

Replies are listed 'Best First'.
Re^4: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 11, 2004 at 09:14 UTC
Sir, I am not concerned with the html tags i.e, I don't want to extract the html tags. My sole purpose is to convert the html tags into spaces, thats it. I think u can understand my problem.	[reply]
Re^5: Stripping HTML tags efficiently by gaal (Parson) on Dec 11, 2004 at 09:36 UTC
Okay, this time I just tested something on the command line, and it appears to work. At least for simple HTML with no confusing attributes in tags: `perl -le '$d = "<moose>elk</moose>"; $p = qr/<.*?>/; $d =~ s/$p/" " x length $1/ge; print $d'` The is even simpler than what I suggested previously: the regexp is very simple: make a non-greedy match from < to > no need for the '`1 while`' construct. Just use the `/g` modifier. But once again, one of the HTML parsers may do a better job at this.	[reply] [d/l] [select]
Re^6: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 11, 2004 at 10:53 UTC
Sir, I want any help using HTML:Parser only as the data to be parsed is not less than 8 MB. So it is not possible for regular expression to parse all the tags so fast as HTML Parser can do.	[reply]