in reply to Stripping HTML tags efficiently

(Please surround your code with CODE tags to keep it readable.)

If you just want to de-HTMLify a document, the fastest way I know of doing it would be to run it through lynx -dump. This even gives you a bit of formatting.

If you really need to overwrite tags with spaces, and in the proper amount, then your approach of making a pattern first and then using it is not bad, but you're making two mistakes. First, you're only making a string, not a compiled regexp. You can very easily fix that by changing your first statement to:

my $pattern = qr/  ...whatever was here before...  /;

Secondly, you are doing the work twice: first you just match for tags, then you substitute. Don't do that.

1 while $target_data =~ s/$pattern/' ' x length $1/ge;

(This is not tested! At all!)

Finally, don't use regexps to parse HTML. Use an HTML::Parser.

Replies are listed 'Best First'.
Re^2: Stripping HTML tags efficiently
by agynr (Acolyte) on Dec 10, 2004 at 11:52 UTC
    Thanx for ur useful advice. Until u had told me I was unaware of the particular module.I have used HTML::Parser but in a different way.I have put my data in a particular file and then parsed it like given below my $p = HTML::Parser->new( text_h => \&text, 'dtext', ); #### my data into the particular file $p->parse_file('try.txt') or die $!; open FILE, ">output.txt" or die "Can't: $!\n"; sub text { my $text = shift; $output .= $text; Anyhow Thanx once again
Re^2: Stripping HTML tags efficiently
by agynr (Acolyte) on Dec 11, 2004 at 08:33 UTC
    Sir, I am having one problem again. That the code completely eliminates the html tags but what I want is to convert it into tags which it is not doing. Can u plz tell me how it can be done?
      If I understand what you're trying to do:

      You want to strip out all the tags from the original data, but gether them all in a separate place? Okay, instead of doing nothing ("1"), gather the data.

      my @extragted_tags; push @extracted_tags, $1 while s/$pattern/" " x length $1/ge;

      (Not tested, either!)

      This puts the separate tags in separate elements of @extracted_tags. If you want them all together in a single string, try this.

      my $extracted_tags; $extracted_tags .= $1 while s/$pattern/" " x length $1/ge;

      The better you manage to specify what you want to do, the easier it will be for you to do it.

        Sir, I am not concerned with the html tags i.e, I don't want to extract the html tags. My sole purpose is to convert the html tags into spaces, thats it. I think u can understand my problem.