Re: Stripping HTML tags efficiently

(Please surround your code with CODE tags to keep it readable.)

If you just want to de-HTMLify a document, the fastest way I know of doing it would be to run it through lynx -dump. This even gives you a bit of formatting.

If you really need to overwrite tags with spaces, and in the proper amount, then your approach of making a pattern first and then using it is not bad, but you're making two mistakes. First, you're only making a string, not a compiled regexp. You can very easily fix that by changing your first statement to:

my $pattern = qr/ ...whatever was here before... /;

Secondly, you are doing the work twice: first you just match for tags, then you substitute. Don't do that.

1 while $target_data =~ s/$pattern/' ' x length $1/ge;

(This is not tested! At all!)

Finally, don't use regexps to parse HTML. Use an HTML::Parser.

Comment on Re: Stripping HTML tags efficiently Select or Download Code

Replies are listed 'Best First'.
Re^2: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 10, 2004 at 11:52 UTC
Thanx for ur useful advice. Until u had told me I was unaware of the particular module.I have used HTML::Parser but in a different way.I have put my data in a particular file and then parsed it like given below my $p = HTML::Parser->new( text_h => \&text, 'dtext', ); #### my data into the particular file $p->parse_file('try.txt') or die $!; open FILE, ">output.txt" or die "Can't: $!\n"; sub text { my $text = shift; $output .= $text; Anyhow Thanx once again	[reply]
Re^2: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 11, 2004 at 08:33 UTC
Sir, I am having one problem again. That the code completely eliminates the html tags but what I want is to convert it into tags which it is not doing. Can u plz tell me how it can be done?	[reply]
Re^3: Stripping HTML tags efficiently by gaal (Parson) on Dec 11, 2004 at 08:59 UTC
If I understand what you're trying to do: You want to strip out all the tags from the original data, but gether them all in a separate place? Okay, instead of doing nothing ("1"), gather the data. `my @extragted_tags; push @extracted_tags, $1 while s/$pattern/" " x length $1/ge;` [download] (Not tested, either!) This puts the separate tags in separate elements of @extracted_tags. If you want them all together in a single string, try this. `my $extracted_tags; $extracted_tags .= $1 while s/$pattern/" " x length $1/ge;` [download] The better you manage to specify what you want to do, the easier it will be for you to do it.	[reply] [d/l] [select]
Re^4: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 11, 2004 at 09:14 UTC
Sir, I am not concerned with the html tags i.e, I don't want to extract the html tags. My sole purpose is to convert the html tags into spaces, thats it. I think u can understand my problem.	[reply]
Re^5: Stripping HTML tags efficiently by gaal (Parson) on Dec 11, 2004 at 09:36 UTC
Re^6: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 11, 2004 at 10:53 UTC