in reply to Stripping HTML tags efficiently
If you just want to de-HTMLify a document, the fastest way I know of doing it would be to run it through lynx -dump. This even gives you a bit of formatting.
If you really need to overwrite tags with spaces, and in the proper amount, then your approach of making a pattern first and then using it is not bad, but you're making two mistakes. First, you're only making a string, not a compiled regexp. You can very easily fix that by changing your first statement to:
my $pattern = qr/ ...whatever was here before... /;
Secondly, you are doing the work twice: first you just match for tags, then you substitute. Don't do that.
1 while $target_data =~ s/$pattern/' ' x length $1/ge;
(This is not tested! At all!)
Finally, don't use regexps to parse HTML. Use an HTML::Parser.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Stripping HTML tags efficiently
by agynr (Acolyte) on Dec 10, 2004 at 11:52 UTC | |
|
Re^2: Stripping HTML tags efficiently
by agynr (Acolyte) on Dec 11, 2004 at 08:33 UTC | |
by gaal (Parson) on Dec 11, 2004 at 08:59 UTC | |
by agynr (Acolyte) on Dec 11, 2004 at 09:14 UTC | |
by gaal (Parson) on Dec 11, 2004 at 09:36 UTC | |
by agynr (Acolyte) on Dec 11, 2004 at 10:53 UTC |