in reply to Stripping HTML tags
--------------
The following may do what you want:
This doesn't convert things like , of course, but you can add code for that on your own fairly easily. Note that the above mostly preserves page structure - you may want something more like the following if you're just trying to export the text:$_ = join '', <DATA>; while(s/<(?:\/?\w|!)[^<>]*>/ /sg) {} s/ +/ /g; s/^ | $//mg; print; __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ""> Once upon a time there was a <a href="page.html">link</a> and some <b>bold text</b> and a paragraph break<p> <!-- invisible <nested tag> --> and a <table cellspacing="0" cellpadding="0" border="0"><tr> <td>table</td> </tr></table> 4 < 5 > 3
$_ = join '', <DATA>; while (s/<(?:\/?\w|!)[^<>]*>/ /sg) {} s/\s+/ /g; s/^ | $//; print;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Stripping HTML tags
by fishbot_v2 (Chaplain) on May 24, 2005 at 20:19 UTC | |
by tilly (Archbishop) on May 24, 2005 at 23:49 UTC | |
by fishbot_v2 (Chaplain) on May 25, 2005 at 00:55 UTC | |
by tilly (Archbishop) on May 25, 2005 at 01:01 UTC |