HTML to text

Replies are listed 'Best First'.
Re: HTML to text by davidrw (Prior) on Sep 05, 2006 at 03:24 UTC
I think your posting is incomplete... for example `echo '<pre>foo</pre>' \| perl -pe 's/<.?>//g' # output: <prefoo</pre` [download] I think you meant `s/<.?>//g` ... And of course, that is _very_ simplistic ... e.g. cases like: `<select> <option selected>foo</option> <option>bar</option> </select>` [download] and of course (i know, not valid html, but we know it always happens): `<font color="red">2 >= 1</font>` [download] Personally, my quick & dirty command-line snippet for this is often: `lynx --dump <file\|url>` [download]	[reply] [d/l] [select]
Re: HTML to text by gellyfish (Monsignor) on Sep 05, 2006 at 08:01 UTC
I'd suggest looking at the entry in perlfaq9 about this. If you want a one liner that does it in the recommended fashion you could try something like: `perl -MHTML::Parser -e'HTML::Parser->new(text_h => [ sub { print shif +t }, 'dtext'])->parse_file($ARGV[0])' <filename>` [download] /J\	[reply] [d/l]
Re: HTML to text by davorg (Chancellor) on Sep 05, 2006 at 08:16 UTC
How would it cope with something like: `<a href="some_rule" class="an_attribute_on_a_different_line">` [download] Or even `<a href="prev"><--</a> <a href="next">--></a>` [download] Parsing HTML with regexes is full of corner cases like this. It's really not worth the effort, use a real HTML parser instead. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: HTML to text by dorward (Curate) on Sep 05, 2006 at 10:27 UTC
In addition to the points raised by others, this would fail to account for character references in the markup. `The quick & the dead` Or greater than characters in attribute values `<p title="x > y">`	[reply] [d/l] [select]