This node falls below the community's minimum standard of quality and will not be displayed.

Replies are listed 'Best First'.
Re: HTML to text
by davidrw (Prior) on Sep 05, 2006 at 03:24 UTC
    I think your posting is incomplete... for example
    echo '<pre>foo</pre>' | perl -pe 's/<.*?>//g' # output: <prefoo</pre
    I think you meant s/<.*?>//g ...

    And of course, that is _very_ simplistic ... e.g. cases like:
    <select> <option selected>foo</option> <option>bar</option> </select>
    and of course (i know, not valid html, but we know it always happens):
    <font color="red">2 >= 1</font>
    Personally, my quick & dirty command-line snippet for this is often:
    lynx --dump <file|url>
Re: HTML to text
by gellyfish (Monsignor) on Sep 05, 2006 at 08:01 UTC

    I'd suggest looking at the entry in perlfaq9 about this.

    If you want a one liner that does it in the recommended fashion you could try something like:

    perl -MHTML::Parser -e'HTML::Parser->new(text_h => [ sub { print shif +t }, 'dtext'])->parse_file($ARGV[0])' <filename>

    /J\

Re: HTML to text
by davorg (Chancellor) on Sep 05, 2006 at 08:16 UTC

    How would it cope with something like:

    <a href="some_rule" class="an_attribute_on_a_different_line">

    Or even

    <a href="prev"><--</a> <a href="next">--></a>

    Parsing HTML with regexes is full of corner cases like this. It's really not worth the effort, use a real HTML parser instead.

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: HTML to text
by dorward (Curate) on Sep 05, 2006 at 10:27 UTC

    In addition to the points raised by others, this would fail to account for character references in the markup.

    The quick &amp; the dead

    Or greater than characters in attribute values

    <p title="x > y">