in reply to Getting the text of the html page

If I understand your question properly, I think you mean you want to strip out the HTML tags. If so, the following ought to do the trick

#!/usr/bin/perl -w use strict; print "Content-type: text/html\r\n"; my $file="path/to/page.html"; open(fp, $file) or die "Couldn't open file: $!"; while ( my $output = <fp> ) { $output=~s/<[^>]*?>//g; $output=~s/&/&amp;/g; $output=~s/"/&quot;/g; $output=~s/</&lt;/g; $output=~s/>/&gt;/g; print $output . "\n"; }; close(fp);

Replies are listed 'Best First'.
Re^2: Getting the text of the html document
by bradcathey (Prior) on Jun 20, 2005 at 12:34 UTC

    Best to let one of the several CPAN modules do this for you. I'd look at HTML::Strip for starters.


    —Brad
    "The important work of moving the world forward does not wait to be done by perfect men." George Eliot
Re^2: Getting the text of the html document
by CountZero (Bishop) on Jun 20, 2005 at 13:12 UTC
    The only way to deal with HTML (or other mark-up languages) is to parse the HTML-code. A "simple" regex-solution is not guaranteed to work in all cases.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      That's a good point. My little regexes there don't convert every single entity, but it strips EVERY tag, and converts the <'s, >'s, quotes, and ampersands. Not much else would be left behind, honestly.

      Regardless of that fact, bradcathey, seems to have a very nice solution which is much faster than regex anyway.

        What would your regex do with a tag like this:

        <img src="next.gif" alt="-->" />

        Honestly, it's best to use a real parser.

        --
        <http://www.dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg