Re: Parsing/Extracting Data from HTML.

Perl can covert HTML to text too...

$htmltext =~ s/<(.*)>//g;

...will replace all tags with emptiness.

If you wish to convert br's and p's to newlines before they are stripped, add:
$htmltext =~ s/<(br|p)>/\n\n/ig;
before the first command.

Of course, you'll lose all formatting. This method is not quarenteed to properly strip comments.

Comment on Re: Parsing/Extracting Data from HTML.

Replies are listed 'Best First'.
RE: Re: Parsing/Extracting Data from HTML. by chromatic (Archbishop) on Mar 23, 2000 at 20:48 UTC
No, don't do that. It's too greedy: `my $string = "<first><second>blahblah<third>\n"; $string =~ s/<(.)>//g; print $string;` [download] Result: (Hey, it's blank!) If you really want to do it this way, use: `$string =~ s/<[^>]?>//g;` The question mark keeps the asterisk from slurping up any character -- including angle brackets -- to the end of the line, and then backtracking to pick up that last angle bracket. Of course, so does the negated character class. Just be more specific.	[reply] [d/l] [select]

Replies are listed 'Best First'.

RE: Re: Parsing/Extracting Data from HTML.
by chromatic (Archbishop) on Mar 23, 2000 at 20:48 UTC

my $string = "<first><second>blahblah<third>\n";
$string =~ s/<(.*)>//g;
print $string;
[download]

If you really want to do it this way, use: $string =~ s/<[^>]*?>//g; The question mark keeps the asterisk from slurping up any character -- including angle brackets -- to the end of the line, and then backtracking to pick up that last angle bracket. Of course, so does the negated character class. Just be more specific.

[reply]
[d/l]
[select]