in reply to Re: HTML::Parser question
in thread HTML::Parser question

Looking at the code you posted in an earlier reply, you could change this line (in "sub text {...}")
$self->{TEXT}.=$text;
to read as follows:
$self->{TEXT}.="$text\n";
I tried your code with this mod, and the result might still not be exactly what you wanted (I saw "nbsp", HTML comments, other "funny character" entities (©, •, etc.) -- I think you'll find a way to handle these with HTML::Entities; also, depending on how far you want to go with filtering the yahoo page content to get rid of irrelevant stuff (like the comments, the scripting, the forms, etc), you might get good mileage out of HTML::TokeParser or it's ::Simple variant (same functionality, different API).

Replies are listed 'Best First'.
Re: Re: Re: HTML::Parser question
by mkurtis (Scribe) on Mar 09, 2004 at 00:54 UTC
    thanks graff, since i only want what is between the tags or the text, how would i use tokeparser. The CPAN docs there look like tokeparser is for taking the content from inside the tags ie <content in tag> as opposed to parser that does this <>content out of tag</>. Simple tokeparser seems to work for taking out the inbetween test, but it needs a file to define the tags it is going to get rid of when it is declared. How would i go about creating this file, or maybe i dont need one, the docs make it seem like i did.

    Thanks