in reply to Re: HTML::Parser question
in thread HTML::Parser question

thanks juerd, that quick and ugly fix that you were talking about, wuold that be putting each word on a seperate line? i looked at that HTML::FormatText module, but i think that if parser just stuck every word on a new line it would work, and all would be well. Do you bychance know how to do this.

Thanks

Replies are listed 'Best First'.
Re: Re: HTML::Parser question
by Juerd (Abbot) on Mar 07, 2004 at 21:36 UTC

    thanks juerd, that quick and ugly fix that you were talking about, wuold that be putting each word on a seperate line? i looked at that HTML::FormatText module, but i think that if parser just stuck every word on a new line it would work, and all would be well.

    That "fix" would do whatever you program it to do. It is not the parser's job to modify anything. It parses and does that well.

    Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re: Re: HTML::Parser question
by graff (Chancellor) on Mar 08, 2004 at 03:03 UTC
    Looking at the code you posted in an earlier reply, you could change this line (in "sub text {...}")
    $self->{TEXT}.=$text;
    to read as follows:
    $self->{TEXT}.="$text\n";
    I tried your code with this mod, and the result might still not be exactly what you wanted (I saw "nbsp", HTML comments, other "funny character" entities (©, •, etc.) -- I think you'll find a way to handle these with HTML::Entities; also, depending on how far you want to go with filtering the yahoo page content to get rid of irrelevant stuff (like the comments, the scripting, the forms, etc), you might get good mileage out of HTML::TokeParser or it's ::Simple variant (same functionality, different API).
      thanks graff, since i only want what is between the tags or the text, how would i use tokeparser. The CPAN docs there look like tokeparser is for taking the content from inside the tags ie <content in tag> as opposed to parser that does this <>content out of tag</>. Simple tokeparser seems to work for taking out the inbetween test, but it needs a file to define the tags it is going to get rid of when it is declared. How would i go about creating this file, or maybe i dont need one, the docs make it seem like i did.

      Thanks