Re: HTML::Parser question

does anyone know how to make this not combine the words?

Are you sure *it* is combining the words? I think your code is doing that. If your sub gets called multiple times, that is because there were tags in between. You do nothing with those tags, but it is very likely that they were meant to render as some sort of white space.

For formatting HTML as plain text, have a look at HTML::FormatText, or consider using w3m -dump, links -dump or lynx -dump.

A quick and ugly fix for your problem would probably be having start and end handlers that add a single space to the string and a substitution on eof to remove duplicate whitespace.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Comment on Re: HTML::Parser question

Replies are listed 'Best First'.
Re: HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 20:54 UTC
thanks juerd, that quick and ugly fix that you were talking about, wuold that be putting each word on a seperate line? i looked at that HTML::FormatText module, but i think that if parser just stuck every word on a new line it would work, and all would be well. Do you bychance know how to do this. Thanks	[reply]
Re: Re: HTML::Parser question by Juerd (Abbot) on Mar 07, 2004 at 21:36 UTC
thanks juerd, that quick and ugly fix that you were talking about, wuold that be putting each word on a seperate line? i looked at that HTML::FormatText module, but i think that if parser just stuck every word on a new line it would work, and all would be well. That "fix" would do whatever you program it to do. It is not the parser's job to modify anything. It parses and does that well. Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }	[reply]
Re: Re: HTML::Parser question by graff (Chancellor) on Mar 08, 2004 at 03:03 UTC
Looking at the code you posted in an earlier reply, you could change this line (in "sub text {...}") `$self->{TEXT}.=$text;` [download] to read as follows: `$self->{TEXT}.="$text\n";` [download] I tried your code with this mod, and the result might still not be exactly what you wanted (I saw "nbsp", HTML comments, other "funny character" entities (©, •, etc.) -- I think you'll find a way to handle these with HTML::Entities; also, depending on how far you want to go with filtering the yahoo page content to get rid of irrelevant stuff (like the comments, the scripting, the forms, etc), you might get good mileage out of HTML::TokeParser or it's ::Simple variant (same functionality, different API).	[reply] [d/l] [select]
Re: Re: Re: HTML::Parser question by mkurtis (Scribe) on Mar 09, 2004 at 00:54 UTC
thanks graff, since i only want what is between the tags or the text, how would i use tokeparser. The CPAN docs there look like tokeparser is for taking the content from inside the tags ie <content in tag> as opposed to parser that does this <>content out of tag</>. Simple tokeparser seems to work for taking out the inbetween test, but it needs a file to define the tags it is going to get rid of when it is declared. How would i go about creating this file, or maybe i dont need one, the docs make it seem like i did. Thanks	[reply]