in reply to HTML::Strip Problem

Having looked at the man page for HTML::Strip, I think the problem with your first snippet is that you are passing an array of strings, rather than a single scalar string that contains the whole HTML document. Slurp the full text into $file instead of reading separate lines into the elements of @file. If your local text file is just the content of the url in the second snippet, the two versions will then behave the same, at least.

As for why the second snippet only produces about 3/4 of the expected text output, that might be a matter of a "syntax error" in the yahoo HTML source. (But how could Yahoo make a mistake like that?? I'm shocked! Shocked!!) Anyway, it appears that HTML::Strip does not do syntax checking (so it probably won't generate parsing errors that you can trap), and there may be some stray angle brackets or flubbed entities in the source text (perhaps 3/4 of the way into the file) that are causing trouble. You would need to just probe the text to see if that's what the problem is -- e.g. run a validating parser on it, or simply try out some simple one-liners that will isolate angle brackets and/or ampersands, along with the things adjacent to them...

Replies are listed 'Best First'.
Re: Re: HTML::Strip Problem
by mkurtis (Scribe) on Mar 29, 2004 at 14:56 UTC
    thanks graff. I did try placing the text into $file but it only contained the first line of the file when i did it. When i made it an array however it printed out the whole file. Im not sure why that is, but thats why i used @file. I think ill just use tachyon's parser as that seems to fix my previous problems with parser.