Having looked at the man page for HTML::Strip, I think the problem with your first snippet is that you are passing an array of strings, rather than a single scalar string that contains the whole HTML document. Slurp the full text into $file instead of reading separate lines into the elements of @file. If your local text file is just the content of the url in the second snippet, the two versions will then behave the same, at least.

As for why the second snippet only produces about 3/4 of the expected text output, that might be a matter of a "syntax error" in the yahoo HTML source. (But how could Yahoo make a mistake like that?? I'm shocked! Shocked!!) Anyway, it appears that HTML::Strip does not do syntax checking (so it probably won't generate parsing errors that you can trap), and there may be some stray angle brackets or flubbed entities in the source text (perhaps 3/4 of the way into the file) that are causing trouble. You would need to just probe the text to see if that's what the problem is -- e.g. run a validating parser on it, or simply try out some simple one-liners that will isolate angle brackets and/or ampersands, along with the things adjacent to them...


In reply to Re: HTML::Strip Problem by graff
in thread HTML::Strip Problem by mkurtis

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.