I actually ran this. I'm not sure what it's supposed to do that makes it superior to the HTML::Parser quickies being kicked around, but it doesn't.

Thanks for running the software. For an idea of what it is supposed to do, download the source of, say, a Wired or CNN news article, and run that past the program. Those are two types of input documents that I know work well.

Yes, unfortunately it is far from perfect. The intent is to use it on busy weblog and news portal sites to automatically download and trim out things like sidebars, boxes interrupting the flow of text, headers and footers. So yes, I'm not surprised it didn't do too well on a POD page---it assumes there's something to be found, but this assumption doesn't work well on a document that is pretty much all content and no distraction.

What's supposed to make it superior to HTML::Parser quickies (and I've written a few of them in my time) is that it doesn't have to be told how to interpret a given page. This may have to change in the future (the range of HTML out there is pretty big!) but I'm confident the approach is robust enough that with work it'll be a killer. If anyone has a HTML::Parser quickie that works in the general case, I'd be very pleased to see it.

The error you got is very unfortunate and wholly my fault for posting something so premature.


In reply to Re: Re: HTML content extractor by Nooks
in thread HTML content extractor by Nooks

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.