I actually ran this. I'm not sure what it's supposed to do that makes it superior to the HTML::Parser quickies being kicked around, but it doesn't.
Thanks for running the software. For an idea of what it is supposed to do, download the source of, say, a Wired or CNN news article, and run that past the program. Those are two types of input documents that I know work well.
Yes, unfortunately it is far from perfect. The intent is to use it on busy weblog and news portal sites to automatically download and trim out things like sidebars, boxes interrupting the flow of text, headers and footers. So yes, I'm not surprised it didn't do too well on a POD page---it assumes there's something to be found, but this assumption doesn't work well on a document that is pretty much all content and no distraction.
What's supposed to make it superior to HTML::Parser quickies (and I've written a few of them in my time) is that it doesn't have to be told how to interpret a given page. This may have to change in the future (the range of HTML out there is pretty big!) but I'm confident the approach is robust enough that with work it'll be a killer. If anyone has a HTML::Parser quickie that works in the general case, I'd be very pleased to see it.
The error you got is very unfortunate and wholly my fault for posting something so premature.
In reply to Re: Re: HTML content extractor
by Nooks
in thread HTML content extractor
by Nooks
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |