in reply to Re: HTML content extractor
in thread HTML content extractor
Look at what happens when both programs are given the HTML in this CNN story.
That is not a canned example---I simply looked at what was on CNN right now, downloaded it, and asked my program to search it for content. (Granted, it doesn't run perfectly on that input---the first few paragraphs are elided---but your program does a truly woeful job: to extract the content from what comes back would require much more work than it does if the HTML syntax and structure is there to help.)
Of course I looked at the HTML::Parser module. I'm using HTML::TreeBuilder for any number of good reasons.
Oh, and yes, HTML::FormatText would work, except it will not render forms and tables, making it completely useless for dealing with the vast majority of weblogs and news sites out there.
The point of the matter is my `not-so-round attempt' works better than your approach ever will. I defy you to do better without doing something at least as complex (and I don't consider what I've written to be terribly complex).
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Re: Re: HTML content extractor
by mirod (Canon) on Feb 12, 2001 at 00:04 UTC | |
by Nooks (Monk) on Feb 12, 2001 at 02:01 UTC |