There's more than one way to do things | |
PerlMonks |
Re: Re: HTML content extractorby Nooks (Monk) |
on Feb 11, 2001 at 02:09 UTC ( [id://57671]=note: print w/replies, xml ) | Need Help?? |
Did you run the program?
Look at what happens when both programs are given the HTML in this CNN story. That is not a canned example---I simply looked at what was on CNN right now, downloaded it, and asked my program to search it for content. (Granted, it doesn't run perfectly on that input---the first few paragraphs are elided---but your program does a truly woeful job: to extract the content from what comes back would require much more work than it does if the HTML syntax and structure is there to help.) Of course I looked at the HTML::Parser module. I'm using HTML::TreeBuilder for any number of good reasons. Oh, and yes, HTML::FormatText would work, except it will not render forms and tables, making it completely useless for dealing with the vast majority of weblogs and news sites out there. The point of the matter is my `not-so-round attempt' works better than your approach ever will. I defy you to do better without doing something at least as complex (and I don't consider what I've written to be terribly complex).
In Section
Cool Uses for Perl
|
|