Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Re: HTML content extractor

by Nooks (Monk)
on Feb 11, 2001 at 02:09 UTC ( [id://57671]=note: print w/replies, xml ) Need Help??


in reply to Re: HTML content extractor
in thread HTML content extractor

Did you run the program?

Look at what happens when both programs are given the HTML in this CNN story.

That is not a canned example---I simply looked at what was on CNN right now, downloaded it, and asked my program to search it for content. (Granted, it doesn't run perfectly on that input---the first few paragraphs are elided---but your program does a truly woeful job: to extract the content from what comes back would require much more work than it does if the HTML syntax and structure is there to help.)

Of course I looked at the HTML::Parser module. I'm using HTML::TreeBuilder for any number of good reasons.

Oh, and yes, HTML::FormatText would work, except it will not render forms and tables, making it completely useless for dealing with the vast majority of weblogs and news sites out there.

The point of the matter is my `not-so-round attempt' works better than your approach ever will. I defy you to do better without doing something at least as complex (and I don't consider what I've written to be terribly complex).

Replies are listed 'Best First'.
Re: Re: Re: HTML content extractor
by mirod (Canon) on Feb 12, 2001 at 00:04 UTC

    My sincere apologies.

    When I read the description of your code you provided I assumed you had written yet-another-html-pseudo-parser. Which you have not. That will teach me to answer posts when I am tired (and too fast).

    Once I started actually reading I found that your code _is_ valuable. I also tried (of course!) to write something similar but simpler, and haven't succeeded so far (man, this CNN page is Hell!).

    What I have managed though is to find a bug in XML::PYX and one in XML::Twig, so I did not loose my time ;--)

    Oh, and of course I upvoted the rest of your comments on the thread.

    Sorry...

      Once I started actually reading I found that your code _is_ valuable. I also tried (of course!) to write something similar but simpler, and haven't succeeded so far (man, this CNN page is Hell!).

      Heh, yeah, those pages can be a right pain in the ass. Don't forget, once you have it working on CNN's news pages, it has to work on slashdot, lwn, (and maybe even one day perlmonks, not that I've tried it myself).

      Don't worry about bruised egos---I can see now the code probably wasn't ready to be posted, and certainly not without a much better explanation of what it does and why (which I originally cut out to make the node shorter).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://57671]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-03-29 06:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found