in reply to HTML content extractor
The even more astute will use HTML::Parser reading the docs for 10 minutes got me the ultra-crude:
#!/bin/perl use HTML::Parser; my $file= shift; my $p = HTML::Parser->new(api_version => 3, handlers => { text => [\@array, "text"] }); $p->parse_file( $file); print $_->[0] foreach @array;
To keep the formatting I strongly suspect that HTML::FormatText will do a nice job too.
You can certainly re-invemt the wheel, but please try not to lure others into using your not-so-round attempt.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: HTML content extractor
by eg (Friar) on Feb 10, 2001 at 22:49 UTC | |
Or even simpler without the accumulating @array,
| [reply] [d/l] |
by Anonymous Monk on Oct 21, 2004 at 14:36 UTC | |
| [reply] |
|
Re: Re: HTML content extractor
by Nooks (Monk) on Feb 11, 2001 at 02:09 UTC | |
Look at what happens when both programs are given the HTML in this CNN story. That is not a canned example---I simply looked at what was on CNN right now, downloaded it, and asked my program to search it for content. (Granted, it doesn't run perfectly on that input---the first few paragraphs are elided---but your program does a truly woeful job: to extract the content from what comes back would require much more work than it does if the HTML syntax and structure is there to help.) Of course I looked at the HTML::Parser module. I'm using HTML::TreeBuilder for any number of good reasons. Oh, and yes, HTML::FormatText would work, except it will not render forms and tables, making it completely useless for dealing with the vast majority of weblogs and news sites out there. The point of the matter is my `not-so-round attempt' works better than your approach ever will. I defy you to do better without doing something at least as complex (and I don't consider what I've written to be terribly complex). | [reply] [d/l] [select] |
by mirod (Canon) on Feb 12, 2001 at 00:04 UTC | |
My sincere apologies. When I read the description of your code you provided I assumed you had written yet-another-html-pseudo-parser. Which you have not. That will teach me to answer posts when I am tired (and too fast). Once I started actually reading I found that your code _is_ valuable. I also tried (of course!) to write something similar but simpler, and haven't succeeded so far (man, this CNN page is Hell!). What I have managed though is to find a bug in XML::PYX and one in XML::Twig, so I did not loose my time ;--) Oh, and of course I upvoted the rest of your comments on the thread. Sorry... | [reply] |
by Nooks (Monk) on Feb 12, 2001 at 02:01 UTC | |
Once I started actually reading I found that your code _is_ valuable. I also tried (of course!) to write something similar but simpler, and haven't succeeded so far (man, this CNN page is Hell!). Heh, yeah, those pages can be a right pain in the ass. Don't forget, once you have it working on CNN's news pages, it has to work on slashdot, lwn, (and maybe even one day perlmonks, not that I've tried it myself). Don't worry about bruised egos---I can see now the code probably wasn't ready to be posted, and certainly not without a much better explanation of what it does and why (which I originally cut out to make the node shorter). | [reply] |