Re: HTML content extractor

Replies are listed 'Best First'.
Re: Re: HTML content extractor by eg (Friar) on Feb 10, 2001 at 22:49 UTC
Or even simpler without the accumulating `@array`, `HTML::Parser->new(text_h => [sub{print @_}, "text"])->parse_file($file +);` [download]	[reply] [d/l]
Re^3: HTML content extractor by Anonymous Monk on Oct 21, 2004 at 14:36 UTC
A:link {color:#333333;text-decoration:none} A:visited {color:#333333;text-decoration:none} A:active {color:#333333;text-decoration:none} A:hover {text-decoration:underline; color:#0099ff;} .mp_bonmun { font-family: "돋움"; font-size: 9pt; font-style: normal; line-height: 17pt; font-weight: normal; font-variant: normal; color: #333333; text-align: justify; text-indent: 10pt; } .mp_pop_title { font-family: "돋움"; font-size: 10pt; font-weight: bold; color: #00067D; } .mp_4C { font-family: "굴림"; font-size: 9pt; color: #4C4C4C; } .mp_point { font-family: "돋움"; font-size: 9pt; font-style: normal; line-height: 17pt; font-weight: bold; font-variant: normal; color: #3399CC; } .mp_title1 { font-size: 9pt; font-style: normal; line-height: 17pt; font-weight: bold; font-variant: normal; color: #3495C2; font-family: "돋움"; } .mp_title2 { font-family: "돋움"; font-size: 10pt; font-style: normal; line-height: 17pt; font-weight: bold; font-variant: normal; color: #4E53A7; } .mp_title3 { font-size: 9pt; font-style: normal; line-height: 17pt; font-weight: bold; font-variant: normal; color: #F6A026; font-family: "돋움"; } .mp_title4 { font-size: 9pt; font-style: normal; line-height: 17pt; font-weight: bold; font-variant: normal; color: #71C601; font-family: "돋움"; } table { font-family: "돋움"; font-size: 9pt; line-height: 17pt; color: #333333; text-align: justify; } .maintb table{ word-break:break-all; table-layout:fixed; white-space: nowrap; } .maintb td{ font-family: "돋움"; font-size: 9pt; line-height: 17pt; color: #333333; text-align: justify; word-break:break-all; table-layout:fixed; } .input01 { background-color:white;border:1 groove #CCCCCC ; font-family:돋움; font-size:9pt;font-color:#555555} .input02 { background-color:#f8f8f8;border:0 solid #D6D6D6 ; font-family:돋움; font-size:9pt;font-color:#555555} #wow_box { width: 517; height: auto; overflow: auto; border:0 solid; background-color:#FFFFFF; scrollbar-3dlight-color:#CCCCCC; scrollbar-base-color: #FFFFFF; scrollbar-shadow-color:#CCCCCC; scrollbar-arrow-color: #888888; scrollbar-face-color: #FFFFFF; text-align: center; vertical-align: middle; } #agree_box { width: 509; height: 350; overflow: auto; padding:7px; border:1px solid #CCCCCC; background-color:#FFFFFF; font-size: 12px; line-height: 20px; scrollbar-3dlight-color:#CCCCCC; scrollbar-base-color: #FFFFFF; scrollbar-shadow-color:#CCCCCC; scrollbar-arrow-color: #888888; scrollbar-face-color: #FFFFFF; text-align: left; } #maga_box { width: 400; height: 120; overflow: auto; padding:7px; border:0 solid #CCCCCC; background-color:#FFFFFF; font-size: 12px; line-height: 20px; scrollbar-3dlight-color:#CCCCCC; scrollbar-base-color: #FFFFFF; scrollbar-shadow-color:#CCCCCC; scrollbar-arrow-color: #888888; scrollbar-face-color: #FFFFFF; text-align: left; } #pp_box { width: 312; height: 80; overflow: auto; padding:5px; background-color:#FFFFFF; font-size: 12px; line-height: 20px; scrollbar-3dlight-color:#CCCCCC; scrollbar-base-color: #FFFFFF; scrollbar-shadow-color:#CCCCCC; scrollbar-arrow-color: #888888; scrollbar-face-color: #FFFFFF; border-top: 0 dashed #CCCCCC; border-right: 0 dashed #CCCCCC; border-bottom: 0 dashed #CCCCCC; border-left: 0 dashed #CCCCCC; text-align: left; } .toc { font-family: "돋움"; font-size: 12px; color: #333333; line-height: 20px; white-space: nowrap; } .toc td{ vertical-align: top; border-bottom-width: 0px; border-top-style: none; border-right-style: none; border-bottom-style: dashed; border-left-style: none; } .bar td{ font-family: "돋움"; font-size: 12px; line-height: 14px; color: #FFFFFF; padding-top: 2px; } .page { font-family: "돋움"; font-size: 11px; color: #3399CC; line-height: 20px; white-space: nowrap; } .pageform { font-family: "돋움"; font-size: 11px; color: #3399CC; line-height: 14px; white-space: nowrap; border: 1px solid #CCCCCC; overflow: hidden; height: 14px; width: 30px; margin-top: 3px; margin-bottom: 3px; } .cateform { font-family: "돋움"; font-size: 11px; color: #000000; line-height: 14px; white-space: nowrap; height: 14px; width: 130px; overflow: hidden; border-top: 1px solid #CCCCCC; border-right: 1px none #CCCCCC; border-bottom: 1px solid #CCCCCC; border-left: 1px none #CCCCCC; margin-top: 3px; margin-bottom: 3px; } .titleform { font-family: "돋움"; font-size: 11px; color: #000000; line-height: 20px; white-space: nowrap; height: 14px; width: 240px; overflow: hidden; border: 1px solid #CCCCCC; margin-top: 3px; margin-bottom: 3px; } .staff { font-family: "돋움"; font-size: 12px; color: #6699CC; text-decoration: none; } .staff a:link{ color:#AAAAAA; text-decoration:none; font-size: 11px; font-family: "Verdana", "Arial", "Helvetica", "sans-serif"; } .staff a:visited{color:#AAAAAA;text-decoration:none;font-size: 11px;font-family: "Verdana", "Arial", "Helvetica", "sans-serif";} .staff a:active{color:#AAAAAA;text-decoration:none;font-size: 11px;font-family: "Verdana", "Arial", "Helvetica", "sans-serif";} .staff a:hover{color:#3399CC;text-decoration:none;font-size: 11px;font-family: "Verdana", "Arial", "Helvetica", "sans-serif";} b { font-weight: bold; color: #3399CC; } .scb td{ font-family: "돋움"; font-size: 12px; color: #336699; text-decoration: none; line-height: 24px; } .receipt td{ font-family: "돋움"; font-size: 12px; color: #000000; text-decoration: none; line-height: 24px; } .login td{ font-family: "돋움"; font-size: 12px; color: #336699; text-decoration: none; line-height: 16px; } .version { color:#FFFFFF; font-size: 10px; font-family: "Helvetica", "sans-serif", "Arial",; margin-bottom: -2px; margin-right: -20px; } .barlink a:link { color:#FFFFFF; text-decoration:none; font-size: 12px; font-family: "돋움"; } .barlink a:visited { color:#FFFFFF; text-decoration:none; font-size: 12px; font-family: "돋움"; } .barlink a:hover { color:#FFFFFF; text-decoration:none; font-size: 12px; font-family: "돋움"; } .barlink a:active { color:#FFFFFF; text-decoration:none; font-size: 12px; font-family: "돋움"; }	[reply]
Re: Re: HTML content extractor by Nooks (Monk) on Feb 11, 2001 at 02:09 UTC
Did you run the program? Look at what happens when both programs are given the HTML in this CNN story. That is not a canned example---I simply looked at what was on CNN right now, downloaded it, and asked my program to search it for content. (Granted, it doesn't run perfectly on that input---the first few paragraphs are elided---but your program does a truly woeful job: to extract the content from what comes back would require much more work than it does if the HTML syntax and structure is there to help.) Of course I looked at the `HTML::Parser` module. I'm using `HTML::TreeBuilder` for any number of good reasons. Oh, and yes, `HTML::FormatText` would work, except it will not render forms and tables, making it completely useless for dealing with the vast majority of weblogs and news sites out there. The point of the matter is my `not-so-round attempt' works better than your approach ever will. I defy you to do better without doing something at least as complex (and I don't consider what I've written to be terribly complex).	[reply] [d/l] [select]
Re: Re: Re: HTML content extractor by mirod (Canon) on Feb 12, 2001 at 00:04 UTC
My sincere apologies. When I read the description of your code you provided I assumed you had written yet-another-html-pseudo-parser. Which you have not. That will teach me to answer posts when I am tired (and too fast). Once I started actually reading I found that your code _is_ valuable. I also tried (of course!) to write something similar but simpler, and haven't succeeded so far (man, this CNN page is Hell!). What I have managed though is to find a bug in XML::PYX and one in XML::Twig, so I did not loose my time ;--) Oh, and of course I upvoted the rest of your comments on the thread. Sorry...	[reply]
Re: Re: Re: Re: HTML content extractor by Nooks (Monk) on Feb 12, 2001 at 02:01 UTC
Once I started actually reading I found that your code _is_ valuable. I also tried (of course!) to write something similar but simpler, and haven't succeeded so far (man, this CNN page is Hell!). Heh, yeah, those pages can be a right pain in the ass. Don't forget, once you have it working on CNN's news pages, it has to work on slashdot, lwn, (and maybe even one day perlmonks, not that I've tried it myself). Don't worry about bruised egos---I can see now the code probably wasn't ready to be posted, and certainly not without a much better explanation of what it does and why (which I originally cut out to make the node shorter).	[reply]