Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

HTML::Strip Problem

by mkurtis (Scribe)
on Mar 29, 2004 at 04:28 UTC ( [id://340508]=perlquestion: print w/replies, xml ) Need Help??

mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

Ive been looking for a way to parse html effectively. Ive used HTML::Parser before but it doesnt parse out all the html. Ive been trying to get HTML::Strip to work. Heres what ive got so far.
#!/usr/bin/perl use strict; use HTML::Strip; my $hs = HTML::Strip->new(); my @file; open(FILE,"</home/baelnorn/yahoo.txt"); @file = <FILE>; close(FILE); print $hs->parse(@file);
This code just prints white space although yahoo.txt does have yahoo's source in it and @file contains that source. I tried using $file but it only got the first line. I then tried this code.
#!/usr/bin/perl -w use strict; use warnings; use LWP::Simple; use HTML::Strip; my $content = get("http://www.yahoo.com"); my $hs = HTML::Strip->new(); print $hs->parse($content);
but it only gets about 3/4 of the text out of yahoo. There is nothing about a size limit to the parsing on his CPAN docs, does anyone know how i should go about fixing this?

Thanks

Replies are listed 'Best First'.
Re: HTML::Strip Problem
by tachyon (Chancellor) on Mar 29, 2004 at 04:53 UTC

    We use something like this:

    use HTML::Parser 3; use LWP::Simple; my $html = get("http://perlmonks.org"); print body_text($html); sub body_text { my $content = $_[0] || return 'EMPTY BODY'; # HTML::Parser is broken on Javascript and styles # (well it leaves it in the text) so we 'fix' it.... my $p = HTML::Parser->new( start_h => [ sub{ $_[0]->{text}.=' '; $_[0]->{skip}++ if $_[1] + eq 'script' or $_[1] eq 'style'; } , 'self,tag' ], end_h => [ sub{ $_[0]->{skip}-- if $_[1] eq '/script' or $_[ +1] eq '/style'; } , 'self,tag' ], text_h => [ sub{ $_[0]->{text}.=$_[1] unless $_[0]->{skip}}, +'self,dtext' ] )->parse($content); $p->eof(); my $text = $p->{text}; # remove escapes $text =~ s/&nbsp;/ /gi; $text =~ s/&[^;]+;/ /g; # remove non ASCII printable chars, leaves punctuation stuff $text =~ s/[^\040-\177]+/ /g; # remove any < or > in case parser choked - rare but happens $text =~ s/[<>]/ /g; # crunch whitespace $text =~ s/\s{2,}/ /g; $text =~ s/^\s+//g; return $text; }

    Hint. Using LWP::Simple to get web pages for a search engine is doomed to failure. You will for a start collect the frameset but not the frames of every page you visit that uses frames. You will be blocked by a number of sites for not being IE. You will ignore metarefreshes and 302 found (perhaps you want to perhaps not). You also have no idea of the problem if you don't get content.

    cheers

    tachyon

      thanks for the code tachyon. Thats not my crawler above there. I used LWP::Simple only to get html to see how well strip would parse it. Thanks for the tip though. My real crawler or something close, i might not have updated is in the code section.
Re: HTML::Strip Problem
by graff (Chancellor) on Mar 29, 2004 at 06:06 UTC
    Having looked at the man page for HTML::Strip, I think the problem with your first snippet is that you are passing an array of strings, rather than a single scalar string that contains the whole HTML document. Slurp the full text into $file instead of reading separate lines into the elements of @file. If your local text file is just the content of the url in the second snippet, the two versions will then behave the same, at least.

    As for why the second snippet only produces about 3/4 of the expected text output, that might be a matter of a "syntax error" in the yahoo HTML source. (But how could Yahoo make a mistake like that?? I'm shocked! Shocked!!) Anyway, it appears that HTML::Strip does not do syntax checking (so it probably won't generate parsing errors that you can trap), and there may be some stray angle brackets or flubbed entities in the source text (perhaps 3/4 of the way into the file) that are causing trouble. You would need to just probe the text to see if that's what the problem is -- e.g. run a validating parser on it, or simply try out some simple one-liners that will isolate angle brackets and/or ampersands, along with the things adjacent to them...

      thanks graff. I did try placing the text into $file but it only contained the first line of the file when i did it. When i made it an array however it printed out the whole file. Im not sure why that is, but thats why i used @file. I think ill just use tachyon's parser as that seems to fix my previous problems with parser.
Re: HTML::Strip Problem
by Anonymous Monk on Jun 28, 2005 at 23:12 UTC
    Thanks for the initial code... I was looking at playing with the HTML::Strip module also. I got the following version to work. Might be useful to some other monk out there. :)
    #!/usr/bin/perl use strict; use warnings; use HTML::Strip; my $hs = HTML::Strip->new(); my @file; open(FILE,"<test.html") || die "Cannot open file: $!"; while(<FILE>) { print $hs->parse($_); } close(FILE);

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://340508]
Approved by tachyon
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-03-29 12:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found