sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

Problem: I can't get only real words pulled back using TokeParser. I'm trying to get trimmed text so it won't display things like (td, tr, marginheight, spacer, wrap, ...) and just words outside of codes. Is there a mistake in my code or am I just not going about it the right way? (or is it even possible?).

You can test this current script (using yahoo.com) at http://sulfericacid.perlmonk.org/count5.pl .

#!/usr/bin/perl open (STDERR, ">>/home/sulfericacid/public_html/test/error.log") or die "Cannot open error log, weird...an error opening an error lo +g: $!"; use warnings; use strict; use CGI qw(:standard); use LWP::Simple qw(!head); use HTML::TokeParser; use diagnostics; # url or file to scan my $url = "http://www.yahoo.com"; my $file = "test.txt"; my $count = "0"; my $content = get($url); getstore($url, $file); my $p = HTML::TokeParser->new(shift||"$file"); while (my $token = $p->get_tag("title")) { my $title = $p->get_trimmed_text("/title"); } $p = HTML::TokeParser->new(shift||"$file"); while (my $token = $p->get_tag("td")) { my $text = $p->get_trimmed_text("/td"); } my (@words, %search, $first, $second, $line); my @ignore = qw(a and the this i me us our ok abc def my of in this th +at you if not is it td div align width); open (FILE, $file) or die "Error $!"; @words = <FILE>; chomp(@words); close FILE; my @search= @words; foreach my $line (@words) { $line = lc $line; foreach my $ignore (@ignore) { $line =~ s/\b$ignore\b//g; } # splitting words on a white space but allowing contractions and hyphe +ns while ($line =~ /([[:alpha:]]+(?:'[[:alpha:]]+)?)/g) { if (exists ($search{$1})) { $search{$1}++; } else { $search{$1}=""; $search{$1}++; } } } print header, start_html; print "<table>"; print "<tr><td>$_ </td><td> $search{$_}</td></tr>\n" for sort {$search{$b} <=> $search{$a}} keys %search; print "</table>";


"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

sulfericacid

Replies are listed 'Best First'.
Re: get_trimmed_text (HTML::TokeParser) problems.
by jaa (Friar) on Jun 04, 2003 at 08:34 UTC
    For parsing HTML, I prefer TreeBuilder. Just dumping text is easy:
    #!/usr/local/bin/perl -w use strict; use HTML::TreeBuilder; my $filename = "sample.html"; dumpText(HTML::TreeBuilder->new_from_file($filename)); sub dumpText { my $element = shift; my ( @contents ) = $element->content_list(); for my $content ( @contents ) { if ( ref($content) ) { dumpText($content); } else { print $content,"\n"; } } }
    Regards

    Jeff