comment on

Problem: I can't get only real words pulled back using TokeParser. I'm trying to get trimmed text so it won't display things like (td, tr, marginheight, spacer, wrap, ...) and just words outside of codes. Is there a mistake in my code or am I just not going about it the right way? (or is it even possible?).

You can test this current script (using yahoo.com) at http://sulfericacid.perlmonk.org/count5.pl .

#!/usr/bin/perl

open (STDERR, ">>/home/sulfericacid/public_html/test/error.log")
   or die "Cannot open error log, weird...an error opening an error lo
+g: $!";

use warnings;
use strict;
use CGI qw(:standard);
use LWP::Simple qw(!head);
use HTML::TokeParser;
use diagnostics;

# url or file to scan
my $url = "http://www.yahoo.com";
my $file = "test.txt";
my $count = "0";

my $content = get($url);
getstore($url, $file);

my $p = HTML::TokeParser->new(shift||"$file");

  while (my $token = $p->get_tag("title")) {
      my $title = $p->get_trimmed_text("/title");
  }
$p = HTML::TokeParser->new(shift||"$file");
  while (my $token = $p->get_tag("td")) {
      my $text = $p->get_trimmed_text("/td");
  }

my (@words, %search, $first, $second, $line);
my @ignore = qw(a and the this i me us our ok abc def my of in this th
+at you if not is it td div align width);


open (FILE, $file) or die "Error $!";
@words = <FILE>;
chomp(@words);
close FILE;

my @search= @words;

foreach my $line (@words) {
$line = lc $line;

    foreach my $ignore (@ignore) {
      $line =~ s/\b$ignore\b//g;
    }
# splitting words on a white space but allowing contractions and hyphe
+ns
    while ($line =~ /([[:alpha:]]+(?:'[[:alpha:]]+)?)/g) {
        if (exists ($search{$1})) {    
            $search{$1}++;    
        } else {
            $search{$1}="";
            $search{$1}++;
        }
    }
}

print header, start_html;
print "<table>";
print "<tr><td>$_ </td><td> $search{$_}</td></tr>\n"
   for sort {$search{$b} <=> $search{$a}} keys %search;
print "</table>";
[download]

"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

sulfericacid

In reply to get_trimmed_text (HTML::TokeParser) problems. by sulfericacid

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.