in reply to Getting Words out of HTML :)

Perhaps you could post what you have, and we could help by critiquing. As for some already solutions, I imagine you could punch Google up for some HTML::Parse examples.

-- Randal L. Schwartz, Perl hacker

Replies are listed 'Best First'.
RE: Re: Getting Words out of HTML :)
by reyjrar (Hermit) on Aug 30, 2000 at 02:25 UTC
    so far I have this:
    HTML::Parser->new(api_version => 3, handlers => [start => [\&tag, "self,tagname +,attr"], end => [\&tag_end, "self,tag +name,attr"], text => [\&text, "'$WEIGHT',d +text"] ], marked_sections => 1, )->parse($DATA) || die "Huh $!\n"; then my three subs: sub tag { my $self = shift; my $tagname = shift; my $attr = shift; my $stuff; if($tagname eq "meta") { if($attr{'name'} eq ("keywords" || "description")) { $stuff = +$attr{'content'}; &text($WEIGHT, $stuff); } } elsif($tagname eq "title") { $WEIGHT = "2"; } } sub tag_end { my $self = shift; my $tagname = shift; my $attr = shift; if($tagname eq "title") { $WEIGHT = "1"; } }
    &text($weight,$text) just breaks up each word passed($text) to it and throws it into the database, incrementing the count by the $weight. that function is working.. (for the most part)
    any ideas?
      Now, I think I might have a deeper understanding of the HTML:Parse module. I just needed a night to sleep and not stupify myself with the company's "Perl For Programmer's" classes that I've been in for the past few days. (Generic equivalent of "Multiplication for PHD's of Mathematics").
      So here's what I hacked out.. testing it now, but since I couldn't find too many good examples off google, that I'd throw this up here and hope someone understands it better.
      ### # Parse the URLs my $ua = new LWP::UserAgent; $ua->agent("AgentName/0.1 " . $ua->agent); my $req = new HTTP::Request GET => $URL; $req->content_type('application/x-www-form-urlencoded'); my $res = $ua->request($req); if ($res->is_success) { $DATA = $res->content; } $WEIGHT = 1; HTML::Parser->new(api_version => 3, handlers => [start => [\&tag, "self,tagname +,attr"], end => [\&tag_end, "self,tag +name,attr"], text => [\&text, "'$WEIGHT',d +text"] ], marked_sections => 1, )->parse($DATA) || die "Huh $!\n"; .... # Parsing subroutines.. sub tag { my $self = shift; my $tagname = shift; my $attr = shift; my $stuff; $inside{$tagname} += 1; if($tagname eq "meta") { if($attr{'name'} eq ("keywords" || "description")) { $stuff = +$attr{'content'}; &text($WEIGHT,$stuff); } } elsif($tagname eq "title") { $WEIGHT = "2"; } } sub tag_end { my $self = shift; my $tagname = shift; my $attr = shift; $inside{$tagname} -= 1; if($tagname eq "title") { $WEIGHT = "1"; } } sub text { my $weight = shift; my $test_to_parse = shift; #do whatever we want to do.. }
      This isn't done, but maybe it'll help someone who is looking around for this too :) thanks for all your help and suggestions.. -brad..