Greetings,

I've been working on a web crawler to suggest descriptions and keywords for websites that I manage, the code and some output follows a few couple for advice.

I like HTML::Summary, although I'd love to check out other summarization engines. Any recommendations? I was hoping to find a Text::Summar* module on cpan, but it appears nobody has started it yet.

The keywords chosen by Lingua::EN::Keywords aren't doing much for me. Any suggestions for more better keyword selection?

Thank you, Troy

---

Source:

#!/usr/bin/perl package Metabot; use strict; use WWW::SimpleRobot; use HTML::Entities; require HTML::Parser; use HTML::Summary; use HTML::TreeBuilder; use Lingua::EN::Keywords; @Metabot::ISA = qw(HTML::Parser); my $url = $ARGV[0]; my $parser = Metabot->new; my $robot = WWW::SimpleRobot->new( URLS => [ $url ], FOLLOW_REGEX => "^$url", DEPTH => 2, TRAVERSAL => 'depth', VISIT_CALLBACK => sub { my ( $url, $depth, $html, $links ) = @_; print "$url - depth $depth\n"; $html = decode_entities($html); $html =~ s/document\.write\(.+?\)\;//g; $html =~ s/\&\#.+?\;//g; my $tree = new HTML::TreeBuilder; $tree->parse($html); my $summarizer = new HTML::Summary( LENGTH => 250, USE_META => 1, ); my $summary = $summarizer->generate( $tree ); $summary =~ s/\s+/ /gs; print "Summary: $summary\n"; $parser->parse($html); my $text = $parser->{TEXT}; my @keywords = keywords($summary . $text); print "Keywords: " . join(", ", @keywords) . "\n\n"; + } , BROKEN_LINK_CALLBACK => sub { my ( $url, $linked_from, $depth ) = @_; print STDERR "$url looks like a broken link on $linked_fro +m\n"; print STDERR "Depth = $depth\n"; } ); $robot->traverse; my @urls = @{$robot->urls}; my @pages = @{$robot->pages}; for my $page ( @pages ) { my $url = $page->{url}; my $depth = $page->{depth}; my $modification_time = $page->{modification_time}; } sub text { my ($self,$text) = @_; $self->{TEXT} .= $text; }
--- A bit of output:
[Ganesha:~/Desktop] davistv% perl metabot4.pl http://cincypg.org http://cincypg.org/ - depth 0 Summary: The Cincinnati Programmers' Guild is founded on the premise t +hat the art of software design is best practiced with a sense of craf +tsmanship and personal responsibility. 2003 February 11 - Member Tom +Wulf going on Safari CPG Member Tom Wulf has volunt Keywords: art of software design, member tom wulf, guild, tom wulf, sa +fari cpg member tom wulf, cincinnati programmers http://cincypg.org/legal/Bylaws.html - depth 1 Summary: BYLAWS OF THE CINCINNATI PROGRAMMERS GUILD April 16, 2002 ART +ICLE I. ARTICLE II. Section 1. ARTICLE III. Section 1. Section 2. Sec +tion 3. Section 4. Section 5. ARTICLE IV. Section 1. Section 2. Secti +on 3. Section 4. Section 5. Section 6. Section 7. S Keywords: section, meetings, council, member, offices, councilors http://cincypg.org/contact.shtml - depth 1 Summary: Cincinnati Programmers' Guild: Contacts General Guild Informa +tion Troy Davis Foundertroy@glyss.com Secretary Jason Paul Secretaryj +ason@adaptiveinfosystems.com Webmaster Jeremy Phelps Guild Webmasterw +ebmaster@cincyp Keywords: meetings, section, council, member, offices, guild http://cincypg.org/cgi-bin/links.pl - depth 1 Summary: Submit a link for inclusion on this page: Title: URL: Select +a categoryTutorials and Online DocumentationOther Computer GroupsInfo +rmation Technology NewsOtherUSENET NewsgroupsAlgorithm sitesComputer +humorOpinionated tripeVendor WebsitesOther Guilds Keywords: meeting, section, council, guild, member, offices http://cincypg.org/directions.shtml - depth 1 Summary: Cincinnati Programmers' Guild This page has moved. Keywords: meeting, section, council, member, guild, offices http://cincypg.org/events.shtml - depth 1 Summary: Many thanks to our host: Future Events Event TypeWhenWhere(Cl +ick for directions.)TopicPresenter(s) Monthly MeetingJune 17th, 2003 +18:30 (6:30PM)KiZAN TechnologiesInvitation to CVSMr. Possible Future +Topics: ActionScript: Flash isn't just for designer Keywords: meeting, monthly meeting, section, ), 6:00pm, council http://cincypg.org/subscribe.shtml - depth 1 Summary: Enter your e-mail address below. Check here if you are unsubs +cribing. Keywords: meeting, monthly meeting, section, ), council, 6:00pm http://cincypg.org/join.shtml - depth 1 Summary: Join the Cincinnati Programmers' Guild We are not currently k +eeping an Official Membership List, nor is there a formal definition +of who is and is not a Guild member. Just show up at our next meeting +. Keywords: meeting, monthly meeting, members, section, ), guild

In reply to Keyword extraction, summarization by davistv

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.