Chris_921 has asked for the wisdom of the Perl Monks concerning the following question:

Hiya monks,

I'm really new to Perl but I'm also pretty damn awful at it. I was wondering if there is a way I can check the number of occurances of a word in a website? For example a user enters into an entry box the URL of a site, and then this code will reply saying how many times the paticular word occured. If you could help me I'd be really greatful, thank you.

jdporter - edited title

Replies are listed 'Best First'.
Re: Count words on a website
by dragonchild (Archbishop) on Mar 03, 2004 at 17:22 UTC
    This is actually a (potentially) very complicated thing. What I would do is the following:
    1. Define the problem better. Are you going to do just static pages? What defines if a page is in a given site? There are a lot of different answers, and none are really wrong.
    2. Learn how to do the part where they enter a the URL (and word?) and figure out how you can do something with it.
    3. Learn how to return information back to them.
    4. Then, and only then, are you ready to learn how to walk a website (probably using something similar to WWW::Mechanize).

    In other words, I would recommend the following:

    1. Read Learning Perl, by Randal Schwartz.
    2. Play around with website development

    I hope your homework isn't due in the next week, or so.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Count words on a website
by davido (Cardinal) on Mar 03, 2004 at 19:07 UTC
    Well, first Perl isn't equal to CGI. Perl is a programming language often used to write CGI scripts. You're asking a couple of questions, in that case. First, is there a way to count the number of words in a website? (and the subquestion: Can you show me how?) Second, can it be done as part of a CGI script written in Perl? ( Again, can you show me how?)

    First let me answer the question of scraping a word-count for a specific word from a website. The following code does just that:

    use strict; use warnings; use LWP::Simple; use HTML::Strip; my $content = get( "http://www.perlmonks.org" ); die "Couldn't get content!" unless defined $content; my $hs = HTML::Strip->new(); my $page_text = $hs->parse( $content ); $hs->eof; my $word = 'Monastery'; my $count = () = $page_text =~ /\b$word\b/gi; print "$word appears $count times.\n";

    That code counts how many times the word "Monastery" appears on the front page of the Perlmonks website. So as you can see, the answer is "Yes, you can count words from a website using Perl."

    The next question is that of CGI being able to accomplish the task, using Perl as the language of implementation. Again, the answer is yes. You will want to look at the CGI.pm module. That module makes it pretty easy to generate web output and accept web input. But you don't learn to program for the Common Gateway Interface in one day. It's not super-hard, but there are a lot of hangups along the way; things like script permissions, where within your webserver's accessible path to put the CGI script, where to read script runtime warnings and errors, and so on.

    That all adds up to more than you can read in one quick post. But all the info you need is available online. There are a number of good tutorials under the Monastery's Tutorials section. You might start there.

    One thing that I will mention:

    If you're going to be using a regexp to find and count word occurrences, and you're going to allow the outside world (via your CGI script) to specify that word, you had better be careful about things like:

    # $word is direct user input my $count = () = $page_text =~ m/\b$word\b/gi;

    I say this, because it's easy for someone to hand your regexp something that makes your script die an early death, or worse. For example, I can crash the previously mentioned bit of code by entering a "word" that looks like "(?<=\s+)". The script will immediately die, complaining about variable-width negative look-behind. This may be harmless to you, but as a general practice you don't want to give your worldwide web users the ability to crash your CGI scripts. That's just not a good thing. The point here is, yes you can do all the things you're asking. But do it armed with the understanding of what you're doing and having planned carefully for tainted user input.


    Dave

Re: Count words on a website
by borisz (Canon) on Mar 03, 2004 at 17:28 UTC
    perl -MLWP::Simple -e ' print ( scalar ( () = get($ARGV[0]) =~ /\b$ARG +V[1]\b/g ) )' http://www.perlmonks.org/ Monks
    Boris
Re: Count words on a website
by blue_cowdawg (Monsignor) on Mar 03, 2004 at 17:20 UTC

        For example a user enters into an entry box the URL of a site, and then this code will reply saying how many times the paticular word occured.

    What have you tried and failed at?

    HINT: Associative arrays.


    Peter L. Berghold -- Unix Professional
    Peter at Berghold dot Net
       Dog trainer, dog agility exhibitor, brewer of fine Belgian style ales. Happiness is a warm, tired, contented dog curled up at your side and a good Belgian ale in your chalice.
Re: Count words on a website
by csuhockey3 (Curate) on Mar 03, 2004 at 21:05 UTC
    Have a look at LWP It's loaded with lots of good stuff, like fetching pages and what you can do with them. I think you will find more than you need. Also search LWP here on PM.

    --CSUhockey3
Re: Count words on a website
by data64 (Chaplain) on Mar 04, 2004 at 03:36 UTC

    >I can check the number of occurances of a word in a website?

    Do you want to count the words just on a single page or do you really mean the entire site ? If you are truly talking about the entire site then you should be using something like the SWish-e search engine. Otherwise follow the excellent advice in the other posts about using LWP.


    Just a tongue-tied, twisted, earth-bound misfit. -- Pink Floyd