in reply to Count words on a website

Well, first Perl isn't equal to CGI. Perl is a programming language often used to write CGI scripts. You're asking a couple of questions, in that case. First, is there a way to count the number of words in a website? (and the subquestion: Can you show me how?) Second, can it be done as part of a CGI script written in Perl? ( Again, can you show me how?)

First let me answer the question of scraping a word-count for a specific word from a website. The following code does just that:

use strict; use warnings; use LWP::Simple; use HTML::Strip; my $content = get( "http://www.perlmonks.org" ); die "Couldn't get content!" unless defined $content; my $hs = HTML::Strip->new(); my $page_text = $hs->parse( $content ); $hs->eof; my $word = 'Monastery'; my $count = () = $page_text =~ /\b$word\b/gi; print "$word appears $count times.\n";

That code counts how many times the word "Monastery" appears on the front page of the Perlmonks website. So as you can see, the answer is "Yes, you can count words from a website using Perl."

The next question is that of CGI being able to accomplish the task, using Perl as the language of implementation. Again, the answer is yes. You will want to look at the CGI.pm module. That module makes it pretty easy to generate web output and accept web input. But you don't learn to program for the Common Gateway Interface in one day. It's not super-hard, but there are a lot of hangups along the way; things like script permissions, where within your webserver's accessible path to put the CGI script, where to read script runtime warnings and errors, and so on.

That all adds up to more than you can read in one quick post. But all the info you need is available online. There are a number of good tutorials under the Monastery's Tutorials section. You might start there.

One thing that I will mention:

If you're going to be using a regexp to find and count word occurrences, and you're going to allow the outside world (via your CGI script) to specify that word, you had better be careful about things like:

# $word is direct user input my $count = () = $page_text =~ m/\b$word\b/gi;

I say this, because it's easy for someone to hand your regexp something that makes your script die an early death, or worse. For example, I can crash the previously mentioned bit of code by entering a "word" that looks like "(?<=\s+)". The script will immediately die, complaining about variable-width negative look-behind. This may be harmless to you, but as a general practice you don't want to give your worldwide web users the ability to crash your CGI scripts. That's just not a good thing. The point here is, yes you can do all the things you're asking. But do it armed with the understanding of what you're doing and having planned carefully for tainted user input.


Dave