Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

Please bare with me, I have not touched perl in about 6 years but have a job that I think perl is perfect for. I have a website that I'm trying to build a glossary of terms for. The idea is to have a glossary page with definitions and then below that show WHERE that term shows up throughout the site. I essentially need something to search across all webpages and find the instances of the word I'm looking for. I'd like something like this:

$>perl wordsearch.pl <searchword> Search Results: <searchword> found in hello.html <searchword> found in index.html <searchword> found in about.hmtl
Does that make sense? I'm not certain what the perl instance is on the server I'm working on, so a brute force method might be best (I'm not certain I can add modules etc). Thank you all for your help!

Replies are listed 'Best First'.
Re: search across website for particular terms
by almut (Canon) on Jul 30, 2008 at 21:58 UTC

    Not saying that Perl couldn't do it too... but maybe a simple recursive grep is sufficient. Something like

    grep -r searchword /path/to/htdocs/

    (Of course that would search in the HTML markup etc. as well, not just the rendered page content... — Update: if that's a problem, you'd have to parse the HTML, for example using modules such as HTML::TokeParser, HTML::TokeParser::Simple or HTML::Parser. To recurse through the document tree, you could use File::Find)

    Update 2: Maybe you could save yourself a lot of work by using some existing software like ht://Dig to index your entire pages, and then issue a simple search if you want to know which documents contain a particular term...

Re: search across website for particular terms
by planetscape (Chancellor) on Jul 31, 2008 at 09:59 UTC

    Several years ago, when I was very new to Perl, I needed to do something a lot like this - I have a wonderful vet, but she thinks computers are some sort of Black Art, and wouldn't even get on the internet. I had just gotten two pet rabbits, and while my vet had treated rabbits before, we agreed that some up-to-date info on this "exotic" pet would be helpful. I mirrored several sites that had some really good info (using a PC Magazine utility called "Site Snagger", though today I'd likely use w3mir or something else), placed the files on CD-ROM, and needed to build a useful index.

    First I needed to know what was in all the files, not being a vet myself (I think I just split canonized text on spaces or word boundaries). I also needed to filter out "stop words" (see also: Lingua::StopWords) - and HTML tags. Much of this I did manually, being, as I said, very new to Perl. ;-)

    But essentially what my index-building process looked like was to take a list of the terms I did want to include, and for each term, iterate through my documents (recursing on directories) to find which ones contained which terms. I did this with nothing more complicated than nested loops and grep, as almut suggested.

    I won't show you any code - as an indication of how "green" I was, my scripts used neither use strict; nor use warnings;, but I did use File::Slurp. ;-)

    You might also find modules such as Ted Pedersen's Ngram Statistics Package package useful for building lists of words that are in your files now. Some of the advice in Creating Dictionaries may also be helpful.


    Update: Oopsie... wrong Ted Pedersen link...

    HTH,

    planetscape
Re: search across website for particular terms
by ambrus (Abbot) on Jul 31, 2008 at 09:51 UTC
Re: search across website for particular terms
by eosbuddy (Scribe) on Jul 31, 2008 at 03:10 UTC
    Don't know if this will work for you - found it using goo(d)o(ld)gle :-) http://www.linuxjournal.com/article/2200