in reply to search across website for particular terms
Several years ago, when I was very new to Perl, I needed to do something a lot like this - I have a wonderful vet, but she thinks computers are some sort of Black Art, and wouldn't even get on the internet. I had just gotten two pet rabbits, and while my vet had treated rabbits before, we agreed that some up-to-date info on this "exotic" pet would be helpful. I mirrored several sites that had some really good info (using a PC Magazine utility called "Site Snagger", though today I'd likely use w3mir or something else), placed the files on CD-ROM, and needed to build a useful index.
First I needed to know what was in all the files, not being a vet myself (I think I just split canonized text on spaces or word boundaries). I also needed to filter out "stop words" (see also: Lingua::StopWords) - and HTML tags. Much of this I did manually, being, as I said, very new to Perl. ;-)
But essentially what my index-building process looked like was to take a list of the terms I did want to include, and for each term, iterate through my documents (recursing on directories) to find which ones contained which terms. I did this with nothing more complicated than nested loops and grep, as almut suggested.
I won't show you any code - as an indication of how "green" I was, my scripts used neither use strict; nor use warnings;, but I did use File::Slurp. ;-)
You might also find modules such as Ted Pedersen's Ngram Statistics Package package useful for building lists of words that are in your files now. Some of the advice in Creating Dictionaries may also be helpful.
Update: Oopsie... wrong Ted Pedersen link...
HTH,
planetscape
|
|---|